linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] jump label patches
@ 2009-09-24 23:17 Jason Baron
  2009-09-24 23:17 ` [PATCH 1/4] jump label - make init_kernel_text() global Jason Baron
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Jason Baron @ 2009-09-24 23:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, tglx, rostedt, ak, roland, rth, mhiramat


hi,

For background, I introduced this patchset in:
http://marc.info/?l=linux-kernel&m=125200966226921&w=2

This patchset updates the jump label mechanism to look like:

#define JUMP_LABEL(tag, label, cond)                                       \
        do {                                                               \
                static const char __jlstrtab_##tag[]                       \
                __used __attribute__((section("__jump_strings")))  = #tag; \
                asm goto ("1:"                                             \
                        "jmp %l[" #label "] \n\t"                          \
                        ".pushsection __jump_table,  \"a\" \n\t"           \
                        _ASM_PTR "1b, %l[" #label "], %c0, 0 \n\t"         \
                        ".popsection \n\t"                                 \
                        : :  "i" (__jlstrtab_##tag) :  : label);           \
        } while (0)

#endif


and usage case is:

#define DECLARE_TRACE(name, proto, args)                                 \
        extern struct tracepoint __tracepoint_##name;                    \
        static inline void trace_##name(proto)                           \
        {                                                                \
                JUMP_LABEL(name, trace_label, __tracepoint_##name.state);\
                __DO_TRACE(&__tracepoint_##name,                         \
                           TP_PROTO(proto), TP_ARGS(args));              \
trace_label:                                                             \
                return;                                                  \
        }                                                                \


So the idea now, is that by default we jump over the disabled code. In order to
enable the 'disabled' code, or tracing, we simply replace the jump with a
'jump 0'. Thus, we now avoid the issue of not having enough instruction space.
Thus, in most cases the jump we are patching is an '0xeb' opcode which is
simply 2 bytes (occasionally its an 0xeb). I have this patchset working nicely
on x86_64. I'm not sure if we need to recognize additional jump opcodes for
x86 32-bit....I have not yet tested there. In my x86_64 testing under memory
pressure I saw about a 30 cycle improvement with this code per tracepoint.
Additionally, this code reduces the icache bytes from 9 bytes (cmpl; je), to 
just 2 bytes (jmp).

The patchset makes use of text_poke_fixup() interface introduced by Masami in:
http://marc.info/?l=linux-kernel&m=125297186224609&w=2.

Thus, that patchset needs to be applied first. Patch is against the latest -tip
tree.

thanks,

-Jason



 arch/x86/include/asm/jump_label.h |   27 +++++++++++
 arch/x86/kernel/Makefile          |    2 +-
 arch/x86/kernel/jump_label.c      |   94 +++++++++++++++++++++++++++++++++++++
 include/asm-generic/vmlinux.lds.h |   11 ++++
 include/linux/jump_label.h        |   71 ++++++++++++++++++++++++++++
 include/linux/kernel.h            |    1 +
 include/linux/module.h            |   12 ++++-
 include/linux/tracepoint.h        |   31 +++++++------
 kernel/extable.c                  |    2 +-
 kernel/module.c                   |   31 ++++++++++++
 kernel/tracepoint.c               |   10 ++++
 11 files changed, 275 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/include/asm/jump_label.h
 create mode 100644 arch/x86/kernel/jump_label.c
 create mode 100644 include/linux/jump_label.h


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] jump label - make init_kernel_text() global
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
@ 2009-09-24 23:17 ` Jason Baron
  2009-10-01 11:20   ` Ingo Molnar
  2009-09-24 23:17 ` [PATCH 2/4] jump label - base patch Jason Baron
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Jason Baron @ 2009-09-24 23:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, tglx, rostedt, ak, roland, rth, mhiramat


allow usage of init_kernel_text - we need this in jump labeling to avoid
attemtpting to patch code that has been freed as in the __init sections


Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 include/linux/kernel.h |    1 +
 kernel/extable.c       |    2 +-
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f61039e..9d3419f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
 extern char *get_options(const char *str, int nints, int *ints);
 extern unsigned long long memparse(const char *ptr, char **retptr);
 
+extern int init_kernel_text(unsigned long addr);
 extern int core_kernel_text(unsigned long addr);
 extern int __kernel_text_address(unsigned long addr);
 extern int kernel_text_address(unsigned long addr);
diff --git a/kernel/extable.c b/kernel/extable.c
index 7f8f263..f6893ad 100644
--- a/kernel/extable.c
+++ b/kernel/extable.c
@@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
 	return e;
 }
 
-static inline int init_kernel_text(unsigned long addr)
+int init_kernel_text(unsigned long addr)
 {
 	if (addr >= (unsigned long)_sinittext &&
 	    addr <= (unsigned long)_einittext)
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] jump label - base patch
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
  2009-09-24 23:17 ` [PATCH 1/4] jump label - make init_kernel_text() global Jason Baron
@ 2009-09-24 23:17 ` Jason Baron
  2009-09-25  0:49   ` Roland McGrath
  2009-10-01 11:36   ` Ingo Molnar
  2009-09-24 23:17 ` [PATCH 3/4] jump label - add module support Jason Baron
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Jason Baron @ 2009-09-24 23:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, tglx, rostedt, ak, roland, rth, mhiramat


Base jump label patch which creates macros for jump patching. This feature is
dependent on gcc version >= 4.5

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 arch/x86/include/asm/jump_label.h |   27 +++++++++++
 arch/x86/kernel/Makefile          |    2 +-
 arch/x86/kernel/jump_label.c      |   92 +++++++++++++++++++++++++++++++++++++
 include/asm-generic/vmlinux.lds.h |   11 ++++
 include/linux/jump_label.h        |   53 +++++++++++++++++++++
 5 files changed, 184 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/include/asm/jump_label.h
 create mode 100644 arch/x86/kernel/jump_label.c
 create mode 100644 include/linux/jump_label.h

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
new file mode 100644
index 0000000..21a7b79
--- /dev/null
+++ b/arch/x86/include/asm/jump_label.h
@@ -0,0 +1,27 @@
+#ifndef _ASM_X86_JUMP_LABEL_H
+#define _ASM_X86_JUMP_LABEL_H
+
+#include <asm/nops.h>
+
+#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
+#define __HAVE_ARCH_JUMP_LABEL
+#endif
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+#define JUMP_LABEL(tag, label, cond)                                       \
+	do {								   \
+		static const char __jlstrtab_##tag[]                       \
+		__used __attribute__((section("__jump_strings")))  = #tag; \
+		asm goto ("1:"						   \
+			"jmp %l[" #label "] \n\t"		           \
+			".pushsection __jump_table,  \"a\" \n\t"	   \
+			_ASM_PTR "1b, %l[" #label "], %c0, 0 \n\t"	   \
+			".popsection \n\t"				   \
+			: :  "i" (__jlstrtab_##tag) :  : label);	   \
+	} while (0)
+
+#endif
+
+#endif
+
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 91d4189..b6b3e42 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -32,7 +32,7 @@ GCOV_PROFILE_paravirt.o		:= n
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o ldt.o dumpstack.o
-obj-y			+= setup.o x86_init.o i8259.o irqinit.o
+obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_X86_VISWS)	+= visws_quirks.o
 obj-$(CONFIG_X86_32)	+= probe_roms_32.o
 obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
new file mode 100644
index 0000000..207947d
--- /dev/null
+++ b/arch/x86/kernel/jump_label.c
@@ -0,0 +1,92 @@
+#include <linux/jump_label.h>
+#include <linux/memory.h>
+#include <linux/kprobes.h>
+#include <linux/uaccess.h>
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern struct jump_entry __start___jump_table[];
+extern struct jump_entry __stop___jump_table[];
+
+DEFINE_MUTEX(jump_label_mutex);
+
+union jump_code_union {
+	char code[RELATIVEJUMP_SIZE];
+	struct {
+		char jump;
+		int offset;
+	} __attribute__((packed));
+};
+
+void jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
+{
+	union jump_code_union code;
+	long jump_length;
+	unsigned char opcode;
+	size_t opcode_length;
+
+	if (entry->type == UNKOWN_JUMP) {
+		if (probe_kernel_read(&opcode, (void *)entry->code, 1))
+			BUG();
+
+		if (opcode == 0xe9)
+			entry->type = LONG_JUMP;
+		else if (opcode == 0xeb)
+			entry->type = SHORT_JUMP;
+		else
+			BUG();
+	}
+
+	if (entry->type == LONG_JUMP)
+		opcode_length = 5;
+	else
+		opcode_length = 2;
+
+	if (type == JUMP_LABEL_ENABLE) {
+		if (entry->type == SHORT_JUMP) {
+			code.code[0] = 0xeb;
+			code.code[1] = 0;
+		} else {
+			code.code[0] = 0xe9;
+			code.offset = 0;
+		}
+	} else {
+		jump_length = entry->target - (entry->code + opcode_length);
+		if (entry->type == SHORT_JUMP) {
+			if ((jump_length > 127 || jump_length < -128)) {
+				BUG();
+			}
+			code.code[0] = 0xeb;
+			code.code[1] = (char)jump_length;
+		} else {
+			if ((jump_length > INT_MAX || jump_length < INT_MIN)) {
+				BUG();
+			}
+			code.code[0] = 0xe9;
+			code.offset = jump_length;
+		}
+	}
+
+	mutex_lock(&text_mutex);
+	text_poke_fixup((void *)entry->code, &code,
+			opcode_length, (void *)entry->code + opcode_length);
+	mutex_unlock(&text_mutex);
+}
+
+int jump_label_update(const char *name, enum jump_label_type type)
+{
+	struct jump_entry *iter;
+	
+	mutex_lock(&jump_label_mutex);
+	for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
+		if (!strcmp(name, iter->name)) {
+			if (!(system_state == SYSTEM_RUNNING &&
+					(init_kernel_text(iter->code))))
+				jump_label_transform(iter, type);
+		}
+	}
+	mutex_unlock(&jump_label_mutex);
+	return 0;
+}
+
+#endif
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index e7eae37..d2f38e2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -213,6 +213,7 @@
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
 		*(__tracepoints_strings)/* Tracepoints: strings */	\
+		*(__jump_strings)/* Jump: strings */	\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
@@ -220,6 +221,8 @@
 	}								\
 									\
 	BUG_TABLE							\
+	JUMP_TABLE							\
+									\
 									\
 	/* PCI quirks */						\
 	.pci_fixup        : AT(ADDR(.pci_fixup) - LOAD_OFFSET) {	\
@@ -564,6 +567,14 @@
 #define BUG_TABLE
 #endif
 
+#define JUMP_TABLE							\
+	. = ALIGN(8);							\
+	__jump_table : AT(ADDR(__jump_table) - LOAD_OFFSET) {		\
+		VMLINUX_SYMBOL(__start___jump_table) = .;		\
+		*(__jump_table)						\
+		VMLINUX_SYMBOL(__stop___jump_table) = .;			\
+	}
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
new file mode 100644
index 0000000..eb0a4ef
--- /dev/null
+++ b/include/linux/jump_label.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_JUMP_LABEL_H
+#define _LINUX_JUMP_LABEL_H
+
+#include <asm/jump_label.h>
+
+struct jump_entry {
+	unsigned long code;
+	unsigned long target;
+	char *name;
+	unsigned long type;
+};
+
+enum jump_label_type {
+	JUMP_LABEL_ENABLE,
+	JUMP_LABEL_DISABLE
+};
+
+enum jump_label_instruction {
+	UNKOWN_JUMP,
+	SHORT_JUMP,
+	LONG_JUMP
+};
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern int jump_label_update(const char *name, enum jump_label_type type);
+
+#define enable_jump_label(name) \
+	jump_label_update(name, JUMP_LABEL_ENABLE);
+
+#define disable_jump_label(name) \
+	jump_label_update(name, JUMP_LABEL_DISABLE);
+
+#else
+
+#define JUMP_LABEL(tag, label, cond)		\
+	if (likely(!cond))			\
+		goto label;
+
+static inline int enable_jump_label(const char *name, enum jump_label_type type)
+{
+	return 0;
+}
+
+static inline int disable_jump_label(const char *name,
+				     enum jump_label_type type)
+{
+	return 0;
+}
+
+#endif
+
+#endif
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] jump label - add module support
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
  2009-09-24 23:17 ` [PATCH 1/4] jump label - make init_kernel_text() global Jason Baron
  2009-09-24 23:17 ` [PATCH 2/4] jump label - base patch Jason Baron
@ 2009-09-24 23:17 ` Jason Baron
  2009-09-24 23:18 ` [PATCH 4/4] jump label - tracepoint implementation Jason Baron
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Jason Baron @ 2009-09-24 23:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, tglx, rostedt, ak, roland, rth, mhiramat



Add module support to jump label patching.

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 arch/x86/kernel/jump_label.c |    4 +++-
 include/linux/jump_label.h   |   34 ++++++++++++++++++++++++++--------
 include/linux/module.h       |   12 +++++++++++-
 kernel/module.c              |   31 +++++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 207947d..d057c3f 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -2,6 +2,7 @@
 #include <linux/memory.h>
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
+#include <linux/module.h>
 
 #ifdef __HAVE_ARCH_JUMP_LABEL
 
@@ -73,7 +74,7 @@ void jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
 	mutex_unlock(&text_mutex);
 }
 
-int jump_label_update(const char *name, enum jump_label_type type)
+int jump_label_update(const char *name, enum jump_label_type type, int module_lock)
 {
 	struct jump_entry *iter;
 	
@@ -85,6 +86,7 @@ int jump_label_update(const char *name, enum jump_label_type type)
 				jump_label_transform(iter, type);
 		}
 	}
+	jump_label_update_modules(name, type, module_lock);
 	mutex_unlock(&jump_label_mutex);
 	return 0;
 }
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index eb0a4ef..1033d35 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -23,13 +23,22 @@ enum jump_label_instruction {
 
 #ifdef __HAVE_ARCH_JUMP_LABEL
 
-extern int jump_label_update(const char *name, enum jump_label_type type);
+extern int jump_label_update(const char *name, enum jump_label_type type,
+			     int module_lock);
+extern void jump_label_transform(struct jump_entry *entry,
+				 enum jump_label_type type);
 
-#define enable_jump_label(name) \
-	jump_label_update(name, JUMP_LABEL_ENABLE);
+#define enable_jump_label_locked(name) \
+	jump_label_update(name, JUMP_LABEL_ENABLE, 1);
 
-#define disable_jump_label(name) \
-	jump_label_update(name, JUMP_LABEL_DISABLE);
+#define disable_jump_label_locked(name) \
+	jump_label_update(name, JUMP_LABEL_DISABLE, 1);
+
+#define enable_jump_label_unlocked(name) \
+	jump_label_update(name, JUMP_LABEL_ENABLE, 0);
+
+#define disable_jump_label_unlocked(name) \
+	jump_label_update(name, JUMP_LABEL_DISABLE, 0);
 
 #else
 
@@ -37,13 +46,22 @@ extern int jump_label_update(const char *name, enum jump_label_type type);
 	if (likely(!cond))			\
 		goto label;
 
-static inline int enable_jump_label(const char *name, enum jump_label_type type)
+static inline int enable_jump_label_locked(const char *name)
+{
+	return 0;
+}
+
+static inline int disable_jump_label_locked(const char *name)
+{
+	return 0;
+}
+
+static inline int enable_jump_label_unlocked(const char *name)
 {
 	return 0;
 }
 
-static inline int disable_jump_label(const char *name,
-				     enum jump_label_type type)
+static inline int disable_jump_label_unlocked(const char *name)
 {
 	return 0;
 }
diff --git a/include/linux/module.h b/include/linux/module.h
index 86863cd..911f5c9 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include <linux/moduleparam.h>
 #include <linux/marker.h>
 #include <linux/tracepoint.h>
+#include <linux/jump_label.h>
 
 #include <asm/local.h>
 #include <asm/module.h>
@@ -353,7 +354,10 @@ struct module
 	struct tracepoint *tracepoints;
 	unsigned int num_tracepoints;
 #endif
-
+#ifdef __HAVE_ARCH_JUMP_LABEL
+	struct jump_entry *jump_entries;
+	unsigned int num_jump_entries;
+#endif
 #ifdef CONFIG_TRACING
 	const char **trace_bprintk_fmt_start;
 	unsigned int num_trace_bprintk_fmt;
@@ -557,6 +561,7 @@ extern void module_update_markers(void);
 
 extern void module_update_tracepoints(void);
 extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+extern void jump_label_update_modules(const char *name, enum jump_label_type type, int module_lock);
 
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
@@ -682,6 +687,11 @@ static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
 	return 0;
 }
 
+static inline void jump_label_update_modules(const char *name, enum jump_label_type type, int module_lock)
+{
+	return 0;
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
diff --git a/kernel/module.c b/kernel/module.c
index 46580ed..01c2343 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -54,6 +54,7 @@
 #include <linux/async.h>
 #include <linux/percpu.h>
 #include <linux/kmemleak.h>
+#include <linux/jump_label.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
@@ -2247,6 +2248,12 @@ static noinline struct module *load_module(void __user *umod,
 					sizeof(*mod->tracepoints),
 					&mod->num_tracepoints);
 #endif
+#ifdef __HAVE_ARCH_JUMP_LABEL
+	mod->jump_entries = section_objs(hdr, sechdrs, secstrings,
+					"__jump_table",
+					sizeof(*mod->jump_entries),
+					&mod->num_jump_entries);
+#endif
 #ifdef CONFIG_EVENT_TRACING
 	mod->trace_events = section_objs(hdr, sechdrs, secstrings,
 					 "_ftrace_events",
@@ -3017,4 +3024,28 @@ int module_get_iter_tracepoints(struct tracepoint_iter *iter)
 	mutex_unlock(&module_mutex);
 	return found;
 }
+
+#endif
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+void jump_label_update_modules(const char *name, enum jump_label_type type, int module_lock)
+{
+	struct module *iter_mod;
+	struct jump_entry *entry;
+
+	if (!module_lock)
+		mutex_lock(&module_mutex);
+	list_for_each_entry(iter_mod, &modules, list) {
+		for(entry = iter_mod->jump_entries;
+		    entry < iter_mod->jump_entries + iter_mod->num_jump_entries;
+		    entry++) {
+				if (!strcmp(name, entry->name))
+					jump_label_transform(entry, type);
+		}
+	}
+	if (!module_lock)
+		mutex_unlock(&module_mutex);
+}
+
 #endif
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] jump label - tracepoint implementation
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
                   ` (2 preceding siblings ...)
  2009-09-24 23:17 ` [PATCH 3/4] jump label - add module support Jason Baron
@ 2009-09-24 23:18 ` Jason Baron
  2009-10-06  5:39 ` [PATCH 0/4] jump label patches Roland McGrath
  2009-10-06  6:04 ` Roland McGrath
  5 siblings, 0 replies; 29+ messages in thread
From: Jason Baron @ 2009-09-24 23:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, tglx, rostedt, ak, roland, rth, mhiramat


Implement tracepoints if conditional on top of jump labels.

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 include/linux/tracepoint.h |   31 +++++++++++++++++--------------
 kernel/tracepoint.c        |   10 ++++++++++
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 63a3f7a..e7e2910 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -16,6 +16,7 @@
 
 #include <linux/types.h>
 #include <linux/rcupdate.h>
+#include <linux/jump_label.h>
 
 struct module;
 struct tracepoint;
@@ -63,20 +64,22 @@ struct tracepoint {
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
  */
-#define DECLARE_TRACE(name, proto, args)				\
-	extern struct tracepoint __tracepoint_##name;			\
-	static inline void trace_##name(proto)				\
-	{								\
-		if (unlikely(__tracepoint_##name.state))		\
-			__DO_TRACE(&__tracepoint_##name,		\
-				TP_PROTO(proto), TP_ARGS(args));	\
-	}								\
-	static inline int register_trace_##name(void (*probe)(proto))	\
-	{								\
-		return tracepoint_probe_register(#name, (void *)probe);	\
-	}								\
-	static inline int unregister_trace_##name(void (*probe)(proto))	\
-	{								\
+#define DECLARE_TRACE(name, proto, args)				 \
+	extern struct tracepoint __tracepoint_##name;			 \
+	static inline void trace_##name(proto)				 \
+	{								 \
+		JUMP_LABEL(name, trace_label, __tracepoint_##name.state);\
+		__DO_TRACE(&__tracepoint_##name,			 \
+			   TP_PROTO(proto), TP_ARGS(args));		 \
+trace_label:								 \
+		return;							 \
+	}								 \
+	static inline int register_trace_##name(void (*probe)(proto))	 \
+	{								 \
+		return tracepoint_probe_register(#name, (void *)probe);	 \
+	}								 \
+	static inline int unregister_trace_##name(void (*probe)(proto))	 \
+	{								 \
 		return tracepoint_probe_unregister(#name, (void *)probe);\
 	}
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 9489a0a..e06bce8 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/jump_label.h>
 
 extern struct tracepoint __start___tracepoints[];
 extern struct tracepoint __stop___tracepoints[];
@@ -256,6 +257,12 @@ static void set_tracepoint(struct tracepoint_entry **entry,
 	 * is used.
 	 */
 	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+
+	if (!elem->state && active) {
+		enable_jump_label_locked(elem->name);
+	} else if (elem->state && !active)
+		disable_jump_label_locked(elem->name);
+
 	elem->state = active;
 }
 
@@ -270,6 +277,9 @@ static void disable_tracepoint(struct tracepoint *elem)
 	if (elem->unregfunc && elem->state)
 		elem->unregfunc();
 
+	if (elem->state)
+		disable_jump_label_locked(elem->name);
+
 	elem->state = 0;
 	rcu_assign_pointer(elem->funcs, NULL);
 }
-- 
1.6.2.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] jump label - base patch
  2009-09-24 23:17 ` [PATCH 2/4] jump label - base patch Jason Baron
@ 2009-09-25  0:49   ` Roland McGrath
  2009-09-26 10:21     ` Steven Rostedt
  2009-10-01 11:36   ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Roland McGrath @ 2009-09-25  0:49 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, ak, rth, mhiramat

> +#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
> +#define __HAVE_ARCH_JUMP_LABEL
> +#endif

Put this in linux/compiler-gcc4.h (there only test __GNUC_MINOR__).
If icc/whoever adds asm goto they can define it too.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] jump label - base patch
  2009-09-25  0:49   ` Roland McGrath
@ 2009-09-26 10:21     ` Steven Rostedt
  0 siblings, 0 replies; 29+ messages in thread
From: Steven Rostedt @ 2009-09-26 10:21 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx, ak,
	rth, mhiramat

On Thu, 2009-09-24 at 17:49 -0700, Roland McGrath wrote:
> > +#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
> > +#define __HAVE_ARCH_JUMP_LABEL
> > +#endif
> 
> Put this in linux/compiler-gcc4.h (there only test __GNUC_MINOR__).
> If icc/whoever adds asm goto they can define it too.

We need to be careful here. This code also depends on the text_poke
implementation that is not defined for every arch.

The code around DEFINE_TRACE can be updated to test for
__HAVE_ARCH_JUMP_LABEL as well the config to know if we have text_poke.

-- Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-09-24 23:17 ` [PATCH 1/4] jump label - make init_kernel_text() global Jason Baron
@ 2009-10-01 11:20   ` Ingo Molnar
  2009-10-01 12:58     ` Mathieu Desnoyers
  2009-10-01 20:39     ` Jason Baron
  0 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2009-10-01 11:20 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mathieu.desnoyers, tglx, rostedt, ak, roland, rth,
	mhiramat


* Jason Baron <jbaron@redhat.com> wrote:

> allow usage of init_kernel_text - we need this in jump labeling to 
> avoid attemtpting to patch code that has been freed as in the __init 
> sections

s/attemtpting/attempting

> Signed-off-by: Jason Baron <jbaron@redhat.com>
> ---
>  include/linux/kernel.h |    1 +
>  kernel/extable.c       |    2 +-
>  2 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index f61039e..9d3419f 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
>  extern char *get_options(const char *str, int nints, int *ints);
>  extern unsigned long long memparse(const char *ptr, char **retptr);
>  
> +extern int init_kernel_text(unsigned long addr);
>  extern int core_kernel_text(unsigned long addr);
>  extern int __kernel_text_address(unsigned long addr);
>  extern int kernel_text_address(unsigned long addr);
> diff --git a/kernel/extable.c b/kernel/extable.c
> index 7f8f263..f6893ad 100644
> --- a/kernel/extable.c
> +++ b/kernel/extable.c
> @@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
>  	return e;
>  }
>  
> -static inline int init_kernel_text(unsigned long addr)
> +int init_kernel_text(unsigned long addr)
>  {
>  	if (addr >= (unsigned long)_sinittext &&
>  	    addr <= (unsigned long)_einittext)

i'm confused. Later on jump_label_update() does:

+                       if (!(system_state == SYSTEM_RUNNING &&
+                                       (init_kernel_text(iter->code))))
+                               jump_label_transform(iter, type);

which is:

+                       if (system_state != SYSTEM_RUNNING ||
+                                       !init_kernel_text(iter->code)))
+                               jump_label_transform(iter, type);

What is the logic behind that? System going into SYSTEM_RUNNING does not 
coincide with free_initmem() precisely.

Also, do we ever want to patch init-text tracepoints? I think we want to 
stay away from them as much as possible.

It appears to me that what we want here is a straight:

                       if (kernel_text(iter->code))
                               jump_label_transform(iter, type);

Also, maybe a WARN_ONCE(!kernel_text()) - we should never even attempt 
to transform non-patchable code. If yes then we want to know about that 
in a noisy way and not skip it silently.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] jump label - base patch
  2009-09-24 23:17 ` [PATCH 2/4] jump label - base patch Jason Baron
  2009-09-25  0:49   ` Roland McGrath
@ 2009-10-01 11:36   ` Ingo Molnar
  1 sibling, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2009-10-01 11:36 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mathieu.desnoyers, tglx, rostedt, ak, roland, rth,
	mhiramat


* Jason Baron <jbaron@redhat.com> wrote:

> Base jump label patch which creates macros for jump patching. This 
> feature is dependent on gcc version >= 4.5
> 
> Signed-off-by: Jason Baron <jbaron@redhat.com>
> ---
>  arch/x86/include/asm/jump_label.h |   27 +++++++++++
>  arch/x86/kernel/Makefile          |    2 +-
>  arch/x86/kernel/jump_label.c      |   92 +++++++++++++++++++++++++++++++++++++
>  include/asm-generic/vmlinux.lds.h |   11 ++++
>  include/linux/jump_label.h        |   53 +++++++++++++++++++++
>  5 files changed, 184 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/include/asm/jump_label.h
>  create mode 100644 arch/x86/kernel/jump_label.c
>  create mode 100644 include/linux/jump_label.h
> 
> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
> new file mode 100644
> index 0000000..21a7b79
> --- /dev/null
> +++ b/arch/x86/include/asm/jump_label.h
> @@ -0,0 +1,27 @@
> +#ifndef _ASM_X86_JUMP_LABEL_H
> +#define _ASM_X86_JUMP_LABEL_H
> +
> +#include <asm/nops.h>
> +
> +#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
> +#define __HAVE_ARCH_JUMP_LABEL
> +#endif

that should be:

> +#if (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5))

Also, please use:

#if ...
# define
#endif

whitespace constructs to keep them more readable.

> +
> +#ifdef __HAVE_ARCH_JUMP_LABEL
> +
> +#define JUMP_LABEL(tag, label, cond)                                       \

(these spaces should be tabs)

> +	do {								   \
> +		static const char __jlstrtab_##tag[]                       \
> +		__used __attribute__((section("__jump_strings")))  = #tag; \
> +		asm goto ("1:"						   \
> +			"jmp %l[" #label "] \n\t"		           \
> +			".pushsection __jump_table,  \"a\" \n\t"	   \
> +			_ASM_PTR "1b, %l[" #label "], %c0, 0 \n\t"	   \
> +			".popsection \n\t"				   \
> +			: :  "i" (__jlstrtab_##tag) :  : label);	   \
> +	} while (0)
> +
> +#endif
> +
> +#endif
> +
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 91d4189..b6b3e42 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -32,7 +32,7 @@ GCOV_PROFILE_paravirt.o		:= n
>  obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
>  obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
>  obj-y			+= time.o ioport.o ldt.o dumpstack.o
> -obj-y			+= setup.o x86_init.o i8259.o irqinit.o
> +obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
>  obj-$(CONFIG_X86_VISWS)	+= visws_quirks.o
>  obj-$(CONFIG_X86_32)	+= probe_roms_32.o
>  obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> new file mode 100644
> index 0000000..207947d
> --- /dev/null
> +++ b/arch/x86/kernel/jump_label.c
> @@ -0,0 +1,92 @@
> +#include <linux/jump_label.h>
> +#include <linux/memory.h>
> +#include <linux/kprobes.h>
> +#include <linux/uaccess.h>
> +
> +#ifdef __HAVE_ARCH_JUMP_LABEL
> +
> +extern struct jump_entry __start___jump_table[];
> +extern struct jump_entry __stop___jump_table[];

externs should go into .h's.

> +
> +DEFINE_MUTEX(jump_label_mutex);

Shouldnt this code rely on text_mutex, which is our generic 'patch 
kernel text' exclusion primitive?

> +
> +union jump_code_union {
> +	char code[RELATIVEJUMP_SIZE];
> +	struct {
> +		char jump;
> +		int offset;
> +	} __attribute__((packed));
> +};
> +
> +void jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
> +{
> +	union jump_code_union code;
> +	long jump_length;
> +	unsigned char opcode;
> +	size_t opcode_length;
> +
> +	if (entry->type == UNKOWN_JUMP) {
> +		if (probe_kernel_read(&opcode, (void *)entry->code, 1))
> +			BUG();

a BUG() is not nice. It results fewer decodable bug reports than a 
WARN_ON_ONCE().

> +
> +		if (opcode == 0xe9)
> +			entry->type = LONG_JUMP;
> +		else if (opcode == 0xeb)
> +			entry->type = SHORT_JUMP;
> +		else
> +			BUG();

ditto.

> +	}
> +
> +	if (entry->type == LONG_JUMP)
> +		opcode_length = 5;
> +	else
> +		opcode_length = 2;
> +
> +	if (type == JUMP_LABEL_ENABLE) {
> +		if (entry->type == SHORT_JUMP) {
> +			code.code[0] = 0xeb;
> +			code.code[1] = 0;
> +		} else {
> +			code.code[0] = 0xe9;
> +			code.offset = 0;
> +		}
> +	} else {
> +		jump_length = entry->target - (entry->code + opcode_length);
> +		if (entry->type == SHORT_JUMP) {
> +			if ((jump_length > 127 || jump_length < -128)) {
> +				BUG();
> +			}

WARN_ON_ONCE() again.

> +			code.code[0] = 0xeb;
> +			code.code[1] = (char)jump_length;
> +		} else {
> +			if ((jump_length > INT_MAX || jump_length < INT_MIN)) {
> +				BUG();
> +			}

ditto.

> +			code.code[0] = 0xe9;
> +			code.offset = jump_length;
> +		}
> +	}
> +
> +	mutex_lock(&text_mutex);
> +	text_poke_fixup((void *)entry->code, &code,
> +			opcode_length, (void *)entry->code + opcode_length);
> +	mutex_unlock(&text_mutex);

Note that here we take the text_mutex anyway. Can drop jump_label_mutex 
and take the text_mutex wider?

> +}
> +
> +int jump_label_update(const char *name, enum jump_label_type type)
> +{
> +	struct jump_entry *iter;
> +	
> +	mutex_lock(&jump_label_mutex);
> +	for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {

how big can this table become, realistically, long term?

> +		if (!strcmp(name, iter->name)) {
> +			if (!(system_state == SYSTEM_RUNNING &&
> +					(init_kernel_text(iter->code))))
> +				jump_label_transform(iter, type);
> +		}
> +	}
> +	mutex_unlock(&jump_label_mutex);
> +	return 0;
> +}

We always seem to return 0 - should be void?

> +
> +#endif
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index e7eae37..d2f38e2 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -213,6 +213,7 @@
>  		*(__vermagic)		/* Kernel version magic */	\
>  		*(__markers_strings)	/* Markers: strings */		\
> 		*(__tracepoints_strings)/* Tracepoints: strings */	\
> +		*(__jump_strings)/* Jump: strings */	\
>  	}								\

inserted line does not follow existing style carefully enough.

>  									\
>  	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
> @@ -220,6 +221,8 @@
>  	}								\
>  									\
>  	BUG_TABLE							\
> +	JUMP_TABLE							\
> +									\
>  									\
>  	/* PCI quirks */						\
>  	.pci_fixup        : AT(ADDR(.pci_fixup) - LOAD_OFFSET) {	\
> @@ -564,6 +567,14 @@
>  #define BUG_TABLE
>  #endif
>  
> +#define JUMP_TABLE							\
> +	. = ALIGN(8);							\
> +	__jump_table : AT(ADDR(__jump_table) - LOAD_OFFSET) {		\
> +		VMLINUX_SYMBOL(__start___jump_table) = .;		\
> +		*(__jump_table)						\
> +		VMLINUX_SYMBOL(__stop___jump_table) = .;			\
> +	}

ditto.

> +
>  #ifdef CONFIG_PM_TRACE
>  #define TRACEDATA							\
>  	. = ALIGN(4);							\
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> new file mode 100644
> index 0000000..eb0a4ef
> --- /dev/null
> +++ b/include/linux/jump_label.h
> @@ -0,0 +1,53 @@
> +#ifndef _LINUX_JUMP_LABEL_H
> +#define _LINUX_JUMP_LABEL_H
> +
> +#include <asm/jump_label.h>
> +
> +struct jump_entry {
> +	unsigned long code;
> +	unsigned long target;
> +	char *name;
> +	unsigned long type;
> +};

please use new structure definition alignments like in 
include/linux/perf_event.h.

> +
> +enum jump_label_type {
> +	JUMP_LABEL_ENABLE,
> +	JUMP_LABEL_DISABLE
> +};

so 0 is enable and 1 is disable?

> +
> +enum jump_label_instruction {
> +	UNKOWN_JUMP,
> +	SHORT_JUMP,
> +	LONG_JUMP
> +};
> +
> +#ifdef __HAVE_ARCH_JUMP_LABEL
> +
> +extern int jump_label_update(const char *name, enum jump_label_type type);
> +
> +#define enable_jump_label(name) \
> +	jump_label_update(name, JUMP_LABEL_ENABLE);
> +
> +#define disable_jump_label(name) \
> +	jump_label_update(name, JUMP_LABEL_DISABLE);
> +
> +#else
> +
> +#define JUMP_LABEL(tag, label, cond)		\
> +	if (likely(!cond))			\
> +		goto label;
> +
> +static inline int enable_jump_label(const char *name, enum jump_label_type type)
> +{
> +	return 0;
> +}
> +
> +static inline int disable_jump_label(const char *name,
> +				     enum jump_label_type type)

just keep the prototype in a single line. (and ignore checkpatch for 
that one)

> +{
> +	return 0;
> +}
> +
> +#endif
> +
> +#endif
> -- 
> 1.6.2.5

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-01 11:20   ` Ingo Molnar
@ 2009-10-01 12:58     ` Mathieu Desnoyers
  2009-10-01 20:39     ` Jason Baron
  1 sibling, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-01 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, linux-kernel, tglx, rostedt, ak, roland, rth, mhiramat

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > allow usage of init_kernel_text - we need this in jump labeling to 
> > avoid attemtpting to patch code that has been freed as in the __init 
> > sections
> 
> s/attemtpting/attempting
> 
> > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > ---
> >  include/linux/kernel.h |    1 +
> >  kernel/extable.c       |    2 +-
> >  2 files changed, 2 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index f61039e..9d3419f 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
> >  extern char *get_options(const char *str, int nints, int *ints);
> >  extern unsigned long long memparse(const char *ptr, char **retptr);
> >  
> > +extern int init_kernel_text(unsigned long addr);
> >  extern int core_kernel_text(unsigned long addr);
> >  extern int __kernel_text_address(unsigned long addr);
> >  extern int kernel_text_address(unsigned long addr);
> > diff --git a/kernel/extable.c b/kernel/extable.c
> > index 7f8f263..f6893ad 100644
> > --- a/kernel/extable.c
> > +++ b/kernel/extable.c
> > @@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
> >  	return e;
> >  }
> >  
> > -static inline int init_kernel_text(unsigned long addr)
> > +int init_kernel_text(unsigned long addr)
> >  {
> >  	if (addr >= (unsigned long)_sinittext &&
> >  	    addr <= (unsigned long)_einittext)
> 
> i'm confused. Later on jump_label_update() does:
> 
> +                       if (!(system_state == SYSTEM_RUNNING &&
> +                                       (init_kernel_text(iter->code))))
> +                               jump_label_transform(iter, type);
> 
> which is:
> 
> +                       if (system_state != SYSTEM_RUNNING ||
> +                                       !init_kernel_text(iter->code)))
> +                               jump_label_transform(iter, type);
> 
> What is the logic behind that? System going into SYSTEM_RUNNING does not 
> coincide with free_initmem() precisely.
> 
> Also, do we ever want to patch init-text tracepoints? I think we want to 
> stay away from them as much as possible.

My experience with tracepoint instruentation is that as soon as we start
using them in inline functions (e.g. memory allocation, interrupt
entry/exit), we start making these functions unusable in init code. That
a very unwelcome side-effect that I'd like to prevent by making sure
tracepoints and the patching mechanism underneath supports init code
patching. Node that it does not have to be complicated.

But I think you are right that the logic seems fuzzy in this specific
spot of the patch.

Thanks,

Mathieu

> 
> It appears to me that what we want here is a straight:
> 
>                        if (kernel_text(iter->code))
>                                jump_label_transform(iter, type);
> 
> Also, maybe a WARN_ONCE(!kernel_text()) - we should never even attempt 
> to transform non-patchable code. If yes then we want to know about that 
> in a noisy way and not skip it silently.
> 
> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-01 11:20   ` Ingo Molnar
  2009-10-01 12:58     ` Mathieu Desnoyers
@ 2009-10-01 20:39     ` Jason Baron
  2009-10-03 10:43       ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Baron @ 2009-10-01 20:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mathieu.desnoyers, tglx, rostedt, ak, roland, rth,
	mhiramat

On Thu, Oct 01, 2009 at 01:20:03PM +0200, Ingo Molnar wrote:
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > allow usage of init_kernel_text - we need this in jump labeling to 
> > avoid attemtpting to patch code that has been freed as in the __init 
> > sections
> 
> s/attemtpting/attempting
> 
> > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > ---
> >  include/linux/kernel.h |    1 +
> >  kernel/extable.c       |    2 +-
> >  2 files changed, 2 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index f61039e..9d3419f 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
> >  extern char *get_options(const char *str, int nints, int *ints);
> >  extern unsigned long long memparse(const char *ptr, char **retptr);
> >  
> > +extern int init_kernel_text(unsigned long addr);
> >  extern int core_kernel_text(unsigned long addr);
> >  extern int __kernel_text_address(unsigned long addr);
> >  extern int kernel_text_address(unsigned long addr);
> > diff --git a/kernel/extable.c b/kernel/extable.c
> > index 7f8f263..f6893ad 100644
> > --- a/kernel/extable.c
> > +++ b/kernel/extable.c
> > @@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
> >  	return e;
> >  }
> >  
> > -static inline int init_kernel_text(unsigned long addr)
> > +int init_kernel_text(unsigned long addr)
> >  {
> >  	if (addr >= (unsigned long)_sinittext &&
> >  	    addr <= (unsigned long)_einittext)
> 
> i'm confused. Later on jump_label_update() does:
> 
> +                       if (!(system_state == SYSTEM_RUNNING &&
> +                                       (init_kernel_text(iter->code))))
> +                               jump_label_transform(iter, type);
> 
> which is:
> 
> +                       if (system_state != SYSTEM_RUNNING ||
> +                                       !init_kernel_text(iter->code)))
> +                               jump_label_transform(iter, type);
> 
> What is the logic behind that? System going into SYSTEM_RUNNING does not 
> coincide with free_initmem() precisely.
> 

The specific case I hit was in modifying code in arch_kdebugfs_init()
which is '__init' after the system was up and running. The tracepoint is
in 'kmalloc()' which is marked as __always_inline.


> Also, do we ever want to patch init-text tracepoints? I think we want to 
> stay away from them as much as possible.

I was trying to make sure that tracepoints in init-text were honored.

> 
> It appears to me that what we want here is a straight:
> 
>                        if (kernel_text(iter->code))
>                                jump_label_transform(iter, type);
> 
> Also, maybe a WARN_ONCE(!kernel_text()) - we should never even attempt 
> to transform non-patchable code. If yes then we want to know about that 
> in a noisy way and not skip it silently.
> 

hmmm....indeed, kernel_text_address() does do what I want here (I must
have mis-read its definition). Although, I'm not sure there isn't a race
here betweeen freeing the init sections and possibly updating them. For
modules, there is no race since the module init free code takes the
module_mutex, and I do as well in this code...

I've now also tested this code on 32-bit x86 system, and it seems to
perform nicely. I'm seeing a 15 cycle improvement per tracepoint.

I've based the text section updating on text_poke_fixup(), which has
recently come into question about safety of cross modifying code. I
could rebase my patches back to use stop_machine()? I guess I'm looking
for some advice on how to proceed here.

thanks,

-Jason





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-01 20:39     ` Jason Baron
@ 2009-10-03 10:43       ` Ingo Molnar
  2009-10-03 12:39         ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2009-10-03 10:43 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mathieu.desnoyers, tglx, rostedt, ak, roland, rth,
	mhiramat


* Jason Baron <jbaron@redhat.com> wrote:

> On Thu, Oct 01, 2009 at 01:20:03PM +0200, Ingo Molnar wrote:
> > * Jason Baron <jbaron@redhat.com> wrote:
> > 
> > > allow usage of init_kernel_text - we need this in jump labeling to 
> > > avoid attemtpting to patch code that has been freed as in the __init 
> > > sections
> > 
> > s/attemtpting/attempting
> > 
> > > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > > ---
> > >  include/linux/kernel.h |    1 +
> > >  kernel/extable.c       |    2 +-
> > >  2 files changed, 2 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > > index f61039e..9d3419f 100644
> > > --- a/include/linux/kernel.h
> > > +++ b/include/linux/kernel.h
> > > @@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
> > >  extern char *get_options(const char *str, int nints, int *ints);
> > >  extern unsigned long long memparse(const char *ptr, char **retptr);
> > >  
> > > +extern int init_kernel_text(unsigned long addr);
> > >  extern int core_kernel_text(unsigned long addr);
> > >  extern int __kernel_text_address(unsigned long addr);
> > >  extern int kernel_text_address(unsigned long addr);
> > > diff --git a/kernel/extable.c b/kernel/extable.c
> > > index 7f8f263..f6893ad 100644
> > > --- a/kernel/extable.c
> > > +++ b/kernel/extable.c
> > > @@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
> > >  	return e;
> > >  }
> > >  
> > > -static inline int init_kernel_text(unsigned long addr)
> > > +int init_kernel_text(unsigned long addr)
> > >  {
> > >  	if (addr >= (unsigned long)_sinittext &&
> > >  	    addr <= (unsigned long)_einittext)
> > 
> > i'm confused. Later on jump_label_update() does:
> > 
> > +                       if (!(system_state == SYSTEM_RUNNING &&
> > +                                       (init_kernel_text(iter->code))))
> > +                               jump_label_transform(iter, type);
> > 
> > which is:
> > 
> > +                       if (system_state != SYSTEM_RUNNING ||
> > +                                       !init_kernel_text(iter->code)))
> > +                               jump_label_transform(iter, type);
> > 
> > What is the logic behind that? System going into SYSTEM_RUNNING does not 
> > coincide with free_initmem() precisely.
> > 
> 
> The specific case I hit was in modifying code in arch_kdebugfs_init()
> which is '__init' after the system was up and running. The tracepoint is
> in 'kmalloc()' which is marked as __always_inline.
> 
> 
> > Also, do we ever want to patch init-text tracepoints? I think we want to 
> > stay away from them as much as possible.
> 
> I was trying to make sure that tracepoints in init-text were honored.
> 
> > 
> > It appears to me that what we want here is a straight:
> > 
> >                        if (kernel_text(iter->code))
> >                                jump_label_transform(iter, type);
> > 
> > Also, maybe a WARN_ONCE(!kernel_text()) - we should never even attempt 
> > to transform non-patchable code. If yes then we want to know about that 
> > in a noisy way and not skip it silently.
> > 
> 
> hmmm....indeed, kernel_text_address() does do what I want here (I must 
> have mis-read its definition). Although, I'm not sure there isn't a 
> race here betweeen freeing the init sections and possibly updating 
> them. For modules, there is no race since the module init free code 
> takes the module_mutex, and I do as well in this code...
> 
> I've now also tested this code on 32-bit x86 system, and it seems to 
> perform nicely. I'm seeing a 15 cycle improvement per tracepoint.
> 
> I've based the text section updating on text_poke_fixup(), which has 
> recently come into question about safety of cross modifying code. I 
> could rebase my patches back to use stop_machine()? I guess I'm 
> looking for some advice on how to proceed here.

I think this very limited form of code patching that you are using here 
(patching a JMP) _should_ be safe - so we can avoid stop_machine().

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-03 10:43       ` Ingo Molnar
@ 2009-10-03 12:39         ` Mathieu Desnoyers
  2009-10-07  1:54           ` Steven Rostedt
  0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-03 12:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Baron, linux-kernel, tglx, rostedt, ak, roland, rth, mhiramat

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > On Thu, Oct 01, 2009 at 01:20:03PM +0200, Ingo Molnar wrote:
> > > * Jason Baron <jbaron@redhat.com> wrote:
> > > 
> > > > allow usage of init_kernel_text - we need this in jump labeling to 
> > > > avoid attemtpting to patch code that has been freed as in the __init 
> > > > sections
> > > 
> > > s/attemtpting/attempting
> > > 
> > > > Signed-off-by: Jason Baron <jbaron@redhat.com>
> > > > ---
> > > >  include/linux/kernel.h |    1 +
> > > >  kernel/extable.c       |    2 +-
> > > >  2 files changed, 2 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > > > index f61039e..9d3419f 100644
> > > > --- a/include/linux/kernel.h
> > > > +++ b/include/linux/kernel.h
> > > > @@ -295,6 +295,7 @@ extern int get_option(char **str, int *pint);
> > > >  extern char *get_options(const char *str, int nints, int *ints);
> > > >  extern unsigned long long memparse(const char *ptr, char **retptr);
> > > >  
> > > > +extern int init_kernel_text(unsigned long addr);
> > > >  extern int core_kernel_text(unsigned long addr);
> > > >  extern int __kernel_text_address(unsigned long addr);
> > > >  extern int kernel_text_address(unsigned long addr);
> > > > diff --git a/kernel/extable.c b/kernel/extable.c
> > > > index 7f8f263..f6893ad 100644
> > > > --- a/kernel/extable.c
> > > > +++ b/kernel/extable.c
> > > > @@ -52,7 +52,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
> > > >  	return e;
> > > >  }
> > > >  
> > > > -static inline int init_kernel_text(unsigned long addr)
> > > > +int init_kernel_text(unsigned long addr)
> > > >  {
> > > >  	if (addr >= (unsigned long)_sinittext &&
> > > >  	    addr <= (unsigned long)_einittext)
> > > 
> > > i'm confused. Later on jump_label_update() does:
> > > 
> > > +                       if (!(system_state == SYSTEM_RUNNING &&
> > > +                                       (init_kernel_text(iter->code))))
> > > +                               jump_label_transform(iter, type);
> > > 
> > > which is:
> > > 
> > > +                       if (system_state != SYSTEM_RUNNING ||
> > > +                                       !init_kernel_text(iter->code)))
> > > +                               jump_label_transform(iter, type);
> > > 
> > > What is the logic behind that? System going into SYSTEM_RUNNING does not 
> > > coincide with free_initmem() precisely.
> > > 
> > 
> > The specific case I hit was in modifying code in arch_kdebugfs_init()
> > which is '__init' after the system was up and running. The tracepoint is
> > in 'kmalloc()' which is marked as __always_inline.
> > 
> > 
> > > Also, do we ever want to patch init-text tracepoints? I think we want to 
> > > stay away from them as much as possible.
> > 
> > I was trying to make sure that tracepoints in init-text were honored.
> > 
> > > 
> > > It appears to me that what we want here is a straight:
> > > 
> > >                        if (kernel_text(iter->code))
> > >                                jump_label_transform(iter, type);
> > > 
> > > Also, maybe a WARN_ONCE(!kernel_text()) - we should never even attempt 
> > > to transform non-patchable code. If yes then we want to know about that 
> > > in a noisy way and not skip it silently.
> > > 
> > 
> > hmmm....indeed, kernel_text_address() does do what I want here (I must 
> > have mis-read its definition). Although, I'm not sure there isn't a 
> > race here betweeen freeing the init sections and possibly updating 
> > them. For modules, there is no race since the module init free code 
> > takes the module_mutex, and I do as well in this code...
> > 
> > I've now also tested this code on 32-bit x86 system, and it seems to 
> > perform nicely. I'm seeing a 15 cycle improvement per tracepoint.
> > 
> > I've based the text section updating on text_poke_fixup(), which has 
> > recently come into question about safety of cross modifying code. I 
> > could rebase my patches back to use stop_machine()? I guess I'm 
> > looking for some advice on how to proceed here.
> 
> I think this very limited form of code patching that you are using here 
> (patching a JMP) _should_ be safe - so we can avoid stop_machine().
> 

I might be missing a bit of context here, I just want to make sure we
are on the same page: patching a jmp instruction is safe on UP, safe
with stop_machine(), is very likely safe with the breakpoint-ipi
approach (but we need the confirmation from Intel, which hpa is trying
to get), but is definitely _not_ safe if neither of these methods are
used on a SMP system. If a non-aligned multi-word jump is modified while
another CPU is fetching the instruction, bad things could happen.

BTW, patching kernel and module init sections can be done without
sop_machine(), because only one CPU is ever accessing the init code.

But again, I might be missing some context. If so, sorry for the noise.

Thanks,

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
                   ` (3 preceding siblings ...)
  2009-09-24 23:18 ` [PATCH 4/4] jump label - tracepoint implementation Jason Baron
@ 2009-10-06  5:39 ` Roland McGrath
  2009-10-06 14:07   ` Jason Baron
  2009-10-06 23:24   ` Richard Henderson
  2009-10-06  6:04 ` Roland McGrath
  5 siblings, 2 replies; 29+ messages in thread
From: Roland McGrath @ 2009-10-06  5:39 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, ak, rth, mhiramat

I am, of course, fully in favor of this hack.  This version raises a new
concern for me vs what we had discussed before.  I don't know what the
conclusion about this should be, but I think it should be aired.

In the previous plan, we had approximately:

	asm goto ("1:" P6_NOP5
		  ".pushsection __jump_table\n"
		  _ASM_PTR "1b, %l[do_trace]\n"
		  ".popsection" : : : do_trace);
	if (0) { do_trace: ... tracing_path(); ... }
	... hot_path(); ...

That is, the straight-line code path is a 5-byte nop.  To enable the
"static if" at runtime, we replace that with a "jmp .Ldo_trace".
So, disabled:

	0x1:	nopl
	0x6:	hot path
	...
	0x100:	ret		# or jmp somewhere else, whatever
	...
	0x234:	tracing path	# never reached
	...
	0x250:	jmp 0x6

and enabled:

	0x1:	jmp 0x234
	0x6:	hot path
	...
	0x100:	ret
	...
	0x234:	tracing path
	...
	0x250:	jmp 0x6


In your new plan, instead we now have approximately:

	asm goto ("1: jmp %l[dont_trace]\n"
		  ".pushsection __jump_table\n"
		  _ASM_PTR "1b, %l[dont_trace]\n"
		  ".popsection" : : : dont_trace);
	... tracing path ...
	dont_trace:
	... hot_path(); ...

That is, we've inverted the sense of the control flow: the straight-line
code path is the tracing path, and in default "disabled" state we jump
around the tracing path to get to the hot path.
So, disabled:

	0x1:	jmp 0x1f
	0x3:	tracing path	# never reached
	...
	0x1f:	hot path
	...
	0x119:	ret

and enabled:

	0x1:	jmp 0x3
	0x3:	tracing path
	...
	0x1f:	hot path
	...
	0x119:	ret


As I understand it, the point of the exercise is to optimize the "disabled"
case to as close as possible to what we'd get with no tracing path compiled
in at all.  In the first example (with "nopl"), it's easy to see how that
is what we presume is pretty close to epsilon addition: the execution cost
of the 5-byte nop, plus the indirect effects of those 5 bytes polluting the
I-cache.  We only really know when we measure, but that just seems likely
to be minimally obtrustive.

In the second example (with "jmp around"), I really wonder what the actual
overhead is.  There's the cost of the jmp itself, plus maybe whatever extra
jumps do to branch predictions or pipelines or whatnots of which I know not
much, plus the entire tracing path being right there adjacent using up the
I-cache space that would otherwise be keeping more of the hot path hot.
I'm sure others on the list have more insight than I do into what the
specific performance impacts we can expect from one code sequence or the
other on various chips.

Of course, a first important point is what the actual compiled code
sequences look like.  I'm hoping Richard (who implemented the compiler
feature for us) can help us with making sure our expectations jibe with the
code we'll really get.  There's no benefit in optimizing our asm not to
introduce a jump into the hot path if the compiler actually generates the
tracing path first and gives the hot path a "jmp" around it anyway.

The code example above assumes that "if (0)" is enough for the compiler to
put that code fork (where the "do_trace:" label is) somewhere out of the
straight-line path rather than jumping around it.  Going on the "belt and
suspenders" theory as to being thoroughly explicit to the compiler what we
intend, I'd go for:

	if (__builtin_expect(0,0)) do_trace: __attribute__((cold)) { ... }

But we need Richard et al to tell us what actually makes a difference to
the compiler's optimizer, and will reliably continue to do so in the future.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
                   ` (4 preceding siblings ...)
  2009-10-06  5:39 ` [PATCH 0/4] jump label patches Roland McGrath
@ 2009-10-06  6:04 ` Roland McGrath
  2009-10-06 14:09   ` Steven Rostedt
  2009-10-06 14:13   ` Masami Hiramatsu
  5 siblings, 2 replies; 29+ messages in thread
From: Roland McGrath @ 2009-10-06  6:04 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, ak, rth, mhiramat

I think text_poke_fixup() is a good safe place to start, and it seems wise
to merge a version using that before worrying anything subtler.  But it's
almost surely overkill and makes the enable/disable switching cost pretty
huge.  The rules as documented by Intel seem to indicate that simple
self-modification can work for UP and for SMP there should be some scheme
with IPIs that is not too terrible.  

Those can entail a multi-phase modification like the int3 patching style,
but int3 is not the only way to do it.  int3 has the benefit of being a
one-byte instruction you can patch in, but also the downside of requiring
the trap handling hair.  Another approach is:

start:
	   .balign 2
	2: nopl
	7: ...

phase 1:
	2: jmp 7
	4: <last 3 bytes of nopl>
	7: ...

phase 2:
	2: jmp 7
	4: {last 3 bytes of "jmp .Ldo_trace"}
	7: ...

phase 3:
	2: jmp .Ldo_trace
	7: ...

A scheme like that requires that the instruction to be patched be 2-byte
aligned so that the two-byte "jmp .+3" can be an atomic store not
straddling a word boundary.  On x86-64 (and, according to the Intel book,
everything >= Pentium), you can atomically store 8 bytes when aligned.  So
there you will usually actually be able to do this in one or two phases to
cover each particular 5 byte range with adequately aligned stores.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06  5:39 ` [PATCH 0/4] jump label patches Roland McGrath
@ 2009-10-06 14:07   ` Jason Baron
  2009-10-06 23:24   ` Richard Henderson
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Baron @ 2009-10-06 14:07 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, ak, rth, mhiramat

On Mon, Oct 05, 2009 at 10:39:15PM -0700, Roland McGrath wrote:
> I am, of course, fully in favor of this hack.  This version raises a new
> concern for me vs what we had discussed before.  I don't know what the
> conclusion about this should be, but I think it should be aired.
> 
> In the previous plan, we had approximately:
> 
> 	asm goto ("1:" P6_NOP5
> 		  ".pushsection __jump_table\n"
> 		  _ASM_PTR "1b, %l[do_trace]\n"
> 		  ".popsection" : : : do_trace);
> 	if (0) { do_trace: ... tracing_path(); ... }
> 	... hot_path(); ...
> 
> That is, the straight-line code path is a 5-byte nop.  To enable the
> "static if" at runtime, we replace that with a "jmp .Ldo_trace".
> So, disabled:
> 
> 	0x1:	nopl
> 	0x6:	hot path
> 	...
> 	0x100:	ret		# or jmp somewhere else, whatever
> 	...
> 	0x234:	tracing path	# never reached
> 	...
> 	0x250:	jmp 0x6
> 
> and enabled:
> 
> 	0x1:	jmp 0x234
> 	0x6:	hot path
> 	...
> 	0x100:	ret
> 	...
> 	0x234:	tracing path
> 	...
> 	0x250:	jmp 0x6
> 
> 
> In your new plan, instead we now have approximately:
> 
> 	asm goto ("1: jmp %l[dont_trace]\n"
> 		  ".pushsection __jump_table\n"
> 		  _ASM_PTR "1b, %l[dont_trace]\n"
> 		  ".popsection" : : : dont_trace);
> 	... tracing path ...
> 	dont_trace:
> 	... hot_path(); ...
> 
> That is, we've inverted the sense of the control flow: the straight-line
> code path is the tracing path, and in default "disabled" state we jump
> around the tracing path to get to the hot path.
> So, disabled:
> 
> 	0x1:	jmp 0x1f
> 	0x3:	tracing path	# never reached
> 	...
> 	0x1f:	hot path
> 	...
> 	0x119:	ret
> 
> and enabled:
> 
> 	0x1:	jmp 0x3
> 	0x3:	tracing path
> 	...
> 	0x1f:	hot path
> 	...
> 	0x119:	ret
> 
> 
> As I understand it, the point of the exercise is to optimize the "disabled"
> case to as close as possible to what we'd get with no tracing path compiled
> in at all.  In the first example (with "nopl"), it's easy to see how that
> is what we presume is pretty close to epsilon addition: the execution cost
> of the 5-byte nop, plus the indirect effects of those 5 bytes polluting the
> I-cache.  We only really know when we measure, but that just seems likely
> to be minimally obtrustive.
> 
> In the second example (with "jmp around"), I really wonder what the actual
> overhead is.  There's the cost of the jmp itself, plus maybe whatever extra
> jumps do to branch predictions or pipelines or whatnots of which I know not
> much, plus the entire tracing path being right there adjacent using up the
> I-cache space that would otherwise be keeping more of the hot path hot.
> I'm sure others on the list have more insight than I do into what the
> specific performance impacts we can expect from one code sequence or the
> other on various chips.
> 
> Of course, a first important point is what the actual compiled code
> sequences look like.  I'm hoping Richard (who implemented the compiler
> feature for us) can help us with making sure our expectations jibe with the
> code we'll really get.  There's no benefit in optimizing our asm not to
> introduce a jump into the hot path if the compiler actually generates the
> tracing path first and gives the hot path a "jmp" around it anyway.
> 
> The code example above assumes that "if (0)" is enough for the compiler to
> put that code fork (where the "do_trace:" label is) somewhere out of the
> straight-line path rather than jumping around it.  Going on the "belt and
> suspenders" theory as to being thoroughly explicit to the compiler what we
> intend, I'd go for:
> 
> 	if (__builtin_expect(0,0)) do_trace: __attribute__((cold)) { ... }
> 
> But we need Richard et al to tell us what actually makes a difference to
> the compiler's optimizer, and will reliably continue to do so in the future.
> 
> 
> Thanks,
> Roland

right, thanks for clearly explaining some of the
advantages/disadvantages of the 2 schemes. So the 2 reasons I moved from the
'nop' scheme to the 'jmp' scheme were:

1) The 'nop' scheme was still producing a 'jmp' around the hotpath. So
in the lingo from above, I was getting:

       0x1:    nopl
       0x6:    jmp 0x100    # jump to hot path
       0x8:    tracing path
       ...
       0x100:    hot path
       ...
       0x200:  ret             # or jmp somewhere else, whatever

So, since the compiler was still producing a jmp at instruction 0x6, I
was saving some icache by simply removing the nopl at instruction 0x1.
That said, I didn't try the 'builtin_expect' construct mentioned above
to move the tracing path out of line. However, the 'jmp' scheme could
use the same technique to move the tracing path out of line as well.
Finally, I do agree that if we had a chance to just place a single nop
or just place a jmp, the nop would certainly be better.


2) Concern over finding a 5-byte atomic nop that would work for all x86
processors. I think Steve Rostedt ran into this issue on with ftrace...
The problem is if the 5-byte nop is not atomic, then we risk being
interrupted and returning to a non-sensical opcode. The 'jmp' scheme, I
believed solved that problem, since I assume the entire 'jmp'
instruction is atomic.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06  6:04 ` Roland McGrath
@ 2009-10-06 14:09   ` Steven Rostedt
  2009-10-06 14:13   ` Masami Hiramatsu
  1 sibling, 0 replies; 29+ messages in thread
From: Steven Rostedt @ 2009-10-06 14:09 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx, ak,
	rth, mhiramat

On Mon, 2009-10-05 at 23:04 -0700, Roland McGrath wrote:
> I think text_poke_fixup() is a good safe place to start, and it seems wise
> to merge a version using that before worrying anything subtler.  But it's
> almost surely overkill and makes the enable/disable switching cost pretty
> huge.  The rules as documented by Intel seem to indicate that simple
> self-modification can work for UP and for SMP there should be some scheme
> with IPIs that is not too terrible.  
> 

I think this is still under discussion. The thread about immediates
seems to express some concern.

  http://lkml.org/lkml/2009/9/24/134

-- Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06  6:04 ` Roland McGrath
  2009-10-06 14:09   ` Steven Rostedt
@ 2009-10-06 14:13   ` Masami Hiramatsu
  2009-10-06 14:30     ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2009-10-06 14:13 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, ak, rth

Roland McGrath wrote:
> I think text_poke_fixup() is a good safe place to start, and it seems wise
> to merge a version using that before worrying anything subtler.  But it's
> almost surely overkill and makes the enable/disable switching cost pretty
> huge.  The rules as documented by Intel seem to indicate that simple
> self-modification can work for UP and for SMP there should be some scheme
> with IPIs that is not too terrible.  
> 
> Those can entail a multi-phase modification like the int3 patching style,
> but int3 is not the only way to do it.  int3 has the benefit of being a
> one-byte instruction you can patch in, but also the downside of requiring
> the trap handling hair.

Hmm, would you want to put tracepoint on the path of int3 handling?

>  Another approach is:
> 
> start:
> 	   .balign 2
> 	2: nopl
> 	7: ...
> 
> phase 1:
> 	2: jmp 7
> 	4: <last 3 bytes of nopl>
> 	7: ...
> 
> phase 2:
> 	2: jmp 7
> 	4: {last 3 bytes of "jmp .Ldo_trace"}
> 	7: ...
> 
> phase 3:
> 	2: jmp .Ldo_trace
> 	7: ...
> 
> A scheme like that requires that the instruction to be patched be 2-byte
> aligned so that the two-byte "jmp .+3" can be an atomic store not
> straddling a word boundary.  On x86-64 (and, according to the Intel book,
> everything >= Pentium), you can atomically store 8 bytes when aligned.  So
> there you will usually actually be able to do this in one or two phases to
> cover each particular 5 byte range with adequately aligned stores.

It is unclear whether we can atomically modify 2 bytes in icache (also, it
can across cache lines or pages.)
I think int3 bypassing is more generic way to patching if you don't mind
tracing int3 path :-)


Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06 14:13   ` Masami Hiramatsu
@ 2009-10-06 14:30     ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-06 14:30 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Roland McGrath, Jason Baron, linux-kernel, mingo, tglx, rostedt, ak, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Roland McGrath wrote:
> > I think text_poke_fixup() is a good safe place to start, and it seems wise
> > to merge a version using that before worrying anything subtler.  But it's
> > almost surely overkill and makes the enable/disable switching cost pretty
> > huge.  The rules as documented by Intel seem to indicate that simple
> > self-modification can work for UP and for SMP there should be some scheme
> > with IPIs that is not too terrible.  
> > 
> > Those can entail a multi-phase modification like the int3 patching style,
> > but int3 is not the only way to do it.  int3 has the benefit of being a
> > one-byte instruction you can patch in, but also the downside of requiring
> > the trap handling hair.
> 
> Hmm, would you want to put tracepoint on the path of int3 handling?
> 
> >  Another approach is:
> > 
> > start:
> > 	   .balign 2
> > 	2: nopl
> > 	7: ...
> > 
> > phase 1:
> > 	2: jmp 7
> > 	4: <last 3 bytes of nopl>
> > 	7: ...
> > 
> > phase 2:
> > 	2: jmp 7
> > 	4: {last 3 bytes of "jmp .Ldo_trace"}
> > 	7: ...
> > 
> > phase 3:
> > 	2: jmp .Ldo_trace
> > 	7: ...
> > 
> > A scheme like that requires that the instruction to be patched be 2-byte
> > aligned so that the two-byte "jmp .+3" can be an atomic store not
> > straddling a word boundary.  On x86-64 (and, according to the Intel book,
> > everything >= Pentium), you can atomically store 8 bytes when aligned.  So
> > there you will usually actually be able to do this in one or two phases to
> > cover each particular 5 byte range with adequately aligned stores.
> 
> It is unclear whether we can atomically modify 2 bytes in icache (also, it
> can across cache lines or pages.)
> I think int3 bypassing is more generic way to patching if you don't mind
> tracing int3 path :-)

I think their point is that by only changing the 2nd byte of a 2-byte
jmp instruction, leaving the opcode as-is, they think the processor will
either choose one or the other branch, seeing two "coherent"
instructions.

However, as I point out in the immediate values thread, the problem
might run deeper, which would be that CPU need to see consistent
instructions across multiple reads in the same code region (due to
pipeline effects of some sorts). Mere atomicity of the modification does
not seem to be enough. What I gathered by discussing with Richard J
Moore is that we really need a synchronizing instruction between the
moments CPUs see the old and new version. int3/iret is one. Sending an
IPI doing a mfence or cpuid is another. Both seem needed to ensure
SMP-safe code modification.

See the immediate values thread for details. We are waiting for Intel
ruling on the int3+IPI scheme.

Mathieu

> 
> 
> Thank you,
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06  5:39 ` [PATCH 0/4] jump label patches Roland McGrath
  2009-10-06 14:07   ` Jason Baron
@ 2009-10-06 23:24   ` Richard Henderson
  2009-10-07  0:14     ` Roland McGrath
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Henderson @ 2009-10-06 23:24 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, ak, mhiramat

On 10/05/2009 10:39 PM, Roland McGrath wrote:
> Of course, a first important point is what the actual compiled code
> sequences look like.  I'm hoping Richard (who implemented the compiler
> feature for us) can help us with making sure our expectations jibe with the
> code we'll really get.  There's no benefit in optimizing our asm not to
> introduce a jump into the hot path if the compiler actually generates the
> tracing path first and gives the hot path a "jmp" around it anyway.

At present, the asm goto extension gives no prediction weights to any 
path.  I had hoped that the -freorder-blocks pass (enabled with -O2) 
would automatically place the relevant fallthrough blocks immediately 
after the asm goto.  It did happen for small test cases, but a message 
from Jason downthread indicates that it doesn't always happen.

> 	if (__builtin_expect(0,0)) do_trace: __attribute__((cold)) { ... }

An attribute cold on a label is something that I've suggested, but have 
not yet implemented.  I think that might be the easiest way to add 
prediction weights to an asm goto.


r~

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-06 23:24   ` Richard Henderson
@ 2009-10-07  0:14     ` Roland McGrath
  2009-10-07 15:35       ` Richard Henderson
  0 siblings, 1 reply; 29+ messages in thread
From: Roland McGrath @ 2009-10-07  0:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, ak, mhiramat

> At present, the asm goto extension gives no prediction weights to any 
> path.  I had hoped that the -freorder-blocks pass (enabled with -O2) 
> would automatically place the relevant fallthrough blocks immediately 
> after the asm goto.  It did happen for small test cases, but a message 
> from Jason downthread indicates that it doesn't always happen.

Kernel builds usually use -Os.  Is there anything else we can do now (4.4)
to influence this placement (while keeping the unlikely target block inside
a scope, i.e. macro, with the asm goto)?

Also note that another usage pattern I'm considering for the general case
would (via fancy macros) look something like:

	if (({ __label__ yes; int maybe = 0;
	      asm goto ("jmp %l[yes]" ::: yes);
	      if (0) yes: maybe = 1;
	      maybe; }))
	  doit();
	else
	  dontit();

So please also advise on making something of that character come out optimal.

> > 	if (__builtin_expect(0,0)) do_trace: __attribute__((cold)) { ... }
> 
> An attribute cold on a label is something that I've suggested, but have 
> not yet implemented.  I think that might be the easiest way to add 
> prediction weights to an asm goto.

Ok.  That syntax is accepted already now (4.4.1 anyway) but always generates:
	warning: ?cold? attribute ignored
so it would be annoying to start using it now before compiler-version
conditionals for a version where it does at most nothing.

Btw, if you do make that work I think it would be nice (and seems like it
would be not much different internally) to make:
	label:  __attribute__ ((cold, section (".text.lukewarm")))
work too.  I can see this being used for a few different kinds of things,
where a specialized user like the kernel might want to segregate multiple
different kinds of text.  e.g., "really cold" for rare exception stuff,
"tracepoint warm" for stuff that's cold by default but actually hot if
lots of tracing is enabled, etc.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-03 12:39         ` Mathieu Desnoyers
@ 2009-10-07  1:54           ` Steven Rostedt
  2009-10-07  2:32             ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2009-10-07  1:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Jason Baron, linux-kernel, tglx, ak, roland, rth, mhiramat

On Sat, 2009-10-03 at 08:39 -0400, Mathieu Desnoyers wrote:

> I might be missing a bit of context here, I just want to make sure we
> are on the same page: patching a jmp instruction is safe on UP, safe
> with stop_machine(), is very likely safe with the breakpoint-ipi

Hi Mathieu,

I've been reading through these threads (both this one and the immediate
one) and I'm still a bit confused. I really want to understand this in a
simple way, thus make sure everyone else understands it too.

>From what Arjan said here:

  http://lkml.org/lkml/2009/9/25/98

The issue is going back from the int3 to the old value. How does the
breakpoint-ipi work?

Supposedly, we can add an int3 to the code without any worry. If another
CPU at that same time hits that code path, it will either run the old
code, or take the interrupt. The breakpoint interrupt handler, will
handle that code path, and the execution continues.

Now what is the issues with removing the int3 and placing back the old
(or new) value. Is there an issue if another CPU is about to execute
that code path as we remove the int3? If so, how does sending an IPI
help the matter without adding more races?

Is there only an issue if we change the old value with something else,
and you just need to send the IPI after you modify the old code and
before removing the int3?

I may just be totally confused, which I usually am. But when I'm not
confused, I feel that the code is practical ;-)

-- Steve

 

> approach (but we need the confirmation from Intel, which hpa is trying
> to get), but is definitely _not_ safe if neither of these methods are
> used on a SMP system. If a non-aligned multi-word jump is modified while
> another CPU is fetching the instruction, bad things could happen.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07  1:54           ` Steven Rostedt
@ 2009-10-07  2:32             ` Mathieu Desnoyers
  2009-10-07  3:10               ` Masami Hiramatsu
                                 ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07  2:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Jason Baron, linux-kernel, tglx, ak, roland, rth, mhiramat

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Sat, 2009-10-03 at 08:39 -0400, Mathieu Desnoyers wrote:
> 
> > I might be missing a bit of context here, I just want to make sure we
> > are on the same page: patching a jmp instruction is safe on UP, safe
> > with stop_machine(), is very likely safe with the breakpoint-ipi
> 
> Hi Mathieu,
> 
> I've been reading through these threads (both this one and the immediate
> one) and I'm still a bit confused. I really want to understand this in a
> simple way, thus make sure everyone else understands it too.
> 
> >From what Arjan said here:
> 
>   http://lkml.org/lkml/2009/9/25/98
> 
> The issue is going back from the int3 to the old value. How does the
> breakpoint-ipi work?
> 
> Supposedly, we can add an int3 to the code without any worry. If another
> CPU at that same time hits that code path, it will either run the old
> code, or take the interrupt. The breakpoint interrupt handler, will
> handle that code path, and the execution continues.
> 
> Now what is the issues with removing the int3 and placing back the old
> (or new) value. Is there an issue if another CPU is about to execute
> that code path as we remove the int3? If so, how does sending an IPI
> help the matter without adding more races?
> 
> Is there only an issue if we change the old value with something else,
> and you just need to send the IPI after you modify the old code and
> before removing the int3?
> 
> I may just be totally confused, which I usually am. But when I'm not
> confused, I feel that the code is practical ;-)
> 

Hi Steven,

OK, I'll make the explanation as straightforward as possible. I'll use a
race example to illustrate what we try to avoid by using the
breakpoint+ipi scheme. After that, I present the same scenario with the
breakpoint+ipi in place.

Each step shows what is executed, and what is the memory values seen by
the CPU. CPU A is doing the code patching, CPU B executing the code.
I intentionally left out some sfence required on CPU A for simplicity.)

Initially, let's say we have:
(1)  (2)
0xeb 0xe5    (jmp to offset 0xe5)

And we want to change this to:
(1)  (2)
0xeb 0xf0    (jmp to offset 0xf0)

(scenario "buggy")

CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
-------------------------    ;)
0xeb 0xe5     0xeb 0xe5
0:            CPU B instruction pointer is earlier than (1)
              CPU B pipeline speculatively predicts branches,
              prefetches data, calculates speculated values.
1:            CPU B loads 0xeb
2:            CPU B loads 0xe5
3:
Write to (2)
0xeb 0xf0     0xeb 0xf0
4:            CPU B instruction pointer gets to (1), needs to validate
              all the pipeline speculation.
              But ! The CPU does not expect code to change underneath.
              General protection fault (or any other fault.. random..)


Now with the breakpoint+ipi/mb() scheme:
(scenario A: CPU B does not hit the breakpoint)

CPU A       |       CPU B
-------------------------
0xeb 0xe5     0xeb 0xe5
0:            CPU B instruction pointer is earlier than (1)
              CPU B pipeline speculatively predicts branches,
              prefetches data, calculates speculated values.
1:            CPU B loads 0xeb
2:            CPU B loads 0xe5
3:
Write to (1)
0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
4: send IPI
5:            mfence     # serializing instruction. Flushes CPU B's
                         # pipeline
6:
Write to (2)
0xcc 0xf0     0xcc 0xf0
7:
Write to (1)
0xeb 0xf0     0xeb 0xf0
8:            CPU B instruction pointer gets to (1), needs to validate
              all the pipeline speculation. Because we flushed any
              speculation prior to the mfence, we're ok.


Now, I'll show why just using the breakpoint, without IPI, is
problematic:

CPU A       |       CPU B
-------------------------
0xeb 0xe5     0xeb 0xe5
0:            CPU B instruction pointer is earlier than (1)
              CPU B pipeline speculatively predicts branches,
              prefetches data, calculates speculated values.
1:            CPU B loads 0xeb
2:            CPU B loads 0xe5
3:
Write to (1)
0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
4:
Write to (2)
0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
                         # Same problem as scenario "buggy".
5:
Write to (1)
0xeb 0xf0     0xeb 0xf0
4:            CPU B instruction pointer gets to (1), needs to validate
              all the pipeline speculation.
              But ! The CPU does not expect code to change underneath.
              General protection fault (or any other fault.. random..)

So, basically, we ensure that the only transitions CPU B will see are
either:

0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
                             breakpoint!
0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint

*but*, the transition we guarantee that CPU B will *never* see without
having a mfence executed between the old and the new version is:

0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.

Hope the explanation helps,

Mathieu


> -- Steve
> 
>  
> 
> > approach (but we need the confirmation from Intel, which hpa is trying
> > to get), but is definitely _not_ safe if neither of these methods are
> > used on a SMP system. If a non-aligned multi-word jump is modified while
> > another CPU is fetching the instruction, bad things could happen.
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07  2:32             ` Mathieu Desnoyers
@ 2009-10-07  3:10               ` Masami Hiramatsu
  2009-10-07  3:23                 ` Mathieu Desnoyers
  2009-10-07  3:29               ` Mathieu Desnoyers
  2009-10-07 12:56               ` Steven Rostedt
  2 siblings, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2009-10-07  3:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Ingo Molnar, Jason Baron, linux-kernel, tglx, ak,
	roland, rth



Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
>> On Sat, 2009-10-03 at 08:39 -0400, Mathieu Desnoyers wrote:
>>
>>> I might be missing a bit of context here, I just want to make sure we
>>> are on the same page: patching a jmp instruction is safe on UP, safe
>>> with stop_machine(), is very likely safe with the breakpoint-ipi
>>
>> Hi Mathieu,
>>
>> I've been reading through these threads (both this one and the immediate
>> one) and I'm still a bit confused. I really want to understand this in a
>> simple way, thus make sure everyone else understands it too.
>>
>> >From what Arjan said here:
>>
>>   http://lkml.org/lkml/2009/9/25/98
>>
>> The issue is going back from the int3 to the old value. How does the
>> breakpoint-ipi work?
>>
>> Supposedly, we can add an int3 to the code without any worry. If another
>> CPU at that same time hits that code path, it will either run the old
>> code, or take the interrupt. The breakpoint interrupt handler, will
>> handle that code path, and the execution continues.
>>
>> Now what is the issues with removing the int3 and placing back the old
>> (or new) value. Is there an issue if another CPU is about to execute
>> that code path as we remove the int3? If so, how does sending an IPI
>> help the matter without adding more races?
>>
>> Is there only an issue if we change the old value with something else,
>> and you just need to send the IPI after you modify the old code and
>> before removing the int3?
>>
>> I may just be totally confused, which I usually am. But when I'm not
>> confused, I feel that the code is practical ;-)
>>
> 
> Hi Steven,
> 
> OK, I'll make the explanation as straightforward as possible. I'll use a
> race example to illustrate what we try to avoid by using the
> breakpoint+ipi scheme. After that, I present the same scenario with the
> breakpoint+ipi in place.
> 
> Each step shows what is executed, and what is the memory values seen by
> the CPU. CPU A is doing the code patching, CPU B executing the code.
> I intentionally left out some sfence required on CPU A for simplicity.)
> 
> Initially, let's say we have:
> (1)  (2)
> 0xeb 0xe5    (jmp to offset 0xe5)
> 
> And we want to change this to:
> (1)  (2)
> 0xeb 0xf0    (jmp to offset 0xf0)
> 
> (scenario "buggy")
> 
> CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
> -------------------------    ;)
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (2)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> 
> Now with the breakpoint+ipi/mb() scheme:
> (scenario A: CPU B does not hit the breakpoint)
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
> 4: send IPI
> 5:            mfence     # serializing instruction. Flushes CPU B's
>                          # pipeline
> 6:
> Write to (2)
> 0xcc 0xf0     0xcc 0xf0
> 7:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 8:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation. Because we flushed any
>               speculation prior to the mfence, we're ok.
> 
> 
> Now, I'll show why just using the breakpoint, without IPI, is
> problematic:
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
> 4:
> Write to (2)
> 0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
>                          # Same problem as scenario "buggy".
> 5:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> So, basically, we ensure that the only transitions CPU B will see are
> either:
> 
> 0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
> 0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
>                              breakpoint!
> 0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint
> 
> *but*, the transition we guarantee that CPU B will *never* see without
> having a mfence executed between the old and the new version is:
> 
> 0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.
> 
> Hope the explanation helps,

Thanks for explanation.
One thing I'd like to know is why you are using mfence
instead of cpuid (a.k.a. sync_core()). I assume that old
processor doesn't have mfence and is that OK?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07  3:10               ` Masami Hiramatsu
@ 2009-10-07  3:23                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07  3:23 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Ingo Molnar, Jason Baron, linux-kernel, tglx, ak,
	roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> 
> 
> Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> >> On Sat, 2009-10-03 at 08:39 -0400, Mathieu Desnoyers wrote:
> >>
> >>> I might be missing a bit of context here, I just want to make sure we
> >>> are on the same page: patching a jmp instruction is safe on UP, safe
> >>> with stop_machine(), is very likely safe with the breakpoint-ipi
> >>
> >> Hi Mathieu,
> >>
> >> I've been reading through these threads (both this one and the immediate
> >> one) and I'm still a bit confused. I really want to understand this in a
> >> simple way, thus make sure everyone else understands it too.
> >>
> >> >From what Arjan said here:
> >>
> >>   http://lkml.org/lkml/2009/9/25/98
> >>
> >> The issue is going back from the int3 to the old value. How does the
> >> breakpoint-ipi work?
> >>
> >> Supposedly, we can add an int3 to the code without any worry. If another
> >> CPU at that same time hits that code path, it will either run the old
> >> code, or take the interrupt. The breakpoint interrupt handler, will
> >> handle that code path, and the execution continues.
> >>
> >> Now what is the issues with removing the int3 and placing back the old
> >> (or new) value. Is there an issue if another CPU is about to execute
> >> that code path as we remove the int3? If so, how does sending an IPI
> >> help the matter without adding more races?
> >>
> >> Is there only an issue if we change the old value with something else,
> >> and you just need to send the IPI after you modify the old code and
> >> before removing the int3?
> >>
> >> I may just be totally confused, which I usually am. But when I'm not
> >> confused, I feel that the code is practical ;-)
> >>
> > 
> > Hi Steven,
> > 
> > OK, I'll make the explanation as straightforward as possible. I'll use a
> > race example to illustrate what we try to avoid by using the
> > breakpoint+ipi scheme. After that, I present the same scenario with the
> > breakpoint+ipi in place.
> > 
> > Each step shows what is executed, and what is the memory values seen by
> > the CPU. CPU A is doing the code patching, CPU B executing the code.
> > I intentionally left out some sfence required on CPU A for simplicity.)
> > 
> > Initially, let's say we have:
> > (1)  (2)
> > 0xeb 0xe5    (jmp to offset 0xe5)
> > 
> > And we want to change this to:
> > (1)  (2)
> > 0xeb 0xf0    (jmp to offset 0xf0)
> > 
> > (scenario "buggy")
> > 
> > CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
> > -------------------------    ;)
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (2)
> > 0xeb 0xf0     0xeb 0xf0
> > 4:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation.
> >               But ! The CPU does not expect code to change underneath.
> >               General protection fault (or any other fault.. random..)
> > 
> > 
> > Now with the breakpoint+ipi/mb() scheme:
> > (scenario A: CPU B does not hit the breakpoint)
> > 
> > CPU A       |       CPU B
> > -------------------------
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (1)
> > 0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
> > 4: send IPI
> > 5:            mfence     # serializing instruction. Flushes CPU B's
> >                          # pipeline
> > 6:
> > Write to (2)
> > 0xcc 0xf0     0xcc 0xf0
> > 7:
> > Write to (1)
> > 0xeb 0xf0     0xeb 0xf0
> > 8:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation. Because we flushed any
> >               speculation prior to the mfence, we're ok.
> > 
> > 
> > Now, I'll show why just using the breakpoint, without IPI, is
> > problematic:
> > 
> > CPU A       |       CPU B
> > -------------------------
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (1)
> > 0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
> > 4:
> > Write to (2)
> > 0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
> >                          # Same problem as scenario "buggy".
> > 5:
> > Write to (1)
> > 0xeb 0xf0     0xeb 0xf0
> > 4:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation.
> >               But ! The CPU does not expect code to change underneath.
> >               General protection fault (or any other fault.. random..)
> > 
> > So, basically, we ensure that the only transitions CPU B will see are
> > either:
> > 
> > 0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
> > 0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
> >                              breakpoint!
> > 0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint
> > 
> > *but*, the transition we guarantee that CPU B will *never* see without
> > having a mfence executed between the old and the new version is:
> > 
> > 0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.
> > 
> > Hope the explanation helps,
> 
> Thanks for explanation.
> One thing I'd like to know is why you are using mfence
> instead of cpuid (a.k.a. sync_core()). I assume that old
> processor doesn't have mfence and is that OK?

Ah, right. Well then we can use cpuid for portability to older archs
lacking mfence.

The reason why I use mfence here is that I wanted to combine:

lfence
cpuid

into

mfence

I use a lfense on CPU B before the cpuid (and matching sfence on CPU A
between the moment I write the breakpoint instruction and the moment I
send the IPI, and another one between the moment CPU A knows the IPI has
been executed and the moment it writes the new operands).

Thanks,

Mathieu

> 
> Thank you,
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07  2:32             ` Mathieu Desnoyers
  2009-10-07  3:10               ` Masami Hiramatsu
@ 2009-10-07  3:29               ` Mathieu Desnoyers
  2009-10-07 12:56               ` Steven Rostedt
  2 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07  3:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Jason Baron, linux-kernel, tglx, ak, roland, rth, mhiramat

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Sat, 2009-10-03 at 08:39 -0400, Mathieu Desnoyers wrote:
> > 
> > > I might be missing a bit of context here, I just want to make sure we
> > > are on the same page: patching a jmp instruction is safe on UP, safe
> > > with stop_machine(), is very likely safe with the breakpoint-ipi
> > 
> > Hi Mathieu,
> > 
> > I've been reading through these threads (both this one and the immediate
> > one) and I'm still a bit confused. I really want to understand this in a
> > simple way, thus make sure everyone else understands it too.
> > 
> > >From what Arjan said here:
> > 
> >   http://lkml.org/lkml/2009/9/25/98
> > 
> > The issue is going back from the int3 to the old value. How does the
> > breakpoint-ipi work?
> > 
> > Supposedly, we can add an int3 to the code without any worry. If another
> > CPU at that same time hits that code path, it will either run the old
> > code, or take the interrupt. The breakpoint interrupt handler, will
> > handle that code path, and the execution continues.
> > 
> > Now what is the issues with removing the int3 and placing back the old
> > (or new) value. Is there an issue if another CPU is about to execute
> > that code path as we remove the int3? If so, how does sending an IPI
> > help the matter without adding more races?
> > 
> > Is there only an issue if we change the old value with something else,
> > and you just need to send the IPI after you modify the old code and
> > before removing the int3?
> > 
> > I may just be totally confused, which I usually am. But when I'm not
> > confused, I feel that the code is practical ;-)
> > 
> 
> Hi Steven,
> 
> OK, I'll make the explanation as straightforward as possible. I'll use a
> race example to illustrate what we try to avoid by using the
> breakpoint+ipi scheme. After that, I present the same scenario with the
> breakpoint+ipi in place.
> 
> Each step shows what is executed, and what is the memory values seen by
> the CPU. CPU A is doing the code patching, CPU B executing the code.
> I intentionally left out some sfence required on CPU A for simplicity.)
> 
> Initially, let's say we have:
> (1)  (2)
> 0xeb 0xe5    (jmp to offset 0xe5)
> 
> And we want to change this to:
> (1)  (2)
> 0xeb 0xf0    (jmp to offset 0xf0)
> 
> (scenario "buggy")
> 
> CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
> -------------------------    ;)
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.

Clarification: going back to my past exchanges with Richard J Moore, the
specific CPU "state" that needs to be consistent on CPU B here across
0-4 is the instruction trace cache.

Mathieu

> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (2)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> 
> Now with the breakpoint+ipi/mb() scheme:
> (scenario A: CPU B does not hit the breakpoint)
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
> 4: send IPI
> 5:            mfence     # serializing instruction. Flushes CPU B's
>                          # pipeline
> 6:
> Write to (2)
> 0xcc 0xf0     0xcc 0xf0
> 7:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 8:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation. Because we flushed any
>               speculation prior to the mfence, we're ok.
> 
> 
> Now, I'll show why just using the breakpoint, without IPI, is
> problematic:
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
> 4:
> Write to (2)
> 0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
>                          # Same problem as scenario "buggy".
> 5:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> So, basically, we ensure that the only transitions CPU B will see are
> either:
> 
> 0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
> 0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
>                              breakpoint!
> 0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint
> 
> *but*, the transition we guarantee that CPU B will *never* see without
> having a mfence executed between the old and the new version is:
> 
> 0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.
> 
> Hope the explanation helps,
> 
> Mathieu
> 
> 
> > -- Steve
> > 
> >  
> > 
> > > approach (but we need the confirmation from Intel, which hpa is trying
> > > to get), but is definitely _not_ safe if neither of these methods are
> > > used on a SMP system. If a non-aligned multi-word jump is modified while
> > > another CPU is fetching the instruction, bad things could happen.
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07  2:32             ` Mathieu Desnoyers
  2009-10-07  3:10               ` Masami Hiramatsu
  2009-10-07  3:29               ` Mathieu Desnoyers
@ 2009-10-07 12:56               ` Steven Rostedt
  2009-10-07 13:35                 ` Mathieu Desnoyers
  2 siblings, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2009-10-07 12:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Jason Baron, linux-kernel, tglx, ak, roland, rth, mhiramat

On Tue, 2009-10-06 at 22:32 -0400, Mathieu Desnoyers wrote:

> 
> Hi Steven,
> 
> OK, I'll make the explanation as straightforward as possible. I'll use a
> race example to illustrate what we try to avoid by using the
> breakpoint+ipi scheme. After that, I present the same scenario with the
> breakpoint+ipi in place.
> 
> Each step shows what is executed, and what is the memory values seen by
> the CPU. CPU A is doing the code patching, CPU B executing the code.
> I intentionally left out some sfence required on CPU A for simplicity.)
> 
> Initially, let's say we have:
> (1)  (2)
> 0xeb 0xe5    (jmp to offset 0xe5)
> 
> And we want to change this to:
> (1)  (2)
> 0xeb 0xf0    (jmp to offset 0xf0)
> 
> (scenario "buggy")
> 
> CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
> -------------------------    ;)
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (2)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> 
> Now with the breakpoint+ipi/mb() scheme:
> (scenario A: CPU B does not hit the breakpoint)
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
> 4: send IPI
> 5:            mfence     # serializing instruction. Flushes CPU B's
>                          # pipeline
> 6:
> Write to (2)
> 0xcc 0xf0     0xcc 0xf0
> 7:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 8:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation. Because we flushed any
>               speculation prior to the mfence, we're ok.
> 
> 
> Now, I'll show why just using the breakpoint, without IPI, is
> problematic:
> 
> CPU A       |       CPU B
> -------------------------
> 0xeb 0xe5     0xeb 0xe5
> 0:            CPU B instruction pointer is earlier than (1)
>               CPU B pipeline speculatively predicts branches,
>               prefetches data, calculates speculated values.
> 1:            CPU B loads 0xeb
> 2:            CPU B loads 0xe5
> 3:
> Write to (1)
> 0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
> 4:
> Write to (2)
> 0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
>                          # Same problem as scenario "buggy".
> 5:
> Write to (1)
> 0xeb 0xf0     0xeb 0xf0
> 4:            CPU B instruction pointer gets to (1), needs to validate
>               all the pipeline speculation.
>               But ! The CPU does not expect code to change underneath.
>               General protection fault (or any other fault.. random..)
> 
> So, basically, we ensure that the only transitions CPU B will see are
> either:
> 
> 0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
> 0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
>                              breakpoint!
> 0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint
> 
> *but*, the transition we guarantee that CPU B will *never* see without
> having a mfence executed between the old and the new version is:
> 
> 0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.
> 
> Hope the explanation helps,

Thanks Mathieu,

This does help explain a lot.

So, basically the IPI is to make sure the int3 is seen by other CPUS
before you modify the jump. Otherwise you risk setting up the int3 and
the other CPU does not see it but still executes the change to the jmp
destination.

I'm assuming that the int3 handler will make the process on CPU B jump
to the next op (one not being modified).

Now we must get from Intel and AMD that it is OK to remove the int3.

-- Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] jump label - make init_kernel_text() global
  2009-10-07 12:56               ` Steven Rostedt
@ 2009-10-07 13:35                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2009-10-07 13:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Jason Baron, linux-kernel, tglx, ak, roland, rth, mhiramat

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Tue, 2009-10-06 at 22:32 -0400, Mathieu Desnoyers wrote:
> 
> > 
> > Hi Steven,
> > 
> > OK, I'll make the explanation as straightforward as possible. I'll use a
> > race example to illustrate what we try to avoid by using the
> > breakpoint+ipi scheme. After that, I present the same scenario with the
> > breakpoint+ipi in place.
> > 
> > Each step shows what is executed, and what is the memory values seen by
> > the CPU. CPU A is doing the code patching, CPU B executing the code.
> > I intentionally left out some sfence required on CPU A for simplicity.)
> > 
> > Initially, let's say we have:
> > (1)  (2)
> > 0xeb 0xe5    (jmp to offset 0xe5)
> > 
> > And we want to change this to:
> > (1)  (2)
> > 0xeb 0xf0    (jmp to offset 0xf0)
> > 
> > (scenario "buggy")
> > 
> > CPU A       |       CPU B  (this is about as far as my ascii-art skills go)
> > -------------------------    ;)
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (2)
> > 0xeb 0xf0     0xeb 0xf0
> > 4:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation.
> >               But ! The CPU does not expect code to change underneath.
> >               General protection fault (or any other fault.. random..)
> > 
> > 
> > Now with the breakpoint+ipi/mb() scheme:
> > (scenario A: CPU B does not hit the breakpoint)
> > 
> > CPU A       |       CPU B
> > -------------------------
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (1)
> > 0xcc 0xe5     0xcc 0xe5  # breakpoint inserted
> > 4: send IPI
> > 5:            mfence     # serializing instruction. Flushes CPU B's
> >                          # pipeline
> > 6:
> > Write to (2)
> > 0xcc 0xf0     0xcc 0xf0
> > 7:
> > Write to (1)
> > 0xeb 0xf0     0xeb 0xf0
> > 8:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation. Because we flushed any
> >               speculation prior to the mfence, we're ok.
> > 
> > 
> > Now, I'll show why just using the breakpoint, without IPI, is
> > problematic:
> > 
> > CPU A       |       CPU B
> > -------------------------
> > 0xeb 0xe5     0xeb 0xe5
> > 0:            CPU B instruction pointer is earlier than (1)
> >               CPU B pipeline speculatively predicts branches,
> >               prefetches data, calculates speculated values.
> > 1:            CPU B loads 0xeb
> > 2:            CPU B loads 0xe5
> > 3:
> > Write to (1)
> > 0xcc 0xe5     0xcc 0xf0  # breakpoint inserted
> > 4:
> > Write to (2)
> > 0xcc 0xf0     0xeb 0xf0  # Silly CPU B. Did not see nor use the breakpoint.
> >                          # Same problem as scenario "buggy".
> > 5:
> > Write to (1)
> > 0xeb 0xf0     0xeb 0xf0
> > 4:            CPU B instruction pointer gets to (1), needs to validate
> >               all the pipeline speculation.
> >               But ! The CPU does not expect code to change underneath.
> >               General protection fault (or any other fault.. random..)
> > 
> > So, basically, we ensure that the only transitions CPU B will see are
> > either:
> > 
> > 0xeb 0xe5 -> 0xcc 0xe5 : OK, adding breakpoint
> > 0xcc 0xe5 -> 0xcc 0xf0 : OK, not using the operand anyway, it's a
> >                              breakpoint!
> > 0xcc 0xf0 -> 0xeb 0xf0 : OK, removing breakpoint
> > 
> > *but*, the transition we guarantee that CPU B will *never* see without
> > having a mfence executed between the old and the new version is:
> > 
> > 0xeb 0xe5 -> 0xeb 0xf0  <----- buggy.
> > 
> > Hope the explanation helps,
> 
> Thanks Mathieu,
> 
> This does help explain a lot.
> 
> So, basically the IPI is to make sure the int3 is seen by other CPUS

- I might add: and that the other CPU's instruction trace caches are
  flushed with a core serializing instruction -

> before you modify the jump. Otherwise you risk setting up the int3 and
> the other CPU does not see it but still executes the change to the jmp
> destination.

Yep.

> 
> I'm assuming that the int3 handler will make the process on CPU B jump
> to the next op (one not being modified).

Indeed.

> 
> Now we must get from Intel and AMD that it is OK to remove the int3.

Yep, that's what hpa is trying to get them to tell us.

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] jump label patches
  2009-10-07  0:14     ` Roland McGrath
@ 2009-10-07 15:35       ` Richard Henderson
  0 siblings, 0 replies; 29+ messages in thread
From: Richard Henderson @ 2009-10-07 15:35 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, ak, mhiramat

On 10/06/2009 05:14 PM, Roland McGrath wrote:
> Kernel builds usually use -Os.  Is there anything else we can do now (4.4)
> to influence this placement (while keeping the unlikely target block inside
> a scope, i.e. macro, with the asm goto)?

I think -Os includes -freorder-blocks as well.

> 	      if (0) yes: maybe = 1;

Anything with "if (0)" in it (even with __builtin_expect) is going to be 
folded away too early to be useful.

I can't think of any way to manipulate block placement from the source 
level at this time that doesn't add more code along the fast path, 
obviating the asm goto.


r~

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-10-07 15:36 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-24 23:17 [PATCH 0/4] jump label patches Jason Baron
2009-09-24 23:17 ` [PATCH 1/4] jump label - make init_kernel_text() global Jason Baron
2009-10-01 11:20   ` Ingo Molnar
2009-10-01 12:58     ` Mathieu Desnoyers
2009-10-01 20:39     ` Jason Baron
2009-10-03 10:43       ` Ingo Molnar
2009-10-03 12:39         ` Mathieu Desnoyers
2009-10-07  1:54           ` Steven Rostedt
2009-10-07  2:32             ` Mathieu Desnoyers
2009-10-07  3:10               ` Masami Hiramatsu
2009-10-07  3:23                 ` Mathieu Desnoyers
2009-10-07  3:29               ` Mathieu Desnoyers
2009-10-07 12:56               ` Steven Rostedt
2009-10-07 13:35                 ` Mathieu Desnoyers
2009-09-24 23:17 ` [PATCH 2/4] jump label - base patch Jason Baron
2009-09-25  0:49   ` Roland McGrath
2009-09-26 10:21     ` Steven Rostedt
2009-10-01 11:36   ` Ingo Molnar
2009-09-24 23:17 ` [PATCH 3/4] jump label - add module support Jason Baron
2009-09-24 23:18 ` [PATCH 4/4] jump label - tracepoint implementation Jason Baron
2009-10-06  5:39 ` [PATCH 0/4] jump label patches Roland McGrath
2009-10-06 14:07   ` Jason Baron
2009-10-06 23:24   ` Richard Henderson
2009-10-07  0:14     ` Roland McGrath
2009-10-07 15:35       ` Richard Henderson
2009-10-06  6:04 ` Roland McGrath
2009-10-06 14:09   ` Steven Rostedt
2009-10-06 14:13   ` Masami Hiramatsu
2009-10-06 14:30     ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).