All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] jump label v3
@ 2009-11-18 22:43 Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Jason Baron
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

hi,

Refresh of the jump labeling patches. We introduce the following:

# ifdef CONFIG_X86_64
#  define JUMP_LABEL_NOP P6_NOP5
# else
#  define JUMP_LABEL_NOP ".byte 0xe9 \n\t .long 0\n\t"
# endif

# define JUMP_LABEL(tag, label, cond)                                       \
       do {                                                               \
               static const char __jlstrtab_##tag[]                       \
               __used __attribute__((section("__jump_strings")))  = #tag; \
               asm goto("1:"                                              \
                       JUMP_LABEL_NOP                                     \
                       ".pushsection __jump_table,  \"a\" \n\t"           \
                       _ASM_PTR "1b, %l[" #label "], %c0 \n\t"            \
                       ".popsection \n\t"                                 \
                       : :  "i" (__jlstrtab_##tag) :  : label);           \
       } while (0)

-------------

I'm using an atomic 5 byte no-op for x86_64 and a long jump for 32-bit x86.
My understanding is that not all 32-bit processors have an atomic 5 byte no-op,
and thus using a long jump or jump 0, for the off case is the safest.

which can then be used by the tracepoint code as:

#define DECLARE_TRACE(name, proto, args)                                \
       extern struct tracepoint __tracepoint_##name;                    \
       static inline void trace_##name(proto)                           \
       {                                                                \
               JUMP_LABEL(name, do_trace, __tracepoint_##name.state);   \
               return;                                                  \
do_trace:                                                               \
               __DO_TRACE(&__tracepoint_##name,                         \
                          TP_PROTO(proto), TP_ARGS(args));              \


--------------

Thus, in the disabled tracing case we have a no-op followed by a jump around
the disabled code. When we enable the tracepoint, we simply patch the no-op
with a jump to the 'do_trace' label. This relies on the 'asm goto' construct
which is already merged into gcc 4.5. In subsequent gcc versions, we also hope
to make use of 'cold' label for the 'do_trace' section. Thus, making the
disabled or straight line codepath, simply a no-op. 

As discussed in pervious mails I've seen an average improvement of 30 cycles
per-tracepoint on x86_64 systems that I've tested.

The first 2 patches of the series are a repost of Masami's text_poke_fixup()
function, which allows for efficient instruction patching. The final 4 patches,
implement the the jump patching mechanism for x86 and x86_64.

The implementation is a 'low' level one, in the sense that it is geared
specifically for tracepoints. However, I believe this mechanism will be more
generally useful for other parts of the kernel. Thus, I will propose 'higher'
level interfaces into the jump label code (layered on these 'low' level ones),
as we go.

thanks,

-Jason

Masami Hiramatsu (2):
	x86: Introduce generic jump patching without stop_machine
	kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Jason Baron(4):
  	move opcode defs from asm/kprobes.h to asm/alternative.h
  	jump-label-basic
  	jump-module-support
  	jump-label-tracepoints

 arch/x86/include/asm/alternative.h |   17 +++++
 arch/x86/include/asm/jump_label.h  |   35 +++++++++++
 arch/x86/include/asm/kprobes.h     |    3 -
 arch/x86/kernel/Makefile           |    2 +-
 arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/jump_label.c       |   66 ++++++++++++++++++++
 arch/x86/kernel/kprobes.c          |    2 +-
 include/asm-generic/vmlinux.lds.h  |   11 +++-
 include/linux/jump_label.h         |   47 ++++++++++++++
 include/linux/module.h             |   12 +++-
 include/linux/tracepoint.h         |   35 ++++++-----
 kernel/kprobes.c                   |    2 +-
 kernel/module.c                    |   27 ++++++++-
 kernel/tracepoint.c                |   25 ++++++--
 14 files changed, 372 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/include/asm/jump_label.h
 create mode 100644 arch/x86/kernel/jump_label.c
 create mode 100644 include/linux/jump_label.h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Jason Baron
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

Change RELATIVEJUMP_INSTRUCTION macro to RELATIVEJUMP_OPCODE since it
represents just the opcode byte.


Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
---
 arch/x86/include/asm/kprobes.h |    2 +-
 arch/x86/kernel/kprobes.c      |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4fe681d..eaec8ea 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -32,7 +32,7 @@ struct kprobe;
 
 typedef u8 kprobe_opcode_t;
 #define BREAKPOINT_INSTRUCTION	0xcc
-#define RELATIVEJUMP_INSTRUCTION 0xe9
+#define RELATIVEJUMP_OPCODE 0xe9
 #define MAX_INSN_SIZE 16
 #define MAX_STACK_SIZE 64
 #define MIN_STACK_SIZE(ADDR)					       \
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 1f3186c..b6b75f1 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -115,7 +115,7 @@ static void __kprobes set_jmp_op(void *from, void *to)
 	} __attribute__((packed)) * jop;
 	jop = (struct __arch_jmp_op *)from;
 	jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
-	jop->op = RELATIVEJUMP_INSTRUCTION;
+	jop->op = RELATIVEJUMP_OPCODE;
 }
 
 /*
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-19  0:28   ` Mathieu Desnoyers
  2009-11-20 21:54   ` H. Peter Anvin
  2009-11-18 22:43 ` [RFC PATCH 3/6] jump label v3 - move opcode defs Jason Baron
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

Add text_poke_fixup() which takes a fixup address to where a processor
jumps if it hits the modifying address while code modifying.
text_poke_fixup() does following steps for this purpose.

 1. Setup int3 handler for fixup.
 2. Put a breakpoint (int3) on the first byte of modifying region,
    and synchronize code on all CPUs.
 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
 4. Modify the first byte of modifying region, and synchronize code
    on all CPUs.
 5. Clear int3 handler.

Thus, if some other processor execute modifying address when step2 to step4,
it will be jumped to fixup code.

This still has many limitations for modifying multi-instructions at once.
However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
because;
 - Replaced instruction is just one instruction, which is executed atomically.
 - Replacing instruction is a jump, so we can set fixup address where the jump
   goes to.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
---
 arch/x86/include/asm/alternative.h |   12 ++++
 arch/x86/include/asm/kprobes.h     |    1 +
 arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
 kernel/kprobes.c                   |    2 +-
 4 files changed, 134 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index c240efc..03843a8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -160,4 +160,16 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 
+/*
+ * Setup int3 trap and fixup execution for cross-modifying on SMP case.
+ * If the other cpus execute modifying instruction, it will hit int3
+ * and go to fixup code. This just provides a minimal safety check.
+ * Additional checks/restrictions are required for completely safe
+ * cross-modifying.
+ */
+extern void *text_poke_fixup(void *addr, const void *opcode, size_t len,
+			     void *fixup);
+extern int text_patch_jump(void *addr, void *dest);
+extern void sync_core_all(void);
+
 #endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index eaec8ea..febab97 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -33,6 +33,7 @@ struct kprobe;
 typedef u8 kprobe_opcode_t;
 #define BREAKPOINT_INSTRUCTION	0xcc
 #define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
 #define MAX_INSN_SIZE 16
 #define MAX_STACK_SIZE 64
 #define MIN_STACK_SIZE(ADDR)					       \
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index de7353c..af47f12 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -4,6 +4,7 @@
 #include <linux/list.h>
 #include <linux/stringify.h>
 #include <linux/kprobes.h>
+#include <linux/kdebug.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/memory.h>
@@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 	local_irq_restore(flags);
 	return addr;
 }
+
+/*
+ * On pentium series, Unsynchronized cross-modifying code
+ * operations can cause unexpected instruction execution results.
+ * So after code modified, we should synchronize it on each processor.
+ */
+static void __kprobes __local_sync_core(void *info)
+{
+	sync_core();
+}
+
+void __kprobes sync_core_all(void)
+{
+	on_each_cpu(__local_sync_core, NULL, 1);
+}
+
+/* Safely cross-code modifying with fixup address */
+static void *patch_fixup_from;
+static void *patch_fixup_addr;
+
+static int __kprobes patch_exceptions_notify(struct notifier_block *self,
+					      unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (likely(!patch_fixup_from))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
+	    (unsigned long)patch_fixup_from != regs->ip)
+		return NOTIFY_DONE;
+
+	args->regs->ip = (unsigned long)patch_fixup_addr;
+	return NOTIFY_STOP;
+}
+
+/**
+ * text_poke_fixup() -- cross-modifying kernel text with fixup address.
+ * @addr:	Modifying address.
+ * @opcode:	New instruction.
+ * @len:	length of modifying bytes.
+ * @fixup:	Fixup address.
+ *
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
+				void *fixup)
+{
+	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
+	static const int int3_size = sizeof(int3_insn);
+
+	/* Replacing 1 byte can be done atomically. */
+	if (unlikely(len <= 1))
+		return text_poke(addr, opcode, len);
+
+	/* Preparing */
+	patch_fixup_addr = fixup;
+	wmb();
+	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
+
+	/* Cap by an int3 */
+	text_poke(addr, &int3_insn, int3_size);
+	sync_core_all();
+
+	/* Replace tail bytes */
+	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
+		  len - int3_size);
+	sync_core_all();
+
+	/* Replace int3 with head byte */
+	text_poke(addr, opcode, int3_size);
+	sync_core_all();
+
+	/* Cleanup */
+	patch_fixup_from = NULL;
+	wmb();
+	return addr;
+}
+
+/**
+ * text_patch_jump() -- cross-modifying kernel text with a relative jump
+ * @addr:	Address where jump from.
+ * @dest:	Address where jump to.
+ *
+ * Return 0 if succeeded to embed a jump. Otherwise, there is an error.
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+int __kprobes text_patch_jump(void *addr, void *dest)
+{
+	unsigned char jmp_code[RELATIVEJUMP_SIZE];
+	s32 rel = (s32)((long)dest - ((long)addr + RELATIVEJUMP_SIZE));
+
+	if (!addr || !dest ||
+	    (long)rel != ((long)dest - ((long)addr + RELATIVEJUMP_SIZE)))
+		return -EINVAL;
+
+	jmp_code[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&jmp_code[1]) = rel;
+
+	text_poke_fixup(addr, jmp_code, RELATIVEJUMP_SIZE, dest);
+	return 0;
+}
+
+static struct notifier_block patch_exceptions_nb = {
+	.notifier_call = patch_exceptions_notify,
+	.priority = 0x7fffffff /* we need to be notified first */
+};
+
+static int __init patch_init(void)
+{
+	return register_die_notifier(&patch_exceptions_nb);
+}
+
+arch_initcall(patch_init);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index e5342a3..43a30d8 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
 
 static struct notifier_block kprobe_exceptions_nb = {
 	.notifier_call = kprobe_exceptions_notify,
-	.priority = 0x7fffffff /* we need to be notified first */
+	.priority = 0x7ffffff0 /* High priority, but not first.  */
 };
 
 unsigned long __weak arch_deref_entry_point(void *entry)
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 3/6] jump label v3 - move opcode defs
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 4/6] jump label v3 - base patch Jason Baron
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

Move x86 opcode macros from arch/x86/include/asm/kprobes.h to
arch/x86/include/asm/alternative.h so they are useful outside of kprobes.


Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 arch/x86/include/asm/alternative.h |    5 +++++
 arch/x86/include/asm/kprobes.h     |    4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 03843a8..6a22696 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -160,6 +160,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
+#define MAX_INSN_SIZE 16
+
 /*
  * Setup int3 trap and fixup execution for cross-modifying on SMP case.
  * If the other cpus execute modifying instruction, it will hit int3
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index febab97..15b1b12 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -31,10 +31,6 @@ struct pt_regs;
 struct kprobe;
 
 typedef u8 kprobe_opcode_t;
-#define BREAKPOINT_INSTRUCTION	0xcc
-#define RELATIVEJUMP_OPCODE 0xe9
-#define RELATIVEJUMP_SIZE 5
-#define MAX_INSN_SIZE 16
 #define MAX_STACK_SIZE 64
 #define MIN_STACK_SIZE(ADDR)					       \
 	(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 4/6] jump label v3 - base patch
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
                   ` (2 preceding siblings ...)
  2009-11-18 22:43 ` [RFC PATCH 3/6] jump label v3 - move opcode defs Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-18 23:38   ` [PATCH] notifier atomic call chain notrace Mathieu Desnoyers
  2009-11-18 22:43 ` [RFC PATCH 5/6] jump label v3 - add module support Jason Baron
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

base patch to implement 'jump labeling'. Based on a new 'asm goto' inline
assembly gcc mechanism, we can now branch to labels from an 'asm goto'
statment. This allows us to create a 'no-op' fastpath, which can subsequently
be patched with a jump to the slowpath code. This is useful for code which 
might be rarely used, but which we'd like to be able to call, if needed.
Tracepoints are the current usecase that these are being implemented for.

Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
condition. This is b/c the die_notifier takes an rcu_read_lock() on the
int3 trap, which then causes another one etc. Since, we aren't going to be
installing removing the handler, the rcu_read_lock() could be avoided for this
case with some code restructuring.

Also, the patch is dependent on CONFIG_KPROBES. This is simply due to a
reliance on the int3 trap handler code that is only enabled with kprobes. This,
could also be modified.

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 arch/x86/include/asm/jump_label.h |   35 ++++++++++++++++++++
 arch/x86/kernel/Makefile          |    2 +-
 arch/x86/kernel/jump_label.c      |   63 +++++++++++++++++++++++++++++++++++++
 include/asm-generic/vmlinux.lds.h |   11 ++++++-
 include/linux/jump_label.h        |   45 ++++++++++++++++++++++++++
 5 files changed, 154 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/jump_label.h
 create mode 100644 arch/x86/kernel/jump_label.c
 create mode 100644 include/linux/jump_label.h

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
new file mode 100644
index 0000000..5817a86
--- /dev/null
+++ b/arch/x86/include/asm/jump_label.h
@@ -0,0 +1,35 @@
+#ifndef _ASM_X86_JUMP_LABEL_H
+#define _ASM_X86_JUMP_LABEL_H
+
+#include <asm/nops.h>
+
+#if (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)) && \
+	!defined(CONFIG_LOCKDEP) && defined(CONFIG_KPROBES)
+# define __HAVE_ARCH_JUMP_LABEL
+#endif
+
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+# ifdef CONFIG_X86_64
+#  define JUMP_LABEL_NOP P6_NOP5
+# else
+#  define JUMP_LABEL_NOP ".byte 0xe9 \n\t .long 0\n\t"
+# endif
+
+# define JUMP_LABEL(tag, label, cond)                                       \
+	do {								   \
+		static const char __jlstrtab_##tag[]                       \
+		__used __attribute__((section("__jump_strings")))  = #tag; \
+		asm goto("1:"						   \
+			JUMP_LABEL_NOP					   \
+			".pushsection __jump_table,  \"a\" \n\t"	   \
+			_ASM_PTR "1b, %l[" #label "], %c0 \n\t"		   \
+			".popsection \n\t"				   \
+			: :  "i" (__jlstrtab_##tag) :  : label);	   \
+	} while (0)
+
+# endif
+
+#endif
+
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4f2e66e..df3b341 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -32,7 +32,7 @@ GCOV_PROFILE_paravirt.o		:= n
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y			+= time.o ioport.o ldt.o dumpstack.o
-obj-y			+= setup.o x86_init.o i8259.o irqinit.o
+obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_X86_VISWS)	+= visws_quirks.o
 obj-$(CONFIG_X86_32)	+= probe_roms_32.o
 obj-$(CONFIG_X86_32)	+= sys_i386_32.o i386_ksyms_32.o
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
new file mode 100644
index 0000000..4131bbd
--- /dev/null
+++ b/arch/x86/kernel/jump_label.c
@@ -0,0 +1,63 @@
+#include <linux/jump_label.h>
+#include <linux/memory.h>
+#include <linux/uaccess.h>
+#include <asm/alternative.h>
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern struct jump_entry __start___jump_table[];
+extern struct jump_entry __stop___jump_table[];
+
+union jump_code_union {
+	char code[RELATIVEJUMP_SIZE];
+	struct {
+		char jump;
+		int offset;
+	} __attribute__((packed));
+};
+
+void jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
+{
+	union jump_code_union code;
+
+	if (type == JUMP_LABEL_ENABLE) {
+		code.jump = 0xe9;
+		code.offset = entry->target - (entry->code + RELATIVEJUMP_SIZE);
+	} else {
+#ifdef CONFIG_X86_64
+		/* opcode for P6_NOP5 */
+		code.code[0] = 0x0f;
+		code.code[1] = 0x1f;
+		code.code[2] = 0x44;
+		code.code[3] = 0x00;
+		code.code[4] = 0x00;
+#else
+		code.jump = 0xe9;
+		code.offset = 0;
+#endif
+	}
+
+	mutex_lock(&text_mutex);
+	text_poke_fixup((void *)entry->code, &code, RELATIVEJUMP_SIZE,
+			(void *)entry->code + RELATIVEJUMP_SIZE);
+	mutex_unlock(&text_mutex);
+}
+
+int jump_label_update(const char *name, enum jump_label_type type, void *mod)
+{
+	struct jump_entry *iter;
+
+	if (mod)
+		return 0;
+
+	for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
+		if (!strcmp(name, iter->name)) {
+			if (kernel_text_address(iter->code))
+				jump_label_transform(iter, type);
+			}
+		}
+	}
+	return 0;
+}
+
+#endif
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5d5def0..8fcbe3b 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -171,7 +171,8 @@
 	BRANCH_PROFILE()		       				\
 	TRACE_PRINTKS()							\
 	FTRACE_EVENTS()							\
-	TRACE_SYSCALLS()
+	TRACE_SYSCALLS()						\
+	JUMP_TABLE()							\
 
 /*
  * Data section helpers
@@ -210,6 +211,7 @@
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
 		*(__tracepoints_strings)/* Tracepoints: strings */	\
+		*(__jump_strings)/* Jump: strings */	\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
@@ -218,6 +220,7 @@
 									\
 	BUG_TABLE							\
 									\
+									\
 	/* PCI quirks */						\
 	.pci_fixup        : AT(ADDR(.pci_fixup) - LOAD_OFFSET) {	\
 		VMLINUX_SYMBOL(__start_pci_fixups_early) = .;		\
@@ -561,6 +564,12 @@
 #define BUG_TABLE
 #endif
 
+#define JUMP_TABLE()							\
+	. = ALIGN(64);							\
+		VMLINUX_SYMBOL(__start___jump_table) = .;		\
+		*(__jump_table)						\
+		VMLINUX_SYMBOL(__stop___jump_table) = .;		\
+
 #ifdef CONFIG_PM_TRACE
 #define TRACEDATA							\
 	. = ALIGN(4);							\
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
new file mode 100644
index 0000000..2719506
--- /dev/null
+++ b/include/linux/jump_label.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_JUMP_LABEL_H
+#define _LINUX_JUMP_LABEL_H
+
+#include <asm/jump_label.h>
+
+struct jump_entry {
+	unsigned long code;
+	unsigned long target;
+	char *name;
+};
+
+enum jump_label_type {
+	JUMP_LABEL_ENABLE,
+	JUMP_LABEL_DISABLE
+};
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern int jump_label_update(const char *name, enum jump_label_type type, void *mod);
+
+#define enable_jump_label(name, mod) \
+	jump_label_update(name, JUMP_LABEL_ENABLE, mod);
+
+#define disable_jump_label(name, mod) \
+	jump_label_update(name, JUMP_LABEL_DISABLE, mod);
+
+#else
+
+#define JUMP_LABEL(tag, label, cond)		\
+	if (unlikely(cond))			\
+		goto label;
+
+static inline int enable_jump_label(const char *name, void *mod)
+{
+	return 0;
+}
+
+static inline int disable_jump_label(const char *name, void *mod)
+{
+	return 0;
+}
+
+#endif
+
+#endif
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 5/6] jump label v3 - add module support
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
                   ` (3 preceding siblings ...)
  2009-11-18 22:43 ` [RFC PATCH 4/6] jump label v3 - base patch Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-18 22:43 ` [RFC PATCH 6/6] jump label v3 - tracepoint support Jason Baron
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

Add support for 'jump label' for modules.

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 arch/x86/kernel/jump_label.c |    5 ++++-
 include/linux/jump_label.h   |    2 ++
 include/linux/module.h       |   12 +++++++++++-
 kernel/module.c              |   25 +++++++++++++++++++++++++
 4 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 4131bbd..080062d 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -1,6 +1,7 @@
 #include <linux/jump_label.h>
 #include <linux/memory.h>
 #include <linux/uaccess.h>
+#include <linux/module.h>
 #include <asm/alternative.h>
 
 #ifdef __HAVE_ARCH_JUMP_LABEL
@@ -47,8 +48,10 @@ int jump_label_update(const char *name, enum jump_label_type type, void *mod)
 {
 	struct jump_entry *iter;
 
-	if (mod)
+	if (mod) {
+		jump_label_update_module(name, type, mod);
 		return 0;
+	}
 
 	for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
 		if (!strcmp(name, iter->name)) {
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 2719506..7935ff6 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -17,6 +17,8 @@ enum jump_label_type {
 #ifdef __HAVE_ARCH_JUMP_LABEL
 
 extern int jump_label_update(const char *name, enum jump_label_type type, void *mod);
+extern void jump_label_transform(struct jump_entry *entry,
+				 enum jump_label_type type);
 
 #define enable_jump_label(name, mod) \
 	jump_label_update(name, JUMP_LABEL_ENABLE, mod);
diff --git a/include/linux/module.h b/include/linux/module.h
index 819354a..307b64a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -16,6 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
+#include <linux/jump_label.h>
 
 #include <asm/local.h>
 #include <asm/module.h>
@@ -355,7 +356,10 @@ struct module
 	struct tracepoint *tracepoints;
 	unsigned int num_tracepoints;
 #endif
-
+#ifdef __HAVE_ARCH_JUMP_LABEL
+	struct jump_entry *jump_entries;
+	unsigned int num_jump_entries;
+#endif
 #ifdef CONFIG_TRACING
 	const char **trace_bprintk_fmt_start;
 	unsigned int num_trace_bprintk_fmt;
@@ -557,6 +561,7 @@ extern void print_modules(void);
 
 extern void module_update_tracepoints(void);
 extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+extern void jump_label_update_module(const char *name, enum jump_label_type type, void *mod);
 
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
@@ -678,6 +683,11 @@ static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
 	return 0;
 }
 
+static inline void jump_label_update_module(const char *name, enum jump_label_type type, void *mod)
+{
+	return;
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
diff --git a/kernel/module.c b/kernel/module.c
index 8b7d880..6b8a754 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -55,6 +55,7 @@
 #include <linux/async.h>
 #include <linux/percpu.h>
 #include <linux/kmemleak.h>
+#include <linux/jump_label.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
@@ -2378,6 +2379,12 @@ static noinline struct module *load_module(void __user *umod,
 					sizeof(*mod->tracepoints),
 					&mod->num_tracepoints);
 #endif
+#ifdef __HAVE_ARCH_JUMP_LABEL
+	mod->jump_entries = section_objs(hdr, sechdrs, secstrings,
+					"__jump_table",
+					sizeof(*mod->jump_entries),
+					&mod->num_jump_entries);
+#endif
 #ifdef CONFIG_EVENT_TRACING
 	mod->trace_events = section_objs(hdr, sechdrs, secstrings,
 					 "_ftrace_events",
@@ -3143,4 +3150,22 @@ int module_get_iter_tracepoints(struct tracepoint_iter *iter)
 	mutex_unlock(&module_mutex);
 	return found;
 }
+
+#endif
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+void jump_label_update_module(const char *name, enum jump_label_type type, void *data)
+{
+	struct jump_entry *entry;
+	struct module *mod = data;
+
+	for (entry = mod->jump_entries;
+		entry < mod->jump_entries + mod->num_jump_entries;
+		entry++) {
+			if (!strcmp(name, entry->name))
+				jump_label_transform(entry, type);
+	}
+}
+
 #endif
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 6/6] jump label v3 - tracepoint support
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
                   ` (4 preceding siblings ...)
  2009-11-18 22:43 ` [RFC PATCH 5/6] jump label v3 - add module support Jason Baron
@ 2009-11-18 22:43 ` Jason Baron
  2009-11-18 22:51 ` [RFC PATCH 0/6] jump label v3 H. Peter Anvin
  2009-11-19  3:54 ` Roland McGrath
  7 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-18 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat

Make use of the jump label infrastructure for tracepoints.

Signed-off-by: Jason Baron <jbaron@redhat.com>
---
 include/linux/tracepoint.h |   35 +++++++++++++++++++----------------
 kernel/module.c            |    2 +-
 kernel/tracepoint.c        |   25 ++++++++++++++++++-------
 3 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 2aac8a8..7c4e9cb 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -16,6 +16,7 @@
 
 #include <linux/types.h>
 #include <linux/rcupdate.h>
+#include <linux/jump_label.h>
 
 struct module;
 struct tracepoint;
@@ -63,20 +64,22 @@ struct tracepoint {
  * not add unwanted padding between the beginning of the section and the
  * structure. Force alignment to the same alignment as the section start.
  */
-#define DECLARE_TRACE(name, proto, args)				\
-	extern struct tracepoint __tracepoint_##name;			\
-	static inline void trace_##name(proto)				\
-	{								\
-		if (unlikely(__tracepoint_##name.state))		\
-			__DO_TRACE(&__tracepoint_##name,		\
-				TP_PROTO(proto), TP_ARGS(args));	\
-	}								\
-	static inline int register_trace_##name(void (*probe)(proto))	\
-	{								\
-		return tracepoint_probe_register(#name, (void *)probe);	\
-	}								\
-	static inline int unregister_trace_##name(void (*probe)(proto))	\
-	{								\
+#define DECLARE_TRACE(name, proto, args)				 \
+	extern struct tracepoint __tracepoint_##name;			 \
+	static inline void trace_##name(proto)				 \
+	{								 \
+		JUMP_LABEL(name, do_trace, __tracepoint_##name.state);   \
+		return;							 \
+do_trace:								 \
+		__DO_TRACE(&__tracepoint_##name,			 \
+			   TP_PROTO(proto), TP_ARGS(args));		 \
+	}								 \
+	static inline int register_trace_##name(void (*probe)(proto))	 \
+	{								 \
+		return tracepoint_probe_register(#name, (void *)probe);	 \
+	}								 \
+	static inline int unregister_trace_##name(void (*probe)(proto))	 \
+	{								 \
 		return tracepoint_probe_unregister(#name, (void *)probe);\
 	}
 
@@ -97,7 +100,7 @@ struct tracepoint {
 	EXPORT_SYMBOL(__tracepoint_##name)
 
 extern void tracepoint_update_probe_range(struct tracepoint *begin,
-	struct tracepoint *end);
+	struct tracepoint *end, void *data);
 
 #else /* !CONFIG_TRACEPOINTS */
 #define DECLARE_TRACE(name, proto, args)				\
@@ -120,7 +123,7 @@ extern void tracepoint_update_probe_range(struct tracepoint *begin,
 #define EXPORT_TRACEPOINT_SYMBOL(name)
 
 static inline void tracepoint_update_probe_range(struct tracepoint *begin,
-	struct tracepoint *end)
+	struct tracepoint *end, void *data)
 { }
 #endif /* CONFIG_TRACEPOINTS */
 #endif /* DECLARE_TRACE */
diff --git a/kernel/module.c b/kernel/module.c
index 6b8a754..7ceba66 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3114,7 +3114,7 @@ void module_update_tracepoints(void)
 	list_for_each_entry(mod, &modules, list)
 		if (!mod->taints)
 			tracepoint_update_probe_range(mod->tracepoints,
-				mod->tracepoints + mod->num_tracepoints);
+				mod->tracepoints + mod->num_tracepoints, mod);
 	mutex_unlock(&module_mutex);
 }
 
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index cc89be5..d78f4b0 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,7 @@
 #include <linux/err.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/jump_label.h>
 
 extern struct tracepoint __start___tracepoints[];
 extern struct tracepoint __stop___tracepoints[];
@@ -239,7 +240,7 @@ static inline void remove_tracepoint(struct tracepoint_entry *e)
  * Sets the probe callback corresponding to one tracepoint.
  */
 static void set_tracepoint(struct tracepoint_entry **entry,
-	struct tracepoint *elem, int active)
+	struct tracepoint *elem, int active, void *data)
 {
 	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
 
@@ -256,6 +257,12 @@ static void set_tracepoint(struct tracepoint_entry **entry,
 	 * is used.
 	 */
 	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+
+	if (!elem->state && active)
+		enable_jump_label(elem->name, data);
+	else if (elem->state && !active)
+		disable_jump_label(elem->name, data);
+
 	elem->state = active;
 }
 
@@ -265,11 +272,14 @@ static void set_tracepoint(struct tracepoint_entry **entry,
  * function insures that the original callback is not used anymore. This insured
  * by preempt_disable around the call site.
  */
-static void disable_tracepoint(struct tracepoint *elem)
+static void disable_tracepoint(struct tracepoint *elem, void *data)
 {
 	if (elem->unregfunc && elem->state)
 		elem->unregfunc();
 
+	if (elem->state)
+		disable_jump_label(elem->name, data);
+
 	elem->state = 0;
 	rcu_assign_pointer(elem->funcs, NULL);
 }
@@ -282,7 +292,8 @@ static void disable_tracepoint(struct tracepoint *elem)
  * Updates the probe callback corresponding to a range of tracepoints.
  */
 void
-tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end)
+tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end,
+				void *data)
 {
 	struct tracepoint *iter;
 	struct tracepoint_entry *mark_entry;
@@ -295,9 +306,9 @@ tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end)
 		mark_entry = get_tracepoint(iter->name);
 		if (mark_entry) {
 			set_tracepoint(&mark_entry, iter,
-					!!mark_entry->refcount);
+					!!mark_entry->refcount, data);
 		} else {
-			disable_tracepoint(iter);
+			disable_tracepoint(iter, data);
 		}
 	}
 	mutex_unlock(&tracepoints_mutex);
@@ -310,7 +321,7 @@ static void tracepoint_update_probes(void)
 {
 	/* Core kernel tracepoints */
 	tracepoint_update_probe_range(__start___tracepoints,
-		__stop___tracepoints);
+		__stop___tracepoints, NULL);
 	/* tracepoints in modules. */
 	module_update_tracepoints();
 }
@@ -565,7 +576,7 @@ int tracepoint_module_notify(struct notifier_block *self,
 	case MODULE_STATE_COMING:
 	case MODULE_STATE_GOING:
 		tracepoint_update_probe_range(mod->tracepoints,
-			mod->tracepoints + mod->num_tracepoints);
+			mod->tracepoints + mod->num_tracepoints, data);
 		break;
 	}
 	return 0;
-- 
1.6.5.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] jump label v3
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
                   ` (5 preceding siblings ...)
  2009-11-18 22:43 ` [RFC PATCH 6/6] jump label v3 - tracepoint support Jason Baron
@ 2009-11-18 22:51 ` H. Peter Anvin
  2009-11-18 23:07   ` Roland McGrath
  2009-11-19  3:54 ` Roland McGrath
  7 siblings, 1 reply; 35+ messages in thread
From: H. Peter Anvin @ 2009-11-18 22:51 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, andi,
	roland, rth, mhiramat

On 11/18/2009 02:43 PM, Jason Baron wrote:
> 
> I'm using an atomic 5 byte no-op for x86_64 and a long jump for 32-bit x86.
> My understanding is that not all 32-bit processors have an atomic 5 byte no-op,
> and thus using a long jump or jump 0, for the off case is the safest.
> 

67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP.  It's
not necessarily the fastest, though (I have no data on that.)
Similarly, 66 66 66 66 90 should also work.

	-hpa

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] jump label v3
  2009-11-18 22:51 ` [RFC PATCH 0/6] jump label v3 H. Peter Anvin
@ 2009-11-18 23:07   ` Roland McGrath
  2009-11-18 23:18     ` H. Peter Anvin
  0 siblings, 1 reply; 35+ messages in thread
From: Roland McGrath @ 2009-11-18 23:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, andi, rth, mhiramat

> 67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP.  It's
> not necessarily the fastest, though (I have no data on that.)
> Similarly, 66 66 66 66 90 should also work.

We should get all the knowledge like that stored in places like the
Kconfig.cpu comments near X86_P6_NOP and/or asm/nops.h macros and comments.

Let's have an ASM_ATOMIC_NOP5 macro in asm/nops.h?  I've lost track of the
variants, and I'll leave that to you all who are close to the chip people.

I can't tell if it's the case that there will be kernel configurations
where there is one known-safe choice but a different choice that's optimal
if CPU model checks pass.  If so, we could compile in the safe choice and
then mass-change to the optimal choice at boot time.  (i.e. like the
alternatives support, but we don't really need to use that too since we
already have another dedicated table of all the PC addresses to touch.)


Thanks,
Roland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] jump label v3
  2009-11-18 23:07   ` Roland McGrath
@ 2009-11-18 23:18     ` H. Peter Anvin
  0 siblings, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2009-11-18 23:18 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, andi, rth, mhiramat

On 11/18/2009 03:07 PM, Roland McGrath wrote:
>> 67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP.  It's
>> not necessarily the fastest, though (I have no data on that.)
>> Similarly, 66 66 66 66 90 should also work.
> 
> We should get all the knowledge like that stored in places like the
> Kconfig.cpu comments near X86_P6_NOP and/or asm/nops.h macros and comments.
> 
> Let's have an ASM_ATOMIC_NOP5 macro in asm/nops.h?  I've lost track of the
> variants, and I'll leave that to you all who are close to the chip people.
> 
> I can't tell if it's the case that there will be kernel configurations
> where there is one known-safe choice but a different choice that's optimal
> if CPU model checks pass.  If so, we could compile in the safe choice and
> then mass-change to the optimal choice at boot time.  (i.e. like the
> alternatives support, but we don't really need to use that too since we
> already have another dedicated table of all the PC addresses to touch.)

Yes, I believe we do something like that already.
	
	-hpa

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] notifier atomic call chain notrace
  2009-11-18 22:43 ` [RFC PATCH 4/6] jump label v3 - base patch Jason Baron
@ 2009-11-18 23:38   ` Mathieu Desnoyers
  2009-11-19  0:02     ` Paul E. McKenney
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-18 23:38 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat, Paul E. McKenney

* Jason Baron (jbaron@redhat.com) wrote:
> Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
> condition. This is b/c the die_notifier takes an rcu_read_lock() on the
> int3 trap, which then causes another one etc. Since, we aren't going to be
> installing removing the handler, the rcu_read_lock() could be avoided for this
> case with some code restructuring.
> 
[snip]

Would the following patch help ? I use it in the LTTng tree to alleviate
this problem.


notifier atomic call chain notrace

In LTTng, being able to use the atomic notifier from cpu idle entry to
ensure the tracer flush the last events in the current subbuffer
requires the rcu read-side to be marked "notrace", otherwise it can end
up calling back into lockdep and the tracer.

Also apply to the the die notifier.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Jason Baron <jbaron@redhat.com>
CC: mingo@elte.hu
---
 kernel/notifier.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/kernel/notifier.c
===================================================================
--- linux-2.6-lttng.orig/kernel/notifier.c	2009-11-12 17:58:56.000000000 -0500
+++ linux-2.6-lttng/kernel/notifier.c	2009-11-12 18:03:28.000000000 -0500
@@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
 	spin_lock_irqsave(&nh->lock, flags);
 	ret = notifier_chain_unregister(&nh->head, n);
 	spin_unlock_irqrestore(&nh->lock, flags);
-	synchronize_rcu();
+	synchronize_sched();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
@@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
 {
 	int ret;
 
-	rcu_read_lock();
+	rcu_read_lock_sched_notrace();
 	ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
-	rcu_read_unlock();
+	rcu_read_unlock_sched_notrace();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] notifier atomic call chain notrace
  2009-11-18 23:38   ` [PATCH] notifier atomic call chain notrace Mathieu Desnoyers
@ 2009-11-19  0:02     ` Paul E. McKenney
  2009-11-19  3:59     ` Masami Hiramatsu
  2009-11-19 16:48     ` Jason Baron
  2 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2009-11-19  0:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, linux-kernel, mingo, hpa, tglx, rostedt, andi,
	roland, rth, mhiramat

On Wed, Nov 18, 2009 at 06:38:15PM -0500, Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
> > Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
> > condition. This is b/c the die_notifier takes an rcu_read_lock() on the
> > int3 trap, which then causes another one etc. Since, we aren't going to be
> > installing removing the handler, the rcu_read_lock() could be avoided for this
> > case with some code restructuring.
> > 
> [snip]
> 
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.
> 
> 
> notifier atomic call chain notrace
> 
> In LTTng, being able to use the atomic notifier from cpu idle entry to
> ensure the tracer flush the last events in the current subbuffer
> requires the rcu read-side to be marked "notrace", otherwise it can end
> up calling back into lockdep and the tracer.
> 
> Also apply to the the die notifier.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Jason Baron <jbaron@redhat.com>
> CC: mingo@elte.hu
> ---
>  kernel/notifier.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6-lttng/kernel/notifier.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/notifier.c	2009-11-12 17:58:56.000000000 -0500
> +++ linux-2.6-lttng/kernel/notifier.c	2009-11-12 18:03:28.000000000 -0500
> @@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
>  	spin_lock_irqsave(&nh->lock, flags);
>  	ret = notifier_chain_unregister(&nh->head, n);
>  	spin_unlock_irqrestore(&nh->lock, flags);
> -	synchronize_rcu();
> +	synchronize_sched();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
> @@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
>  {
>  	int ret;
> 
> -	rcu_read_lock();
> +	rcu_read_lock_sched_notrace();
>  	ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
> -	rcu_read_unlock();
> +	rcu_read_unlock_sched_notrace();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-18 22:43 ` [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Jason Baron
@ 2009-11-19  0:28   ` Mathieu Desnoyers
  2009-11-19  0:58     ` Paul E. McKenney
                       ` (2 more replies)
  2009-11-20 21:54   ` H. Peter Anvin
  1 sibling, 3 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-19  0:28 UTC (permalink / raw)
  To: Jason Baron, mingo, mhiramat, Paul E. McKenney
  Cc: linux-kernel, hpa, tglx, rostedt, andi, roland, rth

* Jason Baron (jbaron@redhat.com) wrote:
> Add text_poke_fixup() which takes a fixup address to where a processor
> jumps if it hits the modifying address while code modifying.
> text_poke_fixup() does following steps for this purpose.
> 
>  1. Setup int3 handler for fixup.
>  2. Put a breakpoint (int3) on the first byte of modifying region,
>     and synchronize code on all CPUs.
>  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>  4. Modify the first byte of modifying region, and synchronize code
>     on all CPUs.
>  5. Clear int3 handler.
> 

Hi Masami,

I like the approach and the API is clean. I have intersped comments
below.

Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
experienced recently in the message below. Might be worth having a look,
I suspect this might have caused the hangs Paul McKenney had with his
past TREE RCU callback migration. I think he did take a mutex in the cpu
hotplug callbacks and might have used IPIs within that same lock.

> Thus, if some other processor execute modifying address when step2 to step4,
> it will be jumped to fixup code.
> 
> This still has many limitations for modifying multi-instructions at once.
> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> because;
>  - Replaced instruction is just one instruction, which is executed atomically.
>  - Replacing instruction is a jump, so we can set fixup address where the jump
>    goes to.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> ---
>  arch/x86/include/asm/alternative.h |   12 ++++
>  arch/x86/include/asm/kprobes.h     |    1 +
>  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
>  kernel/kprobes.c                   |    2 +-
>  4 files changed, 134 insertions(+), 1 deletions(-)
> 

[snip snip]

> index de7353c..af47f12 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -4,6 +4,7 @@
>  #include <linux/list.h>
>  #include <linux/stringify.h>
>  #include <linux/kprobes.h>
> +#include <linux/kdebug.h>
>  #include <linux/mm.h>
>  #include <linux/vmalloc.h>
>  #include <linux/memory.h>
> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>  	local_irq_restore(flags);
>  	return addr;
>  }
> +
> +/*
> + * On pentium series, Unsynchronized cross-modifying code
> + * operations can cause unexpected instruction execution results.
> + * So after code modified, we should synchronize it on each processor.
> + */
> +static void __kprobes __local_sync_core(void *info)
> +{
> +	sync_core();
> +}
> +
> +void __kprobes sync_core_all(void)
> +{
> +	on_each_cpu(__local_sync_core, NULL, 1);

OK, so you rely on the fact that on_each_cpu has memory barriers and
executes the remote "__local_sync_core" with the appropriate memory
barriers underneath, am I correct ?

> +}
> +
> +/* Safely cross-code modifying with fixup address */
> +static void *patch_fixup_from;
> +static void *patch_fixup_addr;
> +
> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> +					      unsigned long val, void *data)
> +{
> +	struct die_args *args = data;
> +	struct pt_regs *regs = args->regs;
> +
> +	if (likely(!patch_fixup_from))
> +		return NOTIFY_DONE;
> +
> +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> +	    (unsigned long)patch_fixup_from != regs->ip)
> +		return NOTIFY_DONE;
> +
> +	args->regs->ip = (unsigned long)patch_fixup_addr;
> +	return NOTIFY_STOP;
> +}
> +
> +/**
> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> + * @addr:	Modifying address.
> + * @opcode:	New instruction.
> + * @len:	length of modifying bytes.
> + * @fixup:	Fixup address.
> + *
> + * Note: You must backup replaced instructions before calling this,
> + * if you need to recover it.
> + * Note: Must be called under text_mutex.
> + */
> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> +				void *fixup)
> +{
> +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> +	static const int int3_size = sizeof(int3_insn);
> +
> +	/* Replacing 1 byte can be done atomically. */

I'm sure it can be done atomically, but can it be done safely though ?

(disclaimer: we're still waiting for the official answer on this
statement): Assuming instruction trace cache effects of SMP cross-code
modification, and that only int3 would be safe to patch, then even
changing 1 single byte could only be done by going to an intermediate
int3 and synchronizing all other cores.

> +	if (unlikely(len <= 1))
> +		return text_poke(addr, opcode, len);
> +
> +	/* Preparing */
> +	patch_fixup_addr = fixup;
> +	wmb();

hrm, missing comment ?

> +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> +
> +	/* Cap by an int3 */
> +	text_poke(addr, &int3_insn, int3_size);
> +	sync_core_all();
> +
> +	/* Replace tail bytes */
> +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> +		  len - int3_size);
> +	sync_core_all();
> +
> +	/* Replace int3 with head byte */
> +	text_poke(addr, opcode, int3_size);
> +	sync_core_all();
> +
> +	/* Cleanup */
> +	patch_fixup_from = NULL;
> +	wmb();

missing comment here too.

> +	return addr;

Little quiz question:

When patch_fixup_from is set to NULL, what ensures that the int3
handlers have completed their execution ?

I think it's probably OK, because the int3 is an interrupt gate, which
therefore disables interrupts as soon as it runs, and executes the
notifier while irqs are off. When we run sync_core_all() after replacing
the int3 by the new 1st byte, we only return when all other cores have
executed an interrupt, which implies that all int3 handlers previously
running should have ended. Is it right ? It looks to me as if this 3rd
sync_core_all() is only needed because of that. Probably that adding a
comment would be good.


Another thing: I've recently noticed that the following locking seems to
hang the system with doing stress-testing concurrently with cpu
hotplug/hotunplug:

mutex_lock(&text_mutex);
  on_each_cpu(something, NULL, 1);

The hang seems to be caused by the fact that alternative.c has:

within cpu hotplug (cpu hotplug lock held)
  mutex_lock(&text_mutex);

It might also be caused by the interaction with the stop_machine()
performed within the cpu hotplug lock. I did not find the root cause of
the problem, but this probably calls for lockdep improvements.

In any case, given you are dealing with the exact same locking scenario
here, I would recomment the following stress-test:

in a loop, use text_poke_jump()
in a loop, hotunplug all cpus, then hotplug all cpus

I had to fix this temporarily by taking get/put_online_cpus() about the
text_mutex.

[snip snip]

> +static struct notifier_block patch_exceptions_nb = {
> +	.notifier_call = patch_exceptions_notify,
> +	.priority = 0x7fffffff /* we need to be notified first */
> +};
> +
> +static int __init patch_init(void)
> +{
> +	return register_die_notifier(&patch_exceptions_nb);
> +}
> +
> +arch_initcall(patch_init);
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index e5342a3..43a30d8 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
>  
>  static struct notifier_block kprobe_exceptions_nb = {
>  	.notifier_call = kprobe_exceptions_notify,
> -	.priority = 0x7fffffff /* we need to be notified first */
> +	.priority = 0x7ffffff0 /* High priority, but not first.  */

It would be good to keep all these priorities in a centralized header.

Thanks,

Mathieu

>  };
>  
>  unsigned long __weak arch_deref_entry_point(void *entry)
> -- 
> 1.6.5.1
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  0:28   ` Mathieu Desnoyers
@ 2009-11-19  0:58     ` Paul E. McKenney
  2009-11-19  1:22       ` Steven Rostedt
  2009-11-19  1:57       ` Mathieu Desnoyers
  2009-11-19 14:04     ` Masami Hiramatsu
  2009-11-21  1:11     ` Masami Hiramatsu
  2 siblings, 2 replies; 35+ messages in thread
From: Paul E. McKenney @ 2009-11-19  0:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, mingo, mhiramat, linux-kernel, hpa, tglx, rostedt,
	andi, roland, rth

On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
> > Add text_poke_fixup() which takes a fixup address to where a processor
> > jumps if it hits the modifying address while code modifying.
> > text_poke_fixup() does following steps for this purpose.
> > 
> >  1. Setup int3 handler for fixup.
> >  2. Put a breakpoint (int3) on the first byte of modifying region,
> >     and synchronize code on all CPUs.
> >  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> >  4. Modify the first byte of modifying region, and synchronize code
> >     on all CPUs.
> >  5. Clear int3 handler.
> > 
> 
> Hi Masami,
> 
> I like the approach and the API is clean. I have intersped comments
> below.
> 
> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.

Hello, Mathieu,

By "mutex", do you mean "mutex_lock()"?  If so, I don't do that from
within the CPU hotplug notifiers.  I do spin_lock_irqsave().

People have been known to invoke synchronize_rcu() from CPU hotplug
notifiers, however -- and this does block.  Is that what you are
getting at?

I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
suspicions of late.  But this seems to bypass the smp_call_function()
code that is most definitely illegal to invoke with irqs disabled,
so no smoking gun.  All that aside, if invoking smp_send_reschedule()
with irqs disabled is in any way a bad idea, please let me know so I
can rearrange the code appropriately.

RCU is currently running reasonably well with the set of patches I have
submitted.  It the kernel is now withstanding a level of punishment that
would have reduced 2.6.28 (for example) to a steaming pile of bits, with
or without the help of rcutorture.  And I am now hitting the occasional
non-RCU bug, so I am starting to feel like RCU is approaching stability.
Approaching, but not there yet -- a few suspicious areas remain.  Time
to crank the testing up another notch or two.  ;-)

							Thanx, Paul

> > Thus, if some other processor execute modifying address when step2 to step4,
> > it will be jumped to fixup code.
> > 
> > This still has many limitations for modifying multi-instructions at once.
> > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > because;
> >  - Replaced instruction is just one instruction, which is executed atomically.
> >  - Replacing instruction is a jump, so we can set fixup address where the jump
> >    goes to.
> > 
> > Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> > ---
> >  arch/x86/include/asm/alternative.h |   12 ++++
> >  arch/x86/include/asm/kprobes.h     |    1 +
> >  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
> >  kernel/kprobes.c                   |    2 +-
> >  4 files changed, 134 insertions(+), 1 deletions(-)
> > 
> 
> [snip snip]
> 
> > index de7353c..af47f12 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -4,6 +4,7 @@
> >  #include <linux/list.h>
> >  #include <linux/stringify.h>
> >  #include <linux/kprobes.h>
> > +#include <linux/kdebug.h>
> >  #include <linux/mm.h>
> >  #include <linux/vmalloc.h>
> >  #include <linux/memory.h>
> > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> >  	local_irq_restore(flags);
> >  	return addr;
> >  }
> > +
> > +/*
> > + * On pentium series, Unsynchronized cross-modifying code
> > + * operations can cause unexpected instruction execution results.
> > + * So after code modified, we should synchronize it on each processor.
> > + */
> > +static void __kprobes __local_sync_core(void *info)
> > +{
> > +	sync_core();
> > +}
> > +
> > +void __kprobes sync_core_all(void)
> > +{
> > +	on_each_cpu(__local_sync_core, NULL, 1);
> 
> OK, so you rely on the fact that on_each_cpu has memory barriers and
> executes the remote "__local_sync_core" with the appropriate memory
> barriers underneath, am I correct ?
> 
> > +}
> > +
> > +/* Safely cross-code modifying with fixup address */
> > +static void *patch_fixup_from;
> > +static void *patch_fixup_addr;
> > +
> > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > +					      unsigned long val, void *data)
> > +{
> > +	struct die_args *args = data;
> > +	struct pt_regs *regs = args->regs;
> > +
> > +	if (likely(!patch_fixup_from))
> > +		return NOTIFY_DONE;
> > +
> > +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > +	    (unsigned long)patch_fixup_from != regs->ip)
> > +		return NOTIFY_DONE;
> > +
> > +	args->regs->ip = (unsigned long)patch_fixup_addr;
> > +	return NOTIFY_STOP;
> > +}
> > +
> > +/**
> > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > + * @addr:	Modifying address.
> > + * @opcode:	New instruction.
> > + * @len:	length of modifying bytes.
> > + * @fixup:	Fixup address.
> > + *
> > + * Note: You must backup replaced instructions before calling this,
> > + * if you need to recover it.
> > + * Note: Must be called under text_mutex.
> > + */
> > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > +				void *fixup)
> > +{
> > +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > +	static const int int3_size = sizeof(int3_insn);
> > +
> > +	/* Replacing 1 byte can be done atomically. */
> 
> I'm sure it can be done atomically, but can it be done safely though ?
> 
> (disclaimer: we're still waiting for the official answer on this
> statement): Assuming instruction trace cache effects of SMP cross-code
> modification, and that only int3 would be safe to patch, then even
> changing 1 single byte could only be done by going to an intermediate
> int3 and synchronizing all other cores.
> 
> > +	if (unlikely(len <= 1))
> > +		return text_poke(addr, opcode, len);
> > +
> > +	/* Preparing */
> > +	patch_fixup_addr = fixup;
> > +	wmb();
> 
> hrm, missing comment ?
> 
> > +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > +
> > +	/* Cap by an int3 */
> > +	text_poke(addr, &int3_insn, int3_size);
> > +	sync_core_all();
> > +
> > +	/* Replace tail bytes */
> > +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > +		  len - int3_size);
> > +	sync_core_all();
> > +
> > +	/* Replace int3 with head byte */
> > +	text_poke(addr, opcode, int3_size);
> > +	sync_core_all();
> > +
> > +	/* Cleanup */
> > +	patch_fixup_from = NULL;
> > +	wmb();
> 
> missing comment here too.
> 
> > +	return addr;
> 
> Little quiz question:
> 
> When patch_fixup_from is set to NULL, what ensures that the int3
> handlers have completed their execution ?
> 
> I think it's probably OK, because the int3 is an interrupt gate, which
> therefore disables interrupts as soon as it runs, and executes the
> notifier while irqs are off. When we run sync_core_all() after replacing
> the int3 by the new 1st byte, we only return when all other cores have
> executed an interrupt, which implies that all int3 handlers previously
> running should have ended. Is it right ? It looks to me as if this 3rd
> sync_core_all() is only needed because of that. Probably that adding a
> comment would be good.
> 
> 
> Another thing: I've recently noticed that the following locking seems to
> hang the system with doing stress-testing concurrently with cpu
> hotplug/hotunplug:
> 
> mutex_lock(&text_mutex);
>   on_each_cpu(something, NULL, 1);
> 
> The hang seems to be caused by the fact that alternative.c has:
> 
> within cpu hotplug (cpu hotplug lock held)
>   mutex_lock(&text_mutex);
> 
> It might also be caused by the interaction with the stop_machine()
> performed within the cpu hotplug lock. I did not find the root cause of
> the problem, but this probably calls for lockdep improvements.
> 
> In any case, given you are dealing with the exact same locking scenario
> here, I would recomment the following stress-test:
> 
> in a loop, use text_poke_jump()
> in a loop, hotunplug all cpus, then hotplug all cpus
> 
> I had to fix this temporarily by taking get/put_online_cpus() about the
> text_mutex.
> 
> [snip snip]
> 
> > +static struct notifier_block patch_exceptions_nb = {
> > +	.notifier_call = patch_exceptions_notify,
> > +	.priority = 0x7fffffff /* we need to be notified first */
> > +};
> > +
> > +static int __init patch_init(void)
> > +{
> > +	return register_die_notifier(&patch_exceptions_nb);
> > +}
> > +
> > +arch_initcall(patch_init);
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index e5342a3..43a30d8 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> >  
> >  static struct notifier_block kprobe_exceptions_nb = {
> >  	.notifier_call = kprobe_exceptions_notify,
> > -	.priority = 0x7fffffff /* we need to be notified first */
> > +	.priority = 0x7ffffff0 /* High priority, but not first.  */
> 
> It would be good to keep all these priorities in a centralized header.
> 
> Thanks,
> 
> Mathieu
> 
> >  };
> >  
> >  unsigned long __weak arch_deref_entry_point(void *entry)
> > -- 
> > 1.6.5.1
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  0:58     ` Paul E. McKenney
@ 2009-11-19  1:22       ` Steven Rostedt
  2009-11-19  1:39         ` Paul E. McKenney
  2009-11-19  1:57       ` Mathieu Desnoyers
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2009-11-19  1:22 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Jason Baron, mingo, mhiramat, linux-kernel,
	hpa, tglx, andi, roland, rth

On Wed, 2009-11-18 at 16:58 -0800, Paul E. McKenney wrote:

> I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> suspicions of late.  But this seems to bypass the smp_call_function()
> code that is most definitely illegal to invoke with irqs disabled,
> so no smoking gun.  All that aside, if invoking smp_send_reschedule()
> with irqs disabled is in any way a bad idea, please let me know so I
> can rearrange the code appropriately.

I don't think you have anything to worry about here. If calling
smp_send_reschedule was bad with keeping interrupts disabled, then we
would have a lot of problems with the scheduler. That is called with the
rq lock held and with interrupts disabled.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  1:22       ` Steven Rostedt
@ 2009-11-19  1:39         ` Paul E. McKenney
  0 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2009-11-19  1:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Jason Baron, mingo, mhiramat, linux-kernel,
	hpa, tglx, andi, roland, rth

On Wed, Nov 18, 2009 at 08:22:47PM -0500, Steven Rostedt wrote:
> On Wed, 2009-11-18 at 16:58 -0800, Paul E. McKenney wrote:
> 
> > I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> > suspicions of late.  But this seems to bypass the smp_call_function()
> > code that is most definitely illegal to invoke with irqs disabled,
> > so no smoking gun.  All that aside, if invoking smp_send_reschedule()
> > with irqs disabled is in any way a bad idea, please let me know so I
> > can rearrange the code appropriately.
> 
> I don't think you have anything to worry about here. If calling
> smp_send_reschedule was bad with keeping interrupts disabled, then we
> would have a lot of problems with the scheduler. That is called with the
> rq lock held and with interrupts disabled.

Whew!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  0:58     ` Paul E. McKenney
  2009-11-19  1:22       ` Steven Rostedt
@ 2009-11-19  1:57       ` Mathieu Desnoyers
  2009-11-19  4:16         ` Paul E. McKenney
  1 sibling, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-19  1:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jason Baron, mingo, mhiramat, linux-kernel, hpa, tglx, rostedt,
	andi, roland, rth

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> > * Jason Baron (jbaron@redhat.com) wrote:
> > > Add text_poke_fixup() which takes a fixup address to where a processor
> > > jumps if it hits the modifying address while code modifying.
> > > text_poke_fixup() does following steps for this purpose.
> > > 
> > >  1. Setup int3 handler for fixup.
> > >  2. Put a breakpoint (int3) on the first byte of modifying region,
> > >     and synchronize code on all CPUs.
> > >  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > >  4. Modify the first byte of modifying region, and synchronize code
> > >     on all CPUs.
> > >  5. Clear int3 handler.
> > > 
> > 
> > Hi Masami,
> > 
> > I like the approach and the API is clean. I have intersped comments
> > below.
> > 
> > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > experienced recently in the message below. Might be worth having a look,
> > I suspect this might have caused the hangs Paul McKenney had with his
> > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > hotplug callbacks and might have used IPIs within that same lock.
> 
> Hello, Mathieu,
> 
> By "mutex", do you mean "mutex_lock()"?  If so, I don't do that from
> within the CPU hotplug notifiers.  I do spin_lock_irqsave().
> 
> People have been known to invoke synchronize_rcu() from CPU hotplug
> notifiers, however -- and this does block.  Is that what you are
> getting at?

Hi Paul,

What I had in mind is more like a N-way deadlock involving the cpu
hotplug mutex, on_each_cpu, and possibly stop_machine interaction.
However I did not push the lockup analysis far enough to know for sure,
but I feel like it's good to let others know. I am not sure if blocking
primitives could be affected by this.

> 
> I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> suspicions of late.  But this seems to bypass the smp_call_function()
> code that is most definitely illegal to invoke with irqs disabled,
> so no smoking gun.  All that aside, if invoking smp_send_reschedule()
> with irqs disabled is in any way a bad idea, please let me know so I
> can rearrange the code appropriately.
> 
> RCU is currently running reasonably well with the set of patches I have
> submitted.  It the kernel is now withstanding a level of punishment that
> would have reduced 2.6.28 (for example) to a steaming pile of bits, with
> or without the help of rcutorture.  And I am now hitting the occasional
> non-RCU bug, so I am starting to feel like RCU is approaching stability.
> Approaching, but not there yet -- a few suspicious areas remain.  Time
> to crank the testing up another notch or two.  ;-)

Nice :) I've been going through the same "stabilization" phase in the
past weeks for LTTng. It's good to see things stabilize even under heavy
cpu hotplug stress-tests.

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > > Thus, if some other processor execute modifying address when step2 to step4,
> > > it will be jumped to fixup code.
> > > 
> > > This still has many limitations for modifying multi-instructions at once.
> > > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > > because;
> > >  - Replaced instruction is just one instruction, which is executed atomically.
> > >  - Replacing instruction is a jump, so we can set fixup address where the jump
> > >    goes to.
> > > 
> > > Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> > > ---
> > >  arch/x86/include/asm/alternative.h |   12 ++++
> > >  arch/x86/include/asm/kprobes.h     |    1 +
> > >  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
> > >  kernel/kprobes.c                   |    2 +-
> > >  4 files changed, 134 insertions(+), 1 deletions(-)
> > > 
> > 
> > [snip snip]
> > 
> > > index de7353c..af47f12 100644
> > > --- a/arch/x86/kernel/alternative.c
> > > +++ b/arch/x86/kernel/alternative.c
> > > @@ -4,6 +4,7 @@
> > >  #include <linux/list.h>
> > >  #include <linux/stringify.h>
> > >  #include <linux/kprobes.h>
> > > +#include <linux/kdebug.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/vmalloc.h>
> > >  #include <linux/memory.h>
> > > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> > >  	local_irq_restore(flags);
> > >  	return addr;
> > >  }
> > > +
> > > +/*
> > > + * On pentium series, Unsynchronized cross-modifying code
> > > + * operations can cause unexpected instruction execution results.
> > > + * So after code modified, we should synchronize it on each processor.
> > > + */
> > > +static void __kprobes __local_sync_core(void *info)
> > > +{
> > > +	sync_core();
> > > +}
> > > +
> > > +void __kprobes sync_core_all(void)
> > > +{
> > > +	on_each_cpu(__local_sync_core, NULL, 1);
> > 
> > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > executes the remote "__local_sync_core" with the appropriate memory
> > barriers underneath, am I correct ?
> > 
> > > +}
> > > +
> > > +/* Safely cross-code modifying with fixup address */
> > > +static void *patch_fixup_from;
> > > +static void *patch_fixup_addr;
> > > +
> > > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > > +					      unsigned long val, void *data)
> > > +{
> > > +	struct die_args *args = data;
> > > +	struct pt_regs *regs = args->regs;
> > > +
> > > +	if (likely(!patch_fixup_from))
> > > +		return NOTIFY_DONE;
> > > +
> > > +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > > +	    (unsigned long)patch_fixup_from != regs->ip)
> > > +		return NOTIFY_DONE;
> > > +
> > > +	args->regs->ip = (unsigned long)patch_fixup_addr;
> > > +	return NOTIFY_STOP;
> > > +}
> > > +
> > > +/**
> > > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > > + * @addr:	Modifying address.
> > > + * @opcode:	New instruction.
> > > + * @len:	length of modifying bytes.
> > > + * @fixup:	Fixup address.
> > > + *
> > > + * Note: You must backup replaced instructions before calling this,
> > > + * if you need to recover it.
> > > + * Note: Must be called under text_mutex.
> > > + */
> > > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > > +				void *fixup)
> > > +{
> > > +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > > +	static const int int3_size = sizeof(int3_insn);
> > > +
> > > +	/* Replacing 1 byte can be done atomically. */
> > 
> > I'm sure it can be done atomically, but can it be done safely though ?
> > 
> > (disclaimer: we're still waiting for the official answer on this
> > statement): Assuming instruction trace cache effects of SMP cross-code
> > modification, and that only int3 would be safe to patch, then even
> > changing 1 single byte could only be done by going to an intermediate
> > int3 and synchronizing all other cores.
> > 
> > > +	if (unlikely(len <= 1))
> > > +		return text_poke(addr, opcode, len);
> > > +
> > > +	/* Preparing */
> > > +	patch_fixup_addr = fixup;
> > > +	wmb();
> > 
> > hrm, missing comment ?
> > 
> > > +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > > +
> > > +	/* Cap by an int3 */
> > > +	text_poke(addr, &int3_insn, int3_size);
> > > +	sync_core_all();
> > > +
> > > +	/* Replace tail bytes */
> > > +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > > +		  len - int3_size);
> > > +	sync_core_all();
> > > +
> > > +	/* Replace int3 with head byte */
> > > +	text_poke(addr, opcode, int3_size);
> > > +	sync_core_all();
> > > +
> > > +	/* Cleanup */
> > > +	patch_fixup_from = NULL;
> > > +	wmb();
> > 
> > missing comment here too.
> > 
> > > +	return addr;
> > 
> > Little quiz question:
> > 
> > When patch_fixup_from is set to NULL, what ensures that the int3
> > handlers have completed their execution ?
> > 
> > I think it's probably OK, because the int3 is an interrupt gate, which
> > therefore disables interrupts as soon as it runs, and executes the
> > notifier while irqs are off. When we run sync_core_all() after replacing
> > the int3 by the new 1st byte, we only return when all other cores have
> > executed an interrupt, which implies that all int3 handlers previously
> > running should have ended. Is it right ? It looks to me as if this 3rd
> > sync_core_all() is only needed because of that. Probably that adding a
> > comment would be good.
> > 
> > 
> > Another thing: I've recently noticed that the following locking seems to
> > hang the system with doing stress-testing concurrently with cpu
> > hotplug/hotunplug:
> > 
> > mutex_lock(&text_mutex);
> >   on_each_cpu(something, NULL, 1);
> > 
> > The hang seems to be caused by the fact that alternative.c has:
> > 
> > within cpu hotplug (cpu hotplug lock held)
> >   mutex_lock(&text_mutex);
> > 
> > It might also be caused by the interaction with the stop_machine()
> > performed within the cpu hotplug lock. I did not find the root cause of
> > the problem, but this probably calls for lockdep improvements.
> > 
> > In any case, given you are dealing with the exact same locking scenario
> > here, I would recomment the following stress-test:
> > 
> > in a loop, use text_poke_jump()
> > in a loop, hotunplug all cpus, then hotplug all cpus
> > 
> > I had to fix this temporarily by taking get/put_online_cpus() about the
> > text_mutex.
> > 
> > [snip snip]
> > 
> > > +static struct notifier_block patch_exceptions_nb = {
> > > +	.notifier_call = patch_exceptions_notify,
> > > +	.priority = 0x7fffffff /* we need to be notified first */
> > > +};
> > > +
> > > +static int __init patch_init(void)
> > > +{
> > > +	return register_die_notifier(&patch_exceptions_nb);
> > > +}
> > > +
> > > +arch_initcall(patch_init);
> > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > index e5342a3..43a30d8 100644
> > > --- a/kernel/kprobes.c
> > > +++ b/kernel/kprobes.c
> > > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> > >  
> > >  static struct notifier_block kprobe_exceptions_nb = {
> > >  	.notifier_call = kprobe_exceptions_notify,
> > > -	.priority = 0x7fffffff /* we need to be notified first */
> > > +	.priority = 0x7ffffff0 /* High priority, but not first.  */
> > 
> > It would be good to keep all these priorities in a centralized header.
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > >  };
> > >  
> > >  unsigned long __weak arch_deref_entry_point(void *entry)
> > > -- 
> > > 1.6.5.1
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] jump label v3
  2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
                   ` (6 preceding siblings ...)
  2009-11-18 22:51 ` [RFC PATCH 0/6] jump label v3 H. Peter Anvin
@ 2009-11-19  3:54 ` Roland McGrath
  2009-11-19 21:55   ` Jason Baron
  7 siblings, 1 reply; 35+ messages in thread
From: Roland McGrath @ 2009-11-19  3:54 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi,
	rth, mhiramat

This looks like another good iteration, still imperfect but getting there.

There are three aspects of this work that are largely separable.

1. the actual instruction-poking for runtime switching

   Masami, Mathieu, et al are hashing this out and I have stopped trying to
   keep track of the details.

2. optimal compiled hot path code

   You and Richard have been working on this in gcc and we know the state
   of it now.  When we get the cold labels feature done, it will be ideal
   for -O(2?).  But people mostly use -Os and there no block reordering
   gets done now (I think perhaps this even means likely/unlikely don't
   really change which path is the straight line, just the source order
   of the blocks still determines it).  So we hope for more incremental
   improvements here, and maybe even really optimal code for -O2 soon.
   But at least for -Os it may not be better than "unconditional jump
   around" as the "straight line" path in the foreseeable future.  As
   noted, that alone is still a nice savings over the status quo for the
   disabled case.  (You gave an "average cycles saved" for this vs a load
   and test, but do you have any comparisons of how those two compare to
   no tracepoint at all?)

3. bookkeeping magic to find all the jumps to enable for a given tracepoint

   Here you have a working first draft, but it looks pretty clunky.
   That strcmp just makes me gag.  For a first version that's still
   pretty simple, I think it should be trivial to use a pointer
   comparison there.  For tracepoints, it can be the address of the
   struct tracepoint.  For the general case, it can be the address of
   the global that would be flag variable in case of no asm goto support.

   For more incremental improvements, we could cut down on running
   through the entire table for every switch.  If there are many
   different switches (as there are already for many different
   tracepoints), then you really just want to run through the
   insn-patch list for the particular switch when you toggle it.  

   It's possible to group this all statically at link time, but all
   the linker magic hacking required to get that to go is probably
   more trouble than it's worth.  

   A simple hack is to run through the big unsorted table at boot time
   and turn it into a contiguous table for each switch.  Then
   e.g. hang each table off the per-switch global variable by the same
   name that in a no-asm-goto build would be the simple global flag.


Finally, for using this for general purposes unrelated to tracepoints,
I envision something like:

	DECLARE_MOSTLY_NOT(foobar);

	foo(int x, int y)
	{
		if (x > y && mostly_not(foobar))
			do_foobar(x - y);
	}

	... set_mostly_not(foobar, onoff);

where it's:

#define DECLARE_MOSTLY_NOT(name) ... __something_##name
#define mostly_not(name) ({ int _doit = 0; __label__ _yes; \
			    JUMP_LABEL(name, _yes, __something_##name); \
			    if (0) _yes: __cold _doit = 1; \
			    unlikely (_doit); })

I don't think we've tried to figure out how well this compiles yet.
But it shows the sort of thing that we can do to expose this feature
in a way that's simple and unrestrictive for kernel code to use casually.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] notifier atomic call chain notrace
  2009-11-18 23:38   ` [PATCH] notifier atomic call chain notrace Mathieu Desnoyers
  2009-11-19  0:02     ` Paul E. McKenney
@ 2009-11-19  3:59     ` Masami Hiramatsu
  2009-11-19 16:48     ` Jason Baron
  2 siblings, 0 replies; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-19  3:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, linux-kernel, mingo, hpa, tglx, rostedt, andi,
	roland, rth, Paul E. McKenney

Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
>> Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
>> condition. This is b/c the die_notifier takes an rcu_read_lock() on the
>> int3 trap, which then causes another one etc. Since, we aren't going to be
>> installing removing the handler, the rcu_read_lock() could be avoided for this
>> case with some code restructuring.
>>
> [snip]
> 
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.

Reviewed-by: Masami Hiramatsu <mhiramat@redhat.com>

The code itself seems OK for me. :-)

I'd just like to hear the opinion from Ingo, since
this change will change all atomic-notifier's locks to
notrace.

> notifier atomic call chain notrace
> 
> In LTTng, being able to use the atomic notifier from cpu idle entry to
> ensure the tracer flush the last events in the current subbuffer
> requires the rcu read-side to be marked "notrace", otherwise it can end
> up calling back into lockdep and the tracer.
> 
> Also apply to the the die notifier.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> CC: Jason Baron <jbaron@redhat.com>
> CC: mingo@elte.hu
> ---
>  kernel/notifier.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6-lttng/kernel/notifier.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/notifier.c	2009-11-12 17:58:56.000000000 -0500
> +++ linux-2.6-lttng/kernel/notifier.c	2009-11-12 18:03:28.000000000 -0500
> @@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
>  	spin_lock_irqsave(&nh->lock, flags);
>  	ret = notifier_chain_unregister(&nh->head, n);
>  	spin_unlock_irqrestore(&nh->lock, flags);
> -	synchronize_rcu();
> +	synchronize_sched();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
> @@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
>  {
>  	int ret;
>  
> -	rcu_read_lock();
> +	rcu_read_lock_sched_notrace();
>  	ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
> -	rcu_read_unlock();
> +	rcu_read_unlock_sched_notrace();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  1:57       ` Mathieu Desnoyers
@ 2009-11-19  4:16         ` Paul E. McKenney
  0 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2009-11-19  4:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, mingo, mhiramat, linux-kernel, hpa, tglx, rostedt,
	andi, roland, rth

On Wed, Nov 18, 2009 at 08:57:56PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> > > * Jason Baron (jbaron@redhat.com) wrote:
> > > > Add text_poke_fixup() which takes a fixup address to where a processor
> > > > jumps if it hits the modifying address while code modifying.
> > > > text_poke_fixup() does following steps for this purpose.
> > > > 
> > > >  1. Setup int3 handler for fixup.
> > > >  2. Put a breakpoint (int3) on the first byte of modifying region,
> > > >     and synchronize code on all CPUs.
> > > >  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > > >  4. Modify the first byte of modifying region, and synchronize code
> > > >     on all CPUs.
> > > >  5. Clear int3 handler.
> > > > 
> > > 
> > > Hi Masami,
> > > 
> > > I like the approach and the API is clean. I have intersped comments
> > > below.
> > > 
> > > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > > experienced recently in the message below. Might be worth having a look,
> > > I suspect this might have caused the hangs Paul McKenney had with his
> > > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > > hotplug callbacks and might have used IPIs within that same lock.
> > 
> > Hello, Mathieu,
> > 
> > By "mutex", do you mean "mutex_lock()"?  If so, I don't do that from
> > within the CPU hotplug notifiers.  I do spin_lock_irqsave().
> > 
> > People have been known to invoke synchronize_rcu() from CPU hotplug
> > notifiers, however -- and this does block.  Is that what you are
> > getting at?
> 
> Hi Paul,
> 
> What I had in mind is more like a N-way deadlock involving the cpu
> hotplug mutex, on_each_cpu, and possibly stop_machine interaction.
> However I did not push the lockup analysis far enough to know for sure,
> but I feel like it's good to let others know. I am not sure if blocking
> primitives could be affected by this.

Then you might be interested to hear that my attempts to move
rcu_offline_cpu() to the CPU_DYING and rcu_online_cpu() to CPU_STARTING
did result in howls of pain from lockdep as well as deadlocks.
I abandoned such attempts.  ;-)

(The reason behind my attempts was to reduce the number and complexity
of race conditions in the RCU implementations -- but no big deal, as I
have found other ways.)

> > I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> > suspicions of late.  But this seems to bypass the smp_call_function()
> > code that is most definitely illegal to invoke with irqs disabled,
> > so no smoking gun.  All that aside, if invoking smp_send_reschedule()
> > with irqs disabled is in any way a bad idea, please let me know so I
> > can rearrange the code appropriately.
> > 
> > RCU is currently running reasonably well with the set of patches I have
> > submitted.  It the kernel is now withstanding a level of punishment that
> > would have reduced 2.6.28 (for example) to a steaming pile of bits, with
> > or without the help of rcutorture.  And I am now hitting the occasional
> > non-RCU bug, so I am starting to feel like RCU is approaching stability.
> > Approaching, but not there yet -- a few suspicious areas remain.  Time
> > to crank the testing up another notch or two.  ;-)
> 
> Nice :) I've been going through the same "stabilization" phase in the
> past weeks for LTTng. It's good to see things stabilize even under heavy
> cpu hotplug stress-tests.

Cool!!!  ;-)

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > > Thus, if some other processor execute modifying address when step2 to step4,
> > > > it will be jumped to fixup code.
> > > > 
> > > > This still has many limitations for modifying multi-instructions at once.
> > > > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > > > because;
> > > >  - Replaced instruction is just one instruction, which is executed atomically.
> > > >  - Replacing instruction is a jump, so we can set fixup address where the jump
> > > >    goes to.
> > > > 
> > > > Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> > > > ---
> > > >  arch/x86/include/asm/alternative.h |   12 ++++
> > > >  arch/x86/include/asm/kprobes.h     |    1 +
> > > >  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
> > > >  kernel/kprobes.c                   |    2 +-
> > > >  4 files changed, 134 insertions(+), 1 deletions(-)
> > > > 
> > > 
> > > [snip snip]
> > > 
> > > > index de7353c..af47f12 100644
> > > > --- a/arch/x86/kernel/alternative.c
> > > > +++ b/arch/x86/kernel/alternative.c
> > > > @@ -4,6 +4,7 @@
> > > >  #include <linux/list.h>
> > > >  #include <linux/stringify.h>
> > > >  #include <linux/kprobes.h>
> > > > +#include <linux/kdebug.h>
> > > >  #include <linux/mm.h>
> > > >  #include <linux/vmalloc.h>
> > > >  #include <linux/memory.h>
> > > > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> > > >  	local_irq_restore(flags);
> > > >  	return addr;
> > > >  }
> > > > +
> > > > +/*
> > > > + * On pentium series, Unsynchronized cross-modifying code
> > > > + * operations can cause unexpected instruction execution results.
> > > > + * So after code modified, we should synchronize it on each processor.
> > > > + */
> > > > +static void __kprobes __local_sync_core(void *info)
> > > > +{
> > > > +	sync_core();
> > > > +}
> > > > +
> > > > +void __kprobes sync_core_all(void)
> > > > +{
> > > > +	on_each_cpu(__local_sync_core, NULL, 1);
> > > 
> > > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > > executes the remote "__local_sync_core" with the appropriate memory
> > > barriers underneath, am I correct ?
> > > 
> > > > +}
> > > > +
> > > > +/* Safely cross-code modifying with fixup address */
> > > > +static void *patch_fixup_from;
> > > > +static void *patch_fixup_addr;
> > > > +
> > > > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > > > +					      unsigned long val, void *data)
> > > > +{
> > > > +	struct die_args *args = data;
> > > > +	struct pt_regs *regs = args->regs;
> > > > +
> > > > +	if (likely(!patch_fixup_from))
> > > > +		return NOTIFY_DONE;
> > > > +
> > > > +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > > > +	    (unsigned long)patch_fixup_from != regs->ip)
> > > > +		return NOTIFY_DONE;
> > > > +
> > > > +	args->regs->ip = (unsigned long)patch_fixup_addr;
> > > > +	return NOTIFY_STOP;
> > > > +}
> > > > +
> > > > +/**
> > > > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > > > + * @addr:	Modifying address.
> > > > + * @opcode:	New instruction.
> > > > + * @len:	length of modifying bytes.
> > > > + * @fixup:	Fixup address.
> > > > + *
> > > > + * Note: You must backup replaced instructions before calling this,
> > > > + * if you need to recover it.
> > > > + * Note: Must be called under text_mutex.
> > > > + */
> > > > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > > > +				void *fixup)
> > > > +{
> > > > +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > > > +	static const int int3_size = sizeof(int3_insn);
> > > > +
> > > > +	/* Replacing 1 byte can be done atomically. */
> > > 
> > > I'm sure it can be done atomically, but can it be done safely though ?
> > > 
> > > (disclaimer: we're still waiting for the official answer on this
> > > statement): Assuming instruction trace cache effects of SMP cross-code
> > > modification, and that only int3 would be safe to patch, then even
> > > changing 1 single byte could only be done by going to an intermediate
> > > int3 and synchronizing all other cores.
> > > 
> > > > +	if (unlikely(len <= 1))
> > > > +		return text_poke(addr, opcode, len);
> > > > +
> > > > +	/* Preparing */
> > > > +	patch_fixup_addr = fixup;
> > > > +	wmb();
> > > 
> > > hrm, missing comment ?
> > > 
> > > > +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > > > +
> > > > +	/* Cap by an int3 */
> > > > +	text_poke(addr, &int3_insn, int3_size);
> > > > +	sync_core_all();
> > > > +
> > > > +	/* Replace tail bytes */
> > > > +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > > > +		  len - int3_size);
> > > > +	sync_core_all();
> > > > +
> > > > +	/* Replace int3 with head byte */
> > > > +	text_poke(addr, opcode, int3_size);
> > > > +	sync_core_all();
> > > > +
> > > > +	/* Cleanup */
> > > > +	patch_fixup_from = NULL;
> > > > +	wmb();
> > > 
> > > missing comment here too.
> > > 
> > > > +	return addr;
> > > 
> > > Little quiz question:
> > > 
> > > When patch_fixup_from is set to NULL, what ensures that the int3
> > > handlers have completed their execution ?
> > > 
> > > I think it's probably OK, because the int3 is an interrupt gate, which
> > > therefore disables interrupts as soon as it runs, and executes the
> > > notifier while irqs are off. When we run sync_core_all() after replacing
> > > the int3 by the new 1st byte, we only return when all other cores have
> > > executed an interrupt, which implies that all int3 handlers previously
> > > running should have ended. Is it right ? It looks to me as if this 3rd
> > > sync_core_all() is only needed because of that. Probably that adding a
> > > comment would be good.
> > > 
> > > 
> > > Another thing: I've recently noticed that the following locking seems to
> > > hang the system with doing stress-testing concurrently with cpu
> > > hotplug/hotunplug:
> > > 
> > > mutex_lock(&text_mutex);
> > >   on_each_cpu(something, NULL, 1);
> > > 
> > > The hang seems to be caused by the fact that alternative.c has:
> > > 
> > > within cpu hotplug (cpu hotplug lock held)
> > >   mutex_lock(&text_mutex);
> > > 
> > > It might also be caused by the interaction with the stop_machine()
> > > performed within the cpu hotplug lock. I did not find the root cause of
> > > the problem, but this probably calls for lockdep improvements.
> > > 
> > > In any case, given you are dealing with the exact same locking scenario
> > > here, I would recomment the following stress-test:
> > > 
> > > in a loop, use text_poke_jump()
> > > in a loop, hotunplug all cpus, then hotplug all cpus
> > > 
> > > I had to fix this temporarily by taking get/put_online_cpus() about the
> > > text_mutex.
> > > 
> > > [snip snip]
> > > 
> > > > +static struct notifier_block patch_exceptions_nb = {
> > > > +	.notifier_call = patch_exceptions_notify,
> > > > +	.priority = 0x7fffffff /* we need to be notified first */
> > > > +};
> > > > +
> > > > +static int __init patch_init(void)
> > > > +{
> > > > +	return register_die_notifier(&patch_exceptions_nb);
> > > > +}
> > > > +
> > > > +arch_initcall(patch_init);
> > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > index e5342a3..43a30d8 100644
> > > > --- a/kernel/kprobes.c
> > > > +++ b/kernel/kprobes.c
> > > > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> > > >  
> > > >  static struct notifier_block kprobe_exceptions_nb = {
> > > >  	.notifier_call = kprobe_exceptions_notify,
> > > > -	.priority = 0x7fffffff /* we need to be notified first */
> > > > +	.priority = 0x7ffffff0 /* High priority, but not first.  */
> > > 
> > > It would be good to keep all these priorities in a centralized header.
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > >  };
> > > >  
> > > >  unsigned long __weak arch_deref_entry_point(void *entry)
> > > > -- 
> > > > 1.6.5.1
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  0:28   ` Mathieu Desnoyers
  2009-11-19  0:58     ` Paul E. McKenney
@ 2009-11-19 14:04     ` Masami Hiramatsu
  2009-11-19 16:03       ` Mathieu Desnoyers
  2009-11-21  1:11     ` Masami Hiramatsu
  2 siblings, 1 reply; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-19 14:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>>  1. Setup int3 handler for fixup.
>>  2. Put a breakpoint (int3) on the first byte of modifying region,
>>     and synchronize code on all CPUs.
>>  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>  4. Modify the first byte of modifying region, and synchronize code
>>     on all CPUs.
>>  5. Clear int3 handler.
>>
> 
> Hi Masami,
> 
> I like the approach and the API is clean. I have intersped comments
> below.

Hi Mathieu,

Thank you for reviewing :-)

> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.
> 
>> Thus, if some other processor execute modifying address when step2 to step4,
>> it will be jumped to fixup code.
>>
>> This still has many limitations for modifying multi-instructions at once.
>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>> because;
>>  - Replaced instruction is just one instruction, which is executed atomically.
>>  - Replacing instruction is a jump, so we can set fixup address where the jump
>>    goes to.
>>
>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>> ---
>>  arch/x86/include/asm/alternative.h |   12 ++++
>>  arch/x86/include/asm/kprobes.h     |    1 +
>>  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
>>  kernel/kprobes.c                   |    2 +-
>>  4 files changed, 134 insertions(+), 1 deletions(-)
>>
> 
> [snip snip]
> 
>> index de7353c..af47f12 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/list.h>
>>  #include <linux/stringify.h>
>>  #include <linux/kprobes.h>
>> +#include <linux/kdebug.h>
>>  #include <linux/mm.h>
>>  #include <linux/vmalloc.h>
>>  #include <linux/memory.h>
>> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>>  	local_irq_restore(flags);
>>  	return addr;
>>  }
>> +
>> +/*
>> + * On pentium series, Unsynchronized cross-modifying code
>> + * operations can cause unexpected instruction execution results.
>> + * So after code modified, we should synchronize it on each processor.
>> + */
>> +static void __kprobes __local_sync_core(void *info)
>> +{
>> +	sync_core();
>> +}
>> +
>> +void __kprobes sync_core_all(void)
>> +{
>> +	on_each_cpu(__local_sync_core, NULL, 1);
> 
> OK, so you rely on the fact that on_each_cpu has memory barriers and
> executes the remote "__local_sync_core" with the appropriate memory
> barriers underneath, am I correct ?

Right, at least I expected that.

>> +}
>> +
>> +/* Safely cross-code modifying with fixup address */
>> +static void *patch_fixup_from;
>> +static void *patch_fixup_addr;
>> +
>> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
>> +					      unsigned long val, void *data)
>> +{
>> +	struct die_args *args = data;
>> +	struct pt_regs *regs = args->regs;
>> +
>> +	if (likely(!patch_fixup_from))
>> +		return NOTIFY_DONE;
>> +
>> +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
>> +	    (unsigned long)patch_fixup_from != regs->ip)
>> +		return NOTIFY_DONE;
>> +
>> +	args->regs->ip = (unsigned long)patch_fixup_addr;
>> +	return NOTIFY_STOP;
>> +}
>> +
>> +/**
>> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
>> + * @addr:	Modifying address.
>> + * @opcode:	New instruction.
>> + * @len:	length of modifying bytes.
>> + * @fixup:	Fixup address.
>> + *
>> + * Note: You must backup replaced instructions before calling this,
>> + * if you need to recover it.
>> + * Note: Must be called under text_mutex.
>> + */
>> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
>> +				void *fixup)
>> +{
>> +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
>> +	static const int int3_size = sizeof(int3_insn);
>> +
>> +	/* Replacing 1 byte can be done atomically. */
> 
> I'm sure it can be done atomically, but can it be done safely though ?
> 
> (disclaimer: we're still waiting for the official answer on this
> statement): Assuming instruction trace cache effects of SMP cross-code
> modification, and that only int3 would be safe to patch, then even
> changing 1 single byte could only be done by going to an intermediate
> int3 and synchronizing all other cores.

Ah, I see. Certainly, that's possible (int3 might be a special token of
cache coherency). So even len==1, we might better use int3 capping.

> 
>> +	if (unlikely(len <= 1))
>> +		return text_poke(addr, opcode, len);
>> +
>> +	/* Preparing */
>> +	patch_fixup_addr = fixup;
>> +	wmb();
> 
> hrm, missing comment ?

Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.

> 
>> +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>> +
>> +	/* Cap by an int3 */
>> +	text_poke(addr, &int3_insn, int3_size);
>> +	sync_core_all();
>> +
>> +	/* Replace tail bytes */
>> +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>> +		  len - int3_size);
>> +	sync_core_all();
>> +
>> +	/* Replace int3 with head byte */
>> +	text_poke(addr, opcode, int3_size);
>> +	sync_core_all();
>> +
>> +	/* Cleanup */
>> +	patch_fixup_from = NULL;
>> +	wmb();
> 
> missing comment here too.
> 
>> +	return addr;
> 
> Little quiz question:
> 
> When patch_fixup_from is set to NULL, what ensures that the int3
> handlers have completed their execution ?
> 
> I think it's probably OK, because the int3 is an interrupt gate, which
> therefore disables interrupts as soon as it runs, and executes the
> notifier while irqs are off. When we run sync_core_all() after replacing
> the int3 by the new 1st byte, we only return when all other cores have
> executed an interrupt, which implies that all int3 handlers previously
> running should have ended. Is it right ? It looks to me as if this 3rd
> sync_core_all() is only needed because of that. Probably that adding a
> comment would be good.

Thanks, it's a good point and that's more what I've thought.
As you said, it is probably safe. Even if it's not safe,
we can add some int3 fixup handler (with lowest priority)
which set regs->ip-1 if there is no int3 anymore, for safety.

> Another thing: I've recently noticed that the following locking seems to
> hang the system with doing stress-testing concurrently with cpu
> hotplug/hotunplug:
> 
> mutex_lock(&text_mutex);
>   on_each_cpu(something, NULL, 1);
> 
> The hang seems to be caused by the fact that alternative.c has:
> 
> within cpu hotplug (cpu hotplug lock held)
>   mutex_lock(&text_mutex);
> 
> It might also be caused by the interaction with the stop_machine()
> performed within the cpu hotplug lock. I did not find the root cause of
> the problem, but this probably calls for lockdep improvements.

Hmm, would you mean it will happen even if we use stop_machine()
under text_mutex locking?
It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.

> In any case, given you are dealing with the exact same locking scenario
> here, I would recomment the following stress-test:
> 
> in a loop, use text_poke_jump()
> in a loop, hotunplug all cpus, then hotplug all cpus
> 
> I had to fix this temporarily by taking get/put_online_cpus() about the
> text_mutex.
> 
> [snip snip]
> 
>> +static struct notifier_block patch_exceptions_nb = {
>> +	.notifier_call = patch_exceptions_notify,
>> +	.priority = 0x7fffffff /* we need to be notified first */
>> +};
>> +
>> +static int __init patch_init(void)
>> +{
>> +	return register_die_notifier(&patch_exceptions_nb);
>> +}
>> +
>> +arch_initcall(patch_init);
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
>> index e5342a3..43a30d8 100644
>> --- a/kernel/kprobes.c
>> +++ b/kernel/kprobes.c
>> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
>>  
>>  static struct notifier_block kprobe_exceptions_nb = {
>>  	.notifier_call = kprobe_exceptions_notify,
>> -	.priority = 0x7fffffff /* we need to be notified first */
>> +	.priority = 0x7ffffff0 /* High priority, but not first.  */
> 
> It would be good to keep all these priorities in a centralized header.

Sure, I'll put it in kprobes.h.

Thank you!
-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19 14:04     ` Masami Hiramatsu
@ 2009-11-19 16:03       ` Mathieu Desnoyers
  2009-11-20  1:00         ` Masami Hiramatsu
  0 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-19 16:03 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > * Jason Baron (jbaron@redhat.com) wrote:
> >> Add text_poke_fixup() which takes a fixup address to where a processor
> >> jumps if it hits the modifying address while code modifying.
> >> text_poke_fixup() does following steps for this purpose.
> >>
> >>  1. Setup int3 handler for fixup.
> >>  2. Put a breakpoint (int3) on the first byte of modifying region,
> >>     and synchronize code on all CPUs.
> >>  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> >>  4. Modify the first byte of modifying region, and synchronize code
> >>     on all CPUs.
> >>  5. Clear int3 handler.
> >>
> > 
> > Hi Masami,
> > 
> > I like the approach and the API is clean. I have intersped comments
> > below.
> 
> Hi Mathieu,
> 
> Thank you for reviewing :-)
> 
> > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > experienced recently in the message below. Might be worth having a look,
> > I suspect this might have caused the hangs Paul McKenney had with his
> > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > hotplug callbacks and might have used IPIs within that same lock.
> > 
> >> Thus, if some other processor execute modifying address when step2 to step4,
> >> it will be jumped to fixup code.
> >>
> >> This still has many limitations for modifying multi-instructions at once.
> >> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> >> because;
> >>  - Replaced instruction is just one instruction, which is executed atomically.
> >>  - Replacing instruction is a jump, so we can set fixup address where the jump
> >>    goes to.
> >>
> >> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> >> ---
> >>  arch/x86/include/asm/alternative.h |   12 ++++
> >>  arch/x86/include/asm/kprobes.h     |    1 +
> >>  arch/x86/kernel/alternative.c      |  120 ++++++++++++++++++++++++++++++++++++
> >>  kernel/kprobes.c                   |    2 +-
> >>  4 files changed, 134 insertions(+), 1 deletions(-)
> >>
> > 
> > [snip snip]
> > 
> >> index de7353c..af47f12 100644
> >> --- a/arch/x86/kernel/alternative.c
> >> +++ b/arch/x86/kernel/alternative.c
> >> @@ -4,6 +4,7 @@
> >>  #include <linux/list.h>
> >>  #include <linux/stringify.h>
> >>  #include <linux/kprobes.h>
> >> +#include <linux/kdebug.h>
> >>  #include <linux/mm.h>
> >>  #include <linux/vmalloc.h>
> >>  #include <linux/memory.h>
> >> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> >>  	local_irq_restore(flags);
> >>  	return addr;
> >>  }
> >> +
> >> +/*
> >> + * On pentium series, Unsynchronized cross-modifying code
> >> + * operations can cause unexpected instruction execution results.
> >> + * So after code modified, we should synchronize it on each processor.
> >> + */
> >> +static void __kprobes __local_sync_core(void *info)
> >> +{
> >> +	sync_core();
> >> +}
> >> +
> >> +void __kprobes sync_core_all(void)
> >> +{
> >> +	on_each_cpu(__local_sync_core, NULL, 1);
> > 
> > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > executes the remote "__local_sync_core" with the appropriate memory
> > barriers underneath, am I correct ?
> 
> Right, at least I expected that.
> 
> >> +}
> >> +
> >> +/* Safely cross-code modifying with fixup address */
> >> +static void *patch_fixup_from;
> >> +static void *patch_fixup_addr;
> >> +
> >> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> >> +					      unsigned long val, void *data)
> >> +{
> >> +	struct die_args *args = data;
> >> +	struct pt_regs *regs = args->regs;
> >> +
> >> +	if (likely(!patch_fixup_from))
> >> +		return NOTIFY_DONE;
> >> +
> >> +	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> >> +	    (unsigned long)patch_fixup_from != regs->ip)
> >> +		return NOTIFY_DONE;
> >> +
> >> +	args->regs->ip = (unsigned long)patch_fixup_addr;
> >> +	return NOTIFY_STOP;
> >> +}
> >> +
> >> +/**
> >> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> >> + * @addr:	Modifying address.
> >> + * @opcode:	New instruction.
> >> + * @len:	length of modifying bytes.
> >> + * @fixup:	Fixup address.
> >> + *
> >> + * Note: You must backup replaced instructions before calling this,
> >> + * if you need to recover it.
> >> + * Note: Must be called under text_mutex.
> >> + */
> >> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> >> +				void *fixup)
> >> +{
> >> +	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> >> +	static const int int3_size = sizeof(int3_insn);
> >> +
> >> +	/* Replacing 1 byte can be done atomically. */
> > 
> > I'm sure it can be done atomically, but can it be done safely though ?
> > 
> > (disclaimer: we're still waiting for the official answer on this
> > statement): Assuming instruction trace cache effects of SMP cross-code
> > modification, and that only int3 would be safe to patch, then even
> > changing 1 single byte could only be done by going to an intermediate
> > int3 and synchronizing all other cores.
> 
> Ah, I see. Certainly, that's possible (int3 might be a special token of
> cache coherency). So even len==1, we might better use int3 capping.
> 
> > 
> >> +	if (unlikely(len <= 1))
> >> +		return text_poke(addr, opcode, len);
> >> +
> >> +	/* Preparing */
> >> +	patch_fixup_addr = fixup;
> >> +	wmb();
> > 
> > hrm, missing comment ?
> 
> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.

When a smp_wmb() is probably enough, and the matching smp_rmb() is
missing in the int3 handler.

But why to you care about the order of these two ? I agree that an
unrelated int3 handler (from kprobes ?) could be running concurrently at
that point, but it clearly cannot be called for this specific address
until the int3 is written by text_poke.

What I am pretty much certain is missing would be a smp_wmb()...

> 
> > 
> >> +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */

..right here, between where you write to the data used by the int3
handler and where you write the actual breakpoint. On the read-side,
this might be a problem with architectures like alpha needing
smp_read_barrier_depends(), but not for Intel. However, in a spirit to
make this code solid, what I did in the immed. val. is:


                target_after_int3 = insn + BREAKPOINT_INS_LEN;
                /* register_die_notifier has memory barriers */
                register_die_notifier(&imv_notify);
                /* The breakpoint will single-step the bypass */
                text_poke((void *)insn,
                        ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);

And I unregister the die notifier at the end after having reached
quiescent state. At least we know that the die notifier chain read-side
has the proper memory barriers, which is not the case for the breakpoint
instruction itself.


> >> +
> >> +	/* Cap by an int3 */
> >> +	text_poke(addr, &int3_insn, int3_size);
> >> +	sync_core_all();
> >> +
> >> +	/* Replace tail bytes */
> >> +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> >> +		  len - int3_size);
> >> +	sync_core_all();
> >> +
> >> +	/* Replace int3 with head byte */
> >> +	text_poke(addr, opcode, int3_size);
> >> +	sync_core_all();
> >> +
> >> +	/* Cleanup */
> >> +	patch_fixup_from = NULL;
> >> +	wmb();
> > 
> > missing comment here too.
> > 
> >> +	return addr;
> > 
> > Little quiz question:
> > 
> > When patch_fixup_from is set to NULL, what ensures that the int3
> > handlers have completed their execution ?
> > 
> > I think it's probably OK, because the int3 is an interrupt gate, which
> > therefore disables interrupts as soon as it runs, and executes the
> > notifier while irqs are off. When we run sync_core_all() after replacing
> > the int3 by the new 1st byte, we only return when all other cores have
> > executed an interrupt, which implies that all int3 handlers previously
> > running should have ended. Is it right ? It looks to me as if this 3rd
> > sync_core_all() is only needed because of that. Probably that adding a
> > comment would be good.
> 
> Thanks, it's a good point and that's more what I've thought.
> As you said, it is probably safe. Even if it's not safe,
> we can add some int3 fixup handler (with lowest priority)
> which set regs->ip-1 if there is no int3 anymore, for safety.

Well, just ensuring that the we reaches a "disabled IRQ code quiescent
state" should be enough. Another way would be to use
synchronize_sched(), but it might take longer. Actively poking the other
CPUs with IPIs seems quicker. So I would be tempted to leave your code
as is in this respect, but to add a comment.

> 
> > Another thing: I've recently noticed that the following locking seems to
> > hang the system with doing stress-testing concurrently with cpu
> > hotplug/hotunplug:
> > 
> > mutex_lock(&text_mutex);
> >   on_each_cpu(something, NULL, 1);
> > 
> > The hang seems to be caused by the fact that alternative.c has:
> > 
> > within cpu hotplug (cpu hotplug lock held)
> >   mutex_lock(&text_mutex);
> > 
> > It might also be caused by the interaction with the stop_machine()
> > performed within the cpu hotplug lock. I did not find the root cause of
> > the problem, but this probably calls for lockdep improvements.
> 
> Hmm, would you mean it will happen even if we use stop_machine()
> under text_mutex locking?
> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.

Yes, but, again.. this calls for more testing. Hopefully it's not
something else in my own code I haven't seen. For not I can just say
that I've been noticing hangs involving cpu hotplug and text mutex, and
taking the cpu hotplug mutex around text mutex (in my immediate values
code) fixed the problem.

Thanks,

Mathieu

> 
> > In any case, given you are dealing with the exact same locking scenario
> > here, I would recomment the following stress-test:
> > 
> > in a loop, use text_poke_jump()
> > in a loop, hotunplug all cpus, then hotplug all cpus
> > 
> > I had to fix this temporarily by taking get/put_online_cpus() about the
> > text_mutex.
> > 
> > [snip snip]
> > 
> >> +static struct notifier_block patch_exceptions_nb = {
> >> +	.notifier_call = patch_exceptions_notify,
> >> +	.priority = 0x7fffffff /* we need to be notified first */
> >> +};
> >> +
> >> +static int __init patch_init(void)
> >> +{
> >> +	return register_die_notifier(&patch_exceptions_nb);
> >> +}
> >> +
> >> +arch_initcall(patch_init);
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> >> index e5342a3..43a30d8 100644
> >> --- a/kernel/kprobes.c
> >> +++ b/kernel/kprobes.c
> >> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> >>  
> >>  static struct notifier_block kprobe_exceptions_nb = {
> >>  	.notifier_call = kprobe_exceptions_notify,
> >> -	.priority = 0x7fffffff /* we need to be notified first */
> >> +	.priority = 0x7ffffff0 /* High priority, but not first.  */
> > 
> > It would be good to keep all these priorities in a centralized header.
> 
> Sure, I'll put it in kprobes.h.
> 
> Thank you!
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] notifier atomic call chain notrace
  2009-11-18 23:38   ` [PATCH] notifier atomic call chain notrace Mathieu Desnoyers
  2009-11-19  0:02     ` Paul E. McKenney
  2009-11-19  3:59     ` Masami Hiramatsu
@ 2009-11-19 16:48     ` Jason Baron
  2 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-19 16:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, mingo, hpa, tglx, rostedt, andi, roland, rth,
	mhiramat, Paul E. McKenney

On Wed, Nov 18, 2009 at 06:38:15PM -0500, Mathieu Desnoyers wrote:
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.
> 

indeed. this patch solves the recursive lockups when lockdep is enabled
and we are doing the jump patching. thanks!

-Jason


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] jump label v3
  2009-11-19  3:54 ` Roland McGrath
@ 2009-11-19 21:55   ` Jason Baron
  0 siblings, 0 replies; 35+ messages in thread
From: Jason Baron @ 2009-11-19 21:55 UTC (permalink / raw)
  To: Roland McGrath
  Cc: linux-kernel, mingo, mathieu.desnoyers, hpa, tglx, rostedt, andi,
	rth, mhiramat

On Wed, Nov 18, 2009 at 07:54:24PM -0800, Roland McGrath wrote:
> 2. optimal compiled hot path code
> 
>    You and Richard have been working on this in gcc and we know the state
>    of it now.  When we get the cold labels feature done, it will be ideal
>    for -O(2?).  But people mostly use -Os and there no block reordering
>    gets done now (I think perhaps this even means likely/unlikely don't
>    really change which path is the straight line, just the source order
>    of the blocks still determines it).  So we hope for more incremental
>    improvements here, and maybe even really optimal code for -O2 soon.
>    But at least for -Os it may not be better than "unconditional jump
>    around" as the "straight line" path in the foreseeable future.  As
>    noted, that alone is still a nice savings over the status quo for the
>    disabled case.  (You gave an "average cycles saved" for this vs a load
>    and test, but do you have any comparisons of how those two compare to
>    no tracepoint at all?)
> 

i've run that in the past, and for the nop + jump sequence its between
2 - 4 cycles on average vs. no tracepoint.


> 3. bookkeeping magic to find all the jumps to enable for a given tracepoint
> 
>    Here you have a working first draft, but it looks pretty clunky.
>    That strcmp just makes me gag.  For a first version that's still
>    pretty simple, I think it should be trivial to use a pointer
>    comparison there.  For tracepoints, it can be the address of the
>    struct tracepoint.  For the general case, it can be the address of
>    the global that would be flag variable in case of no asm goto support.
> 
>    For more incremental improvements, we could cut down on running
>    through the entire table for every switch.  If there are many
>    different switches (as there are already for many different
>    tracepoints), then you really just want to run through the
>    insn-patch list for the particular switch when you toggle it.  
> 
>    It's possible to group this all statically at link time, but all
>    the linker magic hacking required to get that to go is probably
>    more trouble than it's worth.  
> 
>    A simple hack is to run through the big unsorted table at boot time
>    and turn it into a contiguous table for each switch.  Then
>    e.g. hang each table off the per-switch global variable by the same
>    name that in a no-asm-goto build would be the simple global flag.
> 

that probably makes the most sense. Do a sort of the jump table and then
store an offset,length pair with each switch. I was thinking of this as follow
on optimization (the tracepoint code is already O(N) per switch toggle, where
is N = total number of all tracepoint site locations, and not O(n), where
n = number of sites per tracepoint). Certainly, if this is a gating issue for
this patchset, I can fix it now.

> 
> Finally, for using this for general purposes unrelated to tracepoints,
> I envision something like:
> 
> 	DECLARE_MOSTLY_NOT(foobar);
> 
> 	foo(int x, int y)
> 	{
> 		if (x > y && mostly_not(foobar))
> 			do_foobar(x - y);
> 	}
> 
> 	... set_mostly_not(foobar, onoff);
> 
> where it's:
> 
> #define DECLARE_MOSTLY_NOT(name) ... __something_##name
> #define mostly_not(name) ({ int _doit = 0; __label__ _yes; \
> 			    JUMP_LABEL(name, _yes, __something_##name); \
> 			    if (0) _yes: __cold _doit = 1; \
> 			    unlikely (_doit); })
> 
> I don't think we've tried to figure out how well this compiles yet.
> But it shows the sort of thing that we can do to expose this feature
> in a way that's simple and unrestrictive for kernel code to use casually.
> 
> 

cool. the assembly output would be interesting here...

thanks,

-Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19 16:03       ` Mathieu Desnoyers
@ 2009-11-20  1:00         ` Masami Hiramatsu
  2009-11-21 15:32           ` Mathieu Desnoyers
  0 siblings, 1 reply; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-20  1:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

Mathieu Desnoyers wrote:
[...]
>>>> +	if (unlikely(len<= 1))
>>>> +		return text_poke(addr, opcode, len);
>>>> +
>>>> +	/* Preparing */
>>>> +	patch_fixup_addr = fixup;
>>>> +	wmb();
>>>
>>> hrm, missing comment ?
>>
>> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
>> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.
>
> When a smp_wmb() is probably enough, and the matching smp_rmb() is
> missing in the int3 handler.

OK, thank you.

> But why to you care about the order of these two ? I agree that an
> unrelated int3 handler (from kprobes ?) could be running concurrently at
> that point, but it clearly cannot be called for this specific address
> until the int3 is written by text_poke.

Ah, it's my fault. I fixed that a month ago, and forgot to push it...
Actually, we don't need to care the order of those two. Instead,
we have to update the patch_fixup_* before int3 embedding.

>
> What I am pretty much certain is missing would be a smp_wmb()...

Agreed.

>
>>
>>>
>>>> +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>
> ..right here, between where you write to the data used by the int3
> handler and where you write the actual breakpoint. On the read-side,
> this might be a problem with architectures like alpha needing
> smp_read_barrier_depends(), but not for Intel. However, in a spirit to
> make this code solid, what I did in the immed. val. is:
>
>
>                  target_after_int3 = insn + BREAKPOINT_INS_LEN;
>                  /* register_die_notifier has memory barriers */
>                  register_die_notifier(&imv_notify);
>                  /* The breakpoint will single-step the bypass */
>                  text_poke((void *)insn,
>                          ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);

Hmm, it strongly depends on arch. Is smp_wmb() right after setting
patch_fixup_from enough on x86?

> And I unregister the die notifier at the end after having reached
> quiescent state. At least we know that the die notifier chain read-side
> has the proper memory barriers, which is not the case for the breakpoint
> instruction itself.
>
>
>>>> +
>>>> +	/* Cap by an int3 */
>>>> +	text_poke(addr,&int3_insn, int3_size);
>>>> +	sync_core_all();
>>>> +
>>>> +	/* Replace tail bytes */
>>>> +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>>>> +		  len - int3_size);
>>>> +	sync_core_all();
>>>> +
>>>> +	/* Replace int3 with head byte */
>>>> +	text_poke(addr, opcode, int3_size);
>>>> +	sync_core_all();
>>>> +
>>>> +	/* Cleanup */
>>>> +	patch_fixup_from = NULL;
>>>> +	wmb();
>>>
>>> missing comment here too.
>>>
>>>> +	return addr;
>>>
>>> Little quiz question:
>>>
>>> When patch_fixup_from is set to NULL, what ensures that the int3
>>> handlers have completed their execution ?
>>>
>>> I think it's probably OK, because the int3 is an interrupt gate, which
>>> therefore disables interrupts as soon as it runs, and executes the
>>> notifier while irqs are off. When we run sync_core_all() after replacing
>>> the int3 by the new 1st byte, we only return when all other cores have
>>> executed an interrupt, which implies that all int3 handlers previously
>>> running should have ended. Is it right ? It looks to me as if this 3rd
>>> sync_core_all() is only needed because of that. Probably that adding a
>>> comment would be good.
>>
>> Thanks, it's a good point and that's more what I've thought.
>> As you said, it is probably safe. Even if it's not safe,
>> we can add some int3 fixup handler (with lowest priority)
>> which set regs->ip-1 if there is no int3 anymore, for safety.
>
> Well, just ensuring that the we reaches a "disabled IRQ code quiescent
> state" should be enough. Another way would be to use
> synchronize_sched(), but it might take longer. Actively poking the other
> CPUs with IPIs seems quicker. So I would be tempted to leave your code
> as is in this respect, but to add a comment.

Agreed. synchronize_sched() waits too long for this purpose.
OK, I'll add a comment for that "waiting for disabled IRQ code quiescent state" :-)
Thanks for the good advice!

>>> Another thing: I've recently noticed that the following locking seems to
>>> hang the system with doing stress-testing concurrently with cpu
>>> hotplug/hotunplug:
>>>
>>> mutex_lock(&text_mutex);
>>>    on_each_cpu(something, NULL, 1);
>>>
>>> The hang seems to be caused by the fact that alternative.c has:
>>>
>>> within cpu hotplug (cpu hotplug lock held)
>>>    mutex_lock(&text_mutex);
>>>
>>> It might also be caused by the interaction with the stop_machine()
>>> performed within the cpu hotplug lock. I did not find the root cause of
>>> the problem, but this probably calls for lockdep improvements.
>>
>> Hmm, would you mean it will happen even if we use stop_machine()
>> under text_mutex locking?
>> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.
>
> Yes, but, again.. this calls for more testing. Hopefully it's not
> something else in my own code I haven't seen. For not I can just say
> that I've been noticing hangs involving cpu hotplug and text mutex, and
> taking the cpu hotplug mutex around text mutex (in my immediate values
> code) fixed the problem.

Hmm, I guess that we'd better merge those two mutexes since
text modification always requires disabling cpu-hotplug...

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-18 22:43 ` [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Jason Baron
  2009-11-19  0:28   ` Mathieu Desnoyers
@ 2009-11-20 21:54   ` H. Peter Anvin
  2009-11-21  0:06     ` Masami Hiramatsu
  2009-11-21 16:12     ` Mathieu Desnoyers
  1 sibling, 2 replies; 35+ messages in thread
From: H. Peter Anvin @ 2009-11-20 21:54 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, mingo, mathieu.desnoyers, tglx, rostedt, andi,
	roland, rth, mhiramat

On 11/18/2009 02:43 PM, Jason Baron wrote:
> Add text_poke_fixup() which takes a fixup address to where a processor
> jumps if it hits the modifying address while code modifying.
> text_poke_fixup() does following steps for this purpose.
> 
>  1. Setup int3 handler for fixup.
>  2. Put a breakpoint (int3) on the first byte of modifying region,
>     and synchronize code on all CPUs.
>  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>  4. Modify the first byte of modifying region, and synchronize code
>     on all CPUs.
>  5. Clear int3 handler.
> 
> Thus, if some other processor execute modifying address when step2 to step4,
> it will be jumped to fixup code.
> 
> This still has many limitations for modifying multi-instructions at once.
> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> because;
>  - Replaced instruction is just one instruction, which is executed atomically.
>  - Replacing instruction is a jump, so we can set fixup address where the jump
>    goes to.
> 

I just had a thought about this... regardless of if this is safe or not
(which still remains to be determined)... I have a bit more of a
fundamental question about it:

This code ends up taking *two* global IPIs for each instruction
modification.  Each of those requires whole-system synchronization.  How
is this better than taking one IPI and having the other CPUs wait until
the modification is complete before returning?

	-hpa

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-20 21:54   ` H. Peter Anvin
@ 2009-11-21  0:06     ` Masami Hiramatsu
  2009-11-21  0:19       ` H. Peter Anvin
  2009-11-21 16:21       ` Mathieu Desnoyers
  2009-11-21 16:12     ` Mathieu Desnoyers
  1 sibling, 2 replies; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-21  0:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, andi, roland, rth

Hi Peter,

H. Peter Anvin wrote:
> On 11/18/2009 02:43 PM, Jason Baron wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>>   1. Setup int3 handler for fixup.
>>   2. Put a breakpoint (int3) on the first byte of modifying region,
>>      and synchronize code on all CPUs.
>>   3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>   4. Modify the first byte of modifying region, and synchronize code
>>      on all CPUs.
>>   5. Clear int3 handler.
>>
>> Thus, if some other processor execute modifying address when step2 to step4,
>> it will be jumped to fixup code.
>>
>> This still has many limitations for modifying multi-instructions at once.
>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>> because;
>>   - Replaced instruction is just one instruction, which is executed atomically.
>>   - Replacing instruction is a jump, so we can set fixup address where the jump
>>     goes to.
>>
>
> I just had a thought about this... regardless of if this is safe or not
> (which still remains to be determined)... I have a bit more of a
> fundamental question about it:
>
> This code ends up taking *two* global IPIs for each instruction
> modification.  Each of those requires whole-system synchronization.

As Mathieu and I talked, first IPI is for synchronizing code, and
second is for waiting for all int3 handling is done.

>  How
> is this better than taking one IPI and having the other CPUs wait until
> the modification is complete before returning?

Would you mean using stop_machine()? :-)

If we don't care about NMI, we can use stop_machine() (for
this reason, kprobe-jump-optimization can use stop_machine(),
because kprobes can't probe NMI code), but tracepoint has
to support NMI.

Actually, it might be possible, even it will be complicated.
If one-byte modifying(int3 injection/removing) is always
synchronized, I assume below timechart can work
(and it can support NMI/SMI too).

----
        <CPU0>                  <CPU1>
flag = 0
setup int3 handler
int3 injection[sync]
other-bytes modifying
smp_call_function(func)    func()
wait_until(flag==1)        irq_disable()
                            sync_core() for other-bytes modifying
                            flag = 1
first-byte modifying[sync] wait_until(flag==2)
flag = 2
wait_until(flag==3)        irq_enable()
                            flag = 3
cleanup int3 handler       return
return
----

I'm not so sure that this flag-based step-by-step code can
work faster than 2 IPIs :-(

Any comments are welcome! :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-21  0:06     ` Masami Hiramatsu
@ 2009-11-21  0:19       ` H. Peter Anvin
  2009-11-21 16:21       ` Mathieu Desnoyers
  1 sibling, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2009-11-21  0:19 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jason Baron, linux-kernel, mingo, mathieu.desnoyers, tglx,
	rostedt, andi, roland, rth

On 11/20/2009 04:06 PM, Masami Hiramatsu wrote:
> 
> If we don't care about NMI, we can use stop_machine() (for
> this reason, kprobe-jump-optimization can use stop_machine(),
> because kprobes can't probe NMI code), but tracepoint has
> to support NMI.
> 

Ingo pointed out that the NMI issue can be dealt with by having the NMI
handler check a flag if code modification is in progress on the entry to
the NMI handler.  So yes, NMI can trap out of the IPI hold, but it will
not go any further.

	-hpa

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-19  0:28   ` Mathieu Desnoyers
  2009-11-19  0:58     ` Paul E. McKenney
  2009-11-19 14:04     ` Masami Hiramatsu
@ 2009-11-21  1:11     ` Masami Hiramatsu
  2009-11-21 15:38       ` Mathieu Desnoyers
  2 siblings, 1 reply; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-21  1:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

Mathieu Desnoyers wrote:
> * Jason Baron (jbaron@redhat.com) wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>>   1. Setup int3 handler for fixup.
>>   2. Put a breakpoint (int3) on the first byte of modifying region,
>>      and synchronize code on all CPUs.
>>   3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>   4. Modify the first byte of modifying region, and synchronize code
>>      on all CPUs.
>>   5. Clear int3 handler.
>>
>
> Hi Masami,
>
> I like the approach and the API is clean. I have intersped comments
> below.
>
> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.

Hi Mathieu,

I guess that the hang might happen as below;

----
lock text_mutex
modify code
on_each_cpu(do_something)
                               cpu-hotplug (down)
                               lock cpu-hotplug mutex
                               online_cpus is changed
                               native_cpu_die()
                               ->alternatives_smp_switch(0)
                                 ->lock text_mutex -> sleep
(wait for offlined cpu...)
----

If this is correct, I think we can fix it as below.

----
lock cpu-hotplug mutex
lock text_mutex
modify code
on_each_cpu(do_something)
unlock text_mutex
unlock cpu-hotplug mutex
                               cpu-hotplug (down)
                               lock cpu-hotplug mutex
                               online_cpus is changed
                               native_cpu_die()
                               ->alternatives_smp_switch(0)
                                 ->lock text_mutex
                                   modify code
                                   unlock text_mutex
                               ...
                               unlock cpu-hotplug mutex
----
Or,
----
lock text_mutex
modify code
unlock text_mutex
on_each_cpu(do_something)
                               cpu-hotplug (down)
                               lock cpu-hotplug mutex
                               online_cpus is changed
                               native_cpu_die()
                               ->alternatives_smp_switch(0)
                                 ->lock text_mutex
                                   modify code
                                   unlock text_mutex
                               ...
                               unlock cpu-hotplug mutex
----
The latter needs another mutex for int3 handler and
frequently mutex_lock/unlock in this patch.

Hmm?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-20  1:00         ` Masami Hiramatsu
@ 2009-11-21 15:32           ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-21 15:32 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> [...]
>>>>> +	if (unlikely(len<= 1))
>>>>> +		return text_poke(addr, opcode, len);
>>>>> +
>>>>> +	/* Preparing */
>>>>> +	patch_fixup_addr = fixup;
>>>>> +	wmb();
>>>>
>>>> hrm, missing comment ?
>>>
>>> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
>>> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.
>>
>> When a smp_wmb() is probably enough, and the matching smp_rmb() is
>> missing in the int3 handler.
>
> OK, thank you.
>
>> But why to you care about the order of these two ? I agree that an
>> unrelated int3 handler (from kprobes ?) could be running concurrently at
>> that point, but it clearly cannot be called for this specific address
>> until the int3 is written by text_poke.
>
> Ah, it's my fault. I fixed that a month ago, and forgot to push it...
> Actually, we don't need to care the order of those two. Instead,
> we have to update the patch_fixup_* before int3 embedding.
>
>>
>> What I am pretty much certain is missing would be a smp_wmb()...
>
> Agreed.
>
>>
>>>
>>>>
>>>>> +	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>>
>> ..right here, between where you write to the data used by the int3
>> handler and where you write the actual breakpoint. On the read-side,
>> this might be a problem with architectures like alpha needing
>> smp_read_barrier_depends(), but not for Intel. However, in a spirit to
>> make this code solid, what I did in the immed. val. is:
>>
>>
>>                  target_after_int3 = insn + BREAKPOINT_INS_LEN;
>>                  /* register_die_notifier has memory barriers */
>>                  register_die_notifier(&imv_notify);
>>                  /* The breakpoint will single-step the bypass */
>>                  text_poke((void *)insn,
>>                          ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
>
> Hmm, it strongly depends on arch. Is smp_wmb() right after setting
> patch_fixup_from enough on x86?

What else do you have in mind ? wmb() ? Or adding a
smp_read_barrier_depends() at the beginnig of the handler ?

Clearly, smp_read_barrier_depends() is a no-op on x86, but it might be
good to add it just for code clarity (it helps commenting which ordering
has to be done on the read-side).


>
>> And I unregister the die notifier at the end after having reached
>> quiescent state. At least we know that the die notifier chain read-side
>> has the proper memory barriers, which is not the case for the breakpoint
>> instruction itself.
>>
>>
>>>>> +
>>>>> +	/* Cap by an int3 */
>>>>> +	text_poke(addr,&int3_insn, int3_size);
>>>>> +	sync_core_all();
>>>>> +
>>>>> +	/* Replace tail bytes */
>>>>> +	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>>>>> +		  len - int3_size);
>>>>> +	sync_core_all();
>>>>> +
>>>>> +	/* Replace int3 with head byte */
>>>>> +	text_poke(addr, opcode, int3_size);
>>>>> +	sync_core_all();
>>>>> +
>>>>> +	/* Cleanup */
>>>>> +	patch_fixup_from = NULL;
>>>>> +	wmb();
>>>>
>>>> missing comment here too.
>>>>
>>>>> +	return addr;
>>>>
>>>> Little quiz question:
>>>>
>>>> When patch_fixup_from is set to NULL, what ensures that the int3
>>>> handlers have completed their execution ?
>>>>
>>>> I think it's probably OK, because the int3 is an interrupt gate, which
>>>> therefore disables interrupts as soon as it runs, and executes the
>>>> notifier while irqs are off. When we run sync_core_all() after replacing
>>>> the int3 by the new 1st byte, we only return when all other cores have
>>>> executed an interrupt, which implies that all int3 handlers previously
>>>> running should have ended. Is it right ? It looks to me as if this 3rd
>>>> sync_core_all() is only needed because of that. Probably that adding a
>>>> comment would be good.
>>>
>>> Thanks, it's a good point and that's more what I've thought.
>>> As you said, it is probably safe. Even if it's not safe,
>>> we can add some int3 fixup handler (with lowest priority)
>>> which set regs->ip-1 if there is no int3 anymore, for safety.
>>
>> Well, just ensuring that the we reaches a "disabled IRQ code quiescent
>> state" should be enough. Another way would be to use
>> synchronize_sched(), but it might take longer. Actively poking the other
>> CPUs with IPIs seems quicker. So I would be tempted to leave your code
>> as is in this respect, but to add a comment.
>
> Agreed. synchronize_sched() waits too long for this purpose.
> OK, I'll add a comment for that "waiting for disabled IRQ code
> quiescent state" :-) Thanks for the good advice!
>
>>>> Another thing: I've recently noticed that the following locking seems to
>>>> hang the system with doing stress-testing concurrently with cpu
>>>> hotplug/hotunplug:
>>>>
>>>> mutex_lock(&text_mutex);
>>>>    on_each_cpu(something, NULL, 1);
>>>>
>>>> The hang seems to be caused by the fact that alternative.c has:
>>>>
>>>> within cpu hotplug (cpu hotplug lock held)
>>>>    mutex_lock(&text_mutex);
>>>>
>>>> It might also be caused by the interaction with the stop_machine()
>>>> performed within the cpu hotplug lock. I did not find the root cause of
>>>> the problem, but this probably calls for lockdep improvements.
>>>
>>> Hmm, would you mean it will happen even if we use stop_machine()
>>> under text_mutex locking?
>>> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.
>>
>> Yes, but, again.. this calls for more testing. Hopefully it's not
>> something else in my own code I haven't seen. For not I can just say
>> that I've been noticing hangs involving cpu hotplug and text mutex, and
>> taking the cpu hotplug mutex around text mutex (in my immediate values
>> code) fixed the problem.
>
> Hmm, I guess that we'd better merge those two mutexes since
> text modification always requires disabling cpu-hotplug...

Maybe.. although it's not clear to me that CPU hotplug is required to be
disabled around on_each_cpu calls.

Mathieu

>
> Thank you,
>
>
> -- 
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: mhiramat@redhat.com
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-21  1:11     ` Masami Hiramatsu
@ 2009-11-21 15:38       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-21 15:38 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jason Baron, mingo, Paul E. McKenney, linux-kernel, hpa, tglx,
	rostedt, andi, roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
>> * Jason Baron (jbaron@redhat.com) wrote:
>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>> jumps if it hits the modifying address while code modifying.
>>> text_poke_fixup() does following steps for this purpose.
>>>
>>>   1. Setup int3 handler for fixup.
>>>   2. Put a breakpoint (int3) on the first byte of modifying region,
>>>      and synchronize code on all CPUs.
>>>   3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>>   4. Modify the first byte of modifying region, and synchronize code
>>>      on all CPUs.
>>>   5. Clear int3 handler.
>>>
>>
>> Hi Masami,
>>
>> I like the approach and the API is clean. I have intersped comments
>> below.
>>
>> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
>> experienced recently in the message below. Might be worth having a look,
>> I suspect this might have caused the hangs Paul McKenney had with his
>> past TREE RCU callback migration. I think he did take a mutex in the cpu
>> hotplug callbacks and might have used IPIs within that same lock.
>
> Hi Mathieu,
>
> I guess that the hang might happen as below;
>
> ----
> lock text_mutex
> modify code
> on_each_cpu(do_something)
>                               cpu-hotplug (down)
>                               lock cpu-hotplug mutex
>                               online_cpus is changed
>                               native_cpu_die()
>                               ->alternatives_smp_switch(0)
>                                 ->lock text_mutex -> sleep
> (wait for offlined cpu...)
> ----
>
> If this is correct, I think we can fix it as below.
>
> ----
> lock cpu-hotplug mutex

Yes, this is the solution I used in my own immediate values code too.

> lock text_mutex
> modify code
> on_each_cpu(do_something)
> unlock text_mutex
> unlock cpu-hotplug mutex
>                               cpu-hotplug (down)
>                               lock cpu-hotplug mutex
>                               online_cpus is changed
>                               native_cpu_die()
>                               ->alternatives_smp_switch(0)
>                                 ->lock text_mutex
>                                   modify code
>                                   unlock text_mutex
>                               ...
>                               unlock cpu-hotplug mutex
> ----
> Or,
> ----
> lock text_mutex
> modify code
> unlock text_mutex
> on_each_cpu(do_something)
>                               cpu-hotplug (down)
>                               lock cpu-hotplug mutex
>                               online_cpus is changed
>                               native_cpu_die()
>                               ->alternatives_smp_switch(0)
>                                 ->lock text_mutex
>                                   modify code
>                                   unlock text_mutex
>                               ...
>                               unlock cpu-hotplug mutex
> ----
> The latter needs another mutex for int3 handler and
> frequently mutex_lock/unlock in this patch.
>
> Hmm?

The simplest solution seems to be the best one IMHO. But have you been
able to reproduce the lockup ?

Mathieu

>
> Thank you,
>
> -- 
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: mhiramat@redhat.com
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-20 21:54   ` H. Peter Anvin
  2009-11-21  0:06     ` Masami Hiramatsu
@ 2009-11-21 16:12     ` Mathieu Desnoyers
  1 sibling, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-21 16:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jason Baron, linux-kernel, mingo, tglx, rostedt, andi, roland,
	rth, mhiramat

* H. Peter Anvin (hpa@zytor.com) wrote:
> On 11/18/2009 02:43 PM, Jason Baron wrote:
> > Add text_poke_fixup() which takes a fixup address to where a processor
> > jumps if it hits the modifying address while code modifying.
> > text_poke_fixup() does following steps for this purpose.
> > 
> >  1. Setup int3 handler for fixup.
> >  2. Put a breakpoint (int3) on the first byte of modifying region,
> >     and synchronize code on all CPUs.
> >  3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> >  4. Modify the first byte of modifying region, and synchronize code
> >     on all CPUs.
> >  5. Clear int3 handler.
> > 
> > Thus, if some other processor execute modifying address when step2 to step4,
> > it will be jumped to fixup code.
> > 
> > This still has many limitations for modifying multi-instructions at once.
> > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > because;
> >  - Replaced instruction is just one instruction, which is executed atomically.
> >  - Replacing instruction is a jump, so we can set fixup address where the jump
> >    goes to.
> > 
> 
> I just had a thought about this... regardless of if this is safe or not
> (which still remains to be determined)... I have a bit more of a
> fundamental question about it:
> 
> This code ends up taking *two* global IPIs for each instruction
> modification.  Each of those requires whole-system synchronization.  How
> is this better than taking one IPI and having the other CPUs wait until
> the modification is complete before returning?

Hi Peter,

There are two main benefit for int3 vs stop_machine() I can think of:

1. Diminish the "whole system interrupt off" latency.
2. Support NMIs, MCEs and traps without adding complexity to the NMI,
   MCE nor trap handler.

Also, code update execution time is not performance-critical in terms of
execution time, although the more important aspect is the effect on the
irq latency characteristics of the kernel, as explained below.


1. Latency

* Let's say we have a worse-case interrupt latency of 15ms in our
  kernel on CPU A (e.g. induced by printk to a serial console) (yeah,
  that's bad, but these exist).

* Concurrently, CPU B performs code modification with stop_machine().
  stop_machine() spawns a worker thread on eacu CPU, waiting for all
  CPUs to respond to the IPI.

* CPU A is still in its worse-case interrupt latency. All other CPUs
  are running in stop_cpu() waiting for CPU A to join the game.

So, basically, we just transformed a single-cpu worse-case latency to an
even worse, longer, whole-system worse case latency.


2. NMIs, MCEs and traps

Currently, the approach Ingo proposes targets NMIs only. Other
non-maskables interrupt sources should also be treated. I think everyone
assumes that stop_cpu never generates any trap. Is it a documented
requirement, or is it just wishful thinking ? Because if it is not, it
could be possible for an architecture to choose a design that would let
stop_cpu generate a trap, and therefore we would have to fiddle with the
trap handlers too. Should we add the fixup test to trap handlers too ?

NMIs can be dealt with using the scheme where the NMI does the
modification itself if it nests over the code modification code, as
proposed by Ingo. However, I'm concerned about adding any bit of
complexity to a code path like the nmi handler. Basically, it has always
been a mechanism meant to do a task with the minimal delay and with
minimal interaction with the system. Adding fixup code to these go
against this idea. If someone really need to hook his device on a NMI
handler, I am certain that they will not see these extra checks and this
extra NMI-response latency as welcome.

MCEs are yet another code path that could be useful to use for system
diagnostic, yet cannot be disabled, just like NMIs. Should we start
auditing all handlers which are not dealt with and add the fixup test
there too ?


So, in comparison (+ is for int3, - for stop machine):

+ Lower system-wide interrupt latency.
+ Lower NMI response-time.
+ Lower complexity in the NMI handler.
+ Handles all nested execution in one spot, without adding complexity
  elsewhere. Therefore deals with MCEs and traps.
+ Easier to test that _all_ nested executions are appropriately dealt
  with, because the solution is not case-specific.

- Two IPIs instead of one. I'm not even sure it's slower than the worker
  thread creation currently done in stop_machine.
- Execution of a breakpoint in the case a remote CPU (or local
  NMI, MCE, trap) hits the breakpoint.

Mathieu


> 
> 	-hpa

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-21  0:06     ` Masami Hiramatsu
  2009-11-21  0:19       ` H. Peter Anvin
@ 2009-11-21 16:21       ` Mathieu Desnoyers
  2009-11-21 21:55         ` Masami Hiramatsu
  1 sibling, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-21 16:21 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: H. Peter Anvin, Jason Baron, linux-kernel, mingo, tglx, rostedt,
	andi, roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Hi Peter,
>
> H. Peter Anvin wrote:
>> On 11/18/2009 02:43 PM, Jason Baron wrote:
>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>> jumps if it hits the modifying address while code modifying.
>>> text_poke_fixup() does following steps for this purpose.
>>>
>>>   1. Setup int3 handler for fixup.
>>>   2. Put a breakpoint (int3) on the first byte of modifying region,
>>>      and synchronize code on all CPUs.
>>>   3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>>   4. Modify the first byte of modifying region, and synchronize code
>>>      on all CPUs.
>>>   5. Clear int3 handler.
>>>
>>> Thus, if some other processor execute modifying address when step2 to step4,
>>> it will be jumped to fixup code.
>>>
>>> This still has many limitations for modifying multi-instructions at once.
>>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>>> because;
>>>   - Replaced instruction is just one instruction, which is executed atomically.
>>>   - Replacing instruction is a jump, so we can set fixup address where the jump
>>>     goes to.
>>>
>>
>> I just had a thought about this... regardless of if this is safe or not
>> (which still remains to be determined)... I have a bit more of a
>> fundamental question about it:
>>
>> This code ends up taking *two* global IPIs for each instruction
>> modification.  Each of those requires whole-system synchronization.
>
> As Mathieu and I talked, first IPI is for synchronizing code, and
> second is for waiting for all int3 handling is done.
>
>>  How
>> is this better than taking one IPI and having the other CPUs wait until
>> the modification is complete before returning?
>
> Would you mean using stop_machine()? :-)
>
> If we don't care about NMI, we can use stop_machine() (for
> this reason, kprobe-jump-optimization can use stop_machine(),
> because kprobes can't probe NMI code), but tracepoint has
> to support NMI.
>
> Actually, it might be possible, even it will be complicated.
> If one-byte modifying(int3 injection/removing) is always
> synchronized, I assume below timechart can work
> (and it can support NMI/SMI too).
>
> ----
>        <CPU0>                  <CPU1>
> flag = 0
> setup int3 handler
> int3 injection[sync]
> other-bytes modifying
> smp_call_function(func)    func()
> wait_until(flag==1)        irq_disable()
>                            sync_core() for other-bytes modifying
>                            flag = 1
> first-byte modifying[sync] wait_until(flag==2)

Hrm, I don't like this too much. In terms of latency, we can get:

CPU 0:                       CPU 1
                             interrupts off
                             * wait_util(flag == 2)
interrupted
softirq runs...                            
(we have a drink, network bh
processing, etc etc)
back to standard execution
flag = 2

So, as you see, we increase the interrupt latency on all other CPUs of
the duration of a softirq. This is, I think, an unwanted side-effect.

We should really do performance benchmarks comparing stop_machine() and
the int3-based approach rather than to try to come up with tricky
schemes. It's not a real problem until we prove there is indeed a
performance regression. I suspect that the combined effect of cache-line
bouncing, worker thread overhead and the IPI of stop_machine is probably
comparable to the two IPIs we propose for int3.

Thanks,

Mathieu


> flag = 2
> wait_until(flag==3)        irq_enable()
>                            flag = 3
> cleanup int3 handler       return
> return
> ----
>
> I'm not so sure that this flag-based step-by-step code can
> work faster than 2 IPIs :-(
>
> Any comments are welcome! :-)
>
> Thank you,
>
> -- 
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: mhiramat@redhat.com
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-21 16:21       ` Mathieu Desnoyers
@ 2009-11-21 21:55         ` Masami Hiramatsu
  2009-11-22  1:46           ` Mathieu Desnoyers
  0 siblings, 1 reply; 35+ messages in thread
From: Masami Hiramatsu @ 2009-11-21 21:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Jason Baron, linux-kernel, mingo, tglx, rostedt,
	andi, roland, rth

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> > Hi Peter,
>> >
>> > H. Peter Anvin wrote:
>>> >> On 11/18/2009 02:43 PM, Jason Baron wrote:
>>>> >>> Add text_poke_fixup() which takes a fixup address to where a processor
>>>> >>> jumps if it hits the modifying address while code modifying.
>>>> >>> text_poke_fixup() does following steps for this purpose.
>>>> >>>
>>>> >>>   1. Setup int3 handler for fixup.
>>>> >>>   2. Put a breakpoint (int3) on the first byte of modifying region,
>>>> >>>      and synchronize code on all CPUs.
>>>> >>>   3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>>> >>>   4. Modify the first byte of modifying region, and synchronize code
>>>> >>>      on all CPUs.
>>>> >>>   5. Clear int3 handler.
>>>> >>>
>>>> >>> Thus, if some other processor execute modifying address when step2 to step4,
>>>> >>> it will be jumped to fixup code.
>>>> >>>
>>>> >>> This still has many limitations for modifying multi-instructions at once.
>>>> >>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>>>> >>> because;
>>>> >>>   - Replaced instruction is just one instruction, which is executed atomically.
>>>> >>>   - Replacing instruction is a jump, so we can set fixup address where the jump
>>>> >>>     goes to.
>>>> >>>
>>> >>
>>> >> I just had a thought about this... regardless of if this is safe or not
>>> >> (which still remains to be determined)... I have a bit more of a
>>> >> fundamental question about it:
>>> >>
>>> >> This code ends up taking *two* global IPIs for each instruction
>>> >> modification.  Each of those requires whole-system synchronization.
>> >
>> > As Mathieu and I talked, first IPI is for synchronizing code, and
>> > second is for waiting for all int3 handling is done.
>> >
>>> >>  How
>>> >> is this better than taking one IPI and having the other CPUs wait until
>>> >> the modification is complete before returning?
>> >
>> > Would you mean using stop_machine()? :-)
>> >
>> > If we don't care about NMI, we can use stop_machine() (for
>> > this reason, kprobe-jump-optimization can use stop_machine(),
>> > because kprobes can't probe NMI code), but tracepoint has
>> > to support NMI.
>> >
>> > Actually, it might be possible, even it will be complicated.
>> > If one-byte modifying(int3 injection/removing) is always
>> > synchronized, I assume below timechart can work
>> > (and it can support NMI/SMI too).
>> >
>> > ----
>> >        <CPU0>                  <CPU1>
>> > flag = 0
>> > setup int3 handler
>> > int3 injection[sync]
>> > other-bytes modifying
>> > smp_call_function(func)    func()
>> > wait_until(flag==1)        irq_disable()
>> >                            sync_core() for other-bytes modifying
>> >                            flag = 1
>> > first-byte modifying[sync] wait_until(flag==2)
> Hrm, I don't like this too much. In terms of latency, we can get:
> 
> CPU 0:                       CPU 1
>                              interrupts off
>                              * wait_util(flag == 2)
> interrupted
> softirq runs...                            
> (we have a drink, network bh
> processing, etc etc)
> back to standard execution
> flag = 2
> 
> So, as you see, we increase the interrupt latency on all other CPUs of
> the duration of a softirq. This is, I think, an unwanted side-effect.
> 
> We should really do performance benchmarks comparing stop_machine() and
> the int3-based approach rather than to try to come up with tricky
> schemes. It's not a real problem until we prove there is indeed a
> performance regression. I suspect that the combined effect of cache-line
> bouncing, worker thread overhead and the IPI of stop_machine is probably
> comparable to the two IPIs we propose for int3.

I assume that total latency of XMC is almost same on normal-size SMP.
However,
- stop_machine() can't support NMI/SMI.
- stop_machine() stops all other processors while XMC.

Anyway, int3-based approach still needs to be ensured its safeness
by processor architects. So, until that, stop_machine() approach
also useful for some cases.

Thank you,
-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine
  2009-11-21 21:55         ` Masami Hiramatsu
@ 2009-11-22  1:46           ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2009-11-22  1:46 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: H. Peter Anvin, Jason Baron, linux-kernel, mingo, tglx, rostedt,
	andi, roland, rth

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > We should really do performance benchmarks comparing stop_machine() and
> > the int3-based approach rather than to try to come up with tricky
> > schemes. It's not a real problem until we prove there is indeed a
> > performance regression. I suspect that the combined effect of cache-line
> > bouncing, worker thread overhead and the IPI of stop_machine is probably
> > comparable to the two IPIs we propose for int3.
> 
> I assume that total latency of XMC is almost same on normal-size SMP.
> However,
> - stop_machine() can't support NMI/SMI.
> - stop_machine() stops all other processors while XMC.

I would also add that stop_machine() increases the system interrupt
latency of an amount O(num_online_cpus()), which I'd like to avoid given
the 90- to 128-core machines heading our way pretty quickly.

> 
> Anyway, int3-based approach still needs to be ensured its safeness
> by processor architects. So, until that, stop_machine() approach
> also useful for some cases.

True. This makes me think: If performance happens to be a problem, we
could do batched jump instruction modification. Using an hash table to
contain the pointers would allow us to only perform a single pair of IPI
for a whole bunch of instruction modifications.

Mathieu

> 
> Thank you,
 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-11-22  1:46 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-18 22:43 [RFC PATCH 0/6] jump label v3 Jason Baron
2009-11-18 22:43 ` [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Jason Baron
2009-11-18 22:43 ` [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Jason Baron
2009-11-19  0:28   ` Mathieu Desnoyers
2009-11-19  0:58     ` Paul E. McKenney
2009-11-19  1:22       ` Steven Rostedt
2009-11-19  1:39         ` Paul E. McKenney
2009-11-19  1:57       ` Mathieu Desnoyers
2009-11-19  4:16         ` Paul E. McKenney
2009-11-19 14:04     ` Masami Hiramatsu
2009-11-19 16:03       ` Mathieu Desnoyers
2009-11-20  1:00         ` Masami Hiramatsu
2009-11-21 15:32           ` Mathieu Desnoyers
2009-11-21  1:11     ` Masami Hiramatsu
2009-11-21 15:38       ` Mathieu Desnoyers
2009-11-20 21:54   ` H. Peter Anvin
2009-11-21  0:06     ` Masami Hiramatsu
2009-11-21  0:19       ` H. Peter Anvin
2009-11-21 16:21       ` Mathieu Desnoyers
2009-11-21 21:55         ` Masami Hiramatsu
2009-11-22  1:46           ` Mathieu Desnoyers
2009-11-21 16:12     ` Mathieu Desnoyers
2009-11-18 22:43 ` [RFC PATCH 3/6] jump label v3 - move opcode defs Jason Baron
2009-11-18 22:43 ` [RFC PATCH 4/6] jump label v3 - base patch Jason Baron
2009-11-18 23:38   ` [PATCH] notifier atomic call chain notrace Mathieu Desnoyers
2009-11-19  0:02     ` Paul E. McKenney
2009-11-19  3:59     ` Masami Hiramatsu
2009-11-19 16:48     ` Jason Baron
2009-11-18 22:43 ` [RFC PATCH 5/6] jump label v3 - add module support Jason Baron
2009-11-18 22:43 ` [RFC PATCH 6/6] jump label v3 - tracepoint support Jason Baron
2009-11-18 22:51 ` [RFC PATCH 0/6] jump label v3 H. Peter Anvin
2009-11-18 23:07   ` Roland McGrath
2009-11-18 23:18     ` H. Peter Anvin
2009-11-19  3:54 ` Roland McGrath
2009-11-19 21:55   ` Jason Baron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.