All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support
@ 2009-11-23 23:21 Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 01/10] kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Masami Hiramatsu
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: H. Peter Anvin, Frederic Weisbecker, Ananth N Mavinakayanahalli,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Anders Kaseorg, Tim Abbott, Andi Kleen,
	Jason Baron, Mathieu Desnoyers, systemtap, DLE

Hi,

Here are the patchset of the kprobes jump optimization v5
(a.k.a. Djprobe). Since it is not ensured that the int3 bypassing
cross modifying code is safe on any processors yet, I introduced
stop_machine() version of XMC. Using stop_machine() will disable
us to probe NMI codes, but anyway, kprobes itself can't probe
those codes. So, it's not a problem. This version also includes
get/put_online_cpus() around optimization for avoiding deadlock
of text_mutex.

These patches can be applied on the latest -tip.

Changes in v5:
- Use stop_machine() to replace a breakpoint with a jump.
- get/put_online_cpus() around optimization.
- Make generic jump patching interface RFC.

And kprobe stress test didn't found any regressions - from kprobes,
under kvm/x86.

Jump Optimized Kprobes
======================
o Concept
 Kprobes uses the int3 breakpoint instruction on x86 for instrumenting
probes into running kernel. Jump optimization allows kprobes to replace
breakpoint with a jump instruction for reducing probing overhead drastically.

o Performance
 An optimized kprobe 5 times faster than a kprobe.

 Optimizing probes gains its performance. Usually, a kprobe hit takes
0.5 to 1.0 microseconds to process. On the other hand, a jump optimized
probe hit takes less than 0.1 microseconds (actual number depends on the
processor). Here is a sample overheads.

Intel(R) Xeon(R) CPU E5410  @ 2.33GHz (without debugging options)

                     x86-32  x86-64
kprobe:              0.68us  0.91us
kprobe+booster:	     0.27us  0.40us
kprobe+optimized:    0.06us  0.06us

kretprobe :          0.95us  1.21us
kretprobe+booster:   0.53us  0.71us
kretprobe+optimized: 0.30us  0.35us

(booster skips single-stepping)

 Note that jump optimization also consumes more memory, but not so much.
It just uses ~200 bytes, so, even if you use ~10,000 probes, it just 
consumes a few MB.


o Usage
 Set CONFIG_OPTPROBES=y when building a kernel, then all *probes will be
optimized if possible.

 Kprobes decodes probed function and checks whether the target instructions
can be optimized(replaced with a jump) safely. If it can't be, Kprobes just
doesn't optimize it.


o Optimization
  Before preparing optimization, Kprobes inserts original(user-defined)
 kprobe on the specified address. So, even if the kprobe is not
 possible to be optimized, it just uses a normal kprobe.

 - Safety check
  First, Kprobes gets the address of probed function and checks whether the
 optimized region, which will be replaced by a jump instruction, does NOT
 straddle the function boundary, because if the optimized region reaches the
 next function, its caller causes unexpected results.
  Next, Kprobes decodes whole body of probed function and checks there is
 NO indirect jump, NO instruction which will cause exception by checking
 exception_tables (this will jump to fixup code and fixup code jumps into
 same function body) and NO near jump which jumps into the optimized region
 (except the 1st byte of jump), because if some jump instruction jumps
 into the middle of another instruction, it causes unexpected results too.
  Kprobes also measures the length of instructions which will be replaced
 by a jump instruction, because a jump instruction is longer than 1 byte,
 it may replaces multiple instructions, and it checks whether those
 instructions can be executed out-of-line.

 - Preparing detour code
  Then, Kprobes prepares "detour" buffer, which contains exception emulating
 code (push/pop registers, call handler), copied instructions(Kprobes copies
 instructions which will be replaced by a jump, to the detour buffer), and
 a jump which jumps back to the original execution path.

 - Pre-optimization
  After preparing detour code, Kprobes enqueues the kprobe to optimizing list
 and kicks kprobe-optimizer workqueue to optimize it. To wait other optimized
 probes, kprobe-optimizer will delay to work.
  When the optimized-kprobe is hit before optimization, its handler
 changes IP(instruction pointer) to copied code and exits. So, the
 instructions which were copied to detour buffer are executed on the detour
 buffer.

 - Optimization
  Kprobe-optimizer doesn't start instruction-replacing soon, it waits
 synchronize_sched for safety, because some processors are possible to be
 interrupted on the instructions which will be replaced by a jump instruction.
 As you know, synchronize_sched() can ensure that all interruptions which were
 executed when synchronize_sched() was called are done, only if
 CONFIG_PREEMPT=n. So, this version supports only the kernel with
 CONFIG_PREEMPT=n.(*)
  After that, kprobe-optimizer replaces the 4 bytes right after int3 breakpoint
 with relative-jump destination, and synchronize caches on all processors. Next,
 it replaces int3 with relative-jump opcode, and synchronize caches again.

 - Unoptimization
  When unregistering, disabling kprobe or being blocked by other kprobe,
 an optimized-kprobe will be unoptimized. Before kprobe-optimizer runs,
 the kprobe just be dequeued from the optimized list. When the optimization
 has been done, it replaces a jump with int3 breakpoint and original code.
  First it puts int3 at the first byte of the jump, synchronize caches
 on all processors, and replaces the 4 bytes right after int3 with the
 original code.

(*)This optimization-safety checking may be replaced with stop-machine method
 which ksplice is done for supporting CONFIG_PREEMPT=y kernel.


Thank you,

---

Masami Hiramatsu (10):
      [RFC] kprobes/x86: Use text_poke_fixup() for jump optimization
      [RFC] x86: Introduce generic jump patching without stop_machine
      kprobes: Add documents of jump optimization
      kprobes/x86: Support kprobes jump optimization on x86
      kprobes/x86: Cleanup save/restore registers
      kprobes/x86: Boost probes when reentering
      kprobes: Jump optimization sysctl interface
      kprobes: Introduce kprobes jump optimization
      kprobes: Introduce generic insn_slot framework
      kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE


 Documentation/kprobes.txt          |  192 +++++++++++-
 arch/Kconfig                       |   13 +
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/alternative.h |   11 +
 arch/x86/include/asm/kprobes.h     |   31 ++
 arch/x86/kernel/alternative.c      |  102 ++++++
 arch/x86/kernel/kprobes.c          |  585 +++++++++++++++++++++++++++++-------
 include/linux/kprobes.h            |   44 +++
 kernel/kprobes.c                   |  587 +++++++++++++++++++++++++++++++-----
 kernel/sysctl.c                    |   13 +
 10 files changed, 1374 insertions(+), 205 deletions(-)

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 01/10] kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
@ 2009-11-23 23:21 ` Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 02/10] kprobes: Introduce generic insn_slot framework Masami Hiramatsu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Change RELATIVEJUMP_INSTRUCTION macro to RELATIVEJUMP_OPCODE since it
represents just the opcode byte.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/include/asm/kprobes.h |    2 +-
 arch/x86/kernel/kprobes.c      |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4fe681d..eaec8ea 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -32,7 +32,7 @@ struct kprobe;
 
 typedef u8 kprobe_opcode_t;
 #define BREAKPOINT_INSTRUCTION	0xcc
-#define RELATIVEJUMP_INSTRUCTION 0xe9
+#define RELATIVEJUMP_OPCODE 0xe9
 #define MAX_INSN_SIZE 16
 #define MAX_STACK_SIZE 64
 #define MIN_STACK_SIZE(ADDR)					       \
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 1f3186c..b6b75f1 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -115,7 +115,7 @@ static void __kprobes set_jmp_op(void *from, void *to)
 	} __attribute__((packed)) * jop;
 	jop = (struct __arch_jmp_op *)from;
 	jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
-	jop->op = RELATIVEJUMP_INSTRUCTION;
+	jop->op = RELATIVEJUMP_OPCODE;
 }
 
 /*


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 02/10] kprobes: Introduce generic insn_slot framework
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 01/10] kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Masami Hiramatsu
@ 2009-11-23 23:21 ` Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization Masami Hiramatsu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Make insn_slot framework support various size slots.
Current insn_slot just supports one-size instruction buffer slot. However,
kprobes jump optimization needs larger size buffers.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 kernel/kprobes.c |  102 +++++++++++++++++++++++++++++++++---------------------
 1 files changed, 63 insertions(+), 39 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index e5342a3..10d2ed5 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -103,26 +103,42 @@ static struct kprobe_blackpoint kprobe_blacklist[] = {
  * stepping on the instruction on a vmalloced/kmalloced/data page
  * is a recipe for disaster
  */
-#define INSNS_PER_PAGE	(PAGE_SIZE/(MAX_INSN_SIZE * sizeof(kprobe_opcode_t)))
-
 struct kprobe_insn_page {
 	struct list_head list;
 	kprobe_opcode_t *insns;		/* Page of instruction slots */
-	char slot_used[INSNS_PER_PAGE];
 	int nused;
 	int ngarbage;
+	char slot_used[];
 };
 
+#define KPROBE_INSN_PAGE_SIZE(slots)			\
+	(offsetof(struct kprobe_insn_page, slot_used) +	\
+	 (sizeof(char) * (slots)))
+
+struct kprobe_insn_cache {
+	struct list_head pages;	/* list of kprobe_insn_page */
+	size_t insn_size;	/* size of instruction slot */
+	int nr_garbage;
+};
+
+static int slots_per_page(struct kprobe_insn_cache *c)
+{
+	return PAGE_SIZE/(c->insn_size * sizeof(kprobe_opcode_t));
+}
+
 enum kprobe_slot_state {
 	SLOT_CLEAN = 0,
 	SLOT_DIRTY = 1,
 	SLOT_USED = 2,
 };
 
-static DEFINE_MUTEX(kprobe_insn_mutex);	/* Protects kprobe_insn_pages */
-static LIST_HEAD(kprobe_insn_pages);
-static int kprobe_garbage_slots;
-static int collect_garbage_slots(void);
+static DEFINE_MUTEX(kprobe_insn_mutex);	/* Protects kprobe_insn_slots */
+static struct kprobe_insn_cache kprobe_insn_slots = {
+	.pages = LIST_HEAD_INIT(kprobe_insn_slots.pages),
+	.insn_size = MAX_INSN_SIZE,
+	.nr_garbage = 0,
+};
+static int __kprobes collect_garbage_slots(struct kprobe_insn_cache *c);
 
 static int __kprobes check_safety(void)
 {
@@ -152,32 +168,33 @@ loop_end:
  * __get_insn_slot() - Find a slot on an executable page for an instruction.
  * We allocate an executable page if there's no room on existing ones.
  */
-static kprobe_opcode_t __kprobes *__get_insn_slot(void)
+static kprobe_opcode_t __kprobes *__get_insn_slot(struct kprobe_insn_cache *c)
 {
 	struct kprobe_insn_page *kip;
 
  retry:
-	list_for_each_entry(kip, &kprobe_insn_pages, list) {
-		if (kip->nused < INSNS_PER_PAGE) {
+	list_for_each_entry(kip, &c->pages, list) {
+		if (kip->nused < slots_per_page(c)) {
 			int i;
-			for (i = 0; i < INSNS_PER_PAGE; i++) {
+			for (i = 0; i < slots_per_page(c); i++) {
 				if (kip->slot_used[i] == SLOT_CLEAN) {
 					kip->slot_used[i] = SLOT_USED;
 					kip->nused++;
-					return kip->insns + (i * MAX_INSN_SIZE);
+					return kip->insns + (i * c->insn_size);
 				}
 			}
-			/* Surprise!  No unused slots.  Fix kip->nused. */
-			kip->nused = INSNS_PER_PAGE;
+			/* kip->nused is broken. Fix it. */
+			kip->nused = slots_per_page(c);
+			WARN_ON(1);
 		}
 	}
 
 	/* If there are any garbage slots, collect it and try again. */
-	if (kprobe_garbage_slots && collect_garbage_slots() == 0) {
+	if (c->nr_garbage && collect_garbage_slots(c) == 0)
 		goto retry;
-	}
-	/* All out of space.  Need to allocate a new page. Use slot 0. */
-	kip = kmalloc(sizeof(struct kprobe_insn_page), GFP_KERNEL);
+
+	/* All out of space.  Need to allocate a new page. */
+	kip = kmalloc(KPROBE_INSN_PAGE_SIZE(slots_per_page(c)), GFP_KERNEL);
 	if (!kip)
 		return NULL;
 
@@ -192,19 +209,20 @@ static kprobe_opcode_t __kprobes *__get_insn_slot(void)
 		return NULL;
 	}
 	INIT_LIST_HEAD(&kip->list);
-	list_add(&kip->list, &kprobe_insn_pages);
-	memset(kip->slot_used, SLOT_CLEAN, INSNS_PER_PAGE);
+	memset(kip->slot_used, SLOT_CLEAN, slots_per_page(c));
 	kip->slot_used[0] = SLOT_USED;
 	kip->nused = 1;
 	kip->ngarbage = 0;
+	list_add(&kip->list, &c->pages);
 	return kip->insns;
 }
 
+
 kprobe_opcode_t __kprobes *get_insn_slot(void)
 {
-	kprobe_opcode_t *ret;
+	kprobe_opcode_t *ret = NULL;
 	mutex_lock(&kprobe_insn_mutex);
-	ret = __get_insn_slot();
+	ret = __get_insn_slot(&kprobe_insn_slots);
 	mutex_unlock(&kprobe_insn_mutex);
 	return ret;
 }
@@ -221,7 +239,7 @@ static int __kprobes collect_one_slot(struct kprobe_insn_page *kip, int idx)
 		 * so as not to have to set it up again the
 		 * next time somebody inserts a probe.
 		 */
-		if (!list_is_singular(&kprobe_insn_pages)) {
+		if (!list_is_singular(&kip->list)) {
 			list_del(&kip->list);
 			module_free(NULL, kip->insns);
 			kfree(kip);
@@ -231,7 +249,7 @@ static int __kprobes collect_one_slot(struct kprobe_insn_page *kip, int idx)
 	return 0;
 }
 
-static int __kprobes collect_garbage_slots(void)
+static int __kprobes collect_garbage_slots(struct kprobe_insn_cache *c)
 {
 	struct kprobe_insn_page *kip, *next;
 
@@ -239,42 +257,48 @@ static int __kprobes collect_garbage_slots(void)
 	if (check_safety())
 		return -EAGAIN;
 
-	list_for_each_entry_safe(kip, next, &kprobe_insn_pages, list) {
+	list_for_each_entry_safe(kip, next, &c->pages, list) {
 		int i;
 		if (kip->ngarbage == 0)
 			continue;
 		kip->ngarbage = 0;	/* we will collect all garbages */
-		for (i = 0; i < INSNS_PER_PAGE; i++) {
+		for (i = 0; i < slots_per_page(c); i++) {
 			if (kip->slot_used[i] == SLOT_DIRTY &&
 			    collect_one_slot(kip, i))
 				break;
 		}
 	}
-	kprobe_garbage_slots = 0;
+	c->nr_garbage = 0;
 	return 0;
 }
 
-void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
+static void __kprobes __free_insn_slot(struct kprobe_insn_cache *c,
+				       kprobe_opcode_t *slot, int dirty)
 {
 	struct kprobe_insn_page *kip;
 
-	mutex_lock(&kprobe_insn_mutex);
-	list_for_each_entry(kip, &kprobe_insn_pages, list) {
-		if (kip->insns <= slot &&
-		    slot < kip->insns + (INSNS_PER_PAGE * MAX_INSN_SIZE)) {
-			int i = (slot - kip->insns) / MAX_INSN_SIZE;
+	list_for_each_entry(kip, &c->pages, list) {
+		long idx = ((long)slot - (long)kip->insns) / c->insn_size;
+		if (idx >= 0 && idx < slots_per_page(c)) {
+			WARN_ON(kip->slot_used[idx] != SLOT_USED);
 			if (dirty) {
-				kip->slot_used[i] = SLOT_DIRTY;
+				kip->slot_used[idx] = SLOT_DIRTY;
 				kip->ngarbage++;
+				if (++c->nr_garbage > slots_per_page(c))
+					collect_garbage_slots(c);
 			} else
-				collect_one_slot(kip, i);
-			break;
+				collect_one_slot(kip, idx);
+			return;
 		}
 	}
+	/* Could not free this slot. */
+	WARN_ON(1);
+}
 
-	if (dirty && ++kprobe_garbage_slots > INSNS_PER_PAGE)
-		collect_garbage_slots();
-
+void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
+{
+	mutex_lock(&kprobe_insn_mutex);
+	__free_insn_slot(&kprobe_insn_slots, slot, dirty);
 	mutex_unlock(&kprobe_insn_mutex);
 }
 #endif


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 01/10] kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 02/10] kprobes: Introduce generic insn_slot framework Masami Hiramatsu
@ 2009-11-23 23:21 ` Masami Hiramatsu
  2009-11-24  2:44   ` Frederic Weisbecker
  2009-11-23 23:21 ` [PATCH -tip v5 04/10] kprobes: Jump optimization sysctl interface Masami Hiramatsu
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Introduce kprobes jump optimization arch-independent parts.
Kprobes uses breakpoint instruction for interrupting execution flow, on
some architectures, it can be replaced by a jump instruction and
interruption emulation code. This gains kprobs' performance drastically.

To enable this feature, set CONFIG_OPTPROBES=y (default y if the arch
supports OPTPROBE).

Changes in v5:
- Use get_online_cpus()/put_online_cpus() for avoiding text_mutex
  deadlock.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/Kconfig            |   13 ++
 include/linux/kprobes.h |   36 ++++
 kernel/kprobes.c        |  401 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 404 insertions(+), 46 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 28146cd..86a294a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -57,6 +57,17 @@ config KPROBES
 	  for kernel debugging, non-intrusive instrumentation and testing.
 	  If in doubt, say "N".
 
+config OPTPROBES
+	bool "Kprobes jump optimization support (EXPERIMENTAL)"
+	default y
+	depends on KPROBES
+	depends on !PREEMPT
+	depends on HAVE_OPTPROBES
+	select KALLSYMS_ALL
+	help
+	  This option will allow kprobes to optimize breakpoint to
+	  a jump for reducing its overhead.
+
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
 	help
@@ -99,6 +110,8 @@ config HAVE_KPROBES
 config HAVE_KRETPROBES
 	bool
 
+config HAVE_OPTPROBES
+	bool
 #
 # An arch should select this if it provides all these things:
 #
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index 1b672f7..aed1f95 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -122,6 +122,11 @@ struct kprobe {
 /* Kprobe status flags */
 #define KPROBE_FLAG_GONE	1 /* breakpoint has already gone */
 #define KPROBE_FLAG_DISABLED	2 /* probe is temporarily disabled */
+#define KPROBE_FLAG_OPTIMIZED	4 /*
+				   * probe is really optimized.
+				   * NOTE:
+				   * this flag is only for optimized_kprobe.
+				   */
 
 /* Has this kprobe gone ? */
 static inline int kprobe_gone(struct kprobe *p)
@@ -134,6 +139,12 @@ static inline int kprobe_disabled(struct kprobe *p)
 {
 	return p->flags & (KPROBE_FLAG_DISABLED | KPROBE_FLAG_GONE);
 }
+
+/* Is this kprobe really running optimized path ? */
+static inline int kprobe_optimized(struct kprobe *p)
+{
+	return p->flags & KPROBE_FLAG_OPTIMIZED;
+}
 /*
  * Special probe type that uses setjmp-longjmp type tricks to resume
  * execution at a specified entry with a matching prototype corresponding
@@ -249,6 +260,31 @@ extern kprobe_opcode_t *get_insn_slot(void);
 extern void free_insn_slot(kprobe_opcode_t *slot, int dirty);
 extern void kprobes_inc_nmissed_count(struct kprobe *p);
 
+#ifdef CONFIG_OPTPROBES
+/*
+ * Internal structure for direct jump optimized probe
+ */
+struct optimized_kprobe {
+	struct kprobe kp;
+	struct list_head list;	/* list for optimizing queue */
+	struct arch_optimized_insn optinsn;
+};
+
+/* Architecture dependent functions for direct jump optimization */
+extern int arch_prepared_optinsn(struct arch_optimized_insn *optinsn);
+extern int arch_check_optimized_kprobe(struct optimized_kprobe *op);
+extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op);
+extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op);
+extern int  arch_optimize_kprobe(struct optimized_kprobe *op);
+extern void arch_unoptimize_kprobe(struct optimized_kprobe *op);
+extern kprobe_opcode_t *get_optinsn_slot(void);
+extern void free_optinsn_slot(kprobe_opcode_t *slot, int dirty);
+extern int arch_within_optimized_kprobe(struct optimized_kprobe *op,
+					unsigned long addr);
+
+extern void opt_pre_handler(struct kprobe *p, struct pt_regs *regs);
+#endif /* CONFIG_OPTPROBES */
+
 /* Get the kprobe at this addr (if any) - called with preemption disabled */
 struct kprobe *get_kprobe(void *addr);
 void kretprobe_hash_lock(struct task_struct *tsk,
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 10d2ed5..15aa797 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include <linux/debugfs.h>
 #include <linux/kdebug.h>
 #include <linux/memory.h>
+#include <linux/cpu.h>
 
 #include <asm-generic/sections.h>
 #include <asm/cacheflush.h>
@@ -301,6 +302,31 @@ void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
 	__free_insn_slot(&kprobe_insn_slots, slot, dirty);
 	mutex_unlock(&kprobe_insn_mutex);
 }
+#ifdef CONFIG_OPTPROBES
+/* For optimized_kprobe buffer */
+static DEFINE_MUTEX(kprobe_optinsn_mutex); /* Protects kprobe_optinsn_slots */
+static struct kprobe_insn_cache kprobe_optinsn_slots = {
+	.pages = LIST_HEAD_INIT(kprobe_optinsn_slots.pages),
+	/* .insn_size is initialized later */
+	.nr_garbage = 0,
+};
+/* Get a slot for optimized_kprobe buffer */
+kprobe_opcode_t __kprobes *get_optinsn_slot(void)
+{
+	kprobe_opcode_t *ret = NULL;
+	mutex_lock(&kprobe_optinsn_mutex);
+	ret = __get_insn_slot(&kprobe_optinsn_slots);
+	mutex_unlock(&kprobe_optinsn_mutex);
+	return ret;
+}
+
+void __kprobes free_optinsn_slot(kprobe_opcode_t * slot, int dirty)
+{
+	mutex_lock(&kprobe_optinsn_mutex);
+	__free_insn_slot(&kprobe_optinsn_slots, slot, dirty);
+	mutex_unlock(&kprobe_optinsn_mutex);
+}
+#endif
 #endif
 
 /* We have preemption disabled.. so it is safe to use __ versions */
@@ -334,20 +360,270 @@ struct kprobe __kprobes *get_kprobe(void *addr)
 	return NULL;
 }
 
+static int __kprobes aggr_pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+/* Return true if the kprobe is an aggregator */
+static inline int kprobe_aggrprobe(struct kprobe *p)
+{
+	return p->pre_handler == aggr_pre_handler;
+}
+
+/*
+ * Keep all fields in the kprobe consistent
+ */
+static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
+{
+	memcpy(&p->opcode, &old_p->opcode, sizeof(kprobe_opcode_t));
+	memcpy(&p->ainsn, &old_p->ainsn, sizeof(struct arch_specific_insn));
+}
+
+#ifdef CONFIG_OPTPROBES
+/*
+ * Call all pre_handler on the list, but ignores its return value.
+ * This must be called from arch-dep optimized caller.
+ */
+void __kprobes opt_pre_handler(struct kprobe *p, struct pt_regs *regs)
+{
+	struct kprobe *kp;
+
+	list_for_each_entry_rcu(kp, &p->list, list) {
+		if (kp->pre_handler && likely(!kprobe_disabled(kp))) {
+			set_kprobe_instance(kp);
+			kp->pre_handler(kp, regs);
+		}
+		reset_kprobe_instance();
+	}
+}
+
+/* Return true(!0) if the kprobe is ready for optimization. */
+static inline int kprobe_optready(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	if (kprobe_aggrprobe(p)) {
+		op = container_of(p, struct optimized_kprobe, kp);
+		return arch_prepared_optinsn(&op->optinsn);
+	}
+	return 0;
+}
+
+/* Return an optimized kprobe which replaces instructions including addr. */
+struct kprobe *__kprobes get_optimized_kprobe(unsigned long addr)
+{
+	int i;
+	struct kprobe *p = NULL;
+	struct optimized_kprobe *op;
+	for (i = 0; !p && i < MAX_OPTIMIZED_LENGTH; i++)
+		p = get_kprobe((void *)(addr - i));
+
+	if (p && kprobe_optready(p)) {
+		op = container_of(p, struct optimized_kprobe, kp);
+		if (arch_within_optimized_kprobe(op, addr))
+			return p;
+	}
+	return NULL;
+}
+
+/* Optimization staging list, protected by kprobe_mutex */
+static LIST_HEAD(optimizing_list);
+
+static void kprobe_optimizer(struct work_struct *work);
+static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
+#define OPTIMIZE_DELAY 5
+
+/* Kprobe jump optimizer */
+static __kprobes void kprobe_optimizer(struct work_struct *work)
+{
+	struct optimized_kprobe *op, *tmp;
+
+	/* Lock modules while optimizing kprobes */
+	mutex_lock(&module_mutex);
+	mutex_lock(&kprobe_mutex);
+	if (kprobes_all_disarmed)
+		goto end;
+
+	/* Wait quiesence period for ensuring all interrupts are done */
+	synchronize_sched();
+
+	get_online_cpus();	/* Use online_cpus while optimizing */
+	mutex_lock(&text_mutex);
+	list_for_each_entry_safe(op, tmp, &optimizing_list, list) {
+		WARN_ON(kprobe_disabled(&op->kp));
+		if (arch_optimize_kprobe(op) < 0)
+			op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+		list_del_init(&op->list);
+	}
+	mutex_unlock(&text_mutex);
+	put_online_cpus();
+end:
+	mutex_unlock(&kprobe_mutex);
+	mutex_unlock(&module_mutex);
+}
+
+/* Optimize kprobe if p is ready to be optimized */
+static __kprobes void optimize_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	/* Check if the kprobe is disabled or not ready for optimization. */
+	if (!kprobe_optready(p) ||
+	    (kprobe_disabled(p) || kprobes_all_disarmed))
+		return;
+
+	/* Both of break_handler and post_handler are not supported. */
+	if (p->break_handler || p->post_handler)
+		return;
+
+	op = container_of(p, struct optimized_kprobe, kp);
+
+	/* Check there is no other kprobes at the optimized instructions */
+	if (arch_check_optimized_kprobe(op) < 0)
+		return;
+
+	/* Check if it is already optimized. */
+	if (op->kp.flags & KPROBE_FLAG_OPTIMIZED)
+		return;
+
+	op->kp.flags |= KPROBE_FLAG_OPTIMIZED;
+	list_add(&op->list, &optimizing_list);
+	if (!delayed_work_pending(&optimizing_work))
+		schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
+}
+
+/* Unoptimize a kprobe if p is optimized */
+static __kprobes void unoptimize_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	if ((p->flags & KPROBE_FLAG_OPTIMIZED) && kprobe_aggrprobe(p)) {
+		op = container_of(p, struct optimized_kprobe, kp);
+		if (!list_empty(&op->list))
+			/* Dequeue from the optimization queue */
+			list_del_init(&op->list);
+		else
+			/* Replace jump with break */
+			arch_unoptimize_kprobe(op);
+		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+	}
+}
+
+/* Remove optimized instructions */
+static void __kprobes kill_optimized_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	op = container_of(p, struct optimized_kprobe, kp);
+	if (!list_empty(&op->list)) {
+		/* Dequeue from the optimization queue */
+		list_del_init(&op->list);
+		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+	}
+	/* Don't unoptimize, because the target code will be freed. */
+	arch_remove_optimized_kprobe(op);
+}
+
+/* Try to prepare optimized instructions */
+static __kprobes void prepare_optimized_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	op = container_of(p, struct optimized_kprobe, kp);
+	arch_prepare_optimized_kprobe(op);
+}
+
+/* Free optimized instructions and optimized_kprobe */
+static __kprobes void free_aggr_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+	op = container_of(p, struct optimized_kprobe, kp);
+	arch_remove_optimized_kprobe(op);
+	kfree(op);
+}
+
+/* Allocate new optimized_kprobe and try to prepare optimized instructions */
+static __kprobes struct kprobe *alloc_aggr_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+
+	op = kzalloc(sizeof(struct optimized_kprobe), GFP_KERNEL);
+	if (!op)
+		return NULL;
+
+	INIT_LIST_HEAD(&op->list);
+	op->kp.addr = p->addr;
+	arch_prepare_optimized_kprobe(op);
+	return &op->kp;
+}
+
+static void __kprobes init_aggr_kprobe(struct kprobe *ap, struct kprobe *p);
+
+/*
+ * Prepare an optimized_kprobe and optimize it
+ * NOTE: p must be a normal registered kprobe
+ */
+static __kprobes void try_to_optimize_kprobe(struct kprobe *p)
+{
+	struct kprobe *ap;
+	struct optimized_kprobe *op;
+
+	ap = alloc_aggr_kprobe(p);
+	if (!ap)
+		return;
+
+	op = container_of(ap, struct optimized_kprobe, kp);
+	if (!arch_prepared_optinsn(&op->optinsn)) {
+		/* If failed to setup optimizing, fallback to kprobe */
+		free_aggr_kprobe(ap);
+		return;
+	}
+
+	init_aggr_kprobe(ap, p);
+	optimize_kprobe(ap);
+	return;
+}
+#else /* !CONFIG_OPTPROBES */
+#define get_optimized_kprobe(addr)		(NULL)
+#define optimize_kprobe(p)			do {} while (0)
+#define unoptimize_kprobe(p)			do {} while (0)
+#define kill_optimized_kprobe(p)		do {} while (0)
+#define prepare_optimized_kprobe(p)		do {} while (0)
+#define try_to_optimize_kprobe(p)		do {} while (0)
+
+static __kprobes void free_aggr_kprobe(struct kprobe *p)
+{
+	kfree(p);
+}
+
+static __kprobes struct kprobe *alloc_aggr_kprobe(struct kprobe *p)
+{
+	return kzalloc(sizeof(struct kprobe), GFP_KERNEL);
+}
+#endif /* CONFIG_OPTPROBES */
+
+static void __kprobes __arm_kprobe(struct kprobe *kp)
+{
+	arch_arm_kprobe(kp);
+	optimize_kprobe(kp);	/* Try to re-optimize */
+}
+
+static void __kprobes __disarm_kprobe(struct kprobe *kp)
+{
+	unoptimize_kprobe(kp);	/* Try to unoptimize */
+	arch_disarm_kprobe(kp);
+}
+
 /* Arm a kprobe with text_mutex */
 static void __kprobes arm_kprobe(struct kprobe *kp)
 {
+	/* optimize_kprobe doesn't need online_cpus. */
 	mutex_lock(&text_mutex);
-	arch_arm_kprobe(kp);
+	__arm_kprobe(kp);
 	mutex_unlock(&text_mutex);
 }
 
 /* Disarm a kprobe with text_mutex */
 static void __kprobes disarm_kprobe(struct kprobe *kp)
 {
+	get_online_cpus();	/* unoptimize_kprobe requires online_cpus */
 	mutex_lock(&text_mutex);
-	arch_disarm_kprobe(kp);
+	__disarm_kprobe(kp);
 	mutex_unlock(&text_mutex);
+	put_online_cpus();
 }
 
 /*
@@ -416,7 +692,7 @@ static int __kprobes aggr_break_handler(struct kprobe *p, struct pt_regs *regs)
 void __kprobes kprobes_inc_nmissed_count(struct kprobe *p)
 {
 	struct kprobe *kp;
-	if (p->pre_handler != aggr_pre_handler) {
+	if (!kprobe_aggrprobe(p)) {
 		p->nmissed++;
 	} else {
 		list_for_each_entry_rcu(kp, &p->list, list)
@@ -540,21 +816,16 @@ static void __kprobes cleanup_rp_inst(struct kretprobe *rp)
 }
 
 /*
- * Keep all fields in the kprobe consistent
- */
-static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
-{
-	memcpy(&p->opcode, &old_p->opcode, sizeof(kprobe_opcode_t));
-	memcpy(&p->ainsn, &old_p->ainsn, sizeof(struct arch_specific_insn));
-}
-
-/*
 * Add the new probe to ap->list. Fail if this is the
 * second jprobe at the address - two jprobes can't coexist
 */
 static int __kprobes add_new_kprobe(struct kprobe *ap, struct kprobe *p)
 {
 	BUG_ON(kprobe_gone(ap) || kprobe_gone(p));
+
+	if (p->break_handler || p->post_handler)
+		unoptimize_kprobe(ap);	/* Fall back to normal kprobe */
+
 	if (p->break_handler) {
 		if (ap->break_handler)
 			return -EEXIST;
@@ -569,7 +840,7 @@ static int __kprobes add_new_kprobe(struct kprobe *ap, struct kprobe *p)
 		ap->flags &= ~KPROBE_FLAG_DISABLED;
 		if (!kprobes_all_disarmed)
 			/* Arm the breakpoint again. */
-			arm_kprobe(ap);
+			__arm_kprobe(ap);
 	}
 	return 0;
 }
@@ -578,12 +849,13 @@ static int __kprobes add_new_kprobe(struct kprobe *ap, struct kprobe *p)
  * Fill in the required fields of the "manager kprobe". Replace the
  * earlier kprobe in the hlist with the manager kprobe
  */
-static inline void add_aggr_kprobe(struct kprobe *ap, struct kprobe *p)
+static void __kprobes init_aggr_kprobe(struct kprobe *ap, struct kprobe *p)
 {
+	/* Copy p's insn slot to ap */
 	copy_kprobe(p, ap);
 	flush_insn_slot(ap);
 	ap->addr = p->addr;
-	ap->flags = p->flags;
+	ap->flags = p->flags & ~KPROBE_FLAG_OPTIMIZED;
 	ap->pre_handler = aggr_pre_handler;
 	ap->fault_handler = aggr_fault_handler;
 	/* We don't care the kprobe which has gone. */
@@ -593,8 +865,9 @@ static inline void add_aggr_kprobe(struct kprobe *ap, struct kprobe *p)
 		ap->break_handler = aggr_break_handler;
 
 	INIT_LIST_HEAD(&ap->list);
-	list_add_rcu(&p->list, &ap->list);
+	INIT_HLIST_NODE(&ap->hlist);
 
+	list_add_rcu(&p->list, &ap->list);
 	hlist_replace_rcu(&p->hlist, &ap->hlist);
 }
 
@@ -608,12 +881,12 @@ static int __kprobes register_aggr_kprobe(struct kprobe *old_p,
 	int ret = 0;
 	struct kprobe *ap = old_p;
 
-	if (old_p->pre_handler != aggr_pre_handler) {
-		/* If old_p is not an aggr_probe, create new aggr_kprobe. */
-		ap = kzalloc(sizeof(struct kprobe), GFP_KERNEL);
+	if (!kprobe_aggrprobe(old_p)) {
+		/* If old_p is not an aggr_kprobe, create new aggr_kprobe. */
+		ap = alloc_aggr_kprobe(old_p);
 		if (!ap)
 			return -ENOMEM;
-		add_aggr_kprobe(ap, old_p);
+		init_aggr_kprobe(ap, old_p);
 	}
 
 	if (kprobe_gone(ap)) {
@@ -632,6 +905,9 @@ static int __kprobes register_aggr_kprobe(struct kprobe *old_p,
 			 */
 			return ret;
 
+		/* Prepare optimized instructions if possible. */
+		prepare_optimized_kprobe(ap);
+
 		/*
 		 * Clear gone flag to prevent allocating new slot again, and
 		 * set disabled flag because it is not armed yet.
@@ -640,6 +916,7 @@ static int __kprobes register_aggr_kprobe(struct kprobe *old_p,
 			    | KPROBE_FLAG_DISABLED;
 	}
 
+	/* Copy ap's insn slot to p */
 	copy_kprobe(ap, p);
 	return add_new_kprobe(ap, p);
 }
@@ -789,16 +1066,24 @@ int __kprobes register_kprobe(struct kprobe *p)
 	p->nmissed = 0;
 	INIT_LIST_HEAD(&p->list);
 	mutex_lock(&kprobe_mutex);
+	get_online_cpus();
+	mutex_lock(&text_mutex);
+
 	old_p = get_kprobe(p->addr);
 	if (old_p) {
+		/* Since this may unoptimize old_p, locking text_mutex. */
 		ret = register_aggr_kprobe(old_p, p);
 		goto out;
 	}
 
-	mutex_lock(&text_mutex);
+	/* Check collision with other optimized kprobes */
+	old_p = get_optimized_kprobe((unsigned long)p->addr);
+	if (unlikely(old_p))
+		unoptimize_kprobe(old_p); /* Fallback to unoptimized kprobe */
+
 	ret = arch_prepare_kprobe(p);
 	if (ret)
-		goto out_unlock_text;
+		goto out;
 
 	INIT_HLIST_NODE(&p->hlist);
 	hlist_add_head_rcu(&p->hlist,
@@ -807,9 +1092,12 @@ int __kprobes register_kprobe(struct kprobe *p)
 	if (!kprobes_all_disarmed && !kprobe_disabled(p))
 		arch_arm_kprobe(p);
 
-out_unlock_text:
-	mutex_unlock(&text_mutex);
+	/* Try to optimize kprobe */
+	try_to_optimize_kprobe(p);
+
 out:
+	mutex_unlock(&text_mutex);
+	put_online_cpus();
 	mutex_unlock(&kprobe_mutex);
 
 	if (probed_mod)
@@ -831,7 +1119,7 @@ static int __kprobes __unregister_kprobe_top(struct kprobe *p)
 		return -EINVAL;
 
 	if (old_p == p ||
-	    (old_p->pre_handler == aggr_pre_handler &&
+	    (kprobe_aggrprobe(old_p) &&
 	     list_is_singular(&old_p->list))) {
 		/*
 		 * Only probe on the hash list. Disarm only if kprobes are
@@ -839,8 +1127,13 @@ static int __kprobes __unregister_kprobe_top(struct kprobe *p)
 		 * already have been removed. We save on flushing icache.
 		 */
 		if (!kprobes_all_disarmed && !kprobe_disabled(old_p))
-			disarm_kprobe(p);
+			disarm_kprobe(old_p);
 		hlist_del_rcu(&old_p->hlist);
+
+		/* If another kprobe was blocked, optimize it. */
+		old_p = get_optimized_kprobe((unsigned long)p->addr);
+		if (unlikely(old_p))
+			optimize_kprobe(old_p);
 	} else {
 		if (p->break_handler && !kprobe_gone(p))
 			old_p->break_handler = NULL;
@@ -855,8 +1148,13 @@ noclean:
 		list_del_rcu(&p->list);
 		if (!kprobe_disabled(old_p)) {
 			try_to_disable_aggr_kprobe(old_p);
-			if (!kprobes_all_disarmed && kprobe_disabled(old_p))
-				disarm_kprobe(old_p);
+			if (!kprobes_all_disarmed) {
+				if (kprobe_disabled(old_p))
+					disarm_kprobe(old_p);
+				else
+					/* Try to optimize this probe again */
+					optimize_kprobe(old_p);
+			}
 		}
 	}
 	return 0;
@@ -873,7 +1171,7 @@ static void __kprobes __unregister_kprobe_bottom(struct kprobe *p)
 		old_p = list_entry(p->list.next, struct kprobe, list);
 		list_del(&p->list);
 		arch_remove_kprobe(old_p);
-		kfree(old_p);
+		free_aggr_kprobe(old_p);
 	}
 }
 
@@ -1169,7 +1467,7 @@ static void __kprobes kill_kprobe(struct kprobe *p)
 	struct kprobe *kp;
 
 	p->flags |= KPROBE_FLAG_GONE;
-	if (p->pre_handler == aggr_pre_handler) {
+	if (kprobe_aggrprobe(p)) {
 		/*
 		 * If this is an aggr_kprobe, we have to list all the
 		 * chained probes and mark them GONE.
@@ -1178,6 +1476,7 @@ static void __kprobes kill_kprobe(struct kprobe *p)
 			kp->flags |= KPROBE_FLAG_GONE;
 		p->post_handler = NULL;
 		p->break_handler = NULL;
+		kill_optimized_kprobe(p);
 	}
 	/*
 	 * Here, we can remove insn_slot safely, because no thread calls
@@ -1287,6 +1586,11 @@ static int __init init_kprobes(void)
 		}
 	}
 
+#if defined(CONFIG_OPTPROBES) && defined(__ARCH_WANT_KPROBES_INSN_SLOT)
+	/* Init kprobe_optinsn_slots */
+	kprobe_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
+#endif
+
 	/* By default, kprobes are armed */
 	kprobes_all_disarmed = false;
 
@@ -1305,7 +1609,7 @@ static int __init init_kprobes(void)
 
 #ifdef CONFIG_DEBUG_FS
 static void __kprobes report_probe(struct seq_file *pi, struct kprobe *p,
-		const char *sym, int offset,char *modname)
+		const char *sym, int offset, char *modname, struct kprobe *pp)
 {
 	char *kprobe_type;
 
@@ -1315,19 +1619,21 @@ static void __kprobes report_probe(struct seq_file *pi, struct kprobe *p,
 		kprobe_type = "j";
 	else
 		kprobe_type = "k";
+
 	if (sym)
-		seq_printf(pi, "%p  %s  %s+0x%x  %s %s%s\n",
+		seq_printf(pi, "%p  %s  %s+0x%x  %s ",
 			p->addr, kprobe_type, sym, offset,
-			(modname ? modname : " "),
-			(kprobe_gone(p) ? "[GONE]" : ""),
-			((kprobe_disabled(p) && !kprobe_gone(p)) ?
-			 "[DISABLED]" : ""));
+			(modname ? modname : " "));
 	else
-		seq_printf(pi, "%p  %s  %p %s%s\n",
-			p->addr, kprobe_type, p->addr,
-			(kprobe_gone(p) ? "[GONE]" : ""),
-			((kprobe_disabled(p) && !kprobe_gone(p)) ?
-			 "[DISABLED]" : ""));
+		seq_printf(pi, "%p  %s  %p ",
+			p->addr, kprobe_type, p->addr);
+
+	if (!pp)
+		pp = p;
+	seq_printf(pi, "%s%s%s\n",
+		(kprobe_gone(p) ? "[GONE]" : ""),
+		((kprobe_disabled(p) && !kprobe_gone(p)) ?  "[DISABLED]" : ""),
+		(kprobe_optimized(pp) ? "[OPTIMIZED]" : ""));
 }
 
 static void __kprobes *kprobe_seq_start(struct seq_file *f, loff_t *pos)
@@ -1363,11 +1669,11 @@ static int __kprobes show_kprobe_addr(struct seq_file *pi, void *v)
 	hlist_for_each_entry_rcu(p, node, head, hlist) {
 		sym = kallsyms_lookup((unsigned long)p->addr, NULL,
 					&offset, &modname, namebuf);
-		if (p->pre_handler == aggr_pre_handler) {
+		if (kprobe_aggrprobe(p)) {
 			list_for_each_entry_rcu(kp, &p->list, list)
-				report_probe(pi, kp, sym, offset, modname);
+				report_probe(pi, kp, sym, offset, modname, p);
 		} else
-			report_probe(pi, p, sym, offset, modname);
+			report_probe(pi, p, sym, offset, modname, NULL);
 	}
 	preempt_enable();
 	return 0;
@@ -1470,12 +1776,13 @@ static void __kprobes arm_all_kprobes(void)
 	if (!kprobes_all_disarmed)
 		goto already_enabled;
 
+	/* Arm-side doesn't requires online_cpus */
 	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			if (!kprobe_disabled(p))
-				arch_arm_kprobe(p);
+				__arm_kprobe(p);
 	}
 	mutex_unlock(&text_mutex);
 
@@ -1502,16 +1809,18 @@ static void __kprobes disarm_all_kprobes(void)
 
 	kprobes_all_disarmed = true;
 	printk(KERN_INFO "Kprobes globally disabled\n");
+	get_online_cpus();	/* Disarming will unoptimize kprobes */
 	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
 			if (!arch_trampoline_kprobe(p) && !kprobe_disabled(p))
-				arch_disarm_kprobe(p);
+				__disarm_kprobe(p);
 		}
 	}
 
 	mutex_unlock(&text_mutex);
+	put_online_cpus();
 	mutex_unlock(&kprobe_mutex);
 	/* Allow all currently running kprobes to complete */
 	synchronize_sched();


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 04/10] kprobes: Jump optimization sysctl interface
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (2 preceding siblings ...)
  2009-11-23 23:21 ` [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization Masami Hiramatsu
@ 2009-11-23 23:21 ` Masami Hiramatsu
  2009-11-23 23:21 ` [PATCH -tip v5 05/10] kprobes/x86: Boost probes when reentering Masami Hiramatsu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Add /proc/sys/debug/kprobes-optimization sysctl which enables and disables
kprobes jump optimization on the fly for debugging.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 include/linux/kprobes.h |    8 ++++
 kernel/kprobes.c        |   88 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c         |   13 +++++++
 3 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index aed1f95..e7d1b2e 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -283,6 +283,14 @@ extern int arch_within_optimized_kprobe(struct optimized_kprobe *op,
 					unsigned long addr);
 
 extern void opt_pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+#ifdef CONFIG_SYSCTL
+extern int sysctl_kprobes_optimization;
+extern int proc_kprobes_optimization_handler(struct ctl_table *table,
+					     int write, void __user *buffer,
+					     size_t *length, loff_t *ppos);
+#endif
+
 #endif /* CONFIG_OPTPROBES */
 
 /* Get the kprobe at this addr (if any) - called with preemption disabled */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 15aa797..1e862ed 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -42,6 +42,7 @@
 #include <linux/freezer.h>
 #include <linux/seq_file.h>
 #include <linux/debugfs.h>
+#include <linux/sysctl.h>
 #include <linux/kdebug.h>
 #include <linux/memory.h>
 #include <linux/cpu.h>
@@ -378,6 +379,9 @@ static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
 }
 
 #ifdef CONFIG_OPTPROBES
+/* NOTE: change this value only with kprobe_mutex held */
+static bool kprobes_allow_optimization;
+
 /*
  * Call all pre_handler on the list, but ignores its return value.
  * This must be called from arch-dep optimized caller.
@@ -438,7 +442,7 @@ static __kprobes void kprobe_optimizer(struct work_struct *work)
 	/* Lock modules while optimizing kprobes */
 	mutex_lock(&module_mutex);
 	mutex_lock(&kprobe_mutex);
-	if (kprobes_all_disarmed)
+	if (kprobes_all_disarmed || !kprobes_allow_optimization)
 		goto end;
 
 	/* Wait quiesence period for ensuring all interrupts are done */
@@ -464,7 +468,7 @@ static __kprobes void optimize_kprobe(struct kprobe *p)
 {
 	struct optimized_kprobe *op;
 	/* Check if the kprobe is disabled or not ready for optimization. */
-	if (!kprobe_optready(p) ||
+	if (!kprobe_optready(p) || !kprobes_allow_optimization ||
 	    (kprobe_disabled(p) || kprobes_all_disarmed))
 		return;
 
@@ -576,6 +580,80 @@ static __kprobes void try_to_optimize_kprobe(struct kprobe *p)
 	optimize_kprobe(ap);
 	return;
 }
+
+#ifdef CONFIG_SYSCTL
+static void __kprobes optimize_all_kprobes(void)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct kprobe *p;
+	unsigned int i;
+
+	/* If optimization is already allowed, just return */
+	if (kprobes_allow_optimization)
+		return;
+
+	kprobes_allow_optimization = true;
+	/* Optimizing doesn't requires online_cpus */
+	mutex_lock(&text_mutex);
+	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
+		head = &kprobe_table[i];
+		hlist_for_each_entry_rcu(p, node, head, hlist)
+			if (!kprobe_disabled(p))
+				optimize_kprobe(p);
+	}
+	mutex_unlock(&text_mutex);
+	printk(KERN_INFO "Kprobes globally optimized\n");
+}
+
+static void __kprobes unoptimize_all_kprobes(void)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct kprobe *p;
+	unsigned int i;
+
+	/* If optimization is already prohibited, just return */
+	if (!kprobes_allow_optimization)
+		return;
+
+	kprobes_allow_optimization = false;
+	printk(KERN_INFO "Kprobes globally unoptimized\n");
+	get_online_cpus();	/* unoptimizing requires online_cpus */
+	mutex_lock(&text_mutex);
+	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
+		head = &kprobe_table[i];
+		hlist_for_each_entry_rcu(p, node, head, hlist) {
+			if (!kprobe_disabled(p))
+				unoptimize_kprobe(p);
+		}
+	}
+
+	mutex_unlock(&text_mutex);
+	put_online_cpus();
+	/* Allow all currently running kprobes to complete */
+	synchronize_sched();
+}
+
+int sysctl_kprobes_optimization;
+int proc_kprobes_optimization_handler(struct ctl_table *table, int write,
+				      void __user *buffer, size_t *length,
+				      loff_t *ppos)
+{
+	int ret;
+	mutex_lock(&kprobe_mutex);
+	sysctl_kprobes_optimization = kprobes_allow_optimization ? 1 : 0;
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	if (sysctl_kprobes_optimization)
+		optimize_all_kprobes();
+	else
+		unoptimize_all_kprobes();
+	mutex_unlock(&kprobe_mutex);
+	return ret;
+}
+#endif /* CONFIG_SYSCTL */
+
 #else /* !CONFIG_OPTPROBES */
 #define get_optimized_kprobe(addr)		(NULL)
 #define optimize_kprobe(p)			do {} while (0)
@@ -1586,10 +1664,14 @@ static int __init init_kprobes(void)
 		}
 	}
 
-#if defined(CONFIG_OPTPROBES) && defined(__ARCH_WANT_KPROBES_INSN_SLOT)
+#if defined(CONFIG_OPTPROBES)
+#if defined(__ARCH_WANT_KPROBES_INSN_SLOT)
 	/* Init kprobe_optinsn_slots */
 	kprobe_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
 #endif
+	/* By default, kprobes can be optimized */
+	kprobes_allow_optimization = true;
+#endif
 
 	/* By default, kprobes are armed */
 	kprobes_all_disarmed = false;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0060ce7..330aa70 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -51,6 +51,7 @@
 #include <linux/ftrace.h>
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
+#include <linux/kprobes.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1621,6 +1622,18 @@ static struct ctl_table debug_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 #endif
+#if defined(CONFIG_OPTPROBES)
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "kprobes-optimization",
+		.data		= &sysctl_kprobes_optimization,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_kprobes_optimization_handler,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 05/10] kprobes/x86: Boost probes when reentering
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (3 preceding siblings ...)
  2009-11-23 23:21 ` [PATCH -tip v5 04/10] kprobes: Jump optimization sysctl interface Masami Hiramatsu
@ 2009-11-23 23:21 ` Masami Hiramatsu
  2009-11-23 23:22 ` [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers Masami Hiramatsu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:21 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Integrate prepare_singlestep() into setup_singlestep() to boost up reenter
probes, if possible.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/kernel/kprobes.c |   48 ++++++++++++++++++++++++---------------------
 1 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index b6b75f1..85d88f6 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -403,18 +403,6 @@ static void __kprobes restore_btf(void)
 		update_debugctlmsr(current->thread.debugctlmsr);
 }
 
-static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
-{
-	clear_btf();
-	regs->flags |= X86_EFLAGS_TF;
-	regs->flags &= ~X86_EFLAGS_IF;
-	/* single step inline if the instruction is an int3 */
-	if (p->opcode == BREAKPOINT_INSTRUCTION)
-		regs->ip = (unsigned long)p->addr;
-	else
-		regs->ip = (unsigned long)p->ainsn.insn;
-}
-
 void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
 				      struct pt_regs *regs)
 {
@@ -427,19 +415,38 @@ void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
 }
 
 static void __kprobes setup_singlestep(struct kprobe *p, struct pt_regs *regs,
-				       struct kprobe_ctlblk *kcb)
+				       struct kprobe_ctlblk *kcb, int reenter)
 {
 #if !defined(CONFIG_PREEMPT) || defined(CONFIG_FREEZER)
 	if (p->ainsn.boostable == 1 && !p->post_handler) {
 		/* Boost up -- we can execute copied instructions directly */
-		reset_current_kprobe();
+		if (!reenter)
+			reset_current_kprobe();
+		/*
+		 * Reentering boosted probe doesn't reset current_kprobe,
+		 * nor set current_kprobe, because it doesn't use single
+		 * stepping.
+		 */
 		regs->ip = (unsigned long)p->ainsn.insn;
 		preempt_enable_no_resched();
 		return;
 	}
 #endif
-	prepare_singlestep(p, regs);
-	kcb->kprobe_status = KPROBE_HIT_SS;
+	if (reenter) {
+		save_previous_kprobe(kcb);
+		set_current_kprobe(p, regs, kcb);
+		kcb->kprobe_status = KPROBE_REENTER;
+	} else
+		kcb->kprobe_status = KPROBE_HIT_SS;
+	/* Prepare real single stepping */
+	clear_btf();
+	regs->flags |= X86_EFLAGS_TF;
+	regs->flags &= ~X86_EFLAGS_IF;
+	/* single step inline if the instruction is an int3 */
+	if (p->opcode == BREAKPOINT_INSTRUCTION)
+		regs->ip = (unsigned long)p->addr;
+	else
+		regs->ip = (unsigned long)p->ainsn.insn;
 }
 
 /*
@@ -453,11 +460,8 @@ static int __kprobes reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
 	switch (kcb->kprobe_status) {
 	case KPROBE_HIT_SSDONE:
 	case KPROBE_HIT_ACTIVE:
-		save_previous_kprobe(kcb);
-		set_current_kprobe(p, regs, kcb);
 		kprobes_inc_nmissed_count(p);
-		prepare_singlestep(p, regs);
-		kcb->kprobe_status = KPROBE_REENTER;
+		setup_singlestep(p, regs, kcb, 1);
 		break;
 	case KPROBE_HIT_SS:
 		/* A probe has been hit in the codepath leading up to, or just
@@ -532,13 +536,13 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
 			 * more here.
 			 */
 			if (!p->pre_handler || !p->pre_handler(p, regs))
-				setup_singlestep(p, regs, kcb);
+				setup_singlestep(p, regs, kcb, 0);
 			return 1;
 		}
 	} else if (kprobe_running()) {
 		p = __get_cpu_var(current_kprobe);
 		if (p->break_handler && p->break_handler(p, regs)) {
-			setup_singlestep(p, regs, kcb);
+			setup_singlestep(p, regs, kcb, 0);
 			return 1;
 		}
 	} /* else: not a kprobe fault; let the kernel handle it */


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (4 preceding siblings ...)
  2009-11-23 23:21 ` [PATCH -tip v5 05/10] kprobes/x86: Boost probes when reentering Masami Hiramatsu
@ 2009-11-23 23:22 ` Masami Hiramatsu
  2009-11-24  2:51   ` Frederic Weisbecker
  2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Introduce SAVE/RESOTRE_REGS_STRING for cleanup kretprobe-trampoline asm
code. These macros will be used for emulating interruption.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/kernel/kprobes.c |  128 ++++++++++++++++++++++++---------------------
 1 files changed, 67 insertions(+), 61 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 85d88f6..73ac21e 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -551,6 +551,69 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+#define SAVE_REGS_STRING		\
+	/* Skip cs, ip, orig_ax. */	\
+	"	subq $24, %rsp\n"	\
+	"	pushq %rdi\n"		\
+	"	pushq %rsi\n"		\
+	"	pushq %rdx\n"		\
+	"	pushq %rcx\n"		\
+	"	pushq %rax\n"		\
+	"	pushq %r8\n"		\
+	"	pushq %r9\n"		\
+	"	pushq %r10\n"		\
+	"	pushq %r11\n"		\
+	"	pushq %rbx\n"		\
+	"	pushq %rbp\n"		\
+	"	pushq %r12\n"		\
+	"	pushq %r13\n"		\
+	"	pushq %r14\n"		\
+	"	pushq %r15\n"
+#define RESTORE_REGS_STRING		\
+	"	popq %r15\n"		\
+	"	popq %r14\n"		\
+	"	popq %r13\n"		\
+	"	popq %r12\n"		\
+	"	popq %rbp\n"		\
+	"	popq %rbx\n"		\
+	"	popq %r11\n"		\
+	"	popq %r10\n"		\
+	"	popq %r9\n"		\
+	"	popq %r8\n"		\
+	"	popq %rax\n"		\
+	"	popq %rcx\n"		\
+	"	popq %rdx\n"		\
+	"	popq %rsi\n"		\
+	"	popq %rdi\n"		\
+	/* Skip orig_ax, ip, cs */	\
+	"	addq $24, %rsp\n"
+#else
+#define SAVE_REGS_STRING		\
+	/* Skip cs, ip, orig_ax and gs. */	\
+	"	subl $16, %esp\n"	\
+	"	pushl %fs\n"		\
+	"	pushl %ds\n"		\
+	"	pushl %es\n"		\
+	"	pushl %eax\n"		\
+	"	pushl %ebp\n"		\
+	"	pushl %edi\n"		\
+	"	pushl %esi\n"		\
+	"	pushl %edx\n"		\
+	"	pushl %ecx\n"		\
+	"	pushl %ebx\n"
+#define RESTORE_REGS_STRING		\
+	"	popl %ebx\n"		\
+	"	popl %ecx\n"		\
+	"	popl %edx\n"		\
+	"	popl %esi\n"		\
+	"	popl %edi\n"		\
+	"	popl %ebp\n"		\
+	"	popl %eax\n"		\
+	/* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\
+	"	addl $24, %esp\n"
+#endif
+
 /*
  * When a retprobed function returns, this code saves registers and
  * calls trampoline_handler() runs, which calls the kretprobe's handler.
@@ -564,65 +627,16 @@ static void __used __kprobes kretprobe_trampoline_holder(void)
 			/* We don't bother saving the ss register */
 			"	pushq %rsp\n"
 			"	pushfq\n"
-			/*
-			 * Skip cs, ip, orig_ax.
-			 * trampoline_handler() will plug in these values
-			 */
-			"	subq $24, %rsp\n"
-			"	pushq %rdi\n"
-			"	pushq %rsi\n"
-			"	pushq %rdx\n"
-			"	pushq %rcx\n"
-			"	pushq %rax\n"
-			"	pushq %r8\n"
-			"	pushq %r9\n"
-			"	pushq %r10\n"
-			"	pushq %r11\n"
-			"	pushq %rbx\n"
-			"	pushq %rbp\n"
-			"	pushq %r12\n"
-			"	pushq %r13\n"
-			"	pushq %r14\n"
-			"	pushq %r15\n"
+			SAVE_REGS_STRING
 			"	movq %rsp, %rdi\n"
 			"	call trampoline_handler\n"
 			/* Replace saved sp with true return address. */
 			"	movq %rax, 152(%rsp)\n"
-			"	popq %r15\n"
-			"	popq %r14\n"
-			"	popq %r13\n"
-			"	popq %r12\n"
-			"	popq %rbp\n"
-			"	popq %rbx\n"
-			"	popq %r11\n"
-			"	popq %r10\n"
-			"	popq %r9\n"
-			"	popq %r8\n"
-			"	popq %rax\n"
-			"	popq %rcx\n"
-			"	popq %rdx\n"
-			"	popq %rsi\n"
-			"	popq %rdi\n"
-			/* Skip orig_ax, ip, cs */
-			"	addq $24, %rsp\n"
+			RESTORE_REGS_STRING
 			"	popfq\n"
 #else
 			"	pushf\n"
-			/*
-			 * Skip cs, ip, orig_ax and gs.
-			 * trampoline_handler() will plug in these values
-			 */
-			"	subl $16, %esp\n"
-			"	pushl %fs\n"
-			"	pushl %es\n"
-			"	pushl %ds\n"
-			"	pushl %eax\n"
-			"	pushl %ebp\n"
-			"	pushl %edi\n"
-			"	pushl %esi\n"
-			"	pushl %edx\n"
-			"	pushl %ecx\n"
-			"	pushl %ebx\n"
+			SAVE_REGS_STRING
 			"	movl %esp, %eax\n"
 			"	call trampoline_handler\n"
 			/* Move flags to cs */
@@ -630,15 +644,7 @@ static void __used __kprobes kretprobe_trampoline_holder(void)
 			"	movl %edx, 52(%esp)\n"
 			/* Replace saved flags with true return address. */
 			"	movl %eax, 56(%esp)\n"
-			"	popl %ebx\n"
-			"	popl %ecx\n"
-			"	popl %edx\n"
-			"	popl %esi\n"
-			"	popl %edi\n"
-			"	popl %ebp\n"
-			"	popl %eax\n"
-			/* Skip ds, es, fs, gs, orig_ax and ip */
-			"	addl $24, %esp\n"
+			RESTORE_REGS_STRING
 			"	popf\n"
 #endif
 			"	ret\n");


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (5 preceding siblings ...)
  2009-11-23 23:22 ` [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers Masami Hiramatsu
@ 2009-11-23 23:22 ` Masami Hiramatsu
  2009-11-24  3:14   ` Frederic Weisbecker
                     ` (2 more replies)
  2009-11-23 23:22 ` [PATCH -tip v5 08/10] kprobes: Add documents of jump optimization Masami Hiramatsu
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Introduce x86 arch-specific optimization code, which supports both of
x86-32 and x86-64.

This code also supports safety checking, which decodes whole of a function
in which probe is inserted, and checks following conditions before
optimization:
 - The optimized instructions which will be replaced by a jump instruction
   don't straddle the function boundary.
 - There is no indirect jump instruction, because it will jumps into
   the address range which is replaced by jump operand.
 - There is no jump/loop instruction which jumps into the address range
   which is replaced by jump operand.
 - Don't optimize kprobes if it is in functions into which fixup code will
   jumps.

This uses stop_machine() for corss modifying code from int3 to jump.
It doesn't allow us to modify code on NMI/SMI path. However, since
kprobes itself doesn't support NMI/SMI code probing, it's not a
problem.

Changes in v5:
 - Introduce stop_machine-based jump replacing.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/Kconfig               |    1 
 arch/x86/include/asm/kprobes.h |   29 +++
 arch/x86/kernel/kprobes.c      |  457 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 465 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 17abcfa..af0313e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -31,6 +31,7 @@ config X86
 	select ARCH_WANT_FRAME_POINTERS
 	select HAVE_DMA_ATTRS
 	select HAVE_KRETPROBES
+	select HAVE_OPTPROBES
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FUNCTION_TRACER
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index eaec8ea..4ffa345 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -33,6 +33,9 @@ struct kprobe;
 typedef u8 kprobe_opcode_t;
 #define BREAKPOINT_INSTRUCTION	0xcc
 #define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
+#define RELATIVECALL_OPCODE 0xe8
+#define RELATIVE_ADDR_SIZE 4
 #define MAX_INSN_SIZE 16
 #define MAX_STACK_SIZE 64
 #define MIN_STACK_SIZE(ADDR)					       \
@@ -44,6 +47,17 @@ typedef u8 kprobe_opcode_t;
 
 #define flush_insn_slot(p)	do { } while (0)
 
+/* optinsn template addresses */
+extern kprobe_opcode_t optprobe_template_entry;
+extern kprobe_opcode_t optprobe_template_val;
+extern kprobe_opcode_t optprobe_template_call;
+extern kprobe_opcode_t optprobe_template_end;
+#define MAX_OPTIMIZED_LENGTH (MAX_INSN_SIZE + RELATIVE_ADDR_SIZE)
+#define MAX_OPTINSN_SIZE 				\
+	(((unsigned long)&optprobe_template_end -	\
+	  (unsigned long)&optprobe_template_entry) +	\
+	 MAX_OPTIMIZED_LENGTH + RELATIVEJUMP_SIZE)
+
 extern const int kretprobe_blacklist_size;
 
 void arch_remove_kprobe(struct kprobe *p);
@@ -64,6 +78,21 @@ struct arch_specific_insn {
 	int boostable;
 };
 
+struct arch_optimized_insn {
+	/* copy of the original instructions */
+	kprobe_opcode_t copied_insn[RELATIVE_ADDR_SIZE];
+	/* detour code buffer */
+	kprobe_opcode_t *insn;
+	/* the size of instructions copied to detour code buffer */
+	size_t size;
+};
+
+/* Return true (!0) if optinsn is prepared for optimization. */
+static inline int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
+{
+	return optinsn->size;
+}
+
 struct prev_kprobe {
 	struct kprobe *kp;
 	unsigned long status;
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 73ac21e..6d81c11 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kdebug.h>
 #include <linux/kallsyms.h>
+#include <linux/stop_machine.h>
 
 #include <asm/cacheflush.h>
 #include <asm/desc.h>
@@ -106,16 +107,21 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = {
 };
 const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist);
 
-/* Insert a jump instruction at address 'from', which jumps to address 'to'.*/
-static void __kprobes set_jmp_op(void *from, void *to)
+static void __kprobes __synthesize_relative_insn(void *from, void *to, u8 op)
 {
-	struct __arch_jmp_op {
-		char op;
+	struct __arch_relative_insn {
+		u8 op;
 		s32 raddr;
-	} __attribute__((packed)) * jop;
-	jop = (struct __arch_jmp_op *)from;
-	jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
-	jop->op = RELATIVEJUMP_OPCODE;
+	} __attribute__((packed)) *insn;
+	insn = (struct __arch_relative_insn *)from;
+	insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
+	insn->op = op;
+}
+
+/* Insert a jump instruction at address 'from', which jumps to address 'to'.*/
+static void __kprobes synthesize_reljump(void *from, void *to)
+{
+	__synthesize_relative_insn(from, to, RELATIVEJUMP_OPCODE);
 }
 
 /*
@@ -202,7 +208,7 @@ static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr)
 	/*
 	 *  Basically, kp->ainsn.insn has an original instruction.
 	 *  However, RIP-relative instruction can not do single-stepping
-	 *  at different place, fix_riprel() tweaks the displacement of
+	 *  at different place, __copy_instruction() tweaks the displacement of
 	 *  that instruction. In that case, we can't recover the instruction
 	 *  from the kp->ainsn.insn.
 	 *
@@ -284,21 +290,37 @@ static int __kprobes is_IF_modifier(kprobe_opcode_t *insn)
 }
 
 /*
- * Adjust the displacement if the instruction uses the %rip-relative
- * addressing mode.
+ * Copy an instruction and adjust the displacement if the instruction
+ * uses the %rip-relative addressing mode.
  * If it does, Return the address of the 32-bit displacement word.
  * If not, return null.
  * Only applicable to 64-bit x86.
  */
-static void __kprobes fix_riprel(struct kprobe *p)
+static int __kprobes __copy_instruction(u8 *dest, u8 *src, int recover)
 {
-#ifdef CONFIG_X86_64
 	struct insn insn;
-	kernel_insn_init(&insn, p->ainsn.insn);
+	int ret;
+	kprobe_opcode_t buf[MAX_INSN_SIZE];
+
+	kernel_insn_init(&insn, src);
+	if (recover) {
+		insn_get_opcode(&insn);
+		if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
+			ret = recover_probed_instruction(buf,
+							 (unsigned long)src);
+			if (ret)
+				return 0;
+			kernel_insn_init(&insn, buf);
+		}
+	}
+	insn_get_length(&insn);
+	memcpy(dest, insn.kaddr, insn.length);
 
+#ifdef CONFIG_X86_64
 	if (insn_rip_relative(&insn)) {
 		s64 newdisp;
 		u8 *disp;
+		kernel_insn_init(&insn, dest);
 		insn_get_displacement(&insn);
 		/*
 		 * The copied instruction uses the %rip-relative addressing
@@ -312,20 +334,23 @@ static void __kprobes fix_riprel(struct kprobe *p)
 		 * extension of the original signed 32-bit displacement would
 		 * have given.
 		 */
-		newdisp = (u8 *) p->addr + (s64) insn.displacement.value -
-			  (u8 *) p->ainsn.insn;
+		newdisp = (u8 *) src + (s64) insn.displacement.value -
+			  (u8 *) dest;
 		BUG_ON((s64) (s32) newdisp != newdisp); /* Sanity check.  */
-		disp = (u8 *) p->ainsn.insn + insn_offset_displacement(&insn);
+		disp = (u8 *) dest + insn_offset_displacement(&insn);
 		*(s32 *) disp = (s32) newdisp;
 	}
 #endif
+	return insn.length;
 }
 
 static void __kprobes arch_copy_kprobe(struct kprobe *p)
 {
-	memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
-
-	fix_riprel(p);
+	/*
+	 * Copy an instruction without recovering int3, because it will be
+	 * put by another subsystem.
+	 */
+	__copy_instruction(p->ainsn.insn, p->addr, 0);
 
 	if (can_boost(p->addr))
 		p->ainsn.boostable = 0;
@@ -414,9 +439,20 @@ void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
 	*sara = (unsigned long) &kretprobe_trampoline;
 }
 
+#ifdef CONFIG_OPTPROBES
+static int  __kprobes setup_detour_execution(struct kprobe *p,
+					     struct pt_regs *regs,
+					     int reenter);
+#else
+#define setup_detour_execution(p, regs, reenter) (0)
+#endif
+
 static void __kprobes setup_singlestep(struct kprobe *p, struct pt_regs *regs,
 				       struct kprobe_ctlblk *kcb, int reenter)
 {
+	if (setup_detour_execution(p, regs, reenter))
+		return;
+
 #if !defined(CONFIG_PREEMPT) || defined(CONFIG_FREEZER)
 	if (p->ainsn.boostable == 1 && !p->post_handler) {
 		/* Boost up -- we can execute copied instructions directly */
@@ -812,8 +848,8 @@ static void __kprobes resume_execution(struct kprobe *p,
 			 * These instructions can be executed directly if it
 			 * jumps back to correct address.
 			 */
-			set_jmp_op((void *)regs->ip,
-				   (void *)orig_ip + (regs->ip - copy_ip));
+			synthesize_reljump((void *)regs->ip,
+				(void *)orig_ip + (regs->ip - copy_ip));
 			p->ainsn.boostable = 1;
 		} else {
 			p->ainsn.boostable = -1;
@@ -1040,6 +1076,383 @@ int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
 	return 0;
 }
 
+
+#ifdef CONFIG_OPTPROBES
+
+/* Insert a call instruction at address 'from', which calls address 'to'.*/
+static void __kprobes synthesize_relcall(void *from, void *to)
+{
+	__synthesize_relative_insn(from, to, RELATIVECALL_OPCODE);
+}
+
+/* Insert a move instruction which sets a pointer to eax/rdi (1st arg). */
+static void __kprobes synthesize_set_arg1(kprobe_opcode_t *addr,
+					  unsigned long val)
+{
+#ifdef CONFIG_X86_64
+	*addr++ = 0x48;
+	*addr++ = 0xbf;
+#else
+	*addr++ = 0xb8;
+#endif
+	*(unsigned long *)addr = val;
+}
+
+void __kprobes kprobes_optinsn_template_holder(void)
+{
+	asm volatile (
+			".global optprobe_template_entry\n"
+			"optprobe_template_entry: \n"
+#ifdef CONFIG_X86_64
+			/* We don't bother saving the ss register */
+			"	pushq %rsp\n"
+			"	pushfq\n"
+			SAVE_REGS_STRING
+			"	movq %rsp, %rsi\n"
+			".global optprobe_template_val\n"
+			"optprobe_template_val: \n"
+			ASM_NOP5
+			ASM_NOP5
+			".global optprobe_template_call\n"
+			"optprobe_template_call: \n"
+			ASM_NOP5
+			/* Move flags to rsp */
+			"	movq 144(%rsp), %rdx\n"
+			"	movq %rdx, 152(%rsp)\n"
+			RESTORE_REGS_STRING
+			/* Skip flags entry */
+			"	addq $8, %rsp\n"
+			"	popfq\n"
+#else /* CONFIG_X86_32 */
+			"	pushf\n"
+			SAVE_REGS_STRING
+			"	movl %esp, %edx\n"
+			".global optprobe_template_val\n"
+			"optprobe_template_val: \n"
+			ASM_NOP5
+			".global optprobe_template_call\n"
+			"optprobe_template_call: \n"
+			ASM_NOP5
+			RESTORE_REGS_STRING
+			"	addl $4, %esp\n"	/* skip cs */
+			"	popf\n"
+#endif
+			".global optprobe_template_end\n"
+			"optprobe_template_end: \n");
+}
+
+#define TMPL_MOVE_IDX \
+	((long)&optprobe_template_val - (long)&optprobe_template_entry)
+#define TMPL_CALL_IDX \
+	((long)&optprobe_template_call - (long)&optprobe_template_entry)
+#define TMPL_END_IDX \
+	((long)&optprobe_template_end - (long)&optprobe_template_entry)
+
+#define INT3_SIZE sizeof(kprobe_opcode_t)
+
+/* Optimized kprobe call back function: called from optinsn */
+static void __kprobes optimized_callback(struct optimized_kprobe *op,
+					 struct pt_regs *regs)
+{
+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
+
+	preempt_disable();
+	if (kprobe_running()) {
+		kprobes_inc_nmissed_count(&op->kp);
+	} else {
+		/* Save skipped registers */
+#ifdef CONFIG_X86_64
+		regs->cs = __KERNEL_CS;
+#else
+		regs->cs = __KERNEL_CS | get_kernel_rpl();
+		regs->gs = 0;
+#endif
+		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
+		regs->orig_ax = ~0UL;
+
+		__get_cpu_var(current_kprobe) = &op->kp;
+		kcb->kprobe_status = KPROBE_HIT_ACTIVE;
+		opt_pre_handler(&op->kp, regs);
+		__get_cpu_var(current_kprobe) = NULL;
+	}
+	preempt_enable_no_resched();
+}
+
+static int __kprobes copy_optimized_instructions(u8 *dest, u8 *src)
+{
+	int len = 0, ret;
+	while (len < RELATIVEJUMP_SIZE) {
+		ret = __copy_instruction(dest + len, src + len, 1);
+		if (!ret || !can_boost(dest + len))
+			return -EINVAL;
+		len += ret;
+	}
+	return len;
+}
+
+/* Check whether insn is indirect jump */
+static int __kprobes insn_is_indirect_jump(struct insn *insn)
+{
+	return (insn->opcode.bytes[0] == 0xff ||
+		insn->opcode.bytes[0] == 0xea);
+}
+
+/* Check whether insn jumps into specified address range */
+static int insn_jump_into_range(struct insn *insn, unsigned long start, int len)
+{
+	unsigned long target = 0;
+	switch (insn->opcode.bytes[0]) {
+	case 0xe0:	/* loopne */
+	case 0xe1:	/* loope */
+	case 0xe2:	/* loop */
+	case 0xe3:	/* jcxz */
+	case 0xe9:	/* near relative jump */
+	case 0xeb:	/* short relative jump */
+		break;
+	case 0x0f:
+		if ((insn->opcode.bytes[1] & 0xf0) == 0x80) /* jcc near */
+			break;
+		return 0;
+	default:
+		if ((insn->opcode.bytes[0] & 0xf0) == 0x70) /* jcc short */
+			break;
+		return 0;
+	}
+	target = (unsigned long)insn->next_byte + insn->immediate.value;
+	return (start <= target && target <= start + len);
+}
+
+/* Decode whole function to ensure any instructions don't jump into target */
+static int __kprobes can_optimize(unsigned long paddr)
+{
+	int ret;
+	unsigned long addr, size = 0, offset = 0;
+	struct insn insn;
+	kprobe_opcode_t buf[MAX_INSN_SIZE];
+	/* Dummy buffers for lookup_symbol_attrs */
+	static char __dummy_buf[KSYM_NAME_LEN];
+
+	/* Lookup symbol including addr */
+	if (!kallsyms_lookup(paddr, &size, &offset, NULL, __dummy_buf))
+		return 0;
+
+	/* Check there is enough space for a relative jump. */
+	if (size - offset < RELATIVEJUMP_SIZE)
+		return 0;
+
+	/* Decode instructions */
+	addr = paddr - offset;
+	while (addr < paddr - offset + size) { /* Decode until function end */
+		if (search_exception_tables(addr))
+			/*
+			 * Since some fixup code will jumps into this function,
+			 * we can't optimize kprobe in this function.
+			 */
+			return 0;
+		kernel_insn_init(&insn, (void *)addr);
+		insn_get_opcode(&insn);
+		if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
+			ret = recover_probed_instruction(buf, addr);
+			if (ret)
+				return 0;
+			kernel_insn_init(&insn, buf);
+		}
+		insn_get_length(&insn);
+		/* Recover address */
+		insn.kaddr = (void *)addr;
+		insn.next_byte = (void *)(addr + insn.length);
+		/* Check any instructions don't jump into target */
+		if (insn_is_indirect_jump(&insn) ||
+		    insn_jump_into_range(&insn, paddr + INT3_SIZE,
+					 RELATIVE_ADDR_SIZE))
+			return 0;
+		addr += insn.length;
+	}
+
+	return 1;
+}
+
+/* Check optimized_kprobe can actually be optimized. */
+int __kprobes arch_check_optimized_kprobe(struct optimized_kprobe *op)
+{
+	int i;
+	for (i = 1; i < op->optinsn.size; i++)
+		if (get_kprobe(op->kp.addr + i))
+			return -EEXIST;
+	return 0;
+}
+
+/* Check the addr is within the optimized instructions. */
+int __kprobes arch_within_optimized_kprobe(struct optimized_kprobe *op,
+					   unsigned long addr)
+{
+	return ((unsigned long)op->kp.addr <= addr &&
+		(unsigned long)op->kp.addr + op->optinsn.size > addr);
+}
+
+/* Free optimized instruction slot */
+static __kprobes
+void __arch_remove_optimized_kprobe(struct optimized_kprobe *op, int dirty)
+{
+	if (op->optinsn.insn) {
+		free_optinsn_slot(op->optinsn.insn, dirty);
+		op->optinsn.insn = NULL;
+		op->optinsn.size = 0;
+	}
+}
+
+void __kprobes arch_remove_optimized_kprobe(struct optimized_kprobe *op)
+{
+	__arch_remove_optimized_kprobe(op, 1);
+}
+
+/*
+ * Copy replacing target instructions
+ * Target instructions MUST be relocatable (checked inside)
+ */
+int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
+{
+	u8 *buf;
+	int ret;
+
+	if (!can_optimize((unsigned long)op->kp.addr))
+		return -EILSEQ;
+
+	op->optinsn.insn = get_optinsn_slot();
+	if (!op->optinsn.insn)
+		return -ENOMEM;
+
+	buf = (u8 *)op->optinsn.insn;
+
+	/* Copy instructions into the out-of-line buffer */
+	ret = copy_optimized_instructions(buf + TMPL_END_IDX, op->kp.addr);
+	if (ret < 0) {
+		__arch_remove_optimized_kprobe(op, 0);
+		return ret;
+	}
+	op->optinsn.size = ret;
+
+	/* Backup instructions which will be replaced by jump address */
+	memcpy(op->optinsn.copied_insn, op->kp.addr + INT3_SIZE,
+	       RELATIVE_ADDR_SIZE);
+
+	/* Copy arch-dep-instance from template */
+	memcpy(buf, &optprobe_template_entry, TMPL_END_IDX);
+
+	/* Set probe information */
+	synthesize_set_arg1(buf + TMPL_MOVE_IDX, (unsigned long)op);
+
+	/* Set probe function call */
+	synthesize_relcall(buf + TMPL_CALL_IDX, optimized_callback);
+
+	/* Set returning jmp instruction at the tail of out-of-line buffer */
+	synthesize_reljump(buf + TMPL_END_IDX + op->optinsn.size,
+			   (u8 *)op->kp.addr + op->optinsn.size);
+
+	flush_icache_range((unsigned long) buf,
+			   (unsigned long) buf + TMPL_END_IDX +
+			   op->optinsn.size + RELATIVEJUMP_SIZE);
+	return 0;
+}
+
+/*
+ * Cross-modifying kernel text with stop_machine().
+ * This code originally comes from immediate value.
+ * This does _not_ protect against NMI and MCE. However,
+ * since kprobes can't probe NMI/MCE handler, it is OK for kprobes.
+ */
+static atomic_t stop_machine_first;
+static int wrote_text;
+
+struct text_poke_param {
+	void *addr;
+	const void *opcode;
+	size_t len;
+};
+
+static int __kprobes stop_machine_multibyte_poke(void *data)
+{
+	struct text_poke_param *tpp = data;
+
+	if (atomic_dec_and_test(&stop_machine_first)) {
+		text_poke(tpp->addr, tpp->opcode, tpp->len);
+		smp_wmb();	/* Make sure other cpus see that this has run */
+		wrote_text = 1;
+	} else {
+		while (!wrote_text)
+			smp_rmb();
+		sync_core();
+	}
+
+	flush_icache_range((unsigned long)tpp->addr,
+			   (unsigned long)tpp->addr + tpp->len);
+	return 0;
+}
+
+static void *__kprobes __multibyte_poke(void *addr, const void *opcode,
+					size_t len)
+{
+	struct text_poke_param tpp;
+
+	tpp.addr = addr;
+	tpp.opcode = opcode;
+	tpp.len = len;
+	atomic_set(&stop_machine_first, 1);
+	wrote_text = 0;
+	stop_machine(stop_machine_multibyte_poke, (void *)&tpp, NULL);
+	return addr;
+}
+
+/* Replace a breakpoint (int3) with a relative jump.  */
+int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
+{
+	unsigned char jmp_code[RELATIVEJUMP_SIZE];
+	s32 rel = (s32)((long)op->optinsn.insn -
+			((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
+	/* Check if the address gap is in 2GB range. */
+	if ((long)op->kp.addr + RELATIVEJUMP_SIZE + rel !=
+	    (long)op->optinsn.insn)
+		return -EINVAL;
+
+	jmp_code[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&jmp_code[1]) = rel;
+
+	__multibyte_poke(op->kp.addr, jmp_code, RELATIVEJUMP_SIZE);
+	return 0;
+}
+
+/* Replace a relative jump with a breakpoint (int3).  */
+void __kprobes arch_unoptimize_kprobe(struct optimized_kprobe *op)
+{
+	u8 buf[RELATIVEJUMP_SIZE];
+
+	/* Set int3 to first byte for kprobes */
+	buf[0] = BREAKPOINT_INSTRUCTION;
+	memcpy(buf + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+	__multibyte_poke(op->kp.addr, buf, RELATIVEJUMP_SIZE);
+}
+
+static int  __kprobes setup_detour_execution(struct kprobe *p,
+					     struct pt_regs *regs,
+					     int reenter)
+{
+	struct optimized_kprobe *op;
+
+	if (p->flags & KPROBE_FLAG_OPTIMIZED) {
+		/* This kprobe is really able to run optimized path. */
+		op = container_of(p, struct optimized_kprobe, kp);
+		/* Detour through copied instructions */
+		regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX;
+		if (!reenter)
+			reset_current_kprobe();
+		preempt_enable_no_resched();
+		return 1;
+	}
+	return 0;
+}
+#endif
+
 int __init arch_init_kprobes(void)
 {
 	return 0;


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 08/10] kprobes: Add documents of jump optimization
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (6 preceding siblings ...)
  2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
@ 2009-11-23 23:22 ` Masami Hiramatsu
  2009-11-23 23:22 ` [PATCH -tip v5 09/10] [RFC] x86: Introduce generic jump patching without stop_machine Masami Hiramatsu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Add documentations about kprobe jump optimization to Documentation/kprobes.txt.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 Documentation/kprobes.txt |  192 ++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 179 insertions(+), 13 deletions(-)

diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 053037a..e4b0504 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -1,6 +1,7 @@
 Title	: Kernel Probes (Kprobes)
 Authors	: Jim Keniston <jkenisto@us.ibm.com>
 	: Prasanna S Panchamukhi <prasanna@in.ibm.com>
+	: Masami Hiramatsu <mhiramat@redhat.com>
 
 CONTENTS
 
@@ -14,6 +15,7 @@ CONTENTS
 8. Kprobes Example
 9. Jprobes Example
 10. Kretprobes Example
+11. Optimization Example
 Appendix A: The kprobes debugfs interface
 
 1. Concepts: Kprobes, Jprobes, Return Probes
@@ -42,13 +44,13 @@ registration/unregistration of a group of *probes. These functions
 can speed up unregistration process when you have to unregister
 a lot of probes at once.
 
-The next three subsections explain how the different types of
-probes work.  They explain certain things that you'll need to
-know in order to make the best use of Kprobes -- e.g., the
-difference between a pre_handler and a post_handler, and how
-to use the maxactive and nmissed fields of a kretprobe.  But
-if you're in a hurry to start using Kprobes, you can skip ahead
-to section 2.
+The next four subsections explain how the different types of
+probes work and how the optimization works.  They explain certain
+things that you'll need to know in order to make the best use of
+Kprobes -- e.g., the difference between a pre_handler and
+a post_handler, and how to use the maxactive and nmissed fields of
+a kretprobe.  But if you're in a hurry to start using Kprobes, you
+can skip ahead to section 2.
 
 1.1 How Does a Kprobe Work?
 
@@ -161,13 +163,110 @@ In case probed function is entered but there is no kretprobe_instance
 object available, then in addition to incrementing the nmissed count,
 the user entry_handler invocation is also skipped.
 
+1.4 How Does the Optimization Work?
+
+ If you configured kernel with CONFIG_OPTPROBES=y (currently this option is
+supported on x86/x86-64, non-preemptive kernel) and
+"debug.kprobes_optimization" sysctl sets 1, kprobes tries to use a
+jump instruction instead of breakpoint instruction automatically.
+
+1.4.1 Init a Kprobe
+
+ Before preparing optimization, Kprobes inserts original(user-defined)
+kprobe on the specified address. So, even if the kprobe is not
+possible to be optimized, it just uses a normal kprobe.
+
+1.4.2 Safety check
+
+ First, Kprobes gets the address of probed function and checks whether the
+optimized region, which will be replaced by a jump instruction, does NOT
+straddle the function boundary, because if the optimized region reaches the
+next function, its caller causes unexpected results.
+ Next, Kprobes decodes whole body of probed function and checks there is
+NO indirect jump, NO instruction which will cause exception by checking
+exception_tables (this will jump to fixup code and fixup code jumps into
+same function body) and NO near jump which jumps into the optimized region
+(except the 1st byte of jump), because if some jump instruction jumps
+into the middle of another instruction, it causes unexpected results too.
+ Kprobes also measures the length of instructions which will be replaced
+by a jump instruction, because a jump instruction is longer than 1 byte,
+it may replaces multiple instructions, and it checks whether those
+instructions can be executed out-of-line.
+
+1.4.3 Preparing detour buffer
+
+ Then, Kprobes prepares "detour" buffer, which contains exception emulating
+code (push/pop registers, call handler), copied instructions(Kprobes copies
+instructions which will be replaced by a jump, to the detour buffer), and
+a jump which jumps back to the original execution path.
+
+1.4.4 Pre-optimization
+
+ After preparing detour buffer, Kprobes checks that the probe is *NOT* in
+the below cases;
+ - The probe has either break_handler or post_handler.
+ - Other probes are probing the instructions which will be replaced by
+   a jump instruction.
+ - The probe is disabled.
+In above cases, Kprobes just doesn't start optimizating the probe.
+
+ If the kprobe can be optimized, Kprobes enqueues the kprobe to optimizing
+list and kicks kprobe-optimizer workqueue to optimize it. To wait other
+optimized probes, kprobe-optimizer will delay to work.
+ When the optimized-kprobe is hit before optimization, its handler changes
+IP(instruction pointer) to copied code and exits. So, the instructions which
+were copied to detour buffer are executed on the detour buffer.
+
+1.4.5 Optimization
+
+ Kprobe-optimizer doesn't start instruction-replacing soon, it waits
+synchronize_sched for safety, because some processors are possible to be
+interrupted on the instructions which will be replaced by a jump instruction.
+As you know, synchronize_sched() can ensure that all interruptions which were
+executed when synchronize_sched() was called are done, only if
+CONFIG_PREEMPT=n. So, this version supports only the kernel with
+CONFIG_PREEMPT=n.(*)
+ After that, kprobe-optimizer replaces the 4 bytes right after int3
+breakpoint with relative-jump destination, and synchronize caches on all
+processors. And then, it replaces int3 with relative-jump opcode, and
+synchronize caches again.
+
+ After optimizing the probe, a CPU hits the jump instruction and jumps to
+the out-of-line buffer directly. Thus the breakpoint exception is skipped.
+
+1.4.6 Unoptimization
+
+ When unregistering, disabling kprobe or being blocked by other kprobe,
+an optimized-kprobe will be unoptimized. Before kprobe-optimizer runs,
+the kprobe just be dequeued from the optimized list. When the optimization
+has been done, it replaces a jump with int3 breakpoint and original code.
+ First it puts int3 at the first byte of the jump, synchronize caches
+on all processors, replaces the 4 bytes right after int3 with the original
+code and synchronize caches again.
+
+(*)This optimization-safety checking may be replaced with stop-machine method
+ which ksplice is done for supporting CONFIG_PREEMPT=y kernel.
+
+NOTE for geeks:
+The jump optimization changes the kprobe's pre_handler behavior.
+Without optimization, pre_handler can change kernel execution path by
+changing regs->ip and return 1. However, after optimizing the probe,
+that modification is ignored. Thus, if you'd like to tweak kernel
+execution path, you need to avoid optimization. In that case, you can
+choose either,
+ - Set empty function to post_handler or break_handler.
+ or
+ - Config CONFIG_OPTPROBES=n.
+ or
+ - Execute 'sysctl -w debug.kprobes_optimization=n'
+
 2. Architectures Supported
 
 Kprobes, jprobes, and return probes are implemented on the following
 architectures:
 
-- i386
-- x86_64 (AMD-64, EM64T)
+- i386 (Supports jump optimization)
+- x86_64 (AMD-64, EM64T) (Supports jump optimization)
 - ppc64
 - ia64 (Does not support probes on instruction slot1.)
 - sparc64 (Return probes not yet implemented.)
@@ -193,6 +292,10 @@ it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
 so you can use "objdump -d -l vmlinux" to see the source-to-object
 code mapping.
 
+If you want to reduce probing overhead, set "Kprobes jump optimization
+support" (CONFIG_OPTPROBES) to "y". You can find this option under
+"Kprobes" line.
+
 4. API Reference
 
 The Kprobes API includes a "register" function and an "unregister"
@@ -387,9 +490,12 @@ the probe which has been registered.
 
 5. Kprobes Features and Limitations
 
-Kprobes allows multiple probes at the same address.  Currently,
-however, there cannot be multiple jprobes on the same function at
-the same time.
+Kprobes allows multiple probes at the same address even if it is optimized.
+Currently, however, there cannot be multiple jprobes on the same function
+at the same time. And also, optimized kprobes can not invoke the
+post_handler and the break_handler. So if you attempt to install the probe
+which has the the post_handler or the break_handler at the same address of
+an optimized kprobe, the probe will be unoptimized automatically.
 
 In general, you can install a probe anywhere in the kernel.
 In particular, you can probe interrupt handlers.  Known exceptions
@@ -453,6 +559,37 @@ reason, Kprobes doesn't support return probes (or kprobes or jprobes)
 on the x86_64 version of __switch_to(); the registration functions
 return -EINVAL.
 
+On x86/x86-64, since the Jump Optimization of Kprobes modifies instructions
+widely, there are some limitations for optimization. To explain it,
+we introduce some terminology. Image certain binary line which is
+constructed by 2 byte instruction, 2byte instruction and 3byte instruction.
+
+        IA
+         |
+[-2][-1][0][1][2][3][4][5][6][7]
+        [ins1][ins2][  ins3 ]
+	[<-     DCR       ->]
+	   [<- JTPR ->]
+
+ins1: 1st Instruction
+ins2: 2nd Instruction
+ins3: 3rd Instruction
+IA:  Insertion Address
+JTPR: Jump Target Prohibition Region
+DCR: Detoured Code Region
+
+The instructions in DCR are copied to the out-of-line buffer
+of the djprobe instance, because the bytes in JTPR are replaced by
+a jump instruction. So, there are several limitations.
+
+a) The instructions in DCR must be relocatable.
+b) The instructions in DCR must not include call instruction.
+c) JTPR must not be targeted by any jump or call instruction.
+d) DCR must not straddle the border betweeen functions.
+
+Anyway, these limitations are checked by in-kernel instruction decoder,
+so you don't need to care about that.
+
 6. Probe Overhead
 
 On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
@@ -476,6 +613,19 @@ k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07
 ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
 k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99
 
+6.1 Optimized Probe Overhead
+
+Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
+process. Here are sample overhead figures (in usec) for x86-64 architectures.
+k = unoptimized kprobe, b = boosted(single-step skipped), o = optimized kprobe,
+r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
+
+i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+k = 0.68 usec; b = 0.27; o = 0.06; r = 0.95; rb = 0.53; ro = 0.30
+
+x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+k = 0.91 usec; b = 0.40; o = 0.06; r = 1.21; rb = 0.71; ro = 0.35
+
 7. TODO
 
 a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
@@ -523,7 +673,8 @@ is also specified. Following columns show probe status. If the probe is on
 a virtual address that is no longer valid (module init sections, module
 virtual addresses that correspond to modules that've been unloaded),
 such probes are marked with [GONE]. If the probe is temporarily disabled,
-such probes are marked with [DISABLED].
+such probes are marked with [DISABLED]. If the probe is optimized, it is
+marked with [OPTIMIZED].
 
 /sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
 
@@ -533,3 +684,18 @@ registered probes will be disarmed, till such time a "1" is echoed to this
 file. Note that this knob just disarms and arms all kprobes and doesn't
 change each probe's disabling state. This means that disabled kprobes (marked
 [DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
+
+
+Appendix B: The kprobes sysctl interface
+
+/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
+
+When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides a knob
+to globally and forcibly turn the jump optimization ON or OFF. By default,
+jump optimization is allowed(ON). By echoing "0" to this file or By setting
+0 to "debug.kprobes_optimization" via sysctl, all optimized probes will be
+unoptimized. And new probes registered after that will not be optimized.
+Note that this knob *Changes* the optimized state. This means that optimized
+probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
+removed). And after the knob is turned on, it will be optimized again.
+


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 09/10] [RFC] x86: Introduce generic jump patching without stop_machine
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (7 preceding siblings ...)
  2009-11-23 23:22 ` [PATCH -tip v5 08/10] kprobes: Add documents of jump optimization Masami Hiramatsu
@ 2009-11-23 23:22 ` Masami Hiramatsu
  2009-11-23 23:22 ` [PATCH -tip v5 10/10] [RFC] kprobes/x86: Use text_poke_fixup() for jump optimization Masami Hiramatsu
  2009-11-24  2:03 ` [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Frederic Weisbecker
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Add text_poke_fixup() which takes a fixup address to where a processor
jumps if it hits the modifying address while code modifying.
text_poke_fixup() does following steps for this purpose.

 1. Setup int3 handler for fixup.
 2. Put a breakpoint (int3) on the first byte of modifying region,
    and synchronize code on all CPUs.
 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
 4. Modify the first byte of modifying region, and synchronize code
    on all CPUs.
 5. Clear int3 handler.

Thus, if some other processor execute modifying address when step2 to step4,
it will be jumped to fixup code.

This still has many limitations for modifying multi-instructions at once.
However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
because;
 - Replaced instruction is just one instruction, which is executed atomically.
 - Replacing instruction is a jump, so we can set fixup address where the jump
   goes to.

Changes in v5
 - Add some comments.
 - Use smp_wmb()/smp_rmb()
 - Remove unneeded sync_core_all()

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/include/asm/alternative.h |   11 ++++
 arch/x86/kernel/alternative.c      |  102 ++++++++++++++++++++++++++++++++++++
 kernel/kprobes.c                   |    2 -
 3 files changed, 114 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index c240efc..b48ca4d 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -160,4 +160,15 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 
+/*
+ * Setup int3 trap and fixup execution for cross-modifying on SMP case.
+ * If the other cpus execute modifying instruction, it will hit int3
+ * and go to fixup code. This just provides a minimal safety check.
+ * Additional checks/restrictions are required for completely safe
+ * cross-modifying.
+ */
+extern void *text_poke_fixup(void *addr, const void *opcode, size_t len,
+			     void *fixup);
+extern void sync_core_all(void);
+
 #endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index de7353c..04576e4 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -4,6 +4,7 @@
 #include <linux/list.h>
 #include <linux/stringify.h>
 #include <linux/kprobes.h>
+#include <linux/kdebug.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/memory.h>
@@ -552,3 +553,104 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 	local_irq_restore(flags);
 	return addr;
 }
+
+/*
+ * On pentium series, Unsynchronized cross-modifying code
+ * operations can cause unexpected instruction execution results.
+ * So after code modified, we should synchronize it on each processor.
+ */
+static void __kprobes __local_sync_core(void *info)
+{
+	sync_core();
+}
+
+void __kprobes sync_core_all(void)
+{
+	on_each_cpu(__local_sync_core, NULL, 1);
+}
+
+/* Safely cross-code modifying with fixup address */
+static void *patch_fixup_from;
+static void *patch_fixup_addr;
+
+static int __kprobes patch_exceptions_notify(struct notifier_block *self,
+					      unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	smp_rmb();
+
+	if (likely(!patch_fixup_from))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
+	    (unsigned long)patch_fixup_from != regs->ip)
+		return NOTIFY_DONE;
+
+	args->regs->ip = (unsigned long)patch_fixup_addr;
+	return NOTIFY_STOP;
+}
+
+/**
+ * text_poke_fixup() -- cross-modifying kernel text with fixup address.
+ * @addr:	Modifying address.
+ * @opcode:	New instruction.
+ * @len:	length of modifying bytes.
+ * @fixup:	Fixup address.
+ *
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
+				void *fixup)
+{
+	static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
+	static const int int3_size = sizeof(int3_insn);
+
+	/* Replacing 1 byte can be done atomically. */
+	if (unlikely(len <= 1))
+		return text_poke(addr, opcode, len);
+
+	/* Preparing fixup address */
+	patch_fixup_addr = fixup;
+	patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
+	smp_wmb();
+
+	/* Cap by an int3 - expecting synchronously */
+	text_poke(addr, &int3_insn, int3_size);
+
+	/* Replace tail bytes */
+	text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
+		  len - int3_size);
+	/* Synchronous code cache */
+	sync_core_all();
+
+	/* Replace int3 with head byte - expecting synchronously */
+	text_poke(addr, opcode, int3_size);
+
+	/*
+	 * Sync core again - this is for waiting for disabled IRQ code
+	 * quiescent state, IOW, waiting for all running int3 fixup
+	 * handlers.
+	 */
+	sync_core_all();
+
+	/* Cleanup fixup address */
+	patch_fixup_from = NULL;
+	smp_wmb();
+	return addr;
+}
+
+static struct notifier_block patch_exceptions_nb = {
+	.notifier_call = patch_exceptions_notify,
+	.priority = 0x7fffffff /* we need to be notified first */
+};
+
+static int __init patch_init(void)
+{
+	return register_die_notifier(&patch_exceptions_nb);
+}
+
+arch_initcall(patch_init);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 1e862ed..22c2ae5 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1298,7 +1298,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
 
 static struct notifier_block kprobe_exceptions_nb = {
 	.notifier_call = kprobe_exceptions_notify,
-	.priority = 0x7fffffff /* we need to be notified first */
+	.priority = 0x7ffffff0 /* High priority, but not first.  */
 };
 
 unsigned long __weak arch_deref_entry_point(void *entry)


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH -tip v5 10/10] [RFC] kprobes/x86: Use text_poke_fixup() for jump optimization
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (8 preceding siblings ...)
  2009-11-23 23:22 ` [PATCH -tip v5 09/10] [RFC] x86: Introduce generic jump patching without stop_machine Masami Hiramatsu
@ 2009-11-23 23:22 ` Masami Hiramatsu
  2009-11-24  2:03 ` [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Frederic Weisbecker
  10 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-23 23:22 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Frederic Weisbecker, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Use text_poke_fixup() for jump optimization instead of text_poke() with
stop_machine().

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 arch/x86/kernel/kprobes.c |   54 +++------------------------------------------
 1 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 6d81c11..3c5e30f 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -49,7 +49,6 @@
 #include <linux/module.h>
 #include <linux/kdebug.h>
 #include <linux/kallsyms.h>
-#include <linux/stop_machine.h>
 
 #include <asm/cacheflush.h>
 #include <asm/desc.h>
@@ -1355,54 +1354,6 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
 	return 0;
 }
 
-/*
- * Cross-modifying kernel text with stop_machine().
- * This code originally comes from immediate value.
- * This does _not_ protect against NMI and MCE. However,
- * since kprobes can't probe NMI/MCE handler, it is OK for kprobes.
- */
-static atomic_t stop_machine_first;
-static int wrote_text;
-
-struct text_poke_param {
-	void *addr;
-	const void *opcode;
-	size_t len;
-};
-
-static int __kprobes stop_machine_multibyte_poke(void *data)
-{
-	struct text_poke_param *tpp = data;
-
-	if (atomic_dec_and_test(&stop_machine_first)) {
-		text_poke(tpp->addr, tpp->opcode, tpp->len);
-		smp_wmb();	/* Make sure other cpus see that this has run */
-		wrote_text = 1;
-	} else {
-		while (!wrote_text)
-			smp_rmb();
-		sync_core();
-	}
-
-	flush_icache_range((unsigned long)tpp->addr,
-			   (unsigned long)tpp->addr + tpp->len);
-	return 0;
-}
-
-static void *__kprobes __multibyte_poke(void *addr, const void *opcode,
-					size_t len)
-{
-	struct text_poke_param tpp;
-
-	tpp.addr = addr;
-	tpp.opcode = opcode;
-	tpp.len = len;
-	atomic_set(&stop_machine_first, 1);
-	wrote_text = 0;
-	stop_machine(stop_machine_multibyte_poke, (void *)&tpp, NULL);
-	return addr;
-}
-
 /* Replace a breakpoint (int3) with a relative jump.  */
 int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
 {
@@ -1418,7 +1369,8 @@ int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
 	jmp_code[0] = RELATIVEJUMP_OPCODE;
 	*(s32 *)(&jmp_code[1]) = rel;
 
-	__multibyte_poke(op->kp.addr, jmp_code, RELATIVEJUMP_SIZE);
+	text_poke_fixup(op->kp.addr, jmp_code, RELATIVEJUMP_SIZE,
+			op->optinsn.insn);
 	return 0;
 }
 
@@ -1430,7 +1382,7 @@ void __kprobes arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	/* Set int3 to first byte for kprobes */
 	buf[0] = BREAKPOINT_INSTRUCTION;
 	memcpy(buf + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
-	__multibyte_poke(op->kp.addr, buf, RELATIVEJUMP_SIZE);
+	text_poke_fixup(op->kp.addr, buf, RELATIVEJUMP_SIZE, op->optinsn.insn);
 }
 
 static int  __kprobes setup_detour_execution(struct kprobe *p,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support
  2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
                   ` (9 preceding siblings ...)
  2009-11-23 23:22 ` [PATCH -tip v5 10/10] [RFC] kprobes/x86: Use text_poke_fixup() for jump optimization Masami Hiramatsu
@ 2009-11-24  2:03 ` Frederic Weisbecker
  2009-11-24  3:20   ` Frederic Weisbecker
  10 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  2:03 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, H. Peter Anvin,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Anders Kaseorg, Tim Abbott, Andi Kleen,
	Jason Baron, Mathieu Desnoyers, systemtap, DLE

On Mon, Nov 23, 2009 at 06:21:16PM -0500, Masami Hiramatsu wrote:
> Hi,
> 
> Here are the patchset of the kprobes jump optimization v5
> (a.k.a. Djprobe). Since it is not ensured that the int3 bypassing
> cross modifying code is safe on any processors yet, I introduced
> stop_machine() version of XMC. Using stop_machine() will disable
> us to probe NMI codes, but anyway, kprobes itself can't probe
> those codes. So, it's not a problem. This version also includes
> get/put_online_cpus() around optimization for avoiding deadlock
> of text_mutex.
> 
> These patches can be applied on the latest -tip.
> 
> Changes in v5:
> - Use stop_machine() to replace a breakpoint with a jump.
> - get/put_online_cpus() around optimization.
> - Make generic jump patching interface RFC.
> 
> And kprobe stress test didn't found any regressions - from kprobes,
> under kvm/x86.
> 
> Jump Optimized Kprobes
> ======================
> o Concept
>  Kprobes uses the int3 breakpoint instruction on x86 for instrumenting
> probes into running kernel. Jump optimization allows kprobes to replace
> breakpoint with a jump instruction for reducing probing overhead drastically.
> 
> o Performance
>  An optimized kprobe 5 times faster than a kprobe.
> 
>  Optimizing probes gains its performance. Usually, a kprobe hit takes
> 0.5 to 1.0 microseconds to process. On the other hand, a jump optimized
> probe hit takes less than 0.1 microseconds (actual number depends on the
> processor). Here is a sample overheads.
> 
> Intel(R) Xeon(R) CPU E5410  @ 2.33GHz (without debugging options)
> 
>                      x86-32  x86-64
> kprobe:              0.68us  0.91us
> kprobe+booster:	     0.27us  0.40us
> kprobe+optimized:    0.06us  0.06us
> 
> kretprobe :          0.95us  1.21us
> kretprobe+booster:   0.53us  0.71us
> kretprobe+optimized: 0.30us  0.35us
> 
> (booster skips single-stepping)
> 
>  Note that jump optimization also consumes more memory, but not so much.
> It just uses ~200 bytes, so, even if you use ~10,000 probes, it just 
> consumes a few MB.


Nice results.

But I have troubles to figure out the difference between booster version and
optimized version.


> o Optimization
>   Before preparing optimization, Kprobes inserts original(user-defined)
>  kprobe on the specified address. So, even if the kprobe is not
>  possible to be optimized, it just uses a normal kprobe.
> 
>  - Safety check
>   First, Kprobes gets the address of probed function and checks whether the
>  optimized region, which will be replaced by a jump instruction, does NOT
>  straddle the function boundary, because if the optimized region reaches the
>  next function, its caller causes unexpected results.
>   Next, Kprobes decodes whole body of probed function and checks there is
>  NO indirect jump, NO instruction which will cause exception by checking
>  exception_tables (this will jump to fixup code and fixup code jumps into
>  same function body) and NO near jump which jumps into the optimized region
>  (except the 1st byte of jump), because if some jump instruction jumps
>  into the middle of another instruction, it causes unexpected results too.
>   Kprobes also measures the length of instructions which will be replaced
>  by a jump instruction, because a jump instruction is longer than 1 byte,
>  it may replaces multiple instructions, and it checks whether those
>  instructions can be executed out-of-line.
> 
>  - Preparing detour code
>   Then, Kprobes prepares "detour" buffer, which contains exception emulating
>  code (push/pop registers, call handler), copied instructions(Kprobes copies
>  instructions which will be replaced by a jump, to the detour buffer), and
>  a jump which jumps back to the original execution path.
> 
>  - Pre-optimization
>   After preparing detour code, Kprobes enqueues the kprobe to optimizing list
>  and kicks kprobe-optimizer workqueue to optimize it. To wait other optimized
>  probes, kprobe-optimizer will delay to work.


Hmm, so it waits for, actually, non-optimized probes to finish, right?
The site for which you have built up a detour buffer has an int3 in place
that could have kprobes in processing and your are waiting for them
to complete before patching with the jump?


>   When the optimized-kprobe is hit before optimization, its handler
>  changes IP(instruction pointer) to copied code and exits. So, the
>  instructions which were copied to detour buffer are executed on the detour
>  buffer.



Hm, why is it playing such hybrid game there?
If I understand well, we have executed int 3, executed the
handler and we jump back to the detour buffer?



>  - Optimization
>   Kprobe-optimizer doesn't start instruction-replacing soon, it waits
>  synchronize_sched for safety, because some processors are possible to be
>  interrupted on the instructions which will be replaced by a jump instruction.
>  As you know, synchronize_sched() can ensure that all interruptions which were
>  executed when synchronize_sched() was called are done, only if
>  CONFIG_PREEMPT=n. So, this version supports only the kernel with
>  CONFIG_PREEMPT=n.(*)
>   After that, kprobe-optimizer replaces the 4 bytes right after int3 breakpoint
>  with relative-jump destination, and synchronize caches on all processors. Next,
>  it replaces int3 with relative-jump opcode, and synchronize caches again.


You said you now use stop_machine() to patch the jumps, which looks the only
safe way to do that. May be the above explanation is out of date?


>  - Unoptimization
>   When unregistering, disabling kprobe or being blocked by other kprobe,
>  an optimized-kprobe will be unoptimized. Before kprobe-optimizer runs,
>  the kprobe just be dequeued from the optimized list. When the optimization
>  has been done, it replaces a jump with int3 breakpoint and original code.
>   First it puts int3 at the first byte of the jump, synchronize caches
>  on all processors, and replaces the 4 bytes right after int3 with the
>  original code.
> 
> (*)This optimization-safety checking may be replaced with stop-machine method
>  which ksplice is done for supporting CONFIG_PREEMPT=y kernel.


And now that you use get_cpu()/put_cpu(), I guess this config
option is not required anymore.

I don't understand why the int 3 is still required in the sequence.

- Registration: You first patch the site with int 3, then try the jump
  and use the int 3 as a gate to protect your patching.

- Unregistration: Same in reverse


You are doing a live patching while the code might be running concurrently
which requires a very tricky surgery, based on a int 3 gate and rcu as you
describe above.
But do we need to play such dangerous (and complicated) game.
I mean, it's like training to be a tightrope walker while we have a
bridge just beside :)
Why not running stop_machine(), first trying the jump directly, patching
it if it's considered safe, otherwise patching with int 3?

But you said you are using stop_machine() in the v5 changelog,
I should probably first look at the patches :)

Thanks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-23 23:21 ` [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization Masami Hiramatsu
@ 2009-11-24  2:44   ` Frederic Weisbecker
  2009-11-24  3:31     ` Frederic Weisbecker
  2009-11-24 15:34     ` Masami Hiramatsu
  0 siblings, 2 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  2:44 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Mon, Nov 23, 2009 at 06:21:41PM -0500, Masami Hiramatsu wrote:
> +config OPTPROBES
> +	bool "Kprobes jump optimization support (EXPERIMENTAL)"
> +	default y
> +	depends on KPROBES
> +	depends on !PREEMPT


Why does it depends on !PREEMPT?



> +	depends on HAVE_OPTPROBES
> +	select KALLSYMS_ALL
> +	help
> +	  This option will allow kprobes to optimize breakpoint to
> +	  a jump for reducing its overhead.
> +
>  config HAVE_EFFICIENT_UNALIGNED_ACCESS
>  	bool
>  	help
> @@ -99,6 +110,8 @@ config HAVE_KPROBES
>  config HAVE_KRETPROBES
>  	bool
>  
> +config HAVE_OPTPROBES
> +	bool
>  #
>  # An arch should select this if it provides all these things:
>  #
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 1b672f7..aed1f95 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -122,6 +122,11 @@ struct kprobe {
>  /* Kprobe status flags */
>  #define KPROBE_FLAG_GONE	1 /* breakpoint has already gone */
>  #define KPROBE_FLAG_DISABLED	2 /* probe is temporarily disabled */
> +#define KPROBE_FLAG_OPTIMIZED	4 /*
> +				   * probe is really optimized.
> +				   * NOTE:
> +				   * this flag is only for optimized_kprobe.
> +				   */
>  
>  /* Has this kprobe gone ? */
>  static inline int kprobe_gone(struct kprobe *p)
> @@ -134,6 +139,12 @@ static inline int kprobe_disabled(struct kprobe *p)
>  {
>  	return p->flags & (KPROBE_FLAG_DISABLED | KPROBE_FLAG_GONE);
>  }
> +
> +/* Is this kprobe really running optimized path ? */
> +static inline int kprobe_optimized(struct kprobe *p)
> +{
> +	return p->flags & KPROBE_FLAG_OPTIMIZED;
> +}
>  /*
>   * Special probe type that uses setjmp-longjmp type tricks to resume
>   * execution at a specified entry with a matching prototype corresponding
> @@ -249,6 +260,31 @@ extern kprobe_opcode_t *get_insn_slot(void);
>  extern void free_insn_slot(kprobe_opcode_t *slot, int dirty);
>  extern void kprobes_inc_nmissed_count(struct kprobe *p);
>  
> +#ifdef CONFIG_OPTPROBES
> +/*
> + * Internal structure for direct jump optimized probe
> + */
> +struct optimized_kprobe {
> +	struct kprobe kp;
> +	struct list_head list;	/* list for optimizing queue */
> +	struct arch_optimized_insn optinsn;
> +};
> +
> +/* Architecture dependent functions for direct jump optimization */
> +extern int arch_prepared_optinsn(struct arch_optimized_insn *optinsn);
> +extern int arch_check_optimized_kprobe(struct optimized_kprobe *op);
> +extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op);
> +extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op);
> +extern int  arch_optimize_kprobe(struct optimized_kprobe *op);
> +extern void arch_unoptimize_kprobe(struct optimized_kprobe *op);
> +extern kprobe_opcode_t *get_optinsn_slot(void);
> +extern void free_optinsn_slot(kprobe_opcode_t *slot, int dirty);
> +extern int arch_within_optimized_kprobe(struct optimized_kprobe *op,
> +					unsigned long addr);
> +
> +extern void opt_pre_handler(struct kprobe *p, struct pt_regs *regs);
> +#endif /* CONFIG_OPTPROBES */
> +
>  /* Get the kprobe at this addr (if any) - called with preemption disabled */
>  struct kprobe *get_kprobe(void *addr);
>  void kretprobe_hash_lock(struct task_struct *tsk,
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 10d2ed5..15aa797 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -44,6 +44,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/kdebug.h>
>  #include <linux/memory.h>
> +#include <linux/cpu.h>
>  
>  #include <asm-generic/sections.h>
>  #include <asm/cacheflush.h>
> @@ -301,6 +302,31 @@ void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
>  	__free_insn_slot(&kprobe_insn_slots, slot, dirty);
>  	mutex_unlock(&kprobe_insn_mutex);
>  }
> +#ifdef CONFIG_OPTPROBES
> +/* For optimized_kprobe buffer */
> +static DEFINE_MUTEX(kprobe_optinsn_mutex); /* Protects kprobe_optinsn_slots */
> +static struct kprobe_insn_cache kprobe_optinsn_slots = {
> +	.pages = LIST_HEAD_INIT(kprobe_optinsn_slots.pages),
> +	/* .insn_size is initialized later */
> +	.nr_garbage = 0,
> +};
> +/* Get a slot for optimized_kprobe buffer */
> +kprobe_opcode_t __kprobes *get_optinsn_slot(void)
> +{
> +	kprobe_opcode_t *ret = NULL;
> +	mutex_lock(&kprobe_optinsn_mutex);
> +	ret = __get_insn_slot(&kprobe_optinsn_slots);
> +	mutex_unlock(&kprobe_optinsn_mutex);
> +	return ret;
> +}



Just a small nano-neat: could you add a line between variable
declarations and the rest? And also just before the return?
It makes the code a bit easier to review.



> +
> +void __kprobes free_optinsn_slot(kprobe_opcode_t * slot, int dirty)
> +{
> +	mutex_lock(&kprobe_optinsn_mutex);
> +	__free_insn_slot(&kprobe_optinsn_slots, slot, dirty);
> +	mutex_unlock(&kprobe_optinsn_mutex);
> +}
> +#endif
>  #endif
>  
>  /* We have preemption disabled.. so it is safe to use __ versions */
> @@ -334,20 +360,270 @@ struct kprobe __kprobes *get_kprobe(void *addr)
>  	return NULL;
>  }
>  
> +static int __kprobes aggr_pre_handler(struct kprobe *p, struct pt_regs *regs);
> +
> +/* Return true if the kprobe is an aggregator */
> +static inline int kprobe_aggrprobe(struct kprobe *p)
> +{
> +	return p->pre_handler == aggr_pre_handler;
> +}
> +
> +/*
> + * Keep all fields in the kprobe consistent
> + */
> +static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
> +{
> +	memcpy(&p->opcode, &old_p->opcode, sizeof(kprobe_opcode_t));
> +	memcpy(&p->ainsn, &old_p->ainsn, sizeof(struct arch_specific_insn));
> +}
> +
> +#ifdef CONFIG_OPTPROBES
> +/*
> + * Call all pre_handler on the list, but ignores its return value.
> + * This must be called from arch-dep optimized caller.
> + */
> +void __kprobes opt_pre_handler(struct kprobe *p, struct pt_regs *regs)
> +{
> +	struct kprobe *kp;
> +
> +	list_for_each_entry_rcu(kp, &p->list, list) {
> +		if (kp->pre_handler && likely(!kprobe_disabled(kp))) {
> +			set_kprobe_instance(kp);
> +			kp->pre_handler(kp, regs);
> +		}
> +		reset_kprobe_instance();
> +	}
> +}
> +
> +/* Return true(!0) if the kprobe is ready for optimization. */
> +static inline int kprobe_optready(struct kprobe *p)
> +{
> +	struct optimized_kprobe *op;
> +	if (kprobe_aggrprobe(p)) {
> +		op = container_of(p, struct optimized_kprobe, kp);
> +		return arch_prepared_optinsn(&op->optinsn);
> +	}
> +	return 0;
> +}
> +
> +/* Return an optimized kprobe which replaces instructions including addr. */
> +struct kprobe *__kprobes get_optimized_kprobe(unsigned long addr)
> +{
> +	int i;
> +	struct kprobe *p = NULL;
> +	struct optimized_kprobe *op;
> +	for (i = 0; !p && i < MAX_OPTIMIZED_LENGTH; i++)
> +		p = get_kprobe((void *)(addr - i));
> +
> +	if (p && kprobe_optready(p)) {
> +		op = container_of(p, struct optimized_kprobe, kp);
> +		if (arch_within_optimized_kprobe(op, addr))
> +			return p;
> +	}
> +	return NULL;
> +}
> +
> +/* Optimization staging list, protected by kprobe_mutex */
> +static LIST_HEAD(optimizing_list);
> +
> +static void kprobe_optimizer(struct work_struct *work);
> +static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
> +#define OPTIMIZE_DELAY 5
> +
> +/* Kprobe jump optimizer */
> +static __kprobes void kprobe_optimizer(struct work_struct *work)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +
> +	/* Lock modules while optimizing kprobes */
> +	mutex_lock(&module_mutex);
> +	mutex_lock(&kprobe_mutex);
> +	if (kprobes_all_disarmed)
> +		goto end;
> +
> +	/* Wait quiesence period for ensuring all interrupts are done */
> +	synchronize_sched();



It's not clear to me why you are doing that.
Is this waiting for pending int 3 kprobes handlers
to complete? If so, why, and what does that prevent?

Also, why is it a delayed work? I'm not sure what we are
waiting for here.


> +
> +	get_online_cpus();	/* Use online_cpus while optimizing */



And this comment doesn't tell us much what this brings us.
The changelog tells it stands to avoid a text_mutex deadlock.
I'm not sure why we would deadlock without it.

Again, I think this dance with live patching protected
by int 3 only, which patching is in turn a necessary
stage before, is a complicated sequence that could be
simplified by choosing only one patching in stop_machine()
time.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers
  2009-11-23 23:22 ` [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers Masami Hiramatsu
@ 2009-11-24  2:51   ` Frederic Weisbecker
  2009-11-24 15:39     ` Masami Hiramatsu
  2009-11-24 15:40     ` Frank Ch. Eigler
  0 siblings, 2 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  2:51 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Mon, Nov 23, 2009 at 06:22:04PM -0500, Masami Hiramatsu wrote:
> +#ifdef CONFIG_X86_64
> +#define SAVE_REGS_STRING		\
> +	/* Skip cs, ip, orig_ax. */	\
> +	"	subq $24, %rsp\n"	\
> +	"	pushq %rdi\n"		\
> +	"	pushq %rsi\n"		\
> +	"	pushq %rdx\n"		\
> +	"	pushq %rcx\n"		\
> +	"	pushq %rax\n"		\
> +	"	pushq %r8\n"		\
> +	"	pushq %r9\n"		\
> +	"	pushq %r10\n"		\
> +	"	pushq %r11\n"		\
> +	"	pushq %rbx\n"		\
> +	"	pushq %rbp\n"		\
> +	"	pushq %r12\n"		\
> +	"	pushq %r13\n"		\
> +	"	pushq %r14\n"		\
> +	"	pushq %r15\n"
> +#define RESTORE_REGS_STRING		\
> +	"	popq %r15\n"		\
> +	"	popq %r14\n"		\
> +	"	popq %r13\n"		\
> +	"	popq %r12\n"		\
> +	"	popq %rbp\n"		\
> +	"	popq %rbx\n"		\
> +	"	popq %r11\n"		\
> +	"	popq %r10\n"		\
> +	"	popq %r9\n"		\
> +	"	popq %r8\n"		\
> +	"	popq %rax\n"		\
> +	"	popq %rcx\n"		\
> +	"	popq %rdx\n"		\
> +	"	popq %rsi\n"		\
> +	"	popq %rdi\n"		\


BTW, do you really need to push/pop every registers
before/after calling a probe handler?

Is it possible to only save/restore the scratch ones?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
@ 2009-11-24  3:14   ` Frederic Weisbecker
  2009-11-24 16:27   ` Jason Baron
  2009-11-24 16:35   ` H. Peter Anvin
  2 siblings, 0 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  3:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Mon, Nov 23, 2009 at 06:22:11PM -0500, Masami Hiramatsu wrote:
> Introduce x86 arch-specific optimization code, which supports both of
> x86-32 and x86-64.
> 
> This code also supports safety checking, which decodes whole of a function
> in which probe is inserted, and checks following conditions before
> optimization:
>  - The optimized instructions which will be replaced by a jump instruction
>    don't straddle the function boundary.
>  - There is no indirect jump instruction, because it will jumps into
>    the address range which is replaced by jump operand.
>  - There is no jump/loop instruction which jumps into the address range
>    which is replaced by jump operand.
>  - Don't optimize kprobes if it is in functions into which fixup code will
>    jumps.
> 
> This uses stop_machine() for corss modifying code from int3 to jump.
> It doesn't allow us to modify code on NMI/SMI path. However, since
> kprobes itself doesn't support NMI/SMI code probing, it's not a
> problem.
> 
> Changes in v5:
>  - Introduce stop_machine-based jump replacing.



I realize now that int 3 live patching doesn't need stop_machine().
But still, I don't understand the int 3 unecessary step.

You first force int 3 patching, and later try to optimize
with a jump, using stop_machine().

But why the int 3 is a necessary first step? I guess it was
necessary first when you used it as a gate:

- patch with int 3, go to handler, go to old instruction
  that was patched, jump to original code that folows
  instruction that was patched
- set up detour buffer, execute handler (from int 3)
  then route to detour buffer, and original code that
  follows
- the code to be patched with the jump is now a
  dead code, jump to it

And now that you use stop_machine(), the complexity could be
reduced to:

- decide kprobe mode
- if int 3, then do like usual
- if jmp, then prepare detour buffer, and patch with the jump,
  without worrying about routing int 3 to the detour buffer
  to create a dead code area. It is now safe because of stop_machine()

Of course it's possible I completely misunderstood the whole
thing :)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support
  2009-11-24  2:03 ` [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Frederic Weisbecker
@ 2009-11-24  3:20   ` Frederic Weisbecker
  2009-11-24  7:52     ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  3:20 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, H. Peter Anvin,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, Anders Kaseorg, Tim Abbott, Andi Kleen,
	Jason Baron, Mathieu Desnoyers, systemtap, DLE

On Tue, Nov 24, 2009 at 03:03:19AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2009 at 06:21:16PM -0500, Masami Hiramatsu wrote:
> >   When the optimized-kprobe is hit before optimization, its handler
> >  changes IP(instruction pointer) to copied code and exits. So, the
> >  instructions which were copied to detour buffer are executed on the detour
> >  buffer.
> 
> 
> 
> Hm, why is it playing such hybrid game there?
> If I understand well, we have executed int 3, executed the
> handler and we jump back to the detour buffer?
> 


I got it, I think. We have instructions to patch. And the above
turn this area into dead code, safe to patch.

But still, stop_machine() seem to make it not necessary anymore.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24  2:44   ` Frederic Weisbecker
@ 2009-11-24  3:31     ` Frederic Weisbecker
  2009-11-24 15:34       ` Masami Hiramatsu
  2009-11-24 15:34     ` Masami Hiramatsu
  1 sibling, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24  3:31 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Tue, Nov 24, 2009 at 03:44:19AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2009 at 06:21:41PM -0500, Masami Hiramatsu wrote:
> > +static void kprobe_optimizer(struct work_struct *work);
> > +static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
> > +#define OPTIMIZE_DELAY 5
> > +
> > +/* Kprobe jump optimizer */
> > +static __kprobes void kprobe_optimizer(struct work_struct *work)
> > +{
> > +	struct optimized_kprobe *op, *tmp;
> > +
> > +	/* Lock modules while optimizing kprobes */
> > +	mutex_lock(&module_mutex);
> > +	mutex_lock(&kprobe_mutex);
> > +	if (kprobes_all_disarmed)
> > +		goto end;
> > +
> > +	/* Wait quiesence period for ensuring all interrupts are done */
> > +	synchronize_sched();
> 
> 
> 
> It's not clear to me why you are doing that.
> Is this waiting for pending int 3 kprobes handlers
> to complete? If so, why, and what does that prevent?


I _might_ have understood.
You have set up the optimized flags, then you wait for
any old-style int 3 kprobes to complete and route
to detour buffer so that you can patch the jump
safely in the dead code? (and finish with first byte
by patching the int 3 itself)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support
  2009-11-24  3:20   ` Frederic Weisbecker
@ 2009-11-24  7:52     ` Ingo Molnar
  2009-11-24 16:06       ` Masami Hiramatsu
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2009-11-24  7:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Masami Hiramatsu, Ananth N Mavinakayanahalli, lkml,
	H. Peter Anvin, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers, systemtap, DLE


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Tue, Nov 24, 2009 at 03:03:19AM +0100, Frederic Weisbecker wrote:
> > On Mon, Nov 23, 2009 at 06:21:16PM -0500, Masami Hiramatsu wrote:
> > >   When the optimized-kprobe is hit before optimization, its handler
> > >  changes IP(instruction pointer) to copied code and exits. So, the
> > >  instructions which were copied to detour buffer are executed on the detour
> > >  buffer.
> > 
> > 
> > 
> > Hm, why is it playing such hybrid game there?
> > If I understand well, we have executed int 3, executed the
> > handler and we jump back to the detour buffer?
> > 
> 
> I got it, I think. We have instructions to patch. And the above turn 
> this area into dead code, safe to patch.
> 
> But still, stop_machine() seem to make it not necessary anymore.

i think 'sending an IPI to all online CPUs' might be an adequate 
sequence to make sure patching effects have propagated. I.e. an 
smp_call_function() with a dummy function?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24  2:44   ` Frederic Weisbecker
  2009-11-24  3:31     ` Frederic Weisbecker
@ 2009-11-24 15:34     ` Masami Hiramatsu
  2009-11-24 19:45       ` Frederic Weisbecker
  1 sibling, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 15:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Hi Frederic,

Frederic Weisbecker wrote:
> On Mon, Nov 23, 2009 at 06:21:41PM -0500, Masami Hiramatsu wrote:
>> +config OPTPROBES
>> +	bool "Kprobes jump optimization support (EXPERIMENTAL)"
>> +	default y
>> +	depends on KPROBES
>> +	depends on !PREEMPT
> 
> 
> Why does it depends on !PREEMPT?

Oh, because it has not supported preemptive kernel yet.
(I'd like to tell you why in another mail)

>> @@ -301,6 +302,31 @@ void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
>>  	__free_insn_slot(&kprobe_insn_slots, slot, dirty);
>>  	mutex_unlock(&kprobe_insn_mutex);
>>  }
>> +#ifdef CONFIG_OPTPROBES
>> +/* For optimized_kprobe buffer */
>> +static DEFINE_MUTEX(kprobe_optinsn_mutex); /* Protects kprobe_optinsn_slots */
>> +static struct kprobe_insn_cache kprobe_optinsn_slots = {
>> +	.pages = LIST_HEAD_INIT(kprobe_optinsn_slots.pages),
>> +	/* .insn_size is initialized later */
>> +	.nr_garbage = 0,
>> +};
>> +/* Get a slot for optimized_kprobe buffer */
>> +kprobe_opcode_t __kprobes *get_optinsn_slot(void)
>> +{
>> +	kprobe_opcode_t *ret = NULL;
>> +	mutex_lock(&kprobe_optinsn_mutex);
>> +	ret = __get_insn_slot(&kprobe_optinsn_slots);
>> +	mutex_unlock(&kprobe_optinsn_mutex);
>> +	return ret;
>> +}
> 
> 
> 
> Just a small nano-neat: could you add a line between variable
> declarations and the rest? And also just before the return?
> It makes the code a bit easier to review.

Sure :-)

>> +static void kprobe_optimizer(struct work_struct *work);
>> +static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
>> +#define OPTIMIZE_DELAY 5
>> +
>> +/* Kprobe jump optimizer */
>> +static __kprobes void kprobe_optimizer(struct work_struct *work)
>> +{
>> +	struct optimized_kprobe *op, *tmp;
>> +
>> +	/* Lock modules while optimizing kprobes */
>> +	mutex_lock(&module_mutex);
>> +	mutex_lock(&kprobe_mutex);
>> +	if (kprobes_all_disarmed)
>> +		goto end;
>> +
>> +	/* Wait quiesence period for ensuring all interrupts are done */
>> +	synchronize_sched();
> 
> 
> 
> It's not clear to me why you are doing that.
> Is this waiting for pending int 3 kprobes handlers
> to complete? If so, why, and what does that prevent?
> 
> Also, why is it a delayed work? I'm not sure what we are
> waiting for here.
[...]
> Again, I think this dance with live patching protected
> by int 3 only, which patching is in turn a necessary
> stage before, is a complicated sequence that could be
> simplified by choosing only one patching in stop_machine()
> time.

There is a reason why we have to wait here and it's excuse
why it hasn't supported preemption yet too, I'll tell you
in next mail :-)

>> +
>> +	get_online_cpus();	/* Use online_cpus while optimizing */
> 
> 
> 
> And this comment doesn't tell us much what this brings us.
> The changelog tells it stands to avoid a text_mutex deadlock.
> I'm not sure why we would deadlock without it.

As Mathieu and I discussed on LKML (http://lkml.org/lkml/2009/11/21/187)
text_mutex will be locked on the way of cpu-hotplug.
Since kprobes locks text_mutex too and stop_machine() refers online_cpus,
it will cause a dead-lock. So, I decided to use get_online_cpus() to
locking hotplug while optimizing/unoptimizng.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24  3:31     ` Frederic Weisbecker
@ 2009-11-24 15:34       ` Masami Hiramatsu
  2009-11-24 20:14         ` Frederic Weisbecker
  0 siblings, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 15:34 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Frederic Weisbecker wrote:
> On Tue, Nov 24, 2009 at 03:44:19AM +0100, Frederic Weisbecker wrote:
>> On Mon, Nov 23, 2009 at 06:21:41PM -0500, Masami Hiramatsu wrote:
>>> +static void kprobe_optimizer(struct work_struct *work);
>>> +static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
>>> +#define OPTIMIZE_DELAY 5
>>> +
>>> +/* Kprobe jump optimizer */
>>> +static __kprobes void kprobe_optimizer(struct work_struct *work)
>>> +{
>>> +	struct optimized_kprobe *op, *tmp;
>>> +
>>> +	/* Lock modules while optimizing kprobes */
>>> +	mutex_lock(&module_mutex);
>>> +	mutex_lock(&kprobe_mutex);
>>> +	if (kprobes_all_disarmed)
>>> +		goto end;
>>> +
>>> +	/* Wait quiesence period for ensuring all interrupts are done */
>>> +	synchronize_sched();
>>
>>
>>
>> It's not clear to me why you are doing that.
>> Is this waiting for pending int 3 kprobes handlers
>> to complete? If so, why, and what does that prevent?
> 
> 
> I _might_ have understood.
> You have set up the optimized flags, then you wait for
> any old-style int 3 kprobes to complete and route
> to detour buffer so that you can patch the jump
> safely in the dead code? (and finish with first byte
> by patching the int 3 itself)
> 

Yeah, you might get almost correct answer.
The reason why we have to wait scheduling on all processors
is that this code may modify N instructions (not a single
instruction). This means, there is a chance that 2nd to nth
instructions are interrupted on other cpus when we start
code modifying.

Please imagine that 2nd instruction is interrupted and
stop_machine() replaces the 2nd instruction with jump
*address* while running interrupt handler. When the interrupt
returns to original address, there is no valid instructions
and it causes unexpected result.

To avoid this situation, we have to wait a scheduler quiescent
state on all cpus, because it also ensure that all current
interruption are done.

This also excuses why we don't need to wait when unoptimizing
and why it has not supported preemptive kernel yet.

In unoptimizing case, since there is just a single instruction
(jump), there is no nth instruction which can be interrupted.
Thus we can just use a stop_machine(). :-)

On the preemptive kernel, waiting scheduling is not work as we
see on non-preemptive kernel. Since processes can be preempted
in interruption, we can't ensure that the current running
interruption is done. (I assume that a pair of freeze_processes
and thaw_processes may possibly ensure that, or maybe we can
share some stack rewinding code with ksplice.)
So it depends on !PREEMPT.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers
  2009-11-24  2:51   ` Frederic Weisbecker
@ 2009-11-24 15:39     ` Masami Hiramatsu
  2009-11-24 20:19       ` Frederic Weisbecker
  2009-11-24 15:40     ` Frank Ch. Eigler
  1 sibling, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 15:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Frederic Weisbecker wrote:
> On Mon, Nov 23, 2009 at 06:22:04PM -0500, Masami Hiramatsu wrote:
>> +#ifdef CONFIG_X86_64
>> +#define SAVE_REGS_STRING		\
>> +	/* Skip cs, ip, orig_ax. */	\
>> +	"	subq $24, %rsp\n"	\
>> +	"	pushq %rdi\n"		\
>> +	"	pushq %rsi\n"		\
>> +	"	pushq %rdx\n"		\
>> +	"	pushq %rcx\n"		\
>> +	"	pushq %rax\n"		\
>> +	"	pushq %r8\n"		\
>> +	"	pushq %r9\n"		\
>> +	"	pushq %r10\n"		\
>> +	"	pushq %r11\n"		\
>> +	"	pushq %rbx\n"		\
>> +	"	pushq %rbp\n"		\
>> +	"	pushq %r12\n"		\
>> +	"	pushq %r13\n"		\
>> +	"	pushq %r14\n"		\
>> +	"	pushq %r15\n"
>> +#define RESTORE_REGS_STRING		\
>> +	"	popq %r15\n"		\
>> +	"	popq %r14\n"		\
>> +	"	popq %r13\n"		\
>> +	"	popq %r12\n"		\
>> +	"	popq %rbp\n"		\
>> +	"	popq %rbx\n"		\
>> +	"	popq %r11\n"		\
>> +	"	popq %r10\n"		\
>> +	"	popq %r9\n"		\
>> +	"	popq %r8\n"		\
>> +	"	popq %rax\n"		\
>> +	"	popq %rcx\n"		\
>> +	"	popq %rdx\n"		\
>> +	"	popq %rsi\n"		\
>> +	"	popq %rdi\n"		\
> 
> 
> BTW, do you really need to push/pop every registers
> before/after calling a probe handler?

Yes, in both cases (kretprobe/optprpbe) it needs to
emulate kprobes behavior. kprobes can be used as
fault injection, it should pop pt_regs.

> Is it possible to only save/restore the scratch ones?

Hmm, what code did you mean?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore  registers
  2009-11-24  2:51   ` Frederic Weisbecker
  2009-11-24 15:39     ` Masami Hiramatsu
@ 2009-11-24 15:40     ` Frank Ch. Eigler
  2009-11-24 20:20       ` Frederic Weisbecker
  1 sibling, 1 reply; 37+ messages in thread
From: Frank Ch. Eigler @ 2009-11-24 15:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Masami Hiramatsu, Ingo Molnar, Ananth N Mavinakayanahalli, lkml,
	systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

Frederic Weisbecker <fweisbec@gmail.com> writes:

> [...]
>> +#define SAVE_REGS_STRING		\
>> +#define RESTORE_REGS_STRING		\
>
> BTW, do you really need to push/pop every registers
> before/after calling a probe handler?

It's part of the definition of a kprobe, that a populated
pt_regs* value is passed.  Clients can rely on that in order
to access registers etc.

- FChE

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support
  2009-11-24  7:52     ` Ingo Molnar
@ 2009-11-24 16:06       ` Masami Hiramatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 16:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Ananth N Mavinakayanahalli, lkml,
	H. Peter Anvin, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers, systemtap, DLE

Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
>> On Tue, Nov 24, 2009 at 03:03:19AM +0100, Frederic Weisbecker wrote:
>>> On Mon, Nov 23, 2009 at 06:21:16PM -0500, Masami Hiramatsu wrote:
>>>>   When the optimized-kprobe is hit before optimization, its handler
>>>>  changes IP(instruction pointer) to copied code and exits. So, the
>>>>  instructions which were copied to detour buffer are executed on the detour
>>>>  buffer.
>>>
>>>
>>>
>>> Hm, why is it playing such hybrid game there?
>>> If I understand well, we have executed int 3, executed the
>>> handler and we jump back to the detour buffer?
>>>
>>
>> I got it, I think. We have instructions to patch. And the above turn 
>> this area into dead code, safe to patch.
>>
>> But still, stop_machine() seem to make it not necessary anymore.
> 
> i think 'sending an IPI to all online CPUs' might be an adequate 
> sequence to make sure patching effects have propagated. I.e. an 
> smp_call_function() with a dummy function?

Hmm, I assume that you mean waiting for all int3 handler.

We have to separate below issues:
 - int3-based multi-bytes code replacement
 - multi-instruction replacement with int3-detour code

The former is implemented on patch 9/10 and 10/10. As you can see,
these patches are RFC status, because I'd like to wait for official
reply of safeness from processor architects.
And it may be able to use a dummy IPI for 2nd IPI because it
just for waiting int3 interrupts. But again, it is just estimated that
replacing with/recovering from int3 is automatically synchronized...

However, at least stop_machine() method is officially described
at "7.1.3 Handling Self- and Cross-Modifying Code" on the intel's
software developer's manual 3A . So currently we can use it.

For the latter issue, as I explained on previous reply, we need
to wait all running interrupts including hardware interrupts.
Thus I used synchronize_sched().

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
  2009-11-24  3:14   ` Frederic Weisbecker
@ 2009-11-24 16:27   ` Jason Baron
  2009-11-24 17:46     ` Masami Hiramatsu
  2009-11-24 16:35   ` H. Peter Anvin
  2 siblings, 1 reply; 37+ messages in thread
From: Jason Baron @ 2009-11-24 16:27 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Mathieu Desnoyers

On Mon, Nov 23, 2009 at 06:22:11PM -0500, Masami Hiramatsu wrote:
> Introduce x86 arch-specific optimization code, which supports both of
> x86-32 and x86-64.
> 
> This code also supports safety checking, which decodes whole of a function
> in which probe is inserted, and checks following conditions before
> optimization:
>  - The optimized instructions which will be replaced by a jump instruction
>    don't straddle the function boundary.
>  - There is no indirect jump instruction, because it will jumps into
>    the address range which is replaced by jump operand.
>  - There is no jump/loop instruction which jumps into the address range
>    which is replaced by jump operand.
>  - Don't optimize kprobes if it is in functions into which fixup code will
>    jumps.
> 
> This uses stop_machine() for corss modifying code from int3 to jump.
> It doesn't allow us to modify code on NMI/SMI path. However, since
> kprobes itself doesn't support NMI/SMI code probing, it's not a
> problem.
> 
> Changes in v5:
>  - Introduce stop_machine-based jump replacing.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Jim Keniston <jkenisto@us.ibm.com>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Anders Kaseorg <andersk@ksplice.com>
> Cc: Tim Abbott <tabbott@ksplice.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Jason Baron <jbaron@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> ---
> 
>  arch/x86/Kconfig               |    1 
>  arch/x86/include/asm/kprobes.h |   29 +++
>  arch/x86/kernel/kprobes.c      |  457 ++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 465 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 17abcfa..af0313e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -31,6 +31,7 @@ config X86
>  	select ARCH_WANT_FRAME_POINTERS
>  	select HAVE_DMA_ATTRS
>  	select HAVE_KRETPROBES
> +	select HAVE_OPTPROBES
>  	select HAVE_FTRACE_MCOUNT_RECORD
>  	select HAVE_DYNAMIC_FTRACE
>  	select HAVE_FUNCTION_TRACER
> diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
> index eaec8ea..4ffa345 100644
> --- a/arch/x86/include/asm/kprobes.h
> +++ b/arch/x86/include/asm/kprobes.h
> @@ -33,6 +33,9 @@ struct kprobe;
>  typedef u8 kprobe_opcode_t;
>  #define BREAKPOINT_INSTRUCTION	0xcc
>  #define RELATIVEJUMP_OPCODE 0xe9
> +#define RELATIVEJUMP_SIZE 5
> +#define RELATIVECALL_OPCODE 0xe8
> +#define RELATIVE_ADDR_SIZE 4
>  #define MAX_INSN_SIZE 16
>  #define MAX_STACK_SIZE 64
>  #define MIN_STACK_SIZE(ADDR)					       \
> @@ -44,6 +47,17 @@ typedef u8 kprobe_opcode_t;
>  
>  #define flush_insn_slot(p)	do { } while (0)
>  
> +/* optinsn template addresses */
> +extern kprobe_opcode_t optprobe_template_entry;
> +extern kprobe_opcode_t optprobe_template_val;
> +extern kprobe_opcode_t optprobe_template_call;
> +extern kprobe_opcode_t optprobe_template_end;
> +#define MAX_OPTIMIZED_LENGTH (MAX_INSN_SIZE + RELATIVE_ADDR_SIZE)
> +#define MAX_OPTINSN_SIZE 				\
> +	(((unsigned long)&optprobe_template_end -	\
> +	  (unsigned long)&optprobe_template_entry) +	\
> +	 MAX_OPTIMIZED_LENGTH + RELATIVEJUMP_SIZE)
> +
>  extern const int kretprobe_blacklist_size;
>  
>  void arch_remove_kprobe(struct kprobe *p);
> @@ -64,6 +78,21 @@ struct arch_specific_insn {
>  	int boostable;
>  };
>  
> +struct arch_optimized_insn {
> +	/* copy of the original instructions */
> +	kprobe_opcode_t copied_insn[RELATIVE_ADDR_SIZE];
> +	/* detour code buffer */
> +	kprobe_opcode_t *insn;
> +	/* the size of instructions copied to detour code buffer */
> +	size_t size;
> +};
> +
> +/* Return true (!0) if optinsn is prepared for optimization. */
> +static inline int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
> +{
> +	return optinsn->size;
> +}
> +
>  struct prev_kprobe {
>  	struct kprobe *kp;
>  	unsigned long status;
> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> index 73ac21e..6d81c11 100644
> --- a/arch/x86/kernel/kprobes.c
> +++ b/arch/x86/kernel/kprobes.c
> @@ -49,6 +49,7 @@
>  #include <linux/module.h>
>  #include <linux/kdebug.h>
>  #include <linux/kallsyms.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/desc.h>
> @@ -106,16 +107,21 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = {
>  };
>  const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist);
>  
> -/* Insert a jump instruction at address 'from', which jumps to address 'to'.*/
> -static void __kprobes set_jmp_op(void *from, void *to)
> +static void __kprobes __synthesize_relative_insn(void *from, void *to, u8 op)
>  {
> -	struct __arch_jmp_op {
> -		char op;
> +	struct __arch_relative_insn {
> +		u8 op;
>  		s32 raddr;
> -	} __attribute__((packed)) * jop;
> -	jop = (struct __arch_jmp_op *)from;
> -	jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
> -	jop->op = RELATIVEJUMP_OPCODE;
> +	} __attribute__((packed)) *insn;
> +	insn = (struct __arch_relative_insn *)from;
> +	insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
> +	insn->op = op;
> +}
> +
> +/* Insert a jump instruction at address 'from', which jumps to address 'to'.*/
> +static void __kprobes synthesize_reljump(void *from, void *to)
> +{
> +	__synthesize_relative_insn(from, to, RELATIVEJUMP_OPCODE);
>  }
>  
>  /*
> @@ -202,7 +208,7 @@ static int recover_probed_instruction(kprobe_opcode_t *buf, unsigned long addr)
>  	/*
>  	 *  Basically, kp->ainsn.insn has an original instruction.
>  	 *  However, RIP-relative instruction can not do single-stepping
> -	 *  at different place, fix_riprel() tweaks the displacement of
> +	 *  at different place, __copy_instruction() tweaks the displacement of
>  	 *  that instruction. In that case, we can't recover the instruction
>  	 *  from the kp->ainsn.insn.
>  	 *
> @@ -284,21 +290,37 @@ static int __kprobes is_IF_modifier(kprobe_opcode_t *insn)
>  }
>  
>  /*
> - * Adjust the displacement if the instruction uses the %rip-relative
> - * addressing mode.
> + * Copy an instruction and adjust the displacement if the instruction
> + * uses the %rip-relative addressing mode.
>   * If it does, Return the address of the 32-bit displacement word.
>   * If not, return null.
>   * Only applicable to 64-bit x86.
>   */
> -static void __kprobes fix_riprel(struct kprobe *p)
> +static int __kprobes __copy_instruction(u8 *dest, u8 *src, int recover)
>  {
> -#ifdef CONFIG_X86_64
>  	struct insn insn;
> -	kernel_insn_init(&insn, p->ainsn.insn);
> +	int ret;
> +	kprobe_opcode_t buf[MAX_INSN_SIZE];
> +
> +	kernel_insn_init(&insn, src);
> +	if (recover) {
> +		insn_get_opcode(&insn);
> +		if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
> +			ret = recover_probed_instruction(buf,
> +							 (unsigned long)src);
> +			if (ret)
> +				return 0;
> +			kernel_insn_init(&insn, buf);
> +		}
> +	}
> +	insn_get_length(&insn);
> +	memcpy(dest, insn.kaddr, insn.length);
>  
> +#ifdef CONFIG_X86_64
>  	if (insn_rip_relative(&insn)) {
>  		s64 newdisp;
>  		u8 *disp;
> +		kernel_insn_init(&insn, dest);
>  		insn_get_displacement(&insn);
>  		/*
>  		 * The copied instruction uses the %rip-relative addressing
> @@ -312,20 +334,23 @@ static void __kprobes fix_riprel(struct kprobe *p)
>  		 * extension of the original signed 32-bit displacement would
>  		 * have given.
>  		 */
> -		newdisp = (u8 *) p->addr + (s64) insn.displacement.value -
> -			  (u8 *) p->ainsn.insn;
> +		newdisp = (u8 *) src + (s64) insn.displacement.value -
> +			  (u8 *) dest;
>  		BUG_ON((s64) (s32) newdisp != newdisp); /* Sanity check.  */
> -		disp = (u8 *) p->ainsn.insn + insn_offset_displacement(&insn);
> +		disp = (u8 *) dest + insn_offset_displacement(&insn);
>  		*(s32 *) disp = (s32) newdisp;
>  	}
>  #endif
> +	return insn.length;
>  }
>  
>  static void __kprobes arch_copy_kprobe(struct kprobe *p)
>  {
> -	memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
> -
> -	fix_riprel(p);
> +	/*
> +	 * Copy an instruction without recovering int3, because it will be
> +	 * put by another subsystem.
> +	 */
> +	__copy_instruction(p->ainsn.insn, p->addr, 0);
>  
>  	if (can_boost(p->addr))
>  		p->ainsn.boostable = 0;
> @@ -414,9 +439,20 @@ void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri,
>  	*sara = (unsigned long) &kretprobe_trampoline;
>  }
>  
> +#ifdef CONFIG_OPTPROBES
> +static int  __kprobes setup_detour_execution(struct kprobe *p,
> +					     struct pt_regs *regs,
> +					     int reenter);
> +#else
> +#define setup_detour_execution(p, regs, reenter) (0)
> +#endif
> +
>  static void __kprobes setup_singlestep(struct kprobe *p, struct pt_regs *regs,
>  				       struct kprobe_ctlblk *kcb, int reenter)
>  {
> +	if (setup_detour_execution(p, regs, reenter))
> +		return;
> +
>  #if !defined(CONFIG_PREEMPT) || defined(CONFIG_FREEZER)
>  	if (p->ainsn.boostable == 1 && !p->post_handler) {
>  		/* Boost up -- we can execute copied instructions directly */
> @@ -812,8 +848,8 @@ static void __kprobes resume_execution(struct kprobe *p,
>  			 * These instructions can be executed directly if it
>  			 * jumps back to correct address.
>  			 */
> -			set_jmp_op((void *)regs->ip,
> -				   (void *)orig_ip + (regs->ip - copy_ip));
> +			synthesize_reljump((void *)regs->ip,
> +				(void *)orig_ip + (regs->ip - copy_ip));
>  			p->ainsn.boostable = 1;
>  		} else {
>  			p->ainsn.boostable = -1;
> @@ -1040,6 +1076,383 @@ int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
>  	return 0;
>  }
>  
> +
> +#ifdef CONFIG_OPTPROBES
> +
> +/* Insert a call instruction at address 'from', which calls address 'to'.*/
> +static void __kprobes synthesize_relcall(void *from, void *to)
> +{
> +	__synthesize_relative_insn(from, to, RELATIVECALL_OPCODE);
> +}
> +
> +/* Insert a move instruction which sets a pointer to eax/rdi (1st arg). */
> +static void __kprobes synthesize_set_arg1(kprobe_opcode_t *addr,
> +					  unsigned long val)
> +{
> +#ifdef CONFIG_X86_64
> +	*addr++ = 0x48;
> +	*addr++ = 0xbf;
> +#else
> +	*addr++ = 0xb8;
> +#endif
> +	*(unsigned long *)addr = val;
> +}
> +
> +void __kprobes kprobes_optinsn_template_holder(void)
> +{
> +	asm volatile (
> +			".global optprobe_template_entry\n"
> +			"optprobe_template_entry: \n"
> +#ifdef CONFIG_X86_64
> +			/* We don't bother saving the ss register */
> +			"	pushq %rsp\n"
> +			"	pushfq\n"
> +			SAVE_REGS_STRING
> +			"	movq %rsp, %rsi\n"
> +			".global optprobe_template_val\n"
> +			"optprobe_template_val: \n"
> +			ASM_NOP5
> +			ASM_NOP5
> +			".global optprobe_template_call\n"
> +			"optprobe_template_call: \n"
> +			ASM_NOP5
> +			/* Move flags to rsp */
> +			"	movq 144(%rsp), %rdx\n"
> +			"	movq %rdx, 152(%rsp)\n"
> +			RESTORE_REGS_STRING
> +			/* Skip flags entry */
> +			"	addq $8, %rsp\n"
> +			"	popfq\n"
> +#else /* CONFIG_X86_32 */
> +			"	pushf\n"
> +			SAVE_REGS_STRING
> +			"	movl %esp, %edx\n"
> +			".global optprobe_template_val\n"
> +			"optprobe_template_val: \n"
> +			ASM_NOP5
> +			".global optprobe_template_call\n"
> +			"optprobe_template_call: \n"
> +			ASM_NOP5
> +			RESTORE_REGS_STRING
> +			"	addl $4, %esp\n"	/* skip cs */
> +			"	popf\n"
> +#endif
> +			".global optprobe_template_end\n"
> +			"optprobe_template_end: \n");
> +}
> +
> +#define TMPL_MOVE_IDX \
> +	((long)&optprobe_template_val - (long)&optprobe_template_entry)
> +#define TMPL_CALL_IDX \
> +	((long)&optprobe_template_call - (long)&optprobe_template_entry)
> +#define TMPL_END_IDX \
> +	((long)&optprobe_template_end - (long)&optprobe_template_entry)
> +
> +#define INT3_SIZE sizeof(kprobe_opcode_t)
> +
> +/* Optimized kprobe call back function: called from optinsn */
> +static void __kprobes optimized_callback(struct optimized_kprobe *op,
> +					 struct pt_regs *regs)
> +{
> +	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
> +
> +	preempt_disable();
> +	if (kprobe_running()) {
> +		kprobes_inc_nmissed_count(&op->kp);
> +	} else {
> +		/* Save skipped registers */
> +#ifdef CONFIG_X86_64
> +		regs->cs = __KERNEL_CS;
> +#else
> +		regs->cs = __KERNEL_CS | get_kernel_rpl();
> +		regs->gs = 0;
> +#endif
> +		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
> +		regs->orig_ax = ~0UL;
> +
> +		__get_cpu_var(current_kprobe) = &op->kp;
> +		kcb->kprobe_status = KPROBE_HIT_ACTIVE;
> +		opt_pre_handler(&op->kp, regs);
> +		__get_cpu_var(current_kprobe) = NULL;
> +	}
> +	preempt_enable_no_resched();
> +}
> +
> +static int __kprobes copy_optimized_instructions(u8 *dest, u8 *src)
> +{
> +	int len = 0, ret;
> +	while (len < RELATIVEJUMP_SIZE) {
> +		ret = __copy_instruction(dest + len, src + len, 1);
> +		if (!ret || !can_boost(dest + len))
> +			return -EINVAL;
> +		len += ret;
> +	}
> +	return len;
> +}
> +
> +/* Check whether insn is indirect jump */
> +static int __kprobes insn_is_indirect_jump(struct insn *insn)
> +{
> +	return (insn->opcode.bytes[0] == 0xff ||
> +		insn->opcode.bytes[0] == 0xea);
> +}
> +
> +/* Check whether insn jumps into specified address range */
> +static int insn_jump_into_range(struct insn *insn, unsigned long start, int len)
> +{
> +	unsigned long target = 0;
> +	switch (insn->opcode.bytes[0]) {
> +	case 0xe0:	/* loopne */
> +	case 0xe1:	/* loope */
> +	case 0xe2:	/* loop */
> +	case 0xe3:	/* jcxz */
> +	case 0xe9:	/* near relative jump */
> +	case 0xeb:	/* short relative jump */
> +		break;
> +	case 0x0f:
> +		if ((insn->opcode.bytes[1] & 0xf0) == 0x80) /* jcc near */
> +			break;
> +		return 0;
> +	default:
> +		if ((insn->opcode.bytes[0] & 0xf0) == 0x70) /* jcc short */
> +			break;
> +		return 0;
> +	}
> +	target = (unsigned long)insn->next_byte + insn->immediate.value;
> +	return (start <= target && target <= start + len);
> +}
> +
> +/* Decode whole function to ensure any instructions don't jump into target */
> +static int __kprobes can_optimize(unsigned long paddr)
> +{
> +	int ret;
> +	unsigned long addr, size = 0, offset = 0;
> +	struct insn insn;
> +	kprobe_opcode_t buf[MAX_INSN_SIZE];
> +	/* Dummy buffers for lookup_symbol_attrs */
> +	static char __dummy_buf[KSYM_NAME_LEN];
> +
> +	/* Lookup symbol including addr */
> +	if (!kallsyms_lookup(paddr, &size, &offset, NULL, __dummy_buf))
> +		return 0;
> +
> +	/* Check there is enough space for a relative jump. */
> +	if (size - offset < RELATIVEJUMP_SIZE)
> +		return 0;
> +
> +	/* Decode instructions */
> +	addr = paddr - offset;
> +	while (addr < paddr - offset + size) { /* Decode until function end */
> +		if (search_exception_tables(addr))
> +			/*
> +			 * Since some fixup code will jumps into this function,
> +			 * we can't optimize kprobe in this function.
> +			 */
> +			return 0;
> +		kernel_insn_init(&insn, (void *)addr);
> +		insn_get_opcode(&insn);
> +		if (insn.opcode.bytes[0] == BREAKPOINT_INSTRUCTION) {
> +			ret = recover_probed_instruction(buf, addr);
> +			if (ret)
> +				return 0;
> +			kernel_insn_init(&insn, buf);
> +		}
> +		insn_get_length(&insn);
> +		/* Recover address */
> +		insn.kaddr = (void *)addr;
> +		insn.next_byte = (void *)(addr + insn.length);
> +		/* Check any instructions don't jump into target */
> +		if (insn_is_indirect_jump(&insn) ||
> +		    insn_jump_into_range(&insn, paddr + INT3_SIZE,
> +					 RELATIVE_ADDR_SIZE))
> +			return 0;
> +		addr += insn.length;
> +	}
> +
> +	return 1;
> +}
> +
> +/* Check optimized_kprobe can actually be optimized. */
> +int __kprobes arch_check_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	int i;
> +	for (i = 1; i < op->optinsn.size; i++)
> +		if (get_kprobe(op->kp.addr + i))
> +			return -EEXIST;
> +	return 0;
> +}
> +
> +/* Check the addr is within the optimized instructions. */
> +int __kprobes arch_within_optimized_kprobe(struct optimized_kprobe *op,
> +					   unsigned long addr)
> +{
> +	return ((unsigned long)op->kp.addr <= addr &&
> +		(unsigned long)op->kp.addr + op->optinsn.size > addr);
> +}
> +
> +/* Free optimized instruction slot */
> +static __kprobes
> +void __arch_remove_optimized_kprobe(struct optimized_kprobe *op, int dirty)
> +{
> +	if (op->optinsn.insn) {
> +		free_optinsn_slot(op->optinsn.insn, dirty);
> +		op->optinsn.insn = NULL;
> +		op->optinsn.size = 0;
> +	}
> +}
> +
> +void __kprobes arch_remove_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	__arch_remove_optimized_kprobe(op, 1);
> +}
> +
> +/*
> + * Copy replacing target instructions
> + * Target instructions MUST be relocatable (checked inside)
> + */
> +int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	u8 *buf;
> +	int ret;
> +
> +	if (!can_optimize((unsigned long)op->kp.addr))
> +		return -EILSEQ;
> +
> +	op->optinsn.insn = get_optinsn_slot();
> +	if (!op->optinsn.insn)
> +		return -ENOMEM;
> +
> +	buf = (u8 *)op->optinsn.insn;
> +
> +	/* Copy instructions into the out-of-line buffer */
> +	ret = copy_optimized_instructions(buf + TMPL_END_IDX, op->kp.addr);
> +	if (ret < 0) {
> +		__arch_remove_optimized_kprobe(op, 0);
> +		return ret;
> +	}
> +	op->optinsn.size = ret;
> +
> +	/* Backup instructions which will be replaced by jump address */
> +	memcpy(op->optinsn.copied_insn, op->kp.addr + INT3_SIZE,
> +	       RELATIVE_ADDR_SIZE);
> +
> +	/* Copy arch-dep-instance from template */
> +	memcpy(buf, &optprobe_template_entry, TMPL_END_IDX);
> +
> +	/* Set probe information */
> +	synthesize_set_arg1(buf + TMPL_MOVE_IDX, (unsigned long)op);
> +
> +	/* Set probe function call */
> +	synthesize_relcall(buf + TMPL_CALL_IDX, optimized_callback);
> +
> +	/* Set returning jmp instruction at the tail of out-of-line buffer */
> +	synthesize_reljump(buf + TMPL_END_IDX + op->optinsn.size,
> +			   (u8 *)op->kp.addr + op->optinsn.size);
> +
> +	flush_icache_range((unsigned long) buf,
> +			   (unsigned long) buf + TMPL_END_IDX +
> +			   op->optinsn.size + RELATIVEJUMP_SIZE);
> +	return 0;
> +}
> +
> +/*
> + * Cross-modifying kernel text with stop_machine().
> + * This code originally comes from immediate value.
> + * This does _not_ protect against NMI and MCE. However,
> + * since kprobes can't probe NMI/MCE handler, it is OK for kprobes.
> + */
> +static atomic_t stop_machine_first;
> +static int wrote_text;
> +
> +struct text_poke_param {
> +	void *addr;
> +	const void *opcode;
> +	size_t len;
> +};
> +
> +static int __kprobes stop_machine_multibyte_poke(void *data)
> +{
> +	struct text_poke_param *tpp = data;
> +
> +	if (atomic_dec_and_test(&stop_machine_first)) {
> +		text_poke(tpp->addr, tpp->opcode, tpp->len);
> +		smp_wmb();	/* Make sure other cpus see that this has run */
> +		wrote_text = 1;
> +	} else {
> +		while (!wrote_text)
> +			smp_rmb();
> +		sync_core();
> +	}
> +
> +	flush_icache_range((unsigned long)tpp->addr,
> +			   (unsigned long)tpp->addr + tpp->len);
> +	return 0;
> +}
> +
> +static void *__kprobes __multibyte_poke(void *addr, const void *opcode,
> +					size_t len)
> +{
> +	struct text_poke_param tpp;
> +
> +	tpp.addr = addr;
> +	tpp.opcode = opcode;
> +	tpp.len = len;
> +	atomic_set(&stop_machine_first, 1);
> +	wrote_text = 0;
> +	stop_machine(stop_machine_multibyte_poke, (void *)&tpp, NULL);
> +	return addr;
> +}

As you know, I'd like to have the jump label optimization for
tracepoints, make use of this '__multibyte_poke()' interface. So perhaps
it can be moved to arch/x86/kernel/alternative.c. This is where 'text_poke()'
and friends currently live.

Also, with multiple users we don't want to trample over each others code
patching. Thus, if each sub-system could register some type of
'is_reserved()' callback, and then we can call all these call backs from
the '__multibyte_poke()' routine before we do any patching to make sure
that we aren't trampling on each others code. After a successful
patching, each sub-system can update its reserved set of code as
appropriate. I can code a prototype here, if this makes sense.

thanks,

-Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
  2009-11-24  3:14   ` Frederic Weisbecker
  2009-11-24 16:27   ` Jason Baron
@ 2009-11-24 16:35   ` H. Peter Anvin
  2009-11-24 17:00     ` Masami Hiramatsu
  2 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2009-11-24 16:35 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On 11/23/2009 03:22 PM, Masami Hiramatsu wrote:
> 
> This uses stop_machine() for corss modifying code from int3 to jump.
> It doesn't allow us to modify code on NMI/SMI path. However, since
> kprobes itself doesn't support NMI/SMI code probing, it's not a
> problem.
> 

I'm a bit confused by the above statement... does that mean you're
poking int3 and *then* do stop_machine()?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-24 16:35   ` H. Peter Anvin
@ 2009-11-24 17:00     ` Masami Hiramatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 17:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Hi Peter,

H. Peter Anvin wrote:
> On 11/23/2009 03:22 PM, Masami Hiramatsu wrote:
>>
>> This uses stop_machine() for corss modifying code from int3 to jump.
>> It doesn't allow us to modify code on NMI/SMI path. However, since
>> kprobes itself doesn't support NMI/SMI code probing, it's not a
>> problem.
>>
>
> I'm a bit confused by the above statement... does that mean you're
> poking int3 and *then* do stop_machine()?

Yes, as I said in http://lkml.org/lkml/2009/11/24/310,
there are two separated issues.

----
We have to separate below issues:
  - int3-based multi-bytes code replacement
  - multi-instruction replacement with int3-detour code

The former is implemented on patch 9/10 and 10/10. As you can see,
these patches are RFC status, because I'd like to wait for official
reply of safeness from processor architects.
And it may be able to use a dummy IPI for 2nd IPI because it
just for waiting int3 interrupts. But again, it is just estimated that
replacing with/recovering from int3 is automatically synchronized...

However, at least stop_machine() method is officially described
at "7.1.3 Handling Self- and Cross-Modifying Code" on the intel's
software developer's manual 3A . So currently we can use it.

For the latter issue, as I explained on previous reply, we need
to wait all running interrupts including hardware interrupts.
Thus I used synchronize_sched().
----

So that the previous "x86 generic jump patching" patch is
basically for single-instruction replacement. For multi-instructions
replacement, we need to make detour code and wait for all running
interruption. (of course, there are other static code limitations,
as I described at "Safety check" section in patch 0/10.)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-24 16:27   ` Jason Baron
@ 2009-11-24 17:46     ` Masami Hiramatsu
  2009-11-25 16:12       ` Masami Hiramatsu
  0 siblings, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 17:46 UTC (permalink / raw)
  To: Jason Baron
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Mathieu Desnoyers

Jason Baron wrote:
[...]
>> +/*
>> + * Cross-modifying kernel text with stop_machine().
>> + * This code originally comes from immediate value.
>> + * This does _not_ protect against NMI and MCE. However,
>> + * since kprobes can't probe NMI/MCE handler, it is OK for kprobes.
>> + */
>> +static atomic_t stop_machine_first;
>> +static int wrote_text;
>> +
>> +struct text_poke_param {
>> +	void *addr;
>> +	const void *opcode;
>> +	size_t len;
>> +};
>> +
>> +static int __kprobes stop_machine_multibyte_poke(void *data)
>> +{
>> +	struct text_poke_param *tpp = data;
>> +
>> +	if (atomic_dec_and_test(&stop_machine_first)) {
>> +		text_poke(tpp->addr, tpp->opcode, tpp->len);
>> +		smp_wmb();	/* Make sure other cpus see that this has run */
>> +		wrote_text = 1;
>> +	} else {
>> +		while (!wrote_text)
>> +			smp_rmb();
>> +		sync_core();
>> +	}
>> +
>> +	flush_icache_range((unsigned long)tpp->addr,
>> +			   (unsigned long)tpp->addr + tpp->len);
>> +	return 0;
>> +}
>> +
>> +static void *__kprobes __multibyte_poke(void *addr, const void *opcode,
>> +					size_t len)
>> +{
>> +	struct text_poke_param tpp;
>> +
>> +	tpp.addr = addr;
>> +	tpp.opcode = opcode;
>> +	tpp.len = len;
>> +	atomic_set(&stop_machine_first, 1);
>> +	wrote_text = 0;
>> +	stop_machine(stop_machine_multibyte_poke, (void *)&tpp, NULL);
>> +	return addr;
>> +}
>
> As you know, I'd like to have the jump label optimization for
> tracepoints, make use of this '__multibyte_poke()' interface. So perhaps
> it can be moved to arch/x86/kernel/alternative.c. This is where 'text_poke()'
> and friends currently live.

Hmm, maybe current text_poke needs to have singlebyte_poke() wrapper
for avoiding confusion.

> Also, with multiple users we don't want to trample over each others code
> patching. Thus, if each sub-system could register some type of
> 'is_reserved()' callback, and then we can call all these call backs from
> the '__multibyte_poke()' routine before we do any patching to make sure
> that we aren't trampling on each others code. After a successful
> patching, each sub-system can update its reserved set of code as
> appropriate. I can code a prototype here, if this makes sense.

Hmm, we have to implement it carefully, because here kprobes already
inserted int3 and optprobe rewrites the int3 again. If is_reserved()
returns 1 and multibyte_poke returns error, we can't optimize it anymore.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 15:34     ` Masami Hiramatsu
@ 2009-11-24 19:45       ` Frederic Weisbecker
  2009-11-24 21:15         ` Masami Hiramatsu
  0 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24 19:45 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Tue, Nov 24, 2009 at 10:34:08AM -0500, Masami Hiramatsu wrote:
> > And this comment doesn't tell us much what this brings us.
> > The changelog tells it stands to avoid a text_mutex deadlock.
> > I'm not sure why we would deadlock without it.
> 
> As Mathieu and I discussed on LKML (http://lkml.org/lkml/2009/11/21/187)
> text_mutex will be locked on the way of cpu-hotplug.
> Since kprobes locks text_mutex too and stop_machine() refers online_cpus,
> it will cause a dead-lock. So, I decided to use get_online_cpus() to
> locking hotplug while optimizing/unoptimizng.


Ah ok :)
Could you add a comment in the code that explains it?

Thanks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 15:34       ` Masami Hiramatsu
@ 2009-11-24 20:14         ` Frederic Weisbecker
  2009-11-24 20:59           ` Masami Hiramatsu
  2009-11-24 21:08           ` H. Peter Anvin
  0 siblings, 2 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24 20:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Tue, Nov 24, 2009 at 10:34:16AM -0500, Masami Hiramatsu wrote:
> Frederic Weisbecker wrote:
> > I _might_ have understood.
> > You have set up the optimized flags, then you wait for
> > any old-style int 3 kprobes to complete and route
> > to detour buffer so that you can patch the jump
> > safely in the dead code? (and finish with first byte
> > by patching the int 3 itself)
> > 
> 
> Yeah, you might get almost correct answer.
> The reason why we have to wait scheduling on all processors
> is that this code may modify N instructions (not a single
> instruction). This means, there is a chance that 2nd to nth
> instructions are interrupted on other cpus when we start
> code modifying.


Aaah ok!

In this case, you probably just need the synchronize_sched()
thing. The delayed work looks unnecessary.

 
> Please imagine that 2nd instruction is interrupted and
> stop_machine() replaces the 2nd instruction with jump
> *address* while running interrupt handler. When the interrupt
> returns to original address, there is no valid instructions
> and it causes unexpected result.


Yeah.


> 
> To avoid this situation, we have to wait a scheduler quiescent
> state on all cpus, because it also ensure that all current
> interruption are done.


Ok.


> This also excuses why we don't need to wait when unoptimizing
> and why it has not supported preemptive kernel yet.


I see...so the non-preemptible kernel requirement looks
hard to workaround :-s


> In unoptimizing case, since there is just a single instruction
> (jump), there is no nth instruction which can be interrupted.
> Thus we can just use a stop_machine(). :-)


Ok.


> 
> On the preemptive kernel, waiting scheduling is not work as we
> see on non-preemptive kernel. Since processes can be preempted
> in interruption, we can't ensure that the current running
> interruption is done. (I assume that a pair of freeze_processes
> and thaw_processes may possibly ensure that, or maybe we can
> share some stack rewinding code with ksplice.)
> So it depends on !PREEMPT.



Right.
However using freeze_processes() and thaw_processes() would be
probably too costly and it's not a guarantee that every processes
go to the refrigerator() :-), because some tasks are not freezable,
like the kernel threads by default if I remember well, unless they
call set_freezable(). That's a pity, we would just have needed
to set __kprobe in refrigerator().


PS: hmm btw I remember about a patch that
tagged refrigerator() as __cold but it looks like it hasn't been
applied....

Thanks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers
  2009-11-24 15:39     ` Masami Hiramatsu
@ 2009-11-24 20:19       ` Frederic Weisbecker
  0 siblings, 0 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24 20:19 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Tue, Nov 24, 2009 at 10:39:13AM -0500, Masami Hiramatsu wrote:
> Frederic Weisbecker wrote:
> > On Mon, Nov 23, 2009 at 06:22:04PM -0500, Masami Hiramatsu wrote:
> >> +#ifdef CONFIG_X86_64
> >> +#define SAVE_REGS_STRING		\
> >> +	/* Skip cs, ip, orig_ax. */	\
> >> +	"	subq $24, %rsp\n"	\
> >> +	"	pushq %rdi\n"		\
> >> +	"	pushq %rsi\n"		\
> >> +	"	pushq %rdx\n"		\
> >> +	"	pushq %rcx\n"		\
> >> +	"	pushq %rax\n"		\
> >> +	"	pushq %r8\n"		\
> >> +	"	pushq %r9\n"		\
> >> +	"	pushq %r10\n"		\
> >> +	"	pushq %r11\n"		\
> >> +	"	pushq %rbx\n"		\
> >> +	"	pushq %rbp\n"		\
> >> +	"	pushq %r12\n"		\
> >> +	"	pushq %r13\n"		\
> >> +	"	pushq %r14\n"		\
> >> +	"	pushq %r15\n"
> >> +#define RESTORE_REGS_STRING		\
> >> +	"	popq %r15\n"		\
> >> +	"	popq %r14\n"		\
> >> +	"	popq %r13\n"		\
> >> +	"	popq %r12\n"		\
> >> +	"	popq %rbp\n"		\
> >> +	"	popq %rbx\n"		\
> >> +	"	popq %r11\n"		\
> >> +	"	popq %r10\n"		\
> >> +	"	popq %r9\n"		\
> >> +	"	popq %r8\n"		\
> >> +	"	popq %rax\n"		\
> >> +	"	popq %rcx\n"		\
> >> +	"	popq %rdx\n"		\
> >> +	"	popq %rsi\n"		\
> >> +	"	popq %rdi\n"		\
> > 
> > 
> > BTW, do you really need to push/pop every registers
> > before/after calling a probe handler?
> 
> Yes, in both cases (kretprobe/optprpbe) it needs to
> emulate kprobes behavior. kprobes can be used as
> fault injection, it should pop pt_regs.
> 
> > Is it possible to only save/restore the scratch ones?
> 
> Hmm, what code did you mean?



Ah this chain of push/pop is there to dump a struct pt_regs for
the handler?
Sorry, I just thought it was to save the registers from the probed
function.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers
  2009-11-24 15:40     ` Frank Ch. Eigler
@ 2009-11-24 20:20       ` Frederic Weisbecker
  0 siblings, 0 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2009-11-24 20:20 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Masami Hiramatsu, Ingo Molnar, Ananth N Mavinakayanahalli, lkml,
	systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Jason Baron,
	Mathieu Desnoyers

On Tue, Nov 24, 2009 at 10:40:24AM -0500, Frank Ch. Eigler wrote:
> Frederic Weisbecker <fweisbec@gmail.com> writes:
> 
> > [...]
> >> +#define SAVE_REGS_STRING		\
> >> +#define RESTORE_REGS_STRING		\
> >
> > BTW, do you really need to push/pop every registers
> > before/after calling a probe handler?
> 
> It's part of the definition of a kprobe, that a populated
> pt_regs* value is passed.  Clients can rely on that in order
> to access registers etc.
> 
> - FChE

Yeah I made a confusion. Sorry.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 20:14         ` Frederic Weisbecker
@ 2009-11-24 20:59           ` Masami Hiramatsu
  2009-11-25 21:08             ` Steven Rostedt
  2009-11-24 21:08           ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 20:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Frederic Weisbecker wrote:
> On Tue, Nov 24, 2009 at 10:34:16AM -0500, Masami Hiramatsu wrote:
>> Frederic Weisbecker wrote:
>>> I _might_ have understood.
>>> You have set up the optimized flags, then you wait for
>>> any old-style int 3 kprobes to complete and route
>>> to detour buffer so that you can patch the jump
>>> safely in the dead code? (and finish with first byte
>>> by patching the int 3 itself)
>>>
>>
>> Yeah, you might get almost correct answer.
>> The reason why we have to wait scheduling on all processors
>> is that this code may modify N instructions (not a single
>> instruction). This means, there is a chance that 2nd to nth
>> instructions are interrupted on other cpus when we start
>> code modifying.
>
>
> Aaah ok!
>
> In this case, you probably just need the synchronize_sched()
> thing. The delayed work looks unnecessary.

Yeah, the delayed work is for speeding up batch registration
which kprobes are already supported. Sometimes ~100 probes
can be set via batch registration I/F.

>> Please imagine that 2nd instruction is interrupted and
>> stop_machine() replaces the 2nd instruction with jump
>> *address* while running interrupt handler. When the interrupt
>> returns to original address, there is no valid instructions
>> and it causes unexpected result.
>
>
> Yeah.
>
>
>>
>> To avoid this situation, we have to wait a scheduler quiescent
>> state on all cpus, because it also ensure that all current
>> interruption are done.
>
>
> Ok.
>
>
>> This also excuses why we don't need to wait when unoptimizing
>> and why it has not supported preemptive kernel yet.
>
>
> I see...so the non-preemptible kernel requirement looks
> hard to workaround :-s

It's the next challenge I think :-)
Even though, kprobes itself still work on preemptive kernel,
so we don't lose any functionality.

>> In unoptimizing case, since there is just a single instruction
>> (jump), there is no nth instruction which can be interrupted.
>> Thus we can just use a stop_machine(). :-)
>
>
> Ok.
>
>
>>
>> On the preemptive kernel, waiting scheduling is not work as we
>> see on non-preemptive kernel. Since processes can be preempted
>> in interruption, we can't ensure that the current running
>> interruption is done. (I assume that a pair of freeze_processes
>> and thaw_processes may possibly ensure that, or maybe we can
>> share some stack rewinding code with ksplice.)
>> So it depends on !PREEMPT.
>
>
>
> Right.
> However using freeze_processes() and thaw_processes() would be
> probably too costly and it's not a guarantee that every processes
> go to the refrigerator() :-), because some tasks are not freezable,
> like the kernel threads by default if I remember well, unless they
> call set_freezable(). That's a pity, we would just have needed
> to set __kprobe in refrigerator().

Ah, right. Even though, we still have an option of ksplice code.

Thank you,

> PS: hmm btw I remember about a patch that
> tagged refrigerator() as __cold but it looks like it hasn't been
> applied....
>
> Thanks.
>

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 20:14         ` Frederic Weisbecker
  2009-11-24 20:59           ` Masami Hiramatsu
@ 2009-11-24 21:08           ` H. Peter Anvin
  1 sibling, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2009-11-24 21:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Masami Hiramatsu, Ingo Molnar, Ananth N Mavinakayanahalli, lkml,
	systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On 11/24/2009 12:14 PM, Frederic Weisbecker wrote:
> 
> PS: hmm btw I remember about a patch that
> tagged refrigerator() as __cold but it looks like it hasn't been
> applied....
> 

Groan!  That hurt!

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 19:45       ` Frederic Weisbecker
@ 2009-11-24 21:15         ` Masami Hiramatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-24 21:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Ananth N Mavinakayanahalli, lkml, systemtap, DLE,
	Jim Keniston, Srikar Dronamraju, Christoph Hellwig,
	Steven Rostedt, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Frederic Weisbecker wrote:
> On Tue, Nov 24, 2009 at 10:34:08AM -0500, Masami Hiramatsu wrote:
>>> And this comment doesn't tell us much what this brings us.
>>> The changelog tells it stands to avoid a text_mutex deadlock.
>>> I'm not sure why we would deadlock without it.
>>
>> As Mathieu and I discussed on LKML (http://lkml.org/lkml/2009/11/21/187)
>> text_mutex will be locked on the way of cpu-hotplug.
>> Since kprobes locks text_mutex too and stop_machine() refers online_cpus,
>> it will cause a dead-lock. So, I decided to use get_online_cpus() to
>> locking hotplug while optimizing/unoptimizng.
>
>
> Ah ok :)
> Could you add a comment in the code that explains it?

Sure, of course :-)


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86
  2009-11-24 17:46     ` Masami Hiramatsu
@ 2009-11-25 16:12       ` Masami Hiramatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-25 16:12 UTC (permalink / raw)
  To: Jason Baron
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, Steven Rostedt, H. Peter Anvin,
	Anders Kaseorg, Tim Abbott, Andi Kleen, Mathieu Desnoyers

Masami Hiramatsu wrote:
> Jason Baron wrote:
>> Also, with multiple users we don't want to trample over each others code
>> patching. Thus, if each sub-system could register some type of
>> 'is_reserved()' callback, and then we can call all these call backs from
>> the '__multibyte_poke()' routine before we do any patching to make sure
>> that we aren't trampling on each others code. After a successful
>> patching, each sub-system can update its reserved set of code as
>> appropriate. I can code a prototype here, if this makes sense.
>
> Hmm, we have to implement it carefully, because here kprobes already
> inserted int3 and optprobe rewrites the int3 again. If is_reserved()
> returns 1 and multibyte_poke returns error, we can't optimize it anymore.

IMHO, all text-modifiers except kprobes provide is_reserved() callback
and kprobes cancels probing if its target address is reserved, since
only kprobes changes texts anywhere while others changes texts at
fixed addresses.

Anyway, I think this will be another bugfix for current kprobes/alternative.

Thank you,
-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-24 20:59           ` Masami Hiramatsu
@ 2009-11-25 21:08             ` Steven Rostedt
  2009-11-25 21:30               ` Masami Hiramatsu
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Rostedt @ 2009-11-25 21:08 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

On Tue, 2009-11-24 at 15:59 -0500, Masami Hiramatsu wrote:

> > I see...so the non-preemptible kernel requirement looks
> > hard to workaround :-s
> 
> It's the next challenge I think :-)
> Even though, kprobes itself still work on preemptive kernel,
> so we don't lose any functionality.

>From kstop_machine, we could search all tasks to see if any are about to
resume in the modified location. If there is, we could either

1) insert a normal kprobe
2) modify the return address of the task to jump to some trampoline to 
   finish the work and return to the code spot with a direct jump.

#2 is kind of nasty but seems like a fun thing to implement ;-)

-- Steve



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization
  2009-11-25 21:08             ` Steven Rostedt
@ 2009-11-25 21:30               ` Masami Hiramatsu
  0 siblings, 0 replies; 37+ messages in thread
From: Masami Hiramatsu @ 2009-11-25 21:30 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Ananth N Mavinakayanahalli,
	lkml, systemtap, DLE, Jim Keniston, Srikar Dronamraju,
	Christoph Hellwig, H. Peter Anvin, Anders Kaseorg, Tim Abbott,
	Andi Kleen, Jason Baron, Mathieu Desnoyers

Steven Rostedt wrote:
> On Tue, 2009-11-24 at 15:59 -0500, Masami Hiramatsu wrote:
> 
>>> I see...so the non-preemptible kernel requirement looks
>>> hard to workaround :-s
>>
>> It's the next challenge I think :-)
>> Even though, kprobes itself still work on preemptive kernel,
>> so we don't lose any functionality.
> 
>> From kstop_machine, we could search all tasks to see if any are about to
> resume in the modified location. If there is, we could either
> 
> 1) insert a normal kprobe
> 2) modify the return address of the task to jump to some trampoline to
>     finish the work and return to the code spot with a direct jump.
> 
> #2 is kind of nasty but seems like a fun thing to implement ;-)

Sure, anyway, a normal kprobe is already inserted, so we also can
just wait until all tasks don't resume to the spot :-)
(that's another reason why it uses delayed work for optimization.
 we can try it again and again.)
And I think, that code will be shared with ksplice too.

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-11-25 21:30 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-23 23:21 [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Masami Hiramatsu
2009-11-23 23:21 ` [PATCH -tip v5 01/10] kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE Masami Hiramatsu
2009-11-23 23:21 ` [PATCH -tip v5 02/10] kprobes: Introduce generic insn_slot framework Masami Hiramatsu
2009-11-23 23:21 ` [PATCH -tip v5 03/10] kprobes: Introduce kprobes jump optimization Masami Hiramatsu
2009-11-24  2:44   ` Frederic Weisbecker
2009-11-24  3:31     ` Frederic Weisbecker
2009-11-24 15:34       ` Masami Hiramatsu
2009-11-24 20:14         ` Frederic Weisbecker
2009-11-24 20:59           ` Masami Hiramatsu
2009-11-25 21:08             ` Steven Rostedt
2009-11-25 21:30               ` Masami Hiramatsu
2009-11-24 21:08           ` H. Peter Anvin
2009-11-24 15:34     ` Masami Hiramatsu
2009-11-24 19:45       ` Frederic Weisbecker
2009-11-24 21:15         ` Masami Hiramatsu
2009-11-23 23:21 ` [PATCH -tip v5 04/10] kprobes: Jump optimization sysctl interface Masami Hiramatsu
2009-11-23 23:21 ` [PATCH -tip v5 05/10] kprobes/x86: Boost probes when reentering Masami Hiramatsu
2009-11-23 23:22 ` [PATCH -tip v5 06/10] kprobes/x86: Cleanup save/restore registers Masami Hiramatsu
2009-11-24  2:51   ` Frederic Weisbecker
2009-11-24 15:39     ` Masami Hiramatsu
2009-11-24 20:19       ` Frederic Weisbecker
2009-11-24 15:40     ` Frank Ch. Eigler
2009-11-24 20:20       ` Frederic Weisbecker
2009-11-23 23:22 ` [PATCH -tip v5 07/10] kprobes/x86: Support kprobes jump optimization on x86 Masami Hiramatsu
2009-11-24  3:14   ` Frederic Weisbecker
2009-11-24 16:27   ` Jason Baron
2009-11-24 17:46     ` Masami Hiramatsu
2009-11-25 16:12       ` Masami Hiramatsu
2009-11-24 16:35   ` H. Peter Anvin
2009-11-24 17:00     ` Masami Hiramatsu
2009-11-23 23:22 ` [PATCH -tip v5 08/10] kprobes: Add documents of jump optimization Masami Hiramatsu
2009-11-23 23:22 ` [PATCH -tip v5 09/10] [RFC] x86: Introduce generic jump patching without stop_machine Masami Hiramatsu
2009-11-23 23:22 ` [PATCH -tip v5 10/10] [RFC] kprobes/x86: Use text_poke_fixup() for jump optimization Masami Hiramatsu
2009-11-24  2:03 ` [PATCH -tip v5 00/10] kprobes: Kprobes jump optimization support Frederic Weisbecker
2009-11-24  3:20   ` Frederic Weisbecker
2009-11-24  7:52     ` Ingo Molnar
2009-11-24 16:06       ` Masami Hiramatsu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.