All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -tip 0/5] kprobes: batch (un)optimization support
@ 2010-05-10 17:53 Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize Masami Hiramatsu
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: Ananth N Mavinakayanahalli, Ingo Molnar, Jim Keniston,
	Jason Baron, Mathieu Desnoyers, systemtap, DLE


Hi,

Since current kprobes jump optimization calls stop_machine() for each
probe, it will almost freeze machine when registering a lot of probes
(~1000) at once.
For avoiding this issue, this patch series enhances kprobes and
text_poke.

- Introduces text_poke_smp_batch() which modifies multiple codes
  with one stop_machine().
- Limits how many probes can be optimized at once.
- Introduce delayed unoptimization for batch processing.

text_poke_smp_batch() also helps Jason's Jump label to reduce
its overhead coming from text_poke_smp().

Currently, this patch allocates working memory static, however,
it is possible to allocate dynamically, because its not in
a critical section. it will be done by the next step, or next version.

Thank you,

---

Masami Hiramatsu (5):
      kprobes: Support delayed unoptimization
      kprobes/x86: Use text_poke_smp_batch
      x86: Introduce text_poke_smp_batch() for batch-code modifying
      kprobes: Limit maximum number of optimization at once
      [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize


 arch/x86/include/asm/alternative.h |    7 +
 arch/x86/include/asm/kprobes.h     |    4 
 arch/x86/kernel/alternative.c      |   49 +++-
 arch/x86/kernel/kprobes.c          |   78 ++++++
 include/linux/kprobes.h            |    4 
 kernel/kprobes.c                   |  437 ++++++++++++++++++++++++------------
 6 files changed, 416 insertions(+), 163 deletions(-)

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize
  2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
@ 2010-05-10 17:53 ` Masami Hiramatsu
  2010-05-11 12:35   ` Mathieu Desnoyers
  2010-05-10 17:53 ` [PATCH -tip 2/5] kprobes: Limit maximum number of optimization at once Masami Hiramatsu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Jason Baron, Mathieu Desnoyers

Remove text_mutex locking in optimize_all_kprobes, because
that function doesn't modify text but just order optimization
to worker.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 kernel/kprobes.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 282035f..1d34eef 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -606,14 +606,12 @@ static void __kprobes optimize_all_kprobes(void)
 		return;
 
 	kprobes_allow_optimization = true;
-	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			if (!kprobe_disabled(p))
 				optimize_kprobe(p);
 	}
-	mutex_unlock(&text_mutex);
 	printk(KERN_INFO "Kprobes globally optimized\n");
 }
 


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -tip 2/5] kprobes: Limit maximum number of optimization at once
  2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize Masami Hiramatsu
@ 2010-05-10 17:53 ` Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 3/5] x86: Introduce text_poke_smp_batch() for batch-code modifying Masami Hiramatsu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Jason Baron, Mathieu Desnoyers

Since text_poke_smp() uses stop_machine(), the machine almost
stop a while if we optimize a lot of probes at once. This patch
limits the maximum number of probes to optimize (means calling
text_poke_smp()) at once, and try it again after a while if
there are more probes waiting for optimization than the
limitation.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 kernel/kprobes.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 1d34eef..aae368a 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -424,11 +424,13 @@ static LIST_HEAD(optimizing_list);
 static void kprobe_optimizer(struct work_struct *work);
 static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
 #define OPTIMIZE_DELAY 5
+#define MAX_OPTIMIZE_PROBES 64
 
 /* Kprobe jump optimizer */
 static __kprobes void kprobe_optimizer(struct work_struct *work)
 {
 	struct optimized_kprobe *op, *tmp;
+	int c = 0;
 
 	/* Lock modules while optimizing kprobes */
 	mutex_lock(&module_mutex);
@@ -462,9 +464,13 @@ static __kprobes void kprobe_optimizer(struct work_struct *work)
 		if (arch_optimize_kprobe(op) < 0)
 			op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
 		list_del_init(&op->list);
+		if (++c >= MAX_OPTIMIZE_PROBES)
+			break;
 	}
 	mutex_unlock(&text_mutex);
 	put_online_cpus();
+	if (!list_empty(&optimizing_list))
+		schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
 end:
 	mutex_unlock(&kprobe_mutex);
 	mutex_unlock(&module_mutex);


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -tip 3/5] x86: Introduce text_poke_smp_batch() for batch-code modifying
  2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 2/5] kprobes: Limit maximum number of optimization at once Masami Hiramatsu
@ 2010-05-10 17:53 ` Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch Masami Hiramatsu
  2010-05-10 17:53 ` [PATCH -tip 5/5] kprobes: Support delayed unoptimization Masami Hiramatsu
  4 siblings, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Jason Baron, Mathieu Desnoyers

Introduce text_poke_smp_batch(). This function modifies several
text areas with one stop_machine() on SMPr. Because calling
stop_machine() is heavy task, it is better to aggregate text_poke
requests.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 arch/x86/include/asm/alternative.h |    7 +++++
 arch/x86/kernel/alternative.c      |   49 +++++++++++++++++++++++++++++-------
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 03b6bb5..9a08773 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -176,7 +176,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
  * On the local CPU you need to be protected again NMI or MCE handlers seeing an
  * inconsistent instruction while you patch.
  */
+struct text_poke_param {
+	void *addr;
+	const void *opcode;
+	size_t len;
+};
+
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_smp(void *addr, const void *opcode, size_t len);
+extern void text_poke_smp_batch(struct text_poke_param *params, int n);
 
 #endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7023773..7256800 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -590,17 +590,21 @@ static atomic_t stop_machine_first;
 static int wrote_text;
 
 struct text_poke_params {
-	void *addr;
-	const void *opcode;
-	size_t len;
+	struct text_poke_param *params;
+	int nparams;
 };
 
 static int __kprobes stop_machine_text_poke(void *data)
 {
 	struct text_poke_params *tpp = data;
+	struct text_poke_param *p;
+	int i;
 
 	if (atomic_dec_and_test(&stop_machine_first)) {
-		text_poke(tpp->addr, tpp->opcode, tpp->len);
+		for (i = 0; i < tpp->nparams; i++) {
+			p = &tpp->params[i];
+			text_poke(p->addr, p->opcode, p->len);
+		}
 		smp_wmb();	/* Make sure other cpus see that this has run */
 		wrote_text = 1;
 	} else {
@@ -609,8 +613,12 @@ static int __kprobes stop_machine_text_poke(void *data)
 		smp_mb();	/* Load wrote_text before following execution */
 	}
 
-	flush_icache_range((unsigned long)tpp->addr,
-			   (unsigned long)tpp->addr + tpp->len);
+	for (i = 0; i < tpp->nparams; i++) {
+		p = &tpp->params[i];
+		flush_icache_range((unsigned long)p->addr,
+				   (unsigned long)p->addr + p->len);
+	}
+
 	return 0;
 }
 
@@ -630,13 +638,36 @@ static int __kprobes stop_machine_text_poke(void *data)
 void *__kprobes text_poke_smp(void *addr, const void *opcode, size_t len)
 {
 	struct text_poke_params tpp;
+	struct text_poke_param p;
 
-	tpp.addr = addr;
-	tpp.opcode = opcode;
-	tpp.len = len;
+	p.addr = addr;
+	p.opcode = opcode;
+	p.len = len;
+	tpp.params = &p;
+	tpp.nparams = 1;
 	atomic_set(&stop_machine_first, 1);
 	wrote_text = 0;
 	stop_machine(stop_machine_text_poke, (void *)&tpp, NULL);
 	return addr;
 }
 
+/**
+ * text_poke_smp_batch - Update instructions on a live kernel on SMP
+ * @params: an array of text_poke parameters
+ * @n: the number of elements in params.
+ *
+ * Modify multi-byte instruction by using stop_machine() on SMP. Since the
+ * stop_machine() is heavy task, it is better to aggregate text_poke requests
+ * and do it once if possible.
+ *
+ * Note: Must be called under get_online_cpus() and text_mutex.
+ */
+void __kprobes text_poke_smp_batch(struct text_poke_param *params, int n)
+{
+	struct text_poke_params tpp = {.params = params, .nparams = n};
+
+	atomic_set(&stop_machine_first, 1);
+	wrote_text = 0;
+	stop_machine(stop_machine_text_poke, (void *)&tpp, NULL);
+}
+


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
                   ` (2 preceding siblings ...)
  2010-05-10 17:53 ` [PATCH -tip 3/5] x86: Introduce text_poke_smp_batch() for batch-code modifying Masami Hiramatsu
@ 2010-05-10 17:53 ` Masami Hiramatsu
  2010-05-11 14:40   ` Mathieu Desnoyers
  2010-05-10 17:53 ` [PATCH -tip 5/5] kprobes: Support delayed unoptimization Masami Hiramatsu
  4 siblings, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Jason Baron, Mathieu Desnoyers

Use text_poke_smp_batch() in optimization path for reducing
the number of stop_machine() issues.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
 include/linux/kprobes.h   |    2 +-
 kernel/kprobes.c          |   13 +------------
 3 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 345a4b1..63a5c24 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
 	return 0;
 }
 
-/* Replace a breakpoint (int3) with a relative jump.  */
-int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
+#define MAX_OPTIMIZE_PROBES 256
+static struct text_poke_param jump_params[MAX_OPTIMIZE_PROBES];
+static char jump_code_buf[MAX_OPTIMIZE_PROBES][RELATIVEJUMP_SIZE];
+
+static void __kprobes setup_optimize_kprobe(struct text_poke_param *tprm,
+					    char *insn_buf,
+					    struct optimized_kprobe *op)
 {
-	unsigned char jmp_code[RELATIVEJUMP_SIZE];
 	s32 rel = (s32)((long)op->optinsn.insn -
 			((long)op->kp.addr + RELATIVEJUMP_SIZE));
 
@@ -1396,16 +1400,35 @@ int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
 	memcpy(op->optinsn.copied_insn, op->kp.addr + INT3_SIZE,
 	       RELATIVE_ADDR_SIZE);
 
-	jmp_code[0] = RELATIVEJUMP_OPCODE;
-	*(s32 *)(&jmp_code[1]) = rel;
+	insn_buf[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&insn_buf[1]) = rel;
+
+	tprm->addr = op->kp.addr;
+	tprm->opcode = insn_buf;
+	tprm->len = RELATIVEJUMP_SIZE;
+}
+
+/* Replace a breakpoint (int3) with a relative jump.  */
+void __kprobes arch_optimize_kprobes(struct list_head *oplist)
+{
+	struct optimized_kprobe *op, *tmp;
+	int c = 0;
+
+	list_for_each_entry_safe(op, tmp, oplist, list) {
+		WARN_ON(kprobe_disabled(&op->kp));
+		/* Setup param */
+		setup_optimize_kprobe(&jump_params[c], jump_code_buf[c], op);
+		list_del_init(&op->list);
+		if (++c >= MAX_OPTIMIZE_PROBES)
+			break;
+	}
 
 	/*
 	 * text_poke_smp doesn't support NMI/MCE code modifying.
 	 * However, since kprobes itself also doesn't support NMI/MCE
 	 * code probing, it's not a problem.
 	 */
-	text_poke_smp(op->kp.addr, jmp_code, RELATIVEJUMP_SIZE);
-	return 0;
+	text_poke_smp_batch(jump_params, c);
 }
 
 /* Replace a relative jump with a breakpoint (int3).  */
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index e7d1b2e..fe157ba 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -275,7 +275,7 @@ extern int arch_prepared_optinsn(struct arch_optimized_insn *optinsn);
 extern int arch_check_optimized_kprobe(struct optimized_kprobe *op);
 extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op);
 extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op);
-extern int  arch_optimize_kprobe(struct optimized_kprobe *op);
+extern void arch_optimize_kprobes(struct list_head *oplist);
 extern void arch_unoptimize_kprobe(struct optimized_kprobe *op);
 extern kprobe_opcode_t *get_optinsn_slot(void);
 extern void free_optinsn_slot(kprobe_opcode_t *slot, int dirty);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index aae368a..c824c23 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -424,14 +424,10 @@ static LIST_HEAD(optimizing_list);
 static void kprobe_optimizer(struct work_struct *work);
 static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
 #define OPTIMIZE_DELAY 5
-#define MAX_OPTIMIZE_PROBES 64
 
 /* Kprobe jump optimizer */
 static __kprobes void kprobe_optimizer(struct work_struct *work)
 {
-	struct optimized_kprobe *op, *tmp;
-	int c = 0;
-
 	/* Lock modules while optimizing kprobes */
 	mutex_lock(&module_mutex);
 	mutex_lock(&kprobe_mutex);
@@ -459,14 +455,7 @@ static __kprobes void kprobe_optimizer(struct work_struct *work)
 	 */
 	get_online_cpus();
 	mutex_lock(&text_mutex);
-	list_for_each_entry_safe(op, tmp, &optimizing_list, list) {
-		WARN_ON(kprobe_disabled(&op->kp));
-		if (arch_optimize_kprobe(op) < 0)
-			op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
-		list_del_init(&op->list);
-		if (++c >= MAX_OPTIMIZE_PROBES)
-			break;
-	}
+	arch_optimize_kprobes(&optimizing_list);
 	mutex_unlock(&text_mutex);
 	put_online_cpus();
 	if (!list_empty(&optimizing_list))


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH -tip 5/5] kprobes: Support delayed unoptimization
  2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
                   ` (3 preceding siblings ...)
  2010-05-10 17:53 ` [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch Masami Hiramatsu
@ 2010-05-10 17:53 ` Masami Hiramatsu
  4 siblings, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-10 17:53 UTC (permalink / raw)
  To: Ingo Molnar, lkml
  Cc: systemtap, DLE, Masami Hiramatsu, Ananth N Mavinakayanahalli,
	Ingo Molnar, Jim Keniston, Jason Baron, Mathieu Desnoyers

Unoptimization occurs when a probe is unregistered or disabled, and
is heavy because it recovers instructions by using stop_machine().
This patch delays unoptimization operations and unoptimize several
probes at once by using text_poke_smp_batch().
This can avoid unexpected system slowdown coming from stop_machine().

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

 arch/x86/include/asm/kprobes.h |    4 
 arch/x86/kernel/kprobes.c      |   41 ++++
 include/linux/kprobes.h        |    2 
 kernel/kprobes.c               |  434 +++++++++++++++++++++++++++-------------
 4 files changed, 340 insertions(+), 141 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 5478825..e8a864c 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -79,12 +79,12 @@ struct arch_specific_insn {
 };
 
 struct arch_optimized_insn {
-	/* copy of the original instructions */
-	kprobe_opcode_t copied_insn[RELATIVE_ADDR_SIZE];
 	/* detour code buffer */
 	kprobe_opcode_t *insn;
 	/* the size of instructions copied to detour code buffer */
 	size_t size;
+	/* copy of the original instructions */
+	kprobe_opcode_t copied_insn[RELATIVE_ADDR_SIZE];
 };
 
 /* Return true (!0) if optinsn is prepared for optimization. */
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 63a5c24..bd5ea3a 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -1167,6 +1167,10 @@ static void __kprobes optimized_callback(struct optimized_kprobe *op,
 {
 	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
 
+	/* This is possible if op is under delayed unoptimizing */
+	if (kprobe_disabled(&op->kp))
+		return;
+
 	preempt_disable();
 	if (kprobe_running()) {
 		kprobes_inc_nmissed_count(&op->kp);
@@ -1431,6 +1435,19 @@ void __kprobes arch_optimize_kprobes(struct list_head *oplist)
 	text_poke_smp_batch(jump_params, c);
 }
 
+static void __kprobes setup_unoptimize_kprobe(struct text_poke_param *tprm,
+					      char *insn_buf,
+					      struct optimized_kprobe *op)
+{
+	/* Set int3 to first byte for kprobes */
+	insn_buf[0] = BREAKPOINT_INSTRUCTION;
+	memcpy(insn_buf + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+	tprm->addr = op->kp.addr;
+	tprm->opcode = insn_buf;
+	tprm->len = RELATIVEJUMP_SIZE;
+}
+
 /* Replace a relative jump with a breakpoint (int3).  */
 void __kprobes arch_unoptimize_kprobe(struct optimized_kprobe *op)
 {
@@ -1442,6 +1459,30 @@ void __kprobes arch_unoptimize_kprobe(struct optimized_kprobe *op)
 	text_poke_smp(op->kp.addr, buf, RELATIVEJUMP_SIZE);
 }
 
+extern void arch_unoptimize_kprobes(struct list_head *oplist,
+				    struct list_head *done_list)
+{
+	struct optimized_kprobe *op, *tmp;
+	int c = 0;
+
+	list_for_each_entry_safe(op, tmp, oplist, list) {
+		/* Setup param */
+		setup_unoptimize_kprobe(&jump_params[c], jump_code_buf[c], op);
+		list_del_init(&op->list);
+		list_add(&op->list, done_list);
+		if (++c >= MAX_OPTIMIZE_PROBES)
+			break;
+	}
+
+	/*
+	 * text_poke_smp doesn't support NMI/MCE code modifying.
+	 * However, since kprobes itself also doesn't support NMI/MCE
+	 * code probing, it's not a problem.
+	 */
+	text_poke_smp_batch(jump_params, c);
+}
+
+
 static int  __kprobes setup_detour_execution(struct kprobe *p,
 					     struct pt_regs *regs,
 					     int reenter)
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index fe157ba..b78edb5 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -276,6 +276,8 @@ extern int arch_check_optimized_kprobe(struct optimized_kprobe *op);
 extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op);
 extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op);
 extern void arch_optimize_kprobes(struct list_head *oplist);
+extern void arch_unoptimize_kprobes(struct list_head *oplist,
+				    struct list_head *done_list);
 extern void arch_unoptimize_kprobe(struct optimized_kprobe *op);
 extern kprobe_opcode_t *get_optinsn_slot(void);
 extern void free_optinsn_slot(kprobe_opcode_t *slot, int dirty);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index c824c23..08298ed 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -352,6 +352,13 @@ static inline int kprobe_aggrprobe(struct kprobe *p)
 	return p->pre_handler == aggr_pre_handler;
 }
 
+/* Return true(!0) if the kprobe is unused */
+static inline int kprobe_unused(struct kprobe *p)
+{
+	return kprobe_aggrprobe(p) && kprobe_disabled(p) &&
+	       list_empty(&p->list);
+}
+
 /*
  * Keep all fields in the kprobe consistent
  */
@@ -382,6 +389,17 @@ void __kprobes opt_pre_handler(struct kprobe *p, struct pt_regs *regs)
 	}
 }
 
+/* Free optimized instructions and optimized_kprobe */
+static __kprobes void free_aggr_kprobe(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+
+	op = container_of(p, struct optimized_kprobe, kp);
+	arch_remove_optimized_kprobe(op);
+	arch_remove_kprobe(p);
+	kfree(op);
+}
+
 /* Return true(!0) if the kprobe is ready for optimization. */
 static inline int kprobe_optready(struct kprobe *p)
 {
@@ -395,6 +413,20 @@ static inline int kprobe_optready(struct kprobe *p)
 	return 0;
 }
 
+/* Return true(!0) if the kprobe is disarmed. Note: p must be on hash list */
+static inline int kprobe_disarmed(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+
+	/* If kprobe is not aggr/opt probe, just return kprobe is disabled */
+	if (!kprobe_aggrprobe(p))
+		return kprobe_disabled(p);
+
+	op = container_of(p, struct optimized_kprobe, kp);
+
+	return kprobe_disabled(p) && list_empty(&op->list);
+}
+
 /*
  * Return an optimized kprobe whose optimizing code replaces
  * instructions including addr (exclude breakpoint).
@@ -420,29 +452,19 @@ struct kprobe *__kprobes get_optimized_kprobe(unsigned long addr)
 
 /* Optimization staging list, protected by kprobe_mutex */
 static LIST_HEAD(optimizing_list);
+static LIST_HEAD(unoptimizing_list);
 
 static void kprobe_optimizer(struct work_struct *work);
 static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
 #define OPTIMIZE_DELAY 5
+static DECLARE_COMPLETION(optimizer_comp);
 
-/* Kprobe jump optimizer */
-static __kprobes void kprobe_optimizer(struct work_struct *work)
+static __kprobes void do_unoptimize_kprobes(struct list_head *free_list)
 {
-	/* Lock modules while optimizing kprobes */
-	mutex_lock(&module_mutex);
-	mutex_lock(&kprobe_mutex);
-	if (kprobes_all_disarmed || !kprobes_allow_optimization)
-		goto end;
-
-	/*
-	 * Wait for quiesence period to ensure all running interrupts
-	 * are done. Because optprobe may modify multiple instructions
-	 * there is a chance that Nth instruction is interrupted. In that
-	 * case, running interrupt can return to 2nd-Nth byte of jump
-	 * instruction. This wait is for avoiding it.
-	 */
-	synchronize_sched();
+	struct optimized_kprobe *op, *tmp;
 
+	if (list_empty(&unoptimizing_list))
+		return;
 	/*
 	 * The optimization/unoptimization refers online_cpus via
 	 * stop_machine() and cpu-hotplug modifies online_cpus.
@@ -455,14 +477,93 @@ static __kprobes void kprobe_optimizer(struct work_struct *work)
 	 */
 	get_online_cpus();
 	mutex_lock(&text_mutex);
+	/* Do batch unoptimization */
+	arch_unoptimize_kprobes(&unoptimizing_list, free_list);
+	mutex_unlock(&text_mutex);
+	put_online_cpus();
+
+	list_for_each_entry_safe(op, tmp, free_list, list) {
+		/* Disarm probes if disabled */
+		if (kprobe_disabled(&op->kp))
+			arch_disarm_kprobe(&op->kp);
+		/* leave only unused probes on free_list */
+		if (!kprobe_unused(&op->kp))
+			list_del_init(&op->list);
+		else
+			/*
+			 * Remove unused probe from hash list. After waiting
+			 * for synchronization, this probe can be freed safely
+			 * (which will be freed by do_free_unused_kprobes.)
+			 */
+			hlist_del_rcu(&op->kp.hlist);
+	}
+}
+
+static __kprobes void do_free_unused_kprobes(struct list_head *free_list)
+{
+	struct optimized_kprobe *op, *tmp;
+
+	list_for_each_entry_safe(op, tmp, free_list, list) {
+		list_del_init(&op->list);
+		free_aggr_kprobe(&op->kp);
+	}
+}
+
+static __kprobes void do_optimize_kprobes(void)
+{
+	/* Optimization never allowed when disarmed */
+	if (kprobes_all_disarmed || !kprobes_allow_optimization ||
+	    list_empty(&optimizing_list))
+		return;
+
+	get_online_cpus();
+	mutex_lock(&text_mutex);
 	arch_optimize_kprobes(&optimizing_list);
 	mutex_unlock(&text_mutex);
 	put_online_cpus();
-	if (!list_empty(&optimizing_list))
-		schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
-end:
+}
+
+/* Kprobe jump optimize/unoptimize and collect unused aggrprobes */
+static __kprobes void kprobe_optimizer(struct work_struct *work)
+{
+	LIST_HEAD(free_list);
+
+	/* Lock modules while optimizing kprobes */
+	mutex_lock(&module_mutex);
+	mutex_lock(&kprobe_mutex);
+
+	do_unoptimize_kprobes(&free_list);
+
+	/*
+	 * Wait for quiesence period to ensure all running interrupts
+	 * are done. Because optprobe may modify multiple instructions
+	 * there is a chance that Nth instruction is interrupted. In that
+	 * case, running interrupt can return to 2nd-Nth byte of jump
+	 * instruction. This wait is for avoiding it.
+	 */
+	synchronize_sched();
+
+	do_optimize_kprobes();
+
+	/* Release all releasable probes */
+	do_free_unused_kprobes(&free_list);
+
 	mutex_unlock(&kprobe_mutex);
 	mutex_unlock(&module_mutex);
+
+	if (!list_empty(&optimizing_list) || !list_empty(&unoptimizing_list)) {
+		if (!delayed_work_pending(&optimizing_work))
+			schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
+	} else
+		/* Wake up all waiters */
+		complete_all(&optimizer_comp);
+}
+
+/* Wait for completion */
+static __kprobes void wait_for_optimizer(void)
+{
+	if (delayed_work_pending(&optimizing_work))
+		wait_for_completion(&optimizer_comp);
 }
 
 /* Optimize kprobe if p is ready to be optimized */
@@ -475,6 +576,10 @@ static __kprobes void optimize_kprobe(struct kprobe *p)
 	    (kprobe_disabled(p) || kprobes_all_disarmed))
 		return;
 
+	/* Check if it is already optimized. */
+	if (kprobe_optimized(p))
+		return;
+
 	/* Both of break_handler and post_handler are not supported. */
 	if (p->break_handler || p->post_handler)
 		return;
@@ -485,45 +590,96 @@ static __kprobes void optimize_kprobe(struct kprobe *p)
 	if (arch_check_optimized_kprobe(op) < 0)
 		return;
 
-	/* Check if it is already optimized. */
-	if (op->kp.flags & KPROBE_FLAG_OPTIMIZED)
-		return;
-
 	op->kp.flags |= KPROBE_FLAG_OPTIMIZED;
-	list_add(&op->list, &optimizing_list);
-	if (!delayed_work_pending(&optimizing_work))
-		schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
+	if (!list_empty(&op->list))
+		/* Unoptimizing probe: dequeue probe */
+		list_del_init(&op->list);
+	else {
+		list_add(&op->list, &optimizing_list);
+		if (!delayed_work_pending(&optimizing_work))
+			schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
+	}
 }
 
-/* Unoptimize a kprobe if p is optimized */
-static __kprobes void unoptimize_kprobe(struct kprobe *p)
+static __kprobes void force_unoptimize_kprobe(struct optimized_kprobe *op)
+{
+	get_online_cpus();
+	arch_unoptimize_kprobe(op);
+	put_online_cpus();
+	if (kprobe_disabled(&op->kp))
+		arch_disarm_kprobe(&op->kp);
+}
+
+/* Unoptimize a kprobe if p is optimized. */
+static __kprobes void unoptimize_kprobe(struct kprobe *p, int force)
 {
 	struct optimized_kprobe *op;
 
-	if ((p->flags & KPROBE_FLAG_OPTIMIZED) && kprobe_aggrprobe(p)) {
-		op = container_of(p, struct optimized_kprobe, kp);
-		if (!list_empty(&op->list))
-			/* Dequeue from the optimization queue */
+	if (!kprobe_aggrprobe(p) || kprobe_disarmed(p))
+		return;	/* This is not an optprobe nor optimized */
+
+	op = container_of(p, struct optimized_kprobe, kp);
+	if (!kprobe_optimized(p)) {
+		/* Unoptimized or unoptimizing */
+		if (force && !list_empty(&op->list)) {
+			/*
+			 * If unoptimizing probe and forced option, forcibly
+			 * unoptimize it.
+			*/
 			list_del_init(&op->list);
-		else
-			/* Replace jump with break */
-			arch_unoptimize_kprobe(op);
-		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+			force_unoptimize_kprobe(op);
+		}
+		return;
+	}
+
+	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+	if (!list_empty(&op->list)) {
+		/* Dequeue from the optimization queue */
+		list_del_init(&op->list);
+		return;
+	}
+
+	if (force)	/* Forcibly update the code: this is a special case */
+		force_unoptimize_kprobe(op);
+	else {
+		list_add(&op->list, &unoptimizing_list);
+		if (!delayed_work_pending(&optimizing_work))
+			schedule_delayed_work(&optimizing_work, OPTIMIZE_DELAY);
 	}
 }
 
+/* Cancel unoptimizing for reusing */
+static void reuse_unused_kprobe(struct kprobe *ap)
+{
+	struct optimized_kprobe *op;
+
+	BUG_ON(!kprobe_unused(ap));
+	/*
+	 * Unused kprobe MUST be on the way of delayed unoptimizing (means
+	 * there is still a relative jump) and disabled.
+	 */
+	op = container_of(ap, struct optimized_kprobe, kp);
+	if (unlikely(list_empty(&op->list)))
+		printk(KERN_WARNING "Warning: found a stray unused "
+			"aggrprobe@%p\n", ap->addr);
+	/* Enable the probe again */
+	ap->flags &= ~KPROBE_FLAG_DISABLED;
+	/* Optimize it again (remove from op->list) */
+	BUG_ON(!kprobe_optready(ap));
+	optimize_kprobe(ap);
+}
+
 /* Remove optimized instructions */
 static void __kprobes kill_optimized_kprobe(struct kprobe *p)
 {
 	struct optimized_kprobe *op;
 
 	op = container_of(p, struct optimized_kprobe, kp);
-	if (!list_empty(&op->list)) {
-		/* Dequeue from the optimization queue */
+	if (!list_empty(&op->list))
+		/* Dequeue from the (un)optimization queue */
 		list_del_init(&op->list);
-		op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
-	}
-	/* Don't unoptimize, because the target code will be freed. */
+	op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
+	/* Don't unoptimize, because the target code is already freed. */
 	arch_remove_optimized_kprobe(op);
 }
 
@@ -536,16 +692,6 @@ static __kprobes void prepare_optimized_kprobe(struct kprobe *p)
 	arch_prepare_optimized_kprobe(op);
 }
 
-/* Free optimized instructions and optimized_kprobe */
-static __kprobes void free_aggr_kprobe(struct kprobe *p)
-{
-	struct optimized_kprobe *op;
-
-	op = container_of(p, struct optimized_kprobe, kp);
-	arch_remove_optimized_kprobe(op);
-	kfree(op);
-}
-
 /* Allocate new optimized_kprobe and try to prepare optimized instructions */
 static __kprobes struct kprobe *alloc_aggr_kprobe(struct kprobe *p)
 {
@@ -623,20 +769,15 @@ static void __kprobes unoptimize_all_kprobes(void)
 
 	kprobes_allow_optimization = false;
 	printk(KERN_INFO "Kprobes globally unoptimized\n");
-	get_online_cpus();	/* For avoiding text_mutex deadlock */
-	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
 			if (!kprobe_disabled(p))
-				unoptimize_kprobe(p);
+				unoptimize_kprobe(p, 0);
 		}
 	}
-
-	mutex_unlock(&text_mutex);
-	put_online_cpus();
-	/* Allow all currently running kprobes to complete */
-	synchronize_sched();
+	/* Wait for unoptimizing completion */
+	wait_for_optimizer();
 }
 
 int sysctl_kprobes_optimization;
@@ -667,37 +808,64 @@ static void __kprobes __arm_kprobe(struct kprobe *p)
 	/* Check collision with other optimized kprobes */
 	old_p = get_optimized_kprobe((unsigned long)p->addr);
 	if (unlikely(old_p))
-		unoptimize_kprobe(old_p); /* Fallback to unoptimized kprobe */
+		unoptimize_kprobe(old_p, 1);/* Fallback to unoptimized kprobe */
 
 	arch_arm_kprobe(p);
 	optimize_kprobe(p);	/* Try to optimize (add kprobe to a list) */
 }
 
+/* Return true(!0) if the probe is queued on (un)optimizing lists */
+static int __kprobes kprobe_queued(struct kprobe *p)
+{
+	struct optimized_kprobe *op;
+
+	if (kprobe_aggrprobe(p)) {
+		op = container_of(p, struct optimized_kprobe, kp);
+		if (!list_empty(&op->list))
+			return 1;
+	}
+	return 0;
+}
+
 static void __kprobes __disarm_kprobe(struct kprobe *p)
 {
 	struct kprobe *old_p;
 
-	unoptimize_kprobe(p);	/* Try to unoptimize */
-	arch_disarm_kprobe(p);
+	/* Delayed unoptimizing if needed (add kprobe to a list) */
+	if (kprobe_optimized(p))
+		unoptimize_kprobe(p, 0);
 
-	/* If another kprobe was blocked, optimize it. */
-	old_p = get_optimized_kprobe((unsigned long)p->addr);
-	if (unlikely(old_p))
-		optimize_kprobe(old_p);
+	if (!kprobe_queued(p)) {
+		arch_disarm_kprobe(p);
+		/* If another kprobe was blocked, optimize it. */
+		old_p = get_optimized_kprobe((unsigned long)p->addr);
+		if (unlikely(old_p))
+			optimize_kprobe(old_p);
+	}
 }
 
 #else /* !CONFIG_OPTPROBES */
 
 #define optimize_kprobe(p)			do {} while (0)
-#define unoptimize_kprobe(p)			do {} while (0)
+#define unoptimize_kprobe(p, f)			do {} while (0)
 #define kill_optimized_kprobe(p)		do {} while (0)
 #define prepare_optimized_kprobe(p)		do {} while (0)
 #define try_to_optimize_kprobe(p)		do {} while (0)
 #define __arm_kprobe(p)				arch_arm_kprobe(p)
 #define __disarm_kprobe(p)			arch_disarm_kprobe(p)
+#define kprobe_disarmed(p)			kprobe_disabled(p)
+#define wait_for_optimizer()			do {} while (0)
+
+/* There should be no unused kprobes can be reused without optimization */
+static void reuse_unused_kprobe(struct kprobe *ap)
+{
+	printk(KERN_ERR "Error: There should be no unused kprobe here.\n");
+	BUG_ON(kprobe_unused(ap));
+}
 
 static __kprobes void free_aggr_kprobe(struct kprobe *p)
 {
+	arch_remove_kprobe(ap);
 	kfree(p);
 }
 
@@ -720,16 +888,6 @@ static void __kprobes arm_kprobe(struct kprobe *kp)
 	mutex_unlock(&text_mutex);
 }
 
-/* Disarm a kprobe with text_mutex */
-static void __kprobes disarm_kprobe(struct kprobe *kp)
-{
-	get_online_cpus();	/* For avoiding text_mutex deadlock */
-	mutex_lock(&text_mutex);
-	__disarm_kprobe(kp);
-	mutex_unlock(&text_mutex);
-	put_online_cpus();
-}
-
 /*
  * Aggregate handlers for multiple kprobes support - these handlers
  * take care of invoking the individual kprobe handlers on p->list
@@ -928,7 +1086,7 @@ static int __kprobes add_new_kprobe(struct kprobe *ap, struct kprobe *p)
 	BUG_ON(kprobe_gone(ap) || kprobe_gone(p));
 
 	if (p->break_handler || p->post_handler)
-		unoptimize_kprobe(ap);	/* Fall back to normal kprobe */
+		unoptimize_kprobe(ap, 1);/* Fall back to normal kprobe */
 
 	if (p->break_handler) {
 		if (ap->break_handler)
@@ -991,7 +1149,9 @@ static int __kprobes register_aggr_kprobe(struct kprobe *old_p,
 		if (!ap)
 			return -ENOMEM;
 		init_aggr_kprobe(ap, old_p);
-	}
+	} else if (kprobe_unused(ap))
+		/* This probe is going to die. Rescue it */
+		reuse_unused_kprobe(ap);
 
 	if (kprobe_gone(ap)) {
 		/*
@@ -1025,23 +1185,6 @@ static int __kprobes register_aggr_kprobe(struct kprobe *old_p,
 	return add_new_kprobe(ap, p);
 }
 
-/* Try to disable aggr_kprobe, and return 1 if succeeded.*/
-static int __kprobes try_to_disable_aggr_kprobe(struct kprobe *p)
-{
-	struct kprobe *kp;
-
-	list_for_each_entry_rcu(kp, &p->list, list) {
-		if (!kprobe_disabled(kp))
-			/*
-			 * There is an active probe on the list.
-			 * We can't disable aggr_kprobe.
-			 */
-			return 0;
-	}
-	p->flags |= KPROBE_FLAG_DISABLED;
-	return 1;
-}
-
 static int __kprobes in_kprobes_functions(unsigned long addr)
 {
 	struct kprobe_blackpoint *kb;
@@ -1208,6 +1351,45 @@ out:
 }
 EXPORT_SYMBOL_GPL(register_kprobe);
 
+/* Check if all probes on the aggrprobe are disabled */
+static int __kprobes aggr_kprobe_disabled(struct kprobe *ap)
+{
+	struct kprobe *kp;
+
+	list_for_each_entry_rcu(kp, &ap->list, list) {
+		if (!kprobe_disabled(kp))
+			/*
+			 * There is an active probe on the list.
+			 * We can't disable aggr_kprobe.
+			 */
+			return 0;
+	}
+	return 1;
+}
+
+/* Disable one kprobe: Make sure called under kprobe_mutex is locked */
+static struct kprobe *__kprobes __disable_kprobe(struct kprobe *p)
+{
+	struct kprobe *orig_p;
+
+	/* Check whether specified probe is valid, and get real probe */
+	orig_p = __get_valid_kprobe(p);
+	if (unlikely(orig_p == NULL))
+		return NULL;
+
+	if (!kprobe_disabled(p)) {
+		if (p != orig_p)
+			p->flags |= KPROBE_FLAG_DISABLED;
+
+		if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
+			__disarm_kprobe(orig_p);
+			orig_p->flags |= KPROBE_FLAG_DISABLED;
+		}
+	}
+
+	return orig_p;
+}
+
 /*
  * Unregister a kprobe without a scheduler synchronization.
  */
@@ -1215,22 +1397,18 @@ static int __kprobes __unregister_kprobe_top(struct kprobe *p)
 {
 	struct kprobe *old_p, *list_p;
 
-	old_p = __get_valid_kprobe(p);
+	/* Disable kprobe. This will disarm it if needed. */
+	old_p = __disable_kprobe(p);
 	if (old_p == NULL)
 		return -EINVAL;
 
 	if (old_p == p ||
 	    (kprobe_aggrprobe(old_p) &&
-	     list_is_singular(&old_p->list))) {
-		/*
-		 * Only probe on the hash list. Disarm only if kprobes are
-		 * enabled and not gone - otherwise, the breakpoint would
-		 * already have been removed. We save on flushing icache.
-		 */
-		if (!kprobes_all_disarmed && !kprobe_disabled(old_p))
-			disarm_kprobe(old_p);
+	     list_is_singular(&old_p->list) && kprobe_disarmed(old_p)))
+		/* Only disarmed probe on the hash list or on aggr_probe */
 		hlist_del_rcu(&old_p->hlist);
-	} else {
+	else {
+		/* This probe is on an aggr_probe */
 		if (p->break_handler && !kprobe_gone(p))
 			old_p->break_handler = NULL;
 		if (p->post_handler && !kprobe_gone(p)) {
@@ -1242,16 +1420,12 @@ static int __kprobes __unregister_kprobe_top(struct kprobe *p)
 		}
 noclean:
 		list_del_rcu(&p->list);
-		if (!kprobe_disabled(old_p)) {
-			try_to_disable_aggr_kprobe(old_p);
-			if (!kprobes_all_disarmed) {
-				if (kprobe_disabled(old_p))
-					disarm_kprobe(old_p);
-				else
-					/* Try to optimize this probe again */
-					optimize_kprobe(old_p);
-			}
-		}
+		if (!kprobe_disabled(old_p) && !kprobes_all_disarmed)
+			/*
+			 * Try to optimize this probe again,
+			 * because post hander may be changed.
+			 */
+			optimize_kprobe(old_p);
 	}
 	return 0;
 }
@@ -1260,13 +1434,13 @@ static void __kprobes __unregister_kprobe_bottom(struct kprobe *p)
 {
 	struct kprobe *old_p;
 
+	/* If this probe was removed from hlist, remove kprobes insn buffer */
 	if (list_empty(&p->list))
 		arch_remove_kprobe(p);
 	else if (list_is_singular(&p->list)) {
 		/* "p" is the last child of an aggr_kprobe */
 		old_p = list_entry(p->list.next, struct kprobe, list);
 		list_del(&p->list);
-		arch_remove_kprobe(old_p);
 		free_aggr_kprobe(old_p);
 	}
 }
@@ -1308,6 +1482,7 @@ void __kprobes unregister_kprobes(struct kprobe **kps, int num)
 	mutex_unlock(&kprobe_mutex);
 
 	synchronize_sched();
+
 	for (i = 0; i < num; i++)
 		if (kps[i]->addr)
 			__unregister_kprobe_bottom(kps[i]);
@@ -1585,29 +1760,10 @@ static void __kprobes kill_kprobe(struct kprobe *p)
 int __kprobes disable_kprobe(struct kprobe *kp)
 {
 	int ret = 0;
-	struct kprobe *p;
 
 	mutex_lock(&kprobe_mutex);
-
-	/* Check whether specified probe is valid. */
-	p = __get_valid_kprobe(kp);
-	if (unlikely(p == NULL)) {
+	if (__disable_kprobe(kp) == NULL)
 		ret = -EINVAL;
-		goto out;
-	}
-
-	/* If the probe is already disabled (or gone), just return */
-	if (kprobe_disabled(kp))
-		goto out;
-
-	kp->flags |= KPROBE_FLAG_DISABLED;
-	if (p != kp)
-		/* When kp != p, p is always enabled. */
-		try_to_disable_aggr_kprobe(p);
-
-	if (!kprobes_all_disarmed && kprobe_disabled(p))
-		disarm_kprobe(p);
-out:
 	mutex_unlock(&kprobe_mutex);
 	return ret;
 }
@@ -1928,8 +2084,8 @@ static void __kprobes disarm_all_kprobes(void)
 	mutex_unlock(&text_mutex);
 	put_online_cpus();
 	mutex_unlock(&kprobe_mutex);
-	/* Allow all currently running kprobes to complete */
-	synchronize_sched();
+
+	wait_for_optimizer();
 	return;
 
 already_disabled:


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize
  2010-05-10 17:53 ` [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize Masami Hiramatsu
@ 2010-05-11 12:35   ` Mathieu Desnoyers
  2010-05-11 20:06     ` Masami Hiramatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2010-05-11 12:35 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Remove text_mutex locking in optimize_all_kprobes, because
> that function doesn't modify text but just order optimization
> to worker.

Hi Masami,

A few comments:

1) optimize_all_kprobes/unoptimize_all_kprobes should have comments saying that
they are always called with kprobe_mutex held.

2) The sentence above in the changelog could be changed into:

 ..."because this function doesn't modify text. It simply queues optimizations
for the kprobe_optimizer worker thread."

3)

static DEFINE_MUTEX(kprobe_mutex);      /* Protects kprobe_table */

.. should also state that it protects optimizing_list.

Thanks,

Mathieu

> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Jim Keniston <jkenisto@us.ibm.com>
> Cc: Jason Baron <jbaron@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
> 
>  kernel/kprobes.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 282035f..1d34eef 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -606,14 +606,12 @@ static void __kprobes optimize_all_kprobes(void)
>  		return;
>  
>  	kprobes_allow_optimization = true;
> -	mutex_lock(&text_mutex);
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
>  		hlist_for_each_entry_rcu(p, node, head, hlist)
>  			if (!kprobe_disabled(p))
>  				optimize_kprobe(p);
>  	}
> -	mutex_unlock(&text_mutex);
>  	printk(KERN_INFO "Kprobes globally optimized\n");
>  }
>  
> 
> 
> -- 
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-10 17:53 ` [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch Masami Hiramatsu
@ 2010-05-11 14:40   ` Mathieu Desnoyers
  2010-05-12  0:41     ` Masami Hiramatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2010-05-11 14:40 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Use text_poke_smp_batch() in optimization path for reducing
> the number of stop_machine() issues.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Jim Keniston <jkenisto@us.ibm.com>
> Cc: Jason Baron <jbaron@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
> 
>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
>  include/linux/kprobes.h   |    2 +-
>  kernel/kprobes.c          |   13 +------------
>  3 files changed, 32 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> index 345a4b1..63a5c24 100644
> --- a/arch/x86/kernel/kprobes.c
> +++ b/arch/x86/kernel/kprobes.c
> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
>  	return 0;
>  }
>  
> -/* Replace a breakpoint (int3) with a relative jump.  */
> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
> +#define MAX_OPTIMIZE_PROBES 256

So what kind of interrupt latency does a 256-probes batch generate on the
system ?  Are we talking about a few milliseconds, a few seconds ?

Thanks,

Mathieu

> +static struct text_poke_param jump_params[MAX_OPTIMIZE_PROBES];
> +static char jump_code_buf[MAX_OPTIMIZE_PROBES][RELATIVEJUMP_SIZE];
> +
> +static void __kprobes setup_optimize_kprobe(struct text_poke_param *tprm,
> +					    char *insn_buf,
> +					    struct optimized_kprobe *op)
>  {
> -	unsigned char jmp_code[RELATIVEJUMP_SIZE];
>  	s32 rel = (s32)((long)op->optinsn.insn -
>  			((long)op->kp.addr + RELATIVEJUMP_SIZE));
>  
> @@ -1396,16 +1400,35 @@ int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
>  	memcpy(op->optinsn.copied_insn, op->kp.addr + INT3_SIZE,
>  	       RELATIVE_ADDR_SIZE);
>  
> -	jmp_code[0] = RELATIVEJUMP_OPCODE;
> -	*(s32 *)(&jmp_code[1]) = rel;
> +	insn_buf[0] = RELATIVEJUMP_OPCODE;
> +	*(s32 *)(&insn_buf[1]) = rel;
> +
> +	tprm->addr = op->kp.addr;
> +	tprm->opcode = insn_buf;
> +	tprm->len = RELATIVEJUMP_SIZE;
> +}
> +
> +/* Replace a breakpoint (int3) with a relative jump.  */
> +void __kprobes arch_optimize_kprobes(struct list_head *oplist)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +	int c = 0;
> +
> +	list_for_each_entry_safe(op, tmp, oplist, list) {
> +		WARN_ON(kprobe_disabled(&op->kp));
> +		/* Setup param */
> +		setup_optimize_kprobe(&jump_params[c], jump_code_buf[c], op);
> +		list_del_init(&op->list);
> +		if (++c >= MAX_OPTIMIZE_PROBES)
> +			break;
> +	}
>  
>  	/*
>  	 * text_poke_smp doesn't support NMI/MCE code modifying.
>  	 * However, since kprobes itself also doesn't support NMI/MCE
>  	 * code probing, it's not a problem.
>  	 */
> -	text_poke_smp(op->kp.addr, jmp_code, RELATIVEJUMP_SIZE);
> -	return 0;
> +	text_poke_smp_batch(jump_params, c);
>  }
>  
>  /* Replace a relative jump with a breakpoint (int3).  */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index e7d1b2e..fe157ba 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -275,7 +275,7 @@ extern int arch_prepared_optinsn(struct arch_optimized_insn *optinsn);
>  extern int arch_check_optimized_kprobe(struct optimized_kprobe *op);
>  extern int arch_prepare_optimized_kprobe(struct optimized_kprobe *op);
>  extern void arch_remove_optimized_kprobe(struct optimized_kprobe *op);
> -extern int  arch_optimize_kprobe(struct optimized_kprobe *op);
> +extern void arch_optimize_kprobes(struct list_head *oplist);
>  extern void arch_unoptimize_kprobe(struct optimized_kprobe *op);
>  extern kprobe_opcode_t *get_optinsn_slot(void);
>  extern void free_optinsn_slot(kprobe_opcode_t *slot, int dirty);
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index aae368a..c824c23 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -424,14 +424,10 @@ static LIST_HEAD(optimizing_list);
>  static void kprobe_optimizer(struct work_struct *work);
>  static DECLARE_DELAYED_WORK(optimizing_work, kprobe_optimizer);
>  #define OPTIMIZE_DELAY 5
> -#define MAX_OPTIMIZE_PROBES 64
>  
>  /* Kprobe jump optimizer */
>  static __kprobes void kprobe_optimizer(struct work_struct *work)
>  {
> -	struct optimized_kprobe *op, *tmp;
> -	int c = 0;
> -
>  	/* Lock modules while optimizing kprobes */
>  	mutex_lock(&module_mutex);
>  	mutex_lock(&kprobe_mutex);
> @@ -459,14 +455,7 @@ static __kprobes void kprobe_optimizer(struct work_struct *work)
>  	 */
>  	get_online_cpus();
>  	mutex_lock(&text_mutex);
> -	list_for_each_entry_safe(op, tmp, &optimizing_list, list) {
> -		WARN_ON(kprobe_disabled(&op->kp));
> -		if (arch_optimize_kprobe(op) < 0)
> -			op->kp.flags &= ~KPROBE_FLAG_OPTIMIZED;
> -		list_del_init(&op->list);
> -		if (++c >= MAX_OPTIMIZE_PROBES)
> -			break;
> -	}
> +	arch_optimize_kprobes(&optimizing_list);
>  	mutex_unlock(&text_mutex);
>  	put_online_cpus();
>  	if (!list_empty(&optimizing_list))
> 
> 
> -- 
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize
  2010-05-11 12:35   ` Mathieu Desnoyers
@ 2010-05-11 20:06     ` Masami Hiramatsu
  0 siblings, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-11 20:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

Hi Mathieu,

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Remove text_mutex locking in optimize_all_kprobes, because
>> that function doesn't modify text but just order optimization
>> to worker.
> 
> Hi Masami,
> 
> A few comments:
> 
> 1) optimize_all_kprobes/unoptimize_all_kprobes should have comments saying that
> they are always called with kprobe_mutex held.
> 
> 2) The sentence above in the changelog could be changed into:
> 
>  ..."because this function doesn't modify text. It simply queues optimizations
> for the kprobe_optimizer worker thread."
> 
> 3)
> 
> static DEFINE_MUTEX(kprobe_mutex);      /* Protects kprobe_table */
> 
> .. should also state that it protects optimizing_list.
> 

Thanks! all comments are good to me!

Thank you again,




-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-11 14:40   ` Mathieu Desnoyers
@ 2010-05-12  0:41     ` Masami Hiramatsu
  2010-05-12 15:27       ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-12  0:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Use text_poke_smp_batch() in optimization path for reducing
>> the number of stop_machine() issues.
>>
>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
>> Cc: Ingo Molnar <mingo@elte.hu>
>> Cc: Jim Keniston <jkenisto@us.ibm.com>
>> Cc: Jason Baron <jbaron@redhat.com>
>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> ---
>>
>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
>>  include/linux/kprobes.h   |    2 +-
>>  kernel/kprobes.c          |   13 +------------
>>  3 files changed, 32 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
>> index 345a4b1..63a5c24 100644
>> --- a/arch/x86/kernel/kprobes.c
>> +++ b/arch/x86/kernel/kprobes.c
>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
>>  	return 0;
>>  }
>>  
>> -/* Replace a breakpoint (int3) with a relative jump.  */
>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
>> +#define MAX_OPTIMIZE_PROBES 256
> 
> So what kind of interrupt latency does a 256-probes batch generate on the
> system ?  Are we talking about a few milliseconds, a few seconds ?

>From my experiment on kvm/4cpu, it took about 3 seconds in average.
With this patch, it went down to 30ms. (x100 faster :))

Thank you,
-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-12  0:41     ` Masami Hiramatsu
@ 2010-05-12 15:27       ` Mathieu Desnoyers
  2010-05-12 17:43         ` Masami Hiramatsu
  2010-05-13 19:07         ` Masami Hiramatsu
  0 siblings, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 15:27 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >> Use text_poke_smp_batch() in optimization path for reducing
> >> the number of stop_machine() issues.
> >>
> >> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> >> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> >> Cc: Ingo Molnar <mingo@elte.hu>
> >> Cc: Jim Keniston <jkenisto@us.ibm.com>
> >> Cc: Jason Baron <jbaron@redhat.com>
> >> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> ---
> >>
> >>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
> >>  include/linux/kprobes.h   |    2 +-
> >>  kernel/kprobes.c          |   13 +------------
> >>  3 files changed, 32 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> >> index 345a4b1..63a5c24 100644
> >> --- a/arch/x86/kernel/kprobes.c
> >> +++ b/arch/x86/kernel/kprobes.c
> >> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
> >>  	return 0;
> >>  }
> >>  
> >> -/* Replace a breakpoint (int3) with a relative jump.  */
> >> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
> >> +#define MAX_OPTIMIZE_PROBES 256
> > 
> > So what kind of interrupt latency does a 256-probes batch generate on the
> > system ?  Are we talking about a few milliseconds, a few seconds ?
> 
> From my experiment on kvm/4cpu, it took about 3 seconds in average.

That's 3 seconds for multiple calls to stop_machine(). So we can expect
latencies in the area of few microseconds for each call, right ?

> With this patch, it went down to 30ms. (x100 faster :))

This is beefing up the latency from few microseconds to 30ms. It sounds like a
regression rather than a gain to me.

Thanks,

Mathieu

> 
> Thank you,
> -- 
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-12 15:27       ` Mathieu Desnoyers
@ 2010-05-12 17:43         ` Masami Hiramatsu
  2010-05-12 17:48           ` Mathieu Desnoyers
  2010-05-13 19:07         ` Masami Hiramatsu
  1 sibling, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-12 17:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Use text_poke_smp_batch() in optimization path for reducing
>>>> the number of stop_machine() issues.
>>>>
>>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>>>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
>>>> Cc: Ingo Molnar <mingo@elte.hu>
>>>> Cc: Jim Keniston <jkenisto@us.ibm.com>
>>>> Cc: Jason Baron <jbaron@redhat.com>
>>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> ---
>>>>
>>>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
>>>>  include/linux/kprobes.h   |    2 +-
>>>>  kernel/kprobes.c          |   13 +------------
>>>>  3 files changed, 32 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
>>>> index 345a4b1..63a5c24 100644
>>>> --- a/arch/x86/kernel/kprobes.c
>>>> +++ b/arch/x86/kernel/kprobes.c
>>>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> -/* Replace a breakpoint (int3) with a relative jump.  */
>>>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
>>>> +#define MAX_OPTIMIZE_PROBES 256
>>>
>>> So what kind of interrupt latency does a 256-probes batch generate on the
>>> system ?  Are we talking about a few milliseconds, a few seconds ?
>>
>> From my experiment on kvm/4cpu, it took about 3 seconds in average.
> 
> That's 3 seconds for multiple calls to stop_machine(). So we can expect
> latencies in the area of few microseconds for each call, right ?

Theoretically yes.
But if we register more than 1000 probes at once, it's hard to do
anything except optimizing a while(more than 10 sec), because
it stops machine so frequently.

>> With this patch, it went down to 30ms. (x100 faster :))
> 
> This is beefing up the latency from few microseconds to 30ms. It sounds like a
> regression rather than a gain to me.

If it is not acceptable, I can add a knob for control how many probes
optimize/unoptimize at once. Anyway, it is expectable latency (after
registering/unregistering probes) and it will be small if we put a few probes.
(30ms is the worst case)
And if you want, it can be disabled by sysctl.

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-12 17:43         ` Masami Hiramatsu
@ 2010-05-12 17:48           ` Mathieu Desnoyers
  2010-05-12 19:11             ` Masami Hiramatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2010-05-12 17:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >> Mathieu Desnoyers wrote:
> >>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >>>> Use text_poke_smp_batch() in optimization path for reducing
> >>>> the number of stop_machine() issues.
> >>>>
> >>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> >>>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> >>>> Cc: Ingo Molnar <mingo@elte.hu>
> >>>> Cc: Jim Keniston <jkenisto@us.ibm.com>
> >>>> Cc: Jason Baron <jbaron@redhat.com>
> >>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>> ---
> >>>>
> >>>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
> >>>>  include/linux/kprobes.h   |    2 +-
> >>>>  kernel/kprobes.c          |   13 +------------
> >>>>  3 files changed, 32 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> >>>> index 345a4b1..63a5c24 100644
> >>>> --- a/arch/x86/kernel/kprobes.c
> >>>> +++ b/arch/x86/kernel/kprobes.c
> >>>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> -/* Replace a breakpoint (int3) with a relative jump.  */
> >>>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
> >>>> +#define MAX_OPTIMIZE_PROBES 256
> >>>
> >>> So what kind of interrupt latency does a 256-probes batch generate on the
> >>> system ?  Are we talking about a few milliseconds, a few seconds ?
> >>
> >> From my experiment on kvm/4cpu, it took about 3 seconds in average.
> > 
> > That's 3 seconds for multiple calls to stop_machine(). So we can expect
> > latencies in the area of few microseconds for each call, right ?
> 
> Theoretically yes.
> But if we register more than 1000 probes at once, it's hard to do
> anything except optimizing a while(more than 10 sec), because
> it stops machine so frequently.
> 
> >> With this patch, it went down to 30ms. (x100 faster :))
> > 
> > This is beefing up the latency from few microseconds to 30ms. It sounds like a
> > regression rather than a gain to me.
> 
> If it is not acceptable, I can add a knob for control how many probes
> optimize/unoptimize at once. Anyway, it is expectable latency (after
> registering/unregistering probes) and it will be small if we put a few probes.
> (30ms is the worst case)
> And if you want, it can be disabled by sysctl.

I think we are starting to see the stop_machine() approach is really limiting
our ability to do even relatively small amount of work without hurting
responsiveness significantly.

What's the current showstopper with the breakpoint-bypass-ipi approach that
solves this issue properly and makes this batching approach unnecessary ?

Thanks,

Mathieu

> 
> Thank you,
> 
> -- 
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-12 17:48           ` Mathieu Desnoyers
@ 2010-05-12 19:11             ` Masami Hiramatsu
  0 siblings, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-12 19:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron, H. Peter Anvin

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Mathieu Desnoyers wrote:
>>>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>>>> Use text_poke_smp_batch() in optimization path for reducing
>>>>>> the number of stop_machine() issues.
>>>>>>
>>>>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>>>>>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
>>>>>> Cc: Ingo Molnar <mingo@elte.hu>
>>>>>> Cc: Jim Keniston <jkenisto@us.ibm.com>
>>>>>> Cc: Jason Baron <jbaron@redhat.com>
>>>>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>>>> ---
>>>>>>
>>>>>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
>>>>>>  include/linux/kprobes.h   |    2 +-
>>>>>>  kernel/kprobes.c          |   13 +------------
>>>>>>  3 files changed, 32 insertions(+), 20 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
>>>>>> index 345a4b1..63a5c24 100644
>>>>>> --- a/arch/x86/kernel/kprobes.c
>>>>>> +++ b/arch/x86/kernel/kprobes.c
>>>>>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
>>>>>>  	return 0;
>>>>>>  }
>>>>>>  
>>>>>> -/* Replace a breakpoint (int3) with a relative jump.  */
>>>>>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
>>>>>> +#define MAX_OPTIMIZE_PROBES 256
>>>>>
>>>>> So what kind of interrupt latency does a 256-probes batch generate on the
>>>>> system ?  Are we talking about a few milliseconds, a few seconds ?
>>>>
>>>> From my experiment on kvm/4cpu, it took about 3 seconds in average.
>>>
>>> That's 3 seconds for multiple calls to stop_machine(). So we can expect
>>> latencies in the area of few microseconds for each call, right ?
>>
>> Theoretically yes.
>> But if we register more than 1000 probes at once, it's hard to do
>> anything except optimizing a while(more than 10 sec), because
>> it stops machine so frequently.
>>
>>>> With this patch, it went down to 30ms. (x100 faster :))
>>>
>>> This is beefing up the latency from few microseconds to 30ms. It sounds like a
>>> regression rather than a gain to me.
>>
>> If it is not acceptable, I can add a knob for control how many probes
>> optimize/unoptimize at once. Anyway, it is expectable latency (after
>> registering/unregistering probes) and it will be small if we put a few probes.
>> (30ms is the worst case)
>> And if you want, it can be disabled by sysctl.
> 
> I think we are starting to see the stop_machine() approach is really limiting
> our ability to do even relatively small amount of work without hurting
> responsiveness significantly.
> 
> What's the current showstopper with the breakpoint-bypass-ipi approach that
> solves this issue properly and makes this batching approach unnecessary ?

We still do not have any official answer from chip vendors.
As you know, basic implementation has been done.

Thank you,

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-12 15:27       ` Mathieu Desnoyers
  2010-05-12 17:43         ` Masami Hiramatsu
@ 2010-05-13 19:07         ` Masami Hiramatsu
  2010-05-13 21:20           ` Mathieu Desnoyers
  1 sibling, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2010-05-13 19:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Use text_poke_smp_batch() in optimization path for reducing
>>>> the number of stop_machine() issues.
>>>>
>>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>>>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
>>>> Cc: Ingo Molnar <mingo@elte.hu>
>>>> Cc: Jim Keniston <jkenisto@us.ibm.com>
>>>> Cc: Jason Baron <jbaron@redhat.com>
>>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> ---
>>>>
>>>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
>>>>  include/linux/kprobes.h   |    2 +-
>>>>  kernel/kprobes.c          |   13 +------------
>>>>  3 files changed, 32 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
>>>> index 345a4b1..63a5c24 100644
>>>> --- a/arch/x86/kernel/kprobes.c
>>>> +++ b/arch/x86/kernel/kprobes.c
>>>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> -/* Replace a breakpoint (int3) with a relative jump.  */
>>>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
>>>> +#define MAX_OPTIMIZE_PROBES 256
>>>
>>> So what kind of interrupt latency does a 256-probes batch generate on the
>>> system ?  Are we talking about a few milliseconds, a few seconds ?
>>
>> From my experiment on kvm/4cpu, it took about 3 seconds in average.
> 
> That's 3 seconds for multiple calls to stop_machine(). So we can expect
> latencies in the area of few microseconds for each call, right ?

Sorry, my bad. Non tuned kvm guest is so slow...
I've tried to check it again on *bare machine* (4core Xeon 2.33GHz, 4cpu).
I found that even without this patch, optimizing 256 probes took 770us in
average (min 150us, max 3.3ms.)
With this patch, it went down to 90us in average (min 14us, max 324us!)

Isn't it enough low latency? :)

>> With this patch, it went down to 30ms. (x100 faster :))
> 
> This is beefing up the latency from few microseconds to 30ms. It sounds like a
> regression rather than a gain to me.

so, it just takes 90us. I hope it is acceptable.

Thank you,


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch
  2010-05-13 19:07         ` Masami Hiramatsu
@ 2010-05-13 21:20           ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2010-05-13 21:20 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, lkml, systemtap, DLE, Ananth N Mavinakayanahalli,
	Jim Keniston, Jason Baron

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >> Mathieu Desnoyers wrote:
> >>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >>>> Use text_poke_smp_batch() in optimization path for reducing
> >>>> the number of stop_machine() issues.
> >>>>
> >>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> >>>> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> >>>> Cc: Ingo Molnar <mingo@elte.hu>
> >>>> Cc: Jim Keniston <jkenisto@us.ibm.com>
> >>>> Cc: Jason Baron <jbaron@redhat.com>
> >>>> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>> ---
> >>>>
> >>>>  arch/x86/kernel/kprobes.c |   37 ++++++++++++++++++++++++++++++-------
> >>>>  include/linux/kprobes.h   |    2 +-
> >>>>  kernel/kprobes.c          |   13 +------------
> >>>>  3 files changed, 32 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> >>>> index 345a4b1..63a5c24 100644
> >>>> --- a/arch/x86/kernel/kprobes.c
> >>>> +++ b/arch/x86/kernel/kprobes.c
> >>>> @@ -1385,10 +1385,14 @@ int __kprobes arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> -/* Replace a breakpoint (int3) with a relative jump.  */
> >>>> -int __kprobes arch_optimize_kprobe(struct optimized_kprobe *op)
> >>>> +#define MAX_OPTIMIZE_PROBES 256
> >>>
> >>> So what kind of interrupt latency does a 256-probes batch generate on the
> >>> system ?  Are we talking about a few milliseconds, a few seconds ?
> >>
> >> From my experiment on kvm/4cpu, it took about 3 seconds in average.
> > 
> > That's 3 seconds for multiple calls to stop_machine(). So we can expect
> > latencies in the area of few microseconds for each call, right ?
> 
> Sorry, my bad. Non tuned kvm guest is so slow...
> I've tried to check it again on *bare machine* (4core Xeon 2.33GHz, 4cpu).
> I found that even without this patch, optimizing 256 probes took 770us in
> average (min 150us, max 3.3ms.)
> With this patch, it went down to 90us in average (min 14us, max 324us!)
> 
> Isn't it enough low latency? :)
> 
> >> With this patch, it went down to 30ms. (x100 faster :))
> > 
> > This is beefing up the latency from few microseconds to 30ms. It sounds like a
> > regression rather than a gain to me.
> 
> so, it just takes 90us. I hope it is acceptable.

Yes, this is far below the scheduler tick, which is much more acceptable.

Thanks,

Mathieu

> 
> Thank you,
> 
> 
> -- 
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-05-13 21:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-10 17:53 [PATCH -tip 0/5] kprobes: batch (un)optimization support Masami Hiramatsu
2010-05-10 17:53 ` [PATCH -tip 1/5] [CLEANUP] kprobes: Remove redundant text_mutex lock in optimize Masami Hiramatsu
2010-05-11 12:35   ` Mathieu Desnoyers
2010-05-11 20:06     ` Masami Hiramatsu
2010-05-10 17:53 ` [PATCH -tip 2/5] kprobes: Limit maximum number of optimization at once Masami Hiramatsu
2010-05-10 17:53 ` [PATCH -tip 3/5] x86: Introduce text_poke_smp_batch() for batch-code modifying Masami Hiramatsu
2010-05-10 17:53 ` [PATCH -tip 4/5] kprobes/x86: Use text_poke_smp_batch Masami Hiramatsu
2010-05-11 14:40   ` Mathieu Desnoyers
2010-05-12  0:41     ` Masami Hiramatsu
2010-05-12 15:27       ` Mathieu Desnoyers
2010-05-12 17:43         ` Masami Hiramatsu
2010-05-12 17:48           ` Mathieu Desnoyers
2010-05-12 19:11             ` Masami Hiramatsu
2010-05-13 19:07         ` Masami Hiramatsu
2010-05-13 21:20           ` Mathieu Desnoyers
2010-05-10 17:53 ` [PATCH -tip 5/5] kprobes: Support delayed unoptimization Masami Hiramatsu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.