* [PATCH 1/7] perf: Ensure we do not free event->parent before event
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
@ 2013-02-04 19:02 ` Oleg Nesterov
2013-03-20 13:35 ` Jiri Olsa
2013-02-04 19:02 ` [PATCH 2/7] perf: Introduce hw_perf_event->tp_target and ->tp_list Oleg Nesterov
` (6 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:02 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
__perf_event_exit_task() does:
sync_child_event(event)
free_event(event)
sync_child_event() does put_event(event->parent) which can actually
free ->parent. This means that event->destroy(event) is called with
->parent pointing to nowhere.
perf_free_event() does put_parent(parent) before free_event(child)
too.
Afaics, currently this is fine. But the tracing "subclasses" (like
trace_uprobe) may want to track the events and their parents, and even
the fact that parent->destroy() is called before child->destroy() can
complicate this.
Move this put_event() from sync_child_event() to its single caller,
__perf_event_exit_task(). Change perf_free_event() the same way.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/events/core.c | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 301079d..1b2e516 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6787,12 +6787,6 @@ static void sync_child_event(struct perf_event *child_event,
mutex_lock(&parent_event->child_mutex);
list_del_init(&child_event->child_list);
mutex_unlock(&parent_event->child_mutex);
-
- /*
- * Release the parent event, if this was the last
- * reference to it.
- */
- put_event(parent_event);
}
static void
@@ -6800,7 +6794,9 @@ __perf_event_exit_task(struct perf_event *child_event,
struct perf_event_context *child_ctx,
struct task_struct *child)
{
- if (child_event->parent) {
+ struct perf_event *parent_event = child_event->parent;
+
+ if (parent_event) {
raw_spin_lock_irq(&child_ctx->lock);
perf_group_detach(child_event);
raw_spin_unlock_irq(&child_ctx->lock);
@@ -6813,9 +6809,14 @@ __perf_event_exit_task(struct perf_event *child_event,
* that are still around due to the child reference. These
* events need to be zapped.
*/
- if (child_event->parent) {
+ if (parent_event) {
sync_child_event(child_event, child);
free_event(child_event);
+ /*
+ * Release the parent event, if this was the last
+ * reference to it.
+ */
+ put_event(parent_event);
}
}
@@ -6867,8 +6868,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* We can recurse on the same lock type through:
*
* __perf_event_exit_task()
- * sync_child_event()
- * put_event()
+ * put_event(parent_event)
* mutex_lock(&ctx->mutex)
*
* But since its the parent context it won't be the same instance.
@@ -6937,11 +6937,11 @@ static void perf_free_event(struct perf_event *event,
list_del_init(&event->child_list);
mutex_unlock(&parent->child_mutex);
- put_event(parent);
-
perf_group_detach(event);
list_del_event(event, ctx);
free_event(event);
+
+ put_event(parent);
}
/*
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/7] perf: Ensure we do not free event->parent before event
2013-02-04 19:02 ` [PATCH 1/7] perf: Ensure we do not free event->parent before event Oleg Nesterov
@ 2013-03-20 13:35 ` Jiri Olsa
0 siblings, 0 replies; 21+ messages in thread
From: Jiri Olsa @ 2013-03-20 13:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo, Oleg Nesterov, Ingo Molnar,
Peter Zijlstra, Srikar Dronamraju, Steven Rostedt, Anton Arapov,
Frank Eigler, Josh Stone, Masami Hiramatsu, Suzuki K. Poulose,
linux-kernel
On Mon, Feb 04, 2013 at 08:02:43PM +0100, Oleg Nesterov wrote:
> __perf_event_exit_task() does:
>
> sync_child_event(event)
> free_event(event)
>
> sync_child_event() does put_event(event->parent) which can actually
> free ->parent. This means that event->destroy(event) is called with
> ->parent pointing to nowhere.
>
> perf_free_event() does put_parent(parent) before free_event(child)
> too.
>
> Afaics, currently this is fine. But the tracing "subclasses" (like
> trace_uprobe) may want to track the events and their parents, and even
> the fact that parent->destroy() is called before child->destroy() can
> complicate this.
>
> Move this put_event() from sync_child_event() to its single caller,
> __perf_event_exit_task(). Change perf_free_event() the same way.
looks ok to me, could prevent future headaches ;-)
jirka
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> kernel/events/core.c | 24 ++++++++++++------------
> 1 files changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..1b2e516 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6787,12 +6787,6 @@ static void sync_child_event(struct perf_event *child_event,
> mutex_lock(&parent_event->child_mutex);
> list_del_init(&child_event->child_list);
> mutex_unlock(&parent_event->child_mutex);
> -
> - /*
> - * Release the parent event, if this was the last
> - * reference to it.
> - */
> - put_event(parent_event);
> }
>
> static void
> @@ -6800,7 +6794,9 @@ __perf_event_exit_task(struct perf_event *child_event,
> struct perf_event_context *child_ctx,
> struct task_struct *child)
> {
> - if (child_event->parent) {
> + struct perf_event *parent_event = child_event->parent;
> +
> + if (parent_event) {
> raw_spin_lock_irq(&child_ctx->lock);
> perf_group_detach(child_event);
> raw_spin_unlock_irq(&child_ctx->lock);
> @@ -6813,9 +6809,14 @@ __perf_event_exit_task(struct perf_event *child_event,
> * that are still around due to the child reference. These
> * events need to be zapped.
> */
> - if (child_event->parent) {
> + if (parent_event) {
> sync_child_event(child_event, child);
> free_event(child_event);
> + /*
> + * Release the parent event, if this was the last
> + * reference to it.
> + */
> + put_event(parent_event);
> }
> }
>
> @@ -6867,8 +6868,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
> * We can recurse on the same lock type through:
> *
> * __perf_event_exit_task()
> - * sync_child_event()
> - * put_event()
> + * put_event(parent_event)
> * mutex_lock(&ctx->mutex)
> *
> * But since its the parent context it won't be the same instance.
> @@ -6937,11 +6937,11 @@ static void perf_free_event(struct perf_event *event,
> list_del_init(&event->child_list);
> mutex_unlock(&parent->child_mutex);
>
> - put_event(parent);
> -
> perf_group_detach(event);
> list_del_event(event, ctx);
> free_event(event);
> +
> + put_event(parent);
> }
>
> /*
> --
> 1.5.5.1
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/7] perf: Introduce hw_perf_event->tp_target and ->tp_list
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
2013-02-04 19:02 ` [PATCH 1/7] perf: Ensure we do not free event->parent before event Oleg Nesterov
@ 2013-02-04 19:02 ` Oleg Nesterov
2013-02-11 9:44 ` Srikar Dronamraju
2013-02-04 19:02 ` [PATCH 3/7] uprobes: Introduce uprobe_apply() Oleg Nesterov
` (5 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:02 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
sys_perf_event_open()->perf_init_event(event) is called before
find_get_context(event), this means that event->ctx == NULL when
class->reg(TRACE_REG_PERF_REGISTER/OPEN) is called and thus it
can't know if this event is per-task or system-wide.
This patch adds hw_perf_event->tp_target for PERF_TYPE_TRACEPOINT,
this is analogous to PERF_TYPE_BREAKPOINT/bp_target we already have.
The patch also moves ->bp_target up so that it can overlap with the
new member, this can help the compiler to generate the better code.
trace_uprobe_register() will use it for prefiltering to avoid the
unnecessary breakpoints in mm's we do not want to trace.
->tp_target doesn't have its own reference, but we can rely on the
fact that either sys_perf_event_open() holds a reference, or it is
equal to event->ctx->task. So this pointer is always valid until
free_event().
Also add the "struct list_head tp_list" into this union. It is not
strictly necessary, but it can simplify the next changes and we can
add it for free.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/perf_event.h | 9 +++++++--
kernel/events/core.c | 5 ++++-
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2fa..c9775e9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -135,16 +135,21 @@ struct hw_perf_event {
struct { /* software */
struct hrtimer hrtimer;
};
+ struct { /* tracepoint */
+ struct task_struct *tp_target;
+ /* for tp_event->class */
+ struct list_head tp_list;
+ };
#ifdef CONFIG_HAVE_HW_BREAKPOINT
struct { /* breakpoint */
- struct arch_hw_breakpoint info;
- struct list_head bp_list;
/*
* Crufty hack to avoid the chicken and egg
* problem hw_breakpoint has with context
* creation and event initalization.
*/
struct task_struct *bp_target;
+ struct arch_hw_breakpoint info;
+ struct list_head bp_list;
};
#endif
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1b2e516..340fb53 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6162,11 +6162,14 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (task) {
event->attach_state = PERF_ATTACH_TASK;
+
+ if (attr->type == PERF_TYPE_TRACEPOINT)
+ event->hw.tp_target = task;
#ifdef CONFIG_HAVE_HW_BREAKPOINT
/*
* hw_breakpoint is a bit difficult here..
*/
- if (attr->type == PERF_TYPE_BREAKPOINT)
+ else if (attr->type == PERF_TYPE_BREAKPOINT)
event->hw.bp_target = task;
#endif
}
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/7] perf: Introduce hw_perf_event->tp_target and ->tp_list
2013-02-04 19:02 ` [PATCH 2/7] perf: Introduce hw_perf_event->tp_target and ->tp_list Oleg Nesterov
@ 2013-02-11 9:44 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:44 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:02:46]:
> sys_perf_event_open()->perf_init_event(event) is called before
> find_get_context(event), this means that event->ctx == NULL when
> class->reg(TRACE_REG_PERF_REGISTER/OPEN) is called and thus it
> can't know if this event is per-task or system-wide.
>
> This patch adds hw_perf_event->tp_target for PERF_TYPE_TRACEPOINT,
> this is analogous to PERF_TYPE_BREAKPOINT/bp_target we already have.
> The patch also moves ->bp_target up so that it can overlap with the
> new member, this can help the compiler to generate the better code.
>
> trace_uprobe_register() will use it for prefiltering to avoid the
> unnecessary breakpoints in mm's we do not want to trace.
>
> ->tp_target doesn't have its own reference, but we can rely on the
> fact that either sys_perf_event_open() holds a reference, or it is
> equal to event->ctx->task. So this pointer is always valid until
> free_event().
>
> Also add the "struct list_head tp_list" into this union. It is not
> strictly necessary, but it can simplify the next changes and we can
> add it for free.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> include/linux/perf_event.h | 9 +++++++--
> kernel/events/core.c | 5 ++++-
> 2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 6bfb2fa..c9775e9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -135,16 +135,21 @@ struct hw_perf_event {
> struct { /* software */
> struct hrtimer hrtimer;
> };
> + struct { /* tracepoint */
> + struct task_struct *tp_target;
> + /* for tp_event->class */
> + struct list_head tp_list;
> + };
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> struct { /* breakpoint */
> - struct arch_hw_breakpoint info;
> - struct list_head bp_list;
> /*
> * Crufty hack to avoid the chicken and egg
> * problem hw_breakpoint has with context
> * creation and event initalization.
> */
> struct task_struct *bp_target;
> + struct arch_hw_breakpoint info;
> + struct list_head bp_list;
> };
> #endif
> };
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1b2e516..340fb53 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6162,11 +6162,14 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
> if (task) {
> event->attach_state = PERF_ATTACH_TASK;
> +
> + if (attr->type == PERF_TYPE_TRACEPOINT)
> + event->hw.tp_target = task;
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> /*
> * hw_breakpoint is a bit difficult here..
> */
> - if (attr->type == PERF_TYPE_BREAKPOINT)
> + else if (attr->type == PERF_TYPE_BREAKPOINT)
> event->hw.bp_target = task;
> #endif
> }
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/7] uprobes: Introduce uprobe_apply()
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
2013-02-04 19:02 ` [PATCH 1/7] perf: Ensure we do not free event->parent before event Oleg Nesterov
2013-02-04 19:02 ` [PATCH 2/7] perf: Introduce hw_perf_event->tp_target and ->tp_list Oleg Nesterov
@ 2013-02-04 19:02 ` Oleg Nesterov
2013-02-11 9:43 ` Srikar Dronamraju
2013-02-04 19:02 ` [PATCH 4/7] uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's Oleg Nesterov
` (4 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:02 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
Currently it is not possible to change the filtering constraints after
uprobe_register(), so a consumer can not, say, start to trace a task/mm
which was previously filtered out, or remove the no longer needed bp's.
Introduce uprobe_apply() which simply does register_for_each_vma() again
to consult uprobe_consumer->filter() and install/remove the breakpoints.
The only complication is that register_for_each_vma() can no longer
assume that uprobe->consumers should be consulter if is_register == T,
so we change it to accept "struct uprobe_consumer *new" instead.
Unlike uprobe_register(), uprobe_apply(true) doesn't do "unregister" if
register_for_each_vma() fails, it is up to caller to handle the error.
Note: we probably need to cleanup the current interface, it is strange
that uprobe_apply/unregister need inode/offset. We should either change
uprobe_register() to return "struct uprobe *", or add a private ->uprobe
member in uprobe_consumer. And in the long term uprobe_apply() should
take a single argument, uprobe or consumer, even "bool add" should go
away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/uprobes.h | 6 ++++++
kernel/events/uprobes.c | 39 +++++++++++++++++++++++++++++++++++----
2 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 95d0002..02b83db 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -101,6 +101,7 @@ extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsign
extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool __weak is_swbp_insn(uprobe_opcode_t *insn);
extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
extern int uprobe_mmap(struct vm_area_struct *vma);
extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
@@ -124,6 +125,11 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
{
return -ENOSYS;
}
+static inline int
+uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
+{
+ return -ENOSYS;
+}
static inline void
uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
{
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 23340a7..2bcd08e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -733,8 +733,10 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
return curr;
}
-static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
+static int
+register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
{
+ bool is_register = !!new;
struct map_info *info;
int err = 0;
@@ -765,7 +767,7 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
if (is_register) {
/* consult only the "caller", new consumer. */
- if (consumer_filter(uprobe->consumers,
+ if (consumer_filter(new,
UPROBE_FILTER_REGISTER, mm))
err = install_breakpoint(uprobe, mm, vma, info->vaddr);
} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
@@ -788,7 +790,7 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc)
{
consumer_add(uprobe, uc);
- return register_for_each_vma(uprobe, true);
+ return register_for_each_vma(uprobe, uc);
}
static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
@@ -798,7 +800,7 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u
if (!consumer_del(uprobe, uc)) /* WARN? */
return;
- err = register_for_each_vma(uprobe, false);
+ err = register_for_each_vma(uprobe, NULL);
/* TODO : cant unregister? schedule a worker thread */
if (!uprobe->consumers && !err)
delete_uprobe(uprobe);
@@ -855,6 +857,35 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *
EXPORT_SYMBOL_GPL(uprobe_register);
/*
+ * uprobe_apply - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @uc: consumer which wants to add more or remove some breakpoints
+ * @add: add or remove the breakpoints
+ */
+int uprobe_apply(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *uc, bool add)
+{
+ struct uprobe *uprobe;
+ struct uprobe_consumer *con;
+ int ret = -ENOENT;
+
+ uprobe = find_uprobe(inode, offset);
+ if (!uprobe)
+ return ret;
+
+ down_write(&uprobe->register_rwsem);
+ for (con = uprobe->consumers; con && con != uc ; con = con->next)
+ ;
+ if (con)
+ ret = register_for_each_vma(uprobe, add ? uc : NULL);
+ up_write(&uprobe->register_rwsem);
+ put_uprobe(uprobe);
+
+ return ret;
+}
+
+/*
* uprobe_unregister - unregister a already registered probe.
* @inode: the file in which the probe has to be removed.
* @offset: offset from the start of the file.
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 3/7] uprobes: Introduce uprobe_apply()
2013-02-04 19:02 ` [PATCH 3/7] uprobes: Introduce uprobe_apply() Oleg Nesterov
@ 2013-02-11 9:43 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:43 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:02:50]:
> Currently it is not possible to change the filtering constraints after
> uprobe_register(), so a consumer can not, say, start to trace a task/mm
> which was previously filtered out, or remove the no longer needed bp's.
>
> Introduce uprobe_apply() which simply does register_for_each_vma() again
> to consult uprobe_consumer->filter() and install/remove the breakpoints.
> The only complication is that register_for_each_vma() can no longer
> assume that uprobe->consumers should be consulter if is_register == T,
> so we change it to accept "struct uprobe_consumer *new" instead.
>
> Unlike uprobe_register(), uprobe_apply(true) doesn't do "unregister" if
> register_for_each_vma() fails, it is up to caller to handle the error.
>
> Note: we probably need to cleanup the current interface, it is strange
> that uprobe_apply/unregister need inode/offset. We should either change
> uprobe_register() to return "struct uprobe *", or add a private ->uprobe
> member in uprobe_consumer. And in the long term uprobe_apply() should
> take a single argument, uprobe or consumer, even "bool add" should go
> away.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> include/linux/uprobes.h | 6 ++++++
> kernel/events/uprobes.c | 39 +++++++++++++++++++++++++++++++++++----
> 2 files changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 95d0002..02b83db 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -101,6 +101,7 @@ extern int __weak set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsign
> extern int __weak set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern bool __weak is_swbp_insn(uprobe_opcode_t *insn);
> extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> extern int uprobe_mmap(struct vm_area_struct *vma);
> extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
> @@ -124,6 +125,11 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> {
> return -ENOSYS;
> }
> +static inline int
> +uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
> +{
> + return -ENOSYS;
> +}
> static inline void
> uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> {
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 23340a7..2bcd08e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -733,8 +733,10 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
> return curr;
> }
>
> -static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
> +static int
> +register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> {
> + bool is_register = !!new;
> struct map_info *info;
> int err = 0;
>
> @@ -765,7 +767,7 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
>
> if (is_register) {
> /* consult only the "caller", new consumer. */
> - if (consumer_filter(uprobe->consumers,
> + if (consumer_filter(new,
> UPROBE_FILTER_REGISTER, mm))
> err = install_breakpoint(uprobe, mm, vma, info->vaddr);
> } else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
> @@ -788,7 +790,7 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
> static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc)
> {
> consumer_add(uprobe, uc);
> - return register_for_each_vma(uprobe, true);
> + return register_for_each_vma(uprobe, uc);
> }
>
> static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> @@ -798,7 +800,7 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u
> if (!consumer_del(uprobe, uc)) /* WARN? */
> return;
>
> - err = register_for_each_vma(uprobe, false);
> + err = register_for_each_vma(uprobe, NULL);
> /* TODO : cant unregister? schedule a worker thread */
> if (!uprobe->consumers && !err)
> delete_uprobe(uprobe);
> @@ -855,6 +857,35 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *
> EXPORT_SYMBOL_GPL(uprobe_register);
>
> /*
> + * uprobe_apply - unregister a already registered probe.
> + * @inode: the file in which the probe has to be removed.
> + * @offset: offset from the start of the file.
> + * @uc: consumer which wants to add more or remove some breakpoints
> + * @add: add or remove the breakpoints
> + */
> +int uprobe_apply(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *uc, bool add)
> +{
> + struct uprobe *uprobe;
> + struct uprobe_consumer *con;
> + int ret = -ENOENT;
> +
> + uprobe = find_uprobe(inode, offset);
> + if (!uprobe)
> + return ret;
> +
> + down_write(&uprobe->register_rwsem);
> + for (con = uprobe->consumers; con && con != uc ; con = con->next)
> + ;
> + if (con)
> + ret = register_for_each_vma(uprobe, add ? uc : NULL);
> + up_write(&uprobe->register_rwsem);
> + put_uprobe(uprobe);
> +
> + return ret;
> +}
> +
> +/*
> * uprobe_unregister - unregister a already registered probe.
> * @inode: the file in which the probe has to be removed.
> * @offset: offset from the start of the file.
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/7] uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
` (2 preceding siblings ...)
2013-02-04 19:02 ` [PATCH 3/7] uprobes: Introduce uprobe_apply() Oleg Nesterov
@ 2013-02-04 19:02 ` Oleg Nesterov
2013-02-11 9:45 ` Srikar Dronamraju
2013-02-04 19:02 ` [PATCH 5/7] uprobes/perf: Teach trace_uprobe/perf code to pre-filter Oleg Nesterov
` (3 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:02 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
Introduce "struct trace_uprobe_filter" which records the "active"
perf_event's attached to ftrace_event_call. For the start we simply
use list_head, we can optimize this later if needed. For example, we
do not really need to record an event with ->parent != NULL, we can
rely on parent->child_list. And we can certainly do some optimizations
for the case when 2 events have the same ->tp_target or tp_target->mm.
Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
and add/del this perf_event to the list.
We can probably avoid any locking, but lets start with the "obvioulsy
correct" trace_uprobe_filter->rwlock which protects everything.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/trace/trace_uprobe.c | 55 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 0a9a8de..f05ec32 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -28,6 +28,12 @@
#define UPROBE_EVENT_SYSTEM "uprobes"
+struct trace_uprobe_filter {
+ rwlock_t rwlock;
+ int nr_systemwide;
+ struct list_head perf_events;
+};
+
/*
* uprobe event core functions
*/
@@ -35,6 +41,7 @@ struct trace_uprobe {
struct list_head list;
struct ftrace_event_class class;
struct ftrace_event_call call;
+ struct trace_uprobe_filter filter;
struct uprobe_consumer consumer;
struct inode *inode;
char *filename;
@@ -58,6 +65,18 @@ static LIST_HEAD(uprobe_list);
static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+static inline void init_trace_uprobe_filter(struct trace_uprobe_filter *filter)
+{
+ rwlock_init(&filter->rwlock);
+ filter->nr_systemwide = 0;
+ INIT_LIST_HEAD(&filter->perf_events);
+}
+
+static inline bool uprobe_filter_is_empty(struct trace_uprobe_filter *filter)
+{
+ return !filter->nr_systemwide && list_empty(&filter->perf_events);
+}
+
/*
* Allocate new trace_uprobe and initialize it (including uprobes).
*/
@@ -87,6 +106,7 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs)
INIT_LIST_HEAD(&tu->list);
tu->consumer.handler = uprobe_dispatcher;
+ init_trace_uprobe_filter(&tu->filter);
return tu;
error:
@@ -541,6 +561,8 @@ static int probe_event_enable(struct trace_uprobe *tu, int flag)
if (is_trace_uprobe_enabled(tu))
return -EINTR;
+ WARN_ON(!uprobe_filter_is_empty(&tu->filter));
+
tu->flags |= flag;
ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
if (ret)
@@ -554,6 +576,8 @@ static void probe_event_disable(struct trace_uprobe *tu, int flag)
if (!is_trace_uprobe_enabled(tu))
return;
+ WARN_ON(!uprobe_filter_is_empty(&tu->filter));
+
uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
tu->flags &= ~flag;
}
@@ -629,6 +653,30 @@ static int set_print_fmt(struct trace_uprobe *tu)
}
#ifdef CONFIG_PERF_EVENTS
+static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
+{
+ write_lock(&tu->filter.rwlock);
+ if (event->hw.tp_target)
+ list_add(&event->hw.tp_list, &tu->filter.perf_events);
+ else
+ tu->filter.nr_systemwide++;
+ write_unlock(&tu->filter.rwlock);
+
+ return 0;
+}
+
+static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
+{
+ write_lock(&tu->filter.rwlock);
+ if (event->hw.tp_target)
+ list_del(&event->hw.tp_list);
+ else
+ tu->filter.nr_systemwide--;
+ write_unlock(&tu->filter.rwlock);
+
+ return 0;
+}
+
/* uprobe profile handler */
static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
{
@@ -684,6 +732,13 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
case TRACE_REG_PERF_UNREGISTER:
probe_event_disable(tu, TP_FLAG_PROFILE);
return 0;
+
+ case TRACE_REG_PERF_OPEN:
+ return uprobe_perf_open(tu, data);
+
+ case TRACE_REG_PERF_CLOSE:
+ return uprobe_perf_close(tu, data);
+
#endif
default:
return 0;
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 4/7] uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
2013-02-04 19:02 ` [PATCH 4/7] uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's Oleg Nesterov
@ 2013-02-11 9:45 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:45 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:02:54]:
> Introduce "struct trace_uprobe_filter" which records the "active"
> perf_event's attached to ftrace_event_call. For the start we simply
> use list_head, we can optimize this later if needed. For example, we
> do not really need to record an event with ->parent != NULL, we can
> rely on parent->child_list. And we can certainly do some optimizations
> for the case when 2 events have the same ->tp_target or tp_target->mm.
>
> Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
> and add/del this perf_event to the list.
>
> We can probably avoid any locking, but lets start with the "obvioulsy
> correct" trace_uprobe_filter->rwlock which protects everything.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/trace/trace_uprobe.c | 55 +++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 55 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 0a9a8de..f05ec32 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -28,6 +28,12 @@
>
> #define UPROBE_EVENT_SYSTEM "uprobes"
>
> +struct trace_uprobe_filter {
> + rwlock_t rwlock;
> + int nr_systemwide;
> + struct list_head perf_events;
> +};
> +
> /*
> * uprobe event core functions
> */
> @@ -35,6 +41,7 @@ struct trace_uprobe {
> struct list_head list;
> struct ftrace_event_class class;
> struct ftrace_event_call call;
> + struct trace_uprobe_filter filter;
> struct uprobe_consumer consumer;
> struct inode *inode;
> char *filename;
> @@ -58,6 +65,18 @@ static LIST_HEAD(uprobe_list);
>
> static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
>
> +static inline void init_trace_uprobe_filter(struct trace_uprobe_filter *filter)
> +{
> + rwlock_init(&filter->rwlock);
> + filter->nr_systemwide = 0;
> + INIT_LIST_HEAD(&filter->perf_events);
> +}
> +
> +static inline bool uprobe_filter_is_empty(struct trace_uprobe_filter *filter)
> +{
> + return !filter->nr_systemwide && list_empty(&filter->perf_events);
> +}
> +
> /*
> * Allocate new trace_uprobe and initialize it (including uprobes).
> */
> @@ -87,6 +106,7 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs)
>
> INIT_LIST_HEAD(&tu->list);
> tu->consumer.handler = uprobe_dispatcher;
> + init_trace_uprobe_filter(&tu->filter);
> return tu;
>
> error:
> @@ -541,6 +561,8 @@ static int probe_event_enable(struct trace_uprobe *tu, int flag)
> if (is_trace_uprobe_enabled(tu))
> return -EINTR;
>
> + WARN_ON(!uprobe_filter_is_empty(&tu->filter));
> +
> tu->flags |= flag;
> ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
> if (ret)
> @@ -554,6 +576,8 @@ static void probe_event_disable(struct trace_uprobe *tu, int flag)
> if (!is_trace_uprobe_enabled(tu))
> return;
>
> + WARN_ON(!uprobe_filter_is_empty(&tu->filter));
> +
> uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> tu->flags &= ~flag;
> }
> @@ -629,6 +653,30 @@ static int set_print_fmt(struct trace_uprobe *tu)
> }
>
> #ifdef CONFIG_PERF_EVENTS
> +static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
> +{
> + write_lock(&tu->filter.rwlock);
> + if (event->hw.tp_target)
> + list_add(&event->hw.tp_list, &tu->filter.perf_events);
> + else
> + tu->filter.nr_systemwide++;
> + write_unlock(&tu->filter.rwlock);
> +
> + return 0;
> +}
> +
> +static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
> +{
> + write_lock(&tu->filter.rwlock);
> + if (event->hw.tp_target)
> + list_del(&event->hw.tp_list);
> + else
> + tu->filter.nr_systemwide--;
> + write_unlock(&tu->filter.rwlock);
> +
> + return 0;
> +}
> +
> /* uprobe profile handler */
> static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
> {
> @@ -684,6 +732,13 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
> case TRACE_REG_PERF_UNREGISTER:
> probe_event_disable(tu, TP_FLAG_PROFILE);
> return 0;
> +
> + case TRACE_REG_PERF_OPEN:
> + return uprobe_perf_open(tu, data);
> +
> + case TRACE_REG_PERF_CLOSE:
> + return uprobe_perf_close(tu, data);
> +
> #endif
> default:
> return 0;
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/7] uprobes/perf: Teach trace_uprobe/perf code to pre-filter
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
` (3 preceding siblings ...)
2013-02-04 19:02 ` [PATCH 4/7] uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's Oleg Nesterov
@ 2013-02-04 19:02 ` Oleg Nesterov
2013-02-11 9:46 ` Srikar Dronamraju
2013-02-04 19:03 ` [PATCH 6/7] uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE Oleg Nesterov
` (2 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:02 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
->perf_events to figure out whether we need to insert the breakpoint.
uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
the new perf event comes or goes away.
Note that currently this is very suboptimal:
- uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
heavy nop, consumer->filter() always returns F at this stage.
As it was already discussed we need uprobe_register_only() to
avoid the costly register_for_each_vma() when possible.
- uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
changes we need uprobe_apply_mm(), unapply_uprobe() is almost
what we need.
- uprobe_apply() can be simply avoided sometimes, see the next
changes.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perl -e 'syscall -1 while 1' &
[1] 530
# perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 79291
A huge ->nrhit == 79291 reflects the fact that the background process
530 constantly hits this breakpoint too, even if doesn't contribute to
the output.
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 10
This shows that only the target process was punished by int3.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/trace/trace_uprobe.c | 46 ++++++++++++++++++++++++++++++++++++++++--
1 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f05ec32..5d5a261 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -554,7 +554,12 @@ static inline bool is_trace_uprobe_enabled(struct trace_uprobe *tu)
return tu->flags & (TP_FLAG_TRACE | TP_FLAG_PROFILE);
}
-static int probe_event_enable(struct trace_uprobe *tu, int flag)
+typedef bool (*filter_func_t)(struct uprobe_consumer *self,
+ enum uprobe_filter_ctx ctx,
+ struct mm_struct *mm);
+
+static int
+probe_event_enable(struct trace_uprobe *tu, int flag, filter_func_t filter)
{
int ret = 0;
@@ -564,6 +569,7 @@ static int probe_event_enable(struct trace_uprobe *tu, int flag)
WARN_ON(!uprobe_filter_is_empty(&tu->filter));
tu->flags |= flag;
+ tu->consumer.filter = filter;
ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
if (ret)
tu->flags &= ~flag;
@@ -653,6 +659,22 @@ static int set_print_fmt(struct trace_uprobe *tu)
}
#ifdef CONFIG_PERF_EVENTS
+static bool
+__uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
+{
+ struct perf_event *event;
+
+ if (filter->nr_systemwide)
+ return true;
+
+ list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
+ if (event->hw.tp_target->mm == mm)
+ return true;
+ }
+
+ return false;
+}
+
static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
{
write_lock(&tu->filter.rwlock);
@@ -662,6 +684,8 @@ static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
tu->filter.nr_systemwide++;
write_unlock(&tu->filter.rwlock);
+ uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
+
return 0;
}
@@ -674,9 +698,25 @@ static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
tu->filter.nr_systemwide--;
write_unlock(&tu->filter.rwlock);
+ uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
+
return 0;
}
+static bool uprobe_perf_filter(struct uprobe_consumer *uc,
+ enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+{
+ struct trace_uprobe *tu;
+ int ret;
+
+ tu = container_of(uc, struct trace_uprobe, consumer);
+ read_lock(&tu->filter.rwlock);
+ ret = __uprobe_perf_filter(&tu->filter, mm);
+ read_unlock(&tu->filter.rwlock);
+
+ return ret;
+}
+
/* uprobe profile handler */
static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
{
@@ -719,7 +759,7 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
switch (type) {
case TRACE_REG_REGISTER:
- return probe_event_enable(tu, TP_FLAG_TRACE);
+ return probe_event_enable(tu, TP_FLAG_TRACE, NULL);
case TRACE_REG_UNREGISTER:
probe_event_disable(tu, TP_FLAG_TRACE);
@@ -727,7 +767,7 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
- return probe_event_enable(tu, TP_FLAG_PROFILE);
+ return probe_event_enable(tu, TP_FLAG_PROFILE, uprobe_perf_filter);
case TRACE_REG_PERF_UNREGISTER:
probe_event_disable(tu, TP_FLAG_PROFILE);
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 5/7] uprobes/perf: Teach trace_uprobe/perf code to pre-filter
2013-02-04 19:02 ` [PATCH 5/7] uprobes/perf: Teach trace_uprobe/perf code to pre-filter Oleg Nesterov
@ 2013-02-11 9:46 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:46 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:02:58]:
> Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
> ->perf_events to figure out whether we need to insert the breakpoint.
>
> uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
> the new perf event comes or goes away.
>
> Note that currently this is very suboptimal:
>
> - uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
> heavy nop, consumer->filter() always returns F at this stage.
>
> As it was already discussed we need uprobe_register_only() to
> avoid the costly register_for_each_vma() when possible.
>
> - uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
> changes we need uprobe_apply_mm(), unapply_uprobe() is almost
> what we need.
>
> - uprobe_apply() can be simply avoided sometimes, see the next
> changes.
>
> Testing:
>
> # perf probe -x /lib/libc.so.6 syscall
>
> # perl -e 'syscall -1 while 1' &
> [1] 530
>
> # perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'
>
> # perf report --show-total-period
> 100.00% 10 perl libc-2.8.so [.] syscall
>
> Before this patch:
>
> # cat /sys/kernel/debug/tracing/uprobe_profile
> /lib/libc.so.6 syscall 79291
>
> A huge ->nrhit == 79291 reflects the fact that the background process
> 530 constantly hits this breakpoint too, even if doesn't contribute to
> the output.
>
> After the patch:
>
> # cat /sys/kernel/debug/tracing/uprobe_profile
> /lib/libc.so.6 syscall 10
>
> This shows that only the target process was punished by int3.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/trace/trace_uprobe.c | 46 ++++++++++++++++++++++++++++++++++++++++--
> 1 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index f05ec32..5d5a261 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -554,7 +554,12 @@ static inline bool is_trace_uprobe_enabled(struct trace_uprobe *tu)
> return tu->flags & (TP_FLAG_TRACE | TP_FLAG_PROFILE);
> }
>
> -static int probe_event_enable(struct trace_uprobe *tu, int flag)
> +typedef bool (*filter_func_t)(struct uprobe_consumer *self,
> + enum uprobe_filter_ctx ctx,
> + struct mm_struct *mm);
> +
> +static int
> +probe_event_enable(struct trace_uprobe *tu, int flag, filter_func_t filter)
> {
> int ret = 0;
>
> @@ -564,6 +569,7 @@ static int probe_event_enable(struct trace_uprobe *tu, int flag)
> WARN_ON(!uprobe_filter_is_empty(&tu->filter));
>
> tu->flags |= flag;
> + tu->consumer.filter = filter;
> ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
> if (ret)
> tu->flags &= ~flag;
> @@ -653,6 +659,22 @@ static int set_print_fmt(struct trace_uprobe *tu)
> }
>
> #ifdef CONFIG_PERF_EVENTS
> +static bool
> +__uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
> +{
> + struct perf_event *event;
> +
> + if (filter->nr_systemwide)
> + return true;
> +
> + list_for_each_entry(event, &filter->perf_events, hw.tp_list) {
> + if (event->hw.tp_target->mm == mm)
> + return true;
> + }
> +
> + return false;
> +}
> +
> static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
> {
> write_lock(&tu->filter.rwlock);
> @@ -662,6 +684,8 @@ static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
> tu->filter.nr_systemwide++;
> write_unlock(&tu->filter.rwlock);
>
> + uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
> +
> return 0;
> }
>
> @@ -674,9 +698,25 @@ static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
> tu->filter.nr_systemwide--;
> write_unlock(&tu->filter.rwlock);
>
> + uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
> +
> return 0;
> }
>
> +static bool uprobe_perf_filter(struct uprobe_consumer *uc,
> + enum uprobe_filter_ctx ctx, struct mm_struct *mm)
> +{
> + struct trace_uprobe *tu;
> + int ret;
> +
> + tu = container_of(uc, struct trace_uprobe, consumer);
> + read_lock(&tu->filter.rwlock);
> + ret = __uprobe_perf_filter(&tu->filter, mm);
> + read_unlock(&tu->filter.rwlock);
> +
> + return ret;
> +}
> +
> /* uprobe profile handler */
> static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
> {
> @@ -719,7 +759,7 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
>
> switch (type) {
> case TRACE_REG_REGISTER:
> - return probe_event_enable(tu, TP_FLAG_TRACE);
> + return probe_event_enable(tu, TP_FLAG_TRACE, NULL);
>
> case TRACE_REG_UNREGISTER:
> probe_event_disable(tu, TP_FLAG_TRACE);
> @@ -727,7 +767,7 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
>
> #ifdef CONFIG_PERF_EVENTS
> case TRACE_REG_PERF_REGISTER:
> - return probe_event_enable(tu, TP_FLAG_PROFILE);
> + return probe_event_enable(tu, TP_FLAG_PROFILE, uprobe_perf_filter);
>
> case TRACE_REG_PERF_UNREGISTER:
> probe_event_disable(tu, TP_FLAG_PROFILE);
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 6/7] uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
` (4 preceding siblings ...)
2013-02-04 19:02 ` [PATCH 5/7] uprobes/perf: Teach trace_uprobe/perf code to pre-filter Oleg Nesterov
@ 2013-02-04 19:03 ` Oleg Nesterov
2013-02-11 9:54 ` Srikar Dronamraju
2013-02-04 19:03 ` [PATCH 7/7] uprobes/perf: Avoid uprobe_apply() whenever possible Oleg Nesterov
2013-02-06 18:10 ` [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:03 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
Change uprobe_dispatcher() to return "trace_ret | perf_ret" although this
is not really needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually
exclusive.
The only functional change is that uprobe_perf_func() checks the filtering
too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 20
A child process doesn't have a counter, but still it hits this breakoint
"copied" by dup_mmap().
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 11
The child process hits this int3 only once and does unapply_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/trace/trace_uprobe.c | 21 ++++++++++++++-------
1 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 5d5a261..1114619 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -483,7 +483,7 @@ static const struct file_operations uprobe_profile_ops = {
};
/* uprobe handler */
-static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
+static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
{
struct uprobe_trace_entry_head *entry;
struct ring_buffer_event *event;
@@ -501,7 +501,7 @@ static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
size, irq_flags, pc);
if (!event)
- return;
+ return 0;
entry = ring_buffer_event_data(event);
entry->ip = instruction_pointer(task_pt_regs(current));
@@ -511,6 +511,8 @@ static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
if (!filter_current_check_discard(buffer, call, entry, event))
trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+
+ return 0;
}
/* Event entry printers */
@@ -718,7 +720,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
}
/* uprobe profile handler */
-static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
+static int uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
{
struct ftrace_event_call *call = &tu->call;
struct uprobe_trace_entry_head *entry;
@@ -727,11 +729,14 @@ static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
int size, __size, i;
int rctx;
+ if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
+ return UPROBE_HANDLER_REMOVE;
+
__size = sizeof(*entry) + tu->size;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough"))
- return;
+ return 0;
preempt_disable();
@@ -749,6 +754,7 @@ static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
out:
preempt_enable();
+ return 0;
}
#endif /* CONFIG_PERF_EVENTS */
@@ -789,18 +795,19 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
{
struct trace_uprobe *tu;
+ int ret = 0;
tu = container_of(con, struct trace_uprobe, consumer);
tu->nhit++;
if (tu->flags & TP_FLAG_TRACE)
- uprobe_trace_func(tu, regs);
+ ret |= uprobe_trace_func(tu, regs);
#ifdef CONFIG_PERF_EVENTS
if (tu->flags & TP_FLAG_PROFILE)
- uprobe_perf_func(tu, regs);
+ ret |= uprobe_perf_func(tu, regs);
#endif
- return 0;
+ return ret;
}
static struct trace_event_functions uprobe_funcs = {
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 6/7] uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
2013-02-04 19:03 ` [PATCH 6/7] uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE Oleg Nesterov
@ 2013-02-11 9:54 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:03:01]:
> Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
> Change uprobe_dispatcher() to return "trace_ret | perf_ret" although this
> is not really needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually
> exclusive.
>
> The only functional change is that uprobe_perf_func() checks the filtering
> too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.
>
> Testing:
>
> # perf probe -x /lib/libc.so.6 syscall
>
> # perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'
>
> # perf report --show-total-period
> 100.00% 10 perl libc-2.8.so [.] syscall
>
> Before this patch:
>
> # cat /sys/kernel/debug/tracing/uprobe_profile
> /lib/libc.so.6 syscall 20
>
> A child process doesn't have a counter, but still it hits this breakoint
> "copied" by dup_mmap().
>
> After the patch:
>
> # cat /sys/kernel/debug/tracing/uprobe_profile
> /lib/libc.so.6 syscall 11
>
> The child process hits this int3 only once and does unapply_uprobe().
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/trace/trace_uprobe.c | 21 ++++++++++++++-------
> 1 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 5d5a261..1114619 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -483,7 +483,7 @@ static const struct file_operations uprobe_profile_ops = {
> };
>
> /* uprobe handler */
> -static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
> +static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
> {
> struct uprobe_trace_entry_head *entry;
> struct ring_buffer_event *event;
> @@ -501,7 +501,7 @@ static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
> event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
> size, irq_flags, pc);
> if (!event)
> - return;
> + return 0;
>
> entry = ring_buffer_event_data(event);
> entry->ip = instruction_pointer(task_pt_regs(current));
> @@ -511,6 +511,8 @@ static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs)
>
> if (!filter_current_check_discard(buffer, call, entry, event))
> trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
> +
> + return 0;
> }
>
> /* Event entry printers */
> @@ -718,7 +720,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
> }
>
> /* uprobe profile handler */
> -static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
> +static int uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
> {
> struct ftrace_event_call *call = &tu->call;
> struct uprobe_trace_entry_head *entry;
> @@ -727,11 +729,14 @@ static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
> int size, __size, i;
> int rctx;
>
> + if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
> + return UPROBE_HANDLER_REMOVE;
> +
> __size = sizeof(*entry) + tu->size;
> size = ALIGN(__size + sizeof(u32), sizeof(u64));
> size -= sizeof(u32);
> if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough"))
> - return;
> + return 0;
>
> preempt_disable();
>
> @@ -749,6 +754,7 @@ static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs)
>
> out:
> preempt_enable();
> + return 0;
> }
> #endif /* CONFIG_PERF_EVENTS */
>
> @@ -789,18 +795,19 @@ int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type,
> static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
> {
> struct trace_uprobe *tu;
> + int ret = 0;
>
> tu = container_of(con, struct trace_uprobe, consumer);
> tu->nhit++;
>
> if (tu->flags & TP_FLAG_TRACE)
> - uprobe_trace_func(tu, regs);
> + ret |= uprobe_trace_func(tu, regs);
>
> #ifdef CONFIG_PERF_EVENTS
> if (tu->flags & TP_FLAG_PROFILE)
> - uprobe_perf_func(tu, regs);
> + ret |= uprobe_perf_func(tu, regs);
> #endif
> - return 0;
> + return ret;
> }
>
> static struct trace_event_functions uprobe_funcs = {
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 7/7] uprobes/perf: Avoid uprobe_apply() whenever possible
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
` (5 preceding siblings ...)
2013-02-04 19:03 ` [PATCH 6/7] uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE Oleg Nesterov
@ 2013-02-04 19:03 ` Oleg Nesterov
2013-02-11 9:55 ` Srikar Dronamraju
2013-02-06 18:10 ` [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-04 19:03 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
uprobe_perf_open/close call the costly uprobe_apply() every time,
we can avoid it if:
- "nr_systemwide != 0" is not changed.
- There is another process/thread with the same ->mm.
- copy_proccess() does inherit_event(). dup_mmap() preserves the
inserted breakpoints.
- event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
called by exec/mmap paths.
- tp_target is exiting. Only _close() checks PF_EXITING, I don't
think TRACE_REG_PERF_OPEN can hit the dying task too often.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/trace/trace_uprobe.c | 42 ++++++++++++++++++++++++++++++++++++------
1 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 1114619..e4aab34 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -677,30 +677,60 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
return false;
}
+static inline bool
+uprobe_filter_event(struct trace_uprobe *tu, struct perf_event *event)
+{
+ return __uprobe_perf_filter(&tu->filter, event->hw.tp_target->mm);
+}
+
static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
{
+ bool done;
+
write_lock(&tu->filter.rwlock);
- if (event->hw.tp_target)
+ if (event->hw.tp_target) {
+ /*
+ * event->parent != NULL means copy_process(), we can avoid
+ * uprobe_apply(). current->mm must be probed and we can rely
+ * on dup_mmap() which preserves the already installed bp's.
+ *
+ * attr.enable_on_exec means that exec/mmap will install the
+ * breakpoints we need.
+ */
+ done = tu->filter.nr_systemwide ||
+ event->parent || event->attr.enable_on_exec ||
+ uprobe_filter_event(tu, event);
list_add(&event->hw.tp_list, &tu->filter.perf_events);
- else
+ } else {
+ done = tu->filter.nr_systemwide;
tu->filter.nr_systemwide++;
+ }
write_unlock(&tu->filter.rwlock);
- uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
+ if (!done)
+ uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
return 0;
}
static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
{
+ bool done;
+
write_lock(&tu->filter.rwlock);
- if (event->hw.tp_target)
+ if (event->hw.tp_target) {
list_del(&event->hw.tp_list);
- else
+ done = tu->filter.nr_systemwide ||
+ (event->hw.tp_target->flags & PF_EXITING) ||
+ uprobe_filter_event(tu, event);
+ } else {
tu->filter.nr_systemwide--;
+ done = tu->filter.nr_systemwide;
+ }
write_unlock(&tu->filter.rwlock);
- uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
+ if (!done)
+ uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
return 0;
}
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 7/7] uprobes/perf: Avoid uprobe_apply() whenever possible
2013-02-04 19:03 ` [PATCH 7/7] uprobes/perf: Avoid uprobe_apply() whenever possible Oleg Nesterov
@ 2013-02-11 9:55 ` Srikar Dronamraju
0 siblings, 0 replies; 21+ messages in thread
From: Srikar Dronamraju @ 2013-02-11 9:55 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Steven Rostedt, Anton Arapov, Frank Eigler, Jiri Olsa,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
* Oleg Nesterov <oleg@redhat.com> [2013-02-04 20:03:05]:
> uprobe_perf_open/close call the costly uprobe_apply() every time,
> we can avoid it if:
>
> - "nr_systemwide != 0" is not changed.
>
> - There is another process/thread with the same ->mm.
>
> - copy_proccess() does inherit_event(). dup_mmap() preserves the
> inserted breakpoints.
>
> - event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
> called by exec/mmap paths.
>
> - tp_target is exiting. Only _close() checks PF_EXITING, I don't
> think TRACE_REG_PERF_OPEN can hit the dying task too often.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/trace/trace_uprobe.c | 42 ++++++++++++++++++++++++++++++++++++------
> 1 files changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 1114619..e4aab34 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -677,30 +677,60 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm)
> return false;
> }
>
> +static inline bool
> +uprobe_filter_event(struct trace_uprobe *tu, struct perf_event *event)
> +{
> + return __uprobe_perf_filter(&tu->filter, event->hw.tp_target->mm);
> +}
> +
> static int uprobe_perf_open(struct trace_uprobe *tu, struct perf_event *event)
> {
> + bool done;
> +
> write_lock(&tu->filter.rwlock);
> - if (event->hw.tp_target)
> + if (event->hw.tp_target) {
> + /*
> + * event->parent != NULL means copy_process(), we can avoid
> + * uprobe_apply(). current->mm must be probed and we can rely
> + * on dup_mmap() which preserves the already installed bp's.
> + *
> + * attr.enable_on_exec means that exec/mmap will install the
> + * breakpoints we need.
> + */
> + done = tu->filter.nr_systemwide ||
> + event->parent || event->attr.enable_on_exec ||
> + uprobe_filter_event(tu, event);
> list_add(&event->hw.tp_list, &tu->filter.perf_events);
> - else
> + } else {
> + done = tu->filter.nr_systemwide;
> tu->filter.nr_systemwide++;
> + }
> write_unlock(&tu->filter.rwlock);
>
> - uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
> + if (!done)
> + uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
>
> return 0;
> }
>
> static int uprobe_perf_close(struct trace_uprobe *tu, struct perf_event *event)
> {
> + bool done;
> +
> write_lock(&tu->filter.rwlock);
> - if (event->hw.tp_target)
> + if (event->hw.tp_target) {
> list_del(&event->hw.tp_list);
> - else
> + done = tu->filter.nr_systemwide ||
> + (event->hw.tp_target->flags & PF_EXITING) ||
> + uprobe_filter_event(tu, event);
> + } else {
> tu->filter.nr_systemwide--;
> + done = tu->filter.nr_systemwide;
> + }
> write_unlock(&tu->filter.rwlock);
>
> - uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
> + if (!done)
> + uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
>
> return 0;
> }
> --
> 1.5.5.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/7] uprobes/perf: pre-filtering
2013-02-04 19:02 [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
` (6 preceding siblings ...)
2013-02-04 19:03 ` [PATCH 7/7] uprobes/perf: Avoid uprobe_apply() whenever possible Oleg Nesterov
@ 2013-02-06 18:10 ` Oleg Nesterov
2013-02-06 19:42 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Oleg Nesterov
7 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-06 18:10 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
On 02/04, Oleg Nesterov wrote:
>
> Based on git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc uprobes/core,
> on top of "[PATCH 0/5] uprobes: kill uprobe_trace_consumer and other cleanups"
> series.
>
> With this series 'perf record -e uprobe ...' does not install the breakpoints
> into the processes it doesn't want to probe.
>
> Probably this all needs more testing (the last patch for sure) but please
> review. I'll try to do more testing tomorrow.
Well, everything seems to work....
And nobody cares^Wobjects. I am going to add this to "oleg/misc uprobes/core"
and ask Ingo to pull.
> A special note about 1/7. It was actually needed for initial implementation,
> this version doesn't need it. Still I am sending it at least for review, it
> looks "natural" and potentially useful, and I would like to know if it is
> correct or not.
But since 1/7 wasn't reviewed, I'll drop it.
Btw, during the testing I noticed a minor problem (or problems) in tools/perf/,
I'll try to investigate...
Oleg.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 0/1] (Was uprobes/perf: pre-filtering)
2013-02-06 18:10 ` [PATCH 0/7] uprobes/perf: pre-filtering Oleg Nesterov
@ 2013-02-06 19:42 ` Oleg Nesterov
2013-02-06 19:42 ` [PATCH 1/1] perf/tools: Fix "perf record -C... workload" behaviour Oleg Nesterov
2013-02-07 6:01 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Namhyung Kim
0 siblings, 2 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-06 19:42 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
On 02/06, Oleg Nesterov wrote:
>
> Btw, during the testing I noticed a minor problem (or problems) in tools/perf/,
> I'll try to investigate...
To avoid the confusion, this has nothing to do with perf/uprobe/prefiltering
changes.
Well. I am very new to this code, and I didn't even use perf before,
it is quite possible I missed something/everything.
But I simply can't understand the logic behind perf_evlist__create_maps/etc.
OK, builtin-record.c:record sets .target->uses_mmap == T, this is correct,
we are going to use perf_mmap().
But why do we need it? It is for perf_evlist__create_maps() to ensure we
do not use cpu_map__dummy_new().
But. Even if we use perf_mmap(), "event->cpu == -1 && !event->attr.inherit"
is fine. And indeed, "perf record -e whatever -p1" creates a single counter
with "cpu = -1" and this works. Because, damn, perf_evlist__config_attrs()
notices "evlist->cpus->map[0] < 0" and sets "opts->no_inherit = true".
But at the same time, this does not allow to do, say,
"perf record -e whatever -C0 -p1". -C0 is simply ignored because when
perf_evlist__create_maps() is called perf_target__has_task() == T due to "-p1".
Not to mention, there is no way to trace init and the children it will
fork...
And otoh. "perf record -e whatever -i true" creates a counter for each
cpu due to ->uses_mmap. This is suboptimal, but of course the hack like
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1081,6 +1078,8 @@ int cmd_record(int argc, const char **argv, const char *prefix __maybe_unused)
if (!argc && perf_target__none(&rec->opts.target))
usage_with_options(record_usage, record_options);
+ rec->opts.target.uses_mmap = !rec->opts.no_inherit;
+
if (rec->force && rec->append_file) {
ui__error("Can't overwrite and append at the same time."
" You need to choose between -f and -A");
should not be even discussed.
And why opts->target.system_wide is only set by OPT_BOOLEAN("all-cpus") ?
I meant, why I can't do "perf record -e whatever -C0" to create a "global"
counter on CPU_0? This doesn't work because __cmd_record() sees !.system_wide
and assumes we need perf_event__synthesize_thread_map() which silently fail.
So I am sending a single patch to fix the problem which complicated my
testing. It is trivial but I am not sure it correct, please review.
Oleg.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/1] perf/tools: Fix "perf record -C... workload" behaviour
2013-02-06 19:42 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Oleg Nesterov
@ 2013-02-06 19:42 ` Oleg Nesterov
2013-02-25 9:58 ` Jiri Olsa
2013-02-07 6:01 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Namhyung Kim
1 sibling, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-06 19:42 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt
Cc: Anton Arapov, Frank Eigler, Jiri Olsa, Josh Stone,
Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
Test-case:
# perf record -e probe_libc:syscall -C0 true
true: Terminated
The child process is killed by perf_record__sig_exit(). This is because
perf_evlist__prepare_workload() doesn't initialize threads->map[] and
then perf_event__synthesize_mmap_events() (silently!) fails. This patch
seems to fix the problem, but perhaps we should also move
'pr_err("Couldn't run the workload!\n")' down to 'out_delete_session:',
so that the problems like this would be more understandable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
tools/perf/util/evlist.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 7052934..8b3f6ef 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -783,7 +783,7 @@ int perf_evlist__prepare_workload(struct perf_evlist *evlist,
exit(-1);
}
- if (perf_target__none(&opts->target))
+ if (!perf_target__has_task(&opts->target))
evlist->threads->map[0] = evlist->workload.pid;
close(child_ready_pipe[1]);
--
1.5.5.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/1] perf/tools: Fix "perf record -C... workload" behaviour
2013-02-06 19:42 ` [PATCH 1/1] perf/tools: Fix "perf record -C... workload" behaviour Oleg Nesterov
@ 2013-02-25 9:58 ` Jiri Olsa
0 siblings, 0 replies; 21+ messages in thread
From: Jiri Olsa @ 2013-02-25 9:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt, Anton Arapov, Frank Eigler,
Josh Stone, Masami Hiramatsu, Suzuki K. Poulose, linux-kernel
On Wed, Feb 06, 2013 at 08:42:58PM +0100, Oleg Nesterov wrote:
> Test-case:
>
> # perf record -e probe_libc:syscall -C0 true
> true: Terminated
>
> The child process is killed by perf_record__sig_exit(). This is because
> perf_evlist__prepare_workload() doesn't initialize threads->map[] and
> then perf_event__synthesize_mmap_events() (silently!) fails. This patch
hi,
sorry for late reply..
I think the reason it fails is that it actually got inside
perf_event__synthesize_mmap_events while there's cpu list
specified.
I sent possible fix:
http://marc.info/?l=linux-kernel&m=136178604718144&w=2
> seems to fix the problem, but perhaps we should also move
> 'pr_err("Couldn't run the workload!\n")' down to 'out_delete_session:',
I'll check ;-)
thanks,
jirka
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/1] (Was uprobes/perf: pre-filtering)
2013-02-06 19:42 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Oleg Nesterov
2013-02-06 19:42 ` [PATCH 1/1] perf/tools: Fix "perf record -C... workload" behaviour Oleg Nesterov
@ 2013-02-07 6:01 ` Namhyung Kim
2013-02-07 15:22 ` Oleg Nesterov
1 sibling, 1 reply; 21+ messages in thread
From: Namhyung Kim @ 2013-02-07 6:01 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt, Anton Arapov, Frank Eigler,
Jiri Olsa, Josh Stone, Masami Hiramatsu, Suzuki K. Poulose,
linux-kernel
Hi Oleg,
On Wed, 6 Feb 2013 20:42:18 +0100, Oleg Nesterov wrote:
> OK, builtin-record.c:record sets .target->uses_mmap == T, this is correct,
> we are going to use perf_mmap().
>
> But why do we need it? It is for perf_evlist__create_maps() to ensure we
> do not use cpu_map__dummy_new().
>
> But. Even if we use perf_mmap(), "event->cpu == -1 && !event->attr.inherit"
> is fine. And indeed, "perf record -e whatever -p1" creates a single counter
> with "cpu = -1" and this works. Because, damn, perf_evlist__config_attrs()
> notices "evlist->cpus->map[0] < 0" and sets "opts->no_inherit = true".
>
> But at the same time, this does not allow to do, say,
> "perf record -e whatever -C0 -p1". -C0 is simply ignored because when
> perf_evlist__create_maps() is called perf_target__has_task() == T due to "-p1".
>
> Not to mention, there is no way to trace init and the children it will
> fork...
>
> And otoh. "perf record -e whatever -i true" creates a counter for each
> cpu due to ->uses_mmap. This is suboptimal, but of course the hack like
>
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -1081,6 +1078,8 @@ int cmd_record(int argc, const char **argv, const char *prefix __maybe_unused)
> if (!argc && perf_target__none(&rec->opts.target))
> usage_with_options(record_usage, record_options);
>
> + rec->opts.target.uses_mmap = !rec->opts.no_inherit;
> +
> if (rec->force && rec->append_file) {
> ui__error("Can't overwrite and append at the same time."
> " You need to choose between -f and -A");
>
> should not be even discussed.
>
> And why opts->target.system_wide is only set by OPT_BOOLEAN("all-cpus") ?
> I meant, why I can't do "perf record -e whatever -C0" to create a "global"
> counter on CPU_0? This doesn't work because __cmd_record() sees !.system_wide
> and assumes we need perf_event__synthesize_thread_map() which silently fail.
>
> So I am sending a single patch to fix the problem which complicated my
> testing. It is trivial but I am not sure it correct, please review.
Yes, it's not clear how it handles above (-C0) case. I think it should
be treated as a system_wide mode like --all-cpus (-a). So we could set
->system_wide to true if -C is given and/or test perf_target__has_cpu()
for perf_event__synthesize_thread_map() or both.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/1] (Was uprobes/perf: pre-filtering)
2013-02-07 6:01 ` [PATCH 0/1] (Was uprobes/perf: pre-filtering) Namhyung Kim
@ 2013-02-07 15:22 ` Oleg Nesterov
0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2013-02-07 15:22 UTC (permalink / raw)
To: Namhyung Kim
Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
Srikar Dronamraju, Steven Rostedt, Anton Arapov, Frank Eigler,
Jiri Olsa, Josh Stone, Masami Hiramatsu, Suzuki K. Poulose,
linux-kernel
Hi,
On 02/07, Namhyung Kim wrote:
>
> On Wed, 6 Feb 2013 20:42:18 +0100, Oleg Nesterov wrote:
> >
> > And why opts->target.system_wide is only set by OPT_BOOLEAN("all-cpus") ?
> > I meant, why I can't do "perf record -e whatever -C0" to create a "global"
> > counter on CPU_0? This doesn't work because __cmd_record() sees !.system_wide
> > and assumes we need perf_event__synthesize_thread_map() which silently fail.
> >
> > So I am sending a single patch to fix the problem which complicated my
> > testing. It is trivial but I am not sure it correct, please review.
>
> Yes, it's not clear how it handles above (-C0) case. I think it should
> be treated as a system_wide mode like --all-cpus (-a). So we could set
> ->system_wide to true if -C is given and/or test perf_target__has_cpu()
> for perf_event__synthesize_thread_map() or both.
Yes, thanks... but to be honest I do not understand opts->systemwide as
well.
OK, both 'perf record ... sleep 1' and 'perf record ... -a sleep 1' attach
the counter(s) to the child process, but opts->systemwide differs. In the
latter case run_command() does perf_event__synthesize_threads(). Not sure
this is right.
Btw, I am just curious. You can override the target with "-p" and run the
command, but it seems that it is not possible to create a global counter
and run the command. Not that important, but could be handy.
Oleg.
^ permalink raw reply [flat|nested] 21+ messages in thread