All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH tip 0/3] Improvements of scheduler related Tracepoints
@ 2017-12-14 20:20 Teng Qin
  2017-12-14 20:20 ` [PATCH tip 1/3] Improve sched_switch Tracepoint Teng Qin
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Teng Qin @ 2017-12-14 20:20 UTC (permalink / raw)
  To: peterz
  Cc: mingo, ast, bgregg, daniel, yhs, linux-kernel, Kernel-team, Teng Qin

This set of commits attempts to improve three scheduler related
Tracepoints: sched_switch, sched_process_fork, sched_process_exit.

Firstly, these commit add additional flag values, namely preempt,
clone_flags and group_dead to these Tracepoints, to make information
exposed via the Tracepoints more useful and complete.

Secondly, these commits exposes task_struct pointers in these
Tracepoints. The task_struct pointers are arguments of the Tracepoints
and currently only used to compute struct field values. But for BPF
programs attached to these Tracepoints, we may want to read additional
task information via the task_struct pointers. This is currently either
impossible, or we have to make assumption of whether the Tracepoint is
running from previous / parent or next / child, and use current pointer
instead. Exposing the task_struct pointers explicitly makes such use
case easier and more reliable.

Teng Qin (3):
  Improve sched_switch Tracepoint
  Improve sched_process_fork Tracepoint
  Improve sched_process_exit Tracepoint

 include/trace/events/sched.h | 54 ++++++++++++++++++++++++++++++++++++--------
 kernel/exit.c                |  2 +-
 kernel/fork.c                |  2 +-
 3 files changed, 46 insertions(+), 12 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH tip 1/3] Improve sched_switch Tracepoint
  2017-12-14 20:20 [PATCH tip 0/3] Improvements of scheduler related Tracepoints Teng Qin
@ 2017-12-14 20:20 ` Teng Qin
  2017-12-14 20:20 ` [PATCH tip 2/3] Improve sched_process_fork Tracepoint Teng Qin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Teng Qin @ 2017-12-14 20:20 UTC (permalink / raw)
  To: peterz
  Cc: mingo, ast, bgregg, daniel, yhs, linux-kernel, Kernel-team, Teng Qin

This commit adds value of the preempt flag to sched_switch's Tracepoint
struct. The value of the flag is already passed into the Tracepoint as
argument but currently only used to compute previous task's state.
Exposing the value is useful during debugging of contention issues.

This commit also exposes pointers of the previous and next task_struct in
the Tracepoint's struct. BPF programs can read task information via task
struct pointer. Exposing these pointers explicitly gives BPF programs an
easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <qinteng@fb.com>
---
 include/trace/events/sched.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bc01e06..9297b33 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -141,6 +141,9 @@ TRACE_EVENT(sched_switch,
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
 		__field(	int,	next_prio			)
+		__field(	bool,	preempt				)
+		__field(	void *,	prev_task			)
+		__field(	void *,	next_task			)
 	),
 
 	TP_fast_assign(
@@ -151,6 +154,9 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		__entry->preempt	= preempt;
+		__entry->prev_task	= (void *)prev;
+		__entry->next_task	= (void *)next;
 		/* XXX SCHED_DEADLINE */
 	),
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH tip 2/3] Improve sched_process_fork Tracepoint
  2017-12-14 20:20 [PATCH tip 0/3] Improvements of scheduler related Tracepoints Teng Qin
  2017-12-14 20:20 ` [PATCH tip 1/3] Improve sched_switch Tracepoint Teng Qin
@ 2017-12-14 20:20 ` Teng Qin
  2017-12-14 20:20 ` [PATCH tip 3/3] Improve sched_process_exit Tracepoint Teng Qin
  2017-12-14 20:49 ` [PATCH tip 0/3] Improvements of scheduler related Tracepoints Peter Zijlstra
  3 siblings, 0 replies; 12+ messages in thread
From: Teng Qin @ 2017-12-14 20:20 UTC (permalink / raw)
  To: peterz
  Cc: mingo, ast, bgregg, daniel, yhs, linux-kernel, Kernel-team, Teng Qin

This commit adds value of the clone_flag to sched_process_fork's
Tracepoint struct. The value is passed as an argument of the Tracepoint.
It is useful when debugging Processes created via different clone flags.

This commit also exposes pointers of the parent and child task_struct in
the Tracepoint's struct. BPF programs can read task information via task
struct pointer. Exposing these pointers explicitly gives BPF programs an
easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <qinteng@fb.com>
---
 include/trace/events/sched.h | 11 +++++++++--
 kernel/fork.c                |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9297b33..e8bb6b0 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -278,15 +278,19 @@ TRACE_EVENT(sched_process_wait,
  */
 TRACE_EVENT(sched_process_fork,
 
-	TP_PROTO(struct task_struct *parent, struct task_struct *child),
+	TP_PROTO(struct task_struct *parent, struct task_struct *child,
+		 unsigned long clone_flags),
 
-	TP_ARGS(parent, child),
+	TP_ARGS(parent, child, clone_flags),
 
 	TP_STRUCT__entry(
 		__array(	char,	parent_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	parent_pid			)
 		__array(	char,	child_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	child_pid			)
+		__field(	unsigned long,	clone_flags		)
+		__field(	void *,	parent_task			)
+		__field(	void *,	child_task			)
 	),
 
 	TP_fast_assign(
@@ -294,6 +298,9 @@ TRACE_EVENT(sched_process_fork,
 		__entry->parent_pid	= parent->pid;
 		memcpy(__entry->child_comm, child->comm, TASK_COMM_LEN);
 		__entry->child_pid	= child->pid;
+		__entry->clone_flags	= clone_flags;
+		__entry->parent_task	= (void *)parent;
+		__entry->child_task	= (void *)child;
 	),
 
 	TP_printk("comm=%s pid=%d child_comm=%s child_pid=%d",
diff --git a/kernel/fork.c b/kernel/fork.c
index 432eadf..777985e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2053,7 +2053,7 @@ long _do_fork(unsigned long clone_flags,
 		struct completion vfork;
 		struct pid *pid;
 
-		trace_sched_process_fork(current, p);
+		trace_sched_process_fork(current, p, clone_flags);
 
 		pid = get_task_pid(p, PIDTYPE_PID);
 		nr = pid_vnr(pid);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH tip 3/3] Improve sched_process_exit Tracepoint
  2017-12-14 20:20 [PATCH tip 0/3] Improvements of scheduler related Tracepoints Teng Qin
  2017-12-14 20:20 ` [PATCH tip 1/3] Improve sched_switch Tracepoint Teng Qin
  2017-12-14 20:20 ` [PATCH tip 2/3] Improve sched_process_fork Tracepoint Teng Qin
@ 2017-12-14 20:20 ` Teng Qin
  2017-12-14 20:49 ` [PATCH tip 0/3] Improvements of scheduler related Tracepoints Peter Zijlstra
  3 siblings, 0 replies; 12+ messages in thread
From: Teng Qin @ 2017-12-14 20:20 UTC (permalink / raw)
  To: peterz
  Cc: mingo, ast, bgregg, daniel, yhs, linux-kernel, Kernel-team, Teng Qin

This commit adds group_dead value to sched_process_exit's Tracepoint
struct. The value is passed as an argument of the Tracepoint. It is
useful to differentiate exits of a regular process or the last process
of the process group.

This commit also exposes pointers of exited task. BPF program can read task
information via task struct pointer. Exposing the pointer explicitly gives
BPF programs an easy and reliable way of using the Tracepoint.

Signed-off-by: Teng Qin <qinteng@fb.com>
---
 include/trace/events/sched.h | 37 +++++++++++++++++++++++++++++--------
 kernel/exit.c                |  2 +-
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index e8bb6b0..c741f4e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -204,6 +204,35 @@ TRACE_EVENT(sched_migrate_task,
 		  __entry->orig_cpu, __entry->dest_cpu)
 );
 
+/*
+ * Tracepoint for a task exiting:
+ */
+TRACE_EVENT(sched_process_exit,
+
+	TP_PROTO(struct task_struct *p, bool group_dead),
+
+	TP_ARGS(p, group_dead),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	int,	prio			)
+		__field(	bool,	group_dead		)
+		__field(	void *,	task_struct		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->prio		= p->prio;
+		__entry->group_dead	= group_dead;
+		__entry->task_struct	= (void *)p;
+	),
+
+	TP_printk("comm=%s pid=%d prio=%d",
+		  __entry->comm, __entry->pid, __entry->prio)
+);
+
 DECLARE_EVENT_CLASS(sched_process_template,
 
 	TP_PROTO(struct task_struct *p),
@@ -233,14 +262,6 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
 	     TP_PROTO(struct task_struct *p),
 	     TP_ARGS(p));
 	     
-
-/*
- * Tracepoint for a task exiting:
- */
-DEFINE_EVENT(sched_process_template, sched_process_exit,
-	     TP_PROTO(struct task_struct *p),
-	     TP_ARGS(p));
-
 /*
  * Tracepoint for waiting on task to unschedule:
  */
diff --git a/kernel/exit.c b/kernel/exit.c
index 6b4298a..b613594 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -853,7 +853,7 @@ void __noreturn do_exit(long code)
 
 	if (group_dead)
 		acct_process();
-	trace_sched_process_exit(tsk);
+	trace_sched_process_exit(tsk, group_dead);
 
 	exit_sem(tsk);
 	exit_shm(tsk);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-14 20:20 [PATCH tip 0/3] Improvements of scheduler related Tracepoints Teng Qin
                   ` (2 preceding siblings ...)
  2017-12-14 20:20 ` [PATCH tip 3/3] Improve sched_process_exit Tracepoint Teng Qin
@ 2017-12-14 20:49 ` Peter Zijlstra
  2017-12-15  3:16   ` Alexei Starovoitov
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2017-12-14 20:49 UTC (permalink / raw)
  To: Teng Qin; +Cc: mingo, ast, bgregg, daniel, yhs, linux-kernel, Kernel-team

On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> This set of commits attempts to improve three scheduler related
> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
> 
> Firstly, these commit add additional flag values, namely preempt,
> clone_flags and group_dead to these Tracepoints, to make information
> exposed via the Tracepoints more useful and complete.
> 
> Secondly, these commits exposes task_struct pointers in these
> Tracepoints. The task_struct pointers are arguments of the Tracepoints
> and currently only used to compute struct field values. But for BPF
> programs attached to these Tracepoints, we may want to read additional
> task information via the task_struct pointers. This is currently either
> impossible, or we have to make assumption of whether the Tracepoint is
> running from previous / parent or next / child, and use current pointer
> instead. Exposing the task_struct pointers explicitly makes such use
> case easier and more reliable.
> 

NAK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-14 20:49 ` [PATCH tip 0/3] Improvements of scheduler related Tracepoints Peter Zijlstra
@ 2017-12-15  3:16   ` Alexei Starovoitov
  2017-12-15  7:39     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2017-12-15  3:16 UTC (permalink / raw)
  To: Peter Zijlstra, Teng Qin
  Cc: mingo, bgregg, daniel, yhs, linux-kernel, Kernel-team

On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
>> This set of commits attempts to improve three scheduler related
>> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
>>
>> Firstly, these commit add additional flag values, namely preempt,
>> clone_flags and group_dead to these Tracepoints, to make information
>> exposed via the Tracepoints more useful and complete.
>>
>> Secondly, these commits exposes task_struct pointers in these
>> Tracepoints. The task_struct pointers are arguments of the Tracepoints
>> and currently only used to compute struct field values. But for BPF
>> programs attached to these Tracepoints, we may want to read additional
>> task information via the task_struct pointers. This is currently either
>> impossible, or we have to make assumption of whether the Tracepoint is
>> running from previous / parent or next / child, and use current pointer
>> instead. Exposing the task_struct pointers explicitly makes such use
>> case easier and more reliable.
>>
>
> NAK

not sure what is the concern here.
Is it first or second part of the above ?
preempt and group_dead are bool and clone_flags has uapi defined
flags, so no kernel internals being exposed.
Two task_struct pointers are unusable outside of bpf progs.
There are plenty of other tracepoints that store pointers to
kernel structs and bpf progs are looking into them.
So nothing new being exposed here as well.

Note that TP_printk() kept unchanged, so typical user tracing
that parses trace_pipe won't see any difference.
Apps that use binary apis are using libs like libtracecmd and also fine.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-15  3:16   ` Alexei Starovoitov
@ 2017-12-15  7:39     ` Peter Zijlstra
  2017-12-15  8:53       ` Teng Qin
  2017-12-15 17:09       ` Alexei Starovoitov
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2017-12-15  7:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Teng Qin, mingo, bgregg, daniel, yhs, linux-kernel, Kernel-team

On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
> On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> > On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> > > This set of commits attempts to improve three scheduler related
> > > Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
> > > 
> > > Firstly, these commit add additional flag values, namely preempt,
> > > clone_flags and group_dead to these Tracepoints, to make information
> > > exposed via the Tracepoints more useful and complete.
> > > 
> > > Secondly, these commits exposes task_struct pointers in these
> > > Tracepoints. The task_struct pointers are arguments of the Tracepoints
> > > and currently only used to compute struct field values. But for BPF
> > > programs attached to these Tracepoints, we may want to read additional
> > > task information via the task_struct pointers. This is currently either
> > > impossible, or we have to make assumption of whether the Tracepoint is
> > > running from previous / parent or next / child, and use current pointer
> > > instead. Exposing the task_struct pointers explicitly makes such use
> > > case easier and more reliable.
> > > 
> > 
> > NAK
> 
> not sure what is the concern here.
> Is it first or second part of the above ?

Definitely the second, but also the first. You know I would have ripped
out all scheduler tracepoints if I could have. They're a pain in the
arse.

A lot of people want to add to the tracepoints, with the end result that
they'll end up a big bloated pile of useless crap. The first part is
just the pieces you want added.

As to the second, that's complete crap; that just makes everything
slower for bodies benefit. If you register a traceprobe you already get
access to these things.

I think your problem is that you use perf to get access to the
tracepoints, which them means you have to do disgusting things like
this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-15  7:39     ` Peter Zijlstra
@ 2017-12-15  8:53       ` Teng Qin
  2017-12-15  9:53         ` Peter Zijlstra
  2017-12-15 17:09       ` Alexei Starovoitov
  1 sibling, 1 reply; 12+ messages in thread
From: Teng Qin @ 2017-12-15  8:53 UTC (permalink / raw)
  To: Peter Zijlstra, Alexei Starovoitov
  Cc: mingo, bgregg, daniel, Yonghong Song, linux-kernel, Kernel Team



On 12/14/17, 23:40, "Peter Zijlstra" <peterz@infradead.org> wrote:
> On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
> > On 12/14/17 12:49 PM, Peter Zijlstra wrote:
> > > On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
> > > > This set of commits attempts to improve three scheduler related
> > > > Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
> > > >
> > > > Firstly, these commit add additional flag values, namely preempt,
> > > > clone_flags and group_dead to these Tracepoints, to make information
> > > > exposed via the Tracepoints more useful and complete.
> > > >
> > > > Secondly, these commits exposes task_struct pointers in these
> > > > Tracepoints. The task_struct pointers are arguments of the Tracepoints
> > > > and currently only used to compute struct field values. But for BPF
> > > > programs attached to these Tracepoints, we may want to read additional
> > > > task information via the task_struct pointers. This is currently either
> > > > impossible, or we have to make assumption of whether the Tracepoint is
> > > > running from previous / parent or next / child, and use current pointer
> > > > instead. Exposing the task_struct pointers explicitly makes such use
> > > > case easier and more reliable.
> > > >
> > >
> > > NAK
> >
> > not sure what is the concern here.
> > Is it first or second part of the above ?
>
> Definitely the second, but also the first. You know I would have ripped
> out all scheduler tracepoints if I could have. They're a pain in the
> arse.
>
> A lot of people want to add to the tracepoints, with the end result that
> they'll end up a big bloated pile of useless crap. The first part is
> just the pieces you want added.
>
> As to the second, that's complete crap; that just makes everything
> slower for bodies benefit. If you register a traceprobe you already get
> access to these things.

To have access to related task_struct is one of the main purposes of these
patches. Take sched_switch as an example. We depend on the implementation
of the Tracepoint is called from prev or next (which could, although unlikedly,
change) and use current to get that task_struct, which feels, correct
me if I'm wrong, kind of defeating the purpose of Tracepoints being more
implementation-independent than kprobes. Then we have to figure out another
Tracepoint or most likely a kprobe function to get the other (prev or next)
task_struct.

> I think your problem is that you use perf to get access to the
> tracepoints, which them means you have to do disgusting things like
> this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-15  8:53       ` Teng Qin
@ 2017-12-15  9:53         ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2017-12-15  9:53 UTC (permalink / raw)
  To: Teng Qin
  Cc: Alexei Starovoitov, mingo, bgregg, daniel, Yonghong Song,
	linux-kernel, Kernel Team

On Fri, Dec 15, 2017 at 08:53:30AM +0000, Teng Qin wrote:
> To have access to related task_struct is one of the main purposes of these
> patches. Take sched_switch as an example. We depend on the implementation
> of the Tracepoint is called from prev or next (which could, although unlikedly,
> change) and use current to get that task_struct, which feels, correct
> me if I'm wrong, kind of defeating the purpose of Tracepoints being more
> implementation-independent than kprobes. Then we have to figure out another
> Tracepoint or most likely a kprobe function to get the other (prev or next)
> task_struct.

Go read how tracepoints work. The tracepoint_probe things get you
exactly what you want.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-15  7:39     ` Peter Zijlstra
  2017-12-15  8:53       ` Teng Qin
@ 2017-12-15 17:09       ` Alexei Starovoitov
  2017-12-18  9:11         ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2017-12-15 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Teng Qin, mingo, bgregg, daniel, yhs, linux-kernel, Kernel-team

On 12/14/17 11:39 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 07:16:00PM -0800, Alexei Starovoitov wrote:
>> On 12/14/17 12:49 PM, Peter Zijlstra wrote:
>>> On Thu, Dec 14, 2017 at 12:20:41PM -0800, Teng Qin wrote:
>>>> This set of commits attempts to improve three scheduler related
>>>> Tracepoints: sched_switch, sched_process_fork, sched_process_exit.
>>>>
>>>> Firstly, these commit add additional flag values, namely preempt,
>>>> clone_flags and group_dead to these Tracepoints, to make information
>>>> exposed via the Tracepoints more useful and complete.
>>>>
>>>> Secondly, these commits exposes task_struct pointers in these
>>>> Tracepoints. The task_struct pointers are arguments of the Tracepoints
>>>> and currently only used to compute struct field values. But for BPF
>>>> programs attached to these Tracepoints, we may want to read additional
>>>> task information via the task_struct pointers. This is currently either
>>>> impossible, or we have to make assumption of whether the Tracepoint is
>>>> running from previous / parent or next / child, and use current pointer
>>>> instead. Exposing the task_struct pointers explicitly makes such use
>>>> case easier and more reliable.
>>>>
>>>
>>> NAK
>>
>> not sure what is the concern here.
>> Is it first or second part of the above ?
>
> Definitely the second, but also the first. You know I would have ripped
> out all scheduler tracepoints if I could have. They're a pain in the
> arse.
>
> A lot of people want to add to the tracepoints, with the end result that
> they'll end up a big bloated pile of useless crap. The first part is
> just the pieces you want added.
>
> As to the second, that's complete crap; that just makes everything
> slower for bodies benefit. If you register a traceprobe you already get
> access to these things.
>
> I think your problem is that you use perf to get access to the
> tracepoints, which them means you have to do disgusting things like
> this.

yeah. Currently bpf progs are called at the end of
perf_trace_##call()
{
   .. regular tracepoint copy craft
   perf_trace_run_bpf_submit( &copied args )
}

from bpf pov we'd rather get access to raw args passed into
perf_trace_##call.
Sounds like you're suggesting to let bpf side register its
progs directly via tracepoint_probe_register() ?
That would solve the whole thing really nicely indeed.

How such api would look like ?
Something like extending kprobe/uprobe fd-based perf_event_open?
https://www.spinics.net/lists/netdev/msg470567.html
btw could you please apply that set to tip tree
or you want us to route it via bpf-next -> net-next ?

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-15 17:09       ` Alexei Starovoitov
@ 2017-12-18  9:11         ` Peter Zijlstra
  2017-12-21  2:03           ` tracepoint_probe_register and bpf. Was: " Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2017-12-18  9:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Teng Qin, mingo, bgregg, daniel, yhs, linux-kernel, Kernel-team

On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:

> yeah. Currently bpf progs are called at the end of
> perf_trace_##call()
> {
>   .. regular tracepoint copy craft
>   perf_trace_run_bpf_submit( &copied args )
> }
> 
> from bpf pov we'd rather get access to raw args passed into
> perf_trace_##call.
> Sounds like you're suggesting to let bpf side register its
> progs directly via tracepoint_probe_register() ?
> That would solve the whole thing really nicely indeed.

Not sure I thought that far; but if you want the probe arguments either
grab them from the perf probe you're running in or indeed register your
own, don't play silly buggers and copy them in the trace data.

> How such api would look like ?
> Something like extending kprobe/uprobe fd-based perf_event_open?
> https://www.spinics.net/lists/netdev/msg470567.html
> btw could you please apply that set to tip tree
> or you want us to route it via bpf-next -> net-next ?

Oh right, lemme prod Ingo on that.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
  2017-12-18  9:11         ` Peter Zijlstra
@ 2017-12-21  2:03           ` Alexei Starovoitov
  0 siblings, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2017-12-21  2:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Teng Qin, mingo, bgregg, daniel, linux-kernel, kernel-team,
	netdev, David S. Miller, brouer

On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:
> 
> > yeah. Currently bpf progs are called at the end of
> > perf_trace_##call()
> > {
> >   .. regular tracepoint copy craft
> >   perf_trace_run_bpf_submit( &copied args )
> > }
> > 
> > from bpf pov we'd rather get access to raw args passed into
> > perf_trace_##call.
> > Sounds like you're suggesting to let bpf side register its
> > progs directly via tracepoint_probe_register() ?
> > That would solve the whole thing really nicely indeed.
> 
> Not sure I thought that far; but if you want the probe arguments either
> grab them from the perf probe you're running in or indeed register your
> own, don't play silly buggers and copy them in the trace data.

So I've tried both approaches:

1. grab arguments from:
perf_trace_##call(void *__data, proto)

it works ok, but it adds 50-70 bytes of .text to every perf_trace_*
functions which multiplied by the number of tracepoints (~500) which in
total adds ~30k of .text to vmlinux
Not too bad, but it also adds extra branch to critical
execution path, so not ideal

2. register your own
This approach I think is much better.
The trick I'm doing is the following:
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)  \
/* no 'static' here. The bpf probe functions are global */              \
notrace void                                                            \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
        struct trace_event_call *event_call = __data;                   \
        \
        __FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \
}

Where CAST_TO_U64 is a macro to cast all args to u64 one by one.
bpf_trace_run*() functions are defined as:
void bpf_trace_run1(struct trace_event_call *call, u64 arg1);
void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2);
void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3);

so in assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:	mov    %ecx,%ecx
   0xffffffff81132082 <+2>:	jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3() and all other registers are passed as-is.
So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since I want them to appear in kallsyms, so I don't
have to add another 8-byte fields to 'struct trace_event_class'
and 'struct trace_event_call'

Such approach, I think, is the lowest overhead we can possibly have
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since many folks are starting to use tracepoint+bpf at speeds of
1M+ events per second this would be very valuable optimization.

Diff stat so far:
 drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++---
 fs/btrfs/super.c                  |  4 ++++
 include/linux/trace_events.h      | 27 +++++++++++++++++++++++++++
 include/trace/bpf_probe.h         | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/trace/define_trace.h      |  1 +
 include/uapi/linux/bpf.h          |  4 ++++
 kernel/trace/bpf_trace.c          | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 191 insertions(+), 3 deletions(-)

Since I'm not touching anything on ftrace or perf side,
I'm thinking to extend BPF_PROG_ATTACH command to specify
which tracepoint to attach to.
The user space will look like:

prog_fd = bpf_prog_load(...);
bpf_prog_attach(prog_fd, "xdp_exception");

On the kernel side I will walk kallsyms to
find __tracepoint_xdp_exception record, check that 
tp->name == "xdp_exception",
then find __bpf_trace_xdp_exception() address (also in kallsyms),
and finally use tracepoint_probe_register() to connect the two.

Thoughts?

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-12-21  2:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14 20:20 [PATCH tip 0/3] Improvements of scheduler related Tracepoints Teng Qin
2017-12-14 20:20 ` [PATCH tip 1/3] Improve sched_switch Tracepoint Teng Qin
2017-12-14 20:20 ` [PATCH tip 2/3] Improve sched_process_fork Tracepoint Teng Qin
2017-12-14 20:20 ` [PATCH tip 3/3] Improve sched_process_exit Tracepoint Teng Qin
2017-12-14 20:49 ` [PATCH tip 0/3] Improvements of scheduler related Tracepoints Peter Zijlstra
2017-12-15  3:16   ` Alexei Starovoitov
2017-12-15  7:39     ` Peter Zijlstra
2017-12-15  8:53       ` Teng Qin
2017-12-15  9:53         ` Peter Zijlstra
2017-12-15 17:09       ` Alexei Starovoitov
2017-12-18  9:11         ` Peter Zijlstra
2017-12-21  2:03           ` tracepoint_probe_register and bpf. Was: " Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.