LKML Archive on lore.kernel.org
 help / color / Atom feed
* Rough idea of implementing blocking perf calls for system call tracepoints
@ 2018-11-28 18:47 Steven Rostedt
  2018-11-28 19:18 ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2018-11-28 18:47 UTC (permalink / raw)
  To: LKML
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, Jiri Olsa,
	ldv, esyr, Frederic Weisbecker


[
 Sorry for the late reply on this, when I got back from Plumbers, my
 work was really piled up, and then Turkey day came and just added more
 to the chaos.
]

From our discussion at the Linux Plumbers strace talk about
implementing strace with perf. As strace requires to be lossless, it
currently can not be implemented with perf because there's always a
chance to lose events. The idea here is to have a way to instrument a
way to record system calls from perf but also block when the perf ring
buffer is full.

Below is a patch I wrote that gives an idea of what needs to be done.
It is by no means a real patch (wont even compile). And I left out the
wake up part, as I'm not familiar enough with how perf works to
implement it. But hopefully someone on this list can :-)

The idea here is that we set the tracepoints sys_enter and sys_exit
with a new flag called TRACE_EVENT_FL_BLOCK. When the perf code records
the event, if the buffer is full, it will set a "perf_block" field in
the current task structure to point to the tp_event, if the tp_event
has the BLOCK flag set.

Then on the exit of the syscall tracepoints, the perf_block field is
checked, and if it is set, it knows that the event was dropped, and
will add itself to a wait queue. When the reader reads the perf buffer
and hits a water mark, it can wake whatever is on the queue (not sure
where to put this queue, but someone can figure it out).

Once woken, it will try to write to the perf system call tracepoint
again (notice that it only tries perf and doesn't call the generic
tracepoint code, as only perf requires a repeat).

This is just a basic idea patch, to hopefully give someone else an idea
of what I envision. I think it can work, and if it does, I can imagine
that it would greatly improve the performance of strace!

Thoughts?

-- Steve

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..57fe95950a24 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -123,8 +123,22 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	}
 #endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
+		current->perf_block = NULL;
 		trace_sys_enter(regs, regs->orig_ax);
+		while (current->perf_block) {
+			DECLARE_WAITQUEUE(wait, current);
+			struct trace_event_call *tp_event = current->perf_block;
+
+			current->perf_block = NULL;
+
+			set_current_state(TASK_INTERRUPTIBLE);
+			add_wait_queue(&tp_event->block_queue, &wait);
+			perf_trace_sys_enter(tp_event, regs, regs->orig_ax);
+			if (current->perf_block)
+				schedule();
+		}
+	}
 
 	do_audit_syscall_entry(regs, arch);
 
@@ -224,8 +238,22 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
 
 	audit_syscall_exit(regs);
 
-	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+	if (cached_flags & _TIF_SYSCALL_TRACEPOINT) {
+		current->perf_block = NULL;
 		trace_sys_exit(regs, regs->ax);
+		while (current->perf_block) {
+			DECLARE_WAITQUEUE(wait, current);
+			struct trace_event_call *tp_event = current->perf_block;
+
+			current->perf_block = NULL;
+
+			set_current_state(TASK_INTERRUPTIBLE);
+			add_wait_queue(&tp_event->block_queue, &wait);
+			perf_trace_sys_exit(tp_event, regs, regs->orig_ax);
+			if (current->perf_block)
+				schedule();
+		}
+	}
 
 	/*
 	 * If TIF_SYSCALL_EMU is set, we only get here because of
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d6183a55e8eb..39a98da0e03b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
 	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
+	struct trace_event_call		*perf_block;
 #endif
 #ifdef CONFIG_DEBUG_PREEMPT
 	unsigned long			preempt_disable_ip;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 4130a5497d40..35a71f853d64 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -226,6 +226,7 @@ enum {
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
 	TRACE_EVENT_FL_KPROBE_BIT,
 	TRACE_EVENT_FL_UPROBE_BIT,
+	TRACE_EVENT_FL_BLOCK_BIT,
 };
 
 /*
@@ -246,6 +247,7 @@ enum {
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
 	TRACE_EVENT_FL_KPROBE		= (1 << TRACE_EVENT_FL_KPROBE_BIT),
 	TRACE_EVENT_FL_UPROBE		= (1 << TRACE_EVENT_FL_UPROBE_BIT),
+	TRACE_EVENT_FL_BLOCK		= (1 << TRACE_EVENT_FL_BLOCK_BIT),
 };
 
 #define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index 44a3259ed4a5..3fc4ecafebcf 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -39,7 +39,7 @@ TRACE_EVENT_FN(sys_enter,
 	syscall_regfunc, syscall_unregfunc
 );
 
-TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
+TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY | TRACE_EVENT_FL_BLOCK)
 
 TRACE_EVENT_FN(sys_exit,
 
@@ -63,7 +63,7 @@ TRACE_EVENT_FN(sys_exit,
 	syscall_regfunc, syscall_unregfunc
 );
 
-TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY)
+TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY | TRACE_EVENT_FL_BLOCK)
 
 #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84530ab358c3..64be3de9ee41 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6505,8 +6505,12 @@ __perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size)) {
+		if (event->tp_event &&
+		    event->tp_event->flags & TRACE_EVENT_FL_BLOCK)
+			current->perf_block = event->tp_event;
 		goto exit;
+	}
 
 	perf_output_sample(&handle, &header, data, event);
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Rough idea of implementing blocking perf calls for system call tracepoints
  2018-11-28 18:47 Rough idea of implementing blocking perf calls for system call tracepoints Steven Rostedt
@ 2018-11-28 19:18 ` Steven Rostedt
  2018-11-30 10:40   ` Jiri Olsa
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2018-11-28 19:18 UTC (permalink / raw)
  To: LKML
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, Jiri Olsa,
	ldv, esyr, Frederic Weisbecker, Masami Hiramatsu, Namhyung Kim


Adding Masami and Namhyung to this as well.

-- Steve


On Wed, 28 Nov 2018 13:47:00 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> [
>  Sorry for the late reply on this, when I got back from Plumbers, my
>  work was really piled up, and then Turkey day came and just added more
>  to the chaos.
> ]
> 
> From our discussion at the Linux Plumbers strace talk about
> implementing strace with perf. As strace requires to be lossless, it
> currently can not be implemented with perf because there's always a
> chance to lose events. The idea here is to have a way to instrument a
> way to record system calls from perf but also block when the perf ring
> buffer is full.
> 
> Below is a patch I wrote that gives an idea of what needs to be done.
> It is by no means a real patch (wont even compile). And I left out the
> wake up part, as I'm not familiar enough with how perf works to
> implement it. But hopefully someone on this list can :-)
> 
> The idea here is that we set the tracepoints sys_enter and sys_exit
> with a new flag called TRACE_EVENT_FL_BLOCK. When the perf code records
> the event, if the buffer is full, it will set a "perf_block" field in
> the current task structure to point to the tp_event, if the tp_event
> has the BLOCK flag set.
> 
> Then on the exit of the syscall tracepoints, the perf_block field is
> checked, and if it is set, it knows that the event was dropped, and
> will add itself to a wait queue. When the reader reads the perf buffer
> and hits a water mark, it can wake whatever is on the queue (not sure
> where to put this queue, but someone can figure it out).
> 
> Once woken, it will try to write to the perf system call tracepoint
> again (notice that it only tries perf and doesn't call the generic
> tracepoint code, as only perf requires a repeat).
> 
> This is just a basic idea patch, to hopefully give someone else an idea
> of what I envision. I think it can work, and if it does, I can imagine
> that it would greatly improve the performance of strace!
> 
> Thoughts?
> 
> -- Steve
> 
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 3b2490b81918..57fe95950a24 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -123,8 +123,22 @@ static long syscall_trace_enter(struct pt_regs *regs)
>  	}
>  #endif
>  
> -	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> +	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
> +		current->perf_block = NULL;
>  		trace_sys_enter(regs, regs->orig_ax);
> +		while (current->perf_block) {
> +			DECLARE_WAITQUEUE(wait, current);
> +			struct trace_event_call *tp_event = current->perf_block;
> +
> +			current->perf_block = NULL;
> +
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			add_wait_queue(&tp_event->block_queue, &wait);
> +			perf_trace_sys_enter(tp_event, regs, regs->orig_ax);
> +			if (current->perf_block)
> +				schedule();
> +		}
> +	}
>  
>  	do_audit_syscall_entry(regs, arch);
>  
> @@ -224,8 +238,22 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
>  
>  	audit_syscall_exit(regs);
>  
> -	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
> +	if (cached_flags & _TIF_SYSCALL_TRACEPOINT) {
> +		current->perf_block = NULL;
>  		trace_sys_exit(regs, regs->ax);
> +		while (current->perf_block) {
> +			DECLARE_WAITQUEUE(wait, current);
> +			struct trace_event_call *tp_event = current->perf_block;
> +
> +			current->perf_block = NULL;
> +
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			add_wait_queue(&tp_event->block_queue, &wait);
> +			perf_trace_sys_exit(tp_event, regs, regs->orig_ax);
> +			if (current->perf_block)
> +				schedule();
> +		}
> +	}
>  
>  	/*
>  	 * If TIF_SYSCALL_EMU is set, we only get here because of
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d6183a55e8eb..39a98da0e03b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1009,6 +1009,7 @@ struct task_struct {
>  	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
>  	struct mutex			perf_event_mutex;
>  	struct list_head		perf_event_list;
> +	struct trace_event_call		*perf_block;
>  #endif
>  #ifdef CONFIG_DEBUG_PREEMPT
>  	unsigned long			preempt_disable_ip;
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 4130a5497d40..35a71f853d64 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -226,6 +226,7 @@ enum {
>  	TRACE_EVENT_FL_TRACEPOINT_BIT,
>  	TRACE_EVENT_FL_KPROBE_BIT,
>  	TRACE_EVENT_FL_UPROBE_BIT,
> +	TRACE_EVENT_FL_BLOCK_BIT,
>  };
>  
>  /*
> @@ -246,6 +247,7 @@ enum {
>  	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
>  	TRACE_EVENT_FL_KPROBE		= (1 << TRACE_EVENT_FL_KPROBE_BIT),
>  	TRACE_EVENT_FL_UPROBE		= (1 << TRACE_EVENT_FL_UPROBE_BIT),
> +	TRACE_EVENT_FL_BLOCK		= (1 << TRACE_EVENT_FL_BLOCK_BIT),
>  };
>  
>  #define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
> diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
> index 44a3259ed4a5..3fc4ecafebcf 100644
> --- a/include/trace/events/syscalls.h
> +++ b/include/trace/events/syscalls.h
> @@ -39,7 +39,7 @@ TRACE_EVENT_FN(sys_enter,
>  	syscall_regfunc, syscall_unregfunc
>  );
>  
> -TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
> +TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY | TRACE_EVENT_FL_BLOCK)
>  
>  TRACE_EVENT_FN(sys_exit,
>  
> @@ -63,7 +63,7 @@ TRACE_EVENT_FN(sys_exit,
>  	syscall_regfunc, syscall_unregfunc
>  );
>  
> -TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY)
> +TRACE_EVENT_FLAGS(sys_exit, TRACE_EVENT_FL_CAP_ANY | TRACE_EVENT_FL_BLOCK)
>  
>  #endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 84530ab358c3..64be3de9ee41 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6505,8 +6505,12 @@ __perf_event_output(struct perf_event *event,
>  
>  	perf_prepare_sample(&header, data, event, regs);
>  
> -	if (output_begin(&handle, event, header.size))
> +	if (output_begin(&handle, event, header.size)) {
> +		if (event->tp_event &&
> +		    event->tp_event->flags & TRACE_EVENT_FL_BLOCK)
> +			current->perf_block = event->tp_event;
>  		goto exit;
> +	}
>  
>  	perf_output_sample(&handle, &header, data, event);
>  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Rough idea of implementing blocking perf calls for system call tracepoints
  2018-11-28 19:18 ` Steven Rostedt
@ 2018-11-30 10:40   ` Jiri Olsa
  2018-11-30 11:48     ` Jiri Olsa
  0 siblings, 1 reply; 6+ messages in thread
From: Jiri Olsa @ 2018-11-30 10:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, ldv, esyr,
	Frederic Weisbecker, Masami Hiramatsu, Namhyung Kim

On Wed, Nov 28, 2018 at 02:18:08PM -0500, Steven Rostedt wrote:
> 
> Adding Masami and Namhyung to this as well.
> 
> -- Steve
> 
> 
> On Wed, 28 Nov 2018 13:47:00 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > [
> >  Sorry for the late reply on this, when I got back from Plumbers, my
> >  work was really piled up, and then Turkey day came and just added more
> >  to the chaos.
> > ]
> > 
> > From our discussion at the Linux Plumbers strace talk about
> > implementing strace with perf. As strace requires to be lossless, it
> > currently can not be implemented with perf because there's always a
> > chance to lose events. The idea here is to have a way to instrument a
> > way to record system calls from perf but also block when the perf ring
> > buffer is full.
> > 
> > Below is a patch I wrote that gives an idea of what needs to be done.
> > It is by no means a real patch (wont even compile). And I left out the
> > wake up part, as I'm not familiar enough with how perf works to
> > implement it. But hopefully someone on this list can :-)
> > 
> > The idea here is that we set the tracepoints sys_enter and sys_exit
> > with a new flag called TRACE_EVENT_FL_BLOCK. When the perf code records
> > the event, if the buffer is full, it will set a "perf_block" field in
> > the current task structure to point to the tp_event, if the tp_event
> > has the BLOCK flag set.
> > 
> > Then on the exit of the syscall tracepoints, the perf_block field is
> > checked, and if it is set, it knows that the event was dropped, and
> > will add itself to a wait queue. When the reader reads the perf buffer
> > and hits a water mark, it can wake whatever is on the queue (not sure
> > where to put this queue, but someone can figure it out).
> > 
> > Once woken, it will try to write to the perf system call tracepoint
> > again (notice that it only tries perf and doesn't call the generic
> > tracepoint code, as only perf requires a repeat).
> > 
> > This is just a basic idea patch, to hopefully give someone else an idea
> > of what I envision. I think it can work, and if it does, I can imagine
> > that it would greatly improve the performance of strace!
> > 
> > Thoughts?
> > 
> > -- Steve
> > 
> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > index 3b2490b81918..57fe95950a24 100644
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -123,8 +123,22 @@ static long syscall_trace_enter(struct pt_regs *regs)
> >  	}
> >  #endif
> >  
> > -	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> > +	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
> > +		current->perf_block = NULL;
> >  		trace_sys_enter(regs, regs->orig_ax);
> > +		while (current->perf_block) {
> > +			DECLARE_WAITQUEUE(wait, current);
> > +			struct trace_event_call *tp_event = current->perf_block;
> > +
> > +			current->perf_block = NULL;
> > +
> > +			set_current_state(TASK_INTERRUPTIBLE);
> > +			add_wait_queue(&tp_event->block_queue, &wait);
> > +			perf_trace_sys_enter(tp_event, regs, regs->orig_ax);
> > +			if (current->perf_block)
> > +				schedule();

the space gets freed up by user space moving the tail pointer
so I wonder we need actualy to do some polling in here

also how about making this ring buffer feature so it's not specific
just to sys_enter/sys_exit.. I'll check

jirka

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Rough idea of implementing blocking perf calls for system call tracepoints
  2018-11-30 10:40   ` Jiri Olsa
@ 2018-11-30 11:48     ` Jiri Olsa
  2018-11-30 14:01       ` Jiri Olsa
  0 siblings, 1 reply; 6+ messages in thread
From: Jiri Olsa @ 2018-11-30 11:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, ldv, esyr,
	Frederic Weisbecker, Masami Hiramatsu, Namhyung Kim

On Fri, Nov 30, 2018 at 11:40:44AM +0100, Jiri Olsa wrote:
> On Wed, Nov 28, 2018 at 02:18:08PM -0500, Steven Rostedt wrote:
> > 
> > Adding Masami and Namhyung to this as well.
> > 
> > -- Steve
> > 
> > 
> > On Wed, 28 Nov 2018 13:47:00 -0500
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > [
> > >  Sorry for the late reply on this, when I got back from Plumbers, my
> > >  work was really piled up, and then Turkey day came and just added more
> > >  to the chaos.
> > > ]
> > > 
> > > From our discussion at the Linux Plumbers strace talk about
> > > implementing strace with perf. As strace requires to be lossless, it
> > > currently can not be implemented with perf because there's always a
> > > chance to lose events. The idea here is to have a way to instrument a
> > > way to record system calls from perf but also block when the perf ring
> > > buffer is full.
> > > 
> > > Below is a patch I wrote that gives an idea of what needs to be done.
> > > It is by no means a real patch (wont even compile). And I left out the
> > > wake up part, as I'm not familiar enough with how perf works to
> > > implement it. But hopefully someone on this list can :-)
> > > 
> > > The idea here is that we set the tracepoints sys_enter and sys_exit
> > > with a new flag called TRACE_EVENT_FL_BLOCK. When the perf code records
> > > the event, if the buffer is full, it will set a "perf_block" field in
> > > the current task structure to point to the tp_event, if the tp_event
> > > has the BLOCK flag set.
> > > 
> > > Then on the exit of the syscall tracepoints, the perf_block field is
> > > checked, and if it is set, it knows that the event was dropped, and
> > > will add itself to a wait queue. When the reader reads the perf buffer
> > > and hits a water mark, it can wake whatever is on the queue (not sure
> > > where to put this queue, but someone can figure it out).
> > > 
> > > Once woken, it will try to write to the perf system call tracepoint
> > > again (notice that it only tries perf and doesn't call the generic
> > > tracepoint code, as only perf requires a repeat).
> > > 
> > > This is just a basic idea patch, to hopefully give someone else an idea
> > > of what I envision. I think it can work, and if it does, I can imagine
> > > that it would greatly improve the performance of strace!
> > > 
> > > Thoughts?
> > > 
> > > -- Steve
> > > 
> > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > > index 3b2490b81918..57fe95950a24 100644
> > > --- a/arch/x86/entry/common.c
> > > +++ b/arch/x86/entry/common.c
> > > @@ -123,8 +123,22 @@ static long syscall_trace_enter(struct pt_regs *regs)
> > >  	}
> > >  #endif
> > >  
> > > -	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> > > +	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
> > > +		current->perf_block = NULL;
> > >  		trace_sys_enter(regs, regs->orig_ax);
> > > +		while (current->perf_block) {
> > > +			DECLARE_WAITQUEUE(wait, current);
> > > +			struct trace_event_call *tp_event = current->perf_block;
> > > +
> > > +			current->perf_block = NULL;
> > > +
> > > +			set_current_state(TASK_INTERRUPTIBLE);
> > > +			add_wait_queue(&tp_event->block_queue, &wait);
> > > +			perf_trace_sys_enter(tp_event, regs, regs->orig_ax);
> > > +			if (current->perf_block)
> > > +				schedule();
> 
> the space gets freed up by user space moving the tail pointer
> so I wonder we need actualy to do some polling in here
> 
> also how about making this ring buffer feature so it's not specific
> just to sys_enter/sys_exit.. I'll check

or perhaps just tracepoints.. does not seem to make much
sense for he events

jirka

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Rough idea of implementing blocking perf calls for system call tracepoints
  2018-11-30 11:48     ` Jiri Olsa
@ 2018-11-30 14:01       ` Jiri Olsa
  2018-11-30 14:45         ` Steven Rostedt
  0 siblings, 1 reply; 6+ messages in thread
From: Jiri Olsa @ 2018-11-30 14:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, ldv, esyr,
	Frederic Weisbecker, Masami Hiramatsu, Namhyung Kim

On Fri, Nov 30, 2018 at 12:48:09PM +0100, Jiri Olsa wrote:
> On Fri, Nov 30, 2018 at 11:40:44AM +0100, Jiri Olsa wrote:
> > On Wed, Nov 28, 2018 at 02:18:08PM -0500, Steven Rostedt wrote:
> > > 
> > > Adding Masami and Namhyung to this as well.
> > > 
> > > -- Steve
> > > 
> > > 
> > > On Wed, 28 Nov 2018 13:47:00 -0500
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > [
> > > >  Sorry for the late reply on this, when I got back from Plumbers, my
> > > >  work was really piled up, and then Turkey day came and just added more
> > > >  to the chaos.
> > > > ]
> > > > 
> > > > From our discussion at the Linux Plumbers strace talk about
> > > > implementing strace with perf. As strace requires to be lossless, it
> > > > currently can not be implemented with perf because there's always a
> > > > chance to lose events. The idea here is to have a way to instrument a
> > > > way to record system calls from perf but also block when the perf ring
> > > > buffer is full.
> > > > 
> > > > Below is a patch I wrote that gives an idea of what needs to be done.
> > > > It is by no means a real patch (wont even compile). And I left out the
> > > > wake up part, as I'm not familiar enough with how perf works to
> > > > implement it. But hopefully someone on this list can :-)
> > > > 
> > > > The idea here is that we set the tracepoints sys_enter and sys_exit
> > > > with a new flag called TRACE_EVENT_FL_BLOCK. When the perf code records
> > > > the event, if the buffer is full, it will set a "perf_block" field in
> > > > the current task structure to point to the tp_event, if the tp_event
> > > > has the BLOCK flag set.
> > > > 
> > > > Then on the exit of the syscall tracepoints, the perf_block field is
> > > > checked, and if it is set, it knows that the event was dropped, and
> > > > will add itself to a wait queue. When the reader reads the perf buffer
> > > > and hits a water mark, it can wake whatever is on the queue (not sure
> > > > where to put this queue, but someone can figure it out).
> > > > 
> > > > Once woken, it will try to write to the perf system call tracepoint
> > > > again (notice that it only tries perf and doesn't call the generic
> > > > tracepoint code, as only perf requires a repeat).
> > > > 
> > > > This is just a basic idea patch, to hopefully give someone else an idea
> > > > of what I envision. I think it can work, and if it does, I can imagine
> > > > that it would greatly improve the performance of strace!
> > > > 
> > > > Thoughts?
> > > > 
> > > > -- Steve
> > > > 
> > > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > > > index 3b2490b81918..57fe95950a24 100644
> > > > --- a/arch/x86/entry/common.c
> > > > +++ b/arch/x86/entry/common.c
> > > > @@ -123,8 +123,22 @@ static long syscall_trace_enter(struct pt_regs *regs)
> > > >  	}
> > > >  #endif
> > > >  
> > > > -	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> > > > +	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
> > > > +		current->perf_block = NULL;
> > > >  		trace_sys_enter(regs, regs->orig_ax);
> > > > +		while (current->perf_block) {
> > > > +			DECLARE_WAITQUEUE(wait, current);
> > > > +			struct trace_event_call *tp_event = current->perf_block;
> > > > +
> > > > +			current->perf_block = NULL;
> > > > +
> > > > +			set_current_state(TASK_INTERRUPTIBLE);
> > > > +			add_wait_queue(&tp_event->block_queue, &wait);
> > > > +			perf_trace_sys_enter(tp_event, regs, regs->orig_ax);
> > > > +			if (current->perf_block)
> > > > +				schedule();
> > 
> > the space gets freed up by user space moving the tail pointer
> > so I wonder we need actualy to do some polling in here
> > 
> > also how about making this ring buffer feature so it's not specific
> > just to sys_enter/sys_exit.. I'll check
> 
> or perhaps just tracepoints.. does not seem to make much
> sense for he events

actualy the syscalls seems to be enough for now ;-) I tried
something and ended up with hack below

to test you can take user change from my perf/sys_block branch
and run something like:

  # perf record -e syscalls:sys_enter_close/block/ -p <ex-pid>

with ex prog being:

	#include <stdio.h>
	#include <unistd.h>

	int main(int argc, char **argv)
	{
		unsigned int i = 0;

		while (1) {
			close(-1);
			if (i++ % 100000 == 0) {
				printf(".");
				fflush(NULL);
			}
		}
	}

when I stop perf record I can see ex being stopped as well

there's probably better way to do the polling, I'll need
to check on that.. among other things ;-)

jirka


---
 arch/x86/entry/common.c         | 16 ++++++++++++++--
 include/linux/perf_event.h      |  4 +++-
 include/linux/sched.h           |  1 +
 include/linux/syscalls.h        |  2 ++
 include/linux/trace_events.h    |  6 +++---
 include/trace/perf.h            |  2 +-
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c            | 28 ++++++++++++++++++++++++----
 kernel/trace/trace_event_perf.c |  2 +-
 kernel/trace/trace_kprobe.c     |  4 ++--
 kernel/trace/trace_syscalls.c   | 28 ++++++++++++++++++++++++----
 kernel/trace/trace_uprobe.c     |  2 +-
 12 files changed, 78 insertions(+), 20 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..139f857a2cce 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -123,8 +123,14 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	}
 #endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
 		trace_sys_enter(regs, regs->orig_ax);
+		while (current->perf_block) {
+			schedule_timeout(100 * HZ);
+			current->perf_block = false;
+			perf_trace_sys_enter_blocked(regs, regs->orig_ax);
+		}
+	}
 
 	do_audit_syscall_entry(regs, arch);
 
@@ -224,8 +230,14 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
 
 	audit_syscall_exit(regs);
 
-	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+	if (cached_flags & _TIF_SYSCALL_TRACEPOINT) {
 		trace_sys_exit(regs, regs->ax);
+		while (current->perf_block) {
+			schedule_timeout(100 * HZ);
+			current->perf_block = false;
+			perf_trace_sys_exit_blocked(regs, regs->ax);
+		}
+	}
 
 	/*
 	 * If TIF_SYSCALL_EMU is set, we only get here because of
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 47a31d01df5a..af722b95135e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -695,6 +695,8 @@ struct perf_event {
 #endif
 
 	struct list_head		sb_list;
+
+	bool				blocked;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
@@ -1200,7 +1202,7 @@ extern void perf_event_init(void);
 extern void perf_tp_event(u16 event_type, u64 count, void *record,
 			  int entry_size, struct pt_regs *regs,
 			  struct hlist_head *head, int rctx,
-			  struct task_struct *task);
+			  struct task_struct *task, bool blocked);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
 #ifndef perf_misc_flags
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a51c13c2b1a0..60463fcfe795 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
 	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
+	bool				perf_block;
 #endif
 #ifdef CONFIG_DEBUG_PREEMPT
 	unsigned long			preempt_disable_ip;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ac3d13a915b..9ae11151e861 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1296,4 +1296,6 @@ static inline unsigned int ksys_personality(unsigned int personality)
 	return old;
 }
 
+void perf_trace_sys_enter_blocked(struct pt_regs *regs, long id);
+void perf_trace_sys_exit_blocked(struct pt_regs *regs, long ret);
 #endif
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 4130a5497d40..d8d0289c955c 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -618,14 +618,14 @@ void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
 void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct trace_event_call *call, u64 count,
 			       struct pt_regs *regs, struct hlist_head *head,
-			       struct task_struct *task);
+			       struct task_struct *task, bool blocked);
 
 static inline void
 perf_trace_buf_submit(void *raw_data, int size, int rctx, u16 type,
 		       u64 count, struct pt_regs *regs, void *head,
-		       struct task_struct *task)
+		       struct task_struct *task, bool blocked)
 {
-	perf_tp_event(type, count, raw_data, size, regs, head, rctx, task);
+	perf_tp_event(type, count, raw_data, size, regs, head, rctx, task, blocked);
 }
 
 #endif
diff --git a/include/trace/perf.h b/include/trace/perf.h
index dbc6c74defc3..ac9a18fa54b4 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -67,7 +67,7 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	perf_trace_run_bpf_submit(entry, __entry_size, rctx,		\
 				  event_call, __count, __regs,		\
-				  head, __task);			\
+				  head, __task, false);			\
 }
 
 /*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9de8780ac8d9..92bae4cf279c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
 				context_switch :  1, /* context switch data */
 				write_backward :  1, /* Write ring buffer from end to beginning */
 				namespaces     :  1, /* include namespaces data */
-				__reserved_1   : 35;
+				block          :  1, /* block process if there's no space in RB (syscall tracepoints only) */
+				__reserved_1   : 34;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7403a27363f8..889e95e9de0e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6505,8 +6505,11 @@ __perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size)) {
+		if (event->attr.block)
+			event->blocked = current->perf_block = true;
 		goto exit;
+	}
 
 	perf_output_sample(&handle, &header, data, event);
 
@@ -8262,7 +8265,7 @@ static int perf_tp_event_match(struct perf_event *event,
 void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct trace_event_call *call, u64 count,
 			       struct pt_regs *regs, struct hlist_head *head,
-			       struct task_struct *task)
+			       struct task_struct *task, bool blocked)
 {
 	if (bpf_prog_array_valid(call)) {
 		*(struct pt_regs **)raw_data = regs;
@@ -8272,13 +8275,22 @@ void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 		}
 	}
 	perf_tp_event(call->event.type, count, raw_data, size, regs, head,
-		      rctx, task);
+		      rctx, task, blocked);
 }
 EXPORT_SYMBOL_GPL(perf_trace_run_bpf_submit);
 
+static bool perf_event_blocked(struct perf_event *event)
+{
+	bool blocked = event->attr.block && event->blocked;
+
+	if (blocked)
+		event->blocked = false;
+	return blocked;
+}
+
 void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 		   struct pt_regs *regs, struct hlist_head *head, int rctx,
-		   struct task_struct *task)
+		   struct task_struct *task, bool blocked)
 {
 	struct perf_sample_data data;
 	struct perf_event *event;
@@ -8296,6 +8308,9 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 	perf_trace_buf_update(record, event_type);
 
 	hlist_for_each_entry_rcu(event, head, hlist_entry) {
+		if (blocked && !perf_event_blocked(event))
+			continue;
+
 		if (perf_tp_event_match(event, &data, regs))
 			perf_swevent_event(event, count, &data, regs);
 	}
@@ -8314,6 +8329,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 			goto unlock;
 
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+			if (blocked && !perf_event_blocked(event))
+				continue;
 			if (event->cpu != smp_processor_id())
 				continue;
 			if (event->attr.type != PERF_TYPE_TRACEPOINT)
@@ -10461,6 +10478,9 @@ SYSCALL_DEFINE5(perf_event_open,
 			return -EINVAL;
 	}
 
+	if (attr.block && attr.type != PERF_TYPE_TRACEPOINT)
+		return -EINVAL;
+
 	/* Only privileged users can get physical addresses */
 	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
 	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 76217bbef815..fb0d00000ff2 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -462,7 +462,7 @@ perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
 	entry->ip = ip;
 	entry->parent_ip = parent_ip;
 	perf_trace_buf_submit(entry, ENTRY_SIZE, rctx, TRACE_FN,
-			      1, &regs, &head, NULL);
+			      1, &regs, &head, NULL, false);
 
 #undef ENTRY_SIZE
 }
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index fec67188c4d2..0f740e87c3f6 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1200,7 +1200,7 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 	memset(&entry[1], 0, dsize);
 	store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize);
 	perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
-			      head, NULL);
+			      head, NULL, false);
 	return 0;
 }
 NOKPROBE_SYMBOL(kprobe_perf_func);
@@ -1236,7 +1236,7 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
 	entry->ret_ip = (unsigned long)ri->ret_addr;
 	store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize);
 	perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
-			      head, NULL);
+			      head, NULL, false);
 }
 NOKPROBE_SYMBOL(kretprobe_perf_func);
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index f93a56d2db27..87db553bf681 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -578,7 +578,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
 	return trace_call_bpf(call, &param);
 }
 
-static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
+static void __perf_syscall_enter(struct pt_regs *regs, long id, bool blocked)
 {
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_enter *rec;
@@ -625,7 +625,17 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 
 	perf_trace_buf_submit(rec, size, rctx,
 			      sys_data->enter_event->event.type, 1, regs,
-			      head, NULL);
+			      head, NULL, blocked);
+}
+
+static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
+{
+	__perf_syscall_enter(regs, id, false);
+}
+
+void perf_trace_sys_enter_blocked(struct pt_regs *regs, long id)
+{
+	__perf_syscall_enter(regs, id, true);
 }
 
 static int perf_sysenter_enable(struct trace_event_call *call)
@@ -677,7 +687,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg
 	return trace_call_bpf(call, &param);
 }
 
-static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
+static void __perf_syscall_exit(struct pt_regs *regs, long ret, bool blocked)
 {
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_exit *rec;
@@ -721,7 +731,17 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	}
 
 	perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
-			      1, regs, head, NULL);
+			      1, regs, head, NULL, blocked);
+}
+
+static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
+{
+	__perf_syscall_exit(regs, ret, false);
+}
+
+void perf_trace_sys_exit_blocked(struct pt_regs *regs, long ret)
+{
+	__perf_syscall_exit(regs, ret, true);
 }
 
 static int perf_sysexit_enable(struct trace_event_call *call)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 31ea48eceda1..228c444d1598 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1153,7 +1153,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 	}
 
 	perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
-			      head, NULL);
+			      head, NULL, false);
  out:
 	preempt_enable();
 }
-- 
2.17.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Rough idea of implementing blocking perf calls for system call tracepoints
  2018-11-30 14:01       ` Jiri Olsa
@ 2018-11-30 14:45         ` Steven Rostedt
  0 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2018-11-30 14:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Luis Claudio R. Goncalves, Arnaldo Carvalho de Melo, ldv, esyr,
	Frederic Weisbecker, Masami Hiramatsu, Namhyung Kim

On Fri, 30 Nov 2018 15:01:58 +0100
Jiri Olsa <jolsa@redhat.com> wrote:

> > or perhaps just tracepoints.. does not seem to make much
> > sense for he events  
> 
> actualy the syscalls seems to be enough for now ;-) I tried
> something and ended up with hack below

Actually, we *only* want to do syscalls. We really don't want to add
blocking to trace events. That would be a nightmare for deadlocks and
such.

I was thinking about adding a PTRACE feature that makes this part of
PTRACE. That is, we start this via PTRACE and let PTRACE signal the
tracer process when the tracee process blocks, and it can then wake up
the tracee process.

A tracer must have PTRACE permissions anyway. The goal of this is to
speed up syscall tracing.

-- Steve

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-28 18:47 Rough idea of implementing blocking perf calls for system call tracepoints Steven Rostedt
2018-11-28 19:18 ` Steven Rostedt
2018-11-30 10:40   ` Jiri Olsa
2018-11-30 11:48     ` Jiri Olsa
2018-11-30 14:01       ` Jiri Olsa
2018-11-30 14:45         ` Steven Rostedt

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git