linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 1/8] perf: Block perf calls for system call tracepoints
@ 2018-12-05 16:05 Jiri Olsa
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

hi,
Steven asked for possibility to block traced process
in case the tracer can't keep up and we would loose
events.
    
This (RFC) patchset tries to implement this by adding
'block' bool in struct perf_event_attr to be set only
for syscalls tracepoints and adding code to block the
traced process, if there's no space in event's ring
buffer.
    
The blocking code will poll/periodically check for the
space and continue if the event was successfully written.

Available also here:
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  perf/block

Thoughts?

thanks,
jirka


---
Jiri Olsa (8):
      perf: Allow to block process in syscall tracepoints
      perf tools: Sync uapi perf_event.h
      perf record: Add --block option
      perf trace: Add --block option
      perf tools: Add block term support for tracepoints
      perf tools: Add ordered_events__flush_time interface
      perf trace: Move event delivery to deliver_event function
      perf trace: Add ordered processing for --block option

 arch/x86/entry/common.c               | 36 ++++++++++++++++++++++++++++++++++--
 include/linux/perf_event.h            |  2 ++
 include/linux/sched.h                 |  2 ++
 include/linux/syscalls.h              |  2 ++
 include/uapi/linux/perf_event.h       |  3 ++-
 kernel/events/core.c                  | 40 ++++++++++++++++++++++++++++++++++++++--
 kernel/events/ring_buffer.c           |  4 +++-
 kernel/trace/trace_event_perf.c       |  4 ++++
 kernel/trace/trace_syscalls.c         | 28 ++++++++++++++++++++++++----
 tools/include/uapi/linux/perf_event.h |  3 ++-
 tools/perf/builtin-record.c           |  9 +++++++++
 tools/perf/builtin-trace.c            | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 tools/perf/perf.h                     |  1 +
 tools/perf/util/evsel.c               |  7 +++++++
 tools/perf/util/evsel.h               |  2 ++
 tools/perf/util/ordered-events.c      | 34 ++++++++++++++++++++++++++++++----
 tools/perf/util/ordered-events.h      |  3 +++
 tools/perf/util/parse-events.c        |  8 ++++++++
 tools/perf/util/parse-events.h        |  1 +
 tools/perf/util/parse-events.l        |  1 +
 20 files changed, 263 insertions(+), 24 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-05 17:35   ` Steven Rostedt
                     ` (3 more replies)
  2018-12-05 16:05 ` [PATCH 2/8] perf tools: Sync uapi perf_event.h Jiri Olsa
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Adding support to specify 'block' bool in struct perf_event_attr
for syscalls tracepoints, allowing the event to block the process,
if there's no space in the ring buffer.

The blocking code will poll/periodically check for the space and
continue if the event was successfully written.

It's allowed only for syscall tracepoint events attached to
process. Following syscall events are supported:

  raw_syscalls:sys_enter
  raw_syscalls:sys_exit
  syscalls:sys_enter_accept
  syscalls:sys_enter_accept4
  syscalls:sys_enter_access
  syscalls:sys_enter_acct
  syscalls:sys_enter_add_key
  ...

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/n/tip-ocz7zwwkkx11v0mkxrtcddih@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/entry/common.c         | 36 +++++++++++++++++++++++++++--
 include/linux/perf_event.h      |  2 ++
 include/linux/sched.h           |  2 ++
 include/linux/syscalls.h        |  2 ++
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c            | 40 +++++++++++++++++++++++++++++++--
 kernel/events/ring_buffer.c     |  4 +++-
 kernel/trace/trace_event_perf.c |  4 ++++
 kernel/trace/trace_syscalls.c   | 28 +++++++++++++++++++----
 9 files changed, 111 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..e55cf9169a03 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -60,6 +60,32 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
 	}
 }
 
+static void trace_block_syscall(struct pt_regs *regs, bool enter)
+{
+	current->perf_blocked = true;
+
+	do {
+		schedule_timeout(100 * HZ);
+		current->perf_blocked_cnt = 0;
+
+		if (enter) {
+			/* perf syscalls:* enter */
+			perf_trace_syscall_enter(regs);
+
+			/* perf raw_syscalls:* enter */
+			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
+		} else {
+			/* perf syscalls:* enter */
+			perf_trace_syscall_exit(regs);
+
+			/* perf raw_syscalls:* enter */
+			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
+		}
+	} while (current->perf_blocked_cnt);
+
+	current->perf_blocked = false;
+}
+
 /*
  * Returns the syscall nr to run (which should match regs->orig_ax) or -1
  * to skip the syscall.
@@ -123,8 +149,11 @@ static long syscall_trace_enter(struct pt_regs *regs)
 	}
 #endif
 
-	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
+	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT))) {
 		trace_sys_enter(regs, regs->orig_ax);
+		if (current->perf_blocked_cnt)
+			trace_block_syscall(regs, true);
+	}
 
 	do_audit_syscall_entry(regs, arch);
 
@@ -224,8 +253,11 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
 
 	audit_syscall_exit(regs);
 
-	if (cached_flags & _TIF_SYSCALL_TRACEPOINT)
+	if (cached_flags & _TIF_SYSCALL_TRACEPOINT) {
 		trace_sys_exit(regs, regs->ax);
+		if (current->perf_blocked_cnt)
+			trace_block_syscall(regs, false);
+	}
 
 	/*
 	 * If TIF_SYSCALL_EMU is set, we only get here because of
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 47a31d01df5a..904b7245357a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -695,6 +695,8 @@ struct perf_event {
 #endif
 
 	struct list_head		sb_list;
+
+	bool				blocked;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a51c13c2b1a0..aea741ef29ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1009,6 +1009,8 @@ struct task_struct {
 	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
+	bool				perf_blocked;
+	unsigned int			perf_blocked_cnt;
 #endif
 #ifdef CONFIG_DEBUG_PREEMPT
 	unsigned long			preempt_disable_ip;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ac3d13a915b..3c8012ca9aa3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1296,4 +1296,6 @@ static inline unsigned int ksys_personality(unsigned int personality)
 	return old;
 }
 
+void perf_trace_syscall_enter(struct pt_regs *regs);
+void perf_trace_syscall_exit(struct pt_regs *regs);
 #endif
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9de8780ac8d9..92bae4cf279c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
 				context_switch :  1, /* context switch data */
 				write_backward :  1, /* Write ring buffer from end to beginning */
 				namespaces     :  1, /* include namespaces data */
-				__reserved_1   : 35;
+				block          :  1, /* block process if there's no space in RB (syscall tracepoints only) */
+				__reserved_1   : 34;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7403a27363f8..8955c3ebbb58 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6489,6 +6489,23 @@ void perf_prepare_sample(struct perf_event_header *header,
 		data->phys_addr = perf_virt_to_phys(data->addr);
 }
 
+static bool perf_event_is_blocked(struct perf_event *event)
+{
+	bool blocked = event->attr.block && event->blocked;
+
+	if (blocked)
+		event->blocked = false;
+	return blocked;
+}
+
+static void perf_event_set_blocked(struct perf_event *event)
+{
+	if (event->attr.block) {
+		current->perf_blocked_cnt++;
+		event->blocked = true;
+	}
+}
+
 static __always_inline void
 __perf_event_output(struct perf_event *event,
 		    struct perf_sample_data *data,
@@ -6505,8 +6522,10 @@ __perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size)) {
+		perf_event_set_blocked(event);
 		goto exit;
+	}
 
 	perf_output_sample(&handle, &header, data, event);
 
@@ -8264,7 +8283,7 @@ void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct pt_regs *regs, struct hlist_head *head,
 			       struct task_struct *task)
 {
-	if (bpf_prog_array_valid(call)) {
+	if (!current->perf_blocked && bpf_prog_array_valid(call)) {
 		*(struct pt_regs **)raw_data = regs;
 		if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
 			perf_swevent_put_recursion_context(rctx);
@@ -8296,6 +8315,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 	perf_trace_buf_update(record, event_type);
 
 	hlist_for_each_entry_rcu(event, head, hlist_entry) {
+		if (current->perf_blocked && !perf_event_is_blocked(event))
+			continue;
 		if (perf_tp_event_match(event, &data, regs))
 			perf_swevent_event(event, count, &data, regs);
 	}
@@ -8314,6 +8335,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 			goto unlock;
 
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+			if (current->perf_blocked && !perf_event_is_blocked(event))
+				continue;
 			if (event->cpu != smp_processor_id())
 				continue;
 			if (event->attr.type != PERF_TYPE_TRACEPOINT)
@@ -10461,6 +10484,19 @@ SYSCALL_DEFINE5(perf_event_open,
 			return -EINVAL;
 	}
 
+	if (attr.block) {
+		/*
+		 * Allow only syscall tracepoints, check for syscall class
+		 * is made in the tracepoint event_init callback.
+		 */
+		if (attr.type != PERF_TYPE_TRACEPOINT)
+			return -EINVAL;
+
+		/* Allow to block only if we attach to a process. */
+		if (pid == -1)
+			return -EINVAL;
+	}
+
 	/* Only privileged users can get physical addresses */
 	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
 	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4a9937076331..d28849365431 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -223,7 +223,9 @@ __perf_output_begin(struct perf_output_handle *handle,
 	return 0;
 
 fail:
-	local_inc(&rb->lost);
+	/* Do not count lost if we are going to block and try again. */
+	if (!event->attr.block)
+		local_inc(&rb->lost);
 	perf_output_put_handle(handle);
 out:
 	rcu_read_unlock();
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 76217bbef815..1efbb819539d 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
 
 #include <linux/module.h>
 #include <linux/kprobes.h>
+#include <linux/syscalls.h>
 #include "trace.h"
 #include "trace_probe.h"
 
@@ -85,6 +86,9 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 	if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (p_event->attr.block && !is_syscall_trace_event(tp_event))
+		return -EINVAL;
+
 	return 0;
 }
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index f93a56d2db27..a8fd7a81361e 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -578,7 +578,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
 	return trace_call_bpf(call, &param);
 }
 
-static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
+static void __perf_syscall_enter(struct pt_regs *regs, long id)
 {
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_enter *rec;
@@ -616,7 +616,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
 
-	if ((valid_prog_array &&
+	if ((!current->perf_blocked && valid_prog_array &&
 	     !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
@@ -628,6 +628,16 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 			      head, NULL);
 }
 
+static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
+{
+	__perf_syscall_enter(regs, id);
+}
+
+void perf_trace_syscall_enter(struct pt_regs *regs)
+{
+	__perf_syscall_enter(regs, regs->orig_ax);
+}
+
 static int perf_sysenter_enable(struct trace_event_call *call)
 {
 	int ret = 0;
@@ -677,7 +687,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg
 	return trace_call_bpf(call, &param);
 }
 
-static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
+static void __perf_syscall_exit(struct pt_regs *regs, long ret)
 {
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_exit *rec;
@@ -713,7 +723,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	if ((valid_prog_array &&
+	if ((!current->perf_blocked && valid_prog_array &&
 	     !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
@@ -724,6 +734,16 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 			      1, regs, head, NULL);
 }
 
+static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
+{
+	__perf_syscall_exit(regs, ret);
+}
+
+void perf_trace_syscall_exit(struct pt_regs *regs)
+{
+	__perf_syscall_exit(regs, regs->ax);
+}
+
 static int perf_sysexit_enable(struct trace_event_call *call)
 {
 	int ret = 0;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/8] perf tools: Sync uapi perf_event.h
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-05 16:05 ` [PATCH 3/8] perf record: Add --block option Jiri Olsa
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Link: http://lkml.kernel.org/n/tip-lh4vrg3t3sjhvue16i5f3snr@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/include/uapi/linux/perf_event.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 9de8780ac8d9..92bae4cf279c 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -372,7 +372,8 @@ struct perf_event_attr {
 				context_switch :  1, /* context switch data */
 				write_backward :  1, /* Write ring buffer from end to beginning */
 				namespaces     :  1, /* include namespaces data */
-				__reserved_1   : 35;
+				block          :  1, /* block process if there's no space in RB (syscall tracepoints only) */
+				__reserved_1   : 34;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/8] perf record: Add --block option
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
  2018-12-05 16:05 ` [PATCH 2/8] perf tools: Sync uapi perf_event.h Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-05 16:05 ` [PATCH 4/8] perf trace: " Jiri Olsa
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Adding --block option to allow record command set block
bool in struct perf_event_attr for event, like:

  # perf record -e raw_syscalls:sys_enter --block ls
  # perf record -e raw_syscalls:sys_enter --block -p ...

It's allowed only for syscall tracepoint events
attached to process.

Link: http://lkml.kernel.org/n/tip-ocz7zwwkkx11v0mkxrtcddih@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/builtin-record.c | 9 +++++++++
 tools/perf/perf.h           | 1 +
 tools/perf/util/evsel.c     | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4736dc96c4ca..79c8d2f94f86 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1937,6 +1937,8 @@ static struct option __record_options[] = {
 		    "append timestamp to output filename"),
 	OPT_BOOLEAN(0, "timestamp-boundary", &record.timestamp_boundary,
 		    "Record timestamp boundary (time of first/last samples)"),
+	OPT_BOOLEAN(0, "block", &record.opts.block,
+		    "Request blocked tracing (for syscall tracepoints)"),
 	OPT_STRING_OPTARG_SET(0, "switch-output", &record.switch_output.str,
 			  &record.switch_output.set, "signal,size,time",
 			  "Switch output when receive SIGUSR2 or cross size,time threshold",
@@ -2116,6 +2118,13 @@ int cmd_record(int argc, const char **argv)
 		goto out;
 	}
 
+	if (rec->opts.block &&
+	    !target__has_task(&rec->opts.target) &&
+	    !target__none(&rec->opts.target)) {
+		pr_err("Can't use --block on non task targets\n");
+		goto out;
+	}
+
 	/* Enable ignoring missing threads when -u/-p option is defined. */
 	rec->opts.ignore_missing_thread = rec->opts.target.uid != UINT_MAX || rec->opts.target.pid;
 
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 4d40baa45a5f..7884701247b8 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -66,6 +66,7 @@ struct record_opts {
 	bool	     ignore_missing_thread;
 	bool	     strict_freq;
 	bool	     sample_id;
+	bool	     block;
 	unsigned int freq;
 	unsigned int mmap_pages;
 	unsigned int auxtrace_mmap_pages;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index dbc0466db368..e4943e8f811b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1045,6 +1045,9 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
 	if (opts->sample_transaction)
 		perf_evsel__set_sample_bit(evsel, TRANSACTION);
 
+	if (opts->block)
+		attr->block = true;
+
 	if (opts->running_time) {
 		evsel->attr.read_format |=
 			PERF_FORMAT_TOTAL_TIME_ENABLED |
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/8] perf trace: Add --block option
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
                   ` (2 preceding siblings ...)
  2018-12-05 16:05 ` [PATCH 3/8] perf record: Add --block option Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-05 16:05 ` [PATCH 5/8] perf tools: Add block term support for tracepoints Jiri Olsa
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Adding --block option to allow trace command pass along
the --block option for record command or to set it for
trace run session, like:

  # perf trace --block kill
  # perf trace --block -p ...
  # perf trace -e raw_syscalls:sys_enter --block -p ...

It's allowed only for syscall tracepoint events
attached to process.

Link: http://lkml.kernel.org/n/tip-ocz7zwwkkx11v0mkxrtcddih@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/builtin-trace.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index a57a9ae1fd4b..a663ec346a8e 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -2379,10 +2379,12 @@ static int trace__record(struct trace *trace, int argc, const char **argv)
 	unsigned int majpf_args_nr = ARRAY_SIZE(majpf_args);
 	const char * const minpf_args[] = { "-e", "minor-faults" };
 	unsigned int minpf_args_nr = ARRAY_SIZE(minpf_args);
+	const char * const block_args[] = { "--block", };
+	unsigned int block_args_nr = ARRAY_SIZE(block_args);
 
 	/* +1 is for the event string below */
 	rec_argc = ARRAY_SIZE(record_args) + sc_args_nr + 1 +
-		majpf_args_nr + minpf_args_nr + argc;
+		majpf_args_nr + minpf_args_nr + block_args_nr + argc;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
 	if (rec_argv == NULL)
@@ -2416,6 +2418,9 @@ static int trace__record(struct trace *trace, int argc, const char **argv)
 		for (i = 0; i < minpf_args_nr; i++)
 			rec_argv[j++] = minpf_args[i];
 
+	if (trace->opts.block)
+		rec_argv[j++] = block_args[0];
+
 	for (i = 0; i < (unsigned int)argc; i++)
 		rec_argv[j++] = argv[i];
 
@@ -3409,6 +3414,8 @@ int cmd_trace(int argc, const char **argv)
 		     trace__parse_events_option),
 	OPT_BOOLEAN(0, "comm", &trace.show_comm,
 		    "show the thread COMM next to its id"),
+	OPT_BOOLEAN(0, "block", &trace.opts.block,
+		    "Request blocked tracing (for syscall tracepoints)"),
 	OPT_BOOLEAN(0, "tool_stats", &trace.show_tool_stats, "show tool stats"),
 	OPT_CALLBACK(0, "expr", &trace, "expr", "list of syscalls/events to trace",
 		     trace__parse_events_option),
@@ -3500,6 +3507,14 @@ int cmd_trace(int argc, const char **argv)
 				       "cgroup monitoring only available in system-wide mode");
 	}
 
+	if (trace.opts.block) {
+		if (!target__has_task(&trace.opts.target) &&
+		    !target__none(&trace.opts.target)) {
+			pr_err("ERROR: Can't use --block on non task targets\n");
+			goto out;
+		}
+	}
+
 	evsel = bpf__setup_output_event(trace.evlist, "__augmented_syscalls__");
 	if (IS_ERR(evsel)) {
 		bpf__strerror_setup_output_event(trace.evlist, PTR_ERR(evsel), bf, sizeof(bf));
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/8] perf tools: Add block term support for tracepoints
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
                   ` (3 preceding siblings ...)
  2018-12-05 16:05 ` [PATCH 4/8] perf trace: " Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-05 16:05 ` [PATCH 6/8] perf tools: Add ordered_events__flush_time interface Jiri Olsa
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Allow to specify 'block' term for tracepoint in the -e command
line option, like:

  # perf record -e raw_syscalls:sys_enter/block/ -p ...
  # perf trace  -e raw_syscalls:sys_enter/block/ -p ...

to set this tracepoint as blocking (sets struct perf_event_attr::block)

It's allowed only for syscall tracepoint events
attached to process.

Link: http://lkml.kernel.org/n/tip-jj5uwlt3izp1gpk8jcspulze@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/util/evsel.c        | 4 ++++
 tools/perf/util/evsel.h        | 2 ++
 tools/perf/util/parse-events.c | 8 ++++++++
 tools/perf/util/parse-events.h | 1 +
 tools/perf/util/parse-events.l | 1 +
 5 files changed, 16 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index e4943e8f811b..886b0dbec534 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -811,6 +811,9 @@ static void apply_config_terms(struct perf_evsel *evsel,
 			break;
 		case PERF_EVSEL__CONFIG_TERM_DRV_CFG:
 			break;
+		case PERF_EVSEL__CONFIG_TERM_BLOCK:
+			attr->block = term->val.block ? 1 : 0;
+			break;
 		default:
 			break;
 		}
@@ -1655,6 +1658,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(context_switch, p_unsigned);
 	PRINT_ATTRf(write_backward, p_unsigned);
 	PRINT_ATTRf(namespaces, p_unsigned);
+	PRINT_ATTRf(block, p_unsigned);
 
 	PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned);
 	PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 82a289ce8b0c..c5d11035a21b 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -50,6 +50,7 @@ enum term_type {
 	PERF_EVSEL__CONFIG_TERM_OVERWRITE,
 	PERF_EVSEL__CONFIG_TERM_DRV_CFG,
 	PERF_EVSEL__CONFIG_TERM_BRANCH,
+	PERF_EVSEL__CONFIG_TERM_BLOCK,
 };
 
 struct perf_evsel_config_term {
@@ -67,6 +68,7 @@ struct perf_evsel_config_term {
 		bool	overwrite;
 		char	*branch;
 		unsigned long max_events;
+		bool	block;
 	} val;
 	bool weak;
 };
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index caecd2d9494a..6b6e70f1ae1f 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -930,6 +930,7 @@ static const char *config_term_names[__PARSE_EVENTS__TERM_TYPE_NR] = {
 	[PARSE_EVENTS__TERM_TYPE_OVERWRITE]		= "overwrite",
 	[PARSE_EVENTS__TERM_TYPE_NOOVERWRITE]		= "no-overwrite",
 	[PARSE_EVENTS__TERM_TYPE_DRV_CFG]		= "driver-config",
+	[PARSE_EVENTS__TERM_TYPE_BLOCK]			= "block",
 };
 
 static bool config_term_shrinked;
@@ -1041,6 +1042,9 @@ do {									   \
 	case PARSE_EVENTS__TERM_TYPE_MAX_EVENTS:
 		CHECK_TYPE_VAL(NUM);
 		break;
+	case PARSE_EVENTS__TERM_TYPE_BLOCK:
+		CHECK_TYPE_VAL(NUM);
+		break;
 	default:
 		err->str = strdup("unknown term");
 		err->idx = term->err_term;
@@ -1091,6 +1095,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
 	case PARSE_EVENTS__TERM_TYPE_MAX_EVENTS:
 	case PARSE_EVENTS__TERM_TYPE_OVERWRITE:
 	case PARSE_EVENTS__TERM_TYPE_NOOVERWRITE:
+	case PARSE_EVENTS__TERM_TYPE_BLOCK:
 		return config_term_common(attr, term, err);
 	default:
 		if (err) {
@@ -1179,6 +1184,9 @@ do {								\
 		case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
 			ADD_CONFIG_TERM(DRV_CFG, drv_cfg, term->val.str);
 			break;
+		case PARSE_EVENTS__TERM_TYPE_BLOCK:
+			ADD_CONFIG_TERM(BLOCK, block, term->val.num ? 1 : 0);
+			break;
 		default:
 			break;
 		}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 5ed035cbcbb7..204f4bf1c62a 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -75,6 +75,7 @@ enum {
 	PARSE_EVENTS__TERM_TYPE_NOOVERWRITE,
 	PARSE_EVENTS__TERM_TYPE_OVERWRITE,
 	PARSE_EVENTS__TERM_TYPE_DRV_CFG,
+	PARSE_EVENTS__TERM_TYPE_BLOCK,
 	__PARSE_EVENTS__TERM_TYPE_NR,
 };
 
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 7805c71aaae2..69141fb38347 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -274,6 +274,7 @@ inherit			{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_INHERIT); }
 no-inherit		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_NOINHERIT); }
 overwrite		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_OVERWRITE); }
 no-overwrite		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_NOOVERWRITE); }
+block			{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_BLOCK); }
 ,			{ return ','; }
 "/"			{ BEGIN(INITIAL); return '/'; }
 {name_minus}		{ return str(yyscanner, PE_NAME); }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/8] perf tools: Add ordered_events__flush_time interface
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
                   ` (4 preceding siblings ...)
  2018-12-05 16:05 ` [PATCH 5/8] perf tools: Add block term support for tracepoints Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-14 21:00   ` [tip:perf/core] perf ordered_events: " tip-bot for Jiri Olsa
  2018-12-18 14:27   ` tip-bot for Jiri Olsa
  2018-12-05 16:05 ` [PATCH 7/8] perf trace: Move event delivery to deliver_event function Jiri Olsa
  2018-12-05 16:05 ` [PATCH 8/8] perf trace: Add ordered processing for --block option Jiri Olsa
  7 siblings, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Adding OE_FLUSH__TIME flush type, to be able to flush
only certain amount of the queue based on the provided
timestamp. It will be used in following patches.

Link: http://lkml.kernel.org/n/tip-a3na77vemwr1g92lfhlrztg5@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/util/ordered-events.c | 23 +++++++++++++++++++----
 tools/perf/util/ordered-events.h |  2 ++
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/ordered-events.c b/tools/perf/util/ordered-events.c
index c5412db05683..b5c6a854379f 100644
--- a/tools/perf/util/ordered-events.c
+++ b/tools/perf/util/ordered-events.c
@@ -219,8 +219,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 	return 0;
 }
 
-static int __ordered_events__flush(struct ordered_events *oe,
-				   bool show_progress)
+static int do_flush(struct ordered_events *oe, bool show_progress)
 {
 	struct list_head *head = &oe->events;
 	struct ordered_event *tmp, *iter;
@@ -263,7 +262,8 @@ static int __ordered_events__flush(struct ordered_events *oe,
 	return 0;
 }
 
-int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+static int __ordered_events__flush(struct ordered_events *oe, enum oe_flush how,
+				   u64 time)
 {
 	static const char * const str[] = {
 		"NONE",
@@ -302,6 +302,11 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		break;
 	}
 
+	case OE_FLUSH__TIME:
+		oe->next_flush = time;
+		show_progress = false;
+		break;
+
 	case OE_FLUSH__ROUND:
 	case OE_FLUSH__NONE:
 	default:
@@ -312,7 +317,7 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		   str[how], oe->nr_events);
 	pr_oe_time(oe->max_timestamp, "max_timestamp\n");
 
-	err = __ordered_events__flush(oe, show_progress);
+	err = do_flush(oe, show_progress);
 
 	if (!err) {
 		if (how == OE_FLUSH__ROUND)
@@ -328,6 +333,16 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 	return err;
 }
 
+int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+{
+	return __ordered_events__flush(oe, how, 0);
+}
+
+int ordered_events__flush_time(struct ordered_events *oe, u64 time)
+{
+	return __ordered_events__flush(oe, OE_FLUSH__TIME, time);
+}
+
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data)
 {
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 0c6e26aec0e3..6ef81e5be052 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -19,6 +19,7 @@ enum oe_flush {
 	OE_FLUSH__ROUND,
 	OE_FLUSH__HALF,
 	OE_FLUSH__TOP,
+	OE_FLUSH__TIME,
 };
 
 struct ordered_events;
@@ -55,6 +56,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 			  u64 timestamp, u64 file_offset);
 void ordered_events__delete(struct ordered_events *oe, struct ordered_event *event);
 int ordered_events__flush(struct ordered_events *oe, enum oe_flush how);
+int ordered_events__flush_time(struct ordered_events *oe, u64 time);
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data);
 void ordered_events__free(struct ordered_events *oe);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 7/8] perf trace: Move event delivery to deliver_event function
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
                   ` (5 preceding siblings ...)
  2018-12-05 16:05 ` [PATCH 6/8] perf tools: Add ordered_events__flush_time interface Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-14 21:01   ` [tip:perf/core] perf trace: Move event delivery to a new deliver_event() function tip-bot for Jiri Olsa
  2018-12-18 14:28   ` tip-bot for Jiri Olsa
  2018-12-05 16:05 ` [PATCH 8/8] perf trace: Add ordered processing for --block option Jiri Olsa
  7 siblings, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Moving event delivery code to deliver_event function,
so it's easier to add conditional ordered delivery
coming in following patches.

Link: http://lkml.kernel.org/n/tip-qz9wzti2cw6q7swja0h2x4sf@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/builtin-trace.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index a663ec346a8e..b4b1dafe882a 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -2641,6 +2641,21 @@ static int trace__set_filter_pids(struct trace *trace)
 	return err;
 }
 
+static int deliver_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	struct perf_sample sample;
+	int err;
+
+	err = perf_evlist__parse_sample(evlist, event, &sample);
+	if (err)
+		fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
+	else
+		trace__handle_event(trace, event, &sample);
+
+	return 0;
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2806,18 +2821,10 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
 			continue;
 
 		while ((event = perf_mmap__read_event(md)) != NULL) {
-			struct perf_sample sample;
-
 			++trace->nr_events;
 
-			err = perf_evlist__parse_sample(evlist, event, &sample);
-			if (err) {
-				fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
-				goto next_event;
-			}
+			deliver_event(trace, event);
 
-			trace__handle_event(trace, event, &sample);
-next_event:
 			perf_mmap__consume(md);
 
 			if (interrupted)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 8/8] perf trace: Add ordered processing for --block option
  2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
                   ` (6 preceding siblings ...)
  2018-12-05 16:05 ` [PATCH 7/8] perf trace: Move event delivery to deliver_event function Jiri Olsa
@ 2018-12-05 16:05 ` Jiri Olsa
  2018-12-14 21:02   ` [tip:perf/core] perf trace: Add ordered processing tip-bot for Jiri Olsa
  2018-12-18 14:29   ` tip-bot for Jiri Olsa
  7 siblings, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 16:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Peter Zijlstra
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Thomas Gleixner, Luis Claudio R. Goncalves, ldv, esyr,
	Frederic Weisbecker

Adding support to sort the trace events if --block option is set.
In this mode we don't loose events, and we are potentionally slow
by definition, because the traced process could be blocked.

In this case it makes sense to sort events and provide the precise
outcome of ordered events.

Link: http://lkml.kernel.org/n/tip-me69opepwec6tjtpy2cxrrzd@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/perf/builtin-trace.c       | 59 +++++++++++++++++++++++++++++++-
 tools/perf/util/ordered-events.c | 11 ++++++
 tools/perf/util/ordered-events.h |  1 +
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index b4b1dafe882a..d650d8cd421b 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -127,6 +127,11 @@ struct trace {
 	bool			force;
 	bool			vfs_getname;
 	int			trace_pgfaults;
+
+	struct {
+		struct ordered_events	data;
+		u64			last;
+	} oe;
 };
 
 struct tp_field {
@@ -2656,6 +2661,43 @@ static int deliver_event(struct trace *trace, union perf_event *event)
 	return 0;
 }
 
+
+static int flush_ordered_events(struct trace *trace)
+{
+	u64 first = ordered_events__first_time(&trace->oe.data);
+	u64 flush = trace->oe.last - NSEC_PER_SEC;
+
+	/* Is there some thing to flush.. */
+	if (first && first < flush)
+		return ordered_events__flush_time(&trace->oe.data, flush);
+
+	return 0;
+}
+
+static int deliver_ordered_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	int err;
+
+	err = perf_evlist__parse_sample_timestamp(evlist, event, &trace->oe.last);
+	if (err && err != -1)
+		return err;
+
+	err = ordered_events__queue(&trace->oe.data, event, trace->oe.last, 0);
+	if (err)
+		return err;
+
+	return flush_ordered_events(trace);
+}
+
+static int ordered_events__deliver_event(struct ordered_events *oe,
+					 struct ordered_event *event)
+{
+	struct trace *trace = container_of(oe, struct trace, oe.data);
+
+	return deliver_event(trace, event->event);
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2823,7 +2865,13 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
 		while ((event = perf_mmap__read_event(md)) != NULL) {
 			++trace->nr_events;
 
-			deliver_event(trace, event);
+			if (trace->opts.block) {
+				err = deliver_ordered_event(trace, event);
+				if (err)
+					goto out_disable;
+			} else {
+				deliver_event(trace, event);
+			}
 
 			perf_mmap__consume(md);
 
@@ -2846,6 +2894,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
 				draining = true;
 
 			goto again;
+		} else {
+			if (trace->opts.block && flush_ordered_events(trace))
+				goto out_disable;
 		}
 	} else {
 		goto again;
@@ -2856,6 +2907,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
 
 	perf_evlist__disable(evlist);
 
+	if (trace->opts.block)
+		ordered_events__flush(&trace->oe.data, OE_FLUSH__FINAL);
+
 	if (!err) {
 		if (trace->summary)
 			trace__fprintf_thread_summary(trace, trace->output);
@@ -3520,6 +3574,9 @@ int cmd_trace(int argc, const char **argv)
 			pr_err("ERROR: Can't use --block on non task targets\n");
 			goto out;
 		}
+
+		ordered_events__init(&trace.oe.data, ordered_events__deliver_event, &trace);
+		ordered_events__set_copy_on_queue(&trace.oe.data, true);
 	}
 
 	evsel = bpf__setup_output_event(trace.evlist, "__augmented_syscalls__");
diff --git a/tools/perf/util/ordered-events.c b/tools/perf/util/ordered-events.c
index b5c6a854379f..6bd9f50ac83e 100644
--- a/tools/perf/util/ordered-events.c
+++ b/tools/perf/util/ordered-events.c
@@ -343,6 +343,17 @@ int ordered_events__flush_time(struct ordered_events *oe, u64 time)
 	return __ordered_events__flush(oe, OE_FLUSH__TIME, time);
 }
 
+u64 ordered_events__first_time(struct ordered_events *oe)
+{
+	struct ordered_event *event;
+
+	if (list_empty(&oe->events))
+		return 0;
+
+	event = list_first_entry(&oe->events, struct ordered_event, list);
+	return event->timestamp;
+}
+
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data)
 {
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 6ef81e5be052..4f75f1d2f9db 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -61,6 +61,7 @@ void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t d
 			  void *data);
 void ordered_events__free(struct ordered_events *oe);
 void ordered_events__reinit(struct ordered_events *oe);
+u64 ordered_events__first_time(struct ordered_events *oe);
 
 static inline
 void ordered_events__set_alloc_size(struct ordered_events *oe, u64 size)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
@ 2018-12-05 17:35   ` Steven Rostedt
  2018-12-05 17:56     ` Jiri Olsa
  2018-12-06  8:09   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2018-12-05 17:35 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Wed,  5 Dec 2018 17:05:02 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 3b2490b81918..e55cf9169a03 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -60,6 +60,32 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
>  	}
>  }
>  
> +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> +{
> +	current->perf_blocked = true;
> +
> +	do {
> +		schedule_timeout(100 * HZ);
> +		current->perf_blocked_cnt = 0;
> +
> +		if (enter) {
> +			/* perf syscalls:* enter */
> +			perf_trace_syscall_enter(regs);
> +
> +			/* perf raw_syscalls:* enter */
> +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> +		} else {
> +			/* perf syscalls:* enter */
> +			perf_trace_syscall_exit(regs);
> +
> +			/* perf raw_syscalls:* enter */
> +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> +		}
> +	} while (current->perf_blocked_cnt);

I was thinking, if the process reading the perf buffer dies, how do we
tell this task to continue on?

-- Steve

> +
> +	current->perf_blocked = false;
> +}
> +

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 17:35   ` Steven Rostedt
@ 2018-12-05 17:56     ` Jiri Olsa
  0 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-05 17:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Wed, Dec 05, 2018 at 12:35:03PM -0500, Steven Rostedt wrote:
> On Wed,  5 Dec 2018 17:05:02 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > index 3b2490b81918..e55cf9169a03 100644
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -60,6 +60,32 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
> >  	}
> >  }
> >  
> > +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> > +{
> > +	current->perf_blocked = true;
> > +
> > +	do {
> > +		schedule_timeout(100 * HZ);
> > +		current->perf_blocked_cnt = 0;
> > +
> > +		if (enter) {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_enter(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> > +		} else {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_exit(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> > +		}
> > +	} while (current->perf_blocked_cnt);
> 
> I was thinking, if the process reading the perf buffer dies, how do we
> tell this task to continue on?

when the tracer process dies, its event will get uninstalled from the
trace_event_call::perf_events list, so the next iteration of above loop
won't get any blocked event and it will leave (assuming there's just
single event)

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
  2018-12-05 17:35   ` Steven Rostedt
@ 2018-12-06  8:09   ` Peter Zijlstra
  2018-12-06 10:30     ` Jiri Olsa
  2018-12-06  8:10   ` Peter Zijlstra
  2018-12-06  8:17   ` Peter Zijlstra
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-06  8:09 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> Adding support to specify 'block' bool in struct perf_event_attr

NAK for having _Bool in structures.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
  2018-12-05 17:35   ` Steven Rostedt
  2018-12-06  8:09   ` Peter Zijlstra
@ 2018-12-06  8:10   ` Peter Zijlstra
  2018-12-06  8:24     ` Jiri Olsa
  2018-12-06  8:34     ` Peter Zijlstra
  2018-12-06  8:17   ` Peter Zijlstra
  3 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-06  8:10 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> +{
> +	current->perf_blocked = true;
> +
> +	do {
> +		schedule_timeout(100 * HZ);
> +		current->perf_blocked_cnt = 0;
> +
> +		if (enter) {
> +			/* perf syscalls:* enter */
> +			perf_trace_syscall_enter(regs);
> +
> +			/* perf raw_syscalls:* enter */
> +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> +		} else {
> +			/* perf syscalls:* enter */
> +			perf_trace_syscall_exit(regs);
> +
> +			/* perf raw_syscalls:* enter */
> +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> +		}
> +	} while (current->perf_blocked_cnt);
> +
> +	current->perf_blocked = false;
> +}

I don't understand this.. why are we using schedule_timeout() and all
that?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
                     ` (2 preceding siblings ...)
  2018-12-06  8:10   ` Peter Zijlstra
@ 2018-12-06  8:17   ` Peter Zijlstra
  2018-12-06 10:27     ` Jiri Olsa
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-06  8:17 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> @@ -10461,6 +10484,19 @@ SYSCALL_DEFINE5(perf_event_open,
>  			return -EINVAL;
>  	}
>  
> +	if (attr.block) {
> +		/*
> +		 * Allow only syscall tracepoints, check for syscall class
> +		 * is made in the tracepoint event_init callback.
> +		 */
> +		if (attr.type != PERF_TYPE_TRACEPOINT)
> +			return -EINVAL;
> +
> +		/* Allow to block only if we attach to a process. */
> +		if (pid == -1)
> +			return -EINVAL;
> +	}
> +
>  	/* Only privileged users can get physical addresses */
>  	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
>  	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))

That's fairly horrible and will, when copied, lead to a giant mess.

Please see the whole exclusion patch set here:

  https://lkml.kernel.org/r/1543230756-15319-1-git-send-email-andrew.murray@arm.com

PERF_PMU_CAP_BLOCK sounds like something you'd want here.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:10   ` Peter Zijlstra
@ 2018-12-06  8:24     ` Jiri Olsa
  2018-12-06 10:31       ` Peter Zijlstra
  2018-12-06  8:34     ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Jiri Olsa @ 2018-12-06  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Steven Rostedt, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:10:28AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> > +{
> > +	current->perf_blocked = true;
> > +
> > +	do {
> > +		schedule_timeout(100 * HZ);
> > +		current->perf_blocked_cnt = 0;
> > +
> > +		if (enter) {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_enter(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> > +		} else {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_exit(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> > +		}
> > +	} while (current->perf_blocked_cnt);
> > +
> > +	current->perf_blocked = false;
> > +}
> 
> I don't understand this.. why are we using schedule_timeout() and all
> that?

the idea is to block the process and try to deliver the event later

the ring buffer space is freed by user space process moving the tail
pointer, so I can't see doing this other way than polling

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:10   ` Peter Zijlstra
  2018-12-06  8:24     ` Jiri Olsa
@ 2018-12-06  8:34     ` Peter Zijlstra
  2018-12-06 10:31       ` Jiri Olsa
  2018-12-06 18:19       ` Steven Rostedt
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-06  8:34 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:10:28AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> > +{
> > +	current->perf_blocked = true;
> > +
> > +	do {
> > +		schedule_timeout(100 * HZ);
> > +		current->perf_blocked_cnt = 0;
> > +
> > +		if (enter) {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_enter(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> > +		} else {
> > +			/* perf syscalls:* enter */
> > +			perf_trace_syscall_exit(regs);
> > +
> > +			/* perf raw_syscalls:* enter */
> > +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> > +		}
> > +	} while (current->perf_blocked_cnt);
> > +
> > +	current->perf_blocked = false;
> > +}
> 
> I don't understand this.. why are we using schedule_timeout() and all
> that?

Urgh.. in fact, the more I look at this the more I hate it.

We want to block in __perf_output_begin(), but we cannot because both
tracepoints and perf will have preemptability disabled down there.

So what we do is fail the event, fake the lost count and go all the way
up that callstack, detect the failure and then poll-wait and retry.

And only do this for a few special events...  *yuck*

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:17   ` Peter Zijlstra
@ 2018-12-06 10:27     ` Jiri Olsa
  0 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-06 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Steven Rostedt, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:17:07AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > @@ -10461,6 +10484,19 @@ SYSCALL_DEFINE5(perf_event_open,
> >  			return -EINVAL;
> >  	}
> >  
> > +	if (attr.block) {
> > +		/*
> > +		 * Allow only syscall tracepoints, check for syscall class
> > +		 * is made in the tracepoint event_init callback.
> > +		 */
> > +		if (attr.type != PERF_TYPE_TRACEPOINT)
> > +			return -EINVAL;
> > +
> > +		/* Allow to block only if we attach to a process. */
> > +		if (pid == -1)
> > +			return -EINVAL;
> > +	}
> > +
> >  	/* Only privileged users can get physical addresses */
> >  	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
> >  	    perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
> 
> That's fairly horrible and will, when copied, lead to a giant mess.
> 
> Please see the whole exclusion patch set here:
> 
>   https://lkml.kernel.org/r/1543230756-15319-1-git-send-email-andrew.murray@arm.com
> 
> PERF_PMU_CAP_BLOCK sounds like something you'd want here.

ok, that's better

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:09   ` Peter Zijlstra
@ 2018-12-06 10:30     ` Jiri Olsa
  0 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-06 10:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Steven Rostedt, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:09:10AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > Adding support to specify 'block' bool in struct perf_event_attr
> 
> NAK for having _Bool in structures.

oops, will change.. if there's ever another version ;-)

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:34     ` Peter Zijlstra
@ 2018-12-06 10:31       ` Jiri Olsa
  2018-12-06 18:19       ` Steven Rostedt
  1 sibling, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-06 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Steven Rostedt, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:34:00AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 06, 2018 at 09:10:28AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > > +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> > > +{
> > > +	current->perf_blocked = true;
> > > +
> > > +	do {
> > > +		schedule_timeout(100 * HZ);
> > > +		current->perf_blocked_cnt = 0;
> > > +
> > > +		if (enter) {
> > > +			/* perf syscalls:* enter */
> > > +			perf_trace_syscall_enter(regs);
> > > +
> > > +			/* perf raw_syscalls:* enter */
> > > +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> > > +		} else {
> > > +			/* perf syscalls:* enter */
> > > +			perf_trace_syscall_exit(regs);
> > > +
> > > +			/* perf raw_syscalls:* enter */
> > > +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> > > +		}
> > > +	} while (current->perf_blocked_cnt);
> > > +
> > > +	current->perf_blocked = false;
> > > +}
> > 
> > I don't understand this.. why are we using schedule_timeout() and all
> > that?
> 
> Urgh.. in fact, the more I look at this the more I hate it.
> 
> We want to block in __perf_output_begin(), but we cannot because both
> tracepoints and perf will have preemptability disabled down there.
> 
> So what we do is fail the event, fake the lost count and go all the way
> up that callstack, detect the failure and then poll-wait and retry.

right

> 
> And only do this for a few special events...  *yuck*

yes ;-)

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:24     ` Jiri Olsa
@ 2018-12-06 10:31       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-06 10:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Steven Rostedt, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 09:24:55AM +0100, Jiri Olsa wrote:
> On Thu, Dec 06, 2018 at 09:10:28AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 05, 2018 at 05:05:02PM +0100, Jiri Olsa wrote:
> > > +static void trace_block_syscall(struct pt_regs *regs, bool enter)
> > > +{
> > > +	current->perf_blocked = true;
> > > +
> > > +	do {
> > > +		schedule_timeout(100 * HZ);
> > > +		current->perf_blocked_cnt = 0;
> > > +
> > > +		if (enter) {
> > > +			/* perf syscalls:* enter */
> > > +			perf_trace_syscall_enter(regs);
> > > +
> > > +			/* perf raw_syscalls:* enter */
> > > +			perf_trace_sys_enter(&event_sys_enter, regs, regs->orig_ax);
> > > +		} else {
> > > +			/* perf syscalls:* enter */
> > > +			perf_trace_syscall_exit(regs);
> > > +
> > > +			/* perf raw_syscalls:* enter */
> > > +			perf_trace_sys_exit(&event_sys_exit, regs, regs->ax);
> > > +		}
> > > +	} while (current->perf_blocked_cnt);
> > > +
> > > +	current->perf_blocked = false;
> > > +}
> > 
> > I don't understand this.. why are we using schedule_timeout() and all
> > that?
> 
> the idea is to block the process and try to deliver the event later
> 
> the ring buffer space is freed by user space process moving the tail
> pointer, so I can't see doing this other way than polling

Right, figured that out.

I really really hate this stuff. poll-waiting fundamentally stinks, but
also perf is designed to not/minimally interfere with the observed
tasks, as is ftrace.

And bolting, as it really doesn't fit in the fundamental design, this
fugly thing on top for this special case.. urgh.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06  8:34     ` Peter Zijlstra
  2018-12-06 10:31       ` Jiri Olsa
@ 2018-12-06 18:19       ` Steven Rostedt
  2018-12-07  8:44         ` Jiri Olsa
  2018-12-07  8:58         ` Peter Zijlstra
  1 sibling, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2018-12-06 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, 6 Dec 2018 09:34:00 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > I don't understand this.. why are we using schedule_timeout() and all
> > that?  
> 
> Urgh.. in fact, the more I look at this the more I hate it.
> 
> We want to block in __perf_output_begin(), but we cannot because both
> tracepoints and perf will have preemptability disabled down there.
> 
> So what we do is fail the event, fake the lost count and go all the way
> up that callstack, detect the failure and then poll-wait and retry.
> 
> And only do this for a few special events...  *yuck*

Since this is a special case, we should add a new option to the perf
system call that, 1 states that it wants the traced process to block
(and must have PTRACE permission to do so) and 2, after it reads from
the buffer, it needs to check a bit that says "this process is blocked,
please wake it up" and then do another perf call to kick the process to
continue.

I really dislike the polling too. But because this is not a default
case, and is a new feature, we can add more infrastructure to make it
work properly, instead of trying to hack the current method into
something that does something poorly.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06 18:19       ` Steven Rostedt
@ 2018-12-07  8:44         ` Jiri Olsa
  2018-12-07  8:58         ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-07  8:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Jiri Olsa, Arnaldo Carvalho de Melo, lkml,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 01:19:46PM -0500, Steven Rostedt wrote:
> On Thu, 6 Dec 2018 09:34:00 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > 
> > > I don't understand this.. why are we using schedule_timeout() and all
> > > that?  
> > 
> > Urgh.. in fact, the more I look at this the more I hate it.
> > 
> > We want to block in __perf_output_begin(), but we cannot because both
> > tracepoints and perf will have preemptability disabled down there.
> > 
> > So what we do is fail the event, fake the lost count and go all the way
> > up that callstack, detect the failure and then poll-wait and retry.
> > 
> > And only do this for a few special events...  *yuck*
> 
> Since this is a special case, we should add a new option to the perf
> system call that, 1 states that it wants the traced process to block
> (and must have PTRACE permission to do so) and 2, after it reads from
> the buffer, it needs to check a bit that says "this process is blocked,
> please wake it up" and then do another perf call to kick the process to
> continue.

so instead of polling the traced process would properly wait for tracer
to kick him again after it reads/frees the buffer

I guess we could use the control mmap page (struct perf_event_mmap_page)
to communicate the 'we are block-ed' message to the tracer and have new
ioctl to wake the waiting process

jirka

> 
> I really dislike the polling too. But because this is not a default
> case, and is a new feature, we can add more infrastructure to make it
> work properly, instead of trying to hack the current method into
> something that does something poorly.
> 
> -- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-06 18:19       ` Steven Rostedt
  2018-12-07  8:44         ` Jiri Olsa
@ 2018-12-07  8:58         ` Peter Zijlstra
  2018-12-07 13:41           ` Steven Rostedt
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-07  8:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Thu, Dec 06, 2018 at 01:19:46PM -0500, Steven Rostedt wrote:
> On Thu, 6 Dec 2018 09:34:00 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > 
> > > I don't understand this.. why are we using schedule_timeout() and all
> > > that?  
> > 
> > Urgh.. in fact, the more I look at this the more I hate it.
> > 
> > We want to block in __perf_output_begin(), but we cannot because both
> > tracepoints and perf will have preemptability disabled down there.
> > 
> > So what we do is fail the event, fake the lost count and go all the way
> > up that callstack, detect the failure and then poll-wait and retry.
> > 
> > And only do this for a few special events...  *yuck*
> 
> Since this is a special case, we should add a new option to the perf
> system call that, 1 states that it wants the traced process to block
> (and must have PTRACE permission to do so) and 2, after it reads from
> the buffer, it needs to check a bit that says "this process is blocked,
> please wake it up" and then do another perf call to kick the process to
> continue.
> 
> I really dislike the polling too. But because this is not a default
> case, and is a new feature, we can add more infrastructure to make it
> work properly, instead of trying to hack the current method into
> something that does something poorly.

So why are we doing this? What makes the syscall tracepoints so much
more special than many of the others that we need to overhaul our
fundamental design principles for them?

These patches give no justification *what*so*ever* for why we're doing
ugly arse things like this. And why does this, whatever this is, need to
be done in perf?

IOW, what problem are we solving ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07  8:58         ` Peter Zijlstra
@ 2018-12-07 13:41           ` Steven Rostedt
  2018-12-07 15:11             ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2018-12-07 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Fri, 7 Dec 2018 09:58:39 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> These patches give no justification *what*so*ever* for why we're doing
> ugly arse things like this. And why does this, whatever this is, need to
> be done in perf?
> 
> IOW, what problem are we solving ?

I guess the cover letter should have had a link (or copy) of this:

 http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07 13:41           ` Steven Rostedt
@ 2018-12-07 15:11             ` Peter Zijlstra
  2018-12-07 15:49               ` Arnaldo Carvalho de Melo
  2018-12-07 20:14               ` Steven Rostedt
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-07 15:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Fri, Dec 07, 2018 at 08:41:18AM -0500, Steven Rostedt wrote:
> On Fri, 7 Dec 2018 09:58:39 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > These patches give no justification *what*so*ever* for why we're doing
> > ugly arse things like this. And why does this, whatever this is, need to
> > be done in perf?
> > 
> > IOW, what problem are we solving ?
> 
> I guess the cover letter should have had a link (or copy) of this:
> 
>  http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home

That doesn't even begin to explain. Who cares about strace and why? And
why is it such a bad thing to loose the occasional record etc..


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07 15:11             ` Peter Zijlstra
@ 2018-12-07 15:49               ` Arnaldo Carvalho de Melo
  2018-12-08 10:41                 ` Peter Zijlstra
  2018-12-07 20:14               ` Steven Rostedt
  1 sibling, 1 reply; 45+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-12-07 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Jiri Olsa, lkml, Ingo Molnar, Namhyung Kim,
	Alexander Shishkin, Thomas Gleixner, Luis Claudio R. Goncalves,
	ldv, esyr, Frederic Weisbecker

Em Fri, Dec 07, 2018 at 04:11:05PM +0100, Peter Zijlstra escreveu:
> On Fri, Dec 07, 2018 at 08:41:18AM -0500, Steven Rostedt wrote:
> > On Fri, 7 Dec 2018 09:58:39 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> > > These patches give no justification *what*so*ever* for why we're doing
> > > ugly arse things like this. And why does this, whatever this is, need to
> > > be done in perf?

> > > IOW, what problem are we solving ?

> > I guess the cover letter should have had a link (or copy) of this:

> >  http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home

> That doesn't even begin to explain. Who cares about strace and why? And
> why is it such a bad thing to loose the occasional record etc..

The following almost addresses all the details for doing such a perf
based strace tool (modulo signals), was started by Thomas a long time
ago [1] and is in the tree for a long time, with BPF attached to the
raw_syscalls:sys_{enter,exit} (that -e augmented_raw_syscalls.c thing
that lives in tools/perf/examples/bpf now but would be put in place
transparently and pre-compiled as augmented_raw_syscalls.o) to collect
pointer contents without using ptrace:

[root@seventh bpf]# perf trace --no-inherit -e augmented_raw_syscalls.c sleep 1
     0.019 ( 0.001 ms): brk(                                                                  ) = 0x559ca9dd7000
     0.031 ( 0.004 ms): access(filename: , mode: R                                            ) = -1 ENOENT No such file or directory
     0.039 ( 0.003 ms): openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC           ) = 3
     0.043 ( 0.002 ms): fstat(fd: 3, statbuf: 0x7ffdfa50e940                                  ) = 0
     0.045 ( 0.003 ms): mmap(len: 103484, prot: READ, flags: PRIVATE, fd: 3                   ) = 0x7fa76b2ce000
     0.049 ( 0.001 ms): close(fd: 3                                                           ) = 0
     0.056 ( 0.003 ms): openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC           ) = 3
     0.060 ( 0.002 ms): read(fd: 3, buf: 0x7ffdfa50eb08, count: 832                           ) = 832
     0.063 ( 0.001 ms): fstat(fd: 3, statbuf: 0x7ffdfa50e9a0                                  ) = 0
     0.065 ( 0.002 ms): mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS            ) = 0x7fa76b2cc000
     0.070 ( 0.003 ms): mmap(len: 3889792, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa76ad0e000
     0.075 ( 0.005 ms): mprotect(start: 0x7fa76aebb000, len: 2093056                          ) = 0
     0.081 ( 0.005 ms): mmap(addr: 0x7fa76b0ba000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 1753088) = 0x7fa76b0ba000
     0.089 ( 0.002 ms): mmap(addr: 0x7fa76b0c0000, len: 14976, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa76b0c0000
     0.095 ( 0.001 ms): close(fd: 3                                                           ) = 0
     0.105 ( 0.001 ms): arch_prctl(option: 4098, arg2: 140357034366208                        ) = 0
     0.144 ( 0.004 ms): mprotect(start: 0x7fa76b0ba000, len: 16384, prot: READ                ) = 0
     0.159 ( 0.002 ms): mprotect(start: 0x559ca81aa000, len: 4096, prot: READ                 ) = 0
     0.163 ( 0.002 ms): mprotect(start: 0x7fa76b2e8000, len: 4096, prot: READ                 ) = 0
     0.167 ( 0.008 ms): munmap(addr: 0x7fa76b2ce000, len: 103484                              ) = 0
     0.218 ( 0.001 ms): brk(                                                                  ) = 0x559ca9dd7000
     0.220 ( 0.002 ms): brk(brk: 0x559ca9df8000                                               ) = 0x559ca9df8000
     0.224 ( 0.001 ms): brk(                                                                  ) = 0x559ca9df8000
     0.228 ( 0.004 ms): open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC         ) = 3
     0.233 ( 0.001 ms): fstat(fd: 3, statbuf: 0x7fa76b0bfaa0                                  ) = 0
     0.235 ( 0.003 ms): mmap(len: 113045344, prot: READ, flags: PRIVATE, fd: 3                ) = 0x7fa76413f000
     0.241 ( 0.001 ms): close(fd: 3                                                           ) = 0
     0.264 (1000.163 ms): nanosleep(rqtp: 0x7ffdfa50f680                                        ) = 0
  1000.440 ( 0.002 ms): close(fd: 1                                                           ) = 0
  1000.443 ( 0.001 ms): close(fd: 2                                                           ) = 0
  1000.449 (         ): exit_group(                                                           )
[root@seventh bpf]#

doing just:

   # perf trace --filter-pids pid-of-your-local-xterm,gnome-shell-pid-for-your-term

For system wide stracing elliminating the feedback loop of your GUI
components will most of the time work with the default ring buffer size
and when it doesn't, you'll get things like that "LOST events!" message,
and for such extreme cases, just use '-m' to bump the rb size some more.

# trace  --filter-pids 2279,1643
<SNIP>
  8793.387 ( 0.003 ms): cc1/19097 openat(dfd: CWD, filename: 0x3df9940, flags: NOCTTY                   ) = 21
  8793.390 ( 0.001 ms): cc1/19097 fstat(fd: 21, statbuf: 0x3df9648                                      ) = 0
  8793.394 ( 0.003 ms): cc1/19097 read(fd: 21, buf: 0x3d6e0f0, count: 5804                              ) = 5804
  8793.398 ( 0.001 ms): cc1/19097 close(fd: 21                                                          ) = 0
  8793.403 ( 0.002 ms): cc1/19097 openat(dfd: CWD, filename: 0x3dfca70, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  8793.406 ( 0.001 ms): cc1/19097 openat(dfd: CWD, filename: 0x3dfca70, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  8793.409 ( 0.002 ms): cc1/19097 openat(dfd: CWD, filename: 0x3dfca70, flags: NOCTTY                   ) = 21
  8793.412 ( 0.001 ms): cc1/19097 fstat(fd: 21, statbuf: 0x3df9738                                      ) = 0
  8793.414 ( 0.022 ms): cc1/19097 read(fd: 21, buf: 0x3eb6d90, count: 25315                             ) = 25315
  8793.437 ( 0.001 ms): cc1/19097 close(fd: 21                                                          ) = 0
  8793.605 ( 0.005 ms): cc1/19097 openat(dfd: CWD, filename: 0x3dfca70                                  ) = 21
  8793.614 ( 0.001 ms): cc1/19097 close(fd: 5                                                           ) = 0
  8793.618 ( 0.001 ms): cc1/19097 fstat(fd: 21, statbuf: 0x7ffd653ed380                                 ) = 0
  8793.621 ( 0.002 ms): cc1/19097 read(fd: 21, buf: 0x3d41e30, count: 8192                              ) = 8192
  8793.626 ( 0.002 ms): cc1/19097 read(fd: 21, buf: 0x3ebf090, count: 8192                              ) = 8192
  8793.635 ( 0.003 ms): cc1/19097 read(fd: 21, buf: 0x3ed4730, count: 16384                             ) = 8931
  8793.640 ( 0.001 ms): cc1/19097 read(fd: 21, buf: 0x3ed6a13, count: 4096                              ) = 0
LOST 34956 events!
  9497.921 ( 0.006 ms): gcc/19180 openat(dfd: CWD, filename: 0x3115e20, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.935 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3115e20, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.939 ( 0.003 ms): gcc/19180 openat(dfd: CWD, filename: 0x3115e20, flags: NOCTTY                   ) = 9
  9497.943 ( 0.001 ms): gcc/19180 fstat(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, statbuf: 0x3153568) = 0
  9497.945 ( 0.006 ms): gcc/19180 read(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, buf: 0x31c00a0, count: 40574) = 40574
  9497.952 ( 0.001 ms): gcc/19180 close(fd: 9</home/acme/git/linux/include/linux/page_ref.h>            ) = 0
  9497.959 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x31540a0, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.962 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x31540a0, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.965 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x31540a0, flags: NOCTTY                   ) = 9
  9497.969 ( 0.001 ms): gcc/19180 fstat(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, statbuf: 0x30dcf38) = 0
  9497.970 ( 0.003 ms): gcc/19180 read(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, buf: 0x31b8320, count: 23058) = 23058
  9497.975 ( 0.001 ms): gcc/19180 close(fd: 9</home/acme/git/linux/include/linux/page_ref.h>            ) = 0
  9497.979 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154020, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.982 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154020, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9497.986 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154020, flags: NOCTTY                   ) = 9
  9497.989 ( 0.001 ms): gcc/19180 fstat(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, statbuf: 0x3116178) = 0
  9497.990 ( 0.001 ms): gcc/19180 read(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, buf: 0x31509c0, count: 1948) = 1948
  9497.993 ( 0.001 ms): gcc/19180 close(fd: 9</home/acme/git/linux/include/linux/page_ref.h>            ) = 0
  9498.039 ( 0.003 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154060, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9498.043 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154060, flags: NOCTTY                   ) = -1 ENOENT No such file or directory
  9498.046 ( 0.002 ms): gcc/19180 openat(dfd: CWD, filename: 0x3154060, flags: NOCTTY                   ) = 9
  9498.049 ( 0.001 ms): gcc/19180 fstat(fd: 9</home/acme/git/linux/include/linux/page_ref.h>, statbuf: 0x3153a08) = 0

'trace' is just 'perf trace', perf has logic to see if argv[0]  is
'trace', then it goes and does a 'perf trace'. The above case does't use
bpf at all, gets details from /proc.

With bpf we get the 'filename' contents:

[root@seventh bpf]# trace -e augmented_raw_syscalls.c  --filter-pids 2279,1643
<SNIP>
 19766.027 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libz.so.1, flags: CLOEXEC           ) = 5
 19766.035 ( 0.001 ms): gcc/27524 fstat(fd: 5, statbuf: 0x7ffe9323e2a0                                  ) = 0
 19766.037 ( 0.003 ms): gcc/27524 mmap(len: 2187272, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 5   ) = 0x7fa2df435000
 19766.042 ( 0.003 ms): gcc/27524 mprotect(start: 0x7fa2df44b000, len: 2093056                          ) = 0
 19766.046 ( 0.004 ms): gcc/27524 mmap(addr: 0x7fa2df64a000, len: 4096, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 5, off: 86016) = 0x7fa2df64a000
 19766.051 ( 0.002 ms): gcc/27524 mmap(addr: 0x7fa2df64b000, len: 8, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa2df64b000
 19766.057 ( 0.001 ms): gcc/27524 close(fd: 5                                                           ) = 0
 19766.062 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC           ) = 5
<SNIP>

- Arnaldo

[1] https://lwn.net/Articles/415728/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07 15:11             ` Peter Zijlstra
  2018-12-07 15:49               ` Arnaldo Carvalho de Melo
@ 2018-12-07 20:14               ` Steven Rostedt
  2018-12-08 10:44                 ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2018-12-07 20:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Fri, 7 Dec 2018 16:11:05 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Dec 07, 2018 at 08:41:18AM -0500, Steven Rostedt wrote:
> > On Fri, 7 Dec 2018 09:58:39 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > These patches give no justification *what*so*ever* for why we're doing
> > > ugly arse things like this. And why does this, whatever this is, need to
> > > be done in perf?
> > > 
> > > IOW, what problem are we solving ?  
> > 
> > I guess the cover letter should have had a link (or copy) of this:
> > 
> >  http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home  
> 
> That doesn't even begin to explain. Who cares about strace and why? And
> why is it such a bad thing to loose the occasional record etc..

Who cares about strace? Do I really need to answer that? It's one of
the most used tools for seeing what a program is doing.

Why do we care about lost events? Because strace records *all* events,
as that's what it does and that's what it always has done. It would be
a break in functionality (a regression) if it were to start losing
events. I use strace to see everything that an application is doing.

Peter, I think you've spent too much time in the kernel. There's a
whole world out there that lives in userspace ;-)

When we discussed this at plumbers, Oracle people came to me and said
how awesome it would be to run strace against their database accesses.
The problem today is that strace causes such a large overhead that it
isn't feasible to trace any high speed applications, especially if
there are time restraints involved.

If you don't like this for perf, I'll be happy to implement something in
ftrace. I just figured that the perf interface was more suitable for
something like this.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07 15:49               ` Arnaldo Carvalho de Melo
@ 2018-12-08 10:41                 ` Peter Zijlstra
  2018-12-08 17:34                   ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-08 10:41 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, Jiri Olsa, lkml, Ingo Molnar, Namhyung Kim,
	Alexander Shishkin, Thomas Gleixner, Luis Claudio R. Goncalves,
	ldv, esyr, Frederic Weisbecker

On Fri, Dec 07, 2018 at 12:49:21PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Dec 07, 2018 at 04:11:05PM +0100, Peter Zijlstra escreveu:
> > On Fri, Dec 07, 2018 at 08:41:18AM -0500, Steven Rostedt wrote:
> > > On Fri, 7 Dec 2018 09:58:39 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > These patches give no justification *what*so*ever* for why we're doing
> > > > ugly arse things like this. And why does this, whatever this is, need to
> > > > be done in perf?
> 
> > > > IOW, what problem are we solving ?
> 
> > > I guess the cover letter should have had a link (or copy) of this:
> 
> > >  http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home
> 
> > That doesn't even begin to explain. Who cares about strace and why? And
> > why is it such a bad thing to loose the occasional record etc..
> 
> The following almost addresses all the details for doing such a perf
> based strace tool (modulo signals), was started by Thomas a long time
> ago [1] and is in the tree for a long time, with BPF attached to the
> raw_syscalls:sys_{enter,exit} (that -e augmented_raw_syscalls.c thing
> that lives in tools/perf/examples/bpf now but would be put in place
> transparently and pre-compiled as augmented_raw_syscalls.o) to collect
> pointer contents without using ptrace:

> For system wide stracing elliminating the feedback loop of your GUI
> components will most of the time work with the default ring buffer size
> and when it doesn't, you'll get things like that "LOST events!" message,
> and for such extreme cases, just use '-m' to bump the rb size some more.

> # trace  --filter-pids 2279,1643

>   8793.640 ( 0.001 ms): cc1/19097 read(fd: 21, buf: 0x3ed6a13, count: 4096                              ) = 0
> LOST 34956 events!
>   9497.921 ( 0.006 ms): gcc/19180 openat(dfd: CWD, filename: 0x3115e20, flags: NOCTTY                   ) = -1 ENOENT No such file or directory

> 'trace' is just 'perf trace', perf has logic to see if argv[0]  is
> 'trace', then it goes and does a 'perf trace'. The above case does't use
> bpf at all, gets details from /proc.
> 
> With bpf we get the 'filename' contents:
> 
> [root@seventh bpf]# trace -e augmented_raw_syscalls.c  --filter-pids 2279,1643
> <SNIP>
>  19766.027 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libz.so.1, flags: CLOEXEC           ) = 5
>  19766.035 ( 0.001 ms): gcc/27524 fstat(fd: 5, statbuf: 0x7ffe9323e2a0                                  ) = 0
>  19766.037 ( 0.003 ms): gcc/27524 mmap(len: 2187272, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 5   ) = 0x7fa2df435000
>  19766.042 ( 0.003 ms): gcc/27524 mprotect(start: 0x7fa2df44b000, len: 2093056                          ) = 0
>  19766.046 ( 0.004 ms): gcc/27524 mmap(addr: 0x7fa2df64a000, len: 4096, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 5, off: 86016) = 0x7fa2df64a000
>  19766.051 ( 0.002 ms): gcc/27524 mmap(addr: 0x7fa2df64b000, len: 8, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa2df64b000
>  19766.057 ( 0.001 ms): gcc/27524 close(fd: 5                                                           ) = 0
>  19766.062 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC           ) = 5
> <SNIP>

Right; and that is all nice. And exactly doesn't answer my question. Why
do we care about those LOST entries so much that we have to do such
horribly ugly things?

Esp. as you point out, they're clearly marked in the output and easily
avoided by using a slightly larger buffer.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-07 20:14               ` Steven Rostedt
@ 2018-12-08 10:44                 ` Peter Zijlstra
  2018-12-08 17:38                   ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-08 10:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Fri, Dec 07, 2018 at 03:14:33PM -0500, Steven Rostedt wrote:
> On Fri, 7 Dec 2018 16:11:05 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Dec 07, 2018 at 08:41:18AM -0500, Steven Rostedt wrote:
> > > On Fri, 7 Dec 2018 09:58:39 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >   
> > > > These patches give no justification *what*so*ever* for why we're doing
> > > > ugly arse things like this. And why does this, whatever this is, need to
> > > > be done in perf?
> > > > 
> > > > IOW, what problem are we solving ?  
> > > 
> > > I guess the cover letter should have had a link (or copy) of this:
> > > 
> > >  http://lkml.kernel.org/r/20181128134700.212ed035@gandalf.local.home  
> > 
> > That doesn't even begin to explain. Who cares about strace and why? And
> > why is it such a bad thing to loose the occasional record etc..
> 
> Who cares about strace? Do I really need to answer that? It's one of
> the most used tools for seeing what a program is doing.

It's a tool I haven't used in years, given we have so many better tools
around these days.

> Why do we care about lost events? Because strace records *all* events,
> as that's what it does and that's what it always has done. It would be
> a break in functionality (a regression) if it were to start losing
> events. I use strace to see everything that an application is doing.

So make a new tool; break the expectation of all events. See if there's
anybody that really cares.

> When we discussed this at plumbers, Oracle people came to me and said
> how awesome it would be to run strace against their database accesses.
> The problem today is that strace causes such a large overhead that it
> isn't feasible to trace any high speed applications, especially if
> there are time restraints involved.

So have them run that perf thing acme pointed to.

So far nobody's made a good argument for why we cannot have LOST events.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-08 10:41                 ` Peter Zijlstra
@ 2018-12-08 17:34                   ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2018-12-08 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Sat, 8 Dec 2018 11:41:21 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > [root@seventh bpf]# trace -e augmented_raw_syscalls.c  --filter-pids 2279,1643
> > <SNIP>
> >  19766.027 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libz.so.1, flags: CLOEXEC           ) = 5
> >  19766.035 ( 0.001 ms): gcc/27524 fstat(fd: 5, statbuf: 0x7ffe9323e2a0                                  ) = 0
> >  19766.037 ( 0.003 ms): gcc/27524 mmap(len: 2187272, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 5   ) = 0x7fa2df435000
> >  19766.042 ( 0.003 ms): gcc/27524 mprotect(start: 0x7fa2df44b000, len: 2093056                          ) = 0
> >  19766.046 ( 0.004 ms): gcc/27524 mmap(addr: 0x7fa2df64a000, len: 4096, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 5, off: 86016) = 0x7fa2df64a000
> >  19766.051 ( 0.002 ms): gcc/27524 mmap(addr: 0x7fa2df64b000, len: 8, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa2df64b000
> >  19766.057 ( 0.001 ms): gcc/27524 close(fd: 5                                                           ) = 0
> >  19766.062 ( 0.003 ms): gcc/27524 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC           ) = 5
> > <SNIP>  
> 
> Right; and that is all nice. And exactly doesn't answer my question. Why
> do we care about those LOST entries so much that we have to do such
> horribly ugly things?
> 
> Esp. as you point out, they're clearly marked in the output and easily
> avoided by using a slightly larger buffer.

For small cases like this a slightly larger buffer wont help. And it
would suck if you are tracing something to find out why there's some
kind of anomaly that takes hours to run, only to find out that the
anomaly happened in the lost events.

Yes, there is a use case for a guarantee of no lost events!


-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-08 10:44                 ` Peter Zijlstra
@ 2018-12-08 17:38                   ` Steven Rostedt
  2018-12-10 10:18                     ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2018-12-08 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Sat, 8 Dec 2018 11:44:23 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> It's a tool I haven't used in years, given we have so many better tools
> around these days.

So because you don't use it, it's useless? As you don't care about lost
events I can see why you may think there are better tools out there.
But since those tools don't guarantee no lost events, they are
obviously not better to those that do care about lost events.

> 
> > Why do we care about lost events? Because strace records *all* events,
> > as that's what it does and that's what it always has done. It would be
> > a break in functionality (a regression) if it were to start losing
> > events. I use strace to see everything that an application is doing.  
> 
> So make a new tool; break the expectation of all events. See if there's
> anybody that really cares.

Basically you are saying, break strace and see if anyone notices?

> 
> > When we discussed this at plumbers, Oracle people came to me and said
> > how awesome it would be to run strace against their database accesses.
> > The problem today is that strace causes such a large overhead that it
> > isn't feasible to trace any high speed applications, especially if
> > there are time restraints involved.  
> 
> So have them run that perf thing acme pointed to.
> 
> So far nobody's made a good argument for why we cannot have LOST events.

If you don't see the use case, I'm not sure anyone can convince you.
Again, I like the fact that when I do a strace of an application I know
that all system calls that the application I'm tracing is recorded. I
don't need to worry about what happened in the "lost events" space.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-08 17:38                   ` Steven Rostedt
@ 2018-12-10 10:18                     ` Peter Zijlstra
  2018-12-13  0:39                       ` Dmitry V. Levin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-10 10:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, ldv, esyr, Frederic Weisbecker

On Sat, Dec 08, 2018 at 12:38:05PM -0500, Steven Rostedt wrote:
> On Sat, 8 Dec 2018 11:44:23 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:

> > > Why do we care about lost events? Because strace records *all* events,
> > > as that's what it does and that's what it always has done. It would be
> > > a break in functionality (a regression) if it were to start losing
> > > events. I use strace to see everything that an application is doing.  
> > 
> > So make a new tool; break the expectation of all events. See if there's
> > anybody that really cares.
> 
> Basically you are saying, break strace and see if anyone notices?

Nah, give it a new name. Clearly mark this is a new tool.

> > > When we discussed this at plumbers, Oracle people came to me and said
> > > how awesome it would be to run strace against their database accesses.
> > > The problem today is that strace causes such a large overhead that it
> > > isn't feasible to trace any high speed applications, especially if
> > > there are time restraints involved.  
> > 
> > So have them run that perf thing acme pointed to.
> > 
> > So far nobody's made a good argument for why we cannot have LOST events.
> 
> If you don't see the use case, I'm not sure anyone can convince you.
> Again, I like the fact that when I do a strace of an application I know
> that all system calls that the application I'm tracing is recorded. I
> don't need to worry about what happened in the "lost events" space.

You're the one pushing for this crap without _any_ justification. Why
are you getting upset if I ask for some?

If people care so much, it shouldn't be hard to write up a coherent
story on this, so far all I seem to get is: because it's always been
like that.

Which really isn't much of an argument.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-10 10:18                     ` Peter Zijlstra
@ 2018-12-13  0:39                       ` Dmitry V. Levin
  2018-12-13  1:26                         ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Dmitry V. Levin @ 2018-12-13  0:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Jiri Olsa, Arnaldo Carvalho de Melo, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

[-- Attachment #1: Type: text/plain, Size: 3160 bytes --]

Hi Peter,

On Mon, Dec 10, 2018 at 11:18:18AM +0100, Peter Zijlstra wrote:
> On Sat, Dec 08, 2018 at 12:38:05PM -0500, Steven Rostedt wrote:
> > On Sat, 8 Dec 2018 11:44:23 +0100, Peter Zijlstra wrote:
> 
> > > > Why do we care about lost events? Because strace records *all* events,
> > > > as that's what it does and that's what it always has done. It would be
> > > > a break in functionality (a regression) if it were to start losing
> > > > events. I use strace to see everything that an application is doing.  
> > > 
> > > So make a new tool; break the expectation of all events. See if there's
> > > anybody that really cares.
> > 
> > Basically you are saying, break strace and see if anyone notices?
> 
> Nah, give it a new name. Clearly mark this is a new tool.
> 
> > > > When we discussed this at plumbers, Oracle people came to me and said
> > > > how awesome it would be to run strace against their database accesses.
> > > > The problem today is that strace causes such a large overhead that it
> > > > isn't feasible to trace any high speed applications, especially if
> > > > there are time restraints involved.  
> > > 
> > > So have them run that perf thing acme pointed to.
> > > 
> > > So far nobody's made a good argument for why we cannot have LOST events.
> > 
> > If you don't see the use case, I'm not sure anyone can convince you.
> > Again, I like the fact that when I do a strace of an application I know
> > that all system calls that the application I'm tracing is recorded. I
> > don't need to worry about what happened in the "lost events" space.
> 
> You're the one pushing for this crap without _any_ justification. Why
> are you getting upset if I ask for some?
> 
> If people care so much, it shouldn't be hard to write up a coherent
> story on this, so far all I seem to get is: because it's always been
> like that.
> 
> Which really isn't much of an argument.

As you rightly pointed out, strace users are expecting that no events are
lost because it's always been like that, and it would require some efforts
to imagine what kind of things are going to break if this is no longer the
case.

Last FOSDEM I attended a talk [1] by Philippe Ombredanne, he was speaking
about a strace-based tool called TraceCode that constructs build graphs
that are used to find out "exactly which files were built, by what and how
they were transformed in multiple steps from sources to your final binaries".

Imagine you told Philippe that strace now works faster but it's no longer
reliable because some events may be lost, and he would have to repeat
builds under strace again and again until he is lucky.  I can imagine his
reaction to this piece of news, and I certainly wouldn't like to be the
messenger.

btw, I didn't ask for the implementation to be ugly.
You don't have to introduce polling into the kernel if you don't want to,
userspace is perfectly capable of invoking wait4(2) in a loop.
Just block the tracee, notify the tracer, and let it pick up the pieces.

[1] https://archive.fosdem.org/2018/schedule/event/debugging_tools_stracing_build/


-- 
ldv

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13  0:39                       ` Dmitry V. Levin
@ 2018-12-13  1:26                         ` Steven Rostedt
  2018-12-13  1:49                           ` Dmitry V. Levin
  2018-12-13 10:01                           ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2018-12-13  1:26 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Peter Zijlstra, Jiri Olsa, Arnaldo Carvalho de Melo, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

On Thu, 13 Dec 2018 03:39:38 +0300
"Dmitry V. Levin" <ldv@altlinux.org> wrote:

> btw, I didn't ask for the implementation to be ugly.
> You don't have to introduce polling into the kernel if you don't want to,
> userspace is perfectly capable of invoking wait4(2) in a loop.
> Just block the tracee, notify the tracer, and let it pick up the pieces.

Note, there's been some discussion offlist to only have perf set a flag
when it dropped an event and have the ptrace code do the heavy lifting
of blocking the task and waking it back up. I think that would be a
cleaner solution and wont muck with perf as badly.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13  1:26                         ` Steven Rostedt
@ 2018-12-13  1:49                           ` Dmitry V. Levin
  2018-12-13 10:01                           ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Dmitry V. Levin @ 2018-12-13  1:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Jiri Olsa, Arnaldo Carvalho de Melo, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

[-- Attachment #1: Type: text/plain, Size: 910 bytes --]

On Wed, Dec 12, 2018 at 08:26:39PM -0500, Steven Rostedt wrote:
> On Thu, 13 Dec 2018 03:39:38 +0300, wrote:
> 
> > btw, I didn't ask for the implementation to be ugly.
> > You don't have to introduce polling into the kernel if you don't want to,
> > userspace is perfectly capable of invoking wait4(2) in a loop.
> > Just block the tracee, notify the tracer, and let it pick up the pieces.
> 
> Note, there's been some discussion offlist to only have perf set a flag
> when it dropped an event and have the ptrace code do the heavy lifting
> of blocking the task and waking it back up. I think that would be a
> cleaner solution and wont muck with perf as badly.

Yes, if perf could be instructed to invoke something like
tracehook_report_syscall_entry/exit when it drops the event
of entering/exiting syscall, that should probably be enough
for the ptracer to do the recovery.


-- 
ldv

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13  1:26                         ` Steven Rostedt
  2018-12-13  1:49                           ` Dmitry V. Levin
@ 2018-12-13 10:01                           ` Peter Zijlstra
  2018-12-13 10:05                             ` Peter Zijlstra
                                               ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-13 10:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dmitry V. Levin, Jiri Olsa, Arnaldo Carvalho de Melo,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

On Wed, Dec 12, 2018 at 08:26:39PM -0500, Steven Rostedt wrote:
> On Thu, 13 Dec 2018 03:39:38 +0300
> "Dmitry V. Levin" <ldv@altlinux.org> wrote:
> 
> > btw, I didn't ask for the implementation to be ugly.
> > You don't have to introduce polling into the kernel if you don't want to,
> > userspace is perfectly capable of invoking wait4(2) in a loop.
> > Just block the tracee, notify the tracer, and let it pick up the pieces.
> 
> Note, there's been some discussion offlist to only have perf set a flag
> when it dropped an event and have the ptrace code do the heavy lifting
> of blocking the task and waking it back up. I think that would be a
> cleaner solution and wont muck with perf as badly.

It's still really horrid -- the question is not if we can come up with
something, anything, to make strace work. The question is if we can
extend something in a sane and maintainable manner to allow this.

So there's a whole bunch of problems I see with all this, in no
particular order:

 - we cannot block when writing to the actual buffer, and have to unroll
   the callstack and bolt on the blocking manualy in a few specific
   sites. This is ugly, inconsistent and maintenance heavy.

 - it only works for some 'magic' events that got the treatment, but not
   for many other you might expect it to work for with no real
   indication which and why.

 - the wakeups side is icky; the best I can come up with is making the
   data page R/O and single stepping on write fault, but that isn't
   multi-threading safe.

   Another alternative would be keeping the whole page R/O and
   using write(2) or an ioctl() to update the head pointer.

Again, if we're going to do this; it needs to be done well and
consistent and not as a special hack to enable strace-like
functionality. And without clean and sane solutions to the above I just
don't see it happening.

Note that the first 2 points are equally true for ftrace; so I don't see
how we could sanely add it there either.


One, very big maybe, would be to add a new tracepoint type that includes
a might_sleep() and we very carefully undo all the preempt_disable and
go sleep where we should. That also gives the tracepoint crud the
information it needs to publish the capability to userspace.

We also have to consider (and possibly forbid) mixing blocking and
!blocking events to the same buffer.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13 10:01                           ` Peter Zijlstra
@ 2018-12-13 10:05                             ` Peter Zijlstra
  2018-12-13 10:08                             ` Peter Zijlstra
  2018-12-13 11:29                             ` Jiri Olsa
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-13 10:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dmitry V. Levin, Jiri Olsa, Arnaldo Carvalho de Melo,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

On Thu, Dec 13, 2018 at 11:01:49AM +0100, Peter Zijlstra wrote:
> One, very big maybe, would be to add a new tracepoint type that includes
> a might_sleep() and we very carefully undo all the preempt_disable and
> go sleep where we should. That also gives the tracepoint crud the
> information it needs to publish the capability to userspace.

Note that this includes handling fun things like the buffer getting
closed while we wait.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13 10:01                           ` Peter Zijlstra
  2018-12-13 10:05                             ` Peter Zijlstra
@ 2018-12-13 10:08                             ` Peter Zijlstra
  2018-12-13 11:29                             ` Jiri Olsa
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-12-13 10:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Dmitry V. Levin, Jiri Olsa, Arnaldo Carvalho de Melo,
	Ingo Molnar, Namhyung Kim, Alexander Shishkin, Thomas Gleixner,
	Luis Claudio R. Goncalves, Eugene Syromyatnikov,
	Frederic Weisbecker, lkml

On Thu, Dec 13, 2018 at 11:01:49AM +0100, Peter Zijlstra wrote:
>  - the wakeups side is icky; the best I can come up with is making the
>    data page R/O and single stepping on write fault, but that isn't
>    multi-threading safe.

We can emulate the instruction, that would actually work and be thread
safe. Just a small matter of decoding and interpreting any possible
write instruction ;-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] perf: Allow to block process in syscall tracepoints
  2018-12-13 10:01                           ` Peter Zijlstra
  2018-12-13 10:05                             ` Peter Zijlstra
  2018-12-13 10:08                             ` Peter Zijlstra
@ 2018-12-13 11:29                             ` Jiri Olsa
  2 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2018-12-13 11:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Dmitry V. Levin, Jiri Olsa,
	Arnaldo Carvalho de Melo, Ingo Molnar, Namhyung Kim,
	Alexander Shishkin, Thomas Gleixner, Luis Claudio R. Goncalves,
	Eugene Syromyatnikov, Frederic Weisbecker, lkml

On Thu, Dec 13, 2018 at 11:01:49AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 12, 2018 at 08:26:39PM -0500, Steven Rostedt wrote:
> > On Thu, 13 Dec 2018 03:39:38 +0300
> > "Dmitry V. Levin" <ldv@altlinux.org> wrote:
> > 
> > > btw, I didn't ask for the implementation to be ugly.
> > > You don't have to introduce polling into the kernel if you don't want to,
> > > userspace is perfectly capable of invoking wait4(2) in a loop.
> > > Just block the tracee, notify the tracer, and let it pick up the pieces.
> > 
> > Note, there's been some discussion offlist to only have perf set a flag
> > when it dropped an event and have the ptrace code do the heavy lifting
> > of blocking the task and waking it back up. I think that would be a
> > cleaner solution and wont muck with perf as badly.
> 
> It's still really horrid -- the question is not if we can come up with
> something, anything, to make strace work. The question is if we can
> extend something in a sane and maintainable manner to allow this.
> 
> So there's a whole bunch of problems I see with all this, in no
> particular order:
> 
>  - we cannot block when writing to the actual buffer, and have to unroll
>    the callstack and bolt on the blocking manualy in a few specific
>    sites. This is ugly, inconsistent and maintenance heavy.
> 
>  - it only works for some 'magic' events that got the treatment, but not
>    for many other you might expect it to work for with no real
>    indication which and why.
> 
>  - the wakeups side is icky; the best I can come up with is making the
>    data page R/O and single stepping on write fault, but that isn't
>    multi-threading safe.
> 
>    Another alternative would be keeping the whole page R/O and
>    using write(2) or an ioctl() to update the head pointer.
> 
> Again, if we're going to do this; it needs to be done well and
> consistent and not as a special hack to enable strace-like
> functionality. And without clean and sane solutions to the above I just
> don't see it happening.
> 
> Note that the first 2 points are equally true for ftrace; so I don't see
> how we could sanely add it there either.
> 
> 
> One, very big maybe, would be to add a new tracepoint type that includes
> a might_sleep() and we very carefully undo all the preempt_disable and
> go sleep where we should. That also gives the tracepoint crud the
> information it needs to publish the capability to userspace.

nice, I like this one.. seems like the most clean solution

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf ordered_events: Add ordered_events__flush_time interface
  2018-12-05 16:05 ` [PATCH 6/8] perf tools: Add ordered_events__flush_time interface Jiri Olsa
@ 2018-12-14 21:00   ` tip-bot for Jiri Olsa
  2018-12-18 14:27   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-14 21:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ldv, mingo, hpa, tglx, alexander.shishkin, fweisbec,
	linux-kernel, lclaudio, esyr, rostedt, acme, namhyung, jolsa,
	peterz

Commit-ID:  a4a77b93ae0bfd752caad14a34fa9465ea6d30a9
Gitweb:     https://git.kernel.org/tip/a4a77b93ae0bfd752caad14a34fa9465ea6d30a9
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:07 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Thu, 6 Dec 2018 16:43:28 -0300

perf ordered_events: Add ordered_events__flush_time interface

Add OE_FLUSH__TIME flush type, to be able to flush only certain amount
of the queue based on the provided timestamp. It will be used in the
following patches.

Link: http://lkml.kernel.org/n/tip-a3na77vemwr1g92lfhlrztg5@git.kernel.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-7-jolsa@kernel.org
[ Fix the build on older systems such as centos 5 and 6 where 'time' shadows a global declaration ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/ordered-events.c | 23 +++++++++++++++++++----
 tools/perf/util/ordered-events.h |  2 ++
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/ordered-events.c b/tools/perf/util/ordered-events.c
index c5412db05683..46965643020b 100644
--- a/tools/perf/util/ordered-events.c
+++ b/tools/perf/util/ordered-events.c
@@ -219,8 +219,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 	return 0;
 }
 
-static int __ordered_events__flush(struct ordered_events *oe,
-				   bool show_progress)
+static int do_flush(struct ordered_events *oe, bool show_progress)
 {
 	struct list_head *head = &oe->events;
 	struct ordered_event *tmp, *iter;
@@ -263,7 +262,8 @@ static int __ordered_events__flush(struct ordered_events *oe,
 	return 0;
 }
 
-int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+static int __ordered_events__flush(struct ordered_events *oe, enum oe_flush how,
+				   u64 timestamp)
 {
 	static const char * const str[] = {
 		"NONE",
@@ -302,6 +302,11 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		break;
 	}
 
+	case OE_FLUSH__TIME:
+		oe->next_flush = timestamp;
+		show_progress = false;
+		break;
+
 	case OE_FLUSH__ROUND:
 	case OE_FLUSH__NONE:
 	default:
@@ -312,7 +317,7 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		   str[how], oe->nr_events);
 	pr_oe_time(oe->max_timestamp, "max_timestamp\n");
 
-	err = __ordered_events__flush(oe, show_progress);
+	err = do_flush(oe, show_progress);
 
 	if (!err) {
 		if (how == OE_FLUSH__ROUND)
@@ -328,6 +333,16 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 	return err;
 }
 
+int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+{
+	return __ordered_events__flush(oe, how, 0);
+}
+
+int ordered_events__flush_time(struct ordered_events *oe, u64 timestamp)
+{
+	return __ordered_events__flush(oe, OE_FLUSH__TIME, timestamp);
+}
+
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data)
 {
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 0c6e26aec0e3..f352af20e167 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -19,6 +19,7 @@ enum oe_flush {
 	OE_FLUSH__ROUND,
 	OE_FLUSH__HALF,
 	OE_FLUSH__TOP,
+	OE_FLUSH__TIME,
 };
 
 struct ordered_events;
@@ -55,6 +56,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 			  u64 timestamp, u64 file_offset);
 void ordered_events__delete(struct ordered_events *oe, struct ordered_event *event);
 int ordered_events__flush(struct ordered_events *oe, enum oe_flush how);
+int ordered_events__flush_time(struct ordered_events *oe, u64 timestamp);
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data);
 void ordered_events__free(struct ordered_events *oe);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf trace: Move event delivery to a new deliver_event() function
  2018-12-05 16:05 ` [PATCH 7/8] perf trace: Move event delivery to deliver_event function Jiri Olsa
@ 2018-12-14 21:01   ` tip-bot for Jiri Olsa
  2018-12-18 14:28   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-14 21:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ldv, peterz, rostedt, fweisbec, acme, linux-kernel, tglx,
	namhyung, jolsa, hpa, mingo, lclaudio, alexander.shishkin, esyr

Commit-ID:  b2595c9b1e5ad11db05c6767e6d00cd03d76c96e
Gitweb:     https://git.kernel.org/tip/b2595c9b1e5ad11db05c6767e6d00cd03d76c96e
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:08 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Thu, 6 Dec 2018 16:43:28 -0300

perf trace: Move event delivery to a new deliver_event() function

Mov event delivery code to a new trace__deliver_event() function, so
it's easier to add ordered delivery coming in the following patches.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-8-jolsa@kernel.org
[ Add trace__ prefix to the deliver_event method ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-trace.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 98aff12af9e6..3b6b1fecf2bb 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -2637,6 +2637,21 @@ static int trace__set_filter_pids(struct trace *trace)
 	return err;
 }
 
+static int trace__deliver_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	struct perf_sample sample;
+	int err;
+
+	err = perf_evlist__parse_sample(evlist, event, &sample);
+	if (err)
+		fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
+	else
+		trace__handle_event(trace, event, &sample);
+
+	return 0;
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2802,18 +2817,10 @@ again:
 			continue;
 
 		while ((event = perf_mmap__read_event(md)) != NULL) {
-			struct perf_sample sample;
-
 			++trace->nr_events;
 
-			err = perf_evlist__parse_sample(evlist, event, &sample);
-			if (err) {
-				fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
-				goto next_event;
-			}
+			trace__deliver_event(trace, event);
 
-			trace__handle_event(trace, event, &sample);
-next_event:
 			perf_mmap__consume(md);
 
 			if (interrupted)

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf trace: Add ordered processing
  2018-12-05 16:05 ` [PATCH 8/8] perf trace: Add ordered processing for --block option Jiri Olsa
@ 2018-12-14 21:02   ` tip-bot for Jiri Olsa
  2018-12-18 14:29   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-14 21:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, acme, tglx, jolsa, namhyung, esyr, peterz, fweisbec,
	linux-kernel, hpa, lclaudio, mingo, ldv, alexander.shishkin

Commit-ID:  4085fed6373f15892959e3392efe5552504cfc94
Gitweb:     https://git.kernel.org/tip/4085fed6373f15892959e3392efe5552504cfc94
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:09 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Thu, 6 Dec 2018 16:43:28 -0300

perf trace: Add ordered processing

Sort events to provide the precise outcome of ordered events, just like
is done with 'perf report' and 'perf top'.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-9-jolsa@kernel.org
[ split from a larger patch, added trace__ prefixes to new 'struct trace' methods ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-trace.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 3b6b1fecf2bb..366ec3c8f580 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -127,6 +127,10 @@ struct trace {
 	bool			force;
 	bool			vfs_getname;
 	int			trace_pgfaults;
+	struct {
+		struct ordered_events	data;
+		u64			last;
+	} oe;
 };
 
 struct tp_field {
@@ -2652,6 +2656,42 @@ static int trace__deliver_event(struct trace *trace, union perf_event *event)
 	return 0;
 }
 
+static int trace__flush_ordered_events(struct trace *trace)
+{
+	u64 first = ordered_events__first_time(&trace->oe.data);
+	u64 flush = trace->oe.last - NSEC_PER_SEC;
+
+	/* Is there some thing to flush.. */
+	if (first && first < flush)
+		return ordered_events__flush_time(&trace->oe.data, flush);
+
+	return 0;
+}
+
+static int trace__deliver_ordered_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	int err;
+
+	err = perf_evlist__parse_sample_timestamp(evlist, event, &trace->oe.last);
+	if (err && err != -1)
+		return err;
+
+	err = ordered_events__queue(&trace->oe.data, event, trace->oe.last, 0);
+	if (err)
+		return err;
+
+	return trace__flush_ordered_events(trace);
+}
+
+static int ordered_events__deliver_event(struct ordered_events *oe,
+					 struct ordered_event *event)
+{
+	struct trace *trace = container_of(oe, struct trace, oe.data);
+
+	return trace__deliver_event(trace, event->event);
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2819,7 +2859,9 @@ again:
 		while ((event = perf_mmap__read_event(md)) != NULL) {
 			++trace->nr_events;
 
-			trace__deliver_event(trace, event);
+			err = trace__deliver_ordered_event(trace, event);
+			if (err)
+				goto out_disable;
 
 			perf_mmap__consume(md);
 
@@ -2842,6 +2884,9 @@ again:
 				draining = true;
 
 			goto again;
+		} else {
+			if (trace__flush_ordered_events(trace))
+				goto out_disable;
 		}
 	} else {
 		goto again;
@@ -2852,6 +2897,8 @@ out_disable:
 
 	perf_evlist__disable(evlist);
 
+	ordered_events__flush(&trace->oe.data, OE_FLUSH__FINAL);
+
 	if (!err) {
 		if (trace->summary)
 			trace__fprintf_thread_summary(trace, trace->output);
@@ -3562,6 +3609,9 @@ int cmd_trace(int argc, const char **argv)
 		}
 	}
 
+	ordered_events__init(&trace.oe.data, ordered_events__deliver_event, &trace);
+	ordered_events__set_copy_on_queue(&trace.oe.data, true);
+
 	/*
 	 * If we are augmenting syscalls, then combine what we put in the
 	 * __augmented_syscalls__ BPF map with what is in the

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf ordered_events: Add ordered_events__flush_time interface
  2018-12-05 16:05 ` [PATCH 6/8] perf tools: Add ordered_events__flush_time interface Jiri Olsa
  2018-12-14 21:00   ` [tip:perf/core] perf ordered_events: " tip-bot for Jiri Olsa
@ 2018-12-18 14:27   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-18 14:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, jolsa, peterz, tglx, ldv, hpa, fweisbec, esyr,
	alexander.shishkin, namhyung, mingo, linux-kernel, lclaudio,
	rostedt

Commit-ID:  68ca5d07de207e56f57e887de23b03fbc1ebc2a6
Gitweb:     https://git.kernel.org/tip/68ca5d07de207e56f57e887de23b03fbc1ebc2a6
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:07 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 17 Dec 2018 15:02:12 -0300

perf ordered_events: Add ordered_events__flush_time interface

Add OE_FLUSH__TIME flush type, to be able to flush only certain amount
of the queue based on the provided timestamp. It will be used in the
following patches.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-7-jolsa@kernel.org
[ Fix the build on older systems such as centos 5 and 6 where 'time' shadows a global declaration ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/ordered-events.c | 23 +++++++++++++++++++----
 tools/perf/util/ordered-events.h |  2 ++
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/ordered-events.c b/tools/perf/util/ordered-events.c
index c5412db05683..46965643020b 100644
--- a/tools/perf/util/ordered-events.c
+++ b/tools/perf/util/ordered-events.c
@@ -219,8 +219,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 	return 0;
 }
 
-static int __ordered_events__flush(struct ordered_events *oe,
-				   bool show_progress)
+static int do_flush(struct ordered_events *oe, bool show_progress)
 {
 	struct list_head *head = &oe->events;
 	struct ordered_event *tmp, *iter;
@@ -263,7 +262,8 @@ static int __ordered_events__flush(struct ordered_events *oe,
 	return 0;
 }
 
-int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+static int __ordered_events__flush(struct ordered_events *oe, enum oe_flush how,
+				   u64 timestamp)
 {
 	static const char * const str[] = {
 		"NONE",
@@ -302,6 +302,11 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		break;
 	}
 
+	case OE_FLUSH__TIME:
+		oe->next_flush = timestamp;
+		show_progress = false;
+		break;
+
 	case OE_FLUSH__ROUND:
 	case OE_FLUSH__NONE:
 	default:
@@ -312,7 +317,7 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 		   str[how], oe->nr_events);
 	pr_oe_time(oe->max_timestamp, "max_timestamp\n");
 
-	err = __ordered_events__flush(oe, show_progress);
+	err = do_flush(oe, show_progress);
 
 	if (!err) {
 		if (how == OE_FLUSH__ROUND)
@@ -328,6 +333,16 @@ int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
 	return err;
 }
 
+int ordered_events__flush(struct ordered_events *oe, enum oe_flush how)
+{
+	return __ordered_events__flush(oe, how, 0);
+}
+
+int ordered_events__flush_time(struct ordered_events *oe, u64 timestamp)
+{
+	return __ordered_events__flush(oe, OE_FLUSH__TIME, timestamp);
+}
+
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data)
 {
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 0c6e26aec0e3..f352af20e167 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -19,6 +19,7 @@ enum oe_flush {
 	OE_FLUSH__ROUND,
 	OE_FLUSH__HALF,
 	OE_FLUSH__TOP,
+	OE_FLUSH__TIME,
 };
 
 struct ordered_events;
@@ -55,6 +56,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 			  u64 timestamp, u64 file_offset);
 void ordered_events__delete(struct ordered_events *oe, struct ordered_event *event);
 int ordered_events__flush(struct ordered_events *oe, enum oe_flush how);
+int ordered_events__flush_time(struct ordered_events *oe, u64 timestamp);
 void ordered_events__init(struct ordered_events *oe, ordered_events__deliver_t deliver,
 			  void *data);
 void ordered_events__free(struct ordered_events *oe);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf trace: Move event delivery to a new deliver_event() function
  2018-12-05 16:05 ` [PATCH 7/8] perf trace: Move event delivery to deliver_event function Jiri Olsa
  2018-12-14 21:01   ` [tip:perf/core] perf trace: Move event delivery to a new deliver_event() function tip-bot for Jiri Olsa
@ 2018-12-18 14:28   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-18 14:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, fweisbec, linux-kernel, tglx, lclaudio, namhyung, peterz,
	ldv, alexander.shishkin, mingo, esyr, jolsa, hpa, rostedt

Commit-ID:  1f44b3e2fc5d7a2c5f6d1e67c80ebc226d6bd25f
Gitweb:     https://git.kernel.org/tip/1f44b3e2fc5d7a2c5f6d1e67c80ebc226d6bd25f
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:08 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 17 Dec 2018 15:02:17 -0300

perf trace: Move event delivery to a new deliver_event() function

Mov event delivery code to a new trace__deliver_event() function, so
it's easier to add ordered delivery coming in the following patches.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-8-jolsa@kernel.org
[ Add trace__ prefix to the deliver_event method ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-trace.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 98aff12af9e6..3b6b1fecf2bb 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -2637,6 +2637,21 @@ static int trace__set_filter_pids(struct trace *trace)
 	return err;
 }
 
+static int trace__deliver_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	struct perf_sample sample;
+	int err;
+
+	err = perf_evlist__parse_sample(evlist, event, &sample);
+	if (err)
+		fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
+	else
+		trace__handle_event(trace, event, &sample);
+
+	return 0;
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2802,18 +2817,10 @@ again:
 			continue;
 
 		while ((event = perf_mmap__read_event(md)) != NULL) {
-			struct perf_sample sample;
-
 			++trace->nr_events;
 
-			err = perf_evlist__parse_sample(evlist, event, &sample);
-			if (err) {
-				fprintf(trace->output, "Can't parse sample, err = %d, skipping...\n", err);
-				goto next_event;
-			}
+			trace__deliver_event(trace, event);
 
-			trace__handle_event(trace, event, &sample);
-next_event:
 			perf_mmap__consume(md);
 
 			if (interrupted)

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [tip:perf/core] perf trace: Add ordered processing
  2018-12-05 16:05 ` [PATCH 8/8] perf trace: Add ordered processing for --block option Jiri Olsa
  2018-12-14 21:02   ` [tip:perf/core] perf trace: Add ordered processing tip-bot for Jiri Olsa
@ 2018-12-18 14:29   ` tip-bot for Jiri Olsa
  1 sibling, 0 replies; 45+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-12-18 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, lclaudio, rostedt, namhyung, esyr, mingo, acme,
	fweisbec, tglx, alexander.shishkin, jolsa, ldv, linux-kernel

Commit-ID:  028713aa8389d960cb1935a9954327bdaa163cf8
Gitweb:     https://git.kernel.org/tip/028713aa8389d960cb1935a9954327bdaa163cf8
Author:     Jiri Olsa <jolsa@kernel.org>
AuthorDate: Wed, 5 Dec 2018 17:05:09 +0100
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 17 Dec 2018 15:21:17 -0300

perf trace: Add ordered processing

Sort events to provide the precise outcome of ordered events, just like
is done with 'perf report' and 'perf top'.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dmitry Levin <ldv@altlinux.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Luis Cláudio Gonçalves <lclaudio@uudg.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181205160509.1168-9-jolsa@kernel.org
[ split from a larger patch, added trace__ prefixes to new 'struct trace' methods ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-trace.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 3b6b1fecf2bb..366ec3c8f580 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -127,6 +127,10 @@ struct trace {
 	bool			force;
 	bool			vfs_getname;
 	int			trace_pgfaults;
+	struct {
+		struct ordered_events	data;
+		u64			last;
+	} oe;
 };
 
 struct tp_field {
@@ -2652,6 +2656,42 @@ static int trace__deliver_event(struct trace *trace, union perf_event *event)
 	return 0;
 }
 
+static int trace__flush_ordered_events(struct trace *trace)
+{
+	u64 first = ordered_events__first_time(&trace->oe.data);
+	u64 flush = trace->oe.last - NSEC_PER_SEC;
+
+	/* Is there some thing to flush.. */
+	if (first && first < flush)
+		return ordered_events__flush_time(&trace->oe.data, flush);
+
+	return 0;
+}
+
+static int trace__deliver_ordered_event(struct trace *trace, union perf_event *event)
+{
+	struct perf_evlist *evlist = trace->evlist;
+	int err;
+
+	err = perf_evlist__parse_sample_timestamp(evlist, event, &trace->oe.last);
+	if (err && err != -1)
+		return err;
+
+	err = ordered_events__queue(&trace->oe.data, event, trace->oe.last, 0);
+	if (err)
+		return err;
+
+	return trace__flush_ordered_events(trace);
+}
+
+static int ordered_events__deliver_event(struct ordered_events *oe,
+					 struct ordered_event *event)
+{
+	struct trace *trace = container_of(oe, struct trace, oe.data);
+
+	return trace__deliver_event(trace, event->event);
+}
+
 static int trace__run(struct trace *trace, int argc, const char **argv)
 {
 	struct perf_evlist *evlist = trace->evlist;
@@ -2819,7 +2859,9 @@ again:
 		while ((event = perf_mmap__read_event(md)) != NULL) {
 			++trace->nr_events;
 
-			trace__deliver_event(trace, event);
+			err = trace__deliver_ordered_event(trace, event);
+			if (err)
+				goto out_disable;
 
 			perf_mmap__consume(md);
 
@@ -2842,6 +2884,9 @@ again:
 				draining = true;
 
 			goto again;
+		} else {
+			if (trace__flush_ordered_events(trace))
+				goto out_disable;
 		}
 	} else {
 		goto again;
@@ -2852,6 +2897,8 @@ out_disable:
 
 	perf_evlist__disable(evlist);
 
+	ordered_events__flush(&trace->oe.data, OE_FLUSH__FINAL);
+
 	if (!err) {
 		if (trace->summary)
 			trace__fprintf_thread_summary(trace, trace->output);
@@ -3562,6 +3609,9 @@ int cmd_trace(int argc, const char **argv)
 		}
 	}
 
+	ordered_events__init(&trace.oe.data, ordered_events__deliver_event, &trace);
+	ordered_events__set_copy_on_queue(&trace.oe.data, true);
+
 	/*
 	 * If we are augmenting syscalls, then combine what we put in the
 	 * __augmented_syscalls__ BPF map with what is in the

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-12-18 14:29 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05 16:05 [RFC 1/8] perf: Block perf calls for system call tracepoints Jiri Olsa
2018-12-05 16:05 ` [PATCH 1/8] perf: Allow to block process in syscall tracepoints Jiri Olsa
2018-12-05 17:35   ` Steven Rostedt
2018-12-05 17:56     ` Jiri Olsa
2018-12-06  8:09   ` Peter Zijlstra
2018-12-06 10:30     ` Jiri Olsa
2018-12-06  8:10   ` Peter Zijlstra
2018-12-06  8:24     ` Jiri Olsa
2018-12-06 10:31       ` Peter Zijlstra
2018-12-06  8:34     ` Peter Zijlstra
2018-12-06 10:31       ` Jiri Olsa
2018-12-06 18:19       ` Steven Rostedt
2018-12-07  8:44         ` Jiri Olsa
2018-12-07  8:58         ` Peter Zijlstra
2018-12-07 13:41           ` Steven Rostedt
2018-12-07 15:11             ` Peter Zijlstra
2018-12-07 15:49               ` Arnaldo Carvalho de Melo
2018-12-08 10:41                 ` Peter Zijlstra
2018-12-08 17:34                   ` Steven Rostedt
2018-12-07 20:14               ` Steven Rostedt
2018-12-08 10:44                 ` Peter Zijlstra
2018-12-08 17:38                   ` Steven Rostedt
2018-12-10 10:18                     ` Peter Zijlstra
2018-12-13  0:39                       ` Dmitry V. Levin
2018-12-13  1:26                         ` Steven Rostedt
2018-12-13  1:49                           ` Dmitry V. Levin
2018-12-13 10:01                           ` Peter Zijlstra
2018-12-13 10:05                             ` Peter Zijlstra
2018-12-13 10:08                             ` Peter Zijlstra
2018-12-13 11:29                             ` Jiri Olsa
2018-12-06  8:17   ` Peter Zijlstra
2018-12-06 10:27     ` Jiri Olsa
2018-12-05 16:05 ` [PATCH 2/8] perf tools: Sync uapi perf_event.h Jiri Olsa
2018-12-05 16:05 ` [PATCH 3/8] perf record: Add --block option Jiri Olsa
2018-12-05 16:05 ` [PATCH 4/8] perf trace: " Jiri Olsa
2018-12-05 16:05 ` [PATCH 5/8] perf tools: Add block term support for tracepoints Jiri Olsa
2018-12-05 16:05 ` [PATCH 6/8] perf tools: Add ordered_events__flush_time interface Jiri Olsa
2018-12-14 21:00   ` [tip:perf/core] perf ordered_events: " tip-bot for Jiri Olsa
2018-12-18 14:27   ` tip-bot for Jiri Olsa
2018-12-05 16:05 ` [PATCH 7/8] perf trace: Move event delivery to deliver_event function Jiri Olsa
2018-12-14 21:01   ` [tip:perf/core] perf trace: Move event delivery to a new deliver_event() function tip-bot for Jiri Olsa
2018-12-18 14:28   ` tip-bot for Jiri Olsa
2018-12-05 16:05 ` [PATCH 8/8] perf trace: Add ordered processing for --block option Jiri Olsa
2018-12-14 21:02   ` [tip:perf/core] perf trace: Add ordered processing tip-bot for Jiri Olsa
2018-12-18 14:29   ` tip-bot for Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).