linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] export perf overheads information
@ 2016-11-23  9:44 kan.liang
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
                   ` (14 more replies)
  0 siblings, 15 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Profiling brings additional overhead. High overhead may impacts the
behavior of the profiling object, impacts the accuracy of the
profiling result, and even hang the system.
Currently, perf has dynamic interrupt throttle mechanism to lower the
sample rate and overhead. But it has limitations.
 - The mechanism only focus in the overhead from NMI. However, there
   are other parts which bring big overhead. E.g, multiplexing.
 - The hint from the mechanism doesn't work on fixed period.
 - The system changes which caused by the mechanism are not recorded
   in the perf.data. Users have no idea about the overhead and its
   impact.
Acctually, any passive ways like dynamic interrupt throttle mechanism
are only palliative. The best way is to export overheads information,
provide more hints, and help the users design more proper perf command.

According to our test, there are four parts which can bring big overhead.
They include NMI handler, multiplexing handler, iterate side-band events,
and write data in file. Two new perf record type PERF_RECORD_OVERHEAD and
PERF_RECORD_USER_OVERHEAD are introduced to record the overhead
information in kernel and user space respectively.
The overhead information is the system per-CPU overhead, not per-event
overhead. The implementation takes advantage of the existing event log
mechanism.
To reduce the additional overhead from logging overhead information, the
overhead information only be output when the event is going to be
disabled or task is scheduling out.

In perf report, the overhead will be checked automatically. If the
overhead rate is larger than 10%. A warning will be displayed.
A new option is also introduced to display detial per-CPU overhead
information.

Current implementation only include four overhead sources. There could be
more in other parts. The new overhead source can be easily added as a
new type.

Kan Liang (14):
  perf/x86: Introduce PERF_RECORD_OVERHEAD
  perf/x86: output NMI overhead
  perf/x86: output multiplexing overhead
  perf/x86: output side-band events overhead
  perf tools: handle PERF_RECORD_OVERHEAD record type
  perf tools: show NMI overhead
  perf tools: show multiplexing overhead
  perf tools: show side-band events overhead
  perf tools: make get_nsecs visible for buildin files
  perf tools: introduce PERF_RECORD_USER_OVERHEAD
  perf tools: record write data overhead
  perf tools: record elapsed time
  perf tools: warn on high overhead
  perf script: show overhead events

 arch/x86/events/core.c                   |  19 ++++-
 arch/x86/events/perf_event.h             |   2 +
 include/linux/perf_event.h               |   9 ++
 include/uapi/linux/perf_event.h          |  33 ++++++++
 kernel/events/core.c                     | 139 +++++++++++++++++++++++++------
 tools/include/uapi/linux/perf_event.h    |  33 ++++++++
 tools/perf/Documentation/perf-report.txt |   3 +
 tools/perf/builtin-record.c              |  71 +++++++++++++++-
 tools/perf/builtin-report.c              |  25 ++++++
 tools/perf/builtin-sched.c               |   2 +-
 tools/perf/builtin-script.c              |  36 ++++++++
 tools/perf/builtin.h                     |   1 +
 tools/perf/util/event.c                  |  47 +++++++++++
 tools/perf/util/event.h                  |  37 ++++++++
 tools/perf/util/machine.c                |  35 ++++++++
 tools/perf/util/machine.h                |   2 +
 tools/perf/util/session.c                |  68 +++++++++++++++
 tools/perf/util/symbol.h                 |   3 +-
 tools/perf/util/tool.h                   |   1 +
 19 files changed, 535 insertions(+), 31 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 20:11   ` Peter Zijlstra
                     ` (2 more replies)
  2016-11-23  9:44 ` [PATCH 02/14] perf/x86: output NMI overhead kan.liang
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

A new perf record is introduced to export perf overhead information to
userspace. So we can measure the overhead of sampling directly.

To output the overhead information, it takes advantage of the existing
event log mechanism. But the overhead information is the system
overhead, not per-event overhead.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 include/linux/perf_event.h      |  4 ++++
 include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
 kernel/events/core.c            | 43 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..3a530a7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -998,6 +998,10 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+extern void
+perf_log_overhead(struct perf_event *event, u32 type,
+		  struct perf_overhead_entry *entry);
+
 static inline bool is_sampling_event(struct perf_event *event)
 {
 	return event->attr.sample_period != 0;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485..ae571be 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -862,6 +862,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_SWITCH_CPU_WIDE		= 15,
 
+	/*
+	 * Records perf overhead
+	 * struct {
+	 * 	struct perf_event_header 	header;
+	 * 	u32				type;
+	 * 	struct perf_overhead_entry	entry;
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_OVERHEAD			= 16,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -980,4 +991,22 @@ struct perf_branch_entry {
 		reserved:44;
 };
 
+enum perf_record_overhead_type {
+	PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * 	 cpu: The cpu which overhead occues
+ * 	  nr: Times of overhead happens.
+ * 	      E.g. for NMI, nr == times of NMI handler are called.
+ * 	time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+	__u32	cpu;
+	__u64	nr;
+	__u64	time;
+};
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ff230bb..590c399 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7037,6 +7037,49 @@ static void perf_log_itrace_start(struct perf_event *event)
 	perf_output_end(&handle);
 }
 
+
+/*
+ * Record overhead logging
+ *
+ * The overhead logged here is the system overhead, not per-event overhead.
+ * This function only take advantage of the existing event log mechanism
+ * to log the overhead record.
+ *
+ */
+void perf_log_overhead(struct perf_event *event, u32 type,
+		       struct perf_overhead_entry *entry)
+{
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int ret;
+
+	struct {
+		struct perf_event_header	header;
+		u32				type;
+		struct perf_overhead_entry	overhead;
+	} overhead_event = {
+		.header = {
+			.type = PERF_RECORD_OVERHEAD,
+			.misc = 0,
+			.size = sizeof(overhead_event),
+		},
+	};
+	overhead_event.type = type;
+	memcpy(&overhead_event.overhead, entry, sizeof(overhead_event.overhead));
+
+	perf_event_header__init_id(&overhead_event.header, &sample, event);
+	ret = perf_output_begin(&handle, event, overhead_event.header.size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, overhead_event);
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+	memset(entry, 0, sizeof(*entry));
+}
+
 /*
  * Generic event overflow handling, sampling.
  */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 20:06   ` Peter Zijlstra
  2016-11-24 16:19   ` Mark Rutland
  2016-11-23  9:44 ` [PATCH 03/14] perf/x86: output multiplexing overhead kan.liang
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

NMI handler is one of the most important part which brings overhead.

There are lots of NMI during sampling. It's very expensive to log each
NMI. So the accumulated time and NMI# will be output when event is going
to be disabled or task is scheduling out.
The newly introduced flag PERF_EF_LOG indicate to output the overhead
log.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/events/core.c          | 19 ++++++++++++++-
 arch/x86/events/perf_event.h    |  2 ++
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h |  2 ++
 kernel/events/core.c            | 54 ++++++++++++++++++++++-------------------
 5 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d31735f..6c3b0ef 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1397,6 +1397,11 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 
 	perf_event_update_userpage(event);
 
+	if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr) {
+		cpuc->nmi_overhead.cpu = smp_processor_id();
+		perf_log_overhead(event, PERF_NMI_OVERHEAD, &cpuc->nmi_overhead);
+	}
+
 do_del:
 	if (x86_pmu.del) {
 		/*
@@ -1475,11 +1480,21 @@ void perf_events_lapic_init(void)
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
 }
 
+static void
+perf_caculate_nmi_overhead(u64 time)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->nmi_overhead.nr++;
+	cpuc->nmi_overhead.time += time;
+}
+
 static int
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
 	u64 start_clock;
 	u64 finish_clock;
+	u64 clock;
 	int ret;
 
 	/*
@@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	start_clock = sched_clock();
 	ret = x86_pmu.handle_irq(regs);
 	finish_clock = sched_clock();
+	clock = finish_clock - start_clock;
 
-	perf_sample_event_took(finish_clock - start_clock);
+	perf_caculate_nmi_overhead(clock);
+	perf_sample_event_took(clock);
 
 	return ret;
 }
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5874d8d..c6b25ac 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -230,6 +230,8 @@ struct cpu_hw_events {
 	struct intel_excl_cntrs		*excl_cntrs;
 	int excl_thread_id; /* 0 or 1 */
 
+	struct perf_overhead_entry     nmi_overhead;
+
 	/*
 	 * AMD specific bits
 	 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3a530a7..632647f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -313,6 +313,7 @@ struct pmu {
 #define PERF_EF_START	0x01		/* start the counter when adding    */
 #define PERF_EF_RELOAD	0x02		/* reload the counter when starting */
 #define PERF_EF_UPDATE	0x04		/* update the counter when stopping */
+#define PERF_EF_LOG	0x08
 
 	/*
 	 * Adds/Removes a counter to/from the PMU, can be done inside a
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ae571be..071323d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -992,6 +992,8 @@ struct perf_branch_entry {
 };
 
 enum perf_record_overhead_type {
+	PERF_NMI_OVERHEAD	= 0,
+
 	PERF_OVERHEAD_MAX,
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 590c399..d82e6ca 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1785,7 +1785,8 @@ event_filter_match(struct perf_event *event)
 static void
 event_sched_out(struct perf_event *event,
 		  struct perf_cpu_context *cpuctx,
-		  struct perf_event_context *ctx)
+		  struct perf_event_context *ctx,
+		  bool log_overhead)
 {
 	u64 tstamp = perf_event_time(event);
 	u64 delta;
@@ -1810,9 +1811,8 @@ event_sched_out(struct perf_event *event,
 		return;
 
 	perf_pmu_disable(event->pmu);
-
 	event->tstamp_stopped = tstamp;
-	event->pmu->del(event, 0);
+	event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
 	event->oncpu = -1;
 	event->state = PERF_EVENT_STATE_INACTIVE;
 	if (event->pending_disable) {
@@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
 static void
 group_sched_out(struct perf_event *group_event,
 		struct perf_cpu_context *cpuctx,
-		struct perf_event_context *ctx)
+		struct perf_event_context *ctx,
+		bool log_overhead)
 {
 	struct perf_event *event;
 	int state = group_event->state;
 
 	perf_pmu_disable(ctx->pmu);
 
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, cpuctx, ctx, log_overhead);
 
 	/*
 	 * Schedule out siblings (if any):
 	 */
 	list_for_each_entry(event, &group_event->sibling_list, group_entry)
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, cpuctx, ctx, log_overhead);
 
 	perf_pmu_enable(ctx->pmu);
 
@@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
 {
 	unsigned long flags = (unsigned long)info;
 
-	event_sched_out(event, cpuctx, ctx);
+	event_sched_out(event, cpuctx, ctx, false);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	list_del_event(event, ctx);
@@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
 	update_cgrp_time_from_event(event);
 	update_group_times(event);
 	if (event == event->group_leader)
-		group_sched_out(event, cpuctx, ctx);
+		group_sched_out(event, cpuctx, ctx, true);
 	else
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, cpuctx, ctx, true);
 	event->state = PERF_EVENT_STATE_OFF;
 }
 
@@ -2141,10 +2142,10 @@ group_sched_in(struct perf_event *group_event,
 			event->tstamp_running += now - event->tstamp_stopped;
 			event->tstamp_stopped = now;
 		} else {
-			event_sched_out(event, cpuctx, ctx);
+			event_sched_out(event, cpuctx, ctx, false);
 		}
 	}
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, cpuctx, ctx, false);
 
 	pmu->cancel_txn(pmu);
 
@@ -2198,7 +2199,8 @@ static void add_event_to_ctx(struct perf_event *event,
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
+			  enum event_type_t event_type,
+			  bool log_overhead);
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -2206,7 +2208,8 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			       struct perf_event_context *ctx)
+			       struct perf_event_context *ctx,
+			       bool log_overhead)
 {
 	if (!cpuctx->task_ctx)
 		return;
@@ -2214,7 +2217,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
+	ctx_sched_out(ctx, cpuctx, EVENT_ALL, log_overhead);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2234,7 +2237,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 {
 	perf_pmu_disable(cpuctx->ctx.pmu);
 	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx);
+		task_ctx_sched_out(cpuctx, task_ctx, false);
 	cpu_ctx_sched_out(cpuctx, EVENT_ALL);
 	perf_event_sched_in(cpuctx, task_ctx, current);
 	perf_pmu_enable(cpuctx->ctx.pmu);
@@ -2281,7 +2284,7 @@ static int  __perf_install_in_context(void *info)
 	}
 
 	if (activate) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx);
 	} else {
@@ -2396,7 +2399,7 @@ static void __perf_event_enable(struct perf_event *event,
 		return;
 
 	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
 
 	__perf_event_mark_enabled(event);
 
@@ -2611,7 +2614,8 @@ EXPORT_SYMBOL_GPL(perf_event_refresh);
 
 static void ctx_sched_out(struct perf_event_context *ctx,
 			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+			  enum event_type_t event_type,
+			  bool log_overhead)
 {
 	int is_active = ctx->is_active;
 	struct perf_event *event;
@@ -2662,12 +2666,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 	perf_pmu_disable(ctx->pmu);
 	if (is_active & EVENT_PINNED) {
 		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+			group_sched_out(event, cpuctx, ctx, log_overhead);
 	}
 
 	if (is_active & EVENT_FLEXIBLE) {
 		list_for_each_entry(event, &ctx->flexible_groups, group_entry)
-			group_sched_out(event, cpuctx, ctx);
+			group_sched_out(event, cpuctx, ctx, log_overhead);
 	}
 	perf_pmu_enable(ctx->pmu);
 }
@@ -2850,7 +2854,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		task_ctx_sched_out(cpuctx, ctx);
+		task_ctx_sched_out(cpuctx, ctx, false);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -2958,7 +2962,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 			      enum event_type_t event_type)
 {
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
+	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, false);
 }
 
 static void
@@ -3350,7 +3354,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 
 	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE, false);
 
 	rotate_ctx(&cpuctx->ctx);
 	if (ctx)
@@ -3415,7 +3419,7 @@ static void perf_event_enable_on_exec(int ctxn)
 
 	cpuctx = __get_cpu_context(ctx);
 	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, cpuctx, EVENT_TIME, false);
 	list_for_each_entry(event, &ctx->event_list, event_entry)
 		enabled |= event_enable_on_exec(event, ctx);
 
@@ -10190,7 +10194,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * in.
 	 */
 	raw_spin_lock_irq(&child_ctx->lock);
-	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);
+	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, true);
 
 	/*
 	 * Now that the context is inactive, destroy the task <-> ctx relation
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/14] perf/x86: output multiplexing overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
  2016-11-23  9:44 ` [PATCH 02/14] perf/x86: output NMI overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 20:06   ` Peter Zijlstra
  2016-11-23  9:44 ` [PATCH 04/14] perf/x86: output side-band events overhead kan.liang
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Multiplexing overhead is one of the key overhead when the number of
events is more than available counters.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h |  1 +
 kernel/events/core.c            | 16 ++++++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 632647f..f72b97a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -793,6 +793,8 @@ struct perf_cpu_context {
 
 	struct list_head		sched_cb_entry;
 	int				sched_cb_usage;
+
+	struct perf_overhead_entry	mux_overhead;
 };
 
 struct perf_output_handle {
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 071323d..9124c7c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -993,6 +993,7 @@ struct perf_branch_entry {
 
 enum perf_record_overhead_type {
 	PERF_NMI_OVERHEAD	= 0,
+	PERF_MUX_OVERHEAD,
 
 	PERF_OVERHEAD_MAX,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d82e6ca..9934059 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1829,6 +1829,11 @@ event_sched_out(struct perf_event *event,
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
+	if (log_overhead && cpuctx->mux_overhead.nr) {
+		cpuctx->mux_overhead.cpu = smp_processor_id();
+		perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+	}
+
 	perf_pmu_enable(event->pmu);
 }
 
@@ -3330,9 +3335,17 @@ static void rotate_ctx(struct perf_event_context *ctx)
 		list_rotate_left(&ctx->flexible_groups);
 }
 
+static void
+perf_caculate_mux_overhead(struct perf_cpu_context *cpuctx, u64 time)
+{
+	cpuctx->mux_overhead.nr++;
+	cpuctx->mux_overhead.time += time;
+}
+
 static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 {
 	struct perf_event_context *ctx = NULL;
+	u64 start_clock, end_clock;
 	int rotate = 0;
 
 	if (cpuctx->ctx.nr_events) {
@@ -3349,6 +3362,7 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 	if (!rotate)
 		goto done;
 
+	start_clock = perf_clock();
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
 
@@ -3364,6 +3378,8 @@ static int perf_rotate_context(struct perf_cpu_context *cpuctx)
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+	end_clock = perf_clock();
+	perf_caculate_mux_overhead(cpuctx, end_clock - start_clock);
 done:
 
 	return rotate;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/14] perf/x86: output side-band events overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (2 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 03/14] perf/x86: output multiplexing overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 20:06   ` Peter Zijlstra
  2016-11-24 16:21   ` Mark Rutland
  2016-11-23  9:44 ` [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type kan.liang
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Iterating all events which need to receive side-band events also bring
some overhead.
Save the overhead information in task context or CPU context, whichever
context is available.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h |  1 +
 kernel/events/core.c            | 32 ++++++++++++++++++++++++++++----
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f72b97a..ec3cb7f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -764,6 +764,8 @@ struct perf_event_context {
 #endif
 	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
+
+	struct perf_overhead_entry	sb_overhead;
 };
 
 /*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9124c7c..5e7c522 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -994,6 +994,7 @@ struct perf_branch_entry {
 enum perf_record_overhead_type {
 	PERF_NMI_OVERHEAD	= 0,
 	PERF_MUX_OVERHEAD,
+	PERF_SB_OVERHEAD,
 
 	PERF_OVERHEAD_MAX,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9934059..51e9df7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 
-	if (log_overhead && cpuctx->mux_overhead.nr) {
-		cpuctx->mux_overhead.cpu = smp_processor_id();
-		perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+	if (log_overhead) {
+		if (cpuctx->mux_overhead.nr) {
+			cpuctx->mux_overhead.cpu = smp_processor_id();
+			perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
+		}
+		if (ctx->sb_overhead.nr) {
+			ctx->sb_overhead.cpu = smp_processor_id();
+			perf_log_overhead(event, PERF_SB_OVERHEAD, &ctx->sb_overhead);
+		}
 	}
 
 	perf_pmu_enable(event->pmu);
@@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f output, void *data)
 	}
 }
 
+static void
+perf_caculate_sb_overhead(struct perf_event_context *ctx,
+			  u64 time)
+{
+	ctx->sb_overhead.nr++;
+	ctx->sb_overhead.time += time;
+}
+
 /*
  * Iterate all events that need to receive side-band events.
  *
@@ -6143,9 +6157,12 @@ static void
 perf_iterate_sb(perf_iterate_f output, void *data,
 	       struct perf_event_context *task_ctx)
 {
+	struct perf_event_context *overhead_ctx = task_ctx;
 	struct perf_event_context *ctx;
+	u64 start_clock, end_clock;
 	int ctxn;
 
+	start_clock = perf_clock();
 	rcu_read_lock();
 	preempt_disable();
 
@@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void *data,
 
 	for_each_task_context_nr(ctxn) {
 		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (ctx)
+		if (ctx) {
 			perf_iterate_ctx(ctx, output, data, false);
+			if (!overhead_ctx)
+				overhead_ctx = ctx;
+		}
 	}
 done:
 	preempt_enable();
 	rcu_read_unlock();
+
+	end_clock = perf_clock();
+	if (overhead_ctx)
+		perf_caculate_sb_overhead(overhead_ctx, end_clock - start_clock);
 }
 
 /*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (3 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 04/14] perf/x86: output side-band events overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 22:35   ` Jiri Olsa
  2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

The infrastructure to handle PERF_RECORD_OVERHEAD record type. A new
perf report option is also introduced as a knob to show the overhead
information.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/include/uapi/linux/perf_event.h    | 33 ++++++++++++++++++++++++++++++++
 tools/perf/Documentation/perf-report.txt |  3 +++
 tools/perf/builtin-report.c              |  6 ++++++
 tools/perf/util/event.c                  |  9 +++++++++
 tools/perf/util/event.h                  | 11 +++++++++++
 tools/perf/util/machine.c                |  8 ++++++++
 tools/perf/util/machine.h                |  2 ++
 tools/perf/util/session.c                |  5 +++++
 tools/perf/util/symbol.h                 |  3 ++-
 tools/perf/util/tool.h                   |  1 +
 10 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index c66a485..5e7c522 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -862,6 +862,17 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_SWITCH_CPU_WIDE		= 15,
 
+	/*
+	 * Records perf overhead
+	 * struct {
+	 * 	struct perf_event_header 	header;
+	 * 	u32				type;
+	 * 	struct perf_overhead_entry	entry;
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_OVERHEAD			= 16,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -980,4 +991,26 @@ struct perf_branch_entry {
 		reserved:44;
 };
 
+enum perf_record_overhead_type {
+	PERF_NMI_OVERHEAD	= 0,
+	PERF_MUX_OVERHEAD,
+	PERF_SB_OVERHEAD,
+
+	PERF_OVERHEAD_MAX,
+};
+
+/*
+ * single overhead record layout:
+ *
+ * 	 cpu: The cpu which overhead occues
+ * 	  nr: Times of overhead happens.
+ * 	      E.g. for NMI, nr == times of NMI handler are called.
+ * 	time: Total overhead cost(ns)
+ */
+struct perf_overhead_entry {
+	__u32	cpu;
+	__u64	nr;
+	__u64	time;
+};
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 2d17462..fea8bea 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -412,6 +412,9 @@ include::itrace.txt[]
 --hierarchy::
 	Enable hierarchical output.
 
+--show-overhead::
+	Show extra overhead which perf brings during monitoring
+
 include::callchain-overhead-calculation.txt[]
 
 SEE ALSO
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 3dfbfff..1416c39 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -368,6 +368,10 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 	struct perf_evsel *pos;
 
 	fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
+	if (symbol_conf.show_overhead) {
+		fprintf(stdout, "# Overhead:\n");
+		fprintf(stdout, "#\n");
+	}
 	evlist__for_each_entry(evlist, pos) {
 		struct hists *hists = evsel__hists(pos);
 		const char *evname = perf_evsel__name(pos);
@@ -830,6 +834,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT(0, "stdio-color", NULL, "mode",
 			     "'always' (default), 'never' or 'auto' only applicable to --stdio mode",
 			     stdio__config_color, "always"),
+	OPT_BOOLEAN(0, "show-overhead", &symbol_conf.show_overhead,
+		    "Show perf overhead"),
 	OPT_END()
 	};
 	struct perf_data_file file = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 8ab0d7d..ca98c4c 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -31,6 +31,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_LOST_SAMPLES]		= "LOST_SAMPLES",
 	[PERF_RECORD_SWITCH]			= "SWITCH",
 	[PERF_RECORD_SWITCH_CPU_WIDE]		= "SWITCH_CPU_WIDE",
+	[PERF_RECORD_OVERHEAD]			= "OVERHEAD",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
@@ -1056,6 +1057,14 @@ int perf_event__process_switch(struct perf_tool *tool __maybe_unused,
 	return machine__process_switch_event(machine, event);
 }
 
+int perf_event__process_overhead(struct perf_tool *tool __maybe_unused,
+				 union perf_event *event,
+				 struct perf_sample *sample __maybe_unused,
+				 struct machine *machine)
+{
+	return machine__process_overhead_event(machine, event);
+}
+
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp)
 {
 	return fprintf(fp, " %d/%d: [%#" PRIx64 "(%#" PRIx64 ") @ %#" PRIx64 "]: %c %s\n",
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index c735c53..d1b179b 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -480,6 +480,12 @@ struct time_conv_event {
 	u64 time_zero;
 };
 
+struct perf_overhead {
+	struct perf_event_header	header;
+	u32				type;
+	struct perf_overhead_entry	entry;
+};
+
 union perf_event {
 	struct perf_event_header	header;
 	struct mmap_event		mmap;
@@ -509,6 +515,7 @@ union perf_event {
 	struct stat_event		stat;
 	struct stat_round_event		stat_round;
 	struct time_conv_event		time_conv;
+	struct perf_overhead		overhead;
 };
 
 void perf_event__print_totals(void);
@@ -587,6 +594,10 @@ int perf_event__process_switch(struct perf_tool *tool,
 			       union perf_event *event,
 			       struct perf_sample *sample,
 			       struct machine *machine);
+int perf_event__process_overhead(struct perf_tool *tool,
+				 union perf_event *event,
+				 struct perf_sample *sample,
+				 struct machine *machine);
 int perf_event__process_mmap(struct perf_tool *tool,
 			     union perf_event *event,
 			     struct perf_sample *sample,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 9b33bef..1101757 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -555,6 +555,12 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
 	return 0;
 }
 
+int machine__process_overhead_event(struct machine *machine __maybe_unused,
+				    union perf_event *event __maybe_unused)
+{
+	return 0;
+}
+
 static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
 {
 	const char *dup_filename;
@@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
 	case PERF_RECORD_SWITCH:
 	case PERF_RECORD_SWITCH_CPU_WIDE:
 		ret = machine__process_switch_event(machine, event); break;
+	case PERF_RECORD_OVERHEAD:
+		ret = machine__process_overhead_event(machine, event); break;
 	default:
 		ret = -1;
 		break;
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 354de6e..ec2dd4d 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -97,6 +97,8 @@ int machine__process_itrace_start_event(struct machine *machine,
 					union perf_event *event);
 int machine__process_switch_event(struct machine *machine,
 				  union perf_event *event);
+int machine__process_overhead_event(struct machine *machine,
+				    union perf_event *event);
 int machine__process_mmap_event(struct machine *machine, union perf_event *event,
 				struct perf_sample *sample);
 int machine__process_mmap2_event(struct machine *machine, union perf_event *event,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f268201..bc0bc21 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -373,6 +373,8 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
 		tool->itrace_start = perf_event__process_itrace_start;
 	if (tool->context_switch == NULL)
 		tool->context_switch = perf_event__process_switch;
+	if (tool->overhead == NULL)
+		tool->overhead = perf_event__process_overhead;
 	if (tool->read == NULL)
 		tool->read = process_event_sample_stub;
 	if (tool->throttle == NULL)
@@ -786,6 +788,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_LOST_SAMPLES]	  = perf_event__all64_swap,
 	[PERF_RECORD_SWITCH]		  = perf_event__switch_swap,
 	[PERF_RECORD_SWITCH_CPU_WIDE]	  = perf_event__switch_swap,
+	[PERF_RECORD_OVERHEAD]		  = perf_event__all64_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1267,6 +1270,8 @@ static int machines__deliver_event(struct machines *machines,
 	case PERF_RECORD_SWITCH:
 	case PERF_RECORD_SWITCH_CPU_WIDE:
 		return tool->context_switch(tool, event, sample, machine);
+	case PERF_RECORD_OVERHEAD:
+		return tool->overhead(tool, event, sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 2d0a905..2d96bdb 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -117,7 +117,8 @@ struct symbol_conf {
 			show_ref_callgraph,
 			hide_unresolved,
 			raw_trace,
-			report_hierarchy;
+			report_hierarchy,
+			show_overhead;
 	const char	*vmlinux_name,
 			*kallsyms_name,
 			*source_prefix,
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index ac2590a..c5bbb34 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -47,6 +47,7 @@ struct perf_tool {
 			aux,
 			itrace_start,
 			context_switch,
+			overhead,
 			throttle,
 			unthrottle;
 	event_attr_op	attr;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/14] perf tools: show NMI overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (4 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 22:51   ` Jiri Olsa
                     ` (2 more replies)
  2016-11-23  9:44 ` [PATCH 07/14] perf tools: show multiplexing overhead kan.liang
                   ` (8 subsequent siblings)
  14 siblings, 3 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Caculate the total NMI overhead on each CPU, and display them in perf
report

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-report.c | 11 +++++++++++
 tools/perf/util/event.h     |  4 ++++
 tools/perf/util/machine.c   |  9 +++++++++
 tools/perf/util/session.c   | 18 ++++++++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 1416c39..b1437586 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -365,11 +365,22 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 					 struct report *rep,
 					 const char *help)
 {
+	struct perf_session *session = rep->session;
 	struct perf_evsel *pos;
+	int cpu;
 
 	fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
 	if (symbol_conf.show_overhead) {
 		fprintf(stdout, "# Overhead:\n");
+		for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
+			if (!evlist->stats.total_nmi_overhead[cpu][0])
+				continue;
+			if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
+				continue;
+			fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
+				cpu, evlist->stats.total_nmi_overhead[cpu][0],
+				evlist->stats.total_nmi_overhead[cpu][1]);
+		}
 		fprintf(stdout, "#\n");
 	}
 	evlist__for_each_entry(evlist, pos) {
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d1b179b..7d40d54 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -262,6 +262,9 @@ enum auxtrace_error_type {
  * multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
  * the total number of low level events, it is necessary to to sum all struct
  * sample_event.period and stash the result in total_period.
+ *
+ * The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
+ * The total NMI# is stored in [0], while the accumulated time is in [1].
  */
 struct events_stats {
 	u64 total_period;
@@ -270,6 +273,7 @@ struct events_stats {
 	u64 total_lost_samples;
 	u64 total_aux_lost;
 	u64 total_invalid_chains;
+	u64 total_nmi_overhead[MAX_NR_CPUS][2];
 	u32 nr_events[PERF_RECORD_HEADER_MAX];
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 1101757..58076f2 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -558,6 +558,15 @@ int machine__process_switch_event(struct machine *machine __maybe_unused,
 int machine__process_overhead_event(struct machine *machine __maybe_unused,
 				    union perf_event *event __maybe_unused)
 {
+	if (event->overhead.type == PERF_NMI_OVERHEAD) {
+		dump_printf(" NMI nr: %llu  time: %llu cpu %u\n",
+			    event->overhead.entry.nr,
+			    event->overhead.entry.time,
+			    event->overhead.entry.cpu);
+	} else {
+		dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
+	}
+
 	return 0;
 }
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index bc0bc21..a79ab99 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1207,6 +1207,23 @@ static int
 					    &sample->read.one, machine);
 }
 
+static void
+overhead_stats_update(struct perf_tool *tool,
+		      struct perf_evlist *evlist,
+		      union perf_event *event)
+{
+	if (tool->overhead == perf_event__process_overhead) {
+		switch (event->overhead.type) {
+		case PERF_NMI_OVERHEAD:
+			evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+			evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+			break;
+		default:
+			break;
+		}
+	}
+}
+
 static int machines__deliver_event(struct machines *machines,
 				   struct perf_evlist *evlist,
 				   union perf_event *event,
@@ -1271,6 +1288,7 @@ static int machines__deliver_event(struct machines *machines,
 	case PERF_RECORD_SWITCH_CPU_WIDE:
 		return tool->context_switch(tool, event, sample, machine);
 	case PERF_RECORD_OVERHEAD:
+		overhead_stats_update(tool, evlist, event);
 		return tool->overhead(tool, event, sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/14] perf tools: show multiplexing overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (5 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23  9:44 ` [PATCH 08/14] perf tools: show side-band events overhead kan.liang
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Caculate the total multiplexing overhead on each CPU, and display them
in perf report

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-report.c | 8 ++++++--
 tools/perf/util/event.h     | 3 +++
 tools/perf/util/machine.c   | 5 +++++
 tools/perf/util/session.c   | 4 ++++
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index b1437586..2515d7a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -377,9 +377,13 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 				continue;
 			if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
 				continue;
-			fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
-				cpu, evlist->stats.total_nmi_overhead[cpu][0],
+			fprintf(stdout, "#\tCPU %d\n", cpu);
+			fprintf(stdout, "#\t\tNMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
+				evlist->stats.total_nmi_overhead[cpu][0],
 				evlist->stats.total_nmi_overhead[cpu][1]);
+			fprintf(stdout, "#\t\tMultiplexing#: %" PRIu64 " time: %" PRIu64 " ns\n",
+				evlist->stats.total_mux_overhead[cpu][0],
+				evlist->stats.total_mux_overhead[cpu][1]);
 		}
 		fprintf(stdout, "#\n");
 	}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 7d40d54..70e2508 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -265,6 +265,8 @@ enum auxtrace_error_type {
  *
  * The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
  * The total NMI# is stored in [0], while the accumulated time is in [1].
+ * The total_mux_overhead tells exactly the Multiplexing overhead on each CPU.
+ * The total rotate# is stored in [0], while the accumulated time is in [1].
  */
 struct events_stats {
 	u64 total_period;
@@ -274,6 +276,7 @@ struct events_stats {
 	u64 total_aux_lost;
 	u64 total_invalid_chains;
 	u64 total_nmi_overhead[MAX_NR_CPUS][2];
+	u64 total_mux_overhead[MAX_NR_CPUS][2];
 	u32 nr_events[PERF_RECORD_HEADER_MAX];
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 58076f2..eca1f8b 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -563,6 +563,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
 			    event->overhead.entry.nr,
 			    event->overhead.entry.time,
 			    event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_MUX_OVERHEAD) {
+		dump_printf(" Multiplexing nr: %llu  time: %llu cpu %u\n",
+			    event->overhead.entry.nr,
+			    event->overhead.entry.time,
+			    event->overhead.entry.cpu);
 	} else {
 		dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
 	}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a79ab99..594fd5e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1218,6 +1218,10 @@ overhead_stats_update(struct perf_tool *tool,
 			evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
 			evlist->stats.total_nmi_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
 			break;
+		case PERF_MUX_OVERHEAD:
+			evlist->stats.total_mux_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+			evlist->stats.total_mux_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+			break;
 		default:
 			break;
 		}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/14] perf tools: show side-band events overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (6 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 07/14] perf tools: show multiplexing overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23  9:44 ` [PATCH 09/14] perf tools: make get_nsecs visible for buildin files kan.liang
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Caculate the total overhead from accessing side-band events handler
function, and display them in perf report

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-report.c | 3 +++
 tools/perf/util/event.h     | 5 +++++
 tools/perf/util/machine.c   | 5 +++++
 tools/perf/util/session.c   | 4 ++++
 4 files changed, 17 insertions(+)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 2515d7a..9c0a424 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -384,6 +384,9 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 			fprintf(stdout, "#\t\tMultiplexing#: %" PRIu64 " time: %" PRIu64 " ns\n",
 				evlist->stats.total_mux_overhead[cpu][0],
 				evlist->stats.total_mux_overhead[cpu][1]);
+			fprintf(stdout, "#\t\tSB#: %" PRIu64 " time: %" PRIu64 " ns\n",
+				evlist->stats.total_sb_overhead[cpu][0],
+				evlist->stats.total_sb_overhead[cpu][1]);
 		}
 		fprintf(stdout, "#\n");
 	}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 70e2508..3357529 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -267,6 +267,10 @@ enum auxtrace_error_type {
  * The total NMI# is stored in [0], while the accumulated time is in [1].
  * The total_mux_overhead tells exactly the Multiplexing overhead on each CPU.
  * The total rotate# is stored in [0], while the accumulated time is in [1].
+ * The total_sb_overhead tells exactly the overhead to output side-band
+ * events on each CPU.
+ * The total number of accessing side-band events handler function is stored
+ * in [0], while the accumulated processing time is in [1].
  */
 struct events_stats {
 	u64 total_period;
@@ -277,6 +281,7 @@ struct events_stats {
 	u64 total_invalid_chains;
 	u64 total_nmi_overhead[MAX_NR_CPUS][2];
 	u64 total_mux_overhead[MAX_NR_CPUS][2];
+	u64 total_sb_overhead[MAX_NR_CPUS][2];
 	u32 nr_events[PERF_RECORD_HEADER_MAX];
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_warned;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index eca1f8b..d8cde21 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -568,6 +568,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
 			    event->overhead.entry.nr,
 			    event->overhead.entry.time,
 			    event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_SB_OVERHEAD) {
+		dump_printf(" SB nr: %llu  time: %llu cpu %u\n",
+			    event->overhead.entry.nr,
+			    event->overhead.entry.time,
+			    event->overhead.entry.cpu);
 	} else {
 		dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
 	}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 594fd5e..e3aa9d7 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1222,6 +1222,10 @@ overhead_stats_update(struct perf_tool *tool,
 			evlist->stats.total_mux_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
 			evlist->stats.total_mux_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
 			break;
+		case PERF_SB_OVERHEAD:
+			evlist->stats.total_sb_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+			evlist->stats.total_sb_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+			break;
 		default:
 			break;
 		}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/14] perf tools: make get_nsecs visible for buildin files
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (7 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 08/14] perf tools: show side-band events overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23  9:44 ` [PATCH 10/14] perf tools: introduce PERF_RECORD_USER_OVERHEAD kan.liang
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

get_nsecs will be used to get time in following patches.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-sched.c | 2 +-
 tools/perf/builtin.h       | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index fb34412..bf0d571e8 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -194,7 +194,7 @@ struct perf_sched {
 	struct perf_sched_map map;
 };
 
-static u64 get_nsecs(void)
+u64 get_nsecs(void)
 {
 	struct timespec ts;
 
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index 0bcf68e..a564e15 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -12,6 +12,7 @@ const char *help_unknown_cmd(const char *cmd);
 void prune_packed_objects(int);
 int read_line_with_nul(char *buf, int size, FILE *file);
 int check_pager_config(const char *cmd);
+u64 get_nsecs(void);
 
 int cmd_annotate(int argc, const char **argv, const char *prefix);
 int cmd_bench(int argc, const char **argv, const char *prefix);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/14] perf tools: introduce PERF_RECORD_USER_OVERHEAD
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (8 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 09/14] perf tools: make get_nsecs visible for buildin files kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23  9:44 ` [PATCH 11/14] perf tools: record write data overhead kan.liang
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

User space perf tool also bring overhead. Introduce
PERF_RECORD_USER_OVERHEAD to track the overhead information.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/util/event.c   | 1 +
 tools/perf/util/event.h   | 1 +
 tools/perf/util/session.c | 4 ++++
 3 files changed, 6 insertions(+)

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index ca98c4c..6cd43c9 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -48,6 +48,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_STAT_ROUND]		= "STAT_ROUND",
 	[PERF_RECORD_EVENT_UPDATE]		= "EVENT_UPDATE",
 	[PERF_RECORD_TIME_CONV]			= "TIME_CONV",
+	[PERF_RECORD_USER_OVERHEAD]		= "USER_OVERHEAD",
 };
 
 const char *perf_event__name(unsigned int id)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 3357529..1ef1a9d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -237,6 +237,7 @@ enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_STAT_ROUND			= 77,
 	PERF_RECORD_EVENT_UPDATE		= 78,
 	PERF_RECORD_TIME_CONV			= 79,
+	PERF_RECORD_USER_OVERHEAD		= 80,
 	PERF_RECORD_HEADER_MAX
 };
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e3aa9d7..27a5c8a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -804,6 +804,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_STAT_ROUND]	  = perf_event__stat_round_swap,
 	[PERF_RECORD_EVENT_UPDATE]	  = perf_event__event_update_swap,
 	[PERF_RECORD_TIME_CONV]		  = perf_event__all64_swap,
+	[PERF_RECORD_USER_OVERHEAD]	  = perf_event__all64_swap,
 	[PERF_RECORD_HEADER_MAX]	  = NULL,
 };
 
@@ -1382,6 +1383,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 	case PERF_RECORD_TIME_CONV:
 		session->time_conv = event->time_conv;
 		return tool->time_conv(tool, event, session);
+	case PERF_RECORD_USER_OVERHEAD:
+		overhead_stats_update(tool, session->evlist, event);
+		return tool->overhead(tool, event, NULL, NULL);
 	default:
 		return -EINVAL;
 	}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/14] perf tools: record write data overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (9 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 10/14] perf tools: introduce PERF_RECORD_USER_OVERHEAD kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 23:02   ` Jiri Olsa
  2016-11-23 23:06   ` Jiri Olsa
  2016-11-23  9:44 ` [PATCH 12/14] perf tools: record elapsed time kan.liang
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

perf record writes all kinds of events to file, which brings overhead as
well. Introducing PERF_USER_WRITE_OVERHEAD to save the overhead
information.
The overhead is measured by the time taken by the write itself, not the
elapsed system time.
Caculate the total user write overhead on each CPU, and display them in
perf report

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-record.c | 61 +++++++++++++++++++++++++++++++++++++++++++--
 tools/perf/util/event.h     |  9 +++++++
 tools/perf/util/machine.c   |  5 ++++
 tools/perf/util/session.c   |  4 +++
 4 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 67d2a90..492058e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -45,6 +45,11 @@
 #include <asm/bug.h>
 #include <linux/time64.h>
 
+struct write_overhead{
+	u64	nr;
+	u64	time;
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -63,17 +68,40 @@ struct record {
 	bool			timestamp_filename;
 	bool			switch_output;
 	unsigned long long	samples;
+	struct write_overhead	overhead[MAX_NR_CPUS];
 };
 
+static u64 get_vnsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
+
+	return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+}
+
 static int record__write(struct record *rec, void *bf, size_t size)
 {
+	int cpu = sched_getcpu();
+	u64 start, end;
+	int ret = 0;
+
+	start = get_vnsecs();
 	if (perf_data_file__write(rec->session->file, bf, size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
-		return -1;
+		ret = -1;
+		goto done;
 	}
 
 	rec->bytes_written += size;
-	return 0;
+done:
+	end = get_vnsecs();
+	if (cpu >= 0) {
+		rec->overhead[cpu].nr++;
+		rec->overhead[cpu].time += (end - start);
+	}
+
+	return ret;
 }
 
 static int process_synthesized_event(struct perf_tool *tool,
@@ -813,6 +841,33 @@ static int record__synthesize(struct record *rec, bool tail)
 	return err;
 }
 
+static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t process)
+{
+	int cpu;
+
+	union perf_event event = {
+		.overhead = {
+			.header = {
+				.type = PERF_RECORD_USER_OVERHEAD,
+				.size = sizeof(struct perf_overhead),
+			},
+		},
+	};
+
+	event.overhead.type = PERF_USER_WRITE_OVERHEAD;
+
+	for (cpu = 0; cpu < MAX_NR_CPUS; cpu++) {
+		if (!rec->overhead[cpu].nr)
+			continue;
+
+		event.overhead.entry.cpu = cpu;
+		event.overhead.entry.nr = rec->overhead[cpu].nr;
+		event.overhead.entry.time = rec->overhead[cpu].time;
+
+		(void)process(&rec->tool, &event, NULL, NULL);
+	}
+}
+
 static int __cmd_record(struct record *rec, int argc, const char **argv)
 {
 	int err;
@@ -1074,6 +1129,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		goto out_child;
 	}
 
+	perf_event__synth_overhead(rec, process_synthesized_event);
+
 	if (!quiet)
 		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
 
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 1ef1a9d..9927cf9 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -272,6 +272,9 @@ enum auxtrace_error_type {
  * events on each CPU.
  * The total number of accessing side-band events handler function is stored
  * in [0], while the accumulated processing time is in [1].
+ * The total_user_write_overhead tells exactly the overhead to write data in
+ * perf record.
+ * The total write# is stored in [0], while the accumulated time is in [1].
  */
 struct events_stats {
 	u64 total_period;
@@ -283,6 +286,7 @@ struct events_stats {
 	u64 total_nmi_overhead[MAX_NR_CPUS][2];
 	u64 total_mux_overhead[MAX_NR_CPUS][2];
 	u64 total_sb_overhead[MAX_NR_CPUS][2];
+	u64 total_user_write_overhead[MAX_NR_CPUS][2];
 	u32 nr_events[PERF_RECORD_HEADER_MAX];
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_warned;
@@ -493,6 +497,11 @@ struct time_conv_event {
 	u64 time_zero;
 };
 
+enum perf_user_overhead_event_type { /* above any possible kernel type */
+	PERF_USER_OVERHEAD_TYPE_START	= 100,
+	PERF_USER_WRITE_OVERHEAD	= 100,
+};
+
 struct perf_overhead {
 	struct perf_event_header	header;
 	u32				type;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index d8cde21..ce7a0ea 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -573,6 +573,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
 			    event->overhead.entry.nr,
 			    event->overhead.entry.time,
 			    event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
+		dump_printf(" User write nr: %llu  time: %llu cpu %u\n",
+			    event->overhead.entry.nr,
+			    event->overhead.entry.time,
+			    event->overhead.entry.cpu);
 	} else {
 		dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
 	}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 27a5c8a..a72992b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1227,6 +1227,10 @@ overhead_stats_update(struct perf_tool *tool,
 			evlist->stats.total_sb_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
 			evlist->stats.total_sb_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
 			break;
+		case PERF_USER_WRITE_OVERHEAD:
+			evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
+			evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
+			break;
 		default:
 			break;
 		}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/14] perf tools: record elapsed time
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (10 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 11/14] perf tools: record write data overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23  9:44 ` [PATCH 13/14] perf tools: warn on high overhead kan.liang
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Record the elapsed time of perf record, and display it in perf report

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-record.c | 10 ++++++++++
 tools/perf/builtin-report.c |  1 +
 tools/perf/util/event.h     |  3 +++
 tools/perf/util/machine.c   |  3 +++
 tools/perf/util/session.c   |  3 +++
 5 files changed, 20 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 492058e..ea94e10 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -69,6 +69,7 @@ struct record {
 	bool			switch_output;
 	unsigned long long	samples;
 	struct write_overhead	overhead[MAX_NR_CPUS];
+	u64			elapsed_time;
 };
 
 static u64 get_vnsecs(void)
@@ -866,6 +867,12 @@ static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t
 
 		(void)process(&rec->tool, &event, NULL, NULL);
 	}
+
+	event.overhead.type = PERF_USER_ELAPSED_TIME;
+	event.overhead.entry.cpu = -1;
+	event.overhead.entry.nr = 1;
+	event.overhead.entry.time = rec->elapsed_time;
+	(void)process(&rec->tool, &event, NULL, NULL);
 }
 
 static int __cmd_record(struct record *rec, int argc, const char **argv)
@@ -1129,6 +1136,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		goto out_child;
 	}
 
+	rec->elapsed_time = get_nsecs() - rec->elapsed_time;
 	perf_event__synth_overhead(rec, process_synthesized_event);
 
 	if (!quiet)
@@ -1601,6 +1609,8 @@ int cmd_record(int argc, const char **argv, const char *prefix __maybe_unused)
 # undef REASON
 #endif
 
+	rec->elapsed_time = get_nsecs();
+
 	rec->evlist = perf_evlist__new();
 	if (rec->evlist == NULL)
 		return -ENOMEM;
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 9c0a424..de2a9b6 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -371,6 +371,7 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 
 	fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
 	if (symbol_conf.show_overhead) {
+		fprintf(stdout, "# Elapsed time: %" PRIu64 " ns\n",  evlist->stats.elapsed_time);
 		fprintf(stdout, "# Overhead:\n");
 		for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
 			if (!evlist->stats.total_nmi_overhead[cpu][0])
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 9927cf9..ceb0968 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -275,6 +275,7 @@ enum auxtrace_error_type {
  * The total_user_write_overhead tells exactly the overhead to write data in
  * perf record.
  * The total write# is stored in [0], while the accumulated time is in [1].
+ * The elapsed_time tells the elapsed time of perf record
  */
 struct events_stats {
 	u64 total_period;
@@ -287,6 +288,7 @@ struct events_stats {
 	u64 total_mux_overhead[MAX_NR_CPUS][2];
 	u64 total_sb_overhead[MAX_NR_CPUS][2];
 	u64 total_user_write_overhead[MAX_NR_CPUS][2];
+	u64 elapsed_time;
 	u32 nr_events[PERF_RECORD_HEADER_MAX];
 	u32 nr_non_filtered_samples;
 	u32 nr_lost_warned;
@@ -497,6 +499,7 @@ struct time_conv_event {
 	u64 time_zero;
 };
 
+#define PERF_USER_ELAPSED_TIME	200 /* above any possible overhead type */
 enum perf_user_overhead_event_type { /* above any possible kernel type */
 	PERF_USER_OVERHEAD_TYPE_START	= 100,
 	PERF_USER_WRITE_OVERHEAD	= 100,
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index ce7a0ea..150071f 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -578,6 +578,9 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
 			    event->overhead.entry.nr,
 			    event->overhead.entry.time,
 			    event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_USER_ELAPSED_TIME) {
+		dump_printf(" Elapsed time: %llu\n",
+			    event->overhead.entry.time);
 	} else {
 		dump_printf("\tUNSUPPORT OVERHEAD TYPE 0x%x!\n", event->overhead.type);
 	}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a72992b..e84808f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1231,6 +1231,9 @@ overhead_stats_update(struct perf_tool *tool,
 			evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][0] += event->overhead.entry.nr;
 			evlist->stats.total_user_write_overhead[event->overhead.entry.cpu][1] += event->overhead.entry.time;
 			break;
+		case PERF_USER_ELAPSED_TIME:
+			evlist->stats.elapsed_time = event->overhead.entry.time;
+			break;
 		default:
 			break;
 		}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/14] perf tools: warn on high overhead
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (11 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 12/14] perf tools: record elapsed time kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 20:25   ` Andi Kleen
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
  2016-11-24  4:27 ` [PATCH 00/14] export perf overheads information Ingo Molnar
  14 siblings, 1 reply; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

The rough overhead rate can be caculated by the sum of all kinds of
overhead / elapsed time.
If the overhead rate is higher than 10%, warning the user.

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/util/session.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e84808f..decfc48 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1559,6 +1559,30 @@ perf_session__warn_order(const struct perf_session *session)
 		ui__warning("%u out of order events recorded.\n", oe->nr_unordered_events);
 }
 
+static void
+perf_session__warn_overhead(const struct perf_session *session)
+{
+	const struct events_stats *stats = &session->evlist->stats;
+	double overhead_rate;
+	u64 overhead;
+	int i;
+
+	for (i = 0; i < session->header.env.nr_cpus_online; i++) {
+		overhead = stats->total_nmi_overhead[i][1];
+		overhead += stats->total_mux_overhead[i][1];
+		overhead += stats->total_sb_overhead[i][1];
+		overhead += stats->total_user_write_overhead[i][1];
+
+		overhead_rate = (double)overhead / (double)stats->elapsed_time;
+
+		if (overhead_rate > 0.1) {
+			ui__warning("Perf overhead is high! The overhead rate is %3.2f%% on CPU %d\n\n"
+				    "Please consider reducing the number of events, or increasing the period, or decrease the frequency.\n\n",
+				    overhead_rate * 100.0, i);
+		}
+	}
+}
+
 static void perf_session__warn_about_errors(const struct perf_session *session)
 {
 	const struct events_stats *stats = &session->evlist->stats;
@@ -1632,6 +1656,8 @@ static void perf_session__warn_about_errors(const struct perf_session *session)
 			    "Increase it by --proc-map-timeout\n",
 			    stats->nr_proc_map_timeout);
 	}
+
+	perf_session__warn_overhead(session);
 }
 
 static int perf_session__flush_thread_stack(struct thread *thread,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/14] perf script: show overhead events
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (12 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 13/14] perf tools: warn on high overhead kan.liang
@ 2016-11-23  9:44 ` kan.liang
  2016-11-23 23:17   ` Jiri Olsa
                     ` (3 more replies)
  2016-11-24  4:27 ` [PATCH 00/14] export perf overheads information Ingo Molnar
  14 siblings, 4 replies; 58+ messages in thread
From: kan.liang @ 2016-11-23  9:44 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: alexander.shishkin, tglx, namhyung, jolsa, adrian.hunter,
	wangnan0, mark.rutland, andi, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Introduce a new option --show-overhead to show overhead events in perf
script

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 tools/perf/builtin-script.c | 36 ++++++++++++++++++++++++++++++++++++
 tools/perf/util/event.c     | 37 +++++++++++++++++++++++++++++++++++++
 tools/perf/util/event.h     |  1 +
 3 files changed, 74 insertions(+)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e1daff3..76d9747 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -829,6 +829,7 @@ struct perf_script {
 	bool			show_task_events;
 	bool			show_mmap_events;
 	bool			show_switch_events;
+	bool			show_overhead;
 	bool			allocated;
 	struct cpu_map		*cpus;
 	struct thread_map	*threads;
@@ -1264,6 +1265,37 @@ static int process_switch_event(struct perf_tool *tool,
 	return 0;
 }
 
+static int process_overhead_event(struct perf_tool *tool,
+				  union perf_event *event,
+				  struct perf_sample *sample,
+				  struct machine *machine)
+{
+	struct thread *thread;
+	struct perf_script *script = container_of(tool, struct perf_script, tool);
+	struct perf_session *session = script->session;
+	struct perf_evsel *evsel;
+
+	if (perf_event__process_switch(tool, event, sample, machine) < 0)
+		return -1;
+	if (sample) {
+		evsel = perf_evlist__id2evsel(session->evlist, sample->id);
+		thread = machine__findnew_thread(machine, sample->pid, sample->tid);
+		if (thread == NULL) {
+			pr_debug("problem processing OVERHEAD event, skipping it.\n");
+			return -1;
+		}
+
+		print_sample_start(sample, thread, evsel);
+		perf_event__fprintf(event, stdout);
+		thread__put(thread);
+	} else {
+		/* USER OVERHEAD event */
+		perf_event__fprintf(event, stdout);
+	}
+
+	return 0;
+}
+
 static void sig_handler(int sig __maybe_unused)
 {
 	session_done = 1;
@@ -1287,6 +1319,8 @@ static int __cmd_script(struct perf_script *script)
 	}
 	if (script->show_switch_events)
 		script->tool.context_switch = process_switch_event;
+	if (script->show_overhead)
+		script->tool.overhead  = process_overhead_event;
 
 	ret = perf_session__process_events(script->session);
 
@@ -2172,6 +2206,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
 		    "Show the mmap events"),
 	OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
 		    "Show context switch events (if recorded)"),
+	OPT_BOOLEAN('\0', "show-overhead", &script.show_overhead,
+		    "Show overhead events"),
 	OPT_BOOLEAN('f', "force", &file.force, "don't complain, do it"),
 	OPT_BOOLEAN(0, "ns", &nanosecs,
 		    "Use 9 decimal places when displaying time"),
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 6cd43c9..cd4f3aa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -1190,6 +1190,39 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp)
 		       event->context_switch.next_prev_tid);
 }
 
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp)
+{
+	size_t ret;
+
+	if (event->overhead.type == PERF_NMI_OVERHEAD) {
+		ret = fprintf(fp, " [NMI] nr: %llu  time: %llu cpu %u\n",
+			      event->overhead.entry.nr,
+			      event->overhead.entry.time,
+			      event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_MUX_OVERHEAD) {
+		ret = fprintf(fp, " [MUX] nr: %llu  time: %llu cpu %u\n",
+			      event->overhead.entry.nr,
+			      event->overhead.entry.time,
+			      event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_SB_OVERHEAD) {
+		ret = fprintf(fp, " [SB] nr: %llu  time: %llu cpu %u\n",
+			      event->overhead.entry.nr,
+			      event->overhead.entry.time,
+			      event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
+		ret = fprintf(fp, " [USER WRITE] nr: %llu  time: %llu cpu %u\n",
+			      event->overhead.entry.nr,
+			      event->overhead.entry.time,
+			      event->overhead.entry.cpu);
+	} else if (event->overhead.type == PERF_USER_ELAPSED_TIME) {
+		ret = fprintf(fp, " [ELAPSED TIME] time: %llu\n",
+			      event->overhead.entry.time);
+	} else {
+		ret = fprintf(fp, " unhandled!\n");
+	}
+	return ret;
+}
+
 size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 {
 	size_t ret = fprintf(fp, "PERF_RECORD_%s",
@@ -1219,6 +1252,10 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 	case PERF_RECORD_SWITCH_CPU_WIDE:
 		ret += perf_event__fprintf_switch(event, fp);
 		break;
+	case PERF_RECORD_OVERHEAD:
+	case PERF_RECORD_USER_OVERHEAD:
+		ret += perf_event__fprintf_overhead(event, fp);
+		break;
 	default:
 		ret += fprintf(fp, "\n");
 	}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ceb0968..36e295d 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -690,6 +690,7 @@ size_t perf_event__fprintf_switch(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_thread_map(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_cpu_map(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_overhead(union perf_event *event, FILE *fp);
 
 u64 kallsyms__get_function_start(const char *kallsyms_filename,
 				 const char *symbol_name);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-23  9:44 ` [PATCH 02/14] perf/x86: output NMI overhead kan.liang
@ 2016-11-23 20:06   ` Peter Zijlstra
  2016-11-24 16:19   ` Mark Rutland
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-23 20:06 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:40AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> NMI handler is one of the most important part which brings overhead.
> 
> There are lots of NMI during sampling. It's very expensive to log each
> NMI. So the accumulated time and NMI# will be output when event is going
> to be disabled or task is scheduling out.
> The newly introduced flag PERF_EF_LOG indicate to output the overhead
> log.
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  arch/x86/events/core.c          | 19 ++++++++++++++-
>  arch/x86/events/perf_event.h    |  2 ++
>  include/linux/perf_event.h      |  1 +
>  include/uapi/linux/perf_event.h |  2 ++
>  kernel/events/core.c            | 54 ++++++++++++++++++++++-------------------
>  5 files changed, 52 insertions(+), 26 deletions(-)

The subject says this patch is x86 specific, but most changes are in
core code.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/14] perf/x86: output multiplexing overhead
  2016-11-23  9:44 ` [PATCH 03/14] perf/x86: output multiplexing overhead kan.liang
@ 2016-11-23 20:06   ` Peter Zijlstra
  2016-11-23 20:09     ` Liang, Kan
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-23 20:06 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:41AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Multiplexing overhead is one of the key overhead when the number of
> events is more than available counters.
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  include/linux/perf_event.h      |  2 ++
>  include/uapi/linux/perf_event.h |  1 +
>  kernel/events/core.c            | 16 ++++++++++++++++
>  3 files changed, 19 insertions(+)

The subject says x86 specific, but its _all_ core code.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/14] perf/x86: output side-band events overhead
  2016-11-23  9:44 ` [PATCH 04/14] perf/x86: output side-band events overhead kan.liang
@ 2016-11-23 20:06   ` Peter Zijlstra
  2016-11-24 16:21   ` Mark Rutland
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-23 20:06 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:42AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Iterating all events which need to receive side-band events also bring
> some overhead.
> Save the overhead information in task context or CPU context, whichever
> context is available.
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  include/linux/perf_event.h      |  2 ++
>  include/uapi/linux/perf_event.h |  1 +
>  kernel/events/core.c            | 32 ++++++++++++++++++++++++++++----
>  3 files changed, 31 insertions(+), 4 deletions(-)

Guess?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 03/14] perf/x86: output multiplexing overhead
  2016-11-23 20:06   ` Peter Zijlstra
@ 2016-11-23 20:09     ` Liang, Kan
  0 siblings, 0 replies; 58+ messages in thread
From: Liang, Kan @ 2016-11-23 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi



> 
> On Wed, Nov 23, 2016 at 04:44:41AM -0500, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > Multiplexing overhead is one of the key overhead when the number of
> > events is more than available counters.
> >
> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> > ---
> >  include/linux/perf_event.h      |  2 ++
> >  include/uapi/linux/perf_event.h |  1 +
> >  kernel/events/core.c            | 16 ++++++++++++++++
> >  3 files changed, 19 insertions(+)
> 
> The subject says x86 specific, but its _all_ core code.

Oh, Sorry. I will change the subject in next version.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
@ 2016-11-23 20:11   ` Peter Zijlstra
  2016-11-23 20:13   ` Peter Zijlstra
  2016-11-23 23:41   ` Jiri Olsa
  2 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-23 20:11 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:39AM -0500, kan.liang@intel.com wrote:
> +struct perf_overhead_entry {
> +	__u32	cpu;
> +	__u64	nr;
> +	__u64	time;
> +};

> +void perf_log_overhead(struct perf_event *event, u32 type,
> +		       struct perf_overhead_entry *entry)
> +{
> +	struct perf_output_handle handle;
> +	struct perf_sample_data sample;
> +	int ret;
> +
> +	struct {
> +		struct perf_event_header	header;
> +		u32				type;

There's two u32 sized holes here.

> +		struct perf_overhead_entry	overhead;
> +	} overhead_event = {
> +		.header = {
> +			.type = PERF_RECORD_OVERHEAD,
> +			.misc = 0,
> +			.size = sizeof(overhead_event),
> +		},
		.type = type,
		.overhead = *event,
> +	};

> +	overhead_event.type = type;
> +	memcpy(&overhead_event.overhead, entry, sizeof(overhead_event.overhead));

Do the above named initialization instead.

> +
> +	perf_event_header__init_id(&overhead_event.header, &sample, event);
> +	ret = perf_output_begin(&handle, event, overhead_event.header.size);
> +
> +	if (ret)
> +		return;
> +
> +	perf_output_put(&handle, overhead_event);
> +	perf_event__output_id_sample(event, &handle, &sample);
> +
> +	perf_output_end(&handle);
> +	memset(entry, 0, sizeof(*entry));
> +}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
  2016-11-23 20:11   ` Peter Zijlstra
@ 2016-11-23 20:13   ` Peter Zijlstra
  2016-11-23 23:41   ` Jiri Olsa
  2 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-23 20:13 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, linux-kernel, alexander.shishkin, tglx, namhyung,
	jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:39AM -0500, kan.liang@intel.com wrote:
> +/*
> + * single overhead record layout:
> + *
> + * 	 cpu: The cpu which overhead occues

This is duplicate information, its already present in sample_id when
PERF_SAMPLE_CPU, and without that we don't care.

> + * 	  nr: Times of overhead happens.
> + * 	      E.g. for NMI, nr == times of NMI handler are called.
> + * 	time: Total overhead cost(ns)
> + */
> +struct perf_overhead_entry {
> +	__u32	cpu;
> +	__u64	nr;
> +	__u64	time;
> +};

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 13/14] perf tools: warn on high overhead
  2016-11-23  9:44 ` [PATCH 13/14] perf tools: warn on high overhead kan.liang
@ 2016-11-23 20:25   ` Andi Kleen
  2016-11-23 22:03     ` Liang, Kan
  0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2016-11-23 20:25 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland

kan.liang@intel.com writes:

> From: Kan Liang <kan.liang@intel.com>
>
> The rough overhead rate can be caculated by the sum of all kinds of
> overhead / elapsed time.
> If the overhead rate is higher than 10%, warning the user.

Thinking about this more: this is comparing the cost of a single
CPU to the total wall clock time. This isn't very good and can
give confusing results with many cores.

Perhaps we need two separate metrics here:

- cost of perf record on its CPU (or later on if it gets multi threaded
  more multiple). Warn if this is >50% or so.
- average perf collection overhead on a CPU. The 10% threshold here
  seems appropiate.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 13/14] perf tools: warn on high overhead
  2016-11-23 20:25   ` Andi Kleen
@ 2016-11-23 22:03     ` Liang, Kan
  2016-11-25 20:42       ` Andi Kleen
  0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-23 22:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland



> 
> kan.liang@intel.com writes:
> 
> > From: Kan Liang <kan.liang@intel.com>
> >
> > The rough overhead rate can be caculated by the sum of all kinds of
> > overhead / elapsed time.
> > If the overhead rate is higher than 10%, warning the user.
> 
> Thinking about this more: this is comparing the cost of a single CPU to the
> total wall clock time. This isn't very good and can give confusing results with
> many cores.
> 
> Perhaps we need two separate metrics here:
> 
> - cost of perf record on its CPU (or later on if it gets multi threaded
>   more multiple). Warn if this is >50% or so.

What's the formula for cost of perf record on its CPU?
The cost only includes user space overhead or all overhead?
What is the divisor?


> - average perf collection overhead on a CPU. The 10% threshold here
>   seems appropiate.
For the average, do you mean add all overheads among CPUs together
and divide the CPU#?

To calculate the rate, the divisor is wall clock time, right?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type
  2016-11-23  9:44 ` [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type kan.liang
@ 2016-11-23 22:35   ` Jiri Olsa
  2016-11-23 22:58     ` Jiri Olsa
  0 siblings, 1 reply; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 22:35 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:43AM -0500, kan.liang@intel.com wrote:

SNIP

> +
>  static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
>  {
>  	const char *dup_filename;
> @@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
>  	case PERF_RECORD_SWITCH:
>  	case PERF_RECORD_SWITCH_CPU_WIDE:
>  		ret = machine__process_switch_event(machine, event); break;
> +	case PERF_RECORD_OVERHEAD:
> +		ret = machine__process_overhead_event(machine, event); break;

missing breaks

jirka

>  	default:
>  		ret = -1;
>  		break;

SNIP

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
@ 2016-11-23 22:51   ` Jiri Olsa
  2016-11-24 13:37     ` Liang, Kan
  2016-11-23 22:52   ` Jiri Olsa
  2016-11-23 22:52   ` Jiri Olsa
  2 siblings, 1 reply; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 22:51 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Caculate the total NMI overhead on each CPU, and display them in perf
> report

so the output looks like this:

---
# Elapsed time: 1720167944 ns
# Overhead:
#       CPU 6
#               NMI#: 27 time: 111379 ns
#               Multiplexing#: 0 time: 0 ns
#               SB#: 57 time: 90045 ns
#
# Samples: 26  of event 'cycles:u'
# Event count (approx.): 1677531
#
# Overhead  Command  Shared Object     Symbol                 
# ........  .......  ................  .......................
#
    24.20%  ls       ls                [.] _init
    17.18%  ls       libc-2.24.so      [.] __strcoll_l
    11.85%  ls       ld-2.24.so        [.] _dl_relocate_object
---


few things:

- I wonder we want to put this overhead output separatelly from the
  main perf out.. this scale bad with with bigger cpu counts

- we might want to call it some other way, becayse we already
  use 'overhead' for the event count %

- how about TUI output? ;-) I dont think it's necessary, however
  currently 'perf report --show-overhead' does not show anything
  ifTUI is default output, unless you use --stdio option

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
  2016-11-23 22:51   ` Jiri Olsa
@ 2016-11-23 22:52   ` Jiri Olsa
  2016-11-23 22:52   ` Jiri Olsa
  2 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 22:52 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Caculate the total NMI overhead on each CPU, and display them in perf
> report

please put example output into chagelog

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
  2016-11-23 22:51   ` Jiri Olsa
  2016-11-23 22:52   ` Jiri Olsa
@ 2016-11-23 22:52   ` Jiri Olsa
  2 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 22:52 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Caculate the total NMI overhead on each CPU, and display them in perf
> report
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  tools/perf/builtin-report.c | 11 +++++++++++
>  tools/perf/util/event.h     |  4 ++++
>  tools/perf/util/machine.c   |  9 +++++++++
>  tools/perf/util/session.c   | 18 ++++++++++++++++++
>  4 files changed, 42 insertions(+)
> 
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 1416c39..b1437586 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -365,11 +365,22 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
>  					 struct report *rep,
>  					 const char *help)
>  {
> +	struct perf_session *session = rep->session;
>  	struct perf_evsel *pos;
> +	int cpu;
>  
>  	fprintf(stdout, "#\n# Total Lost Samples: %" PRIu64 "\n#\n", evlist->stats.total_lost_samples);
>  	if (symbol_conf.show_overhead) {
>  		fprintf(stdout, "# Overhead:\n");
> +		for (cpu = 0; cpu < session->header.env.nr_cpus_online; cpu++) {
> +			if (!evlist->stats.total_nmi_overhead[cpu][0])
> +				continue;
> +			if (rep->cpu_list && !test_bit(cpu, rep->cpu_bitmap))
> +				continue;
> +			fprintf(stdout, "#\tCPU %d: NMI#: %" PRIu64 " time: %" PRIu64 " ns\n",
> +				cpu, evlist->stats.total_nmi_overhead[cpu][0],
> +				evlist->stats.total_nmi_overhead[cpu][1]);
> +		}
>  		fprintf(stdout, "#\n");
>  	}
>  	evlist__for_each_entry(evlist, pos) {
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index d1b179b..7d40d54 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -262,6 +262,9 @@ enum auxtrace_error_type {
>   * multipling nr_events[PERF_EVENT_SAMPLE] by a frequency isn't possible to get
>   * the total number of low level events, it is necessary to to sum all struct
>   * sample_event.period and stash the result in total_period.
> + *
> + * The total_nmi_overhead tells exactly the NMI handler overhead on each CPU.
> + * The total NMI# is stored in [0], while the accumulated time is in [1].
>   */

hum, why can't this be stored this in the struct instead.. ?

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type
  2016-11-23 22:35   ` Jiri Olsa
@ 2016-11-23 22:58     ` Jiri Olsa
  0 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 22:58 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 11:35:59PM +0100, Jiri Olsa wrote:
> On Wed, Nov 23, 2016 at 04:44:43AM -0500, kan.liang@intel.com wrote:
> 
> SNIP
> 
> > +
> >  static void dso__adjust_kmod_long_name(struct dso *dso, const char *filename)
> >  {
> >  	const char *dup_filename;
> > @@ -1536,6 +1542,8 @@ int machine__process_event(struct machine *machine, union perf_event *event,
> >  	case PERF_RECORD_SWITCH:
> >  	case PERF_RECORD_SWITCH_CPU_WIDE:
> >  		ret = machine__process_switch_event(machine, event); break;
> > +	case PERF_RECORD_OVERHEAD:
> > +		ret = machine__process_overhead_event(machine, event); break;
> 
> missing breaks

ugh.. im blind.. sry :-\

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/14] perf tools: record write data overhead
  2016-11-23  9:44 ` [PATCH 11/14] perf tools: record write data overhead kan.liang
@ 2016-11-23 23:02   ` Jiri Olsa
  2016-11-23 23:06   ` Jiri Olsa
  1 sibling, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:02 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:49AM -0500, kan.liang@intel.com wrote:

SNIP

> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index d8cde21..ce7a0ea 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -573,6 +573,11 @@ int machine__process_overhead_event(struct machine *machine __maybe_unused,
>  			    event->overhead.entry.nr,
>  			    event->overhead.entry.time,
>  			    event->overhead.entry.cpu);
> +	} else if (event->overhead.type == PERF_USER_WRITE_OVERHEAD) {
> +		dump_printf(" User write nr: %llu  time: %llu cpu %u\n",
> +			    event->overhead.entry.nr,
> +			    event->overhead.entry.time,
> +			    event->overhead.entry.cpu);

should this be rather in the switch ?

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/14] perf tools: record write data overhead
  2016-11-23  9:44 ` [PATCH 11/14] perf tools: record write data overhead kan.liang
  2016-11-23 23:02   ` Jiri Olsa
@ 2016-11-23 23:06   ` Jiri Olsa
  1 sibling, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:06 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:49AM -0500, kan.liang@intel.com wrote:

SNIP

>  
> +static void perf_event__synth_overhead(struct record *rec, perf_event__handler_t process)
> +{
> +	int cpu;
> +
> +	union perf_event event = {
> +		.overhead = {
> +			.header = {
> +				.type = PERF_RECORD_USER_OVERHEAD,
> +				.size = sizeof(struct perf_overhead),
> +			},
> +		},
> +	};
> +
> +	event.overhead.type = PERF_USER_WRITE_OVERHEAD;
> +
> +	for (cpu = 0; cpu < MAX_NR_CPUS; cpu++) {
> +		if (!rec->overhead[cpu].nr)
> +			continue;
> +
> +		event.overhead.entry.cpu = cpu;
> +		event.overhead.entry.nr = rec->overhead[cpu].nr;
> +		event.overhead.entry.time = rec->overhead[cpu].time;
> +
> +		(void)process(&rec->tool, &event, NULL, NULL);

plese make perf_event__synth_overhead return int and check above 'process' return value

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] perf script: show overhead events
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
@ 2016-11-23 23:17   ` Jiri Olsa
  2016-11-23 23:18   ` Jiri Olsa
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:17 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:52AM -0500, kan.liang@intel.com wrote:

SNIP

> +}
> +
>  static void sig_handler(int sig __maybe_unused)
>  {
>  	session_done = 1;
> @@ -1287,6 +1319,8 @@ static int __cmd_script(struct perf_script *script)
>  	}
>  	if (script->show_switch_events)
>  		script->tool.context_switch = process_switch_event;
> +	if (script->show_overhead)
> +		script->tool.overhead  = process_overhead_event;
>  
>  	ret = perf_session__process_events(script->session);
>  
> @@ -2172,6 +2206,8 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
>  		    "Show the mmap events"),
>  	OPT_BOOLEAN('\0', "show-switch-events", &script.show_switch_events,
>  		    "Show context switch events (if recorded)"),
> +	OPT_BOOLEAN('\0', "show-overhead", &script.show_overhead,
> +		    "Show overhead events"),

please add the '-events' suffix as for the mmap and task

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] perf script: show overhead events
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
  2016-11-23 23:17   ` Jiri Olsa
@ 2016-11-23 23:18   ` Jiri Olsa
  2016-11-23 23:19   ` Jiri Olsa
  2016-11-23 23:22   ` Jiri Olsa
  3 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:18 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:52AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Introduce a new option --show-overhead to show overhead events in perf
> script
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  tools/perf/builtin-script.c | 36 ++++++++++++++++++++++++++++++++++++
>  tools/perf/util/event.c     | 37 +++++++++++++++++++++++++++++++++++++
>  tools/perf/util/event.h     |  1 +
>  3 files changed, 74 insertions(+)
> 
> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
> index e1daff3..76d9747 100644
> --- a/tools/perf/builtin-script.c
> +++ b/tools/perf/builtin-script.c
> @@ -829,6 +829,7 @@ struct perf_script {
>  	bool			show_task_events;
>  	bool			show_mmap_events;
>  	bool			show_switch_events;
> +	bool			show_overhead;
>  	bool			allocated;
>  	struct cpu_map		*cpus;
>  	struct thread_map	*threads;
> @@ -1264,6 +1265,37 @@ static int process_switch_event(struct perf_tool *tool,
>  	return 0;
>  }
>  
> +static int process_overhead_event(struct perf_tool *tool,
> +				  union perf_event *event,
> +				  struct perf_sample *sample,
> +				  struct machine *machine)
> +{
> +	struct thread *thread;
> +	struct perf_script *script = container_of(tool, struct perf_script, tool);
> +	struct perf_session *session = script->session;
> +	struct perf_evsel *evsel;
> +
> +	if (perf_event__process_switch(tool, event, sample, machine) < 0)
> +		return -1;

process_switch event? copy&paste error?

jirka

> +	if (sample) {
> +		evsel = perf_evlist__id2evsel(session->evlist, sample->id);
> +		thread = machine__findnew_thread(machine, sample->pid, sample->tid);
> +		if (thread == NULL) {
> +			pr_debug("problem processing OVERHEAD event, skipping it.\n");
> +			return -1;
> +		}
> +
> +		print_sample_start(sample, thread, evsel);
> +		perf_event__fprintf(event, stdout);
> +		thread__put(thread);
> +	} else {
> +		/* USER OVERHEAD event */
> +		perf_event__fprintf(event, stdout);
> +	}
> +
> +	return 0;
> +}
> +

SNIP

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] perf script: show overhead events
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
  2016-11-23 23:17   ` Jiri Olsa
  2016-11-23 23:18   ` Jiri Olsa
@ 2016-11-23 23:19   ` Jiri Olsa
  2016-11-23 23:22   ` Jiri Olsa
  3 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:19 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:52AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Introduce a new option --show-overhead to show overhead events in perf
> script

please add exmaple output into changelog

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] perf script: show overhead events
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
                     ` (2 preceding siblings ...)
  2016-11-23 23:19   ` Jiri Olsa
@ 2016-11-23 23:22   ` Jiri Olsa
  3 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:22 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:52AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Introduce a new option --show-overhead to show overhead events in perf
> script

            perf  7356 [001]  7292.203517:     482010 cycles:pp:  ffffffff818e2150 _raw_spin_unlock_irqrestore+0x40 (/lib/modules/4.9.0-rc1+/build/vmlinux)
PERF_RECORD_USER_OVERHEAD [USER WRITE] nr: 1995  time: 14790661 cpu 1
PERF_RECORD_USER_OVERHEAD [ELAPSED TIME] time: 2721649901
            perf  7356 [001]  7292.203766:     482076 cycles:pp:  ffffffff81117f17 lock_release+0x37 (/lib/modules/4.9.0-rc1+/build/vmlinux)

I guess those 2 overhead events dont have context..?
should we make sure they are the last events?

thanks,
jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
  2016-11-23 20:11   ` Peter Zijlstra
  2016-11-23 20:13   ` Peter Zijlstra
@ 2016-11-23 23:41   ` Jiri Olsa
  2016-11-24 13:45     ` Liang, Kan
  2 siblings, 1 reply; 58+ messages in thread
From: Jiri Olsa @ 2016-11-23 23:41 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi

On Wed, Nov 23, 2016 at 04:44:39AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> A new perf record is introduced to export perf overhead information to
> userspace. So we can measure the overhead of sampling directly.
> 
> To output the overhead information, it takes advantage of the existing
> event log mechanism. But the overhead information is the system
> overhead, not per-event overhead.
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  include/linux/perf_event.h      |  4 ++++
>  include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
>  kernel/events/core.c            | 43 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 4741ecd..3a530a7 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -998,6 +998,10 @@ perf_event__output_id_sample(struct perf_event *event,
>  extern void
>  perf_log_lost_samples(struct perf_event *event, u64 lost);
>  
> +extern void
> +perf_log_overhead(struct perf_event *event, u32 type,
> +		  struct perf_overhead_entry *entry);
> +
>  static inline bool is_sampling_event(struct perf_event *event)
>  {
>  	return event->attr.sample_period != 0;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index c66a485..ae571be 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -862,6 +862,17 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_SWITCH_CPU_WIDE		= 15,
>  
> +	/*
> +	 * Records perf overhead
> +	 * struct {
> +	 * 	struct perf_event_header 	header;
> +	 * 	u32				type;
> +	 * 	struct perf_overhead_entry	entry;
> +	 * 	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_OVERHEAD			= 16,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

I think we should make this optional/configurable like the
rest of the aux events, like below..

jirka


---
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 5e7c52278ef0..853a919bd9ae 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -344,7 +344,8 @@ struct perf_event_attr {
 				use_clockid    :  1, /* use @clockid for time fields */
 				context_switch :  1, /* context switch data */
 				write_backward :  1, /* Write ring buffer from end to beginning */
-				__reserved_1   : 36;
+				overhead       :  1,
+				__reserved_1   : 35;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 00/14] export perf overheads information
  2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
                   ` (13 preceding siblings ...)
  2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
@ 2016-11-24  4:27 ` Ingo Molnar
  14 siblings, 0 replies; 58+ messages in thread
From: Ingo Molnar @ 2016-11-24  4:27 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, mark.rutland, andi


* kan.liang@intel.com <kan.liang@intel.com> wrote:

> From: Kan Liang <kan.liang@intel.com>
> 
> Profiling brings additional overhead. High overhead may impacts the
> behavior of the profiling object, impacts the accuracy of the
> profiling result, and even hang the system.
> Currently, perf has dynamic interrupt throttle mechanism to lower the
> sample rate and overhead. But it has limitations.
>  - The mechanism only focus in the overhead from NMI. However, there
>    are other parts which bring big overhead. E.g, multiplexing.
>  - The hint from the mechanism doesn't work on fixed period.
>  - The system changes which caused by the mechanism are not recorded
>    in the perf.data. Users have no idea about the overhead and its
>    impact.
> Acctually, any passive ways like dynamic interrupt throttle mechanism
> are only palliative. The best way is to export overheads information,
> provide more hints, and help the users design more proper perf command.
> 
> According to our test, there are four parts which can bring big overhead.
> They include NMI handler, multiplexing handler, iterate side-band events,
> and write data in file. Two new perf record type PERF_RECORD_OVERHEAD and
> PERF_RECORD_USER_OVERHEAD are introduced to record the overhead
> information in kernel and user space respectively.
> The overhead information is the system per-CPU overhead, not per-event
> overhead. The implementation takes advantage of the existing event log
> mechanism.
> To reduce the additional overhead from logging overhead information, the
> overhead information only be output when the event is going to be
> disabled or task is scheduling out.
> 
> In perf report, the overhead will be checked automatically. If the
> overhead rate is larger than 10%. A warning will be displayed.
> A new option is also introduced to display detial per-CPU overhead
> information.
> 
> Current implementation only include four overhead sources. There could be
> more in other parts. The new overhead source can be easily added as a
> new type.

Please include sample output of the new instrumentation!

Not even the tooling patches show any of the output, nor is it clear anywhere what 
kind of 'overhead' measurement it is, what the units are, what the metrics are, 
how users can _use_ this information, etc.

This is totally inadequate description.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-23 22:51   ` Jiri Olsa
@ 2016-11-24 13:37     ` Liang, Kan
  2016-11-24 15:27       ` Jiri Olsa
  0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 13:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi



> 
> On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > Caculate the total NMI overhead on each CPU, and display them in perf
> > report
> 
> so the output looks like this:
> 
> ---
> # Elapsed time: 1720167944 ns
> # Overhead:
> #       CPU 6
> #               NMI#: 27 time: 111379 ns
> #               Multiplexing#: 0 time: 0 ns
> #               SB#: 57 time: 90045 ns
> #
> # Samples: 26  of event 'cycles:u'
> # Event count (approx.): 1677531
> #
> # Overhead  Command  Shared Object     Symbol
> # ........  .......  ................  .......................
> #
>     24.20%  ls       ls                [.] _init
>     17.18%  ls       libc-2.24.so      [.] __strcoll_l
>     11.85%  ls       ld-2.24.so        [.] _dl_relocate_object
> ---
> 
> 
> few things:
> 
> - I wonder we want to put this overhead output separatelly from the
>   main perf out.. this scale bad with with bigger cpu counts
> 
This output can only be shown when the user explicitly apply
the --show-overhead option. I think the user should expect the big
header.
Or I can add  --show-overhead-only option which only show the
overhead information. It will like what we do for --header and
--header-only 

Any suggestions?

> - we might want to call it some other way, becayse we already
>   use 'overhead' for the event count %
>

"operating_cost"? "processing_cost"? "perf_cost"? "perf_overhead"?
Suggestions?
 
> - how about TUI output? ;-) I dont think it's necessary, however
>   currently 'perf report --show-overhead' does not show anything
>   ifTUI is default output, unless you use --stdio option

I will try to add something in TUI mode.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-23 23:41   ` Jiri Olsa
@ 2016-11-24 13:45     ` Liang, Kan
  2016-11-24 13:50       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 13:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi



> 
> On Wed, Nov 23, 2016 at 04:44:39AM -0500, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > A new perf record is introduced to export perf overhead information to
> > userspace. So we can measure the overhead of sampling directly.
> >
> > To output the overhead information, it takes advantage of the existing
> > event log mechanism. But the overhead information is the system
> > overhead, not per-event overhead.
> >
> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> > ---
> >  include/linux/perf_event.h      |  4 ++++
> >  include/uapi/linux/perf_event.h | 29 +++++++++++++++++++++++++++
> >  kernel/events/core.c            | 43
> +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 76 insertions(+)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 4741ecd..3a530a7 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -998,6 +998,10 @@ perf_event__output_id_sample(struct
> perf_event
> > *event,  extern void  perf_log_lost_samples(struct perf_event *event,
> > u64 lost);
> >
> > +extern void
> > +perf_log_overhead(struct perf_event *event, u32 type,
> > +		  struct perf_overhead_entry *entry);
> > +
> >  static inline bool is_sampling_event(struct perf_event *event)  {
> >  	return event->attr.sample_period != 0; diff --git
> > a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index c66a485..ae571be 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -862,6 +862,17 @@ enum perf_event_type {
> >  	 */
> >  	PERF_RECORD_SWITCH_CPU_WIDE		= 15,
> >
> > +	/*
> > +	 * Records perf overhead
> > +	 * struct {
> > +	 * 	struct perf_event_header 	header;
> > +	 * 	u32				type;
> > +	 * 	struct perf_overhead_entry	entry;
> > +	 * 	struct sample_id		sample_id;
> > +	 * };
> > +	 */
> > +	PERF_RECORD_OVERHEAD			= 16,
> > +
> >  	PERF_RECORD_MAX,			/* non-ABI */
> >  };
> 
> I think we should make this optional/configurable like the rest of the aux
> events, like below..
> 

The overhead logging only happens when event is going to be disabled or
the task is scheduling out. It should not be much and expensive.

Peter,

What do you think?
Should we make it configurable?

Thanks,
Kan

> jirka
> 
> 
> ---
> diff --git a/include/uapi/linux/perf_event.h
> b/include/uapi/linux/perf_event.h index 5e7c52278ef0..853a919bd9ae
> 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -344,7 +344,8 @@ struct perf_event_attr {
>  				use_clockid    :  1, /* use @clockid for time
> fields */
>  				context_switch :  1, /* context switch data */
>  				write_backward :  1, /* Write ring buffer
> from end to beginning */
> -				__reserved_1   : 36;
> +				overhead       :  1,
> +				__reserved_1   : 35;
> 
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n
> events */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 13:45     ` Liang, Kan
@ 2016-11-24 13:50       ` Peter Zijlstra
  2016-11-24 13:56         ` Liang, Kan
  2016-11-24 18:28         ` Andi Kleen
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-24 13:50 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Jiri Olsa, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi

On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > I think we should make this optional/configurable like the rest of the aux
> > events, like below..
> > 
> 
> The overhead logging only happens when event is going to be disabled or
> the task is scheduling out. It should not be much and expensive.
> 
> Peter,
> 
> What do you think?
> Should we make it configurable?

Is there a downside to having it optional?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 13:50       ` Peter Zijlstra
@ 2016-11-24 13:56         ` Liang, Kan
  2016-11-24 14:27           ` Jiri Olsa
  2016-11-24 18:28         ` Andi Kleen
  1 sibling, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi



> 
> On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > I think we should make this optional/configurable like the rest of
> > > the aux events, like below..
> > >
> >
> > The overhead logging only happens when event is going to be disabled
> > or the task is scheduling out. It should not be much and expensive.
> >
> > Peter,
> >
> > What do you think?
> > Should we make it configurable?
> 
> Is there a downside to having it optional?

NO. There will be no overhead information dumped into perf.data. Just like
current implementation.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 13:56         ` Liang, Kan
@ 2016-11-24 14:27           ` Jiri Olsa
  2016-11-24 14:39             ` Liang, Kan
  0 siblings, 1 reply; 58+ messages in thread
From: Jiri Olsa @ 2016-11-24 14:27 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, mingo, acme, linux-kernel, alexander.shishkin,
	tglx, namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland,
	andi

On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
> 
> 
> > 
> > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > I think we should make this optional/configurable like the rest of
> > > > the aux events, like below..
> > > >
> > >
> > > The overhead logging only happens when event is going to be disabled
> > > or the task is scheduling out. It should not be much and expensive.
> > >
> > > Peter,
> > >
> > > What do you think?
> > > Should we make it configurable?
> > 
> > Is there a downside to having it optional?
> 
> NO. There will be no overhead information dumped into perf.data. Just like
> current implementation.

old perf tools will get unexpected events if it's not optional

[root@ibm-x3650m4-01 perf]# ./perf c2c record -ag
^C[ perf record: Woken up 1 times to write data ]
Warning:
Found 1 unknown events!

Is this an older tool processing a perf.data file generated by a more recent tool?

If that is not the case, consider reporting to linux-kernel@vger.kernel.org.

[ perf record: Captured and wrote 2.333 MB perf.data (9370 samples) ]


jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 14:27           ` Jiri Olsa
@ 2016-11-24 14:39             ` Liang, Kan
  2016-11-24 14:47               ` Jiri Olsa
  0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 14:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, mingo, acme, linux-kernel, alexander.shishkin,
	tglx, namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland,
	andi



> On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
> >
> >
> > >
> > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > I think we should make this optional/configurable like the rest
> > > > > of the aux events, like below..
> > > > >
> > > >
> > > > The overhead logging only happens when event is going to be
> > > > disabled or the task is scheduling out. It should not be much and
> expensive.
> > > >
> > > > Peter,
> > > >
> > > > What do you think?
> > > > Should we make it configurable?
> > >
> > > Is there a downside to having it optional?
> >
> > NO. There will be no overhead information dumped into perf.data. Just
> > like current implementation.
> 
> old perf tools will get unexpected events if it's not optional
>

That's expected because a new record type is introduced.
 
OK, I will make it configurable. Anyway, there is nothing to lose. :)

Thanks,
Kan

> [root@ibm-x3650m4-01 perf]# ./perf c2c record -ag ^C[ perf record: Woken
> up 1 times to write data ]
> Warning:
> Found 1 unknown events!
> 
> Is this an older tool processing a perf.data file generated by a more recent
> tool?
> 
> If that is not the case, consider reporting to linux-kernel@vger.kernel.org.
> 
> [ perf record: Captured and wrote 2.333 MB perf.data (9370 samples) ]
> 
> 
> jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 14:39             ` Liang, Kan
@ 2016-11-24 14:47               ` Jiri Olsa
  0 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-24 14:47 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, mingo, acme, linux-kernel, alexander.shishkin,
	tglx, namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland,
	andi

On Thu, Nov 24, 2016 at 02:39:26PM +0000, Liang, Kan wrote:
> 
> 
> > On Thu, Nov 24, 2016 at 01:56:51PM +0000, Liang, Kan wrote:
> > >
> > >
> > > >
> > > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > > I think we should make this optional/configurable like the rest
> > > > > > of the aux events, like below..
> > > > > >
> > > > >
> > > > > The overhead logging only happens when event is going to be
> > > > > disabled or the task is scheduling out. It should not be much and
> > expensive.
> > > > >
> > > > > Peter,
> > > > >
> > > > > What do you think?
> > > > > Should we make it configurable?
> > > >
> > > > Is there a downside to having it optional?
> > >
> > > NO. There will be no overhead information dumped into perf.data. Just
> > > like current implementation.
> > 
> > old perf tools will get unexpected events if it's not optional
> >
> 
> That's expected because a new record type is introduced.

but not if it's configurable.. older tools wont get it

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-24 13:37     ` Liang, Kan
@ 2016-11-24 15:27       ` Jiri Olsa
  2016-11-24 23:20         ` Namhyung Kim
  2016-11-25  0:21         ` Andi Kleen
  0 siblings, 2 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-24 15:27 UTC (permalink / raw)
  To: Liang, Kan
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, mark.rutland, andi

On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
> 
> 
> > 
> > On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> > > From: Kan Liang <kan.liang@intel.com>
> > >
> > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > report
> > 
> > so the output looks like this:
> > 
> > ---
> > # Elapsed time: 1720167944 ns
> > # Overhead:
> > #       CPU 6
> > #               NMI#: 27 time: 111379 ns
> > #               Multiplexing#: 0 time: 0 ns
> > #               SB#: 57 time: 90045 ns
> > #
> > # Samples: 26  of event 'cycles:u'
> > # Event count (approx.): 1677531
> > #
> > # Overhead  Command  Shared Object     Symbol
> > # ........  .......  ................  .......................
> > #
> >     24.20%  ls       ls                [.] _init
> >     17.18%  ls       libc-2.24.so      [.] __strcoll_l
> >     11.85%  ls       ld-2.24.so        [.] _dl_relocate_object
> > ---

how about we display the overhead information same way the main perf output:

  CPU    NMI   NMI time    MTX  MTX time      SB   SB time
  ...  .....   ........  .....  ........  ......  ........
    6     27     111379      0         0      57     90045


would be just matter of adding new sort objects

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-23  9:44 ` [PATCH 02/14] perf/x86: output NMI overhead kan.liang
  2016-11-23 20:06   ` Peter Zijlstra
@ 2016-11-24 16:19   ` Mark Rutland
  2016-11-24 19:02     ` Peter Zijlstra
  2016-11-24 19:40     ` Liang, Kan
  1 sibling, 2 replies; 58+ messages in thread
From: Mark Rutland @ 2016-11-24 16:19 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, andi

On Wed, Nov 23, 2016 at 04:44:40AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> NMI handler is one of the most important part which brings overhead.
> 
> There are lots of NMI during sampling. It's very expensive to log each
> NMI. So the accumulated time and NMI# will be output when event is going
> to be disabled or task is scheduling out.
> The newly introduced flag PERF_EF_LOG indicate to output the overhead
> log.
> 
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  arch/x86/events/core.c          | 19 ++++++++++++++-
>  arch/x86/events/perf_event.h    |  2 ++
>  include/linux/perf_event.h      |  1 +
>  include/uapi/linux/perf_event.h |  2 ++
>  kernel/events/core.c            | 54 ++++++++++++++++++++++-------------------
>  5 files changed, 52 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index d31735f..6c3b0ef 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1397,6 +1397,11 @@ static void x86_pmu_del(struct perf_event *event, int flags)
>  
>  	perf_event_update_userpage(event);
>  
> +	if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr) {
> +		cpuc->nmi_overhead.cpu = smp_processor_id();
> +		perf_log_overhead(event, PERF_NMI_OVERHEAD, &cpuc->nmi_overhead);
> +	}
> +
>  do_del:
>  	if (x86_pmu.del) {
>  		/*
> @@ -1475,11 +1480,21 @@ void perf_events_lapic_init(void)
>  	apic_write(APIC_LVTPC, APIC_DM_NMI);
>  }
>  
> +static void
> +perf_caculate_nmi_overhead(u64 time)

s/caculate/calculate/ - this tripped me up when grepping.

> @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  	start_clock = sched_clock();
>  	ret = x86_pmu.handle_irq(regs);
>  	finish_clock = sched_clock();
> +	clock = finish_clock - start_clock;
>  
> -	perf_sample_event_took(finish_clock - start_clock);
> +	perf_caculate_nmi_overhead(clock);
> +	perf_sample_event_took(clock);

Ah, so it's the *sampling* overhead, not the NMI overhead.

This doesn't take into account the cost of entering/exiting the handler, which
could be larger than the sampling overhead (e.g. if the PMU is connected
through chained interrupt controllers).

>  enum perf_record_overhead_type {
> +	PERF_NMI_OVERHEAD	= 0,

As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this doesn't count
the entire cost of the NMI, and other architectures may want to implement this,
yet don't have NMI.

[...]

>  static void
>  event_sched_out(struct perf_event *event,
>  		  struct perf_cpu_context *cpuctx,
> -		  struct perf_event_context *ctx)
> +		  struct perf_event_context *ctx,
> +		  bool log_overhead)

Boolean parameter are always confusing. Why not pass the flags directly? That
way we can pass *which* overhead to log, and make the callsites easier to
understand.

>  	event->tstamp_stopped = tstamp;
> -	event->pmu->del(event, 0);
> +	event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);

... which we could pass on here.

> @@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
>  static void
>  group_sched_out(struct perf_event *group_event,
>  		struct perf_cpu_context *cpuctx,
> -		struct perf_event_context *ctx)
> +		struct perf_event_context *ctx,
> +		bool log_overhead)

Likewise.

> @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>  	unsigned long flags = (unsigned long)info;
>  
> -	event_sched_out(event, cpuctx, ctx);
> +	event_sched_out(event, cpuctx, ctx, false);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
>  	list_del_event(event, ctx);
> @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
>  	update_cgrp_time_from_event(event);
>  	update_group_times(event);
>  	if (event == event->group_leader)
> -		group_sched_out(event, cpuctx, ctx);
> +		group_sched_out(event, cpuctx, ctx, true);
>  	else
> -		event_sched_out(event, cpuctx, ctx);
> +		event_sched_out(event, cpuctx, ctx, true);

Why does this differ from __perf_remove_from_context()?

What's the policy for when we do or do not measure overhead?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/14] perf/x86: output side-band events overhead
  2016-11-23  9:44 ` [PATCH 04/14] perf/x86: output side-band events overhead kan.liang
  2016-11-23 20:06   ` Peter Zijlstra
@ 2016-11-24 16:21   ` Mark Rutland
  2016-11-24 19:40     ` Liang, Kan
  1 sibling, 1 reply; 58+ messages in thread
From: Mark Rutland @ 2016-11-24 16:21 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, andi

On Wed, Nov 23, 2016 at 04:44:42AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> Iterating all events which need to receive side-band events also bring
> some overhead.
> Save the overhead information in task context or CPU context, whichever
> context is available.

Do we really want to expose this concept to userspace?

What if the implementation changes?

Thanks,
Mark.

> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  include/linux/perf_event.h      |  2 ++
>  include/uapi/linux/perf_event.h |  1 +
>  kernel/events/core.c            | 32 ++++++++++++++++++++++++++++----
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f72b97a..ec3cb7f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -764,6 +764,8 @@ struct perf_event_context {
>  #endif
>  	void				*task_ctx_data; /* pmu specific data */
>  	struct rcu_head			rcu_head;
> +
> +	struct perf_overhead_entry	sb_overhead;
>  };
>  
>  /*
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 9124c7c..5e7c522 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -994,6 +994,7 @@ struct perf_branch_entry {
>  enum perf_record_overhead_type {
>  	PERF_NMI_OVERHEAD	= 0,
>  	PERF_MUX_OVERHEAD,
> +	PERF_SB_OVERHEAD,
>  
>  	PERF_OVERHEAD_MAX,
>  };
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 9934059..51e9df7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
>  	if (event->attr.exclusive || !cpuctx->active_oncpu)
>  		cpuctx->exclusive = 0;
>  
> -	if (log_overhead && cpuctx->mux_overhead.nr) {
> -		cpuctx->mux_overhead.cpu = smp_processor_id();
> -		perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
> +	if (log_overhead) {
> +		if (cpuctx->mux_overhead.nr) {
> +			cpuctx->mux_overhead.cpu = smp_processor_id();
> +			perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx->mux_overhead);
> +		}
> +		if (ctx->sb_overhead.nr) {
> +			ctx->sb_overhead.cpu = smp_processor_id();
> +			perf_log_overhead(event, PERF_SB_OVERHEAD, &ctx->sb_overhead);
> +		}
>  	}
>  
>  	perf_pmu_enable(event->pmu);
> @@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f output, void *data)
>  	}
>  }
>  
> +static void
> +perf_caculate_sb_overhead(struct perf_event_context *ctx,
> +			  u64 time)
> +{
> +	ctx->sb_overhead.nr++;
> +	ctx->sb_overhead.time += time;
> +}
> +
>  /*
>   * Iterate all events that need to receive side-band events.
>   *
> @@ -6143,9 +6157,12 @@ static void
>  perf_iterate_sb(perf_iterate_f output, void *data,
>  	       struct perf_event_context *task_ctx)
>  {
> +	struct perf_event_context *overhead_ctx = task_ctx;
>  	struct perf_event_context *ctx;
> +	u64 start_clock, end_clock;
>  	int ctxn;
>  
> +	start_clock = perf_clock();
>  	rcu_read_lock();
>  	preempt_disable();
>  
> @@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void *data,
>  
>  	for_each_task_context_nr(ctxn) {
>  		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> -		if (ctx)
> +		if (ctx) {
>  			perf_iterate_ctx(ctx, output, data, false);
> +			if (!overhead_ctx)
> +				overhead_ctx = ctx;
> +		}
>  	}
>  done:
>  	preempt_enable();
>  	rcu_read_unlock();
> +
> +	end_clock = perf_clock();
> +	if (overhead_ctx)
> +		perf_caculate_sb_overhead(overhead_ctx, end_clock - start_clock);
>  }
>  
>  /*
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 13:50       ` Peter Zijlstra
  2016-11-24 13:56         ` Liang, Kan
@ 2016-11-24 18:28         ` Andi Kleen
  2016-11-24 18:58           ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2016-11-24 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, Jiri Olsa, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland, andi

On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > I think we should make this optional/configurable like the rest of the aux
> > > events, like below..
> > > 
> > 
> > The overhead logging only happens when event is going to be disabled or
> > the task is scheduling out. It should not be much and expensive.
> > 
> > Peter,
> > 
> > What do you think?
> > Should we make it configurable?
> 
> Is there a downside to having it optional?

It would be good to always have at least one line overhead summary in the
default output. So if someone sends you a perf report output file and it has
suspicious overhead can investigate.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 18:28         ` Andi Kleen
@ 2016-11-24 18:58           ` Peter Zijlstra
  2016-11-24 19:02             ` Andi Kleen
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-24 18:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Liang, Kan, Jiri Olsa, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland

On Thu, Nov 24, 2016 at 10:28:22AM -0800, Andi Kleen wrote:
> On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > I think we should make this optional/configurable like the rest of the aux
> > > > events, like below..
> > > > 
> > > 
> > > The overhead logging only happens when event is going to be disabled or
> > > the task is scheduling out. It should not be much and expensive.
> > > 
> > > Peter,
> > > 
> > > What do you think?
> > > Should we make it configurable?
> > 
> > Is there a downside to having it optional?
> 
> It would be good to always have at least one line overhead summary in the
> default output. So if someone sends you a perf report output file and it has
> suspicious overhead can investigate.

Sure, but that's a tool thing, totally irrelevant for the kernel.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 18:58           ` Peter Zijlstra
@ 2016-11-24 19:02             ` Andi Kleen
  2016-11-24 19:08               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2016-11-24 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Liang, Kan, Jiri Olsa, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland

On Thu, Nov 24, 2016 at 07:58:49PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 10:28:22AM -0800, Andi Kleen wrote:
> > On Thu, Nov 24, 2016 at 02:50:43PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 24, 2016 at 01:45:28PM +0000, Liang, Kan wrote:
> > > > > I think we should make this optional/configurable like the rest of the aux
> > > > > events, like below..
> > > > > 
> > > > 
> > > > The overhead logging only happens when event is going to be disabled or
> > > > the task is scheduling out. It should not be much and expensive.
> > > > 
> > > > Peter,
> > > > 
> > > > What do you think?
> > > > Should we make it configurable?
> > > 
> > > Is there a downside to having it optional?
> > 
> > It would be good to always have at least one line overhead summary in the
> > default output. So if someone sends you a perf report output file and it has
> > suspicious overhead can investigate.
> 
> Sure, but that's a tool thing, totally irrelevant for the kernel.

It requires enabling the overhead records by default.

If there is really concern about not supporting old tool would need a
disable instead of an enable option.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-24 16:19   ` Mark Rutland
@ 2016-11-24 19:02     ` Peter Zijlstra
  2016-11-24 19:40     ` Liang, Kan
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-24 19:02 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kan.liang, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, adrian.hunter, wangnan0, andi

On Thu, Nov 24, 2016 at 04:19:09PM +0000, Mark Rutland wrote:
> >  static void
> >  event_sched_out(struct perf_event *event,
> >  		  struct perf_cpu_context *cpuctx,
> > -		  struct perf_event_context *ctx)
> > +		  struct perf_event_context *ctx,
> > +		  bool log_overhead)
> 
> Boolean parameter are always confusing. Why not pass the flags directly? That
> way we can pass *which* overhead to log, and make the callsites easier to
> understand.
> 
> >  	event->tstamp_stopped = tstamp;
> > -	event->pmu->del(event, 0);
> > +	event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0);
> 
> ... which we could pass on here.
> 
> > @@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event,
> >  static void
> >  group_sched_out(struct perf_event *group_event,
> >  		struct perf_cpu_context *cpuctx,
> > -		struct perf_event_context *ctx)
> > +		struct perf_event_context *ctx,
> > +		bool log_overhead)
> 
> Likewise.
> 
> > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event,
> >  {
> >  	unsigned long flags = (unsigned long)info;
> >  
> > -	event_sched_out(event, cpuctx, ctx);
> > +	event_sched_out(event, cpuctx, ctx, false);
> >  	if (flags & DETACH_GROUP)
> >  		perf_group_detach(event);
> >  	list_del_event(event, ctx);
> > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event,
> >  	update_cgrp_time_from_event(event);
> >  	update_group_times(event);
> >  	if (event == event->group_leader)
> > -		group_sched_out(event, cpuctx, ctx);
> > +		group_sched_out(event, cpuctx, ctx, true);
> >  	else
> > -		event_sched_out(event, cpuctx, ctx);
> > +		event_sched_out(event, cpuctx, ctx, true);
> 
> Why does this differ from __perf_remove_from_context()?

So I'm not a great fan of sprinkling all this through the core code at
all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD
  2016-11-24 19:02             ` Andi Kleen
@ 2016-11-24 19:08               ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-11-24 19:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Liang, Kan, Jiri Olsa, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland

On Thu, Nov 24, 2016 at 11:02:16AM -0800, Andi Kleen wrote:
> > Sure, but that's a tool thing, totally irrelevant for the kernel.
> 
> It requires enabling the overhead records by default.

by the tool, not the kernel.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-24 16:19   ` Mark Rutland
  2016-11-24 19:02     ` Peter Zijlstra
@ 2016-11-24 19:40     ` Liang, Kan
  2016-11-24 23:26       ` Namhyung Kim
  1 sibling, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 19:40 UTC (permalink / raw)
  To: Mark Rutland
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, andi



> > @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd,
> struct pt_regs *regs)
> >  	start_clock = sched_clock();
> >  	ret = x86_pmu.handle_irq(regs);
> >  	finish_clock = sched_clock();
> > +	clock = finish_clock - start_clock;
> >
> > -	perf_sample_event_took(finish_clock - start_clock);
> > +	perf_caculate_nmi_overhead(clock);
> > +	perf_sample_event_took(clock);
> 
> Ah, so it's the *sampling* overhead, not the NMI overhead.
> 
> This doesn't take into account the cost of entering/exiting the handler,
> which could be larger than the sampling overhead (e.g. if the PMU is
> connected through chained interrupt controllers).
> 
> >  enum perf_record_overhead_type {
> > +	PERF_NMI_OVERHEAD	= 0,
> 
> As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this

I think PERF_NMI stands for the NMI overhead in perf part.

PERF_SAMPLE_OVERHEAD looks too generic I think.
It heard like the sum of all overheads in sampling.
After all we collect the overhead in different stage of sampling.
NMI handler, multiplexing, side-band events...


> doesn't count the entire cost of the NMI, and other architectures may want
> to implement this, yet don't have NMI.
>

I think I can change it to PERF_X86_NMI_OVERHEAD, if you think it's more clear.
For other architectures, they can implement their own type of overhead,
just ignore the NMI one.


 
> > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event
> > *event,  {
> >  	unsigned long flags = (unsigned long)info;
> >
> > -	event_sched_out(event, cpuctx, ctx);
> > +	event_sched_out(event, cpuctx, ctx, false);
> >  	if (flags & DETACH_GROUP)
> >  		perf_group_detach(event);
> >  	list_del_event(event, ctx);
> > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct
> perf_event *event,
> >  	update_cgrp_time_from_event(event);
> >  	update_group_times(event);
> >  	if (event == event->group_leader)
> > -		group_sched_out(event, cpuctx, ctx);
> > +		group_sched_out(event, cpuctx, ctx, true);
> >  	else
> > -		event_sched_out(event, cpuctx, ctx);
> > +		event_sched_out(event, cpuctx, ctx, true);
> 
> Why does this differ from __perf_remove_from_context()?
> 

Both of them are called on removing event. So I think we only
need to log overhead in one place. 

I just did some tests. It looks __perf_remove_from_context is called
after __perf_event_disable.
I think I will log overhead in __perf_remove_from_context for next
version.


> What's the policy for when we do or do not measure overhead?

Currently, it's enabled all the time.
Jirka suggested me to make it configurable. I will do it in next version.
For next version, I still prefer to make it enable by default, since
it doesn't bring additional overhead based on my test.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 04/14] perf/x86: output side-band events overhead
  2016-11-24 16:21   ` Mark Rutland
@ 2016-11-24 19:40     ` Liang, Kan
  0 siblings, 0 replies; 58+ messages in thread
From: Liang, Kan @ 2016-11-24 19:40 UTC (permalink / raw)
  To: Mark Rutland
  Cc: peterz, mingo, acme, linux-kernel, alexander.shishkin, tglx,
	namhyung, jolsa, Hunter, Adrian, wangnan0, andi



> 
> On Wed, Nov 23, 2016 at 04:44:42AM -0500, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > Iterating all events which need to receive side-band events also bring
> > some overhead.
> > Save the overhead information in task context or CPU context,
> > whichever context is available.
> 
> Do we really want to expose this concept to userspace?
> 
> What if the implementation changes?

The concept of side-band will be removed?
I thought we just use the rb-tree to replace the list.
I think no matter how do we implement it, we do need to calculate its
overhead, unless the concept is gone, or it merged with other overhead type.
Because based on my test, it brings big overhead on some cases.

Thanks,
Kan

> 
> Thanks,
> Mark.
> 
> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> > ---
> >  include/linux/perf_event.h      |  2 ++
> >  include/uapi/linux/perf_event.h |  1 +
> >  kernel/events/core.c            | 32 ++++++++++++++++++++++++++++----
> >  3 files changed, 31 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index f72b97a..ec3cb7f 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -764,6 +764,8 @@ struct perf_event_context {  #endif
> >  	void				*task_ctx_data; /* pmu specific data
> */
> >  	struct rcu_head			rcu_head;
> > +
> > +	struct perf_overhead_entry	sb_overhead;
> >  };
> >
> >  /*
> > diff --git a/include/uapi/linux/perf_event.h
> > b/include/uapi/linux/perf_event.h index 9124c7c..5e7c522 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -994,6 +994,7 @@ struct perf_branch_entry {  enum
> > perf_record_overhead_type {
> >  	PERF_NMI_OVERHEAD	= 0,
> >  	PERF_MUX_OVERHEAD,
> > +	PERF_SB_OVERHEAD,
> >
> >  	PERF_OVERHEAD_MAX,
> >  };
> > diff --git a/kernel/events/core.c b/kernel/events/core.c index
> > 9934059..51e9df7 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1829,9 +1829,15 @@ event_sched_out(struct perf_event *event,
> >  	if (event->attr.exclusive || !cpuctx->active_oncpu)
> >  		cpuctx->exclusive = 0;
> >
> > -	if (log_overhead && cpuctx->mux_overhead.nr) {
> > -		cpuctx->mux_overhead.cpu = smp_processor_id();
> > -		perf_log_overhead(event, PERF_MUX_OVERHEAD, &cpuctx-
> >mux_overhead);
> > +	if (log_overhead) {
> > +		if (cpuctx->mux_overhead.nr) {
> > +			cpuctx->mux_overhead.cpu = smp_processor_id();
> > +			perf_log_overhead(event, PERF_MUX_OVERHEAD,
> &cpuctx->mux_overhead);
> > +		}
> > +		if (ctx->sb_overhead.nr) {
> > +			ctx->sb_overhead.cpu = smp_processor_id();
> > +			perf_log_overhead(event, PERF_SB_OVERHEAD,
> &ctx->sb_overhead);
> > +		}
> >  	}
> >
> >  	perf_pmu_enable(event->pmu);
> > @@ -6133,6 +6139,14 @@ static void perf_iterate_sb_cpu(perf_iterate_f
> output, void *data)
> >  	}
> >  }
> >
> > +static void
> > +perf_caculate_sb_overhead(struct perf_event_context *ctx,
> > +			  u64 time)
> > +{
> > +	ctx->sb_overhead.nr++;
> > +	ctx->sb_overhead.time += time;
> > +}
> > +
> >  /*
> >   * Iterate all events that need to receive side-band events.
> >   *
> > @@ -6143,9 +6157,12 @@ static void
> >  perf_iterate_sb(perf_iterate_f output, void *data,
> >  	       struct perf_event_context *task_ctx)  {
> > +	struct perf_event_context *overhead_ctx = task_ctx;
> >  	struct perf_event_context *ctx;
> > +	u64 start_clock, end_clock;
> >  	int ctxn;
> >
> > +	start_clock = perf_clock();
> >  	rcu_read_lock();
> >  	preempt_disable();
> >
> > @@ -6163,12 +6180,19 @@ perf_iterate_sb(perf_iterate_f output, void
> > *data,
> >
> >  	for_each_task_context_nr(ctxn) {
> >  		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> > -		if (ctx)
> > +		if (ctx) {
> >  			perf_iterate_ctx(ctx, output, data, false);
> > +			if (!overhead_ctx)
> > +				overhead_ctx = ctx;
> > +		}
> >  	}
> >  done:
> >  	preempt_enable();
> >  	rcu_read_unlock();
> > +
> > +	end_clock = perf_clock();
> > +	if (overhead_ctx)
> > +		perf_caculate_sb_overhead(overhead_ctx, end_clock -
> start_clock);
> >  }
> >
> >  /*
> > --
> > 2.5.5
> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-24 15:27       ` Jiri Olsa
@ 2016-11-24 23:20         ` Namhyung Kim
  2016-11-24 23:45           ` Jiri Olsa
  2016-11-25  0:21         ` Andi Kleen
  1 sibling, 1 reply; 58+ messages in thread
From: Namhyung Kim @ 2016-11-24 23:20 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Liang, Kan, peterz, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, jolsa, Hunter, Adrian, wangnan0,
	mark.rutland, andi

Hi,

On Thu, Nov 24, 2016 at 04:27:21PM +0100, Jiri Olsa wrote:
> On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
> > 
> > 
> > > 
> > > On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> > > > From: Kan Liang <kan.liang@intel.com>
> > > >
> > > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > > report
> > > 
> > > so the output looks like this:
> > > 
> > > ---
> > > # Elapsed time: 1720167944 ns
> > > # Overhead:
> > > #       CPU 6
> > > #               NMI#: 27 time: 111379 ns
> > > #               Multiplexing#: 0 time: 0 ns
> > > #               SB#: 57 time: 90045 ns
> > > #
> > > # Samples: 26  of event 'cycles:u'
> > > # Event count (approx.): 1677531
> > > #
> > > # Overhead  Command  Shared Object     Symbol
> > > # ........  .......  ................  .......................
> > > #
> > >     24.20%  ls       ls                [.] _init
> > >     17.18%  ls       libc-2.24.so      [.] __strcoll_l
> > >     11.85%  ls       ld-2.24.so        [.] _dl_relocate_object
> > > ---
> 
> how about we display the overhead information same way the main perf output:
> 
>   CPU    NMI   NMI time    MTX  MTX time      SB   SB time
>   ...  .....   ........  .....  ........  ......  ........
>     6     27     111379      0         0      57     90045
> 
> 
> would be just matter of adding new sort objects

How would you connect those to hist entries then?  It'd be possible if
the sort key had 'cpu' only, no?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/14] perf/x86: output NMI overhead
  2016-11-24 19:40     ` Liang, Kan
@ 2016-11-24 23:26       ` Namhyung Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Namhyung Kim @ 2016-11-24 23:26 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mark Rutland, peterz, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, jolsa, Hunter, Adrian, wangnan0, andi

On Thu, Nov 24, 2016 at 07:40:21PM +0000, Liang, Kan wrote:
> 
> 
> > > @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd,
> > struct pt_regs *regs)
> > >  	start_clock = sched_clock();
> > >  	ret = x86_pmu.handle_irq(regs);
> > >  	finish_clock = sched_clock();
> > > +	clock = finish_clock - start_clock;
> > >
> > > -	perf_sample_event_took(finish_clock - start_clock);
> > > +	perf_caculate_nmi_overhead(clock);
> > > +	perf_sample_event_took(clock);
> > 
> > Ah, so it's the *sampling* overhead, not the NMI overhead.
> > 
> > This doesn't take into account the cost of entering/exiting the handler,
> > which could be larger than the sampling overhead (e.g. if the PMU is
> > connected through chained interrupt controllers).
> > 
> > >  enum perf_record_overhead_type {
> > > +	PERF_NMI_OVERHEAD	= 0,
> > 
> > As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this
> 
> I think PERF_NMI stands for the NMI overhead in perf part.
> 
> PERF_SAMPLE_OVERHEAD looks too generic I think.
> It heard like the sum of all overheads in sampling.
> After all we collect the overhead in different stage of sampling.
> NMI handler, multiplexing, side-band events...
> 
> 
> > doesn't count the entire cost of the NMI, and other architectures may want
> > to implement this, yet don't have NMI.
> >
> 
> I think I can change it to PERF_X86_NMI_OVERHEAD, if you think it's more clear.
> For other architectures, they can implement their own type of overhead,
> just ignore the NMI one.

I think it'd be better making it arch-agnostic if possible.  What
about PERF_PMU_OVERHEAD or PERF_PMU_SAMPLE_OVERHEAD?

Thanks,
Namhyung

> 
> 
>  
> > > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event
> > > *event,  {
> > >  	unsigned long flags = (unsigned long)info;
> > >
> > > -	event_sched_out(event, cpuctx, ctx);
> > > +	event_sched_out(event, cpuctx, ctx, false);
> > >  	if (flags & DETACH_GROUP)
> > >  		perf_group_detach(event);
> > >  	list_del_event(event, ctx);
> > > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct
> > perf_event *event,
> > >  	update_cgrp_time_from_event(event);
> > >  	update_group_times(event);
> > >  	if (event == event->group_leader)
> > > -		group_sched_out(event, cpuctx, ctx);
> > > +		group_sched_out(event, cpuctx, ctx, true);
> > >  	else
> > > -		event_sched_out(event, cpuctx, ctx);
> > > +		event_sched_out(event, cpuctx, ctx, true);
> > 
> > Why does this differ from __perf_remove_from_context()?
> > 
> 
> Both of them are called on removing event. So I think we only
> need to log overhead in one place. 
> 
> I just did some tests. It looks __perf_remove_from_context is called
> after __perf_event_disable.
> I think I will log overhead in __perf_remove_from_context for next
> version.
> 
> 
> > What's the policy for when we do or do not measure overhead?
> 
> Currently, it's enabled all the time.
> Jirka suggested me to make it configurable. I will do it in next version.
> For next version, I still prefer to make it enable by default, since
> it doesn't bring additional overhead based on my test.
> 
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-24 23:20         ` Namhyung Kim
@ 2016-11-24 23:45           ` Jiri Olsa
  0 siblings, 0 replies; 58+ messages in thread
From: Jiri Olsa @ 2016-11-24 23:45 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Liang, Kan, peterz, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, jolsa, Hunter, Adrian, wangnan0,
	mark.rutland, andi

On Fri, Nov 25, 2016 at 08:20:13AM +0900, Namhyung Kim wrote:
> Hi,
> 
> On Thu, Nov 24, 2016 at 04:27:21PM +0100, Jiri Olsa wrote:
> > On Thu, Nov 24, 2016 at 01:37:04PM +0000, Liang, Kan wrote:
> > > 
> > > 
> > > > 
> > > > On Wed, Nov 23, 2016 at 04:44:44AM -0500, kan.liang@intel.com wrote:
> > > > > From: Kan Liang <kan.liang@intel.com>
> > > > >
> > > > > Caculate the total NMI overhead on each CPU, and display them in perf
> > > > > report
> > > > 
> > > > so the output looks like this:
> > > > 
> > > > ---
> > > > # Elapsed time: 1720167944 ns
> > > > # Overhead:
> > > > #       CPU 6
> > > > #               NMI#: 27 time: 111379 ns
> > > > #               Multiplexing#: 0 time: 0 ns
> > > > #               SB#: 57 time: 90045 ns
> > > > #
> > > > # Samples: 26  of event 'cycles:u'
> > > > # Event count (approx.): 1677531
> > > > #
> > > > # Overhead  Command  Shared Object     Symbol
> > > > # ........  .......  ................  .......................
> > > > #
> > > >     24.20%  ls       ls                [.] _init
> > > >     17.18%  ls       libc-2.24.so      [.] __strcoll_l
> > > >     11.85%  ls       ld-2.24.so        [.] _dl_relocate_object
> > > > ---
> > 
> > how about we display the overhead information same way the main perf output:
> > 
> >   CPU    NMI   NMI time    MTX  MTX time      SB   SB time
> >   ...  .....   ........  .....  ........  ......  ........
> >     6     27     111379      0         0      57     90045
> > 
> > 
> > would be just matter of adding new sort objects
> 
> How would you connect those to hist entries then?  It'd be possible if
> the sort key had 'cpu' only, no?

right, I should have said fields then..

jirka

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/14] perf tools: show NMI overhead
  2016-11-24 15:27       ` Jiri Olsa
  2016-11-24 23:20         ` Namhyung Kim
@ 2016-11-25  0:21         ` Andi Kleen
  1 sibling, 0 replies; 58+ messages in thread
From: Andi Kleen @ 2016-11-25  0:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Liang, Kan, peterz, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland, andi

> how about we display the overhead information same way the main perf output:
> 
>   CPU    NMI   NMI time    MTX  MTX time      SB   SB time
>   ...  .....   ........  .....  ........  ......  ........
>     6     27     111379      0         0      57     90045
> 
> 
> would be just matter of adding new sort objects

The problem with making overhead a standard sort key is that you have
to chose between an output format that makes sense for overhead
and one that makes sense for normal samples.

But overhead is more "auxillary" information, so it should be possible
to access it together with normal sampling information in a single
output file.

So I think it's better handled separately.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 13/14] perf tools: warn on high overhead
  2016-11-23 22:03     ` Liang, Kan
@ 2016-11-25 20:42       ` Andi Kleen
  0 siblings, 0 replies; 58+ messages in thread
From: Andi Kleen @ 2016-11-25 20:42 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Andi Kleen, peterz, mingo, acme, linux-kernel,
	alexander.shishkin, tglx, namhyung, jolsa, Hunter, Adrian,
	wangnan0, mark.rutland

On Wed, Nov 23, 2016 at 10:03:24PM +0000, Liang, Kan wrote:
> > Perhaps we need two separate metrics here:
> > 
> > - cost of perf record on its CPU (or later on if it gets multi threaded
> >   more multiple). Warn if this is >50% or so.
> 
> What's the formula for cost of perf record on its CPU?
> The cost only includes user space overhead or all overhead?
> What is the divisor?

It would be all the overhead in the process. Accounting overhead in
kernel threads or interrupts caused by IO is difficult, we could leave
that out for now.

Sum of:
     For each perf thread:
         thread cpu time / monotonic wall time

I guess Sum is better than average here because the perf threads are
likely running (or could be) on the same CPU. If perf record was changed to be
more aggressively flush buffers on the local CPUs this would need to
change, but I presume it's good enough for now.

> 
> 
> > - average perf collection overhead on a CPU. The 10% threshold here
> >   seems appropiate.
> For the average, do you mean add all overheads among CPUs together
> and divide the CPU#?

Right. Possibly also max of all too.

> 
> To calculate the rate, the divisor is wall clock time, right?

monotonic wall clock time yes.

-Andi

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2016-11-25 20:42 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-23  9:44 [PATCH 00/14] export perf overheads information kan.liang
2016-11-23  9:44 ` [PATCH 01/14] perf/x86: Introduce PERF_RECORD_OVERHEAD kan.liang
2016-11-23 20:11   ` Peter Zijlstra
2016-11-23 20:13   ` Peter Zijlstra
2016-11-23 23:41   ` Jiri Olsa
2016-11-24 13:45     ` Liang, Kan
2016-11-24 13:50       ` Peter Zijlstra
2016-11-24 13:56         ` Liang, Kan
2016-11-24 14:27           ` Jiri Olsa
2016-11-24 14:39             ` Liang, Kan
2016-11-24 14:47               ` Jiri Olsa
2016-11-24 18:28         ` Andi Kleen
2016-11-24 18:58           ` Peter Zijlstra
2016-11-24 19:02             ` Andi Kleen
2016-11-24 19:08               ` Peter Zijlstra
2016-11-23  9:44 ` [PATCH 02/14] perf/x86: output NMI overhead kan.liang
2016-11-23 20:06   ` Peter Zijlstra
2016-11-24 16:19   ` Mark Rutland
2016-11-24 19:02     ` Peter Zijlstra
2016-11-24 19:40     ` Liang, Kan
2016-11-24 23:26       ` Namhyung Kim
2016-11-23  9:44 ` [PATCH 03/14] perf/x86: output multiplexing overhead kan.liang
2016-11-23 20:06   ` Peter Zijlstra
2016-11-23 20:09     ` Liang, Kan
2016-11-23  9:44 ` [PATCH 04/14] perf/x86: output side-band events overhead kan.liang
2016-11-23 20:06   ` Peter Zijlstra
2016-11-24 16:21   ` Mark Rutland
2016-11-24 19:40     ` Liang, Kan
2016-11-23  9:44 ` [PATCH 05/14] perf tools: handle PERF_RECORD_OVERHEAD record type kan.liang
2016-11-23 22:35   ` Jiri Olsa
2016-11-23 22:58     ` Jiri Olsa
2016-11-23  9:44 ` [PATCH 06/14] perf tools: show NMI overhead kan.liang
2016-11-23 22:51   ` Jiri Olsa
2016-11-24 13:37     ` Liang, Kan
2016-11-24 15:27       ` Jiri Olsa
2016-11-24 23:20         ` Namhyung Kim
2016-11-24 23:45           ` Jiri Olsa
2016-11-25  0:21         ` Andi Kleen
2016-11-23 22:52   ` Jiri Olsa
2016-11-23 22:52   ` Jiri Olsa
2016-11-23  9:44 ` [PATCH 07/14] perf tools: show multiplexing overhead kan.liang
2016-11-23  9:44 ` [PATCH 08/14] perf tools: show side-band events overhead kan.liang
2016-11-23  9:44 ` [PATCH 09/14] perf tools: make get_nsecs visible for buildin files kan.liang
2016-11-23  9:44 ` [PATCH 10/14] perf tools: introduce PERF_RECORD_USER_OVERHEAD kan.liang
2016-11-23  9:44 ` [PATCH 11/14] perf tools: record write data overhead kan.liang
2016-11-23 23:02   ` Jiri Olsa
2016-11-23 23:06   ` Jiri Olsa
2016-11-23  9:44 ` [PATCH 12/14] perf tools: record elapsed time kan.liang
2016-11-23  9:44 ` [PATCH 13/14] perf tools: warn on high overhead kan.liang
2016-11-23 20:25   ` Andi Kleen
2016-11-23 22:03     ` Liang, Kan
2016-11-25 20:42       ` Andi Kleen
2016-11-23  9:44 ` [PATCH 14/14] perf script: show overhead events kan.liang
2016-11-23 23:17   ` Jiri Olsa
2016-11-23 23:18   ` Jiri Olsa
2016-11-23 23:19   ` Jiri Olsa
2016-11-23 23:22   ` Jiri Olsa
2016-11-24  4:27 ` [PATCH 00/14] export perf overheads information Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).