All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
@ 2016-05-24 19:52 Andi Kleen
  2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Andi Kleen @ 2016-05-24 19:52 UTC (permalink / raw)
  To: acme; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add basic plumbing for TopDown in perf stat

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads,
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

Add a new --topdown options to enable events.
When --topdown is specified set up events for all topdown
events supported by the kernel.
Add topdown-* as a special case to the event parser, as is
needed for all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2: Use standard sysctl read function.
v3: Move x86 specific code to arch/
v4: Enable --metric-only implicitly for topdown.
v5: Add --single-thread option to not force per core mode
v6: Fix output order of topdown metrics
v7: Allow combining with -d
v8: Remove --single-thread again
v9: Rename functions, adding arch_ and topdown_.
v10: Expand man page and describe TopDown better
Paste intro into commit description.
Print error when malloc fails.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |  32 +++++++++
 tools/perf/arch/x86/util/Build         |   1 +
 tools/perf/arch/x86/util/group.c       |  27 ++++++++
 tools/perf/builtin-stat.c              | 119 ++++++++++++++++++++++++++++++++-
 tools/perf/util/group.h                |   7 ++
 tools/perf/util/parse-events.l         |   1 +
 6 files changed, 184 insertions(+), 3 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/group.c
 create mode 100644 tools/perf/util/group.h

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 04f23b404bbc..d96ccd4844df 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide mode measurements.
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
 
+--topdown::
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking the cycles consumed down into frontend bound, backend bound,
+bad speculation and retiring.
+
+Frontend bound means that the CPU cannot fetch and decode instructions fast
+enough. Backend bound means that computation or memory access is the bottle
+neck. Bad Speculation means that the CPU wasted cycles due to branch
+mispredictions and similar issues. Retiring means that the CPU computed without
+an apparently bottleneck. The bottleneck is only the real bottleneck
+if the workload is actually bound by the CPU and not by something else.
+
+For best results it is usually a good idea to use it with interval
+mode like -I 1000, as the bottleneck of workloads can change often.
+
+The top down metrics are collected per core instead of per
+CPU thread. Per core mode is automatically enabled
+and -a (global monitoring) is needed, requiring root rights or
+perf.perf_event_paranoid=-1.
+
+Topdown uses the full Performance Monitoring Unit, and needs
+disabling of the NMI watchdog (as root):
+echo 0 > /proc/sys/kernel/nmi_watchdog
+for best results. Otherwise the bottlenecks may be inconsistent
+on workload with changing phases.
+
+This enables --metric-only, unless overriden with --no-metric-only.
+
+To interpret the results it is usually needed to know on which
+CPUs the workload runs on. If needed the CPUs can be forced using
+taskset.
 
 EXAMPLES
 --------
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 465970370f3e..4cd8a16b1b7b 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,7 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += group.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
new file mode 100644
index 000000000000..37f92aa39a5d
--- /dev/null
+++ b/tools/perf/arch/x86/util/group.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include "api/fs/fs.h"
+#include "util/group.h"
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+bool arch_topdown_check_group(bool *warn)
+{
+	int n;
+
+	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+		return false;
+	if (n > 0) {
+		*warn = true;
+		return false;
+	}
+	return true;
+}
+
+void arch_topdown_group_warn(void)
+{
+	fprintf(stderr,
+		"nmi_watchdog enabled with topdown. May give wrong results.\n"
+		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 715a1128daeb..dab184d86816 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,10 +59,13 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/group.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/group.h"
 #include "asm/bug.h"
 
+#include <api/fs/fs.h>
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
@@ -98,6 +101,15 @@ static const char * transaction_limited_attrs = {
 	"}"
 };
 
+static const char * topdown_attrs[] = {
+	"topdown-total-slots",
+	"topdown-slots-retired",
+	"topdown-recovery-bubbles",
+	"topdown-fetch-bubbles",
+	"topdown-slots-issued",
+	NULL,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct target target = {
@@ -112,6 +124,7 @@ static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
 static bool			transaction_run;
+static bool			topdown_run			= false;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -124,6 +137,7 @@ static unsigned int		initial_delay			= 0;
 static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static bool			metric_only			= false;
+static bool			force_metric_only		= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
 static aggr_get_id_t		aggr_get_id;
@@ -1520,6 +1534,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
 	return 0;
 }
 
+static int enable_metric_only(const struct option *opt __maybe_unused,
+			      const char *s __maybe_unused, int unset)
+{
+	force_metric_only = true;
+	metric_only = !unset;
+	return 0;
+}
+
 static const struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1578,8 +1600,10 @@ static const struct option stat_options[] = {
 		     "aggregate counts per thread", AGGR_THREAD),
 	OPT_UINTEGER('D', "delay", &initial_delay,
 		     "ms to wait before starting measurement after program start"),
-	OPT_BOOLEAN(0, "metric-only", &metric_only,
-			"Only print computed metrics. No raw values"),
+	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
+			"Only print computed metrics. No raw values", enable_metric_only),
+	OPT_BOOLEAN(0, "topdown", &topdown_run,
+			"measure topdown level 1 statistics"),
 	OPT_END()
 };
 
@@ -1772,12 +1796,62 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	return 0;
 }
 
+static int topdown_filter_events(const char **attr, char **str, bool use_group)
+{
+	int off = 0;
+	int i;
+	int len = 0;
+	char *s;
+
+	for (i = 0; attr[i]; i++) {
+		if (pmu_have_event("cpu", attr[i])) {
+			len += strlen(attr[i]) + 1;
+			attr[i - off] = attr[i];
+		} else
+			off++;
+	}
+	attr[i - off] = NULL;
+
+	*str = malloc(len + 1 + 2);
+	if (!*str)
+		return -1;
+	s = *str;
+	if (i - off == 0) {
+		*s = 0;
+		return 0;
+	}
+	if (use_group)
+		*s++ = '{';
+	for (i = 0; attr[i]; i++) {
+		strcpy(s, attr[i]);
+		s += strlen(s);
+		*s++ = ',';
+	}
+	if (use_group) {
+		s[-1] = '}';
+		*s = 0;
+	} else
+		s[-1] = 0;
+	return 0;
+}
+
+__weak bool arch_topdown_check_group(bool *warn)
+{
+	*warn = false;
+	return false;
+}
+
+__weak void arch_topdown_group_warn(void)
+{
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
  */
 static int add_default_attributes(void)
 {
+	int err;
 	struct perf_event_attr default_attrs0[] = {
 
   { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
@@ -1896,7 +1970,6 @@ static int add_default_attributes(void)
 		return 0;
 
 	if (transaction_run) {
-		int err;
 		if (pmu_have_event("cpu", "cycles-ct") &&
 		    pmu_have_event("cpu", "el-start"))
 			err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1909,6 +1982,46 @@ static int add_default_attributes(void)
 		return 0;
 	}
 
+	if (topdown_run) {
+		char *str = NULL;
+		bool warn = false;
+
+		if (stat_config.aggr_mode != AGGR_GLOBAL &&
+		    stat_config.aggr_mode != AGGR_CORE) {
+			pr_err("top down event configuration requires --per-core mode\n");
+			return -1;
+		}
+		stat_config.aggr_mode = AGGR_CORE;
+		if (nr_cgroups || !target__has_cpu(&target)) {
+			pr_err("top down event configuration requires system-wide mode (-a)\n");
+			return -1;
+		}
+
+		if (!force_metric_only)
+			metric_only = true;
+		if (topdown_filter_events(topdown_attrs, &str,
+				arch_topdown_check_group(&warn)) < 0) {
+			pr_err("Out of memory\n");
+			return -1;
+		}
+		if (topdown_attrs[0] && str) {
+			if (warn)
+				arch_topdown_group_warn();
+			err = parse_events(evsel_list, str, NULL);
+			if (err) {
+				fprintf(stderr,
+					"Cannot set up top down events %s: %d\n",
+					str, err);
+				free(str);
+				return -1;
+			}
+		} else {
+			fprintf(stderr, "System does not support topdown\n");
+			return -1;
+		}
+		free(str);
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
 			return -1;
diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
new file mode 100644
index 000000000000..116debe7a995
--- /dev/null
+++ b/tools/perf/util/group.h
@@ -0,0 +1,7 @@
+#ifndef GROUP_H
+#define GROUP_H 1
+
+bool arch_topdown_check_group(bool *warn);
+void arch_topdown_group_warn(void);
+
+#endif
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 1477fbc78993..744ebe3fa30f 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -259,6 +259,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
 L1-icache|l1-i|l1i|L1-instruction	|
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] perf stat: Add computation of TopDown formulas
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
@ 2016-05-24 19:52 ` Andi Kleen
  2016-06-01 14:50   ` Nilay Vaish
  2016-06-08  8:39   ` [tip:perf/core] " tip-bot for Andi Kleen
  2016-05-24 19:52 ` [PATCH 3/4] perf stat: Print topology/time headers with --metric-only Andi Kleen
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2016-05-24 19:52 UTC (permalink / raw)
  To: acme; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4:
Only print one decimal digit
Fix color printing of one metric
v5: Avoid printing -0.0
v6: Remove extra frontend event lookup
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/util/stat-shadow.c | 162 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c        |   5 ++
 tools/perf/util/stat.h        |   5 ++
 3 files changed, 172 insertions(+)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index fdb71961143e..509ed62a064e 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,11 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
 }
 
 /*
@@ -104,6 +114,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -301,6 +321,107 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double sanitize_val(double x)
+{
+	if (x < 0 && x >= -0.02)
+		return 0.0;
+	return x;
+}
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return sanitize_val(bad_spec);
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return sanitize_val(1.0 - sum);
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -308,6 +429,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -450,6 +572,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d0653861..c1ba255f2abe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e786ccc7..c29bb94c48a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] perf stat: Print topology/time headers with --metric-only
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
  2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
@ 2016-05-24 19:52 ` Andi Kleen
  2016-06-08  8:39   ` [tip:perf/core] " tip-bot for Andi Kleen
  2016-05-24 19:52 ` [PATCH 4/4] perf stat: Add missing aggregation headers for --metric-only CSV Andi Kleen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-05-24 19:52 UTC (permalink / raw)
  To: acme; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

When --metric-only is enabled there were no headers for the topology
in interval mode.  Also when headers were printed they were
on a separate line.

Before:

$ perf stat  --metric-only  -A -I 1000 -a
     1.001038376       frontend cycles idle insn per cycle       stalled cycles per insn branch-misses of all branches
     1.001038376 CPU0     123.54%               0.23                5.29                    7.61%
     1.001038376 CPU1     137.78%               0.24                5.13                   10.07%
     1.001038376 CPU2      64.48%               0.22                5.50                    6.84%

After:

$ perf stat  --metric-only  -A -I 1000 -a
     1.001111114 CPU0      82.46%               0.32                2.60                    7.64%
     1.001111114 CPU1     126.63%               0.02               42.83                    0.15%
     1.001111114 CPU2     193.54%               0.32                2.59                    6.92%

v2: Move all headers on a single line
Reported-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-stat.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index dab184d86816..c01fc7100dad 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1316,7 +1316,7 @@ static int aggr_header_lens[] = {
 	[AGGR_GLOBAL] = 0,
 };
 
-static void print_metric_headers(char *prefix)
+static void print_metric_headers(const char *prefix, bool no_indent)
 {
 	struct perf_stat_output_ctx out;
 	struct perf_evsel *counter;
@@ -1327,7 +1327,7 @@ static void print_metric_headers(char *prefix)
 	if (prefix)
 		fprintf(stat_config.output, "%s", prefix);
 
-	if (!csv_output)
+	if (!csv_output && !no_indent)
 		fprintf(stat_config.output, "%*s",
 			aggr_header_lens[stat_config.aggr_mode], "");
 
@@ -1352,28 +1352,40 @@ static void print_interval(char *prefix, struct timespec *ts)
 
 	sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, csv_sep);
 
-	if (num_print_interval == 0 && !csv_output && !metric_only) {
+	if (num_print_interval == 0 && !csv_output) {
 		switch (stat_config.aggr_mode) {
 		case AGGR_SOCKET:
-			fprintf(output, "#           time socket cpus             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time socket cpus");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_CORE:
-			fprintf(output, "#           time core         cpus             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time core         cpus");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_NONE:
-			fprintf(output, "#           time CPU                counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time CPU");
+			if (!metric_only)
+				fprintf(output, "                counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_THREAD:
-			fprintf(output, "#           time             comm-pid                  counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time             comm-pid");
+			if (!metric_only)
+				fprintf(output, "                  counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_GLOBAL:
 		default:
-			fprintf(output, "#           time             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 		case AGGR_UNSET:
 			break;
 		}
 	}
 
+	if (num_print_interval == 0 && metric_only)
+		print_metric_headers(" ", true);
 	if (++num_print_interval == 25)
 		num_print_interval = 0;
 }
@@ -1442,8 +1454,8 @@ static void print_counters(struct timespec *ts, int argc, const char **argv)
 	if (metric_only) {
 		static int num_print_iv;
 
-		if (num_print_iv == 0)
-			print_metric_headers(prefix);
+		if (num_print_iv == 0 && !interval)
+			print_metric_headers(prefix, false);
 		if (num_print_iv++ == 25)
 			num_print_iv = 0;
 		if (stat_config.aggr_mode == AGGR_GLOBAL && prefix)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] perf stat: Add missing aggregation headers for --metric-only CSV
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
  2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
  2016-05-24 19:52 ` [PATCH 3/4] perf stat: Print topology/time headers with --metric-only Andi Kleen
@ 2016-05-24 19:52 ` Andi Kleen
  2016-06-08  8:40   ` [tip:perf/core] " tip-bot for Andi Kleen
  2016-05-30 16:01 ` [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Arnaldo Carvalho de Melo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-05-24 19:52 UTC (permalink / raw)
  To: acme; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

When in CSV mode --metric-only outputs an header, unlike the other
modes. Previously it did not properly print headers for the
aggregation columns, so the headers were actually shifted against
the real values.

Fix this here by outputting the correct headers for CSV.

Acked-by: Jiri Olsa <jolsa@kernel.org
v2: Indent array.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-stat.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c01fc7100dad..5a28976ae374 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1316,6 +1316,14 @@ static int aggr_header_lens[] = {
 	[AGGR_GLOBAL] = 0,
 };
 
+static const char *aggr_header_csv[] = {
+	[AGGR_CORE] 	= 	"core,cpus,",
+	[AGGR_SOCKET] 	= 	"socket,cpus",
+	[AGGR_NONE] 	= 	"cpu,",
+	[AGGR_THREAD] 	= 	"comm-pid,",
+	[AGGR_GLOBAL] 	=	""
+};
+
 static void print_metric_headers(const char *prefix, bool no_indent)
 {
 	struct perf_stat_output_ctx out;
@@ -1330,6 +1338,12 @@ static void print_metric_headers(const char *prefix, bool no_indent)
 	if (!csv_output && !no_indent)
 		fprintf(stat_config.output, "%*s",
 			aggr_header_lens[stat_config.aggr_mode], "");
+	if (csv_output) {
+		if (stat_config.interval)
+			fputs("time,", stat_config.output);
+		fputs(aggr_header_csv[stat_config.aggr_mode],
+			stat_config.output);
+	}
 
 	/* Print metrics headers only */
 	evlist__for_each(evsel_list, counter) {
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
                   ` (2 preceding siblings ...)
  2016-05-24 19:52 ` [PATCH 4/4] perf stat: Add missing aggregation headers for --metric-only CSV Andi Kleen
@ 2016-05-30 16:01 ` Arnaldo Carvalho de Melo
  2016-05-30 16:04   ` Andi Kleen
  2016-06-01 14:24 ` Nilay Vaish
  2016-06-08  8:38 ` [tip:perf/core] " tip-bot for Andi Kleen
  5 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-05-30 16:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jolsa, linux-kernel, Andi Kleen

Em Tue, May 24, 2016 at 12:52:36PM -0700, Andi Kleen escreveu:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add basic plumbing for TopDown in perf stat
> 
> TopDown is intended to replace the frontend cycles idle/
> backend cycles idle metrics in standard perf stat output.
> These metrics are not reliable in many workloads,
> due to out of order effects.
> 
> This implements a new --topdown mode in perf stat
> (similar to --transaction) that measures the pipe line
> bottlenecks using standardized formulas. The measurement
> can be all done with 5 counters (one fixed counter)
> 
> The result are four metrics:
> FrontendBound, BackendBound, BadSpeculation, Retiring
> 
> that describe the CPU pipeline behavior on a high level.
> 
> FrontendBound and BackendBound
> BadSpeculation is a higher
> 
> The full top down methology has many hierarchical metrics.
> This implementation only supports level 1 which can be
> collected without multiplexing. A full implementation
> of top down on top of perf is available in pmu-tools toplev.
> (http://github.com/andikleen/pmu-tools)
> 
> The current version works on Intel Core CPUs starting
> with Sandy Bridge, and Atom CPUs starting with Silvermont.
> In principle the generic metrics should be also implementable
> on other out of order CPUs.

After applying [1/4] (this patch):

[root@jouet linux]# lscpu | grep Model
Model:                 61
Model name:            Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
[root@jouet linux]#

[root@jouet linux]# perf stat --topdown -a usleep 1
System does not support topdown
[root@jouet linux]#

[root@jouet linux]# uname -r
4.6.0+

Which is a Broadwell-U, dual-core, 14 nm.

[root@jouet linux]# echo 0 > /proc/sys/kernel/nmi_watchdog
[root@jouet linux]# perf stat --topdown -a usleep 1
System does not support topdown
[root@jouet linux]# cat /proc/sys/kernel/nmi_watchdog
0
[root@jouet linux]#

Please advise.

- Arnaldo

 
> TopDown level 1 uses a set of abstracted metrics which
> are generic to out of order CPU cores (although some
> CPUs may not implement all of them):
> 
> topdown-total-slots   Available slots in the pipeline
> topdown-slots-issued          Slots issued into the pipeline
> topdown-slots-retired         Slots successfully retired
> topdown-fetch-bubbles         Pipeline gaps in the frontend
> topdown-recovery-bubbles  Pipeline gaps during recovery
>                           from misspeculation
> 
> These metrics then allow to compute four useful metrics:
> FrontendBound, BackendBound, Retiring, BadSpeculation.
> 
> Add a new --topdown options to enable events.
> When --topdown is specified set up events for all topdown
> events supported by the kernel.
> Add topdown-* as a special case to the event parser, as is
> needed for all events containing -.
> 
> The actual code to compute the metrics is in follow-on patches.
> 
> v2: Use standard sysctl read function.
> v3: Move x86 specific code to arch/
> v4: Enable --metric-only implicitly for topdown.
> v5: Add --single-thread option to not force per core mode
> v6: Fix output order of topdown metrics
> v7: Allow combining with -d
> v8: Remove --single-thread again
> v9: Rename functions, adding arch_ and topdown_.
> v10: Expand man page and describe TopDown better
> Paste intro into commit description.
> Print error when malloc fails.
> Acked-by: Jiri Olsa <jolsa@kernel.org>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/Documentation/perf-stat.txt |  32 +++++++++
>  tools/perf/arch/x86/util/Build         |   1 +
>  tools/perf/arch/x86/util/group.c       |  27 ++++++++
>  tools/perf/builtin-stat.c              | 119 ++++++++++++++++++++++++++++++++-
>  tools/perf/util/group.h                |   7 ++
>  tools/perf/util/parse-events.l         |   1 +
>  6 files changed, 184 insertions(+), 3 deletions(-)
>  create mode 100644 tools/perf/arch/x86/util/group.c
>  create mode 100644 tools/perf/util/group.h
> 
> diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> index 04f23b404bbc..d96ccd4844df 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide mode measurements.
>  --no-aggr::
>  Do not aggregate counts across all monitored CPUs.
>  
> +--topdown::
> +Print top down level 1 metrics if supported by the CPU. This allows to
> +determine bottle necks in the CPU pipeline for CPU bound workloads,
> +by breaking the cycles consumed down into frontend bound, backend bound,
> +bad speculation and retiring.
> +
> +Frontend bound means that the CPU cannot fetch and decode instructions fast
> +enough. Backend bound means that computation or memory access is the bottle
> +neck. Bad Speculation means that the CPU wasted cycles due to branch
> +mispredictions and similar issues. Retiring means that the CPU computed without
> +an apparently bottleneck. The bottleneck is only the real bottleneck
> +if the workload is actually bound by the CPU and not by something else.
> +
> +For best results it is usually a good idea to use it with interval
> +mode like -I 1000, as the bottleneck of workloads can change often.
> +
> +The top down metrics are collected per core instead of per
> +CPU thread. Per core mode is automatically enabled
> +and -a (global monitoring) is needed, requiring root rights or
> +perf.perf_event_paranoid=-1.
> +
> +Topdown uses the full Performance Monitoring Unit, and needs
> +disabling of the NMI watchdog (as root):
> +echo 0 > /proc/sys/kernel/nmi_watchdog
> +for best results. Otherwise the bottlenecks may be inconsistent
> +on workload with changing phases.
> +
> +This enables --metric-only, unless overriden with --no-metric-only.
> +
> +To interpret the results it is usually needed to know on which
> +CPUs the workload runs on. If needed the CPUs can be forced using
> +taskset.
>  
>  EXAMPLES
>  --------
> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
> index 465970370f3e..4cd8a16b1b7b 100644
> --- a/tools/perf/arch/x86/util/Build
> +++ b/tools/perf/arch/x86/util/Build
> @@ -3,6 +3,7 @@ libperf-y += tsc.o
>  libperf-y += pmu.o
>  libperf-y += kvm-stat.o
>  libperf-y += perf_regs.o
> +libperf-y += group.o
>  
>  libperf-$(CONFIG_DWARF) += dwarf-regs.o
>  libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
> diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
> new file mode 100644
> index 000000000000..37f92aa39a5d
> --- /dev/null
> +++ b/tools/perf/arch/x86/util/group.c
> @@ -0,0 +1,27 @@
> +#include <stdio.h>
> +#include "api/fs/fs.h"
> +#include "util/group.h"
> +
> +/*
> + * Check whether we can use a group for top down.
> + * Without a group may get bad results due to multiplexing.
> + */
> +bool arch_topdown_check_group(bool *warn)
> +{
> +	int n;
> +
> +	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
> +		return false;
> +	if (n > 0) {
> +		*warn = true;
> +		return false;
> +	}
> +	return true;
> +}
> +
> +void arch_topdown_group_warn(void)
> +{
> +	fprintf(stderr,
> +		"nmi_watchdog enabled with topdown. May give wrong results.\n"
> +		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
> +}
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 715a1128daeb..dab184d86816 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -59,10 +59,13 @@
>  #include "util/thread.h"
>  #include "util/thread_map.h"
>  #include "util/counts.h"
> +#include "util/group.h"
>  #include "util/session.h"
>  #include "util/tool.h"
> +#include "util/group.h"
>  #include "asm/bug.h"
>  
> +#include <api/fs/fs.h>
>  #include <stdlib.h>
>  #include <sys/prctl.h>
>  #include <locale.h>
> @@ -98,6 +101,15 @@ static const char * transaction_limited_attrs = {
>  	"}"
>  };
>  
> +static const char * topdown_attrs[] = {
> +	"topdown-total-slots",
> +	"topdown-slots-retired",
> +	"topdown-recovery-bubbles",
> +	"topdown-fetch-bubbles",
> +	"topdown-slots-issued",
> +	NULL,
> +};
> +
>  static struct perf_evlist	*evsel_list;
>  
>  static struct target target = {
> @@ -112,6 +124,7 @@ static volatile pid_t		child_pid			= -1;
>  static bool			null_run			=  false;
>  static int			detailed_run			=  0;
>  static bool			transaction_run;
> +static bool			topdown_run			= false;
>  static bool			big_num				=  true;
>  static int			big_num_opt			=  -1;
>  static const char		*csv_sep			= NULL;
> @@ -124,6 +137,7 @@ static unsigned int		initial_delay			= 0;
>  static unsigned int		unit_width			= 4; /* strlen("unit") */
>  static bool			forever				= false;
>  static bool			metric_only			= false;
> +static bool			force_metric_only		= false;
>  static struct timespec		ref_time;
>  static struct cpu_map		*aggr_map;
>  static aggr_get_id_t		aggr_get_id;
> @@ -1520,6 +1534,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
>  	return 0;
>  }
>  
> +static int enable_metric_only(const struct option *opt __maybe_unused,
> +			      const char *s __maybe_unused, int unset)
> +{
> +	force_metric_only = true;
> +	metric_only = !unset;
> +	return 0;
> +}
> +
>  static const struct option stat_options[] = {
>  	OPT_BOOLEAN('T', "transaction", &transaction_run,
>  		    "hardware transaction statistics"),
> @@ -1578,8 +1600,10 @@ static const struct option stat_options[] = {
>  		     "aggregate counts per thread", AGGR_THREAD),
>  	OPT_UINTEGER('D', "delay", &initial_delay,
>  		     "ms to wait before starting measurement after program start"),
> -	OPT_BOOLEAN(0, "metric-only", &metric_only,
> -			"Only print computed metrics. No raw values"),
> +	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
> +			"Only print computed metrics. No raw values", enable_metric_only),
> +	OPT_BOOLEAN(0, "topdown", &topdown_run,
> +			"measure topdown level 1 statistics"),
>  	OPT_END()
>  };
>  
> @@ -1772,12 +1796,62 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
>  	return 0;
>  }
>  
> +static int topdown_filter_events(const char **attr, char **str, bool use_group)
> +{
> +	int off = 0;
> +	int i;
> +	int len = 0;
> +	char *s;
> +
> +	for (i = 0; attr[i]; i++) {
> +		if (pmu_have_event("cpu", attr[i])) {
> +			len += strlen(attr[i]) + 1;
> +			attr[i - off] = attr[i];
> +		} else
> +			off++;
> +	}
> +	attr[i - off] = NULL;
> +
> +	*str = malloc(len + 1 + 2);
> +	if (!*str)
> +		return -1;
> +	s = *str;
> +	if (i - off == 0) {
> +		*s = 0;
> +		return 0;
> +	}
> +	if (use_group)
> +		*s++ = '{';
> +	for (i = 0; attr[i]; i++) {
> +		strcpy(s, attr[i]);
> +		s += strlen(s);
> +		*s++ = ',';
> +	}
> +	if (use_group) {
> +		s[-1] = '}';
> +		*s = 0;
> +	} else
> +		s[-1] = 0;
> +	return 0;
> +}
> +
> +__weak bool arch_topdown_check_group(bool *warn)
> +{
> +	*warn = false;
> +	return false;
> +}
> +
> +__weak void arch_topdown_group_warn(void)
> +{
> +}
> +
>  /*
>   * Add default attributes, if there were no attributes specified or
>   * if -d/--detailed, -d -d or -d -d -d is used:
>   */
>  static int add_default_attributes(void)
>  {
> +	int err;
>  	struct perf_event_attr default_attrs0[] = {
>  
>    { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
> @@ -1896,7 +1970,6 @@ static int add_default_attributes(void)
>  		return 0;
>  
>  	if (transaction_run) {
> -		int err;
>  		if (pmu_have_event("cpu", "cycles-ct") &&
>  		    pmu_have_event("cpu", "el-start"))
>  			err = parse_events(evsel_list, transaction_attrs, NULL);
> @@ -1909,6 +1982,46 @@ static int add_default_attributes(void)
>  		return 0;
>  	}
>  
> +	if (topdown_run) {
> +		char *str = NULL;
> +		bool warn = false;
> +
> +		if (stat_config.aggr_mode != AGGR_GLOBAL &&
> +		    stat_config.aggr_mode != AGGR_CORE) {
> +			pr_err("top down event configuration requires --per-core mode\n");
> +			return -1;
> +		}
> +		stat_config.aggr_mode = AGGR_CORE;
> +		if (nr_cgroups || !target__has_cpu(&target)) {
> +			pr_err("top down event configuration requires system-wide mode (-a)\n");
> +			return -1;
> +		}
> +
> +		if (!force_metric_only)
> +			metric_only = true;
> +		if (topdown_filter_events(topdown_attrs, &str,
> +				arch_topdown_check_group(&warn)) < 0) {
> +			pr_err("Out of memory\n");
> +			return -1;
> +		}
> +		if (topdown_attrs[0] && str) {
> +			if (warn)
> +				arch_topdown_group_warn();
> +			err = parse_events(evsel_list, str, NULL);
> +			if (err) {
> +				fprintf(stderr,
> +					"Cannot set up top down events %s: %d\n",
> +					str, err);
> +				free(str);
> +				return -1;
> +			}
> +		} else {
> +			fprintf(stderr, "System does not support topdown\n");
> +			return -1;
> +		}
> +		free(str);
> +	}
> +
>  	if (!evsel_list->nr_entries) {
>  		if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
>  			return -1;
> diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
> new file mode 100644
> index 000000000000..116debe7a995
> --- /dev/null
> +++ b/tools/perf/util/group.h
> @@ -0,0 +1,7 @@
> +#ifndef GROUP_H
> +#define GROUP_H 1
> +
> +bool arch_topdown_check_group(bool *warn);
> +void arch_topdown_group_warn(void);
> +
> +#endif
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 1477fbc78993..744ebe3fa30f 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -259,6 +259,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> +topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  
>  L1-dcache|l1-d|l1d|L1-data		|
>  L1-icache|l1-i|l1i|L1-instruction	|
> -- 
> 2.5.5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-05-30 16:01 ` [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Arnaldo Carvalho de Melo
@ 2016-05-30 16:04   ` Andi Kleen
  2016-05-30 16:19     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-05-30 16:04 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Andi Kleen, jolsa, linux-kernel

> Which is a Broadwell-U, dual-core, 14 nm.
> 
> [root@jouet linux]# echo 0 > /proc/sys/kernel/nmi_watchdog
> [root@jouet linux]# perf stat --topdown -a usleep 1
> System does not support topdown
> [root@jouet linux]# cat /proc/sys/kernel/nmi_watchdog
> 0
> [root@jouet linux]#
> 
> Please advise.

You don't have the kernel patches?

They're in PeterZ's tree, but not yet in tip
https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=perf/core

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-05-30 16:04   ` Andi Kleen
@ 2016-05-30 16:19     ` Arnaldo Carvalho de Melo
  2016-06-06 14:00       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-05-30 16:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, jolsa, linux-kernel

Em Mon, May 30, 2016 at 09:04:02AM -0700, Andi Kleen escreveu:
> > Which is a Broadwell-U, dual-core, 14 nm.
> > 
> > [root@jouet linux]# echo 0 > /proc/sys/kernel/nmi_watchdog
> > [root@jouet linux]# perf stat --topdown -a usleep 1
> > System does not support topdown
> > [root@jouet linux]# cat /proc/sys/kernel/nmi_watchdog
> > 0
> > [root@jouet linux]#
> > 
> > Please advise.
> 
> You don't have the kernel patches?
> 
> They're in PeterZ's tree, but not yet in tip
> https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=perf/core

oops, will try applying those patches, thanks for claryfing.

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
                   ` (3 preceding siblings ...)
  2016-05-30 16:01 ` [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Arnaldo Carvalho de Melo
@ 2016-06-01 14:24 ` Nilay Vaish
  2016-06-01 14:31   ` Andi Kleen
  2016-06-01 15:24   ` Andi Kleen
  2016-06-08  8:38 ` [tip:perf/core] " tip-bot for Andi Kleen
  5 siblings, 2 replies; 22+ messages in thread
From: Nilay Vaish @ 2016-06-01 14:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, jolsa, Linux Kernel list, Andi Kleen

On 24 May 2016 at 14:52, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 715a1128daeb..dab184d86816 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -59,10 +59,13 @@
>  #include "util/thread.h"
>  #include "util/thread_map.h"
>  #include "util/counts.h"
> +#include "util/group.h"
>  #include "util/session.h"
>  #include "util/tool.h"
> +#include "util/group.h"
>  #include "asm/bug.h"

You have included util/group.h twice.  Is this intentional?

> +static int topdown_filter_events(const char **attr, char **str, bool use_group)
> +{
> +       int off = 0;
> +       int i;
> +       int len = 0;
> +       char *s;
> +
> +       for (i = 0; attr[i]; i++) {
> +               if (pmu_have_event("cpu", attr[i])) {
> +                       len += strlen(attr[i]) + 1;
> +                       attr[i - off] = attr[i];
> +               } else
> +                       off++;
> +       }
> +       attr[i - off] = NULL;
> +
> +       *str = malloc(len + 1 + 2);
> +       if (!*str)
> +               return -1;
> +       s = *str;
> +       if (i - off == 0) {
> +               *s = 0;
> +               return 0;
> +       }

I think we are leaking some memory here.  If i == off, then we set
attr[0] = NULL and do not free the memory allocated to str.


> @@ -1909,6 +1982,46 @@ static int add_default_attributes(void)
>                 return 0;
>         }
>
> +       if (topdown_run) {
> +               char *str = NULL;
> +               bool warn = false;
> +
> +               if (stat_config.aggr_mode != AGGR_GLOBAL &&
> +                   stat_config.aggr_mode != AGGR_CORE) {
> +                       pr_err("top down event configuration requires --per-core mode\n");
> +                       return -1;
> +               }
> +               stat_config.aggr_mode = AGGR_CORE;
> +               if (nr_cgroups || !target__has_cpu(&target)) {
> +                       pr_err("top down event configuration requires system-wide mode (-a)\n");
> +                       return -1;
> +               }
> +
> +               if (!force_metric_only)
> +                       metric_only = true;
> +               if (topdown_filter_events(topdown_attrs, &str,
> +                               arch_topdown_check_group(&warn)) < 0) {
> +                       pr_err("Out of memory\n");
> +                       return -1;
> +               }
> +               if (topdown_attrs[0] && str) {
> +                       if (warn)
> +                               arch_topdown_group_warn();
> +                       err = parse_events(evsel_list, str, NULL);
> +                       if (err) {
> +                               fprintf(stderr,
> +                                       "Cannot set up top down events %s: %d\n",
> +                                       str, err);
> +                               free(str);
> +                               return -1;
> +                       }
> +               } else {
> +                       fprintf(stderr, "System does not support topdown\n");
> +                       return -1;
> +               }
> +               free(str);
> +       }
> +

Continuing with my comment from above memory leak, if i == off,
topdown_attrs[0] will be NULL.  So we would enter the else portion
here and return -1.  But we never free the string we allocated in the
function topdown_filter_events().   I think we are leaking some memory
though seems only about 3 bytes.

--
Nilay

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-06-01 14:24 ` Nilay Vaish
@ 2016-06-01 14:31   ` Andi Kleen
  2016-06-01 15:24   ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-06-01 14:31 UTC (permalink / raw)
  To: Nilay Vaish; +Cc: Andi Kleen, acme, jolsa, Linux Kernel list

On Wed, Jun 01, 2016 at 09:24:33AM -0500, Nilay Vaish wrote:
> On 24 May 2016 at 14:52, Andi Kleen <andi@firstfloor.org> wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> >
> > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > index 715a1128daeb..dab184d86816 100644
> > --- a/tools/perf/builtin-stat.c
> > +++ b/tools/perf/builtin-stat.c
> > @@ -59,10 +59,13 @@
> >  #include "util/thread.h"
> >  #include "util/thread_map.h"
> >  #include "util/counts.h"
> > +#include "util/group.h"
> >  #include "util/session.h"
> >  #include "util/tool.h"
> > +#include "util/group.h"
> >  #include "asm/bug.h"
> 
> You have included util/group.h twice.  Is this intentional?

No was a mistake thanks.

> Continuing with my comment from above memory leak, if i == off,
> topdown_attrs[0] will be NULL.  So we would enter the else portion
> here and return -1.  But we never free the string we allocated in the
> function topdown_filter_events().   I think we are leaking some memory
> though seems only about 3 bytes.

Right, however the process immediately exits afterwards, so it gets
cleaned up anyways.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas
  2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
@ 2016-06-01 14:50   ` Nilay Vaish
  2016-06-01 14:56     ` Andi Kleen
  2016-06-08  8:39   ` [tip:perf/core] " tip-bot for Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: Nilay Vaish @ 2016-06-01 14:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, jolsa, Linux Kernel list, Andi Kleen

On 24 May 2016 at 14:52, Andi Kleen <andi@firstfloor.org> wrote:
> +static double td_be_bound(int ctx, int cpu)
> +{
> +       double sum = (td_fe_bound(ctx, cpu) +
> +                     td_bad_spec(ctx, cpu) +
> +                     td_retiring(ctx, cpu));
> +       if (sum == 0)
> +               return 0;
> +       return sanitize_val(1.0 - sum);
> +}
> +

Can you explain why we need the check on sum?

--
Nilay

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas
  2016-06-01 14:50   ` Nilay Vaish
@ 2016-06-01 14:56     ` Andi Kleen
  2016-06-02 11:56       ` Nilay Vaish
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-06-01 14:56 UTC (permalink / raw)
  To: Nilay Vaish; +Cc: Andi Kleen, acme, jolsa, Linux Kernel list

On Wed, Jun 01, 2016 at 09:50:07AM -0500, Nilay Vaish wrote:
> On 24 May 2016 at 14:52, Andi Kleen <andi@firstfloor.org> wrote:
> > +static double td_be_bound(int ctx, int cpu)
> > +{
> > +       double sum = (td_fe_bound(ctx, cpu) +
> > +                     td_bad_spec(ctx, cpu) +
> > +                     td_retiring(ctx, cpu));
> > +       if (sum == 0)
> > +               return 0;
> > +       return sanitize_val(1.0 - sum);
> > +}
> > +
> 
> Can you explain why we need the check on sum?

You mean the if statement?

Otherwise if nothing was measured it would always report everything backend bound,
which wouldn't be correct.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-06-01 14:24 ` Nilay Vaish
  2016-06-01 14:31   ` Andi Kleen
@ 2016-06-01 15:24   ` Andi Kleen
  2016-06-02 11:52     ` Nilay Vaish
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2016-06-01 15:24 UTC (permalink / raw)
  To: Nilay Vaish; +Cc: Andi Kleen, acme, jolsa, Linux Kernel list


Here's an updated patch addresses the duplicated issue and explicitly frees memory.

---

Add basic plumbing for TopDown in perf stat

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads,
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

Add a new --topdown options to enable events.
When --topdown is specified set up events for all topdown
events supported by the kernel.
Add topdown-* as a special case to the event parser, as is
needed for all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2: Use standard sysctl read function.
v3: Move x86 specific code to arch/
v4: Enable --metric-only implicitly for topdown.
v5: Add --single-thread option to not force per core mode
v6: Fix output order of topdown metrics
v7: Allow combining with -d
v8: Remove --single-thread again
v9: Rename functions, adding arch_ and topdown_.
v10: Expand man page and describe TopDown better
Paste intro into commit description.
Print error when malloc fails.
v11:
Free memory before exit
Remove duplicate include.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |  32 +++++++++
 tools/perf/arch/x86/util/Build         |   1 +
 tools/perf/arch/x86/util/group.c       |  27 ++++++++
 tools/perf/builtin-stat.c              | 119 ++++++++++++++++++++++++++++++++-
 tools/perf/util/group.h                |   7 ++
 tools/perf/util/parse-events.l         |   1 +
 6 files changed, 184 insertions(+), 3 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/group.c
 create mode 100644 tools/perf/util/group.h

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 04f23b404bbc..d96ccd4844df 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide mode measurements.
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
 
+--topdown::
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking the cycles consumed down into frontend bound, backend bound,
+bad speculation and retiring.
+
+Frontend bound means that the CPU cannot fetch and decode instructions fast
+enough. Backend bound means that computation or memory access is the bottle
+neck. Bad Speculation means that the CPU wasted cycles due to branch
+mispredictions and similar issues. Retiring means that the CPU computed without
+an apparently bottleneck. The bottleneck is only the real bottleneck
+if the workload is actually bound by the CPU and not by something else.
+
+For best results it is usually a good idea to use it with interval
+mode like -I 1000, as the bottleneck of workloads can change often.
+
+The top down metrics are collected per core instead of per
+CPU thread. Per core mode is automatically enabled
+and -a (global monitoring) is needed, requiring root rights or
+perf.perf_event_paranoid=-1.
+
+Topdown uses the full Performance Monitoring Unit, and needs
+disabling of the NMI watchdog (as root):
+echo 0 > /proc/sys/kernel/nmi_watchdog
+for best results. Otherwise the bottlenecks may be inconsistent
+on workload with changing phases.
+
+This enables --metric-only, unless overriden with --no-metric-only.
+
+To interpret the results it is usually needed to know on which
+CPUs the workload runs on. If needed the CPUs can be forced using
+taskset.
 
 EXAMPLES
 --------
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 465970370f3e..4cd8a16b1b7b 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,7 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += group.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
new file mode 100644
index 000000000000..37f92aa39a5d
--- /dev/null
+++ b/tools/perf/arch/x86/util/group.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include "api/fs/fs.h"
+#include "util/group.h"
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+bool arch_topdown_check_group(bool *warn)
+{
+	int n;
+
+	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+		return false;
+	if (n > 0) {
+		*warn = true;
+		return false;
+	}
+	return true;
+}
+
+void arch_topdown_group_warn(void)
+{
+	fprintf(stderr,
+		"nmi_watchdog enabled with topdown. May give wrong results.\n"
+		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 715a1128daeb..0e7d65969028 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,10 +59,12 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/group.h"
 #include "util/session.h"
 #include "util/tool.h"
 #include "asm/bug.h"
 
+#include <api/fs/fs.h>
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
@@ -98,6 +100,15 @@ static const char * transaction_limited_attrs = {
 	"}"
 };
 
+static const char * topdown_attrs[] = {
+	"topdown-total-slots",
+	"topdown-slots-retired",
+	"topdown-recovery-bubbles",
+	"topdown-fetch-bubbles",
+	"topdown-slots-issued",
+	NULL,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct target target = {
@@ -112,6 +123,7 @@ static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
 static bool			transaction_run;
+static bool			topdown_run			= false;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -124,6 +136,7 @@ static unsigned int		initial_delay			= 0;
 static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static bool			metric_only			= false;
+static bool			force_metric_only		= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
 static aggr_get_id_t		aggr_get_id;
@@ -1520,6 +1533,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
 	return 0;
 }
 
+static int enable_metric_only(const struct option *opt __maybe_unused,
+			      const char *s __maybe_unused, int unset)
+{
+	force_metric_only = true;
+	metric_only = !unset;
+	return 0;
+}
+
 static const struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1578,8 +1599,10 @@ static const struct option stat_options[] = {
 		     "aggregate counts per thread", AGGR_THREAD),
 	OPT_UINTEGER('D', "delay", &initial_delay,
 		     "ms to wait before starting measurement after program start"),
-	OPT_BOOLEAN(0, "metric-only", &metric_only,
-			"Only print computed metrics. No raw values"),
+	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
+			"Only print computed metrics. No raw values", enable_metric_only),
+	OPT_BOOLEAN(0, "topdown", &topdown_run,
+			"measure topdown level 1 statistics"),
 	OPT_END()
 };
 
@@ -1772,12 +1795,62 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	return 0;
 }
 
+static int topdown_filter_events(const char **attr, char **str, bool use_group)
+{
+	int off = 0;
+	int i;
+	int len = 0;
+	char *s;
+
+	for (i = 0; attr[i]; i++) {
+		if (pmu_have_event("cpu", attr[i])) {
+			len += strlen(attr[i]) + 1;
+			attr[i - off] = attr[i];
+		} else
+			off++;
+	}
+	attr[i - off] = NULL;
+
+	*str = malloc(len + 1 + 2);
+	if (!*str)
+		return -1;
+	s = *str;
+	if (i - off == 0) {
+		*s = 0;
+		return 0;
+	}
+	if (use_group)
+		*s++ = '{';
+	for (i = 0; attr[i]; i++) {
+		strcpy(s, attr[i]);
+		s += strlen(s);
+		*s++ = ',';
+	}
+	if (use_group) {
+		s[-1] = '}';
+		*s = 0;
+	} else
+		s[-1] = 0;
+	return 0;
+}
+
+__weak bool arch_topdown_check_group(bool *warn)
+{
+	*warn = false;
+	return false;
+}
+
+__weak void arch_topdown_group_warn(void)
+{
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
  */
 static int add_default_attributes(void)
 {
+	int err;
 	struct perf_event_attr default_attrs0[] = {
 
   { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
@@ -1896,7 +1969,6 @@ static int add_default_attributes(void)
 		return 0;
 
 	if (transaction_run) {
-		int err;
 		if (pmu_have_event("cpu", "cycles-ct") &&
 		    pmu_have_event("cpu", "el-start"))
 			err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1909,6 +1981,47 @@ static int add_default_attributes(void)
 		return 0;
 	}
 
+	if (topdown_run) {
+		char *str = NULL;
+		bool warn = false;
+
+		if (stat_config.aggr_mode != AGGR_GLOBAL &&
+		    stat_config.aggr_mode != AGGR_CORE) {
+			pr_err("top down event configuration requires --per-core mode\n");
+			return -1;
+		}
+		stat_config.aggr_mode = AGGR_CORE;
+		if (nr_cgroups || !target__has_cpu(&target)) {
+			pr_err("top down event configuration requires system-wide mode (-a)\n");
+			return -1;
+		}
+
+		if (!force_metric_only)
+			metric_only = true;
+		if (topdown_filter_events(topdown_attrs, &str,
+				arch_topdown_check_group(&warn)) < 0) {
+			pr_err("Out of memory\n");
+			return -1;
+		}
+		if (topdown_attrs[0] && str) {
+			if (warn)
+				arch_topdown_group_warn();
+			err = parse_events(evsel_list, str, NULL);
+			if (err) {
+				fprintf(stderr,
+					"Cannot set up top down events %s: %d\n",
+					str, err);
+				free(str);
+				return -1;
+			}
+		} else {
+			fprintf(stderr, "System does not support topdown\n");
+			free(str);
+			return -1;
+		}
+		free(str);
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
 			return -1;
diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
new file mode 100644
index 000000000000..116debe7a995
--- /dev/null
+++ b/tools/perf/util/group.h
@@ -0,0 +1,7 @@
+#ifndef GROUP_H
+#define GROUP_H 1
+
+bool arch_topdown_check_group(bool *warn);
+void arch_topdown_group_warn(void);
+
+#endif
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 1477fbc78993..744ebe3fa30f 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -259,6 +259,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
 L1-icache|l1-i|l1i|L1-instruction	|
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-06-01 15:24   ` Andi Kleen
@ 2016-06-02 11:52     ` Nilay Vaish
  0 siblings, 0 replies; 22+ messages in thread
From: Nilay Vaish @ 2016-06-02 11:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, acme, jolsa, Linux Kernel list

This patch looks fine to me.

--
Nilay

On 1 June 2016 at 10:24, Andi Kleen <ak@linux.intel.com> wrote:
>
> Here's an updated patch addresses the duplicated issue and explicitly frees memory.
>
> ---
>
> Add basic plumbing for TopDown in perf stat
>
> TopDown is intended to replace the frontend cycles idle/
> backend cycles idle metrics in standard perf stat output.
> These metrics are not reliable in many workloads,
> due to out of order effects.
>
> This implements a new --topdown mode in perf stat
> (similar to --transaction) that measures the pipe line
> bottlenecks using standardized formulas. The measurement
> can be all done with 5 counters (one fixed counter)
>
> The result are four metrics:
> FrontendBound, BackendBound, BadSpeculation, Retiring
>
> that describe the CPU pipeline behavior on a high level.
>
> FrontendBound and BackendBound
> BadSpeculation is a higher
>
> The full top down methology has many hierarchical metrics.
> This implementation only supports level 1 which can be
> collected without multiplexing. A full implementation
> of top down on top of perf is available in pmu-tools toplev.
> (http://github.com/andikleen/pmu-tools)
>
> The current version works on Intel Core CPUs starting
> with Sandy Bridge, and Atom CPUs starting with Silvermont.
> In principle the generic metrics should be also implementable
> on other out of order CPUs.
>
> TopDown level 1 uses a set of abstracted metrics which
> are generic to out of order CPU cores (although some
> CPUs may not implement all of them):
>
> topdown-total-slots   Available slots in the pipeline
> topdown-slots-issued          Slots issued into the pipeline
> topdown-slots-retired         Slots successfully retired
> topdown-fetch-bubbles         Pipeline gaps in the frontend
> topdown-recovery-bubbles  Pipeline gaps during recovery
>                           from misspeculation
>
> These metrics then allow to compute four useful metrics:
> FrontendBound, BackendBound, Retiring, BadSpeculation.
>
> Add a new --topdown options to enable events.
> When --topdown is specified set up events for all topdown
> events supported by the kernel.
> Add topdown-* as a special case to the event parser, as is
> needed for all events containing -.
>
> The actual code to compute the metrics is in follow-on patches.
>
> v2: Use standard sysctl read function.
> v3: Move x86 specific code to arch/
> v4: Enable --metric-only implicitly for topdown.
> v5: Add --single-thread option to not force per core mode
> v6: Fix output order of topdown metrics
> v7: Allow combining with -d
> v8: Remove --single-thread again
> v9: Rename functions, adding arch_ and topdown_.
> v10: Expand man page and describe TopDown better
> Paste intro into commit description.
> Print error when malloc fails.
> v11:
> Free memory before exit
> Remove duplicate include.
> Acked-by: Jiri Olsa <jolsa@kernel.org>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/Documentation/perf-stat.txt |  32 +++++++++
>  tools/perf/arch/x86/util/Build         |   1 +
>  tools/perf/arch/x86/util/group.c       |  27 ++++++++
>  tools/perf/builtin-stat.c              | 119 ++++++++++++++++++++++++++++++++-
>  tools/perf/util/group.h                |   7 ++
>  tools/perf/util/parse-events.l         |   1 +
>  6 files changed, 184 insertions(+), 3 deletions(-)
>  create mode 100644 tools/perf/arch/x86/util/group.c
>  create mode 100644 tools/perf/util/group.h
>
> diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> index 04f23b404bbc..d96ccd4844df 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide mode measurements.
>  --no-aggr::
>  Do not aggregate counts across all monitored CPUs.
>
> +--topdown::
> +Print top down level 1 metrics if supported by the CPU. This allows to
> +determine bottle necks in the CPU pipeline for CPU bound workloads,
> +by breaking the cycles consumed down into frontend bound, backend bound,
> +bad speculation and retiring.
> +
> +Frontend bound means that the CPU cannot fetch and decode instructions fast
> +enough. Backend bound means that computation or memory access is the bottle
> +neck. Bad Speculation means that the CPU wasted cycles due to branch
> +mispredictions and similar issues. Retiring means that the CPU computed without
> +an apparently bottleneck. The bottleneck is only the real bottleneck
> +if the workload is actually bound by the CPU and not by something else.
> +
> +For best results it is usually a good idea to use it with interval
> +mode like -I 1000, as the bottleneck of workloads can change often.
> +
> +The top down metrics are collected per core instead of per
> +CPU thread. Per core mode is automatically enabled
> +and -a (global monitoring) is needed, requiring root rights or
> +perf.perf_event_paranoid=-1.
> +
> +Topdown uses the full Performance Monitoring Unit, and needs
> +disabling of the NMI watchdog (as root):
> +echo 0 > /proc/sys/kernel/nmi_watchdog
> +for best results. Otherwise the bottlenecks may be inconsistent
> +on workload with changing phases.
> +
> +This enables --metric-only, unless overriden with --no-metric-only.
> +
> +To interpret the results it is usually needed to know on which
> +CPUs the workload runs on. If needed the CPUs can be forced using
> +taskset.
>
>  EXAMPLES
>  --------
> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
> index 465970370f3e..4cd8a16b1b7b 100644
> --- a/tools/perf/arch/x86/util/Build
> +++ b/tools/perf/arch/x86/util/Build
> @@ -3,6 +3,7 @@ libperf-y += tsc.o
>  libperf-y += pmu.o
>  libperf-y += kvm-stat.o
>  libperf-y += perf_regs.o
> +libperf-y += group.o
>
>  libperf-$(CONFIG_DWARF) += dwarf-regs.o
>  libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
> diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
> new file mode 100644
> index 000000000000..37f92aa39a5d
> --- /dev/null
> +++ b/tools/perf/arch/x86/util/group.c
> @@ -0,0 +1,27 @@
> +#include <stdio.h>
> +#include "api/fs/fs.h"
> +#include "util/group.h"
> +
> +/*
> + * Check whether we can use a group for top down.
> + * Without a group may get bad results due to multiplexing.
> + */
> +bool arch_topdown_check_group(bool *warn)
> +{
> +       int n;
> +
> +       if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
> +               return false;
> +       if (n > 0) {
> +               *warn = true;
> +               return false;
> +       }
> +       return true;
> +}
> +
> +void arch_topdown_group_warn(void)
> +{
> +       fprintf(stderr,
> +               "nmi_watchdog enabled with topdown. May give wrong results.\n"
> +               "Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
> +}
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 715a1128daeb..0e7d65969028 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -59,10 +59,12 @@
>  #include "util/thread.h"
>  #include "util/thread_map.h"
>  #include "util/counts.h"
> +#include "util/group.h"
>  #include "util/session.h"
>  #include "util/tool.h"
>  #include "asm/bug.h"
>
> +#include <api/fs/fs.h>
>  #include <stdlib.h>
>  #include <sys/prctl.h>
>  #include <locale.h>
> @@ -98,6 +100,15 @@ static const char * transaction_limited_attrs = {
>         "}"
>  };
>
> +static const char * topdown_attrs[] = {
> +       "topdown-total-slots",
> +       "topdown-slots-retired",
> +       "topdown-recovery-bubbles",
> +       "topdown-fetch-bubbles",
> +       "topdown-slots-issued",
> +       NULL,
> +};
> +
>  static struct perf_evlist      *evsel_list;
>
>  static struct target target = {
> @@ -112,6 +123,7 @@ static volatile pid_t               child_pid                       = -1;
>  static bool                    null_run                        =  false;
>  static int                     detailed_run                    =  0;
>  static bool                    transaction_run;
> +static bool                    topdown_run                     = false;
>  static bool                    big_num                         =  true;
>  static int                     big_num_opt                     =  -1;
>  static const char              *csv_sep                        = NULL;
> @@ -124,6 +136,7 @@ static unsigned int         initial_delay                   = 0;
>  static unsigned int            unit_width                      = 4; /* strlen("unit") */
>  static bool                    forever                         = false;
>  static bool                    metric_only                     = false;
> +static bool                    force_metric_only               = false;
>  static struct timespec         ref_time;
>  static struct cpu_map          *aggr_map;
>  static aggr_get_id_t           aggr_get_id;
> @@ -1520,6 +1533,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
>         return 0;
>  }
>
> +static int enable_metric_only(const struct option *opt __maybe_unused,
> +                             const char *s __maybe_unused, int unset)
> +{
> +       force_metric_only = true;
> +       metric_only = !unset;
> +       return 0;
> +}
> +
>  static const struct option stat_options[] = {
>         OPT_BOOLEAN('T', "transaction", &transaction_run,
>                     "hardware transaction statistics"),
> @@ -1578,8 +1599,10 @@ static const struct option stat_options[] = {
>                      "aggregate counts per thread", AGGR_THREAD),
>         OPT_UINTEGER('D', "delay", &initial_delay,
>                      "ms to wait before starting measurement after program start"),
> -       OPT_BOOLEAN(0, "metric-only", &metric_only,
> -                       "Only print computed metrics. No raw values"),
> +       OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
> +                       "Only print computed metrics. No raw values", enable_metric_only),
> +       OPT_BOOLEAN(0, "topdown", &topdown_run,
> +                       "measure topdown level 1 statistics"),
>         OPT_END()
>  };
>
> @@ -1772,12 +1795,62 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
>         return 0;
>  }
>
> +static int topdown_filter_events(const char **attr, char **str, bool use_group)
> +{
> +       int off = 0;
> +       int i;
> +       int len = 0;
> +       char *s;
> +
> +       for (i = 0; attr[i]; i++) {
> +               if (pmu_have_event("cpu", attr[i])) {
> +                       len += strlen(attr[i]) + 1;
> +                       attr[i - off] = attr[i];
> +               } else
> +                       off++;
> +       }
> +       attr[i - off] = NULL;
> +
> +       *str = malloc(len + 1 + 2);
> +       if (!*str)
> +               return -1;
> +       s = *str;
> +       if (i - off == 0) {
> +               *s = 0;
> +               return 0;
> +       }
> +       if (use_group)
> +               *s++ = '{';
> +       for (i = 0; attr[i]; i++) {
> +               strcpy(s, attr[i]);
> +               s += strlen(s);
> +               *s++ = ',';
> +       }
> +       if (use_group) {
> +               s[-1] = '}';
> +               *s = 0;
> +       } else
> +               s[-1] = 0;
> +       return 0;
> +}
> +
> +__weak bool arch_topdown_check_group(bool *warn)
> +{
> +       *warn = false;
> +       return false;
> +}
> +
> +__weak void arch_topdown_group_warn(void)
> +{
> +}
> +
>  /*
>   * Add default attributes, if there were no attributes specified or
>   * if -d/--detailed, -d -d or -d -d -d is used:
>   */
>  static int add_default_attributes(void)
>  {
> +       int err;
>         struct perf_event_attr default_attrs0[] = {
>
>    { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK             },
> @@ -1896,7 +1969,6 @@ static int add_default_attributes(void)
>                 return 0;
>
>         if (transaction_run) {
> -               int err;
>                 if (pmu_have_event("cpu", "cycles-ct") &&
>                     pmu_have_event("cpu", "el-start"))
>                         err = parse_events(evsel_list, transaction_attrs, NULL);
> @@ -1909,6 +1981,47 @@ static int add_default_attributes(void)
>                 return 0;
>         }
>
> +       if (topdown_run) {
> +               char *str = NULL;
> +               bool warn = false;
> +
> +               if (stat_config.aggr_mode != AGGR_GLOBAL &&
> +                   stat_config.aggr_mode != AGGR_CORE) {
> +                       pr_err("top down event configuration requires --per-core mode\n");
> +                       return -1;
> +               }
> +               stat_config.aggr_mode = AGGR_CORE;
> +               if (nr_cgroups || !target__has_cpu(&target)) {
> +                       pr_err("top down event configuration requires system-wide mode (-a)\n");
> +                       return -1;
> +               }
> +
> +               if (!force_metric_only)
> +                       metric_only = true;
> +               if (topdown_filter_events(topdown_attrs, &str,
> +                               arch_topdown_check_group(&warn)) < 0) {
> +                       pr_err("Out of memory\n");
> +                       return -1;
> +               }
> +               if (topdown_attrs[0] && str) {
> +                       if (warn)
> +                               arch_topdown_group_warn();
> +                       err = parse_events(evsel_list, str, NULL);
> +                       if (err) {
> +                               fprintf(stderr,
> +                                       "Cannot set up top down events %s: %d\n",
> +                                       str, err);
> +                               free(str);
> +                               return -1;
> +                       }
> +               } else {
> +                       fprintf(stderr, "System does not support topdown\n");
> +                       free(str);
> +                       return -1;
> +               }
> +               free(str);
> +       }
> +
>         if (!evsel_list->nr_entries) {
>                 if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
>                         return -1;
> diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
> new file mode 100644
> index 000000000000..116debe7a995
> --- /dev/null
> +++ b/tools/perf/util/group.h
> @@ -0,0 +1,7 @@
> +#ifndef GROUP_H
> +#define GROUP_H 1
> +
> +bool arch_topdown_check_group(bool *warn);
> +void arch_topdown_group_warn(void);
> +
> +#endif
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 1477fbc78993..744ebe3fa30f 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -259,6 +259,7 @@ cycles-ct                                   { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  cycles-t                                       { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  mem-loads                                      { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>  mem-stores                                     { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
> +topdown-[a-z-]+                                        { return str(yyscanner, PE_KERNEL_PMU_EVENT); }
>
>  L1-dcache|l1-d|l1d|L1-data             |
>  L1-icache|l1-i|l1i|L1-instruction      |
> --
> 2.5.5
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas
  2016-06-01 14:56     ` Andi Kleen
@ 2016-06-02 11:56       ` Nilay Vaish
  2016-06-02 14:26         ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Nilay Vaish @ 2016-06-02 11:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, acme, jolsa, Linux Kernel list

Andi,  I am talking about the if statement.  I don't know why it would
happen that nothing got measured.  I am guessing you saw it happen.
May be we can add a comment in the patch that it is possible that all
counter values are zero and therefore we need that if statement.

--
Nilay

On 1 June 2016 at 09:56, Andi Kleen <ak@linux.intel.com> wrote:
> On Wed, Jun 01, 2016 at 09:50:07AM -0500, Nilay Vaish wrote:
>> On 24 May 2016 at 14:52, Andi Kleen <andi@firstfloor.org> wrote:
>> > +static double td_be_bound(int ctx, int cpu)
>> > +{
>> > +       double sum = (td_fe_bound(ctx, cpu) +
>> > +                     td_bad_spec(ctx, cpu) +
>> > +                     td_retiring(ctx, cpu));
>> > +       if (sum == 0)
>> > +               return 0;
>> > +       return sanitize_val(1.0 - sum);
>> > +}
>> > +
>>
>> Can you explain why we need the check on sum?
>
> You mean the if statement?
>
> Otherwise if nothing was measured it would always report everything backend bound,
> which wouldn't be correct.
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] perf stat: Add computation of TopDown formulas
  2016-06-02 11:56       ` Nilay Vaish
@ 2016-06-02 14:26         ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-06-02 14:26 UTC (permalink / raw)
  To: Nilay Vaish; +Cc: Andi Kleen, Andi Kleen, acme, jolsa, Linux Kernel list

On Thu, Jun 02, 2016 at 06:56:51AM -0500, Nilay Vaish wrote:
> Andi,  I am talking about the if statement.  I don't know why it would
> happen that nothing got measured.  I am guessing you saw it happen.
> May be we can add a comment in the patch that it is possible that all
> counter values are zero and therefore we need that if statement.

Sure it can happen that nothing got measured, for example if the program
didn't run, or the group didn't get scheduled.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-05-30 16:19     ` Arnaldo Carvalho de Melo
@ 2016-06-06 14:00       ` Arnaldo Carvalho de Melo
  2016-06-06 14:11         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-06-06 14:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, Jiri Olsa, linux-kernel

Em Mon, May 30, 2016 at 01:19:42PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, May 30, 2016 at 09:04:02AM -0700, Andi Kleen escreveu:
> > > Which is a Broadwell-U, dual-core, 14 nm.
> > > 
> > > [root@jouet linux]# echo 0 > /proc/sys/kernel/nmi_watchdog
> > > [root@jouet linux]# perf stat --topdown -a usleep 1
> > > System does not support topdown
> > > [root@jouet linux]# cat /proc/sys/kernel/nmi_watchdog
> > > 0
> > > [root@jouet linux]#
> > > 
> > > Please advise.
> > 
> > You don't have the kernel patches?
> > 
> > They're in PeterZ's tree, but not yet in tip
> > https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=perf/core
> 
> oops, will try applying those patches, thanks for claryfing.

The kernel patches are in tip/perf/core now, so I applied the patches
and they seem to work, but they broke one 'perf test' entry:

# perf test 5
 5: parse events tests                                       : FAILED!

# perf test -v 5
<SNIP>
running test 50 '4:0x6530160/name=numpmu/'
running test 51 'L1-dcache-misses/name=cachepmu/'
running test 0 'cpu/config=10,config1,config2=3,period=1000/u'
running test 1 'cpu/config=1,name=krava/u,cpu/config=2/u'
running test 2 'cpu/config=1,call-graph=fp,time,period=100000/,cpu/config=2,call-graph=no,time=0,period=2000/'
Invalid sysfs entry event=topdown-recovery-bubbles.scale
failed to parse event 'cpu/event=topdown-recovery-bubbles.scale/u', err 1
test child finished with 1
---- end ----
parse events tests: FAILED!
#

Please check that,

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-06-06 14:00       ` Arnaldo Carvalho de Melo
@ 2016-06-06 14:11         ` Arnaldo Carvalho de Melo
  2016-06-06 14:36           ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-06-06 14:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, Jiri Olsa, linux-kernel

Em Mon, Jun 06, 2016 at 11:00:03AM -0300, Arnaldo Carvalho de Melo escreveu:
> The kernel patches are in tip/perf/core now, so I applied the patches
> and they seem to work, but they broke one 'perf test' entry:
 
> # perf test -v 5
> <SNIP>
> running test 50 '4:0x6530160/name=numpmu/'
> running test 51 'L1-dcache-misses/name=cachepmu/'
> running test 0 'cpu/config=10,config1,config2=3,period=1000/u'
> running test 1 'cpu/config=1,name=krava/u,cpu/config=2/u'
> running test 2 'cpu/config=1,call-graph=fp,time,period=100000/,cpu/config=2,call-graph=no,time=0,period=2000/'
> Invalid sysfs entry event=topdown-recovery-bubbles.scale
> failed to parse event 'cpu/event=topdown-recovery-bubbles.scale/u', err 1
> test child finished with 1
> ---- end ----
> parse events tests: FAILED!
> #

Ok, the test failure is not due to this patchkit, the reason are the new
topdown*.scale files in the kernel, was there a patch to handle that
that is missing in my branch?

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] perf stat: Basic support for TopDown in perf stat
  2016-06-06 14:11         ` Arnaldo Carvalho de Melo
@ 2016-06-06 14:36           ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2016-06-06 14:36 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Andi Kleen, Jiri Olsa, linux-kernel

On Mon, Jun 06, 2016 at 11:11:40AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jun 06, 2016 at 11:00:03AM -0300, Arnaldo Carvalho de Melo escreveu:
> > The kernel patches are in tip/perf/core now, so I applied the patches
> > and they seem to work, but they broke one 'perf test' entry:
>  
> > # perf test -v 5
> > <SNIP>
> > running test 50 '4:0x6530160/name=numpmu/'
> > running test 51 'L1-dcache-misses/name=cachepmu/'
> > running test 0 'cpu/config=10,config1,config2=3,period=1000/u'
> > running test 1 'cpu/config=1,name=krava/u,cpu/config=2/u'
> > running test 2 'cpu/config=1,call-graph=fp,time,period=100000/,cpu/config=2,call-graph=no,time=0,period=2000/'
> > Invalid sysfs entry event=topdown-recovery-bubbles.scale
> > failed to parse event 'cpu/event=topdown-recovery-bubbles.scale/u', err 1
> > test child finished with 1
> > ---- end ----
> > parse events tests: FAILED!
> > #
> 
> Ok, the test failure is not due to this patchkit, the reason are the new
> topdown*.scale files in the kernel, was there a patch to handle that
> that is missing in my branch?

I sent you a patch separately now that fixes it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf stat: Basic support for TopDown in perf stat
  2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
                   ` (4 preceding siblings ...)
  2016-06-01 14:24 ` Nilay Vaish
@ 2016-06-08  8:38 ` tip-bot for Andi Kleen
  5 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-06-08  8:38 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, acme, ak, mingo, jolsa, linux-kernel

Commit-ID:  44b1e60ab576c343aa592a2a6c679297cc69740d
Gitweb:     http://git.kernel.org/tip/44b1e60ab576c343aa592a2a6c679297cc69740d
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Mon, 30 May 2016 12:49:42 -0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 6 Jun 2016 17:04:15 -0300

perf stat: Basic support for TopDown in perf stat

Add basic plumbing for TopDown in perf stat

TopDown is intended to replace the frontend cycles idle/ backend cycles
idle metrics in standard perf stat output.  These metrics are not
reliable in many workloads, due to out of order effects.

This implements a new --topdown mode in perf stat (similar to
--transaction) that measures the pipe line bottlenecks using
standardized formulas. The measurement can be all done with 5 counters
(one fixed counter)

The result are four metrics:

FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

The full top down methology has many hierarchical metrics.  This
implementation only supports level 1 which can be collected without
multiplexing. A full implementation of top down on top of perf is
available in pmu-tools toplev.  (http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting with Sandy Bridge,
and Atom CPUs starting with Silvermont.  In principle the generic
metrics should be also implementable on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which are generic to
out of order CPU cores (although some CPUs may not implement all of
them):

  topdown-total-slots       Available slots in the pipeline
  topdown-slots-issued      Slots issued into the pipeline
  topdown-slots-retired     Slots successfully retired
  topdown-fetch-bubbles     Pipeline gaps in the frontend
  topdown-recovery-bubbles  Pipeline gaps during recovery
                            from misspeculation

These metrics then allow to compute four useful metrics:

FrontendBound, BackendBound, Retiring, BadSpeculation.

Add a new --topdown options to enable events.  When --topdown is
specified set up events for all topdown events supported by the kernel.
Add topdown-* as a special case to the event parser, as is needed for
all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2: Use standard sysctl read function.
v3: Move x86 specific code to arch/
v4: Enable --metric-only implicitly for topdown.
v5: Add --single-thread option to not force per core mode
v6: Fix output order of topdown metrics
v7: Allow combining with -d
v8: Remove --single-thread again
v9: Rename functions, adding arch_ and topdown_.
v10: Expand man page and describe TopDown better
Paste intro into commit description.
Print error when malloc fails.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/1464119559-17203-1-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-stat.txt |  32 +++++++++
 tools/perf/arch/x86/util/Build         |   1 +
 tools/perf/arch/x86/util/group.c       |  27 ++++++++
 tools/perf/builtin-stat.c              | 119 ++++++++++++++++++++++++++++++++-
 tools/perf/util/group.h                |   7 ++
 tools/perf/util/parse-events.l         |   1 +
 6 files changed, 184 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 04f23b4..d96ccd4 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -204,6 +204,38 @@ Aggregate counts per physical processor for system-wide mode measurements.
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
 
+--topdown::
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking the cycles consumed down into frontend bound, backend bound,
+bad speculation and retiring.
+
+Frontend bound means that the CPU cannot fetch and decode instructions fast
+enough. Backend bound means that computation or memory access is the bottle
+neck. Bad Speculation means that the CPU wasted cycles due to branch
+mispredictions and similar issues. Retiring means that the CPU computed without
+an apparently bottleneck. The bottleneck is only the real bottleneck
+if the workload is actually bound by the CPU and not by something else.
+
+For best results it is usually a good idea to use it with interval
+mode like -I 1000, as the bottleneck of workloads can change often.
+
+The top down metrics are collected per core instead of per
+CPU thread. Per core mode is automatically enabled
+and -a (global monitoring) is needed, requiring root rights or
+perf.perf_event_paranoid=-1.
+
+Topdown uses the full Performance Monitoring Unit, and needs
+disabling of the NMI watchdog (as root):
+echo 0 > /proc/sys/kernel/nmi_watchdog
+for best results. Otherwise the bottlenecks may be inconsistent
+on workload with changing phases.
+
+This enables --metric-only, unless overriden with --no-metric-only.
+
+To interpret the results it is usually needed to know on which
+CPUs the workload runs on. If needed the CPUs can be forced using
+taskset.
 
 EXAMPLES
 --------
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 4659703..4cd8a16 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,7 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += group.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
new file mode 100644
index 0000000..37f92aa
--- /dev/null
+++ b/tools/perf/arch/x86/util/group.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include "api/fs/fs.h"
+#include "util/group.h"
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+bool arch_topdown_check_group(bool *warn)
+{
+	int n;
+
+	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+		return false;
+	if (n > 0) {
+		*warn = true;
+		return false;
+	}
+	return true;
+}
+
+void arch_topdown_group_warn(void)
+{
+	fprintf(stderr,
+		"nmi_watchdog enabled with topdown. May give wrong results.\n"
+		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index ee7ada7..fd76bb0 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,10 +59,13 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/group.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/group.h"
 #include "asm/bug.h"
 
+#include <api/fs/fs.h>
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
@@ -98,6 +101,15 @@ static const char * transaction_limited_attrs = {
 	"}"
 };
 
+static const char * topdown_attrs[] = {
+	"topdown-total-slots",
+	"topdown-slots-retired",
+	"topdown-recovery-bubbles",
+	"topdown-fetch-bubbles",
+	"topdown-slots-issued",
+	NULL,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct target target = {
@@ -112,6 +124,7 @@ static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
 static bool			transaction_run;
+static bool			topdown_run			= false;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -124,6 +137,7 @@ static unsigned int		initial_delay			= 0;
 static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static bool			metric_only			= false;
+static bool			force_metric_only		= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
 static aggr_get_id_t		aggr_get_id;
@@ -1520,6 +1534,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
 	return 0;
 }
 
+static int enable_metric_only(const struct option *opt __maybe_unused,
+			      const char *s __maybe_unused, int unset)
+{
+	force_metric_only = true;
+	metric_only = !unset;
+	return 0;
+}
+
 static const struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1578,8 +1600,10 @@ static const struct option stat_options[] = {
 		     "aggregate counts per thread", AGGR_THREAD),
 	OPT_UINTEGER('D', "delay", &initial_delay,
 		     "ms to wait before starting measurement after program start"),
-	OPT_BOOLEAN(0, "metric-only", &metric_only,
-			"Only print computed metrics. No raw values"),
+	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
+			"Only print computed metrics. No raw values", enable_metric_only),
+	OPT_BOOLEAN(0, "topdown", &topdown_run,
+			"measure topdown level 1 statistics"),
 	OPT_END()
 };
 
@@ -1772,12 +1796,62 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	return 0;
 }
 
+static int topdown_filter_events(const char **attr, char **str, bool use_group)
+{
+	int off = 0;
+	int i;
+	int len = 0;
+	char *s;
+
+	for (i = 0; attr[i]; i++) {
+		if (pmu_have_event("cpu", attr[i])) {
+			len += strlen(attr[i]) + 1;
+			attr[i - off] = attr[i];
+		} else
+			off++;
+	}
+	attr[i - off] = NULL;
+
+	*str = malloc(len + 1 + 2);
+	if (!*str)
+		return -1;
+	s = *str;
+	if (i - off == 0) {
+		*s = 0;
+		return 0;
+	}
+	if (use_group)
+		*s++ = '{';
+	for (i = 0; attr[i]; i++) {
+		strcpy(s, attr[i]);
+		s += strlen(s);
+		*s++ = ',';
+	}
+	if (use_group) {
+		s[-1] = '}';
+		*s = 0;
+	} else
+		s[-1] = 0;
+	return 0;
+}
+
+__weak bool arch_topdown_check_group(bool *warn)
+{
+	*warn = false;
+	return false;
+}
+
+__weak void arch_topdown_group_warn(void)
+{
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
  */
 static int add_default_attributes(void)
 {
+	int err;
 	struct perf_event_attr default_attrs0[] = {
 
   { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
@@ -1896,7 +1970,6 @@ static int add_default_attributes(void)
 		return 0;
 
 	if (transaction_run) {
-		int err;
 		if (pmu_have_event("cpu", "cycles-ct") &&
 		    pmu_have_event("cpu", "el-start"))
 			err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1909,6 +1982,46 @@ static int add_default_attributes(void)
 		return 0;
 	}
 
+	if (topdown_run) {
+		char *str = NULL;
+		bool warn = false;
+
+		if (stat_config.aggr_mode != AGGR_GLOBAL &&
+		    stat_config.aggr_mode != AGGR_CORE) {
+			pr_err("top down event configuration requires --per-core mode\n");
+			return -1;
+		}
+		stat_config.aggr_mode = AGGR_CORE;
+		if (nr_cgroups || !target__has_cpu(&target)) {
+			pr_err("top down event configuration requires system-wide mode (-a)\n");
+			return -1;
+		}
+
+		if (!force_metric_only)
+			metric_only = true;
+		if (topdown_filter_events(topdown_attrs, &str,
+				arch_topdown_check_group(&warn)) < 0) {
+			pr_err("Out of memory\n");
+			return -1;
+		}
+		if (topdown_attrs[0] && str) {
+			if (warn)
+				arch_topdown_group_warn();
+			err = parse_events(evsel_list, str, NULL);
+			if (err) {
+				fprintf(stderr,
+					"Cannot set up top down events %s: %d\n",
+					str, err);
+				free(str);
+				return -1;
+			}
+		} else {
+			fprintf(stderr, "System does not support topdown\n");
+			return -1;
+		}
+		free(str);
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (target__has_cpu(&target))
 			default_attrs0[0].config = PERF_COUNT_SW_CPU_CLOCK;
diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
new file mode 100644
index 0000000..116debe
--- /dev/null
+++ b/tools/perf/util/group.h
@@ -0,0 +1,7 @@
+#ifndef GROUP_H
+#define GROUP_H 1
+
+bool arch_topdown_check_group(bool *warn);
+void arch_topdown_group_warn(void);
+
+#endif
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 01af1ee..3c15b33 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -260,6 +260,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
 L1-icache|l1-i|l1i|L1-instruction	|

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf stat: Add computation of TopDown formulas
  2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
  2016-06-01 14:50   ` Nilay Vaish
@ 2016-06-08  8:39   ` tip-bot for Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-06-08  8:39 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, acme, jolsa, hpa, mingo, ak, tglx

Commit-ID:  239bd47f8355eb5defc865cf408824b6cfeca5dc
Gitweb:     http://git.kernel.org/tip/239bd47f8355eb5defc865cf408824b6cfeca5dc
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Tue, 24 May 2016 12:52:37 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 6 Jun 2016 17:04:16 -0300

perf stat: Add computation of TopDown formulas

Implement the TopDown formulas in 'perf stat'. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed and
output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

Committer note:

Output example:

  # perf stat --topdown -a usleep 1

   Performance counter stats for 'system wide':

             retiring     bad speculation   frontend bound   backend bound
  S0-C0    2     23.8%       11.6%            28.3%           36.3%
  S0-C1    2     16.2%       15.7%            36.5%           31.6%

         0.000579956 seconds time elapsed
  #

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4: Only print one decimal digit
    Fix color printing of one metric
v5: Avoid printing -0.0
v6: Remove extra frontend event lookup

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/1464119559-17203-2-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/stat-shadow.c | 162 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c        |   5 ++
 tools/perf/util/stat.h        |   5 ++
 3 files changed, 172 insertions(+)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index aa9efe0..8a2bbd2 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,11 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
 }
 
 /*
@@ -105,6 +115,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -302,6 +322,107 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double sanitize_val(double x)
+{
+	if (x < 0 && x >= -0.02)
+		return 0.0;
+	return x;
+}
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return sanitize_val(bad_spec);
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return sanitize_val(1.0 - sum);
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -309,6 +430,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -452,6 +574,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d06..c1ba255 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e78..c29bb94 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf stat: Print topology/time headers with --metric-only
  2016-05-24 19:52 ` [PATCH 3/4] perf stat: Print topology/time headers with --metric-only Andi Kleen
@ 2016-06-08  8:39   ` tip-bot for Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-06-08  8:39 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, ak, hpa, jolsa, acme, tglx, mingo

Commit-ID:  41c8ca2a924b359e8f1768f8550487cd13a1ec03
Gitweb:     http://git.kernel.org/tip/41c8ca2a924b359e8f1768f8550487cd13a1ec03
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Tue, 24 May 2016 12:52:38 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 6 Jun 2016 17:04:16 -0300

perf stat: Print topology/time headers with --metric-only

When --metric-only is enabled there were no headers for the topology in
interval mode.  Also when headers were printed they were on a separate
line.

Before:

  $ perf stat  --metric-only  -A -I 1000 -a
    1.001038376     frontend cycles idle insn per cycle  stalled cycles per insn branch-misses of all branches
    1.001038376 CPU0   123.54%               0.23           5.29                    7.61%
    1.001038376 CPU1   137.78%               0.24           5.13                   10.07%
    1.001038376 CPU2    64.48%               0.22           5.50                    6.84%

After:

  $ perf stat  --metric-only  -A -I 1000 -a
    1.001111114 CPU0    82.46%               0.32           2.60                    7.64%
    1.001111114 CPU1   126.63%               0.02          42.83                    0.15%
    1.001111114 CPU2   193.54%               0.32           2.59                    6.92%

v2: Move all headers on a single line

Reported-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/1464119559-17203-3-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-stat.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index fd76bb0..a168e72 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1316,7 +1316,7 @@ static int aggr_header_lens[] = {
 	[AGGR_GLOBAL] = 0,
 };
 
-static void print_metric_headers(char *prefix)
+static void print_metric_headers(const char *prefix, bool no_indent)
 {
 	struct perf_stat_output_ctx out;
 	struct perf_evsel *counter;
@@ -1327,7 +1327,7 @@ static void print_metric_headers(char *prefix)
 	if (prefix)
 		fprintf(stat_config.output, "%s", prefix);
 
-	if (!csv_output)
+	if (!csv_output && !no_indent)
 		fprintf(stat_config.output, "%*s",
 			aggr_header_lens[stat_config.aggr_mode], "");
 
@@ -1352,28 +1352,40 @@ static void print_interval(char *prefix, struct timespec *ts)
 
 	sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, csv_sep);
 
-	if (num_print_interval == 0 && !csv_output && !metric_only) {
+	if (num_print_interval == 0 && !csv_output) {
 		switch (stat_config.aggr_mode) {
 		case AGGR_SOCKET:
-			fprintf(output, "#           time socket cpus             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time socket cpus");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_CORE:
-			fprintf(output, "#           time core         cpus             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time core         cpus");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_NONE:
-			fprintf(output, "#           time CPU                counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time CPU");
+			if (!metric_only)
+				fprintf(output, "                counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_THREAD:
-			fprintf(output, "#           time             comm-pid                  counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time             comm-pid");
+			if (!metric_only)
+				fprintf(output, "                  counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_GLOBAL:
 		default:
-			fprintf(output, "#           time             counts %*s events\n", unit_width, "unit");
+			fprintf(output, "#           time");
+			if (!metric_only)
+				fprintf(output, "             counts %*s events\n", unit_width, "unit");
 		case AGGR_UNSET:
 			break;
 		}
 	}
 
+	if (num_print_interval == 0 && metric_only)
+		print_metric_headers(" ", true);
 	if (++num_print_interval == 25)
 		num_print_interval = 0;
 }
@@ -1442,8 +1454,8 @@ static void print_counters(struct timespec *ts, int argc, const char **argv)
 	if (metric_only) {
 		static int num_print_iv;
 
-		if (num_print_iv == 0)
-			print_metric_headers(prefix);
+		if (num_print_iv == 0 && !interval)
+			print_metric_headers(prefix, false);
 		if (num_print_iv++ == 25)
 			num_print_iv = 0;
 		if (stat_config.aggr_mode == AGGR_GLOBAL && prefix)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf stat: Add missing aggregation headers for --metric-only CSV
  2016-05-24 19:52 ` [PATCH 4/4] perf stat: Add missing aggregation headers for --metric-only CSV Andi Kleen
@ 2016-06-08  8:40   ` tip-bot for Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-06-08  8:40 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, ak, linux-kernel, jolsa, hpa, acme

Commit-ID:  c51fd6395d67a6d414834db7f892c95594247d6f
Gitweb:     http://git.kernel.org/tip/c51fd6395d67a6d414834db7f892c95594247d6f
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Tue, 24 May 2016 12:52:39 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 6 Jun 2016 17:43:12 -0300

perf stat: Add missing aggregation headers for --metric-only CSV

When in CSV mode --metric-only outputs an header, unlike the other
modes. Previously it did not properly print headers for the aggregation
columns, so the headers were actually shifted against the real values.

Fix this here by outputting the correct headers for CSV.

v2: Indent array.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/1464119559-17203-4-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-stat.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index a168e72..dff6373 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1316,6 +1316,14 @@ static int aggr_header_lens[] = {
 	[AGGR_GLOBAL] = 0,
 };
 
+static const char *aggr_header_csv[] = {
+	[AGGR_CORE] 	= 	"core,cpus,",
+	[AGGR_SOCKET] 	= 	"socket,cpus",
+	[AGGR_NONE] 	= 	"cpu,",
+	[AGGR_THREAD] 	= 	"comm-pid,",
+	[AGGR_GLOBAL] 	=	""
+};
+
 static void print_metric_headers(const char *prefix, bool no_indent)
 {
 	struct perf_stat_output_ctx out;
@@ -1330,6 +1338,12 @@ static void print_metric_headers(const char *prefix, bool no_indent)
 	if (!csv_output && !no_indent)
 		fprintf(stat_config.output, "%*s",
 			aggr_header_lens[stat_config.aggr_mode], "");
+	if (csv_output) {
+		if (stat_config.interval)
+			fputs("time,", stat_config.output);
+		fputs(aggr_header_csv[stat_config.aggr_mode],
+			stat_config.output);
+	}
 
 	/* Print metrics headers only */
 	evlist__for_each(evsel_list, counter) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-06-08  8:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24 19:52 [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Andi Kleen
2016-05-24 19:52 ` [PATCH 2/4] perf stat: Add computation of TopDown formulas Andi Kleen
2016-06-01 14:50   ` Nilay Vaish
2016-06-01 14:56     ` Andi Kleen
2016-06-02 11:56       ` Nilay Vaish
2016-06-02 14:26         ` Andi Kleen
2016-06-08  8:39   ` [tip:perf/core] " tip-bot for Andi Kleen
2016-05-24 19:52 ` [PATCH 3/4] perf stat: Print topology/time headers with --metric-only Andi Kleen
2016-06-08  8:39   ` [tip:perf/core] " tip-bot for Andi Kleen
2016-05-24 19:52 ` [PATCH 4/4] perf stat: Add missing aggregation headers for --metric-only CSV Andi Kleen
2016-06-08  8:40   ` [tip:perf/core] " tip-bot for Andi Kleen
2016-05-30 16:01 ` [PATCH 1/4] perf stat: Basic support for TopDown in perf stat Arnaldo Carvalho de Melo
2016-05-30 16:04   ` Andi Kleen
2016-05-30 16:19     ` Arnaldo Carvalho de Melo
2016-06-06 14:00       ` Arnaldo Carvalho de Melo
2016-06-06 14:11         ` Arnaldo Carvalho de Melo
2016-06-06 14:36           ` Andi Kleen
2016-06-01 14:24 ` Nilay Vaish
2016-06-01 14:31   ` Andi Kleen
2016-06-01 15:24   ` Andi Kleen
2016-06-02 11:52     ` Nilay Vaish
2016-06-08  8:38 ` [tip:perf/core] " tip-bot for Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.