linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Add top down metrics to perf stat
@ 2016-05-14  1:44 Andi Kleen
  2016-05-14  1:44 ` [PATCH 1/8] x86/topology: Add topology_max_smt_threads() Andi Kleen
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]
[v5: Rebased on latest tree. Move debug messages to -vv]
[v6: Rebased. Remove .aggr-per-core and --single-thread to not
break old perf binaries. Put SMT enumeration into 
generic topology API.]
[v7: Address review comments. Change patch title headers.]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)
One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-21

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] x86/topology: Add topology_max_smt_threads()
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 2/8] perf/x86: Support sysfs files depending on SMT status Andi Kleen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen, tglx

From: Andi Kleen <ak@linux.intel.com>

For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.

Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()

The single call is easier to use than a loop.

Not exported to user space because user space already can use
the existing sibling interfaces to find this out.

v2: Code formatting changes and use __ for variable name
v3: Use inlines instead of defines.
Cc: tglx@linutronix.de
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/topology.h |  9 +++++++++
 arch/x86/kernel/smpboot.c       | 25 ++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7f991bd5031b..e346572841a0 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -129,6 +129,14 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()			(__max_logical_packages)
+
+extern int __max_smt_threads;
+
+static inline int topology_max_smt_threads(void)
+{
+	return __max_smt_threads;
+}
+
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 extern int topology_phys_to_logical_pkg(unsigned int pkg);
 #else
@@ -136,6 +144,7 @@ extern int topology_phys_to_logical_pkg(unsigned int pkg);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_max_smt_threads(void) { return 1; }
 #endif
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0e4329ed91ef..b753a559ceb9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -105,6 +105,9 @@ static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 
+/* Maximum number of SMT threads on any online core */
+int __max_smt_threads __read_mostly;
+
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
 {
 	unsigned long flags;
@@ -493,7 +496,7 @@ void set_cpu_sibling_map(int cpu)
 	bool has_mp = has_smt || boot_cpu_data.x86_max_cores > 1;
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 	struct cpuinfo_x86 *o;
-	int i;
+	int i, threads;
 
 	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
 
@@ -550,6 +553,10 @@ void set_cpu_sibling_map(int cpu)
 		if (match_die(c, o) && !topology_same_node(c, o))
 			primarily_use_numa_for_topology();
 	}
+
+	threads = cpumask_weight(topology_sibling_cpumask(cpu));
+	if (threads > __max_smt_threads)
+		__max_smt_threads = threads;
 }
 
 /* maps the cpu to the sched domain representing multi-core */
@@ -1441,6 +1448,21 @@ __init void prefill_possible_map(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/* Recompute SMT state for all CPUs on offline */
+static void recompute_smt_state(void)
+{
+	int max_threads, cpu;
+
+	max_threads = 0;
+	for_each_online_cpu (cpu) {
+		int threads = cpumask_weight(topology_sibling_cpumask(cpu));
+
+		if (threads > max_threads)
+			max_threads = threads;
+	}
+	__max_smt_threads = max_threads;
+}
+
 static void remove_siblinginfo(int cpu)
 {
 	int sibling;
@@ -1465,6 +1487,7 @@ static void remove_siblinginfo(int cpu)
 	c->phys_proc_id = 0;
 	c->cpu_core_id = 0;
 	cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
+	recompute_smt_state();
 }
 
 static void remove_cpu_from_maps(int cpu)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] perf/x86: Support sysfs files depending on SMT status
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
  2016-05-14  1:44 ` [PATCH 1/8] x86/topology: Add topology_max_smt_threads() Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 3/8] perf/x86/intel: Add Top Down events to Intel Core Andi Kleen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a way to show different sysfs events attributes depending on
HyperThreading is on or off. This is difficult to determine
early at boot, so we just do it dynamically when the sysfs
attribute is read.

v2:
Compute HT status only once in CPU online/offline hooks.
v3: Use topology_max_smt_threads()
v4: Change formatting. Avoid duplicated prototype.
Signed-off-by: Andi Kleen <ak@linux.intel.com>

squash! perf/x86: Support sysfs files depending on SMT status
---
 arch/x86/events/core.c       | 23 +++++++++++++++++++++++
 arch/x86/events/perf_event.h | 10 ++++++++++
 include/linux/perf_event.h   |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5e5e76a52f58..383657368452 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1622,6 +1622,29 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, cha
 }
 EXPORT_SYMBOL_GPL(events_sysfs_show);
 
+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page)
+{
+	struct perf_pmu_events_ht_attr *pmu_attr =
+		container_of(attr, struct perf_pmu_events_ht_attr, attr);
+
+	/*
+	 * Report conditional events depending on Hyper-Threading.
+	 *
+	 * This is overly conservative as usually the HT special
+	 * handling is not needed if the other CPU thread is idle.
+	 *
+	 * Note this does not (cannot) handle the case when thread
+	 * siblings are invisible, for example with virtualization
+	 * if they are owned by some other guest.  The user tool
+	 * has to re-read when a thread sibling gets onlined later.
+	 */
+	return sprintf(page, "%s",
+			topology_max_smt_threads() > 1 ?
+			pmu_attr->event_str_ht :
+			pmu_attr->event_str_noht);
+}
+
 EVENT_ATTR(cpu-cycles,			CPU_CYCLES		);
 EVENT_ATTR(instructions,		INSTRUCTIONS		);
 EVENT_ATTR(cache-references,		CACHE_REFERENCES	);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8bd764df815d..e2d7285a2dac 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -668,6 +668,14 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
 	.event_str	= str,						\
 };
 
+#define EVENT_ATTR_STR_HT(_name, v, noht, ht)				\
+static struct perf_pmu_events_ht_attr event_attr_##v = {		\
+	.attr		= __ATTR(_name, 0444, events_ht_sysfs_show, NULL),\
+	.id		= 0,						\
+	.event_str_noht	= noht,						\
+	.event_str_ht	= ht,						\
+}
+
 extern struct x86_pmu x86_pmu __read_mostly;
 
 static inline bool x86_pmu_has_lbr_callstack(void)
@@ -803,6 +811,8 @@ struct attribute **merge_attr(struct attribute **a, struct attribute **b);
 
 ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
 			  char *page);
+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page);
 
 #ifdef CONFIG_CPU_SUP_AMD
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9e1c3ada91c4..43e5a345fa8a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1304,6 +1304,13 @@ struct perf_pmu_events_attr {
 	const char *event_str;
 };
 
+struct perf_pmu_events_ht_attr {
+	struct device_attribute			attr;
+	u64 					id;
+	const char				*event_str_ht;
+	const char				*event_str_noht;
+};
+
 ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
 			      char *page);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] perf/x86/intel: Add Top Down events to Intel Core
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
  2016-05-14  1:44 ` [PATCH 1/8] x86/topology: Add topology_max_smt_threads() Andi Kleen
  2016-05-14  1:44 ` [PATCH 2/8] perf/x86: Support sysfs files depending on SMT status Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 4/8] perf/x86/intel: Add Top Down events to Intel Atom Andi Kleen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add declarations for the events needed for TopDown to the
Intel big core CPUs starting with Sandy Bridge. We need
to report different values if HyperThreading is on or off.

The only thing this patch does is to export some events
in sysfs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots	  Available slots in the pipeline
topdown-slots-issued	  Slots issued into the pipeline
topdown-slots-retired	  Slots successfully retired
topdown-fetch-bubbles	  Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
			  from misspeculation

A slot is a single operation in the CPU pipe line.

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.

The kernel declares the events supported by the current
CPU and their scaling factors (such as the pipeline width)
and perf stat then computes the formulas based on the
available metrics.  This is similar how existing
perf metrics, such as TSC metrics or IPC, are implemented.

This abstracts all CPU pipe line specific knowledge in the
kernel driver, but still avoids the need for larger scale perf
interface changes.

For HyperThreading the any bit is needed to get accurate
values when both threads are executing. This implies that
the events can only be collected as root or with
perf_event_paranoid=-1 for now.

The basic scheme is based on the following paper:
Yasin,
A Top Down Method for Performance analysis and Counter architecture
ISPASS14
(pdf available via google)

v2: Rework scaling. Fix formulas for HyperThreading.
v3: Rename agg-per-core to aggr-per-core
Always set aggr-per-core to one to get same output for HT off.
v4: Separate between forced and advisory aggr-per-core
v5: Remove .aggr-per-core attributes
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7c666958a625..c8b2076edba8 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -230,9 +230,46 @@ struct attribute *nhm_events_attrs[] = {
 	NULL,
 };
 
+/*
+ * TopDown events for Core.
+ *
+ * The events are all in slots, which is a free slot in a 4 wide
+ * pipeline. Some events are already reported in slots, for cycle
+ * events we multiply by the pipeline width (4).
+ *
+ * With Hyper Threading on, TopDown metrics are either summed or averaged
+ * between the threads of a core: (count_t0 + count_t1).
+ *
+ * For the average case the metric is always scaled to pipeline width,
+ * so we use factor 2 ((count_t0 + count_t1) / 2 * 4)
+ */
+
+EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
+	"event=0x3c,umask=0x0",			/* cpu_clk_unhalted.thread */
+	"event=0x3c,umask=0x0,any=1");		/* cpu_clk_unhalted.thread_any */
+EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
+	"event=0xe,umask=0x1");			/* uops_issued.any */
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
+	"event=0xc2,umask=0x2");		/* uops_retired.retire_slots */
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
+	"event=0x9c,umask=0x1");		/* idq_uops_not_delivered_core */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
+	"event=0xd,umask=0x3,cmask=1",		/* int_misc.recovery_cycles */
+	"event=0xd,umask=0x3,cmask=1,any=1");	/* int_misc.recovery_cycles_any */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
+	"4", "2");
+
 struct attribute *snb_events_attrs[] = {
 	EVENT_PTR(mem_ld_snb),
 	EVENT_PTR(mem_st_snb),
+	EVENT_PTR(td_slots_issued),
+	EVENT_PTR(td_slots_retired),
+	EVENT_PTR(td_fetch_bubbles),
+	EVENT_PTR(td_total_slots),
+	EVENT_PTR(td_total_slots_scale),
+	EVENT_PTR(td_recovery_bubbles),
+	EVENT_PTR(td_recovery_bubbles_scale),
 	NULL,
 };
 
@@ -3437,6 +3474,13 @@ static struct attribute *hsw_events_attrs[] = {
 	EVENT_PTR(cycles_ct),
 	EVENT_PTR(mem_ld_hsw),
 	EVENT_PTR(mem_st_hsw),
+	EVENT_PTR(td_slots_issued),
+	EVENT_PTR(td_slots_retired),
+	EVENT_PTR(td_fetch_bubbles),
+	EVENT_PTR(td_total_slots),
+	EVENT_PTR(td_total_slots_scale),
+	EVENT_PTR(td_recovery_bubbles),
+	EVENT_PTR(td_recovery_bubbles_scale),
 	NULL
 };
 
@@ -3805,6 +3849,12 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 		intel_pmu_lbr_init_skl();
 
+		/* INT_MISC.RECOVERY_CYCLES has umask 1 in Skylake */
+		event_attr_td_recovery_bubbles.event_str_noht =
+			"event=0xd,umask=0x1,cmask=1";
+		event_attr_td_recovery_bubbles.event_str_ht =
+			"event=0xd,umask=0x1,cmask=1,any=1";
+
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] perf/x86/intel: Add Top Down events to Intel Atom
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (2 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 3/8] perf/x86/intel: Add Top Down events to Intel Core Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 5/8] perf/x86/intel: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add topdown event declarations to Silvermont / Airmont.
These cores do not support the full Top Down metrics, but an useful
subset (FrontendBound, Retiring, Backend Bound/Bad Speculation).

The perf stat tool automatically handles the missing events
and combines the available metrics.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c8b2076edba8..b337ad7db2b2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1369,6 +1369,29 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+EVENT_ATTR_STR(topdown-total-slots, td_total_slots_slm, "event=0x3c");
+EVENT_ATTR_STR(topdown-total-slots.scale, td_total_slots_scale_slm, "2");
+/* no_alloc_cycles.not_delivered */
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles_slm,
+	       "event=0xca,umask=0x50");
+EVENT_ATTR_STR(topdown-fetch-bubbles.scale, td_fetch_bubbles_scale_slm, "2");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued_slm,
+	       "event=0xc2,umask=0x10");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired_slm,
+	       "event=0xc2,umask=0x10");
+
+static struct attribute *slm_events_attrs[] = {
+	EVENT_PTR(td_total_slots_slm),
+	EVENT_PTR(td_total_slots_scale_slm),
+	EVENT_PTR(td_fetch_bubbles_slm),
+	EVENT_PTR(td_fetch_bubbles_scale_slm),
+	EVENT_PTR(td_slots_issued_slm),
+	EVENT_PTR(td_slots_retired_slm),
+	NULL
+};
+
 static struct extra_reg intel_slm_extra_regs[] __read_mostly =
 {
 	/* must define OFFCORE_RSP_X first, see intel_fixup_er() */
@@ -3631,6 +3654,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_slm_extra_regs;
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+		x86_pmu.cpu_events = slm_events_attrs;
 		pr_cont("Silvermont events, ");
 		break;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] perf/x86/intel: Use new topology_max_smt_threads() in HT leak workaround
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (3 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 4/8] perf/x86/intel: Add Top Down events to Intel Atom Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 6/8] perf stat: Avoid fractional digits for integer scales Andi Kleen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Now that we have topology_max_smt_threads() use it
to detect the HT workarounds for older CPUs.

v2: Use topology_max_smt_threads()
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b337ad7db2b2..12b438435999 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3991,16 +3991,14 @@ __init int intel_pmu_init(void)
  */
 static __init int fixup_ht_bug(void)
 {
-	int cpu = smp_processor_id();
-	int w, c;
+	int c;
 	/*
 	 * problem not present on this CPU model, nothing to do
 	 */
 	if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
 		return 0;
 
-	w = cpumask_weight(topology_sibling_cpumask(cpu));
-	if (w > 1) {
+	if (topology_max_smt_threads() > 1) {
 		pr_info("PMU erratum BJ122, BV98, HSD29 worked around, HT is on\n");
 		return 0;
 	}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] perf stat: Avoid fractional digits for integer scales
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (4 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 5/8] perf/x86/intel: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-16 12:53   ` Jiri Olsa
  2016-05-14  1:44 ` [PATCH 7/8] perf stat: Basic support for TopDown in perf stat Andi Kleen
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

When the scaling factor is a full integer don't display fractional
digits. This avoids unnecessary .00 output for topdown metrics
with scale factors.

v2: Remove redundant check.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-stat.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 5645a8361de6..db84bfc0a478 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -66,6 +66,7 @@
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
+#include <math.h>
 
 #define DEFAULT_SEPARATOR	" "
 #define CNTR_NOT_SUPPORTED	"<not supported>"
@@ -986,12 +987,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
 	const char *fmt;
 
 	if (csv_output) {
-		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
+		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
 	} else {
 		if (big_num)
-			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
 		else
-			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
 	}
 
 	aggr_printout(evsel, id, nr);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] perf stat: Basic support for TopDown in perf stat
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (5 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 6/8] perf stat: Avoid fractional digits for integer scales Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-14  1:44 ` [PATCH 8/8] perf stat: Add computation of TopDown formulas Andi Kleen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add basic plumbing for TopDown in perf stat

Add a new --topdown options to enable events.
When --topdown is specified set up events for all topdown
events supported by the kernel.
Add topdown-* as a special case to the event parser, as is
needed for all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2: Use standard sysctl read function.
v3: Move x86 specific code to arch/
v4: Enable --metric-only implicitly for topdown.
v5: Add --single-thread option to not force per core mode
v6: Fix output order of topdown metrics
v7: Allow combining with -d
v8: Remove --single-thread again
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |  16 +++++
 tools/perf/arch/x86/util/Build         |   1 +
 tools/perf/arch/x86/util/group.c       |  27 ++++++++
 tools/perf/builtin-stat.c              | 114 ++++++++++++++++++++++++++++++++-
 tools/perf/util/group.h                |   7 ++
 tools/perf/util/parse-events.l         |   1 +
 6 files changed, 163 insertions(+), 3 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/group.c
 create mode 100644 tools/perf/util/group.h

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 04f23b404bbc..3aaa2916f604 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -204,6 +204,22 @@ Aggregate counts per physical processor for system-wide mode measurements.
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
 
+--topdown::
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking it down into frontend bound, backend bound, bad speculation
+and retiring. Metrics are only printed when they cross a threshold.
+
+The top down metrics may be collected per core instead of per
+CPU thread. In this case per core mode is automatically enabled
+and -a (global monitoring) is needed, requiring root rights or
+perf.perf_event_paranoid=-1.
+
+This enables --metric-only, unless overriden with --no-metric-only.
+
+To interpret the results it is usually needed to know on which
+CPUs the workload runs on. If needed the CPUs can be forced using
+taskset.
 
 EXAMPLES
 --------
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 465970370f3e..4cd8a16b1b7b 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,7 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += group.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
new file mode 100644
index 000000000000..f3039b5ce8b1
--- /dev/null
+++ b/tools/perf/arch/x86/util/group.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include "api/fs/fs.h"
+#include "util/group.h"
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+bool check_group(bool *warn)
+{
+	int n;
+
+	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+		return false;
+	if (n > 0) {
+		*warn = true;
+		return false;
+	}
+	return true;
+}
+
+void group_warn(void)
+{
+	fprintf(stderr,
+		"nmi_watchdog enabled with topdown. May give wrong results.\n"
+		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index db84bfc0a478..7c5c50b61b28 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,10 +59,13 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/group.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/group.h"
 #include "asm/bug.h"
 
+#include <api/fs/fs.h>
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
@@ -98,6 +101,15 @@ static const char * transaction_limited_attrs = {
 	"}"
 };
 
+static const char * topdown_attrs[] = {
+	"topdown-total-slots",
+	"topdown-slots-retired",
+	"topdown-recovery-bubbles",
+	"topdown-fetch-bubbles",
+	"topdown-slots-issued",
+	NULL,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct target target = {
@@ -112,6 +124,7 @@ static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
 static bool			transaction_run;
+static bool			topdown_run			= false;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -124,6 +137,7 @@ static unsigned int		initial_delay			= 0;
 static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static bool			metric_only			= false;
+static bool			force_metric_only		= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
 static aggr_get_id_t		aggr_get_id;
@@ -1515,6 +1529,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
 	return 0;
 }
 
+static int enable_metric_only(const struct option *opt __maybe_unused,
+			      const char *s __maybe_unused, int unset)
+{
+	force_metric_only = true;
+	metric_only = !unset;
+	return 0;
+}
+
 static const struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1573,8 +1595,10 @@ static const struct option stat_options[] = {
 		     "aggregate counts per thread", AGGR_THREAD),
 	OPT_UINTEGER('D', "delay", &initial_delay,
 		     "ms to wait before starting measurement after program start"),
-	OPT_BOOLEAN(0, "metric-only", &metric_only,
-			"Only print computed metrics. No raw values"),
+	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
+			"Only print computed metrics. No raw values", enable_metric_only),
+	OPT_BOOLEAN(0, "topdown", &topdown_run,
+			"measure topdown level 1 statistics"),
 	OPT_END()
 };
 
@@ -1767,12 +1791,61 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	return 0;
 }
 
+static void filter_events(const char **attr, char **str, bool use_group)
+{
+	int off = 0;
+	int i;
+	int len = 0;
+	char *s;
+
+	for (i = 0; attr[i]; i++) {
+		if (pmu_have_event("cpu", attr[i])) {
+			len += strlen(attr[i]) + 1;
+			attr[i - off] = attr[i];
+		} else
+			off++;
+	}
+	attr[i - off] = NULL;
+
+	*str = malloc(len + 1 + 2);
+	if (!*str)
+		return;
+	s = *str;
+	if (i - off == 0) {
+		*s = 0;
+		return;
+	}
+	if (use_group)
+		*s++ = '{';
+	for (i = 0; attr[i]; i++) {
+		strcpy(s, attr[i]);
+		s += strlen(s);
+		*s++ = ',';
+	}
+	if (use_group) {
+		s[-1] = '}';
+		*s = 0;
+	} else
+		s[-1] = 0;
+}
+
+__weak bool check_group(bool *warn)
+{
+	*warn = false;
+	return false;
+}
+
+__weak void group_warn(void)
+{
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
  */
 static int add_default_attributes(void)
 {
+	int err;
 	struct perf_event_attr default_attrs0[] = {
 
   { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
@@ -1891,7 +1964,6 @@ static int add_default_attributes(void)
 		return 0;
 
 	if (transaction_run) {
-		int err;
 		if (pmu_have_event("cpu", "cycles-ct") &&
 		    pmu_have_event("cpu", "el-start"))
 			err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1904,6 +1976,42 @@ static int add_default_attributes(void)
 		return 0;
 	}
 
+	if (topdown_run) {
+		char *str = NULL;
+		bool warn = false;
+
+		if (stat_config.aggr_mode != AGGR_GLOBAL &&
+		    stat_config.aggr_mode != AGGR_CORE) {
+			pr_err("top down event configuration requires --per-core mode\n");
+			return -1;
+		}
+		stat_config.aggr_mode = AGGR_CORE;
+		if (nr_cgroups || !target__has_cpu(&target)) {
+			pr_err("top down event configuration requires system-wide mode (-a)\n");
+			return -1;
+		}
+
+		if (!force_metric_only)
+			metric_only = true;
+		filter_events(topdown_attrs, &str, check_group(&warn));
+		if (topdown_attrs[0] && str) {
+			if (warn)
+				group_warn();
+			err = parse_events(evsel_list, str, NULL);
+			if (err) {
+				fprintf(stderr,
+					"Cannot set up top down events %s: %d\n",
+					str, err);
+				free(str);
+				return -1;
+			}
+		} else {
+			fprintf(stderr, "System does not support topdown\n");
+			return -1;
+		}
+		free(str);
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
 			return -1;
diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
new file mode 100644
index 000000000000..daad3ffdc68d
--- /dev/null
+++ b/tools/perf/util/group.h
@@ -0,0 +1,7 @@
+#ifndef GROUP_H
+#define GROUP_H 1
+
+bool check_group(bool *warn);
+void group_warn(void);
+
+#endif
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 1477fbc78993..744ebe3fa30f 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -259,6 +259,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
 L1-icache|l1-i|l1i|L1-instruction	|
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (6 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 7/8] perf stat: Basic support for TopDown in perf stat Andi Kleen
@ 2016-05-14  1:44 ` Andi Kleen
  2016-05-20 10:23   ` Jiri Olsa
  2016-05-16 12:58 ` Add top down metrics to perf stat Jiri Olsa
  2016-05-20 10:24 ` Jiri Olsa
  9 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4:
Only print one decimal digit
Fix color printing of one metric
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/util/stat-shadow.c | 156 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c        |   5 ++
 tools/perf/util/stat.h        |   5 ++
 3 files changed, 166 insertions(+)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index fdb71961143e..182050f2785a 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
+	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
 }
 
 /*
@@ -104,6 +115,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -301,6 +322,100 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return bad_spec;
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return 1.0 - sum;
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -308,6 +423,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -450,6 +566,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d0653861..c1ba255f2abe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e786ccc7..c29bb94c48a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/8] perf stat: Avoid fractional digits for integer scales
  2016-05-14  1:44 ` [PATCH 6/8] perf stat: Avoid fractional digits for integer scales Andi Kleen
@ 2016-05-16 12:53   ` Jiri Olsa
  0 siblings, 0 replies; 21+ messages in thread
From: Jiri Olsa @ 2016-05-16 12:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel, Andi Kleen

On Fri, May 13, 2016 at 06:44:55PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> When the scaling factor is a full integer don't display fractional
> digits. This avoids unnecessary .00 output for topdown metrics
> with scale factors.
> 
> v2: Remove redundant check.
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

> ---
>  tools/perf/builtin-stat.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 5645a8361de6..db84bfc0a478 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -66,6 +66,7 @@
>  #include <stdlib.h>
>  #include <sys/prctl.h>
>  #include <locale.h>
> +#include <math.h>
>  
>  #define DEFAULT_SEPARATOR	" "
>  #define CNTR_NOT_SUPPORTED	"<not supported>"
> @@ -986,12 +987,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
>  	const char *fmt;
>  
>  	if (csv_output) {
> -		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
> +		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
>  	} else {
>  		if (big_num)
> -			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> +			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
>  		else
> -			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> +			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
>  	}
>  
>  	aggr_printout(evsel, id, nr);
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (7 preceding siblings ...)
  2016-05-14  1:44 ` [PATCH 8/8] perf stat: Add computation of TopDown formulas Andi Kleen
@ 2016-05-16 12:58 ` Jiri Olsa
  2016-05-19 23:51   ` Andi Kleen
  2016-05-20 10:24 ` Jiri Olsa
  9 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2016-05-16 12:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:

SNIP

>     
> The formulas to compute the metrics are generic, they
> only change based on the availability on the abstracted
> input values.
>     
> The kernel declares the events supported by the current
> CPU and perf stat then computes the formulas based on the
> available metrics.
> 
> 
> Example output:
> 
> $ perf stat --topdown -I 1000 cmd
>      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
>      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
>      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           

you've lost first 3 header lines (time/core/cpus):

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
#           time core         cpus             counts unit events
     1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
     1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
     1.000310344 S0-C2           2      3,764,428,422      cycles                                                      

also I'm still getting -0% as I mentioned in my previous comment:

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
     1.001615409                   retiring             bad speculation      frontend bound       backend bound        
     1.001615409 S0-C0           2     38.3%                0.0%               58.4%                3.3%           
     1.001615409 S0-C1           2     38.1%               -0.0%               59.3%                2.6%           
     1.001615409 S0-C2           2     38.1%                0.0%               58.9%                2.9%           
     1.001615409 S0-C3           2     38.1%               -0.0%               58.9%                3.0%           
     1.001615409 S0-C4           2     38.0%                0.0%               59.0%                2.9%           
     1.001615409 S0-C5           2     38.1%               -0.0%               58.6%                3.3%           
     1.001615409 S1-C0           2     49.7%                1.9%               44.7%                3.7%           

thanks,
jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-16 12:58 ` Add top down metrics to perf stat Jiri Olsa
@ 2016-05-19 23:51   ` Andi Kleen
  2016-05-20  9:59     ` Jiri Olsa
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2016-05-19 23:51 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

On Mon, May 16, 2016 at 02:58:38PM +0200, Jiri Olsa wrote:
> On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> 
> SNIP
> 
> >     
> > The formulas to compute the metrics are generic, they
> > only change based on the availability on the abstracted
> > input values.
> >     
> > The kernel declares the events supported by the current
> > CPU and perf stat then computes the formulas based on the
> > available metrics.
> > 
> > 
> > Example output:
> > 
> > $ perf stat --topdown -I 1000 cmd
> >      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
> >      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
> >      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           

Hi Jiri,
> 
> you've lost first 3 header lines (time/core/cpus):
> 
> [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
> #           time core         cpus             counts unit events
>      1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
>      1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
>      1.000310344 S0-C2           2      3,764,428,422      cycles                                                      

I can't reproduce that.

The headers look the same as before.

> 
> also I'm still getting -0% as I mentioned in my previous comment:

Keeping the NMI watchdog enabled can make the formulas inaccurate
because the grouping is disabled, and parts of the formulas
may be measured at different times where the execution profile
is different.

But anyways even without that it can be caused by small inaccuracies,
and then during rounding the value rounds to 0.
I can remove the - for this case.

Otherwise the data looks reasonable.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-19 23:51   ` Andi Kleen
@ 2016-05-20  9:59     ` Jiri Olsa
  2016-05-20 14:18       ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2016-05-20  9:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Thu, May 19, 2016 at 04:51:30PM -0700, Andi Kleen wrote:
> On Mon, May 16, 2016 at 02:58:38PM +0200, Jiri Olsa wrote:
> > On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> > 
> > SNIP
> > 
> > >     
> > > The formulas to compute the metrics are generic, they
> > > only change based on the availability on the abstracted
> > > input values.
> > >     
> > > The kernel declares the events supported by the current
> > > CPU and perf stat then computes the formulas based on the
> > > available metrics.
> > > 
> > > 
> > > Example output:
> > > 
> > > $ perf stat --topdown -I 1000 cmd
> > >      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
> > >      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
> > >      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
> 
> Hi Jiri,
> > 
> > you've lost first 3 header lines (time/core/cpus):
> > 
> > [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
> > #           time core         cpus             counts unit events
> >      1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
> >      1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
> >      1.000310344 S0-C2           2      3,764,428,422      cycles                                                      
> 
> I can't reproduce that.
> 

I can.. your latest code does not display headers: 'time' 'core' 'cpus'
also the initial '#'

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
     1.002097350                   retiring             bad speculation      frontend bound       backend bound        
     1.002097350 S0-C0           2     38.1%                0.0%               59.2%                2.7%           
     1.002097350 S0-C1           2     38.1%                0.1%               59.7%                2.1%           


thanks,
jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-14  1:44 ` [PATCH 8/8] perf stat: Add computation of TopDown formulas Andi Kleen
@ 2016-05-20 10:23   ` Jiri Olsa
  2016-05-20 13:38     ` Andi Kleen
  2016-05-20 13:43     ` Andi Kleen
  0 siblings, 2 replies; 21+ messages in thread
From: Jiri Olsa @ 2016-05-20 10:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel, Andi Kleen

On Fri, May 13, 2016 at 06:44:57PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Implement the TopDown formulas in perf stat. The topdown basic metrics
> reported by the kernel are collected, and the formulas are computed
> and output as normal metrics.
> 
> See the kernel commit exporting the events for details on the used
> metrics.
> 
> v2: Always print all metrics, only use thresholds for coloring.
> v3: Mark retiring over threshold green, not red.
> v4:
> Only print one decimal digit
> Fix color printing of one metric
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/util/stat-shadow.c | 156 ++++++++++++++++++++++++++++++++++++++++++
>  tools/perf/util/stat.c        |   5 ++
>  tools/perf/util/stat.h        |   5 ++
>  3 files changed, 166 insertions(+)
> 
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index fdb71961143e..182050f2785a 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
>  static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
>  static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
>  static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
> +static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
> +static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
> +static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
> +static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
> +static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
>  static bool have_frontend_stalled;
>  
>  struct stats walltime_nsecs_stats;
> @@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
>  		sizeof(runtime_transaction_stats));
>  	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
>  	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
> +	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
> +	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
> +	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
> +	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
> +	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
> +	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");

why to initialize this one in here? it's already in perf_stat__init_shadow_stats

jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
                   ` (8 preceding siblings ...)
  2016-05-16 12:58 ` Add top down metrics to perf stat Jiri Olsa
@ 2016-05-20 10:24 ` Jiri Olsa
  9 siblings, 0 replies; 21+ messages in thread
From: Jiri Olsa @ 2016-05-20 10:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> Note to reviewers: includes both tools and kernel patches.
> The kernel patches are at the beginning.
> 
> [v2: Address review feedback.
> Metrics are now always printed, but colored when crossing threshold.
> --topdown implies --metric-only.
> Various smaller fixes, see individual patches]
> [v3: Add --single-thread option and support it with HT off.
> Clean up old HT workaround.
> Improve documentation.
> Various smaller fixes, see individual patches.]
> [v4: Rebased on latest tree]
> [v5: Rebased on latest tree. Move debug messages to -vv]
> [v6: Rebased. Remove .aggr-per-core and --single-thread to not
> break old perf binaries. Put SMT enumeration into 
> generic topology API.]
> [v7: Address review comments. Change patch title headers.]

other than the missing headers and unnneeded initialization
of have_frontend_stalled I'm ok with the perf tools part

thanks,
jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-20 10:23   ` Jiri Olsa
@ 2016-05-20 13:38     ` Andi Kleen
  2016-05-20 13:43       ` Jiri Olsa
  2016-05-20 13:43     ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2016-05-20 13:38 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

> > @@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
> >  		sizeof(runtime_transaction_stats));
> >  	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
> >  	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
> > +	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
> > +	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
> > +	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
> > +	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
> > +	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
> > +	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
> 
> why to initialize this one in here? it's already in perf_stat__init_shadow_stats

You mean the have_frontend_stalled?

I don't think it's in init_shadow_stats yet, but it could be moved.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-20 10:23   ` Jiri Olsa
  2016-05-20 13:38     ` Andi Kleen
@ 2016-05-20 13:43     ` Andi Kleen
  1 sibling, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-20 13:43 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

Here's an updated patch without the extra line

---

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4:
Only print one decimal digit
Fix color printing of one metric
v5: Avoid printing -0.0
v6: Remove extra frontend event lookup
Signed-off-by: Andi Kleen <ak@linux.intel.com>

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index fdb71961143e..509ed62a064e 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,11 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
 }
 
 /*
@@ -104,6 +114,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -301,6 +321,107 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double sanitize_val(double x)
+{
+	if (x < 0 && x >= -0.02)
+		return 0.0;
+	return x;
+}
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return sanitize_val(bad_spec);
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return sanitize_val(1.0 - sum);
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -308,6 +429,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -450,6 +572,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d0653861..c1ba255f2abe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e786ccc7..c29bb94c48a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-20 13:38     ` Andi Kleen
@ 2016-05-20 13:43       ` Jiri Olsa
  2016-05-20 14:15         ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2016-05-20 13:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 20, 2016 at 06:38:33AM -0700, Andi Kleen wrote:
> > > @@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
> > >  		sizeof(runtime_transaction_stats));
> > >  	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
> > >  	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
> > > +	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
> > > +	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
> > > +	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
> > > +	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
> > > +	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
> > > +	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
> > 
> > why to initialize this one in here? it's already in perf_stat__init_shadow_stats
> 
> You mean the have_frontend_stalled?
> 
> I don't think it's in init_shadow_stats yet, but it could be moved.
> 

fb4605ba47e7 perf stat: Check for frontend stalled for metrics

+void perf_stat__init_shadow_stats(void)
+{
+       have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
+}


jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-20 13:43       ` Jiri Olsa
@ 2016-05-20 14:15         ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-20 14:15 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andi Kleen, Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 20, 2016 at 03:43:46PM +0200, Jiri Olsa wrote:
> On Fri, May 20, 2016 at 06:38:33AM -0700, Andi Kleen wrote:
> > > > @@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
> > > >  		sizeof(runtime_transaction_stats));
> > > >  	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
> > > >  	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
> > > > +	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
> > > > +	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
> > > > +	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
> > > > +	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
> > > > +	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
> > > > +	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
> > > 
> > > why to initialize this one in here? it's already in perf_stat__init_shadow_stats
> > 
> > You mean the have_frontend_stalled?
> > 
> > I don't think it's in init_shadow_stats yet, but it could be moved.
> > 
> 
> fb4605ba47e7 perf stat: Check for frontend stalled for metrics
> 
> +void perf_stat__init_shadow_stats(void)
> +{
> +       have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
> +}

Ok I removed the line and posted a new patch. Thanks.
-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-20  9:59     ` Jiri Olsa
@ 2016-05-20 14:18       ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-20 14:18 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

> [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
> nmi_watchdog enabled with topdown. May give wrong results.
> Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
>      1.002097350                   retiring             bad speculation      frontend bound       backend bound        
>      1.002097350 S0-C0           2     38.1%                0.0%               59.2%                2.7%           
>      1.002097350 S0-C1           2     38.1%                0.1%               59.7%                2.1%           

Ah I see now. this is --metric-only not displaying. --topdown enables
--metric-only implicitely.

I'll send a separate patch for that because metric only was already
merged separately. So it's not really a problem in this patchkit,
but in a previous one.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 8/8] perf stat: Add computation of TopDown formulas
  2016-05-20  0:09 Andi Kleen
@ 2016-05-20  0:10 ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2016-05-20  0:10 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, linux-kernel, mingo, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4:
Only print one decimal digit
Fix color printing of one metric
v5: Avoid printing -0.0
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/util/stat-shadow.c | 163 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c        |   5 ++
 tools/perf/util/stat.h        |   5 ++
 3 files changed, 173 insertions(+)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index fdb71961143e..1b6d8eeacd07 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
+	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
 }
 
 /*
@@ -104,6 +115,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -301,6 +322,107 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double sanitize_val(double x)
+{
+	if (x < 0 && x >= -0.02)
+		return 0.0;
+	return x;
+}
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return sanitize_val(bad_spec);
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return sanitize_val(1.0 - sum);
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -308,6 +430,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -450,6 +573,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d0653861..c1ba255f2abe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e786ccc7..c29bb94c48a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-20 14:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-14  1:44 Add top down metrics to perf stat Andi Kleen
2016-05-14  1:44 ` [PATCH 1/8] x86/topology: Add topology_max_smt_threads() Andi Kleen
2016-05-14  1:44 ` [PATCH 2/8] perf/x86: Support sysfs files depending on SMT status Andi Kleen
2016-05-14  1:44 ` [PATCH 3/8] perf/x86/intel: Add Top Down events to Intel Core Andi Kleen
2016-05-14  1:44 ` [PATCH 4/8] perf/x86/intel: Add Top Down events to Intel Atom Andi Kleen
2016-05-14  1:44 ` [PATCH 5/8] perf/x86/intel: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
2016-05-14  1:44 ` [PATCH 6/8] perf stat: Avoid fractional digits for integer scales Andi Kleen
2016-05-16 12:53   ` Jiri Olsa
2016-05-14  1:44 ` [PATCH 7/8] perf stat: Basic support for TopDown in perf stat Andi Kleen
2016-05-14  1:44 ` [PATCH 8/8] perf stat: Add computation of TopDown formulas Andi Kleen
2016-05-20 10:23   ` Jiri Olsa
2016-05-20 13:38     ` Andi Kleen
2016-05-20 13:43       ` Jiri Olsa
2016-05-20 14:15         ` Andi Kleen
2016-05-20 13:43     ` Andi Kleen
2016-05-16 12:58 ` Add top down metrics to perf stat Jiri Olsa
2016-05-19 23:51   ` Andi Kleen
2016-05-20  9:59     ` Jiri Olsa
2016-05-20 14:18       ` Andi Kleen
2016-05-20 10:24 ` Jiri Olsa
2016-05-20  0:09 Andi Kleen
2016-05-20  0:10 ` [PATCH 8/8] perf stat: Add computation of TopDown formulas Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).