linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Add top down metrics to perf stat
@ 2016-05-05 23:03 Andi Kleen
  2016-05-05 23:03 ` [PATCH 01/10] x86: Add topology_max_smt_threads() Andi Kleen
                   ` (10 more replies)
  0 siblings, 11 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:03 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]
[v5: Rebased on latest tree. Move debug messages to -vv]
[v6: Rebased. Remove .aggr-per-core and --single-thread to not
break old perf binaries. Put SMT enumeration into 
generic topology API.]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)
One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-20

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
@ 2016-05-05 23:03 ` Andi Kleen
  2016-05-06 10:13   ` Peter Zijlstra
  2016-05-05 23:03 ` [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status Andi Kleen
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:03 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.

Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()

The single call is easier to use than a loop.

Not exported to user space because user space already can use
the existing sibling interfaces to find this out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/topology.h |  5 +++++
 arch/x86/kernel/smpboot.c       | 24 ++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7f991bd5031b..4707555f209e 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -129,6 +129,10 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()			(__max_logical_packages)
+
+extern int max_smt_threads;
+#define topology_max_smt_threads()		max_smt_threads
+
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 extern int topology_phys_to_logical_pkg(unsigned int pkg);
 #else
@@ -136,6 +140,7 @@ extern int topology_phys_to_logical_pkg(unsigned int pkg);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+#define topology_max_smt_threads()		1
 #endif
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a2065d3b3b39..6e5a721857f5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -105,6 +105,9 @@ static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 
+/* Maximum number of SMT threads on any online core */
+int max_smt_threads __read_mostly;
+
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
 {
 	unsigned long flags;
@@ -489,6 +492,7 @@ void set_cpu_sibling_map(int cpu)
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 	struct cpuinfo_x86 *o;
 	int i;
+	int threads;
 
 	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
 
@@ -545,6 +549,10 @@ void set_cpu_sibling_map(int cpu)
 		if (match_die(c, o) && !topology_same_node(c, o))
 			primarily_use_numa_for_topology();
 	}
+
+	threads = cpumask_weight(topology_sibling_cpumask(cpu));
+	if (threads > max_smt_threads)
+		max_smt_threads = threads;
 }
 
 /* maps the cpu to the sched domain representing multi-core */
@@ -1436,6 +1444,21 @@ __init void prefill_possible_map(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/* Recompute SMT state for all CPUs on offline */
+static void recompute_smt_state(void)
+{
+	int max_threads;
+	int cpu;
+
+	max_threads = 0;
+	for_each_online_cpu (cpu) {
+		int threads = cpumask_weight(topology_sibling_cpumask(cpu));
+		if (threads > max_threads)
+			max_threads = threads;
+	}
+	max_smt_threads = max_threads;
+}
+
 static void remove_siblinginfo(int cpu)
 {
 	int sibling;
@@ -1460,6 +1483,7 @@ static void remove_siblinginfo(int cpu)
 	c->phys_proc_id = 0;
 	c->cpu_core_id = 0;
 	cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
+	recompute_smt_state();
 }
 
 static void remove_cpu_from_maps(int cpu)
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
  2016-05-05 23:03 ` [PATCH 01/10] x86: Add topology_max_smt_threads() Andi Kleen
@ 2016-05-05 23:03 ` Andi Kleen
  2016-05-09  9:42   ` Peter Zijlstra
  2016-05-12  8:05   ` Ingo Molnar
  2016-05-05 23:04 ` [PATCH 03/10] x86, perf: Add Top Down events to Intel Core Andi Kleen
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:03 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a way to show different sysfs events attributes depending on
HyperThreading is on or off. This is difficult to determine
early at boot, so we just do it dynamically when the sysfs
attribute is read.

v2:
Compute HT status only once in CPU online/offline hooks.
v3: Use topology_max_smt_threads()
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/core.c       | 24 ++++++++++++++++++++++++
 arch/x86/events/perf_event.h | 14 ++++++++++++++
 include/linux/perf_event.h   |  7 +++++++
 3 files changed, 45 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5e5e76a52f58..ec26d7a6ed40 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1622,6 +1622,30 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, cha
 }
 EXPORT_SYMBOL_GPL(events_sysfs_show);
 
+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page)
+{
+	struct perf_pmu_events_ht_attr *pmu_attr =
+		container_of(attr, struct perf_pmu_events_ht_attr, attr);
+
+	/*
+	 * Report conditional events depending on Hyper-Threading.
+	 *
+	 * This is overly conservative as usually the HT special
+	 * handling is not needed if the other CPU thread is idle.
+	 *
+	 * Note this does not (cannot) handle the case when thread
+	 * siblings are invisible, for example with virtualization
+	 * if they are owned by some other guest.  The user tool
+	 * has to re-read when a thread sibling gets onlined later.
+	 */
+
+	return sprintf(page, "%s",
+			topology_max_smt_threads() > 1 ?
+			pmu_attr->event_str_ht :
+			pmu_attr->event_str_noht);
+}
+
 EVENT_ATTR(cpu-cycles,			CPU_CYCLES		);
 EVENT_ATTR(instructions,		INSTRUCTIONS		);
 EVENT_ATTR(cache-references,		CACHE_REFERENCES	);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8bd764df815d..ad2e870f77d9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -668,6 +668,14 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
 	.event_str	= str,						\
 };
 
+#define EVENT_ATTR_STR_HT(_name, v, noht, ht)				\
+static struct perf_pmu_events_ht_attr event_attr_##v = {		\
+	.attr		= __ATTR(_name, 0444, events_ht_sysfs_show, NULL),\
+	.id		= 0,						\
+	.event_str_noht	= noht,						\
+	.event_str_ht	= ht,						\
+}
+
 extern struct x86_pmu x86_pmu __read_mostly;
 
 static inline bool x86_pmu_has_lbr_callstack(void)
@@ -938,6 +946,12 @@ int p6_pmu_init(void);
 
 int knc_pmu_init(void);
 
+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page);
+
+ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page);
+
 static inline int is_ht_workaround_enabled(void)
 {
 	return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9e1c3ada91c4..b425f2d24b26 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1304,6 +1304,13 @@ struct perf_pmu_events_attr {
 	const char *event_str;
 };
 
+struct perf_pmu_events_ht_attr {
+	struct device_attribute attr;
+	u64 id;
+	const char *event_str_ht;
+	const char *event_str_noht;
+};
+
 ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
 			      char *page);
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 03/10] x86, perf: Add Top Down events to Intel Core
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
  2016-05-05 23:03 ` [PATCH 01/10] x86: Add topology_max_smt_threads() Andi Kleen
  2016-05-05 23:03 ` [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-11 13:23   ` Jiri Olsa
  2016-05-12  8:10   ` Ingo Molnar
  2016-05-05 23:04 ` [PATCH 04/10] x86, perf: Add Top Down events to Intel Atom Andi Kleen
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add declarations for the events needed for TopDown to the
Intel big core CPUs starting with Sandy Bridge. We need
to report different values if HyperThreading is on or off.

The only thing this patch does is to export some events
in sysfs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots	  Available slots in the pipeline
topdown-slots-issued	  Slots issued into the pipeline
topdown-slots-retired	  Slots successfully retired
topdown-fetch-bubbles	  Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
			  from misspeculation

A slot is a single operation in the CPU pipe line.

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.

The kernel declares the events supported by the current
CPU and their scaling factors (such as the pipeline width)
and perf stat then computes the formulas based on the
available metrics.  This is similar how existing
perf metrics, such as TSC metrics or IPC, are implemented.

This abstracts all CPU pipe line specific knowledge in the
kernel driver, but still avoids the need for larger scale perf
interface changes.

For HyperThreading the any bit is needed to get accurate
values when both threads are executing. This implies that
the events can only be collected as root or with
perf_event_paranoid=-1 for now.

The basic scheme is based on the following paper:
Yasin,
A Top Down Method for Performance analysis and Counter architecture
ISPASS14
(pdf available via google)

v2: Rework scaling. Fix formulas for HyperThreading.
v3: Rename agg-per-core to aggr-per-core
Always set aggr-per-core to one to get same output for HT off.
v4: Separate between forced and advisory aggr-per-core
v5: Remove .aggr-per-core attributes
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index cd319400dc10..8b146007c264 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -230,9 +230,46 @@ struct attribute *nhm_events_attrs[] = {
 	NULL,
 };
 
+/*
+ * TopDown events for Core.
+ *
+ * The events are all in slots, which is a free slot in a 4 wide
+ * pipeline. Some events are already reported in slots, for cycle
+ * events we multiply by the pipeline width (4).
+ *
+ * With Hyper Threading on, TopDown metrics are either summed or averaged
+ * between the threads of a core: (count_t0 + count_t1).
+ *
+ * For the average case the metric is always scaled to pipeline width,
+ * so we use factor 2 ((count_t0 + count_t1) / 2 * 4)
+ */
+
+EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
+	"event=0x3c,umask=0x0",			/* cpu_clk_unhalted.thread */
+	"event=0x3c,umask=0x0,any=1");		/* cpu_clk_unhalted.thread_any */
+EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
+	"event=0xe,umask=0x1");			/* uops_issued.any */
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
+	"event=0xc2,umask=0x2");		/* uops_retired.retire_slots */
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
+	"event=0x9c,umask=0x1");		/* idq_uops_not_delivered_core */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
+	"event=0xd,umask=0x3,cmask=1",		/* int_misc.recovery_cycles */
+	"event=0xd,umask=0x3,cmask=1,any=1");	/* int_misc.recovery_cycles_any */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
+	"4", "2");
+
 struct attribute *snb_events_attrs[] = {
 	EVENT_PTR(mem_ld_snb),
 	EVENT_PTR(mem_st_snb),
+	EVENT_PTR(td_slots_issued),
+	EVENT_PTR(td_slots_retired),
+	EVENT_PTR(td_fetch_bubbles),
+	EVENT_PTR(td_total_slots),
+	EVENT_PTR(td_total_slots_scale),
+	EVENT_PTR(td_recovery_bubbles),
+	EVENT_PTR(td_recovery_bubbles_scale),
 	NULL,
 };
 
@@ -3437,6 +3474,13 @@ static struct attribute *hsw_events_attrs[] = {
 	EVENT_PTR(cycles_ct),
 	EVENT_PTR(mem_ld_hsw),
 	EVENT_PTR(mem_st_hsw),
+	EVENT_PTR(td_slots_issued),
+	EVENT_PTR(td_slots_retired),
+	EVENT_PTR(td_fetch_bubbles),
+	EVENT_PTR(td_total_slots),
+	EVENT_PTR(td_total_slots_scale),
+	EVENT_PTR(td_recovery_bubbles),
+	EVENT_PTR(td_recovery_bubbles_scale),
 	NULL
 };
 
@@ -3805,6 +3849,12 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 		intel_pmu_lbr_init_skl();
 
+		/* INT_MISC.RECOVERY_CYCLES has umask 1 in Skylake */
+		event_attr_td_recovery_bubbles.event_str_noht =
+			"event=0xd,umask=0x1,cmask=1";
+		event_attr_td_recovery_bubbles.event_str_ht =
+			"event=0xd,umask=0x1,cmask=1,any=1";
+
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 04/10] x86, perf: Add Top Down events to Intel Atom
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (2 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 03/10] x86, perf: Add Top Down events to Intel Core Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-05 23:04 ` [PATCH 05/10] x86, perf: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add topdown event declarations to Silvermont / Airmont.
These cores do not support the full Top Down metrics, but an useful
subset (FrontendBound, Retiring, Backend Bound/Bad Speculation).

The perf stat tool automatically handles the missing events
and combines the available metrics.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8b146007c264..6ea16f705de4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1369,6 +1369,29 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+EVENT_ATTR_STR(topdown-total-slots, td_total_slots_slm, "event=0x3c");
+EVENT_ATTR_STR(topdown-total-slots.scale, td_total_slots_scale_slm, "2");
+/* no_alloc_cycles.not_delivered */
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles_slm,
+	       "event=0xca,umask=0x50");
+EVENT_ATTR_STR(topdown-fetch-bubbles.scale, td_fetch_bubbles_scale_slm, "2");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued_slm,
+	       "event=0xc2,umask=0x10");
+/* uops_retired.all */
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired_slm,
+	       "event=0xc2,umask=0x10");
+
+static struct attribute *slm_events_attrs[] = {
+	EVENT_PTR(td_total_slots_slm),
+	EVENT_PTR(td_total_slots_scale_slm),
+	EVENT_PTR(td_fetch_bubbles_slm),
+	EVENT_PTR(td_fetch_bubbles_scale_slm),
+	EVENT_PTR(td_slots_issued_slm),
+	EVENT_PTR(td_slots_retired_slm),
+	NULL
+};
+
 static struct extra_reg intel_slm_extra_regs[] __read_mostly =
 {
 	/* must define OFFCORE_RSP_X first, see intel_fixup_er() */
@@ -3631,6 +3654,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_slm_extra_regs;
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+		x86_pmu.cpu_events = slm_events_attrs;
 		pr_cont("Silvermont events, ");
 		break;
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 05/10] x86, perf: Use new topology_max_smt_threads() in HT leak workaround
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (3 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 04/10] x86, perf: Add Top Down events to Intel Atom Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-05 23:04 ` [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales Andi Kleen
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Now that we have topology_max_smt_threads() use it
to detect the HT workarounds for older CPUs.

v2: Use topology_max_smt_threads()
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6ea16f705de4..fcc9a010cb0a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3991,16 +3991,14 @@ __init int intel_pmu_init(void)
  */
 static __init int fixup_ht_bug(void)
 {
-	int cpu = smp_processor_id();
-	int w, c;
+	int c;
 	/*
 	 * problem not present on this CPU model, nothing to do
 	 */
 	if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED))
 		return 0;
 
-	w = cpumask_weight(topology_sibling_cpumask(cpu));
-	if (w > 1) {
+	if (topology_max_smt_threads() > 1) {
 		pr_info("PMU erratum BJ122, BV98, HSD29 worked around, HT is on\n");
 		return 0;
 	}
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (4 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 05/10] x86, perf: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-07 19:10   ` Jiri Olsa
  2016-05-20  6:42   ` [tip:perf/urgent] perf " tip-bot for Andi Kleen
  2016-05-05 23:04 ` [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics Andi Kleen
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

When the scaling factor is a full integer don't display fractional
digits. This avoids unnecessary .00 output for topdown metrics
with scale factors.

v2: Remove redundant check.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-stat.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1f19f2f999c8..b407a11c6e22 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -66,6 +66,7 @@
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
+#include <math.h>
 
 #define DEFAULT_SEPARATOR	" "
 #define CNTR_NOT_SUPPORTED	"<not supported>"
@@ -978,12 +979,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
 	const char *fmt;
 
 	if (csv_output) {
-		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
+		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
 	} else {
 		if (big_num)
-			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
 		else
-			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
 	}
 
 	aggr_printout(evsel, id, nr);
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (5 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-07 19:14   ` Jiri Olsa
  2016-05-10 20:30   ` [tip:perf/core] perf " tip-bot for Andi Kleen
  2016-05-05 23:04 ` [PATCH 08/10] perf, tools, stat: Basic support for TopDown in perf stat Andi Kleen
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Scale values by unit before passing them to the metrics printing functions.
This is needed for TopDown, because it needs to scale the slots correctly
by pipeline width / SMTness.

For existing metrics it shouldn't make any difference, as those generally
use events that don't have any units.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/util/stat.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 4d9b481cf3b6..ffa1d0653861 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -307,6 +307,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	struct perf_counts_values *aggr = &counter->counts->aggr;
 	struct perf_stat_evsel *ps = counter->priv;
 	u64 *count = counter->counts->aggr.values;
+	u64 val;
 	int i, ret;
 
 	aggr->val = aggr->ena = aggr->run = 0;
@@ -346,7 +347,8 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	/*
 	 * Save the full runtime - to allow normalization during printout:
 	 */
-	perf_stat__update_shadow_stats(counter, count, 0);
+	val = counter->scale * *count;
+	perf_stat__update_shadow_stats(counter, &val, 0);
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 08/10] perf, tools, stat: Basic support for TopDown in perf stat
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (6 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-05 23:04 ` [PATCH 09/10] perf, tools, stat: Add computation of TopDown formulas Andi Kleen
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add basic plumbing for TopDown in perf stat

Add a new --topdown options to enable events.
When --topdown is specified set up events for all topdown
events supported by the kernel.
Add topdown-* as a special case to the event parser, as is
needed for all events containing -.

The actual code to compute the metrics is in follow-on patches.

v2: Use standard sysctl read function.
v3: Move x86 specific code to arch/
v4: Enable --metric-only implicitly for topdown.
v5: Add --single-thread option to not force per core mode
v6: Fix output order of topdown metrics
v7: Allow combining with -d
v8: Remove --single-thread again
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |  16 +++++
 tools/perf/arch/x86/util/Build         |   1 +
 tools/perf/arch/x86/util/group.c       |  27 ++++++++
 tools/perf/builtin-stat.c              | 114 ++++++++++++++++++++++++++++++++-
 tools/perf/util/group.h                |   7 ++
 tools/perf/util/parse-events.l         |   1 +
 6 files changed, 163 insertions(+), 3 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/group.c
 create mode 100644 tools/perf/util/group.h

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 04f23b404bbc..3aaa2916f604 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -204,6 +204,22 @@ Aggregate counts per physical processor for system-wide mode measurements.
 --no-aggr::
 Do not aggregate counts across all monitored CPUs.
 
+--topdown::
+Print top down level 1 metrics if supported by the CPU. This allows to
+determine bottle necks in the CPU pipeline for CPU bound workloads,
+by breaking it down into frontend bound, backend bound, bad speculation
+and retiring. Metrics are only printed when they cross a threshold.
+
+The top down metrics may be collected per core instead of per
+CPU thread. In this case per core mode is automatically enabled
+and -a (global monitoring) is needed, requiring root rights or
+perf.perf_event_paranoid=-1.
+
+This enables --metric-only, unless overriden with --no-metric-only.
+
+To interpret the results it is usually needed to know on which
+CPUs the workload runs on. If needed the CPUs can be forced using
+taskset.
 
 EXAMPLES
 --------
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 465970370f3e..4cd8a16b1b7b 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -3,6 +3,7 @@ libperf-y += tsc.o
 libperf-y += pmu.o
 libperf-y += kvm-stat.o
 libperf-y += perf_regs.o
+libperf-y += group.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/group.c b/tools/perf/arch/x86/util/group.c
new file mode 100644
index 000000000000..f3039b5ce8b1
--- /dev/null
+++ b/tools/perf/arch/x86/util/group.c
@@ -0,0 +1,27 @@
+#include <stdio.h>
+#include "api/fs/fs.h"
+#include "util/group.h"
+
+/*
+ * Check whether we can use a group for top down.
+ * Without a group may get bad results due to multiplexing.
+ */
+bool check_group(bool *warn)
+{
+	int n;
+
+	if (sysctl__read_int("kernel/nmi_watchdog", &n) < 0)
+		return false;
+	if (n > 0) {
+		*warn = true;
+		return false;
+	}
+	return true;
+}
+
+void group_warn(void)
+{
+	fprintf(stderr,
+		"nmi_watchdog enabled with topdown. May give wrong results.\n"
+		"Disable with echo 0 > /proc/sys/kernel/nmi_watchdog\n");
+}
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index b407a11c6e22..707eef9314da 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -59,10 +59,13 @@
 #include "util/thread.h"
 #include "util/thread_map.h"
 #include "util/counts.h"
+#include "util/group.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/group.h"
 #include "asm/bug.h"
 
+#include <api/fs/fs.h>
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
@@ -98,6 +101,15 @@ static const char * transaction_limited_attrs = {
 	"}"
 };
 
+static const char * topdown_attrs[] = {
+	"topdown-total-slots",
+	"topdown-slots-retired",
+	"topdown-recovery-bubbles",
+	"topdown-fetch-bubbles",
+	"topdown-slots-issued",
+	NULL,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct target target = {
@@ -112,6 +124,7 @@ static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
 static bool			transaction_run;
+static bool			topdown_run			= false;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -124,6 +137,7 @@ static unsigned int		initial_delay			= 0;
 static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static bool			metric_only			= false;
+static bool			force_metric_only		= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
 static aggr_get_id_t		aggr_get_id;
@@ -1507,6 +1521,14 @@ static int stat__set_big_num(const struct option *opt __maybe_unused,
 	return 0;
 }
 
+static int enable_metric_only(const struct option *opt __maybe_unused,
+			      const char *s __maybe_unused, int unset)
+{
+	force_metric_only = true;
+	metric_only = !unset;
+	return 0;
+}
+
 static const struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1565,8 +1587,10 @@ static const struct option stat_options[] = {
 		     "aggregate counts per thread", AGGR_THREAD),
 	OPT_UINTEGER('D', "delay", &initial_delay,
 		     "ms to wait before starting measurement after program start"),
-	OPT_BOOLEAN(0, "metric-only", &metric_only,
-			"Only print computed metrics. No raw values"),
+	OPT_CALLBACK_NOOPT(0, "metric-only", &metric_only, NULL,
+			"Only print computed metrics. No raw values", enable_metric_only),
+	OPT_BOOLEAN(0, "topdown", &topdown_run,
+			"measure topdown level 1 statistics"),
 	OPT_END()
 };
 
@@ -1759,12 +1783,61 @@ static int perf_stat_init_aggr_mode_file(struct perf_stat *st)
 	return 0;
 }
 
+static void filter_events(const char **attr, char **str, bool use_group)
+{
+	int off = 0;
+	int i;
+	int len = 0;
+	char *s;
+
+	for (i = 0; attr[i]; i++) {
+		if (pmu_have_event("cpu", attr[i])) {
+			len += strlen(attr[i]) + 1;
+			attr[i - off] = attr[i];
+		} else
+			off++;
+	}
+	attr[i - off] = NULL;
+
+	*str = malloc(len + 1 + 2);
+	if (!*str)
+		return;
+	s = *str;
+	if (i - off == 0) {
+		*s = 0;
+		return;
+	}
+	if (use_group)
+		*s++ = '{';
+	for (i = 0; attr[i]; i++) {
+		strcpy(s, attr[i]);
+		s += strlen(s);
+		*s++ = ',';
+	}
+	if (use_group) {
+		s[-1] = '}';
+		*s = 0;
+	} else
+		s[-1] = 0;
+}
+
+__weak bool check_group(bool *warn)
+{
+	*warn = false;
+	return false;
+}
+
+__weak void group_warn(void)
+{
+}
+
 /*
  * Add default attributes, if there were no attributes specified or
  * if -d/--detailed, -d -d or -d -d -d is used:
  */
 static int add_default_attributes(void)
 {
+	int err;
 	struct perf_event_attr default_attrs0[] = {
 
   { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
@@ -1883,7 +1956,6 @@ static int add_default_attributes(void)
 		return 0;
 
 	if (transaction_run) {
-		int err;
 		if (pmu_have_event("cpu", "cycles-ct") &&
 		    pmu_have_event("cpu", "el-start"))
 			err = parse_events(evsel_list, transaction_attrs, NULL);
@@ -1896,6 +1968,42 @@ static int add_default_attributes(void)
 		return 0;
 	}
 
+	if (topdown_run) {
+		char *str = NULL;
+		bool warn = false;
+
+		if (stat_config.aggr_mode != AGGR_GLOBAL &&
+		    stat_config.aggr_mode != AGGR_CORE) {
+			pr_err("top down event configuration requires --per-core mode\n");
+			return -1;
+		}
+		stat_config.aggr_mode = AGGR_CORE;
+		if (nr_cgroups || !target__has_cpu(&target)) {
+			pr_err("top down event configuration requires system-wide mode (-a)\n");
+			return -1;
+		}
+
+		if (!force_metric_only)
+			metric_only = true;
+		filter_events(topdown_attrs, &str, check_group(&warn));
+		if (topdown_attrs[0] && str) {
+			if (warn)
+				group_warn();
+			err = parse_events(evsel_list, str, NULL);
+			if (err) {
+				fprintf(stderr,
+					"Cannot set up top down events %s: %d\n",
+					str, err);
+				free(str);
+				return -1;
+			}
+		} else {
+			fprintf(stderr, "System does not support topdown\n");
+			return -1;
+		}
+		free(str);
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (perf_evlist__add_default_attrs(evsel_list, default_attrs0) < 0)
 			return -1;
diff --git a/tools/perf/util/group.h b/tools/perf/util/group.h
new file mode 100644
index 000000000000..daad3ffdc68d
--- /dev/null
+++ b/tools/perf/util/group.h
@@ -0,0 +1,7 @@
+#ifndef GROUP_H
+#define GROUP_H 1
+
+bool check_group(bool *warn);
+void group_warn(void);
+
+#endif
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 1477fbc78993..744ebe3fa30f 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -259,6 +259,7 @@ cycles-ct					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 cycles-t					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-loads					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 mem-stores					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
+topdown-[a-z-]+					{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
 L1-icache|l1-i|l1i|L1-instruction	|
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 09/10] perf, tools, stat: Add computation of TopDown formulas
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (7 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 08/10] perf, tools, stat: Basic support for TopDown in perf stat Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-05 23:04 ` [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv Andi Kleen
  2016-05-12  7:47 ` Add top down metrics to perf stat Jiri Olsa
  10 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Implement the TopDown formulas in perf stat. The topdown basic metrics
reported by the kernel are collected, and the formulas are computed
and output as normal metrics.

See the kernel commit exporting the events for details on the used
metrics.

v2: Always print all metrics, only use thresholds for coloring.
v3: Mark retiring over threshold green, not red.
v4:
Only print one decimal digit
Fix color printing of one metric
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/util/stat-shadow.c | 156 ++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c        |   5 ++
 tools/perf/util/stat.h        |   5 ++
 3 files changed, 166 insertions(+)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index fdb71961143e..182050f2785a 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -36,6 +36,11 @@ static struct stats runtime_dtlb_cache_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_cycles_in_tx_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_transaction_stats[NUM_CTX][MAX_NR_CPUS];
 static struct stats runtime_elision_stats[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_total_slots[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_issued[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_slots_retired[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_fetch_bubbles[NUM_CTX][MAX_NR_CPUS];
+static struct stats runtime_topdown_recovery_bubbles[NUM_CTX][MAX_NR_CPUS];
 static bool have_frontend_stalled;
 
 struct stats walltime_nsecs_stats;
@@ -82,6 +87,12 @@ void perf_stat__reset_shadow_stats(void)
 		sizeof(runtime_transaction_stats));
 	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
+	memset(runtime_topdown_total_slots, 0, sizeof(runtime_topdown_total_slots));
+	memset(runtime_topdown_slots_retired, 0, sizeof(runtime_topdown_slots_retired));
+	memset(runtime_topdown_slots_issued, 0, sizeof(runtime_topdown_slots_issued));
+	memset(runtime_topdown_fetch_bubbles, 0, sizeof(runtime_topdown_fetch_bubbles));
+	memset(runtime_topdown_recovery_bubbles, 0, sizeof(runtime_topdown_recovery_bubbles));
+	have_frontend_stalled = pmu_have_event("cpu", "stalled-cycles-frontend");
 }
 
 /*
@@ -104,6 +115,16 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 *count,
 		update_stats(&runtime_transaction_stats[ctx][cpu], count[0]);
 	else if (perf_stat_evsel__is(counter, ELISION_START))
 		update_stats(&runtime_elision_stats[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_TOTAL_SLOTS))
+		update_stats(&runtime_topdown_total_slots[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_ISSUED))
+		update_stats(&runtime_topdown_slots_issued[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_SLOTS_RETIRED))
+		update_stats(&runtime_topdown_slots_retired[ctx][cpu], count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_BUBBLES))
+		update_stats(&runtime_topdown_fetch_bubbles[ctx][cpu],count[0]);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
+		update_stats(&runtime_topdown_recovery_bubbles[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_stats(&runtime_stalled_cycles_front_stats[ctx][cpu], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
@@ -301,6 +322,100 @@ static void print_ll_cache_misses(int cpu,
 	out->print_metric(out->ctx, color, "%7.2f%%", "of all LL-cache hits", ratio);
 }
 
+/*
+ * High level "TopDown" CPU core pipe line bottleneck break down.
+ *
+ * Basic concept following
+ * Yasin, A Top Down Method for Performance analysis and Counter architecture
+ * ISPASS14
+ *
+ * The CPU pipeline is divided into 4 areas that can be bottlenecks:
+ *
+ * Frontend -> Backend -> Retiring
+ * BadSpeculation in addition means out of order execution that is thrown away
+ * (for example branch mispredictions)
+ * Frontend is instruction decoding.
+ * Backend is execution, like computation and accessing data in memory
+ * Retiring is good execution that is not directly bottlenecked
+ *
+ * The formulas are computed in slots.
+ * A slot is an entry in the pipeline each for the pipeline width
+ * (for example a 4-wide pipeline has 4 slots for each cycle)
+ *
+ * Formulas:
+ * BadSpeculation = ((SlotsIssued - SlotsRetired) + RecoveryBubbles) /
+ *			TotalSlots
+ * Retiring = SlotsRetired / TotalSlots
+ * FrontendBound = FetchBubbles / TotalSlots
+ * BackendBound = 1.0 - BadSpeculation - Retiring - FrontendBound
+ *
+ * The kernel provides the mapping to the low level CPU events and any scaling
+ * needed for the CPU pipeline width, for example:
+ *
+ * TotalSlots = Cycles * 4
+ *
+ * The scaling factor is communicated in the sysfs unit.
+ *
+ * In some cases the CPU may not be able to measure all the formulas due to
+ * missing events. In this case multiple formulas are combined, as possible.
+ *
+ * Full TopDown supports more levels to sub-divide each area: for example
+ * BackendBound into computing bound and memory bound. For now we only
+ * support Level 1 TopDown.
+ */
+
+static double td_total_slots(int ctx, int cpu)
+{
+	return avg_stats(&runtime_topdown_total_slots[ctx][cpu]);
+}
+
+static double td_bad_spec(int ctx, int cpu)
+{
+	double bad_spec = 0;
+	double total_slots;
+	double total;
+
+	total = avg_stats(&runtime_topdown_slots_issued[ctx][cpu]) -
+		avg_stats(&runtime_topdown_slots_retired[ctx][cpu]) +
+		avg_stats(&runtime_topdown_recovery_bubbles[ctx][cpu]);
+	total_slots = td_total_slots(ctx, cpu);
+	if (total_slots)
+		bad_spec = total / total_slots;
+	return bad_spec;
+}
+
+static double td_retiring(int ctx, int cpu)
+{
+	double retiring = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double ret_slots = avg_stats(&runtime_topdown_slots_retired[ctx][cpu]);
+
+	if (total_slots)
+		retiring = ret_slots / total_slots;
+	return retiring;
+}
+
+static double td_fe_bound(int ctx, int cpu)
+{
+	double fe_bound = 0;
+	double total_slots = td_total_slots(ctx, cpu);
+	double fetch_bub = avg_stats(&runtime_topdown_fetch_bubbles[ctx][cpu]);
+
+	if (total_slots)
+		fe_bound = fetch_bub / total_slots;
+	return fe_bound;
+}
+
+static double td_be_bound(int ctx, int cpu)
+{
+	double sum = (td_fe_bound(ctx, cpu) +
+		      td_bad_spec(ctx, cpu) +
+		      td_retiring(ctx, cpu));
+	if (sum == 0)
+		return 0;
+	return 1.0 - sum;
+}
+
 void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				   double avg, int cpu,
 				   struct perf_stat_output_ctx *out)
@@ -308,6 +423,7 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 	void *ctxp = out->ctx;
 	print_metric_t print_metric = out->print_metric;
 	double total, ratio = 0.0, total2;
+	const char *color = NULL;
 	int ctx = evsel_context(evsel);
 
 	if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
@@ -450,6 +566,46 @@ void perf_stat__print_shadow_stats(struct perf_evsel *evsel,
 				     avg / ratio);
 		else
 			print_metric(ctxp, NULL, NULL, "CPUs utilized", 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_BUBBLES)) {
+		double fe_bound = td_fe_bound(ctx, cpu);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_RETIRED)) {
+		double retiring = td_retiring(ctx, cpu);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RECOVERY_BUBBLES)) {
+		double bad_spec = td_bad_spec(ctx, cpu);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_SLOTS_ISSUED)) {
+		double be_bound = td_be_bound(ctx, cpu);
+		const char *name = "backend bound";
+		static int have_recovery_bubbles = -1;
+
+		/* In case the CPU does not support topdown-recovery-bubbles */
+		if (have_recovery_bubbles < 0)
+			have_recovery_bubbles = pmu_have_event("cpu",
+					"topdown-recovery-bubbles");
+		if (!have_recovery_bubbles)
+			name = "backend bound/bad spec";
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		if (td_total_slots(ctx, cpu) > 0)
+			print_metric(ctxp, color, "%8.1f%%", name,
+					be_bound * 100.);
+		else
+			print_metric(ctxp, NULL, NULL, name, 0);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 		char unit_buf[10];
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index ffa1d0653861..c1ba255f2abe 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -79,6 +79,11 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TRANSACTION_START,	cpu/tx-start/),
 	ID(ELISION_START,	cpu/el-start/),
 	ID(CYCLES_IN_TX_CP,	cpu/cycles-ct/),
+	ID(TOPDOWN_TOTAL_SLOTS, topdown-total-slots),
+	ID(TOPDOWN_SLOTS_ISSUED, topdown-slots-issued),
+	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
+	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
+	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
 };
 #undef ID
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 0150e786ccc7..c29bb94c48a4 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -17,6 +17,11 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TRANSACTION_START,
 	PERF_STAT_EVSEL_ID__ELISION_START,
 	PERF_STAT_EVSEL_ID__CYCLES_IN_TX_CP,
+	PERF_STAT_EVSEL_ID__TOPDOWN_TOTAL_SLOTS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_ISSUED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
 	PERF_STAT_EVSEL_ID__MAX,
 };
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (8 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 09/10] perf, tools, stat: Add computation of TopDown formulas Andi Kleen
@ 2016-05-05 23:04 ` Andi Kleen
  2016-05-12  8:03   ` Jiri Olsa
  2016-05-12  7:47 ` Add top down metrics to perf stat Jiri Olsa
  10 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-05-05 23:04 UTC (permalink / raw)
  To: acme, peterz; +Cc: jolsa, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add debug output of raw counter values per CPU when
perf stat -v is specified, together with their cpu numbers.
This is very useful to debug problems with per core counters,
where we can normally only see aggregated values.

v2: Make it depend on -vv, not -v
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-stat.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 707eef9314da..7c5c50b61b28 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -313,6 +313,14 @@ static int read_counter(struct perf_evsel *counter)
 					return -1;
 				}
 			}
+
+			if (verbose > 1) {
+				fprintf(stat_config.output,
+					"%s: %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
+						perf_evsel__name(counter),
+						cpu,
+						count->val, count->ena, count->run);
+			}
 		}
 	}
 
-- 
2.5.5

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-05 23:03 ` [PATCH 01/10] x86: Add topology_max_smt_threads() Andi Kleen
@ 2016-05-06 10:13   ` Peter Zijlstra
  2016-05-06 10:47     ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-05-06 10:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, jolsa, linux-kernel, Andi Kleen, Thomas Gleixner, x86


It might help if you actually Cc'd tglx and x86 in general.

On Thu, May 05, 2016 at 04:03:58PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> For SMT specific workarounds it is useful to know if SMT is active
> on any online CPU in the system. This currently requires a loop
> over all online CPUs.
> 
> Add a global variable that is updated with the maximum number
> of smt threads on any CPU on online/offline, and use it for
> topology_max_smt_threads()
> 
> The single call is easier to use than a loop.
> 
> Not exported to user space because user space already can use
> the existing sibling interfaces to find this out.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/topology.h |  5 +++++
>  arch/x86/kernel/smpboot.c       | 24 ++++++++++++++++++++++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 7f991bd5031b..4707555f209e 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -129,6 +129,10 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
>  
>  extern unsigned int __max_logical_packages;
>  #define topology_max_packages()			(__max_logical_packages)
> +
> +extern int max_smt_threads;
> +#define topology_max_smt_threads()		max_smt_threads
> +
>  int topology_update_package_map(unsigned int apicid, unsigned int cpu);
>  extern int topology_phys_to_logical_pkg(unsigned int pkg);
>  #else
> @@ -136,6 +140,7 @@ extern int topology_phys_to_logical_pkg(unsigned int pkg);
>  static inline int
>  topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
>  static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
> +#define topology_max_smt_threads()		1
>  #endif
>  
>  static inline void arch_fix_phys_package_id(int num, u32 slot)
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index a2065d3b3b39..6e5a721857f5 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -105,6 +105,9 @@ static unsigned int max_physical_pkg_id __read_mostly;
>  unsigned int __max_logical_packages __read_mostly;
>  EXPORT_SYMBOL(__max_logical_packages);
>  
> +/* Maximum number of SMT threads on any online core */
> +int max_smt_threads __read_mostly;
> +
>  static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
>  {
>  	unsigned long flags;
> @@ -489,6 +492,7 @@ void set_cpu_sibling_map(int cpu)
>  	struct cpuinfo_x86 *c = &cpu_data(cpu);
>  	struct cpuinfo_x86 *o;
>  	int i;
> +	int threads;
>  
>  	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
>  
> @@ -545,6 +549,10 @@ void set_cpu_sibling_map(int cpu)
>  		if (match_die(c, o) && !topology_same_node(c, o))
>  			primarily_use_numa_for_topology();
>  	}
> +
> +	threads = cpumask_weight(topology_sibling_cpumask(cpu));
> +	if (threads > max_smt_threads)
> +		max_smt_threads = threads;
>  }
>  
>  /* maps the cpu to the sched domain representing multi-core */
> @@ -1436,6 +1444,21 @@ __init void prefill_possible_map(void)
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  
> +/* Recompute SMT state for all CPUs on offline */
> +static void recompute_smt_state(void)
> +{
> +	int max_threads;
> +	int cpu;
> +
> +	max_threads = 0;
> +	for_each_online_cpu (cpu) {
> +		int threads = cpumask_weight(topology_sibling_cpumask(cpu));
> +		if (threads > max_threads)
> +			max_threads = threads;
> +	}
> +	max_smt_threads = max_threads;
> +}
> +
>  static void remove_siblinginfo(int cpu)
>  {
>  	int sibling;
> @@ -1460,6 +1483,7 @@ static void remove_siblinginfo(int cpu)
>  	c->phys_proc_id = 0;
>  	c->cpu_core_id = 0;
>  	cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
> +	recompute_smt_state();
>  }
>  
>  static void remove_cpu_from_maps(int cpu)
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-06 10:13   ` Peter Zijlstra
@ 2016-05-06 10:47     ` Thomas Gleixner
  2016-05-06 17:24       ` [UPDATED PATCH " Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2016-05-06 10:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, acme, jolsa, linux-kernel, Andi Kleen, x86

On Fri, 6 May 2016, Peter Zijlstra wrote:
> On Thu, May 05, 2016 at 04:03:58PM -0700, Andi Kleen wrote:
> >  
> >  extern unsigned int __max_logical_packages;
> >  #define topology_max_packages()			(__max_logical_packages)
> > +
> > +extern int max_smt_threads;

Please follow the above convention and prepend it with underscores. That way
it's clear that the variable should not be touched outside of the core which
initializes it.

> > +#define topology_max_smt_threads()		max_smt_threads
> > +

> >  static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
> >  {
> >  	unsigned long flags;
> > @@ -489,6 +492,7 @@ void set_cpu_sibling_map(int cpu)
> >  	struct cpuinfo_x86 *c = &cpu_data(cpu);
> >  	struct cpuinfo_x86 *o;
> >  	int i;
> > +	int threads;

  int i, threads; 

Please

> >  
> >  	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
> >  
> > @@ -545,6 +549,10 @@ void set_cpu_sibling_map(int cpu)
> >  		if (match_die(c, o) && !topology_same_node(c, o))
> >  			primarily_use_numa_for_topology();
> >  	}
> > +
> > +	threads = cpumask_weight(topology_sibling_cpumask(cpu));
> > +	if (threads > max_smt_threads)
> > +		max_smt_threads = threads;
> >  }
> >  
> >  /* maps the cpu to the sched domain representing multi-core */
> > @@ -1436,6 +1444,21 @@ __init void prefill_possible_map(void)
> >  
> >  #ifdef CONFIG_HOTPLUG_CPU
> >  
> > +/* Recompute SMT state for all CPUs on offline */
> > +static void recompute_smt_state(void)
> > +{
> > +	int max_threads;
> > +	int cpu;

Ditto

> > +
> > +	max_threads = 0;
> > +	for_each_online_cpu (cpu) {
> > +		int threads = cpumask_weight(topology_sibling_cpumask(cpu));

Missing newline.

Otherwise that looks good.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [UPDATED PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-06 10:47     ` Thomas Gleixner
@ 2016-05-06 17:24       ` Andi Kleen
  2016-05-07  8:11         ` Thomas Gleixner
  2016-05-12  8:07         ` Ingo Molnar
  0 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-06 17:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Andi Kleen, acme, jolsa, linux-kernel, Andi Kleen, x86

For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.
    
Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()
    
The single call is easier to use than a loop.
    
Not exported to user space because user space already can use
the existing sibling interfaces to find this out.
    
v2: Code formatting changes and use __ for variable name
Cc: tglx@linutronix.de
Signed-off-by: Andi Kleen <ak@linux.intel.com>

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7f991bd5031b..f79181c03561 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -129,6 +129,10 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()			(__max_logical_packages)
+
+extern int __max_smt_threads;
+#define topology_max_smt_threads()		__max_smt_threads
+
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 extern int topology_phys_to_logical_pkg(unsigned int pkg);
 #else
@@ -136,6 +140,7 @@ extern int topology_phys_to_logical_pkg(unsigned int pkg);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+#define topology_max_smt_threads()		1
 #endif
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a2065d3b3b39..76bf9d855ee3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -105,6 +105,9 @@ static unsigned int max_physical_pkg_id __read_mostly;
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 
+/* Maximum number of SMT threads on any online core */
+int __max_smt_threads __read_mostly;
+
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
 {
 	unsigned long flags;
@@ -488,7 +491,7 @@ void set_cpu_sibling_map(int cpu)
 	bool has_mp = has_smt || boot_cpu_data.x86_max_cores > 1;
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 	struct cpuinfo_x86 *o;
-	int i;
+	int i, threads;
 
 	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
 
@@ -545,6 +548,10 @@ void set_cpu_sibling_map(int cpu)
 		if (match_die(c, o) && !topology_same_node(c, o))
 			primarily_use_numa_for_topology();
 	}
+
+	threads = cpumask_weight(topology_sibling_cpumask(cpu));
+	if (threads > __max_smt_threads)
+		__max_smt_threads = threads;
 }
 
 /* maps the cpu to the sched domain representing multi-core */
@@ -1436,6 +1443,21 @@ __init void prefill_possible_map(void)
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+/* Recompute SMT state for all CPUs on offline */
+static void recompute_smt_state(void)
+{
+	int max_threads, cpu;
+
+	max_threads = 0;
+	for_each_online_cpu (cpu) {
+		int threads = cpumask_weight(topology_sibling_cpumask(cpu));
+
+		if (threads > max_threads)
+			max_threads = threads;
+	}
+	__max_smt_threads = max_threads;
+}
+
 static void remove_siblinginfo(int cpu)
 {
 	int sibling;
@@ -1460,6 +1482,7 @@ static void remove_siblinginfo(int cpu)
 	c->phys_proc_id = 0;
 	c->cpu_core_id = 0;
 	cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);
+	recompute_smt_state();
 }
 
 static void remove_cpu_from_maps(int cpu)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UPDATED PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-06 17:24       ` [UPDATED PATCH " Andi Kleen
@ 2016-05-07  8:11         ` Thomas Gleixner
  2016-05-12  8:07         ` Ingo Molnar
  1 sibling, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2016-05-07  8:11 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Peter Zijlstra, acme, jolsa, linux-kernel, Andi Kleen, x86

On Fri, 6 May 2016, Andi Kleen wrote:

> For SMT specific workarounds it is useful to know if SMT is active
> on any online CPU in the system. This currently requires a loop
> over all online CPUs.
>     
> Add a global variable that is updated with the maximum number
> of smt threads on any CPU on online/offline, and use it for
> topology_max_smt_threads()
>     
> The single call is easier to use than a loop.
>     
> Not exported to user space because user space already can use
> the existing sibling interfaces to find this out.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales
  2016-05-05 23:04 ` [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales Andi Kleen
@ 2016-05-07 19:10   ` Jiri Olsa
  2016-05-07 19:24     ` Andi Kleen
  2016-05-20  6:42   ` [tip:perf/urgent] perf " tip-bot for Andi Kleen
  1 sibling, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-07 19:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen

On Thu, May 05, 2016 at 04:04:03PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> When the scaling factor is a full integer don't display fractional
> digits. This avoids unnecessary .00 output for topdown metrics
> with scale factors.
> 
> v2: Remove redundant check.
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/builtin-stat.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 1f19f2f999c8..b407a11c6e22 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -66,6 +66,7 @@
>  #include <stdlib.h>
>  #include <sys/prctl.h>
>  #include <locale.h>
> +#include <math.h>
>  
>  #define DEFAULT_SEPARATOR	" "
>  #define CNTR_NOT_SUPPORTED	"<not supported>"
> @@ -978,12 +979,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
>  	const char *fmt;
>  
>  	if (csv_output) {
> -		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
> +		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
>  	} else {
>  		if (big_num)
> -			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> +			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
>  		else
> -			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> +			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";

how about the rest of the code? we display % also in print_running
and print_noise_pct functions and maybe some place else

would be nice having unified output for %

thanks,
jirka

>  	}
>  
>  	aggr_printout(evsel, id, nr);
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics
  2016-05-05 23:04 ` [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics Andi Kleen
@ 2016-05-07 19:14   ` Jiri Olsa
  2016-05-10 20:30   ` [tip:perf/core] perf " tip-bot for Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-05-07 19:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen

On Thu, May 05, 2016 at 04:04:04PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Scale values by unit before passing them to the metrics printing functions.
> This is needed for TopDown, because it needs to scale the slots correctly
> by pipeline width / SMTness.
> 
> For existing metrics it shouldn't make any difference, as those generally
> use events that don't have any units.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

> ---
>  tools/perf/util/stat.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
> index 4d9b481cf3b6..ffa1d0653861 100644
> --- a/tools/perf/util/stat.c
> +++ b/tools/perf/util/stat.c
> @@ -307,6 +307,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
>  	struct perf_counts_values *aggr = &counter->counts->aggr;
>  	struct perf_stat_evsel *ps = counter->priv;
>  	u64 *count = counter->counts->aggr.values;
> +	u64 val;
>  	int i, ret;
>  
>  	aggr->val = aggr->ena = aggr->run = 0;
> @@ -346,7 +347,8 @@ int perf_stat_process_counter(struct perf_stat_config *config,
>  	/*
>  	 * Save the full runtime - to allow normalization during printout:
>  	 */
> -	perf_stat__update_shadow_stats(counter, count, 0);
> +	val = counter->scale * *count;
> +	perf_stat__update_shadow_stats(counter, &val, 0);
>  
>  	return 0;
>  }
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales
  2016-05-07 19:10   ` Jiri Olsa
@ 2016-05-07 19:24     ` Andi Kleen
  2016-05-11 13:00       ` Jiri Olsa
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-05-07 19:24 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, linux-kernel, Andi Kleen

> >  	if (csv_output) {
> > -		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
> > +		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
> >  	} else {
> >  		if (big_num)
> > -			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> > +			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
> >  		else
> > -			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> > +			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
> 
> how about the rest of the code? we display % also in print_running
> and print_noise_pct functions and maybe some place else

For those it doesn't matter. In fact it's probably better there
to always show the fractions.

It is just confusing for metrics.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status
  2016-05-05 23:03 ` [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status Andi Kleen
@ 2016-05-09  9:42   ` Peter Zijlstra
  2016-05-09 14:27     ` Andi Kleen
  2016-05-12  8:05   ` Ingo Molnar
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-05-09  9:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, jolsa, linux-kernel, Andi Kleen, Thomas Gleixner, x86

On Thu, May 05, 2016 at 04:03:59PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add a way to show different sysfs events attributes depending on
> HyperThreading is on or off. This is difficult to determine
> early at boot, so we just do it dynamically when the sysfs
> attribute is read.
> 
> v2:
> Compute HT status only once in CPU online/offline hooks.
> v3: Use topology_max_smt_threads()
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/events/core.c       | 24 ++++++++++++++++++++++++
>  arch/x86/events/perf_event.h | 14 ++++++++++++++
>  include/linux/perf_event.h   |  7 +++++++
>  3 files changed, 45 insertions(+)
> 

Should this not now live in /sys/devices/system/cpu/ ? Thomas?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status
  2016-05-09  9:42   ` Peter Zijlstra
@ 2016-05-09 14:27     ` Andi Kleen
  2016-05-09 14:34       ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-05-09 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, acme, jolsa, linux-kernel, Thomas Gleixner, x86

On Mon, May 09, 2016 at 11:42:19AM +0200, Peter Zijlstra wrote:
> On Thu, May 05, 2016 at 04:03:59PM -0700, Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > 
> > Add a way to show different sysfs events attributes depending on
> > HyperThreading is on or off. This is difficult to determine
> > early at boot, so we just do it dynamically when the sysfs
> > attribute is read.
> > 
> > v2:
> > Compute HT status only once in CPU online/offline hooks.
> > v3: Use topology_max_smt_threads()
> > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > ---
> >  arch/x86/events/core.c       | 24 ++++++++++++++++++++++++
> >  arch/x86/events/perf_event.h | 14 ++++++++++++++
> >  include/linux/perf_event.h   |  7 +++++++
> >  3 files changed, 45 insertions(+)
> > 
> 
> Should this not now live in /sys/devices/system/cpu/ ? Thomas?

This would be incompatible to all previous perf tools.

Also not clear why you would want to move such events, just
because they depend on SMT.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status
  2016-05-09 14:27     ` Andi Kleen
@ 2016-05-09 14:34       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-05-09 14:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, acme, jolsa, linux-kernel, Thomas Gleixner, x86

On Mon, May 09, 2016 at 07:27:41AM -0700, Andi Kleen wrote:
> On Mon, May 09, 2016 at 11:42:19AM +0200, Peter Zijlstra wrote:
> > On Thu, May 05, 2016 at 04:03:59PM -0700, Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > 
> > > Add a way to show different sysfs events attributes depending on
> > > HyperThreading is on or off. This is difficult to determine
> > > early at boot, so we just do it dynamically when the sysfs
> > > attribute is read.
> > > 
> > > v2:
> > > Compute HT status only once in CPU online/offline hooks.
> > > v3: Use topology_max_smt_threads()
> > > Signed-off-by: Andi Kleen <ak@linux.intel.com>
> > > ---
> > >  arch/x86/events/core.c       | 24 ++++++++++++++++++++++++
> > >  arch/x86/events/perf_event.h | 14 ++++++++++++++
> > >  include/linux/perf_event.h   |  7 +++++++
> > >  3 files changed, 45 insertions(+)
> > > 
> > 
> > Should this not now live in /sys/devices/system/cpu/ ? Thomas?
> 
> This would be incompatible to all previous perf tools.
> 
> Also not clear why you would want to move such events, just
> because they depend on SMT.

Durr, my bad; I read the patch wrong. I'll go have another look.

Thanks

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:perf/core] perf stat: Scale values by unit before metrics
  2016-05-05 23:04 ` [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics Andi Kleen
  2016-05-07 19:14   ` Jiri Olsa
@ 2016-05-10 20:30   ` tip-bot for Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: tip-bot for Andi Kleen @ 2016-05-10 20:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, peterz, jolsa, acme, hpa, ak, mingo, tglx

Commit-ID:  f340c5fc93bda334efd9f2b5855ef0d3746e1564
Gitweb:     http://git.kernel.org/tip/f340c5fc93bda334efd9f2b5855ef0d3746e1564
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Thu, 5 May 2016 16:04:04 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 9 May 2016 13:42:09 -0300

perf stat: Scale values by unit before metrics

Scale values by unit before passing them to the metrics printing
functions.  This is needed for TopDown, because it needs to scale the
slots correctly by pipeline width / SMTness.

For existing metrics it shouldn't make any difference, as those
generally use events that don't have any units.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1462489447-31832-8-git-send-email-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/stat.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 4d9b481..ffa1d06 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -307,6 +307,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	struct perf_counts_values *aggr = &counter->counts->aggr;
 	struct perf_stat_evsel *ps = counter->priv;
 	u64 *count = counter->counts->aggr.values;
+	u64 val;
 	int i, ret;
 
 	aggr->val = aggr->ena = aggr->run = 0;
@@ -346,7 +347,8 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 	/*
 	 * Save the full runtime - to allow normalization during printout:
 	 */
-	perf_stat__update_shadow_stats(counter, count, 0);
+	val = counter->scale * *count;
+	perf_stat__update_shadow_stats(counter, &val, 0);
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales
  2016-05-07 19:24     ` Andi Kleen
@ 2016-05-11 13:00       ` Jiri Olsa
  2016-05-11 16:43         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-11 13:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen

On Sat, May 07, 2016 at 12:24:25PM -0700, Andi Kleen wrote:
> > >  	if (csv_output) {
> > > -		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
> > > +		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
> > >  	} else {
> > >  		if (big_num)
> > > -			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> > > +			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
> > >  		else
> > > -			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> > > +			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
> > 
> > how about the rest of the code? we display % also in print_running
> > and print_noise_pct functions and maybe some place else
> 
> For those it doesn't matter. In fact it's probably better there
> to always show the fractions.
> 
> It is just confusing for metrics.

ok, let's try and see, we can always follow up
with the rest if there's a need

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] x86, perf: Add Top Down events to Intel Core
  2016-05-05 23:04 ` [PATCH 03/10] x86, perf: Add Top Down events to Intel Core Andi Kleen
@ 2016-05-11 13:23   ` Jiri Olsa
  2016-05-11 13:29     ` Peter Zijlstra
  2016-05-12  8:10   ` Ingo Molnar
  1 sibling, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-11 13:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen

On Thu, May 05, 2016 at 04:04:00PM -0700, Andi Kleen wrote:

SNIP

> +
> +EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
> +	"event=0x3c,umask=0x0",			/* cpu_clk_unhalted.thread */
> +	"event=0x3c,umask=0x0,any=1");		/* cpu_clk_unhalted.thread_any */
> +EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
> +EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
> +	"event=0xe,umask=0x1");			/* uops_issued.any */
> +EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
> +	"event=0xc2,umask=0x2");		/* uops_retired.retire_slots */
> +EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
> +	"event=0x9c,umask=0x1");		/* idq_uops_not_delivered_core */
> +EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
> +	"event=0xd,umask=0x3,cmask=1",		/* int_misc.recovery_cycles */
> +	"event=0xd,umask=0x3,cmask=1,any=1");	/* int_misc.recovery_cycles_any */
> +EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
> +	"4", "2");
> +
>  struct attribute *snb_events_attrs[] = {
>  	EVENT_PTR(mem_ld_snb),
>  	EVENT_PTR(mem_st_snb),
> +	EVENT_PTR(td_slots_issued),
> +	EVENT_PTR(td_slots_retired),
> +	EVENT_PTR(td_fetch_bubbles),
> +	EVENT_PTR(td_total_slots),
> +	EVENT_PTR(td_total_slots_scale),
> +	EVENT_PTR(td_recovery_bubbles),
> +	EVENT_PTR(td_recovery_bubbles_scale),

Peter, Ingo,
any thoughts about adding these events? The rest of the
tooling code is based on them being accepted..

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] x86, perf: Add Top Down events to Intel Core
  2016-05-11 13:23   ` Jiri Olsa
@ 2016-05-11 13:29     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-05-11 13:29 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, jolsa, linux-kernel, Andi Kleen

On Wed, May 11, 2016 at 03:23:36PM +0200, Jiri Olsa wrote:
> On Thu, May 05, 2016 at 04:04:00PM -0700, Andi Kleen wrote:
> 
> SNIP
> 
> > +
> > +EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
> > +	"event=0x3c,umask=0x0",			/* cpu_clk_unhalted.thread */
> > +	"event=0x3c,umask=0x0,any=1");		/* cpu_clk_unhalted.thread_any */
> > +EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
> > +EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
> > +	"event=0xe,umask=0x1");			/* uops_issued.any */
> > +EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
> > +	"event=0xc2,umask=0x2");		/* uops_retired.retire_slots */
> > +EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
> > +	"event=0x9c,umask=0x1");		/* idq_uops_not_delivered_core */
> > +EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
> > +	"event=0xd,umask=0x3,cmask=1",		/* int_misc.recovery_cycles */
> > +	"event=0xd,umask=0x3,cmask=1,any=1");	/* int_misc.recovery_cycles_any */
> > +EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
> > +	"4", "2");
> > +
> >  struct attribute *snb_events_attrs[] = {
> >  	EVENT_PTR(mem_ld_snb),
> >  	EVENT_PTR(mem_st_snb),
> > +	EVENT_PTR(td_slots_issued),
> > +	EVENT_PTR(td_slots_retired),
> > +	EVENT_PTR(td_fetch_bubbles),
> > +	EVENT_PTR(td_total_slots),
> > +	EVENT_PTR(td_total_slots_scale),
> > +	EVENT_PTR(td_recovery_bubbles),
> > +	EVENT_PTR(td_recovery_bubbles_scale),
> 
> Peter, Ingo,
> any thoughts about adding these events? The rest of the
> tooling code is based on them being accepted..

I queued up these patches; but left the tool parts.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales
  2016-05-11 13:00       ` Jiri Olsa
@ 2016-05-11 16:43         ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 48+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-05-11 16:43 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, peterz, jolsa, linux-kernel, Andi Kleen

Em Wed, May 11, 2016 at 03:00:31PM +0200, Jiri Olsa escreveu:
> On Sat, May 07, 2016 at 12:24:25PM -0700, Andi Kleen wrote:
> > > >  	if (csv_output) {
> > > > -		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
> > > > +		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
> > > >  	} else {
> > > >  		if (big_num)
> > > > -			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
> > > > +			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
> > > >  		else
> > > > -			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
> > > > +			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
> > > 
> > > how about the rest of the code? we display % also in print_running
> > > and print_noise_pct functions and maybe some place else
> > 
> > For those it doesn't matter. In fact it's probably better there
> > to always show the fractions.
> > 
> > It is just confusing for metrics.
> 
> ok, let's try and see, we can always follow up
> with the rest if there's a need
> 
> Acked-by: Jiri Olsa <jolsa@kernel.org>

Missed this one, applied.

- Arnaldo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
                   ` (9 preceding siblings ...)
  2016-05-05 23:04 ` [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv Andi Kleen
@ 2016-05-12  7:47 ` Jiri Olsa
  10 siblings, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-05-12  7:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel

On Thu, May 05, 2016 at 04:03:57PM -0700, Andi Kleen wrote:

SNIP

> The kernel declares the events supported by the current
> CPU and perf stat then computes the formulas based on the
> available metrics.
> 
> 
> Example output:
> 
> $ perf stat --topdown -I 1000 cmd
>      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
>      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
>      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
>      2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
>      2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
>      3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
>      3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
>      4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
>      4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
>      5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
>      5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
>      6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
>      6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
>      7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           


getting -0% for bad speculation.. im on your perf/top-down-20

thanks,
jirka


[root@ibm-x3650m4-01 perf]# ./perf stat --topdown -a -I 1000
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
     1.002322346                   retiring             bad speculation      frontend bound       backend bound        
     1.002322346 S0-C0           2     38.3%                0.0%               57.9%                3.8%           
     1.002322346 S0-C1           2     38.3%                0.0%               59.1%                2.6%           
     1.002322346 S0-C2           2     38.3%                0.0%               59.0%                2.6%           
     1.002322346 S0-C3           2     38.3%                0.0%               58.7%                3.0%           
     1.002322346 S0-C4           2     38.3%               -0.0%               58.6%                3.1%           
     1.002322346 S0-C5           2     38.4%               -0.0%               58.3%                3.3%           
     1.002322346 S1-C0           2     38.3%               -0.0%               58.7%                3.0%           
     1.002322346 S1-C1           2     38.3%                0.0%               59.7%                2.0%           
     1.002322346 S1-C2           2     38.3%               -0.0%               59.3%                2.5%           
     1.002322346 S1-C3           2     38.3%               -0.0%               59.1%                2.5%           
     1.002322346 S1-C4           2     38.3%                0.0%               59.1%                2.6%           
     1.002322346 S1-C5           2     38.3%               -0.0%               59.1%                2.7%           
     2.005429451 S0-C0           2     38.3%                0.0%               57.9%                3.8%           
     ...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv
  2016-05-05 23:04 ` [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv Andi Kleen
@ 2016-05-12  8:03   ` Jiri Olsa
  2016-05-20  0:05     ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-12  8:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen

On Thu, May 05, 2016 at 04:04:07PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add debug output of raw counter values per CPU when
> perf stat -v is specified, together with their cpu numbers.
> This is very useful to debug problems with per core counters,
> where we can normally only see aggregated values.
> 
> v2: Make it depend on -vv, not -v
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/builtin-stat.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 707eef9314da..7c5c50b61b28 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -313,6 +313,14 @@ static int read_counter(struct perf_evsel *counter)
>  					return -1;
>  				}
>  			}
> +
> +			if (verbose > 1) {
> +				fprintf(stat_config.output,
> +					"%s: %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
> +						perf_evsel__name(counter),
> +						cpu,
> +						count->val, count->ena, count->run);
> +			}

hi,
we already have similar output for aggregated counters,
could you please consider something like below to clearly
separate them?

[root@ibm-x3650m4-01 perf]# ./perf stat  -e cycles -I 1000 -vv -a -C 0,1
...
cycles: CPU 0: 1298783264 1000126956 1000126956
cycles: CPU 1: 1298791660 1000134589 1000134589
cycles: AGGR: 2597574924 2000261545 2000261545
...

thanks,
jirka


---
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7c5c50b61b28..bd0d67ebb757 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -316,7 +316,7 @@ static int read_counter(struct perf_evsel *counter)
 
 			if (verbose > 1) {
 				fprintf(stat_config.output,
-					"%s: %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
+					"%s: CPU %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
 						perf_evsel__name(counter),
 						cpu,
 						count->val, count->ena, count->run);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index c1ba255f2abe..5ddeea1399ee 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -345,7 +345,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
 		update_stats(&ps->res_stats[i], count[i]);
 
 	if (verbose) {
-		fprintf(config->output, "%s: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
+		fprintf(config->output, "%s: AGGR: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
 			perf_evsel__name(counter), count[0], count[1], count[2]);
 	}
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status
  2016-05-05 23:03 ` [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status Andi Kleen
  2016-05-09  9:42   ` Peter Zijlstra
@ 2016-05-12  8:05   ` Ingo Molnar
  1 sibling, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2016-05-12  8:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen


* Andi Kleen <andi@firstfloor.org> wrote:

> +ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
> +			  char *page)
> +{
> +	struct perf_pmu_events_ht_attr *pmu_attr =
> +		container_of(attr, struct perf_pmu_events_ht_attr, attr);
> +
> +	/*
> +	 * Report conditional events depending on Hyper-Threading.
> +	 *
> +	 * This is overly conservative as usually the HT special
> +	 * handling is not needed if the other CPU thread is idle.
> +	 *
> +	 * Note this does not (cannot) handle the case when thread
> +	 * siblings are invisible, for example with virtualization
> +	 * if they are owned by some other guest.  The user tool
> +	 * has to re-read when a thread sibling gets onlined later.
> +	 */
> +
> +	return sprintf(page, "%s",
> +			topology_max_smt_threads() > 1 ?
> +			pmu_attr->event_str_ht :
> +			pmu_attr->event_str_noht);
> +}
> +
>  EVENT_ATTR(cpu-cycles,			CPU_CYCLES		);
>  EVENT_ATTR(instructions,		INSTRUCTIONS		);
>  EVENT_ATTR(cache-references,		CACHE_REFERENCES	);
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 8bd764df815d..ad2e870f77d9 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -668,6 +668,14 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
>  	.event_str	= str,						\
>  };
>  
> +#define EVENT_ATTR_STR_HT(_name, v, noht, ht)				\
> +static struct perf_pmu_events_ht_attr event_attr_##v = {		\
> +	.attr		= __ATTR(_name, 0444, events_ht_sysfs_show, NULL),\
> +	.id		= 0,						\
> +	.event_str_noht	= noht,						\
> +	.event_str_ht	= ht,						\
> +}
> +
>  extern struct x86_pmu x86_pmu __read_mostly;
>  
>  static inline bool x86_pmu_has_lbr_callstack(void)
> @@ -938,6 +946,12 @@ int p6_pmu_init(void);
>  
>  int knc_pmu_init(void);
>  
> +ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
> +			  char *page);
> +
> +ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
> +			  char *page);
> +
>  static inline int is_ht_workaround_enabled(void)
>  {
>  	return !!(x86_pmu.flags & PMU_FL_EXCL_ENABLED);
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 9e1c3ada91c4..b425f2d24b26 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1304,6 +1304,13 @@ struct perf_pmu_events_attr {
>  	const char *event_str;
>  };
>  
> +struct perf_pmu_events_ht_attr {
> +	struct device_attribute attr;
> +	u64 id;
> +	const char *event_str_ht;
> +	const char *event_str_noht;
> +};
> +
>  ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
>  			      char *page);
>  

NAK for the following stylistic reasons:

 - structure definition does not follow existing style.

 - silly line breaks inserted into random positions that make the code ugly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [UPDATED PATCH 01/10] x86: Add topology_max_smt_threads()
  2016-05-06 17:24       ` [UPDATED PATCH " Andi Kleen
  2016-05-07  8:11         ` Thomas Gleixner
@ 2016-05-12  8:07         ` Ingo Molnar
  1 sibling, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2016-05-12  8:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Peter Zijlstra, acme, jolsa, linux-kernel,
	Andi Kleen, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> For SMT specific workarounds it is useful to know if SMT is active
> on any online CPU in the system. This currently requires a loop
> over all online CPUs.
>     
> Add a global variable that is updated with the maximum number
> of smt threads on any CPU on online/offline, and use it for
> topology_max_smt_threads()
>     
> The single call is easier to use than a loop.
>     
> Not exported to user space because user space already can use
> the existing sibling interfaces to find this out.
>     
> v2: Code formatting changes and use __ for variable name
> Cc: tglx@linutronix.de
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 7f991bd5031b..f79181c03561 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -129,6 +129,10 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
>  
>  extern unsigned int __max_logical_packages;
>  #define topology_max_packages()			(__max_logical_packages)
> +
> +extern int __max_smt_threads;
> +#define topology_max_smt_threads()		__max_smt_threads
> +
>  int topology_update_package_map(unsigned int apicid, unsigned int cpu);
>  extern int topology_phys_to_logical_pkg(unsigned int pkg);
>  #else
> @@ -136,6 +140,7 @@ extern int topology_phys_to_logical_pkg(unsigned int pkg);
>  static inline int
>  topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
>  static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
> +#define topology_max_smt_threads()		1

Is there a good reason why this is a CPP macro instead of an inline function like 
the code above it uses?

> +/* Recompute SMT state for all CPUs on offline */

s/when a CPU gets offlined/

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] x86, perf: Add Top Down events to Intel Core
  2016-05-05 23:04 ` [PATCH 03/10] x86, perf: Add Top Down events to Intel Core Andi Kleen
  2016-05-11 13:23   ` Jiri Olsa
@ 2016-05-12  8:10   ` Ingo Molnar
  1 sibling, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2016-05-12  8:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, linux-kernel, Andi Kleen


* Andi Kleen <andi@firstfloor.org> wrote:


>  Subject: Re: [PATCH 03/10] x86, perf: Add Top Down events to Intel Core

>  arch/x86/events/intel/core.c | 50 ++++++++++++++++++++++++++++++++++++++++++++

You consistently mis-spell patches to the x86 perf code and for large series this 
adds unnecessary maintainer work.

Use 'perf/x86:' for patches that affect all PMU using x86 CPUs, and 
'perf/x86/intel:' for patches that affect only Intel CPUs.

Generally a 'git log arch/x86/events/<file>.c' will tell you what pattern to use.

This applies to most of the other patches of yours in this series as well, please 
fix it - and use this consistently for future patches as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv
  2016-05-12  8:03   ` Jiri Olsa
@ 2016-05-20  0:05     ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-20  0:05 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, linux-kernel, Andi Kleen

> hi,
> we already have similar output for aggregated counters,
> could you please consider something like below to clearly
> separate them?

I think Arnaldo has already merged the patch, so he should merge
the fix too. The fix is fine for me.

-Andi

> 
> [root@ibm-x3650m4-01 perf]# ./perf stat  -e cycles -I 1000 -vv -a -C 0,1
> ...
> cycles: CPU 0: 1298783264 1000126956 1000126956
> cycles: CPU 1: 1298791660 1000134589 1000134589
> cycles: AGGR: 2597574924 2000261545 2000261545
> ...
> 
> thanks,
> jirka
> 
> 
> ---
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 7c5c50b61b28..bd0d67ebb757 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -316,7 +316,7 @@ static int read_counter(struct perf_evsel *counter)
>  
>  			if (verbose > 1) {
>  				fprintf(stat_config.output,
> -					"%s: %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
> +					"%s: CPU %d: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
>  						perf_evsel__name(counter),
>  						cpu,
>  						count->val, count->ena, count->run);
> diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
> index c1ba255f2abe..5ddeea1399ee 100644
> --- a/tools/perf/util/stat.c
> +++ b/tools/perf/util/stat.c
> @@ -345,7 +345,7 @@ int perf_stat_process_counter(struct perf_stat_config *config,
>  		update_stats(&ps->res_stats[i], count[i]);
>  
>  	if (verbose) {
> -		fprintf(config->output, "%s: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
> +		fprintf(config->output, "%s: AGGR: %" PRIu64 " %" PRIu64 " %" PRIu64 "\n",
>  			perf_evsel__name(counter), count[0], count[1], count[2]);
>  	}
>  
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:perf/urgent] perf stat: Avoid fractional digits for integer scales
  2016-05-05 23:04 ` [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales Andi Kleen
  2016-05-07 19:10   ` Jiri Olsa
@ 2016-05-20  6:42   ` tip-bot for Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: tip-bot for Andi Kleen @ 2016-05-20  6:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, tglx, hpa, jolsa, mingo, acme, ak, peterz

Commit-ID:  e3b03b6c1a4f3b4564be08809f58584592621a0a
Gitweb:     http://git.kernel.org/tip/e3b03b6c1a4f3b4564be08809f58584592621a0a
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Thu, 5 May 2016 16:04:03 -0700
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 16 May 2016 23:11:13 -0300

perf stat: Avoid fractional digits for integer scales

When the scaling factor is a full integer don't display fractional
digits. This avoids unnecessary .00 output for topdown metrics with
scale factors.

v2: Remove redundant check.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1462489447-31832-7-git-send-email-andi@firstfloor.org
[ Rename 'round' to 'stat_round' as 'round' is defined in math.h,
  included by this patch, and this breaks the build on ubuntu 12.04 ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-stat.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 5645a83..16a923c 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -66,6 +66,7 @@
 #include <stdlib.h>
 #include <sys/prctl.h>
 #include <locale.h>
+#include <math.h>
 
 #define DEFAULT_SEPARATOR	" "
 #define CNTR_NOT_SUPPORTED	"<not supported>"
@@ -986,12 +987,12 @@ static void abs_printout(int id, int nr, struct perf_evsel *evsel, double avg)
 	const char *fmt;
 
 	if (csv_output) {
-		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
+		fmt = floor(sc) != sc ?  "%.2f%s" : "%.0f%s";
 	} else {
 		if (big_num)
-			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+			fmt = floor(sc) != sc ? "%'18.2f%s" : "%'18.0f%s";
 		else
-			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+			fmt = floor(sc) != sc ? "%18.2f%s" : "%18.0f%s";
 	}
 
 	aggr_printout(evsel, id, nr);
@@ -1995,7 +1996,7 @@ static int process_stat_round_event(struct perf_tool *tool __maybe_unused,
 				    union perf_event *event,
 				    struct perf_session *session)
 {
-	struct stat_round_event *round = &event->stat_round;
+	struct stat_round_event *stat_round = &event->stat_round;
 	struct perf_evsel *counter;
 	struct timespec tsh, *ts = NULL;
 	const char **argv = session->header.env.cmdline_argv;
@@ -2004,12 +2005,12 @@ static int process_stat_round_event(struct perf_tool *tool __maybe_unused,
 	evlist__for_each(evsel_list, counter)
 		perf_stat_process_counter(&stat_config, counter);
 
-	if (round->type == PERF_STAT_ROUND_TYPE__FINAL)
-		update_stats(&walltime_nsecs_stats, round->time);
+	if (stat_round->type == PERF_STAT_ROUND_TYPE__FINAL)
+		update_stats(&walltime_nsecs_stats, stat_round->time);
 
-	if (stat_config.interval && round->time) {
-		tsh.tv_sec  = round->time / NSECS_PER_SEC;
-		tsh.tv_nsec = round->time % NSECS_PER_SEC;
+	if (stat_config.interval && stat_round->time) {
+		tsh.tv_sec  = stat_round->time / NSECS_PER_SEC;
+		tsh.tv_nsec = stat_round->time % NSECS_PER_SEC;
 		ts = &tsh;
 	}
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-20  9:59     ` Jiri Olsa
@ 2016-05-20 14:18       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-20 14:18 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

> [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
> nmi_watchdog enabled with topdown. May give wrong results.
> Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
>      1.002097350                   retiring             bad speculation      frontend bound       backend bound        
>      1.002097350 S0-C0           2     38.1%                0.0%               59.2%                2.7%           
>      1.002097350 S0-C1           2     38.1%                0.1%               59.7%                2.1%           

Ah I see now. this is --metric-only not displaying. --topdown enables
--metric-only implicitely.

I'll send a separate patch for that because metric only was already
merged separately. So it's not really a problem in this patchkit,
but in a previous one.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-14  1:44 Andi Kleen
  2016-05-16 12:58 ` Jiri Olsa
@ 2016-05-20 10:24 ` Jiri Olsa
  1 sibling, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-05-20 10:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> Note to reviewers: includes both tools and kernel patches.
> The kernel patches are at the beginning.
> 
> [v2: Address review feedback.
> Metrics are now always printed, but colored when crossing threshold.
> --topdown implies --metric-only.
> Various smaller fixes, see individual patches]
> [v3: Add --single-thread option and support it with HT off.
> Clean up old HT workaround.
> Improve documentation.
> Various smaller fixes, see individual patches.]
> [v4: Rebased on latest tree]
> [v5: Rebased on latest tree. Move debug messages to -vv]
> [v6: Rebased. Remove .aggr-per-core and --single-thread to not
> break old perf binaries. Put SMT enumeration into 
> generic topology API.]
> [v7: Address review comments. Change patch title headers.]

other than the missing headers and unnneeded initialization
of have_frontend_stalled I'm ok with the perf tools part

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-19 23:51   ` Andi Kleen
@ 2016-05-20  9:59     ` Jiri Olsa
  2016-05-20 14:18       ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-20  9:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Thu, May 19, 2016 at 04:51:30PM -0700, Andi Kleen wrote:
> On Mon, May 16, 2016 at 02:58:38PM +0200, Jiri Olsa wrote:
> > On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> > 
> > SNIP
> > 
> > >     
> > > The formulas to compute the metrics are generic, they
> > > only change based on the availability on the abstracted
> > > input values.
> > >     
> > > The kernel declares the events supported by the current
> > > CPU and perf stat then computes the formulas based on the
> > > available metrics.
> > > 
> > > 
> > > Example output:
> > > 
> > > $ perf stat --topdown -I 1000 cmd
> > >      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
> > >      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
> > >      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
> 
> Hi Jiri,
> > 
> > you've lost first 3 header lines (time/core/cpus):
> > 
> > [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
> > #           time core         cpus             counts unit events
> >      1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
> >      1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
> >      1.000310344 S0-C2           2      3,764,428,422      cycles                                                      
> 
> I can't reproduce that.
> 

I can.. your latest code does not display headers: 'time' 'core' 'cpus'
also the initial '#'

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
     1.002097350                   retiring             bad speculation      frontend bound       backend bound        
     1.002097350 S0-C0           2     38.1%                0.0%               59.2%                2.7%           
     1.002097350 S0-C1           2     38.1%                0.1%               59.7%                2.1%           


thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-05-20  0:09 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-20  0:09 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, linux-kernel, mingo

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]
[v5: Rebased on latest tree. Move debug messages to -vv]
[v6: Rebased. Remove .aggr-per-core and --single-thread to not
break old perf binaries. Put SMT enumeration into 
generic topology API.]
[v7: Address review comments. Change patch title headers.]
[v8: Avoid -0.00 output]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)
One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-22

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-16 12:58 ` Jiri Olsa
@ 2016-05-19 23:51   ` Andi Kleen
  2016-05-20  9:59     ` Jiri Olsa
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-05-19 23:51 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, mingo, linux-kernel

On Mon, May 16, 2016 at 02:58:38PM +0200, Jiri Olsa wrote:
> On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:
> 
> SNIP
> 
> >     
> > The formulas to compute the metrics are generic, they
> > only change based on the availability on the abstracted
> > input values.
> >     
> > The kernel declares the events supported by the current
> > CPU and perf stat then computes the formulas based on the
> > available metrics.
> > 
> > 
> > Example output:
> > 
> > $ perf stat --topdown -I 1000 cmd
> >      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
> >      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
> >      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           

Hi Jiri,
> 
> you've lost first 3 header lines (time/core/cpus):
> 
> [jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
> #           time core         cpus             counts unit events
>      1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
>      1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
>      1.000310344 S0-C2           2      3,764,428,422      cycles                                                      

I can't reproduce that.

The headers look the same as before.

> 
> also I'm still getting -0% as I mentioned in my previous comment:

Keeping the NMI watchdog enabled can make the formulas inaccurate
because the grouping is disabled, and parts of the formulas
may be measured at different times where the execution profile
is different.

But anyways even without that it can be caused by small inaccuracies,
and then during rounding the value rounds to 0.
I can remove the - for this case.

Otherwise the data looks reasonable.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-05-14  1:44 Andi Kleen
@ 2016-05-16 12:58 ` Jiri Olsa
  2016-05-19 23:51   ` Andi Kleen
  2016-05-20 10:24 ` Jiri Olsa
  1 sibling, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-05-16 12:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, mingo, linux-kernel

On Fri, May 13, 2016 at 06:44:49PM -0700, Andi Kleen wrote:

SNIP

>     
> The formulas to compute the metrics are generic, they
> only change based on the availability on the abstracted
> input values.
>     
> The kernel declares the events supported by the current
> CPU and perf stat then computes the formulas based on the
> available metrics.
> 
> 
> Example output:
> 
> $ perf stat --topdown -I 1000 cmd
>      1.000735655                   frontend bound       retiring             bad speculation      backend bound        
>      1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
>      1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           

you've lost first 3 header lines (time/core/cpus):

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --per-core -e cycles -I 1000 -a
#           time core         cpus             counts unit events
     1.000310344 S0-C0           2      3,764,470,414      cycles                                                      
     1.000310344 S0-C1           2      3,764,445,293      cycles                                                      
     1.000310344 S0-C2           2      3,764,428,422      cycles                                                      

also I'm still getting -0% as I mentioned in my previous comment:

[jolsa@ibm-x3650m4-01 perf]$ sudo ./perf stat --topdown -I 1000 -a
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
     1.001615409                   retiring             bad speculation      frontend bound       backend bound        
     1.001615409 S0-C0           2     38.3%                0.0%               58.4%                3.3%           
     1.001615409 S0-C1           2     38.1%               -0.0%               59.3%                2.6%           
     1.001615409 S0-C2           2     38.1%                0.0%               58.9%                2.9%           
     1.001615409 S0-C3           2     38.1%               -0.0%               58.9%                3.0%           
     1.001615409 S0-C4           2     38.0%                0.0%               59.0%                2.9%           
     1.001615409 S0-C5           2     38.1%               -0.0%               58.6%                3.3%           
     1.001615409 S1-C0           2     49.7%                1.9%               44.7%                3.7%           

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-05-14  1:44 Andi Kleen
  2016-05-16 12:58 ` Jiri Olsa
  2016-05-20 10:24 ` Jiri Olsa
  0 siblings, 2 replies; 48+ messages in thread
From: Andi Kleen @ 2016-05-14  1:44 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, mingo, linux-kernel

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]
[v5: Rebased on latest tree. Move debug messages to -vv]
[v6: Rebased. Remove .aggr-per-core and --single-thread to not
break old perf binaries. Put SMT enumeration into 
generic topology API.]
[v7: Address review comments. Change patch title headers.]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)
One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-21

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-04-27 20:00 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-04-27 20:00 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, linux-kernel

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]
[v5: Rebased on latest tree. Move debug messages to -vv]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

When Hyper Threading is off this can be overriden with the --single-thread
option. When Hyper Threading is on it is enforced, the only way to
not require -a here is to off line the logical CPUs of the second
threads.

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-19

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-04-04 20:41 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-04-04 20:41 UTC (permalink / raw)
  To: peterz, acme; +Cc: jolsa, linux-kernel, mingo

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]
[v4: Rebased on latest tree]

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

When Hyper Threading is off this can be overriden with the --single-thread
option. When Hyper Threading is on it is enforced, the only way to
not require -a here is to off line the logical CPUs of the second
threads.

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-17

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-03-27 11:27 ` Jiri Olsa
@ 2016-03-27 15:22   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-03-27 15:22 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Andi Kleen, acme, peterz, jolsa, eranian, mingo, linux-kernel

> can't see this one (-16):
> 
> [jolsa@krava perf]$ git remote update ak
> Fetching ak
> [jolsa@krava perf]$ git branch -r | grep top-down
>   ak/perf/top-down-10
>   ak/perf/top-down-11
>   ak/perf/top-down-13
>   ak/perf/top-down-2

Please try again, I pushed it again.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Add top down metrics to perf stat
  2016-03-22 23:08 Andi Kleen
@ 2016-03-27 11:27 ` Jiri Olsa
  2016-03-27 15:22   ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-03-27 11:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, peterz, jolsa, eranian, mingo, linux-kernel

On Tue, Mar 22, 2016 at 04:08:46PM -0700, Andi Kleen wrote:

SNIP

> In this case perf stat automatically enables --per-core mode and also requires
> global mode (-a) and avoiding other filters (no cgroup mode)
> 
> When Hyper Threading is off this can be overriden with the --single-thread
> option. When Hyper Threading is on it is enforced, the only way to
> not require -a here is to off line the logical CPUs of the second
> threads.
> 
> One side effect is that this may require root rights or a
> kernel.perf_event_paranoid=-1 setting. 
> 
> Full tree available in 
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-16
> 

can't see this one (-16):

[jolsa@krava perf]$ git remote update ak
Fetching ak
[jolsa@krava perf]$ git branch -r | grep top-down
  ak/perf/top-down-10
  ak/perf/top-down-11
  ak/perf/top-down-13
  ak/perf/top-down-2

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-03-22 23:08 Andi Kleen
  2016-03-27 11:27 ` Jiri Olsa
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-03-22 23:08 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, eranian, mingo, linux-kernel

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.]

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the beginning.

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

When Hyper Threading is off this can be overriden with the --single-thread
option. When Hyper Threading is on it is enforced, the only way to
not require -a here is to off line the logical CPUs of the second
threads.

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-16

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-01-20  2:27 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-01-20  2:27 UTC (permalink / raw)
  To: acme; +Cc: jolsa, mingo, linux-kernel, eranian

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]
[v3: Add --single-thread option and support it with HT off.
Clean up old HT workaround.
Improve documentation.
Various smaller fixes, see individual patches.

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the end.

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 1000 cmd
     1.000735655                   frontend bound       retiring             bad speculation      backend bound        
     1.000735655 S0-C0           2    47.84%              11.69%               8.37%              32.10%           
     1.000735655 S0-C1           2    45.53%              11.39%               8.52%              34.56%           
     2.003978563 S0-C0           2    49.47%              12.22%               8.65%              29.66%           
     2.003978563 S0-C1           2    47.21%              12.98%               8.77%              31.04%           
     3.004901432 S0-C0           2    49.35%              12.26%               8.68%              29.70%           
     3.004901432 S0-C1           2    47.23%              12.67%               8.76%              31.35%           
     4.005766611 S0-C0           2    48.44%              12.14%               8.59%              30.82%           
     4.005766611 S0-C1           2    46.07%              12.41%               8.67%              32.85%           
     5.006580592 S0-C0           2    47.91%              12.08%               8.57%              31.44%           
     5.006580592 S0-C1           2    45.57%              12.27%               8.63%              33.53%           
     6.007545125 S0-C0           2    47.45%              12.02%               8.57%              31.96%           
     6.007545125 S0-C1           2    45.13%              12.17%               8.57%              34.14%           
     7.008539347 S0-C0           2    47.07%              12.03%               8.61%              32.29%           
...


 
For Level 1 Top Down computes metrics per core instead of per logical CPU
on Core CPUs (On Atom CPUs there is no Hyper Threading and TopDown 
is per thread)

In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

When Hyper Threading is off this can be overriden with the --single-thread
option. When Hyper Threading is on it is enforced, the only way to
not require -a here is to off line the logical CPUs of the second
threads.

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting. 

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-13

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2016-01-16  1:12 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-01-16  1:12 UTC (permalink / raw)
  To: acme; +Cc: peterz, jolsa, eranian, linux-kernel, mingo

[v2: Address review feedback.
Metrics are now always printed, but colored when crossing threshold.
--topdown implies --metric-only.
Various smaller fixes, see individual patches]

Note to reviewers: includes both tools and kernel patches.
The kernel patches are at the end.

This patchkit adds support for TopDown measurements to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ perf stat --topdown -I 100 ./BC1s
     0.100576098 frontend bound           retiring                 bad speculation          backend bound            
     0.100576098     8.83%                  48.93%                  35.24%                   7.00%               
     0.200800845     8.84%                  48.49%                  35.53%                   7.13%               
     0.300905983     8.73%                  48.64%                  35.58%                   7.05%            
...


 
On Hyper Threaded CPUs Top Down computes metrics per core instead of per logical CPU.
In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting.  

On systems without Hyper Threading it can be used per process.

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-11

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Add top down metrics to perf stat
@ 2015-08-08  1:06 Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2015-08-08  1:06 UTC (permalink / raw)
  To: acme; +Cc: jolsa, linux-kernel, eranian, namhyung, peterz, mingo

This patchkit adds support for TopDown to perf stat
It applies on top of my earlier metrics patchkit, posted
separately.

TopDown is intended to replace the frontend cycles idle/
backend cycles idle metrics in standard perf stat output.
These metrics are not reliable in many workloads, 
due to out of order effects.

This implements a new --topdown mode in perf stat
(similar to --transaction) that measures the pipe line
bottlenecks using standardized formulas. The measurement
can be all done with 5 counters (one fixed counter)

The result are four metrics:
FrontendBound, BackendBound, BadSpeculation, Retiring

that describe the CPU pipeline behavior on a high level.

FrontendBound and BackendBound
BadSpeculation is a higher

The full top down methology has many hierarchical metrics.
This implementation only supports level 1 which can be
collected without multiplexing. A full implementation
of top down on top of perf is available in pmu-tools toplev.
(http://github.com/andikleen/pmu-tools)

The current version works on Intel Core CPUs starting
with Sandy Bridge, and Atom CPUs starting with Silvermont.
In principle the generic metrics should be also implementable
on other out of order CPUs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):
    
topdown-total-slots   Available slots in the pipeline
topdown-slots-issued          Slots issued into the pipeline
topdown-slots-retired         Slots successfully retired
topdown-fetch-bubbles         Pipeline gaps in the frontend
topdown-recovery-bubbles  Pipeline gaps during recovery
                          from misspeculation
    
These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.
    
The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.
    
The kernel declares the events supported by the current
CPU and perf stat then computes the formulas based on the
available metrics.


Example output:

$ ./perf stat --topdown -a ./BC1s 

 Performance counter stats for 'system wide':

S0-C0           2           19650790      topdown-total-slots                                           (100.00%)
S0-C0           2         4445680.00      topdown-fetch-bubbles     #    22.62% frontend bound          (100.00%)
S0-C0           2         1743552.00      topdown-slots-retired                                         (100.00%)
S0-C0           2             622954      topdown-recovery-bubbles                                      (100.00%)
S0-C0           2         2025498.00      topdown-slots-issued      #    63.90% backend bound         
S0-C1           2        16685216540      topdown-total-slots                                           (100.00%)
S0-C1           2       962557931.00      topdown-fetch-bubbles                                         (100.00%)
S0-C1           2      4175583320.00      topdown-slots-retired                                         (100.00%)
S0-C1           2         1743329246      topdown-recovery-bubbles  #    22.22% bad speculation         (100.00%)
S0-C1           2      6138901193.50      topdown-slots-issued      #    46.99% backend bound         

       1.535832673 seconds time elapsed
 
On Hyper Threaded CPUs Top Down computes metrics per core instead of per logical CPU.
In this case perf stat automatically enables --per-core mode and also requires
global mode (-a) and avoiding other filters (no cgroup mode)

One side effect is that this may require root rights or a
kernel.perf_event_paranoid=-1 setting.  

On systems without Hyper Threading it can be used per process.

Full tree available in 
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/top-down-2


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-05-20 14:18 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-05 23:03 Add top down metrics to perf stat Andi Kleen
2016-05-05 23:03 ` [PATCH 01/10] x86: Add topology_max_smt_threads() Andi Kleen
2016-05-06 10:13   ` Peter Zijlstra
2016-05-06 10:47     ` Thomas Gleixner
2016-05-06 17:24       ` [UPDATED PATCH " Andi Kleen
2016-05-07  8:11         ` Thomas Gleixner
2016-05-12  8:07         ` Ingo Molnar
2016-05-05 23:03 ` [PATCH 02/10] x86, perf: Support sysfs files depending on SMT status Andi Kleen
2016-05-09  9:42   ` Peter Zijlstra
2016-05-09 14:27     ` Andi Kleen
2016-05-09 14:34       ` Peter Zijlstra
2016-05-12  8:05   ` Ingo Molnar
2016-05-05 23:04 ` [PATCH 03/10] x86, perf: Add Top Down events to Intel Core Andi Kleen
2016-05-11 13:23   ` Jiri Olsa
2016-05-11 13:29     ` Peter Zijlstra
2016-05-12  8:10   ` Ingo Molnar
2016-05-05 23:04 ` [PATCH 04/10] x86, perf: Add Top Down events to Intel Atom Andi Kleen
2016-05-05 23:04 ` [PATCH 05/10] x86, perf: Use new topology_max_smt_threads() in HT leak workaround Andi Kleen
2016-05-05 23:04 ` [PATCH 06/10] perf, tools, stat: Avoid fractional digits for integer scales Andi Kleen
2016-05-07 19:10   ` Jiri Olsa
2016-05-07 19:24     ` Andi Kleen
2016-05-11 13:00       ` Jiri Olsa
2016-05-11 16:43         ` Arnaldo Carvalho de Melo
2016-05-20  6:42   ` [tip:perf/urgent] perf " tip-bot for Andi Kleen
2016-05-05 23:04 ` [PATCH 07/10] perf, tools, stat: Scale values by unit before metrics Andi Kleen
2016-05-07 19:14   ` Jiri Olsa
2016-05-10 20:30   ` [tip:perf/core] perf " tip-bot for Andi Kleen
2016-05-05 23:04 ` [PATCH 08/10] perf, tools, stat: Basic support for TopDown in perf stat Andi Kleen
2016-05-05 23:04 ` [PATCH 09/10] perf, tools, stat: Add computation of TopDown formulas Andi Kleen
2016-05-05 23:04 ` [PATCH 10/10] perf, tools, stat: Add extra output of counter values with -vv Andi Kleen
2016-05-12  8:03   ` Jiri Olsa
2016-05-20  0:05     ` Andi Kleen
2016-05-12  7:47 ` Add top down metrics to perf stat Jiri Olsa
  -- strict thread matches above, loose matches on Subject: below --
2016-05-20  0:09 Andi Kleen
2016-05-14  1:44 Andi Kleen
2016-05-16 12:58 ` Jiri Olsa
2016-05-19 23:51   ` Andi Kleen
2016-05-20  9:59     ` Jiri Olsa
2016-05-20 14:18       ` Andi Kleen
2016-05-20 10:24 ` Jiri Olsa
2016-04-27 20:00 Andi Kleen
2016-04-04 20:41 Andi Kleen
2016-03-22 23:08 Andi Kleen
2016-03-27 11:27 ` Jiri Olsa
2016-03-27 15:22   ` Andi Kleen
2016-01-20  2:27 Andi Kleen
2016-01-16  1:12 Andi Kleen
2015-08-08  1:06 Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).