perf, x86: Add parts of the remaining haswell PMU functionality

All of lore.kernel.org
 help / color / mirror / Atom feed

* perf, x86: Add parts of the remaining haswell PMU functionality
@ 2013-08-09  1:15 Andi Kleen
  2013-08-09  1:15 ` [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4 Andi Kleen
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Andi Kleen @ 2013-08-09  1:15 UTC (permalink / raw)
  To: mingo; +Cc: peterz, linux-kernel, acme, jolsa, eranian

Add some more TSX functionality to the basic Haswell PMU.

A lot of the infrastructure needed for these patches has
been merged earlier, so it is all quite straight forward
now.

- Add the checkpointed counter workaround.
(Parts of this have been already merged earlier)
- Add support for reporting PEBS transaction abort cost as weight.
This is useful to judge the cost of aborts and concentrate
on expensive ones first.
(Large parts of this have been already merged earlier,
this is just adding the final few lines to the PEBS handler)
- Add TSX event aliases, needed for perf stat -T and general
usability.
(Infrastructure also already in)
- Add perf stat -T support to give a user friendly highlevel
counting frontend for transaction..
This version should also be usable for POWER8 eventually.

Not included:

Support for transaction flags and TSX LBR flags.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4
  2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
@ 2013-08-09  1:15 ` Andi Kleen
  2013-08-13 10:29   ` Peter Zijlstra
  2013-08-09  1:15 ` [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight Andi Kleen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-08-09  1:15 UTC (permalink / raw)
  To: mingo; +Cc: peterz, linux-kernel, acme, jolsa, eranian, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

With checkpointed counters there can be a situation where the counter
is overflowing, aborts the transaction, is set back to a non overflowing
checkpoint, causes interupt. The interrupt doesn't see the overflow
because it has been checkpointed.  This is then a spurious PMI, typically with
a ugly NMI message.  It can also lead to excessive aborts.

Avoid this problem by:
- Using the full counter width for counting counters (earlier patch)
- Forbid sampling for checkpointed counters. It's not too useful anyways,
checkpointing is mainly for counting. The check is approximate
(to still handle KVM), but should catch the majority of cases.
- On a PMI always set back checkpointed counters to zero.

v2: Add unlikely. Add comment
v3: Allow large sampling periods with CP for KVM
v4: Use event_is_checkpointed. Use EOPNOTSUPP. (Stephane Eranian)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 39 ++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a45d8d4..9218025 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1134,6 +1134,11 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
 }
 
+static inline bool event_is_checkpointed(struct perf_event *event)
+{
+	return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
+}
+
 /*
  * Save and restart an expired event. Called by NMI contexts,
  * so it has to be careful about preempting normal event ops:
@@ -1141,6 +1146,17 @@ static void intel_pmu_enable_event(struct perf_event *event)
 int intel_pmu_save_and_restart(struct perf_event *event)
 {
 	x86_perf_event_update(event);
+	/*
+	 * For a checkpointed counter always reset back to 0.  This
+	 * avoids a situation where the counter overflows, aborts the
+	 * transaction and is then set back to shortly before the
+	 * overflow, and overflows and aborts again.
+	 */
+	if (unlikely(event_is_checkpointed(event))) {
+		/* No race with NMIs because the counter should not be armed */
+		wrmsrl(event->hw.event_base, 0);
+		local64_set(&event->hw.prev_count, 0);
+	}
 	return x86_perf_event_set_period(event);
 }
 
@@ -1224,6 +1240,15 @@ again:
 		x86_pmu.drain_pebs(regs);
 	}
 
+	/*
+	 * To avoid spurious interrupts with perf stat always reset checkpointed
+	 * counters.
+	 *
+	 * XXX move somewhere else.
+	 */
+	if (cpuc->events[2] && event_is_checkpointed(cpuc->events[2]))
+		status |= (1ULL << 2);
+
 	for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
@@ -1689,6 +1714,20 @@ static int hsw_hw_config(struct perf_event *event)
 	      event->attr.precise_ip > 0))
 		return -EOPNOTSUPP;
 
+	if (event_is_checkpointed(event)) {
+		/*
+		 * Sampling of checkpointed events can cause situations where
+		 * the CPU constantly aborts because of a overflow, which is
+		 * then checkpointed back and ignored. Forbid checkpointing
+		 * for sampling.
+		 *
+		 * But still allow a long sampling period, so that perf stat
+		 * from KVM works.
+		 */
+		if (event->attr.sample_period > 0 &&
+		    event->attr.sample_period < 0x7fffffff)
+			return -EOPNOTSUPP;
+	}
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
  2013-08-09  1:15 ` [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4 Andi Kleen
@ 2013-08-09  1:15 ` Andi Kleen
  2013-08-13 11:23   ` Peter Zijlstra
  2013-08-09  1:15 ` [PATCH 3/4] perf, x86: Add Haswell TSX event aliases v6 Andi Kleen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-08-09  1:15 UTC (permalink / raw)
  To: mingo; +Cc: peterz, linux-kernel, acme, jolsa, eranian, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Use the existing weight reporting facility to report the transaction
abort cost, that is the number of cycles wasted in aborts.
Haswell reports this in the PEBS record.

This was in fact the original user for weight.

This is a very useful sort key to concentrate on the most
costly aborts and a good metric for TSX tuning.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3065c57..8959cc7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -838,6 +838,12 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 		x86_pmu.intel_cap.pebs_format >= 1)
 		data.addr = pebs->dla;
 
+	if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) &&
+	    !fll &&
+	    (x86_pmu.intel_cap.pebs_format >= 2) &&
+	    pebs_hsw->tsx_tuning)
+		data.weight = pebs_hsw->tsx_tuning & 0xffffffff;
+
 	if (has_branch_stack(event))
 		data.br_stack = &cpuc->lbr_stack;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] perf, x86: Add Haswell TSX event aliases v6
  2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
  2013-08-09  1:15 ` [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4 Andi Kleen
  2013-08-09  1:15 ` [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight Andi Kleen
@ 2013-08-09  1:15 ` Andi Kleen
  2013-08-09  1:15 ` [PATCH 4/4] perf, tools: Add perf stat --transaction v3 Andi Kleen
  2013-09-02  6:55 ` perf, x86: Add parts of the remaining haswell PMU functionality Ingo Molnar
  4 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2013-08-09  1:15 UTC (permalink / raw)
  To: mingo; +Cc: peterz, linux-kernel, acme, jolsa, eranian, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add TSX event aliases, and export them from the kernel to perf.

These are used by perf stat -T and to allow
more user friendly access to events. The events are designed to
be fairly generic and may also apply to other architectures
implementing HTM.  They all cover common situations that
happens during tuning of transactional code.

For Haswell we have to separate the HLE and RTM events,
as they are separate in the PMU.

This adds the following events.

tx-start	Count start transaction (used by perf stat -T)
tx-commit	Count commit of transaction
tx-abort	Count all aborts
tx-conflict	Count aborts due to conflict with another CPU.
tx-capacity	Count capacity aborts (transaction too large)

Then matching el-* events for HLE

cycles-t	Transactional cycles (used by perf stat -T)
* also exists on POWER8
cycles-ct	Transactional cycles commited (used by perf stat -T)
* according to Michael Ellerman POWER8 has a cycles-transactional-committed,
* perf stat -T handles both cases

Note for useful abort profiling often precise has to be set,
as Haswell can only report the point inside the transaction
with precise=2.

(I had another patchkit to allow exporting precise too, but Vince
Weaver pointed out it violates the ABI, so dropped now)

For some classes of aborts, like conflicts, this is not needed,
as it makes more sense to look at the complete critical section.

This gives a clean set of generalized events to examine transaction
success and aborts. Haswell has additional events for TSX, but those are more
specialized for very specific situations.

v2: Move to new sysfs infrastructure
v3: Use own sysfs functions now
v4: Add tx/el-abort-return for better conflict sampling
v5: Different white space.
v6: Cut down events, rewrite description.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9218025..ca7f02c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2076,7 +2076,34 @@ static __init void intel_nehalem_quirk(void)
 EVENT_ATTR_STR(mem-loads,      mem_ld_hsw,     "event=0xcd,umask=0x1,ldlat=3");
 EVENT_ATTR_STR(mem-stores,     mem_st_hsw,     "event=0xd0,umask=0x82")
 
+/* Haswell special events */
+EVENT_ATTR_STR(tx-start,        tx_start,       "event=0xc9,umask=0x1");
+EVENT_ATTR_STR(tx-commit,       tx_commit,      "event=0xc9,umask=0x2");
+EVENT_ATTR_STR(tx-abort,        tx_abort,	"event=0xc9,umask=0x4");
+EVENT_ATTR_STR(tx-capacity,     tx_capacity,	"event=0x54,umask=0x2");
+EVENT_ATTR_STR(tx-conflict,     tx_conflict,	"event=0x54,umask=0x1");
+EVENT_ATTR_STR(el-start,        el_start,       "event=0xc8,umask=0x1");
+EVENT_ATTR_STR(el-commit,       el_commit,      "event=0xc8,umask=0x2");
+EVENT_ATTR_STR(el-abort,        el_abort,	"event=0xc8,umask=0x4");
+EVENT_ATTR_STR(el-capacity,     el_capacity,    "event=0x54,umask=0x2");
+EVENT_ATTR_STR(el-conflict,     el_conflict,    "event=0x54,umask=0x1");
+EVENT_ATTR_STR(cycles-t,        cycles_t,       "event=0x3c,in_tx=1");
+EVENT_ATTR_STR(cycles-ct,       cycles_ct,
+					"event=0x3c,in_tx=1,in_tx_cp=1");
+
 static struct attribute *hsw_events_attrs[] = {
+	EVENT_PTR(tx_start),
+	EVENT_PTR(tx_commit),
+	EVENT_PTR(tx_abort),
+	EVENT_PTR(tx_capacity),
+	EVENT_PTR(tx_conflict),
+	EVENT_PTR(el_start),
+	EVENT_PTR(el_commit),
+	EVENT_PTR(el_abort),
+	EVENT_PTR(el_capacity),
+	EVENT_PTR(el_conflict),
+	EVENT_PTR(cycles_t),
+	EVENT_PTR(cycles_ct),
 	EVENT_PTR(mem_ld_hsw),
 	EVENT_PTR(mem_st_hsw),
 	NULL
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] perf, tools: Add perf stat --transaction v3
  2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
                   ` (2 preceding siblings ...)
  2013-08-09  1:15 ` [PATCH 3/4] perf, x86: Add Haswell TSX event aliases v6 Andi Kleen
@ 2013-08-09  1:15 ` Andi Kleen
  2013-09-02  6:55 ` perf, x86: Add parts of the remaining haswell PMU functionality Ingo Molnar
  4 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2013-08-09  1:15 UTC (permalink / raw)
  To: mingo; +Cc: peterz, linux-kernel, acme, jolsa, eranian, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add support to perf stat to print the basic transactional execution statistics:
Total cycles, Cycles in Transaction, Cycles in aborted transsactions
using the in_tx and in_tx_checkpoint qualifiers.
Transaction Starts and Elision Starts, to compute the average transaction length.

This is a reasonable overview over the success of the transactions.

Enable with a new --transaction / -T option.

This requires measuring these events in a group, since they depend on each
other.

This is implemented by using TM sysfs events exported by the kernel

v2: Only print the extended statistics when the option is enabled.
This avoids negative output when the user specifies the -T events
in separate groups.
v3: Port to latest tree
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |   5 ++
 tools/perf/builtin-stat.c              | 132 ++++++++++++++++++++++++++++++++-
 tools/perf/util/evsel.h                |   6 ++
 tools/perf/util/pmu.c                  |  16 ++++
 tools/perf/util/pmu.h                  |   1 +
 5 files changed, 157 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 2fe87fb..40bc65a 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -132,6 +132,11 @@ is a useful mode to detect imbalance between physical cores.  To enable this mod
 use --per-core in addition to -a. (system-wide).  The output includes the
 core number and the number of online logical processors on that physical processor.
 
+-T::
+--transaction::
+
+Print statistics of transactional execution if supported.
+
 EXAMPLES
 --------
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 352fbd7..d68bf93 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -46,6 +46,7 @@
 #include "util/util.h"
 #include "util/parse-options.h"
 #include "util/parse-events.h"
+#include "util/pmu.h"
 #include "util/event.h"
 #include "util/evlist.h"
 #include "util/evsel.h"
@@ -70,6 +71,41 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix);
 static void print_counter(struct perf_evsel *counter, char *prefix);
 static void print_aggr(char *prefix);
 
+/* Default events used for perf stat -T */
+static const char * const transaction_attrs[] = {
+	"task-clock",
+	"{"
+	"instructions,"
+	"cycles,"
+	"cpu/cycles-t/,"
+	"cpu/tx-start/,"
+	"cpu/el-start/,"
+	"cpu/cycles-ct/"
+	"}"
+};
+
+/* More limited version when the CPU does not have all events. */
+static const char * const transaction_limited_attrs[] = {
+	"task-clock",
+	"{"
+	"instructions,"
+	"cycles,"
+	"cpu/cycles-t/,"
+	"cpu/tx-start/"
+	"}"
+};
+
+/* must match the transaction_attrs above */
+enum {
+	T_TASK_CLOCK,
+	T_INSTRUCTIONS,
+	T_CYCLES,
+	T_CYCLES_IN_TX,
+	T_TRANSACTION_START,
+	T_ELISION_START,
+	T_CYCLES_IN_TX_CP,
+};
+
 static struct perf_evlist	*evsel_list;
 
 static struct perf_target	target = {
@@ -90,6 +126,7 @@ static enum aggr_mode		aggr_mode			= AGGR_GLOBAL;
 static volatile pid_t		child_pid			= -1;
 static bool			null_run			=  false;
 static int			detailed_run			=  0;
+static bool			transaction_run;
 static bool			big_num				=  true;
 static int			big_num_opt			=  -1;
 static const char		*csv_sep			= NULL;
@@ -213,7 +250,10 @@ static struct stats runtime_l1_icache_stats[MAX_NR_CPUS];
 static struct stats runtime_ll_cache_stats[MAX_NR_CPUS];
 static struct stats runtime_itlb_cache_stats[MAX_NR_CPUS];
 static struct stats runtime_dtlb_cache_stats[MAX_NR_CPUS];
+static struct stats runtime_cycles_in_tx_stats[MAX_NR_CPUS];
 static struct stats walltime_nsecs_stats;
+static struct stats runtime_transaction_stats[MAX_NR_CPUS];
+static struct stats runtime_elision_stats[MAX_NR_CPUS];
 
 static void perf_stat__reset_stats(struct perf_evlist *evlist)
 {
@@ -235,6 +275,11 @@ static void perf_stat__reset_stats(struct perf_evlist *evlist)
 	memset(runtime_ll_cache_stats, 0, sizeof(runtime_ll_cache_stats));
 	memset(runtime_itlb_cache_stats, 0, sizeof(runtime_itlb_cache_stats));
 	memset(runtime_dtlb_cache_stats, 0, sizeof(runtime_dtlb_cache_stats));
+	memset(runtime_cycles_in_tx_stats, 0,
+			sizeof(runtime_cycles_in_tx_stats));
+	memset(runtime_transaction_stats, 0,
+		sizeof(runtime_transaction_stats));
+	memset(runtime_elision_stats, 0, sizeof(runtime_elision_stats));
 	memset(&walltime_nsecs_stats, 0, sizeof(walltime_nsecs_stats));
 }
 
@@ -272,6 +317,18 @@ static inline int nsec_counter(struct perf_evsel *evsel)
 	return 0;
 }
 
+static struct perf_evsel *nth_evsel(int n)
+{
+	struct perf_evsel *ev;
+	int j;
+
+	j = 0;
+	list_for_each_entry(ev, &evsel_list->entries, node)
+		if (j++ == n)
+			return ev;
+	return NULL;
+}
+
 /*
  * Update various tracking values we maintain to print
  * more semantic information such as miss/hit ratios,
@@ -283,8 +340,12 @@ static void update_shadow_stats(struct perf_evsel *counter, u64 *count)
 		update_stats(&runtime_nsecs_stats[0], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_CPU_CYCLES))
 		update_stats(&runtime_cycles_stats[0], count[0]);
-	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
-		update_stats(&runtime_stalled_cycles_front_stats[0], count[0]);
+	else if (perf_evsel__cmp(counter, nth_evsel(T_CYCLES_IN_TX)))
+		update_stats(&runtime_cycles_in_tx_stats[0], count[0]);
+	else if (perf_evsel__cmp(counter, nth_evsel(T_TRANSACTION_START)))
+		update_stats(&runtime_transaction_stats[0], count[0]);
+	else if (perf_evsel__cmp(counter, nth_evsel(T_ELISION_START)))
+		update_stats(&runtime_elision_stats[0], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_BACKEND))
 		update_stats(&runtime_stalled_cycles_back_stats[0], count[0]);
 	else if (perf_evsel__match(counter, HARDWARE, HW_BRANCH_INSTRUCTIONS))
@@ -807,7 +868,7 @@ static void print_ll_cache_misses(int cpu,
 
 static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 {
-	double total, ratio = 0.0;
+	double total, ratio = 0.0, total2;
 	const char *fmt;
 
 	if (csv_output)
@@ -903,6 +964,43 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 			ratio = 1.0 * avg / total;
 
 		fprintf(output, " # %8.3f GHz                    ", ratio);
+	} else if (perf_evsel__cmp(evsel, nth_evsel(T_CYCLES_IN_TX)) &&
+		   transaction_run) {
+		total = avg_stats(&runtime_cycles_stats[cpu]);
+		if (total)
+			fprintf(output,
+				" #   %5.2f%% transactional cycles   ",
+				100.0 * (avg / total));
+	} else if (perf_evsel__cmp(evsel, nth_evsel(T_CYCLES_IN_TX_CP)) &&
+		   transaction_run) {
+		total = avg_stats(&runtime_cycles_stats[cpu]);
+		total2 = avg_stats(&runtime_cycles_in_tx_stats[cpu]);
+		if (total2 < avg)
+			total2 = avg;
+		if (total)
+			fprintf(output,
+				" #   %5.2f%% aborted cycles         ",
+				100.0 * ((total2-avg) / total));
+	} else if (perf_evsel__cmp(evsel, nth_evsel(T_TRANSACTION_START)) &&
+		   avg > 0 &&
+		   runtime_cycles_in_tx_stats[cpu].n != 0 &&
+		   transaction_run) {
+		total = avg_stats(&runtime_cycles_in_tx_stats[cpu]);
+
+		if (total)
+			ratio = total / avg;
+
+		fprintf(output, " # %8.0f cycles / transaction   ", ratio);
+	} else if (perf_evsel__cmp(evsel, nth_evsel(T_ELISION_START)) &&
+		   avg > 0 &&
+		   runtime_cycles_in_tx_stats[cpu].n != 0 &&
+		   transaction_run) {
+		total = avg_stats(&runtime_cycles_in_tx_stats[cpu]);
+
+		if (total)
+			ratio = total / avg;
+
+		fprintf(output, " # %8.0f cycles / elision       ", ratio);
 	} else if (runtime_nsecs_stats[cpu].n != 0) {
 		char unit = 'M';
 
@@ -1216,6 +1314,16 @@ static int perf_stat_init_aggr_mode(void)
 	return 0;
 }
 
+static int setup_events(const char * const *attrs, unsigned len)
+{
+	unsigned i;
+
+	for (i = 0; i < len; i++) {
+		if (parse_events(evsel_list, attrs[i]))
+			return -1;
+	}
+	return 0;
+}
 
 /*
  * Add default attributes, if there were no attributes specified or
@@ -1334,6 +1442,22 @@ static int add_default_attributes(void)
 	if (null_run)
 		return 0;
 
+	if (transaction_run) {
+		int err;
+		if (pmu_have_event("cpu", "cycles-ct") &&
+		    pmu_have_event("cpu", "el-start"))
+			err = setup_events(transaction_attrs,
+					ARRAY_SIZE(transaction_attrs));
+		else
+				err = setup_events(transaction_limited_attrs,
+				 ARRAY_SIZE(transaction_limited_attrs));
+		if (err < 0) {
+			fprintf(stderr, "Cannot set up transaction events\n");
+			return -1;
+		}
+		return 0;
+	}
+
 	if (!evsel_list->nr_entries) {
 		if (perf_evlist__add_default_attrs(evsel_list, default_attrs) < 0)
 			return -1;
@@ -1419,6 +1543,8 @@ int cmd_stat(int argc, const char **argv, const char *prefix __maybe_unused)
 		     "aggregate counts per processor socket", AGGR_SOCKET),
 	OPT_SET_UINT(0, "per-core", &aggr_mode,
 		     "aggregate counts per physical processor core", AGGR_CORE),
+	OPT_BOOLEAN('T', "transaction", &transaction_run,
+		    "hardware transaction statistics"),
 	OPT_END()
 	};
 	const char * const stat_usage[] = {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 3f156cc..2f3dc86 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -180,6 +180,12 @@ static inline bool perf_evsel__match2(struct perf_evsel *e1,
 	       (e1->attr.config == e2->attr.config);
 }
 
+#define perf_evsel__cmp(a, b)			\
+	((a) &&					\
+	 (b) &&					\
+	 (a)->attr.type == (b)->attr.type &&	\
+	 (a)->attr.config == (b)->attr.config)
+
 int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
 			      int cpu, int thread, bool scale);
 
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index bc9d806..64362fe 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -637,3 +637,19 @@ void print_pmu_events(const char *event_glob, bool name_only)
 		printf("\n");
 	free(aliases);
 }
+
+bool pmu_have_event(const char *pname, const char *name)
+{
+	struct perf_pmu *pmu;
+	struct perf_pmu_alias *alias;
+
+	pmu = NULL;
+	while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+		if (strcmp(pname, pmu->name))
+			continue;
+		list_for_each_entry(alias, &pmu->aliases, list)
+			if (!strcmp(alias->name, name))
+				return true;
+	}
+	return false;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 6b2cbe2..1179b26 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -42,6 +42,7 @@ int perf_pmu__format_parse(char *dir, struct list_head *head);
 struct perf_pmu *perf_pmu__scan(struct perf_pmu *pmu);
 
 void print_pmu_events(const char *event_glob, bool name_only);
+bool pmu_have_event(const char *pname, const char *name);
 
 int perf_pmu__test(void);
 #endif /* __PMU_H */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4
  2013-08-09  1:15 ` [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4 Andi Kleen
@ 2013-08-13 10:29   ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2013-08-13 10:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

On Thu, Aug 08, 2013 at 06:15:43PM -0700, Andi Kleen wrote:
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c

> @@ -1141,6 +1146,17 @@ static void intel_pmu_enable_event(struct perf_event *event)
>  int intel_pmu_save_and_restart(struct perf_event *event)
>  {
>  	x86_perf_event_update(event);
> +	/*
> +	 * For a checkpointed counter always reset back to 0.  This
> +	 * avoids a situation where the counter overflows, aborts the
> +	 * transaction and is then set back to shortly before the
> +	 * overflow, and overflows and aborts again.
> +	 */
> +	if (unlikely(event_is_checkpointed(event))) {
> +		/* No race with NMIs because the counter should not be armed */
> +		wrmsrl(event->hw.event_base, 0);
> +		local64_set(&event->hw.prev_count, 0);
> +	}

Right, if it wasn't for KVM you could've done a smaller special case
handler for checkpointed events, but as it stands I suppose it makes
sense to use the normal paths.

>  	return x86_perf_event_set_period(event);
>  }
>  
> @@ -1224,6 +1240,15 @@ again:
>  		x86_pmu.drain_pebs(regs);
>  	}
>  
> +	/*
> +	 * To avoid spurious interrupts with perf stat always reset checkpointed
> +	 * counters.
> +	 *
> +	 * XXX move somewhere else.

Like where? Afaict it needs to be here. You could write it prettier I
suppose and I guess we'll eventually need to assume all events can be
checkpointed but I don't see how it could be done elsewhere.

> +	 */
> +	if (cpuc->events[2] && event_is_checkpointed(cpuc->events[2]))
> +		status |= (1ULL << 2);
> +
>  	for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
>  		struct perf_event *event = cpuc->events[bit];
>  



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-09  1:15 ` [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight Andi Kleen
@ 2013-08-13 11:23   ` Peter Zijlstra
  2013-08-13 14:35     ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2013-08-13 11:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

On Thu, Aug 08, 2013 at 06:15:44PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Use the existing weight reporting facility to report the transaction
> abort cost, that is the number of cycles wasted in aborts.
> Haswell reports this in the PEBS record.
> 
> This was in fact the original user for weight.
> 
> This is a very useful sort key to concentrate on the most
> costly aborts and a good metric for TSX tuning.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 3065c57..8959cc7 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -838,6 +838,12 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>  		x86_pmu.intel_cap.pebs_format >= 1)
>  		data.addr = pebs->dla;
>  
> +	if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) &&
> +	    !fll &&
> +	    (x86_pmu.intel_cap.pebs_format >= 2) &&
> +	    pebs_hsw->tsx_tuning)
> +		data.weight = pebs_hsw->tsx_tuning & 0xffffffff;
> +
>  	if (has_branch_stack(event))
>  		data.br_stack = &cpuc->lbr_stack;


How about something like the below instead? I didn't copy the !fll test
because I couldn't find why that was. Section 18.10.5.1 (Aug 2012)
doesn't mention anything like that and I figure the reason bits would be
0 when the thing isn't appropriate.

---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 63 +++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3065c57..52cb1fa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -178,20 +178,15 @@ struct pebs_record_nhm {
 	u64 status, dla, dse, lat;
 };
 
-/*
- * Same as pebs_record_nhm, with two additional fields.
- */
 struct pebs_record_hsw {
-	struct pebs_record_nhm nhm;
-	/*
-	 * Real IP of the event. In the Intel documentation this
-	 * is called eventingrip.
-	 */
-	u64 real_ip;
-	/*
-	 * TSX tuning information field: abort cycles and abort flags.
-	 */
-	u64 tsx_tuning;
+	u64 flags, ip;
+	u64 ax, bx, cx, dx;
+	u64 si, di, bp, sp;
+	u64 r8,  r9,  r10, r11;
+	u64 r12, r13, r14, r15;
+	u64 status, dla, dse, lat;
+	u64 real_ip; /* the actual eventing ip */
+	u64 tsx_tuning; /* TSX abort cycles and flags */
 };
 
 void init_debug_store_on_cpu(int cpu)
@@ -759,16 +754,41 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	return 0;
 }
 
+union hsw_tsx_tuning {
+	struct {
+		u64 cycles_last_block : 32,
+		    hle_abort         : 1,
+		    rtm_abort         : 1,
+		    ins_abort         : 1,
+		    non_ins_abort     : 1,
+		    retry             : 1,
+		    mem_data_conflict : 1,
+		    capacity          : 1;
+	} bits;
+	u64 value;
+};
+
+static inline u64 intel_hsw_weight(struct pebs_record_hsw *pebs)
+{
+	u64 weight = 0;
+
+	if (pebs->tsx_tuning) {
+		union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning };
+		weight = tsx.bits.cycles_last_block;
+	}
+
+	return weight;
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs, void *__pebs)
 {
 	/*
-	 * We cast to pebs_record_nhm to get the load latency data
-	 * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
+	 * We cast to the biggest PEBS record and are careful not
+	 * to access out-of-bounds members.
 	 */
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct pebs_record_nhm *pebs = __pebs;
-	struct pebs_record_hsw *pebs_hsw = __pebs;
+	struct pebs_record_hsw *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	u64 sample_type;
@@ -826,8 +846,9 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	regs.bp = pebs->bp;
 	regs.sp = pebs->sp;
 
+
 	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
-		regs.ip = pebs_hsw->real_ip;
+		regs.ip = pebs->real_ip;
 		regs.flags |= PERF_EFLAGS_EXACT;
 	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
 		regs.flags |= PERF_EFLAGS_EXACT;
@@ -835,9 +856,13 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 
 	if ((event->attr.sample_type & PERF_SAMPLE_ADDR) &&
-		x86_pmu.intel_cap.pebs_format >= 1)
+	    x86_pmu.intel_cap.pebs_format >= 1)
 		data.addr = pebs->dla;
 
+	if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) &&
+	    x86_pmu.intel_cap.pebs_format >= 2)
+		data.weight = intel_hsw_weight(pebs);
+
 	if (has_branch_stack(event))
 		data.br_stack = &cpuc->lbr_stack;
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-13 11:23   ` Peter Zijlstra
@ 2013-08-13 14:35     ` Andi Kleen
  2013-08-13 15:27       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-08-13 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

> How about something like the below instead? I didn't copy the !fll test
> because I couldn't find why that was. Section 18.10.5.1 (Aug 2012)

!fll is so that if a memory weight is requested we don't overwrite it.

>  	u64 status, dla, dse, lat;
>  };
>  
> -/*
> - * Same as pebs_record_nhm, with two additional fields.
> - */
>  struct pebs_record_hsw {
> -	struct pebs_record_nhm nhm;
> -	/*
> -	 * Real IP of the event. In the Intel documentation this
> -	 * is called eventingrip.
> -	 */
> -	u64 real_ip;
> -	/*
> -	 * TSX tuning information field: abort cycles and abort flags.
> -	 */
> -	u64 tsx_tuning;
> +	u64 flags, ip;
> +	u64 ax, bx, cx, dx;
> +	u64 si, di, bp, sp;
> +	u64 r8,  r9,  r10, r11;
> +	u64 r12, r13, r14, r15;
> +	u64 status, dla, dse, lat;

Seems like an unrelated change.

> +	u64 real_ip; /* the actual eventing ip */
> +	u64 tsx_tuning; /* TSX abort cycles and flags */
>  };
>  
>  void init_debug_store_on_cpu(int cpu)
> @@ -759,16 +754,41 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
>  	return 0;
>  }
>  
> +union hsw_tsx_tuning {
> +	struct {
> +		u64 cycles_last_block : 32,
> +		    hle_abort         : 1,
> +		    rtm_abort         : 1,
> +		    ins_abort         : 1,
> +		    non_ins_abort     : 1,
> +		    retry             : 1,
> +		    mem_data_conflict : 1,
> +		    capacity          : 1;

I think you used an old SDM for this, there were some changes in the
latest.

This would break my next patch which copies the abort bits into 
a new field (well it would need an union at least) 

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hsw/pmu7&id=a88a029a6b3cb95148452584c93cbb4004f77f28

Other than that it seems ok and would likely generate the same
code as mine. I prefer mine as it's simpler (I don't think there
is anything in the kernel that needs to look at the individual bits,
they should be just reported together)

-Andi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-13 14:35     ` Andi Kleen
@ 2013-08-13 15:27       ` Peter Zijlstra
  2013-08-13 18:25         ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2013-08-13 15:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

On Tue, Aug 13, 2013 at 04:35:17PM +0200, Andi Kleen wrote:
> > How about something like the below instead? I didn't copy the !fll test
> > because I couldn't find why that was. Section 18.10.5.1 (Aug 2012)
> 
> !fll is so that if a memory weight is requested we don't overwrite it.

Oh, hum.. it would be good to document that collision. Neither your
changelog nor your patch clarified this. So fail on you.

> >  	u64 status, dla, dse, lat;
> >  };
> >  
> > -/*
> > - * Same as pebs_record_nhm, with two additional fields.
> > - */
> >  struct pebs_record_hsw {
> > -	struct pebs_record_nhm nhm;
> > -	/*
> > -	 * Real IP of the event. In the Intel documentation this
> > -	 * is called eventingrip.
> > -	 */
> > -	u64 real_ip;
> > -	/*
> > -	 * TSX tuning information field: abort cycles and abort flags.
> > -	 */
> > -	u64 tsx_tuning;
> > +	u64 flags, ip;
> > +	u64 ax, bx, cx, dx;
> > +	u64 si, di, bp, sp;
> > +	u64 r8,  r9,  r10, r11;
> > +	u64 r12, r13, r14, r15;
> > +	u64 status, dla, dse, lat;
> 
> Seems like an unrelated change.

Sorta, it gets rid of pebs_hsw. That should've never been introduced.

> > +	u64 real_ip; /* the actual eventing ip */
> > +	u64 tsx_tuning; /* TSX abort cycles and flags */
> >  };
> >  
> >  void init_debug_store_on_cpu(int cpu)
> > @@ -759,16 +754,41 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
> >  	return 0;
> >  }
> >  
> > +union hsw_tsx_tuning {
> > +	struct {
> > +		u64 cycles_last_block : 32,
> > +		    hle_abort         : 1,
> > +		    rtm_abort         : 1,
> > +		    ins_abort         : 1,
> > +		    non_ins_abort     : 1,
> > +		    retry             : 1,
> > +		    mem_data_conflict : 1,
> > +		    capacity          : 1;
> 
> I think you used an old SDM for this, there were some changes in the
> latest.

Like said, Aug 2012. If only Intel would properly announce new SDMs and
with better names than: 325462.pdf.

Let me go fetch a new one.

> This would break my next patch which copies the abort bits into 
> a new field (well it would need an union at least) 
> 
> https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/commit/?h=hsw/pmu7&id=a88a029a6b3cb95148452584c93cbb4004f77f28

Make it a bigger mess: :-)

struct hsw_tsx_abort_info {
	union {
		u64 value;
		struct {
			u32 cycles_last_tx;
			union {
				u32 abort_reason : 8;
				struct {
					u32 hle_abort : 1,
					    rtm_abort : 1,
					    ins_abort : 1,
					    non_ins_abort : 1,
					    retry : 1,
					    data_conflict : 1,
					    capacity_writes : 1,
					    capacity_reads : 1;
				};
			};
		};
	};
};

Also, I think your patch is 'broken' in that it dumps the reserved bits
out to userspace and this brand spanking new SDM doesn't say they're 0.

> Other than that it seems ok and would likely generate the same
> code as mine. I prefer mine as it's simpler (I don't think there
> is anything in the kernel that needs to look at the individual bits,
> they should be just reported together)

The sole reason I even bothered poking at this was that your 'simple'
patch was out there ugly. The moment you need to struggle with that many
line breaks for a conditional you just know you've failed.

__intel_pmu_pebs_event() isn't getting any prettier with all those
pebs_format tests; but I'm not seeing anything to really fix that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-13 15:27       ` Peter Zijlstra
@ 2013-08-13 18:25         ` Andi Kleen
  2013-08-14  9:33           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-08-13 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

> Make it a bigger mess: :-)

Ok.  Only the second union is enough, it only needs the flags,
not the cycles.

> 
> struct hsw_tsx_abort_info {
> 	union {
> 		u64 value;
> 		struct {
> 			u32 cycles_last_tx;
> 			union {
> };
> 
> Also, I think your patch is 'broken' in that it dumps the reserved bits
> out to userspace and this brand spanking new SDM doesn't say they're 0.

Will fix.

> __intel_pmu_pebs_event() isn't getting any prettier with all those
> pebs_format tests; but I'm not seeing anything to really fix that.

Ok. Are you merging your patch with these changes (fll, union) 
or should I send a new one?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight
  2013-08-13 18:25         ` Andi Kleen
@ 2013-08-14  9:33           ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2013-08-14  9:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel, acme, jolsa, eranian, Andi Kleen

On Tue, Aug 13, 2013 at 08:25:18PM +0200, Andi Kleen wrote:
> > Make it a bigger mess: :-)
> 
> Ok.  Only the second union is enough, it only needs the flags,
> not the cycles.
> 
> > 
> > struct hsw_tsx_abort_info {
> > 	union {
> > 		u64 value;
> > 		struct {
> > 			u32 cycles_last_tx;
> > 			union {
> > };
> > 
> > Also, I think your patch is 'broken' in that it dumps the reserved bits
> > out to userspace and this brand spanking new SDM doesn't say they're 0.
> 
> Will fix.
> 
> > __intel_pmu_pebs_event() isn't getting any prettier with all those
> > pebs_format tests; but I'm not seeing anything to really fix that.
> 
> Ok. Are you merging your patch with these changes (fll, union) 
> or should I send a new one?

Please send a new one that is actually tested. Mine didn't see a
compiler up close nor do I have the hardware to verify it actually does
something sane.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
                   ` (3 preceding siblings ...)
  2013-08-09  1:15 ` [PATCH 4/4] perf, tools: Add perf stat --transaction v3 Andi Kleen
@ 2013-09-02  6:55 ` Ingo Molnar
  2013-09-05 13:15   ` Ingo Molnar
  4 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2013-09-02  6:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, linux-kernel, acme, jolsa, eranian


One thing I'm not seeing in the current Haswell code is the config set up 
for PERF_COUNT_HW_STALLED_CYCLES_FRONTEND/BACKEND. Both SB and IB has them 
configured.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-09-02  6:55 ` perf, x86: Add parts of the remaining haswell PMU functionality Ingo Molnar
@ 2013-09-05 13:15   ` Ingo Molnar
  2013-09-05 15:10     ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2013-09-05 13:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, linux-kernel, acme, jolsa, eranian


* Ingo Molnar <mingo@kernel.org> wrote:

> One thing I'm not seeing in the current Haswell code is the config set 
> up for PERF_COUNT_HW_STALLED_CYCLES_FRONTEND/BACKEND. Both SB and IB has 
> them configured.

Ping? Consider this a regression report.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-09-05 13:15   ` Ingo Molnar
@ 2013-09-05 15:10     ` Andi Kleen
  2013-09-05 17:04       ` Ingo Molnar
  2013-09-05 17:12       ` Ingo Molnar
  0 siblings, 2 replies; 17+ messages in thread
From: Andi Kleen @ 2013-09-05 15:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, peterz, linux-kernel, acme, jolsa, eranian

On Thu, Sep 05, 2013 at 03:15:02PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > One thing I'm not seeing in the current Haswell code is the config set 
> > up for PERF_COUNT_HW_STALLED_CYCLES_FRONTEND/BACKEND. Both SB and IB has 
> > them configured.
> 
> Ping? Consider this a regression report.

AFAIK they don't work. You only get the correct answer
in some situations, but in others it either overestimates
frontend or underestimates backend badly.

The correct way is to implement it like TopDown level 1,
but I don't know how to put that into the kernel.

http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues

It requires running 4 counters and computing some equations.

My toplev tool in http://github.com/andikleen/pmu-tools
has a implementation on top of perf.

I could put it into perf stat if you want, but it would
be somewhat Intel specific.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-09-05 15:10     ` Andi Kleen
@ 2013-09-05 17:04       ` Ingo Molnar
  2013-09-05 19:33         ` Andi Kleen
  2013-09-05 17:12       ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2013-09-05 17:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, linux-kernel, acme, jolsa, eranian


* Andi Kleen <andi@firstfloor.org> wrote:

> On Thu, Sep 05, 2013 at 03:15:02PM +0200, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > > One thing I'm not seeing in the current Haswell code is the config set 
> > > up for PERF_COUNT_HW_STALLED_CYCLES_FRONTEND/BACKEND. Both SB and IB has 
> > > them configured.
> > 
> > Ping? Consider this a regression report.
> 
> AFAIK they don't work. You only get the correct answer in some 
> situations, but in others it either overestimates frontend or 
> underestimates backend badly.

Well, at least the front-end side is still documented in the SDM as being 
usable to count stalled cycles.

AFAICS backend stall cycles are documented to work on Ivy Bridge.

On Haswell there's only UOPS_EXECUTED.CORE (0xb1 0x02) - this will 
over-count but could still be useful if we halved its value and considered 
it only statistically correct.

For perf stat -a alike system-wide workloads it should still produce 
usable results that way.

I.e. something like the patch below (it does not solve the double counting 
yet).

Thanks,

	Ingo

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0abf674..a61dd79 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2424,6 +2424,10 @@ __init int intel_pmu_init(void)
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
 
+		/* UOPS_EXECUTED.THREAD,c=1,i=1 to count stall cycles*/
+		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =
+			X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
+
 		pr_cont("IvyBridge events, ");
 		break;
 
@@ -2450,6 +2454,15 @@ __init int intel_pmu_init(void)
 		x86_pmu.hw_config = hsw_hw_config;
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
+
+		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
+		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
+			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
+
+		/* UOPS_EXECUTED.CORE,c=1,i=1 to count stall cycles*/
+		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =
+			X86_CONFIG(.event=0xb1, .umask=0x02, .inv=1, .cmask=1);
+
 		pr_cont("Haswell events, ");
 		break;
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-09-05 15:10     ` Andi Kleen
  2013-09-05 17:04       ` Ingo Molnar
@ 2013-09-05 17:12       ` Ingo Molnar
  1 sibling, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2013-09-05 17:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, linux-kernel, acme, jolsa, eranian

* Andi Kleen <andi@firstfloor.org> wrote:

> The correct way is to implement it like TopDown level 1, but I don't 
> know how to put that into the kernel.

Create an event group, with some callbacks to do the 
additions/subtractions to get at the right figures? (if it's plain linear 
arithmetics then that could be encoded in some simple operation flags as 
well, executed and calculated when the group count is accessed.)

That's something that would be useful to have in the kernel anyway, to 
abstract away simple concepts that are not so simple to measure.

> http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues
> 
> It requires running 4 counters and computing some equations.
>
> My toplev tool in http://github.com/andikleen/pmu-tools has a 
> implementation on top of perf.
> 
> I could put it into perf stat if you want, but it would be somewhat 
> Intel specific.

Yeah, would be nice to hide this mostly transparently, behind a group of 
events or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: perf, x86: Add parts of the remaining haswell PMU functionality
  2013-09-05 17:04       ` Ingo Molnar
@ 2013-09-05 19:33         ` Andi Kleen
  0 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2013-09-05 19:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, peterz, linux-kernel, acme, jolsa, eranian

> Well, at least the front-end side is still documented in the SDM as being 
> usable to count stalled cycles.

Stalled frontend cycles does not necessarily mean frontend bound.
The real bottleneck can be still somewhere later in the PipeLine. 
Out of Order CPUs are complex.

> 
> AFAICS backend stall cycles are documented to work on Ivy Bridge.

I'm not aware of any documentation that presents these events
as accurate frontend/backend stalls without using the full
TopDown methology (Optimization manual B.3.2)

The level 1 top down method for IvyBridge and Haswell is:

PipelineWidth = 4
Slots = PipelineWidth*CPU_CLK_UNHALTED
FrontendBound = IDQ_UOPS_NOT_DELIVERED.CORE / Slots
BadSpeculation = (UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS +
Width*INT_MISC.RECOVERY_CYCLES) / Slots
Retiring = UOPS_RETIRED.RETIRE_SLOTS / Slots
BackendBound = FrontendBound - BadSpeculation + Retiring

> For perf stat -a alike system-wide workloads it should still produce 
> usable results that way.

For some classes of workloads it will be a large unpredictable
systematic error.

> I.e. something like the patch below (it does not solve the double counting 
> yet).

Well you can add it, but I'm not going to Ack it.

-Andi


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-09-05 19:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-09  1:15 perf, x86: Add parts of the remaining haswell PMU functionality Andi Kleen
2013-08-09  1:15 ` [PATCH 1/4] perf, x86: Avoid checkpointed counters causing excessive TSX aborts v4 Andi Kleen
2013-08-13 10:29   ` Peter Zijlstra
2013-08-09  1:15 ` [PATCH 2/4] perf, x86: Report TSX transaction abort cost as weight Andi Kleen
2013-08-13 11:23   ` Peter Zijlstra
2013-08-13 14:35     ` Andi Kleen
2013-08-13 15:27       ` Peter Zijlstra
2013-08-13 18:25         ` Andi Kleen
2013-08-14  9:33           ` Peter Zijlstra
2013-08-09  1:15 ` [PATCH 3/4] perf, x86: Add Haswell TSX event aliases v6 Andi Kleen
2013-08-09  1:15 ` [PATCH 4/4] perf, tools: Add perf stat --transaction v3 Andi Kleen
2013-09-02  6:55 ` perf, x86: Add parts of the remaining haswell PMU functionality Ingo Molnar
2013-09-05 13:15   ` Ingo Molnar
2013-09-05 15:10     ` Andi Kleen
2013-09-05 17:04       ` Ingo Molnar
2013-09-05 19:33         ` Andi Kleen
2013-09-05 17:12       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.