All of lore.kernel.org
 help / color / mirror / Atom feed
* perf: Add basic Skylake PMU support
@ 2015-05-07 22:56 Andi Kleen
  2015-05-07 22:56 ` [PATCH 1/9] x86: Add a native_perf_sched_clock_from_tsc Andi Kleen
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel

This patchkit adds support for the Intel Skylake core PMU to perf, documented in the
recently released SDM 054[1] Vol3, 17.9 and 18.12. 

The main user visible feature is timed branch records, which allows to get cycle counts
for individual basic blocks, and a time stamp for PEBS records which improves
multi-record PEBS. The LBRs (branch records) also have been extended to 32, which allows
more accurate branch sampling and deeper call stacks.

-Andi

[1] http://www.cps.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/9] x86: Add a native_perf_sched_clock_from_tsc
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling Andi Kleen
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen, tglx

From: Andi Kleen <ak@linux.intel.com>

PEBSv3 has a raw TSC time stamp in its memory buffer that
later needs to to be converted to perf_clock. Add a
native_sched_clock_from_tsc() that works the same
as native_sched_clock, but starts with an already given TSC
value.

Right now I ignore paravirt for this. It will just get
the native clock. But there isn't para virtualized PEBS anyways.

Cc: tglx@linutronix.de
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/tsc.h | 1 +
 arch/x86/kernel/tsc.c      | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0..aad56eb 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -51,6 +51,7 @@ extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern int check_tsc_disabled(void);
 extern unsigned long native_calibrate_tsc(void);
+extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
 extern int tsc_clocksource_reliable;
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 5054497..c16afd5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -296,6 +296,14 @@ u64 native_sched_clock(void)
 	return cycles_2_ns(tsc_now);
 }
 
+/*
+ * Generate a sched_clock if you already have a TSC value.
+ */
+u64 native_sched_clock_from_tsc(u64 tsc)
+{
+	return cycles_2_ns(tsc);
+}
+
 /* We need to define a real function for sched_clock, to override the
    weak default version */
 #ifdef CONFIG_PARAVIRT
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
  2015-05-07 22:56 ` [PATCH 1/9] x86: Add a native_perf_sched_clock_from_tsc Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-08 10:59   ` Peter Zijlstra
  2015-05-07 22:56 ` [PATCH 3/9] x86: Add new MSRs and MSR bits used for Skylake perfmon Andi Kleen
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

PEBSv3 is the same as the existing PEBSv2 used on Haswell,
but it adds a new TSC field. Add support to the generic
PEBS handler to handle the new format, and overwrite
the perf time stamp using the new native_sched_clock_from_tsc()
Right now the time stamp is just slightly more accurate,
as it is nearer the actual event trigger point. With
the PEBS threshold > 1 patchkit it will be much more accurate,
avoid the problems with MMAP mismatches earlier.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 32 ++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 813f75d..2b51af5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -224,6 +224,19 @@ union hsw_tsx_tuning {
 
 #define PEBS_HSW_TSX_FLAGS	0xff00000000ULL
 
+/* Same as HSW, plus TSC */
+
+struct pebs_record_v3 {
+	u64 flags, ip;
+	u64 ax, bx, cx, dx;
+	u64 si, di, bp, sp;
+	u64 r8,  r9,  r10, r11;
+	u64 r12, r13, r14, r15;
+	u64 status, dla, dse, lat;
+	u64 real_ip, tsx_tuning;
+	u64 tsc;
+};
+
 void init_debug_store_on_cpu(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -827,7 +840,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	return 0;
 }
 
-static inline u64 intel_hsw_weight(struct pebs_record_hsw *pebs)
+static inline u64 intel_hsw_weight(struct pebs_record_v3 *pebs)
 {
 	if (pebs->tsx_tuning) {
 		union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning };
@@ -836,7 +849,7 @@ static inline u64 intel_hsw_weight(struct pebs_record_hsw *pebs)
 	return 0;
 }
 
-static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
+static inline u64 intel_hsw_transaction(struct pebs_record_v3 *pebs)
 {
 	u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
 
@@ -858,7 +871,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * unconditionally access the 'extra' entries.
 	 */
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct pebs_record_hsw *pebs = __pebs;
+	struct pebs_record_v3 *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	u64 sample_type;
@@ -958,6 +971,13 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 			data.txn = intel_hsw_transaction(pebs);
 	}
 
+	/*
+	 * v3 supplies an accurate time stamp, so we use that
+	 * for the time stamp.
+	 */
+	if (x86_pmu.intel_cap.pebs_format >= 3)
+		data.time = native_sched_clock_from_tsc(pebs->tsc);
+
 	if (has_branch_stack(event))
 		data.br_stack = &cpuc->lbr_stack;
 
@@ -1098,6 +1118,12 @@ void __init intel_ds_init(void)
 			x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
 			break;
 
+		case 3:
+			pr_cont("PEBS fmt3%c, ", pebs_type);
+			x86_pmu.pebs_record_size = sizeof(struct pebs_record_v3);
+			x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
+			break;
+
 		default:
 			printk(KERN_CONT "no PEBS fmt%d%c, ", format, pebs_type);
 			x86_pmu.pebs = 0;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/9] x86: Add new MSRs and MSR bits used for Skylake perfmon
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
  2015-05-07 22:56 ` [PATCH 1/9] x86: Add a native_perf_sched_clock_from_tsc Andi Kleen
  2015-05-07 22:56 ` [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 4/9] perf: Add cycles to branch_info Andi Kleen
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add new MSRs (LBR_INFO) and some new MSR bits used by the Skylake
PMU driver.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/perf_event.h     | 7 +++++++
 arch/x86/include/uapi/asm/msr-index.h | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index dc0f6ed..7bcb861 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -159,6 +159,13 @@ struct x86_pmu_capability {
  */
 #define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 16)
 
+#define GLOBAL_STATUS_COND_CHG				BIT_ULL(63)
+#define GLOBAL_STATUS_BUFFER_OVF			BIT_ULL(62)
+#define GLOBAL_STATUS_UNC_OVF				BIT_ULL(61)
+#define GLOBAL_STATUS_ASIF				BIT_ULL(60)
+#define GLOBAL_STATUS_COUNTERS_FROZEN			BIT_ULL(59)
+#define GLOBAL_STATUS_LBRS_FROZEN			BIT_ULL(58)
+
 /*
  * IBS cpuid feature detection
  */
diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index c469490..a6e1a2d 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -72,6 +72,12 @@
 #define MSR_LBR_CORE_FROM		0x00000040
 #define MSR_LBR_CORE_TO			0x00000060
 
+#define MSR_LBR_INFO_0			0x00000dc0 /* ... 0xddf for _31 */
+#define LBR_INFO_MISPRED		(1UL << 63)
+#define LBR_INFO_IN_TX			(1UL << 62)
+#define LBR_INFO_ABORT			(1UL << 61)
+#define LBR_INFO_CYCLES			0xffff
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/9] perf: Add cycles to branch_info
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (2 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 3/9] x86: Add new MSRs and MSR bits used for Skylake perfmon Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 5/9] x86, perf: Add support for LBRv5 Andi Kleen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Skylake supports reporting the time in cycles a branch in the LBR
took, to give a rough indication of the basic block performance.

Export the cycle information in the branch_info structure.
This can be done by just reusing some currently zero padding.

This is just the generic header change. The architecture
still needs to fill it in.

There's no attempt to convert to real time, as we really
want cycles here.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/uapi/linux/perf_event.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 309211b..9ef489e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -904,6 +904,7 @@ union perf_mem_data_src {
  *
  *     in_tx: running in a hardware transaction
  *     abort: aborting a hardware transaction
+ *    cycles: cycles from last branch (or 0 if not supported)
  */
 struct perf_branch_entry {
 	__u64	from;
@@ -912,7 +913,8 @@ struct perf_branch_entry {
 		predicted:1,/* target predicted */
 		in_tx:1,    /* in transaction */
 		abort:1,    /* transaction abort */
-		reserved:60;
+		cycles:16,  /* cycle count to last branch */
+		reserved:44;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/9] x86, perf: Add support for LBRv5
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (3 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 4/9] perf: Add cycles to branch_info Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 6/9] x86, perf: Add Skylake support Andi Kleen
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add support for the new LBRv5 format used on Skylake.

The flags for mispredict, abort, in_tx etc. moved to range of separate
LBR_INFO_* MSRs. Teach the LBR code to read those. The original
LBR registers stay the same, except they have full sign
extension now.

LBR_INFO also reports a cycle count to the last branch.
Report the cycle information using the new "cycles" branch_info
output field.

In addition we have to context switch and clear the new INFO
MSRs to avoid any information leaks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  1 +
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 21 ++++++++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index c2031cb..6dabfdc 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -595,6 +595,7 @@ struct x86_pmu {
 struct x86_perf_task_context {
 	u64 lbr_from[MAX_LBR_ENTRIES];
 	u64 lbr_to[MAX_LBR_ENTRIES];
+	u64 lbr_info[MAX_LBR_ENTRIES];
 	int lbr_callstack_users;
 	int lbr_stack_state;
 };
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 94e5b50..d938cae 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -13,7 +13,8 @@ enum {
 	LBR_FORMAT_EIP		= 0x02,
 	LBR_FORMAT_EIP_FLAGS	= 0x03,
 	LBR_FORMAT_EIP_FLAGS2	= 0x04,
-	LBR_FORMAT_MAX_KNOWN    = LBR_FORMAT_EIP_FLAGS2,
+	LBR_FORMAT_INFO		= 0x05,
+	LBR_FORMAT_MAX_KNOWN    = LBR_FORMAT_INFO,
 };
 
 static enum {
@@ -184,6 +185,8 @@ static void intel_pmu_lbr_reset_64(void)
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		wrmsrl(x86_pmu.lbr_from + i, 0);
 		wrmsrl(x86_pmu.lbr_to   + i, 0);
+		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+			wrmsrl(MSR_LBR_INFO_0 + i, 0);
 	}
 }
 
@@ -232,6 +235,8 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 		lbr_idx = (tos - i) & mask;
 		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
 		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+			wrmsrl(MSR_LBR_INFO_0 + i, task_ctx->lbr_info[i]);
 	}
 	task_ctx->lbr_stack_state = LBR_NONE;
 }
@@ -253,6 +258,8 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 		lbr_idx = (tos - i) & mask;
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
 		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+			rdmsrl(MSR_LBR_INFO_0 + i, task_ctx->lbr_info[i]);
 	}
 	task_ctx->lbr_stack_state = LBR_VALID;
 }
@@ -417,11 +424,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		unsigned long lbr_idx = (tos - i) & mask;
 		u64 from, to, mis = 0, pred = 0, in_tx = 0, abort = 0;
 		int skip = 0;
+		u16 cycles = 0;
 		int lbr_flags = lbr_desc[lbr_format];
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
+		if (lbr_format == LBR_FORMAT_INFO) {
+			u64 info;
+
+			rdmsrl(MSR_LBR_INFO_0 + lbr_idx, info);
+			mis = !!(info & LBR_INFO_MISPRED);
+			pred = !mis;
+			in_tx = !!(info & LBR_INFO_IN_TX);
+			abort = !!(info & LBR_INFO_ABORT);
+			cycles = (info & LBR_INFO_CYCLES);
+		}
 		if (lbr_flags & LBR_EIP_FLAGS) {
 			mis = !!(from & LBR_FROM_FLAG_MISPRED);
 			pred = !mis;
@@ -451,6 +469,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		cpuc->lbr_entries[out].predicted = pred;
 		cpuc->lbr_entries[out].in_tx	 = in_tx;
 		cpuc->lbr_entries[out].abort	 = abort;
+		cpuc->lbr_entries[out].cycles	 = cycles;
 		cpuc->lbr_entries[out].reserved	 = 0;
 		out++;
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/9] x86, perf: Add Skylake support
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (4 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 5/9] x86, perf: Add support for LBRv5 Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 7/9] perf, x86: Handle new status bits Andi Kleen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add perf core PMU support for future Intel Skylake CPU cores.

The code is based on Haswell/Broadwell.

There is a new cache event list, based on the updated Haswell
event list.

Skylake has removed most counter constraints on basic
events, so the basic constraints table now only has a single
entry (plus the fixed counters)

TSX support and various other setups are all shared with Haswell.

Skylake has 32 LBR entries. Add a new LBR init function
to set this up. The filters are all the same as Haswell.

It also has a new LBR format with a separate LBR_INFO_* MSR,
but that has been already added earlier.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |   6 +-
 arch/x86/kernel/cpu/perf_event_intel.c     | 232 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |  22 +++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  20 +++
 4 files changed, 279 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 6dabfdc..a7577a5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -138,7 +138,7 @@ struct intel_excl_cntrs {
 	unsigned	core_id;	/* per-core: core id */
 };
 
-#define MAX_LBR_ENTRIES		16
+#define MAX_LBR_ENTRIES		32
 
 enum {
 	X86_PERF_KFREE_SHARED = 0,
@@ -841,6 +841,8 @@ extern struct event_constraint intel_ivb_pebs_event_constraints[];
 
 extern struct event_constraint intel_hsw_pebs_event_constraints[];
 
+extern struct event_constraint intel_skl_pebs_event_constraints[];
+
 struct event_constraint *intel_pebs_constraints(struct perf_event *event);
 
 void intel_pmu_pebs_enable(struct perf_event *event);
@@ -877,6 +879,8 @@ void intel_pmu_lbr_init_snb(void);
 
 void intel_pmu_lbr_init_hsw(void);
 
+void intel_pmu_lbr_init_skl(void);
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 void intel_pt_interrupt(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 219d3fb..39fcd07 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -177,6 +177,14 @@ static struct event_constraint intel_slm_event_constraints[] __read_mostly =
 	EVENT_CONSTRAINT_END
 };
 
+struct event_constraint intel_skl_event_constraints[] = {
+	FIXED_EVENT_CONSTRAINT(0x00c0, 0),	/* INST_RETIRED.ANY */
+	FIXED_EVENT_CONSTRAINT(0x003c, 1),	/* CPU_CLK_UNHALTED.CORE */
+	FIXED_EVENT_CONSTRAINT(0x0300, 2),	/* CPU_CLK_UNHALTED.REF */
+	INTEL_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
+	EVENT_CONSTRAINT_END
+};
+
 static struct extra_reg intel_snb_extra_regs[] __read_mostly = {
 	/* must define OFFCORE_RSP_X first, see intel_fixup_er() */
 	INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3f807f8fffull, RSP_0),
@@ -193,6 +201,13 @@ static struct extra_reg intel_snbep_extra_regs[] __read_mostly = {
 	EVENT_EXTRA_END
 };
 
+static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
+	INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
+	INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
+	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
+	EVENT_EXTRA_END
+};
+
 EVENT_ATTR_STR(mem-loads,	mem_ld_nhm,	"event=0x0b,umask=0x10,ldlat=3");
 EVENT_ATTR_STR(mem-loads,	mem_ld_snb,	"event=0xcd,umask=0x1,ldlat=3");
 EVENT_ATTR_STR(mem-stores,	mem_st_snb,	"event=0xcd,umask=0x2");
@@ -244,6 +259,200 @@ static u64 intel_pmu_event_map(int hw_event)
 	return intel_perfmon_event_map[hw_event];
 }
 
+/*
+ * Notes on the events:
+ * - data reads do not include code reads (comparable to earlier tables)
+ * - data counts include speculative execution (except L1 write, dtlb, bpu)
+ * - remote node access includes remote memory, remote cache, remote mmio.
+ * - prefetches are not included in the counts.
+ * - icache miss does not include decoded icache
+ */
+
+#define SKL_DEMAND_DATA_RD		BIT_ULL(0)
+#define SKL_DEMAND_RFO			BIT_ULL(1)
+#define SKL_ANY_RESPONSE		BIT_ULL(16)
+#define SKL_SUPPLIER_NONE		BIT_ULL(17)
+#define SKL_L3_MISS_LOCAL_DRAM		BIT_ULL(26)
+#define SKL_L3_MISS_REMOTE_HOP0_DRAM	BIT_ULL(27)
+#define SKL_L3_MISS_REMOTE_HOP1_DRAM	BIT_ULL(28)
+#define SKL_L3_MISS_REMOTE_HOP2P_DRAM	BIT_ULL(29)
+#define SKL_L3_MISS			(SKL_L3_MISS_LOCAL_DRAM| \
+					 SKL_L3_MISS_REMOTE_HOP0_DRAM| \
+					 SKL_L3_MISS_REMOTE_HOP1_DRAM| \
+					 SKL_L3_MISS_REMOTE_HOP2P_DRAM)
+#define SKL_SPL_HIT			BIT_ULL(30)
+#define SKL_SNOOP_NONE			BIT_ULL(31)
+#define SKL_SNOOP_NOT_NEEDED		BIT_ULL(32)
+#define SKL_SNOOP_MISS			BIT_ULL(33)
+#define SKL_SNOOP_HIT_NO_FWD		BIT_ULL(34)
+#define SKL_SNOOP_HIT_WITH_FWD		BIT_ULL(35)
+#define SKL_SNOOP_HITM			BIT_ULL(36)
+#define SKL_SNOOP_NON_DRAM		BIT_ULL(37)
+#define SKL_ANY_SNOOP			(SKL_SPL_HIT|SKL_SNOOP_NONE| \
+					 SKL_SNOOP_NOT_NEEDED|SKL_SNOOP_MISS| \
+					 SKL_SNOOP_HIT_NO_FWD|SKL_SNOOP_HIT_WITH_FWD| \
+					 SKL_SNOOP_HITM|SKL_SNOOP_NON_DRAM)
+#define SKL_DEMAND_READ			SKL_DEMAND_DATA_RD
+#define SKL_SNOOP_DRAM			(SKL_SNOOP_NONE| \
+					 SKL_SNOOP_NOT_NEEDED|SKL_SNOOP_MISS| \
+					 SKL_SNOOP_HIT_NO_FWD|SKL_SNOOP_HIT_WITH_FWD| \
+					 SKL_SNOOP_HITM|SKL_SPL_HIT)
+#define SKL_DEMAND_WRITE		SKL_DEMAND_RFO
+#define SKL_LLC_ACCESS			SKL_ANY_RESPONSE
+#define SKL_L3_MISS_REMOTE		(SKL_L3_MISS_REMOTE_HOP0_DRAM| \
+					 SKL_L3_MISS_REMOTE_HOP1_DRAM| \
+					 SKL_L3_MISS_REMOTE_HOP2P_DRAM)
+
+static __initconst const u64 skl_hw_cache_event_ids
+				[PERF_COUNT_HW_CACHE_MAX]
+				[PERF_COUNT_HW_CACHE_OP_MAX]
+				[PERF_COUNT_HW_CACHE_RESULT_MAX] =
+{
+ [ C(L1D ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x81d0,	/* MEM_INST_RETIRED.ALL_LOADS */
+		[ C(RESULT_MISS)   ] = 0x151,	/* L1D.REPLACEMENT */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x82d0,	/* MEM_INST_RETIRED.ALL_STORES */
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+ [ C(L1I ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x283,	/* ICACHE_64B.MISS */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+ [ C(LL  ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x1b7,	/* OFFCORE_RESPONSE */
+		[ C(RESULT_MISS)   ] = 0x1b7,	/* OFFCORE_RESPONSE */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x1b7,	/* OFFCORE_RESPONSE */
+		[ C(RESULT_MISS)   ] = 0x1b7,	/* OFFCORE_RESPONSE */
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+ [ C(DTLB) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x81d0,	/* MEM_INST_RETIRED.ALL_LOADS */
+		[ C(RESULT_MISS)   ] = 0x608,	/* DTLB_LOAD_MISSES.WALK_COMPLETED */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x82d0,	/* MEM_INST_RETIRED.ALL_STORES */
+		[ C(RESULT_MISS)   ] = 0x649,	/* DTLB_STORE_MISSES.WALK_COMPLETED */
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+ [ C(ITLB) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x2085,	/* ITLB_MISSES.STLB_HIT */
+		[ C(RESULT_MISS)   ] = 0xe85,	/* ITLB_MISSES.WALK_COMPLETED */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+ },
+ [ C(BPU ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0xc4,	/* BR_INST_RETIRED.ALL_BRANCHES */
+		[ C(RESULT_MISS)   ] = 0xc5,	/* BR_MISP_RETIRED.ALL_BRANCHES */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+ },
+ [ C(NODE) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x1b7,	/* OFFCORE_RESPONSE */
+		[ C(RESULT_MISS)   ] = 0x1b7,	/* OFFCORE_RESPONSE */
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x1b7,	/* OFFCORE_RESPONSE */
+		[ C(RESULT_MISS)   ] = 0x1b7,	/* OFFCORE_RESPONSE */
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+};
+
+static __initconst const u64 skl_hw_cache_extra_regs
+				[PERF_COUNT_HW_CACHE_MAX]
+				[PERF_COUNT_HW_CACHE_OP_MAX]
+				[PERF_COUNT_HW_CACHE_RESULT_MAX] =
+{
+ [ C(LL  ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = SKL_DEMAND_READ|
+				       SKL_LLC_ACCESS|SKL_ANY_SNOOP,
+		[ C(RESULT_MISS)   ] = SKL_DEMAND_READ|
+				       SKL_L3_MISS|SKL_ANY_SNOOP|
+				       SKL_SUPPLIER_NONE,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = SKL_DEMAND_WRITE|
+				       SKL_LLC_ACCESS|SKL_ANY_SNOOP,
+		[ C(RESULT_MISS)   ] = SKL_DEMAND_WRITE|
+				       SKL_L3_MISS|SKL_ANY_SNOOP|
+				       SKL_SUPPLIER_NONE,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+ [ C(NODE) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = SKL_DEMAND_READ|
+				       SKL_L3_MISS_LOCAL_DRAM|SKL_SNOOP_DRAM,
+		[ C(RESULT_MISS)   ] = SKL_DEMAND_READ|
+				       SKL_L3_MISS_REMOTE|SKL_SNOOP_DRAM,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = SKL_DEMAND_WRITE|
+				       SKL_L3_MISS_LOCAL_DRAM|SKL_SNOOP_DRAM,
+		[ C(RESULT_MISS)   ] = SKL_DEMAND_WRITE|
+				       SKL_L3_MISS_REMOTE|SKL_SNOOP_DRAM,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = 0x0,
+		[ C(RESULT_MISS)   ] = 0x0,
+	},
+ },
+};
+
 #define SNB_DMND_DATA_RD	(1ULL << 0)
 #define SNB_DMND_RFO		(1ULL << 1)
 #define SNB_DMND_IFETCH		(1ULL << 2)
@@ -3292,6 +3501,29 @@ __init int intel_pmu_init(void)
 		pr_cont("Broadwell events, ");
 		break;
 
+	case 78: /* 14nm Skylake Mobile */
+	case 94: /* 14nm Skylake Desktop */
+		x86_pmu.late_ack = true;
+		memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
+		memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
+		intel_pmu_lbr_init_skl();
+
+		x86_pmu.event_constraints = intel_skl_event_constraints;
+		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
+		x86_pmu.extra_regs = intel_skl_extra_regs;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		/* all extra regs are per-cpu when HT is on */
+		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+
+		x86_pmu.hw_config = hsw_hw_config;
+		x86_pmu.get_event_constraints = hsw_get_event_constraints;
+		x86_pmu.cpu_events = hsw_events_attrs;
+		WARN_ON(!x86_pmu.format_attrs);
+		x86_pmu.cpu_events = hsw_events_attrs;
+		pr_cont("Skylake events, ");
+		break;
+
 	default:
 		switch (x86_pmu.version) {
 		case 1:
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 2b51af5..20ecfd0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -678,6 +678,28 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 	EVENT_CONSTRAINT_END
 };
 
+struct event_constraint intel_skl_pebs_event_constraints[] = {
+	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
+	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x21d0, 0xf), /* MEM_INST_RETIRED.LOCK_LOADS */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x22d0, 0xf), /* MEM_INST_RETIRED.LOCK_STORES */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x41d0, 0xf), /* MEM_INST_RETIRED.SPLIT_LOADS */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x42d0, 0xf), /* MEM_INST_RETIRED.SPLIT_STORES */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x81d0, 0xf), /* MEM_INST_RETIRED.ALL_LOADS */
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x82d0, 0xf), /* MEM_INST_RETIRED.ALL_STORES */
+	INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd1, 0xf),    /* MEM_LOAD_RETIRED.* */
+	INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd2, 0xf),    /* MEM_LOAD_L3_HIT_RETIRED.* */
+	INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD(0xd3, 0xf),    /* MEM_LOAD_L3_MISS_RETIRED.* */
+	/* Allow all events as PEBS with no flags */
+	INTEL_ALL_EVENT_CONSTRAINT(0, 0xf),
+	EVENT_CONSTRAINT_END
+};
+
 struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 {
 	struct event_constraint *c;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d938cae..6c48c97 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -961,6 +961,26 @@ void intel_pmu_lbr_init_hsw(void)
 	pr_cont("16-deep LBR, ");
 }
 
+/* skylake */
+__init void intel_pmu_lbr_init_skl(void)
+{
+	x86_pmu.lbr_nr	 = 32;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
+
+	/*
+	 * SW branch filter usage:
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
+	pr_cont("32-deep LBR, ");
+}
+
 /* atom */
 void __init intel_pmu_lbr_init_atom(void)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/9] perf, x86: Handle new status bits
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (5 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 6/9] x86, perf: Add Skylake support Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-07 22:56 ` [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing Andi Kleen
  2015-05-07 22:56 ` [PATCH 9/9] perf, tools: Add tools support for cycles, weight branch_info field Andi Kleen
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

ArchPerfmon v4 has some new status bits in GLOBAL_STATUS.
These need to be ignored when deciding whether a NMI
was a NMI to avoid eating all NMIs when they
stay set (see b292d7a104)

This patch ignores the new ASIF bit, which indicates
that SGX interfered with the PMU, and also the new
LBR freezing bits, which are set when the LBRs get
frozen.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 39fcd07..2c8a268 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1821,13 +1821,14 @@ again:
 	intel_pmu_lbr_read();
 
 	/*
-	 * CondChgd bit 63 doesn't mean any overflow status. Ignore
-	 * and clear the bit.
+	 * Ignore a range of extra bits in status that do not indicate
+	 * overflow by themselves.
 	 */
-	if (__test_and_clear_bit(63, (unsigned long *)&status)) {
-		if (!status)
-			goto done;
-	}
+	status &= ~(GLOBAL_STATUS_COND_CHG |
+		    GLOBAL_STATUS_ASIF |
+		    GLOBAL_STATUS_LBRS_FROZEN);
+	if (!status)
+		goto done;
 
 	/*
 	 * PEBS overflow sets bit 62 in the global status register
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (6 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 7/9] perf, x86: Handle new status bits Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  2015-05-08 11:19   ` Peter Zijlstra
  2015-05-07 22:56 ` [PATCH 9/9] perf, tools: Add tools support for cycles, weight branch_info field Andi Kleen
  8 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

In Arch perfmon v4 the GLOBAL_STATUS reset automatically unfreezes
LBRs. So no need to do it manually in the LBR code. Add a check
to skip it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 6c48c97..64d3122 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -147,6 +147,13 @@ static void __intel_pmu_lbr_enable(bool pmi)
 		wrmsrl(MSR_LBR_SELECT, lbr_select);
 	}
 
+	/*
+	 * No need to unfreeze manually, as v4 can do that as part
+	 * of the GLOBAL_STATUS ack.
+	 */
+	if (pmi && x86_pmu.version >= 4)
+		return;
+
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 	orig_debugctl = debugctl;
 	debugctl |= DEBUGCTLMSR_LBR;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 9/9] perf, tools: Add tools support for cycles, weight branch_info field
  2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
                   ` (7 preceding siblings ...)
  2015-05-07 22:56 ` [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing Andi Kleen
@ 2015-05-07 22:56 ` Andi Kleen
  8 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-07 22:56 UTC (permalink / raw)
  To: peterz; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

cycles is a new branch_info field available on some CPUs
that indicates the time deltas between branches in the LBR.

Add a sort key and output code for the cycles
to allow to display the basic block cycles individually in perf report.

We also pass in the cycles for weight when LBRs are processed,
which allows to get global and local weight to get averages.

And also print the cycles information for perf report -D.
I also added printing for the previously missing LBR flags
(mispredict etc.)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt |  1 +
 tools/perf/util/event.h                  |  3 ++-
 tools/perf/util/hist.c                   |  3 ++-
 tools/perf/util/hist.h                   |  1 +
 tools/perf/util/session.c                | 16 ++++++++++++----
 tools/perf/util/sort.c                   | 24 ++++++++++++++++++++++++
 tools/perf/util/sort.h                   |  1 +
 7 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 27190ed..034a2b4 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -108,6 +108,7 @@ OPTIONS
 	- mispredict: "N" for predicted branch, "Y" for mispredicted branch
 	- in_tx: branch in TSX transaction
 	- abort: TSX transaction abort.
+	- cycles: Cycles in basic block
 
 	And default sort keys are changed to comm, dso_from, symbol_from, dso_to
 	and symbol_to, see '--branch-stack'.
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 7eecd5e..f028a3c 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -129,7 +129,8 @@ struct branch_flags {
 	u64 predicted:1;
 	u64 in_tx:1;
 	u64 abort:1;
-	u64 reserved:60;
+	u64 cycles:16;
+	u64 reserved:44;
 };
 
 struct branch_entry {
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 3387706..302fc05 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -623,7 +623,8 @@ iter_add_next_branch_entry(struct hist_entry_iter *iter, struct addr_location *a
 	 * and not events sampled. Thus we use a pseudo period of 1.
 	 */
 	he = __hists__add_entry(hists, al, iter->parent, &bi[i], NULL,
-				1, 1, 0, true);
+				1, bi->flags.cycles ? bi->flags.cycles : 1,
+				0, true);
 	if (he == NULL)
 		return -ENOMEM;
 
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 9f31b89..b55c904 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -47,6 +47,7 @@ enum hist_column {
 	HISTC_MEM_SNOOP,
 	HISTC_MEM_DCACHELINE,
 	HISTC_TRANSACTION,
+	HISTC_CYCLES,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index e722107..484b974 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -738,10 +738,18 @@ static void branch_stack__printf(struct perf_sample *sample)
 
 	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
 
-	for (i = 0; i < sample->branch_stack->nr; i++)
-		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 "\n",
-			i, sample->branch_stack->entries[i].from,
-			sample->branch_stack->entries[i].to);
+	for (i = 0; i < sample->branch_stack->nr; i++) {
+		struct branch_entry *e = &sample->branch_stack->entries[i];
+
+		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 " %hu cycles %s%s%s%s %x\n",
+			i, e->from, e->to,
+			e->flags.cycles,
+			e->flags.mispred ? "M" : " ",
+			e->flags.predicted ? "P" : " ",
+			e->flags.abort ? "A" : " ",
+			e->flags.in_tx ? "T" : " ",
+			(unsigned)e->flags.reserved);
+	}
 }
 
 static void regs_dump__printf(u64 mask, u64 *regs)
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 4593f36..03d8e6e 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -528,6 +528,29 @@ static int hist_entry__mispredict_snprintf(struct hist_entry *he, char *bf,
 	return repsep_snprintf(bf, size, "%-*.*s", width, width, out);
 }
 
+static int64_t
+sort__cycles_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return left->branch_info->flags.cycles -
+		right->branch_info->flags.cycles;
+}
+
+static int hist_entry__cycles_snprintf(struct hist_entry *he, char *bf,
+				    size_t size, unsigned int width)
+{
+	if (he->branch_info->flags.cycles == 0)
+		return repsep_snprintf(bf, size, "%-*s", width, "-");
+	return repsep_snprintf(bf, size, "%-*hd", width,
+			       he->branch_info->flags.cycles);
+}
+
+struct sort_entry sort_cycles = {
+	.se_header	= "Basic Block Cycles",
+	.se_cmp		= sort__cycles_cmp,
+	.se_snprintf	= hist_entry__cycles_snprintf,
+	.se_width_idx	= HISTC_CYCLES,
+};
+
 /* --sort daddr_sym */
 static int64_t
 sort__daddr_cmp(struct hist_entry *left, struct hist_entry *right)
@@ -1192,6 +1215,7 @@ static struct sort_dimension bstack_sort_dimensions[] = {
 	DIM(SORT_MISPREDICT, "mispredict", sort_mispredict),
 	DIM(SORT_IN_TX, "in_tx", sort_in_tx),
 	DIM(SORT_ABORT, "abort", sort_abort),
+	DIM(SORT_CYCLES, "cycles", sort_cycles),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index e97cd47..bc6c87a 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -185,6 +185,7 @@ enum sort_type {
 	SORT_MISPREDICT,
 	SORT_ABORT,
 	SORT_IN_TX,
+	SORT_CYCLES,
 
 	/* memory mode specific sort keys */
 	__SORT_MEMORY_MODE,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling
  2015-05-07 22:56 ` [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling Andi Kleen
@ 2015-05-08 10:59   ` Peter Zijlstra
  2015-05-08 11:59     ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-08 10:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

On Thu, May 07, 2015 at 03:56:25PM -0700, Andi Kleen wrote:
> +/* Same as HSW, plus TSC */
> +
> +struct pebs_record_v3 {

This is inconsistently named.

> +	u64 flags, ip;
> +	u64 ax, bx, cx, dx;
> +	u64 si, di, bp, sp;
> +	u64 r8,  r9,  r10, r11;
> +	u64 r12, r13, r14, r15;
> +	u64 status, dla, dse, lat;
> +	u64 real_ip, tsx_tuning;
> +	u64 tsc;
> +};
> @@ -958,6 +971,13 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>  			data.txn = intel_hsw_transaction(pebs);
>  	}
>  
> +	/*
> +	 * v3 supplies an accurate time stamp, so we use that
> +	 * for the time stamp.
> +	 */
> +	if (x86_pmu.intel_cap.pebs_format >= 3)
> +		data.time = native_sched_clock_from_tsc(pebs->tsc);

That's wrong. It does not respect perf_event_attr::clock_id.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing
  2015-05-07 22:56 ` [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing Andi Kleen
@ 2015-05-08 11:19   ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-08 11:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

On Thu, May 07, 2015 at 03:56:31PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> In Arch perfmon v4 the GLOBAL_STATUS reset automatically unfreezes
> LBRs. So no need to do it manually in the LBR code. Add a check
> to skip it.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 6c48c97..64d3122 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -147,6 +147,13 @@ static void __intel_pmu_lbr_enable(bool pmi)
>  		wrmsrl(MSR_LBR_SELECT, lbr_select);
>  	}
>  
> +	/*
> +	 * No need to unfreeze manually, as v4 can do that as part
> +	 * of the GLOBAL_STATUS ack.
> +	 */
> +	if (pmi && x86_pmu.version >= 4)
> +		return;
> +

But that block above is !pmi, so strictly exclusive with this condition.
So why not put this at the start?

>  	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>  	orig_debugctl = debugctl;
>  	debugctl |= DEBUGCTLMSR_LBR;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling
  2015-05-08 10:59   ` Peter Zijlstra
@ 2015-05-08 11:59     ` Andi Kleen
  2015-05-08 12:06       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2015-05-08 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, kan.liang, eranian, acme, linux-kernel, Andi Kleen

> That's wrong. It does not respect perf_event_attr::clock_id.

Ok.

Are we really defaulting to real time now? (CLOCK_REALTIME == 0)
What happens when ntpd is active and lets the time drift? That
sounds very bad for event accuracy.

The clockid patch looks broken to me. At least it should default
to MONOTONIC time, not just the time that happens to be 0 by 
accident.

Also I question we really need that many different kinds of
time stamps.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling
  2015-05-08 11:59     ` Andi Kleen
@ 2015-05-08 12:06       ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2015-05-08 12:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: kan.liang, eranian, acme, linux-kernel, Andi Kleen

On Fri, May 08, 2015 at 01:59:54PM +0200, Andi Kleen wrote:
> > That's wrong. It does not respect perf_event_attr::clock_id.
> 
> Ok.
> 
> Are we really defaulting to real time now? 

No, we still default to the thing you implemented.

However, when the event is configured differently, you'll write the
wrong stamps in. This code needs to consider that.

> (CLOCK_REALTIME == 0)
> What happens when ntpd is active and lets the time drift? That
> sounds very bad for event accuracy.

NTP slew rate adjustment should still keep the thing monotonic, so its
still useful to order events.

But as is, CLOCK_REALTIME is not supported for hardware events, only
CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW are.

> The clockid patch looks broken to me. At least it should default
> to MONOTONIC time, not just the time that happens to be 0 by 
> accident.

You've not looked good enough, the Changelog states and the code matches
that the default is unchanged.

> Also I question we really need that many different kinds of
> time stamps.

CLOCK_MONOTONIC is useful for cluster wide timestamps,
CLOCK_MONOTONIC_RAW is useful for things that need/want to correlate to
fixed rate clocks.

Most of the other clocks are just a side effect of using the generic
clockid API for selecting clocks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-05-08 12:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-07 22:56 perf: Add basic Skylake PMU support Andi Kleen
2015-05-07 22:56 ` [PATCH 1/9] x86: Add a native_perf_sched_clock_from_tsc Andi Kleen
2015-05-07 22:56 ` [PATCH 2/9] x86, perf: Add support for PEBSv3 profiling Andi Kleen
2015-05-08 10:59   ` Peter Zijlstra
2015-05-08 11:59     ` Andi Kleen
2015-05-08 12:06       ` Peter Zijlstra
2015-05-07 22:56 ` [PATCH 3/9] x86: Add new MSRs and MSR bits used for Skylake perfmon Andi Kleen
2015-05-07 22:56 ` [PATCH 4/9] perf: Add cycles to branch_info Andi Kleen
2015-05-07 22:56 ` [PATCH 5/9] x86, perf: Add support for LBRv5 Andi Kleen
2015-05-07 22:56 ` [PATCH 6/9] x86, perf: Add Skylake support Andi Kleen
2015-05-07 22:56 ` [PATCH 7/9] perf, x86: Handle new status bits Andi Kleen
2015-05-07 22:56 ` [PATCH 8/9] perf, x86: Optimize v4 LBR unfreezing Andi Kleen
2015-05-08 11:19   ` Peter Zijlstra
2015-05-07 22:56 ` [PATCH 9/9] perf, tools: Add tools support for cycles, weight branch_info field Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.