linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf: Some improvements for Skylake perf
@ 2015-10-15 23:37 Andi Kleen
  2015-10-15 23:37 ` [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Andi Kleen @ 2015-10-15 23:37 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel

This patchkit improves the Skylake core PMU perf port

- Fix problems with LBR freezing with a new APIC ack sequence
- Improve performance and accuracy of the NMI handler with counter freezing
- Improve sampling quality with cycles:pp

The last feature could be used on other CPUs too, but for now it's only
enabled on Skylake.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-15 23:37 perf: Some improvements for Skylake perf Andi Kleen
@ 2015-10-15 23:37 ` Andi Kleen
  2015-10-16 11:51   ` Peter Zijlstra
  2015-10-15 23:37 ` [PATCH 2/4] x86, perf: Factor out BTS enable/disable functions Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2015-10-15 23:37 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The SKL PMU code had a problem with LBR freezing. When a counter
overflows already in the PMI handler, the LBR would be frozen
early and not be unfrozen until the next PMI. This means we would
get stale LBR information.

Depending on the workload this could happen for a few percent
of the PMIs for cycles in adaptive frequency mode, because the frequency
algorithm regularly goes down to very low periods.

This patch implements a new PMU ack sequence that avoids this problem.
The new sequence is:

- (counters are disabled with GLOBAL_CTRL)
There should be no further increments of the counters by later instructions; and
thus no additional PMI (and thus no additional freezing).

- ack the APIC

Clear the APIC PMI LVT entry so that any later interrupt is delivered and is
not lost due to the PMI LVT entry being masked. A lost PMI interrupt could lead to
LBRs staying frozen without entering the PMI handler

- Ack the PMU counters. This unfreezes the LBRs on Skylake (but not
on earlier CPUs which rely on DEBUGCTL writes for this)

- Reenable counters

The WRMSR will start the counters counting again (and will be ordered after the
APIC LVT PMI entry write since WRMSR is architecturally serializing). Since the
APIC PMI LVT is unmasked, any PMI which is caused by these perfmon counters
will trigger an NMI (but the delivery may be delayed until after the next
IRET)

One side effect is that the old retry loop is not possible anymore,
as the counters stay unacked for the majority of the PMI handler,
but that is not a big loss, as "profiling" the PMI was always
a bit dubious. For the old ack sequence it is still supported.

In principle the sequence should work on other CPUs too, but
since I only tested on Skylake it is only enabled there.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h       |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c | 35 ++++++++++++++++++++++++++--------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 499f533..fcf01c7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -551,6 +551,7 @@ struct x86_pmu {
 	struct x86_pmu_quirk *quirks;
 	int		perfctr_second_write;
 	bool		late_ack;
+	bool		status_ack_after_apic;
 	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
 
 	/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f63360b..69a545e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1789,6 +1789,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	struct cpu_hw_events *cpuc;
 	int bit, loops;
 	u64 status;
+	u64 orig_status;
 	int handled;
 
 	cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1803,13 +1804,16 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	handled = intel_pmu_drain_bts_buffer();
 	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
+	orig_status = status;
 	if (!status)
 		goto done;
 
 	loops = 0;
 again:
 	intel_pmu_lbr_read();
-	intel_pmu_ack_status(status);
+	if (!x86_pmu.status_ack_after_apic)
+		__intel_pmu_enable_all(0, true);
+
 	if (++loops > 100) {
 		static bool warned = false;
 		if (!warned) {
@@ -1877,15 +1881,20 @@ again:
 			x86_pmu_stop(event, 0);
 	}
 
-	/*
-	 * Repeat if there is more work to be done:
-	 */
-	status = intel_pmu_get_status();
-	if (status)
-		goto again;
+
+	if (!x86_pmu.status_ack_after_apic) {
+		/*
+		 * Repeat if there is more work to be done:
+		 */
+		status = intel_pmu_get_status();
+		if (status)
+			goto again;
+	}
 
 done:
-	__intel_pmu_enable_all(0, true);
+	if (!x86_pmu.status_ack_after_apic)
+		__intel_pmu_enable_all(0, true);
+
 	/*
 	 * Only unmask the NMI after the overflow counters
 	 * have been reset. This avoids spurious NMIs on
@@ -1893,6 +1902,15 @@ done:
 	 */
 	if (x86_pmu.late_ack)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
+
+	/*
+	 * Ack the PMU late. This avoids bogus freezing
+	 * on Skylake CPUs.
+	 */
+	if (x86_pmu.status_ack_after_apic) {
+		intel_pmu_ack_status(orig_status);
+		__intel_pmu_enable_all(0, true);
+	}
 	return handled;
 }
 
@@ -3514,6 +3532,7 @@ __init int intel_pmu_init(void)
 	case 78: /* 14nm Skylake Mobile */
 	case 94: /* 14nm Skylake Desktop */
 		x86_pmu.late_ack = true;
+		x86_pmu.status_ack_after_apic = true;
 		memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 		intel_pmu_lbr_init_skl();
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] x86, perf: Factor out BTS enable/disable functions
  2015-10-15 23:37 perf: Some improvements for Skylake perf Andi Kleen
  2015-10-15 23:37 ` [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Andi Kleen
@ 2015-10-15 23:37 ` Andi Kleen
  2015-10-15 23:37 ` [PATCH 3/4] perf, x86: Use counter freezing with Arch Perfmon v4 Andi Kleen
  2015-10-15 23:38 ` [PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Andi Kleen
  3 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2015-10-15 23:37 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Move BTS enable/disable into own functions. Used by the next
patch.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 43 ++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 69a545e..a466055 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1457,20 +1457,39 @@ static __initconst const u64 slm_hw_cache_event_ids
  },
 };
 
-/*
- * Use from PMIs where the LBRs are already disabled.
- */
-static void __intel_pmu_disable_all(void)
+static inline void intel_pmu_maybe_disable_bts(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
-
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
 	else
 		intel_bts_disable_local();
+}
+
+static inline void intel_pmu_maybe_enable_bts(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
+	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
+		struct perf_event *event =
+			cpuc->events[INTEL_PMC_IDX_FIXED_BTS];
+
+		if (WARN_ON_ONCE(!event))
+			return;
+
+		intel_pmu_enable_bts(event->hw.config);
+	} else
+		intel_bts_enable_local();
+}
+
+/*
+ * Use from PMIs where the LBRs are already disabled.
+ */
+static void __intel_pmu_disable_all(void)
+{
+	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+	intel_pmu_maybe_disable_bts();
 	intel_pmu_pebs_disable_all();
 }
 
@@ -1488,17 +1507,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
 	intel_pmu_lbr_enable_all(pmi);
 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
 			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask);
-
-	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
-		struct perf_event *event =
-			cpuc->events[INTEL_PMC_IDX_FIXED_BTS];
-
-		if (WARN_ON_ONCE(!event))
-			return;
-
-		intel_pmu_enable_bts(event->hw.config);
-	} else
-		intel_bts_enable_local();
+	intel_pmu_maybe_enable_bts();
 }
 
 static void intel_pmu_enable_all(int added)
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] perf, x86: Use counter freezing with Arch Perfmon v4
  2015-10-15 23:37 perf: Some improvements for Skylake perf Andi Kleen
  2015-10-15 23:37 ` [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Andi Kleen
  2015-10-15 23:37 ` [PATCH 2/4] x86, perf: Factor out BTS enable/disable functions Andi Kleen
@ 2015-10-15 23:37 ` Andi Kleen
  2015-10-15 23:38 ` [PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Andi Kleen
  3 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2015-10-15 23:37 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Now that we have reliable LBR unfreezing we can also use
real counter freezing.

Arch Perfmon v4 has an improved counter freezing implementation.

With counter freezing the PMU automatically "freezes" all counters
on a counter overflow, so that the PMI handler does not need to
explicitly disable the counter.

With arch perfmon 4 the freeze bits are explicit bits in GLOBAL_STATUS
now, which avoids a range of races in the previous implementation,
and also allows avoiding an extra write to GLOBAL_CTRL.

Advantages of counter freezes:
- It avoids a couple of costly extra MSR writes in the PMI handler
- It makes the PMI handler more accurate, as all counters get
frozen atomically as soon as any counter overflows. So there is
much less counting of the PMI handler itself.

With the freezing we don't need to disable or enable counters or PEBS. Only
BTS which does not support auto-freezing still needs to be explicitly
disabled.

The "status counter ack" serves as the reenable action, by clearing
the freeze bits in the status register.

So previously for a PEBS counter the PMI would do (each line a MSR access)

disable global ctrl
disable pebs
read status
ack status
read status again
reenable global ctrl
reenable pebs

(5x WRMSR, 2x RDMSR)

With the new counter freezing support this is simplified to:

read status
ack

(1x WRMSR, 1x RDMSR)

The counter freezing code is only used when the CPU model
opted into the new style status ack after apic sequence. So it's currently
only used on Skylake.

One issue is that the hardware doesn't like changing the period in
freeze mode. To avoid any issues here we only use counter freezing
when all counters are not in frequency mode, but have a fixed period.
This is kept track of with new per CPU state.

In frequency mode the old mode is still used.

Performance:

When profiling a kernel build on Skylake with different perf options,
measuring the length of all NMI handlers using the nmi handler trace point:

(lower is better)

perf options    `           avg     max   delta
-                           962   37248
-c 100000                   445   31217   -53%  with counter freezing
-g                         1753   47312
-g -c 100000                966   33698   -44%  with counter freezing
--call-graph lbr           3770   37164
--call-graph lbr -c 100000 2433   36930   -35%  with counter freezing
--c.g. dwarf               2055   33598
--c.g. dwarf -c 100000     1478   30491   -28%  with counter freezing

So the average cost of a NMI handler is cut down significantly with
freezing.

At least on this workload this makes -g competitive with the previous
non -g.

The max cost isn't really improved, since that is dominated by
other overhead.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kernel/cpu/perf_event.h       |  5 +++
 arch/x86/kernel/cpu/perf_event_intel.c | 79 +++++++++++++++++++++++++++++++++-
 3 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 54390bc..a527e77 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -140,6 +140,7 @@
 #define DEBUGCTLMSR_BTS_OFF_OS		(1UL <<  9)
 #define DEBUGCTLMSR_BTS_OFF_USR		(1UL << 10)
 #define DEBUGCTLMSR_FREEZE_LBRS_ON_PMI	(1UL << 11)
+#define DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI (1UL << 12)
 
 #define MSR_PEBS_FRONTEND		0x000003f7
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fcf01c7..2cbba8c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -240,6 +240,11 @@ struct cpu_hw_events {
 	int excl_thread_id; /* 0 or 1 */
 
 	/*
+	 * Counter freezing state.
+	 */
+	int				frozen_enabled;
+
+	/*
 	 * AMD specific bits
 	 */
 	struct amd_nb			*amd_nb;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a466055..4c46cd7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1596,6 +1596,25 @@ static void intel_pmu_nhm_enable_all(int added)
 	intel_pmu_enable_all(added);
 }
 
+static inline bool event_can_freeze(struct perf_event *event)
+{
+	if (!x86_pmu.status_ack_after_apic)
+		return false;
+	return !event->attr.freq;
+}
+
+static void enable_counter_freeze(void)
+{
+	update_debugctlmsr(get_debugctlmsr() |
+			DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
+static void disable_counter_freeze(void)
+{
+	update_debugctlmsr(get_debugctlmsr() &
+			~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
 static inline u64 intel_pmu_get_status(void)
 {
 	u64 status;
@@ -1649,6 +1668,14 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	if (needs_branch_stack(event))
 		intel_pmu_lbr_disable(event);
 
+	/*
+	 * We could disable freezing here, but doesn't hurt if it's on.
+	 * perf remembers the state, and someone else will likely
+	 * reinitialize.
+	 *
+	 * This avoids an extra MSR write in many situations.
+	 */
+
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_disable_fixed(hwc);
 		return;
@@ -1715,6 +1742,26 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	if (event->attr.exclude_guest)
 		cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
 
+	if (x86_pmu.version >= 4) {
+		/*
+		 * Enable freezing if this event is suitable for freezing,
+		 * and no other event is in frequency mode.
+		 * Otherwise disable freezing for everyone.
+		 */
+		if (event_can_freeze(event) && event->ctx->nr_freq == 0) {
+			if (!cpuc->frozen_enabled) {
+				enable_counter_freeze();
+				cpuc->frozen_enabled = 1;
+			}
+		} else if (cpuc->frozen_enabled) {
+			/* Disable freezing if it's on */
+			intel_pmu_disable_all();
+			cpuc->frozen_enabled = 0;
+			disable_counter_freeze();
+			intel_pmu_enable_all(0);
+		}
+	}
+
 	if (unlikely(event_is_checkpointed(event)))
 		cpuc->intel_cp_status |= (1ull << hwc->idx);
 
@@ -1800,16 +1847,29 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	u64 status;
 	u64 orig_status;
 	int handled;
+	bool freeze;
 
 	cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	/*
+	 * With counter freezing the CPU freezes counters on PMI.
+	 * This makes measurements more accurate and generally has
+	 * lower overhead, as we need to change less registers.
+	 *
+	 * We only freeze when all events are in fixed period mode.
+	 */
+	freeze = cpuc->frozen_enabled > 0;
+
+	/*
 	 * No known reason to not always do late ACK,
 	 * but just in case do it opt-in.
 	 */
 	if (!x86_pmu.late_ack)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
-	__intel_pmu_disable_all();
+	if (!freeze)
+		__intel_pmu_disable_all();
+	else
+		intel_pmu_maybe_disable_bts();
 	handled = intel_pmu_drain_bts_buffer();
 	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
@@ -1918,7 +1978,10 @@ done:
 	 */
 	if (x86_pmu.status_ack_after_apic) {
 		intel_pmu_ack_status(orig_status);
-		__intel_pmu_enable_all(0, true);
+		if (!freeze)
+			__intel_pmu_enable_all(0, true);
+		else
+			intel_pmu_maybe_enable_bts();
 	}
 	return handled;
 }
@@ -2908,6 +2971,11 @@ static void intel_pmu_cpu_dying(int cpu)
 	free_excl_cntrs(cpu);
 
 	fini_debug_store_on_cpu(cpu);
+
+	if (cpuc->frozen_enabled) {
+		cpuc->frozen_enabled = 0;
+		disable_counter_freeze();
+	}
 }
 
 static void intel_pmu_sched_task(struct perf_event_context *ctx,
@@ -3646,6 +3714,13 @@ __init int intel_pmu_init(void)
 		pr_cont("full-width counters, ");
 	}
 
+	/*
+	 * For arch perfmon 4 use counter freezing to avoid
+	 * several MSR accesses in the PMI.
+	 */
+	if (x86_pmu.version >= 4 && x86_pmu.status_ack_after_apic)
+		pr_cont("counter freezing, ");
+
 	return 0;
 }
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake
  2015-10-15 23:37 perf: Some improvements for Skylake perf Andi Kleen
                   ` (2 preceding siblings ...)
  2015-10-15 23:37 ` [PATCH 3/4] perf, x86: Use counter freezing with Arch Perfmon v4 Andi Kleen
@ 2015-10-15 23:38 ` Andi Kleen
  3 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2015-10-15 23:38 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Switch the cycles:pp alias from UOPS_RETITRED to INST_RETIRED.PREC_DIST.
The basic mechanism of abusing the inverse cmask to get all cycles
works the same as before.

PREC_DIST has special support for avoiding shadow effects, which
can give better results compare to UOPS_RETIRED. The drawback is
that PREC_DIST can only schedule on counter 1, but that is ok for
cycle sampling, as there is normally no need to do multiple cycle
sampling runs in parallel. It is still possible to run perf top
in parallel, as that doesn't use precise mode. Also of course
the multiplexing can still allow parallel operation.

The UOPS_RETIRED old alias is still available in raw form.

Example:

Sample a loop with 10 sqrt with old cycles:pp

  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
  9.13 │      sqrtps %xmm1,%xmm0
 11.58 │      sqrtps %xmm1,%xmm0
 11.51 │      sqrtps %xmm1,%xmm0
  6.27 │      sqrtps %xmm1,%xmm0
 10.38 │      sqrtps %xmm1,%xmm0
 12.20 │      sqrtps %xmm1,%xmm0
 12.74 │      sqrtps %xmm1,%xmm0
  5.40 │      sqrtps %xmm1,%xmm0
 10.14 │      sqrtps %xmm1,%xmm0
 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the jmp is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone
and all instructions show up roughly evenly:

  9.51 │10:   sqrtps %xmm1,%xmm0
 11.74 │      sqrtps %xmm1,%xmm0
 11.84 │      sqrtps %xmm1,%xmm0
  6.05 │      sqrtps %xmm1,%xmm0
 10.46 │      sqrtps %xmm1,%xmm0
 12.25 │      sqrtps %xmm1,%xmm0
 12.18 │      sqrtps %xmm1,%xmm0
  5.26 │      sqrtps %xmm1,%xmm0
 10.13 │      sqrtps %xmm1,%xmm0
 10.43 │      sqrtps %xmm1,%xmm0
  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result
is not completely even, but systematic shadow effects are
significantly reduced.

The improvements are mainly expected to make a difference
in high IPC code. With low IPC it should be similar.

The PREC_DIST event is supported back to IvyBridge, but I only tested
it on Skylake for now, so I only enabled it on Skylake.

On earlier parts there were various hardware bugs in it
(but no show stopper on IvyBridge and up I believe),
so it could be enabled there after sufficient testing.
On Sandy Bridge PREC_DIST can only be scheduled as a single
event on the PMU, which is too limiting. Before Sandy
Bridge it was not supported.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c    | 27 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  4 +++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4c46cd7..b79427a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2565,6 +2565,31 @@ static void intel_pebs_aliases_snb(struct perf_event *event)
 	}
 }
 
+static void intel_pebs_aliases_skl(struct perf_event *event)
+{
+	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
+		/*
+		 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
+		 * (0x003c) so that we can use it with PEBS.
+		 *
+		 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
+		 * PEBS capable. However we can use INST_RETIRED.PREC_DIST
+		 * (0x01c0), which is a PEBS capable event, to get the same
+		 * count.
+		 *
+		 * The PREC_DIST event has special support to minimize sample
+		 * shadowing effects. One drawback is that it can be
+		 * only programmed on counter 1, but that seems like an
+		 * acceptable trade off.
+		 */
+		u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16);
+
+		alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
+		event->hw.config = alt_config;
+	}
+}
+
+
 static unsigned long intel_pmu_free_running_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.free_running_flags;
@@ -3617,7 +3642,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 5db1c77..acaebc7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -719,8 +719,10 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
-	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
+	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (old style cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-15 23:37 ` [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Andi Kleen
@ 2015-10-16 11:51   ` Peter Zijlstra
  2015-10-16 13:35     ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2015-10-16 11:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On Thu, Oct 15, 2015 at 04:37:57PM -0700, Andi Kleen wrote:

> One side effect is that the old retry loop is not possible anymore,
> as the counters stay unacked for the majority of the PMI handler,
> but that is not a big loss, as "profiling" the PMI was always
> a bit dubious. For the old ack sequence it is still supported.

Its not a self profiling thing, its a safety feature. The interrupt very
explicitly disables all PMU counters, which would make self profiling
impossible.

And note that the "perfevents: irq loop stuck!" WARN is still
triggerable on my IVB (although I've not managed to find the root cause
of that).

What would happen with a 'stuck' event in the new scheme?

> In principle the sequence should work on other CPUs too, but
> since I only tested on Skylake it is only enabled there.

I would very much like a reduction of the ack states. You introduced the
late thing, which should also work for everyone, and now you introduce
yet another variant.

I would very much prefer a single ack scheme if at all possible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-16 11:51   ` Peter Zijlstra
@ 2015-10-16 13:35     ` Andi Kleen
  2015-10-16 15:00       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2015-10-16 13:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner

> What would happen with a 'stuck' event in the new scheme?

The usual reason events are stuck is that perf programs a too low period
in frequency mode after fork.

We have two safety mechanisms:

- The perf overflow code checks against the max number of
overflows per second, and forces a lower period if needed.
- The NMI duration limiter checks the total length of NMI

With the low period problem it will bump into the first limit.
The second limit also makes sure not too much CPU time is lost.

It wouldn't handle a case where the interrupt handler gets 
re-issued without overflowing, but I don't think that can
really happen (except for the check point case, which is
also covered)

> > In principle the sequence should work on other CPUs too, but
> > since I only tested on Skylake it is only enabled there.
> 
> I would very much like a reduction of the ack states. You introduced the
> late thing, which should also work for everyone, and now you introduce
> yet another variant.

Ingo suggested to do it this way. Originally I thought it wasn't needed,
but I think now that late-ack made some of the races that eventually
caused Skylake LBR to fall over worse. So in hindsight it was a good idea
to not use it everywhere. 

> I would very much prefer a single ack scheme if at all possible.

Could enable it everywhere, but then users would need to test it
on most types of CPUs, as I can't.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-16 13:35     ` Andi Kleen
@ 2015-10-16 15:00       ` Peter Zijlstra
  2015-10-16 16:14         ` Mike Galbraith
  2015-10-19  7:08         ` Ingo Molnar
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2015-10-16 15:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner, Mike Galbraith

On Fri, Oct 16, 2015 at 06:35:14AM -0700, Andi Kleen wrote:
> > > In principle the sequence should work on other CPUs too, but
> > > since I only tested on Skylake it is only enabled there.
> > 
> > I would very much like a reduction of the ack states. You introduced the
> > late thing, which should also work for everyone, and now you introduce
> > yet another variant.
> 
> Ingo suggested to do it this way. Originally I thought it wasn't needed,
> but I think now that late-ack made some of the races that eventually
> caused Skylake LBR to fall over worse. So in hindsight it was a good idea
> to not use it everywhere. 
> 
> > I would very much prefer a single ack scheme if at all possible.
> 
> Could enable it everywhere, but then users would need to test it
> on most types of CPUs, as I can't.

I think Mike still has a Core2 machine (and I might be able to dig out a
laptop), Ingo should have a NHM(-EP), I have SNB, IVB-EP, HSW. So if you
could test at least BDW and SKL we might have decent test coverage.

Ingo, do you want to first merge the safe patch and then clean up?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-16 15:00       ` Peter Zijlstra
@ 2015-10-16 16:14         ` Mike Galbraith
  2015-10-19  7:08         ` Ingo Molnar
  1 sibling, 0 replies; 10+ messages in thread
From: Mike Galbraith @ 2015-10-16 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Andi Kleen, linux-kernel, Ingo Molnar, Thomas Gleixner

On Fri, 2015-10-16 at 17:00 +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2015 at 06:35:14AM -0700, Andi Kleen wrote:
> > > > In principle the sequence should work on other CPUs too, but
> > > > since I only tested on Skylake it is only enabled there.
> > > 
> > > I would very much like a reduction of the ack states. You introduced the
> > > late thing, which should also work for everyone, and now you introduce
> > > yet another variant.
> > 
> > Ingo suggested to do it this way. Originally I thought it wasn't needed,
> > but I think now that late-ack made some of the races that eventually
> > caused Skylake LBR to fall over worse. So in hindsight it was a good idea
> > to not use it everywhere. 
> > 
> > > I would very much prefer a single ack scheme if at all possible.
> > 
> > Could enable it everywhere, but then users would need to test it
> > on most types of CPUs, as I can't.
> 
> I think Mike still has a Core2 machine (and I might be able to dig out a
> laptop), Ingo should have a NHM(-EP), I have SNB, IVB-EP, HSW. So if you
> could test at least BDW and SKL we might have decent test coverage.

Yeah, beloved ole Q6600 box, a U4100 lappy too.

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
  2015-10-16 15:00       ` Peter Zijlstra
  2015-10-16 16:14         ` Mike Galbraith
@ 2015-10-19  7:08         ` Ingo Molnar
  1 sibling, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2015-10-19  7:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Andi Kleen, linux-kernel, Thomas Gleixner, Mike Galbraith


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Oct 16, 2015 at 06:35:14AM -0700, Andi Kleen wrote:
> > > > In principle the sequence should work on other CPUs too, but
> > > > since I only tested on Skylake it is only enabled there.
> > > 
> > > I would very much like a reduction of the ack states. You introduced the 
> > > late thing, which should also work for everyone, and now you introduce yet 
> > > another variant.
> > 
> > Ingo suggested to do it this way. Originally I thought it wasn't needed, but I 
> > think now that late-ack made some of the races that eventually caused Skylake 
> > LBR to fall over worse. So in hindsight it was a good idea to not use it 
> > everywhere.
> > 
> > > I would very much prefer a single ack scheme if at all possible.
> > 
> > Could enable it everywhere, but then users would need to test it on most types 
> > of CPUs, as I can't.
> 
> I think Mike still has a Core2 machine (and I might be able to dig out a 
> laptop), Ingo should have a NHM(-EP), I have SNB, IVB-EP, HSW. So if you could 
> test at least BDW and SKL we might have decent test coverage.
> 
> Ingo, do you want to first merge the safe patch and then clean up?

Yeah, would be nice to structure it that way, out of general paranoia.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-19  7:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-15 23:37 perf: Some improvements for Skylake perf Andi Kleen
2015-10-15 23:37 ` [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Andi Kleen
2015-10-16 11:51   ` Peter Zijlstra
2015-10-16 13:35     ` Andi Kleen
2015-10-16 15:00       ` Peter Zijlstra
2015-10-16 16:14         ` Mike Galbraith
2015-10-19  7:08         ` Ingo Molnar
2015-10-15 23:37 ` [PATCH 2/4] x86, perf: Factor out BTS enable/disable functions Andi Kleen
2015-10-15 23:37 ` [PATCH 3/4] perf, x86: Use counter freezing with Arch Perfmon v4 Andi Kleen
2015-10-15 23:38 ` [PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).