linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
@ 2015-12-01  0:28 Andi Kleen
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Andi Kleen @ 2015-12-01  0:28 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, mingo, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list
for cycles:pp. But the event is not documented for Skylake, and has
some issues.

The recommended replacement for cycles:pp is to use INST_RETIRED.ANY+pebs
as a base, similar to what CPUs before Sandy Bridge did. This new event
is called INST_RETIRED.TOTAL_CYCLES_PS. The event is not really
new, but has been already used by perf before Sandy Bridge
for the original cycles:p

Note the SDM doesn't document that event either, but it's
being documented in the latest version of the event list on
https://download.01.org/perfmon/SKL

This patch does:

- Remove UOPS_RETIRED.ALL from the Skylake PEBS event list
- Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to
allow cmask=16,inv=1 for cycles:pp
- We don't need an extra entry for the base INST_RETIRED event,
because it is already covered by the catch-all PEBS table entry.
- Switch Skylake to use the Core2 PEBS alias (which is
INST_RETIRED.TOTAL_CYCLES_PS)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c    | 2 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 297960a..09f4399 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3521,7 +3521,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 5db1c77..dcab005 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -718,9 +718,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
-	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
-	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp
  2015-12-01  0:28 [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Andi Kleen
@ 2015-12-01  0:28 ` Andi Kleen
  2015-12-01 13:41   ` Peter Zijlstra
                     ` (2 more replies)
  2015-12-01 13:19 ` [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 22+ messages in thread
From: Andi Kleen @ 2015-12-01  0:28 UTC (permalink / raw)
  To: peterz; +Cc: linux-kernel, mingo, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST
as base. The basic mechanism of abusing the inverse
cmask to get all cycles works the same as before.

PREC_DIST is available on Sandy Bridge or later. It had some problems
on Sandy Bridge, so we only use it on IvyBridge and later.
I tested it on Broadwell and Skylake.

PREC_DIST has special support for avoiding shadow effects, which can
give better results compare to UOPS_RETIRED. The drawback is that
PREC_DIST can only schedule on counter 1, but that is ok for cycle
sampling, as there is normally no need to do multiple cycle sampling
runs in parallel. It is still possible to run perf top in parallel,
as that doesn't use precise mode. Also of course the multiplexing
can still allow parallel operation.

:pp stays with the previous event.

Example:

Sample a loop with 10 sqrt with old cycles:pp

  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
  9.13 │      sqrtps %xmm1,%xmm0
 11.58 │      sqrtps %xmm1,%xmm0
 11.51 │      sqrtps %xmm1,%xmm0
  6.27 │      sqrtps %xmm1,%xmm0
 10.38 │      sqrtps %xmm1,%xmm0
 12.20 │      sqrtps %xmm1,%xmm0
 12.74 │      sqrtps %xmm1,%xmm0
  5.40 │      sqrtps %xmm1,%xmm0
 10.14 │      sqrtps %xmm1,%xmm0
 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the jmp is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone
and all instructions show up roughly evenly:

  9.51 │10:   sqrtps %xmm1,%xmm0
 11.74 │      sqrtps %xmm1,%xmm0
 11.84 │      sqrtps %xmm1,%xmm0
  6.05 │      sqrtps %xmm1,%xmm0
 10.46 │      sqrtps %xmm1,%xmm0
 12.25 │      sqrtps %xmm1,%xmm0
 12.18 │      sqrtps %xmm1,%xmm0
  5.26 │      sqrtps %xmm1,%xmm0
 10.13 │      sqrtps %xmm1,%xmm0
 10.43 │      sqrtps %xmm1,%xmm0
  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result
is not completely even, but systematic shadow effects are
significantly reduced.

The improvements are mainly expected to make a difference
in high IPC code. With low IPC it should be similar.

v2:
Change to use precise_ip == 3 for the new cycles event
Split original patch into two.
Use from Ivy Bridge and up.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c          |  3 ++
 arch/x86/kernel/cpu/perf_event_intel.c    | 46 ++++++++++++++++++++++++++++---
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  6 ++++
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9dfbba5..ba41899 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			/* Support for IP fixup */
 			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
 				precise++;
+
+			if (x86_pmu.pebs_aliases)
+				precise++;
 		}
 
 		if (event->attr.precise_ip > precise)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 09f4399..671c1c0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2473,6 +2473,44 @@ static void intel_pebs_aliases_snb(struct perf_event *event)
 	}
 }
 
+static void intel_pebs_aliases_precdist(struct perf_event *event)
+{
+	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
+		/*
+		 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
+		 * (0x003c) so that we can use it with PEBS.
+		 *
+		 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
+		 * PEBS capable. However we can use INST_RETIRED.PREC_DIST
+		 * (0x01c0), which is a PEBS capable event, to get the same
+		 * count.
+		 *
+		 * The PREC_DIST event has special support to minimize sample
+		 * shadowing effects. One drawback is that it can be
+		 * only programmed on counter 1, but that seems like an
+		 * acceptable trade off.
+		 */
+		u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16);
+
+		alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
+		event->hw.config = alt_config;
+	}
+}
+
+static void intel_pebs_aliases_ivb(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_snb(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
+static void intel_pebs_aliases_skl(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_core2(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
 static unsigned long intel_pmu_free_running_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.free_running_flags;
@@ -3434,7 +3472,7 @@ __init int intel_pmu_init(void)
 
 		x86_pmu.event_constraints = intel_ivb_event_constraints;
 		x86_pmu.pebs_constraints = intel_ivb_pebs_event_constraints;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
 		if (boot_cpu_data.x86_model == 62)
 			x86_pmu.extra_regs = intel_snbep_extra_regs;
 		else
@@ -3466,7 +3504,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3500,7 +3538,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_bdw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3521,7 +3559,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index dcab005..7f11784 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -686,6 +686,8 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
 	INTEL_PST_CONSTRAINT(0x02cd, 0x8),    /* MEM_TRANS_RETIRED.PRECISE_STORES */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -700,6 +702,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 	INTEL_PLD_CONSTRAINT(0x01cd, 0xf),    /* MEM_TRANS_RETIRED.* */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
@@ -718,6 +722,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
  2015-12-01  0:28 [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Andi Kleen
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
@ 2015-12-01 13:19 ` Peter Zijlstra
  2015-12-04 11:50 ` [tip:perf/core] perf/x86: " tip-bot for Andi Kleen
  2016-01-06 18:51 ` tip-bot for Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-01 13:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, mingo, Andi Kleen

On Mon, Nov 30, 2015 at 04:28:08PM -0800, Andi Kleen wrote:
> +	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
> +	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),

Hmm, so this event is still limited to a single counter. Better than
nothing I suppose.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
@ 2015-12-01 13:41   ` Peter Zijlstra
  2015-12-01 14:54     ` Andi Kleen
  2015-12-04 11:50   ` [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp tip-bot for Andi Kleen
  2016-01-06 18:51   ` tip-bot for Andi Kleen
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-01 13:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, mingo, Andi Kleen

On Mon, Nov 30, 2015 at 04:28:09PM -0800, Andi Kleen wrote:
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 9dfbba5..ba41899 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
>  			/* Support for IP fixup */
>  			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
>  				precise++;
> +
> +			if (x86_pmu.pebs_aliases)
> +				precise++;

This is not accurate, it would allow :ppp for core2 for example, which
does not at all support PREC_DIST events.

Something like so on top?

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -483,7 +483,7 @@ int x86_pmu_hw_config(struct perf_event
 			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
 				precise++;
 
-			if (x86_pmu.pebs_aliases)
+			if (x86_pmu.pebs_prec_dist)
 				precise++;
 		}
 
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -593,7 +593,8 @@ struct x86_pmu {
 			bts_active	:1,
 			pebs		:1,
 			pebs_active	:1,
-			pebs_broken	:1;
+			pebs_broken	:1,
+			pebs_prec_dist	:1;
 	int		pebs_record_size;
 	void		(*drain_pebs)(struct pt_regs *regs);
 	struct event_constraint *pebs_constraints;
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3470,6 +3470,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_ivb_event_constraints;
 		x86_pmu.pebs_constraints = intel_ivb_pebs_event_constraints;
 		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		if (boot_cpu_data.x86_model == 62)
 			x86_pmu.extra_regs = intel_snbep_extra_regs;
 		else
@@ -3503,6 +3504,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
 		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3538,6 +3540,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
 		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3560,6 +3563,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
 		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp
  2015-12-01 13:41   ` Peter Zijlstra
@ 2015-12-01 14:54     ` Andi Kleen
  2015-12-01 14:55       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2015-12-01 14:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, linux-kernel, mingo

On Tue, Dec 01, 2015 at 02:41:51PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 04:28:09PM -0800, Andi Kleen wrote:
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 9dfbba5..ba41899 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
> >  			/* Support for IP fixup */
> >  			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
> >  				precise++;
> > +
> > +			if (x86_pmu.pebs_aliases)
> > +				precise++;
> 
> This is not accurate, it would allow :ppp for core2 for example, which
> does not at all support PREC_DIST events.
> 
> Something like so on top?

Yes that's fine. Thanks.

Are you just applying it with that change, or should I resend a combined patch?

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp
  2015-12-01 14:54     ` Andi Kleen
@ 2015-12-01 14:55       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-01 14:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, mingo

On Tue, Dec 01, 2015 at 06:54:14AM -0800, Andi Kleen wrote:
> On Tue, Dec 01, 2015 at 02:41:51PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 30, 2015 at 04:28:09PM -0800, Andi Kleen wrote:
> > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > > index 9dfbba5..ba41899 100644
> > > --- a/arch/x86/kernel/cpu/perf_event.c
> > > +++ b/arch/x86/kernel/cpu/perf_event.c
> > > @@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
> > >  			/* Support for IP fixup */
> > >  			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
> > >  				precise++;
> > > +
> > > +			if (x86_pmu.pebs_aliases)
> > > +				precise++;
> > 
> > This is not accurate, it would allow :ppp for core2 for example, which
> > does not at all support PREC_DIST events.
> > 
> > Something like so on top?
> 
> Yes that's fine. Thanks.
> 
> Are you just applying it with that change, or should I resend a combined patch?

Yep, got it in. Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
  2015-12-01  0:28 [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Andi Kleen
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
  2015-12-01 13:19 ` [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Peter Zijlstra
@ 2015-12-04 11:50 ` tip-bot for Andi Kleen
  2016-01-06 18:51 ` tip-bot for Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2015-12-04 11:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, vincent.weaver, jolsa, tglx, ak, linux-kernel,
	mingo, acme, eranian, torvalds

Commit-ID:  ac1e1d30cf90fdc7716150a9c130bdd7d7b9fdd9
Gitweb:     http://git.kernel.org/tip/ac1e1d30cf90fdc7716150a9c130bdd7d7b9fdd9
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Mon, 30 Nov 2015 16:28:08 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:08:04 +0100

perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake

I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for
cycles:pp. But the event is not documented for Skylake, and has some
issues.

The recommended replacement for cycles:pp is to use
INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy
Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The
event is not really new, but has been already used by perf before
Sandy Bridge for the original cycles:p

Note the SDM doesn't document that event either, but it's being
documented in the latest version of the event list on:

  https://download.01.org/perfmon/SKL

This patch does:

 - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list

 - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to
   allow cmask=16,inv=1 for cycles:pp

 - We don't need an extra entry for the base INST_RETIRED event,
   because it is already covered by the catch-all PEBS table entry.

 - Switch Skylake to use the Core2 PEBS alias (which is
   INST_RETIRED.TOTAL_CYCLES_PS)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c    | 2 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f63360b..e7daab3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3521,7 +3521,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 5db1c77..dcab005 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -718,9 +718,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
-	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
-	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
  2015-12-01 13:41   ` Peter Zijlstra
@ 2015-12-04 11:50   ` tip-bot for Andi Kleen
  2015-12-06 13:11     ` Ingo Molnar
  2016-01-06 18:51   ` tip-bot for Andi Kleen
  2 siblings, 1 reply; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2015-12-04 11:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, vincent.weaver, eranian, jolsa, torvalds, acme, tglx,
	linux-kernel, peterz, hpa, ak

Commit-ID:  4576ceaa56a86bd0c041c204d51c3f3ca582a49c
Gitweb:     http://git.kernel.org/tip/4576ceaa56a86bd0c041c204d51c3f3ca582a49c
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Mon, 30 Nov 2015 16:28:09 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:08:04 +0100

perf/x86: Use INST_RETIRED.PREC_DIST for cycles:ppp

Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
base. The basic mechanism of abusing the inverse cmask to get all
cycles works the same as before.

PREC_DIST is available on Sandy Bridge or later. It had some problems
on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
on Broadwell and Skylake.

PREC_DIST has special support for avoiding shadow effects, which can
give better results compare to UOPS_RETIRED. The drawback is that
PREC_DIST can only schedule on counter 1, but that is ok for cycle
sampling, as there is normally no need to do multiple cycle sampling
runs in parallel. It is still possible to run perf top in parallel, as
that doesn't use precise mode. Also of course the multiplexing can
still allow parallel operation.

:pp stays with the previous event.

Example:

Sample a loop with 10 sqrt with old cycles:pp

	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
	  9.13 │      sqrtps %xmm1,%xmm0
	 11.58 │      sqrtps %xmm1,%xmm0
	 11.51 │      sqrtps %xmm1,%xmm0
	  6.27 │      sqrtps %xmm1,%xmm0
	 10.38 │      sqrtps %xmm1,%xmm0
	 12.20 │      sqrtps %xmm1,%xmm0
	 12.74 │      sqrtps %xmm1,%xmm0
	  5.40 │      sqrtps %xmm1,%xmm0
	 10.14 │      sqrtps %xmm1,%xmm0
	 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the JMP is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone and all
instructions show up roughly evenly:

	  9.51 │10:   sqrtps %xmm1,%xmm0
	 11.74 │      sqrtps %xmm1,%xmm0
	 11.84 │      sqrtps %xmm1,%xmm0
	  6.05 │      sqrtps %xmm1,%xmm0
	 10.46 │      sqrtps %xmm1,%xmm0
	 12.25 │      sqrtps %xmm1,%xmm0
	 12.18 │      sqrtps %xmm1,%xmm0
	  5.26 │      sqrtps %xmm1,%xmm0
	 10.13 │      sqrtps %xmm1,%xmm0
	 10.43 │      sqrtps %xmm1,%xmm0
	  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result is not
completely even, but systematic shadow effects are significantly
reduced.

The improvements are mainly expected to make a difference in high IPC
code. With low IPC it should be similar.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c          |  3 ++
 arch/x86/kernel/cpu/perf_event.h          |  3 +-
 arch/x86/kernel/cpu/perf_event_intel.c    | 50 ++++++++++++++++++++++++++++---
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  6 ++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2bf79d7..ec2cbe5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			/* Support for IP fixup */
 			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
 				precise++;
+
+			if (x86_pmu.pebs_prec_dist)
+				precise++;
 		}
 
 		if (event->attr.precise_ip > precise)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index ab18b8a..ce0ce26 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -589,7 +589,8 @@ struct x86_pmu {
 			bts_active	:1,
 			pebs		:1,
 			pebs_active	:1,
-			pebs_broken	:1;
+			pebs_broken	:1,
+			pebs_prec_dist	:1;
 	int		pebs_record_size;
 	void		(*drain_pebs)(struct pt_regs *regs);
 	struct event_constraint *pebs_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e7daab3..6fac441 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2475,6 +2475,44 @@ static void intel_pebs_aliases_snb(struct perf_event *event)
 	}
 }
 
+static void intel_pebs_aliases_precdist(struct perf_event *event)
+{
+	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
+		/*
+		 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
+		 * (0x003c) so that we can use it with PEBS.
+		 *
+		 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
+		 * PEBS capable. However we can use INST_RETIRED.PREC_DIST
+		 * (0x01c0), which is a PEBS capable event, to get the same
+		 * count.
+		 *
+		 * The PREC_DIST event has special support to minimize sample
+		 * shadowing effects. One drawback is that it can be
+		 * only programmed on counter 1, but that seems like an
+		 * acceptable trade off.
+		 */
+		u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16);
+
+		alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
+		event->hw.config = alt_config;
+	}
+}
+
+static void intel_pebs_aliases_ivb(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_snb(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
+static void intel_pebs_aliases_skl(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_core2(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
 static unsigned long intel_pmu_free_running_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.free_running_flags;
@@ -3431,7 +3469,8 @@ __init int intel_pmu_init(void)
 
 		x86_pmu.event_constraints = intel_ivb_event_constraints;
 		x86_pmu.pebs_constraints = intel_ivb_pebs_event_constraints;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		if (boot_cpu_data.x86_model == 62)
 			x86_pmu.extra_regs = intel_snbep_extra_regs;
 		else
@@ -3464,7 +3503,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3499,7 +3539,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_bdw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3521,7 +3562,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index dcab005..7f11784 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -686,6 +686,8 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
 	INTEL_PST_CONSTRAINT(0x02cd, 0x8),    /* MEM_TRANS_RETIRED.PRECISE_STORES */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -700,6 +702,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 	INTEL_PLD_CONSTRAINT(0x01cd, 0xf),    /* MEM_TRANS_RETIRED.* */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
@@ -718,6 +722,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x01),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-04 11:50   ` [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp tip-bot for Andi Kleen
@ 2015-12-06 13:11     ` Ingo Molnar
  2015-12-06 22:06       ` Peter Zijlstra
  2015-12-07 13:51       ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-12-06 13:11 UTC (permalink / raw)
  To: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak, peterz,
	linux-kernel, hpa
  Cc: linux-tip-commits


* tip-bot for Andi Kleen <tipbot@zytor.com> wrote:

> Commit-ID:  4576ceaa56a86bd0c041c204d51c3f3ca582a49c
> Gitweb:     http://git.kernel.org/tip/4576ceaa56a86bd0c041c204d51c3f3ca582a49c
> Author:     Andi Kleen <ak@linux.intel.com>
> AuthorDate: Mon, 30 Nov 2015 16:28:09 -0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 4 Dec 2015 10:08:04 +0100
> 
> perf/x86: Use INST_RETIRED.PREC_DIST for cycles:ppp
> 
> Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
> base. The basic mechanism of abusing the inverse cmask to get all
> cycles works the same as before.

So this commit regressed on the follwing Intel Nehalem box:

 processor       : 15
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 26
 model name      : Intel(R) Xeon(R) CPU           X55600 @ 2.80GHz
 stepping        : 5
 cpu MHz         : 2793.000
 cache size      : 8192 KB
 it has this perf bootup signature:

    0.272504] Performance Events: PEBS fmt1+, 16-deep LBR, Nehalem events, Intel PMU driver.
    0.281356] perf_event_intel: CPU erratum AAJ80 worked around

the symptom is that latest 'perf top' and 'perf record' produces no samples.

So I've removed this commit and the related Skylake commit from -tip:

 4576ceaa56a8 perf/x86: Use INST_RETIRED.PREC_DIST for cycles:ppp
 ac1e1d30cf90 perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake

you need to do better testing next time around. Note: if you resubmit the patches, 
please also pick up the updates commit changelogs from the tip-bot emails, don't 
use your original changelogs.

Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
just replace 'pp' with this event - 'pp' is meant to be our most precise event.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-06 13:11     ` Ingo Molnar
@ 2015-12-06 22:06       ` Peter Zijlstra
  2015-12-07  6:48         ` Ingo Molnar
  2015-12-07 13:51       ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-06 22:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits

On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
> Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
> just replace 'pp' with this event - 'pp' is meant to be our most precise event.

I requested this because the PREC_DIST events can only be scheduled on a
single counter, whereas the existing :pp events can be had on all 4.

This mean you can have 2 concurrent :pp users (without RR), but not :ppp.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-06 22:06       ` Peter Zijlstra
@ 2015-12-07  6:48         ` Ingo Molnar
  2015-12-07 10:26           ` Peter Zijlstra
  2015-12-07 14:02           ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-12-07  6:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits, Arnaldo Carvalho de Melo


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
>
> > Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
> > just replace 'pp' with this event - 'pp' is meant to be our most precise 
> > event.
> 
> I requested this because the PREC_DIST events can only be scheduled on a single 
> counter, whereas the existing :pp events can be had on all 4.
> 
> This mean you can have 2 concurrent :pp users (without RR), but not :ppp.

Ok. Will tooling do the right thing? I.e. will the first user of 'perf top' get 
:ppp automatically, while the second one falls back to :pp?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-07  6:48         ` Ingo Molnar
@ 2015-12-07 10:26           ` Peter Zijlstra
  2015-12-07 14:02           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-07 10:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits, Arnaldo Carvalho de Melo

On Mon, Dec 07, 2015 at 07:48:41AM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
> >
> > > Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
> > > just replace 'pp' with this event - 'pp' is meant to be our most precise 
> > > event.
> > 
> > I requested this because the PREC_DIST events can only be scheduled on a single 
> > counter, whereas the existing :pp events can be had on all 4.
> > 
> > This mean you can have 2 concurrent :pp users (without RR), but not :ppp.
> 
> Ok. Will tooling do the right thing? I.e. will the first user of 'perf top' get 
> :ppp automatically, while the second one falls back to :pp?

if not I'll make it so.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-06 13:11     ` Ingo Molnar
  2015-12-06 22:06       ` Peter Zijlstra
@ 2015-12-07 13:51       ` Peter Zijlstra
  2015-12-08  5:36         ` Ingo Molnar
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-07 13:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits

On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:

> the symptom is that latest 'perf top' and 'perf record' produces no samples.
> 
> So I've removed this commit and the related Skylake commit from -tip:
> 
>  4576ceaa56a8 perf/x86: Use INST_RETIRED.PREC_DIST for cycles:ppp
>  ac1e1d30cf90 perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
> 
> you need to do better testing next time around. Note: if you resubmit the patches, 
> please also pick up the updates commit changelogs from the tip-bot emails, don't 
> use your original changelogs.

OK, so I need:

 lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.org
 lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org

On my IVB because that keeps triggering that case with :ppp, other than
that this seems to work fine on IVB.

I've also tried on SNB (which does not support the new :ppp thing) and
that too works.

Sadly I do not have older hardware anymore :/ 


Also, perf tools DTRT and will use :ppp by default if available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-07  6:48         ` Ingo Molnar
  2015-12-07 10:26           ` Peter Zijlstra
@ 2015-12-07 14:02           ` Arnaldo Carvalho de Melo
  2015-12-07 14:12             ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-12-07 14:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, vincent.weaver, jolsa, torvalds, eranian, tglx,
	ak, linux-kernel, hpa, linux-tip-commits

Em Mon, Dec 07, 2015 at 07:48:41AM +0100, Ingo Molnar escreveu:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
> > > Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
> > > just replace 'pp' with this event - 'pp' is meant to be our most precise 
> > > event.

> > I requested this because the PREC_DIST events can only be scheduled on a single 
> > counter, whereas the existing :pp events can be had on all 4.

> > This mean you can have 2 concurrent :pp users (without RR), but not :ppp.

> Ok. Will tooling do the right thing? I.e. will the first user of 'perf top' get 
> :ppp automatically, while the second one falls back to :pp?

I guess so:

void perf_event_attr__set_max_precise_ip(struct perf_event_attr *attr)
{
        attr->precise_ip = 3;
<SNIP>

- Arnaldo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-07 14:02           ` Arnaldo Carvalho de Melo
@ 2015-12-07 14:12             ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-07 14:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, vincent.weaver, jolsa, torvalds, eranian, tglx, ak,
	linux-kernel, hpa, linux-tip-commits

On Mon, Dec 07, 2015 at 11:02:58AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 07, 2015 at 07:48:41AM +0100, Ingo Molnar escreveu:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
> > > > Also, I'm not convinced we need a new 'ppp' qualifier for any of this, why not 
> > > > just replace 'pp' with this event - 'pp' is meant to be our most precise 
> > > > event.
> 
> > > I requested this because the PREC_DIST events can only be scheduled on a single 
> > > counter, whereas the existing :pp events can be had on all 4.
> 
> > > This mean you can have 2 concurrent :pp users (without RR), but not :ppp.
> 
> > Ok. Will tooling do the right thing? I.e. will the first user of 'perf top' get 
> > :ppp automatically, while the second one falls back to :pp?
> 
> I guess so:
> 
> void perf_event_attr__set_max_precise_ip(struct perf_event_attr *attr)
> {
>         attr->precise_ip = 3;

Indeed so, I've since confirmed it does also works in practise.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-07 13:51       ` Peter Zijlstra
@ 2015-12-08  5:36         ` Ingo Molnar
  2015-12-08  8:50           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2015-12-08  5:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Dec 06, 2015 at 02:11:02PM +0100, Ingo Molnar wrote:
> 
> > the symptom is that latest 'perf top' and 'perf record' produces no samples.
> > 
> > So I've removed this commit and the related Skylake commit from -tip:
> > 
> >  4576ceaa56a8 perf/x86: Use INST_RETIRED.PREC_DIST for cycles:ppp
> >  ac1e1d30cf90 perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
> > 
> > you need to do better testing next time around. Note: if you resubmit the patches, 
> > please also pick up the updates commit changelogs from the tip-bot emails, don't 
> > use your original changelogs.
> 
> OK, so I need:
> 
>  lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.org
>  lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org
> 
> On my IVB because that keeps triggering that case with :ppp, other than
> that this seems to work fine on IVB.
> 
> I've also tried on SNB (which does not support the new :ppp thing) and
> that too works.
> 
> Sadly I do not have older hardware anymore :/ 
> 
> 
> Also, perf tools DTRT and will use :ppp by default if available.

So I checked my NHM box with your latest queue and it now works correctly. Do you 
have any idea what the difference is/was?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-08  5:36         ` Ingo Molnar
@ 2015-12-08  8:50           ` Peter Zijlstra
  2015-12-08  8:57             ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-08  8:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits

On Tue, Dec 08, 2015 at 06:36:04AM +0100, Ingo Molnar wrote:

> So I checked my NHM box with your latest queue and it now works correctly. Do you 
> have any idea what the difference is/was?

Sadly, no clue :/

I went over those patches and cannot find anything that should affect
NHM (or <=SNB in fact).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-08  8:50           ` Peter Zijlstra
@ 2015-12-08  8:57             ` Peter Zijlstra
  2015-12-08 13:27               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-08  8:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits

On Tue, Dec 08, 2015 at 09:50:26AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 08, 2015 at 06:36:04AM +0100, Ingo Molnar wrote:
> 
> > So I checked my NHM box with your latest queue and it now works correctly. Do you 
> > have any idea what the difference is/was?
> 
> Sadly, no clue :/
> 
> I went over those patches and cannot find anything that should affect
> NHM (or <=SNB in fact).

Oh wait, I spoke too soon, the two new patches affect everything with
PEBS. And esp. the latter one:

  lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org

But note that this is a pre-existing issue and should not be changed
by the skl-:pp and :ppp patches.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-08  8:57             ` Peter Zijlstra
@ 2015-12-08 13:27               ` Peter Zijlstra
  2015-12-09  8:29                 ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-08 13:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits

On Tue, Dec 08, 2015 at 09:57:07AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 08, 2015 at 09:50:26AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 08, 2015 at 06:36:04AM +0100, Ingo Molnar wrote:
> > 
> > > So I checked my NHM box with your latest queue and it now works correctly. Do you 
> > > have any idea what the difference is/was?
> > 
> > Sadly, no clue :/
> > 
> > I went over those patches and cannot find anything that should affect
> > NHM (or <=SNB in fact).
> 
> Oh wait, I spoke too soon, the two new patches affect everything with
> PEBS. And esp. the latter one:
> 
>   lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org
> 

If that is the patch that makes your NHM go again, then running 2
concurrent perf-top sessions should make it all go dark again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-08 13:27               ` Peter Zijlstra
@ 2015-12-09  8:29                 ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-12-09  8:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vincent.weaver, jolsa, torvalds, eranian, tglx, acme, ak,
	linux-kernel, hpa, linux-tip-commits


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Dec 08, 2015 at 09:57:07AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 08, 2015 at 09:50:26AM +0100, Peter Zijlstra wrote:
> > > On Tue, Dec 08, 2015 at 06:36:04AM +0100, Ingo Molnar wrote:
> > > 
> > > > So I checked my NHM box with your latest queue and it now works correctly. Do you 
> > > > have any idea what the difference is/was?
> > > 
> > > Sadly, no clue :/
> > > 
> > > I went over those patches and cannot find anything that should affect
> > > NHM (or <=SNB in fact).
> > 
> > Oh wait, I spoke too soon, the two new patches affect everything with
> > PEBS. And esp. the latter one:
> > 
> >   lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org
> > 
> 
> If that is the patch that makes your NHM go again, then running 2
> concurrent perf-top sessions should make it all go dark again.

Hm, so I tried again your latest perf/urgent (1221fc3b3e3a) merged into tip:master 
(c9586e904a68), and booted it on the NHM, but 'perf top' refuses to break.

One thing that was special about my first NHM test was that it conducted an about 
24 hours perf stress. OTOH 'cycles:p' did work, it was only 'cycles:pp' that was 
producing no events whatsoever - so it wasn't a total breakage, only :pp related.

Weird ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
  2015-12-01  0:28 [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Andi Kleen
                   ` (2 preceding siblings ...)
  2015-12-04 11:50 ` [tip:perf/core] perf/x86: " tip-bot for Andi Kleen
@ 2016-01-06 18:51 ` tip-bot for Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-01-06 18:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, eranian, peterz, tglx, torvalds, vincent.weaver, acme, ak,
	linux-kernel, jolsa, hpa

Commit-ID:  442f5c74cbeaf54939980397ece59360c0a824e9
Gitweb:     http://git.kernel.org/tip/442f5c74cbeaf54939980397ece59360c0a824e9
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Fri, 4 Dec 2015 03:50:32 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Jan 2016 11:15:32 +0100

perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake

I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for
cycles:pp. But the event is not documented for Skylake, and has some
issues.

The recommended replacement for cycles:pp is to use
INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy
Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The
event is not really new, but has been already used by perf before
Sandy Bridge for the original cycles:p

Note the SDM doesn't document that event either, but it's being
documented in the latest version of the event list on:

  https://download.01.org/perfmon/SKL

This patch does:

 - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list

 - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to
   allow cmask=16,inv=1 for cycles:pp

 - We don't need an extra entry for the base INST_RETIRED event,
   because it is already covered by the catch-all PEBS table entry.

 - Switch Skylake to use the Core2 PEBS alias (which is
   INST_RETIRED.TOTAL_CYCLES_PS)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c    | 2 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 33b4b67..5ed6e0d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -3521,7 +3521,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index cd1993e..56b5015 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -718,9 +718,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
-	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
-	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
-	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
  2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
  2015-12-01 13:41   ` Peter Zijlstra
  2015-12-04 11:50   ` [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp tip-bot for Andi Kleen
@ 2016-01-06 18:51   ` tip-bot for Andi Kleen
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Andi Kleen @ 2016-01-06 18:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, peterz, tglx, jolsa, hpa, acme, ak, vincent.weaver,
	torvalds, mingo, eranian

Commit-ID:  724697648eec540b2a7561089b1c87cb33e6a0eb
Gitweb:     http://git.kernel.org/tip/724697648eec540b2a7561089b1c87cb33e6a0eb
Author:     Andi Kleen <ak@linux.intel.com>
AuthorDate: Fri, 4 Dec 2015 03:50:52 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Jan 2016 11:15:32 +0100

perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp

Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
base. The basic mechanism of abusing the inverse cmask to get all
cycles works the same as before.

PREC_DIST is available on Sandy Bridge or later. It had some problems
on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
on Broadwell and Skylake.

PREC_DIST has special support for avoiding shadow effects, which can
give better results compare to UOPS_RETIRED. The drawback is that
PREC_DIST can only schedule on counter 1, but that is ok for cycle
sampling, as there is normally no need to do multiple cycle sampling
runs in parallel. It is still possible to run perf top in parallel, as
that doesn't use precise mode. Also of course the multiplexing can
still allow parallel operation.

:pp stays with the previous event.

Example:

Sample a loop with 10 sqrt with old cycles:pp

	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
	  9.13 │      sqrtps %xmm1,%xmm0
	 11.58 │      sqrtps %xmm1,%xmm0
	 11.51 │      sqrtps %xmm1,%xmm0
	  6.27 │      sqrtps %xmm1,%xmm0
	 10.38 │      sqrtps %xmm1,%xmm0
	 12.20 │      sqrtps %xmm1,%xmm0
	 12.74 │      sqrtps %xmm1,%xmm0
	  5.40 │      sqrtps %xmm1,%xmm0
	 10.14 │      sqrtps %xmm1,%xmm0
	 10.51 │    ↑ jmp    10

We expect all 10 sqrt to get roughly the sample number of samples.

But you can see that the instruction directly after the JMP is
systematically underestimated in the result, due to sampling shadow
effects.

With the new PREC_DIST based sampling this problem is gone and all
instructions show up roughly evenly:

	  9.51 │10:   sqrtps %xmm1,%xmm0
	 11.74 │      sqrtps %xmm1,%xmm0
	 11.84 │      sqrtps %xmm1,%xmm0
	  6.05 │      sqrtps %xmm1,%xmm0
	 10.46 │      sqrtps %xmm1,%xmm0
	 12.25 │      sqrtps %xmm1,%xmm0
	 12.18 │      sqrtps %xmm1,%xmm0
	  5.26 │      sqrtps %xmm1,%xmm0
	 10.13 │      sqrtps %xmm1,%xmm0
	 10.43 │      sqrtps %xmm1,%xmm0
	  0.16 │    ↑ jmp    10

Even with PREC_DIST there is still sampling skid and the result is not
completely even, but systematic shadow effects are significantly
reduced.

The improvements are mainly expected to make a difference in high IPC
code. With low IPC it should be similar.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c          |  3 ++
 arch/x86/kernel/cpu/perf_event.h          |  3 +-
 arch/x86/kernel/cpu/perf_event_intel.c    | 50 ++++++++++++++++++++++++++++---
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  6 ++++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9dfbba5..e7e63a9 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -482,6 +482,9 @@ int x86_pmu_hw_config(struct perf_event *event)
 			/* Support for IP fixup */
 			if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
 				precise++;
+
+			if (x86_pmu.pebs_prec_dist)
+				precise++;
 		}
 
 		if (event->attr.precise_ip > precise)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 799e6bd..ce8768f 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -583,7 +583,8 @@ struct x86_pmu {
 			bts_active	:1,
 			pebs		:1,
 			pebs_active	:1,
-			pebs_broken	:1;
+			pebs_broken	:1,
+			pebs_prec_dist	:1;
 	int		pebs_record_size;
 	void		(*drain_pebs)(struct pt_regs *regs);
 	struct event_constraint *pebs_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 5ed6e0d..762c602 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2475,6 +2475,44 @@ static void intel_pebs_aliases_snb(struct perf_event *event)
 	}
 }
 
+static void intel_pebs_aliases_precdist(struct perf_event *event)
+{
+	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
+		/*
+		 * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P
+		 * (0x003c) so that we can use it with PEBS.
+		 *
+		 * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't
+		 * PEBS capable. However we can use INST_RETIRED.PREC_DIST
+		 * (0x01c0), which is a PEBS capable event, to get the same
+		 * count.
+		 *
+		 * The PREC_DIST event has special support to minimize sample
+		 * shadowing effects. One drawback is that it can be
+		 * only programmed on counter 1, but that seems like an
+		 * acceptable trade off.
+		 */
+		u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16);
+
+		alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK);
+		event->hw.config = alt_config;
+	}
+}
+
+static void intel_pebs_aliases_ivb(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_snb(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
+static void intel_pebs_aliases_skl(struct perf_event *event)
+{
+	if (event->attr.precise_ip < 3)
+		return intel_pebs_aliases_core2(event);
+	return intel_pebs_aliases_precdist(event);
+}
+
 static unsigned long intel_pmu_free_running_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.free_running_flags;
@@ -3431,7 +3469,8 @@ __init int intel_pmu_init(void)
 
 		x86_pmu.event_constraints = intel_ivb_event_constraints;
 		x86_pmu.pebs_constraints = intel_ivb_pebs_event_constraints;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		if (boot_cpu_data.x86_model == 62)
 			x86_pmu.extra_regs = intel_snbep_extra_regs;
 		else
@@ -3464,7 +3503,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3499,7 +3539,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_bdw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_snbep_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_snb;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_ivb;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
@@ -3521,7 +3562,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.event_constraints = intel_skl_event_constraints;
 		x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_skl_extra_regs;
-		x86_pmu.pebs_aliases = intel_pebs_aliases_core2;
+		x86_pmu.pebs_aliases = intel_pebs_aliases_skl;
+		x86_pmu.pebs_prec_dist = true;
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 56b5015..9c0f8d4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -686,6 +686,8 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
 	INTEL_PST_CONSTRAINT(0x02cd, 0x8),    /* MEM_TRANS_RETIRED.PRECISE_STORES */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_EXCLEVT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
 	INTEL_EXCLEVT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -700,6 +702,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 	INTEL_PLD_CONSTRAINT(0x01cd, 0xf),    /* MEM_TRANS_RETIRED.* */
 	/* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf),
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x11d0, 0xf), /* MEM_UOPS_RETIRED.STLB_MISS_LOADS */
 	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_XLD(0x21d0, 0xf), /* MEM_UOPS_RETIRED.LOCK_LOADS */
@@ -718,6 +722,8 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = {
 
 struct event_constraint intel_skl_pebs_event_constraints[] = {
 	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2),	/* INST_RETIRED.PREC_DIST */
+	/* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:ppp). */
+	INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2),
 	/* INST_RETIRED.TOTAL_CYCLES_PS (inv=1, cmask=16) (cycles:p). */
 	INTEL_FLAGS_EVENT_CONSTRAINT(0x108000c0, 0x0f),
 	INTEL_PLD_CONSTRAINT(0x1cd, 0xf),		      /* MEM_TRANS_RETIRED.* */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-01-06 18:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-01  0:28 [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Andi Kleen
2015-12-01  0:28 ` [PATCH 2/2] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:ppp Andi Kleen
2015-12-01 13:41   ` Peter Zijlstra
2015-12-01 14:54     ` Andi Kleen
2015-12-01 14:55       ` Peter Zijlstra
2015-12-04 11:50   ` [tip:perf/core] perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp tip-bot for Andi Kleen
2015-12-06 13:11     ` Ingo Molnar
2015-12-06 22:06       ` Peter Zijlstra
2015-12-07  6:48         ` Ingo Molnar
2015-12-07 10:26           ` Peter Zijlstra
2015-12-07 14:02           ` Arnaldo Carvalho de Melo
2015-12-07 14:12             ` Peter Zijlstra
2015-12-07 13:51       ` Peter Zijlstra
2015-12-08  5:36         ` Ingo Molnar
2015-12-08  8:50           ` Peter Zijlstra
2015-12-08  8:57             ` Peter Zijlstra
2015-12-08 13:27               ` Peter Zijlstra
2015-12-09  8:29                 ` Ingo Molnar
2016-01-06 18:51   ` tip-bot for Andi Kleen
2015-12-01 13:19 ` [PATCH 1/2] perf, x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake Peter Zijlstra
2015-12-04 11:50 ` [tip:perf/core] perf/x86: " tip-bot for Andi Kleen
2016-01-06 18:51 ` tip-bot for Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).