From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753430AbbJOXiI (ORCPT ); Thu, 15 Oct 2015 19:38:08 -0400 Received: from mga01.intel.com ([192.55.52.88]:18012 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752422AbbJOXiE (ORCPT ); Thu, 15 Oct 2015 19:38:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,687,1437462000"; d="scan'208";a="828086171" From: Andi Kleen To: peterz@infradead.org Cc: linux-kernel@vger.kernel.org, Andi Kleen Subject: [PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake Date: Thu, 15 Oct 2015 16:38:00 -0700 Message-Id: <1444952280-24184-5-git-send-email-andi@firstfloor.org> X-Mailer: git-send-email 2.4.3 In-Reply-To: <1444952280-24184-1-git-send-email-andi@firstfloor.org> References: <1444952280-24184-1-git-send-email-andi@firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andi Kleen Switch the cycles:pp alias from UOPS_RETITRED to INST_RETIRED.PREC_DIST. The basic mechanism of abusing the inverse cmask to get all cycles works the same as before. PREC_DIST has special support for avoiding shadow effects, which can give better results compare to UOPS_RETIRED. The drawback is that PREC_DIST can only schedule on counter 1, but that is ok for cycle sampling, as there is normally no need to do multiple cycle sampling runs in parallel. It is still possible to run perf top in parallel, as that doesn't use precise mode. Also of course the multiplexing can still allow parallel operation. The UOPS_RETIRED old alias is still available in raw form. Example: Sample a loop with 10 sqrt with old cycles:pp 0.14 │10: sqrtps %xmm1,%xmm0 <-------------- 9.13 │ sqrtps %xmm1,%xmm0 11.58 │ sqrtps %xmm1,%xmm0 11.51 │ sqrtps %xmm1,%xmm0 6.27 │ sqrtps %xmm1,%xmm0 10.38 │ sqrtps %xmm1,%xmm0 12.20 │ sqrtps %xmm1,%xmm0 12.74 │ sqrtps %xmm1,%xmm0 5.40 │ sqrtps %xmm1,%xmm0 10.14 │ sqrtps %xmm1,%xmm0 10.51 │ ↑ jmp 10 We expect all 10 sqrt to get roughly the sample number of samples. But you can see that the instruction directly after the jmp is systematically underestimated in the result, due to sampling shadow effects. With the new PREC_DIST based sampling this problem is gone and all instructions show up roughly evenly: 9.51 │10: sqrtps %xmm1,%xmm0 11.74 │ sqrtps %xmm1,%xmm0 11.84 │ sqrtps %xmm1,%xmm0 6.05 │ sqrtps %xmm1,%xmm0 10.46 │ sqrtps %xmm1,%xmm0 12.25 │ sqrtps %xmm1,%xmm0 12.18 │ sqrtps %xmm1,%xmm0 5.26 │ sqrtps %xmm1,%xmm0 10.13 │ sqrtps %xmm1,%xmm0 10.43 │ sqrtps %xmm1,%xmm0 0.16 │ ↑ jmp 10 Even with PREC_DIST there is still sampling skid and the result is not completely even, but systematic shadow effects are significantly reduced. The improvements are mainly expected to make a difference in high IPC code. With low IPC it should be similar. The PREC_DIST event is supported back to IvyBridge, but I only tested it on Skylake for now, so I only enabled it on Skylake. On earlier parts there were various hardware bugs in it (but no show stopper on IvyBridge and up I believe), so it could be enabled there after sufficient testing. On Sandy Bridge PREC_DIST can only be scheduled as a single event on the PMU, which is too limiting. Before Sandy Bridge it was not supported. Signed-off-by: Andi Kleen --- arch/x86/kernel/cpu/perf_event_intel.c | 27 ++++++++++++++++++++++++++- arch/x86/kernel/cpu/perf_event_intel_ds.c | 4 +++- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 4c46cd7..b79427a 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2565,6 +2565,31 @@ static void intel_pebs_aliases_snb(struct perf_event *event) } } +static void intel_pebs_aliases_skl(struct perf_event *event) +{ + if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) { + /* + * Use an alternative encoding for CPU_CLK_UNHALTED.THREAD_P + * (0x003c) so that we can use it with PEBS. + * + * The regular CPU_CLK_UNHALTED.THREAD_P event (0x003c) isn't + * PEBS capable. However we can use INST_RETIRED.PREC_DIST + * (0x01c0), which is a PEBS capable event, to get the same + * count. + * + * The PREC_DIST event has special support to minimize sample + * shadowing effects. One drawback is that it can be + * only programmed on counter 1, but that seems like an + * acceptable trade off. + */ + u64 alt_config = X86_CONFIG(.event=0xc0, .umask=0x01, .inv=1, .cmask=16); + + alt_config |= (event->hw.config & ~X86_RAW_EVENT_MASK); + event->hw.config = alt_config; + } +} + + static unsigned long intel_pmu_free_running_flags(struct perf_event *event) { unsigned long flags = x86_pmu.free_running_flags; @@ -3617,7 +3642,7 @@ __init int intel_pmu_init(void) x86_pmu.event_constraints = intel_skl_event_constraints; x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints; x86_pmu.extra_regs = intel_skl_extra_regs; - x86_pmu.pebs_aliases = intel_pebs_aliases_snb; + x86_pmu.pebs_aliases = intel_pebs_aliases_skl; /* all extra regs are per-cpu when HT is on */ x86_pmu.flags |= PMU_FL_HAS_RSP_1; x86_pmu.flags |= PMU_FL_NO_HT_SHARING; diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index 5db1c77..acaebc7 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -719,8 +719,10 @@ struct event_constraint intel_hsw_pebs_event_constraints[] = { struct event_constraint intel_skl_pebs_event_constraints[] = { INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x2), /* INST_RETIRED.PREC_DIST */ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA(0x01c2, 0xf), /* UOPS_RETIRED.ALL */ - /* UOPS_RETIRED.ALL, inv=1, cmask=16 (cycles:p). */ + /* UOPS_RETIRED.ALL, inv=1, cmask=16 (old style cycles:p). */ INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c2, 0xf), + /* INST_RETIRED.PREC_DIST, inv=1, cmask=16 (cycles:p). */ + INTEL_FLAGS_EVENT_CONSTRAINT(0x108001c0, 0x2), INTEL_PLD_CONSTRAINT(0x1cd, 0xf), /* MEM_TRANS_RETIRED.* */ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x11d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_LOADS */ INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x12d0, 0xf), /* MEM_INST_RETIRED.STLB_MISS_STORES */ -- 2.4.3