From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753494AbbJOXil (ORCPT ); Thu, 15 Oct 2015 19:38:41 -0400 Received: from mga02.intel.com ([134.134.136.20]:11971 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753438AbbJOXiX (ORCPT ); Thu, 15 Oct 2015 19:38:23 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,687,1437462000"; d="scan'208";a="665360691" From: Andi Kleen To: peterz@infradead.org Cc: linux-kernel@vger.kernel.org, Andi Kleen Subject: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake Date: Thu, 15 Oct 2015 16:37:57 -0700 Message-Id: <1444952280-24184-2-git-send-email-andi@firstfloor.org> X-Mailer: git-send-email 2.4.3 In-Reply-To: <1444952280-24184-1-git-send-email-andi@firstfloor.org> References: <1444952280-24184-1-git-send-email-andi@firstfloor.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andi Kleen The SKL PMU code had a problem with LBR freezing. When a counter overflows already in the PMI handler, the LBR would be frozen early and not be unfrozen until the next PMI. This means we would get stale LBR information. Depending on the workload this could happen for a few percent of the PMIs for cycles in adaptive frequency mode, because the frequency algorithm regularly goes down to very low periods. This patch implements a new PMU ack sequence that avoids this problem. The new sequence is: - (counters are disabled with GLOBAL_CTRL) There should be no further increments of the counters by later instructions; and thus no additional PMI (and thus no additional freezing). - ack the APIC Clear the APIC PMI LVT entry so that any later interrupt is delivered and is not lost due to the PMI LVT entry being masked. A lost PMI interrupt could lead to LBRs staying frozen without entering the PMI handler - Ack the PMU counters. This unfreezes the LBRs on Skylake (but not on earlier CPUs which rely on DEBUGCTL writes for this) - Reenable counters The WRMSR will start the counters counting again (and will be ordered after the APIC LVT PMI entry write since WRMSR is architecturally serializing). Since the APIC PMI LVT is unmasked, any PMI which is caused by these perfmon counters will trigger an NMI (but the delivery may be delayed until after the next IRET) One side effect is that the old retry loop is not possible anymore, as the counters stay unacked for the majority of the PMI handler, but that is not a big loss, as "profiling" the PMI was always a bit dubious. For the old ack sequence it is still supported. In principle the sequence should work on other CPUs too, but since I only tested on Skylake it is only enabled there. Signed-off-by: Andi Kleen --- arch/x86/kernel/cpu/perf_event.h | 1 + arch/x86/kernel/cpu/perf_event_intel.c | 35 ++++++++++++++++++++++++++-------- 2 files changed, 28 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 499f533..fcf01c7 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -551,6 +551,7 @@ struct x86_pmu { struct x86_pmu_quirk *quirks; int perfctr_second_write; bool late_ack; + bool status_ack_after_apic; unsigned (*limit_period)(struct perf_event *event, unsigned l); /* diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index f63360b..69a545e 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1789,6 +1789,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) struct cpu_hw_events *cpuc; int bit, loops; u64 status; + u64 orig_status; int handled; cpuc = this_cpu_ptr(&cpu_hw_events); @@ -1803,13 +1804,16 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) handled = intel_pmu_drain_bts_buffer(); handled += intel_bts_interrupt(); status = intel_pmu_get_status(); + orig_status = status; if (!status) goto done; loops = 0; again: intel_pmu_lbr_read(); - intel_pmu_ack_status(status); + if (!x86_pmu.status_ack_after_apic) + __intel_pmu_enable_all(0, true); + if (++loops > 100) { static bool warned = false; if (!warned) { @@ -1877,15 +1881,20 @@ again: x86_pmu_stop(event, 0); } - /* - * Repeat if there is more work to be done: - */ - status = intel_pmu_get_status(); - if (status) - goto again; + + if (!x86_pmu.status_ack_after_apic) { + /* + * Repeat if there is more work to be done: + */ + status = intel_pmu_get_status(); + if (status) + goto again; + } done: - __intel_pmu_enable_all(0, true); + if (!x86_pmu.status_ack_after_apic) + __intel_pmu_enable_all(0, true); + /* * Only unmask the NMI after the overflow counters * have been reset. This avoids spurious NMIs on @@ -1893,6 +1902,15 @@ done: */ if (x86_pmu.late_ack) apic_write(APIC_LVTPC, APIC_DM_NMI); + + /* + * Ack the PMU late. This avoids bogus freezing + * on Skylake CPUs. + */ + if (x86_pmu.status_ack_after_apic) { + intel_pmu_ack_status(orig_status); + __intel_pmu_enable_all(0, true); + } return handled; } @@ -3514,6 +3532,7 @@ __init int intel_pmu_init(void) case 78: /* 14nm Skylake Mobile */ case 94: /* 14nm Skylake Desktop */ x86_pmu.late_ack = true; + x86_pmu.status_ack_after_apic = true; memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids)); memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs)); intel_pmu_lbr_init_skl(); -- 2.4.3