From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753494AbbJOXil (ORCPT <rfc822;w@1wt.eu>);
	Thu, 15 Oct 2015 19:38:41 -0400
Received: from mga02.intel.com ([134.134.136.20]:11971 "EHLO mga02.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753438AbbJOXiX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 15 Oct 2015 19:38:23 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.17,687,1437462000"; 
   d="scan'208";a="665360691"
From: Andi Kleen <andi@firstfloor.org>
To: peterz@infradead.org
Cc: linux-kernel@vger.kernel.org, Andi Kleen <ak@linux.intel.com>
Subject: [PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake
Date: Thu, 15 Oct 2015 16:37:57 -0700
Message-Id: <1444952280-24184-2-git-send-email-andi@firstfloor.org>
X-Mailer: git-send-email 2.4.3
In-Reply-To: <1444952280-24184-1-git-send-email-andi@firstfloor.org>
References: <1444952280-24184-1-git-send-email-andi@firstfloor.org>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

From: Andi Kleen <ak@linux.intel.com>

The SKL PMU code had a problem with LBR freezing. When a counter
overflows already in the PMI handler, the LBR would be frozen
early and not be unfrozen until the next PMI. This means we would
get stale LBR information.

Depending on the workload this could happen for a few percent
of the PMIs for cycles in adaptive frequency mode, because the frequency
algorithm regularly goes down to very low periods.

This patch implements a new PMU ack sequence that avoids this problem.
The new sequence is:

- (counters are disabled with GLOBAL_CTRL)
There should be no further increments of the counters by later instructions; and
thus no additional PMI (and thus no additional freezing).

- ack the APIC

Clear the APIC PMI LVT entry so that any later interrupt is delivered and is
not lost due to the PMI LVT entry being masked. A lost PMI interrupt could lead to
LBRs staying frozen without entering the PMI handler

- Ack the PMU counters. This unfreezes the LBRs on Skylake (but not
on earlier CPUs which rely on DEBUGCTL writes for this)

- Reenable counters

The WRMSR will start the counters counting again (and will be ordered after the
APIC LVT PMI entry write since WRMSR is architecturally serializing). Since the
APIC PMI LVT is unmasked, any PMI which is caused by these perfmon counters
will trigger an NMI (but the delivery may be delayed until after the next
IRET)

One side effect is that the old retry loop is not possible anymore,
as the counters stay unacked for the majority of the PMI handler,
but that is not a big loss, as "profiling" the PMI was always
a bit dubious. For the old ack sequence it is still supported.

In principle the sequence should work on other CPUs too, but
since I only tested on Skylake it is only enabled there.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h       |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c | 35 ++++++++++++++++++++++++++--------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 499f533..fcf01c7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -551,6 +551,7 @@ struct x86_pmu {
 	struct x86_pmu_quirk *quirks;
 	int		perfctr_second_write;
 	bool		late_ack;
+	bool		status_ack_after_apic;
 	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
 
 	/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f63360b..69a545e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1789,6 +1789,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	struct cpu_hw_events *cpuc;
 	int bit, loops;
 	u64 status;
+	u64 orig_status;
 	int handled;
 
 	cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1803,13 +1804,16 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	handled = intel_pmu_drain_bts_buffer();
 	handled += intel_bts_interrupt();
 	status = intel_pmu_get_status();
+	orig_status = status;
 	if (!status)
 		goto done;
 
 	loops = 0;
 again:
 	intel_pmu_lbr_read();
-	intel_pmu_ack_status(status);
+	if (!x86_pmu.status_ack_after_apic)
+		__intel_pmu_enable_all(0, true);
+
 	if (++loops > 100) {
 		static bool warned = false;
 		if (!warned) {
@@ -1877,15 +1881,20 @@ again:
 			x86_pmu_stop(event, 0);
 	}
 
-	/*
-	 * Repeat if there is more work to be done:
-	 */
-	status = intel_pmu_get_status();
-	if (status)
-		goto again;
+
+	if (!x86_pmu.status_ack_after_apic) {
+		/*
+		 * Repeat if there is more work to be done:
+		 */
+		status = intel_pmu_get_status();
+		if (status)
+			goto again;
+	}
 
 done:
-	__intel_pmu_enable_all(0, true);
+	if (!x86_pmu.status_ack_after_apic)
+		__intel_pmu_enable_all(0, true);
+
 	/*
 	 * Only unmask the NMI after the overflow counters
 	 * have been reset. This avoids spurious NMIs on
@@ -1893,6 +1902,15 @@ done:
 	 */
 	if (x86_pmu.late_ack)
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
+
+	/*
+	 * Ack the PMU late. This avoids bogus freezing
+	 * on Skylake CPUs.
+	 */
+	if (x86_pmu.status_ack_after_apic) {
+		intel_pmu_ack_status(orig_status);
+		__intel_pmu_enable_all(0, true);
+	}
 	return handled;
 }
 
@@ -3514,6 +3532,7 @@ __init int intel_pmu_init(void)
 	case 78: /* 14nm Skylake Mobile */
 	case 94: /* 14nm Skylake Desktop */
 		x86_pmu.late_ack = true;
+		x86_pmu.status_ack_after_apic = true;
 		memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 		intel_pmu_lbr_init_skl();
-- 
2.4.3