From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753282AbbFLJIp (ORCPT <rfc822;w@1wt.eu>);
	Fri, 12 Jun 2015 05:08:45 -0400
Received: from mga01.intel.com ([192.55.52.88]:13248 "EHLO mga01.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752211AbbFLJIj (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 12 Jun 2015 05:08:39 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.13,601,1427785200"; 
   d="scan'208";a="725909109"
From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        adrian.hunter@intel.com, x86@kernel.org, hpa@zytor.com,
        acme@infradead.org
Subject: Re: [PATCH 2/2] perf/x86/intel: Fix PMI handling for Intel PT
In-Reply-To: <20150611140020.GU19282@twins.programming.kicks-ass.net>
References: <1434024837-9916-1-git-send-email-alexander.shishkin@linux.intel.com> <1434024837-9916-3-git-send-email-alexander.shishkin@linux.intel.com> <20150611140020.GU19282@twins.programming.kicks-ass.net>
User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu)
Date: Fri, 12 Jun 2015 12:08:35 +0300
Message-ID: <87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Jun 11, 2015 at 03:13:57PM +0300, Alexander Shishkin wrote:
>> @@ -1426,7 +1426,23 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>>  	u64 finish_clock;
>>  	int ret;
>>  
>> -	if (!atomic_read(&active_events))
>> +	/*
>> +	 * Intel PT is a separate PMUs, it doesn't increment active_events
>> +	 * nor does it use resources associated with it, such as DS area.
>> +	 * However, it still uses the same PMI handler, so here we need
>> +	 * to make sure to call it *also* if any PT events are present.
>> +	 *
>> +	 * BTS events currently increment active_events implicitly through
>> +	 * x86_reserve_hardware(), where it acts as DS area reference
>> +	 * counter, so no need to check x86_lbr_exclusive_bts counter here;
>> +	 * otherwise we'd have to do the same for BTS.
>> +	 *
>> +	 * Theoretically, they could be made into separate PMI handlers, but
>> +	 * that can create additional challenges as PT uses the same PMI status
>> +	 * register as x86_pmu.
>> +	 */
>> +	if (!atomic_read(&active_events) &&
>> +	    !atomic_read(&x86_pmu.lbr_exclusive[x86_lbr_exclusive_pt]))
>>  		return NMI_DONE;
>>  
>
> Urgh, sad. That's two cache misses instead of one. Cant we somehow keep
> that a single value to read?

Indeed. One thing here is that active_events actually serves dual
purpose and does reference counting for
{reserve,release}_{ds_buffers,pmc_hardware}(). We can split it in two
and then we can add PT events to active_events as well, so we can make
do with just one atomic_read() here. How about the patch below instead
of the previously posted one?

>>From 6f0e86cb0bd556cbfaa5ada6a8ef3aa0e2a6b60a Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Date: Tue, 9 Jun 2015 13:03:26 +0300
Subject: [PATCH] perf/x86/intel: Fix PMI handling for Intel PT

Since Intel PT is a separate pmu and is not using any of the x86_pmu
code paths, which means in particular that active_events counter remains
intact when new PT events are created. However, PT uses x86_pmu PMI
handler for its PMI handling needs. The problem here is that the latter
checks active_events and in case of it being zero, exits without calling
the actual x86_pmu.handle_nmi(), which results in unknown NMI errors and
massive data loss for PT.

The effect is not visible if there are other perf events in the system
at the same time that keep active_events counter non-zero, for instance
if the NMI watchdog is running, so one needs to disable it to reproduce
the problem.

At the same time, the active_events counter besides doing what the name
suggests also implicitly serves as a pmc hardware and DS area reference
counter.

This patch adds a separate reference counter for the pmc hardware, leaving
active_events for actually counting the events and makes sure it also
counts PT and BTS events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8487714da2..3b0257e222 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -135,6 +135,7 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event)
 }
 
 static atomic_t active_events;
+static atomic_t pmc_refcount;
 static DEFINE_MUTEX(pmc_reserve_mutex);
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -271,6 +272,7 @@ msr_fail:
 static void hw_perf_event_destroy(struct perf_event *event)
 {
 	x86_release_hardware();
+	atomic_dec(&active_events);
 }
 
 void hw_perf_lbr_event_destroy(struct perf_event *event)
@@ -324,16 +326,16 @@ int x86_reserve_hardware(void)
 {
 	int err = 0;
 
-	if (!atomic_inc_not_zero(&active_events)) {
+	if (!atomic_inc_not_zero(&pmc_refcount)) {
 		mutex_lock(&pmc_reserve_mutex);
-		if (atomic_read(&active_events) == 0) {
+		if (atomic_read(&pmc_refcount) == 0) {
 			if (!reserve_pmc_hardware())
 				err = -EBUSY;
 			else
 				reserve_ds_buffers();
 		}
 		if (!err)
-			atomic_inc(&active_events);
+			atomic_inc(&pmc_refcount);
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 
@@ -342,7 +344,7 @@ int x86_reserve_hardware(void)
 
 void x86_release_hardware(void)
 {
-	if (atomic_dec_and_mutex_lock(&active_events, &pmc_reserve_mutex)) {
+	if (atomic_dec_and_mutex_lock(&pmc_refcount, &pmc_reserve_mutex)) {
 		release_pmc_hardware();
 		release_ds_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
@@ -371,12 +373,24 @@ int x86_add_exclusive(unsigned int what)
 
 out:
 	mutex_unlock(&pmc_reserve_mutex);
+
+	/*
+	 * Assuming that all exclusive events will share the PMI handler
+	 * (which checks active_events for whether there is work to do),
+	 * we can bump active_events counter right here, except for
+	 * x86_lbr_exclusive_lbr events that go through x86_pmu_event_init()
+	 * path, which already bumps active_events for them.
+	 */
+	if (!ret && what != x86_lbr_exclusive_lbr)
+		atomic_inc(&active_events);
+
 	return ret;
 }
 
 void x86_del_exclusive(unsigned int what)
 {
 	atomic_dec(&x86_pmu.lbr_exclusive[what]);
+	atomic_dec(&active_events);
 }
 
 int x86_setup_perfctr(struct perf_event *event)
@@ -557,6 +571,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	if (err)
 		return err;
 
+	atomic_inc(&active_events);
 	event->destroy = hw_perf_event_destroy;
 
 	event->hw.idx = -1;
@@ -1426,6 +1441,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * All PMUs/events that share this PMI handler should make sure to
+	 * increment active_events for their events.
+	 */
 	if (!atomic_read(&active_events))
 		return NMI_DONE;
 
-- 
2.1.4