From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756104AbbJUQYK (ORCPT <rfc822;w@1wt.eu>);
	Wed, 21 Oct 2015 12:24:10 -0400
Received: from mail-wi0-f173.google.com ([209.85.212.173]:37578 "EHLO
	mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755320AbbJUQYF (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 21 Oct 2015 12:24:05 -0400
Date: Wed, 21 Oct 2015 18:24:00 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <andi@firstfloor.org>, acme@kernel.org, jolsa@kernel.org,
        linux-kernel@vger.kernel.org, Andi Kleen <ak@linux.intel.com>,
        stable@vger.kernel.org, Andy Lutomirski <luto@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore
Message-ID: <20151021162400.GA28914@gmail.com>
References: <1445366797-30894-1-git-send-email-andi@firstfloor.org>
 <20151021131310.GE3604@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20151021131310.GE3604@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Peter Zijlstra <peterz@infradead.org> wrote:

> >  	mask = x86_pmu.lbr_nr - 1;
> > -	tos = intel_pmu_lbr_tos();
> > +	tos = task_ctx->tos;
> >  	for (i = 0; i < tos; i++) {
> >  		lbr_idx = (tos - i) & mask;
> >  		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
> >  		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> >  			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
> >  	}
> > +	wrmsrl(x86_pmu.lbr_tos, tos);
> >  	task_ctx->lbr_stack_state = LBR_NONE;
> >  }
> 
> Any idea who much more expensive that wrmsr() is compared to the rdmsr() it 
> replaces?
> 
> If its significant we could think about having this behaviour depend on 
> callstacks.

The WRMSR extra cost is probably rather significant - here is a typical Intel 
WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference:

[  170.798574] x86/bench: -------------------------------------------------------------------
[  170.807258] x86/bench: |                 RDTSC-cycles:    hot  (ąnoise) /   cold  (ąnoise)
[  170.816115] x86/bench: -------------------------------------------------------------------
[  212.146982] x86/bench: rdtsc                         :     16           /     60
[  213.725998] x86/bench: rdmsr                         :    100           /    148
[  215.469958] x86/bench: wrmsr                         :    456           /    708

That's on a Xeon E7-4890 (22nm IvyBridge-EX).

So it's 350-550 RDTSC cycles ...

Thanks,

	Ingo

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mail-wi0-f173.google.com ([209.85.212.173]:37578 "EHLO
	mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755320AbbJUQYF (ORCPT
	<rfc822;stable@vger.kernel.org>); Wed, 21 Oct 2015 12:24:05 -0400
Date: Wed, 21 Oct 2015 18:24:00 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <andi@firstfloor.org>, acme@kernel.org,
	jolsa@kernel.org, linux-kernel@vger.kernel.org,
	Andi Kleen <ak@linux.intel.com>, stable@vger.kernel.org,
	Andy Lutomirski <luto@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore
Message-ID: <20151021162400.GA28914@gmail.com>
References: <1445366797-30894-1-git-send-email-andi@firstfloor.org>
 <20151021131310.GE3604@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20151021131310.GE3604@twins.programming.kicks-ass.net>
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>


* Peter Zijlstra <peterz@infradead.org> wrote:

> >  	mask = x86_pmu.lbr_nr - 1;
> > -	tos = intel_pmu_lbr_tos();
> > +	tos = task_ctx->tos;
> >  	for (i = 0; i < tos; i++) {
> >  		lbr_idx = (tos - i) & mask;
> >  		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
> >  		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> >  			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
> >  	}
> > +	wrmsrl(x86_pmu.lbr_tos, tos);
> >  	task_ctx->lbr_stack_state = LBR_NONE;
> >  }
> 
> Any idea who much more expensive that wrmsr() is compared to the rdmsr() it 
> replaces?
> 
> If its significant we could think about having this behaviour depend on 
> callstacks.

The WRMSR extra cost is probably rather significant - here is a typical Intel 
WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference:

[  170.798574] x86/bench: -------------------------------------------------------------------
[  170.807258] x86/bench: |                 RDTSC-cycles:    hot  (ďż˝noise) /   cold  (ďż˝noise)
[  170.816115] x86/bench: -------------------------------------------------------------------
[  212.146982] x86/bench: rdtsc                         :     16           /     60
[  213.725998] x86/bench: rdmsr                         :    100           /    148
[  215.469958] x86/bench: wrmsr                         :    456           /    708

That's on a Xeon E7-4890 (22nm IvyBridge-EX).

So it's 350-550 RDTSC cycles ...

Thanks,

	Ingo