From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756104AbbJUQYK (ORCPT ); Wed, 21 Oct 2015 12:24:10 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:37578 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755320AbbJUQYF (ORCPT ); Wed, 21 Oct 2015 12:24:05 -0400 Date: Wed, 21 Oct 2015 18:24:00 +0200 From: Ingo Molnar To: Peter Zijlstra Cc: Andi Kleen , acme@kernel.org, jolsa@kernel.org, linux-kernel@vger.kernel.org, Andi Kleen , stable@vger.kernel.org, Andy Lutomirski , Linus Torvalds , Thomas Gleixner , "H. Peter Anvin" Subject: Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore Message-ID: <20151021162400.GA28914@gmail.com> References: <1445366797-30894-1-git-send-email-andi@firstfloor.org> <20151021131310.GE3604@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20151021131310.GE3604@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Peter Zijlstra wrote: > > mask = x86_pmu.lbr_nr - 1; > > - tos = intel_pmu_lbr_tos(); > > + tos = task_ctx->tos; > > for (i = 0; i < tos; i++) { > > lbr_idx = (tos - i) & mask; > > wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]); > > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx) > > if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) > > wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]); > > } > > + wrmsrl(x86_pmu.lbr_tos, tos); > > task_ctx->lbr_stack_state = LBR_NONE; > > } > > Any idea who much more expensive that wrmsr() is compared to the rdmsr() it > replaces? > > If its significant we could think about having this behaviour depend on > callstacks. The WRMSR extra cost is probably rather significant - here is a typical Intel WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference: [ 170.798574] x86/bench: ------------------------------------------------------------------- [ 170.807258] x86/bench: | RDTSC-cycles: hot (±noise) / cold (±noise) [ 170.816115] x86/bench: ------------------------------------------------------------------- [ 212.146982] x86/bench: rdtsc : 16 / 60 [ 213.725998] x86/bench: rdmsr : 100 / 148 [ 215.469958] x86/bench: wrmsr : 456 / 708 That's on a Xeon E7-4890 (22nm IvyBridge-EX). So it's 350-550 RDTSC cycles ... Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com ([209.85.212.173]:37578 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755320AbbJUQYF (ORCPT ); Wed, 21 Oct 2015 12:24:05 -0400 Date: Wed, 21 Oct 2015 18:24:00 +0200 From: Ingo Molnar To: Peter Zijlstra Cc: Andi Kleen , acme@kernel.org, jolsa@kernel.org, linux-kernel@vger.kernel.org, Andi Kleen , stable@vger.kernel.org, Andy Lutomirski , Linus Torvalds , Thomas Gleixner , "H. Peter Anvin" Subject: Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore Message-ID: <20151021162400.GA28914@gmail.com> References: <1445366797-30894-1-git-send-email-andi@firstfloor.org> <20151021131310.GE3604@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20151021131310.GE3604@twins.programming.kicks-ass.net> Sender: stable-owner@vger.kernel.org List-ID: * Peter Zijlstra wrote: > > mask = x86_pmu.lbr_nr - 1; > > - tos = intel_pmu_lbr_tos(); > > + tos = task_ctx->tos; > > for (i = 0; i < tos; i++) { > > lbr_idx = (tos - i) & mask; > > wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]); > > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx) > > if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) > > wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]); > > } > > + wrmsrl(x86_pmu.lbr_tos, tos); > > task_ctx->lbr_stack_state = LBR_NONE; > > } > > Any idea who much more expensive that wrmsr() is compared to the rdmsr() it > replaces? > > If its significant we could think about having this behaviour depend on > callstacks. The WRMSR extra cost is probably rather significant - here is a typical Intel WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference: [ 170.798574] x86/bench: ------------------------------------------------------------------- [ 170.807258] x86/bench: | RDTSC-cycles: hot (�noise) / cold (�noise) [ 170.816115] x86/bench: ------------------------------------------------------------------- [ 212.146982] x86/bench: rdtsc : 16 / 60 [ 213.725998] x86/bench: rdmsr : 100 / 148 [ 215.469958] x86/bench: wrmsr : 456 / 708 That's on a Xeon E7-4890 (22nm IvyBridge-EX). So it's 350-550 RDTSC cycles ... Thanks, Ingo