linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
@ 2012-03-01 22:28 Vince Weaver
  2012-06-06 16:17 ` [tip:perf/core] perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel tip-bot for Vince Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Vince Weaver @ 2012-03-01 22:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, a.p.zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

Hello

The rdpmc instruction is faster than the equivelant rdmsr call,
so use it when possible in the kernel.

The perfctr kernel patches did this, after extensive testing showed
rdpmc to always be faster (One can look in etc/costs in the perfctr-2.6 
package to see a historical list of the overhead).

I have done some tests on a 3.2 kernel, the kernel module I used
was included in the first posting of this patch:

                    rdmsr            rdpmc
Core2 T9900:      203.9 cycles     30.9 cycles
AMD fam0fh:        56.2 cycles      9.8 cycles
Atom 6/28/2:      129.7 cycles     50.6 cycles

The speedup of using rdpmc is large, although granted
it really is a drop in the bucket compared to the other overheads 
involved.

It's probably possible (and desirable) to do this without 
requiring a new field in the hw_perf_event structure, but the fixed events 
make this tricky.

Changes since the last version: properly use the "rdpmc" macro,
      make event_base_rdpmc an int rather than unsigned long

Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5adce10..e1ddfc7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -70,6 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	struct hw_perf_event *hwc = &event->hw;
 	int shift = 64 - x86_pmu.cntval_bits;
 	u64 prev_raw_count, new_raw_count;
+	u32 high, low;
 	int idx = hwc->idx;
 	s64 delta;
 
@@ -85,7 +86,8 @@ u64 x86_perf_event_update(struct perf_event *event)
 	 */
 again:
 	prev_raw_count = local64_read(&hwc->prev_count);
-	rdmsrl(hwc->event_base, new_raw_count);
+	rdpmc(hwc->event_base_rdpmc, low, high);
+	new_raw_count=((u64)high<<32 | (u64)low);
 
 	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
 					new_raw_count) != prev_raw_count)
@@ -768,9 +770,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
 		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
+		hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
 	} else {
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
+		hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index abb2776..8caf91a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -562,6 +562,7 @@ struct hw_perf_event {
 			u64		last_tag;
 			unsigned long	config_base;
 			unsigned long	event_base;
+			int		event_base_rdpmc;
 			int		idx;
 			int		last_cpu;
 			struct hw_perf_event_extra extra_reg;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [tip:perf/core] perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel
  2012-03-01 22:28 [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel Vince Weaver
@ 2012-06-06 16:17 ` tip-bot for Vince Weaver
  2012-06-15 17:43   ` [PATCH] perf, amd: Fix rdpmc index calculation for AMD family 15h Robert Richter
  0 siblings, 1 reply; 8+ messages in thread
From: tip-bot for Vince Weaver @ 2012-06-06 16:17 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, vweaver1

Commit-ID:  c48b60538c3ba05a7a2713c4791b25405525431b
Gitweb:     http://git.kernel.org/tip/c48b60538c3ba05a7a2713c4791b25405525431b
Author:     Vince Weaver <vweaver1@eecs.utk.edu>
AuthorDate: Thu, 1 Mar 2012 17:28:14 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Jun 2012 17:23:35 +0200

perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel

The rdpmc instruction is faster than the equivelant rdmsr call,
so use it when possible in the kernel.

The perfctr kernel patches did this, after extensive testing showed
rdpmc to always be faster (One can look in etc/costs in the perfctr-2.6
package to see a historical list of the overhead).

I have done some tests on a 3.2 kernel, the kernel module I used
was included in the first posting of this patch:

                   rdmsr           rdpmc
 Core2 T9900:      203.9 cycles     30.9 cycles
 AMD fam0fh:        56.2 cycles      9.8 cycles
 Atom 6/28/2:      129.7 cycles     50.6 cycles

The speedup of using rdpmc is large.

[ It's probably possible (and desirable) to do this without
  requiring a new field in the hw_perf_event structure, but
  the fixed events make this tricky. ]

Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1203011724030.26934@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |    4 +++-
 include/linux/perf_event.h       |    1 +
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 43c2017..000a474 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -75,7 +75,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	 */
 again:
 	prev_raw_count = local64_read(&hwc->prev_count);
-	rdmsrl(hwc->event_base, new_raw_count);
+	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
 
 	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
 					new_raw_count) != prev_raw_count)
@@ -819,9 +819,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
 		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
+		hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
 	} else {
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
+		hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 45db49f..1ce887a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -677,6 +677,7 @@ struct hw_perf_event {
 			u64		last_tag;
 			unsigned long	config_base;
 			unsigned long	event_base;
+			int		event_base_rdpmc;
 			int		idx;
 			int		last_cpu;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] perf, amd: Fix rdpmc index calculation for AMD family 15h
  2012-06-06 16:17 ` [tip:perf/core] perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel tip-bot for Vince Weaver
@ 2012-06-15 17:43   ` Robert Richter
  0 siblings, 0 replies; 8+ messages in thread
From: Robert Richter @ 2012-06-15 17:43 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, a.p.zijlstra, tglx, vweaver1; +Cc: linux-tip-commits

On 06.06.12 09:17:02, tip-bot for Vince Weaver wrote:
> Commit-ID:  c48b60538c3ba05a7a2713c4791b25405525431b
> Gitweb:     http://git.kernel.org/tip/c48b60538c3ba05a7a2713c4791b25405525431b
> Author:     Vince Weaver <vweaver1@eecs.utk.edu>
> AuthorDate: Thu, 1 Mar 2012 17:28:14 -0500
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 6 Jun 2012 17:23:35 +0200
> 
> perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel
> 
> The rdpmc instruction is faster than the equivelant rdmsr call,
> so use it when possible in the kernel.
> 
> The perfctr kernel patches did this, after extensive testing showed
> rdpmc to always be faster (One can look in etc/costs in the perfctr-2.6
> package to see a historical list of the overhead).
> 
> I have done some tests on a 3.2 kernel, the kernel module I used
> was included in the first posting of this patch:
> 
>                    rdmsr           rdpmc
>  Core2 T9900:      203.9 cycles     30.9 cycles
>  AMD fam0fh:        56.2 cycles      9.8 cycles
>  Atom 6/28/2:      129.7 cycles     50.6 cycles
> 
> The speedup of using rdpmc is large.
> 
> [ It's probably possible (and desirable) to do this without
>   requiring a new field in the hw_perf_event structure, but
>   the fixed events make this tricky. ]
> 
> Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1203011724030.26934@cl320.eecs.utk.edu
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/cpu/perf_event.c |    4 +++-
>  include/linux/perf_event.h       |    1 +
>  2 files changed, 4 insertions(+), 1 deletions(-)

This crashs on AMD family 15h. Fix below.

-Robert


>From cf346c12eb0878046d8326a003d4b86aafc8f923 Mon Sep 17 00:00:00 2001
From: Robert Richter <robert.richter@amd.com>
Date: Fri, 15 Jun 2012 19:06:44 +0200
Subject: [PATCH] perf, amd: Fix rdpmc index calculation for AMD family 15h

The rdpmc index calculation is wrong for AMD family 15h (X86_FEATURE_
PERFCTR_CORE set). This leads to a GP when accessing the counter.
While the msr address offset is (index << 1) we must use index to
select the correct rdpmc.

 Pid: 2237, comm: syslog-ng Not tainted 3.5.0-rc1-perf-x86_64-standard-g130ff90 #135 AMD Pike/Pike
 RIP: 0010:[<ffffffff8100dc33>]  [<ffffffff8100dc33>] x86_perf_event_update+0x27/0x66
 RSP: 0000:ffff88042ec07d90  EFLAGS: 00010083
 RAX: 0000000000000005 RBX: 0000000000000005 RCX: 000000000000000a
 RDX: 0000000000000000 RSI: ffff8803f4a68160 RDI: ffff8803f4a68000
 RBP: ffff88042ec07d98 R08: ffffffffffff6c21 R09: 0000000000000030
 R10: ffff88041d33f000 R11: 000000000000001f R12: 0000000000000004
 R13: ffff88042ec0b790 R14: ffff88042ec0b998 R15: ffff8803f4a68000
 FS:  00007f109a313700(0000) GS:ffff88042ec00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffff600400 CR3: 00000003f485f000 CR4: 00000000000407f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process syslog-ng (pid: 2237, threadinfo ffff8803f4972000, task ffff8803f4ed9850)
 Stack:
  0000000000000005 ffff88042ec07e48 ffffffff8100f5b9 ffff88042ec07ef8
  ffff88042ec0b990 00000000000001c7 00007f1099445f4d 000008bd000008bd
  0000004835d3bc7a 0000000000000000 0000000000000051 000000000000002a
 Call Trace:
  <NMI>
  [<ffffffff8100f5b9>] x86_pmu_handle_irq+0xc6/0x16a
  [<ffffffff8145db68>] perf_event_nmi_handler+0x19/0x1b
  [<ffffffff8145d4b2>] nmi_handle+0x4d/0x71
  [<ffffffff8145d57e>] do_nmi+0xa8/0x2b9
  [<ffffffff8145cce2>] end_repeat_nmi+0x1a/0x1e
  <<EOE>>
 Code: c9 c3 90 90 31 d2 83 bf 1c 01 00 00 30 55 44 8b 0d 83 c4 6f 00 48 89 e5 53 74 49 48 8d b7 60 01 00 00 4c 8b 06 8b 8f 18 01 00 00 <0f> 33 48 c1 e2 20 89 c0 48 09 c2 4c 89 c0 48 0f b1 16 4c 39 c0
 RIP  [<ffffffff8100dc33>] x86_perf_event_update+0x27/0x66
  RSP <ffff88042ec07d90>
 ---[ end trace 69da2f3afd06ff37 ]---
 Kernel panic - not syncing: Fatal exception in interrupt

Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Robert Richter <robert.richter@amd.com>
---
 arch/x86/kernel/cpu/perf_event.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 000a474..e7540c8 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -823,7 +823,7 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	} else {
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
-		hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
+		hwc->event_base_rdpmc = hwc->idx;
 	}
 }
 
-- 
1.7.8.4



-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
  2012-02-27 16:18     ` Peter Zijlstra
@ 2012-02-27 17:04       ` Vince Weaver
  0 siblings, 0 replies; 8+ messages in thread
From: Vince Weaver @ 2012-02-27 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

On Mon, 27 Feb 2012, Peter Zijlstra wrote:

> > I couldn't find another usable rdpmc() call in the kernel.  Should I add 
> > one?  I admit I hadn't thought that this might break VMs not expecting
> > rdpmc calls from the kernel. 
> 
> arch/x86/include/asm/msr.h
> 
> #define rdpmc(counter, low, high)                       \                                    
> do {                                                    \                                    
>         u64 _l = native_read_pmc((counter));            \                                    
>         (low)  = (u32)_l;                               \                                    
>         (high) = (u32)(_l >> 32);                       \                                    
> } while (0)

ugh I did multiple recursive greps, and definitely searched msr.h by
hand various times looking for that and I somehow missed it each time.

I'll update the patch.

Vince


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
  2012-02-27 16:06   ` Vince Weaver
@ 2012-02-27 16:18     ` Peter Zijlstra
  2012-02-27 17:04       ` Vince Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2012-02-27 16:18 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, mingo, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

On Mon, 2012-02-27 at 11:06 -0500, Vince Weaver wrote:
> > > +   new_raw_count=native_read_pmc(hwc->event_base_rdpmc);
> > 
> > That really wants to be rdpmc(), bypassing paravirt like that isn't
> > nice.
> 
> I couldn't find another usable rdpmc() call in the kernel.  Should I add 
> one?  I admit I hadn't thought that this might break VMs not expecting
> rdpmc calls from the kernel. 

arch/x86/include/asm/msr.h

#define rdpmc(counter, low, high)                       \                                    
do {                                                    \                                    
        u64 _l = native_read_pmc((counter));            \                                    
        (low)  = (u32)_l;                               \                                    
        (high) = (u32)(_l >> 32);                       \                                    
} while (0)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
  2012-02-27 11:39 ` Peter Zijlstra
@ 2012-02-27 16:06   ` Vince Weaver
  2012-02-27 16:18     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Vince Weaver @ 2012-02-27 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

On Mon, 27 Feb 2012, Peter Zijlstra wrote:

> On Mon, 2012-02-20 at 17:38 -0500, Vince Weaver wrote:
> 
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 5adce10..5550047 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
> >  	 */
> >  again:
> >  	prev_raw_count = local64_read(&hwc->prev_count);
> > -	rdmsrl(hwc->event_base, new_raw_count);
> > +	new_raw_count=native_read_pmc(hwc->event_base_rdpmc);
> 
> That really wants to be rdpmc(), bypassing paravirt like that isn't
> nice.

I couldn't find another usable rdpmc() call in the kernel.  Should I add 
one?  I admit I hadn't thought that this might break VMs not expecting
rdpmc calls from the kernel.

> > index abb2776..432ac69 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -562,6 +562,7 @@ struct hw_perf_event {
> >  			u64		last_tag;
> >  			unsigned long	config_base;
> >  			unsigned long	event_base;
> > +			unsigned long	event_base_rdpmc;
> 
> We could make that unsigned int, the SDM explicitly states
> rdmsr/wrmsr/rdpmc take ECX (and ignore the upper 32bits RCX).

OK.
 
> >  			int		idx;
> >  			int		last_cpu;
> >  			struct hw_perf_event_extra extra_reg;
> 
> But yeah, avoiding the extra variable will add a conditional and extra
> instructions to the rdpmc path.

yes, I couldn't think of a good way to get rid of the extra field
without adding extra conditional branches in key code paths.

Mostly that's the fault of the Intel fixed counters, otherwise you could 
possibly fix things by cleverly adding or subtracting off the msr_base
to get the counter number.

Vince


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
  2012-02-20 22:38 [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel Vince Weaver
@ 2012-02-27 11:39 ` Peter Zijlstra
  2012-02-27 16:06   ` Vince Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2012-02-27 11:39 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, mingo, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

On Mon, 2012-02-20 at 17:38 -0500, Vince Weaver wrote:

> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 5adce10..5550047 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
>  	 */
>  again:
>  	prev_raw_count = local64_read(&hwc->prev_count);
> -	rdmsrl(hwc->event_base, new_raw_count);
> +	new_raw_count=native_read_pmc(hwc->event_base_rdpmc);

That really wants to be rdpmc(), bypassing paravirt like that isn't
nice.

>  
>  	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
>  					new_raw_count) != prev_raw_count)
> @@ -768,9 +768,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
>  	} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
>  		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
>  		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
> +		hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
>  	} else {
>  		hwc->config_base = x86_pmu_config_addr(hwc->idx);
>  		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
> +		hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
>  	}
>  }
>  
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index abb2776..432ac69 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -562,6 +562,7 @@ struct hw_perf_event {
>  			u64		last_tag;
>  			unsigned long	config_base;
>  			unsigned long	event_base;
> +			unsigned long	event_base_rdpmc;

We could make that unsigned int, the SDM explicitly states
rdmsr/wrmsr/rdpmc take ECX (and ignore the upper 32bits RCX).

>  			int		idx;
>  			int		last_cpu;
>  			struct hw_perf_event_extra extra_reg;

But yeah, avoiding the extra variable will add a conditional and extra
instructions to the rdpmc path.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel
@ 2012-02-20 22:38 Vince Weaver
  2012-02-27 11:39 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Vince Weaver @ 2012-02-20 22:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, a.p.zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Stephane Eranian

[-- Attachment #1: Type: text/plain, Size: 2431 bytes --]

Hello

The perfctr interface uses rdpmc rather than rdmsr when possible in the 
kernel, as rdpmc tends to have lower latencies.  (One can look in 
etc/costs in the perfctr-2.6 package to see a historical list of the
overhead).

I have done some tests on a 3.2 kernel:

                    rdmsr            rdpmc
Core2 T9900:      203.9 cycles     30.9 cycles
AMD fam0fh:        56.2 cycles      9.8 cycles
Atom 6/28/2:      129.7 cycles     50.6 cycles

As you can see the speedup of using rdpmc is large, although granted
it really is a drop in the bucket compared to the other overheads 
involved.

I've attached a kernel module that can be used to test this on your own 
personal x86 machine.

Below is a patch that changes perf_event to use rdpmc rather than rdmsr 
when possible.  It's probably possible (and desirable) to do this without 
requiring a new field in the hw_perf_event structure, but the fixed events 
make this tricky.

Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5adce10..5550047 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	 */
 again:
 	prev_raw_count = local64_read(&hwc->prev_count);
-	rdmsrl(hwc->event_base, new_raw_count);
+	new_raw_count=native_read_pmc(hwc->event_base_rdpmc);
 
 	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
 					new_raw_count) != prev_raw_count)
@@ -768,9 +768,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
 		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
+		hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
 	} else {
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
+		hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index abb2776..432ac69 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -562,6 +562,7 @@ struct hw_perf_event {
 			u64		last_tag;
 			unsigned long	config_base;
 			unsigned long	event_base;
+			unsigned long	event_base_rdpmc;
 			int		idx;
 			int		last_cpu;
 			struct hw_perf_event_extra extra_reg;

[-- Attachment #2: Makefile --]
[-- Type: text/plain, Size: 197 bytes --]

obj-m := rdpmc-module.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

default:
	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

#clean:	
#	rm -f *.o *.ko *~ *.symvers *.order

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: rdpmc-module.c --]
[-- Type: text/x-csrc; name="rdpmc-module.c", Size: 7352 bytes --]

/* Code taken from perfctr by Mikael Pettersson */
/* http://user.it.uu.se/~mikpe/linux/perfctr/   */
/* Slightly modified by Vince Weaver <vweaver1@eecs.utk.edu> */

#include <linux/module.h>
#include <linux/init.h>

#include <asm/msr.h>

#define MSR_P5_CESR             0x11
#define MSR_P5_CTR0             0x12
#define P5_CESR_VAL             (0x16 | (3<<6))

#define P6_EVNTSEL0_VAL         (0xC0 | (3<<16) | (1<<22))

#define K7_EVNTSEL0_VAL         (0xC0 | (3<<16) | (1<<22))

#define MSR_P4_IQ_COUNTER0      0x30C
#define P4_CRU_ESCR0_VAL        ((2<<25) | (1<<9) | (0x3<<2))
#define P4_IQ_CCCR0_VAL         ((0x3<<16) | (4<<13) | (1<<12))

#define CORE2_PMC_FIXED_CTR0    ((1<<30) | 0)


#define NITER   64
#define X2(S)   S";"S
#define X8(S)   X2(X2(X2(S)))

#ifdef __x86_64__
#define CR4MOV  "movq"
#else
#define CR4MOV  "movl"
#endif


#define rdtsc_low(low) \
  __asm__ __volatile__("rdtsc" : "=a"(low) : : "edx")


static void __init do_rdpmc(unsigned pmc, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8("rdpmc") : : "c"(pmc) : "eax", "edx");
}

static void __init do_rdmsr(unsigned msr, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8("rdmsr") : : "c"(msr) : "eax", "edx");
}

static void __init do_wrmsr(unsigned msr, unsigned data)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8("wrmsr") : : "c"(msr), "a"(data), "d"(0));
}

static void __init do_rdcr4(unsigned unused1, unsigned unused2)
{
  unsigned i;
  unsigned long dummy;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8(CR4MOV" %%cr4,%0") : "=r"(dummy));
}

static void __init do_wrcr4(unsigned cr4, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8(CR4MOV" %0,%%cr4") : : "r"((long)cr4));
}

static void __init do_rdtsc(unsigned unused1, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__(X8("rdtsc") : : : "eax", "edx");
}

#if 0
static void __init do_wrlvtpc(unsigned val, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i) {
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
    apic_write(APIC_LVTPC, val);
  }
}
#endif

static void __init do_sync_core(unsigned unused1, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i) {
    sync_core();
    sync_core();
    sync_core();
    sync_core();
    sync_core();
    sync_core();
    sync_core();
    sync_core();
  }
}


static void __init do_empty_loop(unsigned unused1, unsigned unused2)
{
  unsigned i;
  for(i = 0; i < NITER/8; ++i)
    __asm__ __volatile__("" : : "c"(0));
}

static unsigned __init run(void (*doit)(unsigned, unsigned),
                           unsigned arg1, unsigned arg2)
{
  unsigned start, stop;
  sync_core();
  rdtsc_low(start);
  (*doit)(arg1, arg2);    /* should take < 2^32 cycles to complete */
  sync_core();
  rdtsc_low(stop);
  return stop - start;
}




static void __init
measure_overheads(unsigned msr_evntsel0, unsigned evntsel0, unsigned msr_perfctr0,
                  unsigned msr_cccr, unsigned cccr_val, unsigned is_core2)
{
  int i;
  unsigned int loop, ticks[14];
  const char *name[14];

  if( msr_evntsel0 )
    wrmsr(msr_evntsel0, 0, 0);
  if( msr_cccr )
    wrmsr(msr_cccr, 0, 0);

  name[0] = "rdtsc";
  ticks[0] = run(do_rdtsc, 0, 0);
  name[1] = "rdpmc";
  ticks[1] = run(do_rdpmc,1,0);
  name[2] = "rdmsr (counter)";
  ticks[2] = msr_perfctr0 ? run(do_rdmsr, msr_perfctr0, 0) : 0;
  name[3] = msr_cccr ? "rdmsr (escr)" : "rdmsr (evntsel)";
  ticks[3] = msr_evntsel0 ? run(do_rdmsr, msr_evntsel0, 0) : 0;
  name[4] = "wrmsr (counter)";
  ticks[4] = msr_perfctr0 ? run(do_wrmsr, msr_perfctr0, 0) : 0;
  name[5] = msr_cccr ? "wrmsr (escr)" : "wrmsr (evntsel)";
  ticks[5] = msr_evntsel0 ? run(do_wrmsr, msr_evntsel0, evntsel0) : 0;
  name[6] = "read cr4";
  ticks[6] = run(do_rdcr4, 0, 0);
  name[7] = "write cr4";
  ticks[7] = run(do_wrcr4, read_cr4(), 0);
  name[8] = "rdpmc (fast)";
  ticks[8] = msr_cccr ? run(do_rdpmc, 0x80000001, 0) : 0;
  name[9] = "rdmsr (cccr)";
  ticks[9] = msr_cccr ? run(do_rdmsr, msr_cccr, 0) : 0;
  name[10] = "wrmsr (cccr)";
  ticks[10] = msr_cccr ? run(do_wrmsr, msr_cccr, cccr_val) : 0;
  name[11] = "sync_core";
  ticks[11] = run(do_sync_core, 0, 0);
  name[12] = "read fixed_ctr0";
  ticks[12] = is_core2 ? run(do_rdpmc, CORE2_PMC_FIXED_CTR0, 0) : 0;
  name[13] = "wrmsr fixed_ctr_ctrl";
  ticks[13] = is_core2 ? run(do_wrmsr, MSR_CORE_PERF_FIXED_CTR_CTRL, 0) : 0;
  /*
  name[14] = "write LVTPC";
  ticks[14] = (perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT)
    ? run(do_wrlvtpc, APIC_DM_NMI|APIC_LVT_MASKED, 0) : 0;
  */

  loop = run(do_empty_loop, 0, 0);

  if( msr_evntsel0 )
    wrmsr(msr_evntsel0, 0, 0);
  if( msr_cccr )
    wrmsr(msr_cccr, 0, 0);

  printk(KERN_INFO "COUNTER_OVERHEAD: NITER == %u\n", NITER);
  printk(KERN_INFO "COUNTER_OVERHEAD: loop overhead is %u cycles\n", loop);
  for(i = 0; i < ARRAY_SIZE(ticks); ++i) {
    unsigned int x;
    if( !ticks[i] )
      continue;
    x = ((ticks[i] - loop) * 10) / NITER;
    printk(KERN_INFO "COUNTER_OVERHEAD: %s cost is %u.%u cycles (%u total)\n",
	   name[i], x/10, x%10, ticks[i]);
  }
}


static inline void perfctr_p5_init_tests(void)
{
  measure_overheads(MSR_P5_CESR, P5_CESR_VAL, MSR_P5_CTR0, 0, 0, 0);
}

static inline void perfctr_p6_init_tests(void)
{
  measure_overheads(MSR_P6_EVNTSEL0, P6_EVNTSEL0_VAL, MSR_P6_PERFCTR0, 0, 
		    0, 0);
}

static inline void perfctr_core2_init_tests(void)
{
  measure_overheads(MSR_P6_EVNTSEL0, P6_EVNTSEL0_VAL, MSR_P6_PERFCTR0, 0, 
		    0, 1);
}

static inline void perfctr_p4_init_tests(void)
{
  measure_overheads(MSR_P4_CRU_ESCR0, P4_CRU_ESCR0_VAL, MSR_P4_IQ_COUNTER0,
		    MSR_P4_IQ_CCCR0, P4_IQ_CCCR0_VAL, 0);
}



static inline void perfctr_k7_init_tests(void)
{
  measure_overheads(MSR_K7_EVNTSEL0, K7_EVNTSEL0_VAL, MSR_K7_PERFCTR0, 0, 0, 0);
}

static inline void perfctr_generic_init_tests(void)
{
  measure_overheads(0, 0, 0, 0, 0, 0);
}




static int __init mymodule_init(void)
{

  printk(KERN_INFO "COUNTER_OVERHEAD: vendor %u, family %u, model %u, stepping %u\n",
	 boot_cpu_data.x86_vendor,
	 boot_cpu_data.x86,
	 boot_cpu_data.x86_model,
	 boot_cpu_data.x86_mask);

  if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
     perfctr_k7_init_tests();
  }
  else if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
     if (boot_cpu_data.x86_model==5) perfctr_p5_init_tests();
     else if (boot_cpu_data.x86_model==6) perfctr_p6_init_tests();
     else if (boot_cpu_data.x86_model==15) perfctr_p4_init_tests();
     else perfctr_core2_init_tests();     
  }
  else {
     perfctr_generic_init_tests();
  }
  return 0;
}

static void __exit mymodule_exit(void)
{
 printk ("COUNTER_OVERHEAD: Unloading.\n");
        return;
}

module_init(mymodule_init);
module_exit(mymodule_exit);

MODULE_LICENSE("GPL");

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-06-15 17:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-01 22:28 [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel Vince Weaver
2012-06-06 16:17 ` [tip:perf/core] perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernel tip-bot for Vince Weaver
2012-06-15 17:43   ` [PATCH] perf, amd: Fix rdpmc index calculation for AMD family 15h Robert Richter
  -- strict thread matches above, loose matches on Subject: below --
2012-02-20 22:38 [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel Vince Weaver
2012-02-27 11:39 ` Peter Zijlstra
2012-02-27 16:06   ` Vince Weaver
2012-02-27 16:18     ` Peter Zijlstra
2012-02-27 17:04       ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).