* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
@ 2010-01-27 16:45 Dirk Behme
2010-01-28 13:03 ` Catalin Marinas
2010-01-29 12:17 ` Leif Lindholm
0 siblings, 2 replies; 7+ messages in thread
From: Dirk Behme @ 2010-01-27 16:45 UTC (permalink / raw)
To: linux-arm-kernel
On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
[1]) by adding a nop to __delay():
--- a/arch/arm/lib/delay.S
+++ b/arch/arm/lib/delay.S
@@ -41,6 +41,9 @@ ENTRY(__const_udelay) @ 0 <= r0 <= 0x
@ Delay routine
ENTRY(__delay)
+#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
+ nop
+#endif
subs r0, r0, #1
#if 0
movls pc, lr
Any ideas what might happen here?
Many thanks and best regards
Dirk
[1] 2.6.32 without and with additional nop in __delay():
====> Clean 2.6.32 without nop in __delay():
...
Calibrating delay loop... 159.74 BogoMIPS (lpj=798720)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
Calibrating local timer... 199.98MHz.
CPU1: Booted secondary processor
Calibrating delay loop... 159.33 BogoMIPS (lpj=796672)
CPU2: Booted secondary processor
Calibrating delay loop... 159.74 BogoMIPS (lpj=798720)
Brought up 3 CPUs
SMP: Total of 3 processors activated (478.82 BogoMIPS).
...
Disassembly:
|@ Delay routine
|ENTRY(__delay)
C0940600|E2500001 __delay: subs r0,r0,#0x1
C0940604|8AFFFFFD bhi 0xC0940600 ; __delay
C0940608|E1A0F00E cpy pc,r14
====> With an additional nop in __delay():
...
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
Calibrating local timer... 199.97MHz.
CPU1: Booted secondary processor
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
CPU2: Booted secondary processor
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
Brought up 3 CPUs
SMP: Total of 3 processors activated (1196.85 BogoMIPS).
...
Disassembly:
|@ Delay routine
|ENTRY(__delay)
|#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
C0940600|E320F000 __delay: nop
|#endif
C0940604|E2500001 subs r0,r0,#0x1
C0940608|8AFFFFFC bhi 0xC0940600 ; __delay
C094060C|E1A0F00E cpy pc,r14
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
@ 2010-01-28 13:03 ` Catalin Marinas
2010-01-29 5:08 ` Shilimkar, Santosh
2010-01-29 12:17 ` Leif Lindholm
1 sibling, 1 reply; 7+ messages in thread
From: Catalin Marinas @ 2010-01-28 13:03 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, 2010-01-27 at 16:45 +0000, Dirk Behme wrote:
> On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> [1]) by adding a nop to __delay():
>
> --- a/arch/arm/lib/delay.S
> +++ b/arch/arm/lib/delay.S
> @@ -41,6 +41,9 @@ ENTRY(__const_udelay) @ 0 <= r0 <= 0x
> @ Delay routine
> ENTRY(__delay)
> +#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
> + nop
> +#endif
> subs r0, r0, #1
> #if 0
> movls pc, lr
>
> Any ideas what might happen here?
Branch (mis-)prediction? Alignment?
It doesn't really matter, bogomips should not be used as some form of
performance checking.
BTW, local timers give a more accurate estimate of the CPU frequency
(they are counting at half this frequency).
--
Catalin
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
2010-01-28 13:03 ` Catalin Marinas
@ 2010-01-29 5:08 ` Shilimkar, Santosh
0 siblings, 0 replies; 7+ messages in thread
From: Shilimkar, Santosh @ 2010-01-29 5:08 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel-
> bounces at lists.infradead.org] On Behalf Of Catalin Marinas
> Sent: Thursday, January 28, 2010 6:33 PM
> To: Dirk Behme
> Cc: linux-arm-kernel at lists.infradead.org
> Subject: Re: ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
>
> On Wed, 2010-01-27 at 16:45 +0000, Dirk Behme wrote:
> > On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> > 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> > [1]) by adding a nop to __delay():
> >
> > --- a/arch/arm/lib/delay.S
> > +++ b/arch/arm/lib/delay.S
> > @@ -41,6 +41,9 @@ ENTRY(__const_udelay) @ 0 <= r0 <= 0x
> > @ Delay routine
> > ENTRY(__delay)
> > +#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
> > + nop
> > +#endif
> > subs r0, r0, #1
> > #if 0
> > movls pc, lr
> >
> > Any ideas what might happen here?
>
> Branch (mis-)prediction? Alignment?
>
> It doesn't really matter, bogomips should not be used as some form of
> performance checking.
>
> BTW, local timers give a more accurate estimate of the CPU frequency
> (they are counting at half this frequency).
Last time I was experimenting with this, the data I got from for A9 was " the
loop prediction" makes this faster on the hw support fast loop mode .
It is a feature of the C-A9 pipeline that enables it to spot short loops like
"BHI {pc}-4 ; 0x100 **" nd just issue store them in the pipiline queue
to be dispatched from there rather than being fetched from the Icache all
the time.
Regards,
Santosh
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
2010-01-28 13:03 ` Catalin Marinas
@ 2010-01-29 12:17 ` Leif Lindholm
1 sibling, 0 replies; 7+ messages in thread
From: Leif Lindholm @ 2010-01-29 12:17 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-
> kernel-bounces at lists.infradead.org] On Behalf Of Dirk Behme
> Sent: 27 January 2010 16:45
> On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> [1]) by adding a nop to __delay():
The reason for this is that the ARM11 MPCore doesn't fold branch
instructions for busy-wait-style loops. Inserting the nop (or any other
non-branch instruction) removes the branch instruction from the
execution stream.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/ch06s02s04.html
But as Catalin says, this makes no functional difference, even though it
might look "more impressive" :)
/
Leif
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
2010-01-29 15:17 ` Leif Lindholm
@ 2010-01-29 15:26 ` Russell King - ARM Linux
0 siblings, 0 replies; 7+ messages in thread
From: Russell King - ARM Linux @ 2010-01-29 15:26 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Jan 29, 2010 at 03:17:59PM -0000, Leif Lindholm wrote:
> > -----Original Message-----
> > From: Uwe Kleine-K?nig [mailto:u.kleine-koenig at pengutronix.de]
> > Sent: 29 January 2010 14:54
>
> > > But as Catalin says, this makes no functional difference, even though
> > > it might look "more impressive" :)
> >
> > With a bigger number of loops_per_seconds the maximal period that can
> > be delayed for is shorter and the granularity is better, no?
>
> I concur that this aspect of my comment was technically incorrect - but if
> your code was depending on either of those, surely that would be a problem
> in itself?
>
> The minimum time would be questionable anyway, as loops_per_second is an
> average, and the branch predictor is fairly likely to miss on the first time
> through the loop unless your code uses a lot of delays.
Not only that, but it takes a certain number of cycles to calculate the
number of loops for any given delay; the smaller the delay the more
significant that is. For very short delays (eg, one loop) that it
swamps the loop itself.
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
2010-01-29 14:54 Uwe Kleine-König
@ 2010-01-29 15:17 ` Leif Lindholm
2010-01-29 15:26 ` Russell King - ARM Linux
0 siblings, 1 reply; 7+ messages in thread
From: Leif Lindholm @ 2010-01-29 15:17 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: Uwe Kleine-K?nig [mailto:u.kleine-koenig at pengutronix.de]
> Sent: 29 January 2010 14:54
> > But as Catalin says, this makes no functional difference, even though
> > it might look "more impressive" :)
>
> With a bigger number of loops_per_seconds the maximal period that can
> be delayed for is shorter and the granularity is better, no?
I concur that this aspect of my comment was technically incorrect - but if
your code was depending on either of those, surely that would be a problem
in itself?
The minimum time would be questionable anyway, as loops_per_second is an
average, and the branch predictor is fairly likely to miss on the first time
through the loop unless your code uses a lot of delays.
/
Leif
^ permalink raw reply [flat|nested] 7+ messages in thread
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
@ 2010-01-29 14:54 Uwe Kleine-König
2010-01-29 15:17 ` Leif Lindholm
0 siblings, 1 reply; 7+ messages in thread
From: Uwe Kleine-König @ 2010-01-29 14:54 UTC (permalink / raw)
To: linux-arm-kernel
Hello,
On Fri, Jan 29, 2010 at 12:17:18PM -0000, Leif Lindholm wrote:
> > -----Original Message-----
> > From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-
> > kernel-bounces at lists.infradead.org] On Behalf Of Dirk Behme
> > Sent: 27 January 2010 16:45
>
> > On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> > 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> > [1]) by adding a nop to __delay():
>
> The reason for this is that the ARM11 MPCore doesn't fold branch
> instructions for busy-wait-style loops. Inserting the nop (or any other
> non-branch instruction) removes the branch instruction from the
> execution stream.
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/ch06s02s04.html
>
> But as Catalin says, this makes no functional difference, even though it
> might look "more impressive" :)
With a bigger number of loops_per_seconds the maximal period that can be
delayed for is shorter and the granularity is better, no?
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-01-29 15:26 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
2010-01-28 13:03 ` Catalin Marinas
2010-01-29 5:08 ` Shilimkar, Santosh
2010-01-29 12:17 ` Leif Lindholm
2010-01-29 14:54 Uwe Kleine-König
2010-01-29 15:17 ` Leif Lindholm
2010-01-29 15:26 ` Russell King - ARM Linux
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.