All of lore.kernel.org
 help / color / mirror / Atom feed
* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
@ 2010-01-27 16:45 Dirk Behme
  2010-01-28 13:03 ` Catalin Marinas
  2010-01-29 12:17 ` Leif Lindholm
  0 siblings, 2 replies; 7+ messages in thread
From: Dirk Behme @ 2010-01-27 16:45 UTC (permalink / raw)
  To: linux-arm-kernel


On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel 
2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below 
[1]) by adding a nop to __delay():

--- a/arch/arm/lib/delay.S
+++ b/arch/arm/lib/delay.S
@@ -41,6 +41,9 @@ ENTRY(__const_udelay)    @ 0 <= r0 <= 0x
  @ Delay routine
  ENTRY(__delay)
+#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
+        nop
+#endif
          subs    r0, r0, #1
  #if 0
          movls    pc, lr

Any ideas what might happen here?

Many thanks and best regards

Dirk

[1] 2.6.32 without and with additional nop in __delay():

====> Clean 2.6.32 without nop in __delay():

...
Calibrating delay loop... 159.74 BogoMIPS (lpj=798720)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
Calibrating local timer... 199.98MHz.
CPU1: Booted secondary processor
Calibrating delay loop... 159.33 BogoMIPS (lpj=796672)
CPU2: Booted secondary processor
Calibrating delay loop... 159.74 BogoMIPS (lpj=798720)
Brought up 3 CPUs
SMP: Total of 3 processors activated (478.82 BogoMIPS).
...

Disassembly:

         |@ Delay routine
         |ENTRY(__delay)
C0940600|E2500001  __delay:  subs    r0,r0,#0x1
C0940604|8AFFFFFD        bhi     0xC0940600       ; __delay
C0940608|E1A0F00E        cpy     pc,r14



====> With an additional nop in __delay():

...
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
Calibrating local timer... 199.97MHz.
CPU1: Booted secondary processor
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
CPU2: Booted secondary processor
Calibrating delay loop... 398.95 BogoMIPS (lpj=1994752)
Brought up 3 CPUs
SMP: Total of 3 processors activated (1196.85 BogoMIPS).
...

Disassembly:

         |@ Delay routine
         |ENTRY(__delay)
         |#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
C0940600|E320F000  __delay:  nop
         |#endif
C0940604|E2500001        subs    r0,r0,#0x1
C0940608|8AFFFFFC        bhi     0xC0940600       ; __delay
C094060C|E1A0F00E        cpy     pc,r14

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
  2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
@ 2010-01-28 13:03 ` Catalin Marinas
  2010-01-29  5:08   ` Shilimkar, Santosh
  2010-01-29 12:17 ` Leif Lindholm
  1 sibling, 1 reply; 7+ messages in thread
From: Catalin Marinas @ 2010-01-28 13:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-01-27 at 16:45 +0000, Dirk Behme wrote:
> On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> [1]) by adding a nop to __delay():
> 
> --- a/arch/arm/lib/delay.S
> +++ b/arch/arm/lib/delay.S
> @@ -41,6 +41,9 @@ ENTRY(__const_udelay)    @ 0 <= r0 <= 0x
>   @ Delay routine
>   ENTRY(__delay)
> +#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
> +        nop
> +#endif
>           subs    r0, r0, #1
>   #if 0
>           movls    pc, lr
> 
> Any ideas what might happen here?

Branch (mis-)prediction? Alignment?

It doesn't really matter, bogomips should not be used as some form of
performance checking.

BTW, local timers give a more accurate estimate of the CPU frequency
(they are counting at half this frequency).

-- 
Catalin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
  2010-01-28 13:03 ` Catalin Marinas
@ 2010-01-29  5:08   ` Shilimkar, Santosh
  0 siblings, 0 replies; 7+ messages in thread
From: Shilimkar, Santosh @ 2010-01-29  5:08 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel-
> bounces at lists.infradead.org] On Behalf Of Catalin Marinas
> Sent: Thursday, January 28, 2010 6:33 PM
> To: Dirk Behme
> Cc: linux-arm-kernel at lists.infradead.org
> Subject: Re: ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
> 
> On Wed, 2010-01-27 at 16:45 +0000, Dirk Behme wrote:
> > On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> > 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> > [1]) by adding a nop to __delay():
> >
> > --- a/arch/arm/lib/delay.S
> > +++ b/arch/arm/lib/delay.S
> > @@ -41,6 +41,9 @@ ENTRY(__const_udelay)    @ 0 <= r0 <= 0x
> >   @ Delay routine
> >   ENTRY(__delay)
> > +#if defined(CONFIG_CPU_V6) && defined(CONFIG_SMP)
> > +        nop
> > +#endif
> >           subs    r0, r0, #1
> >   #if 0
> >           movls    pc, lr
> >
> > Any ideas what might happen here?
> 
> Branch (mis-)prediction? Alignment?
> 
> It doesn't really matter, bogomips should not be used as some form of
> performance checking.
> 
> BTW, local timers give a more accurate estimate of the CPU frequency
> (they are counting at half this frequency).

Last time I was experimenting with this, the data I got from for A9 was " the
loop prediction" makes this faster on the hw support fast loop mode .
It is a feature of the C-A9 pipeline that enables it to spot short loops like 
"BHI      {pc}-4 ; 0x100  **" nd just issue store them in the pipiline queue
to be dispatched from there rather than being fetched from the Icache all 
the time.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
  2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
  2010-01-28 13:03 ` Catalin Marinas
@ 2010-01-29 12:17 ` Leif Lindholm
  1 sibling, 0 replies; 7+ messages in thread
From: Leif Lindholm @ 2010-01-29 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-
> kernel-bounces at lists.infradead.org] On Behalf Of Dirk Behme
> Sent: 27 January 2010 16:45

> On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> [1]) by adding a nop to __delay():

The reason for this is that the ARM11 MPCore doesn't fold branch
instructions for busy-wait-style loops. Inserting the nop (or any other
non-branch instruction) removes the branch instruction from the
execution stream.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/ch06s02s04.html

But as Catalin says, this makes no functional difference, even though it
might look "more impressive" :)

/
	Leif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
  2010-01-29 15:17 ` Leif Lindholm
@ 2010-01-29 15:26   ` Russell King - ARM Linux
  0 siblings, 0 replies; 7+ messages in thread
From: Russell King - ARM Linux @ 2010-01-29 15:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jan 29, 2010 at 03:17:59PM -0000, Leif Lindholm wrote:
> > -----Original Message-----
> > From: Uwe Kleine-K?nig [mailto:u.kleine-koenig at pengutronix.de]
> > Sent: 29 January 2010 14:54
> 
> > > But as Catalin says, this makes no functional difference, even though
> > > it might look "more impressive" :)
> >
> > With a bigger number of loops_per_seconds the maximal period that can
> > be delayed for is shorter and the granularity is better, no?
> 
> I concur that this aspect of my comment was technically incorrect - but if
> your code was depending on either of those, surely that would be a problem
> in itself?
> 
> The minimum time would be questionable anyway, as loops_per_second is an
> average, and the branch predictor is fairly likely to miss on the first time
> through the loop unless your code uses a lot of delays.

Not only that, but it takes a certain number of cycles to calculate the
number of loops for any given delay; the smaller the delay the more
significant that is.  For very short delays (eg, one loop) that it
swamps the loop itself.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
  2010-01-29 14:54 Uwe Kleine-König
@ 2010-01-29 15:17 ` Leif Lindholm
  2010-01-29 15:26   ` Russell King - ARM Linux
  0 siblings, 1 reply; 7+ messages in thread
From: Leif Lindholm @ 2010-01-29 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Uwe Kleine-K?nig [mailto:u.kleine-koenig at pengutronix.de]
> Sent: 29 January 2010 14:54

> > But as Catalin says, this makes no functional difference, even though
> > it might look "more impressive" :)
>
> With a bigger number of loops_per_seconds the maximal period that can
> be delayed for is shorter and the granularity is better, no?

I concur that this aspect of my comment was technically incorrect - but if
your code was depending on either of those, surely that would be a problem
in itself?

The minimum time would be questionable anyway, as loops_per_second is an
average, and the branch predictor is fairly likely to miss on the first time
through the loop unless your code uses a lot of delays.
 
/
	Leif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj
@ 2010-01-29 14:54 Uwe Kleine-König
  2010-01-29 15:17 ` Leif Lindholm
  0 siblings, 1 reply; 7+ messages in thread
From: Uwe Kleine-König @ 2010-01-29 14:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

On Fri, Jan 29, 2010 at 12:17:18PM -0000, Leif Lindholm wrote:
> > -----Original Message-----
> > From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-
> > kernel-bounces at lists.infradead.org] On Behalf Of Dirk Behme
> > Sent: 27 January 2010 16:45
> 
> > On a 400MHz ARM11 MPCore system (NEC NaviEngine based) with kernel
> > 2.6.32 we found that BogoMIPS/loops per jiffies ~doubles (see below
> > [1]) by adding a nop to __delay():
> 
> The reason for this is that the ARM11 MPCore doesn't fold branch
> instructions for busy-wait-style loops. Inserting the nop (or any other
> non-branch instruction) removes the branch instruction from the
> execution stream.
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0360f/ch06s02s04.html
> 
> But as Catalin says, this makes no functional difference, even though it
> might look "more impressive" :)
With a bigger number of loops_per_seconds the maximal period that can be
delayed for is shorter and the granularity is better, no?

Best regards
Uwe

-- 
Pengutronix e.K.                              | Uwe Kleine-K?nig            |
Industrial Linux Solutions                    | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-01-29 15:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-27 16:45 ARM11 MPCore: Adding nop to __delay() doubles the BogoMIPS/lpj Dirk Behme
2010-01-28 13:03 ` Catalin Marinas
2010-01-29  5:08   ` Shilimkar, Santosh
2010-01-29 12:17 ` Leif Lindholm
2010-01-29 14:54 Uwe Kleine-König
2010-01-29 15:17 ` Leif Lindholm
2010-01-29 15:26   ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.