linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* SMP spin-locks
@ 2001-06-14 17:26 Richard B. Johnson
  2001-06-14 17:32 ` David S. Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Richard B. Johnson @ 2001-06-14 17:26 UTC (permalink / raw)
  To: Linux kernel



I __finally__ got back on "the list". They finally fixed the
company firewall!

During my absence, I had the chance to look at some SMP code
because of a performance problem (a few microseconds out of
spec on a 130 MHz embedded system) and I have a question about
the current spin-locks.

Spin-locks now transfer control to the .text.lock segment.
This is a separate segment that can be at an offset that
is far away from the currently executing code. That may
cause the cache to be reloaded. Further, each spin-lock
invocation generates separate code within that segment.

Question 1: Why?

Question 2: What is the purpose of the code sequence, "repz nop" 
generated by the spin-lock code? Is this a processor BUG work-around?
`as` doesn't "like" this sequence and, Intel doesn't seem to
document it.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 17:26 SMP spin-locks Richard B. Johnson
@ 2001-06-14 17:32 ` David S. Miller
  2001-06-14 17:35 ` Kurt Garloff
  2001-06-14 20:42 ` Roger Larsson
  2 siblings, 0 replies; 14+ messages in thread
From: David S. Miller @ 2001-06-14 17:32 UTC (permalink / raw)
  To: root; +Cc: Linux kernel


Richard B. Johnson writes:
 > Spin-locks now transfer control to the .text.lock segment.
 > This is a separate segment that can be at an offset that
 > is far away from the currently executing code. That may
 > cause the cache to be reloaded. Further, each spin-lock
 > invocation generates separate code within that segment.
 > 
 > Question 1: Why?

Because this increases the code density of the common
case, getting the lock immediately.

 > Question 2: What is the purpose of the code sequence, "repz nop" 
 > generated by the spin-lock code? Is this a processor BUG work-around?
 > `as` doesn't "like" this sequence and, Intel doesn't seem to
 > document it.

It is a hint to the processor that we are executing a spinlock loop
(it does something wrt. keeping the cacheline of the lock in shared
state when possible).

I believe it is documented in the Pentium 4 manuals, previous chips
ignore this sequence and treat it as a pure nop from what I understand.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 17:26 SMP spin-locks Richard B. Johnson
  2001-06-14 17:32 ` David S. Miller
@ 2001-06-14 17:35 ` Kurt Garloff
  2001-06-15  6:51   ` Doug Ledford
  2001-06-14 20:42 ` Roger Larsson
  2 siblings, 1 reply; 14+ messages in thread
From: Kurt Garloff @ 2001-06-14 17:35 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 404 bytes --]

On Thu, Jun 14, 2001 at 01:26:05PM -0400, Richard B. Johnson wrote:
> Question 2: What is the purpose of the code sequence, "repz nop" 

Puts iP4 into low power mode.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, FRG                               SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 17:26 SMP spin-locks Richard B. Johnson
  2001-06-14 17:32 ` David S. Miller
  2001-06-14 17:35 ` Kurt Garloff
@ 2001-06-14 20:42 ` Roger Larsson
  2001-06-14 21:05   ` Richard B. Johnson
  2 siblings, 1 reply; 14+ messages in thread
From: Roger Larsson @ 2001-06-14 20:42 UTC (permalink / raw)
  To: Richard B. Johnson, Linux kernel

Hi,

Wait a minute...

Spinlocks on a embedded system? Is it _really_ SMP?

What kind of performance problem do you have?
My guess, since you are mentioning spin locks, is that you are
having a latency problem - RT process does not execute/start
quickly enough?

If that is the case you should look at Andrew Mortons low latency
patches.
 http://www.uow.edu.au/~andrewm/linux/schedlat.html

/RogerL

On Thursday 14 June 2001 19:26, Richard B. Johnson wrote:
> I __finally__ got back on "the list". They finally fixed the
> company firewall!
>
> During my absence, I had the chance to look at some SMP code
> because of a performance problem (a few microseconds out of
> spec on a 130 MHz embedded system) and I have a question about
> the current spin-locks.
>
> Spin-locks now transfer control to the .text.lock segment.
> This is a separate segment that can be at an offset that
> is far away from the currently executing code. That may
> cause the cache to be reloaded. Further, each spin-lock
> invocation generates separate code within that segment.
>
> Question 1: Why?
>
> Question 2: What is the purpose of the code sequence, "repz nop"
> generated by the spin-lock code? Is this a processor BUG work-around?
> `as` doesn't "like" this sequence and, Intel doesn't seem to
> document it.
>
>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).
>
> "Memory is like gasoline. You use it up when you are running. Of
> course you get it all back when you reboot..."; Actual explanation
> obtained from the Micro$oft help desk.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 20:42 ` Roger Larsson
@ 2001-06-14 21:05   ` Richard B. Johnson
  2001-06-14 21:30     ` Roger Larsson
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Richard B. Johnson @ 2001-06-14 21:05 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Linux kernel

On Thu, 14 Jun 2001, Roger Larsson wrote:

> Hi,
> 
> Wait a minute...
> 
> Spinlocks on a embedded system? Is it _really_ SMP?
>

The embedded system is not SMP. However, there is definite
advantage to using an unmodified kernel that may/may-not
have been compiled for SMP. Of course spin-locks are used
to prevent interrupts from screwing up buffer pointers, etc.
 
> What kind of performance problem do you have?

The problem is that a data acquisition board across the PCI bus
gives a data transfer rate of 10 to 11 megabytes per second
with a UP kernel, and the transfer drops to 5-6 megabytes per
second with a SMP kernel. The ISR is really simple and copies
data, that's all.

The 'read()' routine uses a spinlock when it modifies pointers.

I started to look into where all the CPU clocks were going. The
SMP spinlock code is where it's going. There is often contention
for the lock because interrupts normally occur at 50 to 60 kHz.

When there is contention, a very long........jump occurs into
the test.lock segment. I think this is flushing queues. 

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 21:05   ` Richard B. Johnson
@ 2001-06-14 21:30     ` Roger Larsson
  2001-06-15  3:21       ` Richard B. Johnson
  2001-06-15 12:10     ` Ingo Oeser
  2001-06-15 15:52     ` Pavel Machek
  2 siblings, 1 reply; 14+ messages in thread
From: Roger Larsson @ 2001-06-14 21:30 UTC (permalink / raw)
  To: root; +Cc: Linux kernel

On Thursday 14 June 2001 23:05, you wrote:
> On Thu, 14 Jun 2001, Roger Larsson wrote:
> > Hi,
> >
> > Wait a minute...
> >
> > Spinlocks on a embedded system? Is it _really_ SMP?
>
> The embedded system is not SMP. However, there is definite
> advantage to using an unmodified kernel that may/may-not
> have been compiled for SMP. Of course spin-locks are used
> to prevent interrupts from screwing up buffer pointers, etc.
>

Not really - it prevents another processor entering the same code
segment  (spin_lock_irqsave prevents both another processor and
local interrupts).

An interrupt on UP can not wait on a spin lock - it will never be released
since no other code than the interrupt spinning will be able to execute)


> > What kind of performance problem do you have?
>
> The problem is that a data acquisition board across the PCI bus
> gives a data transfer rate of 10 to 11 megabytes per second
> with a UP kernel, and the transfer drops to 5-6 megabytes per
> second with a SMP kernel. The ISR is really simple and copies
> data, that's all.
>
> The 'read()' routine uses a spinlock when it modifies pointers.
>
> I started to look into where all the CPU clocks were going. The
> SMP spinlock code is where it's going. There is often contention
> for the lock because interrupts normally occur at 50 to 60 kHz.
>

SMP compiled kernel, but running on UP hardware - right?
Then this _should not_ happen!

see linux/Documentation/spinlocks.txt

Is it your spinlocks that are causing this, or?

> When there is contention, a very long........jump occurs into
> the test.lock segment. I think this is flushing queues.
>

It does not matter, if there is contention - let it take time. Waiting is what
spinlocking is about anyway...

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-15  3:21       ` Richard B. Johnson
@ 2001-06-15  2:33         ` David Lang
  2001-06-15 10:35         ` David Schwartz
  1 sibling, 0 replies; 14+ messages in thread
From: David Lang @ 2001-06-15  2:33 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Roger Larsson, Linux kernel

I thought that when you compiled a kernel as UP it replaced the spin-lock
macros with versions that are blank. As a result a UP kernel spends no
time doing spinlocks at all.

that's why a SMP kernel on a UP box is slightly slower, there is more code
to be executed

David Lang



 On Thu, 14 Jun 2001, Richard B. Johnson wrote:

> Date: Thu, 14 Jun 2001 23:21:35 -0400 (EDT)
> From: Richard B. Johnson <root@chaos.analogic.com>
> To: Roger Larsson <roger.larsson@norran.net>
> Cc: Linux kernel <linux-kernel@vger.kernel.org>
> Subject: Re: SMP spin-locks
>
> On Thu, 14 Jun 2001, Roger Larsson wrote:
>
> > On Thursday 14 June 2001 23:05, you wrote:
> > > On Thu, 14 Jun 2001, Roger Larsson wrote:
> > > > Hi,
> > > >
> > > > Wait a minute...
> > > >
> > > > Spinlocks on a embedded system? Is it _really_ SMP?
> > >
> > > The embedded system is not SMP. However, there is definite
> > > advantage to using an unmodified kernel that may/may-not
> > > have been compiled for SMP. Of course spin-locks are used
> > > to prevent interrupts from screwing up buffer pointers, etc.
> > >
> >
> > Not really - it prevents another processor entering the same code
> > segment  (spin_lock_irqsave prevents both another processor and
> > local interrupts).
> >
> > An interrupt on UP can not wait on a spin lock - it will never be released
> > since no other code than the interrupt spinning will be able to execute)
>
> An interrupt on a UP system will never spin, nor will the IP from
> another CPU because there isn't another CPU. A spin-lock, compiled
> for UP is:
>
> 	pushf
> 	popl	some_register, currently EBX
> 	cli	; Clear the interrupts on the only CPU you have
>
> 	do_some_code_that_must_not_be_interrupted();
>
> 	pushl	same_register_as_above
> 	popf	; Restore interrupts if they were enabled
>
>
> For SMP is:
>
> 	pushf
> 	popl	some_register
> 	cli	; Clear interrupts
> 	modify_a_memory_variable
> x:	see_if_it_is_what_you_expect
> 	if_not_loop_to x
>
> 	do_some_code_that_must_not_be_interrupted();
>
> 	modify_the_memory_variable_back
> 	pushl	same_register_as_above
> 	popf
>
>
> Since `cli` will only stop interrupts on the CPU that actually
> fetches the instruction, another CPU can enter the code unless
> it is forced to spin until the lock is released.
>
> If this code is executed on a UP machine, the memory variable
> will always become exactly as expected so it will never spin.
> Therefore SMP code should be perfectly safe on a UP machine,
> in fact must be perfectly safe, or it's broken.
>
> The current spinlock code does work perfectly on a UP machine.
> However, the large difference in performance shows that something
> is quite less than optimum in the coding.
>
> Spinlocks are machine dependent. A simple increment of a byte
> memory variable, spinning if it's not 1 will do fine. Decrementing
> this variable will release the lock. A `lock` prefix is not necessary
> because  all Intel byte operations are atomic anyway. This assumes
> that the lock was initialized to 0. It doesn't have to be. It
> could be initialized to 0xaa (anything) and spin if it's not
> 0xab (or anything + 1).
>
>
> >
> > SMP compiled kernel, but running on UP hardware - right?
> > Then this _should not_ happen!
> >
> > see linux/Documentation/spinlocks.txt
> >
>
> This, in fact, will happen. Machines booted from the network should
> have SMP code so a SMP machine can use all its CPUs. This same
> code, booted from the network, should have no measurable performance
> penalty in UP machines.
>
> Also, when you develop drivers on a workstation, test them on
> a workstation, then upload everything to an embedded system, you
> had better be executing the same code, kernel, drivers, et all,
> or you are in a world of hurt. Many embedded systems don't have
> any 'standard I/O' so you can't prove that it meets its specs
> (exception handling, etc) on the target. You have to test that
> logic elsewhere.
>
> This workstation has two CPUs. All drivers are modules. It uses
> initrd to install the ones for my SCSI disks, network, etc.
>
> Script started on Thu Jun 14 23:13:10 2001
> lsmod
> Module                  Size  Used by
> ramdisk                 4448   0
> loop                    8212   0  (autoclean)
> ipx                    19248   0  (unused)
> 3c59x                  25020   1  (autoclean)
> nls_cp437               4408   4  (autoclean)
> BusLogic               38320   6
> sd_mod                 10932   6
> scsi_mod               59460   2  [BusLogic sd_mod]
> # exit
> exit
>
> Script done on Thu Jun 14 23:13:45 2001
>
> The same kernel, uploaded to an embedded system, also uses
> initrd to load the machine-specific drivers. In this way, only
> the drivers that are actually used, are loaded. The kernel remains
> small. There is a slight performance penality for using modules,
> but none other.
>
> # telnet platinum
> Trying 10.106.100.166...
> Connected to platinum.analogic.com.
> Escape character is '^]'.
>
>   Enter "help" for commands
>
> PLATINUM> sho modules
>
> pcilynx                13468   1
> raw1394                 7984   1
> ieee1394               22984   0 [pcilynx raw1394]
> rtc_drvr                2372   0
> vxibus                 10660   6
> gpib_drvr              19200   2
> ramdisk                 4428   0
> pcnet32se              15640   1
>
> PLATINUM> exit
> 	Exit
>
> Connection closed by foreign host.
> # exit
> exit
>
>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).
>
> "Memory is like gasoline. You use it up when you are running. Of
> course you get it all back when you reboot..."; Actual explanation
> obtained from the Micro$oft help desk.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 21:30     ` Roger Larsson
@ 2001-06-15  3:21       ` Richard B. Johnson
  2001-06-15  2:33         ` David Lang
  2001-06-15 10:35         ` David Schwartz
  0 siblings, 2 replies; 14+ messages in thread
From: Richard B. Johnson @ 2001-06-15  3:21 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Linux kernel

On Thu, 14 Jun 2001, Roger Larsson wrote:

> On Thursday 14 June 2001 23:05, you wrote:
> > On Thu, 14 Jun 2001, Roger Larsson wrote:
> > > Hi,
> > >
> > > Wait a minute...
> > >
> > > Spinlocks on a embedded system? Is it _really_ SMP?
> >
> > The embedded system is not SMP. However, there is definite
> > advantage to using an unmodified kernel that may/may-not
> > have been compiled for SMP. Of course spin-locks are used
> > to prevent interrupts from screwing up buffer pointers, etc.
> >
> 
> Not really - it prevents another processor entering the same code
> segment  (spin_lock_irqsave prevents both another processor and
> local interrupts).
> 
> An interrupt on UP can not wait on a spin lock - it will never be released
> since no other code than the interrupt spinning will be able to execute)

An interrupt on a UP system will never spin, nor will the IP from
another CPU because there isn't another CPU. A spin-lock, compiled
for UP is:

	pushf
	popl	some_register, currently EBX
	cli	; Clear the interrupts on the only CPU you have

	do_some_code_that_must_not_be_interrupted();

	pushl	same_register_as_above
	popf	; Restore interrupts if they were enabled


For SMP is:

	pushf
	popl	some_register
	cli	; Clear interrupts
	modify_a_memory_variable
x:	see_if_it_is_what_you_expect
	if_not_loop_to x

	do_some_code_that_must_not_be_interrupted();

	modify_the_memory_variable_back
	pushl	same_register_as_above
	popf


Since `cli` will only stop interrupts on the CPU that actually
fetches the instruction, another CPU can enter the code unless
it is forced to spin until the lock is released.

If this code is executed on a UP machine, the memory variable
will always become exactly as expected so it will never spin.
Therefore SMP code should be perfectly safe on a UP machine,
in fact must be perfectly safe, or it's broken.

The current spinlock code does work perfectly on a UP machine.
However, the large difference in performance shows that something
is quite less than optimum in the coding.

Spinlocks are machine dependent. A simple increment of a byte
memory variable, spinning if it's not 1 will do fine. Decrementing
this variable will release the lock. A `lock` prefix is not necessary
because  all Intel byte operations are atomic anyway. This assumes
that the lock was initialized to 0. It doesn't have to be. It
could be initialized to 0xaa (anything) and spin if it's not
0xab (or anything + 1).


> 
> SMP compiled kernel, but running on UP hardware - right?
> Then this _should not_ happen!
> 
> see linux/Documentation/spinlocks.txt
>

This, in fact, will happen. Machines booted from the network should
have SMP code so a SMP machine can use all its CPUs. This same
code, booted from the network, should have no measurable performance
penalty in UP machines.

Also, when you develop drivers on a workstation, test them on
a workstation, then upload everything to an embedded system, you
had better be executing the same code, kernel, drivers, et all,
or you are in a world of hurt. Many embedded systems don't have
any 'standard I/O' so you can't prove that it meets its specs
(exception handling, etc) on the target. You have to test that
logic elsewhere.

This workstation has two CPUs. All drivers are modules. It uses
initrd to install the ones for my SCSI disks, network, etc.

Script started on Thu Jun 14 23:13:10 2001
lsmod
Module                  Size  Used by
ramdisk                 4448   0 
loop                    8212   0  (autoclean)
ipx                    19248   0  (unused)
3c59x                  25020   1  (autoclean)
nls_cp437               4408   4  (autoclean)
BusLogic               38320   6 
sd_mod                 10932   6 
scsi_mod               59460   2  [BusLogic sd_mod]
# exit
exit

Script done on Thu Jun 14 23:13:45 2001

The same kernel, uploaded to an embedded system, also uses
initrd to load the machine-specific drivers. In this way, only
the drivers that are actually used, are loaded. The kernel remains
small. There is a slight performance penality for using modules,
but none other.

# telnet platinum
Trying 10.106.100.166...
Connected to platinum.analogic.com.
Escape character is '^]'.

  Enter "help" for commands

PLATINUM> sho modules

pcilynx                13468   1
raw1394                 7984   1
ieee1394               22984   0 [pcilynx raw1394]
rtc_drvr                2372   0
vxibus                 10660   6
gpib_drvr              19200   2
ramdisk                 4428   0
pcnet32se              15640   1

PLATINUM> exit
	Exit 

Connection closed by foreign host.
# exit
exit


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 17:35 ` Kurt Garloff
@ 2001-06-15  6:51   ` Doug Ledford
  0 siblings, 0 replies; 14+ messages in thread
From: Doug Ledford @ 2001-06-15  6:51 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

Kurt Garloff wrote:
> 
> On Thu, Jun 14, 2001 at 01:26:05PM -0400, Richard B. Johnson wrote:
> > Question 2: What is the purpose of the code sequence, "repz nop"
> 
> Puts iP4 into low power mode.

Umm, slightly more accurate would be to say that it makes the P4 processor
wait before resuming the loop to give the lock a chance to have been
released.  It makes the code go from a constant busy loop to a check/wait
small amount of time/check again loop.  This in turns keeps your processor
from trying to constantly check the lock itself which is suppossed to have
benefits in terms of inter-processor bus pressure.

-- 

 Doug Ledford <dledford@redhat.com>  http://people.redhat.com/dledford
      Please check my web site for aic7xxx updates/answers before
                      e-mailing me about problems

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: SMP spin-locks
  2001-06-15  3:21       ` Richard B. Johnson
  2001-06-15  2:33         ` David Lang
@ 2001-06-15 10:35         ` David Schwartz
  2001-06-15 13:26           ` Richard B. Johnson
  1 sibling, 1 reply; 14+ messages in thread
From: David Schwartz @ 2001-06-15 10:35 UTC (permalink / raw)
  To: root, Roger Larsson; +Cc: Linux kernel


> Spinlocks are machine dependent. A simple increment of a byte
> memory variable, spinning if it's not 1 will do fine. Decrementing
> this variable will release the lock. A `lock` prefix is not necessary
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> because  all Intel byte operations are atomic anyway. This assumes
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> that the lock was initialized to 0. It doesn't have to be. It
> could be initialized to 0xaa (anything) and spin if it's not
> 0xab (or anything + 1).

	If this is true, atomicity isn't enough to do it. Atomicity means that
there's a single instruction (and so it can't be interrupted mid-modify).
Atomicity (at least as the term is normally used) doesn't prevent the
cache-coherency logic from ping-ponging the memory location between two
processor's caches during the atomic operation.

	DS


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 21:05   ` Richard B. Johnson
  2001-06-14 21:30     ` Roger Larsson
@ 2001-06-15 12:10     ` Ingo Oeser
  2001-06-15 12:49       ` Richard B. Johnson
  2001-06-15 15:52     ` Pavel Machek
  2 siblings, 1 reply; 14+ messages in thread
From: Ingo Oeser @ 2001-06-15 12:10 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Roger Larsson, Linux kernel

On Thu, Jun 14, 2001 at 05:05:07PM -0400, Richard B. Johnson wrote:
> The problem is that a data acquisition board across the PCI bus
> gives a data transfer rate of 10 to 11 megabytes per second
> with a UP kernel, and the transfer drops to 5-6 megabytes per
> second with a SMP kernel. The ISR is really simple and copies
> data, that's all.
> 
> The 'read()' routine uses a spinlock when it modifies pointers.
> 
> I started to look into where all the CPU clocks were going. The
> SMP spinlock code is where it's going. There is often contention
> for the lock because interrupts normally occur at 50 to 60 kHz.

Then you need another (better?) queueing mechanism.

Use multiple queues and a _overflowable_ sequence number as
global variable between the queues. 

N Queues (N := no. of CPUs + 1), which have a spin_lock for each
queue.

optionally: One reader packet reassembly priority queue (APQ) ordered by
   sequence number (implicitly or explicitly), if this shouldn't
   be done in user space.

In the writer ISR: 

   Foreach Queue in RR order (start with remebered one):
   - Try to lock it with spin_trylock (totally inline!)
     + Failed
        * if we failed to find a free queue for x "rounds", disable
          device (we have no reader) and notify user space somehow
       * increment "rounds" 
       * next queue
     + Succeed
       * Increment sequence number
       * Put data record into queue
      (* remember this queue as last queue used)
      (* mark queue "not empty")
       * do other IRQ work...

In the reader routine:
   Foreach Queue in RR order (start with remebered one):
   - No data counter above threshold -> EAGAIN [1] 
   - Try to lock it with spin_trylock (totally inline!)
     + Failed -> next queue
     + Succeed
       * if queue empty, unlock and try next one
      (* remember this queue as last queue used)
       * Get one data record from queue (in queue order!)
       * Move data record into APQ
       * Unlock queue
       * Deliver as much data from the APQ, as the user wants and
         is available
    - if all queues empty or locked -> increment "no data round"
      counter
  

Notes:
   The "last queue used" variable is static, but local to routine.
   It is there to decrease the number of iterations and distribute
   the data to all queues as more equally.


   Statistics about lock contention per queue, per round and per
   try would be nice here to estimate the number of queues
   needed.

   The APQ can be quite large, if the sequences are bad
   distributed and some queues tend to be always locked, if the
   reader wants to read from this queue.

   The above can be solved by 2^N "One entry queues" (aka slots)
   and sequence numbers mapping to this slots. If you need many
   slots (more then 256, I would say) then this is again 
   inaccaptable, because of the iteration cost in the ISR.
   
What do you think? After some polishing this should decrease lock
contention noticibly.


Regards

Ingo Oeser

[1] Blocking will be harder to implement here, since we need to
   notify the reader routine, that we have data available, which
   involves some latency you cannot afford. Maybe this could be
   done via schedule_task(), if needed.
-- 
Use ReiserFS to get a faster fsck and Ext2 to fsck slowly and gently.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-15 12:10     ` Ingo Oeser
@ 2001-06-15 12:49       ` Richard B. Johnson
  0 siblings, 0 replies; 14+ messages in thread
From: Richard B. Johnson @ 2001-06-15 12:49 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Roger Larsson, Linux kernel

On Fri, 15 Jun 2001, Ingo Oeser wrote:

> On Thu, Jun 14, 2001 at 05:05:07PM -0400, Richard B. Johnson wrote:
> > The problem is that a data acquisition board across the PCI bus
> > gives a data transfer rate of 10 to 11 megabytes per second
> > with a UP kernel, and the transfer drops to 5-6 megabytes per
> > second with a SMP kernel. The ISR is really simple and copies
> > data, that's all.
> > 
> > The 'read()' routine uses a spinlock when it modifies pointers.
> > 
> > I started to look into where all the CPU clocks were going. The
> > SMP spinlock code is where it's going. There is often contention
> > for the lock because interrupts normally occur at 50 to 60 kHz.
> 
> Then you need another (better?) queueing mechanism.
> 
> Use multiple queues and a _overflowable_ sequence number as
> global variable between the queues. 
> 
> N Queues (N := no. of CPUs + 1), which have a spin_lock for each
> queue.
> 
> optionally: One reader packet reassembly priority queue (APQ) ordered by
>    sequence number (implicitly or explicitly), if this shouldn't
>    be done in user space.
> 
> In the writer ISR: 
> 
>    Foreach Queue in RR order (start with remebered one):
>    - Try to lock it with spin_trylock (totally inline!)
>      + Failed
>         * if we failed to find a free queue for x "rounds", disable
>           device (we have no reader) and notify user space somehow
>        * increment "rounds" 
>        * next queue
>      + Succeed
>        * Increment sequence number
>        * Put data record into queue
>       (* remember this queue as last queue used)
>       (* mark queue "not empty")
>        * do other IRQ work...
> 
> In the reader routine:
>    Foreach Queue in RR order (start with remebered one):
>    - No data counter above threshold -> EAGAIN [1] 
>    - Try to lock it with spin_trylock (totally inline!)
>      + Failed -> next queue
>      + Succeed
>        * if queue empty, unlock and try next one
>       (* remember this queue as last queue used)
>        * Get one data record from queue (in queue order!)
>        * Move data record into APQ
>        * Unlock queue
>        * Deliver as much data from the APQ, as the user wants and
>          is available
>     - if all queues empty or locked -> increment "no data round"
>       counter
>   
> 
> Notes:
>    The "last queue used" variable is static, but local to routine.
>    It is there to decrease the number of iterations and distribute
>    the data to all queues as more equally.
> 
> 
>    Statistics about lock contention per queue, per round and per
>    try would be nice here to estimate the number of queues
>    needed.
> 
>    The APQ can be quite large, if the sequences are bad
>    distributed and some queues tend to be always locked, if the
>    reader wants to read from this queue.
> 
>    The above can be solved by 2^N "One entry queues" (aka slots)
>    and sequence numbers mapping to this slots. If you need many
>    slots (more then 256, I would say) then this is again 
>    inaccaptable, because of the iteration cost in the ISR.
>    
> What do you think? After some polishing this should decrease lock
> contention noticibly.
> 
> 
> Regards
> 
> Ingo Oeser
> 
> [1] Blocking will be harder to implement here, since we need to
>    notify the reader routine, that we have data available, which
>    involves some latency you cannot afford. Maybe this could be
>    done via schedule_task(), if needed.
> -- 
> Use ReiserFS to get a faster fsck and Ext2 to fsck slowly and gently.
> 

For further discussion I will take this off-list. However, you are
correct. The very simple ISR that I have (I did preallocate buffers)
leaves a great deal of room for improvement.

However, the logic that you mention has execution overhead as well.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: SMP spin-locks
  2001-06-15 10:35         ` David Schwartz
@ 2001-06-15 13:26           ` Richard B. Johnson
  0 siblings, 0 replies; 14+ messages in thread
From: Richard B. Johnson @ 2001-06-15 13:26 UTC (permalink / raw)
  To: David Schwartz; +Cc: Roger Larsson, Linux kernel

On Fri, 15 Jun 2001, David Schwartz wrote:

> 
> > Spinlocks are machine dependent. A simple increment of a byte
> > memory variable, spinning if it's not 1 will do fine. Decrementing
> > this variable will release the lock. A `lock` prefix is not necessary
>                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > because  all Intel byte operations are atomic anyway. This assumes
>                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > that the lock was initialized to 0. It doesn't have to be. It
> > could be initialized to 0xaa (anything) and spin if it's not
> > 0xab (or anything + 1).
> 
> 	If this is true, atomicity isn't enough to do it. Atomicity means that
> there's a single instruction (and so it can't be interrupted mid-modify).
> Atomicity (at least as the term is normally used) doesn't prevent the
> cache-coherency logic from ping-ponging the memory location between two
> processor's caches during the atomic operation.
> 
> 	DS

Try it. You'll like it. There are no simultaneous accesses from
different CPUs to any address space of any kind on an Intel-based
SMP machine. That is a fact. This is because there is only one
group of decodes for this address space. This applies for both memory
and I/O. Basically, the bus even though it may be broken into
several units of different speeds, operates as a unit. So, only
one operation can be occurring at any instant. 

Now, suppose you have a DSP that accesses it's own memory. It's
on a different board than the main CPU. You provide a mechanism
whereby your CPU can share a portion (or all) of this memory.
For this, you "dual-port" the memory, or you access it via a
PCI bus. Anyway, the DSP's memory now appears in your address
space. When you access this memory at a time that the DSP could
be writing to it, you need a `lock` prefix. Also hardware needs
to handle the #LOCK signal properly or you may get some funny
values from the DSP.

As shown in the '486 manual, if you perform a read/modify/write
operation you may need a lock prefix. Unlike CPUs that can only
perform load and store operations upon memory, the ix86 can
perform many operations directly. Amongst many of these wonderful
instructions is the ability to increment or decrement a byte anywhere
in memory. The CPU does not perform a read/modify/write operation
in the general sense when it does this. Instead, the data is read,
modified, and written in a single bus cycle. There is no way
that another CPU can access the bus in between these operations.
Memory access instructions that are complete in a single bus cycle
(this is not a single CPU clock), would never need a lock prefix.
The lock-prefix executes in only a single CPU clock.

The idea is not to get rid of this. The idea is to get rid of the
awful spin_lock_irqsave()/ spin_lock_irqrestore() code that has grown
like some virus and replace it with simple working code that
does not use a seperate segment for the spinning, etc.

Also, the cache of all CPUs "knows" when a write within its cache-line
has occurred so the next CPU will correctly see the result of
the previous operation. 


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: SMP spin-locks
  2001-06-14 21:05   ` Richard B. Johnson
  2001-06-14 21:30     ` Roger Larsson
  2001-06-15 12:10     ` Ingo Oeser
@ 2001-06-15 15:52     ` Pavel Machek
  2 siblings, 0 replies; 14+ messages in thread
From: Pavel Machek @ 2001-06-15 15:52 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Roger Larsson, Linux kernel

Hi!

> The 'read()' routine uses a spinlock when it modifies pointers.
> 
> I started to look into where all the CPU clocks were going. The
> SMP spinlock code is where it's going. There is often contention
> for the lock because interrupts normally occur at 50 to 60 kHz.
> 
> When there is contention, a very long........jump occurs into
> the test.lock segment. I think this is flushing queues. 

On UP, there's *never* contention on the lock, because irqsave lock
disables interrupts. Right? Something else must be slowing you.

								Pavel
PS: But that's bad. Performance should not come down twice --
this will bite you even on real SMP.

-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2001-06-16 10:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-14 17:26 SMP spin-locks Richard B. Johnson
2001-06-14 17:32 ` David S. Miller
2001-06-14 17:35 ` Kurt Garloff
2001-06-15  6:51   ` Doug Ledford
2001-06-14 20:42 ` Roger Larsson
2001-06-14 21:05   ` Richard B. Johnson
2001-06-14 21:30     ` Roger Larsson
2001-06-15  3:21       ` Richard B. Johnson
2001-06-15  2:33         ` David Lang
2001-06-15 10:35         ` David Schwartz
2001-06-15 13:26           ` Richard B. Johnson
2001-06-15 12:10     ` Ingo Oeser
2001-06-15 12:49       ` Richard B. Johnson
2001-06-15 15:52     ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).