linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* threading question
@ 2001-06-12 18:24 ognen
  2001-06-12 18:39 ` Davide Libenzi
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: ognen @ 2001-06-12 18:24 UTC (permalink / raw)
  To: linux-kernel

Hello,

I am a summer student implementing a multi-threaded version of a very
popular bioinformatics tool. So far it compiles and runs without problems
(as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
sequential version of the tool on all of these machines (most of them are
dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
SGI IRIX). On dual-CPU machines the speedups are as follows: my version
is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
kernel. Why are the numbers on Linux machines so much lower? It is the
same multi-threaded code, I am not using any tricks, the code basically
uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
size is set to 8K (but the numbers are the same with larger/smaller stack
sizes).

Is there anything I am missing? Is this to be expected due to Linux way of
handling threads (clone call)? I am just trying to explain the numbers and
nothing else comes to mind....

Best regards,
Ognen Duzlevski
-- 
ognen@gene.pbi.nrc.ca
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: threading question
  2001-06-12 18:24 threading question ognen
@ 2001-06-12 18:39 ` Davide Libenzi
  2001-06-12 18:57 ` from dmesg: kernel BUG at inode.c:486 Olivier Sessink
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Davide Libenzi @ 2001-06-12 18:39 UTC (permalink / raw)
  To: ognen; +Cc: linux-kernel


On 12-Jun-2001 ognen@gene.pbi.nrc.ca wrote:
> Hello,
> 
> I am a summer student implementing a multi-threaded version of a very
> popular bioinformatics tool. So far it compiles and runs without problems
> (as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
> OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
> sequential version of the tool on all of these machines (most of them are
> dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
> SGI IRIX). On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower? It is the
> same multi-threaded code, I am not using any tricks, the code basically
> uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> size is set to 8K (but the numbers are the same with larger/smaller stack
> sizes).
> 
> Is there anything I am missing? Is this to be expected due to Linux way of
> handling threads (clone call)? I am just trying to explain the numbers and
> nothing else comes to mind....

How is your  vmstat  while your tool is running ?



- Davide


^ permalink raw reply	[flat|nested] 37+ messages in thread

* from dmesg: kernel BUG at inode.c:486
  2001-06-12 18:24 threading question ognen
  2001-06-12 18:39 ` Davide Libenzi
@ 2001-06-12 18:57 ` Olivier Sessink
  2001-06-12 18:58 ` threading question Christoph Hellwig
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Olivier Sessink @ 2001-06-12 18:57 UTC (permalink / raw)
  To: linux-kernel

Hi all,

Today my girlfriend reported all programs that accessed my 
NFS mounted drive where crashing. I use Linux 2.4.5 on the client
with these .config options (for NFS):
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_ROOT_NFS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y

The server is a very old install, running user-space NFS daemon:
fender:~$ /usr/sbin/rpc.nfsd --version
Universal NFS Server 2.2beta41

When running dmesg on the client I got this output:

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68 
kernel BUG at inode.c:486!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b   ebx: cc703ba0   ecx: 00000001   edx: c025ba84
esi: c025ec60   edi: c976eac0   ebp: c32fdfa4   esp: c32fdeec
ds: 0018   es: 0018   ss: 0018
Process gmc (pid: 1193, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 cc703ba0 c01409c7 cc703ba0 cceee320
cc703ba0 
       c015e62a cc703ba0 c013e5d6 cceee320 cc703ba0 cceee320 00000000
c013723c 
       cceee320 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c8587000
00000000 
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>] 
       [<c0106a7b>] [<c010002b>] 

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68 
kernel BUG at inode.c:486!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b   ebx: c62eb840   ecx: 00000001   edx: c025ba84
esi: c025ec60   edi: c976eac0   ebp: c7135fa4   esp: c7135eec
ds: 0018   es: 0018   ss: 0018
Process gmc (pid: 1239, stackpage=c7135000)
Stack: c021b86d c021b8cc 000001e6 c62eb840 c01409c7 c62eb840 cf7285e0
c62eb840 
       c015e62a c62eb840 c013e5d6 cf7285e0 c62eb840 cf7285e0 00000000
c013723c 
       cf7285e0 c7135f68 c013795a c976eac0 c7135f68 00000000 c89ac000
00000000 
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<c0137f68>] [<c0135276>] 
       [<c0106a7b>] [<c010002b>] 

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68 
kernel BUG at inode.c:486!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c013fffb>]
EFLAGS: 00010286
eax: 0000001b   ebx: c62ebde0   ecx: 00000001   edx: c025ba84
esi: c025ec60   edi: c976eac0   ebp: c32fdfa4   esp: c32fdeec
ds: 0018   es: 0018   ss: 0018
Process gmc (pid: 1243, stackpage=c32fd000)
Stack: c021b86d c021b8cc 000001e6 c62ebde0 c01409c7 c62ebde0 cf7288e0
c62ebde0 
       c015e62a c62ebde0 c013e5d6 cf7288e0 c62ebde0 cf7288e0 00000000
c013723c 
       cf7288e0 c32fdf68 c013795a c976eac0 c32fdf68 00000000 c55df000
00000000 
Call Trace: [<c01409c7>] [<c015e62a>] [<c013e5d6>] [<c013723c>] [<c013795a>]
[<d8e7dda3>] [<c0137f68>] 
       [<c0135276>] [<c0106a7b>] [<c010002b>] 

Code: 0f 0b 83 c4 0c f6 83 f4 00 00 00 10 75 19 68 e8 01 00 00 68 

I have no idea what this means, to me it looks serious so I decided to
post it on the kernel mailinglist. Is this a real bug? If I have to provide 
more detailed information please tell me what you need and how to get it.

thanks,
	Olivier Sessink
	


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 18:24 threading question ognen
  2001-06-12 18:39 ` Davide Libenzi
  2001-06-12 18:57 ` from dmesg: kernel BUG at inode.c:486 Olivier Sessink
@ 2001-06-12 18:58 ` Christoph Hellwig
  2001-06-12 19:07   ` ognen
  2001-06-12 21:44   ` Davide Libenzi
  2001-06-12 19:06 ` Kip Macy
  2001-06-12 22:41 ` threading question Pavel Machek
  4 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2001-06-12 18:58 UTC (permalink / raw)
  To: ognen; +Cc: linux-kernel

In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you wrote:
> On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower?

Does your measurement include the time needed to actually create the
threads or do you even frequently create and destroy threads?

The code for creating threads is _horribly_ slow in Linuxthreads due
to the way it is implemented.

> It is the
> same multi-threaded code, I am not using any tricks, the code basically
> uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> size is set to 8K (but the numbers are the same with larger/smaller stack
> sizes).
>
> Is there anything I am missing? Is this to be expected due to Linux way of
> handling threads (clone call)? I am just trying to explain the numbers and
> nothing else comes to mind....

Linuxthreads is a rather bad pthreads implementation performance-wise,
mostly due to the rather different linux-native threading model.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 18:24 threading question ognen
                   ` (2 preceding siblings ...)
  2001-06-12 18:58 ` threading question Christoph Hellwig
@ 2001-06-12 19:06 ` Kip Macy
  2001-06-12 19:14   ` Alexander Viro
                     ` (2 more replies)
  2001-06-12 22:41 ` threading question Pavel Machek
  4 siblings, 3 replies; 37+ messages in thread
From: Kip Macy @ 2001-06-12 19:06 UTC (permalink / raw)
  To: ognen; +Cc: linux-kernel

This may sound like flamebait, but its not. Linux threads are basically
just processes that share the same address space. Their performance is
measurably worse than it is on most commercial Unixes and FreeBSD.
They are not, or at least two years ago, were not POSIX compliant
(they behaved badly with respect to signals). The impoverished
implementation of threads is not an accidental oversight, threads are not
looked upon favorably by most of the core linux kernel hackers. A quote
from Larry McVoy's home page attributed to Alan Cox illustrates this
reasonably well: "A computer is a state machine. Threads are for people
who can't program state machines." Sorry for not being more helpful.

		-Kip


On Tue, 12 Jun 2001 ognen@gene.pbi.nrc.ca wrote:

> Hello,
> 
> I am a summer student implementing a multi-threaded version of a very
> popular bioinformatics tool. So far it compiles and runs without problems
> (as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
> OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
> sequential version of the tool on all of these machines (most of them are
> dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
> SGI IRIX). On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower? It is the
> same multi-threaded code, I am not using any tricks, the code basically
> uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> size is set to 8K (but the numbers are the same with larger/smaller stack
> sizes).
> 
> Is there anything I am missing? Is this to be expected due to Linux way of
> handling threads (clone call)? I am just trying to explain the numbers and
> nothing else comes to mind....
> 
> Best regards,
> Ognen Duzlevski
> -- 
> ognen@gene.pbi.nrc.ca
> Plant Biotechnology Institute
> National Research Council of Canada
> Bioinformatics team
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 18:58 ` threading question Christoph Hellwig
@ 2001-06-12 19:07   ` ognen
  2001-06-12 19:15     ` Kip Macy
                       ` (2 more replies)
  2001-06-12 21:44   ` Davide Libenzi
  1 sibling, 3 replies; 37+ messages in thread
From: ognen @ 2001-06-12 19:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel

Hello,

due to the nature of the problem (a pairwise mutual alignment of n
sequences results in mx. n^2 alignments which can each be done in a
separate thread), I need to create and destroy the threads frequently.

I am not really comfortable with 1.4 - 1.5 speedups since the solution was
intended as a Linux one primarily and it just happenned that it works (and
now even better) on Solaris/SGI/OSF...

Best regards,
Ognen

On Tue, 12 Jun 2001, Christoph Hellwig wrote:

> In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you wrote:
> > On dual-CPU machines the speedups are as follows: my version
> > is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> > 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> > kernel. Why are the numbers on Linux machines so much lower?
>
> Does your measurement include the time needed to actually create the
> threads or do you even frequently create and destroy threads?
>
> The code for creating threads is _horribly_ slow in Linuxthreads due
> to the way it is implemented.
>
> > It is the
> > same multi-threaded code, I am not using any tricks, the code basically
> > uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> > size is set to 8K (but the numbers are the same with larger/smaller stack
> > sizes).
> >
> > Is there anything I am missing? Is this to be expected due to Linux way of
> > handling threads (clone call)? I am just trying to explain the numbers and
> > nothing else comes to mind....
>
> Linuxthreads is a rather bad pthreads implementation performance-wise,
> mostly due to the rather different linux-native threading model.
>
> 	Christoph


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:06 ` Kip Macy
@ 2001-06-12 19:14   ` Alexander Viro
  2001-06-12 19:25     ` Russell Leighton
  2001-06-13 17:31   ` bert hubert
  2001-06-14 18:28   ` Alan Cox
  2 siblings, 1 reply; 37+ messages in thread
From: Alexander Viro @ 2001-06-12 19:14 UTC (permalink / raw)
  To: Kip Macy; +Cc: ognen, linux-kernel



On Tue, 12 Jun 2001, Kip Macy wrote:

> implementation of threads is not an accidental oversight, threads are not
> looked upon favorably by most of the core linux kernel hackers. A quote

s/threads/POSIX threads/.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:07   ` ognen
@ 2001-06-12 19:15     ` Kip Macy
  2001-06-12 19:29       ` Christoph Hellwig
  2001-06-12 19:15     ` Christoph Hellwig
  2001-06-13 12:20     ` Kurt Garloff
  2 siblings, 1 reply; 37+ messages in thread
From: Kip Macy @ 2001-06-12 19:15 UTC (permalink / raw)
  To: ognen; +Cc: linux-kernel

For heavy threading, try a user-level threads package.

		-Kip


On Tue, 12 Jun 2001 ognen@gene.pbi.nrc.ca wrote:

> Hello,
> 
> due to the nature of the problem (a pairwise mutual alignment of n
> sequences results in mx. n^2 alignments which can each be done in a
> separate thread), I need to create and destroy the threads frequently.
> 
> I am not really comfortable with 1.4 - 1.5 speedups since the solution was
> intended as a Linux one primarily and it just happenned that it works (and
> now even better) on Solaris/SGI/OSF...
> 
> Best regards,
> Ognen
> 
> On Tue, 12 Jun 2001, Christoph Hellwig wrote:
> 
> > In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you wrote:
> > > On dual-CPU machines the speedups are as follows: my version
> > > is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> > > 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> > > kernel. Why are the numbers on Linux machines so much lower?
> >
> > Does your measurement include the time needed to actually create the
> > threads or do you even frequently create and destroy threads?
> >
> > The code for creating threads is _horribly_ slow in Linuxthreads due
> > to the way it is implemented.
> >
> > > It is the
> > > same multi-threaded code, I am not using any tricks, the code basically
> > > uses PTHREAD_CREATE_DETACHED and PTHREAD_SCOPE_SYSTEM and the thread stack
> > > size is set to 8K (but the numbers are the same with larger/smaller stack
> > > sizes).
> > >
> > > Is there anything I am missing? Is this to be expected due to Linux way of
> > > handling threads (clone call)? I am just trying to explain the numbers and
> > > nothing else comes to mind....
> >
> > Linuxthreads is a rather bad pthreads implementation performance-wise,
> > mostly due to the rather different linux-native threading model.
> >
> > 	Christoph
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:07   ` ognen
  2001-06-12 19:15     ` Kip Macy
@ 2001-06-12 19:15     ` Christoph Hellwig
  2001-06-13 12:20     ` Kurt Garloff
  2 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2001-06-12 19:15 UTC (permalink / raw)
  To: ognen; +Cc: linux-kernel

On Tue, Jun 12, 2001 at 01:07:11PM -0600, ognen@gene.pbi.nrc.ca wrote:
> Hello,
> 
> due to the nature of the problem (a pairwise mutual alignment of n
> sequences results in mx. n^2 alignments which can each be done in a
> separate thread), I need to create and destroy the threads frequently.
> 
> I am not really comfortable with 1.4 - 1.5 speedups since the solution was
> intended as a Linux one primarily and it just happenned that it works (and
> now even better) on Solaris/SGI/OSF...

If you havily create threads under load you're rather srewed.  If you want
to stay with the (IMHO rather suboptimal) posix threads API you might want
to take a look at the stuff IBM has produced:

	http://oss.software.ibm.com/developerworks/projects/pthreads/

Otherwise a simple wrapper for clone might be a _lot_ faster, but has it's
own disadvantages: no ready-to-use lcoking primitives, no cross-platform
support (ok, it should be portable to the FreeBSD rfork easily).

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:14   ` Alexander Viro
@ 2001-06-12 19:25     ` Russell Leighton
  2001-06-12 23:27       ` Mike Castle
  0 siblings, 1 reply; 37+ messages in thread
From: Russell Leighton @ 2001-06-12 19:25 UTC (permalink / raw)
  To: linux-kernel




Any recommendations for alternate threading packages?

Alexander Viro wrote:

> On Tue, 12 Jun 2001, Kip Macy wrote:
>
> > implementation of threads is not an accidental oversight, threads are not
> > looked upon favorably by most of the core linux kernel hackers. A quote
>
> s/threads/POSIX threads/.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
---------------------------------------------------
Russell Leighton    russell.leighton@247media.com
---------------------------------------------------



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:15     ` Kip Macy
@ 2001-06-12 19:29       ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2001-06-12 19:29 UTC (permalink / raw)
  To: Kip Macy; +Cc: linux-kernel

In article <Pine.GSO.4.10.10106121214380.20809-100000@orbit-fe.eng.netapp.com> you wrote:
> For heavy threading, try a user-level threads package.

Sure, userlevel threading is the best way to get SMP-scalability...

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 18:58 ` threading question Christoph Hellwig
  2001-06-12 19:07   ` ognen
@ 2001-06-12 21:44   ` Davide Libenzi
  2001-06-12 21:48     ` ognen
  2001-06-12 21:58     ` threading question Albert D. Cahalan
  1 sibling, 2 replies; 37+ messages in thread
From: Davide Libenzi @ 2001-06-12 21:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, ognen


On 12-Jun-2001 Christoph Hellwig wrote:
> In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you
> wrote:
>> On dual-CPU machines the speedups are as follows: my version
>> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
>> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
>> kernel. Why are the numbers on Linux machines so much lower?
> 
> Does your measurement include the time needed to actually create the
> threads or do you even frequently create and destroy threads?

This is an extract of the most busy vmstat report running under his tool :

12  0  0  15508  40980  24880 355480   0   0     0     0  141   481 100   0   0
19  0  0  15508  40248  24880 355480   0   0     0     0  142   564 100   0   0
12  0  0  15508  40112  24880 355480   0   0     0     0  150   543 100   0   0
11  0  0  15508  41272  24880 355480   0   0     0     0  156   594  99   1   0
17  0  0  15508  40408  24880 355480   0   0     0     0  156   474  99   1   0
17  0  0  15508  39840  24880 355480   0   0     0     0  135   475 100   0   0
21  0  0  15508  39568  24880 355480   0   0     0     0  125   409 100   0   0
21  0  0  15508  39668  24880 355480   0   0     0     0  135   420 100   0   0
16  0  0  15508  39760  24880 355480   0   0     0     0  149   486 100   0   0


The context switch is very low and the user CPU utilization is 100% , I don't
think it's system responsibility here ( clearly a CPU bound program ).
Even if the runqueue is long, the context switch is low.
I've just close to me a dual PIII 1GHz workstation that run an MTA that uses
linux pthreads with context switching ranging between 5000 and 11000 with a
thread creation rate of about 300 thread/sec ( relaying 600000 msg/hour ).
No problem at all with the system even if the load avg is a bit high
( about 8 ).




- Davide


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 21:44   ` Davide Libenzi
@ 2001-06-12 21:48     ` ognen
  2001-06-14 18:15       ` Alan Cox
  2001-06-12 21:58     ` threading question Albert D. Cahalan
  1 sibling, 1 reply; 37+ messages in thread
From: ognen @ 2001-06-12 21:48 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: linux-kernel

Hello,

a good suggestion was given to me to actually create as many threads as
there are CPUs (or a bit more) and then keep them asking for work when
they are done. This should help it (and avoid the pthread_create,
pthread_exit). I will implement this and report my results if there is
interest.

Thank you all,
Ognen

On Tue, 12 Jun 2001, Davide Libenzi wrote:

>
> On 12-Jun-2001 Christoph Hellwig wrote:
> > In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you
> > wrote:
> >> On dual-CPU machines the speedups are as follows: my version
> >> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> >> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> >> kernel. Why are the numbers on Linux machines so much lower?
> >
> > Does your measurement include the time needed to actually create the
> > threads or do you even frequently create and destroy threads?
>
> This is an extract of the most busy vmstat report running under his tool :
>
> 12  0  0  15508  40980  24880 355480   0   0     0     0  141   481 100   0   0
> 19  0  0  15508  40248  24880 355480   0   0     0     0  142   564 100   0   0
> 12  0  0  15508  40112  24880 355480   0   0     0     0  150   543 100   0   0
> 11  0  0  15508  41272  24880 355480   0   0     0     0  156   594  99   1   0
> 17  0  0  15508  40408  24880 355480   0   0     0     0  156   474  99   1   0
> 17  0  0  15508  39840  24880 355480   0   0     0     0  135   475 100   0   0
> 21  0  0  15508  39568  24880 355480   0   0     0     0  125   409 100   0   0
> 21  0  0  15508  39668  24880 355480   0   0     0     0  135   420 100   0   0
> 16  0  0  15508  39760  24880 355480   0   0     0     0  149   486 100   0   0
>
>
> The context switch is very low and the user CPU utilization is 100% , I don't
> think it's system responsibility here ( clearly a CPU bound program ).
> Even if the runqueue is long, the context switch is low.
> I've just close to me a dual PIII 1GHz workstation that run an MTA that uses
> linux pthreads with context switching ranging between 5000 and 11000 with a
> thread creation rate of about 300 thread/sec ( relaying 600000 msg/hour ).
> No problem at all with the system even if the load avg is a bit high
> ( about 8 ).

-- 
Ognen Duzlevski
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 21:44   ` Davide Libenzi
  2001-06-12 21:48     ` ognen
@ 2001-06-12 21:58     ` Albert D. Cahalan
  2001-06-12 23:48       ` J . A . Magallon
  1 sibling, 1 reply; 37+ messages in thread
From: Albert D. Cahalan @ 2001-06-12 21:58 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Christoph Hellwig, linux-kernel, ognen

Davide Libenzi writes:
> On 12-Jun-2001 Christoph Hellwig wrote:
>> In article <Pine.LNX.4.30.0106121213570.24593-100000@gene.pbi.nrc.ca> you
>> wrote:

>>> On dual-CPU machines the speedups are as follows: my version
>>> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
>>> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux
>>> 2.4 kernel. Why are the numbers on Linux machines so much lower?
...
> The context switch is very low and the user CPU utilization is 100%,
> I don't think it's system responsibility here ( clearly a CPU bound
> program ).  Even if the runqueue is long, the context switch is low.
> I've just close to me a dual PIII 1GHz workstation that run an MTA
> that uses linux pthreads with context switching ranging between 5000
> and 11000 with a thread creation rate of about 300 thread/sec (
> relaying 600000 msg/hour ).  No problem at all with the system even
> if the load avg is a bit high ( about 8 ).

In that case, this could be a hardware issue. Note that he seems
to be comparing an x86 PC against SGI MIPS, Sun SPARC, and Compaq
Alpha hardware.

His data set is most likely huge. It's DNA data.

The x86 box likely has small caches, a fast core, and a slow bus.
So most of the time the CPU will be stalled waiting for a memory
operation.

Maybe there are performance monitor registers that could be used
to determine if this is the case.

(Not that the app design is sane though.)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 18:24 threading question ognen
                   ` (3 preceding siblings ...)
  2001-06-12 19:06 ` Kip Macy
@ 2001-06-12 22:41 ` Pavel Machek
  4 siblings, 0 replies; 37+ messages in thread
From: Pavel Machek @ 2001-06-12 22:41 UTC (permalink / raw)
  To: ognen, linux-kernel

Hi!

> I am a summer student implementing a multi-threaded version of a very
> popular bioinformatics tool. So far it compiles and runs without problems
> (as far as I can tell ;) on Linux 2.2.x, Sun Solaris, SGI IRIX and Compaq
> OSF/1 running on Alpha. I have ran a lot of timing tests compared to the
> sequential version of the tool on all of these machines (most of them are
> dual-CPU, although I am also running tests on 12-CPU Solaris and 108 CPU
> SGI IRIX). On dual-CPU machines the speedups are as follows: my version
> is 1.88 faster than the sequential one on IRIX, 1.81 times on Solaris,
> 1.8 times on OSF/1, 1.43 times on Linux 2.2.x and 1.52 times on Linux 2.4
> kernel. Why are the numbers on Linux machines so much lower? It is
> the

But this is all different hw, no?

So dual cpu SPARC is more efficient than dual cpu i686. Maybe SPARCs
have faster RAM and slower cpus... 
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:25     ` Russell Leighton
@ 2001-06-12 23:27       ` Mike Castle
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Castle @ 2001-06-12 23:27 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jun 12, 2001 at 03:25:54PM -0400, Russell Leighton wrote:
> Any recommendations for alternate threading packages?

Does NSPR use native methods (ie, clone), or just ride on top of pthreads?

What about the gnu threading package?

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 21:58     ` threading question Albert D. Cahalan
@ 2001-06-12 23:48       ` J . A . Magallon
  0 siblings, 0 replies; 37+ messages in thread
From: J . A . Magallon @ 2001-06-12 23:48 UTC (permalink / raw)
  To: Albert D . Cahalan; +Cc: Davide Libenzi, Christoph Hellwig, linux-kernel, ognen


On 20010612 Albert D. Cahalan wrote:
> 
> In that case, this could be a hardware issue. Note that he seems
> to be comparing an x86 PC against SGI MIPS, Sun SPARC, and Compaq
> Alpha hardware.
> 
> His data set is most likely huge. It's DNA data.
> 
> The x86 box likely has small caches, a fast core, and a slow bus.
> So most of the time the CPU will be stalled waiting for a memory
> operation.
> 

Perhaps is just synchronization of caches. 
say you want to sum all the elements of a vector in parallele split in
two pieces:

int total=0;
thread 1:
	for fist half
		total += v[i]
thread 2:
	for second half
		total += v[i]

and you tought: 'well, I need a mutex for access to total. that will slow
down things, lets use separate counters':

int bigtotal;
int total[2];
thread 1:
	for fist half
		total[0] += v[i]
thread 2:
	for second half
		total[1] += v[i]

bigtotal = total[0]+total[1]

The problem ? total[0] and total[1] are nearby one of each other. So in
the same cache line. So on every write to total[?], even if they are
independent, system has to synchrnize caches.

Big iron (SGI, Sparc), has special hardware, but cheap PC mobos...

-- 
J.A. Magallon                           #  Let the source be with you...        
mailto:jamagallon@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:07   ` ognen
  2001-06-12 19:15     ` Kip Macy
  2001-06-12 19:15     ` Christoph Hellwig
@ 2001-06-13 12:20     ` Kurt Garloff
  2001-06-13 13:35       ` J . A . Magallon
  2 siblings, 1 reply; 37+ messages in thread
From: Kurt Garloff @ 2001-06-13 12:20 UTC (permalink / raw)
  To: ognen; +Cc: Christoph Hellwig, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1162 bytes --]

On Tue, Jun 12, 2001 at 01:07:11PM -0600, ognen@gene.pbi.nrc.ca wrote:
> due to the nature of the problem (a pairwise mutual alignment of n
> sequences results in mx. n^2 alignments which can each be done in a
> separate thread), I need to create and destroy the threads frequently.
> 
> I am not really comfortable with 1.4 - 1.5 speedups since the solution was
> intended as a Linux one primarily and it just happenned that it works (and
> now even better) on Solaris/SGI/OSF...

Nor would I. 

What I do in my numerics code to avoid this problem, is to create all the
threads (as many as there are CPUs) on program startup and have then wait
(block) for a condition. As soon as there's something to to, variables for
the thread are setup (protected by a mutex) and the thread gets signalled
(cond_signal).
If you're interested in the code, tell me.

This is supposed to be much faster than thread creation.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, FRG                               SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-13 12:20     ` Kurt Garloff
@ 2001-06-13 13:35       ` J . A . Magallon
  2001-06-13 14:17         ` Philips
  0 siblings, 1 reply; 37+ messages in thread
From: J . A . Magallon @ 2001-06-13 13:35 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: ognen, Christoph Hellwig, linux-kernel


On 20010613 Kurt Garloff wrote:
> 
> What I do in my numerics code to avoid this problem, is to create all the
> threads (as many as there are CPUs) on program startup and have then wait
> (block) for a condition. As soon as there's something to to, variables for
> the thread are setup (protected by a mutex) and the thread gets signalled
> (cond_signal).
> If you're interested in the code, tell me.
> 

I use the reverse approach. you feed work to the threads, I create the threads
and let them ask for work to a master until it says 'done'. When the
master is queried for work, it locks a mutex, decide the next work for
that thread, and unlocks it. I think it gives the lesser contention and
is simpler to manage.

-- 
J.A. Magallon                           #  Let the source be with you...        
mailto:jamagallon@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-13 13:35       ` J . A . Magallon
@ 2001-06-13 14:17         ` Philips
  2001-06-13 15:06           ` ognen
  0 siblings, 1 reply; 37+ messages in thread
From: Philips @ 2001-06-13 14:17 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]

"J . A . Magallon" wrote:
> 
> On 20010613 Kurt Garloff wrote:
> >
> > What I do in my numerics code to avoid this problem, is to create all the
> > threads (as many as there are CPUs) on program startup and have then wait
> > (block) for a condition. As soon as there's something to to, variables for
> > the thread are setup (protected by a mutex) and the thread gets signalled
> > (cond_signal).
> > If you're interested in the code, tell me.
> >
> 
> I use the reverse approach. you feed work to the threads, I create the threads
> and let them ask for work to a master until it says 'done'. When the
> master is queried for work, it locks a mutex, decide the next work for
> that thread, and unlocks it. I think it gives the lesser contention and
> is simpler to manage.
> 

	BTW. 
	Question was poping in my mind and finally got negative answer by my mind ;-)

	Is it possible to make somethis like:


	char a[100] = {...}
	char b[100] = {...}
	char c[100];
	char d[100];
	
	1: { // run this on first CPU
		for (int i=0; i<100; i++) c[i] = a[i] + b[i];
	};
	2: { // run this on any other CPU
		for (int i=0; i<100; i++) d[i] = a[i] * b[i];
	};
	
	...
	// do something else...
	...
	
	wait 1,2; // to be sure c[] and d[] are ready.


	what was popping in my mind - some prefix (like 0x66 Intel used for 32
instructions) to say this instruction should run on other CPU?
	I know - stupid idea. Too many questions will arise. 
	If we will do 

	PREFIX jmp far some_routing

	and this routing will run on other CPU not blocking current execution thread.
	(who will clean stack? when?.. question without answers...)

	Is there anything like this in computerworld? I heard about old computers that
have a speacial instruction set to implicit run code on given processor.
	Is it possible to emulate this behavior on PCs?

[-- Attachment #2: Card for Philips --]
[-- Type: text/x-vcard, Size: 407 bytes --]

begin:vcard 
n:Filiapau;Ihar
tel;pager:+375 (0) 17 2850000#6683
tel;fax:+375 (0) 17 2841537
tel;home:+375 (0) 17 2118441
tel;work:+375 (0) 17 2841371
x-mozilla-html:TRUE
url:www.iph.to
org:Enformatica Ltd.;Linux Developement Department
adr:;;Kalinine str. 19-18;Minsk;BY;220012;Belarus
version:2.1
email;internet:philips@iph.to
title:Software Developer
note:(none)
x-mozilla-cpt:;18368
fn:Philips
end:vcard

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-13 14:17         ` Philips
@ 2001-06-13 15:06           ` ognen
  0 siblings, 0 replies; 37+ messages in thread
From: ognen @ 2001-06-13 15:06 UTC (permalink / raw)
  To: Philips; +Cc: linux-kernel

Solaris has pset_create() and pset_bind() where you can bind LWPs to
specific processors, but I doubt this works on anything else....

Best regards,
Ognen

On Wed, 13 Jun 2001, Philips wrote:

> 	BTW.
> 	Question was poping in my mind and finally got negative answer by my mind ;-)
>
> 	Is it possible to make somethis like:
>
>
> 	char a[100] = {...}
> 	char b[100] = {...}
> 	char c[100];
> 	char d[100];
>
> 	1: { // run this on first CPU
> 		for (int i=0; i<100; i++) c[i] = a[i] + b[i];
> 	};
> 	2: { // run this on any other CPU
> 		for (int i=0; i<100; i++) d[i] = a[i] * b[i];
> 	};
>
> 	...
> 	// do something else...
> 	...
>
> 	wait 1,2; // to be sure c[] and d[] are ready.
>
>
> 	what was popping in my mind - some prefix (like 0x66 Intel used for 32
> instructions) to say this instruction should run on other CPU?
> 	I know - stupid idea. Too many questions will arise.
> 	If we will do
>
> 	PREFIX jmp far some_routing
>
> 	and this routing will run on other CPU not blocking current execution thread.
> 	(who will clean stack? when?.. question without answers...)
>
> 	Is there anything like this in computerworld? I heard about old computers that
> have a speacial instruction set to implicit run code on given processor.
> 	Is it possible to emulate this behavior on PCs?

-- 
Ognen Duzlevski
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:06 ` Kip Macy
  2001-06-12 19:14   ` Alexander Viro
@ 2001-06-13 17:31   ` bert hubert
  2001-06-14  6:45     ` Helge Hafting
  2001-06-14 18:28   ` Alan Cox
  2 siblings, 1 reply; 37+ messages in thread
From: bert hubert @ 2001-06-13 17:31 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jun 12, 2001 at 12:06:40PM -0700, Kip Macy wrote:
> This may sound like flamebait, but its not. Linux threads are basically
> just processes that share the same address space. Their performance is
> measurably worse than it is on most commercial Unixes and FreeBSD.

Thread creation may be a bit slow. But the kludges to provide posix threads
completely from userspace also hurt. Notably, they do not scale over
multiple CPUs.

> They are not, or at least two years ago, were not POSIX compliant
> (they behaved badly with respect to signals). The impoverished

POSIX threads are silly with respect to signals. I do almost all my
programming these days with pthreads and I find that I really do not miss
signals at all.

> from Larry McVoy's home page attributed to Alan Cox illustrates this
> reasonably well: "A computer is a state machine. Threads are for people
> who can't program state machines." Sorry for not being more helpful.

I got that response too. When I pressed kernel people for details it turns
out that they think having hundreds of runnable threads/processes (mostly
the same thing under Linux) is wasteful. The scheduler is just not optimised
for that.

Regards,

bert

-- 
http://www.PowerDNS.com      Versatile DNS Services  
Trilab                       The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-13 17:31   ` bert hubert
@ 2001-06-14  6:45     ` Helge Hafting
  0 siblings, 0 replies; 37+ messages in thread
From: Helge Hafting @ 2001-06-14  6:45 UTC (permalink / raw)
  To: bert hubert; +Cc: linux-kernel

bert hubert wrote:

> > from Larry McVoy's home page attributed to Alan Cox illustrates this
> > reasonably well: "A computer is a state machine. Threads are for people
> > who can't program state machines." Sorry for not being more helpful.
> 
> I got that response too. When I pressed kernel people for details it turns
> out that they think having hundreds of runnable threads/processes (mostly
> the same thing under Linux) is wasteful. The scheduler is just not optimised
> for that.

The scheduler can be optimized for that, so far at the cost of
pessimizing
the common case with few threads.  The bigger problem here is that
your cpu (particularly TLB's and caches) aren't optimized for switching
between a lot of threads either.  This will always be a problem as long
as cpu's have level 1 caches much smaller than the combined working
set of your threads.  So run one thread per cpu, perhaps two if you
expect
io stalls.  The task at hand may easily be divided into many more parts,
but serializing those extra parts will be better for performance.

Helge Hafting

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 21:48     ` ognen
@ 2001-06-14 18:15       ` Alan Cox
  2001-06-14 22:42         ` threading question (results after thread pooling) ognen
  0 siblings, 1 reply; 37+ messages in thread
From: Alan Cox @ 2001-06-14 18:15 UTC (permalink / raw)
  To: ognen; +Cc: Davide Libenzi, linux-kernel

> they are done. This should help it (and avoid the pthread_create,
> pthread_exit). I will implement this and report my results if there is
> interest.

You should also check up the cache colouring. X86 boxes have relatively poor
memory performance and most x86 chips have lousy behaviour when data bounces
between processors or is driven out of cache

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-12 19:06 ` Kip Macy
  2001-06-12 19:14   ` Alexander Viro
  2001-06-13 17:31   ` bert hubert
@ 2001-06-14 18:28   ` Alan Cox
  2001-06-14 19:01     ` bert hubert
                       ` (2 more replies)
  2 siblings, 3 replies; 37+ messages in thread
From: Alan Cox @ 2001-06-14 18:28 UTC (permalink / raw)
  To: Kip Macy; +Cc: ognen, linux-kernel

> just processes that share the same address space. Their performance is
> measurably worse than it is on most commercial Unixes and FreeBSD.

Actually their performance is massively superior. But that is because we were
not stupid enough to burden the kernel with all of the posix pthread crap.
Pthreads is an ugly compromise API that can be badly implemented in both
userland and kernel space. Unfortunately its also a standard.

So you have two choices
1.	Pthread performance is poorer due to library glue
2.	Every single signal delivery is 20% slower threaded or otherwise due
	to all the crap that it adds 
	And it does damage to other calls too.

In the big picture #1 is definitely preferable. 

There are really only two reasons for threaded programming. 

- Poor programmer skills/language expression of event handling

- OS implementation flaws (and yes the posix/sus unix api has some of these)

Co-routines or better language choices are much more efficient ways to express
the event handling problem.

fork() is often a better approach than pthreads at least for the design of an
SMP threaded application because unless you explicitly think about what you
share you will never get the cache affinity you need for good performance.

And if you don't care about cache affinity then you shouldnt care about
pthread_create overhead because quite frankly pthread_create overhead is easily
mitigated (thread cache) and in most real world applications considerably less
of an performance hit

Alan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-14 18:28   ` Alan Cox
@ 2001-06-14 19:01     ` bert hubert
  2001-06-14 19:22       ` Russell Leighton
  2001-06-15 11:29       ` Anil Kumar
  2001-06-14 23:05     ` J . A . Magallon
  2001-06-16 14:16     ` Michael Rothwell
  2 siblings, 2 replies; 37+ messages in thread
From: bert hubert @ 2001-06-14 19:01 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kip Macy, ognen, linux-kernel

On Thu, Jun 14, 2001 at 07:28:32PM +0100, Alan Cox wrote:

> There are really only two reasons for threaded programming. 
> 
> - Poor programmer skills/language expression of event handling

The converse is that pthreads are:

 - Very easy to use from C at a reasonable runtime overhead

It is very convenient for a userspace coder to be able to just start a
function in a different thread. Now it might be so that a kernel is not
there to provide ease of use for userspace coders but it is a factor.

I see lots of people only using:
	pthread_create()/pthread_join()
	mutex_lock/unlock
	sem_post/sem_wait
	no signals
	
My gut feeling is that you could implement this subset in a way that is both
fast and right - although it would not be 'pthreads compliant'. Can anybody
confirm this feeling?

Regards,

bert

-- 
http://www.PowerDNS.com      Versatile DNS Services  
Trilab                       The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-14 19:01     ` bert hubert
@ 2001-06-14 19:22       ` Russell Leighton
  2001-06-15 11:29       ` Anil Kumar
  1 sibling, 0 replies; 37+ messages in thread
From: Russell Leighton @ 2001-06-14 19:22 UTC (permalink / raw)
  To: linux-kernel


bert hubert wrote:

> <stuff deleted>
>
> I see lots of people only using:
>         pthread_create()/pthread_join()
>         mutex_lock/unlock
>         sem_post/sem_wait
>         no signals
>
> My gut feeling is that you could implement this subset in a way that is both
> fast and right - although it would not be 'pthreads compliant'. Can anybody
> confirm this feeling?

... add condition variables (maybe a small per-thread storage area)
and I'd toss out pthreads for most apps I write...especially if it is very efficient.

--
---------------------------------------------------
Russell Leighton    russell.leighton@247media.com
---------------------------------------------------



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question (results after thread pooling)
  2001-06-14 18:15       ` Alan Cox
@ 2001-06-14 22:42         ` ognen
  2001-06-14 23:00           ` Mike Castle
  0 siblings, 1 reply; 37+ messages in thread
From: ognen @ 2001-06-14 22:42 UTC (permalink / raw)
  To: linux-kernel

Hello,

I have implemented thread pooling (with an environment variable
where I can give the number of threads to be created). Results:

1. Linux, no change in the times (not under 2.2.x or 2.4)

2. SGI/Solaris/OSF/1: times decrease when the number of threads matched
the number of processors available. The times were the same as my
previous version or couple of percents better when I exhaggerated the
number of threads to create, say, 128 threads on a 2 CPU.

3. The load on the machines has decreased considerably with the new
solution. I consider this to be the only positive impact I have seen from
this solution.

The solution is basically designed in the following way:

1. Threads are created and they wait on a condition with pthread_cond_wait
2. The main thread sets up the data (which are global) and then signals
that there is work to be done on the same condition variable. The first
thread to get awaken takes the work. the remaining threads keep waiting.
3. Go to 2. until there is work to distribute

I am now pretty much inclined to believe that it is either a) hardware
issue (someone mentioned that SPARCs and MIPSes handle things differently)
or b) Linux for some reason just cant give me what IRIX/Solaris can in
this particular case

Regretfully, the organization I work for prohibits me from releasing the
code I am talking about until the lawyers decide what to do with it. My
hope is to be able to release it for free to anyone interested since this
sequence alignment tool is used a lot :). This kind of defeats the purpose
of my question(s) since without the code it is difficult to talk.

Best regards,
Ognen Duzlevski

On Thu, 14 Jun 2001, Alan Cox wrote:

> > they are done. This should help it (and avoid the pthread_create,
> > pthread_exit). I will implement this and report my results if there is
> > interest.
>
> You should also check up the cache colouring. X86 boxes have relatively poor
> memory performance and most x86 chips have lousy behaviour when data bounces
> between processors or is driven out of cache

-- 
Ognen Duzlevski
Plant Biotechnology Institute
National Research Council of Canada
Bioinformatics team



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question (results after thread pooling)
  2001-06-14 22:42         ` threading question (results after thread pooling) ognen
@ 2001-06-14 23:00           ` Mike Castle
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Castle @ 2001-06-14 23:00 UTC (permalink / raw)
  To: linux-kernel

On Thu, Jun 14, 2001 at 04:42:29PM -0600, ognen@gene.pbi.nrc.ca wrote:
> 2. The main thread sets up the data (which are global) and then signals
> that there is work to be done on the same condition variable. The first
> thread to get awaken takes the work. the remaining threads keep waiting.

For curiosities sake, at what point would this technique result in a
thundering herd issue?  Does it happen near the level at which the number of
schedulable entities equal the number of processors or does it have to be
much greater than that?

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-14 18:28   ` Alan Cox
  2001-06-14 19:01     ` bert hubert
@ 2001-06-14 23:05     ` J . A . Magallon
  2001-06-16 14:16     ` Michael Rothwell
  2 siblings, 0 replies; 37+ messages in thread
From: J . A . Magallon @ 2001-06-14 23:05 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kip Macy, ognen, linux-kernel


On 20010614 Alan Cox wrote:
>
>So you have two choices
>1.	Pthread performance is poorer due to library glue
>2.	Every single signal delivery is 20% slower threaded or otherwise due
>	to all the crap that it adds 
>	And it does damage to other calls too.
>

Pthreads are a standard. You say 'use linux native calls, are faster and
make signal management efficient'. But then portability goes to hell. Now
I can run the same code on Linux, Irix and Solaris. Your way, I would
have to write three versions with clone(), sproc() and lwp_xxxx().
Take the example of OpenGL on IRIX boxes. Time ago it was a wrapper over
IrisGL. Now it is native. If you have a notably poor implimentation of
an standard nobody will use your system.

>In the big picture #1 is definitely preferable. 
>
>There are really only two reasons for threaded programming. 
>
>- Poor programmer skills/language expression of event handling
>
>- OS implementation flaws (and yes the posix/sus unix api has some of these)
>
>Co-routines or better language choices are much more efficient ways to express
>the event handling problem.
>
>fork() is often a better approach than pthreads at least for the design of an
>SMP threaded application because unless you explicitly think about what you
>share you will never get the cache affinity you need for good performance.
>

Joking ? That only works if your more complex structure is an array. Try
to get a rendering program with a complex linked lits-tree data structure
for the geometry, materials, textures, etc and
thinking on cache affinity. You can only think about that locally: mmm, I
need a counter for each thread, I would not put them all in an array because
I will trash caches, lets put them in separate variables; need to return
data to a segment of a big array, lets use a local copy and then pass it back.
But no more. Yes, you can change all your malloc() or new for shm's, but
what is the gain ? That is the beauty of shared memory boxes.

What linux needs is a good implementation for POSIX threads. I do not mean
putting pthreads right into the kernel, but perhaps some small change or
addition can make the user space much much faster. There are many apps that
can benefit much from using threads, use a big data segment in ro mode,
and just communicate a bit between them (a threaded web server, a rendering
program).

-- 
J.A. Magallon                           #  Let the source be with you...        
mailto:jamagallon@able.es
Linux Mandrake release 8.1 (Cooker) for i586
Linux werewolf 2.4.5-ac13 #1 SMP Sun Jun 10 21:42:28 CEST 2001 i686

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: threading question
  2001-06-14 19:01     ` bert hubert
  2001-06-14 19:22       ` Russell Leighton
@ 2001-06-15 11:29       ` Anil Kumar
  1 sibling, 0 replies; 37+ messages in thread
From: Anil Kumar @ 2001-06-15 11:29 UTC (permalink / raw)
  To: bert hubert, Alan Cox; +Cc: Kip Macy, ognen, linux-kernel

Since while using only a small subset of primitives provided by the pthreads
the burden for the other primitive maintanence is much more so i too feel
when we use only a small part its better to implement in our own requiredd
way for performance issues.

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of bert hubert
Sent: Friday, June 15, 2001 12:32 AM
To: Alan Cox
Cc: Kip Macy; ognen@gene.pbi.nrc.ca; linux-kernel@vger.kernel.org
Subject: Re: threading question


On Thu, Jun 14, 2001 at 07:28:32PM +0100, Alan Cox wrote:

> There are really only two reasons for threaded programming.
>
> - Poor programmer skills/language expression of event handling

The converse is that pthreads are:

 - Very easy to use from C at a reasonable runtime overhead

It is very convenient for a userspace coder to be able to just start a
function in a different thread. Now it might be so that a kernel is not
there to provide ease of use for userspace coders but it is a factor.

I see lots of people only using:
	pthread_create()/pthread_join()
	mutex_lock/unlock
	sem_post/sem_wait
	no signals

My gut feeling is that you could implement this subset in a way that is both
fast and right - although it would not be 'pthreads compliant'. Can anybody
confirm this feeling?

Regards,

bert

--
http://www.PowerDNS.com      Versatile DNS Services
Trilab                       The Technology People
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-14 18:28   ` Alan Cox
  2001-06-14 19:01     ` bert hubert
  2001-06-14 23:05     ` J . A . Magallon
@ 2001-06-16 14:16     ` Michael Rothwell
  2001-06-16 15:19       ` Alan Cox
  2 siblings, 1 reply; 37+ messages in thread
From: Michael Rothwell @ 2001-06-16 14:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On 14 Jun 2001 19:28:32 +0100, Alan Cox wrote:

> Co-routines or better language choices are much more efficient ways to express
> the event handling problem.

Can you provide any info and/or examples of co-routines? I'm curious to
see a good example of co-routines' "betterness."

Thanks,

--
Michael Rothwell
rothwell@holly-springs.nc.us



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-16 14:16     ` Michael Rothwell
@ 2001-06-16 15:19       ` Alan Cox
  2001-06-16 18:33         ` Russell Leighton
  2001-06-16 19:06         ` Michael Rothwell
  0 siblings, 2 replies; 37+ messages in thread
From: Alan Cox @ 2001-06-16 15:19 UTC (permalink / raw)
  To: Michael Rothwell; +Cc: Alan Cox, linux-kernel

> Can you provide any info and/or examples of co-routines? I'm curious to
> see a good example of co-routines' "betterness."

With co-routines you don't need

	8K of kernel stack
	Scheduler overhead
	Fancy locking

You don't get the automatic thread switching stuff though.

So you might get code that reads like this (note that aio_ stuff works rather
well combined with co-routines as it fixes a lack of asynchronicity in the
unix disk I/O world)


	select(....)

	if(FD_ISSET(copier_fd))
		run_coroutine(&copier_state);

	...


and the copier might be something like

	while(1)
	{
		// Yes 1 at a time is dumb but this is an example..
		// Yes Im ignoring EOF for this
		if(read(copier_fd, buf[bufptr], 1)==-1)
		{
			if(errno==-EWOULDBLOCK)
			{
				coroutine_return();
				continue;
			}
		}
		if(bufptr==255  || buf[bufptr]=='\n')
		{
			run_coroutine(run_command, buf);
			bufptr=0;
		}
		else
			bufptr++;
	}


it lets you express a state machine as a set of multiple such small state
machines instead.  run_coroutine() will continue a routine where it last
coroutine_return()'d from. Thus in the above case we are expressing read
bytes until you see a new line cleanly - not mangled in with keeping state
in global structures but by using natural C local variables and code flow

Alan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-16 15:19       ` Alan Cox
@ 2001-06-16 18:33         ` Russell Leighton
  2001-06-16 19:06         ` Michael Rothwell
  1 sibling, 0 replies; 37+ messages in thread
From: Russell Leighton @ 2001-06-16 18:33 UTC (permalink / raw)
  To: Alan Cox, linux-kernel


Is there a user-space implemenation (library?) for coroutines that would work from C?


Alan Cox wrote:

> > Can you provide any info and/or examples of co-routines? I'm curious to
> > see a good example of co-routines' "betterness."
>
> With co-routines you don't need
>
>         8K of kernel stack
>         Scheduler overhead
>         Fancy locking
>
> You don't get the automatic thread switching stuff though.
>
> So you might get code that reads like this (note that aio_ stuff works rather
> well combined with co-routines as it fixes a lack of asynchronicity in the
> unix disk I/O world)
>
>         select(....)
>
>         if(FD_ISSET(copier_fd))
>                 run_coroutine(&copier_state);
>
>         ...
>
> and the copier might be something like
>
>         while(1)
>         {
>                 // Yes 1 at a time is dumb but this is an example..
>                 // Yes Im ignoring EOF for this
>                 if(read(copier_fd, buf[bufptr], 1)==-1)
>                 {
>                         if(errno==-EWOULDBLOCK)
>                         {
>                                 coroutine_return();
>                                 continue;
>                         }
>                 }
>                 if(bufptr==255  || buf[bufptr]=='\n')
>                 {
>                         run_coroutine(run_command, buf);
>                         bufptr=0;
>                 }
>                 else
>                         bufptr++;
>         }
>
> it lets you express a state machine as a set of multiple such small state
> machines instead.  run_coroutine() will continue a routine where it last
> coroutine_return()'d from. Thus in the above case we are expressing read
> bytes until you see a new line cleanly - not mangled in with keeping state
> in global structures but by using natural C local variables and code flow
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
---------------------------------------------------
Russell Leighton    russell.leighton@247media.com
---------------------------------------------------



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question
  2001-06-16 15:19       ` Alan Cox
  2001-06-16 18:33         ` Russell Leighton
@ 2001-06-16 19:06         ` Michael Rothwell
  2001-06-16 21:30           ` Coroutines [was Re: threading question] Russell Leighton
  1 sibling, 1 reply; 37+ messages in thread
From: Michael Rothwell @ 2001-06-16 19:06 UTC (permalink / raw)
  To: Russell Leighton; +Cc: linux-kernel

Try this:

http://lecker.essen.de/~froese/coro/

-M

On 16 Jun 2001 14:33:50 -0400, Russell Leighton wrote:
> 
> Is there a user-space implemenation (library?) for coroutines that would work from C?
> 
> 
> Alan Cox wrote:
> 
> > > Can you provide any info and/or examples of co-routines? I'm curious to
> > > see a good example of co-routines' "betterness."
> >
> > With co-routines you don't need
> >
> >         8K of kernel stack
> >         Scheduler overhead
> >         Fancy locking
> >
> > You don't get the automatic thread switching stuff though.
> >
> > So you might get code that reads like this (note that aio_ stuff works rather
> > well combined with co-routines as it fixes a lack of asynchronicity in the
> > unix disk I/O world)
> >
> >         select(....)
> >
> >         if(FD_ISSET(copier_fd))
> >                 run_coroutine(&copier_state);
> >
> >         ...
> >
> > and the copier might be something like
> >
> >         while(1)
> >         {
> >                 // Yes 1 at a time is dumb but this is an example..
> >                 // Yes Im ignoring EOF for this
> >                 if(read(copier_fd, buf[bufptr], 1)==-1)
> >                 {
> >                         if(errno==-EWOULDBLOCK)
> >                         {
> >                                 coroutine_return();
> >                                 continue;
> >                         }
> >                 }
> >                 if(bufptr==255  || buf[bufptr]=='\n')
> >                 {
> >                         run_coroutine(run_command, buf);
> >                         bufptr=0;
> >                 }
> >                 else
> >                         bufptr++;
> >         }
> >
> > it lets you express a state machine as a set of multiple such small state
> > machines instead.  run_coroutine() will continue a routine where it last
> > coroutine_return()'d from. Thus in the above case we are expressing read
> > bytes until you see a new line cleanly - not mangled in with keeping state
> > in global structures but by using natural C local variables and code flow
> >
> > Alan
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> ---------------------------------------------------
> Russell Leighton    russell.leighton@247media.com
> ---------------------------------------------------
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
Michael Rothwell
rothwell@holly-springs.nc.us



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Coroutines [was Re: threading question]
  2001-06-16 19:06         ` Michael Rothwell
@ 2001-06-16 21:30           ` Russell Leighton
  0 siblings, 0 replies; 37+ messages in thread
From: Russell Leighton @ 2001-06-16 21:30 UTC (permalink / raw)
  To: Michael Rothwell; +Cc: linux-kernel



Any chance this or the equiv could become part of glibc?

This seems a very handy abstraction,  in many apps
threads would then really only be needed for true parallelism.


Michael Rothwell wrote:

> Try this:
>
> http://lecker.essen.de/~froese/coro/
>
> -M
>
> On 16 Jun 2001 14:33:50 -0400, Russell Leighton wrote:
> >
> > Is there a user-space implemenation (library?) for coroutines that would work from C?
> >
> >
> > Alan Cox wrote:
> >
> > > > Can you provide any info and/or examples of co-routines? I'm curious to
> > > > see a good example of co-routines' "betterness."
> > >
> > > With co-routines you don't need
> > >
> > >         8K of kernel stack
> > >         Scheduler overhead
> > >         Fancy locking
> > >
> > > You don't get the automatic thread switching stuff though.
> > >
> > > So you might get code that reads like this (note that aio_ stuff works rather
> > > well combined with co-routines as it fixes a lack of asynchronicity in the
> > > unix disk I/O world)
> > >
> > >         select(....)
> > >
> > >         if(FD_ISSET(copier_fd))
> > >                 run_coroutine(&copier_state);
> > >
> > >         ...
> > >
> > > and the copier might be something like
> > >
> > >         while(1)
> > >         {
> > >                 // Yes 1 at a time is dumb but this is an example..
> > >                 // Yes Im ignoring EOF for this
> > >                 if(read(copier_fd, buf[bufptr], 1)==-1)
> > >                 {
> > >                         if(errno==-EWOULDBLOCK)
> > >                         {
> > >                                 coroutine_return();
> > >                                 continue;
> > >                         }
> > >                 }
> > >                 if(bufptr==255  || buf[bufptr]=='\n')
> > >                 {
> > >                         run_coroutine(run_command, buf);
> > >                         bufptr=0;
> > >                 }
> > >                 else
> > >                         bufptr++;
> > >         }
> > >
> > > it lets you express a state machine as a set of multiple such small state
> > > machines instead.  run_coroutine() will continue a routine where it last
> > > coroutine_return()'d from. Thus in the above case we are expressing read
> > > bytes until you see a new line cleanly - not mangled in with keeping state
> > > in global structures but by using natural C local variables and code flow
> > >
> > > Alan
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > --
> > ---------------------------------------------------
> > Russell Leighton    russell.leighton@247media.com
> > ---------------------------------------------------
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> Michael Rothwell
> rothwell@holly-springs.nc.us



--
---------------------------------------------------
Russell Leighton    russell.leighton@247media.com
---------------------------------------------------



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: threading question (results after thread pooling)
@ 2001-06-14 23:20 Dieter Nützel
  0 siblings, 0 replies; 37+ messages in thread
From: Dieter Nützel @ 2001-06-14 23:20 UTC (permalink / raw)
  To: Ognen Duzlevski; +Cc: Linux Kernel List

> Hello,
>
> I have implemented thread pooling (with an environment variable
> where I can give the number of threads to be created). Results:
>
> 1. Linux, no change in the times (not under 2.2.x or 2.4)
[snip]
> I am now pretty much inclined to believe that it is either a) hardware
> issue (someone mentioned that SPARCs and MIPSes handle things differently)
> or b) Linux for some reason just cant give me what IRIX/Solaris can in
> this particular case
[snip]

Hello Ognen,

can you get your hands on an dual AMD Athlon MP 1/1.2 GHz system?
The only mobo currently on the marked is the AMD 760MP based Tyan Thunder K7.
It has (all) the good stuff (Point-to-Point bus, crossbar) which former only 
the (big) Alphas/SUN/SGI etc. had.

http://www.amd.com/products/cpg/server/athlon/index.html
http://www.tyan.com/products/html/thunderk7.html

Regards,
	Dieter
-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
Cognitive Systems Group
Vogt-Kölln-Straße 30
D-22527 Hamburg, Germany

email: nuetzel@kogs.informatik.uni-hamburg.de
@home: Dieter.Nuetzel@hamburg.de

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2001-06-16 21:28 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-12 18:24 threading question ognen
2001-06-12 18:39 ` Davide Libenzi
2001-06-12 18:57 ` from dmesg: kernel BUG at inode.c:486 Olivier Sessink
2001-06-12 18:58 ` threading question Christoph Hellwig
2001-06-12 19:07   ` ognen
2001-06-12 19:15     ` Kip Macy
2001-06-12 19:29       ` Christoph Hellwig
2001-06-12 19:15     ` Christoph Hellwig
2001-06-13 12:20     ` Kurt Garloff
2001-06-13 13:35       ` J . A . Magallon
2001-06-13 14:17         ` Philips
2001-06-13 15:06           ` ognen
2001-06-12 21:44   ` Davide Libenzi
2001-06-12 21:48     ` ognen
2001-06-14 18:15       ` Alan Cox
2001-06-14 22:42         ` threading question (results after thread pooling) ognen
2001-06-14 23:00           ` Mike Castle
2001-06-12 21:58     ` threading question Albert D. Cahalan
2001-06-12 23:48       ` J . A . Magallon
2001-06-12 19:06 ` Kip Macy
2001-06-12 19:14   ` Alexander Viro
2001-06-12 19:25     ` Russell Leighton
2001-06-12 23:27       ` Mike Castle
2001-06-13 17:31   ` bert hubert
2001-06-14  6:45     ` Helge Hafting
2001-06-14 18:28   ` Alan Cox
2001-06-14 19:01     ` bert hubert
2001-06-14 19:22       ` Russell Leighton
2001-06-15 11:29       ` Anil Kumar
2001-06-14 23:05     ` J . A . Magallon
2001-06-16 14:16     ` Michael Rothwell
2001-06-16 15:19       ` Alan Cox
2001-06-16 18:33         ` Russell Leighton
2001-06-16 19:06         ` Michael Rothwell
2001-06-16 21:30           ` Coroutines [was Re: threading question] Russell Leighton
2001-06-12 22:41 ` threading question Pavel Machek
2001-06-14 23:20 threading question (results after thread pooling) Dieter Nützel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).