linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: sched_yield() makes OpenLDAP slow
@ 2006-01-14 19:29 Howard Chu
  0 siblings, 0 replies; 49+ messages in thread
From: Howard Chu @ 2006-01-14 19:29 UTC (permalink / raw)
  To: linux-kernel

Resurrecting a dead horse...
> *From: *Lee Revell
> *Date: * Sat Aug 20 2005 - 15:57:36 EST
>
> ------------------------------------------------------------------------
> On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> >/ But I also found that I needed to add a new /
> >/ yield(), to work around yet another unexpected issue on this system -/
> >/ we have a number of threads waiting on a condition variable, and the/
> >/ thread holding the mutex signals the var, unlocks the mutex, and then /
> >/ immediately relocks it. The expectation here is that upon unlocking/
> >/ the mutex, the calling thread would block while some waiting thread/
> >/ (that just got signaled) would get to run. In fact what happened is/
> >/ that the calling thread unlocked and relocked the mutex without/
> >/ allowing any of the waiting threads to run. In this case the only/
> >/ solution was to insert a yield() after the mutex_unlock(). /
>
> That's exactly the behavior I would expect. Why would you expect
> unlocking a mutex to cause a reschedule, if the calling thread still has
> timeslice left?
>
> Lee

POSIX requires a reschedule to occur, as noted here:
http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/

The relevant SUSv3 text is here
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html

I suppose if pthread_mutex_unlock() actually behaved correctly we could 
remove the other sched_yield() hacks that didn't belong there in the 
first place and go on our merry way.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  8:30           ` Helge Hafting
  2006-01-26  9:01             ` Nick Piggin
@ 2006-01-26 10:50             ` Nikita Danilov
  1 sibling, 0 replies; 49+ messages in thread
From: Nikita Danilov @ 2006-01-26 10:50 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Linux Kernel Mailing List, hancockr

Helge Hafting writes:

[...]

 > 
 > >nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
 > >before any other thread gets around to getting it.
 > >  
 > >
 > Wrong.
 > The spec says that the mutex must be given to a waiter (if any) at the
 > moment of release.  The waiter don't have to be scheduled at that
 > point, it may keep sleeping with its freshly unlocked mutex.  So the
 > unlocking thread may continue - but if it tries to reaquire the mutex
 > it will find the mutex taken and go to sleep at that point. Then other

You just described a convoy formation: a phenomenon that all reasonable
mutex implementation try to avoid at all costs. If that's what standard
prescribes---the standard has to be amended.

 > 
 > Helge Hafting

Nikita.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  8:30           ` Helge Hafting
@ 2006-01-26  9:01             ` Nick Piggin
  2006-01-26 10:50             ` Nikita Danilov
  1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2006-01-26  9:01 UTC (permalink / raw)
  To: Helge Hafting; +Cc: davids, Linux Kernel Mailing List, hancockr

Helge Hafting wrote:
> David Schwartz wrote:

>> nothing says that it can't call pthread_mutex_lock and re-acquire the 
>> mutex
>> before any other thread gets around to getting it.
>>  
>>
> Wrong.
> The spec says that the mutex must be given to a waiter (if any) at the
> moment of release.

Repeating myself here...

To me it says that the scheduling policy decides at the moment of release.
What if the scheduling policy decides *right then* to give the mutex to
the next running thread that tries to aquire it?

That would be the logical way for a scheduling policy to decide the next
owner of the mutex.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  1:07         ` David Schwartz
@ 2006-01-26  8:30           ` Helge Hafting
  2006-01-26  9:01             ` Nick Piggin
  2006-01-26 10:50             ` Nikita Danilov
  0 siblings, 2 replies; 49+ messages in thread
From: Helge Hafting @ 2006-01-26  8:30 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List, hancockr

David Schwartz wrote:

>>Robert Hancock wrote:
>>    
>>
>>But the thread the released the
>>mutex is not one of the waiting threads, and is not eligible for
>>consideration.
>>    
>>
>
>	Where are you getting this from? Nothing requires the scheduler to schedule
>any threads when the mutex is released.
>  
>
Correct.

>	All that must happen is that the mutex must be unlocked. The scheduler is
>permitted to allow any thread it wants to run at that point, or no thread.
>Nothing says the thread that released the mutex can't continue running and
>  
>
Correct. The releasing thread may keep running.

>nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
>before any other thread gets around to getting it.
>  
>
Wrong.
The spec says that the mutex must be given to a waiter (if any) at the
moment of release.  The waiter don't have to be scheduled at that
point, it may keep sleeping with its freshly unlocked mutex.  So the
unlocking thread may continue - but if it tries to reaquire the mutex
it will find the mutex taken and go to sleep at that point. Then other
processes will schedule, and at some time the one now owning the mutex
will wake up and do its work.

>	In general, it is very bad karma for the scheduler to stop a thread before
>its timeslice is up if it doesn't have to. Consider one CPU and two threads,
>each needing to do 100 quick lock/unlock cycles. Why force 200 context
>switches?
>
Good point, except it is a strange program that do this.  Lock the mutex 
once,
do 100 operations, then unlock is the better way. :-)

Helge Hafting

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
  2006-01-25 14:38         ` Robert Hancock
  2006-01-25 17:49         ` Christopher Friesen
@ 2006-01-26  1:07         ` David Schwartz
  2006-01-26  8:30           ` Helge Hafting
  2 siblings, 1 reply; 49+ messages in thread
From: David Schwartz @ 2006-01-26  1:07 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: hancockr


> Robert Hancock wrote:

>  > "If there are threads blocked on the mutex object referenced by mutex
>  > when pthread_mutex_unlock() is called, resulting in the mutex becoming
>  > available, the scheduling policy shall determine which thread shall
>  > acquire the mutex."
>  >
>  > This says nothing about requiring a reschedule. The "scheduling policy"
>  > can well decide that the thread which just released the mutex can
>  > re-acquire it.

> No, because the thread that just released the mutex is obviously not one
> of  the threads blocked on the mutex.

	So what?

> When a mutex is unlocked, one of
> the *waiting* threads at the time of the unlock must acquire it, and the
> scheduling policy can determine that.

	This is false and is nowhere found in the standard.

> But the thread the released the
> mutex is not one of the waiting threads, and is not eligible for
> consideration.

	Where are you getting this from? Nothing requires the scheduler to schedule
any threads when the mutex is released.

	All that must happen is that the mutex must be unlocked. The scheduler is
permitted to allow any thread it wants to run at that point, or no thread.
Nothing says the thread that released the mutex can't continue running and
nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
before any other thread gets around to getting it.

	In general, it is very bad karma for the scheduler to stop a thread before
its timeslice is up if it doesn't have to. Consider one CPU and two threads,
each needing to do 100 quick lock/unlock cycles. Why force 200 context
switches?

	DS



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
  2006-01-25 14:38         ` Robert Hancock
@ 2006-01-25 17:49         ` Christopher Friesen
  2006-01-26  1:07         ` David Schwartz
  2 siblings, 0 replies; 49+ messages in thread
From: Christopher Friesen @ 2006-01-25 17:49 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> 
> Robert Hancock wrote:

>  > This says nothing about requiring a reschedule. The "scheduling policy"
>  > can well decide that the thread which just released the mutex can
>  > re-acquire it.
> 
> No, because the thread that just released the mutex is obviously not one 
> of  the threads blocked on the mutex. When a mutex is unlocked, one of 
> the *waiting* threads at the time of the unlock must acquire it, and the 
> scheduling policy can determine that. But the thread the released the 
> mutex is not one of the waiting threads, and is not eligible for 
> consideration.

Is it *required* that the new owner of the mutex is determined at the 
time of mutex release?

If the kernel doesn't actually determine the new owner of the mutex 
until the currently running thread swaps out, it would be possible for 
the currently running thread to re-aquire the mutex.

Chris

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
@ 2006-01-25 14:38         ` Robert Hancock
  2006-01-25 17:49         ` Christopher Friesen
  2006-01-26  1:07         ` David Schwartz
  2 siblings, 0 replies; 49+ messages in thread
From: Robert Hancock @ 2006-01-25 14:38 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu wrote:
> No, because the thread that just released the mutex is obviously not one 
> of  the threads blocked on the mutex. When a mutex is unlocked, one of 
> the *waiting* threads at the time of the unlock must acquire it, and the 
> scheduling policy can determine that. But the thread the released the 
> mutex is not one of the waiting threads, and is not eligible for 
> consideration.

That statement does not imply that any reschedule needs to happen at the 
time of the mutex unlock at all, only that the other threads waiting on 
the mutex can attempt to reacquire it when the scheduler allows them to. 
  In all likelihood, what tends to happen is that either the thread that 
had the mutex previously still has time left in its timeslice and is 
allowed to keep running and reacquire the mutex, or another thread is 
woken up (perhaps on another CPU) but doesn't reacquire the mutex before 
the original thread carries on and acquires it, and therefore goes back 
to sleep.

Forcing the mutex to ping-pong between different threads would be quite 
inefficient (especially on SMP machines), and is not something that 
POSIX requires.

--
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 12:11     ` Olaf Kirch
@ 2006-01-25 13:51       ` Howard Chu
  2006-01-25 14:38         ` Robert Hancock
                           ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Howard Chu @ 2006-01-25 13:51 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: hancockr


Robert Hancock wrote:
 > Howard Chu wrote:
 > > POSIX requires a reschedule to occur, as noted here:
 > > http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/
 >
 > No, it doesn't:
 >
 > >
 > > The relevant SUSv3 text is here
 > > 
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html 

 >
 > "If there are threads blocked on the mutex object referenced by mutex
 > when pthread_mutex_unlock() is called, resulting in the mutex becoming
 > available, the scheduling policy shall determine which thread shall
 > acquire the mutex."
 >
 > This says nothing about requiring a reschedule. The "scheduling policy"
 > can well decide that the thread which just released the mutex can
 > re-acquire it.

No, because the thread that just released the mutex is obviously not one 
of  the threads blocked on the mutex. When a mutex is unlocked, one of 
the *waiting* threads at the time of the unlock must acquire it, and the 
scheduling policy can determine that. But the thread the released the 
mutex is not one of the waiting threads, and is not eligible for 
consideration.

 > > I suppose if pthread_mutex_unlock() actually behaved correctly we 
could
 > > remove the other sched_yield() hacks that didn't belong there in the
 > > first place and go on our merry way.
 >
 > Generally, needing to implement hacks like this is a sign that there are
 > problems with the synchronization design of the code (like a mutex which
 > has excessive contention). Programs should not rely on the scheduling
 > behavior of the kernel for proper operation when that behavior is not
 > defined.
 >
 > --
 > Robert Hancock      Saskatoon, SK, Canada
 > To email, remove "nospam" from hancockr@nospamshaw.ca
 > Home Page: http://www.roberthancock.com/

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found] <5uZqb-4fo-15@gated-at.bofh.it>
@ 2006-01-14 22:47 ` Robert Hancock
  0 siblings, 0 replies; 49+ messages in thread
From: Robert Hancock @ 2006-01-14 22:47 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> POSIX requires a reschedule to occur, as noted here:
> http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/

No, it doesn't:

> 
> The relevant SUSv3 text is here
> http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html 

"If there are threads blocked on the mutex object referenced by mutex 
when pthread_mutex_unlock() is called, resulting in the mutex becoming 
available, the scheduling policy shall determine which thread shall 
acquire the mutex."

This says nothing about requiring a reschedule. The "scheduling policy" 
can well decide that the thread which just released the mutex can 
re-acquire it.

> I suppose if pthread_mutex_unlock() actually behaved correctly we could 
> remove the other sched_yield() hacks that didn't belong there in the 
> first place and go on our merry way.

Generally, needing to implement hacks like this is a sign that there are 
problems with the synchronization design of the code (like a mutex which 
has excessive contention). Programs should not rely on the scheduling 
behavior of the kernel for proper operation when that behavior is not 
defined.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-23 12:07               ` Denis Vlasenko
@ 2005-08-24  3:37                 ` Lincoln Dale
  0 siblings, 0 replies; 49+ messages in thread
From: Lincoln Dale @ 2005-08-24  3:37 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-os (Dick Johnson), Robert Hancock, linux-kernel

Denis Vlasenko wrote:

>>>This is what I would expect if run on an otherwise idle machine.
>>>sched_yield just puts you at the back of the line for runnable
>>>processes, it doesn't magically cause you to go to sleep somehow.
>>>      
>>>
>>When a kernel build is occurring??? Plus `top` itself.... It damn
>>well sleep while giving up the CPU. If it doesn't it's broken.
>>    
>>
unless you have all of the kernel source in the buffer cache, a 
concurrent kernel build will spend a fair bit of time in io_wait state ..
as such its perfectly plausible that sched_yield keeps popping back to 
the top of 'runnable' processes . . .


cheers,

lincoln.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-23 11:17             ` linux-os (Dick Johnson)
@ 2005-08-23 12:07               ` Denis Vlasenko
  2005-08-24  3:37                 ` Lincoln Dale
  0 siblings, 1 reply; 49+ messages in thread
From: Denis Vlasenko @ 2005-08-23 12:07 UTC (permalink / raw)
  To: linux-os (Dick Johnson), Robert Hancock; +Cc: linux-kernel

On Tuesday 23 August 2005 14:17, linux-os \(Dick Johnson\) wrote:
> 
> On Mon, 22 Aug 2005, Robert Hancock wrote:
> 
> > linux-os (Dick Johnson) wrote:
> >> I reported thet sched_yield() wasn't working (at least as expected)
> >> back in March of 2004.
> >>
> >>  		for(;;)
> >>                      sched_yield();
> >>
> >> ... takes 100% CPU time as reported by `top`. It should take
> >> practically 0. Somebody said that this was because `top` was
> >> broken, others said that it was because I didn't know how to
> >> code. Nevertheless, the problem was not fixed, even after
> >> schedular changes were made for the current version.
> >
> > This is what I would expect if run on an otherwise idle machine.
> > sched_yield just puts you at the back of the line for runnable
> > processes, it doesn't magically cause you to go to sleep somehow.
> >
> 
> When a kernel build is occurring??? Plus `top` itself.... It damn
> well sleep while giving up the CPU. If it doesn't it's broken.

top doesn't run all the time:

# strace -o top.strace -tt top

14:52:19.407958 write(1, "  758 root      16   0   104   2"..., 79) = 79
14:52:19.408318 write(1, "  759 root      16   0   100   1"..., 79) = 79
14:52:19.408659 write(1, "  760 root      16   0   100   1"..., 79) = 79
14:52:19.409001 write(1, "  761 root      18   0  2604  39"..., 74) = 74
14:52:19.409342 write(1, "  763 daemon    17   0   108   1"..., 78) = 78
14:52:19.409672 write(1, "  773 root      16   0   104   2"..., 79) = 79
14:52:19.410010 write(1, "  774 root      16   0   104   2"..., 79) = 79
14:52:19.410362 write(1, "  775 root      16   0   100   1"..., 79) = 79
14:52:19.410692 write(1, "  776 root      16   0   104   2"..., 79) = 79
14:52:19.411136 write(1, "  777 daemon    17   0   108   1"..., 86) = 86
14:52:19.411505 select(1, [0], NULL, NULL, {5, 0}) = 0 (Timeout)
	hrrr..... psssss.......
14:52:24.411744 time([1124797944])      = 1124797944
14:52:24.411883 lseek(4, 0, SEEK_SET)   = 0
14:52:24.411957 read(4, "24822.01 18801.28\n", 1023) = 18
14:52:24.412082 access("/var/run/utmpx", F_OK) = -1 ENOENT (No such file or directory)
14:52:24.412224 open("/var/run/utmp", O_RDWR) = 8
14:52:24.412328 fcntl64(8, F_GETFD)     = 0
14:52:24.412399 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
14:52:24.412467 _llseek(8, 0, [0], SEEK_SET) = 0
14:52:24.412556 alarm(0)                = 0
14:52:24.412643 rt_sigaction(SIGALRM, {0x4015a57c, [], SA_RESTORER, 0x40094ae8}, {SIG_DFL}, 8) = 0
14:52:24.412747 alarm(1)                = 0

However, kernel compile shouldn't.

I suggest stracing with -tt "for(;;) yield();" test proggy with and without
kernel compile in parallel, and comparing the output...

Hmm... actually, knowing that you will argue to death instead...

# cat t.c
#include <sched.h>

int main() {
    for(;;) sched_yield();
    return 0;
}
# gcc t.c
# strace -tt ./a.out
...
15:03:41.211324 sched_yield()           = 0
15:03:41.211673 sched_yield()           = 0
15:03:41.212034 sched_yield()           = 0
15:03:41.212400 sched_yield()           = 0
15:03:41.212749 sched_yield()           = 0
15:03:41.213126 sched_yield()           = 0
15:03:41.213486 sched_yield()           = 0
15:03:41.213835 sched_yield()           = 0
15:03:41.214220 sched_yield()           = 0
15:03:41.214577 sched_yield()           = 0
15:03:41.214939 sched_yield()           = 0
    I start "while true; do true; done" on another console...
15:03:43.314645 sched_yield()           = 0
15:03:43.847644 sched_yield()           = 0
15:03:43.954635 sched_yield()           = 0
15:03:44.063798 sched_yield()           = 0
15:03:44.171596 sched_yield()           = 0
15:03:44.282624 sched_yield()           = 0
15:03:44.391632 sched_yield()           = 0
15:03:44.498609 sched_yield()           = 0
15:03:44.605584 sched_yield()           = 0
15:03:44.712538 sched_yield()           = 0
15:03:44.819557 sched_yield()           = 0
15:03:44.928594 sched_yield()           = 0
15:03:45.040603 sched_yield()           = 0
15:03:45.148545 sched_yield()           = 0
15:03:45.259311 sched_yield()           = 0
15:03:45.368563 sched_yield()           = 0
15:03:45.476482 sched_yield()           = 0
15:03:45.583568 sched_yield()           = 0
15:03:45.690491 sched_yield()           = 0
15:03:45.797512 sched_yield()           = 0
15:03:45.906534 sched_yield()           = 0
15:03:46.013545 sched_yield()           = 0
15:03:46.120505 sched_yield()           = 0
Ctrl-C

# uname -a
Linux firebird 2.6.12-r4 #1 SMP Sun Jul 17 13:51:47 EEST 2005 i686 unknown unknown GNU/Linux
--
vda


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 14:26           ` Robert Hancock
@ 2005-08-23 11:17             ` linux-os (Dick Johnson)
  2005-08-23 12:07               ` Denis Vlasenko
  0 siblings, 1 reply; 49+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-23 11:17 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel


On Mon, 22 Aug 2005, Robert Hancock wrote:

> linux-os (Dick Johnson) wrote:
>> I reported thet sched_yield() wasn't working (at least as expected)
>> back in March of 2004.
>>
>>  		for(;;)
>>                      sched_yield();
>>
>> ... takes 100% CPU time as reported by `top`. It should take
>> practically 0. Somebody said that this was because `top` was
>> broken, others said that it was because I didn't know how to
>> code. Nevertheless, the problem was not fixed, even after
>> schedular changes were made for the current version.
>
> This is what I would expect if run on an otherwise idle machine.
> sched_yield just puts you at the back of the line for runnable
> processes, it doesn't magically cause you to go to sleep somehow.
>

When a kernel build is occurring??? Plus `top` itself.... It damn
well sleep while giving up the CPU. If it doesn't it's broken.

> --
> Robert Hancock      Saskatoon, SK, Canada
> To email, remove "nospam" from hancockr@nospamshaw.ca
> Home Page: http://www.roberthancock.com/
>


Cheers,
Dick Johnson
Penguin : Linux version 2.6.12.5 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 13:20           ` Florian Weimer
@ 2005-08-22 23:19             ` Howard Chu
  0 siblings, 0 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-22 23:19 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andi Kleen, linux-kernel

Florian Weimer wrote:
>  * Howard Chu:
> > That's not the complete story. BerkeleyDB provides a
> > db_env_set_func_yield() hook to tell it what yield function it
> > should use when its internal locking routines need such a function.
> > If you don't set a specific hook, it just uses sleep(). The
> > OpenLDAP backend will invoke this hook during some (not necessarily
> > all) init sequences, to tell it to use the thread yield function
> > that we selected in autoconf.

>  And this helps to increase performance substantially?

When the caller is a threaded program, yes, there is a substantial 
(measurable and noticable) difference. Given that sleep() blocks the 
entire process, the difference is obvious.

> > Note that (on systems that support inter-process mutexes) a
> > BerkeleyDB database environment may be used by multiple processes
> > concurrently.
>
>  Yes, I know this, and I haven't experienced that much trouble with
>  deadlocks.  Maybe the way you structure and access the database
>  environment can be optimized for deadlock avoidance?

Maybe we already did this deadlock analysis and optimization, years ago 
when we first started developing this backend? Do you think everyone 
else in the world is a total fool?

> > As such, the yield function that is provided must work both for
> > threads within a single process (PTHREAD_SCOPE_PROCESS) as well as
> > between processes (PTHREAD_SCOPE_SYSTEM).

>  If I understand you correctly, what you really need is a syscall
>  along the lines "don't run me again until all threads T that share
>  property X have run, where the Ts aren't necessarily in the same
>  process".  The kernel is psychic, it can't really know which
>  processes to schedule to satisfy such a requirement.  I don't even
>  think "has joined the Berkeley DB environment" is the desired
>  property, but something like "is part of this cycle in the wait-for
>  graph" or something similar.

You seem to believe we're looking for special treatment for the 
processes we're concerned with, and that's not true. If the system is 
busy with other processes, so be it, the system is busy. If you want 
better performance, you build a dedicated server and don't let anything 
else make the system busy. This is the way mission-critical services are 
delivered, regardless of the service. If you're not running on a 
dedicated system, then your deployment must not be mission critical, and 
so you shouldn't be surprised if a large gcc run slows down some other 
activities in the meantime. If you have a large nice'd job running 
before your normal priority jobs get their timeslice, then you should 
certainly wonder wtf the scheduler is doing, and why your system even 
claims to support nice() when clearly it doesn't mean anything on that 
system.

>  I would have to check the Berkeley DB internals in order to tell what
>  is feasible to implement.  This code shouldn't be on the fast path,
>  so some kernel-based synchronization is probably sufficient.

pthread_cond_wait() probably would be just fine here, but BerkeleyDB 
doesn't work that way.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 13:06           ` Andi Kleen
@ 2005-08-22 18:47             ` Howard Chu
  0 siblings, 0 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-22 18:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Florian Weimer, linux-kernel

Andi Kleen wrote:
> > processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd
> > only needing to yield within a single process is inaccurate; since
> > we allow slapcat to run concurrently with slapd (to allow hot
> > backups) we need BerkeleyDB's locking/yield functions to work in
> > System scope.

>  That's broken by design - it means you can be arbitarily starved by
>  other processes running in parallel. You are basically assuming your
>  application is the only thing running on the system which is wrong.
>  Also there are enough synchronization primitives that can synchronize
>  multiple processes without making such broken assumptions.

Again, I think you overstate the problem. "Arbitrarily starved by other 
processes" implies that the process scheduler will do a poor job and 
will allow the slapd process to be starved. We do not assume we're the 
only app on the system, we just assume that eventually we will get the 
CPU back. If that's not a valid assumption, then there is something 
wrong with the underlying system environment.

Something you ought to keep in mind - correctness and compliance are 
well and good, but worthless if the end result isn't useful. Windows NT 
has a POSIX-compliant subsystem but it is utterly useless. That's what 
you wind up with when all you do is conform to the letter of the spec.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 11:44         ` linux-os (Dick Johnson)
@ 2005-08-22 14:26           ` Robert Hancock
  2005-08-23 11:17             ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Hancock @ 2005-08-22 14:26 UTC (permalink / raw)
  To: linux-kernel

linux-os (Dick Johnson) wrote:
> I reported thet sched_yield() wasn't working (at least as expected)
> back in March of 2004.
> 
>  		for(;;)
>                      sched_yield();
> 
> ... takes 100% CPU time as reported by `top`. It should take
> practically 0. Somebody said that this was because `top` was
> broken, others said that it was because I didn't know how to
> code. Nevertheless, the problem was not fixed, even after
> schedular changes were made for the current version.

This is what I would expect if run on an otherwise idle machine. 
sched_yield just puts you at the back of the line for runnable 
processes, it doesn't magically cause you to go to sleep somehow.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22  5:09         ` Howard Chu
  2005-08-22 13:06           ` Andi Kleen
@ 2005-08-22 13:20           ` Florian Weimer
  2005-08-22 23:19             ` Howard Chu
  1 sibling, 1 reply; 49+ messages in thread
From: Florian Weimer @ 2005-08-22 13:20 UTC (permalink / raw)
  To: Howard Chu; +Cc: Andi Kleen, linux-kernel

* Howard Chu:

>>> Has anybody contacted the Sleepycat people with a description of
>>> the problem yet?

>> Berkeley DB does not call sched_yield, but OpenLDAP does in some
>> wrapper code around the Berkeley DB backend.

> That's not the complete story. BerkeleyDB provides a 
> db_env_set_func_yield() hook to tell it what yield function it should 
> use when its internal locking routines need such a function. If you 
> don't set a specific hook, it just uses sleep(). The OpenLDAP backend 
> will invoke this hook during some (not necessarily all) init sequences, 
> to tell it to use the thread yield function that we selected in autoconf.

And this helps to increase performance substantially?

> Note that (on systems that support inter-process mutexes) a BerkeleyDB 
> database environment may be used by multiple processes concurrently.

Yes, I know this, and I haven't experienced that much trouble with
deadlocks.  Maybe the way you structure and access the database
environment can be optimized for deadlock avoidance?

> As such, the yield function that is provided must work both for
> threads within a single process (PTHREAD_SCOPE_PROCESS) as well as
> between processes (PTHREAD_SCOPE_SYSTEM).

If I understand you correctly, what you really need is a syscall along
the lines "don't run me again until all threads T that share property
X have run, where the Ts aren't necessarily in the same process".  The
kernel is psychic, it can't really know which processes to schedule to
satisfy such a requirement.  I don't even think "has joined the
Berkeley DB environment" is the desired property, but something like
"is part of this cycle in the wait-for graph" or something similar.

I would have to check the Berkeley DB internals in order to tell what
is feasible to implement.  This code shouldn't be on the fast path, so
some kernel-based synchronization is probably sufficient.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22  5:09         ` Howard Chu
@ 2005-08-22 13:06           ` Andi Kleen
  2005-08-22 18:47             ` Howard Chu
  2005-08-22 13:20           ` Florian Weimer
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2005-08-22 13:06 UTC (permalink / raw)
  To: Howard Chu; +Cc: Florian Weimer, Andi Kleen, linux-kernel

> processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd only 
> needing to yield within a single process is inaccurate; since we allow 
> slapcat to run concurrently with slapd (to allow hot backups) we need 
> BerkeleyDB's locking/yield functions to work in System scope.

That's broken by design - it means you can be arbitarily starved 
by other processes running in parallel. You are basically assuming
your application is the only thing running on the system
which is wrong. Also there are enough synchronization primitives
that can synchronize multiple processes without making
such broken assumptions.

-Andi


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21  1:04       ` Robert Hancock
@ 2005-08-22 11:44         ` linux-os (Dick Johnson)
  2005-08-22 14:26           ` Robert Hancock
  0 siblings, 1 reply; 49+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-22 11:44 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel


On Sat, 20 Aug 2005, Robert Hancock wrote:

> Howard Chu wrote:
>> I'll note that we removed a number of the yield calls (that were in
>> OpenLDAP 2.2) for the 2.3 release, because I found that they were
>> redundant and causing unnecessary delays. My own test system is running
>> on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro),
>> and OpenLDAP 2.3 runs perfectly well here, now that those redundant
>> calls have been removed. But I also found that I needed to add a new
>> yield(), to work around yet another unexpected issue on this system - we
>> have a number of threads waiting on a condition variable, and the thread
>> holding the mutex signals the var, unlocks the mutex, and then
>> immediately relocks it. The expectation here is that upon unlocking the
>> mutex, the calling thread would block while some waiting thread (that
>> just got signaled) would get to run. In fact what happened is that the
>> calling thread unlocked and relocked the mutex without allowing any of
>> the waiting threads to run. In this case the only solution was to insert
>> a yield() after the mutex_unlock(). So again, for those of you claiming
>> "oh, all you need to do is use a condition variable or any of the other
>> POSIX synchronization primitives" - yes, that's a nice theory, but
>> reality says otherwise.
>
> I encountered a similar issue with some software that I wrote, and used
> a similar workaround, however this was basically because there wasn't
> enough time available at the time to redesign things to work properly.
> The problem here is essentially caused by the fact that the mutex is
> being locked for an excessively large proportion of the time and not
> letting other threads in. In the case I am thinking of, posting the
> messages to the thread that was hogging the mutex via a signaling queue
> would have been a better solution than using yield and having correct
> operation depend on undefined parts of thread scheduling behavior..
>
> --
> Robert Hancock      Saskatoon, SK, Canada
> To email, remove "nospam" from hancockr@nospamshaw.ca
> Home Page: http://www.roberthancock.com/
>

I reported thet sched_yield() wasn't working (at least as expected)
back in March of 2004.

 		for(;;)
                     sched_yield();

... takes 100% CPU time as reported by `top`. It should take
practically 0. Somebody said that this was because `top` was
broken, others said that it was because I didn't know how to
code. Nevertheless, the problem was not fixed, even after
schedular changes were made for the current version.

  One can execute:

 		usleep(0);
... instead of:
 		sched_yield();

... and Linux then performs exactly like other Unixes when
code is waiting on mutexes.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.12.5 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21 11:33           ` Nikita Danilov
@ 2005-08-22  8:06             ` Howard Chu
  0 siblings, 0 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-22  8:06 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Nikita Danilov wrote:
> Howard Chu writes:
>
>  > That's beside the point. Folks are making an assertion that
>  > sched_yield() is meaningless; this example demonstrates that there are
>  > cases where sched_yield() is essential.
>
> It is not essential, it is non-portable.
>
> Code you described is based on non-portable "expectations" about thread
> scheduling. Linux implementation of pthreads fails to satisfy
> them. Perfectly reasonable. Code is then "fixed" by adding sched_yield()
> calls and introducing more non-portable assumptions. Again, there is no
> guarantee this would work on any compliant implementation.
>
> While "intuitive" semantics of sched_yield() is to yield CPU and to give
> other runnable threads their chance to run, this is _not_ what standard
> prescribes (for non-RT threads).
>   
Very well; it is not prescribed in the standard and it is non-portable. 
Our code is broken and we will fix it.

But even Dave Butenhof, Mr. Pthreads himself, has said it is reasonable 
to expect sched_yield to yield the CPU. That's what pthread_yield did in 
Pthreads Draft 4 (DCE threads) and it is common knowledge that 
sched_yield is a direct replacement for pthread_yield; i.e., 
pthread_yield() was deleted from the spec because sched_yield fulfilled 
its purpose. Now you're saying "well, technically, sched_yield doesn't 
have to do anything at all" and the letter of the spec supports your 
position, but anybody who's been programming with pthreads since the DCE 
days "knows" that is not the original intention. I wonder that nobody 
has decided to raise this issue with the IEEE/POSIX group and get them 
to issue a correction/clarification in all this time, since the absence 
of specification here really impairs the usefulness of the spec.

Likewise the fact that sched_yield() can now cause the current process 
to be queued behind other processes seems suspect, unless we know for 
sure that the threads are running with PTHREAD_SCOPE_SYSTEM. (I haven't 
checked to see if PTHREAD_SCOPE_PROCESS is still supported in NPTL.)

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21 19:47       ` Florian Weimer
@ 2005-08-22  5:09         ` Howard Chu
  2005-08-22 13:06           ` Andi Kleen
  2005-08-22 13:20           ` Florian Weimer
  0 siblings, 2 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-22  5:09 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andi Kleen, linux-kernel

Florian Weimer wrote:
> * Andi Kleen:
>
>   
>> Has anybody contacted the Sleepycat people with a description of the 
>> problem yet?
>>     
>
> Berkeley DB does not call sched_yield, but OpenLDAP does in some
> wrapper code around the Berkeley DB backend.
That's not the complete story. BerkeleyDB provides a 
db_env_set_func_yield() hook to tell it what yield function it should 
use when its internal locking routines need such a function. If you 
don't set a specific hook, it just uses sleep(). The OpenLDAP backend 
will invoke this hook during some (not necessarily all) init sequences, 
to tell it to use the thread yield function that we selected in autoconf.

Note that (on systems that support inter-process mutexes) a BerkeleyDB 
database environment may be used by multiple processes concurrently. As 
such, the yield function that is provided must work both for threads 
within a single process (PTHREAD_SCOPE_PROCESS) as well as between 
processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd only 
needing to yield within a single process is inaccurate; since we allow 
slapcat to run concurrently with slapd (to allow hot backups) we need 
BerkeleyDB's locking/yield functions to work in System scope.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 13:48     ` Andi Kleen
@ 2005-08-21 19:47       ` Florian Weimer
  2005-08-22  5:09         ` Howard Chu
  0 siblings, 1 reply; 49+ messages in thread
From: Florian Weimer @ 2005-08-21 19:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Howard Chu, linux-kernel

* Andi Kleen:

> Has anybody contacted the Sleepycat people with a description of the 
> problem yet?

Berkeley DB does not call sched_yield, but OpenLDAP does in some
wrapper code around the Berkeley DB backend.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 21:24         ` Howard Chu
  2005-08-21  0:36           ` Nick Piggin
@ 2005-08-21 11:33           ` Nikita Danilov
  2005-08-22  8:06             ` Howard Chu
  1 sibling, 1 reply; 49+ messages in thread
From: Nikita Danilov @ 2005-08-21 11:33 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Howard Chu writes:
 > Lee Revell wrote:
 > >  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
 > > > But I also found that I needed to add a new yield(), to work around
 > > > yet another unexpected issue on this system - we have a number of
 > > > threads waiting on a condition variable, and the thread holding the
 > > > mutex signals the var, unlocks the mutex, and then immediately
 > > > relocks it. The expectation here is that upon unlocking the mutex,
 > > > the calling thread would block while some waiting thread (that just
 > > > got signaled) would get to run. In fact what happened is that the
 > > > calling thread unlocked and relocked the mutex without allowing any
 > > > of the waiting threads to run. In this case the only solution was
 > > > to insert a yield() after the mutex_unlock().
 > >
 > >  That's exactly the behavior I would expect.  Why would you expect
 > >  unlocking a mutex to cause a reschedule, if the calling thread still
 > >  has timeslice left?
 >
 > That's beside the point. Folks are making an assertion that
 > sched_yield() is meaningless; this example demonstrates that there are
 > cases where sched_yield() is essential.

It is not essential, it is non-portable.

Code you described is based on non-portable "expectations" about thread
scheduling. Linux implementation of pthreads fails to satisfy
them. Perfectly reasonable. Code is then "fixed" by adding sched_yield()
calls and introducing more non-portable assumptions. Again, there is no
guarantee this would work on any compliant implementation.

While "intuitive" semantics of sched_yield() is to yield CPU and to give
other runnable threads their chance to run, this is _not_ what standard
prescribes (for non-RT threads).

 >
 > --
 >   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
  2005-08-20 21:50       ` Lee Revell
@ 2005-08-21  1:04       ` Robert Hancock
  2005-08-22 11:44         ` linux-os (Dick Johnson)
  2 siblings, 1 reply; 49+ messages in thread
From: Robert Hancock @ 2005-08-21  1:04 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> I'll note that we removed a number of the yield calls (that were in 
> OpenLDAP 2.2) for the 2.3 release, because I found that they were 
> redundant and causing unnecessary delays. My own test system is running 
> on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro), 
> and OpenLDAP 2.3 runs perfectly well here, now that those redundant 
> calls have been removed. But I also found that I needed to add a new 
> yield(), to work around yet another unexpected issue on this system - we 
> have a number of threads waiting on a condition variable, and the thread 
> holding the mutex signals the var, unlocks the mutex, and then 
> immediately relocks it. The expectation here is that upon unlocking the 
> mutex, the calling thread would block while some waiting thread (that 
> just got signaled) would get to run. In fact what happened is that the 
> calling thread unlocked and relocked the mutex without allowing any of 
> the waiting threads to run. In this case the only solution was to insert 
> a yield() after the mutex_unlock(). So again, for those of you claiming 
> "oh, all you need to do is use a condition variable or any of the other 
> POSIX synchronization primitives" - yes, that's a nice theory, but 
> reality says otherwise.

I encountered a similar issue with some software that I wrote, and used 
a similar workaround, however this was basically because there wasn't 
enough time available at the time to redesign things to work properly. 
The problem here is essentially caused by the fact that the mutex is 
being locked for an excessively large proportion of the time and not 
letting other threads in. In the case I am thinking of, posting the 
messages to the thread that was hogging the mutex via a signaling queue 
would have been a better solution than using yield and having correct 
operation depend on undefined parts of thread scheduling behavior..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 21:24         ` Howard Chu
@ 2005-08-21  0:36           ` Nick Piggin
  2005-08-21 11:33           ` Nikita Danilov
  1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2005-08-21  0:36 UTC (permalink / raw)
  To: Howard Chu; +Cc: Lee Revell, Robert Hancock, linux-kernel

Howard Chu wrote:
> Lee Revell wrote:
> 
>>  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
>> > But I also found that I needed to add a new yield(), to work around
>> > yet another unexpected issue on this system - we have a number of
>> > threads waiting on a condition variable, and the thread holding the
>> > mutex signals the var, unlocks the mutex, and then immediately
>> > relocks it. The expectation here is that upon unlocking the mutex,
>> > the calling thread would block while some waiting thread (that just
>> > got signaled) would get to run. In fact what happened is that the
>> > calling thread unlocked and relocked the mutex without allowing any
>> > of the waiting threads to run. In this case the only solution was
>> > to insert a yield() after the mutex_unlock().
>>
>>  That's exactly the behavior I would expect.  Why would you expect
>>  unlocking a mutex to cause a reschedule, if the calling thread still
>>  has timeslice left?
> 
> 
> That's beside the point. Folks are making an assertion that 
> sched_yield() is meaningless; this example demonstrates that there are 
> cases where sched_yield() is essential.
> 

The point is, with SCHED_OTHER scheduling, sched_yield() need not
do anything. It may not let any other tasks run.

The fact that it does on Linux is because we do attempt to do
something expected... but the simple matter is that you can't realy
on it to do what you expect.

I'm not sure exactly how you would solve the above problem, but I'm
sure it can be achieved using mutexes (for example, you could have
a queue where every thread waits on its own private mutex).... but I
don't do much userspace C programming sorry.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 19:49       ` Howard Chu
@ 2005-08-20 22:08         ` Nikita Danilov
  0 siblings, 0 replies; 49+ messages in thread
From: Nikita Danilov @ 2005-08-20 22:08 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu writes:
 > Nikita Danilov wrote:
 > > That returns us to the core of the problem: sched_yield() is used to
 > > implement a synchronization primitive and non-portable assumptions are
 > > made about its behavior: SUS defines that after sched_yield() thread
 > > ceases to run on the CPU "until it again becomes the head of its thread
 > > list", and "thread list" discipline is only defined for real-time
 > > scheduling policies. E.g., 
 > >
 > > int sched_yield(void)
 > > {
 > >        return 0;
 > > }
 > >
 > > and
 > >
 > > int sched_yield(void)
 > > {
 > >        sleep(100);
 > >        return 0;
 > > }
 > >
 > > are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
 > > threads.
 > I think you're mistaken:
 > http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0d4eaf3703131e86/da051ebe58976b00#da051ebe58976b00
 > 
 > sched_yield() is required to be supported even if priority scheduling is 
 > not supported, and it is required to cause the calling thread (not 
 > process) to yield the processor.

Of course sched_yield() is required to be supported, the question is for
how long CPU is yielded. Here is the quote from the SUS (actually the
complete definition of sched_yield()):

    The sched_yield() function shall force the running thread to
    relinquish the processor until it again becomes the head of its
    thread list.

As far as I can see, SUS doesn't specify how "thread list" is maintained
for non-RT scheduling policy, and implementation that immediately places
SCHED_OTHER thread that called sched_yield() back at the head of its
thread list is perfectly valid. Also valid is an implementation that
waits for 100 seconds and then places sched_yield() caller to the head
of the list, etc. Basically, while semantics of sched_yield() are well
defined for RT scheduling policy, for SCHED_OTHER policy standard leaves
it implementation defined.

 > 
 > -- 
 >   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
@ 2005-08-20 21:50       ` Lee Revell
  2005-08-21  1:04       ` Robert Hancock
  2 siblings, 0 replies; 49+ messages in thread
From: Lee Revell @ 2005-08-20 21:50 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> Nick Piggin wrote:
> >  Robert Hancock wrote:
> > > I fail to see how sched_yield is going to be very helpful in this
> > > situation. Since that call can sleep from a range of time ranging
> > > from zero to a long time, it's going to give unpredictable results.
> 
> >  Well, not sleep technically, but yield the CPU for some undefined
> >  amount of time.
> 
> Since the slapd server was not written to run in realtime, nor is it 
> commonly run on realtime operating systems, I don't believe predictable 
> timing here is a criteria we care about. One could say the same of 
> sigsuspend() by the way - it can pause a process for a range of time 
> ranging from zero to a long time. Should we tell application writers not 
> to use this function either, regardless of whether the developer thinks 
> they have a good reason to use it?

Of course not.  We should tell them that if they use sigsuspend() they
cannot assume that the process will not wake up immediately.

Lee



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 20:57       ` Lee Revell
@ 2005-08-20 21:24         ` Howard Chu
  2005-08-21  0:36           ` Nick Piggin
  2005-08-21 11:33           ` Nikita Danilov
  0 siblings, 2 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-20 21:24 UTC (permalink / raw)
  To: Lee Revell; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Lee Revell wrote:
>  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> > But I also found that I needed to add a new yield(), to work around
> > yet another unexpected issue on this system - we have a number of
> > threads waiting on a condition variable, and the thread holding the
> > mutex signals the var, unlocks the mutex, and then immediately
> > relocks it. The expectation here is that upon unlocking the mutex,
> > the calling thread would block while some waiting thread (that just
> > got signaled) would get to run. In fact what happened is that the
> > calling thread unlocked and relocked the mutex without allowing any
> > of the waiting threads to run. In this case the only solution was
> > to insert a yield() after the mutex_unlock().
>
>  That's exactly the behavior I would expect.  Why would you expect
>  unlocking a mutex to cause a reschedule, if the calling thread still
>  has timeslice left?

That's beside the point. Folks are making an assertion that 
sched_yield() is meaningless; this example demonstrates that there are 
cases where sched_yield() is essential.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
@ 2005-08-20 20:57       ` Lee Revell
  2005-08-20 21:24         ` Howard Chu
  2005-08-20 21:50       ` Lee Revell
  2005-08-21  1:04       ` Robert Hancock
  2 siblings, 1 reply; 49+ messages in thread
From: Lee Revell @ 2005-08-20 20:57 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> But I also found that I needed to add a new 
> yield(), to work around yet another unexpected issue on this system -
> we have a number of threads waiting on a condition variable, and the
> thread holding the mutex signals the var, unlocks the mutex, and then 
> immediately relocks it. The expectation here is that upon unlocking
> the mutex, the calling thread would block while some waiting thread
> (that just got signaled) would get to run. In fact what happened is
> that the calling thread unlocked and relocked the mutex without
> allowing any of the waiting threads to run. In this case the only
> solution was to insert a yield() after the mutex_unlock(). 

That's exactly the behavior I would expect.  Why would you expect
unlocking a mutex to cause a reschedule, if the calling thread still has
timeslice left?

Lee


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 13:23     ` Nikita Danilov
@ 2005-08-20 19:49       ` Howard Chu
  2005-08-20 22:08         ` Nikita Danilov
  0 siblings, 1 reply; 49+ messages in thread
From: Howard Chu @ 2005-08-20 19:49 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wrote:
> That returns us to the core of the problem: sched_yield() is used to
> implement a synchronization primitive and non-portable assumptions are
> made about its behavior: SUS defines that after sched_yield() thread
> ceases to run on the CPU "until it again becomes the head of its thread
> list", and "thread list" discipline is only defined for real-time
> scheduling policies. E.g., 
>
> int sched_yield(void)
> {
>        return 0;
> }
>
> and
>
> int sched_yield(void)
> {
>        sleep(100);
>        return 0;
> }
>
> are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
> threads.
I think you're mistaken:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0d4eaf3703131e86/da051ebe58976b00#da051ebe58976b00

sched_yield() is required to be supported even if priority scheduling is 
not supported, and it is required to cause the calling thread (not 
process) to yield the processor.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20  4:18   ` Nick Piggin
@ 2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-20 18:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Robert Hancock, linux-kernel

Nick Piggin wrote:
>  Robert Hancock wrote:
> > I fail to see how sched_yield is going to be very helpful in this
> > situation. Since that call can sleep from a range of time ranging
> > from zero to a long time, it's going to give unpredictable results.

>  Well, not sleep technically, but yield the CPU for some undefined
>  amount of time.

Since the slapd server was not written to run in realtime, nor is it 
commonly run on realtime operating systems, I don't believe predictable 
timing here is a criteria we care about. One could say the same of 
sigsuspend() by the way - it can pause a process for a range of time 
ranging from zero to a long time. Should we tell application writers not 
to use this function either, regardless of whether the developer thinks 
they have a good reason to use it?

> > It seems to me that this sort of thing is why we have POSIX pthread
> > synchronization primitives.. sched_yield is basically there for a
> > process to indicate that "what I'm doing doesn't matter much, let
> > other stuff run". Any other use of it generally constitutes some
> > kind of hack.

In terms of transaction recovery, we do an exponential backoff on the 
retries, because our benchmarks showed that under heavy lock contention, 
immediate retries only made things worse. In fact, having arbitrarily 
long backoff delays here was shown to improve transaction throughput. 
(We use select() with an increasing timeval in combination with the 
yield() call. One way or another we get a longer delay as desired.)

sched_yield is there for a *thread* to indicate "what I'm doing doesn't 
matter much, let other stuff run."

I suppose it may be a hack. But then so is TCP congestion control. In 
both cases, empirical evidence indicates the hack is worthwhile. If you 
haven't done the analysis then you're in no position to deny the value 
of the approach.

>  In SCHED_OTHER mode, you're right, sched_yield is basically
>  meaningless.

>  In a realtime system, there is a very well defined and probably
>  useful behaviour.

>  Eg. If 2 SCHED_FIFO processes are running at the same priority, One
>  can call sched_yield to deterministically give the CPU to the other
>  guy.

Well yes, the point of a realtime system is to provide deterministic 
response times to unpredictable input.

I'll note that we removed a number of the yield calls (that were in 
OpenLDAP 2.2) for the 2.3 release, because I found that they were 
redundant and causing unnecessary delays. My own test system is running 
on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro), 
and OpenLDAP 2.3 runs perfectly well here, now that those redundant 
calls have been removed. But I also found that I needed to add a new 
yield(), to work around yet another unexpected issue on this system - we 
have a number of threads waiting on a condition variable, and the thread 
holding the mutex signals the var, unlocks the mutex, and then 
immediately relocks it. The expectation here is that upon unlocking the 
mutex, the calling thread would block while some waiting thread (that 
just got signaled) would get to run. In fact what happened is that the 
calling thread unlocked and relocked the mutex without allowing any of 
the waiting threads to run. In this case the only solution was to insert 
a yield() after the mutex_unlock(). So again, for those of you claiming 
"oh, all you need to do is use a condition variable or any of the other 
POSIX synchronization primitives" - yes, that's a nice theory, but 
reality says otherwise.

To say that sched_yield is basically meaningless is far overstating your 
point.
-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found]   ` <430666DB.70802@symas.com.suse.lists.linux.kernel>
@ 2005-08-20 13:48     ` Andi Kleen
  2005-08-21 19:47       ` Florian Weimer
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2005-08-20 13:48 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

Howard Chu <hyc@symas.com> writes:

> In this specific example, we use whatever
> BerkeleyDB provides and we're certainly not about to write our own
> transactional embedded database engine just for this.

BerkeleyDB is free software after all that comes with source code. 
Surely it can be fixed without rewriting it from scratch.

Has anybody contacted the Sleepycat people with a description of the 
problem yet?

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19 23:10   ` Howard Chu
@ 2005-08-20 13:23     ` Nikita Danilov
  2005-08-20 19:49       ` Howard Chu
  0 siblings, 1 reply; 49+ messages in thread
From: Nikita Danilov @ 2005-08-20 13:23 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu writes:
 > Nikita Danilov wrote:

[...]

 > 
 > >  What prevents transaction monitor from using, say, condition
 > >  variables to "yield cpu"? That would have an additional advantage of
 > >  blocking thread precisely until specific event occurs, instead of
 > >  blocking for some vague indeterminate load and platform dependent
 > >  amount of time.
 > 
 > Condition variables offer no control over which thread is waken up. 

When only one thread waits on a condition variable, which is exactly a
scenario involved, --sorry if I weren't clear enough-- condition signal
provides precise control over which thread is woken up.

 > We're wandering into the design of the SleepyCat BerkeleyDB library 
 > here, and we don't exert any control over that either. BerkeleyDB 
 > doesn't appear to use pthread condition variables; it seems to construct 
 > its own synchronization mechanisms on top of mutexes (and yield calls). 

That returns us to the core of the problem: sched_yield() is used to
implement a synchronization primitive and non-portable assumptions are
made about its behavior: SUS defines that after sched_yield() thread
ceases to run on the CPU "until it again becomes the head of its thread
list", and "thread list" discipline is only defined for real-time
scheduling policies. E.g., 

int sched_yield(void)
{
       return 0;
}

and

int sched_yield(void)
{
       sleep(100);
       return 0;
}

are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
threads.

Nikita.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20  3:20 ` Robert Hancock
@ 2005-08-20  4:18   ` Nick Piggin
  2005-08-20 18:38     ` Howard Chu
  0 siblings, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2005-08-20  4:18 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel

Robert Hancock wrote:

> 
> I fail to see how sched_yield is going to be very helpful in this 
> situation. Since that call can sleep from a range of time ranging from 
> zero to a long time, it's going to give unpredictable results.
> 

Well, not sleep technically, but yield the CPU for some undefined
amount of time.

> It seems to me that this sort of thing is why we have POSIX pthread 
> synchronization primitives.. sched_yield is basically there for a 
> process to indicate that "what I'm doing doesn't matter much, let other 
> stuff run". Any other use of it generally constitutes some kind of hack.
> 

In SCHED_OTHER mode, you're right, sched_yield is basically meaningless.

In a realtime system, there is a very well defined and probably useful
behaviour.

Eg. If 2 SCHED_FIFO processes are running at the same priority, One can
call sched_yield to deterministically give the CPU to the other guy.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found] <4D8eT-4rg-31@gated-at.bofh.it>
@ 2005-08-20  3:20 ` Robert Hancock
  2005-08-20  4:18   ` Nick Piggin
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Hancock @ 2005-08-20  3:20 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> You assume that spinlocks are the only reason a developer may want to 
> yield the processor. This assumption is unfounded. Case in point - the 
> primary backend in OpenLDAP uses a transactional database with 
> page-level locking of its data structures to provide high levels of 
> concurrency. It is the nature of such a system to encounter deadlocks 
> over the normal course of operations. When a deadlock is detected, some 
> thread must be chosen (by one of a variety of algorithms) to abort its 
> transaction, in order to allow other operations to proceed to 
> completion. In this situation, the chosen thread must get control of the 
> CPU long enough to clean itself up, and then it must yield the CPU in 
> order to allow any other competing threads to complete their 
> transaction. The thread with the aborted transaction relinquishes all of 
> its locks and then waits to get another shot at the CPU to try 
> everything over again. Again, this is all fundamental to the nature of 
> transactional programming. If the 2.6 kernel makes this programming 
> model unreasonably slow, then quite simply this kernel is not viable as 
> a database platform.

I fail to see how sched_yield is going to be very helpful in this 
situation. Since that call can sleep from a range of time ranging from 
zero to a long time, it's going to give unpredictable results.

It seems to me that this sort of thing is why we have POSIX pthread 
synchronization primitives.. sched_yield is basically there for a 
process to indicate that "what I'm doing doesn't matter much, let other 
stuff run". Any other use of it generally constitutes some kind of hack.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19 10:21 ` Nikita Danilov
@ 2005-08-19 23:10   ` Howard Chu
  2005-08-20 13:23     ` Nikita Danilov
  0 siblings, 1 reply; 49+ messages in thread
From: Howard Chu @ 2005-08-19 23:10 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wrote:
>  Howard Chu <hyc@symas.com> writes:
> > concurrency. It is the nature of such a system to encounter
> > deadlocks over the normal course of operations. When a deadlock is
> > detected, some thread must be chosen (by one of a variety of
> > algorithms) to abort its transaction, in order to allow other
> > operations to proceed to completion. In this situation, the chosen
> > thread must get control of the CPU long enough to clean itself up,

>  What prevents transaction monitor from using, say, condition
>  variables to "yield cpu"? That would have an additional advantage of
>  blocking thread precisely until specific event occurs, instead of
>  blocking for some vague indeterminate load and platform dependent
>  amount of time.

Condition variables offer no control over which thread is waken up. 
We're wandering into the design of the SleepyCat BerkeleyDB library 
here, and we don't exert any control over that either. BerkeleyDB 
doesn't appear to use pthread condition variables; it seems to construct 
its own synchronization mechanisms on top of mutexes (and yield calls). 
In this specific example, we use whatever BerkeleyDB provides and we're 
certainly not about to write our own transactional embedded database 
engine just for this.
-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:59 ` Chris Wedgwood
@ 2005-08-19 22:45   ` Howard Chu
  0 siblings, 0 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-19 22:45 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

Chris Wedgwood wrote:
>  On Thu, Aug 18, 2005 at 11:03:45PM -0700, Howard Chu wrote:
> > If the 2.6 kernel makes this programming model unreasonably slow,
> > then quite simply this kernel is not viable as a database platform.

>  Pretty much everyone else manages to make it work.

And this contributes to the discussion how? Pretty much every other 
Unix-ish operating system manages to make scheduling with nice'd 
processes work. If you really want to get into what "everyone else 
manages to make work" you're in for a rough ride.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
  2005-08-19  6:59 ` Chris Wedgwood
@ 2005-08-19 10:21 ` Nikita Danilov
  2005-08-19 23:10   ` Howard Chu
  2 siblings, 1 reply; 49+ messages in thread
From: Nikita Danilov @ 2005-08-19 10:21 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu <hyc@symas.com> writes:

[...]

> concurrency. It is the nature of such a system to encounter deadlocks
> over the normal course of operations. When a deadlock is detected, some
> thread must be chosen (by one of a variety of algorithms) to abort its
> transaction, in order to allow other operations to proceed to
> completion. In this situation, the chosen thread must get control of the
> CPU long enough to clean itself up,

What prevents transaction monitor from using, say, condition variables
to "yield cpu"? That would have an additional advantage of blocking
thread precisely until specific event occurs, instead of blocking for
some vague indeterminate load and platform dependent amount of time.

>                                     and then it must yield the CPU in
> order to allow any other competing threads to complete their
> transaction.

Again, this sounds like thing doable with standard POSIX synchronization
primitives.

>
> -- 
>   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
@ 2005-08-19  6:59 ` Chris Wedgwood
  2005-08-19 22:45   ` Howard Chu
  2005-08-19 10:21 ` Nikita Danilov
  2 siblings, 1 reply; 49+ messages in thread
From: Chris Wedgwood @ 2005-08-19  6:59 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

On Thu, Aug 18, 2005 at 11:03:45PM -0700, Howard Chu wrote:

> If the 2.6 kernel makes this programming model unreasonably slow,
> then quite simply this kernel is not viable as a database platform.

Pretty much everyone else manages to make it work.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
@ 2005-08-19  6:34 ` Nick Piggin
  2005-08-19  6:59 ` Chris Wedgwood
  2005-08-19 10:21 ` Nikita Danilov
  2 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2005-08-19  6:34 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

Hi Howard,

Thanks for joining the discussion. One request, if I may,
can you retain the CC list on posts please?

Howard Chu wrote:
 >
>>  AFAIKS, sched_yield should only really be used by realtime
>>  applications that know exactly what they're doing.
> 
> 
> pthread_yield() was deleted from the POSIX threads drafts years ago. 
> sched_yield() is the officially supported API, and OpenLDAP is using it 
> for the documented purpose. Anyone who says "applications shouldn't be 
> using sched_yield()" doesn't know what they're talking about.
> 

Linux's SCHED_OTHER policy offers static priorities in the range [0..0].
I think anything else would be a bug, because from my reading of the
standards, a process with a higher static priority shall always preempt
a process with a lower priority.

And SCHED_OTHER simply doesn't work that way.

So sched_yield() from a SCHED_OTHER task is free to basically do anything
at all. Is that the kind of behaviour you had in mind?

>>  It's really more a feature than a bug that it breaks so easily
>>  because they should be really using futexes instead, which have much
>>  better behaviour than any sched_yield ever could (they will directly
>>  wake up another process waiting for the lock and avoid the thundering
>>  herd for contended locks)
> 
> 
> You assume that spinlocks are the only reason a developer may want to 
> yield the processor. This assumption is unfounded. Case in point - the 
> primary backend in OpenLDAP uses a transactional database with 
> page-level locking of its data structures to provide high levels of 
> concurrency. It is the nature of such a system to encounter deadlocks 
> over the normal course of operations. When a deadlock is detected, some 
> thread must be chosen (by one of a variety of algorithms) to abort its 
> transaction, in order to allow other operations to proceed to 
> completion. In this situation, the chosen thread must get control of the 
> CPU long enough to clean itself up, and then it must yield the CPU in 
> order to allow any other competing threads to complete their 
> transaction. The thread with the aborted transaction relinquishes all of 
> its locks and then waits to get another shot at the CPU to try 
> everything over again. Again, this is all fundamental to the nature of 

You didn't explain why you can't use a mutex to do this. From
your brief description, it seems like a mutex might just do
the job nicely.

> transactional programming. If the 2.6 kernel makes this programming 
> model unreasonably slow, then quite simply this kernel is not viable as 
> a database platform.
> 

Actually it should still be fast. It may yield excessive CPU to
other tasks (including those that are reniced). You didn't rely
on sched_yield providing some semantics about not doing such a
thing, did you?

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* re: sched_yield() makes OpenLDAP slow
@ 2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Howard Chu @ 2005-08-19  6:03 UTC (permalink / raw)
  To: linux-kernel

Hm, seems there's a great deal of misinformation in this thread.

>  I also think OpenLDAP is wrong. First, it should be calling
>  pthread_yield() because slapd is a multithreading process and it just
>  wants to run the other threads. See:
...
>  AFAIKS, sched_yield should only really be used by realtime
>  applications that know exactly what they're doing.

pthread_yield() was deleted from the POSIX threads drafts years ago. 
sched_yield() is the officially supported API, and OpenLDAP is using it 
for the documented purpose. Anyone who says "applications shouldn't be 
using sched_yield()" doesn't know what they're talking about.

>  It's really more a feature than a bug that it breaks so easily
>  because they should be really using futexes instead, which have much
>  better behaviour than any sched_yield ever could (they will directly
>  wake up another process waiting for the lock and avoid the thundering
>  herd for contended locks)

You assume that spinlocks are the only reason a developer may want to 
yield the processor. This assumption is unfounded. Case in point - the 
primary backend in OpenLDAP uses a transactional database with 
page-level locking of its data structures to provide high levels of 
concurrency. It is the nature of such a system to encounter deadlocks 
over the normal course of operations. When a deadlock is detected, some 
thread must be chosen (by one of a variety of algorithms) to abort its 
transaction, in order to allow other operations to proceed to 
completion. In this situation, the chosen thread must get control of the 
CPU long enough to clean itself up, and then it must yield the CPU in 
order to allow any other competing threads to complete their 
transaction. The thread with the aborted transaction relinquishes all of 
its locks and then waits to get another shot at the CPU to try 
everything over again. Again, this is all fundamental to the nature of 
transactional programming. If the 2.6 kernel makes this programming 
model unreasonably slow, then quite simply this kernel is not viable as 
a database platform.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  3:19       ` Andi Kleen
@ 2005-08-19  3:30         ` Bernardo Innocenti
  0 siblings, 0 replies; 49+ messages in thread
From: Bernardo Innocenti @ 2005-08-19  3:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, nickpiggin, Giovanni Bajo

Andi Kleen wrote:
> Bernardo Innocenti <bernie@develer.com> writes:
> 
> It's really more a feature than a bug that it breaks so easily
> because they should be really using futexes instead, which
> have much better behaviour than any sched_yield ever could
> (they will directly wake up another process waiting for the
> lock and avoid the thundering herd for contended locks) 

Actually, I believe they should be using pthread synchronization
primitives instead of relying on Linux-specific functionality.
Glibc already uses futexes internally, so it's almost as efficient.

I've already suggested this to the OpenLDAP people, but with
my limited knowledge of slapd threading requirements, there
may well be a very good reason for busy-waiting with
sched_yield().  Waiting for their answer.

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found]     ` <43054D9A.7090509@develer.com.suse.lists.linux.kernel>
@ 2005-08-19  3:19       ` Andi Kleen
  2005-08-19  3:30         ` Bernardo Innocenti
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2005-08-19  3:19 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: linux-kernel, nickpiggin

Bernardo Innocenti <bernie@develer.com> writes:

It's really more a feature than a bug that it breaks so easily
because they should be really using futexes instead, which
have much better behaviour than any sched_yield ever could
(they will directly wake up another process waiting for the
lock and avoid the thundering herd for contended locks) 

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  2:58   ` Nick Piggin
@ 2005-08-19  3:10     ` Bernardo Innocenti
  0 siblings, 0 replies; 49+ messages in thread
From: Bernardo Innocenti @ 2005-08-19  3:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joseph Fannin, lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Nick Piggin wrote:

> We class the SCHED_OTHER policy as having a single priority, which
> I believe is allowed (and even makes good sense, because dynamic
> and even nice priorities aren't really well defined).
> 
> That also makes our sched_yield() behaviour correct.
> 
> AFAIKS, sched_yield should only really be used by realtime
> applications that know exactly what they're doing.

I'm pretty sure this has already been discussed in the
past, but I fail to see why this new behavior of
sched_yield() would be more correct.

In the OpenLDAP bug discussion, one of the developers
considers this a Linux quirk needing a workaround, not
a real bug in OpenLDAP.

As I understand it, the old behavior was to push the
yielding process to the end of the queue for processes
with the same niceness.  This is somewhat closer to
the (vague) definition in the POSIX man pages:

 The sched_yield() function shall force the running
 thread to relinquish the processor until it again
 becomes the head of its thread list. It takes no arguments.

Pushing the process far behind in the queue, even after
niced CPU crunchers, appears a bit extreme.  It seems
most programs expect sched_yield() to only reschedule
the calling thread wrt its sibling threads, to be used
to implement do-it-yourself spinlocks and the like.

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:47 ` Con Kolivas
@ 2005-08-18 10:48   ` Maciej Soltysiak
  0 siblings, 0 replies; 49+ messages in thread
From: Maciej Soltysiak @ 2005-08-18 10:48 UTC (permalink / raw)
  To: lkml

Hello Con,

Thursday, August 18, 2005, 2:47:25 AM, you wrote:
> sched_yield behaviour changed in 2.5 series more than 3 years ago and
> applications that use this as a locking primitive should be updated.
I remember open office had a problem with excessive use of sched_yield()
during 2.5. I guess they changed it but I have not checked.
Does anyone know ?

Back then oo was having serious latency problems on 2.5

Regards,
Maciej



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  1:07 ` Joseph Fannin
  2005-08-18  2:25   ` Bernardo Innocenti
@ 2005-08-18  2:58   ` Nick Piggin
  2005-08-19  3:10     ` Bernardo Innocenti
  1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2005-08-18  2:58 UTC (permalink / raw)
  To: Joseph Fannin
  Cc: Bernardo Innocenti, lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Joseph Fannin wrote:

>On Thu, Aug 18, 2005 at 02:50:16AM +0200, Bernardo Innocenti wrote:
>
>
>>The relative timestamp reveals that slapd is spending 50ms
>>after yielding.  Meanwhile, GCC is probably being scheduled
>>for a whole quantum.
>>
>>Reading the man-page of sched_yield() it seems this isn't
>>the correct behavior:
>>
>>   Note: If the current process is the only process in the
>>   highest priority list at that time, this process will
>>   continue to run after a call to sched_yield.
>>
>
>   The behavior of sched_yield changed for 2.6.  I suppose the man
>page didn't get updated.
>
>

We class the SCHED_OTHER policy as having a single priority, which
I believe is allowed (and even makes good sense, because dynamic
and even nice priorities aren't really well defined).

That also makes our sched_yield() behaviour correct.

AFAIKS, sched_yield should only really be used by realtime
applications that know exactly what they're doing.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  1:07 ` Joseph Fannin
@ 2005-08-18  2:25   ` Bernardo Innocenti
  2005-08-18  2:58   ` Nick Piggin
  1 sibling, 0 replies; 49+ messages in thread
From: Bernardo Innocenti @ 2005-08-18  2:25 UTC (permalink / raw)
  To: Joseph Fannin; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Joseph Fannin wrote:

>    The behavior of sched_yield changed for 2.6.  I suppose the man
> page didn't get updated.

Now I remember reading about that on LWN or maybe KernelTraffic.
Thanks!

>>I also think OpenLDAP is wrong.  First, it should be calling
>>pthread_yield() because slapd is a multithreading process
>>and it just wants to run the other threads.  See:
> 
>     Is it possible that this problem has been noticed and fixed
> already?

The OpenLDAP 2.3.5 source still looks like this.
I've filed a report in OpenLDAP's issue tracker:

 http://www.openldap.org/its/index.cgi/Incoming?id=3950;page=2

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:50 Bernardo Innocenti
  2005-08-18  0:47 ` Con Kolivas
@ 2005-08-18  1:07 ` Joseph Fannin
  2005-08-18  2:25   ` Bernardo Innocenti
  2005-08-18  2:58   ` Nick Piggin
  1 sibling, 2 replies; 49+ messages in thread
From: Joseph Fannin @ 2005-08-18  1:07 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

On Thu, Aug 18, 2005 at 02:50:16AM +0200, Bernardo Innocenti wrote:

> The relative timestamp reveals that slapd is spending 50ms
> after yielding.  Meanwhile, GCC is probably being scheduled
> for a whole quantum.
>
> Reading the man-page of sched_yield() it seems this isn't
> the correct behavior:
>
>    Note: If the current process is the only process in the
>    highest priority list at that time, this process will
>    continue to run after a call to sched_yield.

   The behavior of sched_yield changed for 2.6.  I suppose the man
page didn't get updated.

>From linux/Documentation/post-halloween.txt:

| - The behavior of sched_yield() changed a lot.  A task that uses
|   this system call should now expect to sleep for possibly a very
|   long time.  Tasks that do not really desire to give up the
|   processor for a while should probably not make heavy use of this
|   function.  Unfortunately, some GUI programs (like Open Office)
|   do make excessive use of this call and under load their
|   performance is poor.  It seems this new 2.6 behavior is optimal
|   but some user-space applications may need fixing.

    This is pretty much all I know about it; I just thought I'd point
it out.

> I also think OpenLDAP is wrong.  First, it should be calling
> pthread_yield() because slapd is a multithreading process
> and it just wants to run the other threads.  See:

    Is it possible that this problem has been noticed and fixed
already?

--
Joseph Fannin
jfannin@gmail.com

 /* So there I am, in the middle of my `netfilter-is-wonderful'
talk in Sydney, and someone asks `What happens if you try
to enlarge a 64k packet here?'. I think I said something
eloquent like `fuck'. - RR */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* sched_yield() makes OpenLDAP slow
@ 2005-08-18  0:50 Bernardo Innocenti
  2005-08-18  0:47 ` Con Kolivas
  2005-08-18  1:07 ` Joseph Fannin
  0 siblings, 2 replies; 49+ messages in thread
From: Bernardo Innocenti @ 2005-08-18  0:50 UTC (permalink / raw)
  To: lkml, OpenLDAP-devel; +Cc: Giovanni Bajo, Simone Zinanni

Hello,

I've been investigating a performance problem on a
server using OpenLDAP 2.2.26 for nss resolution and
running kernel 2.6.12.

When a CPU bound process such as GCC is running in the
background (even at nice 10), many trivial commands such
as "su" or "groups" become extremely slow and take a few
seconds to complete.

strace revealed that data exchange over the slapd socket
was where most of the time was spent.  Looking at the
slapd side, I see several calls to sched_yield() like this:


[pid  8780]      0.000033 stat64("gidNumber.dbb", 0xb7b3ebcc) = -1 EACCES (Permission denied)
[pid  8780]      0.000059 pread(20, "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\2\0\344\17\2\3"..., 4096, 4096) = 4096
[pid  8780]      0.000083 pread(20, "\0\0\0\0\1\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\222\0<\7\1\5\370"..., 4096, 16384) = 4096
[pid  8780]      0.000078 time(NULL)    = 1124322520
[pid  8780]      0.000066 pread(11, "\0\0\0\0\1\0\0\0\250\0\0\0\231\0\0\0\235\0\0\0\16\0000"..., 4096, 688128) = 4096
[pid  8780]      0.000241 write(19, "0e\2\1\3d`\4$cn=bernie,ou=group,dc=d"..., 103) = 103
[pid  8780]      0.000137 sched_yield( <unfinished ...>
[pid  8781]      0.050020 <... sched_yield resumed> ) = 0
[pid  8780]      0.000025 <... sched_yield resumed> ) = 0
[pid  8781]      0.000060 futex(0x925ab20, FUTEX_WAIT, 33, NULL <unfinished ...>
[pid  8780]      0.000026 write(19, "0\f\2\1\3e\7\n\1\0\4\0\4\0", 14) = 14
[pid  8774]      0.000774 <... select resumed> ) = 1 (in [19])


The relative timestamp reveals that slapd is spending 50ms
after yielding.  Meanwhile, GCC is probably being scheduled
for a whole quantum.

Reading the man-page of sched_yield() it seems this isn't
the correct behavior:

   Note: If the current process is the only process in the
   highest priority list at that time, this process will
   continue to run after a call to sched_yield.

I also think OpenLDAP is wrong.  First, it should be calling
pthread_yield() because slapd is a multithreading process
and it just wants to run the other threads.  See:

int
ldap_pvt_thread_yield( void )
{
#if HAVE_THR_YIELD
        return thr_yield();

#elif HAVE_PTHREADS == 10
        return sched_yield();

#elif defined(_POSIX_THREAD_IS_GNU_PTH)
        sched_yield();
        return 0;

#elif HAVE_PTHREADS == 6
        pthread_yield(NULL);
        return 0;
#else
        pthread_yield();
        return 0;
#endif
}


-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:50 Bernardo Innocenti
@ 2005-08-18  0:47 ` Con Kolivas
  2005-08-18 10:48   ` Maciej Soltysiak
  2005-08-18  1:07 ` Joseph Fannin
  1 sibling, 1 reply; 49+ messages in thread
From: Con Kolivas @ 2005-08-18  0:47 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

On Thu, 18 Aug 2005 10:50 am, Bernardo Innocenti wrote:
> Hello,
>
> I've been investigating a performance problem on a
> server using OpenLDAP 2.2.26 for nss resolution and
> running kernel 2.6.12.
>
> When a CPU bound process such as GCC is running in the
> background (even at nice 10), many trivial commands such
> as "su" or "groups" become extremely slow and take a few
> seconds to complete.
>
> strace revealed that data exchange over the slapd socket
> was where most of the time was spent.  Looking at the
> slapd side, I see several calls to sched_yield() like this:
>
>
> [pid  8780]      0.000033 stat64("gidNumber.dbb", 0xb7b3ebcc) = -1 EACCES
> (Permission denied) [pid  8780]      0.000059 pread(20,
> "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\2\0\344\17\2\3"..., 4096, 4096) =
> 4096 [pid  8780]      0.000083 pread(20,
> "\0\0\0\0\1\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\222\0<\7\1\5\370"..., 4096,
> 16384) = 4096 [pid  8780]      0.000078 time(NULL)    = 1124322520
> [pid  8780]      0.000066 pread(11,
> "\0\0\0\0\1\0\0\0\250\0\0\0\231\0\0\0\235\0\0\0\16\0000"..., 4096, 688128)
> = 4096 [pid  8780]      0.000241 write(19,
> "0e\2\1\3d`\4$cn=bernie,ou=group,dc=d"..., 103) = 103 [pid  8780]     
> 0.000137 sched_yield( <unfinished ...>
> [pid  8781]      0.050020 <... sched_yield resumed> ) = 0
> [pid  8780]      0.000025 <... sched_yield resumed> ) = 0
> [pid  8781]      0.000060 futex(0x925ab20, FUTEX_WAIT, 33, NULL <unfinished
> ...> [pid  8780]      0.000026 write(19, "0\f\2\1\3e\7\n\1\0\4\0\4\0", 14)
> = 14 [pid  8774]      0.000774 <... select resumed> ) = 1 (in [19])
>
>
> The relative timestamp reveals that slapd is spending 50ms
> after yielding.  Meanwhile, GCC is probably being scheduled
> for a whole quantum.
>
> Reading the man-page of sched_yield() it seems this isn't
> the correct behavior:
>
>    Note: If the current process is the only process in the
>    highest priority list at that time, this process will
>    continue to run after a call to sched_yield.
>
> I also think OpenLDAP is wrong.  First, it should be calling
> pthread_yield() because slapd is a multithreading process
> and it just wants to run the other threads.  See:

sched_yield behaviour changed in 2.5 series more than 3 years ago and 
applications that use this as a locking primitive should be updated.

Cheers,
Con

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2006-01-26 10:50 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-14 19:29 sched_yield() makes OpenLDAP slow Howard Chu
  -- strict thread matches above, loose matches on Subject: below --
2006-01-24 22:59 e100 oops on resume Stefan Seyfried
2006-01-24 23:21 ` Mattia Dongili
2006-01-25  9:02   ` Olaf Kirch
2006-01-25 12:11     ` Olaf Kirch
2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
2006-01-25 14:38         ` Robert Hancock
2006-01-25 17:49         ` Christopher Friesen
2006-01-26  1:07         ` David Schwartz
2006-01-26  8:30           ` Helge Hafting
2006-01-26  9:01             ` Nick Piggin
2006-01-26 10:50             ` Nikita Danilov
     [not found] <5uZqb-4fo-15@gated-at.bofh.it>
2006-01-14 22:47 ` Robert Hancock
     [not found] <43057641.70700@symas.com.suse.lists.linux.kernel>
     [not found] ` <17157.45712.877795.437505@gargle.gargle.HOWL.suse.lists.linux.kernel>
     [not found]   ` <430666DB.70802@symas.com.suse.lists.linux.kernel>
2005-08-20 13:48     ` Andi Kleen
2005-08-21 19:47       ` Florian Weimer
2005-08-22  5:09         ` Howard Chu
2005-08-22 13:06           ` Andi Kleen
2005-08-22 18:47             ` Howard Chu
2005-08-22 13:20           ` Florian Weimer
2005-08-22 23:19             ` Howard Chu
     [not found] <4D8eT-4rg-31@gated-at.bofh.it>
2005-08-20  3:20 ` Robert Hancock
2005-08-20  4:18   ` Nick Piggin
2005-08-20 18:38     ` Howard Chu
2005-08-20 20:57       ` Lee Revell
2005-08-20 21:24         ` Howard Chu
2005-08-21  0:36           ` Nick Piggin
2005-08-21 11:33           ` Nikita Danilov
2005-08-22  8:06             ` Howard Chu
2005-08-20 21:50       ` Lee Revell
2005-08-21  1:04       ` Robert Hancock
2005-08-22 11:44         ` linux-os (Dick Johnson)
2005-08-22 14:26           ` Robert Hancock
2005-08-23 11:17             ` linux-os (Dick Johnson)
2005-08-23 12:07               ` Denis Vlasenko
2005-08-24  3:37                 ` Lincoln Dale
2005-08-19  6:03 Howard Chu
2005-08-19  6:34 ` Nick Piggin
2005-08-19  6:59 ` Chris Wedgwood
2005-08-19 22:45   ` Howard Chu
2005-08-19 10:21 ` Nikita Danilov
2005-08-19 23:10   ` Howard Chu
2005-08-20 13:23     ` Nikita Danilov
2005-08-20 19:49       ` Howard Chu
2005-08-20 22:08         ` Nikita Danilov
     [not found] <4303DB48.8010902@develer.com.suse.lists.linux.kernel>
     [not found] ` <20050818010703.GA13127@nineveh.rivenstone.net.suse.lists.linux.kernel>
     [not found]   ` <4303F967.6000404@yahoo.com.au.suse.lists.linux.kernel>
     [not found]     ` <43054D9A.7090509@develer.com.suse.lists.linux.kernel>
2005-08-19  3:19       ` Andi Kleen
2005-08-19  3:30         ` Bernardo Innocenti
2005-08-18  0:50 Bernardo Innocenti
2005-08-18  0:47 ` Con Kolivas
2005-08-18 10:48   ` Maciej Soltysiak
2005-08-18  1:07 ` Joseph Fannin
2005-08-18  2:25   ` Bernardo Innocenti
2005-08-18  2:58   ` Nick Piggin
2005-08-19  3:10     ` Bernardo Innocenti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).