e100 oops on resume

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* e100 oops on resume
@ 2006-01-24 22:59 Stefan Seyfried
  2006-01-24 23:21 ` Mattia Dongili
  0 siblings, 1 reply; 127+ messages in thread
From: Stefan Seyfried @ 2006-01-24 22:59 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: netdev

Hi,
since 2.6.16rc1-git3, e100 dies on resume (regardless if from disk, ram or
runtime powermanagement). Unfortunately i only have a bad photo of
the oops right now, it is available from
https://bugzilla.novell.com/attachment.cgi?id=64761&action=view
I have reproduced this on a second e100 machine and can get a serial
console log from this machine tomorrow if needed.
It did resume fine with 2.6.15-git12
-- 
Stefan Seyfried                  \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices      \ wanted to be paid for what I write."
SUSE LINUX Products GmbH, Nürnberg \                    -- Leonard Cohen

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-24 22:59 e100 oops on resume Stefan Seyfried
@ 2006-01-24 23:21 ` Mattia Dongili
  2006-01-25  9:02   ` Olaf Kirch
  0 siblings, 1 reply; 127+ messages in thread
From: Mattia Dongili @ 2006-01-24 23:21 UTC (permalink / raw)
  To: Stefan Seyfried; +Cc: Linux Kernel Mailing List, netdev

On Tue, Jan 24, 2006 at 11:59:19PM +0100, Stefan Seyfried wrote:
> Hi,
> since 2.6.16rc1-git3, e100 dies on resume (regardless if from disk, ram or
> runtime powermanagement). Unfortunately i only have a bad photo of
> the oops right now, it is available from
> https://bugzilla.novell.com/attachment.cgi?id=64761&action=view
> I have reproduced this on a second e100 machine and can get a serial
> console log from this machine tomorrow if needed.
> It did resume fine with 2.6.15-git12

I experienced the same today, I was planning to get a photo tomorrow :)
I'm running 2.6.16-rc1-mm2 and the last working kernel was 2.6.15-mm4
(didn't try .16-rc1-mm1 being scared of the reiserfs breakage).

-- 
mattia
:wq!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-24 23:21 ` Mattia Dongili
@ 2006-01-25  9:02   ` Olaf Kirch
  2006-01-25 12:11     ` Olaf Kirch
  0 siblings, 1 reply; 127+ messages in thread
From: Olaf Kirch @ 2006-01-25  9:02 UTC (permalink / raw)
  To: Stefan Seyfried, Linux Kernel Mailing List, netdev

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On Wed, Jan 25, 2006 at 12:21:42AM +0100, Mattia Dongili wrote:
> I experienced the same today, I was planning to get a photo tomorrow :)
> I'm running 2.6.16-rc1-mm2 and the last working kernel was 2.6.15-mm4
> (didn't try .16-rc1-mm1 being scared of the reiserfs breakage).

I think that's because the latest driver version wants to wait for
the ucode download, and e100_exec_cb_wait before allocating any
control blocks.

static inline int e100_exec_cb_wait(struct nic *nic, struct sk_buff *skb,
        void (*cb_prepare)(struct nic *, struct cb *, struct sk_buff *))
{
        int err = 0, counter = 50;
        struct cb *cb = nic->cb_to_clean;

        if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode)))
                DPRINTK(PROBE,ERR, "ucode cmd failed with error %d\n", err);
	/* NOTE: the oops shows that e100_exec_cb fails with ENOMEM,
  	 * which also means there are no cbs */

	/* ... other stuff...
	 * and then we die here because cb is NULL: */
        while (!(cb->status & cpu_to_le16(cb_complete))) {
                msleep(10);
                if (!--counter) break;
        }

I'm not sure what the right fix would be. e100_resume would probably
have to call e100_alloc_cbs early on, while e100_up should avoid
calling it a second time if nic->cbs_avail != 0. A tentative patch
for testing is attached.

Olaf
-- 
Olaf Kirch   |  --- o --- Nous sommes du soleil we love when we play
okir@suse.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

[-- Attachment #2: e100-resume-fix --]
[-- Type: text/plain, Size: 1830 bytes --]

[PATCH] e100: allocate cbs early on when resuming

Signed-off-by: Olaf Kirch <okir@suse.de>

 drivers/net/e100.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

Index: build/drivers/net/e100.c
===================================================================
--- build.orig/drivers/net/e100.c
+++ build/drivers/net/e100.c
@@ -1298,8 +1298,10 @@ static inline int e100_exec_cb_wait(stru
 	int err = 0, counter = 50;
 	struct cb *cb = nic->cb_to_clean;
 
-	if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode)))
+	if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode))) {
 		DPRINTK(PROBE,ERR, "ucode cmd failed with error %d\n", err);
+		return err;
+	}
 
 	/* must restart cuc */
 	nic->cuc_cmd = cuc_start;
@@ -1721,9 +1723,11 @@ static int e100_alloc_cbs(struct nic *ni
 	struct cb *cb;
 	unsigned int i, count = nic->params.cbs.count;
 
+	/* bail out if we've been here before */
+	if (nic->cbs_avail)
+		return 0;
+
 	nic->cuc_cmd = cuc_start;
-	nic->cb_to_use = nic->cb_to_send = nic->cb_to_clean = NULL;
-	nic->cbs_avail = 0;
 
 	nic->cbs = pci_alloc_consistent(nic->pdev,
 		sizeof(struct cb) * count, &nic->cbs_dma_addr);
@@ -2578,6 +2582,8 @@ static int __devinit e100_probe(struct p
 	nic->pdev = pdev;
 	nic->msg_enable = (1 << debug) - 1;
 	pci_set_drvdata(pdev, netdev);
+	nic->cb_to_use = nic->cb_to_send = nic->cb_to_clean = NULL;
+	nic->cbs_avail = 0;
 
 	if((err = pci_enable_device(pdev))) {
 		DPRINTK(PROBE, ERR, "Cannot enable PCI device, aborting.\n");
@@ -2752,6 +2758,8 @@ static int e100_resume(struct pci_dev *p
 	retval = pci_enable_wake(pdev, 0, 0);
 	if (retval)
 		DPRINTK(PROBE,ERR, "Error clearing wake events\n");
+	if ((retval = e100_alloc_cbs(nic)))
+		DPRINTK(PROBE,ERR, "No memory for cbs\n");
 	if(e100_hw_init(nic))
 		DPRINTK(HW, ERR, "e100_hw_init failed\n");
 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-25  9:02   ` Olaf Kirch
@ 2006-01-25 12:11     ` Olaf Kirch
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
  2006-01-25 19:37       ` e100 oops on resume Jesse Brandeburg
  0 siblings, 2 replies; 127+ messages in thread
From: Olaf Kirch @ 2006-01-25 12:11 UTC (permalink / raw)
  To: Stefan Seyfried, Linux Kernel Mailing List, netdev

On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> I'm not sure what the right fix would be. e100_resume would probably
> have to call e100_alloc_cbs early on, while e100_up should avoid
> calling it a second time if nic->cbs_avail != 0. A tentative patch
> for testing is attached.

Reportedly, the patch fixes the crash on resume.

Olaf
-- 
Olaf Kirch   |  --- o --- Nous sommes du soleil we love when we play
okir@suse.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 12:11     ` Olaf Kirch
@ 2006-01-25 13:51       ` Howard Chu
  2006-01-25 14:38         ` Robert Hancock
                           ` (2 more replies)
  2006-01-25 19:37       ` e100 oops on resume Jesse Brandeburg
  1 sibling, 3 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-25 13:51 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: hancockr


Robert Hancock wrote:
 > Howard Chu wrote:
 > > POSIX requires a reschedule to occur, as noted here:
 > > http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/
 >
 > No, it doesn't:
 >
 > >
 > > The relevant SUSv3 text is here
 > > 
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html 

 >
 > "If there are threads blocked on the mutex object referenced by mutex
 > when pthread_mutex_unlock() is called, resulting in the mutex becoming
 > available, the scheduling policy shall determine which thread shall
 > acquire the mutex."
 >
 > This says nothing about requiring a reschedule. The "scheduling policy"
 > can well decide that the thread which just released the mutex can
 > re-acquire it.

No, because the thread that just released the mutex is obviously not one 
of  the threads blocked on the mutex. When a mutex is unlocked, one of 
the *waiting* threads at the time of the unlock must acquire it, and the 
scheduling policy can determine that. But the thread the released the 
mutex is not one of the waiting threads, and is not eligible for 
consideration.

 > > I suppose if pthread_mutex_unlock() actually behaved correctly we 
could
 > > remove the other sched_yield() hacks that didn't belong there in the
 > > first place and go on our merry way.
 >
 > Generally, needing to implement hacks like this is a sign that there are
 > problems with the synchronization design of the code (like a mutex which
 > has excessive contention). Programs should not rely on the scheduling
 > behavior of the kernel for proper operation when that behavior is not
 > defined.
 >
 > --
 > Robert Hancock      Saskatoon, SK, Canada
 > To email, remove "nospam" from hancockr@nospamshaw.ca
 > Home Page: http://www.roberthancock.com/

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
@ 2006-01-25 14:38         ` Robert Hancock
  2006-01-25 17:49         ` Christopher Friesen
  2006-01-26  1:07         ` sched_yield() makes OpenLDAP slow David Schwartz
  2 siblings, 0 replies; 127+ messages in thread
From: Robert Hancock @ 2006-01-25 14:38 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu wrote:
> No, because the thread that just released the mutex is obviously not one 
> of  the threads blocked on the mutex. When a mutex is unlocked, one of 
> the *waiting* threads at the time of the unlock must acquire it, and the 
> scheduling policy can determine that. But the thread the released the 
> mutex is not one of the waiting threads, and is not eligible for 
> consideration.

That statement does not imply that any reschedule needs to happen at the 
time of the mutex unlock at all, only that the other threads waiting on 
the mutex can attempt to reacquire it when the scheduler allows them to. 
  In all likelihood, what tends to happen is that either the thread that 
had the mutex previously still has time left in its timeslice and is 
allowed to keep running and reacquire the mutex, or another thread is 
woken up (perhaps on another CPU) but doesn't reacquire the mutex before 
the original thread carries on and acquires it, and therefore goes back 
to sleep.

Forcing the mutex to ping-pong between different threads would be quite 
inefficient (especially on SMP machines), and is not something that 
POSIX requires.

--
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
  2006-01-25 14:38         ` Robert Hancock
@ 2006-01-25 17:49         ` Christopher Friesen
  2006-01-25 18:26           ` pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow) Howard Chu
  2006-01-26  1:07         ` sched_yield() makes OpenLDAP slow David Schwartz
  2 siblings, 1 reply; 127+ messages in thread
From: Christopher Friesen @ 2006-01-25 17:49 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> 
> Robert Hancock wrote:

>  > This says nothing about requiring a reschedule. The "scheduling policy"
>  > can well decide that the thread which just released the mutex can
>  > re-acquire it.
> 
> No, because the thread that just released the mutex is obviously not one 
> of  the threads blocked on the mutex. When a mutex is unlocked, one of 
> the *waiting* threads at the time of the unlock must acquire it, and the 
> scheduling policy can determine that. But the thread the released the 
> mutex is not one of the waiting threads, and is not eligible for 
> consideration.

Is it *required* that the new owner of the mutex is determined at the 
time of mutex release?

If the kernel doesn't actually determine the new owner of the mutex 
until the currently running thread swaps out, it would be possible for 
the currently running thread to re-aquire the mutex.

Chris

^ permalink raw reply	[flat|nested] 127+ messages in thread

* pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 17:49         ` Christopher Friesen
@ 2006-01-25 18:26           ` Howard Chu
  2006-01-25 18:59             ` Nick Piggin
                               ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-25 18:26 UTC (permalink / raw)
  To: Christopher Friesen; +Cc: Linux Kernel Mailing List, hancockr

Christopher Friesen wrote:
> Howard Chu wrote:
>>
>> Robert Hancock wrote:
>
>>  > This says nothing about requiring a reschedule. The "scheduling 
>> policy"
>>  > can well decide that the thread which just released the mutex can
>>  > re-acquire it.
>>
>> No, because the thread that just released the mutex is obviously not 
>> one of  the threads blocked on the mutex. When a mutex is unlocked, 
>> one of the *waiting* threads at the time of the unlock must acquire 
>> it, and the scheduling policy can determine that. But the thread the 
>> released the mutex is not one of the waiting threads, and is not 
>> eligible for consideration.
>
> Is it *required* that the new owner of the mutex is determined at the 
> time of mutex release?
>
> If the kernel doesn't actually determine the new owner of the mutex 
> until the currently running thread swaps out, it would be possible for 
> the currently running thread to re-aquire the mutex.

The SUSv3 text seems pretty clear. It says "WHEN pthread_mutex_unlock() 
is called, ... the scheduling policy SHALL decide ..." It doesn't say 
MAY, and it doesn't say "some undefined time after the call." There is 
nothing optional or implementation-defined here. The only thing that is 
not explicitly stated is what happens when there are no waiting threads; 
in that case obviously the running thread can continue running.

re: forcing the mutex to ping-pong between different threads - if that 
is inefficient, then the thread scheduler needs to be tuned differently. 
Threads and thread context switches are supposed to be cheap, otherwise 
you might as well just program with fork() instead. (And of course, back 
when Unix was first developed, *processes* were lightweight, compared to 
other extant OSs.)

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 18:26           ` pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow) Howard Chu
@ 2006-01-25 18:59             ` Nick Piggin
  2006-01-25 19:32               ` Howard Chu
  2006-01-25 21:06             ` Lee Revell
  2006-01-26  0:08             ` Robert Hancock
  2 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-25 18:59 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> Christopher Friesen wrote:
> 
>> Howard Chu wrote:
>>
>>>
>>> Robert Hancock wrote:
>>
>>
>>>  > This says nothing about requiring a reschedule. The "scheduling 
>>> policy"
>>>  > can well decide that the thread which just released the mutex can
>>>  > re-acquire it.
>>>
>>> No, because the thread that just released the mutex is obviously not 
>>> one of  the threads blocked on the mutex. When a mutex is unlocked, 
>>> one of the *waiting* threads at the time of the unlock must acquire 
>>> it, and the scheduling policy can determine that. But the thread the 
>>> released the mutex is not one of the waiting threads, and is not 
>>> eligible for consideration.
>>
>>
>> Is it *required* that the new owner of the mutex is determined at the 
>> time of mutex release?
>>
>> If the kernel doesn't actually determine the new owner of the mutex 
>> until the currently running thread swaps out, it would be possible for 
>> the currently running thread to re-aquire the mutex.
> 
> 
> The SUSv3 text seems pretty clear. It says "WHEN pthread_mutex_unlock() 
> is called, ... the scheduling policy SHALL decide ..." It doesn't say 
> MAY, and it doesn't say "some undefined time after the call." There is 
> nothing optional or implementation-defined here. The only thing that is 
> not explicitly stated is what happens when there are no waiting threads; 
> in that case obviously the running thread can continue running.
> 

But it doesn't say the unlocking thread must yield to the new mutex
owner, only that the scheduling policy shall determine the which
thread aquires the lock.

It doesn't say that decision must be made immediately, either (eg.
it could be made as a by product of which contender is chosen to run
next).

I think the intention of the wording is that for deterministic policies,
it is clear that the waiting threads are actually worken and reevaluated
for scheduling. In the case of SCHED_OTHER, it means basically nothing,
considering the scheduling policy is arbitrary.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 18:59             ` Nick Piggin
@ 2006-01-25 19:32               ` Howard Chu
  2006-01-26  8:51                 ` Nick Piggin
  2006-01-26 10:38                 ` Nikita Danilov
  0 siblings, 2 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-25 19:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Nick Piggin wrote:
> Howard Chu wrote:
>> The SUSv3 text seems pretty clear. It says "WHEN 
>> pthread_mutex_unlock() is called, ... the scheduling policy SHALL 
>> decide ..." It doesn't say MAY, and it doesn't say "some undefined 
>> time after the call." There is nothing optional or 
>> implementation-defined here. The only thing that is not explicitly 
>> stated is what happens when there are no waiting threads; in that 
>> case obviously the running thread can continue running.
>>
>
> But it doesn't say the unlocking thread must yield to the new mutex
> owner, only that the scheduling policy shall determine the which
> thread aquires the lock.

True, the unlocking thread doesn't have to yield to the new mutex owner 
as a direct consequence of the unlock. But logically, if the unlocking 
thread subsequently calls mutex_lock, it must block, because some other 
thread has already been assigned ownership of the mutex.

> It doesn't say that decision must be made immediately, either (eg.
> it could be made as a by product of which contender is chosen to run
> next).

A straightforward reading of the language here says the decision happens 
"when pthread_mutex_unlock() is called" and not at any later time. There 
is nothing here to support your interpretation.
>
> I think the intention of the wording is that for deterministic policies,
> it is clear that the waiting threads are actually worken and reevaluated
> for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> considering the scheduling policy is arbitrary.
>
Clearly the point is that one of the waiting threads is waken and gets 
the mutex, and it doesn't matter which thread is chosen. I.e., whatever 
thread the scheduling policy chooses. The fact that SCHED_OTHER can 
choose arbitrarily is immaterial, it still can only choose one of the 
waiting threads.

The fact that SCHED_OTHER's scheduling behavior is undefined is not free 
license to implement whatever you want. Scheduling policies are an 
optional feature; the basic thread behavior must still be consistent 
even on systems that don't implement scheduling policies.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-25 12:11     ` Olaf Kirch
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
@ 2006-01-25 19:37       ` Jesse Brandeburg
  2006-01-25 20:14         ` Olaf Kirch
  2006-01-26  0:28         ` Jesse Brandeburg
  1 sibling, 2 replies; 127+ messages in thread
From: Jesse Brandeburg @ 2006-01-25 19:37 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: Stefan Seyfried, Linux Kernel Mailing List, netdev

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

On 1/25/06, Olaf Kirch <okir@suse.de> wrote:
> On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> > I'm not sure what the right fix would be. e100_resume would probably
> > have to call e100_alloc_cbs early on, while e100_up should avoid
> > calling it a second time if nic->cbs_avail != 0. A tentative patch
> > for testing is attached.
>
> Reportedly, the patch fixes the crash on resume.

Cool, thanks for the research, I have a concern about this however.

its an interesting patch, but it raises the question why does
e100_init_hw need to be called at all in resume?  I looked back
through our history and that init_hw call has always been there.  I
think its incorrect, but its taking me a while to set up a system with
the ability to resume.

everywhere else in the driver alloc_cbs is called before init_hw so it
just seems like a long standing bug.

comments?  anyone want to test? i compile tested this, but it is untested.

[-- Attachment #2: e100_resume_no_init.diff --]
[-- Type: application/octet-stream, Size: 818 bytes --]

e100: remove init_hw call to fix panic

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

e100 seems to have had a long standing bug where e100_init_hw was being
called when it should not have been.  This caused a panic due to recent
changes that rely on correct set up in the driver, and more robust error
paths.
---

 drivers/net/e100.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -2752,8 +2752,6 @@ static int e100_resume(struct pci_dev *p
 	retval = pci_enable_wake(pdev, 0, 0);
 	if (retval)
 		DPRINTK(PROBE,ERR, "Error clearing wake events\n");
-	if(e100_hw_init(nic))
-		DPRINTK(HW, ERR, "e100_hw_init failed\n");
 
 	netif_device_attach(netdev);
 	if(netif_running(netdev))

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-25 19:37       ` e100 oops on resume Jesse Brandeburg
@ 2006-01-25 20:14         ` Olaf Kirch
  2006-01-25 22:28           ` Jesse Brandeburg
  2006-01-26  0:28         ` Jesse Brandeburg
  1 sibling, 1 reply; 127+ messages in thread
From: Olaf Kirch @ 2006-01-25 20:14 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Stefan Seyfried, Linux Kernel Mailing List, netdev

On Wed, Jan 25, 2006 at 11:37:40AM -0800, Jesse Brandeburg wrote:
> its an interesting patch, but it raises the question why does
> e100_init_hw need to be called at all in resume?  I looked back
> through our history and that init_hw call has always been there.  I
> think its incorrect, but its taking me a while to set up a system with
> the ability to resume.

I'll ask the folks here to give it a try tomorrow. But I suspect at
least some of it will be needed. For instance I assume you'll
have to reload to ucode when bringing the NIC back from sleep.

Olaf
-- 
Olaf Kirch   |  --- o --- Nous sommes du soleil we love when we play
okir@suse.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 18:26           ` pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow) Howard Chu
  2006-01-25 18:59             ` Nick Piggin
@ 2006-01-25 21:06             ` Lee Revell
  2006-01-25 22:14               ` Howard Chu
  2006-01-26  0:08             ` Robert Hancock
  2 siblings, 1 reply; 127+ messages in thread
From: Lee Revell @ 2006-01-25 21:06 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

On Wed, 2006-01-25 at 10:26 -0800, Howard Chu wrote:
> The SUSv3 text seems pretty clear. It says "WHEN
> pthread_mutex_unlock() 
> is called, ... the scheduling policy SHALL decide ..." It doesn't say 
> MAY, and it doesn't say "some undefined time after the call."  

This does NOT require pthread_mutex_unlock() to cause the scheduler to
immediately pick a new runnable process.  It only says it's up the the
scheduling POLICY what to do.  The policy could be "let the unlocking
thread finish its timeslice then reschedule".

Lee


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 21:06             ` Lee Revell
@ 2006-01-25 22:14               ` Howard Chu
  2006-01-26  0:16                 ` Robert Hancock
                                   ` (3 more replies)
  0 siblings, 4 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-25 22:14 UTC (permalink / raw)
  To: Lee Revell; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Lee Revell wrote:
> On Wed, 2006-01-25 at 10:26 -0800, Howard Chu wrote:
>   
>> The SUSv3 text seems pretty clear. It says "WHEN
>> pthread_mutex_unlock() 
>> is called, ... the scheduling policy SHALL decide ..." It doesn't say 
>> MAY, and it doesn't say "some undefined time after the call."  
>>     
>
> This does NOT require pthread_mutex_unlock() to cause the scheduler to
> immediately pick a new runnable process.  It only says it's up the the
> scheduling POLICY what to do.  The policy could be "let the unlocking
> thread finish its timeslice then reschedule".
>   

This is obviously some very old ground.

http://groups.google.com/groups?threadm=etai7.108188%24B37.2381726%40news1.rdc1.bc.home.com

Kaz's post clearly interprets the POSIX spec differently from you. The 
policy can decide *which of the waiting threads* gets the mutex, but the 
releasing thread is totally out of the picture. For good or bad, the 
current pthread_mutex_unlock() is not POSIX-compliant. Now then, if 
we're forced to live with that, for efficiency's sake, that's OK, 
assuming that valid workarounds exist, such as inserting a sched_yield() 
after the unlock.

http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en&

But then we have to deal with you folks' bizarre notion that 
sched_yield() can legitimately be a no-op, which also defies the POSIX 
spec. Again, in SUSv3 "The /sched_yield/() function shall force the 
running thread to relinquish the processor until it again becomes the 
head of its thread list. It takes no arguments." There is no language 
here saying "sched_yield *may* do nothing at all." There are of course 
cases where it will have no effect, such as when called in a 
single-threaded program, but those are the exceptions that define the 
rule. Otherwise, the expectation is that some other runnable thread will 
acquire the CPU. Again, note that sched_yield() is a core function of 
the Threads specification, while scheduling policies are an optional 
feature. The function's core behavior (give up the CPU and make some 
other runnable thread run) is invariant; the current thread gives up the 
CPU regardless of which scheduling policy is in effect or even if 
scheduling policies are implemented at all. The only behavior that's 
open to implementors is which *of the other runnable threads* is chosen 
to take the place of the current thread.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-25 20:14         ` Olaf Kirch
@ 2006-01-25 22:28           ` Jesse Brandeburg
  0 siblings, 0 replies; 127+ messages in thread
From: Jesse Brandeburg @ 2006-01-25 22:28 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: Stefan Seyfried, Linux Kernel Mailing List, netdev

On 1/25/06, Olaf Kirch <okir@suse.de> wrote:
> On Wed, Jan 25, 2006 at 11:37:40AM -0800, Jesse Brandeburg wrote:
> > its an interesting patch, but it raises the question why does
> > e100_init_hw need to be called at all in resume?  I looked back
> > through our history and that init_hw call has always been there.  I
> > think its incorrect, but its taking me a while to set up a system with
> > the ability to resume.
>
> I'll ask the folks here to give it a try tomorrow. But I suspect at
> least some of it will be needed. For instance I assume you'll
> have to reload to ucode when bringing the NIC back from sleep.

I totally agree thats what it looks like, but unless I'm missing
something e100_up will take care of everything, and if the interface
is not up, e100_open->e100_up afterward will take care of it.

we have to be really careful about what might happen when resuming on
a system with a SMBUS link to a BMC, as there are some tricky
transitions in the hardware that can be easily violated.

Jesse

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 18:26           ` pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow) Howard Chu
  2006-01-25 18:59             ` Nick Piggin
  2006-01-25 21:06             ` Lee Revell
@ 2006-01-26  0:08             ` Robert Hancock
  2 siblings, 0 replies; 127+ messages in thread
From: Robert Hancock @ 2006-01-26  0:08 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List

Howard Chu wrote:
> The SUSv3 text seems pretty clear. It says "WHEN pthread_mutex_unlock() 
> is called, ... the scheduling policy SHALL decide ..." It doesn't say 
> MAY, and it doesn't say "some undefined time after the call." There is 
> nothing optional or implementation-defined here. The only thing that is 
> not explicitly stated is what happens when there are no waiting threads; 
> in that case obviously the running thread can continue running.

It says the scheduling policy will decide who gets the mutex. It does 
not say that such a decision must be made immediately. That seems rather 
implementation defined to me.

> 
> re: forcing the mutex to ping-pong between different threads - if that 
> is inefficient, then the thread scheduler needs to be tuned differently. 
> Threads and thread context switches are supposed to be cheap, otherwise 
> you might as well just program with fork() instead. (And of course, back 
> when Unix was first developed, *processes* were lightweight, compared to 
> other extant OSs.)

This is nothing to do with the thread scheduler being inefficient. It is 
inherently inefficient to context-switch repeatedly no matter how good 
the kernel is. It trashes the CPU pipeline, at the very least, can cause 
thrashing of the CPU caches, and can cause cache lines to be pushed back 
and forth across the bus on SMP machines which really kills performance.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 22:14               ` Howard Chu
@ 2006-01-26  0:16                 ` Robert Hancock
  2006-01-26  0:49                   ` Howard Chu
  2006-01-26  2:05                 ` David Schwartz
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 127+ messages in thread
From: Robert Hancock @ 2006-01-26  0:16 UTC (permalink / raw)
  To: Howard Chu; +Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List

Howard Chu wrote:
> Kaz's post clearly interprets the POSIX spec differently from you. The 
> policy can decide *which of the waiting threads* gets the mutex, but the 
> releasing thread is totally out of the picture. For good or bad, the 
> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if 
> we're forced to live with that, for efficiency's sake, that's OK, 
> assuming that valid workarounds exist, such as inserting a sched_yield() 
> after the unlock.
> 
> http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en& 

Did you read the rest of this post?

"In any event, all the mutex fairness in the world won't solve the
problem. Consider if this lock/unlock cycle is inside a larger
lock/unlock cycle. Yielding at the unlock or blocking at the lock will
increase the dreadlock over the larger mutex.

The fact is, the threads library can't read the programmer's mind. So
it shouldn't try to, especially if that makes the common cases much
worse for the benefit of excruciatingly rare cases."

And earlier in that thread ("old behavior" referring to an old 
LinuxThreads version which allowed "unfair" locking):

"Notice however that even the old "unfair" behavior is perfectly
acceptable with respect to the POSIX standard: for the default
scheduling policy, POSIX makes no guarantees of fairness, such as "the
thread waiting for the mutex for the longest time always acquires it
first". Properly written multithreaded code avoids that kind of heavy
contention on mutexes, and does not run into fairness problems. If you
need scheduling guarantees, you should consider using the real-time
scheduling policies SCHED_RR and SCHED_FIFO, which have precisely
defined scheduling behaviors. "

If you indeed have some thread which is trying to do an essentially 
infinite amount of work, you really should not have that thread locking 
a mutex, which other threads need to acquire, for a large part of each 
cycle. Correctness aside, this is simply not efficient.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-25 19:37       ` e100 oops on resume Jesse Brandeburg
  2006-01-25 20:14         ` Olaf Kirch
@ 2006-01-26  0:28         ` Jesse Brandeburg
  2006-01-26  9:32           ` Pavel Machek
                             ` (2 more replies)
  1 sibling, 3 replies; 127+ messages in thread
From: Jesse Brandeburg @ 2006-01-26  0:28 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: Stefan Seyfried, Linux Kernel Mailing List, netdev

On 1/25/06, Jesse Brandeburg <jesse.brandeburg@gmail.com> wrote:
> On 1/25/06, Olaf Kirch <okir@suse.de> wrote:
> > On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> > > I'm not sure what the right fix would be. e100_resume would probably
> > > have to call e100_alloc_cbs early on, while e100_up should avoid
> > > calling it a second time if nic->cbs_avail != 0. A tentative patch
> > > for testing is attached.
> >
> > Reportedly, the patch fixes the crash on resume.
>
> Cool, thanks for the research, I have a concern about this however.
>
> its an interesting patch, but it raises the question why does
> e100_init_hw need to be called at all in resume?  I looked back
> through our history and that init_hw call has always been there.  I
> think its incorrect, but its taking me a while to set up a system with
> the ability to resume.
>
> everywhere else in the driver alloc_cbs is called before init_hw so it
> just seems like a long standing bug.
>
> comments?  anyone want to test? i compile tested this, but it is untested.

Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
to show that my patch that just removes e100_init_hw works okay for
me.  Let me know how it goes for you, I think this is a good fix.

Jesse

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  0:16                 ` Robert Hancock
@ 2006-01-26  0:49                   ` Howard Chu
  2006-01-26  1:04                     ` Lee Revell
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-26  0:49 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List

Robert Hancock wrote:
> Howard Chu wrote:
>> Kaz's post clearly interprets the POSIX spec differently from you. 
>> The policy can decide *which of the waiting threads* gets the mutex, 
>> but the releasing thread is totally out of the picture. For good or 
>> bad, the current pthread_mutex_unlock() is not POSIX-compliant. Now 
>> then, if we're forced to live with that, for efficiency's sake, 
>> that's OK, assuming that valid workarounds exist, such as inserting a 
>> sched_yield() after the unlock.
>>
>> http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en& 
>
>
> Did you read the rest of this post?
>
> "In any event, all the mutex fairness in the world won't solve the
> problem. Consider if this lock/unlock cycle is inside a larger
> lock/unlock cycle. Yielding at the unlock or blocking at the lock will
> increase the dreadlock over the larger mutex.

Basic "fairness" isn't the issue. Fairness is concerned with which of 
*multiple waiting threads* gets the mutex, and that is certainly 
irrelevant here. The issue is that the releasing thread should not be a 
candidate.

The mutex functions are a core part of the thread specification; they 
have a fundamental behavior, and the definition says if there are 
blocked threads waiting on a mutex when it gets unlocked, one of the 
waiting threads gets the mutex. Which of the waiting threads gets it is 
unspecified in the core spec. On a system that implements the scheduling 
option, the scheduling policy specifies which thread. The scheduling 
policy is an optional feature, it serves only to refine the core 
functionality. A program written to the basic core specification should 
not break when run in an environment that implements optional features.

The spec may be mandating a non-optimal behavior, but that's a 
side-issue - someone should file an objection with the Open Group to get 
it redefined if it's such a bad idea. But for now, the NPTL 
implementation is non-conformant.

Standards aren't just academic exercises. They're meant to be useful. If 
the standard is too thinly specified, is ambiguous, or allows 
nonsensical behavior, it's not useful and should be fixed at the source, 
not just ignored and papered over in implementations.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  0:49                   ` Howard Chu
@ 2006-01-26  1:04                     ` Lee Revell
  2006-01-26  1:31                       ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Lee Revell @ 2006-01-26  1:04 UTC (permalink / raw)
  To: Howard Chu; +Cc: Robert Hancock, Christopher Friesen, Linux Kernel Mailing List

On Wed, 2006-01-25 at 16:49 -0800, Howard Chu wrote:
> Basic "fairness" isn't the issue. Fairness is concerned with which of 
> *multiple waiting threads* gets the mutex, and that is certainly 
> irrelevant here. The issue is that the releasing thread should not be
> a candidate.
> 

You seem to be making 2 controversial assertions:

1. pthread_mutex_unlock must cause an immediate reschedule if other
threads are blocked on the mutex, and 
2. if the unlocking thread immediately tries to relock the mutex,
another thread must get it first

I disagree with #1, which makes #2 irrelevant.  It would lead to
obviously incorrect behavior, pthread_mutex_unlock would no longer be an
RT safe operation for example.

Also consider a SCHED_FIFO policy - static priorities and the scheduler
always runs the highest priority runnable thread - under your
interpretation of POSIX a high priority thread unlocking a mutex would
require the scheduler to run a lower priority thread which violates
SCHED_FIFO semantics.

Lee

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: sched_yield() makes OpenLDAP slow
  2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
  2006-01-25 14:38         ` Robert Hancock
  2006-01-25 17:49         ` Christopher Friesen
@ 2006-01-26  1:07         ` David Schwartz
  2006-01-26  8:30           ` Helge Hafting
  2 siblings, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-26  1:07 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: hancockr


> Robert Hancock wrote:

>  > "If there are threads blocked on the mutex object referenced by mutex
>  > when pthread_mutex_unlock() is called, resulting in the mutex becoming
>  > available, the scheduling policy shall determine which thread shall
>  > acquire the mutex."
>  >
>  > This says nothing about requiring a reschedule. The "scheduling policy"
>  > can well decide that the thread which just released the mutex can
>  > re-acquire it.

> No, because the thread that just released the mutex is obviously not one
> of  the threads blocked on the mutex.

	So what?

> When a mutex is unlocked, one of
> the *waiting* threads at the time of the unlock must acquire it, and the
> scheduling policy can determine that.

	This is false and is nowhere found in the standard.

> But the thread the released the
> mutex is not one of the waiting threads, and is not eligible for
> consideration.

	Where are you getting this from? Nothing requires the scheduler to schedule
any threads when the mutex is released.

	All that must happen is that the mutex must be unlocked. The scheduler is
permitted to allow any thread it wants to run at that point, or no thread.
Nothing says the thread that released the mutex can't continue running and
nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
before any other thread gets around to getting it.

	In general, it is very bad karma for the scheduler to stop a thread before
its timeslice is up if it doesn't have to. Consider one CPU and two threads,
each needing to do 100 quick lock/unlock cycles. Why force 200 context
switches?

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  1:04                     ` Lee Revell
@ 2006-01-26  1:31                       ` Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-26  1:31 UTC (permalink / raw)
  To: Lee Revell; +Cc: Robert Hancock, Christopher Friesen, Linux Kernel Mailing List

Lee Revell wrote:
> On Wed, 2006-01-25 at 16:49 -0800, Howard Chu wrote:
>   
>> Basic "fairness" isn't the issue. Fairness is concerned with which of 
>> *multiple waiting threads* gets the mutex, and that is certainly 
>> irrelevant here. The issue is that the releasing thread should not be
>> a candidate.
>>
>>     
>
> You seem to be making 2 controversial assertions:
>
> 1. pthread_mutex_unlock must cause an immediate reschedule if other
> threads are blocked on the mutex, and 
> 2. if the unlocking thread immediately tries to relock the mutex,
> another thread must get it first
>
> I disagree with #1, which makes #2 irrelevant.  It would lead to
> obviously incorrect behavior, pthread_mutex_unlock would no longer be an
> RT safe operation for example.
>   

Actually no, I see that #1 is unnecessary, and already acknowledged as such
http://groups.google.com/group/fa.linux.kernel/msg/89da66017d53d496

But #2 still holds.

> Also consider a SCHED_FIFO policy - static priorities and the scheduler
> always runs the highest priority runnable thread - under your
> interpretation of POSIX a high priority thread unlocking a mutex would
> require the scheduler to run a lower priority thread which violates
> SCHED_FIFO semantics

See the Mutex Initialization Scheduling Attributes section which 
specifically addresses priority inversion:
http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09

If point #2 were not true, then there would be no need to bother with 
any of that. Instead that text ends with "it is important that 
IEEE Std 1003.1-2001 provide these interfaces for those cases in which 
it is necessary."

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 22:14               ` Howard Chu
  2006-01-26  0:16                 ` Robert Hancock
@ 2006-01-26  2:05                 ` David Schwartz
  2006-01-26  2:48                   ` Mark Lord
  2006-01-26  8:54                 ` Nick Piggin
  2006-01-26 10:44                 ` Nikita Danilov
  3 siblings, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-26  2:05 UTC (permalink / raw)
  To: Lee Revell; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr


> Kaz's post clearly interprets the POSIX spec differently from you. The
> policy can decide *which of the waiting threads* gets the mutex, but the
> releasing thread is totally out of the picture. For good or bad, the
> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
> we're forced to live with that, for efficiency's sake, that's OK,
> assuming that valid workarounds exist, such as inserting a sched_yield()
> after the unlock.

	My thanks to David Hopwood for providing me with the definitive refutation
of this position. The response is that the as-if rules allows the
implementation to violate the specification internally provided no compliant
application could tell the difference.

	When you call 'pthread_mutex_lock', there is no guarantee regarding how
long it will or might take until you are actually waiting for the mutex. So
no conforming application can ever tell whether or not it is waiting for the
mutex or about to wait for the mutex.

	So you cannot write an application that can tell the difference.

	His exact quote is, "It could have been the case that the other threads ran
more slowly, so that they didn't reach the point of blocking on the mutex
before the pthread_mutex_unlock()."

	You can find it on comp.programming.threads if you like.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  2:05                 ` David Schwartz
@ 2006-01-26  2:48                   ` Mark Lord
  2006-01-26  3:30                     ` David Schwartz
  0 siblings, 1 reply; 127+ messages in thread
From: Mark Lord @ 2006-01-26  2:48 UTC (permalink / raw)
  To: davids
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

David Schwartz wrote:
>> Kaz's post clearly interprets the POSIX spec differently from you. The
>> policy can decide *which of the waiting threads* gets the mutex, but the
>> releasing thread is totally out of the picture. For good or bad, the
>> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if
>> we're forced to live with that, for efficiency's sake, that's OK,
>> assuming that valid workarounds exist, such as inserting a sched_yield()
>> after the unlock.
> 
> 	My thanks to David Hopwood for providing me with the definitive refutation
> of this position. The response is that the as-if rules allows the
> implementation to violate the specification internally provided no compliant
> application could tell the difference.
> 
> 	When you call 'pthread_mutex_lock', there is no guarantee regarding how
> long it will or might take until you are actually waiting for the mutex. So
> no conforming application can ever tell whether or not it is waiting for the
> mutex or about to wait for the mutex.
> 
> 	So you cannot write an application that can tell the difference.

Not true.  The code for the relinquishing thread could indeed tell the difference.

-ml

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  2:48                   ` Mark Lord
@ 2006-01-26  3:30                     ` David Schwartz
  2006-01-26  3:49                       ` Samuel Masham
  0 siblings, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-26  3:30 UTC (permalink / raw)
  To: lkml; +Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr


> > 	So you cannot write an application that can tell the difference.

> Not true.  The code for the relinquishing thread could indeed
> tell the difference.
>
> -ml

	It can tell the difference between the other thread getting the mutex first
and it getting the mutex first. But it cannot tell the difference between an
implementation that puts random sleeps before calls to 'pthread_mutex_lock'
and an implementation that has the allegedly non-compliant behavior. That
makes the behavior compliant under the 'as-if' rule.

	If you don't believe me, try to write a program that prints 'non-compliant'
on a system that has the alleged non-compliance but is guaranteed not to do
so on any compliant system. It cannot be done.

	In order to claim the alleged compliance, you would have to know that a
thread waiting for a mutex did not get it. But there is no possible way you
can know that another thread is waiting for the mutex (as opposed to being
about to wait for it). So you can never detect the claimed non-compliance,
so it's not non-compliance.

	This is definitive, really. It 100% refutes the claim.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  3:30                     ` David Schwartz
@ 2006-01-26  3:49                       ` Samuel Masham
  2006-01-26  4:02                         ` Samuel Masham
  0 siblings, 1 reply; 127+ messages in thread
From: Samuel Masham @ 2006-01-26  3:49 UTC (permalink / raw)
  To: davids
  Cc: lkml, Lee Revell, Christopher Friesen, Linux Kernel Mailing List,
	hancockr

On 26/01/06, David Schwartz <davids@webmaster.com> wrote:
>
> > >     So you cannot write an application that can tell the difference.
>
> > Not true.  The code for the relinquishing thread could indeed
> > tell the difference.
> >
> > -ml
>
>         It can tell the difference between the other thread getting the mutex first
> and it getting the mutex first. But it cannot tell the difference between an
> implementation that puts random sleeps before calls to 'pthread_mutex_lock'
> and an implementation that has the allegedly non-compliant behavior. That
> makes the behavior compliant under the 'as-if' rule.
>
>         If you don't believe me, try to write a program that prints 'non-compliant'
> on a system that has the alleged non-compliance but is guaranteed not to do
> so on any compliant system. It cannot be done.

Just putting priority inheritance on then in the running thread check
your priority, if it goes up then the waiting thread in really
waiting.

Then if you can release + get the lock again its non compliant.... no?

ie    pthread_mutexattr_setprotocol(pthread_mutexattr_t *attr, int
protocol); with PTHREAD_PRIO_INHERIT

comment:
As a rt person I don't like the idea of scheduler bounce so the way
round seems to be have the mutex lock acquiring work on a FIFO like
basis.


>         In order to claim the alleged compliance, you would have to know that a
> thread waiting for a mutex did not get it. But there is no possible way you
> can know that another thread is waiting for the mutex (as opposed to being
> about to wait for it). So you can never detect the claimed non-compliance,
> so it's not non-compliance.
>
>         This is definitive, really. It 100% refutes the claim.
>
>         DS
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  3:49                       ` Samuel Masham
@ 2006-01-26  4:02                         ` Samuel Masham
  2006-01-26  4:53                           ` Lee Revell
  0 siblings, 1 reply; 127+ messages in thread
From: Samuel Masham @ 2006-01-26  4:02 UTC (permalink / raw)
  To: davids
  Cc: lkml, Lee Revell, Christopher Friesen, Linux Kernel Mailing List,
	hancockr

On 26/01/06, Samuel Masham <samuel.masham@gmail.com> wrote:
> comment:
> As a rt person I don't like the idea of scheduler bounce so the way
> round seems to be have the mutex lock acquiring work on a FIFO like
> basis.

which is obviously wrong...

Howeve my basic point stands but needs to be clarified a bit:

I think I can print non-compliant if the mutex acquisition doesn't
respect the higher priority of the waiter over the current process
even if the mutex is "available".

OK?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  4:02                         ` Samuel Masham
@ 2006-01-26  4:53                           ` Lee Revell
  2006-01-26  6:14                             ` Samuel Masham
  0 siblings, 1 reply; 127+ messages in thread
From: Lee Revell @ 2006-01-26  4:53 UTC (permalink / raw)
  To: Samuel Masham
  Cc: davids, lkml, Christopher Friesen, Linux Kernel Mailing List, hancockr

On Thu, 2006-01-26 at 13:02 +0900, Samuel Masham wrote:
> On 26/01/06, Samuel Masham <samuel.masham@gmail.com> wrote:
> > comment:
> > As a rt person I don't like the idea of scheduler bounce so the way
> > round seems to be have the mutex lock acquiring work on a FIFO like
> > basis.
> 
> which is obviously wrong...
> 
> Howeve my basic point stands but needs to be clarified a bit:
> 
> I think I can print non-compliant if the mutex acquisition doesn't
> respect the higher priority of the waiter over the current process
> even if the mutex is "available".
> 
> OK?

I don't think using an optional feature (PI) counts...

Lee


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  4:53                           ` Lee Revell
@ 2006-01-26  6:14                             ` Samuel Masham
  0 siblings, 0 replies; 127+ messages in thread
From: Samuel Masham @ 2006-01-26  6:14 UTC (permalink / raw)
  To: Lee Revell
  Cc: davids, lkml, Christopher Friesen, Linux Kernel Mailing List, hancockr

On 26/01/06, Lee Revell <rlrevell@joe-job.com> wrote:
> On Thu, 2006-01-26 at 13:02 +0900, Samuel Masham wrote:
> > On 26/01/06, Samuel Masham <samuel.masham@gmail.com> wrote:
> > > comment:
> > > As a rt person I don't like the idea of scheduler bounce so the way
> > > round seems to be have the mutex lock acquiring work on a FIFO like
> > > basis.
> >
> > which is obviously wrong...
> >
> > Howeve my basic point stands but needs to be clarified a bit:
> >
> > I think I can print non-compliant if the mutex acquisition doesn't
> > respect the higher priority of the waiter over the current process
> > even if the mutex is "available".
> >
> > OK?
>
> I don't think using an optional feature (PI) counts...
>
> Lee

So when acquiring a mutex with pi enabled must involve scheduler...

... and you can skip that bit with it disabled as one can argue that
the user can't tell if the time slice hit between the call to acquire
the mutex and the actual mutex wait itself?

sounds a bit of a fudge to me....

I assume that mutexes will must never support a the wchan (proc)
interface or the like?

On the other hand the basic point about high contention around mutexes
and relying on this being a bad idea is fine by me.

Samuel

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  1:07         ` sched_yield() makes OpenLDAP slow David Schwartz
@ 2006-01-26  8:30           ` Helge Hafting
  2006-01-26  9:01             ` Nick Piggin
  2006-01-26 10:50             ` Nikita Danilov
  0 siblings, 2 replies; 127+ messages in thread
From: Helge Hafting @ 2006-01-26  8:30 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List, hancockr

David Schwartz wrote:

>>Robert Hancock wrote:
>>    
>>
>>But the thread the released the
>>mutex is not one of the waiting threads, and is not eligible for
>>consideration.
>>    
>>
>
>	Where are you getting this from? Nothing requires the scheduler to schedule
>any threads when the mutex is released.
>  
>
Correct.

>	All that must happen is that the mutex must be unlocked. The scheduler is
>permitted to allow any thread it wants to run at that point, or no thread.
>Nothing says the thread that released the mutex can't continue running and
>  
>
Correct. The releasing thread may keep running.

>nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
>before any other thread gets around to getting it.
>  
>
Wrong.
The spec says that the mutex must be given to a waiter (if any) at the
moment of release.  The waiter don't have to be scheduled at that
point, it may keep sleeping with its freshly unlocked mutex.  So the
unlocking thread may continue - but if it tries to reaquire the mutex
it will find the mutex taken and go to sleep at that point. Then other
processes will schedule, and at some time the one now owning the mutex
will wake up and do its work.

>	In general, it is very bad karma for the scheduler to stop a thread before
>its timeslice is up if it doesn't have to. Consider one CPU and two threads,
>each needing to do 100 quick lock/unlock cycles. Why force 200 context
>switches?
>
Good point, except it is a strange program that do this.  Lock the mutex 
once,
do 100 operations, then unlock is the better way. :-)

Helge Hafting

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 19:32               ` Howard Chu
@ 2006-01-26  8:51                 ` Nick Piggin
  2006-01-26 14:15                   ` Kyle Moffett
  2006-01-26 10:38                 ` Nikita Danilov
  1 sibling, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26  8:51 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> Nick Piggin wrote:
> 
>> Howard Chu wrote:
>>
>>> The SUSv3 text seems pretty clear. It says "WHEN 
>>> pthread_mutex_unlock() is called, ... the scheduling policy SHALL 
>>> decide ..." It doesn't say MAY, and it doesn't say "some undefined 
>>> time after the call." There is nothing optional or 
>>> implementation-defined here. The only thing that is not explicitly 
>>> stated is what happens when there are no waiting threads; in that 
>>> case obviously the running thread can continue running.
>>>
>>
>> But it doesn't say the unlocking thread must yield to the new mutex
>> owner, only that the scheduling policy shall determine the which
>> thread aquires the lock.
> 
> 
> True, the unlocking thread doesn't have to yield to the new mutex owner 
> as a direct consequence of the unlock. But logically, if the unlocking 
> thread subsequently calls mutex_lock, it must block, because some other 
> thread has already been assigned ownership of the mutex.
> 
>> It doesn't say that decision must be made immediately, either (eg.
>> it could be made as a by product of which contender is chosen to run
>> next).
> 
> 
> A straightforward reading of the language here says the decision happens 
> "when pthread_mutex_unlock() is called" and not at any later time. There 
> is nothing here to support your interpretation.
> 

OK, so what happens if my scheduling policy decides _right then_, that
the next _running_ thread that was being blocked on or tries to aquire
the mutex, is the next owner?

This is the logical way for a *scheduling* policy to determine which
thread gets the mutex. I don't know any other way that the scheduling
policy could determine the next thread to get the mutex.

>>
>> I think the intention of the wording is that for deterministic policies,
>> it is clear that the waiting threads are actually worken and reevaluated
>> for scheduling. In the case of SCHED_OTHER, it means basically nothing,
>> considering the scheduling policy is arbitrary.
>>
> Clearly the point is that one of the waiting threads is waken and gets 
> the mutex, and it doesn't matter which thread is chosen. I.e., whatever 
> thread the scheduling policy chooses. The fact that SCHED_OTHER can 
> choose arbitrarily is immaterial, it still can only choose one of the 
> waiting threads.
> 

I don't know that it exactly says one of the waiting threads must get the
mutex.

> The fact that SCHED_OTHER's scheduling behavior is undefined is not free 
> license to implement whatever you want. Scheduling policies are an 
> optional feature; the basic thread behavior must still be consistent 
> even on systems that don't implement scheduling policies.
> 

It just so happens that normal tasks in Linux run in SCHED_OTHER. It
is irrelevant whether it might be an optional feature or not.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 22:14               ` Howard Chu
  2006-01-26  0:16                 ` Robert Hancock
  2006-01-26  2:05                 ` David Schwartz
@ 2006-01-26  8:54                 ` Nick Piggin
  2006-01-26 14:24                   ` Howard Chu
  2006-01-26 10:44                 ` Nikita Danilov
  3 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26  8:54 UTC (permalink / raw)
  To: Howard Chu
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> Lee Revell wrote:
> 
>> On Wed, 2006-01-25 at 10:26 -0800, Howard Chu wrote:
>>  
>>
>>> The SUSv3 text seems pretty clear. It says "WHEN
>>> pthread_mutex_unlock() is called, ... the scheduling policy SHALL 
>>> decide ..." It doesn't say MAY, and it doesn't say "some undefined 
>>> time after the call."      
>>
>>
>> This does NOT require pthread_mutex_unlock() to cause the scheduler to
>> immediately pick a new runnable process.  It only says it's up the the
>> scheduling POLICY what to do.  The policy could be "let the unlocking
>> thread finish its timeslice then reschedule".
>>   
> 
> 
> This is obviously some very old ground.
> 
> http://groups.google.com/groups?threadm=etai7.108188%24B37.2381726%40news1.rdc1.bc.home.com 
> 
> 
> Kaz's post clearly interprets the POSIX spec differently from you. The 
> policy can decide *which of the waiting threads* gets the mutex, but the 
> releasing thread is totally out of the picture. For good or bad, the 
> current pthread_mutex_unlock() is not POSIX-compliant. Now then, if 
> we're forced to live with that, for efficiency's sake, that's OK, 
> assuming that valid workarounds exist, such as inserting a sched_yield() 
> after the unlock.
> 
> http://groups.google.com/group/comp.programming.threads/msg/16c01eac398a1139?hl=en& 
> 
> 
> But then we have to deal with you folks' bizarre notion that 
> sched_yield() can legitimately be a no-op, which also defies the POSIX 
> spec. Again, in SUSv3 "The /sched_yield/() function shall force the 
> running thread to relinquish the processor until it again becomes the 
> head of its thread list. It takes no arguments." There is no language 

How many times have we been over this? What do you think the "head of
its thread list" might mean?

> here saying "sched_yield *may* do nothing at all." There are of course 

There is language saying SCHED_OTHER is arbitrary, including how the
thread list is implemented and how a task might become on the head of
it.

They obviously don't need to redefine exactly what sched_yield may do
under each scheduling policy, do they?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  8:30           ` Helge Hafting
@ 2006-01-26  9:01             ` Nick Piggin
  2006-01-26 10:50             ` Nikita Danilov
  1 sibling, 0 replies; 127+ messages in thread
From: Nick Piggin @ 2006-01-26  9:01 UTC (permalink / raw)
  To: Helge Hafting; +Cc: davids, Linux Kernel Mailing List, hancockr

Helge Hafting wrote:
> David Schwartz wrote:

>> nothing says that it can't call pthread_mutex_lock and re-acquire the 
>> mutex
>> before any other thread gets around to getting it.
>>  
>>
> Wrong.
> The spec says that the mutex must be given to a waiter (if any) at the
> moment of release.

Repeating myself here...

To me it says that the scheduling policy decides at the moment of release.
What if the scheduling policy decides *right then* to give the mutex to
the next running thread that tries to aquire it?

That would be the logical way for a scheduling policy to decide the next
owner of the mutex.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-26  0:28         ` Jesse Brandeburg
@ 2006-01-26  9:32           ` Pavel Machek
  2006-01-26 19:02           ` Stefan Seyfried
       [not found]           ` <BAY108-DAV111F6EF46F6682FEECCC1593140@phx.gbl>
  2 siblings, 0 replies; 127+ messages in thread
From: Pavel Machek @ 2006-01-26  9:32 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Olaf Kirch, Stefan Seyfried, Linux Kernel Mailing List, netdev

On St 25-01-06 16:28:48, Jesse Brandeburg wrote:
> On 1/25/06, Jesse Brandeburg <jesse.brandeburg@gmail.com> wrote:
> > On 1/25/06, Olaf Kirch <okir@suse.de> wrote:
> > > On Wed, Jan 25, 2006 at 10:02:40AM +0100, Olaf Kirch wrote:
> > > > I'm not sure what the right fix would be. e100_resume would probably
> > > > have to call e100_alloc_cbs early on, while e100_up should avoid
> > > > calling it a second time if nic->cbs_avail != 0. A tentative patch
> > > > for testing is attached.
> > >
> > > Reportedly, the patch fixes the crash on resume.
> >
> > Cool, thanks for the research, I have a concern about this however.
> >
> > its an interesting patch, but it raises the question why does
> > e100_init_hw need to be called at all in resume?  I looked back
> > through our history and that init_hw call has always been there.  I
> > think its incorrect, but its taking me a while to set up a system with
> > the ability to resume.
> >
> > everywhere else in the driver alloc_cbs is called before init_hw so it
> > just seems like a long standing bug.
> >
> > comments?  anyone want to test? i compile tested this, but it is untested.
> 
> Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> to show that my patch that just removes e100_init_hw works okay for
> me.  Let me know how it goes for you, I think this is a good fix.

S1 preserves hardware state, .suspend/.resume routines can be NULL for
S1. Try with swsusp or S3.
								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 19:32               ` Howard Chu
  2006-01-26  8:51                 ` Nick Piggin
@ 2006-01-26 10:38                 ` Nikita Danilov
  2006-01-30  8:35                   ` Helge Hafting
  1 sibling, 1 reply; 127+ messages in thread
From: Nikita Danilov @ 2006-01-26 10:38 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu writes:

[...]

 > 
 > A straightforward reading of the language here says the decision happens 
 > "when pthread_mutex_unlock() is called" and not at any later time. There 
 > is nothing here to support your interpretation.
 > >
 > > I think the intention of the wording is that for deterministic policies,
 > > it is clear that the waiting threads are actually worken and reevaluated
 > > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
 > > considering the scheduling policy is arbitrary.
 > >
 > Clearly the point is that one of the waiting threads is waken and gets 
 > the mutex, and it doesn't matter which thread is chosen. I.e., whatever 

Note that this behavior directly leads to "convoy formation": if that
woken thread T0 does not immediately run (e.g., because there are higher
priority threads) but still already owns the mutex, then other running
threads contending for this mutex will block waiting for T0, forming a
convoy.

 > thread the scheduling policy chooses. The fact that SCHED_OTHER can 
 > choose arbitrarily is immaterial, it still can only choose one of the 
 > waiting threads.

Looks like a good time to submit Defect Report to the Open Group.

 > 
 > The fact that SCHED_OTHER's scheduling behavior is undefined is not free 
 > license to implement whatever you want. Scheduling policies are an 
 > optional feature; the basic thread behavior must still be consistent 
 > even on systems that don't implement scheduling policies.
 > 
 > -- 
 >   -- Howard Chu

Nikita.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-25 22:14               ` Howard Chu
                                   ` (2 preceding siblings ...)
  2006-01-26  8:54                 ` Nick Piggin
@ 2006-01-26 10:44                 ` Nikita Danilov
  3 siblings, 0 replies; 127+ messages in thread
From: Nikita Danilov @ 2006-01-26 10:44 UTC (permalink / raw)
  To: Howard Chu; +Cc: Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu writes:

[...]

 > 
 > But then we have to deal with you folks' bizarre notion that 
 > sched_yield() can legitimately be a no-op, which also defies the POSIX 
 > spec. Again, in SUSv3 "The /sched_yield/() function shall force the 
 > running thread to relinquish the processor until it again becomes the 
 > head of its thread list. It takes no arguments." There is no language 
 > here saying "sched_yield *may* do nothing at all." There are of course 

As have been pointed to you already, while there is no such language,
the effect may be the same, if --for example-- scheduling policy decides
to put current thread back to "the head of its thread list" immediately
after sched_yield(). Which is a valid behavior for SCHED_OTHER.

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2006-01-26  8:30           ` Helge Hafting
  2006-01-26  9:01             ` Nick Piggin
@ 2006-01-26 10:50             ` Nikita Danilov
  1 sibling, 0 replies; 127+ messages in thread
From: Nikita Danilov @ 2006-01-26 10:50 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Linux Kernel Mailing List, hancockr

Helge Hafting writes:

[...]

 > 
 > >nothing says that it can't call pthread_mutex_lock and re-acquire the mutex
 > >before any other thread gets around to getting it.
 > >  
 > >
 > Wrong.
 > The spec says that the mutex must be given to a waiter (if any) at the
 > moment of release.  The waiter don't have to be scheduled at that
 > point, it may keep sleeping with its freshly unlocked mutex.  So the
 > unlocking thread may continue - but if it tries to reaquire the mutex
 > it will find the mutex taken and go to sleep at that point. Then other

You just described a convoy formation: a phenomenon that all reasonable
mutex implementation try to avoid at all costs. If that's what standard
prescribes---the standard has to be amended.

 > 
 > Helge Hafting

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  8:51                 ` Nick Piggin
@ 2006-01-26 14:15                   ` Kyle Moffett
  2006-01-26 14:43                     ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Kyle Moffett @ 2006-01-26 14:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hancockr

Haven't you OpenLDAP guys realized that the pthread model you're  
actually looking for is this?  POSIX mutexes are not designed to  
mandate scheduling requirements *precisely* because this achieves  
your scheduling goals by explicitly stating what they are.

s: pthread_mutex_lock(&mutex);
s: pthread_cond_wait(&wake_slave, &mutex);

m: [do some work]
m: pthread_cond_signal(&wake_slave);
m: pthread_cond_wait(&wake_master, &mutex);

s: [return from pthread_cond_wait]
s: [do some work]
s: pthread_cond_signal(&wake_master);
s: pthread_cond_wait(&wake_slave, &mutex);

Of course, if that's the model you're looking for, you could always  
do this instead:

void master_func() {
	while (1) {
		[do some work]
		slave_func();
	}
}

void slave_func() {
	[do some work]
}

The semantics are effectively the same.

Cheers,
Kyle Moffett

--
Premature optimization is the root of all evil in programming
   -- C.A.R. Hoare




^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26  8:54                 ` Nick Piggin
@ 2006-01-26 14:24                   ` Howard Chu
  2006-01-26 14:54                     ` Nick Piggin
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-26 14:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Nick Piggin wrote:
> Howard Chu wrote:
>> But then we have to deal with you folks' bizarre notion that 
>> sched_yield() can legitimately be a no-op, which also defies the 
>> POSIX spec. Again, in SUSv3 "The /sched_yield/() function shall force 
>> the running thread to relinquish the processor until it again becomes 
>> the head of its thread list. It takes no arguments." There is no 
>> language 
>
> How many times have we been over this? What do you think the "head of
> its thread list" might mean?
>
>> here saying "sched_yield *may* do nothing at all." There are of course 
>
> There is language saying SCHED_OTHER is arbitrary, including how the
> thread list is implemented and how a task might become on the head of
> it.
>
> They obviously don't need to redefine exactly what sched_yield may do
> under each scheduling policy, do they?
>
As Dave Butenhof says so often, threading is a cooperative programming 
model, not a competitive one. The sched_yield function exists for a 
specific purpose, to let one thread decide to allow some other thread to 
run. No matter what the scheduling policy, or even if there is no 
scheduling policy at all, the expectation is that the current thread 
will not continue to run unless there are no other runnable threads in 
the same process. The other important point here is that the yielding 
thread is only cooperating with other threads in its process. The 2.6 
kernel behavior effectively causes the entire process to give up its 
time slice, since the yielding thread has to wait for other processes in 
the system before it can run again. Again, if folks wanted process 
scheduling behavior they would have used fork().

By the way, I've already raised an objection with the Open Group asking 
for more clarification here.
http://www.opengroup.org/austin/aardvark/latest/xshbug2.txt   request 
number 120.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 14:15                   ` Kyle Moffett
@ 2006-01-26 14:43                     ` Howard Chu
  2006-01-26 19:57                       ` David Schwartz
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-26 14:43 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Nick Piggin, Christopher Friesen, Linux Kernel Mailing List, hancockr

Kyle Moffett wrote:
> Haven't you OpenLDAP guys realized that the pthread model you're 
> actually looking for is this?  POSIX mutexes are not designed to 
> mandate scheduling requirements *precisely* because this achieves your 
> scheduling goals by explicitly stating what they are.

This isn't about OpenLDAP. Yes, we had a lot of yield() calls scattered 
through the code, leftovers from when we only supported non-preemptive 
threading. Those calls have been removed. There are a few remaining, 
that are only in code paths for unusual errors, so what they do has no 
real performance impact.

The point of this discussion is that the POSIX spec says one thing and 
you guys say another; one way or another that should be resolved. The 
2.6 kernel behavior is a noticable departure from previous releases. The 
2.4/LinuxThreads guys believed their implementation was correct. If you 
believe the 2.6 implementation is correct, then you should get the spec 
amended or state up front that the "P" in "NPTL" doesn't really mean 
anything.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 14:24                   ` Howard Chu
@ 2006-01-26 14:54                     ` Nick Piggin
  2006-01-26 15:23                       ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 14:54 UTC (permalink / raw)
  To: Howard Chu
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> Nick Piggin wrote:

>> They obviously don't need to redefine exactly what sched_yield may do
>> under each scheduling policy, do they?
>>
> As Dave Butenhof says so often, threading is a cooperative programming 
> model, not a competitive one. The sched_yield function exists for a 
> specific purpose, to let one thread decide to allow some other thread to 
> run. No matter what the scheduling policy, or even if there is no 

Yes, and even SCHED_OTHER in Linux attempts to do this as part of
the principle of least surprise. That it doesn't _exactly_ match
what you want it to do just means you need to be using something
else.

> scheduling policy at all, the expectation is that the current thread 
> will not continue to run unless there are no other runnable threads in 
> the same process. The other important point here is that the yielding 
> thread is only cooperating with other threads in its process. The 2.6 

No I don't think so. POSIX 1.b where sched_yield is defined are the
realtime extensions, are they not?

sched_yield explicitly makes reference to the realtime priority system
of thread lists does it not? It is pretty clear that it is used for
realtime processes to deterministically give up their timeslices to
others of the same priority level.

Linux's SCHED_OTHER behaviour is arguably the best interpretation,
considering SCHED_OTHER is defined to have a single priority level.

> kernel behavior effectively causes the entire process to give up its 
> time slice, since the yielding thread has to wait for other processes in 
> the system before it can run again. Again, if folks wanted process 

It yields to all other SCHED_OTHER processes (which are all on the
same thread priority list) and not to any other processes of higher
realtime priority.

> scheduling behavior they would have used fork().
> 

It so happens that processes and threads use the same scheduling
policy in Linux. Is that forbidden somewhere?

> By the way, I've already raised an objection with the Open Group asking 
> for more clarification here.
> http://www.opengroup.org/austin/aardvark/latest/xshbug2.txt   request 
> number 120.
> 

-- 
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 14:54                     ` Nick Piggin
@ 2006-01-26 15:23                       ` Howard Chu
  2006-01-26 15:51                         ` Nick Piggin
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-26 15:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Nick Piggin wrote:
> Howard Chu wrote:
>> scheduling policy at all, the expectation is that the current thread 
>> will not continue to run unless there are no other runnable threads 
>> in the same process. The other important point here is that the 
>> yielding thread is only cooperating with other threads in its 
>> process. The 2.6 
>
> No I don't think so. POSIX 1.b where sched_yield is defined are the
> realtime extensions, are they not?
>
> sched_yield explicitly makes reference to the realtime priority system
> of thread lists does it not? It is pretty clear that it is used for
> realtime processes to deterministically give up their timeslices to
> others of the same priority level.

The fact that sched_yield came originally from the realtime extensions 
is just a historical artifact. There was a pthread_yield() function 
specifically for threads and it was merged with sched_yield(). Today 
sched_yield() is a core part of the basic Threads specification, 
independent of the realtime extensions. The fact that it is defined 
solely in the language of the realtime priorities is an obvious flaw in 
the spec, since the function itself exists independently of realtime 
priorities. The objection I raised with the Open Group specifically 
addresses this flaw.

> Linux's SCHED_OTHER behaviour is arguably the best interpretation,
> considering SCHED_OTHER is defined to have a single priority level.

It appears that you just read the spec and blindly followed it without 
thinking about what it really said and failed to say. The best 
interpretation would come from saying "hey, this spec is only defined 
for realtime behavior, WTF is it supposed to do for the default 
non-realtime case?" and getting a clear definition in the spec.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 15:23                       ` Howard Chu
@ 2006-01-26 15:51                         ` Nick Piggin
  2006-01-26 16:44                           ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 15:51 UTC (permalink / raw)
  To: Howard Chu
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Howard Chu wrote:
> Nick Piggin wrote:
> 
>> Howard Chu wrote:
>>
>>> scheduling policy at all, the expectation is that the current thread 
>>> will not continue to run unless there are no other runnable threads 
>>> in the same process. The other important point here is that the 
>>> yielding thread is only cooperating with other threads in its 
>>> process. The 2.6 
>>
>>
>> No I don't think so. POSIX 1.b where sched_yield is defined are the
>> realtime extensions, are they not?
>>
>> sched_yield explicitly makes reference to the realtime priority system
>> of thread lists does it not? It is pretty clear that it is used for
>> realtime processes to deterministically give up their timeslices to
>> others of the same priority level.
> 
> 
> The fact that sched_yield came originally from the realtime extensions 
> is just a historical artifact. There was a pthread_yield() function 
> specifically for threads and it was merged with sched_yield(). Today 
> sched_yield() is a core part of the basic Threads specification, 
> independent of the realtime extensions. The fact that it is defined 
> solely in the language of the realtime priorities is an obvious flaw in 
> the spec, since the function itself exists independently of realtime 
> priorities. The objection I raised with the Open Group specifically 
> addresses this flaw.
> 

Either way, it by no means says anything about yielding to other
threads in the process but nobody else. Where did you get that
from?

>> Linux's SCHED_OTHER behaviour is arguably the best interpretation,
>> considering SCHED_OTHER is defined to have a single priority level.
> 
> 
> It appears that you just read the spec and blindly followed it without 
> thinking about what it really said and failed to say. The best 

No, a spec is something that is written unambiguously, and generally
the wording leads me to believe they attempted to make it so (it
definitely isn't perfect - your mutex unlock example is one that could
be interpreted either way). If they failed to say something that should
be there then the spec needs to be corrected -- however in this case
I don't think you've shown what's missing.

And actually your reading things into the spec that "they failed to say"
is wrong I believe (in the above sched_yield example).

> interpretation would come from saying "hey, this spec is only defined 
> for realtime behavior, WTF is it supposed to do for the default 
> non-realtime case?" and getting a clear definition in the spec.
> 

However they do not omit to say that. They quite explicitly say that
SCHED_OTHER is considered a single priority class in relation to its
interactions with other realtime classes, and is otherwise free to
be implemented in any way.

I can't see how you still have a problem with that...

-- 
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 15:51                         ` Nick Piggin
@ 2006-01-26 16:44                           ` Howard Chu
  2006-01-26 17:34                             ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-26 16:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Lee Revell, Christopher Friesen, Linux Kernel Mailing List, hancockr

Nick Piggin wrote:
> No, a spec is something that is written unambiguously, and generally
> the wording leads me to believe they attempted to make it so (it
> definitely isn't perfect - your mutex unlock example is one that could
> be interpreted either way). If they failed to say something that should
> be there then the spec needs to be corrected -- however in this case
> I don't think you've shown what's missing.

What is missing: sched_yield is a core threads function but it's defined 
using language that only has meaning in the presence of an optional 
feature (Process Scheduling.) Since the function must exist even in the 
absence of these options, the definition must be changed to use language 
that has meaning even in the absence of these options.

> And actually your reading things into the spec that "they failed to say"
> is wrong I believe (in the above sched_yield example).
>
>> interpretation would come from saying "hey, this spec is only defined 
>> for realtime behavior, WTF is it supposed to do for the default 
>> non-realtime case?" and getting a clear definition in the spec.
>
> However they do not omit to say that. They quite explicitly say that
> SCHED_OTHER is considered a single priority class in relation to its
> interactions with other realtime classes, and is otherwise free to
> be implemented in any way.
>
> I can't see how you still have a problem with that...
>
I may be missing the obvious, but I couldn't find this explicit 
statement in the SUS docs. Also, it would not address the core 
complaint, that sched_yield's definition has no meaning when the Process 
Scheduling option doesn't exist.

The current Open Group response to my objection reads:
 >>>

Add to APPLICATION USAGE
Since there may not be more than one thread runnable in a process
a call to sched_yield() might not relinquish the processor at all.
In a single threaded application this will always be case.

<<<
The interesting point one can draw from this response is that 
sched_yield is only intended to yield to other runnable threads within a 
single process. This response is also problematic, because restricting 
it to threads within a process makes it useless for Process Scheduling. 
E.g., the Process Scheduling language would imply that a single-threaded 
app could yield the processor to some other process. As such, I think 
this response is also flawed, and the definition still needs more work.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 16:44                           ` Howard Chu
@ 2006-01-26 17:34                             ` linux-os (Dick Johnson)
  2006-01-26 19:00                               ` Nick Piggin
  2006-01-30  8:44                               ` Helge Hafting
  0 siblings, 2 replies; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2006-01-26 17:34 UTC (permalink / raw)
  To: Howard Chu
  Cc: Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr


On Thu, 26 Jan 2006, Howard Chu wrote:

> Nick Piggin wrote:
>> No, a spec is something that is written unambiguously, and generally
>> the wording leads me to believe they attempted to make it so (it
>> definitely isn't perfect - your mutex unlock example is one that could
>> be interpreted either way). If they failed to say something that should
>> be there then the spec needs to be corrected -- however in this case
>> I don't think you've shown what's missing.
>
> What is missing: sched_yield is a core threads function but it's defined
> using language that only has meaning in the presence of an optional
> feature (Process Scheduling.) Since the function must exist even in the
> absence of these options, the definition must be changed to use language
> that has meaning even in the absence of these options.
>
>> And actually your reading things into the spec that "they failed to say"
>> is wrong I believe (in the above sched_yield example).
>>
>>> interpretation would come from saying "hey, this spec is only defined
>>> for realtime behavior, WTF is it supposed to do for the default
>>> non-realtime case?" and getting a clear definition in the spec.
>>
>> However they do not omit to say that. They quite explicitly say that
>> SCHED_OTHER is considered a single priority class in relation to its
>> interactions with other realtime classes, and is otherwise free to
>> be implemented in any way.
>>
>> I can't see how you still have a problem with that...
>>
> I may be missing the obvious, but I couldn't find this explicit
> statement in the SUS docs. Also, it would not address the core
> complaint, that sched_yield's definition has no meaning when the Process
> Scheduling option doesn't exist.
>
> The current Open Group response to my objection reads:
> >>>
>
> Add to APPLICATION USAGE
> Since there may not be more than one thread runnable in a process
> a call to sched_yield() might not relinquish the processor at all.
> In a single threaded application this will always be case.
>
> <<<
> The interesting point one can draw from this response is that
> sched_yield is only intended to yield to other runnable threads within a
> single process. This response is also problematic, because restricting
> it to threads within a process makes it useless for Process Scheduling.
> E.g., the Process Scheduling language would imply that a single-threaded
> app could yield the processor to some other process. As such, I think
> this response is also flawed, and the definition still needs more work.
>
> --
>  -- Howard Chu
>  Chief Architect, Symas Corp.  http://www.symas.com
>  Director, Highland Sun        http://highlandsun.com/hyc
>  OpenLDAP Core Team            http://www.openldap.org/project/
>

To fix the current problem, you can substitute usleep(0); It will
give the CPU to somebody if it's computable, then give it back to
you. It seems to work in every case that sched_yield() has
mucked up (perhaps 20 to 30 here).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 17:34                             ` linux-os (Dick Johnson)
@ 2006-01-26 19:00                               ` Nick Piggin
  2006-01-26 19:14                                 ` linux-os (Dick Johnson)
  2006-01-30  8:44                               ` Helge Hafting
  1 sibling, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 19:00 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Howard Chu, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

linux-os (Dick Johnson) wrote:
> 
> To fix the current problem, you can substitute usleep(0); It will
> give the CPU to somebody if it's computable, then give it back to
> you. It seems to work in every case that sched_yield() has
> mucked up (perhaps 20 to 30 here).
> 

That sounds like a terrible hack.

What cases has sched_yield mucked up for you, and why do you
think the problem is sched_yield mucking up? Can you solve it
using mutexes?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-26  0:28         ` Jesse Brandeburg
  2006-01-26  9:32           ` Pavel Machek
@ 2006-01-26 19:02           ` Stefan Seyfried
  2006-01-26 19:09             ` Olaf Kirch
  2006-01-28 11:53             ` Mattia Dongili
       [not found]           ` <BAY108-DAV111F6EF46F6682FEECCC1593140@phx.gbl>
  2 siblings, 2 replies; 127+ messages in thread
From: Stefan Seyfried @ 2006-01-26 19:02 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Olaf Kirch, Linux Kernel Mailing List, netdev

On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
 
> Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> to show that my patch that just removes e100_init_hw works okay for
> me.  Let me know how it goes for you, I think this is a good fix.

worked for me in the Compaq Armada e500 and reportedly also fixed the
SONY that originally uncovered it.

Will be in the next SUSE betas, so if anything breaks, we'll notice
it.

Thanks.
-- 
Stefan Seyfried                  \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices      \ wanted to be paid for what I write."
SUSE LINUX Products GmbH, Nürnberg \                    -- Leonard Cohen

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-26 19:02           ` Stefan Seyfried
@ 2006-01-26 19:09             ` Olaf Kirch
  2006-01-28 11:53             ` Mattia Dongili
  1 sibling, 0 replies; 127+ messages in thread
From: Olaf Kirch @ 2006-01-26 19:09 UTC (permalink / raw)
  To: Stefan Seyfried; +Cc: Jesse Brandeburg, Linux Kernel Mailing List, netdev

On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
> Will be in the next SUSE betas, so if anything breaks, we'll notice
> it.

I doubt it. As Jesse mentioned, e100_hw_init is called from e100_up,
so the call from e100_resume was really superfluous.

Olaf
-- 
Olaf Kirch   |  --- o --- Nous sommes du soleil we love when we play
okir@suse.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 19:00                               ` Nick Piggin
@ 2006-01-26 19:14                                 ` linux-os (Dick Johnson)
  2006-01-26 21:12                                   ` Nick Piggin
  0 siblings, 1 reply; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2006-01-26 19:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Howard Chu, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

On Thu, 26 Jan 2006, Nick Piggin wrote:

> linux-os (Dick Johnson) wrote:
>>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>
>
> That sounds like a terrible hack.
>
> What cases has sched_yield mucked up for you, and why do you
> think the problem is sched_yield mucking up? Can you solve it
> using mutexes?
>
> Thanks,
> Nick

Somebody wrote code that used Linux Threads. We didn't know
why it was so slow so I was asked to investigate. It was
a user-interface where high-speed image data gets put into
a buffer (using DMA) and one thread manipulates it. Another
thread copies and crunches the data, then displays it. The
writer insisted that he was doing the correct thing, however
the response sucked big time. I ran top and found that the
threaded processes were always grabbing big chunks of
CPU time. Searching for every instance of sched_yield(), I
was going to replace it with a diagnostic. However, the code
ran beautifully when the 'fprintf(stderr, "Message\n"' was
in the code! The call to write() sleeps. That gave the
CPU to somebody who was starving. The 'quick-fix" was
to replace sched_yield() with usleep(0).

The permanent fix was to not use threads at all.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 14:43                     ` Howard Chu
@ 2006-01-26 19:57                       ` David Schwartz
  2006-01-26 20:27                         ` Howard Chu
  2006-01-30  8:28                         ` Helge Hafting
  0 siblings, 2 replies; 127+ messages in thread
From: David Schwartz @ 2006-01-26 19:57 UTC (permalink / raw)
  To: hyc; +Cc: Linux Kernel Mailing List


> The point of this discussion is that the POSIX spec says one thing and
> you guys say another; one way or another that should be resolved. The
> 2.6 kernel behavior is a noticable departure from previous releases. The
> 2.4/LinuxThreads guys believed their implementation was correct. If you
> believe the 2.6 implementation is correct, then you should get the spec
> amended or state up front that the "P" in "NPTL" doesn't really mean
> anything.

	There is disagreement over what the POSIX specification says. You have
already seen three arguments against your interpretation, any one of which
is, IMO, sufficient to demolish it.

	First, there's the as-if issue. You cannot write a program that can print
"non-compliant" with the behavior you claim is non-compliant that is
guaranteed not to do so by the standard because there is no way to know that
another thread is blocked on the mutex (except for PI mutexes).

	Second, there's the plain langauge of the standard. It says "If X is so at
time T, then Y". This does not require Y to happen at time T. It is X
happening at time T that requires Y, but the time for Y is not specified.

	If a law says, for example, "if there are two or more bids with the same
price lower than all other bids at the close of bidding, the first such bid
to be received shall be accepted". The phrase "at the close of bidding"
refers to the time the rule is deteremined to apply to the situation, not
the time at which the decision as to which bid to accept is made.

	Third, there's the ambiguity of the standard. It says the "sceduling
policy" shall decide, not that the scheduler shall decide. If the policy is
to make a conditional or delayed decision, that is still perfectly valid
policy. "Whichever thread requests it first" is a valid scheduler policy.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 19:57                       ` David Schwartz
@ 2006-01-26 20:27                         ` Howard Chu
  2006-01-26 20:46                           ` Nick Piggin
  2006-01-27  2:16                           ` David Schwartz
  2006-01-30  8:28                         ` Helge Hafting
  1 sibling, 2 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-26 20:27 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List

David Schwartz wrote:
>> The point of this discussion is that the POSIX spec says one thing and
>> you guys say another; one way or another that should be resolved. The
>> 2.6 kernel behavior is a noticable departure from previous releases. The
>> 2.4/LinuxThreads guys believed their implementation was correct. If you
>> believe the 2.6 implementation is correct, then you should get the spec
>> amended or state up front that the "P" in "NPTL" doesn't really mean
>> anything.
>>     
>
> 	There is disagreement over what the POSIX specification says. You have
> already seen three arguments against your interpretation, any one of which
> is, IMO, sufficient to demolish it.
>   

> 	First, there's the as-if issue. You cannot write a program that can print
> "non-compliant" with the behavior you claim is non-compliant that is
> guaranteed not to do so by the standard because there is no way to know that
> another thread is blocked on the mutex (except for PI mutexes).
>   

The exception here demolishes this argument, IMO. Moreover, if the 
unlocker was a lower priority thread and there are higher priority 
threads blocked on the mutex, you really want the higher priority thread 
to run.

> 	Second, there's the plain langauge of the standard. It says "If X is so at
> time T, then Y". This does not require Y to happen at time T. It is X
> happening at time T that requires Y, but the time for Y is not specified.
>
> 	If a law says, for example, "if there are two or more bids with the same
> price lower than all other bids at the close of bidding, the first such bid
> to be received shall be accepted". The phrase "at the close of bidding"
> refers to the time the rule is deteremined to apply to the situation, not
> the time at which the decision as to which bid to accept is made.
>   

The time at which the decision takes effect is immaterial; the point is 
that the decision can only be made from the set of options available at 
time T.

Per your analogy, if a new bid comes in at time T+1, it can't have any 
effect on which of the bids shall be accepted.

> 	Third, there's the ambiguity of the standard. It says the "sceduling
> policy" shall decide, not that the scheduler shall decide. If the policy is
> to make a conditional or delayed decision, that is still perfectly valid
> policy. "Whichever thread requests it first" is a valid scheduler policy.

I am not debating what the policy can decide. Merely the set of choices 
from which it may decide.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 20:27                         ` Howard Chu
@ 2006-01-26 20:46                           ` Nick Piggin
  2006-01-26 21:32                             ` Howard Chu
  2006-01-27  2:16                           ` David Schwartz
  1 sibling, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 20:46 UTC (permalink / raw)
  To: Howard Chu; +Cc: davids, Linux Kernel Mailing List

Howard Chu wrote:
> David Schwartz wrote:
> 

> The time at which the decision takes effect is immaterial; the point is 
> that the decision can only be made from the set of options available at 
> time T.
> 
> Per your analogy, if a new bid comes in at time T+1, it can't have any 
> effect on which of the bids shall be accepted.
> 
>>     Third, there's the ambiguity of the standard. It says the "sceduling
>> policy" shall decide, not that the scheduler shall decide. If the 
>> policy is
>> to make a conditional or delayed decision, that is still perfectly valid
>> policy. "Whichever thread requests it first" is a valid scheduler policy.
> 
> 
> I am not debating what the policy can decide. Merely the set of choices 
> from which it may decide.
> 

OK, you believe that the mutex *must* be granted to a blocking thread
at the time of the unlock. I don't think this is unreasonable from the
wording (because it does not seem to be completely unambiguous english),
however think about this -

A realtime system with tasks A and B, A has an RT scheduling priority of
1, and B is 2. A and B are both runnable, so A is running. A takes a mutex
then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
some point it drops the mutex and then tries to take it again.

What happens?

I haven't programmed realtime systems of any complexity, but I'd think it
would be undesirable if A were to block and allow B to run at this point.

Now this has nothing to do with PI or SCHED_OTHER, so behaviour is exactly
determined by our respective interpretations of what it means for "the
scheduling policy to decide which task gets the mutex".

What have I proven? Nothing ;) but perhaps my question could be answered
by someone who knows a lot more about RT systems than I.

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 19:14                                 ` linux-os (Dick Johnson)
@ 2006-01-26 21:12                                   ` Nick Piggin
  2006-01-26 21:31                                     ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 21:12 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Howard Chu, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

linux-os (Dick Johnson) wrote:
> On Thu, 26 Jan 2006, Nick Piggin wrote:

>>What cases has sched_yield mucked up for you, and why do you
>>think the problem is sched_yield mucking up? Can you solve it
>>using mutexes?
>>
>>Thanks,
>>Nick
> 
> 
> Somebody wrote code that used Linux Threads. We didn't know
> why it was so slow so I was asked to investigate. It was
> a user-interface where high-speed image data gets put into
> a buffer (using DMA) and one thread manipulates it. Another
> thread copies and crunches the data, then displays it. The
> writer insisted that he was doing the correct thing, however
> the response sucked big time. I ran top and found that the
> threaded processes were always grabbing big chunks of
> CPU time. Searching for every instance of sched_yield(), I
> was going to replace it with a diagnostic. However, the code
> ran beautifully when the 'fprintf(stderr, "Message\n"' was
> in the code! The call to write() sleeps. That gave the
> CPU to somebody who was starving. The 'quick-fix" was
> to replace sched_yield() with usleep(0).
> 
> The permanent fix was to not use threads at all.
> 

This sounds like a trivial producer consumer problem that you
would find in any basic books on synchronisation, threading, or
operating systems.

If it was not a realtime system, then I can't believe it has any
usages of sched_yield in there at all. If it is a realtime system,
then replacing them with something else could easily have broken
it.

Also, I'm not sure that you can rely on write or usleep for 0
microseconds to sleep.

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
> Warning : 98.36% of all statistics are fiction.
> .
> 
> ****************************************************************
> The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.
> 
> Thank you.
> 

Any chance you can get rid of that crazy disclaimer when posting
to lkml, please?

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:12                                   ` Nick Piggin
@ 2006-01-26 21:31                                     ` linux-os (Dick Johnson)
  2006-01-27  7:06                                       ` Valdis.Kletnieks
  0 siblings, 1 reply; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2006-01-26 21:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Howard Chu, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

On Thu, 26 Jan 2006, Nick Piggin wrote:
[SNIPPED...]

> Any chance you can get rid of that crazy disclaimer when posting
> to lkml, please?
>
> Thanks,
> Nick
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
>

I tried. The "!@#(*$%^~!" IT/Legal Department(s) don't have a clue.
I asked the "mail-filter" guy on linux-kernel if he could just
exclude everything after a "." in the first column, just like
/bin/mail and, for that matter, sendmail. I was just told that
"It doesn't...." even though I can run sendmail by hand, using
telnet port 25, over the network, and know that the "." in the
first column is the way it knows the end-of-message after it
receives the "DATA" command.

Hoping that somebody, sometime, will implement my suggestion,
I continue to put a dot in the first column after my signature.
I know that if I send my mail around the lab without going through
the "*(_!@#&%" MicroWorm mail-grinder, the dot gets rid of
everything thereafter.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 20:46                           ` Nick Piggin
@ 2006-01-26 21:32                             ` Howard Chu
  2006-01-26 21:41                               ` Nick Piggin
                                                 ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-26 21:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: davids, Linux Kernel Mailing List

Nick Piggin wrote:
> OK, you believe that the mutex *must* be granted to a blocking thread
> at the time of the unlock. I don't think this is unreasonable from the
> wording (because it does not seem to be completely unambiguous english),
> however think about this -
>
> A realtime system with tasks A and B, A has an RT scheduling priority of
> 1, and B is 2. A and B are both runnable, so A is running. A takes a 
> mutex
> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> some point it drops the mutex and then tries to take it again.
>
> What happens?
>
> I haven't programmed realtime systems of any complexity, but I'd think it
> would be undesirable if A were to block and allow B to run at this point.

But why does A take the mutex in the first place? Presumably because it 
is about to execute a critical section. And also presumably, A will not 
release the mutex until it no longer has anything critical to do; 
certainly it could hold it longer if it needed to.

If A still needed the mutex, why release it and reacquire it, why not 
just hold onto it? The fact that it is being released is significant.

> Now this has nothing to do with PI or SCHED_OTHER, so behaviour is 
> exactly
> determined by our respective interpretations of what it means for "the
> scheduling policy to decide which task gets the mutex".
>
> What have I proven? Nothing ;) but perhaps my question could be answered
> by someone who knows a lot more about RT systems than I.

In the last RT work I did 12-13 years ago, there was only one high 
priority producer task and it was never allowed to block. The consumers 
just kept up as best they could (multi-proc machine of course). I've 
seldom seen a need for many priority levels. Probably not much you can 
generalzie from this though.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:32                             ` Howard Chu
@ 2006-01-26 21:41                               ` Nick Piggin
  2006-01-26 21:56                                 ` Howard Chu
  2006-01-26 21:58                               ` Christopher Friesen
  2006-01-27  4:13                               ` Steven Rostedt
  2 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 21:41 UTC (permalink / raw)
  To: Howard Chu; +Cc: davids, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:
> 
>> OK, you believe that the mutex *must* be granted to a blocking thread
>> at the time of the unlock. I don't think this is unreasonable from the
>> wording (because it does not seem to be completely unambiguous english),
>> however think about this -
>>
>> A realtime system with tasks A and B, A has an RT scheduling priority of
>> 1, and B is 2. A and B are both runnable, so A is running. A takes a 
>> mutex
>> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
>> some point it drops the mutex and then tries to take it again.
>>
>> What happens?
>>
>> I haven't programmed realtime systems of any complexity, but I'd think it
>> would be undesirable if A were to block and allow B to run at this point.
> 
> 
> But why does A take the mutex in the first place? Presumably because it 
> is about to execute a critical section. And also presumably, A will not 
> release the mutex until it no longer has anything critical to do; 
> certainly it could hold it longer if it needed to.
> 
> If A still needed the mutex, why release it and reacquire it, why not 
> just hold onto it? The fact that it is being released is significant.
> 

Regardless of why, that is just the simplest scenario I could think
of that would give us a test case. However...

Why not hold onto it? We sometimes do this in the kernel if we need
to take a lock that is incompatible with the lock already being held,
or if we discover we need to take a mutex which nests outside our
currently held lock in other paths. Ie to prevent deadlock.

Another reason might be because we will be running for a very long
time without requiring the lock. Or we might like to release it because
we expect a higher priority process to take it.

>> Now this has nothing to do with PI or SCHED_OTHER, so behaviour is 
>> exactly
>> determined by our respective interpretations of what it means for "the
>> scheduling policy to decide which task gets the mutex".
>>
>> What have I proven? Nothing ;) but perhaps my question could be answered
>> by someone who knows a lot more about RT systems than I.
> 
> 
> In the last RT work I did 12-13 years ago, there was only one high 
> priority producer task and it was never allowed to block. The consumers 
> just kept up as best they could (multi-proc machine of course). I've 
> seldom seen a need for many priority levels. Probably not much you can 
> generalzie from this though.
> 

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:41                               ` Nick Piggin
@ 2006-01-26 21:56                                 ` Howard Chu
  2006-01-26 22:24                                   ` Nick Piggin
  2006-01-27  4:27                                   ` Steven Rostedt
  0 siblings, 2 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-26 21:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: davids, Linux Kernel Mailing List

Nick Piggin wrote:
> Howard Chu wrote:
>> Nick Piggin wrote:
>>
>>> OK, you believe that the mutex *must* be granted to a blocking thread
>>> at the time of the unlock. I don't think this is unreasonable from the
>>> wording (because it does not seem to be completely unambiguous 
>>> english),
>>> however think about this -
>>>
>>> A realtime system with tasks A and B, A has an RT scheduling 
>>> priority of
>>> 1, and B is 2. A and B are both runnable, so A is running. A takes a 
>>> mutex
>>> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
>>> some point it drops the mutex and then tries to take it again.
>>>
>>> What happens?
>>>
>>> I haven't programmed realtime systems of any complexity, but I'd 
>>> think it
>>> would be undesirable if A were to block and allow B to run at this 
>>> point.
>>
>>
>> But why does A take the mutex in the first place? Presumably because 
>> it is about to execute a critical section. And also presumably, A 
>> will not release the mutex until it no longer has anything critical 
>> to do; certainly it could hold it longer if it needed to.
>>
>> If A still needed the mutex, why release it and reacquire it, why not 
>> just hold onto it? The fact that it is being released is significant.
>>
>
> Regardless of why, that is just the simplest scenario I could think
> of that would give us a test case. However...
>
> Why not hold onto it? We sometimes do this in the kernel if we need
> to take a lock that is incompatible with the lock already being held,
> or if we discover we need to take a mutex which nests outside our
> currently held lock in other paths. Ie to prevent deadlock.

In those cases, A cannot retake the mutex anyway. I.e., you just said 
that you released the first mutex because you want to acquire a 
different one. So those cases don't fit this example very well.

> Another reason might be because we will be running for a very long
> time without requiring the lock.

And again in this case, A should not be immediately reacquiring the lock 
if it doesn't actually need it.

> Or we might like to release it because
> we expect a higher priority process to take it.

And in this case, the expected behavior is the same as I've been pursuing.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:32                             ` Howard Chu
  2006-01-26 21:41                               ` Nick Piggin
@ 2006-01-26 21:58                               ` Christopher Friesen
  2006-01-27  4:13                               ` Steven Rostedt
  2 siblings, 0 replies; 127+ messages in thread
From: Christopher Friesen @ 2006-01-26 21:58 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, davids, Linux Kernel Mailing List

Howard Chu wrote:

> But why does A take the mutex in the first place? Presumably because it 
> is about to execute a critical section. And also presumably, A will not 
> release the mutex until it no longer has anything critical to do; 
> certainly it could hold it longer if it needed to.

Suppose A is pulling job requests off a queue.

A takes the mutex because it is going to modify data protected by the 
mutex.  It then gives up the mutex when it's done modifying the data.

> If A still needed the mutex, why release it and reacquire it, why not 
> just hold onto it? The fact that it is being released is significant.

Suppose A then pulls another job request off the queue.  It just so 
happens that this job requires touching some data protected by the same 
mutex.  It would need to take the mutex again.

A doesn't necessarily know what data the various jobs will require it to 
access, so it doesn't know a priori what mutexes will be required.

Chris

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:56                                 ` Howard Chu
@ 2006-01-26 22:24                                   ` Nick Piggin
  2006-01-27  8:08                                     ` Howard Chu
  2006-01-27  4:27                                   ` Steven Rostedt
  1 sibling, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2006-01-26 22:24 UTC (permalink / raw)
  To: Howard Chu; +Cc: davids, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:

>> Regardless of why, that is just the simplest scenario I could think
>> of that would give us a test case. However...
>>
>> Why not hold onto it? We sometimes do this in the kernel if we need
>> to take a lock that is incompatible with the lock already being held,
>> or if we discover we need to take a mutex which nests outside our
>> currently held lock in other paths. Ie to prevent deadlock.
> 
> 
> In those cases, A cannot retake the mutex anyway. I.e., you just said 
> that you released the first mutex because you want to acquire a 
> different one. So those cases don't fit this example very well.
> 

Umm yes, then *after* aquiring the different one, A would like to
retake the original mutex.

>> Another reason might be because we will be running for a very long
>> time without requiring the lock.
> 
> 
> And again in this case, A should not be immediately reacquiring the lock 
> if it doesn't actually need it.
> 

No, not immediately, I said "for a very long time". As in: A does not
need the exclusion provided by the lock for a very long time so it
drops it to avoid needless contention, then reaquires it when it finally
does need the lock.

>> Or we might like to release it because
>> we expect a higher priority process to take it.
> 
> 
> And in this case, the expected behavior is the same as I've been pursuing.
> 

No, we're talking about what happens when A tries to aquire it again.

Just accept that my described scenario is legitimate then consider it in
isolation rather than getting caught up in the superfluous details of how
such a situation might come about.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 20:27                         ` Howard Chu
  2006-01-26 20:46                           ` Nick Piggin
@ 2006-01-27  2:16                           ` David Schwartz
  2006-01-27  8:19                             ` Howard Chu
  1 sibling, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-27  2:16 UTC (permalink / raw)
  To: hyc; +Cc: Linux Kernel Mailing List


> David Schwartz wrote:

> > First, there's the as-if issue. You cannot write a program
> > that can print
> > "non-compliant" with the behavior you claim is non-compliant that is
> > guaranteed not to do so by the standard because there is no way
> > to know that
> > another thread is blocked on the mutex (except for PI mutexes).

> The exception here demolishes this argument, IMO.

	You're saying the authors of the standard intended that clause to be read
in light of the possibility of PI mutexes?! That's just nuts.

> Moreover, if the
> unlocker was a lower priority thread and there are higher priority
> threads blocked on the mutex, you really want the higher priority thread
> to run.

	Yes, I agree.

> > 	Second, there's the plain langauge of the standard. It says
> > "If X is so at
> > time T, then Y". This does not require Y to happen at time T. It is X
> > happening at time T that requires Y, but the time for Y is not
> specified.

> > 	If a law says, for example, "if there are two or more bids
> > with the same
> > price lower than all other bids at the close of bidding, the
> > first such bid
> > to be received shall be accepted". The phrase "at the close of bidding"
> > refers to the time the rule is deteremined to apply to the
> > situation, not
> > the time at which the decision as to which bid to accept is made.

> The time at which the decision takes effect is immaterial; the point is
> that the decision can only be made from the set of options available at
> time T.
>
> Per your analogy, if a new bid comes in at time T+1, it can't have any
> effect on which of the bids shall be accepted.

	Only because of the specifics of this analogy. If the rule said "if there
are two or more such bids with the same price at the close of bidding, the
winning bad shall be determined by the board of directors policy", nothing
prevents the board of directors from having a policy of going back to the
bidders and asking if they can lower their bids further.

	Nothing prevents them from rebidding the project if they want. In other
words, it doesn't place any restrictions on what the board can do.

> > 	Third, there's the ambiguity of the standard. It says the "sceduling
> > policy" shall decide, not that the scheduler shall decide. If
> > the policy is
> > to make a conditional or delayed decision, that is still perfectly valid
> > policy. "Whichever thread requests it first" is a valid
> > scheduler policy.

> I am not debating what the policy can decide. Merely the set of choices
> from which it may decide.

	Which is a restriction not found in the standard. A "policy" is a way of
deciding, not a decision. Scheduling policy can be to let whoever asks first
get it.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:32                             ` Howard Chu
  2006-01-26 21:41                               ` Nick Piggin
  2006-01-26 21:58                               ` Christopher Friesen
@ 2006-01-27  4:13                               ` Steven Rostedt
  2 siblings, 0 replies; 127+ messages in thread
From: Steven Rostedt @ 2006-01-27  4:13 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List, davids, Nick Piggin

On Thu, 2006-01-26 at 13:32 -0800, Howard Chu wrote:
> Nick Piggin wrote:
> > OK, you believe that the mutex *must* be granted to a blocking thread
> > at the time of the unlock. I don't think this is unreasonable from the
> > wording (because it does not seem to be completely unambiguous english),
> > however think about this -
> >
> > A realtime system with tasks A and B, A has an RT scheduling priority of
> > 1, and B is 2. A and B are both runnable, so A is running. A takes a 
> > mutex
> > then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> > some point it drops the mutex and then tries to take it again.
> >
> > What happens?
> >
> > I haven't programmed realtime systems of any complexity, but I'd think it
> > would be undesirable if A were to block and allow B to run at this point.
> 
> But why does A take the mutex in the first place? Presumably because it 
> is about to execute a critical section. And also presumably, A will not 
> release the mutex until it no longer has anything critical to do; 
> certainly it could hold it longer if it needed to.

A while back I discovered that the -rt patch did just this with the
spin_lock to rt_mutexes. Here's the scenario that happened amazingly too
much.

Three tasks A, B, C:  A with highest  prio (say 3), B is middle (say 2)
and C is lowest (say 1).  And all this with PI (although without PI it
can happen even easier. see my explanation here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111165425915947&w=4 )

C grabs mutex X
B preempts C and tries to grab mutex X and blocks (C inherits from B)
A comes along and preempts C and blocks on X (C now inherits from A)
C lets go of mutex X and gives it to A.
A does some work then releases mutex X (B although not running aquires
it).
A needs to grab X again but B owns it. Since B has the lock, high
priority task A must give up the CPU for a lower priority task B.

I implemented a "lock stealing" for this very case and cut down
unnecessary schedules and latencies tremendously.  If A goes to grab X
again, but B has it (but hasn't woken up yet) it can "steal" it from B
and continue.

Hmm, this may still be under the POSIX if what you say is that a
"waiting" process must get the lock.  If A comes back before B wakes up,
A is now a waiting process and may take it. OK maybe I'm stretching it a
little, but that's what RT wants.

> 
> If A still needed the mutex, why release it and reacquire it, why not 
> just hold onto it? The fact that it is being released is significant.

There's several reasons.  Why hold a mutex when you don't need to. This
could be a SMP machine and B could grab the mutex in the small time that
A releases it.  Also locks are released and reaquired a lot to prevent
deadlocks.

It's good practice to always release a mutex (or any lock) when not
needed, even if you plan on grabbing it again right a way. For anything,
a higher priority process my be waiting to get it.

> 
> > Now this has nothing to do with PI or SCHED_OTHER, so behaviour is 
> > exactly
> > determined by our respective interpretations of what it means for "the
> > scheduling policy to decide which task gets the mutex".
> >
> > What have I proven? Nothing ;) but perhaps my question could be answered
> > by someone who knows a lot more about RT systems than I.
> 
> In the last RT work I did 12-13 years ago, there was only one high 
> priority producer task and it was never allowed to block. The consumers 
> just kept up as best they could (multi-proc machine of course). I've 
> seldom seen a need for many priority levels. Probably not much you can 
> generalzie from this though.

That seems to be a very simple system.  I usually deal with 4 or 5
priority levels and that can easily create headaches.

-- Steve

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:56                                 ` Howard Chu
  2006-01-26 22:24                                   ` Nick Piggin
@ 2006-01-27  4:27                                   ` Steven Rostedt
  1 sibling, 0 replies; 127+ messages in thread
From: Steven Rostedt @ 2006-01-27  4:27 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List, davids, Nick Piggin

On Thu, 2006-01-26 at 13:56 -0800, Howard Chu wrote:
> Nick Piggin wrote:

> >>
> >> But why does A take the mutex in the first place? Presumably because 
> >> it is about to execute a critical section. And also presumably, A 
> >> will not release the mutex until it no longer has anything critical 
> >> to do; certainly it could hold it longer if it needed to.
> >>
> >> If A still needed the mutex, why release it and reacquire it, why not 
> >> just hold onto it? The fact that it is being released is significant.
> >>
> >
> > Regardless of why, that is just the simplest scenario I could think
> > of that would give us a test case. However...
> >
> > Why not hold onto it? We sometimes do this in the kernel if we need
> > to take a lock that is incompatible with the lock already being held,
> > or if we discover we need to take a mutex which nests outside our
> > currently held lock in other paths. Ie to prevent deadlock.
> 
> In those cases, A cannot retake the mutex anyway. I.e., you just said 
> that you released the first mutex because you want to acquire a 
> different one. So those cases don't fit this example very well.

Lets say you have two locks X and Y.  Y nests inside of X. To do block1
you need to have lock Y and to do block2 you need to have both locks X
and Y, and block 1 must be done first without holding lock X.

func()
{
again:
	mutex_lock(Y);
	block1();
	if (!mutex_try_lock(X)) {
		mutex_unlock(Y);
		mutex_lock(X);
		mutex_lock(Y);
		if (block1_has_changed()) {
			mutex_unlock(Y);
			mutex_unlock(X);
			goto again;
		}
	}
	block2();
	mutex_unlock(X);
	mutex_unlock(Y);
}

Stuff like the above actually is done (it's done in the kernel). So you
can see here that Y can be released and reacquired right away.  If
another task was waiting on Y (of lower priority) we don't want to give
up the lock, since we would then block and the chances of
block1_has_changed goes up even more.

> 
> > Another reason might be because we will be running for a very long
> > time without requiring the lock.
> 
> And again in this case, A should not be immediately reacquiring the lock 
> if it doesn't actually need it.

I'm not sure what Nick means here, but I'm sure he didn't mean it to
come out that way ;)

> 
> > Or we might like to release it because
> > we expect a higher priority process to take it.
> 
> And in this case, the expected behavior is the same as I've been pursuing.

But you can't know if a higher or lower priority process is waiting.
Sure it works like what you say when a higher priority process is
waiting, but it doesn't when it's a lower priority process waiting.

-- Steve



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 21:31                                     ` linux-os (Dick Johnson)
@ 2006-01-27  7:06                                       ` Valdis.Kletnieks
  0 siblings, 0 replies; 127+ messages in thread
From: Valdis.Kletnieks @ 2006-01-27  7:06 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Nick Piggin, Howard Chu, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

[-- Attachment #1: Type: text/plain, Size: 1590 bytes --]

On Thu, 26 Jan 2006 16:31:28 EST, "linux-os (Dick Johnson)" said:

> "It doesn't...." even though I can run sendmail by hand, using
> telnet port 25, over the network, and know that the "." in the
> first column is the way it knows the end-of-message after it
> receives the "DATA" command.

Right. That's how an MTA talks to another MTA.  However, your mail
needs to be properly escaped.  RFC821, section 4.5.2:

     4.5.2.  TRANSPARENCY

         Without some provision for data transparency the character
         sequence "<CRLF>.<CRLF>" ends the mail text and cannot be sent
         by the user.  In general, users are not aware of such
         "forbidden" sequences.  To allow all user composed text to be
         transmitted transparently the following procedures are used.

            1. Before sending a line of mail text the sender-SMTP checks
            the first character of the line.  If it is a period, one
            additional period is inserted at the beginning of the line.

            2. When a line of mail text is received by the receiver-SMTP
            it checks the line.  If the line is composed of a single
            period it is the end of mail.  If the first character is a
            period and there are other characters on the line, the first
            character is deleted.

In other words, the on-the-wire protocol is specifically designed so that
you *cant* accidentally lose the rest of the message by sending a bare '.'.
The fact that some programs implement it when talking to the user is
merely a convenience hack on the program's part.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 22:24                                   ` Nick Piggin
@ 2006-01-27  8:08                                     ` Howard Chu
  2006-01-27 19:25                                       ` Philipp Matthias Hahn
  2006-02-01 12:31                                       ` Nick Piggin
  0 siblings, 2 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-27  8:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: davids, Linux Kernel Mailing List

Nick Piggin wrote:
> Howard Chu wrote:
>
>>> Another reason might be because we will be running for a very long
>>> time without requiring the lock.
>>
>>
>> And again in this case, A should not be immediately reacquiring the 
>> lock if it doesn't actually need it.
>>
>
> No, not immediately, I said "for a very long time". As in: A does not
> need the exclusion provided by the lock for a very long time so it
> drops it to avoid needless contention, then reaquires it when it finally
> does need the lock.

OK. I think this is really a separate situation. Just to recap: A takes 
lock, does some work, releases lock, a very long time passes, then A 
takes the lock again. In the "time passes" part, that mutex could be 
locked and unlocked any number of times by other threads and A won't 
know or care. Particularly on an SMP machine, other threads that were 
blocked on that mutex could do useful work in the interim without 
impacting A's progress at all. So here, when A leaves the mutex unlocked 
for a long time, it's desirable to give the mutex to one of the waiters 
ASAP.

>>> Or we might like to release it because
>>> we expect a higher priority process to take it.
>>
>>
>> And in this case, the expected behavior is the same as I've been 
>> pursuing.
>>
>
> No, we're talking about what happens when A tries to aquire it again.
>
> Just accept that my described scenario is legitimate then consider it in
> isolation rather than getting caught up in the superfluous details of how
> such a situation might come about.

OK. I'm not trying to be difficult here. In much of life, context is 
everything; very little can be understood in isolation.

Back to the scenario:

> A realtime system with tasks A and B, A has an RT scheduling priority of
> 1, and B is 2. A and B are both runnable, so A is running. A takes a 
> mutex
> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
> some point it drops the mutex and then tries to take it again.
>
> What happens?

As I understand the spec, A must block because B has acquired the mutex. 
Once again, the SUS discussion of priority inheritance would never need 
to have been written if this were not the case:

 >>>
In a priority-driven environment, a direct use of traditional primitives 
like mutexes and condition variables can lead to unbounded priority 
inversion, where a higher priority thread can be blocked by a lower 
priority thread, or set of threads, for an unbounded duration of time. 
As a result, it becomes impossible to guarantee thread deadlines. 
Priority inversion can be bounded and minimized by the use of priority 
inheritance protocols. This allows thread deadlines to be guaranteed 
even in the presence of synchronization requirements.
<<<

The very first sentence indicates that a higher priority thread can be 
blocked by a lower priority thread. If your interpretation of the spec 
were correct, then such an instance would never occur. Since your 
scenario is using realtime threads, then we can assume that the Priority 
Ceiling feature is present and you can use it if needed. ( 
http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09_06 
Realtime Threads option group )

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27  2:16                           ` David Schwartz
@ 2006-01-27  8:19                             ` Howard Chu
  2006-01-27 19:50                               ` David Schwartz
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-27  8:19 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List

David Schwartz wrote:
>>> 	Third, there's the ambiguity of the standard. It says the "sceduling
>>> policy" shall decide, not that the scheduler shall decide. If
>>> the policy is
>>> to make a conditional or delayed decision, that is still perfectly valid
>>> policy. "Whichever thread requests it first" is a valid
>>> scheduler policy.
>>>       

>> I am not debating what the policy can decide. Merely the set of choices
>> from which it may decide.
>>     
>
> 	Which is a restriction not found in the standard. A "policy" is a way of
> deciding, not a decision. Scheduling policy can be to let whoever asks first
> get it.
>   

If we just went with "whoever asks first" then clearly one of the 
blocked threads asked before the unlocker made its new request. You're 
arguing for my point, then.

Other ambiguities aside, one thing is clear - a decision is triggered by 
the unlock. What you seem to be arguing is the equivalent of saying that 
the decision is made based on the next lock operation. The spec doesn't 
say that mutex_lock is to behave this way. Why do you suppose that is? 
Perhaps you should raise this question with the Open Group.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27  8:08                                     ` Howard Chu
@ 2006-01-27 19:25                                       ` Philipp Matthias Hahn
  2006-02-01 12:31                                       ` Nick Piggin
  1 sibling, 0 replies; 127+ messages in thread
From: Philipp Matthias Hahn @ 2006-01-27 19:25 UTC (permalink / raw)
  To: Howard Chu; +Cc: davids, Linux Kernel Mailing List

Hello!

On Fri, Jan 27, 2006 at 12:08:13AM -0800, Howard Chu wrote:
> >No, not immediately, I said "for a very long time". As in: A does not
> >need the exclusion provided by the lock for a very long time so it
> >drops it to avoid needless contention, then reaquires it when it finally
> >does need the lock.
> 
> OK. I think this is really a separate situation. Just to recap: A takes 
> lock, does some work, releases lock, a very long time passes, then A 
> takes the lock again. In the "time passes" part, that mutex could be 
> locked and unlocked any number of times by other threads and A won't 
> know or care. Particularly on an SMP machine, other threads that were 
> blocked on that mutex could do useful work in the interim without 
> impacting A's progress at all. So here, when A leaves the mutex unlocked 
> for a long time, it's desirable to give the mutex to one of the waiters 
> ASAP.

When you release a lock, you unblock at most one thread, which is
waiting for that lock and put that released thread in the runnable
state.
Than it's up to the scheduler, what happens next:
- if you have multiple processors, you _can_ run the released thread on
  anther processor, so both thread run.
- if you are single processor or don't want to schedule the released
  thread on a second cpu, you must decide to
  - either _continue running the releasing thread_ and let the released
    thread stay some more time in the runnable queue,
  - or _preempt the releasing thread_ to the runnable queue and make the
    released thread running.
If you have different priorities, your decision is easy: run the most
important thread.
But if you don't have priorities, you base your decision on other
metrics: Since it takes more time to switch a thread (save/restore
state) compared to continue running the same thread, from a throuput
perspective you'll prefer to not change threads.

Similar thinking for yield(): You put the running thread back to the
runnable queue and choose one thread from it as the new running thread.
Note, that you might choose the old thread as the new thread again,
since with SCHED_OTHER this is perfectly fine, if you decided to honor
throuput more than fairness.
Other with SCHED_FIFO/RR, since there you are forced to put the old
thread at the end of your runnable queue and choose the new one from the
front of the queue, so all other threads with the same priority will run
before you yielding thread gets the cpu again.

Summary: yield() only makes sense with a SCHED_FIFO/RR policy, because
with SCHED_OTHER you know too little about the exact policy to make any
use of it.

BYtE
Philipp
-- 
      Dipl.-Inform. Philipp.Hahn@informatik.uni-oldenburg.de
      Abteilung Systemsoftware und verteilte Systeme, Fk. II
Carl von Ossietzky Universitaet Oldenburg, 26111 Oldenburg, Germany
    http://www.svs.informatik.uni-oldenburg.de/contact/pmhahn/
      Telefon: +49 441 798-2866    Telefax: +49 441 798-2756

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27  8:19                             ` Howard Chu
@ 2006-01-27 19:50                               ` David Schwartz
  2006-01-27 20:13                                 ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-27 19:50 UTC (permalink / raw)
  To: hyc; +Cc: Linux Kernel Mailing List


> If we just went with "whoever asks first" then clearly one of the
> blocked threads asked before the unlocker made its new request. You're
> arguing for my point, then.

	Huh? I am saying the policy can be anything at all. We could just go with
"whoever asks first", but we are not required to. And, in any event, I meant
whoever asks for the mutex first, not whoever blocks first. (Note that I
didn't say "whoever asked first" which would mean something totally
different.)

> Other ambiguities aside, one thing is clear - a decision is triggered by
> the unlock. What you seem to be arguing is the equivalent of saying that
> the decision is made based on the next lock operation.

	The spec says that the decision is triggered by a particular condition that
exists at the time of the unlock. That does not mean the decision is made at
the time of the unlock.

> The spec doesn't
> say that mutex_lock is to behave this way.

	We don't agree on what the specification says.

> Why do you suppose that is?

	Why do I suppose what? I find the specification perfectly clear and your
reading of it incredibly strained for the three reasons I stated.

> Perhaps you should raise this question with the Open Group.

	I don't think it's unclear.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27 19:50                               ` David Schwartz
@ 2006-01-27 20:13                                 ` Howard Chu
  2006-01-27 21:05                                   ` David Schwartz
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-27 20:13 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List

David Schwartz wrote:
> 	We don't agree on what the specification says.
>
>   
>> Why do you suppose that is?
>>     
>
> 	Why do I suppose what? I find the specification perfectly clear and your
> reading of it incredibly strained for the three reasons I stated.
>   

Oddly enough, you said 
http://groups.google.com/group/comp.programming.threads/msg/28b58e91886a3602?hl=en&
"Unfortunately, it sounds reasonable"  so I can't lend credence to your 
stating that my reading is incredibly strained. The fact that 
LinuxThreads historically adhered to my reading of it lends more weight 
to my argument. The fact that people accepted this interpretation for so 
many years lends further weight. In light of this, it is your current 
interpretation that is incredibly strained, and I would say, broken.

You have essentially created a tri-state mutex. (Locked, unlocked, and 
sort-of-unlocked-but-really-reserved.) That may be a good and useful 
thing in its own right, but it should not be the default behavior.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27 20:13                                 ` Howard Chu
@ 2006-01-27 21:05                                   ` David Schwartz
  2006-01-27 21:23                                     ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: David Schwartz @ 2006-01-27 21:05 UTC (permalink / raw)
  To: hyc; +Cc: Linux Kernel Mailing List


> David Schwartz wrote:
> > 	We don't agree on what the specification says.
> >
> >
> >> Why do you suppose that is?
> >>
> >
> > 	Why do I suppose what? I find the specification perfectly
> clear and your
> > reading of it incredibly strained for the three reasons I stated.
> >

> Oddly enough, you said
> http://groups.google.com/group/comp.programming.threads/msg/28b58e
> 91886a3602?hl=en&
> "Unfortunately, it sounds reasonable"  so I can't lend credence to your
> stating that my reading is incredibly strained. The fact that
> LinuxThreads historically adhered to my reading of it lends more weight
> to my argument. The fact that people accepted this interpretation for so
> many years lends further weight. In light of this, it is your current
> interpretation that is incredibly strained, and I would say, broken.

	After collecting other opinions from comp.programming.threads, and being
unable to find other people who considered it reasonable, I've changed my
opinion. I was far too generous and deferential before.

	The more I consider it, the more absurd I find it. POSIX and SuS were so
careful not to dictate scheduler policy (or even hint at any notion of
fairness) that to argue that they intended to prohibit a thread from
releasing and reacquiring a mutex while another thread was blocked on it is
not tenable.

	You are essentially arguing that they intended to prohibit the most natural
and highest performing implementation. This is totally inconsistent with
POSIX's overall design intention to provide the lightest and
highest-performing primitives and allow users to add features with overhead
if they needed those features and could tolerate the overhead.

> You have essentially created a tri-state mutex. (Locked, unlocked, and
> sort-of-unlocked-but-really-reserved.) That may be a good and useful
> thing in its own right, but it should not be the default behavior.

	Huh?

	I'm suggesting the most natural implementation: When a thread tries to
acquire a mutex, it is blocked if a higher-priority thread is already
waiting for a mutex. When a thread releases a mutex, the highest-priority
thread waiting for the mutex is woken (but not necessarily guaranteed the
mutex, the mutex is simply marked available). When a thread tries to acquire
a mutex, it gets it unless a higher-priority thread is already registered as
wanting it. When a thread tries to acquire a mutex, it loops until it
acquires it and on each iteration blocks if the mutex is taken or a
higher-priority thread is waiting for it, otherwise it takes the mutex.

	A thread that is descheduled should never get priority over a thread that
is already running (unless a scheduling priority mechanism requires it).

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27 21:05                                   ` David Schwartz
@ 2006-01-27 21:23                                     ` Howard Chu
  2006-01-27 23:31                                       ` David Schwartz
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-27 21:23 UTC (permalink / raw)
  To: davids; +Cc: Linux Kernel Mailing List

David Schwartz wrote:
> 	After collecting other opinions from comp.programming.threads, and being
> unable to find other people who considered it reasonable, I've changed my
> opinion. I was far too generous and deferential before.
>   

David, you specifically have been faced with this question before:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/2184ba84f911d9dd/a6e4f7cf13bbec2d#a6e4f7cf13bbec2d
and you didn't dispute the interpretation then. The wording for 
pthread_mutex_unlock hasn't changed between 2001 and now.

And here:
http://groups.google.com/group/comp.programming.threads/msg/89cc5d600e34e88a?hl=en&

If those statements were incorrect, I have a feeling someone would have 
corrected them at the time. Certainly you can attest to that.
http://groups.google.com/group/comp.programming.threads/msg/d5b2231ca57bb102?hl=en&

Clearly at this point there's nothing to be gained from pursuing this 
any further. The 2.6 kernel has been out for too long; if it were to be 
"fixed" again it would just make life ugly for another group of people, 
and I don't want to write the autoconf tests to detect the 
flavor-of-the-week. We've wasted enough time arguing futilely over it, 
I'll stop.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27 21:23                                     ` Howard Chu
@ 2006-01-27 23:31                                       ` David Schwartz
  0 siblings, 0 replies; 127+ messages in thread
From: David Schwartz @ 2006-01-27 23:31 UTC (permalink / raw)
  To: hyc; +Cc: Linux Kernel Mailing List


> David, you specifically have been faced with this question before:
> http://groups.google.com/group/comp.programming.threads/browse_frm
> /thread/2184ba84f911d9dd/a6e4f7cf13bbec2d#a6e4f7cf13bbec2d
> and you didn't dispute the interpretation then. The wording for
> pthread_mutex_unlock hasn't changed between 2001 and now.

	This was a totally different question. This was about the implementation,
not the interpretation. You'll note that I objected to the implementation.

> And here:
>
http://groups.google.com/group/comp.programming.threads/msg/89cc5d600e34e88a
?hl=en&

	Again, I don't see that I commented on the interpretation. This was an
unfortunate missed oppurtunity. Kaz is incorrect here.

> If those statements were incorrect, I have a feeling someone would have
> corrected them at the time. Certainly you can attest to that.

	Obviously not, since they are incorrect and nobody did.

>
http://groups.google.com/group/comp.programming.threads/msg/d5b2231ca57bb102
?hl=en&

	Again, this had nothing whatsoever to do with whether the interpretation is
correct or not.

> Clearly at this point there's nothing to be gained from pursuing this
> any further. The 2.6 kernel has been out for too long; if it were to be
> "fixed" again it would just make life ugly for another group of people,
> and I don't want to write the autoconf tests to detect the
> flavor-of-the-week. We've wasted enough time arguing futilely over it,
> I'll stop.

	The problem is that this interpretation is simply incorrect and results in
maximally inefficient implementations.

	David Butenhof recently posted to comp.programming.threads and indicated
that disagreed with this implementation. That's about as close to
authoritative as you're likely to get.

	POSIX had no intention to constrain the scheduler to compel inefficient
behavior. In fact, they went out of their way to create the lightest
possible primitives.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-26 19:02           ` Stefan Seyfried
  2006-01-26 19:09             ` Olaf Kirch
@ 2006-01-28 11:53             ` Mattia Dongili
  2006-01-28 19:53               ` Jesse Brandeburg
  1 sibling, 1 reply; 127+ messages in thread
From: Mattia Dongili @ 2006-01-28 11:53 UTC (permalink / raw)
  To: Stefan Seyfried
  Cc: Jesse Brandeburg, Olaf Kirch, Linux Kernel Mailing List, netdev

On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
> On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
>  
> > Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> > to show that my patch that just removes e100_init_hw works okay for
> > me.  Let me know how it goes for you, I think this is a good fix.
> 
> worked for me in the Compaq Armada e500 and reportedly also fixed the
> SONY that originally uncovered it.

confirmed here too. The patch fixes S3 resume on this Sony (GR7/K)
running 2.6.16-rc1-mm3.

0000:02:08.0 Ethernet controller: Intel Corporation 82801CAM (ICH3) PRO/100 VE (LOM) Ethernet Controller (rev 41)
	Subsystem: Sony Corporation Vaio PCG-GR214EP/GR214MP/GR215MP/GR314MP/GR315MP
	Flags: bus master, medium devsel, latency 66, IRQ 9
	Memory at d0204000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at 4000 [size=64]
	Capabilities: <available only to root>

thanks
-- 
mattia
:wq!

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-28 11:53             ` Mattia Dongili
@ 2006-01-28 19:53               ` Jesse Brandeburg
  2006-02-07  6:57                 ` Jeff Garzik
  0 siblings, 1 reply; 127+ messages in thread
From: Jesse Brandeburg @ 2006-01-28 19:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Stefan Seyfried, Jesse Brandeburg, Olaf Kirch,
	Linux Kernel Mailing List, netdev, Jesse Brandeburg,
	Jeff Kirsher

[-- Attachment #1: Type: text/plain, Size: 737 bytes --]

On 1/28/06, Mattia Dongili <malattia@linux.it> wrote:
> On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
> > On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
> >
> > > Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
> > > to show that my patch that just removes e100_init_hw works okay for
> > > me.  Let me know how it goes for you, I think this is a good fix.
> >
> > worked for me in the Compaq Armada e500 and reportedly also fixed the
> > SONY that originally uncovered it.
>
> confirmed here too. The patch fixes S3 resume on this Sony (GR7/K)
> running 2.6.16-rc1-mm3.

excellent news! thanks for testing.

Jeff, could you please apply to 2.6.16-rcX

Jesse

[-- Attachment #2: e100_resume_no_init.patch --]
[-- Type: application/octet-stream, Size: 818 bytes --]

e100: remove init_hw call to fix panic

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

e100 seems to have had a long standing bug where e100_init_hw was being
called when it should not have been.  This caused a panic due to recent
changes that rely on correct set up in the driver, and more robust error
paths.
---

 drivers/net/e100.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -2752,8 +2752,6 @@ static int e100_resume(struct pci_dev *p
 	retval = pci_enable_wake(pdev, 0, 0);
 	if (retval)
 		DPRINTK(PROBE,ERR, "Error clearing wake events\n");
-	if(e100_hw_init(nic))
-		DPRINTK(HW, ERR, "e100_hw_init failed\n");
 
 	netif_device_attach(netdev);
 	if(netif_running(netdev))

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 19:57                       ` David Schwartz
  2006-01-26 20:27                         ` Howard Chu
@ 2006-01-30  8:28                         ` Helge Hafting
  1 sibling, 0 replies; 127+ messages in thread
From: Helge Hafting @ 2006-01-30  8:28 UTC (permalink / raw)
  To: davids; +Cc: hyc, Linux Kernel Mailing List

David Schwartz wrote:

>	Third, there's the ambiguity of the standard. It says the "sceduling
>policy" shall decide, not that the scheduler shall decide. If the policy is
>to make a conditional or delayed decision, that is still perfectly valid
>policy. "Whichever thread requests it first" is a valid scheduler policy.
>  
>
Sure.  And with a "whichever thread aquires it first" policy, then
it is obvious what happens when a mutex is released when someone
is blocked on it:  Whoever blocked on it first is then the one
who requested it first - that cannot change as the request was made
before the mutex even was released.  So then, the releasing thread has
no chance of getting the mutex back until the others have had a
go at it - no matter what threads actually gets scheduled.

Helge Hafting


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 10:38                 ` Nikita Danilov
@ 2006-01-30  8:35                   ` Helge Hafting
  2006-01-30 11:13                     ` Nikita Danilov
  2006-01-31 23:18                     ` David Schwartz
  0 siblings, 2 replies; 127+ messages in thread
From: Helge Hafting @ 2006-01-30  8:35 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Howard Chu, Christopher Friesen, Linux Kernel Mailing List, hancockr

Nikita Danilov wrote:

>Howard Chu writes:
>
>[...]
>
> > 
> > A straightforward reading of the language here says the decision happens 
> > "when pthread_mutex_unlock() is called" and not at any later time. There 
> > is nothing here to support your interpretation.
> > >
> > > I think the intention of the wording is that for deterministic policies,
> > > it is clear that the waiting threads are actually worken and reevaluated
> > > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
> > > considering the scheduling policy is arbitrary.
> > >
> > Clearly the point is that one of the waiting threads is waken and gets 
> > the mutex, and it doesn't matter which thread is chosen. I.e., whatever 
>
>Note that this behavior directly leads to "convoy formation": if that
>woken thread T0 does not immediately run (e.g., because there are higher
>priority threads) but still already owns the mutex, then other running
>threads contending for this mutex will block waiting for T0, forming a
>convoy.
>
I just wonder - what is the problem with this convoy formation?
It can only happen when the cpu is overloaded, and in that case
someone has to wait.  In this case, the mutex waiters. 

Aggressively handing the cpu to whoever holds a mutex will mean the
mutexes are free more of the time - but it will *not* mean less waiting in
tghe system.  You just changes who waits.

Helge Hafting


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-26 17:34                             ` linux-os (Dick Johnson)
  2006-01-26 19:00                               ` Nick Piggin
@ 2006-01-30  8:44                               ` Helge Hafting
  2006-01-30  8:50                                 ` Howard Chu
  2006-01-30 13:28                                 ` linux-os (Dick Johnson)
  1 sibling, 2 replies; 127+ messages in thread
From: Helge Hafting @ 2006-01-30  8:44 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

linux-os (Dick Johnson) wrote:

>To fix the current problem, you can substitute usleep(0); It will
>give the CPU to somebody if it's computable, then give it back to
>you. It seems to work in every case that sched_yield() has
>mucked up (perhaps 20 to 30 here).
>  
>
Isn't that dangerous?  Someday, someone working on linux (or some
other unixish os) might come up with an usleep implementation where
usleep(0) just returns and becomes a no-op.  Which probably is ok
with the usleep spec - it did sleep for zero time . . .

Helge Hafting

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30  8:44                               ` Helge Hafting
@ 2006-01-30  8:50                                 ` Howard Chu
  2006-01-30 15:33                                   ` Kyle Moffett
  2006-01-30 13:28                                 ` linux-os (Dick Johnson)
  1 sibling, 1 reply; 127+ messages in thread
From: Howard Chu @ 2006-01-30  8:50 UTC (permalink / raw)
  To: Helge Hafting
  Cc: linux-os (Dick Johnson),
	Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

Helge Hafting wrote:
> linux-os (Dick Johnson) wrote:
>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>  
>>
> Isn't that dangerous?  Someday, someone working on linux (or some
> other unixish os) might come up with an usleep implementation where
> usleep(0) just returns and becomes a no-op.  Which probably is ok
> with the usleep spec - it did sleep for zero time . . .
>
We actually experimented with usleep(0) and select(...) with a zeroed 
timeval. Both of these approaches performed worse than just using 
sched_yield(), depending on the system and some other conditions. 
Dual-core AMD64 vs single-CPU had quite different behaviors. Also, if 
the slapd main event loop was using epoll() instead of select(), the 
select's used for yields slowed down by a couple orders of magnitude. (A 
test that normally took ~30 seconds took as long as 45 minutes in one 
case, it was quite erratic.)

It turned out that most of those yield's were leftovers inherited from 
when we only supported non-preemptive threads, and simply deleting them 
was the best approach.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30  8:35                   ` Helge Hafting
@ 2006-01-30 11:13                     ` Nikita Danilov
  2006-01-31 23:18                     ` David Schwartz
  1 sibling, 0 replies; 127+ messages in thread
From: Nikita Danilov @ 2006-01-30 11:13 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Nikita Danilov, Howard Chu, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

Helge Hafting writes:
 > Nikita Danilov wrote:
 > 
 > >Howard Chu writes:
 > >
 > >[...]
 > >
 > > > 
 > > > A straightforward reading of the language here says the decision happens 
 > > > "when pthread_mutex_unlock() is called" and not at any later time. There 
 > > > is nothing here to support your interpretation.
 > > > >
 > > > > I think the intention of the wording is that for deterministic policies,
 > > > > it is clear that the waiting threads are actually worken and reevaluated
 > > > > for scheduling. In the case of SCHED_OTHER, it means basically nothing,
 > > > > considering the scheduling policy is arbitrary.
 > > > >
 > > > Clearly the point is that one of the waiting threads is waken and gets 
 > > > the mutex, and it doesn't matter which thread is chosen. I.e., whatever 
 > >
 > >Note that this behavior directly leads to "convoy formation": if that
 > >woken thread T0 does not immediately run (e.g., because there are higher
 > >priority threads) but still already owns the mutex, then other running
 > >threads contending for this mutex will block waiting for T0, forming a
 > >convoy.
 > >
 > I just wonder - what is the problem with this convoy formation?
 > It can only happen when the cpu is overloaded, and in that case
 > someone has to wait.  In this case, the mutex waiters. 

The obvious problem is extra context switch: if mutex is left unlocked,
then first thread (say, T0) that tries to acquire it, succeeds and
continues to run, whereas if mutex is directly handed to the runnable
(but not running) thread T1, T0 has to block, until T1 runs.

What's worse, convoys tend to grow once formed.

 > 
 > Aggressively handing the cpu to whoever holds a mutex will mean the
 > mutexes are free more of the time - but it will *not* mean less waiting in
 > tghe system.  You just changes who waits.
 > 
 > Helge Hafting

Nikita.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30  8:44                               ` Helge Hafting
  2006-01-30  8:50                                 ` Howard Chu
@ 2006-01-30 13:28                                 ` linux-os (Dick Johnson)
  2006-01-30 15:15                                   ` Helge Hafting
  1 sibling, 1 reply; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2006-01-30 13:28 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

On Mon, 30 Jan 2006, Helge Hafting wrote:

> linux-os (Dick Johnson) wrote:
>
>> To fix the current problem, you can substitute usleep(0); It will
>> give the CPU to somebody if it's computable, then give it back to
>> you. It seems to work in every case that sched_yield() has
>> mucked up (perhaps 20 to 30 here).
>>
>>
> Isn't that dangerous?  Someday, someone working on linux (or some
> other unixish os) might come up with an usleep implementation where
> usleep(0) just returns and becomes a no-op.  Which probably is ok
> with the usleep spec - it did sleep for zero time . . .
>
> Helge Hafting

Dangerous?? You have a product that needs to ship. You can make
it work by adding a hack. You add a hack. I don't see danger at
all. I see getting the management off the back of the software
engineers so that they can fix the code. Further, you __test__ the
stuff before you ship. If usleep(0) just spins, then you use
usleep(1).

Also, I don't think any Engineer would use threads for anything
that could be potentially dangerous anyway. You create step-by-step
ordered procedures with explicit state-machines for things that
really need to happen as written. You use threads for things that
must occur, but you don't give a damn when they occur (like updating
a window on the screen or sorting keys in a database).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.66 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30 13:28                                 ` linux-os (Dick Johnson)
@ 2006-01-30 15:15                                   ` Helge Hafting
  0 siblings, 0 replies; 127+ messages in thread
From: Helge Hafting @ 2006-01-30 15:15 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Howard Chu, Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

linux-os (Dick Johnson) wrote:

>On Mon, 30 Jan 2006, Helge Hafting wrote:
>
>  
>
>>linux-os (Dick Johnson) wrote:
>>
>>    
>>
>>>To fix the current problem, you can substitute usleep(0); It will
>>>give the CPU to somebody if it's computable, then give it back to
>>>you. It seems to work in every case that sched_yield() has
>>>mucked up (perhaps 20 to 30 here).
>>>
>>>
>>>      
>>>
>>Isn't that dangerous?  Someday, someone working on linux (or some
>>other unixish os) might come up with an usleep implementation where
>>usleep(0) just returns and becomes a no-op.  Which probably is ok
>>with the usleep spec - it did sleep for zero time . . .
>>    
>>
>
>Dangerous?? You have a product that needs to ship. You can make
>it work by adding a hack. You add a hack. I don't see danger at
>all. I see getting the management off the back of the software
>engineers so that they can fix the code. Further, you __test__ the
>stuff before you ship. If usleep(0) just spins, then you use
>usleep(1).
>  
>
The dangerous part was that usleep(0) works as a "yield"
today, as your testing will confirm before you ship the product.
But it may break next year if someone changes this part of
the kernel.  Then your customer suddenly have a broken product.

Helge Hafting

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30  8:50                                 ` Howard Chu
@ 2006-01-30 15:33                                   ` Kyle Moffett
  0 siblings, 0 replies; 127+ messages in thread
From: Kyle Moffett @ 2006-01-30 15:33 UTC (permalink / raw)
  To: Howard Chu
  Cc: Helge Hafting, linux-os (Dick Johnson),
	Nick Piggin, Lee Revell, Christopher Friesen,
	Linux Kernel Mailing List, hancockr

On Jan 30, 2006, at 03:50, Howard Chu wrote:
> Helge Hafting wrote:
>> linux-os (Dick Johnson) wrote:
>>> To fix the current problem, you can substitute usleep(0); It will  
>>> give the CPU to somebody if it's computable, then give it back to  
>>> you. It seems to work in every case that sched_yield() has mucked  
>>> up (perhaps 20 to 30 here).
>>
>> Isn't that dangerous?  Someday, someone working on linux (or some  
>> other unixish os) might come up with an usleep implementation  
>> where usleep(0) just returns and becomes a no-op.  Which probably  
>> is ok with the usleep spec - it did sleep for zero time . . .
>
> We actually experimented with usleep(0) and select(...) with a  
> zeroed timeval. Both of these approaches performed worse than just  
> using sched_yield(), depending on the system and some other  
> conditions. Dual-core AMD64 vs single-CPU had quite different  
> behaviors. Also, if the slapd main event loop was using epoll()  
> instead of select(), the select's used for yields slowed down by a  
> couple orders of magnitude. (A test that normally took ~30 seconds  
> took as long as 45 minutes in one case, it was quite erratic.)
>
> It turned out that most of those yield's were leftovers inherited  
> from when we only supported non-preemptive threads, and simply  
> deleting them was the best approach.

I would argue that in a non realtime environment sched_yield() is not  
useful at all.  When you want to wait for another process, you wait  
explicitly for that process using one of the various POSIX-defined  
methods, such as mutexes, condition variables, etc.  There are very  
clearly and thoroughly defined ways to wait for other processes to  
complete work, why rely on usleep(0) giving CPU to some other task  
when you can explicitly tell the scheduler "I am waiting for task foo  
to release this mutex" or "I can't run until somebody signals this  
condition variable".

Cheers,
Kyle Moffett

--
Unix was not designed to stop people from doing stupid things,  
because that would also stop them from doing clever things.
   -- Doug Gwyn



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Can I do a regular read to simulate prefetch instruction?
       [not found]             ` <4807377b0601271404w6dbfcff6s4de1c3f785dded9f@mail.gmail.com>
@ 2006-01-30 17:25               ` John Smith
  0 siblings, 0 replies; 127+ messages in thread
From: John Smith @ 2006-01-30 17:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hi,

I find out some network card drivers (e.g. e1000 driver) use prefetch 
instruction
to reduce memory access latency and speed up data operation. My question is:
Support we want to pre-read a skb buffer into the cache, what is the 
difference
between the following two methods, i.e. what is the different when using 
prefetch
and using a regular read opertation?
1. use prefetch instruction to stimulate a pre-fetch of the skb address,
    e.g. prefetch(skb);
2. use an assignment statement to stimulate a pre-fetch of the skb address,
    e.g. skb1 = skb;

I was told the data will be prefetched into a so-called prefetching queue 
only by
using prefetching instruction? Is this true?

Thanks,

John



^ permalink raw reply	[flat|nested] 127+ messages in thread

* RE: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-30  8:35                   ` Helge Hafting
  2006-01-30 11:13                     ` Nikita Danilov
@ 2006-01-31 23:18                     ` David Schwartz
  1 sibling, 0 replies; 127+ messages in thread
From: David Schwartz @ 2006-01-31 23:18 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Howard Chu, Linux Kernel Mailing List


> I just wonder - what is the problem with this convoy formation?
> It can only happen when the cpu is overloaded, and in that case
> someone has to wait.  In this case, the mutex waiters. 

	The problem is that you need to become more efficient as load increases, not less. If you get more efficient as load increases, you can get into a situation where even though you have an amount of load you can handle, you will never catch up on the load that backed up before.
 
> Aggressively handing the cpu to whoever holds a mutex will mean the
> mutexes are free more of the time - but it will *not* mean less waiting in
> tghe system.  You just changes who waits.

	It will mean fewer context switches and more effective use of caches as load increases. Even a very small amount of "gets more efficient as load goes up" can mean the difference between a system that handles load spikes smoothly (with a temporary reduction in responsiveness) and a system that backs up in a load spike and never recovers (with a per,amently increasing reduction in responsiveness even with load that's normally tolerable).

	As load goes up, you need your threads to use more of their timeslice. This means not descheduling a running thread unless it is unavoidable.

	DS



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow)
  2006-01-27  8:08                                     ` Howard Chu
  2006-01-27 19:25                                       ` Philipp Matthias Hahn
@ 2006-02-01 12:31                                       ` Nick Piggin
  1 sibling, 0 replies; 127+ messages in thread
From: Nick Piggin @ 2006-02-01 12:31 UTC (permalink / raw)
  To: Howard Chu; +Cc: davids, Linux Kernel Mailing List

Howard Chu wrote:
> Nick Piggin wrote:
>> Howard Chu wrote:
>>
>>>
>>> And again in this case, A should not be immediately reacquiring the 
>>> lock if it doesn't actually need it.
>>>
>>
>> No, not immediately, I said "for a very long time". As in: A does not
>> need the exclusion provided by the lock for a very long time so it
>> drops it to avoid needless contention, then reaquires it when it finally
>> does need the lock.
> 
> 
> OK. I think this is really a separate situation. Just to recap: A takes 
> lock, does some work, releases lock, a very long time passes, then A 
> takes the lock again. In the "time passes" part, that mutex could be 
> locked and unlocked any number of times by other threads and A won't 
> know or care. Particularly on an SMP machine, other threads that were 
> blocked on that mutex could do useful work in the interim without 
> impacting A's progress at all. So here, when A leaves the mutex unlocked 
> for a long time, it's desirable to give the mutex to one of the waiters 
> ASAP.
> 

But how do you quantify "a long time"? And what happens if process A is
a very high priority and which nothing else is allowed to run?

>> Just accept that my described scenario is legitimate then consider it in
>> isolation rather than getting caught up in the superfluous details of how
>> such a situation might come about.
> 
> 
> OK. I'm not trying to be difficult here. In much of life, context is 
> everything; very little can be understood in isolation.
> 

OK, but other valid examples were offered up - lock inversion avoidance,
and externally driven systems (ie. where it is not known which lock will
be taken next).

> Back to the scenario:
> 
>> A realtime system with tasks A and B, A has an RT scheduling priority of
>> 1, and B is 2. A and B are both runnable, so A is running. A takes a 
>> mutex
>> then sleeps, B runs and ends up blocked on the mutex. A wakes up and at
>> some point it drops the mutex and then tries to take it again.
>>
>> What happens?
> 
> 
> As I understand the spec, A must block because B has acquired the mutex. 
> Once again, the SUS discussion of priority inheritance would never need 
> to have been written if this were not the case:
> 
>  >>>
> In a priority-driven environment, a direct use of traditional primitives 
> like mutexes and condition variables can lead to unbounded priority 
> inversion, where a higher priority thread can be blocked by a lower 
> priority thread, or set of threads, for an unbounded duration of time. 
> As a result, it becomes impossible to guarantee thread deadlines. 
> Priority inversion can be bounded and minimized by the use of priority 
> inheritance protocols. This allows thread deadlines to be guaranteed 
> even in the presence of synchronization requirements.
> <<<
> 
> The very first sentence indicates that a higher priority thread can be 
> blocked by a lower priority thread. If your interpretation of the spec 
> were correct, then such an instance would never occur. Since your 

Wrong. It will obviously occur if the lower priority process is able
to take a lock before a higher priority process.

The situation will not exist in "the scenario" though, if we follow
my reading of the spec, because *the scheduler* determines the next
process to gain the mutex. This makes perfect sense to me.

> scenario is using realtime threads, then we can assume that the Priority 
> Ceiling feature is present and you can use it if needed. ( 
> http://www.opengroup.org/onlinepubs/000095399/xrat/xsh_chap02.html#tag_03_02_09_06 
> Realtime Threads option group )
> 

Any kind of priority boost / inherentance like this is orthogonal to
the issue. They still do not prevent B from acquiring the mutex and
thereby blocking the execution of the higher priority A. I think this
is against the spirit of the spec, especially the part where it says
*the scheduler* will choose which process to gain the lock.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: e100 oops on resume
  2006-01-28 19:53               ` Jesse Brandeburg
@ 2006-02-07  6:57                 ` Jeff Garzik
  0 siblings, 0 replies; 127+ messages in thread
From: Jeff Garzik @ 2006-02-07  6:57 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Stefan Seyfried, Olaf Kirch, Linux Kernel Mailing List, netdev,
	Jesse Brandeburg, Jeff Kirsher

Jesse Brandeburg wrote:
> On 1/28/06, Mattia Dongili <malattia@linux.it> wrote:
> 
>>On Thu, Jan 26, 2006 at 08:02:37PM +0100, Stefan Seyfried wrote:
>>
>>>On Wed, Jan 25, 2006 at 04:28:48PM -0800, Jesse Brandeburg wrote:
>>>
>>>
>>>>Okay I reproduced the issue on 2.6.15.1 (with S1 sleep) and was able
>>>>to show that my patch that just removes e100_init_hw works okay for
>>>>me.  Let me know how it goes for you, I think this is a good fix.
>>>
>>>worked for me in the Compaq Armada e500 and reportedly also fixed the
>>>SONY that originally uncovered it.
>>
>>confirmed here too. The patch fixes S3 resume on this Sony (GR7/K)
>>running 2.6.16-rc1-mm3.
> 
> 
> excellent news! thanks for testing.
> 
> Jeff, could you please apply to 2.6.16-rcX
> 
> Jesse

SIGH.  In your last patch submission you had it right, but Intel has yet 
again regressed in patch submission form.

Your fixes will be expedited if they can be applied by script, and then 
quickly whisked upstream to Linus/Andrew.  This one had to be applied by 
hand (so yes, its applied) for several reasons:

* Unreviewable in mail reader, due to MIME type application/octet-stream.

* In general, never use MIME (attachments), they decrease the audience 
that can easily review your patch.

* Your patch's description and signed-off-by were buried inside the 
octet-stream attachment.

* Please review http://linux.yyz.us/patch-format.html  (I probably 
should add MIME admonitions to that)

	Jeff

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found] <5uZqb-4fo-15@gated-at.bofh.it>
@ 2006-01-14 22:47 ` Robert Hancock
  0 siblings, 0 replies; 127+ messages in thread
From: Robert Hancock @ 2006-01-14 22:47 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> POSIX requires a reschedule to occur, as noted here:
> http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/

No, it doesn't:

> 
> The relevant SUSv3 text is here
> http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html 

"If there are threads blocked on the mutex object referenced by mutex 
when pthread_mutex_unlock() is called, resulting in the mutex becoming 
available, the scheduling policy shall determine which thread shall 
acquire the mutex."

This says nothing about requiring a reschedule. The "scheduling policy" 
can well decide that the thread which just released the mutex can 
re-acquire it.

> I suppose if pthread_mutex_unlock() actually behaved correctly we could 
> remove the other sched_yield() hacks that didn't belong there in the 
> first place and go on our merry way.

Generally, needing to implement hacks like this is a sign that there are 
problems with the synchronization design of the code (like a mutex which 
has excessive contention). Programs should not rely on the scheduling 
behavior of the kernel for proper operation when that behavior is not 
defined.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
@ 2006-01-14 19:29 Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2006-01-14 19:29 UTC (permalink / raw)
  To: linux-kernel

Resurrecting a dead horse...
> *From: *Lee Revell
> *Date: * Sat Aug 20 2005 - 15:57:36 EST
>
> ------------------------------------------------------------------------
> On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> >/ But I also found that I needed to add a new /
> >/ yield(), to work around yet another unexpected issue on this system -/
> >/ we have a number of threads waiting on a condition variable, and the/
> >/ thread holding the mutex signals the var, unlocks the mutex, and then /
> >/ immediately relocks it. The expectation here is that upon unlocking/
> >/ the mutex, the calling thread would block while some waiting thread/
> >/ (that just got signaled) would get to run. In fact what happened is/
> >/ that the calling thread unlocked and relocked the mutex without/
> >/ allowing any of the waiting threads to run. In this case the only/
> >/ solution was to insert a yield() after the mutex_unlock(). /
>
> That's exactly the behavior I would expect. Why would you expect
> unlocking a mutex to cause a reschedule, if the calling thread still has
> timeslice left?
>
> Lee

POSIX requires a reschedule to occur, as noted here:
http://blog.firetree.net/2005/06/22/thread-yield-after-mutex-unlock/

The relevant SUSv3 text is here
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_mutex_unlock.html

I suppose if pthread_mutex_unlock() actually behaved correctly we could 
remove the other sched_yield() hacks that didn't belong there in the 
first place and go on our merry way.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-23 12:07               ` Denis Vlasenko
@ 2005-08-24  3:37                 ` Lincoln Dale
  0 siblings, 0 replies; 127+ messages in thread
From: Lincoln Dale @ 2005-08-24  3:37 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-os (Dick Johnson), Robert Hancock, linux-kernel

Denis Vlasenko wrote:

>>>This is what I would expect if run on an otherwise idle machine.
>>>sched_yield just puts you at the back of the line for runnable
>>>processes, it doesn't magically cause you to go to sleep somehow.
>>>      
>>>
>>When a kernel build is occurring??? Plus `top` itself.... It damn
>>well sleep while giving up the CPU. If it doesn't it's broken.
>>    
>>
unless you have all of the kernel source in the buffer cache, a 
concurrent kernel build will spend a fair bit of time in io_wait state ..
as such its perfectly plausible that sched_yield keeps popping back to 
the top of 'runnable' processes . . .


cheers,

lincoln.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-23 11:17             ` linux-os (Dick Johnson)
@ 2005-08-23 12:07               ` Denis Vlasenko
  2005-08-24  3:37                 ` Lincoln Dale
  0 siblings, 1 reply; 127+ messages in thread
From: Denis Vlasenko @ 2005-08-23 12:07 UTC (permalink / raw)
  To: linux-os (Dick Johnson), Robert Hancock; +Cc: linux-kernel

On Tuesday 23 August 2005 14:17, linux-os \(Dick Johnson\) wrote:
> 
> On Mon, 22 Aug 2005, Robert Hancock wrote:
> 
> > linux-os (Dick Johnson) wrote:
> >> I reported thet sched_yield() wasn't working (at least as expected)
> >> back in March of 2004.
> >>
> >>  		for(;;)
> >>                      sched_yield();
> >>
> >> ... takes 100% CPU time as reported by `top`. It should take
> >> practically 0. Somebody said that this was because `top` was
> >> broken, others said that it was because I didn't know how to
> >> code. Nevertheless, the problem was not fixed, even after
> >> schedular changes were made for the current version.
> >
> > This is what I would expect if run on an otherwise idle machine.
> > sched_yield just puts you at the back of the line for runnable
> > processes, it doesn't magically cause you to go to sleep somehow.
> >
> 
> When a kernel build is occurring??? Plus `top` itself.... It damn
> well sleep while giving up the CPU. If it doesn't it's broken.

top doesn't run all the time:

# strace -o top.strace -tt top

14:52:19.407958 write(1, "  758 root      16   0   104   2"..., 79) = 79
14:52:19.408318 write(1, "  759 root      16   0   100   1"..., 79) = 79
14:52:19.408659 write(1, "  760 root      16   0   100   1"..., 79) = 79
14:52:19.409001 write(1, "  761 root      18   0  2604  39"..., 74) = 74
14:52:19.409342 write(1, "  763 daemon    17   0   108   1"..., 78) = 78
14:52:19.409672 write(1, "  773 root      16   0   104   2"..., 79) = 79
14:52:19.410010 write(1, "  774 root      16   0   104   2"..., 79) = 79
14:52:19.410362 write(1, "  775 root      16   0   100   1"..., 79) = 79
14:52:19.410692 write(1, "  776 root      16   0   104   2"..., 79) = 79
14:52:19.411136 write(1, "  777 daemon    17   0   108   1"..., 86) = 86
14:52:19.411505 select(1, [0], NULL, NULL, {5, 0}) = 0 (Timeout)
	hrrr..... psssss.......
14:52:24.411744 time([1124797944])      = 1124797944
14:52:24.411883 lseek(4, 0, SEEK_SET)   = 0
14:52:24.411957 read(4, "24822.01 18801.28\n", 1023) = 18
14:52:24.412082 access("/var/run/utmpx", F_OK) = -1 ENOENT (No such file or directory)
14:52:24.412224 open("/var/run/utmp", O_RDWR) = 8
14:52:24.412328 fcntl64(8, F_GETFD)     = 0
14:52:24.412399 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
14:52:24.412467 _llseek(8, 0, [0], SEEK_SET) = 0
14:52:24.412556 alarm(0)                = 0
14:52:24.412643 rt_sigaction(SIGALRM, {0x4015a57c, [], SA_RESTORER, 0x40094ae8}, {SIG_DFL}, 8) = 0
14:52:24.412747 alarm(1)                = 0

However, kernel compile shouldn't.

I suggest stracing with -tt "for(;;) yield();" test proggy with and without
kernel compile in parallel, and comparing the output...

Hmm... actually, knowing that you will argue to death instead...

# cat t.c
#include <sched.h>

int main() {
    for(;;) sched_yield();
    return 0;
}
# gcc t.c
# strace -tt ./a.out
...
15:03:41.211324 sched_yield()           = 0
15:03:41.211673 sched_yield()           = 0
15:03:41.212034 sched_yield()           = 0
15:03:41.212400 sched_yield()           = 0
15:03:41.212749 sched_yield()           = 0
15:03:41.213126 sched_yield()           = 0
15:03:41.213486 sched_yield()           = 0
15:03:41.213835 sched_yield()           = 0
15:03:41.214220 sched_yield()           = 0
15:03:41.214577 sched_yield()           = 0
15:03:41.214939 sched_yield()           = 0
    I start "while true; do true; done" on another console...
15:03:43.314645 sched_yield()           = 0
15:03:43.847644 sched_yield()           = 0
15:03:43.954635 sched_yield()           = 0
15:03:44.063798 sched_yield()           = 0
15:03:44.171596 sched_yield()           = 0
15:03:44.282624 sched_yield()           = 0
15:03:44.391632 sched_yield()           = 0
15:03:44.498609 sched_yield()           = 0
15:03:44.605584 sched_yield()           = 0
15:03:44.712538 sched_yield()           = 0
15:03:44.819557 sched_yield()           = 0
15:03:44.928594 sched_yield()           = 0
15:03:45.040603 sched_yield()           = 0
15:03:45.148545 sched_yield()           = 0
15:03:45.259311 sched_yield()           = 0
15:03:45.368563 sched_yield()           = 0
15:03:45.476482 sched_yield()           = 0
15:03:45.583568 sched_yield()           = 0
15:03:45.690491 sched_yield()           = 0
15:03:45.797512 sched_yield()           = 0
15:03:45.906534 sched_yield()           = 0
15:03:46.013545 sched_yield()           = 0
15:03:46.120505 sched_yield()           = 0
Ctrl-C

# uname -a
Linux firebird 2.6.12-r4 #1 SMP Sun Jul 17 13:51:47 EEST 2005 i686 unknown unknown GNU/Linux
--
vda


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 14:26           ` Robert Hancock
@ 2005-08-23 11:17             ` linux-os (Dick Johnson)
  2005-08-23 12:07               ` Denis Vlasenko
  0 siblings, 1 reply; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-23 11:17 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel

On Mon, 22 Aug 2005, Robert Hancock wrote:

> linux-os (Dick Johnson) wrote:
>> I reported thet sched_yield() wasn't working (at least as expected)
>> back in March of 2004.
>>
>>  		for(;;)
>>                      sched_yield();
>>
>> ... takes 100% CPU time as reported by `top`. It should take
>> practically 0. Somebody said that this was because `top` was
>> broken, others said that it was because I didn't know how to
>> code. Nevertheless, the problem was not fixed, even after
>> schedular changes were made for the current version.
>
> This is what I would expect if run on an otherwise idle machine.
> sched_yield just puts you at the back of the line for runnable
> processes, it doesn't magically cause you to go to sleep somehow.
>

When a kernel build is occurring??? Plus `top` itself.... It damn
well sleep while giving up the CPU. If it doesn't it's broken.

> --
> Robert Hancock      Saskatoon, SK, Canada
> To email, remove "nospam" from hancockr@nospamshaw.ca
> Home Page: http://www.roberthancock.com/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.12.5 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 13:20           ` Florian Weimer
@ 2005-08-22 23:19             ` Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-22 23:19 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andi Kleen, linux-kernel

Florian Weimer wrote:
>  * Howard Chu:
> > That's not the complete story. BerkeleyDB provides a
> > db_env_set_func_yield() hook to tell it what yield function it
> > should use when its internal locking routines need such a function.
> > If you don't set a specific hook, it just uses sleep(). The
> > OpenLDAP backend will invoke this hook during some (not necessarily
> > all) init sequences, to tell it to use the thread yield function
> > that we selected in autoconf.

>  And this helps to increase performance substantially?

When the caller is a threaded program, yes, there is a substantial 
(measurable and noticable) difference. Given that sleep() blocks the 
entire process, the difference is obvious.

> > Note that (on systems that support inter-process mutexes) a
> > BerkeleyDB database environment may be used by multiple processes
> > concurrently.
>
>  Yes, I know this, and I haven't experienced that much trouble with
>  deadlocks.  Maybe the way you structure and access the database
>  environment can be optimized for deadlock avoidance?

Maybe we already did this deadlock analysis and optimization, years ago 
when we first started developing this backend? Do you think everyone 
else in the world is a total fool?

> > As such, the yield function that is provided must work both for
> > threads within a single process (PTHREAD_SCOPE_PROCESS) as well as
> > between processes (PTHREAD_SCOPE_SYSTEM).

>  If I understand you correctly, what you really need is a syscall
>  along the lines "don't run me again until all threads T that share
>  property X have run, where the Ts aren't necessarily in the same
>  process".  The kernel is psychic, it can't really know which
>  processes to schedule to satisfy such a requirement.  I don't even
>  think "has joined the Berkeley DB environment" is the desired
>  property, but something like "is part of this cycle in the wait-for
>  graph" or something similar.

You seem to believe we're looking for special treatment for the 
processes we're concerned with, and that's not true. If the system is 
busy with other processes, so be it, the system is busy. If you want 
better performance, you build a dedicated server and don't let anything 
else make the system busy. This is the way mission-critical services are 
delivered, regardless of the service. If you're not running on a 
dedicated system, then your deployment must not be mission critical, and 
so you shouldn't be surprised if a large gcc run slows down some other 
activities in the meantime. If you have a large nice'd job running 
before your normal priority jobs get their timeslice, then you should 
certainly wonder wtf the scheduler is doing, and why your system even 
claims to support nice() when clearly it doesn't mean anything on that 
system.

>  I would have to check the Berkeley DB internals in order to tell what
>  is feasible to implement.  This code shouldn't be on the fast path,
>  so some kernel-based synchronization is probably sufficient.

pthread_cond_wait() probably would be just fine here, but BerkeleyDB 
doesn't work that way.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 13:06           ` Andi Kleen
@ 2005-08-22 18:47             ` Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-22 18:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Florian Weimer, linux-kernel

Andi Kleen wrote:
> > processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd
> > only needing to yield within a single process is inaccurate; since
> > we allow slapcat to run concurrently with slapd (to allow hot
> > backups) we need BerkeleyDB's locking/yield functions to work in
> > System scope.

>  That's broken by design - it means you can be arbitarily starved by
>  other processes running in parallel. You are basically assuming your
>  application is the only thing running on the system which is wrong.
>  Also there are enough synchronization primitives that can synchronize
>  multiple processes without making such broken assumptions.

Again, I think you overstate the problem. "Arbitrarily starved by other 
processes" implies that the process scheduler will do a poor job and 
will allow the slapd process to be starved. We do not assume we're the 
only app on the system, we just assume that eventually we will get the 
CPU back. If that's not a valid assumption, then there is something 
wrong with the underlying system environment.

Something you ought to keep in mind - correctness and compliance are 
well and good, but worthless if the end result isn't useful. Windows NT 
has a POSIX-compliant subsystem but it is utterly useless. That's what 
you wind up with when all you do is conform to the letter of the spec.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22 11:44         ` linux-os (Dick Johnson)
@ 2005-08-22 14:26           ` Robert Hancock
  2005-08-23 11:17             ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 127+ messages in thread
From: Robert Hancock @ 2005-08-22 14:26 UTC (permalink / raw)
  To: linux-kernel

linux-os (Dick Johnson) wrote:
> I reported thet sched_yield() wasn't working (at least as expected)
> back in March of 2004.
> 
>  		for(;;)
>                      sched_yield();
> 
> ... takes 100% CPU time as reported by `top`. It should take
> practically 0. Somebody said that this was because `top` was
> broken, others said that it was because I didn't know how to
> code. Nevertheless, the problem was not fixed, even after
> schedular changes were made for the current version.

This is what I would expect if run on an otherwise idle machine. 
sched_yield just puts you at the back of the line for runnable 
processes, it doesn't magically cause you to go to sleep somehow.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22  5:09         ` Howard Chu
  2005-08-22 13:06           ` Andi Kleen
@ 2005-08-22 13:20           ` Florian Weimer
  2005-08-22 23:19             ` Howard Chu
  1 sibling, 1 reply; 127+ messages in thread
From: Florian Weimer @ 2005-08-22 13:20 UTC (permalink / raw)
  To: Howard Chu; +Cc: Andi Kleen, linux-kernel

* Howard Chu:

>>> Has anybody contacted the Sleepycat people with a description of
>>> the problem yet?

>> Berkeley DB does not call sched_yield, but OpenLDAP does in some
>> wrapper code around the Berkeley DB backend.

> That's not the complete story. BerkeleyDB provides a 
> db_env_set_func_yield() hook to tell it what yield function it should 
> use when its internal locking routines need such a function. If you 
> don't set a specific hook, it just uses sleep(). The OpenLDAP backend 
> will invoke this hook during some (not necessarily all) init sequences, 
> to tell it to use the thread yield function that we selected in autoconf.

And this helps to increase performance substantially?

> Note that (on systems that support inter-process mutexes) a BerkeleyDB 
> database environment may be used by multiple processes concurrently.

Yes, I know this, and I haven't experienced that much trouble with
deadlocks.  Maybe the way you structure and access the database
environment can be optimized for deadlock avoidance?

> As such, the yield function that is provided must work both for
> threads within a single process (PTHREAD_SCOPE_PROCESS) as well as
> between processes (PTHREAD_SCOPE_SYSTEM).

If I understand you correctly, what you really need is a syscall along
the lines "don't run me again until all threads T that share property
X have run, where the Ts aren't necessarily in the same process".  The
kernel is psychic, it can't really know which processes to schedule to
satisfy such a requirement.  I don't even think "has joined the
Berkeley DB environment" is the desired property, but something like
"is part of this cycle in the wait-for graph" or something similar.

I would have to check the Berkeley DB internals in order to tell what
is feasible to implement.  This code shouldn't be on the fast path, so
some kernel-based synchronization is probably sufficient.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-22  5:09         ` Howard Chu
@ 2005-08-22 13:06           ` Andi Kleen
  2005-08-22 18:47             ` Howard Chu
  2005-08-22 13:20           ` Florian Weimer
  1 sibling, 1 reply; 127+ messages in thread
From: Andi Kleen @ 2005-08-22 13:06 UTC (permalink / raw)
  To: Howard Chu; +Cc: Florian Weimer, Andi Kleen, linux-kernel

> processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd only 
> needing to yield within a single process is inaccurate; since we allow 
> slapcat to run concurrently with slapd (to allow hot backups) we need 
> BerkeleyDB's locking/yield functions to work in System scope.

That's broken by design - it means you can be arbitarily starved 
by other processes running in parallel. You are basically assuming
your application is the only thing running on the system
which is wrong. Also there are enough synchronization primitives
that can synchronize multiple processes without making
such broken assumptions.

-Andi

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21  1:04       ` Robert Hancock
@ 2005-08-22 11:44         ` linux-os (Dick Johnson)
  2005-08-22 14:26           ` Robert Hancock
  0 siblings, 1 reply; 127+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-22 11:44 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel

On Sat, 20 Aug 2005, Robert Hancock wrote:

> Howard Chu wrote:
>> I'll note that we removed a number of the yield calls (that were in
>> OpenLDAP 2.2) for the 2.3 release, because I found that they were
>> redundant and causing unnecessary delays. My own test system is running
>> on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro),
>> and OpenLDAP 2.3 runs perfectly well here, now that those redundant
>> calls have been removed. But I also found that I needed to add a new
>> yield(), to work around yet another unexpected issue on this system - we
>> have a number of threads waiting on a condition variable, and the thread
>> holding the mutex signals the var, unlocks the mutex, and then
>> immediately relocks it. The expectation here is that upon unlocking the
>> mutex, the calling thread would block while some waiting thread (that
>> just got signaled) would get to run. In fact what happened is that the
>> calling thread unlocked and relocked the mutex without allowing any of
>> the waiting threads to run. In this case the only solution was to insert
>> a yield() after the mutex_unlock(). So again, for those of you claiming
>> "oh, all you need to do is use a condition variable or any of the other
>> POSIX synchronization primitives" - yes, that's a nice theory, but
>> reality says otherwise.
>
> I encountered a similar issue with some software that I wrote, and used
> a similar workaround, however this was basically because there wasn't
> enough time available at the time to redesign things to work properly.
> The problem here is essentially caused by the fact that the mutex is
> being locked for an excessively large proportion of the time and not
> letting other threads in. In the case I am thinking of, posting the
> messages to the thread that was hogging the mutex via a signaling queue
> would have been a better solution than using yield and having correct
> operation depend on undefined parts of thread scheduling behavior..
>
> --
> Robert Hancock      Saskatoon, SK, Canada
> To email, remove "nospam" from hancockr@nospamshaw.ca
> Home Page: http://www.roberthancock.com/
>

I reported thet sched_yield() wasn't working (at least as expected)
back in March of 2004.

 		for(;;)
                     sched_yield();

... takes 100% CPU time as reported by `top`. It should take
practically 0. Somebody said that this was because `top` was
broken, others said that it was because I didn't know how to
code. Nevertheless, the problem was not fixed, even after
schedular changes were made for the current version.

  One can execute:

 		usleep(0);
... instead of:
 		sched_yield();

... and Linux then performs exactly like other Unixes when
code is waiting on mutexes.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.12.5 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21 11:33           ` Nikita Danilov
@ 2005-08-22  8:06             ` Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-22  8:06 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Nikita Danilov wrote:
> Howard Chu writes:
>
>  > That's beside the point. Folks are making an assertion that
>  > sched_yield() is meaningless; this example demonstrates that there are
>  > cases where sched_yield() is essential.
>
> It is not essential, it is non-portable.
>
> Code you described is based on non-portable "expectations" about thread
> scheduling. Linux implementation of pthreads fails to satisfy
> them. Perfectly reasonable. Code is then "fixed" by adding sched_yield()
> calls and introducing more non-portable assumptions. Again, there is no
> guarantee this would work on any compliant implementation.
>
> While "intuitive" semantics of sched_yield() is to yield CPU and to give
> other runnable threads their chance to run, this is _not_ what standard
> prescribes (for non-RT threads).
>   
Very well; it is not prescribed in the standard and it is non-portable. 
Our code is broken and we will fix it.

But even Dave Butenhof, Mr. Pthreads himself, has said it is reasonable 
to expect sched_yield to yield the CPU. That's what pthread_yield did in 
Pthreads Draft 4 (DCE threads) and it is common knowledge that 
sched_yield is a direct replacement for pthread_yield; i.e., 
pthread_yield() was deleted from the spec because sched_yield fulfilled 
its purpose. Now you're saying "well, technically, sched_yield doesn't 
have to do anything at all" and the letter of the spec supports your 
position, but anybody who's been programming with pthreads since the DCE 
days "knows" that is not the original intention. I wonder that nobody 
has decided to raise this issue with the IEEE/POSIX group and get them 
to issue a correction/clarification in all this time, since the absence 
of specification here really impairs the usefulness of the spec.

Likewise the fact that sched_yield() can now cause the current process 
to be queued behind other processes seems suspect, unless we know for 
sure that the threads are running with PTHREAD_SCOPE_SYSTEM. (I haven't 
checked to see if PTHREAD_SCOPE_PROCESS is still supported in NPTL.)

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-21 19:47       ` Florian Weimer
@ 2005-08-22  5:09         ` Howard Chu
  2005-08-22 13:06           ` Andi Kleen
  2005-08-22 13:20           ` Florian Weimer
  0 siblings, 2 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-22  5:09 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andi Kleen, linux-kernel

Florian Weimer wrote:
> * Andi Kleen:
>
>   
>> Has anybody contacted the Sleepycat people with a description of the 
>> problem yet?
>>     
>
> Berkeley DB does not call sched_yield, but OpenLDAP does in some
> wrapper code around the Berkeley DB backend.
That's not the complete story. BerkeleyDB provides a 
db_env_set_func_yield() hook to tell it what yield function it should 
use when its internal locking routines need such a function. If you 
don't set a specific hook, it just uses sleep(). The OpenLDAP backend 
will invoke this hook during some (not necessarily all) init sequences, 
to tell it to use the thread yield function that we selected in autoconf.

Note that (on systems that support inter-process mutexes) a BerkeleyDB 
database environment may be used by multiple processes concurrently. As 
such, the yield function that is provided must work both for threads 
within a single process (PTHREAD_SCOPE_PROCESS) as well as between 
processes (PTHREAD_SCOPE_SYSTEM). The previous comment about slapd only 
needing to yield within a single process is inaccurate; since we allow 
slapcat to run concurrently with slapd (to allow hot backups) we need 
BerkeleyDB's locking/yield functions to work in System scope.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 13:48     ` Andi Kleen
@ 2005-08-21 19:47       ` Florian Weimer
  2005-08-22  5:09         ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Florian Weimer @ 2005-08-21 19:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Howard Chu, linux-kernel

* Andi Kleen:

> Has anybody contacted the Sleepycat people with a description of the 
> problem yet?

Berkeley DB does not call sched_yield, but OpenLDAP does in some
wrapper code around the Berkeley DB backend.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 21:24         ` Howard Chu
  2005-08-21  0:36           ` Nick Piggin
@ 2005-08-21 11:33           ` Nikita Danilov
  2005-08-22  8:06             ` Howard Chu
  1 sibling, 1 reply; 127+ messages in thread
From: Nikita Danilov @ 2005-08-21 11:33 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Howard Chu writes:
 > Lee Revell wrote:
 > >  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
 > > > But I also found that I needed to add a new yield(), to work around
 > > > yet another unexpected issue on this system - we have a number of
 > > > threads waiting on a condition variable, and the thread holding the
 > > > mutex signals the var, unlocks the mutex, and then immediately
 > > > relocks it. The expectation here is that upon unlocking the mutex,
 > > > the calling thread would block while some waiting thread (that just
 > > > got signaled) would get to run. In fact what happened is that the
 > > > calling thread unlocked and relocked the mutex without allowing any
 > > > of the waiting threads to run. In this case the only solution was
 > > > to insert a yield() after the mutex_unlock().
 > >
 > >  That's exactly the behavior I would expect.  Why would you expect
 > >  unlocking a mutex to cause a reschedule, if the calling thread still
 > >  has timeslice left?
 >
 > That's beside the point. Folks are making an assertion that
 > sched_yield() is meaningless; this example demonstrates that there are
 > cases where sched_yield() is essential.

It is not essential, it is non-portable.

Code you described is based on non-portable "expectations" about thread
scheduling. Linux implementation of pthreads fails to satisfy
them. Perfectly reasonable. Code is then "fixed" by adding sched_yield()
calls and introducing more non-portable assumptions. Again, there is no
guarantee this would work on any compliant implementation.

While "intuitive" semantics of sched_yield() is to yield CPU and to give
other runnable threads their chance to run, this is _not_ what standard
prescribes (for non-RT threads).

 >
 > --
 >   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
  2005-08-20 21:50       ` Lee Revell
@ 2005-08-21  1:04       ` Robert Hancock
  2005-08-22 11:44         ` linux-os (Dick Johnson)
  2 siblings, 1 reply; 127+ messages in thread
From: Robert Hancock @ 2005-08-21  1:04 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> I'll note that we removed a number of the yield calls (that were in 
> OpenLDAP 2.2) for the 2.3 release, because I found that they were 
> redundant and causing unnecessary delays. My own test system is running 
> on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro), 
> and OpenLDAP 2.3 runs perfectly well here, now that those redundant 
> calls have been removed. But I also found that I needed to add a new 
> yield(), to work around yet another unexpected issue on this system - we 
> have a number of threads waiting on a condition variable, and the thread 
> holding the mutex signals the var, unlocks the mutex, and then 
> immediately relocks it. The expectation here is that upon unlocking the 
> mutex, the calling thread would block while some waiting thread (that 
> just got signaled) would get to run. In fact what happened is that the 
> calling thread unlocked and relocked the mutex without allowing any of 
> the waiting threads to run. In this case the only solution was to insert 
> a yield() after the mutex_unlock(). So again, for those of you claiming 
> "oh, all you need to do is use a condition variable or any of the other 
> POSIX synchronization primitives" - yes, that's a nice theory, but 
> reality says otherwise.

I encountered a similar issue with some software that I wrote, and used 
a similar workaround, however this was basically because there wasn't 
enough time available at the time to redesign things to work properly. 
The problem here is essentially caused by the fact that the mutex is 
being locked for an excessively large proportion of the time and not 
letting other threads in. In the case I am thinking of, posting the 
messages to the thread that was hogging the mutex via a signaling queue 
would have been a better solution than using yield and having correct 
operation depend on undefined parts of thread scheduling behavior..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 21:24         ` Howard Chu
@ 2005-08-21  0:36           ` Nick Piggin
  2005-08-21 11:33           ` Nikita Danilov
  1 sibling, 0 replies; 127+ messages in thread
From: Nick Piggin @ 2005-08-21  0:36 UTC (permalink / raw)
  To: Howard Chu; +Cc: Lee Revell, Robert Hancock, linux-kernel

Howard Chu wrote:
> Lee Revell wrote:
> 
>>  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
>> > But I also found that I needed to add a new yield(), to work around
>> > yet another unexpected issue on this system - we have a number of
>> > threads waiting on a condition variable, and the thread holding the
>> > mutex signals the var, unlocks the mutex, and then immediately
>> > relocks it. The expectation here is that upon unlocking the mutex,
>> > the calling thread would block while some waiting thread (that just
>> > got signaled) would get to run. In fact what happened is that the
>> > calling thread unlocked and relocked the mutex without allowing any
>> > of the waiting threads to run. In this case the only solution was
>> > to insert a yield() after the mutex_unlock().
>>
>>  That's exactly the behavior I would expect.  Why would you expect
>>  unlocking a mutex to cause a reschedule, if the calling thread still
>>  has timeslice left?
> 
> 
> That's beside the point. Folks are making an assertion that 
> sched_yield() is meaningless; this example demonstrates that there are 
> cases where sched_yield() is essential.
> 

The point is, with SCHED_OTHER scheduling, sched_yield() need not
do anything. It may not let any other tasks run.

The fact that it does on Linux is because we do attempt to do
something expected... but the simple matter is that you can't realy
on it to do what you expect.

I'm not sure exactly how you would solve the above problem, but I'm
sure it can be achieved using mutexes (for example, you could have
a queue where every thread waits on its own private mutex).... but I
don't do much userspace C programming sorry.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 19:49       ` Howard Chu
@ 2005-08-20 22:08         ` Nikita Danilov
  0 siblings, 0 replies; 127+ messages in thread
From: Nikita Danilov @ 2005-08-20 22:08 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu writes:
 > Nikita Danilov wrote:
 > > That returns us to the core of the problem: sched_yield() is used to
 > > implement a synchronization primitive and non-portable assumptions are
 > > made about its behavior: SUS defines that after sched_yield() thread
 > > ceases to run on the CPU "until it again becomes the head of its thread
 > > list", and "thread list" discipline is only defined for real-time
 > > scheduling policies. E.g., 
 > >
 > > int sched_yield(void)
 > > {
 > >        return 0;
 > > }
 > >
 > > and
 > >
 > > int sched_yield(void)
 > > {
 > >        sleep(100);
 > >        return 0;
 > > }
 > >
 > > are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
 > > threads.
 > I think you're mistaken:
 > http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0d4eaf3703131e86/da051ebe58976b00#da051ebe58976b00
 > 
 > sched_yield() is required to be supported even if priority scheduling is 
 > not supported, and it is required to cause the calling thread (not 
 > process) to yield the processor.

Of course sched_yield() is required to be supported, the question is for
how long CPU is yielded. Here is the quote from the SUS (actually the
complete definition of sched_yield()):

    The sched_yield() function shall force the running thread to
    relinquish the processor until it again becomes the head of its
    thread list.

As far as I can see, SUS doesn't specify how "thread list" is maintained
for non-RT scheduling policy, and implementation that immediately places
SCHED_OTHER thread that called sched_yield() back at the head of its
thread list is perfectly valid. Also valid is an implementation that
waits for 100 seconds and then places sched_yield() caller to the head
of the list, etc. Basically, while semantics of sched_yield() are well
defined for RT scheduling policy, for SCHED_OTHER policy standard leaves
it implementation defined.

 > 
 > -- 
 >   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
@ 2005-08-20 21:50       ` Lee Revell
  2005-08-21  1:04       ` Robert Hancock
  2 siblings, 0 replies; 127+ messages in thread
From: Lee Revell @ 2005-08-20 21:50 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> Nick Piggin wrote:
> >  Robert Hancock wrote:
> > > I fail to see how sched_yield is going to be very helpful in this
> > > situation. Since that call can sleep from a range of time ranging
> > > from zero to a long time, it's going to give unpredictable results.
> 
> >  Well, not sleep technically, but yield the CPU for some undefined
> >  amount of time.
> 
> Since the slapd server was not written to run in realtime, nor is it 
> commonly run on realtime operating systems, I don't believe predictable 
> timing here is a criteria we care about. One could say the same of 
> sigsuspend() by the way - it can pause a process for a range of time 
> ranging from zero to a long time. Should we tell application writers not 
> to use this function either, regardless of whether the developer thinks 
> they have a good reason to use it?

Of course not.  We should tell them that if they use sigsuspend() they
cannot assume that the process will not wake up immediately.

Lee



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 20:57       ` Lee Revell
@ 2005-08-20 21:24         ` Howard Chu
  2005-08-21  0:36           ` Nick Piggin
  2005-08-21 11:33           ` Nikita Danilov
  0 siblings, 2 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-20 21:24 UTC (permalink / raw)
  To: Lee Revell; +Cc: Nick Piggin, Robert Hancock, linux-kernel

Lee Revell wrote:
>  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> > But I also found that I needed to add a new yield(), to work around
> > yet another unexpected issue on this system - we have a number of
> > threads waiting on a condition variable, and the thread holding the
> > mutex signals the var, unlocks the mutex, and then immediately
> > relocks it. The expectation here is that upon unlocking the mutex,
> > the calling thread would block while some waiting thread (that just
> > got signaled) would get to run. In fact what happened is that the
> > calling thread unlocked and relocked the mutex without allowing any
> > of the waiting threads to run. In this case the only solution was
> > to insert a yield() after the mutex_unlock().
>
>  That's exactly the behavior I would expect.  Why would you expect
>  unlocking a mutex to cause a reschedule, if the calling thread still
>  has timeslice left?

That's beside the point. Folks are making an assertion that 
sched_yield() is meaningless; this example demonstrates that there are 
cases where sched_yield() is essential.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 18:38     ` Howard Chu
@ 2005-08-20 20:57       ` Lee Revell
  2005-08-20 21:24         ` Howard Chu
  2005-08-20 21:50       ` Lee Revell
  2005-08-21  1:04       ` Robert Hancock
  2 siblings, 1 reply; 127+ messages in thread
From: Lee Revell @ 2005-08-20 20:57 UTC (permalink / raw)
  To: Howard Chu; +Cc: Nick Piggin, Robert Hancock, linux-kernel

On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
> But I also found that I needed to add a new 
> yield(), to work around yet another unexpected issue on this system -
> we have a number of threads waiting on a condition variable, and the
> thread holding the mutex signals the var, unlocks the mutex, and then 
> immediately relocks it. The expectation here is that upon unlocking
> the mutex, the calling thread would block while some waiting thread
> (that just got signaled) would get to run. In fact what happened is
> that the calling thread unlocked and relocked the mutex without
> allowing any of the waiting threads to run. In this case the only
> solution was to insert a yield() after the mutex_unlock(). 

That's exactly the behavior I would expect.  Why would you expect
unlocking a mutex to cause a reschedule, if the calling thread still has
timeslice left?

Lee


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20 13:23     ` Nikita Danilov
@ 2005-08-20 19:49       ` Howard Chu
  2005-08-20 22:08         ` Nikita Danilov
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2005-08-20 19:49 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wrote:
> That returns us to the core of the problem: sched_yield() is used to
> implement a synchronization primitive and non-portable assumptions are
> made about its behavior: SUS defines that after sched_yield() thread
> ceases to run on the CPU "until it again becomes the head of its thread
> list", and "thread list" discipline is only defined for real-time
> scheduling policies. E.g., 
>
> int sched_yield(void)
> {
>        return 0;
> }
>
> and
>
> int sched_yield(void)
> {
>        sleep(100);
>        return 0;
> }
>
> are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
> threads.
I think you're mistaken:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0d4eaf3703131e86/da051ebe58976b00#da051ebe58976b00

sched_yield() is required to be supported even if priority scheduling is 
not supported, and it is required to cause the calling thread (not 
process) to yield the processor.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20  4:18   ` Nick Piggin
@ 2005-08-20 18:38     ` Howard Chu
  2005-08-20 20:57       ` Lee Revell
                         ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-20 18:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Robert Hancock, linux-kernel

Nick Piggin wrote:
>  Robert Hancock wrote:
> > I fail to see how sched_yield is going to be very helpful in this
> > situation. Since that call can sleep from a range of time ranging
> > from zero to a long time, it's going to give unpredictable results.

>  Well, not sleep technically, but yield the CPU for some undefined
>  amount of time.

Since the slapd server was not written to run in realtime, nor is it 
commonly run on realtime operating systems, I don't believe predictable 
timing here is a criteria we care about. One could say the same of 
sigsuspend() by the way - it can pause a process for a range of time 
ranging from zero to a long time. Should we tell application writers not 
to use this function either, regardless of whether the developer thinks 
they have a good reason to use it?

> > It seems to me that this sort of thing is why we have POSIX pthread
> > synchronization primitives.. sched_yield is basically there for a
> > process to indicate that "what I'm doing doesn't matter much, let
> > other stuff run". Any other use of it generally constitutes some
> > kind of hack.

In terms of transaction recovery, we do an exponential backoff on the 
retries, because our benchmarks showed that under heavy lock contention, 
immediate retries only made things worse. In fact, having arbitrarily 
long backoff delays here was shown to improve transaction throughput. 
(We use select() with an increasing timeval in combination with the 
yield() call. One way or another we get a longer delay as desired.)

sched_yield is there for a *thread* to indicate "what I'm doing doesn't 
matter much, let other stuff run."

I suppose it may be a hack. But then so is TCP congestion control. In 
both cases, empirical evidence indicates the hack is worthwhile. If you 
haven't done the analysis then you're in no position to deny the value 
of the approach.

>  In SCHED_OTHER mode, you're right, sched_yield is basically
>  meaningless.

>  In a realtime system, there is a very well defined and probably
>  useful behaviour.

>  Eg. If 2 SCHED_FIFO processes are running at the same priority, One
>  can call sched_yield to deterministically give the CPU to the other
>  guy.

Well yes, the point of a realtime system is to provide deterministic 
response times to unpredictable input.

I'll note that we removed a number of the yield calls (that were in 
OpenLDAP 2.2) for the 2.3 release, because I found that they were 
redundant and causing unnecessary delays. My own test system is running 
on a Linux 2.6.12.3 kernel (installed over a SuSE 9.2 x86_64 distro), 
and OpenLDAP 2.3 runs perfectly well here, now that those redundant 
calls have been removed. But I also found that I needed to add a new 
yield(), to work around yet another unexpected issue on this system - we 
have a number of threads waiting on a condition variable, and the thread 
holding the mutex signals the var, unlocks the mutex, and then 
immediately relocks it. The expectation here is that upon unlocking the 
mutex, the calling thread would block while some waiting thread (that 
just got signaled) would get to run. In fact what happened is that the 
calling thread unlocked and relocked the mutex without allowing any of 
the waiting threads to run. In this case the only solution was to insert 
a yield() after the mutex_unlock(). So again, for those of you claiming 
"oh, all you need to do is use a condition variable or any of the other 
POSIX synchronization primitives" - yes, that's a nice theory, but 
reality says otherwise.

To say that sched_yield is basically meaningless is far overstating your 
point.
-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found]   ` <430666DB.70802@symas.com.suse.lists.linux.kernel>
@ 2005-08-20 13:48     ` Andi Kleen
  2005-08-21 19:47       ` Florian Weimer
  0 siblings, 1 reply; 127+ messages in thread
From: Andi Kleen @ 2005-08-20 13:48 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

Howard Chu <hyc@symas.com> writes:

> In this specific example, we use whatever
> BerkeleyDB provides and we're certainly not about to write our own
> transactional embedded database engine just for this.

BerkeleyDB is free software after all that comes with source code. 
Surely it can be fixed without rewriting it from scratch.

Has anybody contacted the Sleepycat people with a description of the 
problem yet?

-Andi

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19 23:10   ` Howard Chu
@ 2005-08-20 13:23     ` Nikita Danilov
  2005-08-20 19:49       ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Nikita Danilov @ 2005-08-20 13:23 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu writes:
 > Nikita Danilov wrote:

[...]

 > 
 > >  What prevents transaction monitor from using, say, condition
 > >  variables to "yield cpu"? That would have an additional advantage of
 > >  blocking thread precisely until specific event occurs, instead of
 > >  blocking for some vague indeterminate load and platform dependent
 > >  amount of time.
 > 
 > Condition variables offer no control over which thread is waken up. 

When only one thread waits on a condition variable, which is exactly a
scenario involved, --sorry if I weren't clear enough-- condition signal
provides precise control over which thread is woken up.

 > We're wandering into the design of the SleepyCat BerkeleyDB library 
 > here, and we don't exert any control over that either. BerkeleyDB 
 > doesn't appear to use pthread condition variables; it seems to construct 
 > its own synchronization mechanisms on top of mutexes (and yield calls). 

That returns us to the core of the problem: sched_yield() is used to
implement a synchronization primitive and non-portable assumptions are
made about its behavior: SUS defines that after sched_yield() thread
ceases to run on the CPU "until it again becomes the head of its thread
list", and "thread list" discipline is only defined for real-time
scheduling policies. E.g., 

int sched_yield(void)
{
       return 0;
}

and

int sched_yield(void)
{
       sleep(100);
       return 0;
}

are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
threads.

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-20  3:20 ` Robert Hancock
@ 2005-08-20  4:18   ` Nick Piggin
  2005-08-20 18:38     ` Howard Chu
  0 siblings, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2005-08-20  4:18 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel

Robert Hancock wrote:

> 
> I fail to see how sched_yield is going to be very helpful in this 
> situation. Since that call can sleep from a range of time ranging from 
> zero to a long time, it's going to give unpredictable results.
> 

Well, not sleep technically, but yield the CPU for some undefined
amount of time.

> It seems to me that this sort of thing is why we have POSIX pthread 
> synchronization primitives.. sched_yield is basically there for a 
> process to indicate that "what I'm doing doesn't matter much, let other 
> stuff run". Any other use of it generally constitutes some kind of hack.
> 

In SCHED_OTHER mode, you're right, sched_yield is basically meaningless.

In a realtime system, there is a very well defined and probably useful
behaviour.

Eg. If 2 SCHED_FIFO processes are running at the same priority, One can
call sched_yield to deterministically give the CPU to the other guy.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found] <4D8eT-4rg-31@gated-at.bofh.it>
@ 2005-08-20  3:20 ` Robert Hancock
  2005-08-20  4:18   ` Nick Piggin
  0 siblings, 1 reply; 127+ messages in thread
From: Robert Hancock @ 2005-08-20  3:20 UTC (permalink / raw)
  To: linux-kernel

Howard Chu wrote:
> You assume that spinlocks are the only reason a developer may want to 
> yield the processor. This assumption is unfounded. Case in point - the 
> primary backend in OpenLDAP uses a transactional database with 
> page-level locking of its data structures to provide high levels of 
> concurrency. It is the nature of such a system to encounter deadlocks 
> over the normal course of operations. When a deadlock is detected, some 
> thread must be chosen (by one of a variety of algorithms) to abort its 
> transaction, in order to allow other operations to proceed to 
> completion. In this situation, the chosen thread must get control of the 
> CPU long enough to clean itself up, and then it must yield the CPU in 
> order to allow any other competing threads to complete their 
> transaction. The thread with the aborted transaction relinquishes all of 
> its locks and then waits to get another shot at the CPU to try 
> everything over again. Again, this is all fundamental to the nature of 
> transactional programming. If the 2.6 kernel makes this programming 
> model unreasonably slow, then quite simply this kernel is not viable as 
> a database platform.

I fail to see how sched_yield is going to be very helpful in this 
situation. Since that call can sleep from a range of time ranging from 
zero to a long time, it's going to give unpredictable results.

It seems to me that this sort of thing is why we have POSIX pthread 
synchronization primitives.. sched_yield is basically there for a 
process to indicate that "what I'm doing doesn't matter much, let other 
stuff run". Any other use of it generally constitutes some kind of hack.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19 10:21 ` Nikita Danilov
@ 2005-08-19 23:10   ` Howard Chu
  2005-08-20 13:23     ` Nikita Danilov
  0 siblings, 1 reply; 127+ messages in thread
From: Howard Chu @ 2005-08-19 23:10 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Linux Kernel Mailing List

Nikita Danilov wrote:
>  Howard Chu <hyc@symas.com> writes:
> > concurrency. It is the nature of such a system to encounter
> > deadlocks over the normal course of operations. When a deadlock is
> > detected, some thread must be chosen (by one of a variety of
> > algorithms) to abort its transaction, in order to allow other
> > operations to proceed to completion. In this situation, the chosen
> > thread must get control of the CPU long enough to clean itself up,

>  What prevents transaction monitor from using, say, condition
>  variables to "yield cpu"? That would have an additional advantage of
>  blocking thread precisely until specific event occurs, instead of
>  blocking for some vague indeterminate load and platform dependent
>  amount of time.

Condition variables offer no control over which thread is waken up. 
We're wandering into the design of the SleepyCat BerkeleyDB library 
here, and we don't exert any control over that either. BerkeleyDB 
doesn't appear to use pthread condition variables; it seems to construct 
its own synchronization mechanisms on top of mutexes (and yield calls). 
In this specific example, we use whatever BerkeleyDB provides and we're 
certainly not about to write our own transactional embedded database 
engine just for this.
-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:59 ` Chris Wedgwood
@ 2005-08-19 22:45   ` Howard Chu
  0 siblings, 0 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-19 22:45 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

Chris Wedgwood wrote:
>  On Thu, Aug 18, 2005 at 11:03:45PM -0700, Howard Chu wrote:
> > If the 2.6 kernel makes this programming model unreasonably slow,
> > then quite simply this kernel is not viable as a database platform.

>  Pretty much everyone else manages to make it work.

And this contributes to the discussion how? Pretty much every other 
Unix-ish operating system manages to make scheduling with nice'd 
processes work. If you really want to get into what "everyone else 
manages to make work" you're in for a rough ride.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
  2005-08-19  6:59 ` Chris Wedgwood
@ 2005-08-19 10:21 ` Nikita Danilov
  2005-08-19 23:10   ` Howard Chu
  2 siblings, 1 reply; 127+ messages in thread
From: Nikita Danilov @ 2005-08-19 10:21 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

Howard Chu <hyc@symas.com> writes:

[...]

> concurrency. It is the nature of such a system to encounter deadlocks
> over the normal course of operations. When a deadlock is detected, some
> thread must be chosen (by one of a variety of algorithms) to abort its
> transaction, in order to allow other operations to proceed to
> completion. In this situation, the chosen thread must get control of the
> CPU long enough to clean itself up,

What prevents transaction monitor from using, say, condition variables
to "yield cpu"? That would have an additional advantage of blocking
thread precisely until specific event occurs, instead of blocking for
some vague indeterminate load and platform dependent amount of time.

>                                     and then it must yield the CPU in
> order to allow any other competing threads to complete their
> transaction.

Again, this sounds like thing doable with standard POSIX synchronization
primitives.

>
> -- 
>   -- Howard Chu

Nikita.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
@ 2005-08-19  6:59 ` Chris Wedgwood
  2005-08-19 22:45   ` Howard Chu
  2005-08-19 10:21 ` Nikita Danilov
  2 siblings, 1 reply; 127+ messages in thread
From: Chris Wedgwood @ 2005-08-19  6:59 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

On Thu, Aug 18, 2005 at 11:03:45PM -0700, Howard Chu wrote:

> If the 2.6 kernel makes this programming model unreasonably slow,
> then quite simply this kernel is not viable as a database platform.

Pretty much everyone else manages to make it work.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  6:03 Howard Chu
@ 2005-08-19  6:34 ` Nick Piggin
  2005-08-19  6:59 ` Chris Wedgwood
  2005-08-19 10:21 ` Nikita Danilov
  2 siblings, 0 replies; 127+ messages in thread
From: Nick Piggin @ 2005-08-19  6:34 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel

Hi Howard,

Thanks for joining the discussion. One request, if I may,
can you retain the CC list on posts please?

Howard Chu wrote:
 >
>>  AFAIKS, sched_yield should only really be used by realtime
>>  applications that know exactly what they're doing.
> 
> 
> pthread_yield() was deleted from the POSIX threads drafts years ago. 
> sched_yield() is the officially supported API, and OpenLDAP is using it 
> for the documented purpose. Anyone who says "applications shouldn't be 
> using sched_yield()" doesn't know what they're talking about.
> 

Linux's SCHED_OTHER policy offers static priorities in the range [0..0].
I think anything else would be a bug, because from my reading of the
standards, a process with a higher static priority shall always preempt
a process with a lower priority.

And SCHED_OTHER simply doesn't work that way.

So sched_yield() from a SCHED_OTHER task is free to basically do anything
at all. Is that the kind of behaviour you had in mind?

>>  It's really more a feature than a bug that it breaks so easily
>>  because they should be really using futexes instead, which have much
>>  better behaviour than any sched_yield ever could (they will directly
>>  wake up another process waiting for the lock and avoid the thundering
>>  herd for contended locks)
> 
> 
> You assume that spinlocks are the only reason a developer may want to 
> yield the processor. This assumption is unfounded. Case in point - the 
> primary backend in OpenLDAP uses a transactional database with 
> page-level locking of its data structures to provide high levels of 
> concurrency. It is the nature of such a system to encounter deadlocks 
> over the normal course of operations. When a deadlock is detected, some 
> thread must be chosen (by one of a variety of algorithms) to abort its 
> transaction, in order to allow other operations to proceed to 
> completion. In this situation, the chosen thread must get control of the 
> CPU long enough to clean itself up, and then it must yield the CPU in 
> order to allow any other competing threads to complete their 
> transaction. The thread with the aborted transaction relinquishes all of 
> its locks and then waits to get another shot at the CPU to try 
> everything over again. Again, this is all fundamental to the nature of 

You didn't explain why you can't use a mutex to do this. From
your brief description, it seems like a mutex might just do
the job nicely.

> transactional programming. If the 2.6 kernel makes this programming 
> model unreasonably slow, then quite simply this kernel is not viable as 
> a database platform.
> 

Actually it should still be fast. It may yield excessive CPU to
other tasks (including those that are reniced). You didn't rely
on sched_yield providing some semantics about not doing such a
thing, did you?

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* re: sched_yield() makes OpenLDAP slow
@ 2005-08-19  6:03 Howard Chu
  2005-08-19  6:34 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 127+ messages in thread
From: Howard Chu @ 2005-08-19  6:03 UTC (permalink / raw)
  To: linux-kernel

Hm, seems there's a great deal of misinformation in this thread.

>  I also think OpenLDAP is wrong. First, it should be calling
>  pthread_yield() because slapd is a multithreading process and it just
>  wants to run the other threads. See:
...
>  AFAIKS, sched_yield should only really be used by realtime
>  applications that know exactly what they're doing.

pthread_yield() was deleted from the POSIX threads drafts years ago. 
sched_yield() is the officially supported API, and OpenLDAP is using it 
for the documented purpose. Anyone who says "applications shouldn't be 
using sched_yield()" doesn't know what they're talking about.

>  It's really more a feature than a bug that it breaks so easily
>  because they should be really using futexes instead, which have much
>  better behaviour than any sched_yield ever could (they will directly
>  wake up another process waiting for the lock and avoid the thundering
>  herd for contended locks)

You assume that spinlocks are the only reason a developer may want to 
yield the processor. This assumption is unfounded. Case in point - the 
primary backend in OpenLDAP uses a transactional database with 
page-level locking of its data structures to provide high levels of 
concurrency. It is the nature of such a system to encounter deadlocks 
over the normal course of operations. When a deadlock is detected, some 
thread must be chosen (by one of a variety of algorithms) to abort its 
transaction, in order to allow other operations to proceed to 
completion. In this situation, the chosen thread must get control of the 
CPU long enough to clean itself up, and then it must yield the CPU in 
order to allow any other competing threads to complete their 
transaction. The thread with the aborted transaction relinquishes all of 
its locks and then waits to get another shot at the CPU to try 
everything over again. Again, this is all fundamental to the nature of 
transactional programming. If the 2.6 kernel makes this programming 
model unreasonably slow, then quite simply this kernel is not viable as 
a database platform.

-- 
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-19  3:19       ` Andi Kleen
@ 2005-08-19  3:30         ` Bernardo Innocenti
  0 siblings, 0 replies; 127+ messages in thread
From: Bernardo Innocenti @ 2005-08-19  3:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, nickpiggin, Giovanni Bajo

Andi Kleen wrote:
> Bernardo Innocenti <bernie@develer.com> writes:
> 
> It's really more a feature than a bug that it breaks so easily
> because they should be really using futexes instead, which
> have much better behaviour than any sched_yield ever could
> (they will directly wake up another process waiting for the
> lock and avoid the thundering herd for contended locks) 

Actually, I believe they should be using pthread synchronization
primitives instead of relying on Linux-specific functionality.
Glibc already uses futexes internally, so it's almost as efficient.

I've already suggested this to the OpenLDAP people, but with
my limited knowledge of slapd threading requirements, there
may well be a very good reason for busy-waiting with
sched_yield().  Waiting for their answer.

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
       [not found]     ` <43054D9A.7090509@develer.com.suse.lists.linux.kernel>
@ 2005-08-19  3:19       ` Andi Kleen
  2005-08-19  3:30         ` Bernardo Innocenti
  0 siblings, 1 reply; 127+ messages in thread
From: Andi Kleen @ 2005-08-19  3:19 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: linux-kernel, nickpiggin

Bernardo Innocenti <bernie@develer.com> writes:

It's really more a feature than a bug that it breaks so easily
because they should be really using futexes instead, which
have much better behaviour than any sched_yield ever could
(they will directly wake up another process waiting for the
lock and avoid the thundering herd for contended locks) 

-Andi

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  2:58   ` Nick Piggin
@ 2005-08-19  3:10     ` Bernardo Innocenti
  0 siblings, 0 replies; 127+ messages in thread
From: Bernardo Innocenti @ 2005-08-19  3:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Joseph Fannin, lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Nick Piggin wrote:

> We class the SCHED_OTHER policy as having a single priority, which
> I believe is allowed (and even makes good sense, because dynamic
> and even nice priorities aren't really well defined).
> 
> That also makes our sched_yield() behaviour correct.
> 
> AFAIKS, sched_yield should only really be used by realtime
> applications that know exactly what they're doing.

I'm pretty sure this has already been discussed in the
past, but I fail to see why this new behavior of
sched_yield() would be more correct.

In the OpenLDAP bug discussion, one of the developers
considers this a Linux quirk needing a workaround, not
a real bug in OpenLDAP.

As I understand it, the old behavior was to push the
yielding process to the end of the queue for processes
with the same niceness.  This is somewhat closer to
the (vague) definition in the POSIX man pages:

 The sched_yield() function shall force the running
 thread to relinquish the processor until it again
 becomes the head of its thread list. It takes no arguments.

Pushing the process far behind in the queue, even after
niced CPU crunchers, appears a bit extreme.  It seems
most programs expect sched_yield() to only reschedule
the calling thread wrt its sibling threads, to be used
to implement do-it-yourself spinlocks and the like.

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:47 ` Con Kolivas
@ 2005-08-18 10:48   ` Maciej Soltysiak
  0 siblings, 0 replies; 127+ messages in thread
From: Maciej Soltysiak @ 2005-08-18 10:48 UTC (permalink / raw)
  To: lkml

Hello Con,

Thursday, August 18, 2005, 2:47:25 AM, you wrote:
> sched_yield behaviour changed in 2.5 series more than 3 years ago and
> applications that use this as a locking primitive should be updated.
I remember open office had a problem with excessive use of sched_yield()
during 2.5. I guess they changed it but I have not checked.
Does anyone know ?

Back then oo was having serious latency problems on 2.5

Regards,
Maciej



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  1:07 ` Joseph Fannin
  2005-08-18  2:25   ` Bernardo Innocenti
@ 2005-08-18  2:58   ` Nick Piggin
  2005-08-19  3:10     ` Bernardo Innocenti
  1 sibling, 1 reply; 127+ messages in thread
From: Nick Piggin @ 2005-08-18  2:58 UTC (permalink / raw)
  To: Joseph Fannin
  Cc: Bernardo Innocenti, lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Joseph Fannin wrote:

>On Thu, Aug 18, 2005 at 02:50:16AM +0200, Bernardo Innocenti wrote:
>
>
>>The relative timestamp reveals that slapd is spending 50ms
>>after yielding.  Meanwhile, GCC is probably being scheduled
>>for a whole quantum.
>>
>>Reading the man-page of sched_yield() it seems this isn't
>>the correct behavior:
>>
>>   Note: If the current process is the only process in the
>>   highest priority list at that time, this process will
>>   continue to run after a call to sched_yield.
>>
>
>   The behavior of sched_yield changed for 2.6.  I suppose the man
>page didn't get updated.
>
>

We class the SCHED_OTHER policy as having a single priority, which
I believe is allowed (and even makes good sense, because dynamic
and even nice priorities aren't really well defined).

That also makes our sched_yield() behaviour correct.

AFAIKS, sched_yield should only really be used by realtime
applications that know exactly what they're doing.


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  1:07 ` Joseph Fannin
@ 2005-08-18  2:25   ` Bernardo Innocenti
  2005-08-18  2:58   ` Nick Piggin
  1 sibling, 0 replies; 127+ messages in thread
From: Bernardo Innocenti @ 2005-08-18  2:25 UTC (permalink / raw)
  To: Joseph Fannin; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

Joseph Fannin wrote:

>    The behavior of sched_yield changed for 2.6.  I suppose the man
> page didn't get updated.

Now I remember reading about that on LWN or maybe KernelTraffic.
Thanks!

>>I also think OpenLDAP is wrong.  First, it should be calling
>>pthread_yield() because slapd is a multithreading process
>>and it just wants to run the other threads.  See:
> 
>     Is it possible that this problem has been noticed and fixed
> already?

The OpenLDAP 2.3.5 source still looks like this.
I've filed a report in OpenLDAP's issue tracker:

 http://www.openldap.org/its/index.cgi/Incoming?id=3950;page=2

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:50 Bernardo Innocenti
  2005-08-18  0:47 ` Con Kolivas
@ 2005-08-18  1:07 ` Joseph Fannin
  2005-08-18  2:25   ` Bernardo Innocenti
  2005-08-18  2:58   ` Nick Piggin
  1 sibling, 2 replies; 127+ messages in thread
From: Joseph Fannin @ 2005-08-18  1:07 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

On Thu, Aug 18, 2005 at 02:50:16AM +0200, Bernardo Innocenti wrote:

> The relative timestamp reveals that slapd is spending 50ms
> after yielding.  Meanwhile, GCC is probably being scheduled
> for a whole quantum.
>
> Reading the man-page of sched_yield() it seems this isn't
> the correct behavior:
>
>    Note: If the current process is the only process in the
>    highest priority list at that time, this process will
>    continue to run after a call to sched_yield.

   The behavior of sched_yield changed for 2.6.  I suppose the man
page didn't get updated.

>From linux/Documentation/post-halloween.txt:

| - The behavior of sched_yield() changed a lot.  A task that uses
|   this system call should now expect to sleep for possibly a very
|   long time.  Tasks that do not really desire to give up the
|   processor for a while should probably not make heavy use of this
|   function.  Unfortunately, some GUI programs (like Open Office)
|   do make excessive use of this call and under load their
|   performance is poor.  It seems this new 2.6 behavior is optimal
|   but some user-space applications may need fixing.

    This is pretty much all I know about it; I just thought I'd point
it out.

> I also think OpenLDAP is wrong.  First, it should be calling
> pthread_yield() because slapd is a multithreading process
> and it just wants to run the other threads.  See:

    Is it possible that this problem has been noticed and fixed
already?

--
Joseph Fannin
jfannin@gmail.com

 /* So there I am, in the middle of my `netfilter-is-wonderful'
talk in Sydney, and someone asks `What happens if you try
to enlarge a 64k packet here?'. I think I said something
eloquent like `fuck'. - RR */

^ permalink raw reply	[flat|nested] 127+ messages in thread

* sched_yield() makes OpenLDAP slow
@ 2005-08-18  0:50 Bernardo Innocenti
  2005-08-18  0:47 ` Con Kolivas
  2005-08-18  1:07 ` Joseph Fannin
  0 siblings, 2 replies; 127+ messages in thread
From: Bernardo Innocenti @ 2005-08-18  0:50 UTC (permalink / raw)
  To: lkml, OpenLDAP-devel; +Cc: Giovanni Bajo, Simone Zinanni

Hello,

I've been investigating a performance problem on a
server using OpenLDAP 2.2.26 for nss resolution and
running kernel 2.6.12.

When a CPU bound process such as GCC is running in the
background (even at nice 10), many trivial commands such
as "su" or "groups" become extremely slow and take a few
seconds to complete.

strace revealed that data exchange over the slapd socket
was where most of the time was spent.  Looking at the
slapd side, I see several calls to sched_yield() like this:

[pid  8780]      0.000033 stat64("gidNumber.dbb", 0xb7b3ebcc) = -1 EACCES (Permission denied)
[pid  8780]      0.000059 pread(20, "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\2\0\344\17\2\3"..., 4096, 4096) = 4096
[pid  8780]      0.000083 pread(20, "\0\0\0\0\1\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\222\0<\7\1\5\370"..., 4096, 16384) = 4096
[pid  8780]      0.000078 time(NULL)    = 1124322520
[pid  8780]      0.000066 pread(11, "\0\0\0\0\1\0\0\0\250\0\0\0\231\0\0\0\235\0\0\0\16\0000"..., 4096, 688128) = 4096
[pid  8780]      0.000241 write(19, "0e\2\1\3d`\4$cn=bernie,ou=group,dc=d"..., 103) = 103
[pid  8780]      0.000137 sched_yield( <unfinished ...>
[pid  8781]      0.050020 <... sched_yield resumed> ) = 0
[pid  8780]      0.000025 <... sched_yield resumed> ) = 0
[pid  8781]      0.000060 futex(0x925ab20, FUTEX_WAIT, 33, NULL <unfinished ...>
[pid  8780]      0.000026 write(19, "0\f\2\1\3e\7\n\1\0\4\0\4\0", 14) = 14
[pid  8774]      0.000774 <... select resumed> ) = 1 (in [19])

The relative timestamp reveals that slapd is spending 50ms
after yielding.  Meanwhile, GCC is probably being scheduled
for a whole quantum.

Reading the man-page of sched_yield() it seems this isn't
the correct behavior:

   Note: If the current process is the only process in the
   highest priority list at that time, this process will
   continue to run after a call to sched_yield.

I also think OpenLDAP is wrong.  First, it should be calling
pthread_yield() because slapd is a multithreading process
and it just wants to run the other threads.  See:

int
ldap_pvt_thread_yield( void )
{
#if HAVE_THR_YIELD
        return thr_yield();

#elif HAVE_PTHREADS == 10
        return sched_yield();

#elif defined(_POSIX_THREAD_IS_GNU_PTH)
        sched_yield();
        return 0;

#elif HAVE_PTHREADS == 6
        pthread_yield(NULL);
        return 0;
#else
        pthread_yield();
        return 0;
#endif
}

-- 
  // Bernardo Innocenti - Develer S.r.l., R&D dept.
\X/  http://www.develer.com/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: sched_yield() makes OpenLDAP slow
  2005-08-18  0:50 Bernardo Innocenti
@ 2005-08-18  0:47 ` Con Kolivas
  2005-08-18 10:48   ` Maciej Soltysiak
  2005-08-18  1:07 ` Joseph Fannin
  1 sibling, 1 reply; 127+ messages in thread
From: Con Kolivas @ 2005-08-18  0:47 UTC (permalink / raw)
  To: Bernardo Innocenti; +Cc: lkml, OpenLDAP-devel, Giovanni Bajo, Simone Zinanni

On Thu, 18 Aug 2005 10:50 am, Bernardo Innocenti wrote:
> Hello,
>
> I've been investigating a performance problem on a
> server using OpenLDAP 2.2.26 for nss resolution and
> running kernel 2.6.12.
>
> When a CPU bound process such as GCC is running in the
> background (even at nice 10), many trivial commands such
> as "su" or "groups" become extremely slow and take a few
> seconds to complete.
>
> strace revealed that data exchange over the slapd socket
> was where most of the time was spent.  Looking at the
> slapd side, I see several calls to sched_yield() like this:
>
>
> [pid  8780]      0.000033 stat64("gidNumber.dbb", 0xb7b3ebcc) = -1 EACCES
> (Permission denied) [pid  8780]      0.000059 pread(20,
> "\0\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\2\0\344\17\2\3"..., 4096, 4096) =
> 4096 [pid  8780]      0.000083 pread(20,
> "\0\0\0\0\1\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\222\0<\7\1\5\370"..., 4096,
> 16384) = 4096 [pid  8780]      0.000078 time(NULL)    = 1124322520
> [pid  8780]      0.000066 pread(11,
> "\0\0\0\0\1\0\0\0\250\0\0\0\231\0\0\0\235\0\0\0\16\0000"..., 4096, 688128)
> = 4096 [pid  8780]      0.000241 write(19,
> "0e\2\1\3d`\4$cn=bernie,ou=group,dc=d"..., 103) = 103 [pid  8780]     
> 0.000137 sched_yield( <unfinished ...>
> [pid  8781]      0.050020 <... sched_yield resumed> ) = 0
> [pid  8780]      0.000025 <... sched_yield resumed> ) = 0
> [pid  8781]      0.000060 futex(0x925ab20, FUTEX_WAIT, 33, NULL <unfinished
> ...> [pid  8780]      0.000026 write(19, "0\f\2\1\3e\7\n\1\0\4\0\4\0", 14)
> = 14 [pid  8774]      0.000774 <... select resumed> ) = 1 (in [19])
>
>
> The relative timestamp reveals that slapd is spending 50ms
> after yielding.  Meanwhile, GCC is probably being scheduled
> for a whole quantum.
>
> Reading the man-page of sched_yield() it seems this isn't
> the correct behavior:
>
>    Note: If the current process is the only process in the
>    highest priority list at that time, this process will
>    continue to run after a call to sched_yield.
>
> I also think OpenLDAP is wrong.  First, it should be calling
> pthread_yield() because slapd is a multithreading process
> and it just wants to run the other threads.  See:

sched_yield behaviour changed in 2.5 series more than 3 years ago and 
applications that use this as a locking primitive should be updated.

Cheers,
Con

^ permalink raw reply	[flat|nested] 127+ messages in thread

end of thread, other threads:[~2006-02-07  6:57 UTC | newest]

Thread overview: 127+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-24 22:59 e100 oops on resume Stefan Seyfried
2006-01-24 23:21 ` Mattia Dongili
2006-01-25  9:02   ` Olaf Kirch
2006-01-25 12:11     ` Olaf Kirch
2006-01-25 13:51       ` sched_yield() makes OpenLDAP slow Howard Chu
2006-01-25 14:38         ` Robert Hancock
2006-01-25 17:49         ` Christopher Friesen
2006-01-25 18:26           ` pthread_mutex_unlock (was Re: sched_yield() makes OpenLDAP slow) Howard Chu
2006-01-25 18:59             ` Nick Piggin
2006-01-25 19:32               ` Howard Chu
2006-01-26  8:51                 ` Nick Piggin
2006-01-26 14:15                   ` Kyle Moffett
2006-01-26 14:43                     ` Howard Chu
2006-01-26 19:57                       ` David Schwartz
2006-01-26 20:27                         ` Howard Chu
2006-01-26 20:46                           ` Nick Piggin
2006-01-26 21:32                             ` Howard Chu
2006-01-26 21:41                               ` Nick Piggin
2006-01-26 21:56                                 ` Howard Chu
2006-01-26 22:24                                   ` Nick Piggin
2006-01-27  8:08                                     ` Howard Chu
2006-01-27 19:25                                       ` Philipp Matthias Hahn
2006-02-01 12:31                                       ` Nick Piggin
2006-01-27  4:27                                   ` Steven Rostedt
2006-01-26 21:58                               ` Christopher Friesen
2006-01-27  4:13                               ` Steven Rostedt
2006-01-27  2:16                           ` David Schwartz
2006-01-27  8:19                             ` Howard Chu
2006-01-27 19:50                               ` David Schwartz
2006-01-27 20:13                                 ` Howard Chu
2006-01-27 21:05                                   ` David Schwartz
2006-01-27 21:23                                     ` Howard Chu
2006-01-27 23:31                                       ` David Schwartz
2006-01-30  8:28                         ` Helge Hafting
2006-01-26 10:38                 ` Nikita Danilov
2006-01-30  8:35                   ` Helge Hafting
2006-01-30 11:13                     ` Nikita Danilov
2006-01-31 23:18                     ` David Schwartz
2006-01-25 21:06             ` Lee Revell
2006-01-25 22:14               ` Howard Chu
2006-01-26  0:16                 ` Robert Hancock
2006-01-26  0:49                   ` Howard Chu
2006-01-26  1:04                     ` Lee Revell
2006-01-26  1:31                       ` Howard Chu
2006-01-26  2:05                 ` David Schwartz
2006-01-26  2:48                   ` Mark Lord
2006-01-26  3:30                     ` David Schwartz
2006-01-26  3:49                       ` Samuel Masham
2006-01-26  4:02                         ` Samuel Masham
2006-01-26  4:53                           ` Lee Revell
2006-01-26  6:14                             ` Samuel Masham
2006-01-26  8:54                 ` Nick Piggin
2006-01-26 14:24                   ` Howard Chu
2006-01-26 14:54                     ` Nick Piggin
2006-01-26 15:23                       ` Howard Chu
2006-01-26 15:51                         ` Nick Piggin
2006-01-26 16:44                           ` Howard Chu
2006-01-26 17:34                             ` linux-os (Dick Johnson)
2006-01-26 19:00                               ` Nick Piggin
2006-01-26 19:14                                 ` linux-os (Dick Johnson)
2006-01-26 21:12                                   ` Nick Piggin
2006-01-26 21:31                                     ` linux-os (Dick Johnson)
2006-01-27  7:06                                       ` Valdis.Kletnieks
2006-01-30  8:44                               ` Helge Hafting
2006-01-30  8:50                                 ` Howard Chu
2006-01-30 15:33                                   ` Kyle Moffett
2006-01-30 13:28                                 ` linux-os (Dick Johnson)
2006-01-30 15:15                                   ` Helge Hafting
2006-01-26 10:44                 ` Nikita Danilov
2006-01-26  0:08             ` Robert Hancock
2006-01-26  1:07         ` sched_yield() makes OpenLDAP slow David Schwartz
2006-01-26  8:30           ` Helge Hafting
2006-01-26  9:01             ` Nick Piggin
2006-01-26 10:50             ` Nikita Danilov
2006-01-25 19:37       ` e100 oops on resume Jesse Brandeburg
2006-01-25 20:14         ` Olaf Kirch
2006-01-25 22:28           ` Jesse Brandeburg
2006-01-26  0:28         ` Jesse Brandeburg
2006-01-26  9:32           ` Pavel Machek
2006-01-26 19:02           ` Stefan Seyfried
2006-01-26 19:09             ` Olaf Kirch
2006-01-28 11:53             ` Mattia Dongili
2006-01-28 19:53               ` Jesse Brandeburg
2006-02-07  6:57                 ` Jeff Garzik
     [not found]           ` <BAY108-DAV111F6EF46F6682FEECCC1593140@phx.gbl>
     [not found]             ` <4807377b0601271404w6dbfcff6s4de1c3f785dded9f@mail.gmail.com>
2006-01-30 17:25               ` Can I do a regular read to simulate prefetch instruction? John Smith
     [not found] <5uZqb-4fo-15@gated-at.bofh.it>
2006-01-14 22:47 ` sched_yield() makes OpenLDAP slow Robert Hancock
  -- strict thread matches above, loose matches on Subject: below --
2006-01-14 19:29 Howard Chu
     [not found] <43057641.70700@symas.com.suse.lists.linux.kernel>
     [not found] ` <17157.45712.877795.437505@gargle.gargle.HOWL.suse.lists.linux.kernel>
     [not found]   ` <430666DB.70802@symas.com.suse.lists.linux.kernel>
2005-08-20 13:48     ` Andi Kleen
2005-08-21 19:47       ` Florian Weimer
2005-08-22  5:09         ` Howard Chu
2005-08-22 13:06           ` Andi Kleen
2005-08-22 18:47             ` Howard Chu
2005-08-22 13:20           ` Florian Weimer
2005-08-22 23:19             ` Howard Chu
     [not found] <4D8eT-4rg-31@gated-at.bofh.it>
2005-08-20  3:20 ` Robert Hancock
2005-08-20  4:18   ` Nick Piggin
2005-08-20 18:38     ` Howard Chu
2005-08-20 20:57       ` Lee Revell
2005-08-20 21:24         ` Howard Chu
2005-08-21  0:36           ` Nick Piggin
2005-08-21 11:33           ` Nikita Danilov
2005-08-22  8:06             ` Howard Chu
2005-08-20 21:50       ` Lee Revell
2005-08-21  1:04       ` Robert Hancock
2005-08-22 11:44         ` linux-os (Dick Johnson)
2005-08-22 14:26           ` Robert Hancock
2005-08-23 11:17             ` linux-os (Dick Johnson)
2005-08-23 12:07               ` Denis Vlasenko
2005-08-24  3:37                 ` Lincoln Dale
2005-08-19  6:03 Howard Chu
2005-08-19  6:34 ` Nick Piggin
2005-08-19  6:59 ` Chris Wedgwood
2005-08-19 22:45   ` Howard Chu
2005-08-19 10:21 ` Nikita Danilov
2005-08-19 23:10   ` Howard Chu
2005-08-20 13:23     ` Nikita Danilov
2005-08-20 19:49       ` Howard Chu
2005-08-20 22:08         ` Nikita Danilov
     [not found] <4303DB48.8010902@develer.com.suse.lists.linux.kernel>
     [not found] ` <20050818010703.GA13127@nineveh.rivenstone.net.suse.lists.linux.kernel>
     [not found]   ` <4303F967.6000404@yahoo.com.au.suse.lists.linux.kernel>
     [not found]     ` <43054D9A.7090509@develer.com.suse.lists.linux.kernel>
2005-08-19  3:19       ` Andi Kleen
2005-08-19  3:30         ` Bernardo Innocenti
2005-08-18  0:50 Bernardo Innocenti
2005-08-18  0:47 ` Con Kolivas
2005-08-18 10:48   ` Maciej Soltysiak
2005-08-18  1:07 ` Joseph Fannin
2005-08-18  2:25   ` Bernardo Innocenti
2005-08-18  2:58   ` Nick Piggin
2005-08-19  3:10     ` Bernardo Innocenti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).