linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* web page on O(1) scheduler
@ 2003-05-21  6:49 David Mosberger
  2003-05-21  9:01 ` Arjan van de Ven
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: David Mosberger @ 2003-05-21  6:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64

Recently, I started to look into some odd performance behaviors of the
O(1) scheduler.  I decided to document what I found in a web page
at:

	http://www.hpl.hp.com/research/linux/kernel/o1.php

(it may take another couple of hours before the pages show up outside
the HP firewall, so if you get "page not found" at the moment, don't
be surprised).

I should say that I have no direct stake in the CPU scheduler (never
worked on it, not sure I ever would want to), but I feel that it's
worthwhile to at least document the O(1) scheduler a bit better.
Also, I do feel that the O(1) scheduler currently has rather poor
"long-term" properties.  It would be nice if some of those properties
could be improved without hurting the other excellent properties of
the current O(1) scheduler.

I think the web pages should be most relevant to the HPTC (high
performance technical computing) community, since this is the
community that is most likely affected by some of the performance
oddities of the O(1) scheduler.  Certainly anyone using OpenMP on
Intel platforms (x86 and ia64) may want to take a look.

Comments welcome.

	--david

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  6:49 web page on O(1) scheduler David Mosberger
@ 2003-05-21  9:01 ` Arjan van de Ven
  2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
                     ` (2 more replies)
  2003-05-21  9:26 ` Mike Galbraith
  2003-05-21 18:31 ` [Linux-ia64] " David Mosberger
  2 siblings, 3 replies; 47+ messages in thread
From: Arjan van de Ven @ 2003-05-21  9:01 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 675 bytes --]

On Wed, 2003-05-21 at 08:49, David Mosberger wrote:

> 
> I think the web pages should be most relevant to the HPTC (high
> performance technical computing) community, since this is the
> community that is most likely affected by some of the performance
> oddities of the O(1) scheduler.  Certainly anyone using OpenMP on
> Intel platforms (x86 and ia64) may want to take a look.

oh you mean the OpenMP broken behavior of calling sched_yield() in a
tight loop to implement spinlocks ?

I'd guess that instead of second guessing the runtime, they should use
the pthreads primitives which are the fastest for the platform one would
hope.. (eg futexes nowadays)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  6:49 web page on O(1) scheduler David Mosberger
  2003-05-21  9:01 ` Arjan van de Ven
@ 2003-05-21  9:26 ` Mike Galbraith
  2003-05-21  9:30   ` Mike Galbraith
                     ` (3 more replies)
  2003-05-21 18:31 ` [Linux-ia64] " David Mosberger
  2 siblings, 4 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-21  9:26 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, linux-ia64

At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
>Recently, I started to look into some odd performance behaviors of the
>O(1) scheduler.  I decided to document what I found in a web page
>at:
>
>         http://www.hpl.hp.com/research/linux/kernel/o1.php

<snip>

>Comments welcome.

The page mentions persistent starvation.  My own explorations of this issue 
indicate that the primary source is always selecting the highest priority 
queue.  Combine that with the round-robin, and you have a good chance of 
being grossly unfair with some workloads.  I know for certain that lock 
holders in the active array can be starved for very long periods by tasks 
entering higher priority queues, thereby causing even more starvation when 
they finally get the cpu and can release the lock (sleepers go through the 
roof).

Try the attached overly simplistic (KISS:) diff, and watch your starvation 
issues be very noticably reduced.

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  9:26 ` Mike Galbraith
@ 2003-05-21  9:30   ` Mike Galbraith
  2003-05-21 17:56   ` David Mosberger
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-21  9:30 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 210 bytes --]

At 11:26 AM 5/21/2003 +0200, Mike Galbraith wrote:

>Try the attached overly simplistic (KISS:) diff, and watch your starvation 
>issues be very noticably reduced.

Ahem ;-)

Now even attached.

         -Mike 

[-- Attachment #2: xx.diff --]
[-- Type: application/octet-stream, Size: 929 bytes --]

--- linux-2.5.69/kernel/sched.c.org	Wed May 21 07:45:00 2003
+++ linux-2.5.69/kernel/sched.c	Wed May 21 08:27:09 2003
@@ -1264,7 +1264,7 @@
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
+	struct list_head *head, *curr;
 	int idx;
 
 	/*
@@ -1331,8 +1331,22 @@
 	}
 
 	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
+next_queue:
+	head = array->queue + idx;
+	curr = head->next;
+	next = list_entry(curr, task_t, run_list);
+	curr = curr->next;
+	/*
+	 * If we are about to wrap back to the head of the queue,
+	 * give a lower priority queue a chance to sneak one in.
+	 */
+	if (idx == prev->prio && curr == head && array->nr_active > 1) {
+		int tmp = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+		if (tmp < MAX_PRIO) {
+			idx = tmp;
+			goto next_queue;
+		}
+	}
 
 switch_tasks:
 	prefetch(next);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21  9:01 ` Arjan van de Ven
@ 2003-05-21 10:40   ` Duraid Madina
  2003-05-21 10:43     ` Christoph Hellwig
  2003-05-21 13:12     ` Arjan van de Ven
  2003-05-21 15:18   ` David Mosberger
  2003-05-23  1:07   ` [Linux-ia64] " Hans Boehm
  2 siblings, 2 replies; 47+ messages in thread
From: Duraid Madina @ 2003-05-21 10:40 UTC (permalink / raw)
  To: arjanv; +Cc: davidm, linux-kernel, linux-ia64

Dear Arjan,


       ///////
       //    O
      //      >                                    This is a graduate
       / \__ ~                                     student, laboratory
         ||                            /////       assistant, automotive
       (\ \)   (~)                    //  o   <--- engineer or other
       ( \ \  / /                    //    >       unfortunate soul
       (  \ \/ /         ____________/ \__O        attempting to get
       (   \__/         /  ___ ______\//           performance out of a
       /   | /@        (  /  / ______)/            machine running Linux
      (    |//          \ \ / /   (_)              by writing a simple
       \   ()            \ \O/                     and correct
        \  |              ) )                      multithreaded program.
         ) )             / /
        (  |_           / /_
        (____>         (____>

           ^
           |
           |
           |
           |

      This is you.



	(with apologies to the haxor brothers,)

	Duraid.


Arjan van de Ven wrote:
 > oh you mean the OpenMP broken behavior of calling sched_yield() in a
 > tight loop to implement spinlocks ?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
@ 2003-05-21 10:43     ` Christoph Hellwig
  2003-05-21 13:12     ` Arjan van de Ven
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2003-05-21 10:43 UTC (permalink / raw)
  To: Duraid Madina; +Cc: linux-kernel, linux-ia64


*plonk*


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
  2003-05-21 10:43     ` Christoph Hellwig
@ 2003-05-21 13:12     ` Arjan van de Ven
  2003-05-21 13:51       ` Olivier Galibert
  2003-05-21 19:18       ` Duraid Madina
  1 sibling, 2 replies; 47+ messages in thread
From: Arjan van de Ven @ 2003-05-21 13:12 UTC (permalink / raw)
  To: Duraid Madina; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

On Wed, 2003-05-21 at 12:40, Duraid Madina wrote:
> Dear Arjan,
> 
> 
>        ///////
>        //    O
>       //      >                                    This is a graduate
>        / \__ ~                                     student, laboratory
>          ||                            /////       assistant, automotive
>        (\ \)   (~)                    //  o   <--- engineer or other
>        ( \ \  / /                    //    >       unfortunate soul
>        (  \ \/ /         ____________/ \__O        attempting to get
>        (   \__/         /  ___ ______\//           performance out of a
>        /   | /@        (  /  / ______)/            machine running Linux
>       (    |//          \ \ / /   (_)              by writing a simple
>        \   ()            \ \O/                     and correct
>         \  |              ) )                      multithreaded program.
>          ) )             / /
>         (  |_           / /_
>         (____>         (____>
> 
>            ^
>            |
>            |
>            |
>            |
> 
>       This is you.
> 
> 

if you had spent the time you spent on this colorful graphic on reading
SUS or Posix about what sched_yield() means, you would actually have
learned something. sched_yield() means "I'm the least important thing in
the system". It's the wrong thing for cross-cpu spinlocks; futexes are
optimal for that. For letting higher priority tasks run a sleep(0) is
also far more closer to the right behavior than a sched_yield().


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 13:12     ` Arjan van de Ven
@ 2003-05-21 13:51       ` Olivier Galibert
  2003-05-28 22:12         ` Bill Davidsen
       [not found]         ` <Pine.LNX.3.96.1030528180909.21414B-100000@gatekeeper.tmr.c om>
  2003-05-21 19:18       ` Duraid Madina
  1 sibling, 2 replies; 47+ messages in thread
From: Olivier Galibert @ 2003-05-21 13:51 UTC (permalink / raw)
  To: linux-kernel

On Wed, May 21, 2003 at 03:12:12PM +0200, Arjan van de Ven wrote:
> if you had spent the time you spent on this colorful graphic on reading
> SUS or Posix about what sched_yield() means, you would actually have
> learned something. sched_yield() means "I'm the least important thing in
> the system".

Susv2:

DESCRIPTION

 The sched_yield() function forces the running thread to relinquish
 the processor until it again becomes the head of its thread list. It
 takes no arguments.


Aka "I skip the rest of my turn, try the others again once", not "I'm
unimportant" nor "please rerun me immediatly".

What is it with you people wanting to make sched_yield() unusable for
anything that makes sense?

  OG.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  9:01 ` Arjan van de Ven
  2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
@ 2003-05-21 15:18   ` David Mosberger
  2003-05-23  1:07   ` [Linux-ia64] " Hans Boehm
  2 siblings, 0 replies; 47+ messages in thread
From: David Mosberger @ 2003-05-21 15:18 UTC (permalink / raw)
  To: arjanv; +Cc: davidm, linux-kernel, linux-ia64

>>>>> On 21 May 2003 11:01:33 +0200, Arjan van de Ven <arjanv@redhat.com> said:

  Arjan> On Wed, 2003-05-21 at 08:49, David Mosberger wrote:
  >>  I think the web pages should be most relevant to the HPTC (high
  >> performance technical computing) community, since this is the
  >> community that is most likely affected by some of the performance
  >> oddities of the O(1) scheduler.  Certainly anyone using OpenMP on
  >> Intel platforms (x86 and ia64) may want to take a look.

  Arjan> oh you mean the OpenMP broken behavior of calling
  Arjan> sched_yield() in a tight loop to implement spinlocks ?

Please have the courtesy of reading the web page before jumping to
conclusions.

	--david

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  9:26 ` Mike Galbraith
  2003-05-21  9:30   ` Mike Galbraith
@ 2003-05-21 17:56   ` David Mosberger
  2003-05-21 20:46     ` Mike Galbraith
  2003-05-22  9:52     ` Mike Galbraith
  2003-05-22  0:38   ` web page on O(1) scheduler Rik van Riel
       [not found]   ` <Pine.LNX.4.50L.0305212038120.5425-100000@imladris.surriel. com>
  3 siblings, 2 replies; 47+ messages in thread
From: David Mosberger @ 2003-05-21 17:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: davidm, linux-kernel, linux-ia64

>>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de> said:

  Mike> The page mentions persistent starvation.  My own explorations
  Mike> of this issue indicate that the primary source is always
  Mike> selecting the highest priority queue.

My working assumption is that the problem is a bug with the dynamic
prioritization.  The task receiving the signals calls sleep() after
handling a signal and hence it's dynamic priority should end up higher
than the priority of the task sending signals (since the sender never
relinquishes the CPU voluntarily).

However, I haven't actually had time to look at the relevant code, so
I may be missing something.  If you understand the issue better,
please explain to me why this isn't a dynamic priority issue.

	--david

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] web page on O(1) scheduler
  2003-05-21  6:49 web page on O(1) scheduler David Mosberger
  2003-05-21  9:01 ` Arjan van de Ven
  2003-05-21  9:26 ` Mike Galbraith
@ 2003-05-21 18:31 ` David Mosberger
  2003-05-21 20:00   ` Cyclades Cyclom-Y ISA on 2.5.69 John Stoffel
  2 siblings, 1 reply; 47+ messages in thread
From: David Mosberger @ 2003-05-21 18:31 UTC (permalink / raw)
  To: linux-kernel, linux-ia64

It has been pointed out that on the main web page
(http://www.hpl.hp.com/research/linux/kernel/o1.php) it is very
difficult to see that the section entitled "Some problems with the
O(1) scheduler" contains links to the full problem descriptions (the
links aren't underlined and the color looks almost black).

So, in case you missed that, please do click on those links to get the
full description.

I'll see what we can do to improve the layout of the page.

Thanks,

	--david


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 13:12     ` Arjan van de Ven
  2003-05-21 13:51       ` Olivier Galibert
@ 2003-05-21 19:18       ` Duraid Madina
  2003-05-21 20:03         ` Helge Hafting
  2003-05-21 22:59         ` Alan Cox
  1 sibling, 2 replies; 47+ messages in thread
From: Duraid Madina @ 2003-05-21 19:18 UTC (permalink / raw)
  To: arjanv; +Cc: linux-kernel

Arjan van de Ven wrote:

> if you had spent the time you spent on this colorful graphic on reading
> SUS or Posix about what sched_yield() means

Quoth the man page,

"A process can relinquish the processor voluntarily without blocking by 
calling sched_yield. The process will then be moved to the end of the 
queue for its static priority and a new process gets to run."

How you get from there to "I'm the least important thing in the system" 
is, once again, beyond me. And even if that were a reasonable 
interpretation of the word 'yield', you would still hope that more than 
one CPU would get something to do if there was enough work to go around. 
Agreed, "spinning" on sched_yield is a very naive way of doing 
spinlocks. But that doesn't change the fact that it's a simple and 
correct way. One would have hoped that calling sched_yield every few 
million cycles wouldn't break the scheduler.

	Duraid


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Cyclades Cyclom-Y ISA on 2.5.69
  2003-05-21 18:31 ` [Linux-ia64] " David Mosberger
@ 2003-05-21 20:00   ` John Stoffel
  0 siblings, 0 replies; 47+ messages in thread
From: John Stoffel @ 2003-05-21 20:00 UTC (permalink / raw)
  To: linux-kernel


Hi all,

Has anyone else run into a problem compiling the Cyclades serial board
driver under 2.5.x when you have ISA defined as well?  I've done a
quick hack patch to make it compile.  I've been running 2.4.21-rc*
lately, so it's time I started to actually test this patch below and
see how it works.  

I read the file Documentation cli-sti-removal.txt and I think I've
done the right things here.

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-399-0479



*** drivers/char/cyclades.c.org Wed May 21 11:45:48 2003
--- drivers/char/cyclades.c     Wed May 21 12:07:53 2003
***************
*** 872,877 ****
--- 872,878 ----
  static int cyz_issue_cmd(struct cyclades_card *, uclong, ucchar,
uclong);
  #ifdef CONFIG_ISA
  static unsigned detect_isa_irq (volatile ucchar *);
+ spinlock_t isa_card_lock;
  #endif /* CONFIG_ISA */
  
  static int cyclades_get_proc_info(char *, char **, off_t , int , int
  *, void *);
***************
*** 1056,1069 ****
      udelay(5000L);
  
      /* Enable the Tx interrupts on the CD1400 */
!     save_flags(flags); cli();
        cy_writeb((u_long)address + (CyCAR<<index), 0);
        cyy_issue_cmd(address, CyCHAN_CTL|CyENB_XMTR, index);
  
        cy_writeb((u_long)address + (CyCAR<<index), 0);
        cy_writeb((u_long)address + (CySRER<<index), 
                cy_readb(address + (CySRER<<index)) | CyTxRdy);
!     restore_flags(flags);
  
      /* Wait ... */
      udelay(5000L);
--- 1057,1070 ----
      udelay(5000L);
  
      /* Enable the Tx interrupts on the CD1400 */
!     spin_lock_irqsave(&isa_card_lock,flags);
        cy_writeb((u_long)address + (CyCAR<<index), 0);
        cyy_issue_cmd(address, CyCHAN_CTL|CyENB_XMTR, index);
  
        cy_writeb((u_long)address + (CyCAR<<index), 0);
        cy_writeb((u_long)address + (CySRER<<index), 
                cy_readb(address + (CySRER<<index)) | CyTxRdy);
!     spin_unlock_irqrestore(&isa_card_lock, flags);
  
      /* Wait ... */
      udelay(5000L);
***************
*** 5762,5768 ****
      }
  #endif /* CONFIG_CYZ_INTR */
  
!     save_flags(flags); cli();
  
      if ((e1 = tty_unregister_driver(&cy_serial_driver)))
              printk("cyc: failed to unregister Cyclades serial
              driver(%d)\n",
--- 5763,5769 ----
      }
  #endif /* CONFIG_CYZ_INTR */
  
!     spin_lock_irqsave(&isa_card_lock, flags);
  
      if ((e1 = tty_unregister_driver(&cy_serial_driver)))
              printk("cyc: failed to unregister Cyclades serial
              driver(%d)\n",
***************
*** 5771,5777 ****
              printk("cyc: failed to unregister Cyclades callout
              driver (%d)\n", 
                e2);
  
!     restore_flags(flags);
  
      for (i = 0; i < NR_CARDS; i++) {
          if (cy_card[i].base_addr != 0) {
--- 5772,5778 ----
              printk("cyc: failed to unregister Cyclades callout
          driver (%d)\n", 
                e2);
  
!     spin_unlock_irqrestore(&isa_card_lock, flags);
  
      for (i = 0; i < NR_CARDS; i++) {
          if (cy_card[i].base_addr != 0) {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 19:18       ` Duraid Madina
@ 2003-05-21 20:03         ` Helge Hafting
  2003-05-21 22:59         ` Alan Cox
  1 sibling, 0 replies; 47+ messages in thread
From: Helge Hafting @ 2003-05-21 20:03 UTC (permalink / raw)
  To: Duraid Madina; +Cc: arjanv, linux-kernel

On Thu, May 22, 2003 at 05:18:02AM +1000, Duraid Madina wrote:
> Arjan van de Ven wrote:
> 
> >if you had spent the time you spent on this colorful graphic on reading
> >SUS or Posix about what sched_yield() means
> 
> Quoth the man page,
> 
> "A process can relinquish the processor voluntarily without blocking by 
> calling sched_yield. The process will then be moved to the end of the 
> queue for its static priority and a new process gets to run."
>
This assumes the implementation uses queues, one per 
priority level.
And even if it does, the process may be the only one with that
priority, making this a useless way of giving up "some time".
It'll still get rescheduled over and over and prevent
lower-priority processes from running.

 
> How you get from there to "I'm the least important thing in the system" 
> is, once again, beyond me. And even if that were a reasonable 
> interpretation of the word 'yield', you would still hope that more than 
> one CPU would get something to do if there was enough work to go around. 
> Agreed, "spinning" on sched_yield is a very naive way of doing 
> spinlocks. But that doesn't change the fact that it's a simple and 
> correct way. One would have hoped that calling sched_yield every few 
> million cycles wouldn't break the scheduler.

The way I understand it, the scheduler doesn't "break".  You just
get a lot of useless busy waiting.

Helge Hafting

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21 17:56   ` David Mosberger
@ 2003-05-21 20:46     ` Mike Galbraith
  2003-05-22  9:52     ` Mike Galbraith
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-21 20:46 UTC (permalink / raw)
  To: davidm; +Cc: davidm, linux-kernel, linux-ia64

At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de> 
> said:
>
>   Mike> The page mentions persistent starvation.  My own explorations
>   Mike> of this issue indicate that the primary source is always
>   Mike> selecting the highest priority queue.
>
>My working assumption is that the problem is a bug with the dynamic
>prioritization.  The task receiving the signals calls sleep() after
>handling a signal and hence it's dynamic priority should end up higher
>than the priority of the task sending signals (since the sender never
>relinquishes the CPU voluntarily).

The only thing that matters is how much you sleep vs run, so yes, it should 
have a higher priority unless that handling is heavy on cpu.  If it 
doesn't, then you have to have a different problem, because the dynamic 
priority portion of the scheduler is dead simple.  The only way I can 
imagine that priority could end up lower than expected is heavyweight 
interrupt load, or spinning out of control.

>However, I haven't actually had time to look at the relevant code, so
>I may be missing something.  If you understand the issue better,
>please explain to me why this isn't a dynamic priority issue.

I just saw your other post regarding the web page.  Now that I know that 
there's a detailed description in there somewhere, I'll go read it and see 
if any of what I've gleaned from crawling around the scheduler code is 
useful.  I thought you might be encountering the same kind of generic 
starvation I've seen.  Ergo, the simple diag patch.

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 19:18       ` Duraid Madina
  2003-05-21 20:03         ` Helge Hafting
@ 2003-05-21 22:59         ` Alan Cox
  1 sibling, 0 replies; 47+ messages in thread
From: Alan Cox @ 2003-05-21 22:59 UTC (permalink / raw)
  To: Duraid Madina; +Cc: arjanv, Linux Kernel Mailing List

On Mer, 2003-05-21 at 20:18, Duraid Madina wrote:
> "A process can relinquish the processor voluntarily without blocking by 
> calling sched_yield. The process will then be moved to the end of the 
> queue for its static priority and a new process gets to run."
> 
> How you get from there to "I'm the least important thing in the system" 
> is, once again, beyond me. And even if that were a reasonable 


Assuming nobody is niced the two say the same thing. Note "for its
_static_ priority" not for its dynamic priority. So sched_yield puts you
at the back of a pile of cpu hogs cos thats what the spec says it does


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21  9:26 ` Mike Galbraith
  2003-05-21  9:30   ` Mike Galbraith
  2003-05-21 17:56   ` David Mosberger
@ 2003-05-22  0:38   ` Rik van Riel
       [not found]   ` <Pine.LNX.4.50L.0305212038120.5425-100000@imladris.surriel. com>
  3 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2003-05-22  0:38 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: davidm, linux-kernel, linux-ia64

On Wed, 21 May 2003, Mike Galbraith wrote:
> At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
> >Recently, I started to look into some odd performance behaviors of the
> >O(1) scheduler.  I decided to document what I found in a web page
> >at:
> >
> >         http://www.hpl.hp.com/research/linux/kernel/o1.php
>
> The page mentions persistent starvation.  My own explorations of this
> issue indicate that the primary source is always selecting the highest
> priority queue.

It's deeper than that.  The O(1) scheduler doesn't consider
actual CPU usage as a factor of CPU priority.


Rik
-- 
Engineers don't grow up, they grow sideways.
http://www.surriel.com/		http://kernelnewbies.org/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
       [not found]   ` <Pine.LNX.4.50L.0305212038120.5425-100000@imladris.surriel. com>
@ 2003-05-22  5:52     ` Mike Galbraith
  2003-05-22 14:47       ` Valdis.Kletnieks
  2003-05-25  9:17       ` Mike Galbraith
  0 siblings, 2 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-22  5:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: davidm, linux-kernel, linux-ia64

At 08:38 PM 5/21/2003 -0400, Rik van Riel wrote:
>On Wed, 21 May 2003, Mike Galbraith wrote:
> > At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
> > >Recently, I started to look into some odd performance behaviors of the
> > >O(1) scheduler.  I decided to document what I found in a web page
> > >at:
> > >
> > >         http://www.hpl.hp.com/research/linux/kernel/o1.php
> >
> > The page mentions persistent starvation.  My own explorations of this
> > issue indicate that the primary source is always selecting the highest
> > priority queue.
>
>It's deeper than that.  The O(1) scheduler doesn't consider
>actual CPU usage as a factor of CPU priority.

Oh, there's no doubt in my mind that it's _much_ deeper than my little 
surface scratchings ;-)

It does consider cpu usage though.  Your run history is right there in your 
accumulated sleep_avg.  Unfortunately (in some ways, fortunate in others.. 
conflict) that information can be diluted down to nothing instantly by new 
input from one wakeup.

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-21 17:56   ` David Mosberger
  2003-05-21 20:46     ` Mike Galbraith
@ 2003-05-22  9:52     ` Mike Galbraith
  2003-05-22 16:25       ` Mike Galbraith
  2003-06-03 20:59       ` sched.c gives ICE [Was: Re: web page on O(1) scheduler] J.A. Magallon
  1 sibling, 2 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-22  9:52 UTC (permalink / raw)
  To: davidm; +Cc: davidm, linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 951 bytes --]

At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de> 
> said:
>
>   Mike> The page mentions persistent starvation.  My own explorations
>   Mike> of this issue indicate that the primary source is always
>   Mike> selecting the highest priority queue.
>
>My working assumption is that the problem is a bug with the dynamic
>prioritization.  The task receiving the signals calls sleep() after
>handling a signal and hence it's dynamic priority should end up higher
>than the priority of the task sending signals (since the sender never
>relinquishes the CPU voluntarily).
>
>However, I haven't actually had time to look at the relevant code, so
>I may be missing something.  If you understand the issue better,
>please explain to me why this isn't a dynamic priority issue.

You're right, it looks like a corner case.  It works fine here with the 
attached diff.

         -Mike 

[-- Attachment #2: xx.diff --]
[-- Type: application/octet-stream, Size: 1553 bytes --]

--- linux-2.5.69.virgin/kernel/sched.c.org	Wed May 21 07:45:00 2003
+++ linux-2.5.69.virgin/kernel/sched.c	Thu May 22 11:06:12 2003
@@ -1264,7 +1264,7 @@
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
+	struct list_head *head, *curr;
 	int idx;
 
 	/*
@@ -1286,7 +1286,6 @@
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1303,6 +1302,9 @@
 			break;
 		}
 	default:
+		/* One sleep credit for releasing the cpu immediately. */
+		if (prev->last_run == jiffies && prev->sleep_avg < MAX_SLEEP_AVG)
+			prev->sleep_avg++;
 		deactivate_task(prev, rq);
 	case TASK_RUNNING:
 		;
@@ -1331,8 +1333,22 @@
 	}
 
 	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
+next_queue:
+	head = array->queue + idx;
+	curr = head->next;
+	next = list_entry(curr, task_t, run_list);
+	curr = curr->next;
+	/*
+	 * If we are about to wrap back to the head of the queue,
+	 * give a lower priority queue a chance to sneak one in.
+	 */
+	if (idx == prev->prio && curr == head && array->nr_active > 1) {
+		int tmp = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+		if (tmp < MAX_PRIO) {
+			idx = tmp;
+			goto next_queue;
+		}
+	}
 
 switch_tasks:
 	prefetch(next);
@@ -1342,6 +1358,7 @@
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
+		prev->last_run = next->last_run = jiffies;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-22  5:52     ` Mike Galbraith
@ 2003-05-22 14:47       ` Valdis.Kletnieks
  2003-05-22 16:12         ` Mike Galbraith
  2003-05-25  9:17       ` Mike Galbraith
  1 sibling, 1 reply; 47+ messages in thread
From: Valdis.Kletnieks @ 2003-05-22 14:47 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

On Thu, 22 May 2003 07:52:44 +0200, Mike Galbraith said:

> It does consider cpu usage though.  Your run history is right there in your 
> accumulated sleep_avg.  Unfortunately (in some ways, fortunate in others.. 
> conflict) that information can be diluted down to nothing instantly by new 
> input from one wakeup.

Maybe there should be a scheduler tunable that says how much it should be
diluted?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-22 14:47       ` Valdis.Kletnieks
@ 2003-05-22 16:12         ` Mike Galbraith
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-22 16:12 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

At 10:47 AM 5/22/2003 -0400, Valdis.Kletnieks@vt.edu wrote:
>On Thu, 22 May 2003 07:52:44 +0200, Mike Galbraith said:
>
> > It does consider cpu usage though.  Your run history is right there in 
> your
> > accumulated sleep_avg.  Unfortunately (in some ways, fortunate in others..
> > conflict) that information can be diluted down to nothing instantly by new
> > input from one wakeup.
>
>Maybe there should be a scheduler tunable that says how much it should be
>diluted?

I hope not.  Once you use a knob, you _have_ to use it again when you 
change workloads.

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-22  9:52     ` Mike Galbraith
@ 2003-05-22 16:25       ` Mike Galbraith
  2003-05-22 17:58         ` David Mosberger
  2003-05-27 15:16         ` [case closed] " Mike Galbraith
  2003-06-03 20:59       ` sched.c gives ICE [Was: Re: web page on O(1) scheduler] J.A. Magallon
  1 sibling, 2 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-22 16:25 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, linux-ia64

At 11:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
>> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de> 
>> said:
>>
>>   Mike> The page mentions persistent starvation.  My own explorations
>>   Mike> of this issue indicate that the primary source is always
>>   Mike> selecting the highest priority queue.
>>
>>My working assumption is that the problem is a bug with the dynamic
>>prioritization.  The task receiving the signals calls sleep() after
>>handling a signal and hence it's dynamic priority should end up higher
>>than the priority of the task sending signals (since the sender never
>>relinquishes the CPU voluntarily).
>>
>>However, I haven't actually had time to look at the relevant code, so
>>I may be missing something.  If you understand the issue better,
>>please explain to me why this isn't a dynamic priority issue.
>
>You're right, it looks like a corner case.

Out of curiosity, is someone hitting that with a real program?

         -Mike

aside:
if so, I suppose nano-ticks may be needed.  rounding up gave us too many 
"nano-ticks", and was the first problem with irman, which brought round 
down into activate_task().  now, test-starve.c appears, and it turns out to 
be too many nano-ticks _missing_.  (rounding up doesn't "fix" that one btw 
[too fast], but what I did to demonstrate the problem does re-break irman 
rather nicely:) 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-22 16:25       ` Mike Galbraith
@ 2003-05-22 17:58         ` David Mosberger
  2003-05-27 15:16         ` [case closed] " Mike Galbraith
  1 sibling, 0 replies; 47+ messages in thread
From: David Mosberger @ 2003-05-22 17:58 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: davidm, linux-kernel, linux-ia64

>>>>> On Thu, 22 May 2003 18:25:54 +0200, Mike Galbraith <efault@gmx.de> said:

  Mike> Out of curiosity, is someone hitting that with a real program?

Yes, it caused a failure in a validation program.  The test case is a
stripped-down version of the original program and, of course, doesn't
make much sense other than to demonstrate the scheduling problem.

	--david

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21  9:01 ` Arjan van de Ven
  2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
  2003-05-21 15:18   ` David Mosberger
@ 2003-05-23  1:07   ` Hans Boehm
  2003-05-23  8:30     ` Arjan van de Ven
  2 siblings, 1 reply; 47+ messages in thread
From: Hans Boehm @ 2003-05-23  1:07 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: davidm, linux-kernel, linux-ia64

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2114 bytes --]

On Wed, 21 May 2003, Arjan van de Ven wrote:

> oh you mean the OpenMP broken behavior of calling sched_yield() in a
> tight loop to implement spinlocks ?
> 
> I'd guess that instead of second guessing the runtime, they should use
> the pthreads primitives which are the fastest for the platform one would
> hope.. (eg futexes nowadays)
> 

That really depends on the circumstances.  I think there will always
be applications that use custom synchronization because they need the
last bit of performance in their specific environment.

I just ran a quick test to compare

a) 10,000,000 lock/unlock pairs with a rudimentary custom user-level spinlock
implementation, and
b) 10,000,000 pthread_mutex_lock/pthread_mutex_unlock pairs.

All locking is done by a single thread; this is the completely contention-free
case.

On a 1GHz Itanium 2 I get

Custom lock: 180 msecs
Custom lock: 1382 msecs

On a 2GHz Xeon, I get

Custom lock: 646 msecs
Custom lock: 1659 msecs

There are good reasons for the differences:

1) The pthread implementation needs an atomic operation on lock exit to
check whether waiters need to be awoken.  The spin lock just needs a
release barrier, which is cheap on both platforms.

2) The custom lock can be inlined.  The pthread one normally involves a
dynamic library call.  That has to be the case if you want to be able
to plug in different thread implementations.

3) Pthread locks come in various flavors, and the interface is designed
such that you have to check at runtime which flavor you have.

In the contention case there are other interesting issues, since it's
often far more efficient to spin before attempting to yield, and pthreads
implementations don't always do that.

The original case may also have involved barrier synchronization instead
of locks, in which case there is probably at least as much motivation to
"roll your own".

To reproduce my results from attached files (ao is my current attempt at
a portable atomic operations library):

tar xvfz ao-0.2.tar.gz
cp time_lock.c ao-0.2
cd ao-0.2
gcc -O2 time_lock.c -lpthread
./a.out

-- 
Hans Boehm
(hboehm@hpl.hp.com)

[-- Attachment #2: ao-0.2.tar.gz --]
[-- Type: APPLICATION/x-gzip, Size: 13274 bytes --]

[-- Attachment #3: time_lock.s --]
[-- Type: TEXT/PLAIN, Size: 1759 bytes --]

#include <pthread.h>
#include <stdio.h>
#include <sys/time.h>
#include "atomic_ops.h"

/* Timing code stolen from Ellis/Kovac/Boehm GCBench.			*/
#define currentTime() stats_rtclock()
#define elapsedTime(x) (x)

unsigned long
stats_rtclock( void )
{
  struct timeval t;
  struct timezone tz;

  if (gettimeofday( &t, &tz ) == -1)
    return 0;
  return (t.tv_sec * 1000 + t.tv_usec / 1000);
}

AO_TS_T my_spin_lock = AO_TS_INITIALIZER;

pthread_mutex_t my_pthread_lock = PTHREAD_MUTEX_INITIALIZER;

void spin_lock_ool(AO_TS_T *lock)
{
  /* Should repeatly retry the AO_test_and_set_acquire, perhaps		*/
  /* after trying a plain read.  Should "exponentially" back off	*/
  /* between tries.  For short time periods it should spin, for 	*/
  /* medium ones it should use sched_yield, and for longer ones usleep. */

  /* For now we punt, since this is a contention-free test.		*/
      abort();
}

inline void spin_lock(AO_TS_T *lock)
{
  if (__builtin_expect(AO_test_and_set_acquire(lock) != AO_TS_CLEAR, 0))
    spin_lock_ool(lock);
}

inline void spin_unlock(AO_TS_T *lock)
{
  AO_CLEAR(lock);
}

int main()
{
  unsigned long start_time, end_time;
  int i;
  
  start_time = currentTime();
  for (i = 0; i < 10000000; ++i) {
    spin_lock(&my_spin_lock);
    spin_unlock(&my_spin_lock);
  }
  end_time = currentTime();
  fprintf(stderr, "Custom lock: %lu msecs\n",
	  elapsedTime(end_time - start_time));
  start_time = currentTime();
  for (i = 0; i < 10000000; ++i) {
    pthread_mutex_lock(&my_pthread_lock);
    pthread_mutex_unlock(&my_pthread_lock);
  }
  end_time = currentTime();
  fprintf(stderr, "Custom lock: %lu msecs\n",
	  elapsedTime(end_time - start_time));
  return 0;
}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-23  1:07   ` [Linux-ia64] " Hans Boehm
@ 2003-05-23  8:30     ` Arjan van de Ven
  0 siblings, 0 replies; 47+ messages in thread
From: Arjan van de Ven @ 2003-05-23  8:30 UTC (permalink / raw)
  To: Hans Boehm; +Cc: Arjan van de Ven, davidm, linux-kernel, linux-ia64

On Thu, May 22, 2003 at 06:07:46PM -0700, Hans Boehm wrote:
> case.
> 
> On a 1GHz Itanium 2 I get
> 
> Custom lock: 180 msecs
> Custom lock: 1382 msecs
> 
> On a 2GHz Xeon, I get
> 
> Custom lock: 646 msecs
> Custom lock: 1659 msecs

is the pthreads one with nptl ?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: web page on O(1) scheduler
  2003-05-22  5:52     ` Mike Galbraith
  2003-05-22 14:47       ` Valdis.Kletnieks
@ 2003-05-25  9:17       ` Mike Galbraith
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-25  9:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: davidm, linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

At 07:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>At 08:38 PM 5/21/2003 -0400, Rik van Riel wrote:
>>On Wed, 21 May 2003, Mike Galbraith wrote:
>> > At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
>> > >Recently, I started to look into some odd performance behaviors of the
>> > >O(1) scheduler.  I decided to document what I found in a web page
>> > >at:
>> > >
>> > >         http://www.hpl.hp.com/research/linux/kernel/o1.php
>> >
>> > The page mentions persistent starvation.  My own explorations of this
>> > issue indicate that the primary source is always selecting the highest
>> > priority queue.
>>
>>It's deeper than that.  The O(1) scheduler doesn't consider
>>actual CPU usage as a factor of CPU priority.
>
>Oh, there's no doubt in my mind that it's _much_ deeper than my little 
>surface scratchings ;-)
>
>It does consider cpu usage though.  Your run history is right there in 
>your accumulated sleep_avg.  Unfortunately (in some ways, fortunate in 
>others.. conflict) that information can be diluted down to nothing 
>instantly by new input from one wakeup.

Or did you mean that it misses a bunch of cpu usage?  I went looking at cpu 
usage, and...

Unless there's something seriously b0rked in the attached (don't _think_ 
so, but...;), trusting an irq that happens 1/ms to tick tasks isn't 100% 
effective, even if you aren't context switching a zillion times 
faster.  The attached diff produced the attached log while running 
test-starve.c.  I get some pretty serious thievery even when doing ho-hum 
stuff.  We can deactivate tasks at _really_ high frequency, and besides, if 
the timer interrupt doesn't fire while the last runnable task is active, he 
can be missed for a while and have accumulated up nearly a full ms.  It 
seems to me that with the right timing, you could end up stealing a bunch 
of cpu (my logs say it's happening a _lot_ under load.  again, diff might 
be b0rked)

         -Mike 

[-- Attachment #2: xx.log --]
[-- Type: application/octet-stream, Size: 7260 bytes --]

COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
ROADRUNNER! 308:19 stole 2 msecs.
COYOTE! 309:22 not ticked for 27 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 4 msecs.
COYOTE! 309:22 not ticked for 5 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 36 msecs.
COYOTE! 309:22 not ticked for 37 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 74 msecs.
COYOTE! 309:22 not ticked for 75 msecs.
COYOTE! 309:22 not ticked for 76 msecs.
COYOTE! 309:22 not ticked for 77 msecs.
COYOTE! 309:22 not ticked for 78 msecs.
COYOTE! 309:22 not ticked for 79 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 65 msecs.
COYOTE! 309:22 not ticked for 66 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 4 msecs.
COYOTE! 309:22 not ticked for 5 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 38 msecs.
COYOTE! 309:22 not ticked for 39 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 47 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
ROADRUNNER! 309:23 stole 30 msecs.
ROADRUNNER! 309:23 stole 1 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 68 msecs.
COYOTE! 309:23 not ticked for 69 msecs.
COYOTE! 309:23 not ticked for 70 msecs.
COYOTE! 309:23 not ticked for 71 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 60 msecs.
COYOTE! 309:23 not ticked for 61 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 308:20 not ticked for 1 msecs.

[-- Attachment #3: xx.diff --]
[-- Type: application/octet-stream, Size: 6575 bytes --]

--- linux-2.5.69.virgin/include/linux/sched.h.org	Fri May 23 07:14:23 2003
+++ linux-2.5.69.virgin/include/linux/sched.h	Sat May 24 08:03:40 2003
@@ -328,7 +328,8 @@
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long last_run;
+	unsigned long long last_run;
+	unsigned int used_nsecs;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
--- linux-2.5.69.virgin/kernel/sched.c.org	Sun May 25 06:05:42 2003
+++ linux-2.5.69.virgin/kernel/sched.c	Sun May 25 10:04:14 2003
@@ -74,6 +74,10 @@
 #define MAX_SLEEP_AVG		(10*HZ)
 #define STARVATION_LIMIT	(10*HZ)
 #define NODE_THRESHOLD		125
+#define SCHED_NANOSECOND	1
+#define SCHED_MILLISECOND	(1000000 * SCHED_NANOSECOND)
+
+extern unsigned long long monotonic_clock(void);
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -342,7 +346,8 @@
  */
 static inline int activate_task(task_t *p, runqueue_t *rq)
 {
-	long sleep_time = jiffies - p->last_run - 1;
+	unsigned long long now = monotonic_clock();
+	long sleep_time =  (long)(now - p->last_run) / SCHED_MILLISECOND;
 	int requeue_waker = 0;
 
 	if (sleep_time > 0) {
@@ -514,7 +519,9 @@
 				__activate_task(p, rq);
 			else {
 				requeue_waker = activate_task(p, rq);
-				if (p->prio < rq->curr->prio)
+				if (p->prio < rq->curr->prio ||
+						(0 && p->prio == rq->curr->prio &&
+						p->time_slice > rq->curr->time_slice))
 					resched_task(rq->curr);
 			}
 			success = 1;
@@ -1182,11 +1189,12 @@
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
+	int idle = p == rq->idle, used_msecs;
 
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_ticks);
 
-	if (p == rq->idle) {
+	if (idle) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1194,8 +1202,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-		rebalance_tick(rq, 1);
-		return;
+		goto out;
 	}
 	if (TASK_NICE(p) > 0)
 		kstat_cpu(cpu).cpustat.nice += user_ticks;
@@ -1203,12 +1210,12 @@
 		kstat_cpu(cpu).cpustat.user += user_ticks;
 	kstat_cpu(cpu).cpustat.system += sys_ticks;
 
+	spin_lock(&rq->lock);
 	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
 		set_tsk_need_resched(p);
-		return;
+		goto out_unlock;
 	}
-	spin_lock(&rq->lock);
 	/*
 	 * The task was running during this tick - update the
 	 * time slice counter and the sleep average. Note: we
@@ -1217,8 +1224,15 @@
 	 * it possible for interactive tasks to use up their
 	 * timeslices at their highest priority levels.
 	 */
-	if (p->sleep_avg)
-		p->sleep_avg--;
+	if ((used_msecs = p->used_nsecs / SCHED_MILLISECOND)) {
+		p->used_nsecs -= used_msecs * SCHED_MILLISECOND;
+		if (p->sleep_avg >= used_msecs)
+			p->sleep_avg -= used_msecs;
+		else p->sleep_avg = 0;
+		if (p->policy != SCHED_FIFO && used_msecs > p->time_slice)
+			printk(KERN_DEBUG "ROADRUNNER! %d:%d stole %d msecs.\n",
+				p->pid, p->prio-100, used_msecs - p->time_slice);
+	}
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -1233,7 +1247,7 @@
 			dequeue_task(p, rq->active);
 			enqueue_task(p, rq->active);
 		}
-		goto out;
+		goto out_unlock;
 	}
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
@@ -1249,9 +1263,10 @@
 		} else
 			enqueue_task(p, rq->active);
 	}
-out:
+out_unlock:
 	spin_unlock(&rq->lock);
-	rebalance_tick(rq, 0);
+out:
+	rebalance_tick(rq, idle);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1264,8 +1279,9 @@
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
+	struct list_head *head, *curr;
 	int idx;
+	static int last_pid, last_msused;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1286,7 +1302,6 @@
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1331,8 +1346,32 @@
 	}
 
 	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
+next_queue:
+	head = array->queue + idx;
+	curr = head->next;
+	next = list_entry(curr, task_t, run_list);
+	curr = curr->next;
+	/*
+	 * If we are about to wrap back to the head of the queue,
+	 * give a lower priority queue a chance to sneak one in.
+	 */
+	if (0 && idx == prev->prio && curr == head && array->nr_active > 1) {
+		int tmp = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+		if (tmp < MAX_PRIO) {
+			idx = tmp;
+			goto next_queue;
+		}
+	}
+	if (prev != rq->idle && prev->array == rq->active &&
+			prev->used_nsecs / SCHED_MILLISECOND >= 1) {
+		if (last_pid != prev->pid ||
+				last_msused != prev->used_nsecs / SCHED_MILLISECOND)
+		printk(KERN_DEBUG "COYOTE! %d:%d not ticked for %d msecs.\n",
+			prev->pid, prev->prio-100,
+			prev->used_nsecs / SCHED_MILLISECOND);
+		last_pid = prev->pid;
+		last_msused = prev->used_nsecs / SCHED_MILLISECOND;
+	}
 
 switch_tasks:
 	prefetch(next);
@@ -1340,8 +1379,11 @@
 	RCU_qsctr(prev->thread_info->cpu)++;
 
 	if (likely(prev != next)) {
+		unsigned long long now = monotonic_clock();
 		rq->nr_switches++;
 		rq->curr = next;
+		prev->used_nsecs += now - prev->last_run;
+		prev->last_run = next->last_run = now;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);
--- linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c.org	Sat May 24 08:55:05 2003
+++ linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c	Sat May 24 09:26:09 2003
@@ -102,12 +102,13 @@
 static unsigned long long monotonic_clock_tsc(void)
 {
 	unsigned long long last_offset, this_offset, base;
+	unsigned long flags;
 	
 	/* atomically read monotonic base & last_offset */
-	read_lock_irq(&monotonic_lock);
+	read_lock_irqsave(&monotonic_lock, flags);
 	last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
 	base = monotonic_base;
-	read_unlock_irq(&monotonic_lock);
+	read_unlock_irqrestore(&monotonic_lock, flags);
 
 	/* Read the Time Stamp Counter */
 	rdtscll(this_offset);
--- linux-2.5.69.virgin/kernel/printk.c.org	Sun May 25 09:10:38 2003
+++ linux-2.5.69.virgin/kernel/printk.c	Sun May 25 09:11:04 2003
@@ -510,8 +510,10 @@
 	console_may_schedule = 0;
 	up(&console_sem);
 	spin_unlock_irqrestore(&logbuf_lock, flags);
+#if 0
 	if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
 		wake_up_interruptible(&log_wait);
+#endif
 }
 
 /** console_conditional_schedule - yield the CPU if required

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [case closed] Re: web page on O(1) scheduler
  2003-05-22 16:25       ` Mike Galbraith
  2003-05-22 17:58         ` David Mosberger
@ 2003-05-27 15:16         ` Mike Galbraith
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-27 15:16 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, linux-ia64, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 2102 bytes --]

At 06:25 PM 5/22/2003 +0200, Mike Galbraith wrote:
>At 11:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>>At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
>>> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith 
>>> <efault@gmx.de> said:
>>>
>>>   Mike> The page mentions persistent starvation.  My own explorations
>>>   Mike> of this issue indicate that the primary source is always
>>>   Mike> selecting the highest priority queue.
>>>
>>>My working assumption is that the problem is a bug with the dynamic
>>>prioritization.  The task receiving the signals calls sleep() after
>>>handling a signal and hence it's dynamic priority should end up higher
>>>than the priority of the task sending signals (since the sender never
>>>relinquishes the CPU voluntarily).
>>>
>>>However, I haven't actually had time to look at the relevant code, so
>>>I may be missing something.  If you understand the issue better,
>>>please explain to me why this isn't a dynamic priority issue.
>>
>>You're right, it looks like a corner case.
>
>Out of curiosity, is someone hitting that with a real program?
>
>         -Mike
>
>aside:
>if so, I suppose nano-ticks may be needed....

Since I spent a bunch of time fumbling around with this problem, I may as 
well post one last time and put it to bed.  I bet Ingo has a _lot_ better 
way to cure that problem, but...

Yes indeed, pedantic tracking of cpu usage does fix the problem.  It also 
makes for some interesting changes in contest numbers (well, _I_ find them 
interesting;)  If anyone wants to see the numbers, they're attached.  If 
anyone wants to try the attached diff, it's attached too, have fun... the 
standard "you get to keep the pieces" lkml warrantee applies :)  FWIW, it 
works just fine here, and the only time someone now manages to steal ticks 
(one) is (apparently) once in a long while when someone (whose initials 
appear to be IDE:) keeps interrupts disabled for too long.  Nobody has yet 
managed to steal more than one tick.

         The End,

         -Mike

P.S.  ok, so they're micro-ticks... what's a few orders of magnitude :) 

[-- Attachment #2: xx.diff --]
[-- Type: application/octet-stream, Size: 7534 bytes --]

--- linux-2.5.69.virgin/include/linux/sched.h.org	Fri May 23 07:14:23 2003
+++ linux-2.5.69.virgin/include/linux/sched.h	Tue May 27 09:30:51 2003
@@ -328,7 +328,9 @@
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long last_run;
+	unsigned long long last_run;
+	unsigned int run_nsecs;
+	unsigned int sleep_nsecs;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
--- linux-2.5.69.virgin/kernel/sched.c.org	Sun May 25 06:05:42 2003
+++ linux-2.5.69.virgin/kernel/sched.c	Tue May 27 12:58:02 2003
@@ -74,6 +74,12 @@
 #define MAX_SLEEP_AVG		(10*HZ)
 #define STARVATION_LIMIT	(10*HZ)
 #define NODE_THRESHOLD		125
+#define SCHED_NANOSECOND	1
+#define SCHED_SECOND		(1000000000 * SCHED_NANOSECOND)
+#define SCHED_TICK		(SCHED_SECOND / HZ)
+#define TICKS_PER_SECOND	(SCHED_SECOND / SCHED_TICK)
+
+extern unsigned long long monotonic_clock(void);
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -342,10 +348,23 @@
  */
 static inline int activate_task(task_t *p, runqueue_t *rq)
 {
-	long sleep_time = jiffies - p->last_run - 1;
-	int requeue_waker = 0;
+	unsigned long long now = monotonic_clock();
+	long long sleep =  now - p->last_run + p->sleep_nsecs;
+	int ticks = 0, requeue_waker = 0;
+
+	if (sleep >= SCHED_TICK) {
+		while (sleep >= SCHED_SECOND) {
+			sleep -= SCHED_SECOND;
+			ticks += TICKS_PER_SECOND;
+		}
+		while (sleep >= SCHED_TICK) {
+			sleep -= SCHED_TICK;
+			ticks++;
+		}
+		p->sleep_nsecs = sleep;
+	} else p->sleep_nsecs += sleep;
 
-	if (sleep_time > 0) {
+	if (ticks > 0) {
 		int sleep_avg;
 
 		/*
@@ -356,7 +375,7 @@
 		 * spends sleeping, the higher the average gets - and the
 		 * higher the priority boost gets as well.
 		 */
-		sleep_avg = p->sleep_avg + sleep_time;
+		sleep_avg = p->sleep_avg + ticks;
 
 		/*
 		 * 'Overflow' bonus ticks go to the waker as well, so the
@@ -364,8 +383,10 @@
 		 * boosting tasks that are related to maximum-interactive
 		 * tasks.
 		 */
-		if (sleep_avg > MAX_SLEEP_AVG)
+		if (sleep_avg > MAX_SLEEP_AVG) {
 			sleep_avg = MAX_SLEEP_AVG;
+			p->sleep_nsecs = 0;
+		}
 		if (p->sleep_avg != sleep_avg) {
 			p->sleep_avg = sleep_avg;
 			p->prio = effective_prio(p);
@@ -571,6 +592,8 @@
 	current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
 	p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
 	p->prio = effective_prio(p);
+	p->run_nsecs = 0;
+	p->sleep_nsecs = 0;
 	set_task_cpu(p, smp_processor_id());
 
 	if (unlikely(!current->array))
@@ -1170,6 +1193,49 @@
 		(jiffies - (rq)->expired_timestamp >= \
 			STARVATION_LIMIT * ((rq)->nr_running) + 1)))
 
+inline void __scheduler_tick(runqueue_t *rq, task_t *p)
+{
+	unsigned long long now = monotonic_clock();
+	prio_array_t *array = rq->active;
+	int ticks;
+
+	p->run_nsecs += now - p->last_run;
+	/* Task might have expired already, but not scheduled off yet */
+	if (p->array != array) {
+		set_tsk_need_resched(p);
+		goto abort;
+	}
+	if (p->run_nsecs < SCHED_TICK || p->policy == SCHED_FIFO )
+		goto abort;
+
+	for (ticks = 0; p->run_nsecs >= SCHED_TICK; ticks++)
+		p->run_nsecs -= SCHED_TICK;
+	if (ticks > p->time_slice)
+		show_task(p);
+	if (p->sleep_avg > ticks)
+		p->sleep_avg -= ticks;
+	else
+		p->sleep_avg = 0;
+	p->time_slice -= ticks;
+
+	if (p->time_slice <= 0) {
+		dequeue_task(p, p->array);
+		p->prio = effective_prio(p);
+		p->time_slice = task_timeslice(p);
+		p->first_time_slice = 0;
+		set_tsk_need_resched(p);
+		if ((EXPIRED_STARVING(rq) && !rt_task(p)) ||
+				!TASK_INTERACTIVE(p)) {
+			array = rq->expired;
+			if (!rq->expired_timestamp)
+				rq->expired_timestamp = jiffies;
+		}
+		enqueue_task(p, array);
+	}
+abort:
+	p->last_run = monotonic_clock();
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1182,11 +1248,12 @@
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
+	int idle = p == rq->idle;
 
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_ticks);
 
-	if (p == rq->idle) {
+	if (idle) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1194,8 +1261,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-		rebalance_tick(rq, 1);
-		return;
+		goto out;
 	}
 	if (TASK_NICE(p) > 0)
 		kstat_cpu(cpu).cpustat.nice += user_ticks;
@@ -1203,11 +1269,6 @@
 		kstat_cpu(cpu).cpustat.user += user_ticks;
 	kstat_cpu(cpu).cpustat.system += sys_ticks;
 
-	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
-		set_tsk_need_resched(p);
-		return;
-	}
 	spin_lock(&rq->lock);
 	/*
 	 * The task was running during this tick - update the
@@ -1217,41 +1278,10 @@
 	 * it possible for interactive tasks to use up their
 	 * timeslices at their highest priority levels.
 	 */
-	if (p->sleep_avg)
-		p->sleep_avg--;
-	if (unlikely(rt_task(p))) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			dequeue_task(p, rq->active);
-			enqueue_task(p, rq->active);
-		}
-		goto out;
-	}
-	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
-		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			if (!rq->expired_timestamp)
-				rq->expired_timestamp = jiffies;
-			enqueue_task(p, rq->expired);
-		} else
-			enqueue_task(p, rq->active);
-	}
-out:
+	 __scheduler_tick(rq, p);
 	spin_unlock(&rq->lock);
-	rebalance_tick(rq, 0);
+out:
+	rebalance_tick(rq, idle);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1286,8 +1316,8 @@
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
+	__scheduler_tick(rq, prev);
 
 	/*
 	 * if entering off of a kernel preemption go straight
@@ -1342,6 +1372,7 @@
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
+		next->last_run = prev->last_run;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);
--- linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c.org	Sat May 24 08:55:05 2003
+++ linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c	Sat May 24 09:26:09 2003
@@ -102,12 +102,13 @@
 static unsigned long long monotonic_clock_tsc(void)
 {
 	unsigned long long last_offset, this_offset, base;
+	unsigned long flags;
 	
 	/* atomically read monotonic base & last_offset */
-	read_lock_irq(&monotonic_lock);
+	read_lock_irqsave(&monotonic_lock, flags);
 	last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
 	base = monotonic_base;
-	read_unlock_irq(&monotonic_lock);
+	read_unlock_irqrestore(&monotonic_lock, flags);
 
 	/* Read the Time Stamp Counter */
 	rdtscll(this_offset);
--- linux-2.5.69.virgin/kernel/printk.c.org	Sun May 25 09:10:38 2003
+++ linux-2.5.69.virgin/kernel/printk.c	Sun May 25 09:11:04 2003
@@ -510,8 +510,10 @@
 	console_may_schedule = 0;
 	up(&console_sem);
 	spin_unlock_irqrestore(&logbuf_lock, flags);
+#if 0
 	if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
 		wake_up_interruptible(&log_wait);
+#endif
 }
 
 /** console_conditional_schedule - yield the CPU if required

[-- Attachment #3: contest.log --]
[-- Type: application/octet-stream, Size: 1557 bytes --]

no_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	153	94.8	0.0	0.0	1.00
2.5.69X      1	154	94.2	0.0	0.0	1.00
cacherun:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	146	98.6	0.0	0.0	0.95
2.5.69X      1	147	98.6	0.0	0.0	0.95
process_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	331	43.8	90.0	55.3	2.16
2.5.69X      1	344	41.9	94.0	57.0	2.23
ctar_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	190	77.9	0.0	0.0	1.24
2.5.69X      1	187	79.1	0.0	0.0	1.21
xtar_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	196	75.0	0.0	3.1	1.28
2.5.69X      1	198	74.7	0.0	3.5	1.29
io_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	437	34.6	69.1	15.1	2.86
2.5.69X      1	349	43.3	53.3	14.9	2.27
io_other:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	387	38.8	69.3	17.3	2.53
2.5.69X      1	436	34.9	63.4	14.0	2.83
read_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	221	67.0	9.4	3.6	1.44
2.5.69X      1	208	71.2	9.0	3.4	1.35
list_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	203	71.4	99.0	20.2	1.33
2.5.69X      1	204	71.1	100.0	20.1	1.32
mem_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	256	57.8	34.0	1.2	1.67
2.5.69X      1	244	60.2	36.0	1.2	1.58
dbench_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	517	28.8	5.0	35.6	3.38
2.5.69X      1	444	33.3	3.0	28.8	2.88
ab_load:
Kernel  [runs]	Time	CPU%	Loads	LCPU%	Ratio
2.5.69       1	300	48.3	46.0	21.7	1.96
2.5.69X      1	293	49.5	44.0	21.5	1.90

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-21 13:51       ` Olivier Galibert
@ 2003-05-28 22:12         ` Bill Davidsen
       [not found]         ` <Pine.LNX.3.96.1030528180909.21414B-100000@gatekeeper.tmr.c om>
  1 sibling, 0 replies; 47+ messages in thread
From: Bill Davidsen @ 2003-05-28 22:12 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: linux-kernel

On Wed, 21 May 2003, Olivier Galibert wrote:

> On Wed, May 21, 2003 at 03:12:12PM +0200, Arjan van de Ven wrote:
> > if you had spent the time you spent on this colorful graphic on reading
> > SUS or Posix about what sched_yield() means, you would actually have
> > learned something. sched_yield() means "I'm the least important thing in
> > the system".
> 
> Susv2:
> 
> DESCRIPTION
> 
>  The sched_yield() function forces the running thread to relinquish
>  the processor until it again becomes the head of its thread list. It
>  takes no arguments.
> 
> 
> Aka "I skip the rest of my turn, try the others again once", not "I'm
> unimportant" nor "please rerun me immediatly".
> 
> What is it with you people wanting to make sched_yield() unusable for
> anything that makes sense?

Have to agree, I have a context switching benchmark which uses a spinlock
in shared memory for do-it-yourself gating, and it wants sched_yeild() to
be useful on uni. The SuS is pretty clear about this, the useful behaviour
is also the required behaviour, why are people resisting doing it that
way? 

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
       [not found]         ` <Pine.LNX.3.96.1030528180909.21414B-100000@gatekeeper.tmr.c om>
@ 2003-05-29  5:59           ` Mike Galbraith
  2003-06-02  8:05             ` Ingo Molnar
                               ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-05-29  5:59 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Olivier Galibert, linux-kernel

At 06:12 PM 5/28/2003 -0400, Bill Davidsen wrote:
>On Wed, 21 May 2003, Olivier Galibert wrote:
>
> > On Wed, May 21, 2003 at 03:12:12PM +0200, Arjan van de Ven wrote:
> > > if you had spent the time you spent on this colorful graphic on reading
> > > SUS or Posix about what sched_yield() means, you would actually have
> > > learned something. sched_yield() means "I'm the least important thing in
> > > the system".
> >
> > Susv2:
> >
> > DESCRIPTION
> >
> >  The sched_yield() function forces the running thread to relinquish
> >  the processor until it again becomes the head of its thread list. It
> >  takes no arguments.
> >
> >
> > Aka "I skip the rest of my turn, try the others again once", not "I'm
> > unimportant" nor "please rerun me immediatly".
> >
> > What is it with you people wanting to make sched_yield() unusable for
> > anything that makes sense?
>
>Have to agree, I have a context switching benchmark which uses a spinlock
>in shared memory for do-it-yourself gating, and it wants sched_yeild() to
>be useful on uni. The SuS is pretty clear about this, the useful behaviour
>is also the required behaviour, why are people resisting doing it that
>way?

It can be argued that the current implementation does exactly what SuS 
describes.  The thread's list can be considered to be the sum of the active 
queue and the expired queue.  Merely rotating the active queue and 
scheduling away excludes the yielding thread's equally important expired 
peers from participation in the game.  Furthermore, if you are the only 
thread left in the active array, (you are at the head obviously), and all 
of your peers are currently expired, yielding would become meaningless... 
the thing you're trying to yield (the remainder of your slice) sticks to you.

Having said that, it can also be argued that we are violating the SuS 
description in that yielding the remainder of your slice yields to all 
threads in all lists, not only it's list.  What makes more sense to me than 
the current implementation is to rotate the entire peer queue when a thread 
expires... ie pull in the head of the expired queue into the tail of the 
active queue at the same time so you always have a player if one 
exists.  (you'd have to select queues based on used cpu time to make that 
work right though)

That would still suck rocks for mutex usage... as it must with any 
implementation of sched_yield() in the presence of peer threads who are not 
playing with the mutex.  Actually, using sched_yield() makes no sense what 
so ever to me, other than what Arjan said.  It refers to yielding your 
turn, but from userland "your turn" has no determinate meaning.  There is 
exactly one case where it has a useable value, and that is  when you're the 
_only_ runnable thread... at which time it means precisely zero. (blech)

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-29  5:59           ` Mike Galbraith
@ 2003-06-02  8:05             ` Ingo Molnar
  2003-06-04  4:07               ` Bill Davidsen
       [not found]             ` <Pine.LNX.4.44.0306020949520.3375-100000@localhost.localdom ain>
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2003-06-02  8:05 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Bill Davidsen, Olivier Galibert, linux-kernel


On Thu, 29 May 2003, Mike Galbraith wrote:

> [...] What makes more sense to me than the current implementation is to
> rotate the entire peer queue when a thread expires... ie pull in the
> head of the expired queue into the tail of the active queue at the same
> time so you always have a player if one exists.  (you'd have to select
> queues based on used cpu time to make that work right though)

we have tried all sorts of more complex yield() schemes before - they
sucked for one or another workload. So in 2.5 i took the following path:  
make yield() _simple_ and effective, ie. expire the yielding task (push it
down the runqueue roughly halfway, statistically) and dont try to be too
smart doing it. All the real yield() users (mostly in the kernel) want it
to be an efficient way to avoid livelocks. The old 2.4 yield
implementation had the problem of enabling a ping-pong between two
higher-prio yielding processes, until they use up their full timeslice.

(we could do one more thing that still keeps the thing simple: we could
re-set the yielding task's timeslice instead of the current 'keep the
previous timeslice' logic.)

OpenOffice used to use yield() as a legacy of 'green thread'
implementation - where userspace threads needed to do periodic yield()s to
get any sort of multitasking behavior.

	Ingo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
       [not found]             ` <Pine.LNX.4.44.0306020949520.3375-100000@localhost.localdom ain>
@ 2003-06-02 13:51               ` Mike Galbraith
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-06-02 13:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Bill Davidsen, Olivier Galibert, linux-kernel

At 10:05 AM 6/2/2003 +0200, Ingo Molnar wrote:

>On Thu, 29 May 2003, Mike Galbraith wrote:
>
> > [...] What makes more sense to me than the current implementation is to
> > rotate the entire peer queue when a thread expires... ie pull in the
> > head of the expired queue into the tail of the active queue at the same
> > time so you always have a player if one exists.  (you'd have to select
> > queues based on used cpu time to make that work right though)
>
>we have tried all sorts of more complex yield() schemes before - they
>sucked for one or another workload. So in 2.5 i took the following path:
>make yield() _simple_ and effective, ie. expire the yielding task (push it
>down the runqueue roughly halfway, statistically) and dont try to be too
>smart doing it. All the real yield() users (mostly in the kernel) want it
>to be an efficient way to avoid livelocks. The old 2.4 yield
>implementation had the problem of enabling a ping-pong between two
>higher-prio yielding processes, until they use up their full timeslice.

(yeah, i looked at that in ktracer logs.  cpu hot-potato sucks;)

>(we could do one more thing that still keeps the thing simple: we could
>re-set the yielding task's timeslice instead of the current 'keep the
>previous timeslice' logic.)

(more consistent) 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* sched.c gives ICE [Was: Re: web page on O(1) scheduler]
  2003-05-22  9:52     ` Mike Galbraith
  2003-05-22 16:25       ` Mike Galbraith
@ 2003-06-03 20:59       ` J.A. Magallon
  2003-06-03 22:29         ` Mike Galbraith
  2003-06-22 22:03         ` J.A. Magallon
  1 sibling, 2 replies; 47+ messages in thread
From: J.A. Magallon @ 2003-06-03 20:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel


On 05.22, Mike Galbraith wrote:
> At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> > >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de> 
> > said:
> >
> >   Mike> The page mentions persistent starvation.  My own explorations
> >   Mike> of this issue indicate that the primary source is always
> >   Mike> selecting the highest priority queue.
> >
> >My working assumption is that the problem is a bug with the dynamic
> >prioritization.  The task receiving the signals calls sleep() after
> >handling a signal and hence it's dynamic priority should end up higher
> >than the priority of the task sending signals (since the sender never
> >relinquishes the CPU voluntarily).
> >
> >However, I haven't actually had time to look at the relevant code, so
> >I may be missing something.  If you understand the issue better,
> >please explain to me why this isn't a dynamic priority issue.
> 
> You're right, it looks like a corner case.  It works fine here with the 
> attached diff.
> 

Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
and I just get an ICE:

End of search list.
sched.c: In function `do_schedule':
sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:https://qa.mandrakesoft.com/> for instructions.

sched.c::1003 is the closing brace for do_schedule().

Any idea ?

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc6-jam1 (gcc 3.2.3 (Mandrake Linux 9.2 3.2.3-1mdk))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]
  2003-06-03 20:59       ` sched.c gives ICE [Was: Re: web page on O(1) scheduler] J.A. Magallon
@ 2003-06-03 22:29         ` Mike Galbraith
  2003-06-22 22:03         ` J.A. Magallon
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2003-06-03 22:29 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel

At 10:59 PM 6/3/2003 +0200, J.A. Magallon wrote:

>On 05.22, Mike Galbraith wrote:
> > At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> > > >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <efault@gmx.de>
> > > said:
> > >
> > >   Mike> The page mentions persistent starvation.  My own explorations
> > >   Mike> of this issue indicate that the primary source is always
> > >   Mike> selecting the highest priority queue.
> > >
> > >My working assumption is that the problem is a bug with the dynamic
> > >prioritization.  The task receiving the signals calls sleep() after
> > >handling a signal and hence it's dynamic priority should end up higher
> > >than the priority of the task sending signals (since the sender never
> > >relinquishes the CPU voluntarily).
> > >
> > >However, I haven't actually had time to look at the relevant code, so
> > >I may be missing something.  If you understand the issue better,
> > >please explain to me why this isn't a dynamic priority issue.
> >
> > You're right, it looks like a corner case.  It works fine here with the
> > attached diff.
> >
>
>Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
>and I just get an ICE:

No, I use 2.95.3.

>End of search list.
>sched.c: In function `do_schedule':
>sched.c:1003: internal compiler error: in merge_assigned_reloads, at 
>reload1.c:6134
>Please submit a full bug report,
>with preprocessed source if appropriate.
>See <URL:https://qa.mandrakesoft.com/> for instructions.
>
>sched.c::1003 is the closing brace for do_schedule().
>
>Any idea ?

Nope... gcc can have the blame if it wants it :)

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-29  5:59           ` Mike Galbraith
  2003-06-02  8:05             ` Ingo Molnar
       [not found]             ` <Pine.LNX.4.44.0306020949520.3375-100000@localhost.localdom ain>
@ 2003-06-04  3:52             ` Bill Davidsen
  2003-06-04  4:55               ` David Schwartz
       [not found]             ` <Pine.LNX.3.96.1030603234616.16495B-100000@gatekeeper.tmr.c om>
  3 siblings, 1 reply; 47+ messages in thread
From: Bill Davidsen @ 2003-06-04  3:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Olivier Galibert, linux-kernel

On Thu, 29 May 2003, Mike Galbraith wrote:

> That would still suck rocks for mutex usage... as it must with any 
> implementation of sched_yield() in the presence of peer threads who are not 
> playing with the mutex.  Actually, using sched_yield() makes no sense what 
> so ever to me, other than what Arjan said.  It refers to yielding your 
> turn, but from userland "your turn" has no determinate meaning.  There is 
> exactly one case where it has a useable value, and that is  when you're the 
> _only_ runnable thread... at which time it means precisely zero. (blech)

No, it works usefully without threads at all, with processes sharing a
spinlock in shared memory. If the lock is closed process does a
sched_yeild() to allow whoever has the lock to run. Yes to all comments
WRT order of running, if you care you don't do this, clearly. But in the
case where a process forks to a feeder and consumer it's faster than
semaphores, signal, etc.

All that's needed is to put the yeild process on the end of the
appropriate run queue and reschedule. Doing anything else results in bad
performance and no gain to anything else.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-06-02  8:05             ` Ingo Molnar
@ 2003-06-04  4:07               ` Bill Davidsen
  0 siblings, 0 replies; 47+ messages in thread
From: Bill Davidsen @ 2003-06-04  4:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Galbraith, Olivier Galibert, linux-kernel

On Mon, 2 Jun 2003, Ingo Molnar wrote:

> 
> On Thu, 29 May 2003, Mike Galbraith wrote:
> 
> > [...] What makes more sense to me than the current implementation is to
> > rotate the entire peer queue when a thread expires... ie pull in the
> > head of the expired queue into the tail of the active queue at the same
> > time so you always have a player if one exists.  (you'd have to select
> > queues based on used cpu time to make that work right though)
> 
> we have tried all sorts of more complex yield() schemes before - they
> sucked for one or another workload. So in 2.5 i took the following path:  
> make yield() _simple_ and effective, ie. expire the yielding task (push it
> down the runqueue roughly halfway, statistically) and dont try to be too
> smart doing it. All the real yield() users (mostly in the kernel) want it
> to be an efficient way to avoid livelocks. The old 2.4 yield
> implementation had the problem of enabling a ping-pong between two
> higher-prio yielding processes, until they use up their full timeslice.
> 
> (we could do one more thing that still keeps the thing simple: we could
> re-set the yielding task's timeslice instead of the current 'keep the
> previous timeslice' logic.)
> 
> OpenOffice used to use yield() as a legacy of 'green thread'
> implementation - where userspace threads needed to do periodic yield()s to
> get any sort of multitasking behavior.

The way I use it is in cases when I fork processes which communicate
through a state machine (I'm simplifying) gated by a spinlock, both in
shared memory. On SMP machines the lock time is trivial and the processes
run really well. On uniprocessor you can get hangs for a full timeslice
unless a shed_yeild() is used if the lock is not available.

Since the processes have no shared data other than the small bit of shared
memory, it seems that threads give no benefit over fork, and for some o/s
much worse. Use of a mutex seems slightly slower in Linux, and quite
slower elsewhere.

The method you describe seems far more useful than some other discussion
which seems to advocate making the yeild process the lowest priority thing
in the system. That really doesn't seem to fit SuS, although I think they
had a single queue in mind. Perhaps not, in any case you seem to provide a
workable solution.

I'm not sure if there would any any significant difference by pushing down
halfway, all the way in the same queue, or just down one. The further down
it goes the better chance there is that the blocking process will run, but
the slower the access to the shared resource and the more chance that
another process will grab the resource. Perhaps down one the first time
and then to the end of the queue if there has not been a timeslice end or
i/o block before another yeild. That would prevent looping between any
number of competitors delaying dispatch to the holder of the lock.

Feel free to disagree, that just came to me as I typed.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
  2003-06-04  3:52             ` Bill Davidsen
@ 2003-06-04  4:55               ` David Schwartz
  0 siblings, 0 replies; 47+ messages in thread
From: David Schwartz @ 2003-06-04  4:55 UTC (permalink / raw)
  To: Bill Davidsen, Mike Galbraith; +Cc: linux-kernel


> No, it works usefully without threads at all, with processes sharing a
> spinlock in shared memory. If the lock is closed process does a
> sched_yeild() to allow whoever has the lock to run. Yes to all comments
> WRT order of running, if you care you don't do this, clearly. But in the
> case where a process forks to a feeder and consumer it's faster than
> semaphores, signal, etc.
>
> All that's needed is to put the yeild process on the end of the
> appropriate run queue and reschedule. Doing anything else results in bad
> performance and no gain to anything else.
>
> --
> bill davidsen <davidsen@tmr.com>
>   CTO, TMR Associates, Inc
> Doing interesting things with little computers since 1979.

	I've found sched_yield to be most useful in a case where you can't afford
to wait for a lock, but your processing will be much more efficient if you
have that lock. So you do this:

bool TryYieldLock(lock *f)
{
 if(TryLock(f)) return true;
 sched_yield();
 return TryLock(f);
}

	And you use it like this:

if(TryYieldLock(f))
{ /* great, we got the lock, just do it */
 DoItNow(x);
 Unlock(f);
}
else
{ /* oh well, we didn't get the lock */
 schedule_thread(DoItLater, f);
}

	In other words, you try a bit harder than the usual 'trylock' function but
not as hard as waiting for the lock. If you don't get the lock, you have to
process it a more expensive way (perhaps scheduling another thread to come
around later and wait for the lock).

	Consider a 'free' function for a custom allocator. If there's contention,
you might be willing to do a sched_yield to free it immediately. But you
really don't want to block while some other thread tries to do some
expensive coalescing, in that case you'd rather schedule another thread to
come around later and free it.

	This is much less useful on an MP machine though. It's unlikely that
'sched_yield()' will give the other thread enough time to release the lock
since odds are it's running on the other CPU.

	DS



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
       [not found]             ` <Pine.LNX.3.96.1030603234616.16495B-100000@gatekeeper.tmr.c om>
@ 2003-06-04  7:13               ` Mike Galbraith
  2003-06-04 15:30                 ` Jan Harkes
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2003-06-04  7:13 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Olivier Galibert, linux-kernel

At 11:52 PM 6/3/2003 -0400, Bill Davidsen wrote:
>On Thu, 29 May 2003, Mike Galbraith wrote:
>
> > That would still suck rocks for mutex usage... as it must with any
> > implementation of sched_yield() in the presence of peer threads who are 
> not
> > playing with the mutex.  Actually, using sched_yield() makes no sense what
> > so ever to me, other than what Arjan said.  It refers to yielding your
> > turn, but from userland "your turn" has no determinate meaning.  There is
> > exactly one case where it has a useable value, and that is  when you're 
> the
> > _only_ runnable thread... at which time it means precisely zero. (blech)
>
>No, it works usefully without threads at all, with processes sharing a
>spinlock in shared memory. If the lock is closed process does a
>sched_yeild() to allow whoever has the lock to run. Yes to all comments
>WRT order of running, if you care you don't do this, clearly. But in the
>case where a process forks to a feeder and consumer it's faster than
>semaphores, signal, etc.

It can only be faster when there is no other source of competition for the 
cpu.  If there is no other competition, the current implementation won't 
hurt, because you'll switch arrays at very high speed... the queue rotation 
will be that which I described earlier.

There's only one real difference between going to sleep and waiting for a 
signal and your description of a proper yield.  Both place you at the end 
of your queue.  One immediately, the other upon wakeup.  How much "other 
stuff" is in the system determines task throughput in both cases.

>All that's needed is to put the yeild process on the end of the
>appropriate run queue and reschedule. Doing anything else results in bad
>performance and no gain to anything else.

And if either party has expired it's timeslice, who is the active task 
yielding to?  Why do you insist that those who have expired their slice 
should not be considered eligible players?  Now, let's throw a real 
monkey-wrench into the works...

What happens with your version of sched_yield() if one of your producer 
consumer pair is SCHED_RR, or worse - SCHED_FIFO [no timeslice, "turn" 
means until it blocks whether that be voluntary or not], and sched_yield() 
is the only way it can get rid of the cpu (ie it's a high priority pure cpu 
burner)?  If all you do is rotate the active queue and reschedule, you have 
just guaranteed that sched_yield() will have zero meaning.  The same exact 
thing (but less exaggerated) will happen if producer and consumer are not 
at the same dynamic priority.  You can call schedule() all you want if 
you're the highest priority runnable task... you get to keep the cpu 
whether you really want it or not.  I still say that sched_yield() makes 
zero sense for this usage.  There is only one guaranteed way to give up the 
cpu, and that is to sleep.

Using sched_yield() as a mutex is exactly the same as using a party line 
for private conversation.

         -Mike 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Linux-ia64] Re: web page on O(1) scheduler
  2003-06-04  7:13               ` Mike Galbraith
@ 2003-06-04 15:30                 ` Jan Harkes
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Harkes @ 2003-06-04 15:30 UTC (permalink / raw)
  To: linux-kernel

On Wed, Jun 04, 2003 at 09:13:43AM +0200, Mike Galbraith wrote:
> Using sched_yield() as a mutex is exactly the same as using a party line 
> for private conversation.

I actually used sched_yield for the 'inverse' mutex case. One thread was
pretty much exclusively holding a mutex, but other threads sometimes try
to get through the same critical section and block on mutex_lock. 

The first thread then occasionally does,

    mutex_unlock()
    sched_yield()
    mutex_lock()

Without the sched_yield, other threads never get a chance to wake up and
grab the mutex before the mutex hogging thread is back in the critical
section.

This is ofcourse the simple explanation, this was actually part of a
wrapper around pthreads that needed to provide non-concurrent
co-routines that in specific situations could run concurrently for a bit
to do DNS lookups and such.

Not sure how different semantics affect this because I went with a
different solution a while ago.

Jan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]
  2003-06-03 20:59       ` sched.c gives ICE [Was: Re: web page on O(1) scheduler] J.A. Magallon
  2003-06-03 22:29         ` Mike Galbraith
@ 2003-06-22 22:03         ` J.A. Magallon
  2003-06-22 22:10           ` J.A. Magallon
  1 sibling, 1 reply; 47+ messages in thread
From: J.A. Magallon @ 2003-06-22 22:03 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: Mike Galbraith, linux-kernel


On 06.03, J.A. Magallon wrote:
> 
[...]
> 
> Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
> and I just get an ICE:
> 
> End of search list.
> sched.c: In function `do_schedule':
> sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <URL:https://qa.mandrakesoft.com/> for instructions.
> 
> sched.c::1003 is the closing brace for do_schedule().
> 
> Any idea ?
> 

FYI, it was a gcc bug. In latest gcc release in mandrake:

 * Tue Jun 17 2003 Juan Quintela <quintela@trasno.org> 3.3-2mdk

- Add reload1 patch (fix bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10890).

Now it builds fine with the patch attached.

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-jam1 (gcc 3.3 (Mandrake Linux 9.2 3.3-1mdk))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]
  2003-06-22 22:03         ` J.A. Magallon
@ 2003-06-22 22:10           ` J.A. Magallon
  0 siblings, 0 replies; 47+ messages in thread
From: J.A. Magallon @ 2003-06-22 22:10 UTC (permalink / raw)
  To: Lista Linux-Kernel


On 06.23, J.A. Magallon wrote:
> 
> On 06.03, J.A. Magallon wrote:
> > 
> [...]
> > 
> > Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
> > and I just get an ICE:
> > 
> > End of search list.
> > sched.c: In function `do_schedule':
> > sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > See <URL:https://qa.mandrakesoft.com/> for instructions.
> > 
> > sched.c::1003 is the closing brace for do_schedule().
> > 
> > Any idea ?
> > 
> 
> FYI, it was a gcc bug. In latest gcc release in mandrake:
> 
>  * Tue Jun 17 2003 Juan Quintela <quintela@trasno.org> 3.3-2mdk
> 
> - Add reload1 patch (fix bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10890).
> 
> Now it builds fine with the patch attached.
> 

Ups,                                ^^^^^^^ == applied

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-jam1 (gcc 3.3 (Mandrake Linux 9.2 3.3-2mdk))

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-24  5:38 ` Davide Libenzi
@ 2003-05-24 14:43   ` Davide Libenzi
  0 siblings, 0 replies; 47+ messages in thread
From: Davide Libenzi @ 2003-05-24 14:43 UTC (permalink / raw)
  To: Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

On Fri, 23 May 2003, Davide Libenzi wrote:

> You need a write memory barrier even on the unlock. Consider this :
>
> 	spinlock = 1;
> 	...
> 	protected_resource = NEWVAL;
> 	spinlock = 0;
>
> ( where spinlock = 0/1 strip down, but do not lose the concept, the lock
> operation ). If a CPU reorder those writes, another CPU might see the lock
> drop before the protected resource assignment. And this is usually bad
> for obvious reasons.

David made me notice about a brain misfire here. You need protection even
for loads crossing the unlock. For the same obvious reasons :) Yes, a
realease barrier will be sufficent for ia64, while you'll need an mfence
on P4.



- Davide


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-24  0:53 Boehm, Hans
@ 2003-05-24  5:38 ` Davide Libenzi
  2003-05-24 14:43   ` Davide Libenzi
  0 siblings, 1 reply; 47+ messages in thread
From: Davide Libenzi @ 2003-05-24  5:38 UTC (permalink / raw)
  To: Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

On Fri, 23 May 2003, Boehm, Hans wrote:

> Thanks for the pointer.  In the statically linked case, I get 200/568/345
> for custom/pthread_mutex/pthread_spin.
>
> I agree that this is not a fair comparison.  That was my point.  An implementation
> with custom yield/sleep
> code can do things that you can't do with blocking pthreads primitives at the
> same performance.  (Of course pthreads_mutex_lock will win in other cases.)
>
> Please forget about the abort() in the contention case.  I put that there for
> brevity since it is not exercised by the test.  The intent was to time the
> noncontention performance of a custom lock that first spins, then yields,
> and then sleeps, as was stated in the comment.
>
> You are forgetting two issues in your analysis of what pthreads is/should be doing
> relative to the spin-lock-like code:
>
> 1) The unlock code is different.  If you potentially do a waitforunlocks()
> in the locking code, you need to at least check whether the corresponding
> notification is necessary when you unlock().  For NPTL that requires another
> atomic operation, and hence another dozen to a couple of hundred cycles,
> depending on the processor.  You need to look at both the lock and unlock
> code.

That code was completely independent by what pthread might do. I didn't
look at the code but I think the new pthread uses futexes for mutexes.
The code wanted only to show that a mutex lock does more than a spinlock.
And this "more" is amplified by your tight loop.



> 2) (I hadn't mentioned this before.) The (standard interpretation of)
> the memory barrier semantics of the pthreads primitives is too strong.
> Arguably they need to be full memory barriers in both directions.
> The pthread_spin_lock code inserts an extra full
> memory barrier on IA64 as a result, instead of just
> using the acquire barrier associated with the cmpxchg.acq instruction.
> (I think the spin unlock code doesn't do this.  One could argue that that's a bug,
> though I would argue that the bug is really  in the pthreads spec.)

You need a write memory barrier even on the unlock. Consider this :

	spinlock = 1;
	...
	protected_resource = NEWVAL;
	spinlock = 0;

( where spinlock = 0/1 strip down, but do not lose the concept, the lock
operation ). If a CPU reorder those writes, another CPU might see the lock
drop before the protected resource assignment. And this is usually bad
for obvious reasons.



- Davide


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
@ 2003-05-24  0:53 Boehm, Hans
  2003-05-24  5:38 ` Davide Libenzi
  0 siblings, 1 reply; 47+ messages in thread
From: Boehm, Hans @ 2003-05-24  0:53 UTC (permalink / raw)
  To: 'Davide Libenzi', Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

Thanks for the pointer.  In the statically linked case, I get 200/568/345
for custom/pthread_mutex/pthread_spin.

I agree that this is not a fair comparison.  That was my point.  An implementation
with custom yield/sleep
code can do things that you can't do with blocking pthreads primitives at the
same performance.  (Of course pthreads_mutex_lock will win in other cases.)

Please forget about the abort() in the contention case.  I put that there for
brevity since it is not exercised by the test.  The intent was to time the
noncontention performance of a custom lock that first spins, then yields,
and then sleeps, as was stated in the comment.

You are forgetting two issues in your analysis of what pthreads is/should be doing
relative to the spin-lock-like code:

1) The unlock code is different.  If you potentially do a waitforunlocks()
in the locking code, you need to at least check whether the corresponding
notification is necessary when you unlock().  For NPTL that requires another
atomic operation, and hence another dozen to a couple of hundred cycles,
depending on the processor.  You need to look at both the lock and unlock
code.

2) (I hadn't mentioned this before.) The (standard interpretation of)
the memory barrier semantics of the pthreads primitives is too strong.
Arguably they need to be full memory barriers in both directions.
The pthread_spin_lock code inserts an extra full
memory barrier on IA64 as a result, instead of just
using the acquire barrier associated with the cmpxchg.acq instruction. 
(I think the spin unlock code doesn't do this.  One could argue that that's a bug,
though I would argue that the bug is really  in the pthreads spec.)

Hans

> -----Original Message-----
> From: Davide Libenzi [mailto:davidel@xmailserver.org]
> Sent: Friday, May 23, 2003 5:20 PM
> To: Boehm, Hans
> Cc: 'Arjan van de Ven'; Hans Boehm; MOSBERGER, DAVID
> (HP-PaloAlto,unix3); Linux Kernel Mailing List; 
> linux-ia64@linuxia64.org
> Subject: RE: [Linux-ia64] Re: web page on O(1) scheduler
> 
> 
> On Fri, 23 May 2003, Boehm, Hans wrote:
> 
> > Pthread_spin_lock() under the NPTL version in RH9 does 
> basically what my
> > custom locks do in the uncontested case, aside from the 
> function call.
> > But remember that this began with a discussion about whether it was
> > reasonable for user locking code to explicitly yield rather 
> than relying
> > on pthreads to suspend the thread.  I don't think 
> pthread_spin_lock is
> > relevant in this context, for two reasons:
> >
> > 1) At least the RH9 version of pthread_spin_lock in NPTL 
> literally spins
> > and makes no attempt to yield or block.  This only makes 
> sense at user
> > level if you are 100% certain that the processors won't be
> > overcommitted. Otherwise there is little to be lost by 
> blocking once you
> > have spun for sufficiently long.  You could use 
> pthread_spin_trylock and
> > block explicitly, but that gets us back to custom blocking code.
> 
> Yes, that would be a spinlock. Your code was basically a spinlock that
> instead of spinning was doing abort() in contention case. Again, you
> measured two different things. Even if the pthread mutex does 
> something
> very simple like :
> 
> 	spinlock(mtx->lock);
> 	while (mtx->busy) {
> 		spinunlock(mtx->lock);
> 		waitforunlocks();
> 		spinlock(mtx->lock);
> 	}
> 	mtx->busy++;
> 	spinunlock(mtx->lock);
> 
> Only the fact that this code likely reside inside a deeper 
> call lever will
> make you pay in a tight loop like your.
> 
> 
> > 2) AFAICT, pthread_spin_lock is currently a little too 
> bleeding edge to
> > be widely used.  I tried to time it, but failed.  Pthread.h doesn't
> > include the declaration for pthread_spin_lock_t by default, 
> at least not
> > yet.  It doesn't seem to have a Linux man page, yet.  I 
> tried to define
> > the magic macro to get it declared, but that broke something else.
> 
> $ gcc -D_GNU_SOURCE ...
> 
> 
> 
> - Davide
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-24  0:10 Boehm, Hans
@ 2003-05-24  0:20 ` Davide Libenzi
  0 siblings, 0 replies; 47+ messages in thread
From: Davide Libenzi @ 2003-05-24  0:20 UTC (permalink / raw)
  To: Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

On Fri, 23 May 2003, Boehm, Hans wrote:

> Pthread_spin_lock() under the NPTL version in RH9 does basically what my
> custom locks do in the uncontested case, aside from the function call.
> But remember that this began with a discussion about whether it was
> reasonable for user locking code to explicitly yield rather than relying
> on pthreads to suspend the thread.  I don't think pthread_spin_lock is
> relevant in this context, for two reasons:
>
> 1) At least the RH9 version of pthread_spin_lock in NPTL literally spins
> and makes no attempt to yield or block.  This only makes sense at user
> level if you are 100% certain that the processors won't be
> overcommitted. Otherwise there is little to be lost by blocking once you
> have spun for sufficiently long.  You could use pthread_spin_trylock and
> block explicitly, but that gets us back to custom blocking code.

Yes, that would be a spinlock. Your code was basically a spinlock that
instead of spinning was doing abort() in contention case. Again, you
measured two different things. Even if the pthread mutex does something
very simple like :

	spinlock(mtx->lock);
	while (mtx->busy) {
		spinunlock(mtx->lock);
		waitforunlocks();
		spinlock(mtx->lock);
	}
	mtx->busy++;
	spinunlock(mtx->lock);

Only the fact that this code likely reside inside a deeper call lever will
make you pay in a tight loop like your.


> 2) AFAICT, pthread_spin_lock is currently a little too bleeding edge to
> be widely used.  I tried to time it, but failed.  Pthread.h doesn't
> include the declaration for pthread_spin_lock_t by default, at least not
> yet.  It doesn't seem to have a Linux man page, yet.  I tried to define
> the magic macro to get it declared, but that broke something else.

$ gcc -D_GNU_SOURCE ...



- Davide


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
@ 2003-05-24  0:10 Boehm, Hans
  2003-05-24  0:20 ` Davide Libenzi
  0 siblings, 1 reply; 47+ messages in thread
From: Boehm, Hans @ 2003-05-24  0:10 UTC (permalink / raw)
  To: 'Davide Libenzi', Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

Pthread_spin_lock() under the NPTL version in RH9 does basically what my custom locks do in the uncontested case, aside from the function call.  But remember that this began with a discussion about whether it was reasonable for user locking code to explicitly yield rather than relying on pthreads to suspend the thread.  I don't think pthread_spin_lock is relevant in this context, for two reasons:

1) At least the RH9 version of pthread_spin_lock in NPTL literally spins and makes no attempt to yield or block.  This only makes sense at user level if you are 100% certain that the processors won't be overcommitted.  Otherwise there is little to be lost by blocking once you have spun for sufficiently long.  You could use pthread_spin_trylock and block explicitly, but that gets us back to custom blocking code.

2) AFAICT, pthread_spin_lock is currently a little too bleeding edge to be widely used.  I tried to time it, but failed.  Pthread.h doesn't include the declaration for pthread_spin_lock_t by default, at least not yet.  It doesn't seem to have a Linux man page, yet.  I tried to define the magic macro to get it declared, but that broke something else.

Hans

> -----Original Message-----
> From: Davide Libenzi [mailto:davidel@xmailserver.org]
> Sent: Friday, May 23, 2003 11:05 AM
> To: Boehm, Hans
> Cc: 'Arjan van de Ven'; Hans Boehm; MOSBERGER, DAVID
> (HP-PaloAlto,unix3); Linux Kernel Mailing List; 
> linux-ia64@linuxia64.org
> Subject: RE: [Linux-ia64] Re: web page on O(1) scheduler
> 
> 
> On Fri, 23 May 2003, Boehm, Hans wrote:
> 
> > Sorry about the typo and misnaming for the test program.  I 
> attached the correct version that prints the right labels.
> >
> > The results I posted did not use NPTL.  (Presumably OpenMP 
> wasn't targeted at NPTL either.)  I don't think that NPTL has 
> any bearing on the underlying issues that I mentioned, though 
> path lengths are probably a bit shorter.  It should also 
> handle contention substantially better, but that wasn't tested.
> >
> > I did rerun the test case on a 900 MHz Itanium 2 machine 
> with a more recent Debian installation with NPTL.  I get 
> 200msecs (20nsecs/iter) with the custom lock, and 768 for 
> pthreads.  (With static linking that decreases to 658 for 
> pthreads.)  Pthreads (and/or some of the other 
> infrastructure) is clearly getting better, but I don't think 
> the difference will disappear.
> 
> To make things more fair you should test against pthread 
> spinlocks. Also,
> for tight loops like that, even an extra call deep level 
> (that pthread is
> likely to do) is going to matter.
> 
> 
> 
> - Davide
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
  2003-05-23 17:48 [Linux-ia64] Re: web page on O(1) scheduler Boehm, Hans
@ 2003-05-23 18:04 ` Davide Libenzi
  0 siblings, 0 replies; 47+ messages in thread
From: Davide Libenzi @ 2003-05-23 18:04 UTC (permalink / raw)
  To: Boehm, Hans
  Cc: 'Arjan van de Ven',
	Hans Boehm, MOSBERGER, DAVID (HP-PaloAlto,unix3),
	Linux Kernel Mailing List, linux-ia64

On Fri, 23 May 2003, Boehm, Hans wrote:

> Sorry about the typo and misnaming for the test program.  I attached the correct version that prints the right labels.
>
> The results I posted did not use NPTL.  (Presumably OpenMP wasn't targeted at NPTL either.)  I don't think that NPTL has any bearing on the underlying issues that I mentioned, though path lengths are probably a bit shorter.  It should also handle contention substantially better, but that wasn't tested.
>
> I did rerun the test case on a 900 MHz Itanium 2 machine with a more recent Debian installation with NPTL.  I get 200msecs (20nsecs/iter) with the custom lock, and 768 for pthreads.  (With static linking that decreases to 658 for pthreads.)  Pthreads (and/or some of the other infrastructure) is clearly getting better, but I don't think the difference will disappear.

To make things more fair you should test against pthread spinlocks. Also,
for tight loops like that, even an extra call deep level (that pthread is
likely to do) is going to matter.



- Davide


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [Linux-ia64] Re: web page on O(1) scheduler
@ 2003-05-23 17:48 Boehm, Hans
  2003-05-23 18:04 ` Davide Libenzi
  0 siblings, 1 reply; 47+ messages in thread
From: Boehm, Hans @ 2003-05-23 17:48 UTC (permalink / raw)
  To: 'Arjan van de Ven', Hans Boehm
  Cc: MOSBERGER, DAVID (HP-PaloAlto,unix3), linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

Sorry about the typo and misnaming for the test program.  I attached the correct version that prints the right labels.

The results I posted did not use NPTL.  (Presumably OpenMP wasn't targeted at NPTL either.)  I don't think that NPTL has any bearing on the underlying issues that I mentioned, though path lengths are probably a bit shorter.  It should also handle contention substantially better, but that wasn't tested.

I did rerun the test case on a 900 MHz Itanium 2 machine with a more recent Debian installation with NPTL.  I get 200msecs (20nsecs/iter) with the custom lock, and 768 for pthreads.  (With static linking that decreases to 658 for pthreads.)  Pthreads (and/or some of the other infrastructure) is clearly getting better, but I don't think the difference will disappear.

I don't have a Xeon with NPTL handy.  On an old PPro, the results were 1133 and 4379 msecs for custom and NPTL respectively.

Hans

> -----Original Message-----
> From: Arjan van de Ven [mailto:arjanv@redhat.com]
> Sent: Friday, May 23, 2003 1:31 AM
> To: Hans Boehm
> Cc: Arjan van de Ven; davidm@hpl.hp.com; linux-kernel@vger.kernel.org;
> linux-ia64@linuxia64.org
> Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler
> 
> 
> On Thu, May 22, 2003 at 06:07:46PM -0700, Hans Boehm wrote:
> > case.
> > 
> > On a 1GHz Itanium 2 I get
> > 
> > Custom lock: 180 msecs
> > Custom lock: 1382 msecs
> > 
> > On a 2GHz Xeon, I get
> > 
> > Custom lock: 646 msecs
> > Custom lock: 1659 msecs
> 
> is the pthreads one with nptl ?
> 


[-- Attachment #2: time_lock.c --]
[-- Type: application/octet-stream, Size: 1760 bytes --]

#include <pthread.h>
#include <stdio.h>
#include <sys/time.h>
#include "atomic_ops.h"

/* Timing code stolen from Ellis/Kovac/Boehm GCBench.			*/
#define currentTime() stats_rtclock()
#define elapsedTime(x) (x)

unsigned long
stats_rtclock( void )
{
  struct timeval t;
  struct timezone tz;

  if (gettimeofday( &t, &tz ) == -1)
    return 0;
  return (t.tv_sec * 1000 + t.tv_usec / 1000);
}

AO_TS_T my_spin_lock = AO_TS_INITIALIZER;

pthread_mutex_t my_pthread_lock = PTHREAD_MUTEX_INITIALIZER;

void spin_lock_ool(AO_TS_T *lock)
{
  /* Should repeatly retry the AO_test_and_set_acquire, perhaps		*/
  /* after trying a plain read.  Should "exponentially" back off	*/
  /* between tries.  For short time periods it should spin, for 	*/
  /* medium ones it should use sched_yield, and for longer ones usleep. */

  /* For now we punt, since this is a contention-free test.		*/
      abort();
}

inline void spin_lock(AO_TS_T *lock)
{
  if (__builtin_expect(AO_test_and_set_acquire(lock) != AO_TS_CLEAR, 0))
    spin_lock_ool(lock);
}

inline void spin_unlock(AO_TS_T *lock)
{
  AO_CLEAR(lock);
}

int main()
{
  unsigned long start_time, end_time;
  int i;
  
  start_time = currentTime();
  for (i = 0; i < 10000000; ++i) {
    spin_lock(&my_spin_lock);
    spin_unlock(&my_spin_lock);
  }
  end_time = currentTime();
  fprintf(stderr, "Custom lock: %lu msecs\n",
	  elapsedTime(end_time - start_time));
  start_time = currentTime();
  for (i = 0; i < 10000000; ++i) {
    pthread_mutex_lock(&my_pthread_lock);
    pthread_mutex_unlock(&my_pthread_lock);
  }
  end_time = currentTime();
  fprintf(stderr, "Pthread lock: %lu msecs\n",
	  elapsedTime(end_time - start_time));
  return 0;
}

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2003-06-22 21:56 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-21  6:49 web page on O(1) scheduler David Mosberger
2003-05-21  9:01 ` Arjan van de Ven
2003-05-21 10:40   ` [Linux-ia64] " Duraid Madina
2003-05-21 10:43     ` Christoph Hellwig
2003-05-21 13:12     ` Arjan van de Ven
2003-05-21 13:51       ` Olivier Galibert
2003-05-28 22:12         ` Bill Davidsen
     [not found]         ` <Pine.LNX.3.96.1030528180909.21414B-100000@gatekeeper.tmr.c om>
2003-05-29  5:59           ` Mike Galbraith
2003-06-02  8:05             ` Ingo Molnar
2003-06-04  4:07               ` Bill Davidsen
     [not found]             ` <Pine.LNX.4.44.0306020949520.3375-100000@localhost.localdom ain>
2003-06-02 13:51               ` Mike Galbraith
2003-06-04  3:52             ` Bill Davidsen
2003-06-04  4:55               ` David Schwartz
     [not found]             ` <Pine.LNX.3.96.1030603234616.16495B-100000@gatekeeper.tmr.c om>
2003-06-04  7:13               ` Mike Galbraith
2003-06-04 15:30                 ` Jan Harkes
2003-05-21 19:18       ` Duraid Madina
2003-05-21 20:03         ` Helge Hafting
2003-05-21 22:59         ` Alan Cox
2003-05-21 15:18   ` David Mosberger
2003-05-23  1:07   ` [Linux-ia64] " Hans Boehm
2003-05-23  8:30     ` Arjan van de Ven
2003-05-21  9:26 ` Mike Galbraith
2003-05-21  9:30   ` Mike Galbraith
2003-05-21 17:56   ` David Mosberger
2003-05-21 20:46     ` Mike Galbraith
2003-05-22  9:52     ` Mike Galbraith
2003-05-22 16:25       ` Mike Galbraith
2003-05-22 17:58         ` David Mosberger
2003-05-27 15:16         ` [case closed] " Mike Galbraith
2003-06-03 20:59       ` sched.c gives ICE [Was: Re: web page on O(1) scheduler] J.A. Magallon
2003-06-03 22:29         ` Mike Galbraith
2003-06-22 22:03         ` J.A. Magallon
2003-06-22 22:10           ` J.A. Magallon
2003-05-22  0:38   ` web page on O(1) scheduler Rik van Riel
     [not found]   ` <Pine.LNX.4.50L.0305212038120.5425-100000@imladris.surriel. com>
2003-05-22  5:52     ` Mike Galbraith
2003-05-22 14:47       ` Valdis.Kletnieks
2003-05-22 16:12         ` Mike Galbraith
2003-05-25  9:17       ` Mike Galbraith
2003-05-21 18:31 ` [Linux-ia64] " David Mosberger
2003-05-21 20:00   ` Cyclades Cyclom-Y ISA on 2.5.69 John Stoffel
2003-05-23 17:48 [Linux-ia64] Re: web page on O(1) scheduler Boehm, Hans
2003-05-23 18:04 ` Davide Libenzi
2003-05-24  0:10 Boehm, Hans
2003-05-24  0:20 ` Davide Libenzi
2003-05-24  0:53 Boehm, Hans
2003-05-24  5:38 ` Davide Libenzi
2003-05-24 14:43   ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).