linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sched_yield: delete sysctl_sched_compat_yield
@ 2007-11-27  9:33 Zhang, Yanmin
  2007-11-27 11:17 ` Ingo Molnar
  2007-11-27 22:57 ` Arjan van de Ven
  0 siblings, 2 replies; 38+ messages in thread
From: Zhang, Yanmin @ 2007-11-27  9:33 UTC (permalink / raw)
  To: mingo; +Cc: LKML

If echo "1">/proc/sys/kernel/sched_compat_yield before starting volanoMark
testing, the result is very good with kernel 2.6.24-rc3 on my 16-core tigerton.

1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
2.6.24-rc3 has more than 70% improvement;
2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
2.6.24-rc3 has more than 80% regression;

On other machines, the volanoMark result also has much improvement if
/proc/sys/kernel/sched_compat_yield=1.

Would you like to change function yield_task_fair to delete codes around
sysctl_sched_compat_yield, or just initiate it to 1?

Thanks,
Yanmin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-27  9:33 sched_yield: delete sysctl_sched_compat_yield Zhang, Yanmin
@ 2007-11-27 11:17 ` Ingo Molnar
  2007-11-27 22:57 ` Arjan van de Ven
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-11-27 11:17 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> If echo "1">/proc/sys/kernel/sched_compat_yield before starting 
> volanoMark testing, the result is very good with kernel 2.6.24-rc3 on 
> my 16-core tigerton.

yep, that's known and has been discussed in detail on lkml. Java should 
use something more suitable than sched_yield for its locking. Yield will 
always be dependent on scheduling details and _some_ category of apps 
will always hurt. That's why we offer the sched_compat_yield flag.

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-27  9:33 sched_yield: delete sysctl_sched_compat_yield Zhang, Yanmin
  2007-11-27 11:17 ` Ingo Molnar
@ 2007-11-27 22:57 ` Arjan van de Ven
  2007-11-30  2:46   ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Arjan van de Ven @ 2007-11-27 22:57 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: mingo, LKML

On Tue, 27 Nov 2007 17:33:05 +0800
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:

> If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> my 16-core tigerton.
> 
> 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> 2.6.24-rc3 has more than 70% improvement;
> 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> 2.6.24-rc3 has more than 80% regression;
> 
> On other machines, the volanoMark result also has much improvement if
> /proc/sys/kernel/sched_compat_yield=1.
> 
> Would you like to change function yield_task_fair to delete codes
> around sysctl_sched_compat_yield, or just initiate it to 1?
> 

sounds like a bad idea; volanomark (well, technically the jvm behind
it) is abusing sched_yield() by assuming it does something it really
doesn't do, and as it happens some of the earlier 2.6 schedulers
accidentally happened to behave in a way that was nice for this
benchmark. 

Todays kernel has a different behavior somewhat (and before people
scream "regression"; sched_yield() behavior isn't really specified and
doesn't make any sense at all, whatever you get is what you get....
it's pretty much an insane defacto behavior that is incredibly tied to
which decisions the scheduler makes how, and no app can depend on that
in any way. In fact, I've proposed to make sched_yield() just do an
msleep(1)... that'd be closer to what sched_yield is supposed to do
standard wise than any of the current behaviors .... ;_


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-27 22:57 ` Arjan van de Ven
@ 2007-11-30  2:46   ` Nick Piggin
  2007-11-30  2:51     ` Arjan van de Ven
  2007-11-30  3:15     ` Zhang, Yanmin
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-11-30  2:46 UTC (permalink / raw)
  To: Arjan van de Ven, Andrew Morton; +Cc: Zhang, Yanmin, mingo, LKML

On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> On Tue, 27 Nov 2007 17:33:05 +0800
>
> "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > my 16-core tigerton.
> >
> > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 70% improvement;
> > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 80% regression;
> >
> > On other machines, the volanoMark result also has much improvement if
> > /proc/sys/kernel/sched_compat_yield=1.
> >
> > Would you like to change function yield_task_fair to delete codes
> > around sysctl_sched_compat_yield, or just initiate it to 1?
>
> sounds like a bad idea; volanomark (well, technically the jvm behind
> it) is abusing sched_yield() by assuming it does something it really
> doesn't do, and as it happens some of the earlier 2.6 schedulers
> accidentally happened to behave in a way that was nice for this
> benchmark.

OK, why is this still happening? Haven't we been asking JVMs to use
futexes or posix locking for years and years now? Are there any sane
jvms that _don't_ use yield?


> Todays kernel has a different behavior somewhat (and before people
> scream "regression"; sched_yield() behavior isn't really specified and
> doesn't make any sense at all, whatever you get is what you get....
> it's pretty much an insane defacto behavior that is incredibly tied to
> which decisions the scheduler makes how, and no app can depend on that

It is a performance regression. Is there any reason *not* to use the
"compat" yield by default? As you say, for SCHED_OTHER tasks, yield
can do almost anything. We may as well do something that isn't a
regression...


> in any way. In fact, I've proposed to make sched_yield() just do an
> msleep(1)... that'd be closer to what sched_yield is supposed to do
> standard wise than any of the current behaviors .... ;_

What makes you say that? IIRC of all the things that sched_yeild can
do, it is not allowed to block. So this is about the only thing that
will break the standard...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  2:46   ` Nick Piggin
@ 2007-11-30  2:51     ` Arjan van de Ven
  2007-11-30  3:02       ` Nick Piggin
  2007-11-30  3:15     ` Zhang, Yanmin
  1 sibling, 1 reply; 38+ messages in thread
From: Arjan van de Ven @ 2007-11-30  2:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Zhang, Yanmin, mingo, LKML

On Fri, 30 Nov 2007 13:46:22 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified
> > and doesn't make any sense at all, whatever you get is what you
> > get.... it's pretty much an insane defacto behavior that is
> > incredibly tied to which decisions the scheduler makes how, and no
> > app can depend on that
> 
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression..

it just makes OTHER tests/benchmarks regress.... this is one of those
things where you just can't win.

> 
> 
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors .... ;_
> 
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...

sched_yield OF COURSE can block.. it's a schedule call after all!



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  2:51     ` Arjan van de Ven
@ 2007-11-30  3:02       ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-11-30  3:02 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, Zhang, Yanmin, mingo, LKML

On Friday 30 November 2007 13:51, Arjan van de Ven wrote:
> On Fri, 30 Nov 2007 13:46:22 +1100
>
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified
> > > and doesn't make any sense at all, whatever you get is what you
> > > get.... it's pretty much an insane defacto behavior that is
> > > incredibly tied to which decisions the scheduler makes how, and no
> > > app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression..
>
> it just makes OTHER tests/benchmarks regress.... this is one of those
> things where you just can't win.

OK, which ones? Because java is slightly important...


> > > in any way. In fact, I've proposed to make sched_yield() just do an
> > > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > > standard wise than any of the current behaviors .... ;_
> >
> > What makes you say that? IIRC of all the things that sched_yeild can
> > do, it is not allowed to block. So this is about the only thing that
> > will break the standard...
>
> sched_yield OF COURSE can block.. it's a schedule call after all!

In unix, blocking ~= removed from runqueue, no?

OF COURSE it is allowed to cooperatively schedule another task, but
I don't see why you think it should so obviously be allowed to block
/ sleep.

It breaks the basically only invariant of sched_yeild in that the
task will no longer run when there is nothing else running.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  2:46   ` Nick Piggin
  2007-11-30  2:51     ` Arjan van de Ven
@ 2007-11-30  3:15     ` Zhang, Yanmin
  2007-11-30  3:29       ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Zhang, Yanmin @ 2007-11-30  3:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Arjan van de Ven, Andrew Morton, mingo, LKML

On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> > On Tue, 27 Nov 2007 17:33:05 +0800
> >
> > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote:
> > > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > > my 16-core tigerton.
> > >
> > > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 70% improvement;
> > > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 80% regression;
> > >
> > > On other machines, the volanoMark result also has much improvement if
> > > /proc/sys/kernel/sched_compat_yield=1.
> > >
> > > Would you like to change function yield_task_fair to delete codes
> > > around sysctl_sched_compat_yield, or just initiate it to 1?
> >
> > sounds like a bad idea; volanomark (well, technically the jvm behind
> > it) is abusing sched_yield() by assuming it does something it really
> > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > accidentally happened to behave in a way that was nice for this
> > benchmark.
> 
> OK, why is this still happening? Haven't we been asking JVMs to use
> futexes or posix locking for years and years now? Are there any sane
> jvms that _don't_ use yield?
I think it's an issue of volanomark (a kind of java application) instead of JVM.

> 
> 
> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified and
> > doesn't make any sense at all, whatever you get is what you get....
> > it's pretty much an insane defacto behavior that is incredibly tied to
> > which decisions the scheduler makes how, and no app can depend on that
> 
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default?
There is no, so I suggest to set sched_compat_yield=1 by default.
If sched_compat_yield=0, kernel almost does nothing but returns. When
sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

> As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression...
I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
the latest kernel?

> 
> 
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors .... ;_
> 
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  3:15     ` Zhang, Yanmin
@ 2007-11-30  3:29       ` Nick Piggin
  2007-11-30  4:32         ` Zhang, Yanmin
  2007-11-30 10:08         ` Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-11-30  3:29 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Arjan van de Ven, Andrew Morton, mingo, LKML

On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:

> > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > it) is abusing sched_yield() by assuming it does something it really
> > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > accidentally happened to behave in a way that was nice for this
> > > benchmark.
> >
> > OK, why is this still happening? Haven't we been asking JVMs to use
> > futexes or posix locking for years and years now? Are there any sane
> > jvms that _don't_ use yield?
>
> I think it's an issue of volanomark (a kind of java application) instead of
> JVM.

volanomark itself and not the jvm is calling sched_yield()? Do we have
any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)


> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified and
> > > doesn't make any sense at all, whatever you get is what you get....
> > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > which decisions the scheduler makes how, and no app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default?
>
> There is no, so I suggest to set sched_compat_yield=1 by default.
> If sched_compat_yield=0, kernel almost does nothing but returns. When
> sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

sched_yield() is really only defined for posix realtime scheduling
AFAIK, which talks about priority lists. 

SCHED_OTHER is defined to be a single priority, below the rest of the
realtime priorities. So at first you *might* say that the process
should then be made to run only after all other SCHED_OTHER processes,
however there is no such ordering requirement for SCHED_OTHER
scheduling. The SCHED_OTHER scheduler can run any task at any time.

That said, I think people would *expect* that call be much closer to
the compat behaviour than the current default. And that's definitely
what Linux has done in the past. So there really does need to be a
good reason to change it like this IMO.


> > As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression...
>
> I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> the latest kernel?

Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  3:29       ` Nick Piggin
@ 2007-11-30  4:32         ` Zhang, Yanmin
  2007-11-30 10:08         ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Zhang, Yanmin @ 2007-11-30  4:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Arjan van de Ven, Andrew Morton, mingo, LKML

On Fri, 2007-11-30 at 14:29 +1100, Nick Piggin wrote:
> On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> > On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> 
> > > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > > it) is abusing sched_yield() by assuming it does something it really
> > > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > > accidentally happened to behave in a way that was nice for this
> > > > benchmark.
> > >
> > > OK, why is this still happening? Haven't we been asking JVMs to use
> > > futexes or posix locking for years and years now? Are there any sane
> > > jvms that _don't_ use yield?
> >
> > I think it's an issue of volanomark (a kind of java application) instead of
> > JVM.
> 
> volanomark itself and not the jvm is calling sched_yield()? Do we have
> any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)
I run lots of well-known benchmarks and volanoMark is the one who gets the largest
impact from sched_yield.

As for real-applications which use sched_yield, mostly, they are not open sources.
Yesterday, I got to know someone was using sched_yield in his network C programs,
but he didn't want to share the sources with me.

> 
> 
> > > > Todays kernel has a different behavior somewhat (and before people
> > > > scream "regression"; sched_yield() behavior isn't really specified and
> > > > doesn't make any sense at all, whatever you get is what you get....
> > > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > > which decisions the scheduler makes how, and no app can depend on that
> > >
> > > It is a performance regression. Is there any reason *not* to use the
> > > "compat" yield by default?
> >
> > There is no, so I suggest to set sched_compat_yield=1 by default.
> > If sched_compat_yield=0, kernel almost does nothing but returns. When
> > sched_compat_yield=1, it is closer to the meaning of sched_yield man page.
> 
> sched_yield() is really only defined for posix realtime scheduling
> AFAIK, which talks about priority lists. 
> 
> SCHED_OTHER is defined to be a single priority, below the rest of the
> realtime priorities. So at first you *might* say that the process
> should then be made to run only after all other SCHED_OTHER processes,
> however there is no such ordering requirement for SCHED_OTHER
> scheduling. The SCHED_OTHER scheduler can run any task at any time.
> 
> That said, I think people would *expect* that call be much closer to
> the compat behaviour than the current default. And that's definitely
> what Linux has done in the past. So there really does need to be a
> good reason to change it like this IMO.
That's indeed what I am thinking.

I am running many testing(SPECjbb/SPECjbb2005/cpu2000/iozone/dbench/tbench...) to 
see if there is any regression if sched_compat_yield=1. I think there is no
regression and the testing is just to double-check.

> 
> 
> > > As you say, for SCHED_OTHER tasks, yield
> > > can do almost anything. We may as well do something that isn't a
> > > regression...
> >
> > I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> > the latest kernel?
> 
> Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...
Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30  3:29       ` Nick Piggin
  2007-11-30  4:32         ` Zhang, Yanmin
@ 2007-11-30 10:08         ` Ingo Molnar
  2007-12-03  4:27           ` Nick Piggin
  2007-12-03  9:29           ` Zhang, Yanmin
  1 sibling, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-11-30 10:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Haven't we been asking JVMs to use futexes or posix locking for years 
> and years now? [...]

i'm curious, with what JVM was it tested and where's the source so i can 
fix their locking for them? Can the problem be reproduced with:

  http://download.fedora.redhat.com/pub/fedora/linux/development/source/SRPMS/java-1.7.0-icedtea-1.7.0.0-0.20.b23.snapshot.fc9.src.rpm

?

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30 10:08         ` Ingo Molnar
@ 2007-12-03  4:27           ` Nick Piggin
  2007-12-03  8:45             ` Ingo Molnar
  2007-12-03  9:29           ` Zhang, Yanmin
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-12-03  4:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML

On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Haven't we been asking JVMs to use futexes or posix locking for years
> > and years now? [...]
>
> i'm curious, with what JVM was it tested and where's the source so i can
> fix their locking for them? Can the problem be reproduced with:

Sure, but why shouldn't the compat behaviour be the default, and the
sysctl go away?

It makes older JVMs work better, it is slightly closer to the old
behaviour, and it is arguably a less surprising result.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  4:27           ` Nick Piggin
@ 2007-12-03  8:45             ` Ingo Molnar
  2007-12-03  9:17               ` Nick Piggin
  2007-12-03  9:41               ` Zhang, Yanmin
  0 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03  8:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > and years now? [...]
> >
> > i'm curious, with what JVM was it tested and where's the source so i 
> > can fix their locking for them? Can the problem be reproduced with:
> 
> Sure, but why shouldn't the compat behaviour be the default, and the 
> sysctl go away?
> 
> It makes older JVMs work better, it is slightly closer to the old 
> behaviour, and it is arguably a less surprising result.

as far as desktop apps such as firefox goes, the exact opposite is true. 
We had two choices basically: either a "more agressive" yield than 
before or a "less agressive" yield. Desktop apps were reported to hurt 
from a "more agressive" yield (firefox for example gets some pretty bad 
delays), so we defaulted to the less agressive method. (and we defaulted 
to that in v2.6.23 already) Really, in this sense volanomark is another 
test like dbench - we care about it but not unconditionally and in this 
case it's a really silly API use that is at the center of the problem. 
Talking about the default alone will not bring us forward, but we can 
certainly add helpers to identify SCHED_OTHER::yield tasks - a once per 
bootup warning perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  8:45             ` Ingo Molnar
@ 2007-12-03  9:17               ` Nick Piggin
  2007-12-03  9:35                 ` Zhang, Yanmin
  2007-12-03  9:57                 ` Ingo Molnar
  2007-12-03  9:41               ` Zhang, Yanmin
  1 sibling, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-12-03  9:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML

On Monday 03 December 2007 19:45, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > and years now? [...]
> > >
> > > i'm curious, with what JVM was it tested and where's the source so i
> > > can fix their locking for them? Can the problem be reproduced with:
> >
> > Sure, but why shouldn't the compat behaviour be the default, and the
> > sysctl go away?
> >
> > It makes older JVMs work better, it is slightly closer to the old
> > behaviour, and it is arguably a less surprising result.
>
> as far as desktop apps such as firefox goes, the exact opposite is true.
> We had two choices basically: either a "more agressive" yield than
> before or a "less agressive" yield. Desktop apps were reported to hurt
> from a "more agressive" yield (firefox for example gets some pretty bad
> delays), so we defaulted to the less agressive method. (and we defaulted
> to that in v2.6.23 already)

Yeah, I doubt the 2.6.23 scheduler will be usable for distros though...


> Really, in this sense volanomark is another 
> test like dbench - we care about it but not unconditionally and in this
> case it's a really silly API use that is at the center of the problem.

Sure, but do you whether _real_ java server applications are OK? Is it
possible to reduce the aggressiveness of yield to a mid-way? Are the
firefox tests also like dbench (ie. were they done with make -j huge or
some other insane scheduler loads)


> Talking about the default alone will not bring us forward, but we can
> certainly add helpers to identify SCHED_OTHER::yield tasks - a once per
> bootup warning perhaps?

I don't care about keeping the behaviour for future apps. But for older
code out there, it is very important to still work well.

I was just talking about the default because I didn't know the reason
for the way it was set -- now that I do, we should talk about trying to
improve the actual code so we don't need 2 defaults.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-11-30 10:08         ` Ingo Molnar
  2007-12-03  4:27           ` Nick Piggin
@ 2007-12-03  9:29           ` Zhang, Yanmin
  2007-12-03 10:05             ` Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Zhang, Yanmin @ 2007-12-03  9:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Arjan van de Ven, Andrew Morton, LKML

On Fri, 2007-11-30 at 11:08 +0100, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > Haven't we been asking JVMs to use futexes or posix locking for years 
> > and years now? [...]
> 
> i'm curious, with what JVM was it tested and where's the source so i can 
> fix their locking for them? Can the problem be reproduced with:
> 
>   http://download.fedora.redhat.com/pub/fedora/linux/development/source/SRPMS/java-1.7.0-icedtea-1.7.0.0-0.20.b23.snapshot.fc9.src.rpm
I used BEA Jrockit to run volanoMark. Because of no Jrockit source codes, so
I retested volanoMark by jre-1.7.0-icedtea.x86_64 java of Fedora Core 8 on my stoakley (8-core)
machine with kernel 2.6.24-rc3.

1) Jrockit: sched_compat_yield=0's result is less than 15% of sched_compat_yield=1's.
2) jre-1.7.0-icedtea: sched_compat_yield=0's result is less than 89% of sched_compat_yield=1's.

So JVM really has much impact on the regression.

I checked the source codes of openjdk and found Thread.yield is implemented as native sched_yield.
If java applications call Thread.yield, it just calls sched_yield. garbage collection and other JVM
threads also calls Thread.yield. That's why 2 different JVM have different regression percentage.

Although no source codes of volanoMark, I suspect it calls Thread.sched. volanoMark is a kind
of chatroom benchmark. When a client sends out a message, server will send the message to all clients.
I suspect the client calls Thread.yield after sending out a couple of messages.

2 JVM all have regression if sched_compat_yield=0.

I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench, and didn't see regression.

-yanmin


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  9:17               ` Nick Piggin
@ 2007-12-03  9:35                 ` Zhang, Yanmin
  2007-12-03  9:57                 ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Zhang, Yanmin @ 2007-12-03  9:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Arjan van de Ven, Andrew Morton, LKML

On Mon, 2007-12-03 at 20:17 +1100, Nick Piggin wrote:
> On Monday 03 December 2007 19:45, Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > > and years now? [...]
> > > >
> > > > i'm curious, with what JVM was it tested and where's the source so i
> > > > can fix their locking for them? Can the problem be reproduced with:
> > >
> > > Sure, but why shouldn't the compat behaviour be the default, and the
> > > sysctl go away?
> > >
> > > It makes older JVMs work better, it is slightly closer to the old
> > > behaviour, and it is arguably a less surprising result.
> >
> > as far as desktop apps such as firefox goes, the exact opposite is true.
> > We had two choices basically: either a "more agressive" yield than
> > before or a "less agressive" yield. Desktop apps were reported to hurt
> > from a "more agressive" yield (firefox for example gets some pretty bad
> > delays), so we defaulted to the less agressive method. (and we defaulted
> > to that in v2.6.23 already)
> 
> Yeah, I doubt the 2.6.23 scheduler will be usable for distros though...
> 
> 
> > Really, in this sense volanomark is another 
> > test like dbench - we care about it but not unconditionally and in this
> > case it's a really silly API use that is at the center of the problem.
> 
> Sure, but do you whether _real_ java server applications are OK?
I did a simple check of openjvm source codes and garbage collecter calls
Thread.yield. It really has much impact on both Jrockit and openJVM although
the regression percentage is different.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  8:45             ` Ingo Molnar
  2007-12-03  9:17               ` Nick Piggin
@ 2007-12-03  9:41               ` Zhang, Yanmin
  2007-12-03 10:17                 ` Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Zhang, Yanmin @ 2007-12-03  9:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Arjan van de Ven, Andrew Morton, LKML

On Mon, 2007-12-03 at 09:45 +0100, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > and years now? [...]
> > >
> > > i'm curious, with what JVM was it tested and where's the source so i 
> > > can fix their locking for them? Can the problem be reproduced with:
> > 
> > Sure, but why shouldn't the compat behaviour be the default, and the 
> > sysctl go away?
> > 
> > It makes older JVMs work better, it is slightly closer to the old 
> > behaviour, and it is arguably a less surprising result.
> 
> as far as desktop apps such as firefox goes, the exact opposite is true. 
> We had two choices basically: either a "more agressive" yield than 
> before or a "less agressive" yield. Desktop apps were reported to hurt 
> from a "more agressive" yield (firefox for example gets some pretty bad 
> delays),
Why not to change source codes of firefox? If the sched_compat_yield=0,
the sys_sched_yield almost does nothing but returns, so firefox could just
do not call sched_yield. I assume 'sched_compat_yield=0' ~ no_call_to_sched_yield.

It's easier to delete calls to sched_yield in applications than to tune
calls to sched_yield.

>  so we defaulted to the less agressive method. (and we defaulted 
> to that in v2.6.23 already) Really, in this sense volanomark is another 
> test like dbench - we care about it but not unconditionally and in this 
> case it's a really silly API use that is at the center of the problem. 
> Talking about the default alone will not bring us forward, but we can 
> certainly add helpers to identify SCHED_OTHER::yield tasks - a once per 
> bootup warning perhaps?
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  9:17               ` Nick Piggin
  2007-12-03  9:35                 ` Zhang, Yanmin
@ 2007-12-03  9:57                 ` Ingo Molnar
  2007-12-03 10:15                   ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03  9:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > as far as desktop apps such as firefox goes, the exact opposite is 
> > true. We had two choices basically: either a "more agressive" yield 
> > than before or a "less agressive" yield. Desktop apps were reported 
> > to hurt from a "more agressive" yield (firefox for example gets some 
> > pretty bad delays), so we defaulted to the less agressive method. 
> > (and we defaulted to that in v2.6.23 already)
> 
> Yeah, I doubt the 2.6.23 scheduler will be usable for distros 
> though...

... which is a pretty gross exaggeration belied by distros already 
running v2.6.23. Sure, "enterprise" distros might not run .23 (or .22 or 
.21 or .24) because those are slow to adopt and pick _one_ upstream 
kernel every 10 releases without bothering much about anything 
inbetween. So the enterprise distros might in fact want to see 1-2 
iterations of the scheduler before they switch to it. (But by that 
argument 80% of the other upstream kernels were not used by enterprise 
distros either, so this is nothing new.)

> I was just talking about the default because I didn't know the reason 
> for the way it was set -- now that I do, we should talk about trying 
> to improve the actual code so we don't need 2 defaults.

I've got the patch below queued up: it uses the more agressive yield 
implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural 
differentiator, it's a "I dont care about latency, it's all about 
throughput for me" signal from the application.

But first and foremost, do you realize that there will be no easy 
solutions to this topic, that it's not just about 'flipping a default'?

	Ingo

-------------->
Subject: sched: default to more agressive yield for SCHED_BATCH tasks
From: Ingo Molnar <mingo@elte.hu>

do more agressive yield for SCHED_BATCH tasks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -824,8 +824,9 @@ static void dequeue_task_fair(struct rq 
  */
 static void yield_task_fair(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(rq->curr);
-	struct sched_entity *rightmost, *se = &rq->curr->se;
+	struct task_struct *curr = rq->curr;
+	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+	struct sched_entity *rightmost, *se = &curr->se;
 
 	/*
 	 * Are we the only task in the tree?
@@ -833,7 +834,7 @@ static void yield_task_fair(struct rq *r
 	if (unlikely(cfs_rq->nr_running == 1))
 		return;
 
-	if (likely(!sysctl_sched_compat_yield)) {
+	if (likely(!sysctl_sched_compat_yield) && curr->policy != SCHED_BATCH) {
 		__update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  9:29           ` Zhang, Yanmin
@ 2007-12-03 10:05             ` Ingo Molnar
  2007-12-04  6:40               ` Zhang, Yanmin
  0 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 10:05 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Nick Piggin, Arjan van de Ven, Andrew Morton, LKML


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> Although no source codes of volanoMark, I suspect it calls 
> Thread.sched. volanoMark is a kind of chatroom benchmark. When a 
> client sends out a message, server will send the message to all 
> clients. I suspect the client calls Thread.yield after sending out a 
> couple of messages.

yeah, so far only volanomark seems to be affected by this, and if it 
indeed calls Thread.yield artificially it's a pretty stupid benchmark 
and it's not the fault of the JDK. If we had the source to volanomark we 
could fix this easily.

> 2 JVM all have regression if sched_compat_yield=0.
> 
> I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench, 
> and didn't see regression.

which JVM was utilized by the specjbb (Java Business Benchmark) tests?

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  9:57                 ` Ingo Molnar
@ 2007-12-03 10:15                   ` Nick Piggin
  2007-12-03 10:33                     ` Ingo Molnar
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-12-03 10:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML

On Monday 03 December 2007 20:57, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > as far as desktop apps such as firefox goes, the exact opposite is
> > > true. We had two choices basically: either a "more agressive" yield
> > > than before or a "less agressive" yield. Desktop apps were reported
> > > to hurt from a "more agressive" yield (firefox for example gets some
> > > pretty bad delays), so we defaulted to the less agressive method.
> > > (and we defaulted to that in v2.6.23 already)
> >
> > Yeah, I doubt the 2.6.23 scheduler will be usable for distros
> > though...
>
> ... which is a pretty gross exaggeration belied by distros already
> running v2.6.23. Sure, "enterprise" distros might not run .23 (or .22 or

Yeah, that's what I mean of course. And it's because of the performance
and immediate upstream divergence issues with 2.6.23. Specifically I'm
talking about the scheduler: they may run a base 2.6.23, but it would
likely have most or all subsequent scheduler patches.


> > I was just talking about the default because I didn't know the reason
> > for the way it was set -- now that I do, we should talk about trying
> > to improve the actual code so we don't need 2 defaults.
>
> I've got the patch below queued up: it uses the more agressive yield
> implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
> differentiator, it's a "I dont care about latency, it's all about
> throughput for me" signal from the application.

First and foremost, do you realize that I'm talking about existing
userspace working well on future kernels right? (ie. backwards
compatibility).


> But first and foremost, do you realize that there will be no easy
> solutions to this topic, that it's not just about 'flipping a default'?

Of course ;) I already answered that in the email that you're replying
to:

> > I was just talking about the default because I didn't know the reason
> > for the way it was set -- now that I do, we should talk about trying
> > to improve the actual code so we don't need 2 defaults.

Anyway, I'd hope it can actually be improved and even the sysctl
removed completely.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03  9:41               ` Zhang, Yanmin
@ 2007-12-03 10:17                 ` Ingo Molnar
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 10:17 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Nick Piggin, Arjan van de Ven, Andrew Morton, LKML


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> > as far as desktop apps such as firefox goes, the exact opposite is 
> > true. We had two choices basically: either a "more agressive" yield 
> > than before or a "less agressive" yield. Desktop apps were reported 
> > to hurt from a "more agressive" yield (firefox for example gets some 
> > pretty bad delays),
>
> Why not to change source codes of firefox? [...]

because we care a heck of a lot more about a widely used open-source 
package's default "user experience" than we care about closed-source 
volanomark scores...

do you realize the absurdity of that suggestion: in essence we'd punish 
firefox _because it is open-source and can be changed_. So basically 
firefox would get a more preferential treatment if it was closed-source 
and could not be changed? That's totally backwards.

> If the sched_compat_yield=0, the sys_sched_yield almost does nothing 
> but returns, so firefox could just do not call sched_yield. I assume 
> 'sched_compat_yield=0' ~ no_call_to_sched_yield.
> 
> It's easier to delete calls to sched_yield in applications than to 
> tune calls to sched_yield.

We are not at all worried about punishing silly benchmark behavior - and 
volanomark's call to Thread.yield (if that's indeed what is happening - 
could you try to trace it to make sure?) is outright silly. There are 
other chatroom benchmarks such as hackbench.c and hackbench_pth.c that i 
test frequently, and they are not affected by any yield details. (and 
even then it's still taken with a grain of salt - remember dbench)

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 10:15                   ` Nick Piggin
@ 2007-12-03 10:33                     ` Ingo Molnar
  2007-12-03 11:02                       ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 10:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > > I was just talking about the default because I didn't know the 
> > > reason for the way it was set -- now that I do, we should talk 
> > > about trying to improve the actual code so we don't need 2 
> > > defaults.
> >
> > I've got the patch below queued up: it uses the more agressive yield 
> > implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural 
> > differentiator, it's a "I dont care about latency, it's all about 
> > throughput for me" signal from the application.
> 
> First and foremost, do you realize that I'm talking about existing 
> userspace working well on future kernels right? (ie. backwards 
> compatibility).

given how poorly sched_yield() is/was defined the only "compatible" 
solution would be to go back to the old yield code. (And note that you 
are rehashing arguments that were covered on lkml months ago already.)

> > But first and foremost, do you realize that there will be no easy 
> > solutions to this topic, that it's not just about 'flipping a 
> > default'?
> 
> Of course ;) I already answered that in the email that you're replying 
> to:
> 
> > > I was just talking about the default because I didn't know the 
> > > reason for the way it was set -- now that I do, we should talk 
> > > about trying to improve the actual code so we don't need 2 
> > > defaults.

well, in case you were wondering why i was a bit pointy about this, this 
topic of yield has been covered on lkml quite extensively a couple of 
months ago. I assumed you knew about that already, but perhaps not?

> Anyway, I'd hope it can actually be improved and even the sysctl 
> removed completely.

i think the sanest long-term solution is to strongly discourage the use 
of SCHED_OTHER::yield, because there's just no sane definition for yield 
that apps could rely upon. (well Linus suggested a pretty sane 
definition but that would necessiate the burdening of the scheduler 
fastpath - we dont want to do that.) New ideas are welcome of course.

[ also, actual technical feedback on the SCHED_BATCH patch i sent (which
  was the only "forward looking" moment in this thread so far ;-) would
  be nice too. ]

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 10:33                     ` Ingo Molnar
@ 2007-12-03 11:02                       ` Nick Piggin
  2007-12-03 11:37                         ` Ingo Molnar
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-12-03 11:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML

On Monday 03 December 2007 21:33, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > I was just talking about the default because I didn't know the
> > > > reason for the way it was set -- now that I do, we should talk
> > > > about trying to improve the actual code so we don't need 2
> > > > defaults.
> > >
> > > I've got the patch below queued up: it uses the more agressive yield
> > > implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
> > > differentiator, it's a "I dont care about latency, it's all about
> > > throughput for me" signal from the application.
> >
> > First and foremost, do you realize that I'm talking about existing
> > userspace working well on future kernels right? (ie. backwards
> > compatibility).
>
> given how poorly sched_yield() is/was defined the only "compatible"
> solution would be to go back to the old yield code.

While it is technically allowed to do anything with SCHED_OTHER class,
putting the thread to the back of the runnable tasks, or at least having
it give up _some_ priority (like the old scheduler) is less surprising
than having it do _nothing_.

I mean, if firefox really works best if sched_yield does nothing, it
surely shouldn't be calling it at all (nothing to do with it being open
source or not).

Wheras JVMs (eg. that have garbage collectors call yield), presumably
get quite a lot of tuning, and that was probably done with the less
surprising (and more common) sched_yield behaviour.


> (And note that you 
> are rehashing arguments that were covered on lkml months ago already.)

I'm just wondering whether the outcome was the right one.


> > > But first and foremost, do you realize that there will be no easy
> > > solutions to this topic, that it's not just about 'flipping a
> > > default'?
> >
> > Of course ;) I already answered that in the email that you're replying
> >
> > to:
> > > > I was just talking about the default because I didn't know the
> > > > reason for the way it was set -- now that I do, we should talk
> > > > about trying to improve the actual code so we don't need 2
> > > > defaults.
>
> well, in case you were wondering why i was a bit pointy about this, this
> topic of yield has been covered on lkml quite extensively a couple of
> months ago. I assumed you knew about that already, but perhaps not?

I did, but I haven't always followed the scheduler discussions closely
recently. I was surprised to find it hasn't changed much.

I appreciate you can never do exactly the right thing for everyone and
you can't (and don't want, by definition) to make behaviour exactly the
same.

Clearly the current default is far less aggressive (almost noop), and the
compat behaviour is probably more aggressive in most cases than the old
scheduler. I would have thought looking for a middle ground might be a
good idea.

Or just ignore firefox and get them to fix it, if the occasional stalls
are during really high scheduler stressing workloads (do you have a pointer
to that thread, btw?).


> > Anyway, I'd hope it can actually be improved and even the sysctl
> > removed completely.
>
> i think the sanest long-term solution is to strongly discourage the use
> of SCHED_OTHER::yield, because there's just no sane definition for yield
> that apps could rely upon. (well Linus suggested a pretty sane
> definition but that would necessiate the burdening of the scheduler
> fastpath - we dont want to do that.) New ideas are welcome of course.

sched_yield is defined to put the calling task at the end of the queue for
the given priority level as you know (ie. at the end of all other priority
0 tasks, for SCHED_OTHER).

So, while SCHED_OTHER technically allows _any_ task to be picked, I think
it would be least surprising to have the calling task go to the end of the
queue, rather than not doing very much at all...


> [ also, actual technical feedback on the SCHED_BATCH patch i sent (which
>   was the only "forward looking" moment in this thread so far ;-) would
>   be nice too. ]

I dislike a wholesale change in behaviour like that. Especially when it
is changing behaviour of yield among SCHED_BATCH tasks versus yield among
SCHED_OTHER tasks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 11:02                       ` Nick Piggin
@ 2007-12-03 11:37                         ` Ingo Molnar
  2007-12-03 17:04                           ` David Schwartz
  2007-12-04  1:02                           ` Nick Piggin
  0 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 11:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > given how poorly sched_yield() is/was defined the only "compatible" 
> > solution would be to go back to the old yield code.
> 
> While it is technically allowed to do anything with SCHED_OTHER class, 
> putting the thread to the back of the runnable tasks, or at least 
> having it give up _some_ priority (like the old scheduler) is less 
> surprising than having it do _nothing_.

wrong: it's not "nothing" that the new code does - run two yield-ing 
loops and they'll happily switch to each other, at a rate of a few 
million context switches per second.

( Note that the old scheduler's yield code did not actually change a 
  task's priority - so if an interactive task (such as firefox) yielded, 
  it got different behavior than CPU hogs. )

> Wheras JVMs (eg. that have garbage collectors call yield), presumably 
> get quite a lot of tuning, and that was probably done with the less 
> surprising (and more common) sched_yield behaviour.

i disagree. To some of them, having a _more_ agressive yield than 2.6.22 
might increase latencies and jitter - which can be seen as a regression 
as well. All tests i've seen so far show dramatically lower jitter in 
v2.6.23 and upwards kernels.

anyway, right now what we have is a closed-source benchmark (which is a 
quite silly one as well) against a popular open-source desktop app and 
in that case the choice is obvious. Actual Java app server benchmarks 
did not show any regression so maybe Java's use of yield for locking is 
not that significant after all and it's only Volanomark that is doing 
extra (unnecessary) yields. (and java benchmarks are part of the 
upstream kernel test grid anyway so we'd have noticed any serious 
regression)

if you insist on flipping the default then that just shows a blatant 
disregard to desktop performance - i personally care quite a bit about 
desktop performance. (and deterministic scheduling in particular)

> > i think the sanest long-term solution is to strongly discourage the 
> > use of SCHED_OTHER::yield, because there's just no sane definition 
> > for yield that apps could rely upon. (well Linus suggested a pretty 
> > sane definition but that would necessiate the burdening of the 
> > scheduler fastpath - we dont want to do that.) New ideas are welcome 
> > of course.
> 
> sched_yield is defined to put the calling task at the end of the queue 
> for the given priority level as you know (ie. at the end of all other 
> priority 0 tasks, for SCHED_OTHER).

almost: substitute "priority" with "nice level". One problem is, that's 
not what the old scheduler did.

> > [ also, actual technical feedback on the SCHED_BATCH patch i sent
> >   (which was the only "forward looking" moment in this thread so far
> >    ;-) would be nice too. ]
> 
> I dislike a wholesale change in behaviour like that. Especially when 
> it is changing behaviour of yield among SCHED_BATCH tasks versus yield 
> among SCHED_OTHER tasks.

There's no wholesale change in behavior, SCHED_BATCH tasks have clear 
expectations of being throughput versus latency, hence the patch makes 
quite a bit of sense to me. YMMV.

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 11:37                         ` Ingo Molnar
@ 2007-12-03 17:04                           ` David Schwartz
  2007-12-03 17:37                             ` Chris Friesen
  2007-12-04  1:02                           ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: David Schwartz @ 2007-12-03 17:04 UTC (permalink / raw)
  To: Nick Piggin, Ingo Molnar
  Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML


	I've asked versions of this question at least three times and never gotten
anything approaching a straight answer:

	1) What is the current default 'sched_yield' behavior?

	2) What is the current alternate 'sched_yield' behavior?

	3) Are either of them sensible? Simply acting as if the current thread's
timeslice was up should be sufficient.

	The implication I keep getting is that neither the default behavior nor the
alternate behavior are sensible. What is so hard about simply scheduling the
next thread?

	We don't need perfection, but it sounds like we have two alternatives of
which neither is sensible.

	DS



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 17:04                           ` David Schwartz
@ 2007-12-03 17:37                             ` Chris Friesen
  2007-12-03 19:12                               ` David Schwartz
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Friesen @ 2007-12-03 17:37 UTC (permalink / raw)
  To: davids
  Cc: Nick Piggin, Ingo Molnar, Zhang, Yanmin, Arjan van de Ven,
	Andrew Morton, LKML

David Schwartz wrote:
> 	I've asked versions of this question at least three times and never gotten
> anything approaching a straight answer:
> 
> 	1) What is the current default 'sched_yield' behavior?
> 
> 	2) What is the current alternate 'sched_yield' behavior?

I'm pretty sure I've seen responses from Ingo describing this multiple 
times in various threads.  Google should have them.

If I remember right, the default is to simply recalculate the task's 
position in the tree and reinsert it, and the alternate is to yield to 
everything currently runnable.

> 	3) Are either of them sensible? Simply acting as if the current thread's
> timeslice was up should be sufficient.

The new scheduler doesn't really have a concept of "timeslice".  This is 
one of the core problems with determining what to do on sched_yield().

> 	The implication I keep getting is that neither the default behavior nor the
> alternate behavior are sensible. What is so hard about simply scheduling the
> next thread?

The problem is where do we insert the task that is yielding?  CFS is 
based around a tree structure ordered by time.

The old scheduler was priority-based, so you could essentially yield to 
everyone of the same niceness level.

With the new scheduler, this would be possible, but would involve extra 
work tracking the position of the rightmost task at each priority level. 
  This additional overhead is what Ingo is trying to avoid.

> 	We don't need perfection, but it sounds like we have two alternatives of
> which neither is sensible.

sched_yield() isn't a great API.  It just says to delay the task, 
without specifying how long or what the task is waiting *for*.  Other 
constructs are much more useful because they give the scheduler more 
information with which to make a decision.

Chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 17:37                             ` Chris Friesen
@ 2007-12-03 19:12                               ` David Schwartz
  2007-12-03 19:56                                 ` Chris Friesen
  0 siblings, 1 reply; 38+ messages in thread
From: David Schwartz @ 2007-12-03 19:12 UTC (permalink / raw)
  To: Christopher Friesen
  Cc: Nick Piggin, Ingo Molnar, Zhang, Yanmin, Arjan van de Ven,
	Andrew Morton, LKML


Chris Friesen wrote:

> David Schwartz wrote:

> > 	I've asked versions of this question at least three times
> > and never gotten
> > anything approaching a straight answer:
> >
> > 	1) What is the current default 'sched_yield' behavior?
> >
> > 	2) What is the current alternate 'sched_yield' behavior?

> I'm pretty sure I've seen responses from Ingo describing this multiple
> times in various threads.  Google should have them.

> If I remember right, the default is to simply recalculate the task's
> position in the tree and reinsert it, and the alternate is to yield to
> everything currently runnable.

The meaning of the default behavior then depends upon where in the tree it
reinserts it.

> > 	3) Are either of them sensible? Simply acting as if the
> > current thread's
> > timeslice was up should be sufficient.

> The new scheduler doesn't really have a concept of "timeslice".  This is
> one of the core problems with determining what to do on sched_yield().

Then it should probably just not support 'sched_yield' and return ENOSYS.
Applications should work around an ENOSYS reply (since some versions of
Solaris return this, among other reasons). Perhaps for compatability, it
could also yield 'lightly' just in case applications ignore the return
value.

It could also handle it the way it handles the smallest sleep time that it
supports. This is sub-optimal if no other task are ready-to-run at the same
static priority level and that might be an expensive check.

If CFS really can't support sched_yield's semantics, then it should just
not, and that's that. Return ENOSYS and admit that the behavior sched_yield
is documented to have simply can't be supported by the scheduler.

> > The implication I keep getting is that neither the default
> > behavior nor the
> > alternate behavior are sensible. What is so hard about simply
> > scheduling the
> > next thread?

> The problem is where do we insert the task that is yielding?  CFS is
> based around a tree structure ordered by time.

We put it exactly where we would have when its timeslice ran out. If we can
reward it a little bit, that's great. But if not, we can live with that.
Just imagine that the timer interrupt fired to indicate the end of the
thread's run time when the thread called 'sched_yield'.

> The old scheduler was priority-based, so you could essentially yield to
> everyone of the same niceness level.
>
> With the new scheduler, this would be possible, but would involve extra
> work tracking the position of the rightmost task at each priority level.
> This additional overhead is what Ingo is trying to avoid.

Then what does he do when the task runs out of run time? It's hard to
imagine we can't do that when the task calls sched_yield.

> > 	We don't need perfection, but it sounds like we have two
> > alternatives of
> > which neither is sensible.

> sched_yield() isn't a great API.

I agree.

> It just says to delay the task,
> without specifying how long or what the task is waiting *for*.

That is not true. The task is waiting for something that will be done by
another thread that is ready-to-run and at the same priority level. The task
does not need to wait until the thing is guaranteed done but wishes to wait
until it is more likely to be done. This is an often-misused but sometimes
sensible thing to do.

I think the API gets blamed for two things that are not its fault:

1) It's often misunderstood and misused.

2) It was often chosen as a "best available" solution because no truly good
solutions were available.

> Other
> constructs are much more useful because they give the scheduler more
> information with which to make a decision.

Sure, if there is more information. But if all you really want to do is wait
until other threads at the same static priority level have had a chance to
run, then sched_yield is the right API.

DS



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 19:12                               ` David Schwartz
@ 2007-12-03 19:56                                 ` Chris Friesen
  2007-12-03 21:39                                   ` Mark Lord
  0 siblings, 1 reply; 38+ messages in thread
From: Chris Friesen @ 2007-12-03 19:56 UTC (permalink / raw)
  To: davids
  Cc: Nick Piggin, Ingo Molnar, Zhang, Yanmin, Arjan van de Ven,
	Andrew Morton, LKML

David Schwartz wrote:
> Chris Friesen wrote:


> If CFS really can't support sched_yield's semantics, then it should just
> not, and that's that. Return ENOSYS and admit that the behavior sched_yield
> is documented to have simply can't be supported by the scheduler.

That's just it though...sched_yield() with SCHED_OTHER doesn't have well 
defined semantics, so we can do just about anything we want.

The issue is mostly how to work around existing apps that (invalidly) 
expect certain behaviour from sched_yield().

>>The problem is where do we insert the task that is yielding?  CFS is
>>based around a tree structure ordered by time.

> We put it exactly where we would have when its timeslice ran out. If we can
> reward it a little bit, that's great. But if not, we can live with that.
> Just imagine that the timer interrupt fired to indicate the end of the
> thread's run time when the thread called 'sched_yield'.

CFS doesn't really do "timeslice".  But in essence what you are 
describing is the default behaviour currently...it simply removes the 
task from the tree and reinserts it based on how much cpu time it used up.

> Then what does he do when the task runs out of run time? It's hard to
> imagine we can't do that when the task calls sched_yield.

It gets reinserted into the tree at a position based on how much cpu 
time it used.  This is exactly the current sched_yield() behaviour.

>>It just says to delay the task,
>>without specifying how long or what the task is waiting *for*.

> That is not true. The task is waiting for something that will be done by
> another thread that is ready-to-run and at the same priority level. The task
> does not need to wait until the thing is guaranteed done but wishes to wait
> until it is more likely to be done. This is an often-misused but sometimes
> sensible thing to do.

The scheduler still doesn't know specifically what the task is waiting for.

> Sure, if there is more information. But if all you really want to do is wait
> until other threads at the same static priority level have had a chance to
> run, then sched_yield is the right API.

Technically, all of SCHED_OTHER has static priority level zero.  Thus 
the "right" thing to do is to allow all SCHED_OTHER tasks to run, 
including the ones with the highst possible nice level.

This is the alternate implementation in the current code, but it has 
latency implications that may be unexpected by applications written for 
the previous 2.6 behaviour.

Chris


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 19:56                                 ` Chris Friesen
@ 2007-12-03 21:39                                   ` Mark Lord
  2007-12-03 21:48                                     ` Ingo Molnar
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Lord @ 2007-12-03 21:39 UTC (permalink / raw)
  To: Chris Friesen
  Cc: davids, Nick Piggin, Ingo Molnar, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML

Chris Friesen wrote:
> David Schwartz wrote:
>> Chris Friesen wrote:
..
>>> The problem is where do we insert the task that is yielding?  CFS is
>>> based around a tree structure ordered by time.
> 
>> We put it exactly where we would have when its timeslice ran out. If 
>> we can
>> reward it a little bit, that's great. But if not, we can live with that.
>> Just imagine that the timer interrupt fired to indicate the end of the
>> thread's run time when the thread called 'sched_yield'.
> 
> CFS doesn't really do "timeslice".  But in essence what you are 
> describing is the default behaviour currently...it simply removes the 
> task from the tree and reinserts it based on how much cpu time it used up.
> 
>> Then what does he do when the task runs out of run time? It's hard to
>> imagine we can't do that when the task calls sched_yield.
> 
> It gets reinserted into the tree at a position based on how much cpu 
> time it used.  This is exactly the current sched_yield() behaviour.
..

That's not the same thing at all.
I think that David is suggesting that the reinsertion logic
should pretend that the task used up all of the CPU time it
was offered in the slot leading up to the sched_yield() call.

If it did that, then the task would be far more likely not to
end up as the next task chosen to run.

Without doing that, the task is highly likely to be chosen
to run again immediately, as it will appear to have done
nothing since it was previously chosen -- and so the same 
criteria will result in it being chosen again, and again,
and again, until it finally wastes enough cycles to not
be reinserted into the "currently active" slot of the tree.

Cheers

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 21:39                                   ` Mark Lord
@ 2007-12-03 21:48                                     ` Ingo Molnar
  2007-12-03 21:57                                       ` Mark Lord
  0 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 21:48 UTC (permalink / raw)
  To: Mark Lord
  Cc: Chris Friesen, davids, Nick Piggin, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML


* Mark Lord <lkml@rtr.ca> wrote:

> That's not the same thing at all. I think that David is suggesting 
> that the reinsertion logic should pretend that the task used up all of 
> the CPU time it was offered in the slot leading up to the 
> sched_yield() call.

we have tried this too, and this has problems too (artificial inflation 
of the vruntime metric and a domino effects on other portions of the 
scheduler). So this is a worse solution than what we have now. (and this 
has all been pointed out in past discussions in which David 
participated. I'll certainly reply to any genuinely new idea.)

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 21:48                                     ` Ingo Molnar
@ 2007-12-03 21:57                                       ` Mark Lord
  2007-12-03 22:05                                         ` Ingo Molnar
  0 siblings, 1 reply; 38+ messages in thread
From: Mark Lord @ 2007-12-03 21:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Friesen, davids, Nick Piggin, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML

Ingo Molnar wrote:
> * Mark Lord <lkml@rtr.ca> wrote:
> 
>> That's not the same thing at all. I think that David is suggesting 
>> that the reinsertion logic should pretend that the task used up all of 
>> the CPU time it was offered in the slot leading up to the 
>> sched_yield() call.
> 
> we have tried this too, and this has problems too (artificial inflation 
> of the vruntime metric and a domino effects on other portions of the 
> scheduler). So this is a worse solution than what we have now. (and this 
> has all been pointed out in past discussions in which David 
> participated. I'll certainly reply to any genuinely new idea.)
..

Ack.  And what of the suggestion to try to ensure that a yielding task
simply not end up as the very next one chosen to run?  Maybe by swapping
it with another (adjacent?) task in the tree if it comes out on top again?

(I really don't know the proper terminology to use here,
 but hopefully Ingo can translate that).

That's probably already been covered too, but are the prior conclusions still valid?

Thanks Ingo -- I *really* like this scheduler!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 21:57                                       ` Mark Lord
@ 2007-12-03 22:05                                         ` Ingo Molnar
  2007-12-03 22:18                                           ` Mark Lord
  2007-12-04  0:30                                           ` David Schwartz
  0 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 22:05 UTC (permalink / raw)
  To: Mark Lord
  Cc: Chris Friesen, davids, Nick Piggin, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML


* Mark Lord <lkml@rtr.ca> wrote:

> Ack.  And what of the suggestion to try to ensure that a yielding task 
> simply not end up as the very next one chosen to run?  Maybe by 
> swapping it with another (adjacent?) task in the tree if it comes out 
> on top again?

we did that too for quite some time in CFS - it was found to be "not 
agressive enough" by some folks and "too agressive" by others. Then when 
people started bickering over this we added these two simple corner 
cases - switchable via a flag. (minimum agression and maximum agression)

> (I really don't know the proper terminology to use here, but hopefully 
> Ingo can translate that).

the terminology you used is perfectly fine.

> Thanks Ingo -- I *really* like this scheduler!

heh, thanks :) For which workload does it make the biggest difference 
for you? (and compared to what other scheduler you used before? 2.6.22?)

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 22:05                                         ` Ingo Molnar
@ 2007-12-03 22:18                                           ` Mark Lord
  2007-12-03 22:33                                             ` Ingo Molnar
  2007-12-04  0:30                                           ` David Schwartz
  1 sibling, 1 reply; 38+ messages in thread
From: Mark Lord @ 2007-12-03 22:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Friesen, davids, Nick Piggin, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML

Ingo Molnar wrote:
> * Mark Lord <lkml@rtr.ca> wrote:
> 
>> Ack.  And what of the suggestion to try to ensure that a yielding task 
>> simply not end up as the very next one chosen to run?  Maybe by 
>> swapping it with another (adjacent?) task in the tree if it comes out 
>> on top again?
> 
> we did that too for quite some time in CFS - it was found to be "not 
> agressive enough" by some folks and "too agressive" by others. Then when 
> people started bickering over this we added these two simple corner 
> cases - switchable via a flag. (minimum agression and maximum agression)
> 
>> (I really don't know the proper terminology to use here, but hopefully 
>> Ingo can translate that).
> 
> the terminology you used is perfectly fine.
> 
>> Thanks Ingo -- I *really* like this scheduler!
> 
> heh, thanks :) For which workload does it make the biggest difference 
> for you? (and compared to what other scheduler you used before? 2.6.22?)
..

Heh.. I'm just a very unsophisticated desktop user, and I like it when
Thunderbird and Firefox are unaffected by the "make -j3" kernel builds
that are often running in another window.  BIG difference there.

And on the cool side, the Swarm game (swarm.swf) is a great example of
something that used to get jerky really fast whenever anything else was
running, and now it really doesn't seem to be affected by anything.
(I don't really play computer games, but this one is has a very retro feel..).

Cheers

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 22:18                                           ` Mark Lord
@ 2007-12-03 22:33                                             ` Ingo Molnar
  2007-12-04  0:18                                               ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-12-03 22:33 UTC (permalink / raw)
  To: Mark Lord
  Cc: Chris Friesen, davids, Nick Piggin, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML


* Mark Lord <lkml@rtr.ca> wrote:

>> heh, thanks :) For which workload does it make the biggest difference 
>> for you? (and compared to what other scheduler you used before? 
>> 2.6.22?)
> ..
>
> Heh.. I'm just a very unsophisticated desktop user, and I like it when 
> Thunderbird and Firefox are unaffected by the "make -j3" kernel builds 
> that are often running in another window.  BIG difference there.
>
> And on the cool side, the Swarm game (swarm.swf) is a great example of 
> something that used to get jerky really fast whenever anything else 
> was running, and now it really doesn't seem to be affected by 
> anything. (I don't really play computer games, but this one is has a 
> very retro feel..).

nice! Do you feel any difference between 2.6.23 and 2.6.24-rc for these 
workloads? (if you've tried .24 already)

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 22:33                                             ` Ingo Molnar
@ 2007-12-04  0:18                                               ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-12-04  0:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark Lord, Chris Friesen, davids, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML

On Tuesday 04 December 2007 09:33, Ingo Molnar wrote:
> * Mark Lord <lkml@rtr.ca> wrote:
> >> heh, thanks :) For which workload does it make the biggest difference
> >> for you? (and compared to what other scheduler you used before?
> >> 2.6.22?)
> >
> > ..
> >
> > Heh.. I'm just a very unsophisticated desktop user, and I like it when
> > Thunderbird and Firefox are unaffected by the "make -j3" kernel builds
> > that are often running in another window.  BIG difference there.
> >
> > And on the cool side, the Swarm game (swarm.swf) is a great example of
> > something that used to get jerky really fast whenever anything else
> > was running, and now it really doesn't seem to be affected by
> > anything. (I don't really play computer games, but this one is has a
> > very retro feel..).
>
> nice! Do you feel any difference between 2.6.23 and 2.6.24-rc for these
> workloads? (if you've tried .24 already)

And also, I wonder what the average timeslice and number of context
switches is between 2.6.22 and 2.6.23-4. Would be interesting to see.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 22:05                                         ` Ingo Molnar
  2007-12-03 22:18                                           ` Mark Lord
@ 2007-12-04  0:30                                           ` David Schwartz
  2007-12-04  2:09                                             ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: David Schwartz @ 2007-12-04  0:30 UTC (permalink / raw)
  To: Mark Lord, Ingo Molnar
  Cc: Chris Friesen, Nick Piggin, Zhang, Yanmin, Arjan van de Ven,
	Andrew Morton, LKML


> * Mark Lord <lkml@rtr.ca> wrote:

> > Ack.  And what of the suggestion to try to ensure that a yielding task
> > simply not end up as the very next one chosen to run?  Maybe by
> > swapping it with another (adjacent?) task in the tree if it comes out
> > on top again?

> we did that too for quite some time in CFS - it was found to be "not
> agressive enough" by some folks and "too agressive" by others. Then when
> people started bickering over this we added these two simple corner
> cases - switchable via a flag. (minimum agression and maximum agression)

They are both correct. It is not agressive enough if there are tasks other
than those two that are at the same static priority level and ready to run.
It is too agressive if the task it is swapped with is at a lower static
priority level.

Perhaps it might be possible to scan for the task at the same static
priority level that is ready-to-run but last in line among other
ready-to-run tasks and put it after that task? I think that's about as close
as we can get to the POSIX-specified behavior.

> > Thanks Ingo -- I *really* like this scheduler!

Just in case this isn't clear, I like CFS too and sincerely appreciate the
work Ingo, Con, and others have done on it.

DS



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 11:37                         ` Ingo Molnar
  2007-12-03 17:04                           ` David Schwartz
@ 2007-12-04  1:02                           ` Nick Piggin
  1 sibling, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-12-04  1:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Zhang, Yanmin, Arjan van de Ven, Andrew Morton, LKML

On Monday 03 December 2007 22:37, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > given how poorly sched_yield() is/was defined the only "compatible"
> > > solution would be to go back to the old yield code.
> >
> > While it is technically allowed to do anything with SCHED_OTHER class,
> > putting the thread to the back of the runnable tasks, or at least
> > having it give up _some_ priority (like the old scheduler) is less
> > surprising than having it do _nothing_.
>
> wrong: it's not "nothing" that the new code does - run two yield-ing
> loops and they'll happily switch to each other, at a rate of a few
> million context switches per second.

OK, it's not nothing, it interacts with the quantisation of the
update granularity and wakeup granularity... It's definitely not
what would be expected if you didn't look at the implementation
though.


> > Wheras JVMs (eg. that have garbage collectors call yield), presumably
> > get quite a lot of tuning, and that was probably done with the less
> > surprising (and more common) sched_yield behaviour.
>
> i disagree. To some of them, having a _more_ agressive yield than 2.6.22
> might increase latencies and jitter - which can be seen as a regression
> as well. All tests i've seen so far show dramatically lower jitter in
> v2.6.23 and upwards kernels.

Right so we should have one being about the _same_ aggressiveness.
Doesn't that make sense?


> anyway, right now what we have is a closed-source benchmark (which is a
> quite silly one as well) against a popular open-source desktop app and
> in that case the choice is obvious. Actual Java app server benchmarks
> did not show any regression so maybe Java's use of yield for locking is
> not that significant after all and it's only Volanomark that is doing
> extra (unnecessary) yields. (and java benchmarks are part of the
> upstream kernel test grid anyway so we'd have noticed any serious
> regression)

Sure I'm not basing this purely on volanomark at all. If you've tested
a reasonable range of actual java app server benchmarks with a range of
jvms then fine.


> if you insist on flipping the default then that just shows a blatant
> disregard to desktop performance

That statement is true. But you know I'm not insisting on flipping
the default, so I don't see how it is relevant.

BTW. can you answer what workload did firefox see the sched_yield
pauses with, and/or where that thread is archived? I still think
firefox should not call sched_yield at all.


> > > i think the sanest long-term solution is to strongly discourage the
> > > use of SCHED_OTHER::yield, because there's just no sane definition
> > > for yield that apps could rely upon. (well Linus suggested a pretty
> > > sane definition but that would necessiate the burdening of the
> > > scheduler fastpath - we dont want to do that.) New ideas are welcome
> > > of course.
> >
> > sched_yield is defined to put the calling task at the end of the queue
> > for the given priority level as you know (ie. at the end of all other
> > priority 0 tasks, for SCHED_OTHER).
>
> almost: substitute "priority" with "nice level". One problem is, that's
> not what the old scheduler did.

I'm not sure if that's right. Posix realtime scheduling says that all
SCHED_OTHER tasks are priority 0. But I'm not much of a standards reader.
And even if it were just applied to a given nice level, that would be
more intuitive than the current default.


> > > [ also, actual technical feedback on the SCHED_BATCH patch i sent
> > >   (which was the only "forward looking" moment in this thread so far
> > >    ;-) would be nice too. ]
> >
> > I dislike a wholesale change in behaviour like that. Especially when
> > it is changing behaviour of yield among SCHED_BATCH tasks versus yield
> > among SCHED_OTHER tasks.
>
> There's no wholesale change in behavior, SCHED_BATCH tasks have clear
> expectations of being throughput versus latency, hence the patch makes
> quite a bit of sense to me. YMMV.

sched_yield semantics are totally different depending on whether the
process is SCHED_BATCH or not. That's what I was calling a change in
behaviour, so arguing otherwise is just arguing semantics.

I just would think it isn't such a good thing if you suddently got a
500% speedup by making your jvm SCHED_BATCH, only to find that it
stops working when your batch cron jobs or something start running...
But if there are no real jvm workloads that would see such a speedup,
then I guess the point is moot ;)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-04  0:30                                           ` David Schwartz
@ 2007-12-04  2:09                                             ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-12-04  2:09 UTC (permalink / raw)
  To: davids
  Cc: Mark Lord, Ingo Molnar, Chris Friesen, Zhang, Yanmin,
	Arjan van de Ven, Andrew Morton, LKML

On Tuesday 04 December 2007 11:30, David Schwartz wrote:

> Perhaps it might be possible to scan for the task at the same static
> priority level that is ready-to-run but last in line among other
> ready-to-run tasks and put it after that task?

Nice level versus posix static priority level debate aside, this
is the exact behaviour which the compat mode does now basically,
when you have all tasks running at nice 0 (which I assume is the
essentially the case in both the jvm and firefox tests) (some
things, eg. kernel threads or X server could run at a higher prio,
but these are not the ones calling yield anyway...)


> I think that's about as 
> close as we can get to the POSIX-specified behavior.

I don't think it is a question of POSIX being a bit fuzzy, or some
problem we have implementing it. It is explicitly specified to
allow any behaviour.

So the current default is not wrong, any more than the compat mode
is right.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: sched_yield: delete sysctl_sched_compat_yield
  2007-12-03 10:05             ` Ingo Molnar
@ 2007-12-04  6:40               ` Zhang, Yanmin
  0 siblings, 0 replies; 38+ messages in thread
From: Zhang, Yanmin @ 2007-12-04  6:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Arjan van de Ven, Andrew Morton, LKML

On Mon, 2007-12-03 at 11:05 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > Although no source codes of volanoMark, I suspect it calls 
> > Thread.sched. volanoMark is a kind of chatroom benchmark. When a 
> > client sends out a message, server will send the message to all 
> > clients. I suspect the client calls Thread.yield after sending out a 
> > couple of messages.
> 
> yeah, so far only volanomark seems to be affected by this, and if it 
> indeed calls Thread.yield artificially it's a pretty stupid benchmark 
> and it's not the fault of the JDK. If we had the source to volanomark we 
> could fix this easily.
> 
> > 2 JVM all have regression if sched_compat_yield=0.
> > 
> > I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench, 
> > and didn't see regression.
> 
> which JVM was utilized by the specjbb (Java Business Benchmark) tests?
BEA Jrockit. It supports huge pages which promote performance for about 8%~10%.

-yanmin

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2007-12-04  6:41 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-27  9:33 sched_yield: delete sysctl_sched_compat_yield Zhang, Yanmin
2007-11-27 11:17 ` Ingo Molnar
2007-11-27 22:57 ` Arjan van de Ven
2007-11-30  2:46   ` Nick Piggin
2007-11-30  2:51     ` Arjan van de Ven
2007-11-30  3:02       ` Nick Piggin
2007-11-30  3:15     ` Zhang, Yanmin
2007-11-30  3:29       ` Nick Piggin
2007-11-30  4:32         ` Zhang, Yanmin
2007-11-30 10:08         ` Ingo Molnar
2007-12-03  4:27           ` Nick Piggin
2007-12-03  8:45             ` Ingo Molnar
2007-12-03  9:17               ` Nick Piggin
2007-12-03  9:35                 ` Zhang, Yanmin
2007-12-03  9:57                 ` Ingo Molnar
2007-12-03 10:15                   ` Nick Piggin
2007-12-03 10:33                     ` Ingo Molnar
2007-12-03 11:02                       ` Nick Piggin
2007-12-03 11:37                         ` Ingo Molnar
2007-12-03 17:04                           ` David Schwartz
2007-12-03 17:37                             ` Chris Friesen
2007-12-03 19:12                               ` David Schwartz
2007-12-03 19:56                                 ` Chris Friesen
2007-12-03 21:39                                   ` Mark Lord
2007-12-03 21:48                                     ` Ingo Molnar
2007-12-03 21:57                                       ` Mark Lord
2007-12-03 22:05                                         ` Ingo Molnar
2007-12-03 22:18                                           ` Mark Lord
2007-12-03 22:33                                             ` Ingo Molnar
2007-12-04  0:18                                               ` Nick Piggin
2007-12-04  0:30                                           ` David Schwartz
2007-12-04  2:09                                             ` Nick Piggin
2007-12-04  1:02                           ` Nick Piggin
2007-12-03  9:41               ` Zhang, Yanmin
2007-12-03 10:17                 ` Ingo Molnar
2007-12-03  9:29           ` Zhang, Yanmin
2007-12-03 10:05             ` Ingo Molnar
2007-12-04  6:40               ` Zhang, Yanmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).