linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] 2.6.0 batch scheduling, HT aware
@ 2003-12-23  0:38 Con Kolivas
  2003-12-23  1:11 ` Nick Piggin
  2003-12-26 22:56 ` Pavel Machek
  0 siblings, 2 replies; 31+ messages in thread
From: Con Kolivas @ 2003-12-23  0:38 UTC (permalink / raw)
  To: linux kernel mailing list; +Cc: Nick Piggin

I've done a resync and update of my batch scheduling that is also hyper-thread 
aware.

What is batch scheduling? Specifying a task as batch allows it to only use cpu 
time if there is idle time available, rather than having a proportion of the 
cpu time based on niceness.

Why do I need hyper-thread aware batch scheduling?

If you have a hyperthread (P4HT) processor and run it as two logical cpus you 
can have a very low priority task running that can consume 50% of your 
physical cpu's capacity no matter how high priority tasks you are running. 
For example if you use the distributed computing client setiathome you will 
be effectively be running at half your cpu's speed even if you run setiathome 
at nice 20. Batch scheduling for normal cpus allows only idle time to be used 
for batch tasks, and for HT cpus only allows idle time when both logical cpus 
are idle.

This is not being pushed for mainline kernel inclusion, but the issue of how 
to prevent low priority tasks slowing down HT cpus needs to be considered for 
the mainline HT scheduler if it ever gets included. This patch provides a 
temporising measure for those with HT processors, and a demonstrative way to 
handle them in mainline.

Patch available here:
http://ck.kolivas.org/patches/2.6/2.6.0/

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  0:38 [PATCH] 2.6.0 batch scheduling, HT aware Con Kolivas
@ 2003-12-23  1:11 ` Nick Piggin
  2003-12-23  1:24   ` Con Kolivas
  2003-12-26 22:56 ` Pavel Machek
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2003-12-23  1:11 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Con Kolivas, linux kernel mailing list



Con Kolivas wrote:

>I've done a resync and update of my batch scheduling that is also hyper-thread 
>aware.
>
>What is batch scheduling? Specifying a task as batch allows it to only use cpu 
>time if there is idle time available, rather than having a proportion of the 
>cpu time based on niceness.
>
>Why do I need hyper-thread aware batch scheduling?
>
>If you have a hyperthread (P4HT) processor and run it as two logical cpus you 
>can have a very low priority task running that can consume 50% of your 
>physical cpu's capacity no matter how high priority tasks you are running. 
>For example if you use the distributed computing client setiathome you will 
>be effectively be running at half your cpu's speed even if you run setiathome 
>at nice 20. Batch scheduling for normal cpus allows only idle time to be used 
>for batch tasks, and for HT cpus only allows idle time when both logical cpus 
>are idle.
>
>This is not being pushed for mainline kernel inclusion, but the issue of how 
>to prevent low priority tasks slowing down HT cpus needs to be considered for 
>the mainline HT scheduler if it ever gets included. This patch provides a 
>temporising measure for those with HT processors, and a demonstrative way to 
>handle them in mainline.
>

I wonder how does Intel suggest we handle this problem? Batch scheduling
aside, I wonder how to do any sort of priorities at all? I think POWER5
can do priorities in hardware, that is the only sane way I can think of
doing it.

I think this patch is much too ugly to get into such an elegant scheduler.
No fault to you Con because its an ugly problem.

How about this: if a task is "delta" priority points below a task running
on another sibling, move it to that sibling (so priorities via timeslice
start working). I call it active unbalancing! I might be able to make it
fit if there is interest. Other suggestions?



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  1:11 ` Nick Piggin
@ 2003-12-23  1:24   ` Con Kolivas
  2003-12-23  1:36     ` Nick Piggin
  2004-01-02 20:10     ` Bill Davidsen
  0 siblings, 2 replies; 31+ messages in thread
From: Con Kolivas @ 2003-12-23  1:24 UTC (permalink / raw)
  To: Nick Piggin, Nakajima, Jun; +Cc: linux kernel mailing list

On Tue, 23 Dec 2003 12:11, Nick Piggin wrote:
> I think this patch is much too ugly to get into such an elegant scheduler.
> No fault to you Con because its an ugly problem.

You're too kind. No it's ugly because of my code but it works for now.

> How about this: if a task is "delta" priority points below a task running
> on another sibling, move it to that sibling (so priorities via timeslice
> start working). I call it active unbalancing! I might be able to make it
> fit if there is interest. Other suggestions?

I discussed this with Ingo and that's the sort of thing we thought of. Perhaps 
a relative crossover of 10 dynamic priorities and an absolute crossover of 5 
static priorities before things got queued together. This is really only 
required for the UP HT case.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  1:24   ` Con Kolivas
@ 2003-12-23  1:36     ` Nick Piggin
  2003-12-23  2:42       ` Con Kolivas
  2004-01-02 20:10     ` Bill Davidsen
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2003-12-23  1:36 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Nakajima, Jun, linux kernel mailing list



Con Kolivas wrote:

>On Tue, 23 Dec 2003 12:11, Nick Piggin wrote:
>
>>I think this patch is much too ugly to get into such an elegant scheduler.
>>No fault to you Con because its an ugly problem.
>>
>
>You're too kind. No it's ugly because of my code but it works for now.
>

Well its all the special cases for batch scheduling that I don't like,
the idea to not run batch tasks on a package running non batch processes
is sound. I thought the batch scheduling code is Ingo's, but I could
be mistaken. Anyway...

>
>>How about this: if a task is "delta" priority points below a task running
>>on another sibling, move it to that sibling (so priorities via timeslice
>>start working). I call it active unbalancing! I might be able to make it
>>fit if there is interest. Other suggestions?
>>
>
>I discussed this with Ingo and that's the sort of thing we thought of. Perhaps 
>a relative crossover of 10 dynamic priorities and an absolute crossover of 5 
>static priorities before things got queued together. This is really only 
>required for the UP HT case.
>

Well I guess it would still be nice for "SMP HT" as well. Hopefully the code
can be generic enough that it would just carry over nicely. It does have
complications though because the load balancer would have to be taught about
it, and those architectures that do hardware priorities probably don't even
want it.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  1:36     ` Nick Piggin
@ 2003-12-23  2:42       ` Con Kolivas
  2003-12-23  2:57         ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2003-12-23  2:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nakajima, Jun, linux kernel mailing list

On Tue, 23 Dec 2003 12:36, Nick Piggin wrote:
> Con Kolivas wrote:
> >On Tue, 23 Dec 2003 12:11, Nick Piggin wrote:
> >>I think this patch is much too ugly to get into such an elegant
> >> scheduler. No fault to you Con because its an ugly problem.
> >
> >You're too kind. No it's ugly because of my code but it works for now.
>
> Well its all the special cases for batch scheduling that I don't like,
> the idea to not run batch tasks on a package running non batch processes
> is sound. I thought the batch scheduling code is Ingo's, but I could
> be mistaken. Anyway...

I realise the special cases suck. Code for one setting in a spot where it 
affects everyone is bad. Regarding the batch scheduling; no that's my special 
flavour coded ugly from the ground up. Ingo's is much smarter than this but 
once again I needed something that works now without too much effort.

>
> >>How about this: if a task is "delta" priority points below a task running
> >>on another sibling, move it to that sibling (so priorities via timeslice
> >>start working). I call it active unbalancing! I might be able to make it
> >>fit if there is interest. Other suggestions?
> >
> >I discussed this with Ingo and that's the sort of thing we thought of.
> > Perhaps a relative crossover of 10 dynamic priorities and an absolute
> > crossover of 5 static priorities before things got queued together. This
> > is really only required for the UP HT case.
>
> Well I guess it would still be nice for "SMP HT" as well. Hopefully the
> code can be generic enough that it would just carry over nicely. 

I disagree. I can't think of a real world scenario where 2+ physical cpus 
would benefit from this.

> It does 
> have complications though because the load balancer would have to be taught
> about it, and those architectures that do hardware priorities probably
> don't even want it.

Probably the simple relative/absolute will have to suffice. However it still 
doesn't help the fact that running something cpu bound concurrently at nice 0 
with something interactive nice 0 is actually slower if you use a UP HT 
processor in SMP mode instead of UP.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  2:42       ` Con Kolivas
@ 2003-12-23  2:57         ` Nick Piggin
  2003-12-23  3:15           ` Con Kolivas
  2003-12-23 15:51           ` bill davidsen
  0 siblings, 2 replies; 31+ messages in thread
From: Nick Piggin @ 2003-12-23  2:57 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Nakajima, Jun, linux kernel mailing list



Con Kolivas wrote:

>On Tue, 23 Dec 2003 12:36, Nick Piggin wrote:
>
>>Con Kolivas wrote:
>>
>>>I discussed this with Ingo and that's the sort of thing we thought of.
>>>Perhaps a relative crossover of 10 dynamic priorities and an absolute
>>>crossover of 5 static priorities before things got queued together. This
>>>is really only required for the UP HT case.
>>>
>>Well I guess it would still be nice for "SMP HT" as well. Hopefully the
>>code can be generic enough that it would just carry over nicely. 
>>
>
>I disagree. I can't think of a real world scenario where 2+ physical cpus 
>would benefit from this.
>

Well its the same problem. A nice -20 process can still lose 40-55% of its
performance to a nice 19 process, a figure of 10% is probably too high and
we'd really want it <= 5% like what happens with a single logical processor.

>
>>It does 
>>have complications though because the load balancer would have to be taught
>>about it, and those architectures that do hardware priorities probably
>>don't even want it.
>>
>
>Probably the simple relative/absolute will have to suffice. However it still 
>doesn't help the fact that running something cpu bound concurrently at nice 0 
>with something interactive nice 0 is actually slower if you use a UP HT 
>processor in SMP mode instead of UP.
>

It will be based on dynamic priorities, possibly with some feedback from
nice as well, but it probably still won't be perfect and it will probably
be very complex *cough* hardware priorities *cough* ;)

I might try to fit it into a more general priority balancing system because
we currently have similar sorts of failings on regular SMP as well.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  2:57         ` Nick Piggin
@ 2003-12-23  3:15           ` Con Kolivas
  2003-12-23  3:16             ` Con Kolivas
  2003-12-23 15:51           ` bill davidsen
  1 sibling, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2003-12-23  3:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nakajima, Jun, linux kernel mailing list

On Tue, 23 Dec 2003 13:57, Nick Piggin wrote:
> Con Kolivas wrote:
> >On Tue, 23 Dec 2003 12:36, Nick Piggin wrote:
> >>Con Kolivas wrote:
> >>>I discussed this with Ingo and that's the sort of thing we thought of.
> >>>Perhaps a relative crossover of 10 dynamic priorities and an absolute
> >>>crossover of 5 static priorities before things got queued together. This
> >>>is really only required for the UP HT case.
> >>
> >>Well I guess it would still be nice for "SMP HT" as well. Hopefully the
> >>code can be generic enough that it would just carry over nicely.
> >
> >I disagree. I can't think of a real world scenario where 2+ physical cpus
> >would benefit from this.
>
> Well its the same problem. A nice -20 process can still lose 40-55% of its
> performance to a nice 19 process, a figure of 10% is probably too high and
> we'd really want it <= 5% like what happens with a single logical
> processor.

I changed my mind just after I sent that mail. 4 physical cores running three 
nice 20 and one nice -20 task gives the nice -20 task only 25% of the total 
cpu and 25% to each of the nice 20 tasks.

> >>It does
> >>have complications though because the load balancer would have to be
> >> taught about it, and those architectures that do hardware priorities
> >> probably don't even want it.
> >
> >Probably the simple relative/absolute will have to suffice. However it
> > still doesn't help the fact that running something cpu bound concurrently
> > at nice 0 with something interactive nice 0 is actually slower if you use
> > a UP HT processor in SMP mode instead of UP.
>
> It will be based on dynamic priorities, possibly with some feedback from
> nice as well, but it probably still won't be perfect and it will probably
> be very complex *cough* hardware priorities *cough* ;)
>
> I might try to fit it into a more general priority balancing system because
> we currently have similar sorts of failings on regular SMP as well.

I'll keep my eyes peeled. Meanwhile I'll use my ugly patch ;-)

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  3:15           ` Con Kolivas
@ 2003-12-23  3:16             ` Con Kolivas
  2003-12-26 23:03               ` Pavel Machek
  0 siblings, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2003-12-23  3:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nakajima, Jun, linux kernel mailing list

On Tue, 23 Dec 2003 14:15, Con Kolivas wrote:
> On Tue, 23 Dec 2003 13:57, Nick Piggin wrote:
> > Con Kolivas wrote:
> > >On Tue, 23 Dec 2003 12:36, Nick Piggin wrote:
> > >>Con Kolivas wrote:
> > >>>I discussed this with Ingo and that's the sort of thing we thought of.
> > >>>Perhaps a relative crossover of 10 dynamic priorities and an absolute
> > >>>crossover of 5 static priorities before things got queued together.
> > >>> This is really only required for the UP HT case.
> > >>
> > >>Well I guess it would still be nice for "SMP HT" as well. Hopefully the
> > >>code can be generic enough that it would just carry over nicely.
> > >
> > >I disagree. I can't think of a real world scenario where 2+ physical
> > > cpus would benefit from this.
> >
> > Well its the same problem. A nice -20 process can still lose 40-55% of
> > its performance to a nice 19 process, a figure of 10% is probably too
> > high and we'd really want it <= 5% like what happens with a single
> > logical processor.
>
> I changed my mind just after I sent that mail. 4 physical cores running
> three nice 20 and one nice -20 task gives the nice -20 task only 25% of the
> total cpu and 25% to each of the nice 20 tasks.

Err that should read 4 logical cores.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  2:57         ` Nick Piggin
  2003-12-23  3:15           ` Con Kolivas
@ 2003-12-23 15:51           ` bill davidsen
  2003-12-23 22:09             ` Con Kolivas
  1 sibling, 1 reply; 31+ messages in thread
From: bill davidsen @ 2003-12-23 15:51 UTC (permalink / raw)
  To: linux-kernel

In article <3FE7AF24.40600@cyberone.com.au>,
Nick Piggin  <piggin@cyberone.com.au> wrote:
| 
| 
| Con Kolivas wrote:
| 
| >On Tue, 23 Dec 2003 12:36, Nick Piggin wrote:
| >
| >>Con Kolivas wrote:
| >>
| >>>I discussed this with Ingo and that's the sort of thing we thought of.
| >>>Perhaps a relative crossover of 10 dynamic priorities and an absolute
| >>>crossover of 5 static priorities before things got queued together. This
| >>>is really only required for the UP HT case.

There are two goals here. Not having a batch process on one siling makes
sense, and I'm going to try Con's patch after I try Nick's latest.
Actually, if they play nicely I would use both, batch would be very
useful for nightly report generation on servers.

But WRT the whole HT scheduling, it would seem that ideally you want to
schedule the two (or N) processes which have the lowest aggregate cache
thrash, if you had a way to determine that. I suspect that a process
which had a small itterative inner loop with a code+data footprint of
2-3k would coexist well with almost anything else. Minimizing the FPU
contention also would improve performance, no doubt. I don't know that
there are the tools at the moment to get this information, but it seems
as though until it's available any scheduling will be working in the
dark to some extent.

Feel free to tell me I misread this problem.

| >>>
| >>Well I guess it would still be nice for "SMP HT" as well. Hopefully the
| >>code can be generic enough that it would just carry over nicely. 
| >>
| >
| >I disagree. I can't think of a real world scenario where 2+ physical cpus 
| >would benefit from this.
| >
| 
| Well its the same problem. A nice -20 process can still lose 40-55% of its
| performance to a nice 19 process, a figure of 10% is probably too high and
| we'd really want it <= 5% like what happens with a single logical processor.
| 
| >
| >>It does 
| >>have complications though because the load balancer would have to be taught
| >>about it, and those architectures that do hardware priorities probably
| >>don't even want it.
| >>
| >
| >Probably the simple relative/absolute will have to suffice. However it still 
| >doesn't help the fact that running something cpu bound concurrently at nice 0 
| >with something interactive nice 0 is actually slower if you use a UP HT 
| >processor in SMP mode instead of UP.
| >
| 
| It will be based on dynamic priorities, possibly with some feedback from
| nice as well, but it probably still won't be perfect and it will probably
| be very complex *cough* hardware priorities *cough* ;)
| 
| I might try to fit it into a more general priority balancing system because
| we currently have similar sorts of failings on regular SMP as well.

I my experience, on servers it's more important to avoid really bad
behaviour all of the time than to have perfect behaviour most of the
time. All of the recent scheduler work from Nick, Con and Ingo has
avoided "jackpot cases" quite well, for which I thank you and encourage
you to continue. If server response goes from 20ms to 100ms Saturday
night, we discuss it at a status meeting Monday morning and make
suggestions to management. If response goes to 2sec we discuss it with
management at 2am and they make suggestions :-(

So far 2.6.0 has been quite good at "bend but do not break" under load.
Great job!
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23 15:51           ` bill davidsen
@ 2003-12-23 22:09             ` Con Kolivas
  2003-12-30  0:35               ` bill davidsen
  0 siblings, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2003-12-23 22:09 UTC (permalink / raw)
  To: bill davidsen, linux-kernel

On Wed, 24 Dec 2003 02:51, bill davidsen wrote:
> There are two goals here. Not having a batch process on one siling makes
> sense, and I'm going to try Con's patch after I try Nick's latest.
> Actually, if they play nicely I would use both, batch would be very
> useful for nightly report generation on servers.

No hope of them playing nicely, but at some later stage I might resync on top 
of Nick's work if I like the direction it takes (which looks likely!)

> But WRT the whole HT scheduling, it would seem that ideally you want to
> schedule the two (or N) processes which have the lowest aggregate cache
> thrash, if you had a way to determine that. I suspect that a process
> which had a small itterative inner loop with a code+data footprint of
> 2-3k would coexist well with almost anything else. Minimizing the FPU
> contention also would improve performance, no doubt. I don't know that
> there are the tools at the moment to get this information, but it seems
> as though until it's available any scheduling will be working in the
> dark to some extent.

Impossible with current tools. Only userspace would have a chance of 
predicting this and the simple rule we work off is that userspace can't be 
trusted so this does not appear doable in the foreseeable future.

> Feel free to tell me I misread this problem.

> I my experience, on servers it's more important to avoid really bad
> behaviour all of the time than to have perfect behaviour most of the
> time. All of the recent scheduler work from Nick, Con and Ingo has
> avoided "jackpot cases" quite well, for which I thank you and encourage
> you to continue. If server response goes from 20ms to 100ms Saturday
> night, we discuss it at a status meeting Monday morning and make
> suggestions to management. If response goes to 2sec we discuss it with
> management at 2am and they make suggestions :-(
>
> So far 2.6.0 has been quite good at "bend but do not break" under load.
> Great job!

Excellent! I'm sure we'll hear from you when you turn the knob up to 11/10.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  0:38 [PATCH] 2.6.0 batch scheduling, HT aware Con Kolivas
  2003-12-23  1:11 ` Nick Piggin
@ 2003-12-26 22:56 ` Pavel Machek
  2003-12-26 23:42   ` Con Kolivas
                     ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Pavel Machek @ 2003-12-26 22:56 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, Nick Piggin

Hi!

> I've done a resync and update of my batch scheduling that is also hyper-thread 
> aware.
> 
> What is batch scheduling? Specifying a task as batch allows it to only use cpu 
> time if there is idle time available, rather than having a proportion of the 
> cpu time based on niceness.
> 
> Why do I need hyper-thread aware batch scheduling?
> 
> If you have a hyperthread (P4HT) processor and run it as two logical cpus you 
> can have a very low priority task running that can consume 50% of your 
> physical cpu's capacity no matter how high priority tasks you are running. 
> For example if you use the distributed computing client setiathome you will 
> be effectively be running at half your cpu's speed even if you run setiathome 
> at nice 20. Batch scheduling for normal cpus allows only idle time to be used 
> for batch tasks, and for HT cpus only allows idle time when both logical cpus 
> are idle.

BTW this is going to be an issue even on normal (non-HT)
systems. Imagine memory-bound scientific task on CPU0 and nice -20
memory-bound seti&home at CPU1. Even without hyperthreading, your
scientific task is going to run at 50% of speed and seti&home is going
to get second half. Oops.

Something similar can happen with disk, but we are moving out of
cpu-scheduler arena with that.

[I do not have SMP nearby to demonstrate it, anybody wanting to
benchmark a bit?] 
									Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  3:16             ` Con Kolivas
@ 2003-12-26 23:03               ` Pavel Machek
  0 siblings, 0 replies; 31+ messages in thread
From: Pavel Machek @ 2003-12-26 23:03 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Nick Piggin, Nakajima, Jun, linux kernel mailing list

Hi!

> > > >>>I discussed this with Ingo and that's the sort of thing we thought of.
> > > >>>Perhaps a relative crossover of 10 dynamic priorities and an absolute
> > > >>>crossover of 5 static priorities before things got queued together.
> > > >>> This is really only required for the UP HT case.
> > > >>
> > > >>Well I guess it would still be nice for "SMP HT" as well. Hopefully the
> > > >>code can be generic enough that it would just carry over nicely.
> > > >
> > > >I disagree. I can't think of a real world scenario where 2+ physical
> > > > cpus would benefit from this.
> > >
> > > Well its the same problem. A nice -20 process can still lose 40-55% of
> > > its performance to a nice 19 process, a figure of 10% is probably too
> > > high and we'd really want it <= 5% like what happens with a single
> > > logical processor.
> >
> > I changed my mind just after I sent that mail. 4 physical cores running
> > three nice 20 and one nice -20 task gives the nice -20 task only 25% of the
> > total cpu and 25% to each of the nice 20 tasks.
> 
> Err that should read 4 logical cores.

Actually, for 4 physical cores it is going to be true, too. And if you
are memory-bound, stopping those 3 task can speed your important
task, too. Its really same.
									Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-26 22:56 ` Pavel Machek
@ 2003-12-26 23:42   ` Con Kolivas
  2003-12-26 23:49     ` Con Kolivas
  2003-12-27 11:09     ` Pavel Machek
  2003-12-27  8:52   ` Mika Penttilä
  2004-01-02 20:05   ` Bill Davidsen
  2 siblings, 2 replies; 31+ messages in thread
From: Con Kolivas @ 2003-12-26 23:42 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux kernel mailing list, Nick Piggin

On Sat, 27 Dec 2003 09:56, Pavel Machek wrote:
> Hi!
>
> > I've done a resync and update of my batch scheduling that is also
> > hyper-thread aware.
> >
> > What is batch scheduling? Specifying a task as batch allows it to only
> > use cpu time if there is idle time available, rather than having a
> > proportion of the cpu time based on niceness.
> >
> > Why do I need hyper-thread aware batch scheduling?
> >
> > If you have a hyperthread (P4HT) processor and run it as two logical cpus
> > you can have a very low priority task running that can consume 50% of
> > your physical cpu's capacity no matter how high priority tasks you are
> > running. For example if you use the distributed computing client
> > setiathome you will be effectively be running at half your cpu's speed
> > even if you run setiathome at nice 20. Batch scheduling for normal cpus
> > allows only idle time to be used for batch tasks, and for HT cpus only
> > allows idle time when both logical cpus are idle.
>
> BTW this is going to be an issue even on normal (non-HT)
> systems. Imagine memory-bound scientific task on CPU0 and nice -20
> memory-bound seti&home at CPU1. Even without hyperthreading, your
> scientific task is going to run at 50% of speed and seti&home is going
> to get second half. Oops.
>
> Something similar can happen with disk, but we are moving out of
> cpu-scheduler arena with that.
>
> [I do not have SMP nearby to demonstrate it, anybody wanting to
> benchmark a bit?]

This is definitely the case but there is one huge difference. If you have 
2x1Ghz non HT processors then the fastest a single threaded task can run is 
at 1Ghz. If you have 1x2Ghz HT processor the fastest a single threaded task 
can run is 2Ghz. 

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-26 23:42   ` Con Kolivas
@ 2003-12-26 23:49     ` Con Kolivas
  2003-12-27 11:09     ` Pavel Machek
  1 sibling, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2003-12-26 23:49 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux kernel mailing list, Nick Piggin

On Sat, 27 Dec 2003 10:42, Con Kolivas wrote:
> On Sat, 27 Dec 2003 09:56, Pavel Machek wrote:
> > Hi!
> >
> > > I've done a resync and update of my batch scheduling that is also
> > > hyper-thread aware.
> > >
> > > What is batch scheduling? Specifying a task as batch allows it to only
> > > use cpu time if there is idle time available, rather than having a
> > > proportion of the cpu time based on niceness.
> > >
> > > Why do I need hyper-thread aware batch scheduling?
> > >
> > > If you have a hyperthread (P4HT) processor and run it as two logical
> > > cpus you can have a very low priority task running that can consume 50%
> > > of your physical cpu's capacity no matter how high priority tasks you
> > > are running. For example if you use the distributed computing client
> > > setiathome you will be effectively be running at half your cpu's speed
> > > even if you run setiathome at nice 20. Batch scheduling for normal cpus
> > > allows only idle time to be used for batch tasks, and for HT cpus only
> > > allows idle time when both logical cpus are idle.
> >
> > BTW this is going to be an issue even on normal (non-HT)
> > systems. Imagine memory-bound scientific task on CPU0 and nice -20
> > memory-bound seti&home at CPU1. Even without hyperthreading, your
> > scientific task is going to run at 50% of speed and seti&home is going
> > to get second half. Oops.
> >
> > Something similar can happen with disk, but we are moving out of
> > cpu-scheduler arena with that.
> >
> > [I do not have SMP nearby to demonstrate it, anybody wanting to
> > benchmark a bit?]
>
> This is definitely the case but there is one huge difference. If you have
> 2x1Ghz non HT processors then the fastest a single threaded task can run is
> at 1Ghz. If you have 1x2Ghz HT processor the fastest a single threaded task
> can run is 2Ghz.

Or even if you have 1024x1Ghz cpus. One thread can still only run at 1Ghz. 
Sure there are other things that can run on other cpus which will allow the 
single thread to run unabated at 1Ghz, but no faster.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-26 22:56 ` Pavel Machek
  2003-12-26 23:42   ` Con Kolivas
@ 2003-12-27  8:52   ` Mika Penttilä
  2003-12-30  0:32     ` bill davidsen
  2004-01-02 20:05   ` Bill Davidsen
  2 siblings, 1 reply; 31+ messages in thread
From: Mika Penttilä @ 2003-12-27  8:52 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Con Kolivas, linux kernel mailing list, Nick Piggin



Pavel Machek wrote:

>Hi!
>
>  
>
>>I've done a resync and update of my batch scheduling that is also hyper-thread 
>>aware.
>>
>>What is batch scheduling? Specifying a task as batch allows it to only use cpu 
>>time if there is idle time available, rather than having a proportion of the 
>>cpu time based on niceness.
>>
>>Why do I need hyper-thread aware batch scheduling?
>>
>>If you have a hyperthread (P4HT) processor and run it as two logical cpus you 
>>can have a very low priority task running that can consume 50% of your 
>>physical cpu's capacity no matter how high priority tasks you are running. 
>>For example if you use the distributed computing client setiathome you will 
>>be effectively be running at half your cpu's speed even if you run setiathome 
>>at nice 20. Batch scheduling for normal cpus allows only idle time to be used 
>>for batch tasks, and for HT cpus only allows idle time when both logical cpus 
>>are idle.
>>    
>>
>
>BTW this is going to be an issue even on normal (non-HT)
>systems. Imagine memory-bound scientific task on CPU0 and nice -20
>memory-bound seti&home at CPU1. Even without hyperthreading, your
>scientific task is going to run at 50% of speed and seti&home is going
>to get second half. Oops.
>
>Something similar can happen with disk, but we are moving out of
>cpu-scheduler arena with that.
>
>[I do not have SMP nearby to demonstrate it, anybody wanting to
>benchmark a bit?] 
>									Pavel
>
heh...and the situation gets even worse when you add cpus, with 16way 
you get only 1/16 of the speed ;)

--Mika



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-26 23:42   ` Con Kolivas
  2003-12-26 23:49     ` Con Kolivas
@ 2003-12-27 11:09     ` Pavel Machek
  2003-12-27 11:15       ` Con Kolivas
  2003-12-29  7:02       ` Nick Piggin
  1 sibling, 2 replies; 31+ messages in thread
From: Pavel Machek @ 2003-12-27 11:09 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, Nick Piggin

Hi!

> > > I've done a resync and update of my batch scheduling that is also
> > > hyper-thread aware.
> > >
> > > What is batch scheduling? Specifying a task as batch allows it to only
> > > use cpu time if there is idle time available, rather than having a
> > > proportion of the cpu time based on niceness.
> > >
> > > Why do I need hyper-thread aware batch scheduling?
> > >
> > > If you have a hyperthread (P4HT) processor and run it as two logical cpus
> > > you can have a very low priority task running that can consume 50% of
> > > your physical cpu's capacity no matter how high priority tasks you are
> > > running. For example if you use the distributed computing client
> > > setiathome you will be effectively be running at half your cpu's speed
> > > even if you run setiathome at nice 20. Batch scheduling for normal cpus
> > > allows only idle time to be used for batch tasks, and for HT cpus only
> > > allows idle time when both logical cpus are idle.
> >
> > BTW this is going to be an issue even on normal (non-HT)
> > systems. Imagine memory-bound scientific task on CPU0 and nice -20
> > memory-bound seti&home at CPU1. Even without hyperthreading, your
> > scientific task is going to run at 50% of speed and seti&home is going
> > to get second half. Oops.
> >
> > Something similar can happen with disk, but we are moving out of
> > cpu-scheduler arena with that.
> >
> > [I do not have SMP nearby to demonstrate it, anybody wanting to
> > benchmark a bit?]
> 
> This is definitely the case but there is one huge difference. If you have 
> 2x1Ghz non HT processors then the fastest a single threaded task can run is 
> at 1Ghz. If you have 1x2Ghz HT processor the fastest a single threaded task 
> can run is 2Ghz. 

Well, gigaherz is not the *only* important thing.

On 2x1GHz, 2GB/sec RAM bandwidth, fastest a single threaded task can
run is 1GHz, 2GB/sec. If you run two of them, it is 1GHz,
*1*GB/sec. So you still have effect similar to hyperthreading. And
yes, it can be measured.

stress runs two tasks walking over 10MB of memory, just for fun. Look:

[Lefik is dual-p3; according to you two mem stressers should run about
same speed as one of them. That's not the case:]

machek@lefik:~/misc$ ./stress tenmega
Process 1665 started at 1072522582.
machek@lefik:~/misc$ Process 1665 done at 1072522695 (113 sec).

machek@lefik:~/misc$ ./stress tenmega tenmega
Process 1669 started at 1072522722.
Process 1670 started at 1072522722.
machek@lefik:~/misc$ Process 1670 done at 1072522895 (173 sec).
Process 1669 done at 1072522903 (181 sec).

machek@lefik:~/misc$

And yes, that machine does have two cpus:

machek@lefik:~/misc$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 801.828
cache size      : 256 KB
physical id     : 0
siblings        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1599.07

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 801.828
cache size      : 256 KB
physical id     : 0
siblings        : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1602.35

machek@lefik:~/misc$

So... even on normal SMP,
"task-on-other-cpu-slows-down-task-on-this-cpu" effect exists. Okay,
it is not as visible as on HT machine (50% slowdown), but its
definitely there.
								Pavel

/* Copyright 1999-2003 Pavel Machek, distribute under GPLv2 */

#define MEM 20*1024
#define RAMSIZE (8*1024*1024)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/fcntl.h>
#include <time.h>

void
main( int argc, char *argv[] )
{
unsigned long i;

if (!argc) 
  {
  printf( "stress loop|memread|tenmega|mem|write|eatmem ...\n" );
  return;
  }
for (i=0; i<argc; i++)
  {
  if (!strcmp( argv[i], "loop" ))
    if (!fork())
      while (1);
  if (!strcmp( argv[i], "eatmem" ))
    if (!fork())
      while(1) {
	char *c = malloc(4096);
	if (c) *c='a';
      }
  if (!strcmp( argv[i], "memread" ))
    if (!fork())
      { 
      char *p = malloc( RAMSIZE ); 
      for( i=0; i<RAMSIZE; i++ ) p[i]=1;
      while( 1 ) 
	{ 
	int a;
	for( i=0; i<RAMSIZE; i++ ) a+=p[i];
	} 
      }
  if (!strcmp( argv[i], "tenmega" ))
    if (!fork())
      { 
      char *p = malloc( 10*1024*1024 ); 
      int i, j, start;
      printf( "Process %d started at %d.\n", getpid(), start = time(NULL));
      for( i=0; i<10*1024*1024; i++ ) p[i]=1;
      for( j=1; j<1000; j++ )
	{ 
	volatile int a;
	for( i=0; i<10*1024*1024; i++ ) a+=p[i];
	} 
      printf( "Process %d done at %ld (%ld sec).\n", getpid(), (long) time(NULL), time(NULL)-start);
      exit(0);
      }
  if (!strcmp( argv[i], "mem" ))
    if (!fork())
      { 
      char *p = malloc( RAMSIZE ); 
      while( 1 ) 
	{ 
	for( i=0; i<RAMSIZE; i++ ) p[i]=1;
	sleep( 60 ); 
	} 
      }
  if (!strcmp( argv[i], "write" ))	
    if (!fork())
      {
      char namebuf[1024];
      int h;
      char buf[1024]="Signature of something rather strange ;-)";
      sprintf( namebuf, "/tmp/stresstest.delme.%d", getpid() );
      h = creat( namebuf, 0666 );
      if (h<0) { printf( "Creat failed: %m\n" ); exit(0); }
      while( 1 )
	for( i=0; i<MEM; i++ )
	  {
	  if (lseek( h, i*1024, SEEK_SET )<0) { printf( "Seek failed: %m\n" ); exit(0); }
	  if (write( h, buf, 1024 )<0) { printf( "Write failed: %m\n" ); exit(0); }
	  }
      }
  }
}


-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-27 11:09     ` Pavel Machek
@ 2003-12-27 11:15       ` Con Kolivas
  2003-12-30  0:29         ` bill davidsen
  2003-12-29  7:02       ` Nick Piggin
  1 sibling, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2003-12-27 11:15 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux kernel mailing list, Nick Piggin

On Sat, 27 Dec 2003 22:09, Pavel Machek wrote:
> So... even on normal SMP,
> "task-on-other-cpu-slows-down-task-on-this-cpu" effect exists. Okay,
> it is not as visible as on HT machine (50% slowdown), but its
> definitely there.

Sure but I think we're getting pedantic here. The problem is really simple - a 
uniprocessor HT desktop booted in SMP mode feels half the speed while running 
setiathome (or video encoding or whatever cpu bound task) compared to booting 
it in UP mode. So, ironically, enabling the HT makes the machine feel slower 
when running multiple tasks. And there will be a heck of a lot of these in 
the future.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-27 11:09     ` Pavel Machek
  2003-12-27 11:15       ` Con Kolivas
@ 2003-12-29  7:02       ` Nick Piggin
  2003-12-29 12:49         ` Pavel Machek
  1 sibling, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2003-12-29  7:02 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Con Kolivas, linux kernel mailing list



Pavel Machek wrote:

>Hi!
>
>
>>>BTW this is going to be an issue even on normal (non-HT)
>>>systems. Imagine memory-bound scientific task on CPU0 and nice -20
>>>memory-bound seti&home at CPU1. Even without hyperthreading, your
>>>scientific task is going to run at 50% of speed and seti&home is going
>>>to get second half. Oops.
>>>
>>>Something similar can happen with disk, but we are moving out of
>>>cpu-scheduler arena with that.
>>>
>>>[I do not have SMP nearby to demonstrate it, anybody wanting to
>>>benchmark a bit?]
>>>
>>This is definitely the case but there is one huge difference. If you have 
>>2x1Ghz non HT processors then the fastest a single threaded task can run is 
>>at 1Ghz. If you have 1x2Ghz HT processor the fastest a single threaded task 
>>can run is 2Ghz. 
>>
>
>Well, gigaherz is not the *only* important thing.
>
>On 2x1GHz, 2GB/sec RAM bandwidth, fastest a single threaded task can
>run is 1GHz, 2GB/sec. If you run two of them, it is 1GHz,
>*1*GB/sec. So you still have effect similar to hyperthreading. And
>yes, it can be measured.
>

Hi Pavel,
Sure this might be a real problem sometimes, but I don't see the
CPU scheduler ever handling it unless we want to add a few kitchen
sinks to its nice lean code as well.

If the need really arises, then probably a userspace daemon could
do it.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-29  7:02       ` Nick Piggin
@ 2003-12-29 12:49         ` Pavel Machek
  0 siblings, 0 replies; 31+ messages in thread
From: Pavel Machek @ 2003-12-29 12:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Con Kolivas, linux kernel mailing list

Hi!

> >>>BTW this is going to be an issue even on normal (non-HT)
> >>>systems. Imagine memory-bound scientific task on CPU0 and nice -20
> >>>memory-bound seti&home at CPU1. Even without hyperthreading, your
> >>>scientific task is going to run at 50% of speed and seti&home is going
> >>>to get second half. Oops.
> >>>
> >>>Something similar can happen with disk, but we are moving out of
> >>>cpu-scheduler arena with that.
> >>>
> >>>[I do not have SMP nearby to demonstrate it, anybody wanting to
> >>>benchmark a bit?]
> >>>
> >>This is definitely the case but there is one huge difference. If you have 
> >>2x1Ghz non HT processors then the fastest a single threaded task can run 
> >>is at 1Ghz. If you have 1x2Ghz HT processor the fastest a single threaded 
> >>task can run is 2Ghz. 
> >>
> >
> >Well, gigaherz is not the *only* important thing.
> >
> >On 2x1GHz, 2GB/sec RAM bandwidth, fastest a single threaded task can
> >run is 1GHz, 2GB/sec. If you run two of them, it is 1GHz,
> >*1*GB/sec. So you still have effect similar to hyperthreading. And
> >yes, it can be measured.
> >
> 
> Hi Pavel,
> Sure this might be a real problem sometimes, but I don't see the
> CPU scheduler ever handling it unless we want to add a few kitchen
> sinks to its nice lean code as well.

Why is it a problem? If you are handling HT case, anyway, it should be
fairly easy to say "imagine it is HT system, not SMP one", and poof,
problem magically goes away.
								Pavel

/*
 *  .----~~|
 *  \      |
 *   ~~~~~~
 */

[Ready-made kitchen-sink for scheduler :-)))]
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-27 11:15       ` Con Kolivas
@ 2003-12-30  0:29         ` bill davidsen
  0 siblings, 0 replies; 31+ messages in thread
From: bill davidsen @ 2003-12-30  0:29 UTC (permalink / raw)
  To: linux-kernel

In article <200312272215.01563.kernel@kolivas.org>,
Con Kolivas  <kernel@kolivas.org> wrote:
| On Sat, 27 Dec 2003 22:09, Pavel Machek wrote:
| > So... even on normal SMP,
| > "task-on-other-cpu-slows-down-task-on-this-cpu" effect exists. Okay,
| > it is not as visible as on HT machine (50% slowdown), but its
| > definitely there.
| 
| Sure but I think we're getting pedantic here. The problem is really simple - a 
| uniprocessor HT desktop booted in SMP mode feels half the speed while running 
| setiathome (or video encoding or whatever cpu bound task) compared to booting 
| it in UP mode. So, ironically, enabling the HT makes the machine feel slower 
| when running multiple tasks. And there will be a heck of a lot of these in 
| the future.

Let me put forth a thought, without a solution. In the case you
describe, what is needed, and not provided in hardware, is a way to do
priority within the CPU, so in the case of a contested resource there is
a way to ensure the process we wish wins.

Since that seems unavailable in Intel, do other CPUs do better (or
different)?
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-27  8:52   ` Mika Penttilä
@ 2003-12-30  0:32     ` bill davidsen
  0 siblings, 0 replies; 31+ messages in thread
From: bill davidsen @ 2003-12-30  0:32 UTC (permalink / raw)
  To: linux-kernel

In article <3FED4838.6050908@kolumbus.fi>,
=?ISO-8859-1?Q?Mika_Penttil=E4?=  <mika.penttila@kolumbus.fi> wrote:

| heh...and the situation gets even worse when you add cpus, with 16way 
| you get only 1/16 of the speed ;)

No, not when you add CPUs, when you add siblings. There's a big
difference, since sibs compete for cache on the chip, and some execution
units (FPU?).
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23 22:09             ` Con Kolivas
@ 2003-12-30  0:35               ` bill davidsen
  0 siblings, 0 replies; 31+ messages in thread
From: bill davidsen @ 2003-12-30  0:35 UTC (permalink / raw)
  To: linux-kernel

In article <200312240909.19006.kernel@kolivas.org>,
Con Kolivas  <kernel@kolivas.org> wrote:
| On Wed, 24 Dec 2003 02:51, bill davidsen wrote:
| > There are two goals here. Not having a batch process on one siling makes
| > sense, and I'm going to try Con's patch after I try Nick's latest.
| > Actually, if they play nicely I would use both, batch would be very
| > useful for nightly report generation on servers.
| 
| No hope of them playing nicely, but at some later stage I might resync on top 
| of Nick's work if I like the direction it takes (which looks likely!)
| 
| > But WRT the whole HT scheduling, it would seem that ideally you want to
| > schedule the two (or N) processes which have the lowest aggregate cache
| > thrash, if you had a way to determine that. I suspect that a process
| > which had a small itterative inner loop with a code+data footprint of
| > 2-3k would coexist well with almost anything else. Minimizing the FPU
| > contention also would improve performance, no doubt. I don't know that
| > there are the tools at the moment to get this information, but it seems
| > as though until it's available any scheduling will be working in the
| > dark to some extent.
| 
| Impossible with current tools. Only userspace would have a chance of 
| predicting this and the simple rule we work off is that userspace can't be 
| trusted so this does not appear doable in the foreseeable future.

Glad you agree, but this makes improvement dificult.


-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-26 22:56 ` Pavel Machek
  2003-12-26 23:42   ` Con Kolivas
  2003-12-27  8:52   ` Mika Penttilä
@ 2004-01-02 20:05   ` Bill Davidsen
  2004-01-02 20:56     ` Davide Libenzi
  2 siblings, 1 reply; 31+ messages in thread
From: Bill Davidsen @ 2004-01-02 20:05 UTC (permalink / raw)
  To: linux-kernel

Pavel Machek wrote:

>>Why do I need hyper-thread aware batch scheduling?
>>
>>If you have a hyperthread (P4HT) processor and run it as two logical cpus you 
>>can have a very low priority task running that can consume 50% of your 
>>physical cpu's capacity no matter how high priority tasks you are running. 
>>For example if you use the distributed computing client setiathome you will 
>>be effectively be running at half your cpu's speed even if you run setiathome 
>>at nice 20. Batch scheduling for normal cpus allows only idle time to be used 
>>for batch tasks, and for HT cpus only allows idle time when both logical cpus 
>>are idle.
> 
> 
> BTW this is going to be an issue even on normal (non-HT)
> systems. Imagine memory-bound scientific task on CPU0 and nice -20
> memory-bound seti&home at CPU1. Even without hyperthreading, your
> scientific task is going to run at 50% of speed and seti&home is going
> to get second half. Oops.

Yes and even worse, if you stop running setiathome the scientific task 
*still* only gets half the available CPU!

The difference is that with HT running a task on one sibling actually 
does (or can) slow the other. That's not true with true SMP, at least 
not directly, since the resourses shared (memory and disk) are much 
farther away from the CPU.

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  1:24   ` Con Kolivas
  2003-12-23  1:36     ` Nick Piggin
@ 2004-01-02 20:10     ` Bill Davidsen
  1 sibling, 0 replies; 31+ messages in thread
From: Bill Davidsen @ 2004-01-02 20:10 UTC (permalink / raw)
  To: linux-kernel

Con Kolivas wrote:

> I discussed this with Ingo and that's the sort of thing we thought of. Perhaps 
> a relative crossover of 10 dynamic priorities and an absolute crossover of 5 
> static priorities before things got queued together. This is really only 
> required for the UP HT case.

What? Do siblings in Xeons not compete for cache and memory bandwidth, 
executions units, and the like?

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2004-01-02 20:05   ` Bill Davidsen
@ 2004-01-02 20:56     ` Davide Libenzi
  2004-01-02 21:10       ` Valdis.Kletnieks
  0 siblings, 1 reply; 31+ messages in thread
From: Davide Libenzi @ 2004-01-02 20:56 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Fri, 2 Jan 2004, Bill Davidsen wrote:

> Yes and even worse, if you stop running setiathome the scientific task 
> *still* only gets half the available CPU!

Look that this is not true. If one core is not running any task, the idle 
task (if not polling) does "hlt" and the "what they call Fetch And 
Deliver" engine will be dedicated to the other core. Also, because the 
halted core not not issue any op to the execution engine, full resources 
will be available for the running task. There are many docs available 
inside the Intel developer web site that explain this.




- Davide



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2004-01-02 20:56     ` Davide Libenzi
@ 2004-01-02 21:10       ` Valdis.Kletnieks
  2004-01-02 23:34         ` Davide Libenzi
  0 siblings, 1 reply; 31+ messages in thread
From: Valdis.Kletnieks @ 2004-01-02 21:10 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Bill Davidsen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On Fri, 02 Jan 2004 12:56:16 PST, Davide Libenzi said:
> On Fri, 2 Jan 2004, Bill Davidsen wrote:
> 
> > Yes and even worse, if you stop running setiathome the scientific task 
> > *still* only gets half the available CPU!
> 
> Look that this is not true. If one core is not running any task, the idle 
> task (if not polling) does "hlt" and the "what they call Fetch And 

What Bill said was:

>> memory-bound seti&home at CPU1. Even without hyperthreading, your
>> scientific task is going to run at 50% of speed and seti&home is going
>> to get second half. Oops.

> Yes and even worse, if you stop running setiathome the scientific task 
> *still* only gets half the available CPU!

So Bill is pointing out that on a *normal* SMP, you get 50% whether or
not the other processor is busy.

> The difference is that with HT running a task on one sibling actually 
> does (or can) slow the other. That's not true with true SMP, at least 
> not directly, since the resourses shared (memory and disk) are much 
> farther away from the CPU.

And this is where Bill talks about issues like the one you mentioned about
sharing the dispatch engine.

So I think you and Bill are actually saying the same exact thing.


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2004-01-02 21:10       ` Valdis.Kletnieks
@ 2004-01-02 23:34         ` Davide Libenzi
  0 siblings, 0 replies; 31+ messages in thread
From: Davide Libenzi @ 2004-01-02 23:34 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Bill Davidsen, Linux Kernel Mailing List

On Fri, 2 Jan 2004 Valdis.Kletnieks@vt.edu wrote:

> On Fri, 02 Jan 2004 12:56:16 PST, Davide Libenzi said:
> > On Fri, 2 Jan 2004, Bill Davidsen wrote:
> > 
> > > Yes and even worse, if you stop running setiathome the scientific task 
> > > *still* only gets half the available CPU!
> > 
> > Look that this is not true. If one core is not running any task, the idle 
> > task (if not polling) does "hlt" and the "what they call Fetch And 
> 
> What Bill said was:
> 
> >> memory-bound seti&home at CPU1. Even without hyperthreading, your
> >> scientific task is going to run at 50% of speed and seti&home is going
> >> to get second half. Oops.
> 
> > Yes and even worse, if you stop running setiathome the scientific task 
> > *still* only gets half the available CPU!
> 
> So Bill is pointing out that on a *normal* SMP, you get 50% whether or
> not the other processor is busy.

Define 50% ;) In case you are talking about 50% of all the available 
resources, in case of a 2 way SMP this is pretty obvious since a 
single thread cannot run simultaneuosly on two CPUs. In case of an HT 
core, this is not true since the single thread will expand using the whole 
core resources. Note though that it won't be an expansion to 100%, and 
this is the reason of the existence of the HT technology. That is, even 
with monster length pipelines, the CPU is not able to keep exec units 100% 
full using a single dispatch unit.



- Davide



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  5:33 Nakajima, Jun
@ 2003-12-23 10:13 ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2003-12-23 10:13 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Con Kolivas, linux kernel mailing list



Nakajima, Jun wrote:

>BTW, Nick, does your SMT scheduler have "idle package prioritization"
>which chooses an idle logical processor with the other local processor
>idle if any (rather than just an idle processor with other local
>processor running at full speed), when the scheduler requires an idle
>local processor? That would prevent situations like two logical
>processors run at full speed in the same processor package, with the
>other processor package(s) idle in a same processor package(s). I
>haven't reviewed your latest patch closely, and that is the one of the
>things I want to do during the holidays.
>

Yep,
sched_balance_wake wakes to idle siblings if your domain has SD_FLAG_WAKE
and idle_balance tries pulling tasks from any domain with SD_FLAG_NEWIDLE
set if we're just about to become idle.

>
>One question. Why did you remove SD_FLAG_IDLE flag from cpu_domain
>initialization in the w27 patch? We've been seeing some performance
>degradation with w27, compared to w26.
>

I reworked things to not require this hopefully. w26 was quite broken
with respect to the active balancing stuff. One thing I did in w27 was
accidently release the code with cache_hot_time for the SMT domain set
to 1ms instead of 0 in w26, so SD_FLAG_NEWIDLE is sometimes not allowed
to pull a ready-to-run task off a sibling...

I haven't been able to do a great deal of performance tuning though,
there is probably quite a bit of room for improvement.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH] 2.6.0 batch scheduling, HT aware
@ 2003-12-23  5:33 Nakajima, Jun
  2003-12-23 10:13 ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Nakajima, Jun @ 2003-12-23  5:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Con Kolivas, linux kernel mailing list

BTW, Nick, does your SMT scheduler have "idle package prioritization"
which chooses an idle logical processor with the other local processor
idle if any (rather than just an idle processor with other local
processor running at full speed), when the scheduler requires an idle
local processor? That would prevent situations like two logical
processors run at full speed in the same processor package, with the
other processor package(s) idle in a same processor package(s). I
haven't reviewed your latest patch closely, and that is the one of the
things I want to do during the holidays.

One question. Why did you remove SD_FLAG_IDLE flag from cpu_domain
initialization in the w27 patch? We've been seeing some performance
degradation with w27, compared to w26.

Jun

> -----Original Message-----
> From: Nick Piggin [mailto:piggin@cyberone.com.au]
> Sent: Monday, December 22, 2003 6:41 PM
> To: Nakajima, Jun
> Cc: Con Kolivas; linux kernel mailing list
> Subject: Re: [PATCH] 2.6.0 batch scheduling, HT aware
> 
> 
> 
> Nakajima, Jun wrote:
> 
> >Today utilization of execution resources of a logical processor is
> >around 60% as you can find in public papers, and it's dependent on
the
> >processor implementation and the workload. It could be higher in the
> >future, and their relative priority could be much higher then. So I
> >don't think it's a good idea to hard code such a
implementation-specific
> >factor into the generic scheduler code.
> >
> 
> No. The mechanism would be generic, but the parameters would be
> arch specific as part of my sched domains patch (if I have anything
> to do with it!)
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] 2.6.0 batch scheduling, HT aware
  2003-12-23  1:59 Nakajima, Jun
@ 2003-12-23  2:40 ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2003-12-23  2:40 UTC (permalink / raw)
  To: Nakajima, Jun; +Cc: Con Kolivas, linux kernel mailing list



Nakajima, Jun wrote:

>Today utilization of execution resources of a logical processor is
>around 60% as you can find in public papers, and it's dependent on the
>processor implementation and the workload. It could be higher in the
>future, and their relative priority could be much higher then. So I
>don't think it's a good idea to hard code such a implementation-specific
>factor into the generic scheduler code.
>

No. The mechanism would be generic, but the parameters would be
arch specific as part of my sched domains patch (if I have anything
to do with it!)

>
>Regarding H/W-based priority, I'm not sure it's very useful especially
>because so many events happen inside the processor and a set of the
>execution resources required changes very rapidly at runtime, i.e. the
>H/W knows what it should do to run faster at runtime, and imposing
>priority on those logical processor could make them run slower.
>
>I think a software priority-based solution like the below would be more
>generic and work better.
>

I wouldn't pretend to know about hardware, but it seems like much nicer
than doing it in software. Anyway, if there is hardware out there without
priorities then it would be a good idea to code for it.

Nick


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH] 2.6.0 batch scheduling, HT aware
@ 2003-12-23  1:59 Nakajima, Jun
  2003-12-23  2:40 ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Nakajima, Jun @ 2003-12-23  1:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Con Kolivas, linux kernel mailing list

Today utilization of execution resources of a logical processor is
around 60% as you can find in public papers, and it's dependent on the
processor implementation and the workload. It could be higher in the
future, and their relative priority could be much higher then. So I
don't think it's a good idea to hard code such a implementation-specific
factor into the generic scheduler code.

Regarding H/W-based priority, I'm not sure it's very useful especially
because so many events happen inside the processor and a set of the
execution resources required changes very rapidly at runtime, i.e. the
H/W knows what it should do to run faster at runtime, and imposing
priority on those logical processor could make them run slower.

I think a software priority-based solution like the below would be more
generic and work better.
> How about this: if a task is "delta" priority points below a task
running
> on another sibling, move it to that sibling (so priorities via
timeslice
> start working). I call it active unbalancing! I might be able to make
it
> fit if there is interest. Other suggestions?

Jun


> -----Original Message-----
> From: Nick Piggin [mailto:piggin@cyberone.com.au]
> Sent: Monday, December 22, 2003 5:11 PM
> To: Nakajima, Jun
> Cc: Con Kolivas; linux kernel mailing list
> Subject: Re: [PATCH] 2.6.0 batch scheduling, HT aware
> 
> 
> 
> Con Kolivas wrote:
> 
> >I've done a resync and update of my batch scheduling that is also
hyper-
> thread
> >aware.
> >
> >What is batch scheduling? Specifying a task as batch allows it to
only
> use cpu
> >time if there is idle time available, rather than having a proportion
of
> the
> >cpu time based on niceness.
> >
> >Why do I need hyper-thread aware batch scheduling?
> >
> >If you have a hyperthread (P4HT) processor and run it as two logical
cpus
> you
> >can have a very low priority task running that can consume 50% of
your
> >physical cpu's capacity no matter how high priority tasks you are
running.
> >For example if you use the distributed computing client setiathome
you
> will
> >be effectively be running at half your cpu's speed even if you run
> setiathome
> >at nice 20. Batch scheduling for normal cpus allows only idle time to
be
> used
> >for batch tasks, and for HT cpus only allows idle time when both
logical
> cpus
> >are idle.
> >
> >This is not being pushed for mainline kernel inclusion, but the issue
of
> how
> >to prevent low priority tasks slowing down HT cpus needs to be
considered
> for
> >the mainline HT scheduler if it ever gets included. This patch
provides a
> >temporising measure for those with HT processors, and a demonstrative
way
> to
> >handle them in mainline.
> >
> 
> I wonder how does Intel suggest we handle this problem? Batch
scheduling
> aside, I wonder how to do any sort of priorities at all? I think
POWER5
> can do priorities in hardware, that is the only sane way I can think
of
> doing it.
> 
> I think this patch is much too ugly to get into such an elegant
scheduler.
> No fault to you Con because its an ugly problem.
> 
> How about this: if a task is "delta" priority points below a task
running
> on another sibling, move it to that sibling (so priorities via
timeslice
> start working). I call it active unbalancing! I might be able to make
it
> fit if there is interest. Other suggestions?
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2004-01-02 23:34 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-23  0:38 [PATCH] 2.6.0 batch scheduling, HT aware Con Kolivas
2003-12-23  1:11 ` Nick Piggin
2003-12-23  1:24   ` Con Kolivas
2003-12-23  1:36     ` Nick Piggin
2003-12-23  2:42       ` Con Kolivas
2003-12-23  2:57         ` Nick Piggin
2003-12-23  3:15           ` Con Kolivas
2003-12-23  3:16             ` Con Kolivas
2003-12-26 23:03               ` Pavel Machek
2003-12-23 15:51           ` bill davidsen
2003-12-23 22:09             ` Con Kolivas
2003-12-30  0:35               ` bill davidsen
2004-01-02 20:10     ` Bill Davidsen
2003-12-26 22:56 ` Pavel Machek
2003-12-26 23:42   ` Con Kolivas
2003-12-26 23:49     ` Con Kolivas
2003-12-27 11:09     ` Pavel Machek
2003-12-27 11:15       ` Con Kolivas
2003-12-30  0:29         ` bill davidsen
2003-12-29  7:02       ` Nick Piggin
2003-12-29 12:49         ` Pavel Machek
2003-12-27  8:52   ` Mika Penttilä
2003-12-30  0:32     ` bill davidsen
2004-01-02 20:05   ` Bill Davidsen
2004-01-02 20:56     ` Davide Libenzi
2004-01-02 21:10       ` Valdis.Kletnieks
2004-01-02 23:34         ` Davide Libenzi
2003-12-23  1:59 Nakajima, Jun
2003-12-23  2:40 ` Nick Piggin
2003-12-23  5:33 Nakajima, Jun
2003-12-23 10:13 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).