All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09  2:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2011-03-09  2:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

So I've been experiencing hangs shortly after boot with recent kernels
on a Power7 machine. I was testing with PREEMPT & HZ=1024 which might
increase the frequency of the problem but I don't think they are
necessary to expose it.

>From what I've figured out, when the machine hangs, it's essentially
looping forever in update_sd_lb_stats(), due to a corrupted sd->groups
list (in my cases, the list contains a loop that doesn't loop back
the the first element).

It appears that this corresponds to one CPU deciding to rebuild the
sched domains. There's various reasons why that can happen, the typical
one in our case is the new VPNH feature where the hypervisor informs us
of a change in node affinity of our virtual processors. s390 has a
similar feature and should be affected as well.

I suspect the problem could be reproduced on x86 by hammering the sysfs
file that can be used to trigger a rebuild as well on a sufficently
large machine.

>From what I can tell, there's some missing locking here between
rebuilding the domains and find_busiest_group. I haven't quite got my
head around how that -should- be done, though, as I an really not very
familiar with that code. For example, I don't quite get when domains are
attached to an rq, and whether code like build_numa_sched_groups() which
allocates groups and attach them to sched domains sd->groups does it on
a "live" domain or not (in that case, there's a problem since it kmalloc
and attaches the uninitialized result immediately).

I don't believe I understand enough of the scheduler to fix that quickly
and I'm really bogged down with some other urgent stuff, so I would very
much appreciate if you could provide some assistance here, even if it's
just in the form of suggestions/hints.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09  2:58 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2011-03-09  2:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

So I've been experiencing hangs shortly after boot with recent kernels
on a Power7 machine. I was testing with PREEMPT & HZ=1024 which might
increase the frequency of the problem but I don't think they are
necessary to expose it.

>From what I've figured out, when the machine hangs, it's essentially
looping forever in update_sd_lb_stats(), due to a corrupted sd->groups
list (in my cases, the list contains a loop that doesn't loop back
the the first element).

It appears that this corresponds to one CPU deciding to rebuild the
sched domains. There's various reasons why that can happen, the typical
one in our case is the new VPNH feature where the hypervisor informs us
of a change in node affinity of our virtual processors. s390 has a
similar feature and should be affected as well.

I suspect the problem could be reproduced on x86 by hammering the sysfs
file that can be used to trigger a rebuild as well on a sufficently
large machine.

>From what I can tell, there's some missing locking here between
rebuilding the domains and find_busiest_group. I haven't quite got my
head around how that -should- be done, though, as I an really not very
familiar with that code. For example, I don't quite get when domains are
attached to an rq, and whether code like build_numa_sched_groups() which
allocates groups and attach them to sched domains sd->groups does it on
a "live" domain or not (in that case, there's a problem since it kmalloc
and attaches the uninitialized result immediately).

I don't believe I understand enough of the scheduler to fix that quickly
and I'm really bogged down with some other urgent stuff, so I would very
much appreciate if you could provide some assistance here, even if it's
just in the form of suggestions/hints.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09  2:58 ` Benjamin Herrenschmidt
@ 2011-03-09 10:19   ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 10:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 13:58 +1100, Benjamin Herrenschmidt wrote:
> So I've been experiencing hangs shortly after boot with recent kernels
> on a Power7 machine. I was testing with PREEMPT & HZ=1024 which might
> increase the frequency of the problem but I don't think they are
> necessary to expose it.
> 
> From what I've figured out, when the machine hangs, it's essentially
> looping forever in update_sd_lb_stats(), due to a corrupted sd->groups
> list (in my cases, the list contains a loop that doesn't loop back
> the the first element).
> 
> It appears that this corresponds to one CPU deciding to rebuild the
> sched domains. There's various reasons why that can happen, the typical
> one in our case is the new VPNH feature where the hypervisor informs us
> of a change in node affinity of our virtual processors. s390 has a
> similar feature and should be affected as well.

Ahh, so that's triggering it :-), just curious, how often does the HV do
that to you?

> I suspect the problem could be reproduced on x86 by hammering the sysfs
> file that can be used to trigger a rebuild as well on a sufficently
> large machine.

Should, yeah, regular hotplug is racy too.

> From what I can tell, there's some missing locking here between
> rebuilding the domains and find_busiest_group. 

init_sched_build_groups() races against pretty much all sched_group
iterations, like the one in update_sd_lb_stats() which is the most
common one and the one you're getting stuck in.

> I haven't quite got my
> head around how that -should- be done, though, as I an really not very
> familiar with that code. 

:-)

> For example, I don't quite get when domains are
> attached to an rq, and whether code like build_numa_sched_groups() which
> allocates groups and attach them to sched domains sd->groups does it on
> a "live" domain or not (in that case, there's a problem since it kmalloc
> and attaches the uninitialized result immediately).

No, the domain stuff is good, we allocate new domains and have a
synchronize_sched() between us installing the new ones and freeing the
old ones.

But the sched_group list is as said rather icky.

> I don't believe I understand enough of the scheduler to fix that quickly
> and I'm really bogged down with some other urgent stuff, so I would very
> much appreciate if you could provide some assistance here, even if it's
> just in the form of suggestions/hints.

Yeah, sched_group rebuild is racy as hell, I haven't really managed to
come up with a sane fix yet, will poke at it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 10:19   ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 10:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 13:58 +1100, Benjamin Herrenschmidt wrote:
> So I've been experiencing hangs shortly after boot with recent kernels
> on a Power7 machine. I was testing with PREEMPT & HZ=3D1024 which might
> increase the frequency of the problem but I don't think they are
> necessary to expose it.
>=20
> From what I've figured out, when the machine hangs, it's essentially
> looping forever in update_sd_lb_stats(), due to a corrupted sd->groups
> list (in my cases, the list contains a loop that doesn't loop back
> the the first element).
>=20
> It appears that this corresponds to one CPU deciding to rebuild the
> sched domains. There's various reasons why that can happen, the typical
> one in our case is the new VPNH feature where the hypervisor informs us
> of a change in node affinity of our virtual processors. s390 has a
> similar feature and should be affected as well.

Ahh, so that's triggering it :-), just curious, how often does the HV do
that to you?

> I suspect the problem could be reproduced on x86 by hammering the sysfs
> file that can be used to trigger a rebuild as well on a sufficently
> large machine.

Should, yeah, regular hotplug is racy too.

> From what I can tell, there's some missing locking here between
> rebuilding the domains and find_busiest_group.=20

init_sched_build_groups() races against pretty much all sched_group
iterations, like the one in update_sd_lb_stats() which is the most
common one and the one you're getting stuck in.

> I haven't quite got my
> head around how that -should- be done, though, as I an really not very
> familiar with that code.=20

:-)

> For example, I don't quite get when domains are
> attached to an rq, and whether code like build_numa_sched_groups() which
> allocates groups and attach them to sched domains sd->groups does it on
> a "live" domain or not (in that case, there's a problem since it kmalloc
> and attaches the uninitialized result immediately).

No, the domain stuff is good, we allocate new domains and have a
synchronize_sched() between us installing the new ones and freeing the
old ones.

But the sched_group list is as said rather icky.

> I don't believe I understand enough of the scheduler to fix that quickly
> and I'm really bogged down with some other urgent stuff, so I would very
> much appreciate if you could provide some assistance here, even if it's
> just in the form of suggestions/hints.

Yeah, sched_group rebuild is racy as hell, I haven't really managed to
come up with a sane fix yet, will poke at it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 10:19   ` Peter Zijlstra
@ 2011-03-09 11:33     ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 11:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > It appears that this corresponds to one CPU deciding to rebuild the
> > sched domains. There's various reasons why that can happen, the typical
> > one in our case is the new VPNH feature where the hypervisor informs us
> > of a change in node affinity of our virtual processors. s390 has a
> > similar feature and should be affected as well.
> 
> Ahh, so that's triggering it :-), just curious, how often does the HV do
> that to you? 

OK, so Ben told me on IRC this can happen quite frequently, to which I
must ask WTF were you guys smoking? Flipping the CPU topology every time
the HV scheduler does something funny is quite insane. And you did that
without ever talking to the scheduler folks, not cool.

That is of course aside from the fact that we have a real bug there that
needs fixing, but really guys, WTF!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 11:33     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 11:33 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > It appears that this corresponds to one CPU deciding to rebuild the
> > sched domains. There's various reasons why that can happen, the typical
> > one in our case is the new VPNH feature where the hypervisor informs us
> > of a change in node affinity of our virtual processors. s390 has a
> > similar feature and should be affected as well.
>=20
> Ahh, so that's triggering it :-), just curious, how often does the HV do
> that to you?=20

OK, so Ben told me on IRC this can happen quite frequently, to which I
must ask WTF were you guys smoking? Flipping the CPU topology every time
the HV scheduler does something funny is quite insane. And you did that
without ever talking to the scheduler folks, not cool.

That is of course aside from the fact that we have a real bug there that
needs fixing, but really guys, WTF!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 10:19   ` Peter Zijlstra
@ 2011-03-09 13:01     ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> No, the domain stuff is good, we allocate new domains and have a
> synchronize_sched() between us installing the new ones and freeing the
> old ones. 

Gah, if only..

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:01     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> No, the domain stuff is good, we allocate new domains and have a
> synchronize_sched() between us installing the new ones and freeing the
> old ones.=20

Gah, if only..

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 11:33     ` Peter Zijlstra
@ 2011-03-09 13:15       ` Martin Schwidefsky
  -1 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 09 Mar 2011 12:33:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > It appears that this corresponds to one CPU deciding to rebuild the
> > > sched domains. There's various reasons why that can happen, the typical
> > > one in our case is the new VPNH feature where the hypervisor informs us
> > > of a change in node affinity of our virtual processors. s390 has a
> > > similar feature and should be affected as well.
> > 
> > Ahh, so that's triggering it :-), just curious, how often does the HV do
> > that to you? 
> 
> OK, so Ben told me on IRC this can happen quite frequently, to which I
> must ask WTF were you guys smoking? Flipping the CPU topology every time
> the HV scheduler does something funny is quite insane. And you did that
> without ever talking to the scheduler folks, not cool.
> 
> That is of course aside from the fact that we have a real bug there that
> needs fixing, but really guys, WTF!

Just for info, on s390 the topology change events are rather infrequent.
They do happen e.g. after an LPAR has been activated and the LPAR
hypervisor needs to reshuffle the CPUs of the different nodes.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:15       ` Martin Schwidefsky
  0 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 09 Mar 2011 12:33:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > It appears that this corresponds to one CPU deciding to rebuild the
> > > sched domains. There's various reasons why that can happen, the typical
> > > one in our case is the new VPNH feature where the hypervisor informs us
> > > of a change in node affinity of our virtual processors. s390 has a
> > > similar feature and should be affected as well.
> > 
> > Ahh, so that's triggering it :-), just curious, how often does the HV do
> > that to you? 
> 
> OK, so Ben told me on IRC this can happen quite frequently, to which I
> must ask WTF were you guys smoking? Flipping the CPU topology every time
> the HV scheduler does something funny is quite insane. And you did that
> without ever talking to the scheduler folks, not cool.
> 
> That is of course aside from the fact that we have a real bug there that
> needs fixing, but really guys, WTF!

Just for info, on s390 the topology change events are rather infrequent.
They do happen e.g. after an LPAR has been activated and the LPAR
hypervisor needs to reshuffle the CPUs of the different nodes.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:15       ` Martin Schwidefsky
@ 2011-03-09 13:19         ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:19 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 14:15 +0100, Martin Schwidefsky wrote:
> On Wed, 09 Mar 2011 12:33:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > > It appears that this corresponds to one CPU deciding to rebuild the
> > > > sched domains. There's various reasons why that can happen, the typical
> > > > one in our case is the new VPNH feature where the hypervisor informs us
> > > > of a change in node affinity of our virtual processors. s390 has a
> > > > similar feature and should be affected as well.
> > > 
> > > Ahh, so that's triggering it :-), just curious, how often does the HV do
> > > that to you? 
> > 
> > OK, so Ben told me on IRC this can happen quite frequently, to which I
> > must ask WTF were you guys smoking? Flipping the CPU topology every time
> > the HV scheduler does something funny is quite insane. And you did that
> > without ever talking to the scheduler folks, not cool.
> > 
> > That is of course aside from the fact that we have a real bug there that
> > needs fixing, but really guys, WTF!
> 
> Just for info, on s390 the topology change events are rather infrequent.
> They do happen e.g. after an LPAR has been activated and the LPAR
> hypervisor needs to reshuffle the CPUs of the different nodes.

But if you don't also update the cpu->node memory mappings (which I
think it near impossible) what good is it to change the scheduler
topology?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:19         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:19 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 14:15 +0100, Martin Schwidefsky wrote:
> On Wed, 09 Mar 2011 12:33:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>=20
> > On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > > It appears that this corresponds to one CPU deciding to rebuild the
> > > > sched domains. There's various reasons why that can happen, the typ=
ical
> > > > one in our case is the new VPNH feature where the hypervisor inform=
s us
> > > > of a change in node affinity of our virtual processors. s390 has a
> > > > similar feature and should be affected as well.
> > >=20
> > > Ahh, so that's triggering it :-), just curious, how often does the HV=
 do
> > > that to you?=20
> >=20
> > OK, so Ben told me on IRC this can happen quite frequently, to which I
> > must ask WTF were you guys smoking? Flipping the CPU topology every tim=
e
> > the HV scheduler does something funny is quite insane. And you did that
> > without ever talking to the scheduler folks, not cool.
> >=20
> > That is of course aside from the fact that we have a real bug there tha=
t
> > needs fixing, but really guys, WTF!
>=20
> Just for info, on s390 the topology change events are rather infrequent.
> They do happen e.g. after an LPAR has been activated and the LPAR
> hypervisor needs to reshuffle the CPUs of the different nodes.

But if you don't also update the cpu->node memory mappings (which I
think it near impossible) what good is it to change the scheduler
topology?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:19         ` Peter Zijlstra
@ 2011-03-09 13:31           ` Martin Schwidefsky
  -1 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 09 Mar 2011 14:19:29 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 14:15 +0100, Martin Schwidefsky wrote:
> > On Wed, 09 Mar 2011 12:33:49 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > > > It appears that this corresponds to one CPU deciding to rebuild the
> > > > > sched domains. There's various reasons why that can happen, the typical
> > > > > one in our case is the new VPNH feature where the hypervisor informs us
> > > > > of a change in node affinity of our virtual processors. s390 has a
> > > > > similar feature and should be affected as well.
> > > > 
> > > > Ahh, so that's triggering it :-), just curious, how often does the HV do
> > > > that to you? 
> > > 
> > > OK, so Ben told me on IRC this can happen quite frequently, to which I
> > > must ask WTF were you guys smoking? Flipping the CPU topology every time
> > > the HV scheduler does something funny is quite insane. And you did that
> > > without ever talking to the scheduler folks, not cool.
> > > 
> > > That is of course aside from the fact that we have a real bug there that
> > > needs fixing, but really guys, WTF!
> > 
> > Just for info, on s390 the topology change events are rather infrequent.
> > They do happen e.g. after an LPAR has been activated and the LPAR
> > hypervisor needs to reshuffle the CPUs of the different nodes.
> 
> But if you don't also update the cpu->node memory mappings (which I
> think it near impossible) what good is it to change the scheduler
> topology?

The memory for the different LPARs is striped over all nodes (or books as we
call them). We heavily rely on the large shared cache between the books to hide
the different memory access latencies.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:31           ` Martin Schwidefsky
  0 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 09 Mar 2011 14:19:29 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 14:15 +0100, Martin Schwidefsky wrote:
> > On Wed, 09 Mar 2011 12:33:49 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > > > > It appears that this corresponds to one CPU deciding to rebuild the
> > > > > sched domains. There's various reasons why that can happen, the typical
> > > > > one in our case is the new VPNH feature where the hypervisor informs us
> > > > > of a change in node affinity of our virtual processors. s390 has a
> > > > > similar feature and should be affected as well.
> > > > 
> > > > Ahh, so that's triggering it :-), just curious, how often does the HV do
> > > > that to you? 
> > > 
> > > OK, so Ben told me on IRC this can happen quite frequently, to which I
> > > must ask WTF were you guys smoking? Flipping the CPU topology every time
> > > the HV scheduler does something funny is quite insane. And you did that
> > > without ever talking to the scheduler folks, not cool.
> > > 
> > > That is of course aside from the fact that we have a real bug there that
> > > needs fixing, but really guys, WTF!
> > 
> > Just for info, on s390 the topology change events are rather infrequent.
> > They do happen e.g. after an LPAR has been activated and the LPAR
> > hypervisor needs to reshuffle the CPUs of the different nodes.
> 
> But if you don't also update the cpu->node memory mappings (which I
> think it near impossible) what good is it to change the scheduler
> topology?

The memory for the different LPARs is striped over all nodes (or books as we
call them). We heavily rely on the large shared cache between the books to hide
the different memory access latencies.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:31           ` Martin Schwidefsky
@ 2011-03-09 13:33             ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:33 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > But if you don't also update the cpu->node memory mappings (which I
> > think it near impossible) what good is it to change the scheduler
> > topology?
> 
> The memory for the different LPARs is striped over all nodes (or books as we
> call them). We heavily rely on the large shared cache between the books to hide
> the different memory access latencies. 

Right, so effectively you don't have NUMA due to that striping. So why
then change the CPU topology? Simply create a topology without NUMA and
keep it static, that accurately reflects the memory topology.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:33             ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:33 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > But if you don't also update the cpu->node memory mappings (which I
> > think it near impossible) what good is it to change the scheduler
> > topology?
>=20
> The memory for the different LPARs is striped over all nodes (or books as=
 we
> call them). We heavily rely on the large shared cache between the books t=
o hide
> the different memory access latencies.=20

Right, so effectively you don't have NUMA due to that striping. So why
then change the CPU topology? Simply create a topology without NUMA and
keep it static, that accurately reflects the memory topology.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:33             ` Peter Zijlstra
@ 2011-03-09 13:46               ` Martin Schwidefsky
  -1 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 09 Mar 2011 14:33:56 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > > But if you don't also update the cpu->node memory mappings (which I
> > > think it near impossible) what good is it to change the scheduler
> > > topology?
> > 
> > The memory for the different LPARs is striped over all nodes (or books as we
> > call them). We heavily rely on the large shared cache between the books to hide
> > the different memory access latencies. 
> 
> Right, so effectively you don't have NUMA due to that striping. So why
> then change the CPU topology? Simply create a topology without NUMA and
> keep it static, that accurately reflects the memory topology.

Well the CPU topology can change due to different grouping of logical CPUs
dependent on which LPARs are activated. And we effectively do not have a
memory topology, only CPU. Its basically all about caches, we want to
reflect the distance between CPUs over the up to 4 cache levels.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:46               ` Martin Schwidefsky
  0 siblings, 0 replies; 35+ messages in thread
From: Martin Schwidefsky @ 2011-03-09 13:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 09 Mar 2011 14:33:56 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > > But if you don't also update the cpu->node memory mappings (which I
> > > think it near impossible) what good is it to change the scheduler
> > > topology?
> > 
> > The memory for the different LPARs is striped over all nodes (or books as we
> > call them). We heavily rely on the large shared cache between the books to hide
> > the different memory access latencies. 
> 
> Right, so effectively you don't have NUMA due to that striping. So why
> then change the CPU topology? Simply create a topology without NUMA and
> keep it static, that accurately reflects the memory topology.

Well the CPU topology can change due to different grouping of logical CPUs
dependent on which LPARs are activated. And we effectively do not have a
memory topology, only CPU. Its basically all about caches, we want to
reflect the distance between CPUs over the up to 4 cache levels.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:46               ` Martin Schwidefsky
@ 2011-03-09 13:54                 ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:54 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Benjamin Herrenschmidt, linux-kernel, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 14:46 +0100, Martin Schwidefsky wrote:
> On Wed, 09 Mar 2011 14:33:56 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > > > But if you don't also update the cpu->node memory mappings (which I
> > > > think it near impossible) what good is it to change the scheduler
> > > > topology?
> > > 
> > > The memory for the different LPARs is striped over all nodes (or books as we
> > > call them). We heavily rely on the large shared cache between the books to hide
> > > the different memory access latencies. 
> > 
> > Right, so effectively you don't have NUMA due to that striping. So why
> > then change the CPU topology? Simply create a topology without NUMA and
> > keep it static, that accurately reflects the memory topology.
> 
> Well the CPU topology can change due to different grouping of logical CPUs
> dependent on which LPARs are activated. And we effectively do not have a
> memory topology, only CPU. Its basically all about caches, we want to
> reflect the distance between CPUs over the up to 4 cache levels.

Right, so I consider caches to be part of the memory topology, anyway,
if this all is very rare then yeah, that works out.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 13:54                 ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-09 13:54 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 14:46 +0100, Martin Schwidefsky wrote:
> On Wed, 09 Mar 2011 14:33:56 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>=20
> > On Wed, 2011-03-09 at 14:31 +0100, Martin Schwidefsky wrote:
> > > > But if you don't also update the cpu->node memory mappings (which I
> > > > think it near impossible) what good is it to change the scheduler
> > > > topology?
> > >=20
> > > The memory for the different LPARs is striped over all nodes (or book=
s as we
> > > call them). We heavily rely on the large shared cache between the boo=
ks to hide
> > > the different memory access latencies.=20
> >=20
> > Right, so effectively you don't have NUMA due to that striping. So why
> > then change the CPU topology? Simply create a topology without NUMA and
> > keep it static, that accurately reflects the memory topology.
>=20
> Well the CPU topology can change due to different grouping of logical CPU=
s
> dependent on which LPARs are activated. And we effectively do not have a
> memory topology, only CPU. Its basically all about caches, we want to
> reflect the distance between CPUs over the up to 4 cache levels.

Right, so I consider caches to be part of the memory topology, anyway,
if this all is very rare then yeah, that works out.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 11:33     ` Peter Zijlstra
@ 2011-03-09 15:26       ` Steven Rostedt
  -1 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2011-03-09 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, linux-kernel, Martin Schwidefsky,
	linuxppc-dev, Jesse Larrew

On Wed, Mar 09, 2011 at 12:33:49PM +0100, Peter Zijlstra wrote:
> 
> That is of course aside from the fact that we have a real bug there that
> needs fixing, but really guys, WTF!

They just wanted to give you a very nice reproducer for that bug ;)

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-09 15:26       ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2011-03-09 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, Mar 09, 2011 at 12:33:49PM +0100, Peter Zijlstra wrote:
> 
> That is of course aside from the fact that we have a real bug there that
> needs fixing, but really guys, WTF!

They just wanted to give you a very nice reproducer for that bug ;)

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-09 13:01     ` Peter Zijlstra
@ 2011-03-10 14:10       ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-10 14:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Wed, 2011-03-09 at 14:01 +0100, Peter Zijlstra wrote:
> On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > No, the domain stuff is good, we allocate new domains and have a
> > synchronize_sched() between us installing the new ones and freeing the
> > old ones. 
> 
> Gah, if only..

OK, so for hotplug and cpusets it works because they change the doms_cur
set, when the old and the new set don't match it destroys the current
sched_domain/sched_group sets for the relevant cpus and then calls
synchronize_sched() to wait for any current activity to go away.

Only then does it rebuild stuff for the new set, reusing the static
allocated sched_domain and sched_group data.

Now, supposedly when your new and old domain set is the same it should
be a nop, unless arch_update_cpu_topology() returns true in which case
it will do a full destroy and rebuild.

So I'm not quite sure what power does to make it go bang..

Anyway, I'm now rewriting the sched_domain creation stuff because I've
utterly had it with that code..

Also, still waiting to hear from the Power7 folks on how often they
think to rebuild the topology and how they think that makes sense,
afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
seeing how that's going to work properly at all.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-03-10 14:10       ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-03-10 14:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-03-09 at 14:01 +0100, Peter Zijlstra wrote:
> On Wed, 2011-03-09 at 11:19 +0100, Peter Zijlstra wrote:
> > No, the domain stuff is good, we allocate new domains and have a
> > synchronize_sched() between us installing the new ones and freeing the
> > old ones.=20
>=20
> Gah, if only..

OK, so for hotplug and cpusets it works because they change the doms_cur
set, when the old and the new set don't match it destroys the current
sched_domain/sched_group sets for the relevant cpus and then calls
synchronize_sched() to wait for any current activity to go away.

Only then does it rebuild stuff for the new set, reusing the static
allocated sched_domain and sched_group data.

Now, supposedly when your new and old domain set is the same it should
be a nop, unless arch_update_cpu_topology() returns true in which case
it will do a full destroy and rebuild.

So I'm not quite sure what power does to make it go bang..

Anyway, I'm now rewriting the sched_domain creation stuff because I've
utterly had it with that code..

Also, still waiting to hear from the Power7 folks on how often they
think to rebuild the topology and how they think that makes sense,
afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
seeing how that's going to work properly at all.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-03-10 14:10       ` Peter Zijlstra
@ 2011-04-20 10:07         ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-04-20 10:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Thu, 2011-03-10 at 15:10 +0100, Peter Zijlstra wrote:
> 
> Also, still waiting to hear from the Power7 folks on how often they
> think to rebuild the topology and how they think that makes sense,
> afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
> seeing how that's going to work properly at all. 

Jesse care to answer? I hear from Ben you're responsible for that mess.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-04-20 10:07         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-04-20 10:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Thu, 2011-03-10 at 15:10 +0100, Peter Zijlstra wrote:
>=20
> Also, still waiting to hear from the Power7 folks on how often they
> think to rebuild the topology and how they think that makes sense,
> afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
> seeing how that's going to work properly at all.=20

Jesse care to answer? I hear from Ben you're responsible for that mess.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-04-20 10:07         ` Peter Zijlstra
@ 2011-04-20 22:01           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2011-04-20 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Martin Schwidefsky, linuxppc-dev, Jesse Larrew

On Wed, 2011-04-20 at 12:07 +0200, Peter Zijlstra wrote:
> On Thu, 2011-03-10 at 15:10 +0100, Peter Zijlstra wrote:
> > 
> > Also, still waiting to hear from the Power7 folks on how often they
> > think to rebuild the topology and how they think that makes sense,
> > afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
> > seeing how that's going to work properly at all. 
> 
> Jesse care to answer? I hear from Ben you're responsible for that mess.

"responsible for this mess" is a big word :-)

But he's the one to last play with that code ... Jesse ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-04-20 22:01           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2011-04-20 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel, Jesse Larrew

On Wed, 2011-04-20 at 12:07 +0200, Peter Zijlstra wrote:
> On Thu, 2011-03-10 at 15:10 +0100, Peter Zijlstra wrote:
> > 
> > Also, still waiting to hear from the Power7 folks on how often they
> > think to rebuild the topology and how they think that makes sense,
> > afaict Power7 does have actual NUMA nodes unlike s390, so I'm still not
> > seeing how that's going to work properly at all. 
> 
> Jesse care to answer? I hear from Ben you're responsible for that mess.

"responsible for this mess" is a big word :-)

But he's the one to last play with that code ... Jesse ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-04-20 22:01           ` Benjamin Herrenschmidt
  (?)
@ 2011-05-09 21:26           ` Jesse Larrew
  2011-05-10 14:09               ` Peter Zijlstra
  -1 siblings, 1 reply; 35+ messages in thread
From: Jesse Larrew @ 2011-05-09 21:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Martin Schwidefsky, linuxppc-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1603 bytes --]

On 04/20/2011 05:01 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2011-04-20 at 12:07 +0200, Peter Zijlstra wrote:
>> On Thu, 2011-03-10 at 15:10 +0100, Peter Zijlstra wrote:
>>>
>>> Also, still waiting to hear from the Power7 folks on how often
>>> they think to rebuild the topology and how they think that makes
>>> sense, afaict Power7 does have actual NUMA nodes unlike s390, so
>>> I'm still not seeing how that's going to work properly at all.
>>
>> Jesse care to answer? I hear from Ben you're responsible for that
>> mess.
>
> "responsible for this mess" is a big word :-)
>
> But he's the one to last play with that code ... Jesse ?
>

Hi Peter!

According the the Power firmware folks, updating the home node of a virtual cpu happens rather infrequently. The VPHN code currently checks for topology updates every 60 seconds, but we can poll less frequently if it helps. I chose 60 second intervals simply because that's how often they check the topology on s390. ;-)

As for updating the memory topology, there are cases where changing the home node of a virtual cpu doesn't affect the memory topology. If it does, there is a separate notification system for memory topology updates that is independent from the cpu updates. I plan to start working on a patch set to enable memory topology updates in the kernel in the coming weeks, but I wanted to get the cpu patches out on the list so we could start having these debates. :)

Sincerely,

Jesse Larrew
Software Engineer, Linux on Power Kernel Team
IBM Linux Technology Center
Phone: (512) 973-2052 (T/L: 363-2052)
jlarrew@linux.vnet.ibm.com


[-- Attachment #2: Type: text/html, Size: 2409 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-05-09 21:26           ` Jesse Larrew
@ 2011-05-10 14:09               ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-05-10 14:09 UTC (permalink / raw)
  To: Jesse Larrew
  Cc: Benjamin Herrenschmidt, linux-kernel, Martin Schwidefsky,
	linuxppc-dev, nfont

On Mon, 2011-05-09 at 16:26 -0500, Jesse Larrew wrote:
> 
> According the the Power firmware folks, updating the home node of a
> virtual cpu happens rather infrequently. The VPHN code currently
> checks for topology updates every 60 seconds, but we can poll less
> frequently if it helps. I chose 60 second intervals simply because
> that's how often they check the topology on s390. ;-)

This just makes me shudder, so you poll the state? Meaning that the vcpu
can actually run 99% of the time on another node?

What's the point of this if the vcpu scheduler can move the vcpu around
much faster?

> As for updating the memory topology, there are cases where changing
> the home node of a virtual cpu doesn't affect the memory topology. If
> it does, there is a separate notification system for memory topology
> updates that is independent from the cpu updates. I plan to start
> working on a patch set to enable memory topology updates in the kernel
> in the coming weeks, but I wanted to get the cpu patches out on the
> list so we could start having these debates. :) 

Well, they weren't put out on a list (well maybe on the ppc list but
that's the same as not posting them from my pov), they were merged (and
thus declared done) that's not how you normally start a debate.

I would really like to see both patch-sets together. Also, I'm not at
all convinced its a sane thing to do. Pretty much all NUMA aware
software I know of assumes that CPU<->NODE relations are static,
breaking that in kernel renders all existing software broken.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-05-10 14:09               ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-05-10 14:09 UTC (permalink / raw)
  To: Jesse Larrew; +Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel

On Mon, 2011-05-09 at 16:26 -0500, Jesse Larrew wrote:
> 
> According the the Power firmware folks, updating the home node of a
> virtual cpu happens rather infrequently. The VPHN code currently
> checks for topology updates every 60 seconds, but we can poll less
> frequently if it helps. I chose 60 second intervals simply because
> that's how often they check the topology on s390. ;-)

This just makes me shudder, so you poll the state? Meaning that the vcpu
can actually run 99% of the time on another node?

What's the point of this if the vcpu scheduler can move the vcpu around
much faster?

> As for updating the memory topology, there are cases where changing
> the home node of a virtual cpu doesn't affect the memory topology. If
> it does, there is a separate notification system for memory topology
> updates that is independent from the cpu updates. I plan to start
> working on a patch set to enable memory topology updates in the kernel
> in the coming weeks, but I wanted to get the cpu patches out on the
> list so we could start having these debates. :) 

Well, they weren't put out on a list (well maybe on the ppc list but
that's the same as not posting them from my pov), they were merged (and
thus declared done) that's not how you normally start a debate.

I would really like to see both patch-sets together. Also, I'm not at
all convinced its a sane thing to do. Pretty much all NUMA aware
software I know of assumes that CPU<->NODE relations are static,
breaking that in kernel renders all existing software broken.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-05-10 14:09               ` Peter Zijlstra
@ 2011-05-11 16:17                 ` Jesse Larrew
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesse Larrew @ 2011-05-11 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, linux-kernel, Martin Schwidefsky,
	linuxppc-dev, nfont

On 05/10/2011 09:09 AM, Peter Zijlstra wrote:
> On Mon, 2011-05-09 at 16:26 -0500, Jesse Larrew wrote:
>>
>> According the the Power firmware folks, updating the home node of a
>> virtual cpu happens rather infrequently. The VPHN code currently
>> checks for topology updates every 60 seconds, but we can poll less
>> frequently if it helps. I chose 60 second intervals simply because
>> that's how often they check the topology on s390. ;-)
> 
> This just makes me shudder, so you poll the state? Meaning that the vcpu
> can actually run 99% of the time on another node?
> 
> What's the point of this if the vcpu scheduler can move the vcpu around
> much faster?
> 

Based on my discussion with the firmware folks, it sounds like the hypervisor will never automatically move vcpus around on its own. The firmware is designed to set the cpu home node at partition boot, then wait for the customer to run a tool to rebalance the affinity. Moving vcpus around costs performance, so they want to let the customer decide when to shuffle the vcpus. 

>From the kernel's perspective, we can expect to see occasional batches of vcpus updating at once, after which the topology should remain fixed until the tool is run again.

>> As for updating the memory topology, there are cases where changing
>> the home node of a virtual cpu doesn't affect the memory topology. If
>> it does, there is a separate notification system for memory topology
>> updates that is independent from the cpu updates. I plan to start
>> working on a patch set to enable memory topology updates in the kernel
>> in the coming weeks, but I wanted to get the cpu patches out on the
>> list so we could start having these debates. :) 
> 
> Well, they weren't put out on a list (well maybe on the ppc list but
> that's the same as not posting them from my pov), they were merged (and
> thus declared done) that's not how you normally start a debate.
> 

That's a fair point. At the time, I didn't expect anyone outside of the PPC community to care much about a PPC-specific patch set, but I see now why it's important to keep everyone in the loop. Sorry about that. I'll be sure to send any future patches to LKML as well.

> I would really like to see both patch-sets together. Also, I'm not at
> all convinced its a sane thing to do. Pretty much all NUMA aware
> software I know of assumes that CPU<->NODE relations are static,
> breaking that in kernel renders all existing software broken.
> 

I suspect that's true. Then again, shouldn't it be the capabilities of the hardware that dictates what the software does, rather than the other way around?

-- 

Jesse Larrew
Software Engineer, Linux on Power Kernel Team
IBM Linux Technology Center
Phone: (512) 973-2052 (T/L: 363-2052)
jlarrew@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-05-11 16:17                 ` Jesse Larrew
  0 siblings, 0 replies; 35+ messages in thread
From: Jesse Larrew @ 2011-05-11 16:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel

On 05/10/2011 09:09 AM, Peter Zijlstra wrote:
> On Mon, 2011-05-09 at 16:26 -0500, Jesse Larrew wrote:
>>
>> According the the Power firmware folks, updating the home node of a
>> virtual cpu happens rather infrequently. The VPHN code currently
>> checks for topology updates every 60 seconds, but we can poll less
>> frequently if it helps. I chose 60 second intervals simply because
>> that's how often they check the topology on s390. ;-)
> 
> This just makes me shudder, so you poll the state? Meaning that the vcpu
> can actually run 99% of the time on another node?
> 
> What's the point of this if the vcpu scheduler can move the vcpu around
> much faster?
> 

Based on my discussion with the firmware folks, it sounds like the hypervisor will never automatically move vcpus around on its own. The firmware is designed to set the cpu home node at partition boot, then wait for the customer to run a tool to rebalance the affinity. Moving vcpus around costs performance, so they want to let the customer decide when to shuffle the vcpus. 

>From the kernel's perspective, we can expect to see occasional batches of vcpus updating at once, after which the topology should remain fixed until the tool is run again.

>> As for updating the memory topology, there are cases where changing
>> the home node of a virtual cpu doesn't affect the memory topology. If
>> it does, there is a separate notification system for memory topology
>> updates that is independent from the cpu updates. I plan to start
>> working on a patch set to enable memory topology updates in the kernel
>> in the coming weeks, but I wanted to get the cpu patches out on the
>> list so we could start having these debates. :) 
> 
> Well, they weren't put out on a list (well maybe on the ppc list but
> that's the same as not posting them from my pov), they were merged (and
> thus declared done) that's not how you normally start a debate.
> 

That's a fair point. At the time, I didn't expect anyone outside of the PPC community to care much about a PPC-specific patch set, but I see now why it's important to keep everyone in the loop. Sorry about that. I'll be sure to send any future patches to LKML as well.

> I would really like to see both patch-sets together. Also, I'm not at
> all convinced its a sane thing to do. Pretty much all NUMA aware
> software I know of assumes that CPU<->NODE relations are static,
> breaking that in kernel renders all existing software broken.
> 

I suspect that's true. Then again, shouldn't it be the capabilities of the hardware that dictates what the software does, rather than the other way around?

-- 

Jesse Larrew
Software Engineer, Linux on Power Kernel Team
IBM Linux Technology Center
Phone: (512) 973-2052 (T/L: 363-2052)
jlarrew@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
  2011-05-11 16:17                 ` Jesse Larrew
@ 2011-06-03 14:47                   ` Peter Zijlstra
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-06-03 14:47 UTC (permalink / raw)
  To: Jesse Larrew
  Cc: Benjamin Herrenschmidt, linux-kernel, Martin Schwidefsky,
	linuxppc-dev, nfont

On Wed, 2011-05-11 at 11:17 -0500, Jesse Larrew wrote:
> > I would really like to see both patch-sets together. Also, I'm not
> at
> > all convinced its a sane thing to do. Pretty much all NUMA aware
> > software I know of assumes that CPU<->NODE relations are static,
> > breaking that in kernel renders all existing software broken.
> > 
> 
> I suspect that's true. Then again, shouldn't it be the capabilities of
> the hardware that dictates what the software does, rather than the
> other way around? 

Wish that were true, we wouldn't be all constrained by all this legacy
software.. ;-)

Anyway, there's plenty of CPU<->NODE assumptions in the kernel as well,
fixing those will be 'interesting' at best, as for userspace, since its
a user-driven tool revamping the topology the user gets to keep the
pieces when he runs that while some NUMA aware proglet is running.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [BUG] rebuild_sched_domains considered dangerous
@ 2011-06-03 14:47                   ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-06-03 14:47 UTC (permalink / raw)
  To: Jesse Larrew; +Cc: Martin Schwidefsky, linuxppc-dev, linux-kernel

On Wed, 2011-05-11 at 11:17 -0500, Jesse Larrew wrote:
> > I would really like to see both patch-sets together. Also, I'm not
> at
> > all convinced its a sane thing to do. Pretty much all NUMA aware
> > software I know of assumes that CPU<->NODE relations are static,
> > breaking that in kernel renders all existing software broken.
> >=20
>=20
> I suspect that's true. Then again, shouldn't it be the capabilities of
> the hardware that dictates what the software does, rather than the
> other way around?=20

Wish that were true, we wouldn't be all constrained by all this legacy
software.. ;-)

Anyway, there's plenty of CPU<->NODE assumptions in the kernel as well,
fixing those will be 'interesting' at best, as for userspace, since its
a user-driven tool revamping the topology the user gets to keep the
pieces when he runs that while some NUMA aware proglet is running.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-06-03 14:48 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-09  2:58 [BUG] rebuild_sched_domains considered dangerous Benjamin Herrenschmidt
2011-03-09  2:58 ` Benjamin Herrenschmidt
2011-03-09 10:19 ` Peter Zijlstra
2011-03-09 10:19   ` Peter Zijlstra
2011-03-09 11:33   ` Peter Zijlstra
2011-03-09 11:33     ` Peter Zijlstra
2011-03-09 13:15     ` Martin Schwidefsky
2011-03-09 13:15       ` Martin Schwidefsky
2011-03-09 13:19       ` Peter Zijlstra
2011-03-09 13:19         ` Peter Zijlstra
2011-03-09 13:31         ` Martin Schwidefsky
2011-03-09 13:31           ` Martin Schwidefsky
2011-03-09 13:33           ` Peter Zijlstra
2011-03-09 13:33             ` Peter Zijlstra
2011-03-09 13:46             ` Martin Schwidefsky
2011-03-09 13:46               ` Martin Schwidefsky
2011-03-09 13:54               ` Peter Zijlstra
2011-03-09 13:54                 ` Peter Zijlstra
2011-03-09 15:26     ` Steven Rostedt
2011-03-09 15:26       ` Steven Rostedt
2011-03-09 13:01   ` Peter Zijlstra
2011-03-09 13:01     ` Peter Zijlstra
2011-03-10 14:10     ` Peter Zijlstra
2011-03-10 14:10       ` Peter Zijlstra
2011-04-20 10:07       ` Peter Zijlstra
2011-04-20 10:07         ` Peter Zijlstra
2011-04-20 22:01         ` Benjamin Herrenschmidt
2011-04-20 22:01           ` Benjamin Herrenschmidt
2011-05-09 21:26           ` Jesse Larrew
2011-05-10 14:09             ` Peter Zijlstra
2011-05-10 14:09               ` Peter Zijlstra
2011-05-11 16:17               ` Jesse Larrew
2011-05-11 16:17                 ` Jesse Larrew
2011-06-03 14:47                 ` Peter Zijlstra
2011-06-03 14:47                   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.