All of lore.kernel.org
 help / color / mirror / Atom feed
* NUMA and SMP
@ 2007-01-14 11:55 David Pilger
  2007-01-14 19:00 ` Ryan Harper
  2007-01-15 17:21 ` Anthony Liguori
  0 siblings, 2 replies; 35+ messages in thread
From: David Pilger @ 2007-01-14 11:55 UTC (permalink / raw)
  To: xen-devel, ryanh

Hi all,

1. Does desktop computers, such as intel dual core really benefit from NUMA?
2. Does it have a real effect on the performance of Xen?
3. Can't we let the guest OS manage NUMA instead of Xen? what is the
difference? and why is it implemented in Xen?

Thanks,
David.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NUMA and SMP
  2007-01-14 11:55 NUMA and SMP David Pilger
@ 2007-01-14 19:00 ` Ryan Harper
  2007-01-15 17:21 ` Anthony Liguori
  1 sibling, 0 replies; 35+ messages in thread
From: Ryan Harper @ 2007-01-14 19:00 UTC (permalink / raw)
  To: David Pilger; +Cc: xen-devel, ryanh

* David Pilger <pilger.david@gmail.com> [2007-01-14 06:04]:
> Hi all,
> 
> 1. Does desktop computers, such as intel dual core really benefit from NUMA?

Desktop computers with AMD chips which include a memory bus on the cpu
have NUMA characteristics that can benefit from keeping memory close to
the cpu.

> 2. Does it have a real effect on the performance of Xen?

I've [1]posted previously to the list on the performance benefit for NUMA
systems and that there is no regression for non-NUMA systems.

> 3. Can't we let the guest OS manage NUMA instead of Xen? what is the
> difference? and why is it implemented in Xen?

Xen owns all of the system memory and also controls the allocation of
that memory and therefor determines what memory and which processors are
in use for a guest.  If we are to be able to create a guest with memory
close to the physical processors in-use, then we must understand the
topology of the system when we allocate memory for the guest.

I'm not sure I understand entirely what you mean by letting the guest OS
manage NUMA instead.  However, the current Xen NUMA implementation does
not export the domain's NUMA-ness to the guest kernel, but that is the
next logical step.  Not only allocate memory to the guest in a
NUMA-aware fashion, but in the case that we are required to give memory
to a guest from multiple NUMA nodes, to export the guest topology such
that if the guest OS is NUMA-aware, it can make NUMA-aware decisions.


1. http://lists.xensource.com/archives/html/xen-devel/2006-09/msg00958.html

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: NUMA and SMP
  2007-01-14 11:55 NUMA and SMP David Pilger
  2007-01-14 19:00 ` Ryan Harper
@ 2007-01-15 17:21 ` Anthony Liguori
  2007-01-16 10:47   ` Petersson, Mats
  2007-01-16 14:51   ` Re: NUMA and SMP ron minnich
  1 sibling, 2 replies; 35+ messages in thread
From: Anthony Liguori @ 2007-01-15 17:21 UTC (permalink / raw)
  To: David Pilger; +Cc: xen-devel, Ryan Harper

David Pilger wrote:
> Hi all,
> 
> 1. Does desktop computers, such as intel dual core really benefit from 
> NUMA?

No.  NUMA standards Non-Uniform Memory Architecture.  It's basically a 
system where you have nodes (which are essentially independent 
computers) that are connected via a high speed bus.  Each node has it's 
own memory but through the magic of NUMA, every node can access the 
other nodes memory as if it's own.  Most NUMA systems (if not all) are 
very high end servers.

> 2. Does it have a real effect on the performance of Xen?

On a NUMA system, absolutely.  If you have a domain running on a 
particular node, you want to make sure that it's using memory that's in 
it's node if at all possible.  Accessing memory on a local node is 
considerably faster than access memory on other nodes.  Prior to Ryan's 
NUMA work, Xen would just blindly allocate memory to a domain without 
taking into account memory locality.

> 3. Can't we let the guest OS manage NUMA instead of Xen? what is the
> difference? and why is it implemented in Xen?

If a guest OS spans multiple nodes, then you would want it to be NUMA 
aware.  However, you always want Xen to, at least, be NUMA aware so that 
it allocates memory appropriately.

Regards,

Anthony Liguori

> Thanks,
> David.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-01-15 17:21 ` Anthony Liguori
@ 2007-01-16 10:47   ` Petersson, Mats
  2007-01-16 13:55     ` Emmanuel Ackaouy
  2007-01-16 14:51   ` Re: NUMA and SMP ron minnich
  1 sibling, 1 reply; 35+ messages in thread
From: Petersson, Mats @ 2007-01-16 10:47 UTC (permalink / raw)
  To: Anthony Liguori, David Pilger; +Cc: xen-devel, Ryan Harper

 

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Anthony Liguori
> Sent: 15 January 2007 17:22
> To: David Pilger
> Cc: xen-devel; Ryan Harper
> Subject: [Xen-devel] Re: NUMA and SMP
> 
> David Pilger wrote:
> > Hi all,
> > 
> > 1. Does desktop computers, such as intel dual core really 
> benefit from 
> > NUMA?
> 
> No.  NUMA standards Non-Uniform Memory Architecture.  It's 
> basically a 
> system where you have nodes (which are essentially independent 
> computers) that are connected via a high speed bus.  Each 
> node has it's 
> own memory but through the magic of NUMA, every node can access the 
> other nodes memory as if it's own.  Most NUMA systems (if not 
> all) are 
> very high end servers.

Good description, but you have to agreet that AMD has a NUMA-style
architecture in the Opteron class systems. However, this is sometimes
also called "SUMO" (Sufficiently Uniform Memory Organization), which
means that non-NUMA-aware software will operate correctly on the system,
although not optimally (because the software will allocate memory
without regard to it's locality, and thus potentially incurr penalties
that aren't necessary). It's "sufficiently uniform" because the penalty
(compared with "true NUMA") for "bad" memory allocation is in the same
order as a normal memory fetch (but of course, that means about 2X to 3X
a local memory fetch). 

On other NUMA systems, the penalty for accessing out-of-node memory can
be 10-100x the local memory access time, which is obviously a much more
noticable effect.
> 
> > 2. Does it have a real effect on the performance of Xen?
> 
> On a NUMA system, absolutely.  If you have a domain running on a 
> particular node, you want to make sure that it's using memory 
> that's in 
> it's node if at all possible.  Accessing memory on a local node is 
> considerably faster than access memory on other nodes.  Prior 
> to Ryan's 
> NUMA work, Xen would just blindly allocate memory to a domain without 
> taking into account memory locality.

Absolutely, there's a noticable benefit. 
> 
> > 3. Can't we let the guest OS manage NUMA instead of Xen? what is the
> > difference? and why is it implemented in Xen?
> 
> If a guest OS spans multiple nodes, then you would want it to be NUMA 
> aware.  However, you always want Xen to, at least, be NUMA 
> aware so that 
> it allocates memory appropriately.

Ideally, we'd want the NUMA information exported to the guest, but at
least if Xen knows that memory allocated for a particular guest is local
to the same (group of) processor(s), there's a benefit.

You can't "just" leave it to the guest OS tho', because the guest has no
control over which bits of memory it's actually gets, Xen doles that
out, and if the OS is NUMA aware but gets memory from Node1 and procesor
on Node0, then it's not much the OS can do to make things better, right?

--
Mats
> 
> Regards,
> 
> Anthony Liguori
> 
> > Thanks,
> > David.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-01-16 10:47   ` Petersson, Mats
@ 2007-01-16 13:55     ` Emmanuel Ackaouy
  2007-01-16 14:19       ` Petersson, Mats
  2007-03-20 13:10       ` tgh
  0 siblings, 2 replies; 35+ messages in thread
From: Emmanuel Ackaouy @ 2007-01-16 13:55 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: Anthony Liguori, xen-devel, David Pilger, Ryan Harper

On the topic of NUMA:

I'd like to dispute the assumption that a NUMA-aware OS can actually
make good decisions about the initial placement of memory in a
reasonable hardware ccNUMA system.

How does the OS know on which node a particular chunk of memory
will be most accessed? The truth is that unless the application or
person running the application is herself NUMA-aware and can provide
placement hints or directives, the OS will seldom beat a round-robin /
interleave or random placement strategy.

To illustrate, consider an app which lays out a bunch of data in memory
in a single thread and then spawns worker threads to process it.

Is the OS to place memory close to the initial thread? How can it 
possibly
know how many threads will eventually process the data?

Even if the OS knew how many threads will eventually crunch the data,
it cannot possibly know at placement time if each thread will work on an
assigned data subset (and if so, which one) or if it will act as a 
pipeline
stage with all the data being passed from one thread to the next.

If you go beyond initial memory placement or start considering memory
migration, then it's even harder to win because you have to pay copy
and stall penalties during migrations. So you have to be real smart
about predicting the future to do better than your ~10-40% memory
bandwidth and latency hit associated with doing simple memory
interleaving on a modern hardware-ccNUMA system.

And it gets worse for you when your app is successfully taking advantage
of the memory cache hierarchy because its performance is less impacted
by raw memory latency and bandwidth.

Things also get more difficult on a time-sharing host with competing
apps.

There is a strong argument for making hypervisors and OSes NUMA
aware in the sense that:
1- They know about system topology
2- They can export this information up the stack to applications and 
users
3- They can take in directives from users and applications to partition 
the
     host and place some threads and memory in specific partitions.
4- They use an interleaved (or random) initial memory placement strategy
     by default.

The argument that the OS on its own -- without user or application
directives -- can make better placement decisions than round-robin or
random placement is -- in my opinion -- flawed.

I also am skeptical that the complexity associated with page migration
strategies would be worthwhile: If you got it wrong the first time, what
makes you think you'll do better this time?

Emmanuel.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-01-16 13:55     ` Emmanuel Ackaouy
@ 2007-01-16 14:19       ` Petersson, Mats
  2007-01-16 16:13         ` Emmanuel Ackaouy
  2007-03-20 13:10       ` tgh
  1 sibling, 1 reply; 35+ messages in thread
From: Petersson, Mats @ 2007-01-16 14:19 UTC (permalink / raw)
  To: Emmanuel Ackaouy; +Cc: Anthony Liguori, xen-devel, David Pilger, Ryan Harper

> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xensource.com] 
> Sent: 16 January 2007 13:56
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> On the topic of NUMA:
> 
> I'd like to dispute the assumption that a NUMA-aware OS can actually
> make good decisions about the initial placement of memory in a
> reasonable hardware ccNUMA system.

I'm not saying that it ALWAYS can make good decisions, but it's got a
better chance than software that just places things in "first available"
way. 

> 
> How does the OS know on which node a particular chunk of memory
> will be most accessed? The truth is that unless the application or
> person running the application is herself NUMA-aware and can provide
> placement hints or directives, the OS will seldom beat a round-robin /
> interleave or random placement strategy.

I don't disagree with that. 
> 
> To illustrate, consider an app which lays out a bunch of data 
> in memory
> in a single thread and then spawns worker threads to process it.

That's a good example of a hard to crack nut. Not easily solved in the
OS, that's for sure. 
> 
> Is the OS to place memory close to the initial thread? How can it 
> possibly
> know how many threads will eventually process the data?
> 
> Even if the OS knew how many threads will eventually crunch the data,
> it cannot possibly know at placement time if each thread will 
> work on an
> assigned data subset (and if so, which one) or if it will act as a 
> pipeline
> stage with all the data being passed from one thread to the next.
> 
> If you go beyond initial memory placement or start considering memory
> migration, then it's even harder to win because you have to pay copy
> and stall penalties during migrations. So you have to be real smart
> about predicting the future to do better than your ~10-40% memory
> bandwidth and latency hit associated with doing simple memory
> interleaving on a modern hardware-ccNUMA system.

Sure, I certainly wasn't suggesting memory migration. 

However, there is a case where NUMA information COULD be helpful, and
that is if the system is paging in, it could try to find a page in the
local node rather than "random" [although without knowing what the
future holds, this could be wrong - as any non-future-knowing strategy
would be]. Of course, I wouldn't disagree if you said "The system
probably has too little memory if it's paging"!. 

> 
> And it gets worse for you when your app is successfully 
> taking advantage
> of the memory cache hierarchy because its performance is less impacted
> by raw memory latency and bandwidth.

Indeed. 
> 
> Things also get more difficult on a time-sharing host with competing
> apps.

Agreed.
> 
> There is a strong argument for making hypervisors and OSes NUMA
> aware in the sense that:
> 1- They know about system topology
> 2- They can export this information up the stack to applications and 
> users
> 3- They can take in directives from users and applications to 
> partition 
> the
>      host and place some threads and memory in specific partitions.
> 4- They use an interleaved (or random) initial memory 
> placement strategy
>      by default.
> 
> The argument that the OS on its own -- without user or application
> directives -- can make better placement decisions than round-robin or
> random placement is -- in my opinion -- flawed.

Debatable - it depends a lot on WHAT applications you expect to run, and
how they behave. If you consider an application that frequently
allocates and de-allocates memory dynamically in a single threaded
process (say compiler), then allocating memory in the local node should
be the "first choice". 

Multithreaded apps can use a similar approach, if a thread is allocating
memory, it's often a good chance that the memory is being used by that
thread too [although this doesn't work for message passing between
threads, obviously, this is again a case where "knowledge from the app"
will be the only better solution than "random"].

This approach is by far not perfect, but if you consider that
applications often do short term allocations, it makes sense to allocate
on the local node if possible. 
> 
> I also am skeptical that the complexity associated with page migration
> strategies would be worthwhile: If you got it wrong the first 
> time, what
> makes you think you'll do better this time?

I'm not advocating for any page-migration, with the possible exception
that page-faults that are resolved by paging in should have a
first-choice of local node. 

However, supporting NUMA in the Hypervisor and forwarding arch-info to
the guest would make sense. At the least the very basic principle of: If
the guest is to run on a limited set of processors (nodes), allocate
memory from that (those) node(s) for the guest would make a lot of
sense. 

[Note that I'm by no means a NUMA expert - I just happen to work for AMD
that happens to have a ccNUMA architecture]. 

--
Mats
> 
> Emmanuel.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-01-15 17:21 ` Anthony Liguori
  2007-01-16 10:47   ` Petersson, Mats
@ 2007-01-16 14:51   ` ron minnich
  1 sibling, 0 replies; 35+ messages in thread
From: ron minnich @ 2007-01-16 14:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: xen-devel, David Pilger, Ryan Harper

On 1/15/07, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:

> No.  NUMA standards Non-Uniform Memory Architecture.  It's basically a
> system where you have nodes (which are essentially independent
> computers) that are connected via a high speed bus.  Each node has it's
> own memory but through the magic of NUMA, every node can access the
> other nodes memory as if it's own.  Most NUMA systems (if not all) are
> very high end servers.

no, at this point, most NUMA servers are probably Alienware desktops
for gamers :-) Especially now that Dell is selling them. Opteron
brought NUMA into the mainstream in a big way. And desktop unit sales
trump all supercomputer sales :-) Us poor supercomputer types are,
once again, in the noise where dollar volume is concerned.

And, Linux has known for some time how to exploit the NUMA-ness of
these Opteron systems. There is even an ACPI table entry, SRAT, to
describe the NUMA-ness of a machine.


>
> > 2. Does it have a real effect on the performance of Xen?
>
> On a NUMA system, absolutely.  If you have a domain running on a
> particular node, you want to make sure that it's using memory that's in
> it's node if at all possible.  Accessing memory on a local node is
> considerably faster than access memory on other nodes.

Right, but it's not even close to a  factor of two on a desktop
machine like dual opteron. It's still worth being NUMA aware, however.

thanks

ron

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-01-16 14:19       ` Petersson, Mats
@ 2007-01-16 16:13         ` Emmanuel Ackaouy
  2007-01-16 16:30           ` Petersson, Mats
  0 siblings, 1 reply; 35+ messages in thread
From: Emmanuel Ackaouy @ 2007-01-16 16:13 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: Anthony Liguori, xen-devel, David Pilger, Ryan Harper

On Jan 16, 2007, at 15:19, Petersson, Mats wrote:
>> There is a strong argument for making hypervisors and OSes NUMA
>> aware in the sense that:
>> 1- They know about system topology
>> 2- They can export this information up the stack to applications and
>> users
>> 3- They can take in directives from users and applications to
>> partition
>> the
>>      host and place some threads and memory in specific partitions.
>> 4- They use an interleaved (or random) initial memory
>> placement strategy
>>      by default.
>>
>> The argument that the OS on its own -- without user or application
>> directives -- can make better placement decisions than round-robin or
>> random placement is -- in my opinion -- flawed.
>
> Debatable - it depends a lot on WHAT applications you expect to run, 
> and
> how they behave. If you consider an application that frequently
> allocates and de-allocates memory dynamically in a single threaded
> process (say compiler), then allocating memory in the local node should
> be the "first choice".
>
> Multithreaded apps can use a similar approach, if a thread is 
> allocating
> memory, it's often a good chance that the memory is being used by that
> thread too [although this doesn't work for message passing between
> threads, obviously, this is again a case where "knowledge from the app"
> will be the only better solution than "random"].
>
> This approach is by far not perfect, but if you consider that
> applications often do short term allocations, it makes sense to 
> allocate
> on the local node if possible.

I do not agree.

Just because a thread happens to run on processor X when
it first faults in a page off the process' heap doesn't give you
a good indication that the memory will be used mostly by
this thread or that the thread will continue running on the
same processor. There are at least as many cases when
this assumption is invalid than when it is valid. Without any
solid indication that something else will work better, round
robin allocation has to be the default strategy.

Also, if you allow one process to consume a large percentage
of one node's memory, you are indirectly hurting all competing
multi-threaded apps which benefit from higher total memory
bandwidth when they spread their data across nodes.

I understand your point that if a single threaded process quickly
shrinks its heap after growing it, it makes it less likely that it will
migrate to a different processor while it is using this memory. I'm
not sure how you predict that memory will be quickly released at
allocation time though. Even if you could, I maintain you would
still need safeguards in place to balance that process' needs
with that of competing multi-threaded apps benefiting from the
memory bandwidth scaling with number of hosting nodes.

You could try and compromise and allocate round robin starting
locally and perhaps with diminishing strides as the total allocation
grows (ie allocate local and progressively move towards a page
round robin scheme as more memory is requested). I'm not sure
this would do any better than plain old dumb round robin in the
average case but it's worth a thought.


> However, supporting NUMA in the Hypervisor and forwarding arch-info to
> the guest would make sense. At the least the very basic principle of: 
> If
> the guest is to run on a limited set of processors (nodes), allocate
> memory from that (those) node(s) for the guest would make a lot of
> sense.

I suspect there is widespread agreement on this point.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-01-16 16:13         ` Emmanuel Ackaouy
@ 2007-01-16 16:30           ` Petersson, Mats
  0 siblings, 0 replies; 35+ messages in thread
From: Petersson, Mats @ 2007-01-16 16:30 UTC (permalink / raw)
  To: Emmanuel Ackaouy; +Cc: Anthony Liguori, xen-devel, David Pilger, Ryan Harper

> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xensource.com] 
> Sent: 16 January 2007 16:14
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> On Jan 16, 2007, at 15:19, Petersson, Mats wrote:
> >> There is a strong argument for making hypervisors and OSes NUMA
> >> aware in the sense that:
> >> 1- They know about system topology
> >> 2- They can export this information up the stack to 
> applications and
> >> users
> >> 3- They can take in directives from users and applications to
> >> partition
> >> the
> >>      host and place some threads and memory in specific partitions.
> >> 4- They use an interleaved (or random) initial memory
> >> placement strategy
> >>      by default.
> >>
> >> The argument that the OS on its own -- without user or application
> >> directives -- can make better placement decisions than 
> round-robin or
> >> random placement is -- in my opinion -- flawed.
> >
> > Debatable - it depends a lot on WHAT applications you 
> expect to run, 
> > and
> > how they behave. If you consider an application that frequently
> > allocates and de-allocates memory dynamically in a single threaded
> > process (say compiler), then allocating memory in the local 
> node should
> > be the "first choice".
> >
> > Multithreaded apps can use a similar approach, if a thread is 
> > allocating
> > memory, it's often a good chance that the memory is being 
> used by that
> > thread too [although this doesn't work for message passing between
> > threads, obviously, this is again a case where "knowledge 
> from the app"
> > will be the only better solution than "random"].
> >
> > This approach is by far not perfect, but if you consider that
> > applications often do short term allocations, it makes sense to 
> > allocate
> > on the local node if possible.
> 
> I do not agree.
> 
> Just because a thread happens to run on processor X when
> it first faults in a page off the process' heap doesn't give you
> a good indication that the memory will be used mostly by
> this thread or that the thread will continue running on the
> same processor. There are at least as many cases when
> this assumption is invalid than when it is valid. Without any
> solid indication that something else will work better, round
> robin allocation has to be the default strategy.

My guess would be that noticably more than 50% of all (user-mode) memory
allocations are released within a shorter time than the time quanta used
by the scheduler - which in itself means that it's most likely not going
to swap from one processor to another (although of course an interrupt
may reschedule and move the thread to another processor, of course).
These memory allocations are also usually small, but there may be many
of them done in any second of runtime of the machine. Note that I
haven't made any effort to verify this guess, so if there's some other
data that you have that contradicts my view, then by all means disregard
my thoughts!

> 
> Also, if you allow one process to consume a large percentage
> of one node's memory, you are indirectly hurting all competing
> multi-threaded apps which benefit from higher total memory
> bandwidth when they spread their data across nodes.

Yes. There's definitely one of the drawbacks with this method. 

> 
> I understand your point that if a single threaded process quickly
> shrinks its heap after growing it, it makes it less likely 
> that it will
> migrate to a different processor while it is using this memory. I'm
> not sure how you predict that memory will be quickly released at
> allocation time though. Even if you could, I maintain you would
> still need safeguards in place to balance that process' needs
> with that of competing multi-threaded apps benefiting from the
> memory bandwidth scaling with number of hosting nodes.

See above "guesswork". 
> 
> You could try and compromise and allocate round robin starting
> locally and perhaps with diminishing strides as the total allocation
> grows (ie allocate local and progressively move towards a page
> round robin scheme as more memory is requested). I'm not sure
> this would do any better than plain old dumb round robin in the
> average case but it's worth a thought.

That's definitely not a bad idea.

Also, it's probably not a bad idea to have at least two choices:
"Allocate on closest processor" and "Round robin" (or "random" -
apparently, this is a better approach than LRU for cache-line
replacement, where LRU tends to work very badly for some cases, so it
may be a better approach than round robin for the same reason). 
> 
> 
> > However, supporting NUMA in the Hypervisor and forwarding 
> arch-info to
> > the guest would make sense. At the least the very basic 
> principle of: 
> > If
> > the guest is to run on a limited set of processors (nodes), allocate
> > memory from that (those) node(s) for the guest would make a lot of
> > sense.
> 
> I suspect there is widespread agreement on this point.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-01-16 13:55     ` Emmanuel Ackaouy
  2007-01-16 14:19       ` Petersson, Mats
@ 2007-03-20 13:10       ` tgh
  2007-03-20 13:19         ` Petersson, Mats
  2007-03-20 13:51         ` Daniel Stodden
  1 sibling, 2 replies; 35+ messages in thread
From: tgh @ 2007-03-20 13:10 UTC (permalink / raw)
  To: Emmanuel Ackaouy
  Cc: Ryan Harper, Petersson, Mats, xen-devel, David Pilger, Anthony Liguori

I am puzzled ,what is the page migration?
Thank you in advance


Emmanuel Ackaouy 写道:
> On the topic of NUMA:
>
> I'd like to dispute the assumption that a NUMA-aware OS can actually
> make good decisions about the initial placement of memory in a
> reasonable hardware ccNUMA system.
>
> How does the OS know on which node a particular chunk of memory
> will be most accessed? The truth is that unless the application or
> person running the application is herself NUMA-aware and can provide
> placement hints or directives, the OS will seldom beat a round-robin /
> interleave or random placement strategy.
>
> To illustrate, consider an app which lays out a bunch of data in memory
> in a single thread and then spawns worker threads to process it.
>
> Is the OS to place memory close to the initial thread? How can it 
> possibly
> know how many threads will eventually process the data?
>
> Even if the OS knew how many threads will eventually crunch the data,
> it cannot possibly know at placement time if each thread will work on an
> assigned data subset (and if so, which one) or if it will act as a 
> pipeline
> stage with all the data being passed from one thread to the next.
>
> If you go beyond initial memory placement or start considering memory
> migration, then it's even harder to win because you have to pay copy
> and stall penalties during migrations. So you have to be real smart
> about predicting the future to do better than your ~10-40% memory
> bandwidth and latency hit associated with doing simple memory
> interleaving on a modern hardware-ccNUMA system.
>
> And it gets worse for you when your app is successfully taking advantage
> of the memory cache hierarchy because its performance is less impacted
> by raw memory latency and bandwidth.
>
> Things also get more difficult on a time-sharing host with competing
> apps.
>
> There is a strong argument for making hypervisors and OSes NUMA
> aware in the sense that:
> 1- They know about system topology
> 2- They can export this information up the stack to applications and 
> users
> 3- They can take in directives from users and applications to 
> partition the
> host and place some threads and memory in specific partitions.
> 4- They use an interleaved (or random) initial memory placement strategy
> by default.
>
> The argument that the OS on its own -- without user or application
> directives -- can make better placement decisions than round-robin or
> random placement is -- in my opinion -- flawed.
>
> I also am skeptical that the complexity associated with page migration
> strategies would be worthwhile: If you got it wrong the first time, what
> makes you think you'll do better this time?
>
> Emmanuel.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-03-20 13:10       ` tgh
@ 2007-03-20 13:19         ` Petersson, Mats
  2007-03-20 13:49           ` tgh
  2007-03-20 13:51         ` Daniel Stodden
  1 sibling, 1 reply; 35+ messages in thread
From: Petersson, Mats @ 2007-03-20 13:19 UTC (permalink / raw)
  To: tgh, Emmanuel Ackaouy
  Cc: Anthony Liguori, xen-devel, David Pilger, Ryan Harper

> -----Original Message-----
> From: tgh [mailto:tianguanhua@ncic.ac.cn] 
> Sent: 20 March 2007 13:10
> To: Emmanuel Ackaouy
> Cc: Petersson, Mats; Anthony Liguori; xen-devel; David 
> Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> I am puzzled ,what is the page migration?
> Thank you in advance

I'm not entirely sure it's the correct term, but I used to indicate that if you allocate some memory local to processor no X, and then later on, the page is used by processor Y, then one could consider "moving" the page from the memory region of X to the memory region of Y. So you "migrate" the page from one processor to another. This is of course not a "free" operation, and it's only really helpful if the memory is accessed many times (and not cached each time it's accessed). 

A case where this can be done "almost for free" is when a page is swapped out, and on return, allocate the page from the processor that made the access. But of course, if you're looking for ultimate performance, swapping is a terrible idea - so making small optimizations in memory management when you're loosing tons of cycles by swapping is meaningless as a overall performance gain. 

--
Mats
> 
> 
> Emmanuel Ackaouy 写道:
> > On the topic of NUMA:
> >
> > I'd like to dispute the assumption that a NUMA-aware OS can actually
> > make good decisions about the initial placement of memory in a
> > reasonable hardware ccNUMA system.
> >
> > How does the OS know on which node a particular chunk of memory
> > will be most accessed? The truth is that unless the application or
> > person running the application is herself NUMA-aware and can provide
> > placement hints or directives, the OS will seldom beat a 
> round-robin /
> > interleave or random placement strategy.
> >
> > To illustrate, consider an app which lays out a bunch of 
> data in memory
> > in a single thread and then spawns worker threads to process it.
> >
> > Is the OS to place memory close to the initial thread? How can it 
> > possibly
> > know how many threads will eventually process the data?
> >
> > Even if the OS knew how many threads will eventually crunch 
> the data,
> > it cannot possibly know at placement time if each thread 
> will work on an
> > assigned data subset (and if so, which one) or if it will act as a 
> > pipeline
> > stage with all the data being passed from one thread to the next.
> >
> > If you go beyond initial memory placement or start 
> considering memory
> > migration, then it's even harder to win because you have to pay copy
> > and stall penalties during migrations. So you have to be real smart
> > about predicting the future to do better than your ~10-40% memory
> > bandwidth and latency hit associated with doing simple memory
> > interleaving on a modern hardware-ccNUMA system.
> >
> > And it gets worse for you when your app is successfully 
> taking advantage
> > of the memory cache hierarchy because its performance is 
> less impacted
> > by raw memory latency and bandwidth.
> >
> > Things also get more difficult on a time-sharing host with competing
> > apps.
> >
> > There is a strong argument for making hypervisors and OSes NUMA
> > aware in the sense that:
> > 1- They know about system topology
> > 2- They can export this information up the stack to 
> applications and 
> > users
> > 3- They can take in directives from users and applications to 
> > partition the
> > host and place some threads and memory in specific partitions.
> > 4- They use an interleaved (or random) initial memory 
> placement strategy
> > by default.
> >
> > The argument that the OS on its own -- without user or application
> > directives -- can make better placement decisions than 
> round-robin or
> > random placement is -- in my opinion -- flawed.
> >
> > I also am skeptical that the complexity associated with 
> page migration
> > strategies would be worthwhile: If you got it wrong the 
> first time, what
> > makes you think you'll do better this time?
> >
> > Emmanuel.
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> >
> >
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-20 13:19         ` Petersson, Mats
@ 2007-03-20 13:49           ` tgh
  2007-03-20 15:50             ` Petersson, Mats
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-20 13:49 UTC (permalink / raw)
  To: Petersson, Mats
  Cc: Anthony Liguori, xen-devel, Emmanuel Ackaouy, David Pilger, Ryan Harper

Thank you for your reply

I see
and does xen support the numa-aware guestlinux now or in the future?

another question maybe should be another topic
what is the function of xc_map_foreign_range()in /tools/libxc/xc_linux.c?
does xc_map_foreign_range() mmap the shared memory with another domain
,or with domain0 ,orsomething?

could you help me
Thanks in advance

Petersson, Mats 写道:
>> -----Original Message-----
>> From: tgh [mailto:tianguanhua@ncic.ac.cn] 
>> Sent: 20 March 2007 13:10
>> To: Emmanuel Ackaouy
>> Cc: Petersson, Mats; Anthony Liguori; xen-devel; David 
>> Pilger; Ryan Harper
>> Subject: Re: [Xen-devel] Re: NUMA and SMP
>>
>> I am puzzled ,what is the page migration?
>> Thank you in advance
>>     
>
> I'm not entirely sure it's the correct term, but I used to indicate that if you allocate some memory local to processor no X, and then later on, the page is used by processor Y, then one could consider "moving" the page from the memory region of X to the memory region of Y. So you "migrate" the page from one processor to another. This is of course not a "free" operation, and it's only really helpful if the memory is accessed many times (and not cached each time it's accessed). 
>
> A case where this can be done "almost for free" is when a page is swapped out, and on return, allocate the page from the processor that made the access. But of course, if you're looking for ultimate performance, swapping is a terrible idea - so making small optimizations in memory management when you're loosing tons of cycles by swapping is meaningless as a overall performance gain. 
>
> --
> Mats
>   
>> Emmanuel Ackaouy 写道:
>>     
>>> On the topic of NUMA:
>>>
>>> I'd like to dispute the assumption that a NUMA-aware OS can actually
>>> make good decisions about the initial placement of memory in a
>>> reasonable hardware ccNUMA system.
>>>
>>> How does the OS know on which node a particular chunk of memory
>>> will be most accessed? The truth is that unless the application or
>>> person running the application is herself NUMA-aware and can provide
>>> placement hints or directives, the OS will seldom beat a 
>>>       
>> round-robin /
>>     
>>> interleave or random placement strategy.
>>>
>>> To illustrate, consider an app which lays out a bunch of 
>>>       
>> data in memory
>>     
>>> in a single thread and then spawns worker threads to process it.
>>>
>>> Is the OS to place memory close to the initial thread? How can it 
>>> possibly
>>> know how many threads will eventually process the data?
>>>
>>> Even if the OS knew how many threads will eventually crunch 
>>>       
>> the data,
>>     
>>> it cannot possibly know at placement time if each thread 
>>>       
>> will work on an
>>     
>>> assigned data subset (and if so, which one) or if it will act as a 
>>> pipeline
>>> stage with all the data being passed from one thread to the next.
>>>
>>> If you go beyond initial memory placement or start 
>>>       
>> considering memory
>>     
>>> migration, then it's even harder to win because you have to pay copy
>>> and stall penalties during migrations. So you have to be real smart
>>> about predicting the future to do better than your ~10-40% memory
>>> bandwidth and latency hit associated with doing simple memory
>>> interleaving on a modern hardware-ccNUMA system.
>>>
>>> And it gets worse for you when your app is successfully 
>>>       
>> taking advantage
>>     
>>> of the memory cache hierarchy because its performance is 
>>>       
>> less impacted
>>     
>>> by raw memory latency and bandwidth.
>>>
>>> Things also get more difficult on a time-sharing host with competing
>>> apps.
>>>
>>> There is a strong argument for making hypervisors and OSes NUMA
>>> aware in the sense that:
>>> 1- They know about system topology
>>> 2- They can export this information up the stack to 
>>>       
>> applications and 
>>     
>>> users
>>> 3- They can take in directives from users and applications to 
>>> partition the
>>> host and place some threads and memory in specific partitions.
>>> 4- They use an interleaved (or random) initial memory 
>>>       
>> placement strategy
>>     
>>> by default.
>>>
>>> The argument that the OS on its own -- without user or application
>>> directives -- can make better placement decisions than 
>>>       
>> round-robin or
>>     
>>> random placement is -- in my opinion -- flawed.
>>>
>>> I also am skeptical that the complexity associated with 
>>>       
>> page migration
>>     
>>> strategies would be worthwhile: If you got it wrong the 
>>>       
>> first time, what
>>     
>>> makes you think you'll do better this time?
>>>
>>> Emmanuel.
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>       
>>
>>
>>     
>
>
>
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-20 13:10       ` tgh
  2007-03-20 13:19         ` Petersson, Mats
@ 2007-03-20 13:51         ` Daniel Stodden
  2007-03-21  1:08           ` tgh
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-20 13:51 UTC (permalink / raw)
  To: tgh; +Cc: Xen Developers

On Tue, 2007-03-20 at 21:10 +0800, tgh wrote:
> I am puzzled ,what is the page migration?
> Thank you in advance

NUMA is clear? NUMA distributes main memory across multiple memory
interfaces.

This used to be a feature reserved to high-end multiprocessor
architectures, but in servers it is becoming sort of a commodity these
days, in part due AMD multiprocessor systems being NUMA systems these
days. AMD64 processors carry an integrated memory controller. So, if you
buy an SMP machine with AMD processors today, you'd find each slice of
the total memory being connected to a different processor inside.

Note that this doesn't break the 'symmetric' in 'SMP': it still remains
a global, flat physical address space. The processors have interconnects
by which memory can be read from remote processors as well, and will do
so transparently to system and application software.

[The alternative is rather the 'classic' model: Multiple processors
interconnected making SMP, but  single memory interface in a single
northbridge (Intel would call it the "MCH") connecting to the front-side
bus, connecting all processors them to main memory. Obviously, that
single memory interface will easily become a bottleneck, if all
processors try to access memory simultaneously.]

NUMA *may* help here: accessing local memory is very fast. Acessing
remote memory is still pretty fast, but not as fast as it could be:
hence 'NUMA' - non-uniform memory access.

So, in order to take advantage of such a memory topology, memory data
would ideally be always at the CPU where the processing happens. But
processes (or domains, regarding xen) may migrate between different
processors. Whether this happens depends on scheduling decisions.
There's a cost involved in migration itself, so schedulers will do it
ideally only if it really-makes-sense(TM).

In order to keep a NUMA-system happy, pages once allocated could be
moved as well, to where the current CPU is. This is page migration.
As you may imagine, even more costly, and unfortunately completely
useless if cpu migration needs to happen on a regular basis. Therefore
it's difficult to get it right. Getting it right depends on how much the
scheduler and memory management knows about where the memory asked for
will be needed -- in advance. This is the hardest part: Most software
won't tell, because the programming models employed today do not even
recognize the fact that it may matter. Even if they would, in many
cases, it would be even difficult to predict at all.

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-03-20 13:49           ` tgh
@ 2007-03-20 15:50             ` Petersson, Mats
  2007-03-20 16:45               ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: Petersson, Mats @ 2007-03-20 15:50 UTC (permalink / raw)
  To: tgh; +Cc: xen-devel

> -----Original Message-----
> From: tgh [mailto:tianguanhua@ncic.ac.cn] 
> Sent: 20 March 2007 13:50
> To: Petersson, Mats
> Cc: Emmanuel Ackaouy; Anthony Liguori; xen-devel; David 
> Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> Thank you for your reply
> 
> I see
> and does xen support the numa-aware guestlinux now or in the future?

There is no support in current Xen for NUMA-awareness, and for the guest to understand NUMA-ness in the system, Xen must have sufficient understanding to forward the relevant information to the guest. 

> 
> another question maybe should be another topic
> what is the function of xc_map_foreign_range()in 
> /tools/libxc/xc_linux.c?
> does xc_map_foreign_range() mmap the shared memory with another domain
> ,or with domain0 ,orsomething?

It maps a shared memory region with the domain specified by "domid". 

--
Mats

> 
> could you help me
> Thanks in advance
> 
> Petersson, Mats 写道:
> >> -----Original Message-----
> >> From: tgh [mailto:tianguanhua@ncic.ac.cn] 
> >> Sent: 20 March 2007 13:10
> >> To: Emmanuel Ackaouy
> >> Cc: Petersson, Mats; Anthony Liguori; xen-devel; David 
> >> Pilger; Ryan Harper
> >> Subject: Re: [Xen-devel] Re: NUMA and SMP
> >>
> >> I am puzzled ,what is the page migration?
> >> Thank you in advance
> >>     
> >
> > I'm not entirely sure it's the correct term, but I used to 
> indicate that if you allocate some memory local to processor 
> no X, and then later on, the page is used by processor Y, 
> then one could consider "moving" the page from the memory 
> region of X to the memory region of Y. So you "migrate" the 
> page from one processor to another. This is of course not a 
> "free" operation, and it's only really helpful if the memory 
> is accessed many times (and not cached each time it's accessed). 
> >
> > A case where this can be done "almost for free" is when a 
> page is swapped out, and on return, allocate the page from 
> the processor that made the access. But of course, if you're 
> looking for ultimate performance, swapping is a terrible idea 
> - so making small optimizations in memory management when 
> you're loosing tons of cycles by swapping is meaningless as a 
> overall performance gain. 
> >
> > --
> > Mats
> >   
> >> Emmanuel Ackaouy 写道:
> >>     
> >>> On the topic of NUMA:
> >>>
> >>> I'd like to dispute the assumption that a NUMA-aware OS 
> can actually
> >>> make good decisions about the initial placement of memory in a
> >>> reasonable hardware ccNUMA system.
> >>>
> >>> How does the OS know on which node a particular chunk of memory
> >>> will be most accessed? The truth is that unless the application or
> >>> person running the application is herself NUMA-aware and 
> can provide
> >>> placement hints or directives, the OS will seldom beat a 
> >>>       
> >> round-robin /
> >>     
> >>> interleave or random placement strategy.
> >>>
> >>> To illustrate, consider an app which lays out a bunch of 
> >>>       
> >> data in memory
> >>     
> >>> in a single thread and then spawns worker threads to process it.
> >>>
> >>> Is the OS to place memory close to the initial thread? How can it 
> >>> possibly
> >>> know how many threads will eventually process the data?
> >>>
> >>> Even if the OS knew how many threads will eventually crunch 
> >>>       
> >> the data,
> >>     
> >>> it cannot possibly know at placement time if each thread 
> >>>       
> >> will work on an
> >>     
> >>> assigned data subset (and if so, which one) or if it will 
> act as a 
> >>> pipeline
> >>> stage with all the data being passed from one thread to the next.
> >>>
> >>> If you go beyond initial memory placement or start 
> >>>       
> >> considering memory
> >>     
> >>> migration, then it's even harder to win because you have 
> to pay copy
> >>> and stall penalties during migrations. So you have to be 
> real smart
> >>> about predicting the future to do better than your ~10-40% memory
> >>> bandwidth and latency hit associated with doing simple memory
> >>> interleaving on a modern hardware-ccNUMA system.
> >>>
> >>> And it gets worse for you when your app is successfully 
> >>>       
> >> taking advantage
> >>     
> >>> of the memory cache hierarchy because its performance is 
> >>>       
> >> less impacted
> >>     
> >>> by raw memory latency and bandwidth.
> >>>
> >>> Things also get more difficult on a time-sharing host 
> with competing
> >>> apps.
> >>>
> >>> There is a strong argument for making hypervisors and OSes NUMA
> >>> aware in the sense that:
> >>> 1- They know about system topology
> >>> 2- They can export this information up the stack to 
> >>>       
> >> applications and 
> >>     
> >>> users
> >>> 3- They can take in directives from users and applications to 
> >>> partition the
> >>> host and place some threads and memory in specific partitions.
> >>> 4- They use an interleaved (or random) initial memory 
> >>>       
> >> placement strategy
> >>     
> >>> by default.
> >>>
> >>> The argument that the OS on its own -- without user or application
> >>> directives -- can make better placement decisions than 
> >>>       
> >> round-robin or
> >>     
> >>> random placement is -- in my opinion -- flawed.
> >>>
> >>> I also am skeptical that the complexity associated with 
> >>>       
> >> page migration
> >>     
> >>> strategies would be worthwhile: If you got it wrong the 
> >>>       
> >> first time, what
> >>     
> >>> makes you think you'll do better this time?
> >>>
> >>> Emmanuel.
> >>>
> >>>
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@lists.xensource.com
> >>> http://lists.xensource.com/xen-devel
> >>>
> >>>
> >>>       
> >>
> >>
> >>     
> >
> >
> >
> >
> >   
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-20 15:50             ` Petersson, Mats
@ 2007-03-20 16:45               ` Ryan Harper
  2007-03-20 16:47                 ` Petersson, Mats
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-20 16:45 UTC (permalink / raw)
  To: Petersson, Mats; +Cc: xen-devel, tgh

* Petersson, Mats <Mats.Petersson@amd.com> [2007-03-20 11:33]:
> > -----Original Message-----
> > From: tgh [mailto:tianguanhua@ncic.ac.cn] 
> > Sent: 20 March 2007 13:50
> > To: Petersson, Mats
> > Cc: Emmanuel Ackaouy; Anthony Liguori; xen-devel; David 
> > Pilger; Ryan Harper
> > Subject: Re: [Xen-devel] Re: NUMA and SMP
> > 
> > Thank you for your reply
> > 
> > I see
> > and does xen support the numa-aware guestlinux now or in the future?
> 
> There is no support in current Xen for NUMA-awareness, and for the
> guest to understand NUMA-ness in the system, Xen must have sufficient
> understanding to forward the relevant information to the guest. 

As of Xen 3.0.4, Xen has support for detecting NUMA systems, parsing
SRAT tables which indicate how memory and cpu are split up between the
system NUMA nodes, support for allocating memory local to a particular
cpu.  To use NUMA, one must pass numa=on on the xen command line.

Xen still lacks a NUMA-aware scheduler, so one must be sure to pin vcpus
and keep your guest within a NUMA node.  This is done using the cpus=""
parameter in the guest config file.

Xen doesn't export any of the topology information is gleans from the
SRAT table at the moment.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-03-20 16:45               ` Ryan Harper
@ 2007-03-20 16:47                 ` Petersson, Mats
  0 siblings, 0 replies; 35+ messages in thread
From: Petersson, Mats @ 2007-03-20 16:47 UTC (permalink / raw)
  To: Ryan Harper; +Cc: xen-devel, tgh

 

> -----Original Message-----
> From: Ryan Harper [mailto:ryanh@us.ibm.com] 
> Sent: 20 March 2007 16:46
> To: Petersson, Mats
> Cc: tgh; xen-devel
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> * Petersson, Mats <Mats.Petersson@amd.com> [2007-03-20 11:33]:
> > > -----Original Message-----
> > > From: tgh [mailto:tianguanhua@ncic.ac.cn] 
> > > Sent: 20 March 2007 13:50
> > > To: Petersson, Mats
> > > Cc: Emmanuel Ackaouy; Anthony Liguori; xen-devel; David 
> > > Pilger; Ryan Harper
> > > Subject: Re: [Xen-devel] Re: NUMA and SMP
> > > 
> > > Thank you for your reply
> > > 
> > > I see
> > > and does xen support the numa-aware guestlinux now or in 
> the future?
> > 
> > There is no support in current Xen for NUMA-awareness, and for the
> > guest to understand NUMA-ness in the system, Xen must have 
> sufficient
> > understanding to forward the relevant information to the guest. 
> 
> As of Xen 3.0.4, Xen has support for detecting NUMA systems, parsing
> SRAT tables which indicate how memory and cpu are split up between the
> system NUMA nodes, support for allocating memory local to a particular
> cpu.  To use NUMA, one must pass numa=on on the xen command line.
> 
> Xen still lacks a NUMA-aware scheduler, so one must be sure 
> to pin vcpus
> and keep your guest within a NUMA node.  This is done using 
> the cpus=""
> parameter in the guest config file.
> 
> Xen doesn't export any of the topology information is gleans from the
> SRAT table at the moment.

Thanks for the update - I must have missed that it went in. 

--
Mats
> 
> 
> -- 
> Ryan Harper
> Software Engineer; Linux Technology Center
> IBM Corp., Austin, Tx
> (512) 838-9253   T/L: 678-9253
> ryanh@us.ibm.com
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-20 13:51         ` Daniel Stodden
@ 2007-03-21  1:08           ` tgh
  2007-03-21  2:45             ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-21  1:08 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen Developers

Thank you for your reply


Daniel Stodden 写道:
> On Tue, 2007-03-20 at 21:10 +0800, tgh wrote:
>   
>> I am puzzled ,what is the page migration?
>> Thank you in advance
>>     
>
> NUMA is clear? NUMA distributes main memory across multiple memory
> interfaces.
>
> This used to be a feature reserved to high-end multiprocessor
> architectures, but in servers it is becoming sort of a commodity these
> days, in part due AMD multiprocessor systems being NUMA systems these
> days. AMD64 processors carry an integrated memory controller. So, if you
> buy an SMP machine with AMD processors today, you'd find each slice of
> the total memory being connected to a different processor inside.
>
> Note that this doesn't break the 'symmetric' in 'SMP': it still remains
> a global, flat physical address space. The processors have interconnects
> by which memory can be read from remote processors as well, and will do
> so transparently to system and application software.
>   
that is ,in the smp with adm64,it is a numa in the hardware 
architecture,while a smp in the system software,is it right?

Thank you in advance

> [The alternative is rather the 'classic' model: Multiple processors
> interconnected making SMP, but  single memory interface in a single
> northbridge (Intel would call it the "MCH") connecting to the front-side
> bus, connecting all processors them to main memory. Obviously, that
> single memory interface will easily become a bottleneck, if all
> processors try to access memory simultaneously.]
>
> NUMA *may* help here: accessing local memory is very fast. Acessing
> remote memory is still pretty fast, but not as fast as it could be:
> hence 'NUMA' - non-uniform memory access.
>
> So, in order to take advantage of such a memory topology, memory data
> would ideally be always at the CPU where the processing happens. But
> processes (or domains, regarding xen) may migrate between different
> processors. Whether this happens depends on scheduling decisions.
> There's a cost involved in migration itself, so schedulers will do it
> ideally only if it really-makes-sense(TM).
>
> In order to keep a NUMA-system happy, pages once allocated could be
> moved as well, to where the current CPU is. This is page migration.
> As you may imagine, even more costly, and unfortunately completely
> useless if cpu migration needs to happen on a regular basis. Therefore
> it's difficult to get it right. Getting it right depends on how much the
> scheduler and memory management knows about where the memory asked for
> will be needed -- in advance. This is the hardest part: Most software
> won't tell, because the programming models employed today do not even
> recognize the fact that it may matter. Even if they would, in many
> cases, it would be even difficult to predict at all.
>
> regards,
> daniel
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-21  1:08           ` tgh
@ 2007-03-21  2:45             ` Daniel Stodden
  2007-03-22  1:16               ` tgh
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-21  2:45 UTC (permalink / raw)
  To: tgh; +Cc: Xen Developers

On Wed, 2007-03-21 at 09:08 +0800, tgh wrote:
> Thank you for your reply
> 
> 
> Daniel Stodden 写道:
> > On Tue, 2007-03-20 at 21:10 +0800, tgh wrote:
> >   
> >> I am puzzled ,what is the page migration?
> >> Thank you in advance
> >>     
> >
> > NUMA is clear? NUMA distributes main memory across multiple memory
> > interfaces.
> >
> > This used to be a feature reserved to high-end multiprocessor
> > architectures, but in servers it is becoming sort of a commodity these
> > days, in part due AMD multiprocessor systems being NUMA systems these
> > days. AMD64 processors carry an integrated memory controller. So, if you
> > buy an SMP machine with AMD processors today, you'd find each slice of
> > the total memory being connected to a different processor inside.
> >
> > Note that this doesn't break the 'symmetric' in 'SMP': it still remains
> > a global, flat physical address space. The processors have interconnects
> > by which memory can be read from remote processors as well, and will do
> > so transparently to system and application software.
> >   
> that is ,in the smp with adm64,it is a numa in the hardware 
> architecture,while a smp in the system software,is it right?

%}
i believe you mean the right thing. it remains a regular smp
architecture. system software remains smp.

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-21  2:45             ` Daniel Stodden
@ 2007-03-22  1:16               ` tgh
  2007-03-22 10:42                 ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-22  1:16 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen Developers

Thank you for reply

>>>   
>>>       
>> that is ,in the smp with adm64,it is a numa in the hardware 
>> architecture,while a smp in the system software,is it right?
>>     
>
> %}
> i believe you mean the right thing. it remains a regular smp
> architecture. system software remains smp.
>
>   
in the linux ,a one node (struct pglist_data) has many zone(struct 
zone_struct),a zone has many page(struct page),is it right?
in the smp with adm64 with the hardware of numa, its linux is a smp os 
,then there is only one node (struct pglist_data) in the os when running 
or  there are as many nodes as cpus in the system,  does linux smp 
support two or more nodes when running? or in this case linux support 
numa feature?

I am confused

could you help me
Thanks in advance


> regards,
> daniel
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22  1:16               ` tgh
@ 2007-03-22 10:42                 ` Daniel Stodden
  2007-03-22 12:13                   ` tgh
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-22 10:42 UTC (permalink / raw)
  To: tgh; +Cc: Xen Developers

On Thu, 2007-03-22 at 09:16 +0800, tgh wrote:
> Thank you for reply
> 
> >>>   
> >>>       
> >> that is ,in the smp with adm64,it is a numa in the hardware 
> >> architecture,while a smp in the system software,is it right?
> >>     
> >
> > %}
> > i believe you mean the right thing. it remains a regular smp
> > architecture. system software remains smp.
> >
> >   
> in the linux ,a one node (struct pglist_data) has many zone(struct 
> zone_struct),a zone has many page(struct page),is it right?

right.

> in the smp with adm64 with the hardware of numa, its linux is a smp os 
> ,then there is only one node (struct pglist_data) in the os when running 
> or  there are as many nodes as cpus in the system,  does linux smp 
> support two or more nodes when running? or in this case linux support 
> numa feature?

the number of nodes corresponds to the number of memory areas which
allocators need distinguish. in the case of integrated memory
controllers like amd64, expect to find as many nodes as there are
processors. 

yes, linux has numa support. see linux/Documentation/vm/

daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 10:42                 ` Daniel Stodden
@ 2007-03-22 12:13                   ` tgh
  2007-03-22 12:28                     ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-22 12:13 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen Developers

Thank you for your reply



>   
>> in the smp with adm64 with the hardware of numa, its linux is a smp os 
>> ,then there is only one node (struct pglist_data) in the os when running 
>> or  there are as many nodes as cpus in the system,  does linux smp 
>> support two or more nodes when running? or in this case linux support 
>> numa feature?
>>     
>
> the number of nodes corresponds to the number of memory areas which
> allocators need distinguish. in the case of integrated memory
> controllers like amd64, expect to find as many nodes as there are
> processors. 
>
> yes, linux has numa support. see linux/Documentation/vm/
>   

linux has numa support,and in the case of integrated memory controllers 
like amd64, CONFIG_NUMA should be choice and linux support numa,is it 
right? if CONFIG_NUMA is not choice ,then linux could not work well in 
the case of integrated memory controllers even for amd64,is it right? 
and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
does it  support  numa-aware  if CONFIG_NUMA choiced?

Thanks in advance
> daniel
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 12:13                   ` tgh
@ 2007-03-22 12:28                     ` Daniel Stodden
  2007-03-22 13:02                       ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-22 12:28 UTC (permalink / raw)
  To: tgh; +Cc: Xen Developers

On Thu, 2007-03-22 at 20:13 +0800, tgh wrote:
> Thank you for your reply
> 
> 
> 
> >   
> >> in the smp with adm64 with the hardware of numa, its linux is a smp os 
> >> ,then there is only one node (struct pglist_data) in the os when running 
> >> or  there are as many nodes as cpus in the system,  does linux smp 
> >> support two or more nodes when running? or in this case linux support 
> >> numa feature?
> >>     
> >
> > the number of nodes corresponds to the number of memory areas which
> > allocators need distinguish. in the case of integrated memory
> > controllers like amd64, expect to find as many nodes as there are
> > processors. 
> >
> > yes, linux has numa support. see linux/Documentation/vm/
> >   
> 
> linux has numa support,and in the case of integrated memory controllers 
> like amd64, CONFIG_NUMA should be choice and linux support numa,is it 
> right? if CONFIG_NUMA is not choice ,then linux could not work well in 
> the case of integrated memory controllers even for amd64,is it right? 


> and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> does it  support  numa-aware  if CONFIG_NUMA choiced?

no. there is num-support in xen, as far as inspection of the memory
topology and inclusion in the memory management is concerned. so
basically, you can add the desired node number to get_free_pages().

there is numa support in linux to a somewhat larger degree.

but...

config NUMA
       bool "Non Uniform Memory Access (NUMA) Support"
       depends on SMP && !X86_64_XEN

...there is no numa-support for paravirtual kernels at this point in
time.

see, you can't just switch it on and expect anything to improve. the vm
may typically see a subset of the cpus/nodes physically available, with
no reflection on their mapping to physical nodes. page migration between
logical cpus is pointless if logical cpus migrate across physical ones,
right?

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 12:28                     ` Daniel Stodden
@ 2007-03-22 13:02                       ` Ryan Harper
  2007-03-22 14:56                         ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-22 13:02 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen Developers, tgh

* Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 07:29]:
> On Thu, 2007-03-22 at 20:13 +0800, tgh wrote:
> > Thank you for your reply
> > 
> > 
> > 
> > >   
> > >> in the smp with adm64 with the hardware of numa, its linux is a smp os 
> > >> ,then there is only one node (struct pglist_data) in the os when running 
> > >> or  there are as many nodes as cpus in the system,  does linux smp 
> > >> support two or more nodes when running? or in this case linux support 
> > >> numa feature?
> > >>     
> > >
> > > the number of nodes corresponds to the number of memory areas which
> > > allocators need distinguish. in the case of integrated memory
> > > controllers like amd64, expect to find as many nodes as there are
> > > processors. 
> > >
> > > yes, linux has numa support. see linux/Documentation/vm/
> > >   
> > 
> > linux has numa support,and in the case of integrated memory controllers 
> > like amd64, CONFIG_NUMA should be choice and linux support numa,is it 
> > right? if CONFIG_NUMA is not choice ,then linux could not work well in 
> > the case of integrated memory controllers even for amd64,is it right? 
> 
> 
> > and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> > does it  support  numa-aware  if CONFIG_NUMA choiced?
> 
> no. there is num-support in xen, as far as inspection of the memory
> topology and inclusion in the memory management is concerned. so
> basically, you can add the desired node number to get_free_pages().

There is NUMA support in Xen since 3.0.4 in the hypervisor, and we have
the capability to ensure a guest memory is local to the processors being
used.  The topology of the system is not exported to the guest so
CONFIG_NUMA in the guest kernel config will be of no value.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 13:02                       ` Ryan Harper
@ 2007-03-22 14:56                         ` Daniel Stodden
  2007-03-22 15:12                           ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-22 14:56 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Xen Developers

On Thu, 2007-03-22 at 08:02 -0500, Ryan Harper wrote:

> > > and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> > > does it  support  numa-aware  if CONFIG_NUMA choiced?
> > 
> > no. there is num-support in xen, as far as inspection of the memory
> > topology and inclusion in the memory management is concerned. so
> > basically, you can add the desired node number to get_free_pages().
> 
> There is NUMA support in Xen since 3.0.4 in the hypervisor, and we have
> the capability to ensure a guest memory is local to the processors being
> used.  The topology of the system is not exported to the guest so
> CONFIG_NUMA in the guest kernel config will be of no value.

oops, that's more than i've noticed. thanks for the correction. so now
it seems up to me to ask questions. :} i don't see that path taken along
vcpu_migrate. where is it happening?

cheers,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 14:56                         ` Daniel Stodden
@ 2007-03-22 15:12                           ` Ryan Harper
  2007-03-22 15:38                             ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-22 15:12 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ryan Harper, Xen Developers

* Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 10:03]:
> On Thu, 2007-03-22 at 08:02 -0500, Ryan Harper wrote:
> 
> > > > and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> > > > does it  support  numa-aware  if CONFIG_NUMA choiced?
> > > 
> > > no. there is num-support in xen, as far as inspection of the memory
> > > topology and inclusion in the memory management is concerned. so
> > > basically, you can add the desired node number to get_free_pages().
> > 
> > There is NUMA support in Xen since 3.0.4 in the hypervisor, and we have
> > the capability to ensure a guest memory is local to the processors being
> > used.  The topology of the system is not exported to the guest so
> > CONFIG_NUMA in the guest kernel config will be of no value.
> 
> oops, that's more than i've noticed. thanks for the correction. so now
> it seems up to me to ask questions. :} i don't see that path taken along
> vcpu_migrate. where is it happening?

The credit scheduler is not NUMA aware.  So to ensure that the initial
allocation for the guest remains local, the domain uses a cpumask
(generated from cpus="" config file option)  to keep the scheduler from
migrating vcpus to off-node cpus.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 15:12                           ` Ryan Harper
@ 2007-03-22 15:38                             ` Daniel Stodden
  2007-03-22 16:01                               ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel Stodden @ 2007-03-22 15:38 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Xen Developers

On Thu, 2007-03-22 at 10:12 -0500, Ryan Harper wrote:
> * Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 10:03]:
> > On Thu, 2007-03-22 at 08:02 -0500, Ryan Harper wrote:
> > 
> > > > > and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> > > > > does it  support  numa-aware  if CONFIG_NUMA choiced?
> > > > 
> > > > no. there is num-support in xen, as far as inspection of the memory
> > > > topology and inclusion in the memory management is concerned. so
> > > > basically, you can add the desired node number to get_free_pages().
> > > 
> > > There is NUMA support in Xen since 3.0.4 in the hypervisor, and we have
> > > the capability to ensure a guest memory is local to the processors being
> > > used.  The topology of the system is not exported to the guest so
> > > CONFIG_NUMA in the guest kernel config will be of no value.
> > 
> > oops, that's more than i've noticed. thanks for the correction. so now
> > it seems up to me to ask questions. :} i don't see that path taken along
> > vcpu_migrate. where is it happening?
> 
> The credit scheduler is not NUMA aware.  So to ensure that the initial
> allocation for the guest remains local, the domain uses a cpumask
> (generated from cpus="" config file option)  to keep the scheduler from
> migrating vcpus to off-node cpus.

i see. not like i'm too deep in the srat, but methinks there may be sane
default values to be generated from the srat. any work happening or
having happened on that front?

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 15:38                             ` Daniel Stodden
@ 2007-03-22 16:01                               ` Ryan Harper
  2007-03-22 16:22                                 ` Daniel Stodden
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-22 16:01 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ryan Harper, Xen Developers

* Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 10:41]:
> On Thu, 2007-03-22 at 10:12 -0500, Ryan Harper wrote:
> > * Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 10:03]:
> > > On Thu, 2007-03-22 at 08:02 -0500, Ryan Harper wrote:
> > > 
> > > > > > and for the paravirt xen-guest-linux do not support  numa-aware  now ,or 
> > > > > > does it  support  numa-aware  if CONFIG_NUMA choiced?
> > > > > 
> > > > > no. there is num-support in xen, as far as inspection of the memory
> > > > > topology and inclusion in the memory management is concerned. so
> > > > > basically, you can add the desired node number to get_free_pages().
> > > > 
> > > > There is NUMA support in Xen since 3.0.4 in the hypervisor, and we have
> > > > the capability to ensure a guest memory is local to the processors being
> > > > used.  The topology of the system is not exported to the guest so
> > > > CONFIG_NUMA in the guest kernel config will be of no value.
> > > 
> > > oops, that's more than i've noticed. thanks for the correction. so now
> > > it seems up to me to ask questions. :} i don't see that path taken along
> > > vcpu_migrate. where is it happening?
> > 
> > The credit scheduler is not NUMA aware.  So to ensure that the initial
> > allocation for the guest remains local, the domain uses a cpumask
> > (generated from cpus="" config file option)  to keep the scheduler from
> > migrating vcpus to off-node cpus.
> 
> i see. not like i'm too deep in the srat, but methinks there may be sane
> default values to be generated from the srat. any work happening or
> having happened on that front?

Xen understands the topology but does not make any direct use of the
information in either the initial placement of VCPUs for the guest, nor
in the scheduler when making migration decisions.  I'm not aware of any
work to address that at the moment.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 16:01                               ` Ryan Harper
@ 2007-03-22 16:22                                 ` Daniel Stodden
  2007-03-22 17:02                                   ` Ryan Harper
  2007-03-28 21:25                                   ` The context switch overhead comparison between vmexit/vmentry and hypercall Liang Yang
  0 siblings, 2 replies; 35+ messages in thread
From: Daniel Stodden @ 2007-03-22 16:22 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Xen Developers

On Thu, 2007-03-22 at 11:01 -0500, Ryan Harper wrote:

> > i see. not like i'm too deep in the srat, but methinks there may be sane
> > default values to be generated from the srat. any work happening or
> > having happened on that front?
> 
> Xen understands the topology but does not make any direct use of the
> information in either the initial placement of VCPUs for the guest, nor
> in the scheduler when making migration decisions.  I'm not aware of any
> work to address that at the moment.

thanks. do you continue work on xen and numa, or proceed elsewhere?

regards,
daniel

-- 
Daniel Stodden
LRR     -      Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik der TU München             D-85748 Garching
http://www.lrr.in.tum.de/~stodden         mailto:stodden@cs.tum.edu
PGP Fingerprint: F5A4 1575 4C56 E26A 0B33  3D80 457E 82AE B0D8 735B

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 16:22                                 ` Daniel Stodden
@ 2007-03-22 17:02                                   ` Ryan Harper
  2007-03-23  5:47                                     ` tgh
  2007-03-28 21:25                                   ` The context switch overhead comparison between vmexit/vmentry and hypercall Liang Yang
  1 sibling, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-22 17:02 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen Developers

* Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 11:25]:
> On Thu, 2007-03-22 at 11:01 -0500, Ryan Harper wrote:
> 
> > > i see. not like i'm too deep in the srat, but methinks there may be sane
> > > default values to be generated from the srat. any work happening or
> > > having happened on that front?
> > 
> > Xen understands the topology but does not make any direct use of the
> > information in either the initial placement of VCPUs for the guest, nor
> > in the scheduler when making migration decisions.  I'm not aware of any
> > work to address that at the moment.
> 
> thanks. do you continue work on xen and numa, or proceed elsewhere?

I continue to work on xen and keep and eye on the NUMA support.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-22 17:02                                   ` Ryan Harper
@ 2007-03-23  5:47                                     ` tgh
  2007-03-23 14:42                                       ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-23  5:47 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Xen Developers, Daniel Stodden

hi
how many nodes in the numa with adm64 does xen support at present?

Thank you



Ryan Harper 写道:
> * Daniel Stodden <stodden@cs.tum.edu> [2007-03-22 11:25]:
>   
>> On Thu, 2007-03-22 at 11:01 -0500, Ryan Harper wrote:
>>
>>     
>>>> i see. not like i'm too deep in the srat, but methinks there may be sane
>>>> default values to be generated from the srat. any work happening or
>>>> having happened on that front?
>>>>         
>>> Xen understands the topology but does not make any direct use of the
>>> information in either the initial placement of VCPUs for the guest, nor
>>> in the scheduler when making migration decisions.  I'm not aware of any
>>> work to address that at the moment.
>>>       
>> thanks. do you continue work on xen and numa, or proceed elsewhere?
>>     
>
> I continue to work on xen and keep and eye on the NUMA support.
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-23  5:47                                     ` tgh
@ 2007-03-23 14:42                                       ` Ryan Harper
  2007-03-23 14:48                                         ` Petersson, Mats
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Harper @ 2007-03-23 14:42 UTC (permalink / raw)
  To: tgh; +Cc: Xen Developers, Daniel Stodden

* tgh <tianguanhua@ncic.ac.cn> [2007-03-23 00:48]:
> hi
> how many nodes in the numa with adm64 does xen support at present?

in xen/include/asm-x86/numa.h:
#define NODE_SHIFT=6

#in xen/include/xen/numa.h:
#define MAX_NUMNODES = (1 << NODE_SHIFT);

which works out to 64 nodes.  I don't know if anyone has tested more
than an 8 node system.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Re: NUMA and SMP
  2007-03-23 14:42                                       ` Ryan Harper
@ 2007-03-23 14:48                                         ` Petersson, Mats
  2007-03-28  1:50                                           ` tgh
  0 siblings, 1 reply; 35+ messages in thread
From: Petersson, Mats @ 2007-03-23 14:48 UTC (permalink / raw)
  To: Ryan Harper, tgh; +Cc: Xen Developers, Daniel Stodden

 

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Ryan Harper
> Sent: 23 March 2007 14:43
> To: tgh
> Cc: Xen Developers; Daniel Stodden
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> * tgh <tianguanhua@ncic.ac.cn> [2007-03-23 00:48]:
> > hi
> > how many nodes in the numa with adm64 does xen support at present?
> 
> in xen/include/asm-x86/numa.h:
> #define NODE_SHIFT=6
> 
> #in xen/include/xen/numa.h:
> #define MAX_NUMNODES = (1 << NODE_SHIFT);
> 
> which works out to 64 nodes.  I don't know if anyone has tested more
> than an 8 node system.

Of course, if we're talking AMD64 systems, if a NODE is a socket, the
currently available architecture supports 8 NODES, so there's plenty of
space to grow such a system. I think there's plans to grow this, but I
doubt that the limit above will be reached anytime soon. 


Even if a node is a core within a CPU, the current limit of 8 sockets
will limit the number of cores in a system to 32 cores when the
quad-core processors become available. So still sufficient to support
any current architecture.

--
Mats
> 
> -- 
> Ryan Harper
> Software Engineer; Linux Technology Center
> IBM Corp., Austin, Tx
> (512) 838-9253   T/L: 678-9253
> ryanh@us.ibm.com
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-23 14:48                                         ` Petersson, Mats
@ 2007-03-28  1:50                                           ` tgh
  2007-03-28  2:01                                             ` Ryan Harper
  0 siblings, 1 reply; 35+ messages in thread
From: tgh @ 2007-03-28  1:50 UTC (permalink / raw)
  To: Xen Developers; +Cc: Petersson, Mats, Ryan Harper, Daniel Stodden

hi
xen does not support numa-aware guest linux, is it right?
and there are memory-hotplug.c and migration.c in the linux2.6.20, does 
it means that linux could support the hotplug memory or not ?
if it could ,does linux have to be numa-aware to support memory hotplug 
or a smp linux could support memory hotplug?

I am confused about it

could you help me
Thanks in advance




Petersson, Mats 写道:
>  
>
>   
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com 
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>> Ryan Harper
>> Sent: 23 March 2007 14:43
>> To: tgh
>> Cc: Xen Developers; Daniel Stodden
>> Subject: Re: [Xen-devel] Re: NUMA and SMP
>>
>> * tgh <tianguanhua@ncic.ac.cn> [2007-03-23 00:48]:
>>     
>>> hi
>>> how many nodes in the numa with adm64 does xen support at present?
>>>       
>> in xen/include/asm-x86/numa.h:
>> #define NODE_SHIFT=6
>>
>> #in xen/include/xen/numa.h:
>> #define MAX_NUMNODES = (1 << NODE_SHIFT);
>>
>> which works out to 64 nodes.  I don't know if anyone has tested more
>> than an 8 node system.
>>     
>
> Of course, if we're talking AMD64 systems, if a NODE is a socket, the
> currently available architecture supports 8 NODES, so there's plenty of
> space to grow such a system. I think there's plans to grow this, but I
> doubt that the limit above will be reached anytime soon. 
>
>
> Even if a node is a core within a CPU, the current limit of 8 sockets
> will limit the number of cores in a system to 32 cores when the
> quad-core processors become available. So still sufficient to support
> any current architecture.
>
> --
> Mats
>   
>> -- 
>> Ryan Harper
>> Software Engineer; Linux Technology Center
>> IBM Corp., Austin, Tx
>> (512) 838-9253   T/L: 678-9253
>> ryanh@us.ibm.com
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>>
>>     
>
>
>
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Re: NUMA and SMP
  2007-03-28  1:50                                           ` tgh
@ 2007-03-28  2:01                                             ` Ryan Harper
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Harper @ 2007-03-28  2:01 UTC (permalink / raw)
  To: tgh; +Cc: Petersson, Mats, Xen Developers, Daniel Stodden, Ryan Harper

* tgh <tianguanhua@ncic.ac.cn> [2007-03-27 20:51]:
> hi
> xen does not support numa-aware guest linux, is it right?

You can have NUMA enabled in your guest, but Xen does not export
something like a virtual SRAT table that your NUMA-aware guest could use
to determine if its memory and cpu were in two different nodes.

> and there are memory-hotplug.c and migration.c in the linux2.6.20, does 
> it means that linux could support the hotplug memory or not ?

I don't know the current state of memory hotplug in Linux.

> if it could ,does linux have to be numa-aware to support memory hotplug 

I don't believe supporting memory hotplug is related to NUMA.

> or a smp linux could support memory hotplug?

SMP linux isn't related to memory hotplug either.  

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* The context switch overhead comparison between vmexit/vmentry and hypercall.
  2007-03-22 16:22                                 ` Daniel Stodden
  2007-03-22 17:02                                   ` Ryan Harper
@ 2007-03-28 21:25                                   ` Liang Yang
  1 sibling, 0 replies; 35+ messages in thread
From: Liang Yang @ 2007-03-28 21:25 UTC (permalink / raw)
  To: Xen Developers

Hi,

If I just considering the pure context switch ovehead, which one has bigger 
overhead, using HW vmexit/vmentry to do root and non-root mode switch by 
programming VT-x vetor or using SW hypercall to inject interrupt to switch 
from ring 1 to ring 0 (or ring 3 to ring 0 for 64bit OS)? Does the switch 
between ring1 and ring0 has the same overhead as the switch between ring 3 
and ring0?

BTW, both root and non-root mode has four rings, if the ring0 and ring3 in 
non-root mode are used for guest os kernel and user applications, which ring 
level in root mode will be used when doing vmexit?

Thanks,

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-03-28 21:25 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-14 11:55 NUMA and SMP David Pilger
2007-01-14 19:00 ` Ryan Harper
2007-01-15 17:21 ` Anthony Liguori
2007-01-16 10:47   ` Petersson, Mats
2007-01-16 13:55     ` Emmanuel Ackaouy
2007-01-16 14:19       ` Petersson, Mats
2007-01-16 16:13         ` Emmanuel Ackaouy
2007-01-16 16:30           ` Petersson, Mats
2007-03-20 13:10       ` tgh
2007-03-20 13:19         ` Petersson, Mats
2007-03-20 13:49           ` tgh
2007-03-20 15:50             ` Petersson, Mats
2007-03-20 16:45               ` Ryan Harper
2007-03-20 16:47                 ` Petersson, Mats
2007-03-20 13:51         ` Daniel Stodden
2007-03-21  1:08           ` tgh
2007-03-21  2:45             ` Daniel Stodden
2007-03-22  1:16               ` tgh
2007-03-22 10:42                 ` Daniel Stodden
2007-03-22 12:13                   ` tgh
2007-03-22 12:28                     ` Daniel Stodden
2007-03-22 13:02                       ` Ryan Harper
2007-03-22 14:56                         ` Daniel Stodden
2007-03-22 15:12                           ` Ryan Harper
2007-03-22 15:38                             ` Daniel Stodden
2007-03-22 16:01                               ` Ryan Harper
2007-03-22 16:22                                 ` Daniel Stodden
2007-03-22 17:02                                   ` Ryan Harper
2007-03-23  5:47                                     ` tgh
2007-03-23 14:42                                       ` Ryan Harper
2007-03-23 14:48                                         ` Petersson, Mats
2007-03-28  1:50                                           ` tgh
2007-03-28  2:01                                             ` Ryan Harper
2007-03-28 21:25                                   ` The context switch overhead comparison between vmexit/vmentry and hypercall Liang Yang
2007-01-16 14:51   ` Re: NUMA and SMP ron minnich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.