All of lore.kernel.org
 help / color / mirror / Atom feed
* Xen 3.4.1 NUMA support
@ 2009-11-04 12:02 Papagiannis Anastasios
  2009-11-04 12:32 ` Keir Fraser
  0 siblings, 1 reply; 30+ messages in thread
From: Papagiannis Anastasios @ 2009-11-04 12:02 UTC (permalink / raw)
  To: xen-devel

Hello,

does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf
or a link that can give me some more details about that? I work on a
project for xen performace in numa machines. And in xen 3.3.0 this
performance isn't good. Have something changed in last version?

Thanks in advance,
Papagiannis Anastasios

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-04 12:02 Xen 3.4.1 NUMA support Papagiannis Anastasios
@ 2009-11-04 12:32 ` Keir Fraser
  2009-11-06 18:07   ` Dan Magenheimer
  0 siblings, 1 reply; 30+ messages in thread
From: Keir Fraser @ 2009-11-04 12:32 UTC (permalink / raw)
  To: Papagiannis Anastasios, xen-devel

Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's up to
you to, for example, pin domains to specific nodes, using the 'cpus=...'
option in the domain config file. See /etc/xen/xmexample1 for an example of
its usage.

 -- Keir

On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> wrote:

> Hello,
> 
> does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf
> or a link that can give me some more details about that? I work on a
> project for xen performace in numa machines. And in xen 3.3.0 this
> performance isn't good. Have something changed in last version?
> 
> Thanks in advance,
> Papagiannis Anastasios
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-04 12:32 ` Keir Fraser
@ 2009-11-06 18:07   ` Dan Magenheimer
  2009-11-09 11:33     ` George Dunlap
  2009-11-09 15:02     ` Andre Przywara
  0 siblings, 2 replies; 30+ messages in thread
From: Dan Magenheimer @ 2009-11-06 18:07 UTC (permalink / raw)
  To: Keir Fraser, Papagiannis Anastasios, xen-devel; +Cc: George Dunlap

VMware has the notion of a "cell" where VMs can be
scheduled only within a cell, not across cells.
Cell boundaries are determined by VMware by
default, though certains settings can override them.

An interesting project might be to implement
"numa=cell" for Xen.... or maybe something similar
is already in George Dunlap's scheduler plans?

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Wednesday, November 04, 2009 5:33 AM
> To: Papagiannis Anastasios; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support
> 
> 
> Add Xen boot parameter 'numa=on' to enable NUMA detection. 
> Then it's up to
> you to, for example, pin domains to specific nodes, using the 
> 'cpus=...'
> option in the domain config file. See /etc/xen/xmexample1 for 
> an example of
> its usage.
> 
>  -- Keir
> 
> On 04/11/2009 12:02, "Papagiannis Anastasios" 
> <apapag@ics.forth.gr> wrote:
> 
> > Hello,
> > 
> > does the last version of Xen(3.4.1) support NUMA machines? 
> Is there a .pdf
> > or a link that can give me some more details about that? I work on a
> > project for xen performace in numa machines. And in xen 3.3.0 this
> > performance isn't good. Have something changed in last version?
> > 
> > Thanks in advance,
> > Papagiannis Anastasios
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-06 18:07   ` Dan Magenheimer
@ 2009-11-09 11:33     ` George Dunlap
  2009-11-09 11:39       ` Dulloor
  2009-11-09 11:44       ` Juergen Gross
  2009-11-09 15:02     ` Andre Przywara
  1 sibling, 2 replies; 30+ messages in thread
From: George Dunlap @ 2009-11-09 11:33 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel, Keir Fraser, Papagiannis Anastasios

I haven't had time to look at NUMA stuff at all.  I probably will look 
at it eventually, if no one else does, but I'd be happy if someone else 
could pursue it.

 -George

Dan Magenheimer wrote:
> VMware has the notion of a "cell" where VMs can be
> scheduled only within a cell, not across cells.
> Cell boundaries are determined by VMware by
> default, though certains settings can override them.
>
> An interesting project might be to implement
> "numa=cell" for Xen.... or maybe something similar
> is already in George Dunlap's scheduler plans?
>
>   
>> -----Original Message-----
>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>> Sent: Wednesday, November 04, 2009 5:33 AM
>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com
>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support
>>
>>
>> Add Xen boot parameter 'numa=on' to enable NUMA detection. 
>> Then it's up to
>> you to, for example, pin domains to specific nodes, using the 
>> 'cpus=...'
>> option in the domain config file. See /etc/xen/xmexample1 for 
>> an example of
>> its usage.
>>
>>  -- Keir
>>
>> On 04/11/2009 12:02, "Papagiannis Anastasios" 
>> <apapag@ics.forth.gr> wrote:
>>
>>     
>>> Hello,
>>>
>>> does the last version of Xen(3.4.1) support NUMA machines? 
>>>       
>> Is there a .pdf
>>     
>>> or a link that can give me some more details about that? I work on a
>>> project for xen performace in numa machines. And in xen 3.3.0 this
>>> performance isn't good. Have something changed in last version?
>>>
>>> Thanks in advance,
>>> Papagiannis Anastasios
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>       
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>     

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 11:33     ` George Dunlap
@ 2009-11-09 11:39       ` Dulloor
  2009-11-09 12:29         ` George Dunlap
  2009-11-09 11:44       ` Juergen Gross
  1 sibling, 1 reply; 30+ messages in thread
From: Dulloor @ 2009-11-09 11:39 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

George,

What's the current scope and status of your scheduler work ? Is it
going to look similar to the Linux scheduler (with scheduling domains,
et al). In that case, topology is already accounted for, to a large
extent. It would be good to know so that I can work on something that
doesn't overlap.

-dulloor

On Mon, Nov 9, 2009 at 6:33 AM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> I haven't had time to look at NUMA stuff at all.  I probably will look at it
> eventually, if no one else does, but I'd be happy if someone else could
> pursue it.
>
> -George
>
> Dan Magenheimer wrote:
>>
>> VMware has the notion of a "cell" where VMs can be
>> scheduled only within a cell, not across cells.
>> Cell boundaries are determined by VMware by
>> default, though certains settings can override them.
>>
>> An interesting project might be to implement
>> "numa=cell" for Xen.... or maybe something similar
>> is already in George Dunlap's scheduler plans?
>>
>>
>>>
>>> -----Original Message-----
>>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>>> Sent: Wednesday, November 04, 2009 5:33 AM
>>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com
>>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support
>>>
>>>
>>> Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's up
>>> to
>>> you to, for example, pin domains to specific nodes, using the 'cpus=...'
>>> option in the domain config file. See /etc/xen/xmexample1 for an example
>>> of
>>> its usage.
>>>
>>>  -- Keir
>>>
>>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr>
>>> wrote:
>>>
>>>
>>>>
>>>> Hello,
>>>>
>>>> does the last version of Xen(3.4.1) support NUMA machines?
>>>
>>> Is there a .pdf
>>>
>>>>
>>>> or a link that can give me some more details about that? I work on a
>>>> project for xen performace in numa machines. And in xen 3.3.0 this
>>>> performance isn't good. Have something changed in last version?
>>>>
>>>> Thanks in advance,
>>>> Papagiannis Anastasios
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 11:33     ` George Dunlap
  2009-11-09 11:39       ` Dulloor
@ 2009-11-09 11:44       ` Juergen Gross
  2009-11-09 12:07         ` George Dunlap
  2009-11-09 12:40         ` Keir Fraser
  1 sibling, 2 replies; 30+ messages in thread
From: Juergen Gross @ 2009-11-09 11:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

Cpupools? :-)

NUMA was a topic I wanted to look at as soon as cpupools are officially
accepted. Keir wanted to propose a way to get rid of the function
continue_hypercall_on_cpu() which was causing most of the stuff leading
to the objection of cpupools.
I guess Keir had some higher priority jobs. :-)
So I will try a new patch for cpupools without continue_hypercall_on_cpu()
and perhaps with NUMA support.
George, would this be okay for you? I think your scheduler still will have
problems with domain weights as long as domains are restricted to some
processors, right?

Juergen

George Dunlap wrote:
> I haven't had time to look at NUMA stuff at all.  I probably will look
> at it eventually, if no one else does, but I'd be happy if someone else
> could pursue it.
> 
> -George
> 
> Dan Magenheimer wrote:
>> VMware has the notion of a "cell" where VMs can be
>> scheduled only within a cell, not across cells.
>> Cell boundaries are determined by VMware by
>> default, though certains settings can override them.
>>
>> An interesting project might be to implement
>> "numa=cell" for Xen.... or maybe something similar
>> is already in George Dunlap's scheduler plans?
>>
>>  
>>> -----Original Message-----
>>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>>> Sent: Wednesday, November 04, 2009 5:33 AM
>>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com
>>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support
>>>
>>>
>>> Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's
>>> up to
>>> you to, for example, pin domains to specific nodes, using the 'cpus=...'
>>> option in the domain config file. See /etc/xen/xmexample1 for an
>>> example of
>>> its usage.
>>>
>>>  -- Keir
>>>
>>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr>
>>> wrote:
>>>
>>>    
>>>> Hello,
>>>>
>>>> does the last version of Xen(3.4.1) support NUMA machines?       
>>> Is there a .pdf
>>>    
>>>> or a link that can give me some more details about that? I work on a
>>>> project for xen performace in numa machines. And in xen 3.3.0 this
>>>> performance isn't good. Have something changed in last version?
>>>>
>>>> Thanks in advance,
>>>> Papagiannis Anastasios
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>       
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>     
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 


-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 11:44       ` Juergen Gross
@ 2009-11-09 12:07         ` George Dunlap
  2009-11-09 12:40         ` Keir Fraser
  1 sibling, 0 replies; 30+ messages in thread
From: George Dunlap @ 2009-11-09 12:07 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

On Mon, Nov 9, 2009 at 11:44 AM, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> George, would this be okay for you? I think your scheduler still will have
> problems with domain weights as long as domains are restricted to some
> processors, right?

Hmm, this may be a point of discussion at some point.

My plan was actually to have one runqueue per L2 processor cache.
Thus as many as 4 cores (and possibly 8 hyperthreads) would be sharing
the same runqueue; doing CPU pinning within the same runqueue would be
problematic.

I was planning on having credits work mainly within one runqueue, and
then do load balancing between runqueues.  In that case pinning to a
specific runqueue shouldn't cause a problem, because credits of one
runqueue wouldn't affect credtis of another one.

However, I haven't implemented or tested this idea yet; it's possible
that having credits kept distinct and doing load balancing between
runqueues will cause unacceptable levels of unfairness.  I expect it
to be fine (esp since Linux's scheduler does this kind of load
balancing, but doesn't share runqueues between logical processors),
but without implementation and testing I can't say for sure.

Thoughts are welcome at this point, but it will probably be better to
have a real discussion once I've posted some patches.

 -George

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 11:39       ` Dulloor
@ 2009-11-09 12:29         ` George Dunlap
  2009-11-09 12:51           ` Dulloor
  0 siblings, 1 reply; 30+ messages in thread
From: George Dunlap @ 2009-11-09 12:29 UTC (permalink / raw)
  To: Dulloor; +Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote:
> What's the current scope and status of your scheduler work ? Is it
> going to look similar to the Linux scheduler (with scheduling domains,
> et al). In that case, topology is already accounted for, to a large
> extent. It would be good to know so that I can work on something that
> doesn't overlap.

My plan was to do something similar to Linux, but with this
difference: Instead of having one runqueue per logical processor (as
both Xen and Linux currently do), and having "domains" all the way up
(as Linux currently does), I had planned on having one runqueue per L2
processor cache.  The main reason to avoid migration is to preserve a
warm cache; but since L1's are replaced so quickly, there should be
little impact to a VM migrating between different threads and cores
which share the same L2.

Above the L2s I was planning on having an idea similar to the Linux
"domains" (although obviously it would need a different name to avoid
confusion), and doing explicit load-balancing between them.  But as I
have not had a chance to test this kind of load balancing yet, the
plan may change somewhate before then.

Problems to solve wrt NUMA, as I understand it, are to balance the
performance cost of sharing a busy local CPU, vs the performance cost
of non-local memory accesses.  This would involve adding the NUMA
logic to the load balancing algorithm.  Which I guess would depend in
part on having a load balancing algorithm to begin with. :-)

Once I have the basic credit patches in working order, would you be
interested in working on the load-balancing between runqueues?  I can
then work on further testing of the credit algorithm.  My ultimate
goal would be to have a basic regression test that people could use to
measure how their changes to the scheduler affect a wide variety of
workloads.

 -George

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 11:44       ` Juergen Gross
  2009-11-09 12:07         ` George Dunlap
@ 2009-11-09 12:40         ` Keir Fraser
  1 sibling, 0 replies; 30+ messages in thread
From: Keir Fraser @ 2009-11-09 12:40 UTC (permalink / raw)
  To: Juergen Gross, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Papagiannis Anastasios

On 09/11/2009 11:44, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:

> NUMA was a topic I wanted to look at as soon as cpupools are officially
> accepted. Keir wanted to propose a way to get rid of the function
> continue_hypercall_on_cpu() which was causing most of the stuff leading
> to the objection of cpupools.
> I guess Keir had some higher priority jobs. :-)

Well, I forgot about it. I think the plan was to perhaps keep something like
continue_hypercall_on_cpu(), but not need to actually run the vcpu itself
'over there' but instead schedule a tasklet or somesuch, and sleep on its
completion. That would get rid of the skanky affinity hacks you had to do to
support continue_hypercall_on_cpu(). I'll have a look back at what we
discussed.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 12:29         ` George Dunlap
@ 2009-11-09 12:51           ` Dulloor
  0 siblings, 0 replies; 30+ messages in thread
From: Dulloor @ 2009-11-09 12:51 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

Sure ! Let know when you have the patches ready. Also, that might be a
good time to see if runq-per-l2 works better.

-dulloor

On Mon, Nov 9, 2009 at 7:29 AM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote:
>> What's the current scope and status of your scheduler work ? Is it
>> going to look similar to the Linux scheduler (with scheduling domains,
>> et al). In that case, topology is already accounted for, to a large
>> extent. It would be good to know so that I can work on something that
>> doesn't overlap.
>
> My plan was to do something similar to Linux, but with this
> difference: Instead of having one runqueue per logical processor (as
> both Xen and Linux currently do), and having "domains" all the way up
> (as Linux currently does), I had planned on having one runqueue per L2
> processor cache.  The main reason to avoid migration is to preserve a
> warm cache; but since L1's are replaced so quickly, there should be
> little impact to a VM migrating between different threads and cores
> which share the same L2.
>
> Above the L2s I was planning on having an idea similar to the Linux
> "domains" (although obviously it would need a different name to avoid
> confusion), and doing explicit load-balancing between them.  But as I
> have not had a chance to test this kind of load balancing yet, the
> plan may change somewhate before then.
>
> Problems to solve wrt NUMA, as I understand it, are to balance the
> performance cost of sharing a busy local CPU, vs the performance cost
> of non-local memory accesses.  This would involve adding the NUMA
> logic to the load balancing algorithm.  Which I guess would depend in
> part on having a load balancing algorithm to begin with. :-)
>
> Once I have the basic credit patches in working order, would you be
> interested in working on the load-balancing between runqueues?  I can
> then work on further testing of the credit algorithm.  My ultimate
> goal would be to have a basic regression test that people could use to
> measure how their changes to the scheduler affect a wide variety of
> workloads.
>
>  -George
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-06 18:07   ` Dan Magenheimer
  2009-11-09 11:33     ` George Dunlap
@ 2009-11-09 15:02     ` Andre Przywara
  2009-11-09 15:06       ` George Dunlap
  2009-11-09 15:19       ` Jan Beulich
  1 sibling, 2 replies; 30+ messages in thread
From: Andre Przywara @ 2009-11-09 15:02 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: George Dunlap, xen-devel, Keir Fraser, Papagiannis Anastasios

Dan Magenheimer wrote:
>> Add Xen boot parameter 'numa=on' to enable NUMA detection. 
>> Then it's up to you to, for example, pin domains to specific nodes, 
>> using the 'cpus=...' option in the domain config file. See
>> /etc/xen/xmexample1 for an example of its usage.
> VMware has the notion of a "cell" where VMs can be
> scheduled only within a cell, not across cells.
> Cell boundaries are determined by VMware by
> default, though certains settings can override them.
Well, If I got this right, then you are describing the current behaviour 
of Xen. It has a similar feature for some time now (since 3.3, I guess). 
When you launch a domain on a numa=on machine, it will pick the least 
busiest node (which can hold the requested memory) and restrict the 
domain to that node (by only allowing CPUs of that node).
This is in XendDomainInfo.py (c/s 17131, 17247, 17709)
Looks like this one:
(kernel xen.gz numa=on dom0_mem=6144M dom0_max_vcpus=6 dom0_vcpus_pin)
# xm create opensuse.hvm
# xm create opensuse2.hvm
# xm vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU 
Affinity
001-LTP                              1     0     6   -b-      17.8 6-11
001-LTP                              1     1     7   -b-       6.3 6-11
002-LTP                              2     0    12   -b-      19.0 12-17
002-LTP                              2     1    16   -b-       1.6 12-17
002-LTP                              2     2    17   -b-       1.7 12-17
002-LTP                              2     3    14   -b-       1.6 12-17
002-LTP                              2     4    16   -b-       1.6 12-17
002-LTP                              2     5    15   -b-       1.5 12-17
002-LTP                              2     6    12   -b-       1.3 12-17
002-LTP                              2     7    13   -b-       1.8 12-17
Domain-0                             0     0     0   -b-      12.6 0
Domain-0                             0     1     1   -b-       7.6 1
Domain-0                             0     2     2   -b-       8.0 2
Domain-0                             0     3     3   -b-      14.6 3
Domain-0                             0     4     4   r--       1.4 4
Domain-0                             0     5     5   -b-       0.9 5
# xm debug-keys U
(XEN) Domain 0 (total: 2097152):
(XEN)     Node 0: 2097152
(XEN)     Node 1: 0
(XEN)     Node 2: 0
(XEN)     Node 3: 0
(XEN)     Node 4: 0
(XEN)     Node 5: 0
(XEN)     Node 6: 0
(XEN)     Node 7: 0
(XEN) Domain 1 (total: 394219):
(XEN)     Node 0: 0
(XEN)     Node 1: 394219
(XEN)     Node 2: 0
(XEN)     Node 3: 0
(XEN)     Node 4: 0
(XEN)     Node 5: 0
(XEN)     Node 6: 0
(XEN)     Node 7: 0
(XEN) Domain 2 (total: 394219):
(XEN)     Node 0: 0
(XEN)     Node 1: 0
(XEN)     Node 2: 394219
(XEN)     Node 3: 0
(XEN)     Node 4: 0
(XEN)     Node 5: 0
(XEN)     Node 6: 0
(XEN)     Node 7: 0

Note that there were no cpus= lines in the config files, Xen did that 
automatically.

Domains can be localhost-migrated to another node:
# xm migrate --node=4 1 localhost
The only issue is with domains larger than a node.
If someone has a useful use-case, I can start rebasing my old patches 
for NUMA aware HVM domains to Xen unstable.

Regards,
Andre.

BTW: Shouldn't we set finally numa=on as the default value?

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 15:02     ` Andre Przywara
@ 2009-11-09 15:06       ` George Dunlap
  2009-11-09 22:51         ` Andre Przywara
  2009-11-13 14:14         ` Andre Przywara
  2009-11-09 15:19       ` Jan Beulich
  1 sibling, 2 replies; 30+ messages in thread
From: George Dunlap @ 2009-11-09 15:06 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

Andre Przywara wrote:
> BTW: Shouldn't we set finally numa=on as the default value?
>   
Is there any data to support the idea that this helps significantly on 
common systems?

 -George

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 15:02     ` Andre Przywara
  2009-11-09 15:06       ` George Dunlap
@ 2009-11-09 15:19       ` Jan Beulich
  2009-11-10  1:46         ` Ian Pratt
                           ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Jan Beulich @ 2009-11-09 15:19 UTC (permalink / raw)
  To: Andre Przywara, Dan Magenheimer
  Cc: George Dunlap, xen-devel, Keir Fraser, Papagiannis Anastasios

>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>>
>BTW: Shouldn't we set finally numa=on as the default value?

I'd say no, at least until the default confinement of a guest to a single
node gets fixed to properly deal with guests having more vCPU-s than
a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
not overcommitting CPUs outweigh the drawbacks of cross-node memory
accesses at the very least for CPU-bound workloads).

Jan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 15:06       ` George Dunlap
@ 2009-11-09 22:51         ` Andre Przywara
  2009-11-10  6:56           ` Dulloor
  2009-11-13 14:14         ` Andre Przywara
  1 sibling, 1 reply; 30+ messages in thread
From: Andre Przywara @ 2009-11-09 22:51 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

George Dunlap wrote:
> Andre Przywara wrote:
>> BTW: Shouldn't we set finally numa=on as the default value?
>>   
> Is there any data to support the idea that this helps significantly on 
> common systems?
I don't have any numbers handy, but I will try if I can generate some.

Looking from a high level perspective it is a shame that it's not the 
default: With numa=off the Xen domain loader will allocate physical 
memory from some node (maybe even from several nodes) and will schedule 
the guest on some other (even rapidly changing) nodes. According to 
Murphy's law you will end up with _all_ the memory access of a guest to 
be remote. But in fact a NUMA architecture is really beneficial for 
virtualization: As there are close to zero cross domain memory accesses 
(except for Dom0), each node is more or less self contained and each 
guest can use the node's memory controller almost exclusively.
But this is all spoiled as most people don't know about Xen's NUMA 
capabilities and don't set numa=on. Using this as a default would solve 
this.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 488-3567-12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-09 15:19       ` Jan Beulich
@ 2009-11-10  1:46         ` Ian Pratt
  2009-11-10  8:51           ` Jan Beulich
  2009-11-12 16:09         ` Keir Fraser
  2009-11-30 15:40         ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara
  2 siblings, 1 reply; 30+ messages in thread
From: Ian Pratt @ 2009-11-10  1:46 UTC (permalink / raw)
  To: Jan Beulich, Andre Przywara, Dan Magenheimer
  Cc: George Dunlap, Ian Pratt, xen-devel, Keir Fraser, Papagiannis Anastasios

> >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>>
> >BTW: Shouldn't we set finally numa=on as the default value?
> 
> I'd say no, at least until the default confinement of a guest to a single
> node gets fixed to properly deal with guests having more vCPU-s than
> a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
> not overcommitting CPUs outweigh the drawbacks of cross-node memory
> accesses at the very least for CPU-bound workloads).

What default confinement? I thought guests had an all-pCPUs affinity mask be default?

I suspect we will get benefits enabling NUMA even if all the guests have all-pCPUs affinity masks: all guests will have memory stripped across all nodes, which is likely better than allocating from one node and then the other. Obviously assigning VMs to node(s) and allocating memory accordingly is the best plan.

Ian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 22:51         ` Andre Przywara
@ 2009-11-10  6:56           ` Dulloor
  2009-11-10  7:49             ` Andre Przywara
  0 siblings, 1 reply; 30+ messages in thread
From: Dulloor @ 2009-11-10  6:56 UTC (permalink / raw)
  To: Andre Przywara
  Cc: George Dunlap, Dan Magenheimer, xen-devel, Keir Fraser,
	Papagiannis Anastasios

I am not finding this. Can you please point to the code ?

numa=on/off is only for setting up numa in xen (similar to the linux
knob, but turned off by default). The allocation of memory from a
single node (that you observe) could be because of the way
alloc_heap_pages is implemented (trying to allocate from all the heaps
from a node, before trying the next one) - try looking at dump_numa
output. And, affinities are not set anywhere based on the node from
which allocation happens.

-dulloor

On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote:
> George Dunlap wrote:
>>
>> Andre Przywara wrote:
>>>
>>> BTW: Shouldn't we set finally numa=on as the default value?
>>>
>>
>> Is there any data to support the idea that this helps significantly on
>> common systems?
>
> I don't have any numbers handy, but I will try if I can generate some.
>
> Looking from a high level perspective it is a shame that it's not the
> default: With numa=off the Xen domain loader will allocate physical memory
> from some node (maybe even from several nodes) and will schedule the guest
> on some other (even rapidly changing) nodes. According to Murphy's law you
> will end up with _all_ the memory access of a guest to be remote. But in
> fact a NUMA architecture is really beneficial for virtualization: As there
> are close to zero cross domain memory accesses (except for Dom0), each node
> is more or less self contained and each guest can use the node's memory
> controller almost exclusively.
> But this is all spoiled as most people don't know about Xen's NUMA
> capabilities and don't set numa=on. Using this as a default would solve
> this.
>
> Regards,
> Andre.
>
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 488-3567-12
> ----to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-10  6:56           ` Dulloor
@ 2009-11-10  7:49             ` Andre Przywara
  0 siblings, 0 replies; 30+ messages in thread
From: Andre Przywara @ 2009-11-10  7:49 UTC (permalink / raw)
  To: Dulloor
  Cc: George Dunlap, Dan Magenheimer, xen-devel, Keir Fraser,
	Papagiannis Anastasios

Dulloor wrote:
> I am not finding this. Can you please point to the code ?
tools/python/xen/xend/XendDomainInfo.py (around line 2600)
with the core code being:
-------------
       index = nodeload.index( min(nodeload) )
       cpumask = info['node_to_cpu'][index]
   for v in range(0, self.info['VCPUs_max']):
       xc.vcpu_setaffinity(self.domid, v, cpumask)
--------------
The code got introduced with c/s 17131 and later got refined with c/s 
17247 and c/s 17709.
> 
> numa=on/off is only for setting up numa in xen (similar to the linux
> knob, but turned off by default). The allocation of memory from a
> single node (that you observe) could be because of the way
> alloc_heap_pages is implemented (trying to allocate from all the heaps
> from a node, before trying the next one)
Yes, but if the domain is pinned before it allocated it's memory, then 
the natural behavior of Xen is to take memory from this local node.

> - try looking at dump_numa
> output. And, affinities are not set anywhere based on the node from
> which allocation happens.
It is the other way round, first the domain is pinned, later the memory 
is allocated (based on the node to which the currently scheduled CPU is 
belonging to).

Regards,
Andre.

> 
> -dulloor
> 
> On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote:
>> George Dunlap wrote:
>>> Andre Przywara wrote:
>>>> BTW: Shouldn't we set finally numa=on as the default value?
>>>>
>>> Is there any data to support the idea that this helps significantly on
>>> common systems?
>> I don't have any numbers handy, but I will try if I can generate some.
>>
>> Looking from a high level perspective it is a shame that it's not the
>> default: With numa=off the Xen domain loader will allocate physical memory
>> from some node (maybe even from several nodes) and will schedule the guest
>> on some other (even rapidly changing) nodes. According to Murphy's law you
>> will end up with _all_ the memory access of a guest to be remote. But in
>> fact a NUMA architecture is really beneficial for virtualization: As there
>> are close to zero cross domain memory accesses (except for Dom0), each node
>> is more or less self contained and each guest can use the node's memory
>> controller almost exclusively.
>> But this is all spoiled as most people don't know about Xen's NUMA
>> capabilities and don't set numa=on. Using this as a default would solve
>> this.
>>
>> Regards,
>> Andre.
>>
-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-10  1:46         ` Ian Pratt
@ 2009-11-10  8:51           ` Jan Beulich
  2009-11-10  8:57             ` Keir Fraser
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2009-11-10  8:51 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Andre Przywara, Dan Magenheimer, xen-devel, George Dunlap,
	Keir Fraser, Papagiannis Anastasios

>>> Ian Pratt <Ian.Pratt@eu.citrix.com> 10.11.09 02:46 >>>
>> >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>>
>> >BTW: Shouldn't we set finally numa=on as the default value?
>> 
>> I'd say no, at least until the default confinement of a guest to a single
>> node gets fixed to properly deal with guests having more vCPU-s than
>> a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
>> not overcommitting CPUs outweigh the drawbacks of cross-node memory
>> accesses at the very least for CPU-bound workloads).
>
>What default confinement? I thought guests had an all-pCPUs affinity mask be default?

Not with numa=on (see also Andre's post to this effect): The guest will
get assigned to a node, and its affinity set to that node's CPUs.

Jan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-10  8:51           ` Jan Beulich
@ 2009-11-10  8:57             ` Keir Fraser
  0 siblings, 0 replies; 30+ messages in thread
From: Keir Fraser @ 2009-11-10  8:57 UTC (permalink / raw)
  To: Jan Beulich, Ian Pratt
  Cc: George Dunlap, Andre Przywara, Dan Magenheimer, xen-devel,
	Papagiannis Anastasios

On 10/11/2009 08:51, "Jan Beulich" <JBeulich@novell.com> wrote:

>> What default confinement? I thought guests had an all-pCPUs affinity mask be
>> default?
> 
> Not with numa=on (see also Andre's post to this effect): The guest will
> get assigned to a node, and its affinity set to that node's CPUs.

...And if it didn't, striping would not happen. In fact iirc the default
NUMA allocation policy for an all-pcpus domain is in some respects pessimal:
vcpu0's initial node gets drained of memory first. I.e., you get *less*
'striping' than you could with numa=off where you might at least get lucky.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 15:19       ` Jan Beulich
  2009-11-10  1:46         ` Ian Pratt
@ 2009-11-12 16:09         ` Keir Fraser
  2009-11-30 15:40         ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara
  2 siblings, 0 replies; 30+ messages in thread
From: Keir Fraser @ 2009-11-12 16:09 UTC (permalink / raw)
  To: Jan Beulich, Andre Przywara, Dan Magenheimer
  Cc: George Dunlap, xen-devel, Papagiannis Anastasios

On 09/11/2009 15:19, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>>
>> BTW: Shouldn't we set finally numa=on as the default value?
> 
> I'd say no, at least until the default confinement of a guest to a single
> node gets fixed to properly deal with guests having more vCPU-s than
> a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
> not overcommitting CPUs outweigh the drawbacks of cross-node memory
> accesses at the very least for CPU-bound workloads).

If this would be fixed (e.g., turn off node locality entirely by default for
domains which will not fit into a single node) then I think we could
consider numa=on by default.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-09 15:06       ` George Dunlap
  2009-11-09 22:51         ` Andre Przywara
@ 2009-11-13 14:14         ` Andre Przywara
  2009-11-13 14:29           ` Ian Pratt
  2009-11-13 14:31           ` Keir Fraser
  1 sibling, 2 replies; 30+ messages in thread
From: Andre Przywara @ 2009-11-13 14:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dan Magenheimer, xen-devel, Keir Fraser, Papagiannis Anastasios

George Dunlap wrote:
> Andre Przywara wrote:
>> BTW: Shouldn't we set finally numa=on as the default value?
>>   
> Is there any data to support the idea that this helps significantly on 
> common systems?
I did some tests on an 8 node machine. I will retry this later on 
4-nodes and 2-nodes systems, but I assume similar numbers. I used 
multiple guests in parallel each running bw_mem of lmbench, which is 
admittedly quite NUMA sensitive. I cannot publish real numbers (yet?), 
but the results were dramatic:
with numa=on I got the same results for each guest (the same as the 
native result) when the number of guests was smaller or equal the number 
of nodes (since each guest got it's own memory controller).
If I disabled NUMA aware placement by explicitly specifying cpus="0-31" 
  in the config file or booted with numa=off, the values dropped down by 
factor 3-5 (!) (even for a few guests) with some variance due to the 
random nature of core to memory mapping.
Overcommitting the nodes (letting multiple guests use each node) lowered 
the values to about 80% for two guests and 60% for three guests per 
node, but it never got anywhere close to the numa=off values.
So these results encourage me again to opt for numa=on as the default value.
Keir, I will check if dropping the node containment in the CPU 
overcommitment case is an option, but what would be the right strategy 
in that case?
Warn the user?
Don't contain at all?
Contain to more than onde node?

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-13 14:14         ` Andre Przywara
@ 2009-11-13 14:29           ` Ian Pratt
  2009-11-13 15:25             ` George Dunlap
  2009-11-13 15:27             ` Keir Fraser
  2009-11-13 14:31           ` Keir Fraser
  1 sibling, 2 replies; 30+ messages in thread
From: Ian Pratt @ 2009-11-13 14:29 UTC (permalink / raw)
  To: Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Ian Pratt, Keir Fraser,
	Papagiannis Anastasios

> Overcommitting the nodes (letting multiple guests use each node) lowered
> the values to about 80% for two guests and 60% for three guests per
> node, but it never got anywhere close to the numa=off values.
> So these results encourage me again to opt for numa=on as the default
> value.
> Keir, I will check if dropping the node containment in the CPU
> overcommitment case is an option, but what would be the right strategy
> in that case?
> Warn the user?
> Don't contain at all?
> Contain to more than onde node?

In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?)

We should turn off automatic node containment of any kind if the total number of pCPUs in the system is <= 8  -- on such systems the statistical multiplexing gain of having access to more pCPUs likely outweighs the NUMA placement benefit and memory striping will be a better strategy.
I'm inclined to believe that may be true for 2 node systems with <=16 pCPUs too under many workloads 

I'd really like to see us enumerate pCPUs in a sensible order so that it's easier to see the topology.  It should be nodes.sockets.cores{.threads}, leaving gaps for missing execution units due to hot plug or non power of two packing. 
Right now we're inconsistent in the enumeration order depending on how the BIOS has set things up. It would be great if someone could volunteer to fix this...

Ian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-13 14:14         ` Andre Przywara
  2009-11-13 14:29           ` Ian Pratt
@ 2009-11-13 14:31           ` Keir Fraser
  2009-11-13 15:38             ` Ian Pratt
  1 sibling, 1 reply; 30+ messages in thread
From: Keir Fraser @ 2009-11-13 14:31 UTC (permalink / raw)
  To: Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Papagiannis Anastasios

On 13/11/2009 14:14, "Andre Przywara" <andre.przywara@amd.com> wrote:

> Keir, I will check if dropping the node containment in the CPU
> overcommitment case is an option, but what would be the right strategy
> in that case?
> Warn the user?
> Don't contain at all?
> Contain to more than onde node?

I would suggest simply don't contain at all (i.e., keep equivalent numa=off
behaviour) would be safest.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-13 14:29           ` Ian Pratt
@ 2009-11-13 15:25             ` George Dunlap
  2009-11-13 15:35               ` Ian Pratt
  2009-11-13 15:27             ` Keir Fraser
  1 sibling, 1 reply; 30+ messages in thread
From: George Dunlap @ 2009-11-13 15:25 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Andre Przywara, Dan Magenheimer, xen-devel, Keir Fraser,
	Papagiannis Anastasios

Ian Pratt wrote:
> In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?)

Seems like CPU load might be a better measure.  Xen doesn't calculate 
load currently, but it's on my list of things to do.

  -George

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-13 14:29           ` Ian Pratt
  2009-11-13 15:25             ` George Dunlap
@ 2009-11-13 15:27             ` Keir Fraser
  2009-11-13 15:40               ` Ian Pratt
  1 sibling, 1 reply; 30+ messages in thread
From: Keir Fraser @ 2009-11-13 15:27 UTC (permalink / raw)
  To: Ian Pratt, Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Papagiannis Anastasios

On 13/11/2009 14:29, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote:

> I'd really like to see us enumerate pCPUs in a sensible order so that it's
> easier to see the topology.  It should be nodes.sockets.cores{.threads},
> leaving gaps for missing execution units due to hot plug or non power of two
> packing. 
> Right now we're inconsistent in the enumeration order depending on how the
> BIOS has set things up. It would be great if someone could volunteer to fix
> this...

Even better would be to have pCPUs addressable and listable explicitly as
dotted tuples. That can be implemented entirely within the toolstack, and
could even allow wildcarding of tuple components to efficiently express
cpumasks.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-13 15:25             ` George Dunlap
@ 2009-11-13 15:35               ` Ian Pratt
  0 siblings, 0 replies; 30+ messages in thread
From: Ian Pratt @ 2009-11-13 15:35 UTC (permalink / raw)
  To: George Dunlap
  Cc: Andre Przywara, Dan Magenheimer, xen-devel, Ian Pratt,
	Keir Fraser, Papagiannis Anastasios

> Ian Pratt wrote:
> > In the case where a VM is asking for more vCPUs there are pCPUs in a
> node we should contain the guest to multiple nodes. (I presume we favour
> nodes according to the number of vCPUs they already have committed to
> them?)
> 
> Seems like CPU load might be a better measure.  Xen doesn't calculate
> load currently, but it's on my list of things to do.

I'd rather get this stuff fixed now than wait for the new scheduler.

It's not clear that instantaneous CPU load is any better than just counting the number of vCPUs. The XCP xapi stack also records good historical data, and would be in a better position to do the placement. Further work.

Ian 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-13 14:31           ` Keir Fraser
@ 2009-11-13 15:38             ` Ian Pratt
  0 siblings, 0 replies; 30+ messages in thread
From: Ian Pratt @ 2009-11-13 15:38 UTC (permalink / raw)
  To: Keir Fraser, Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Ian Pratt, Papagiannis, Anastasios

> > Keir, I will check if dropping the node containment in the CPU
> > overcommitment case is an option, but what would be the right strategy
> > in that case?
> > Warn the user?
> > Don't contain at all?
> > Contain to more than onde node?
> 
> I would suggest simply don't contain at all (i.e., keep equivalent
> numa=off
> behaviour) would be safest.

I disagree. In systems with 2 nodes it will use all nodes, which is the same as your propose[*]. In systems with more nodes it will do placement to some subset. Note that systems with >2 nodes generally have stronger NUMA effects and these are exactly the systems where node placement is a good thing.

[*] note that numa=off is quite different from just disabling node placement. If node placement is disabled we still get the benefit of memory striping across nodes, which at least avoids some performance cliffs.

Ian
  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Xen 3.4.1 NUMA support
  2009-11-13 15:27             ` Keir Fraser
@ 2009-11-13 15:40               ` Ian Pratt
  2009-11-13 16:02                 ` Keir Fraser
  0 siblings, 1 reply; 30+ messages in thread
From: Ian Pratt @ 2009-11-13 15:40 UTC (permalink / raw)
  To: Keir Fraser, Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Ian Pratt, Papagiannis, Anastasios

> > I'd really like to see us enumerate pCPUs in a sensible order so that
> it's
> > easier to see the topology.  It should be nodes.sockets.cores{.threads},
> > leaving gaps for missing execution units due to hot plug or non power of
> two
> > packing.
> > Right now we're inconsistent in the enumeration order depending on how
> the
> > BIOS has set things up. It would be great if someone could volunteer to
> fix
> > this...
> 
> Even better would be to have pCPUs addressable and listable explicitly as
> dotted tuples. That can be implemented entirely within the toolstack, and
> could even allow wildcarding of tuple components to efficiently express
> cpumasks.

Yes, I'd certainly like to see the toolstack support dotted tuple notation.

However, I just don't trust the toolstack to get this right unless xen has already set it up nicely for it with a sensible enumeration and defined sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should provide a clean interface to the toolstack in this respect.

Ian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Xen 3.4.1 NUMA support
  2009-11-13 15:40               ` Ian Pratt
@ 2009-11-13 16:02                 ` Keir Fraser
  0 siblings, 0 replies; 30+ messages in thread
From: Keir Fraser @ 2009-11-13 16:02 UTC (permalink / raw)
  To: Ian Pratt, Andre Przywara, George Dunlap
  Cc: Dan Magenheimer, xen-devel, Papagiannis Anastasios

On 13/11/2009 15:40, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote:

>> Even better would be to have pCPUs addressable and listable explicitly as
>> dotted tuples. That can be implemented entirely within the toolstack, and
>> could even allow wildcarding of tuple components to efficiently express
>> cpumasks.
> 
> Yes, I'd certainly like to see the toolstack support dotted tuple notation.
> 
> However, I just don't trust the toolstack to get this right unless xen has
> already set it up nicely for it with a sensible enumeration and defined
> sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should
> provide a clean interface to the toolstack in this respect.

Xen provides a topology-interrogation hypercall which should suffice for
tools to build up a {node,socket,core,thread}<->cpuid mapping table.

 -- Keir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] tools: avoid over-commitment if numa=on
  2009-11-09 15:19       ` Jan Beulich
  2009-11-10  1:46         ` Ian Pratt
  2009-11-12 16:09         ` Keir Fraser
@ 2009-11-30 15:40         ` Andre Przywara
  2 siblings, 0 replies; 30+ messages in thread
From: Andre Przywara @ 2009-11-30 15:40 UTC (permalink / raw)
  To: Keir Fraser
  Cc: George Dunlap, Dan Magenheimer, xen-devel,
	Papagiannis Anastasios, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 1271 bytes --]

Jan Beulich wrote:
>>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>>
>> BTW: Shouldn't we set finally numa=on as the default value?
> 
> I'd say no, at least until the default confinement of a guest to a single
> node gets fixed to properly deal with guests having more vCPU-s than
> a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
> not overcommitting CPUs outweigh the drawbacks of cross-node memory
> accesses at the very least for CPU-bound workloads).
That sounds reasonable.
Attached a patch to lift the restriction of one node per guest if the 
number of VCPUs is greater than the number of cores / node.
This isn't optimal (the best way would be to inform the guest about it, 
but this is another patchset ;-), but should solve the above concerns.

Please apply,
Andre.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

[-- Attachment #2: more_NUMA_nodes.patch --]
[-- Type: text/x-patch, Size: 2389 bytes --]

# HG changeset patch
# User Andre Przywara <andre.przywara@amd.com>
# Date 1259594006 -3600
# Node ID bdf4109edffbcc0cbac605a19d2fd7a7459f1117
# Parent  abc6183f486e66b5721dbf0313ee0d3460613a99
allocate enough NUMA nodes for all VCPUs

If numa=on, we constrain a guest to one node to keep it's memory
accesses local. This will hurt performance if the number of VCPUs
is greater than the number of cores per node. We detect this case
now and allocate further NUMA nodes to allow all VCPUs to run
simultaneously.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

diff -r abc6183f486e -r bdf4109edffb tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Mon Nov 30 10:58:23 2009 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Mon Nov 30 16:13:26 2009 +0100
@@ -2637,8 +2637,7 @@
                         nodeload[i] = int(nodeload[i] * 16 / len(info['node_to_cpu'][i]))
                     else:
                         nodeload[i] = sys.maxint
-                index = nodeload.index( min(nodeload) )    
-                return index
+                return map(lambda x: x[0], sorted(enumerate(nodeload), key=lambda x:x[1]))
 
             info = xc.physinfo()
             if info['nr_nodes'] > 1:
@@ -2648,8 +2647,15 @@
                 for i in range(0, info['nr_nodes']):
                     if node_memory_list[i] >= needmem and len(info['node_to_cpu'][i]) > 0:
                         candidate_node_list.append(i)
-                index = find_relaxed_node(candidate_node_list)
-                cpumask = info['node_to_cpu'][index]
+                best_node = find_relaxed_node(candidate_node_list)[0]
+                cpumask = info['node_to_cpu'][best_node]
+                cores_per_node = info['nr_cpus'] / info['nr_nodes']
+                nodes_required = (self.info['VCPUs_max'] + cores_per_node - 1) / cores_per_node
+                if nodes_required > 1:
+                    log.debug("allocating %d NUMA nodes", nodes_required)
+                    best_nodes = find_relaxed_node(filter(lambda x: x != best_node, range(0,info['nr_nodes'])))
+                    for i in best_nodes[:nodes_required - 1]:
+                        cpumask = cpumask + info['node_to_cpu'][i]
                 for v in range(0, self.info['VCPUs_max']):
                     xc.vcpu_setaffinity(self.domid, v, cpumask)
         return index

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-11-30 15:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-04 12:02 Xen 3.4.1 NUMA support Papagiannis Anastasios
2009-11-04 12:32 ` Keir Fraser
2009-11-06 18:07   ` Dan Magenheimer
2009-11-09 11:33     ` George Dunlap
2009-11-09 11:39       ` Dulloor
2009-11-09 12:29         ` George Dunlap
2009-11-09 12:51           ` Dulloor
2009-11-09 11:44       ` Juergen Gross
2009-11-09 12:07         ` George Dunlap
2009-11-09 12:40         ` Keir Fraser
2009-11-09 15:02     ` Andre Przywara
2009-11-09 15:06       ` George Dunlap
2009-11-09 22:51         ` Andre Przywara
2009-11-10  6:56           ` Dulloor
2009-11-10  7:49             ` Andre Przywara
2009-11-13 14:14         ` Andre Przywara
2009-11-13 14:29           ` Ian Pratt
2009-11-13 15:25             ` George Dunlap
2009-11-13 15:35               ` Ian Pratt
2009-11-13 15:27             ` Keir Fraser
2009-11-13 15:40               ` Ian Pratt
2009-11-13 16:02                 ` Keir Fraser
2009-11-13 14:31           ` Keir Fraser
2009-11-13 15:38             ` Ian Pratt
2009-11-09 15:19       ` Jan Beulich
2009-11-10  1:46         ` Ian Pratt
2009-11-10  8:51           ` Jan Beulich
2009-11-10  8:57             ` Keir Fraser
2009-11-12 16:09         ` Keir Fraser
2009-11-30 15:40         ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.