All of lore.kernel.org
 help / color / mirror / Atom feed
* NUMA TODO-list for xen-devel
@ 2012-08-01 16:16 Dario Faggioli
  2012-08-01 16:24 ` Dario Faggioli
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-01 16:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Jan Beulich,
	Andrew Cooper, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 4233 bytes --]

Hi everyone,

With automatic placement finally landing into xen-unstable, I stated
thinking about what I could work on next, still in the field of
improving Xen's NUMA support. Well, it turned out that running out of
things to do is not an option! :-O

In fact, I can think of quite a bit of open issues in that area, that I'm
just braindumping here. If anyone has thoughts or idea or feedback or
whatever, I'd be happy to serve as a collector of them. I've already
created a Wiki page to help with the tracking. You can see it here
(for now it basically replicates this e-mail):

 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap

I'm putting a [D] (standing for Dario) near the points I've started
working on or looking at, and again, I'd be happy to try tracking this
too, i.e., keeping the list of "who-is-doing-what" updated, in order to
ease collaboration.

So, let's cut the talking:

    - Automatic placement at guest creation time. Basics are there and
      will be shipping with 4.2. However, a lot of other things are
      missing and/or can be improved, for instance:
[D]    * automated verification and testing of the placement;
       * benchmarks and improvements of the placement heuristic;
[D]    * choosing/building up some measure of node load (more accurate
         than just counting vcpus) onto which to rely during placement;
       * consider IONUMA during placement;
       * automatic placement of Dom0, if possible (my current series is
         only affecting DomU)
       * having internal xen data structure honour the placement (e.g., 
         I've been told that right now vcpu stacks are always allocated
         on node 0... Andrew?).

[D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
      just have them _prefer_ running on the nodes where their memory
      is.

[D] - Dynamic memory migration between different nodes of the host. As
      the counter-part of the NUMA-aware scheduler.

    - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
      guest ends up on more than one nodes, make sure it knows it's
      running on a NUMA platform (smaller than the actual host, but
      still NUMA). This interacts with some of the above points:
       * consider this during automatic placement for
         resuming/migrating domains (if they have a virtual topology,
         better not to change it);
       * consider this during memory migration (it can change the
         actual topology, should we update it on-line or disable memory
         migration?)

    - NUMA and ballooning and memory sharing. In some more details:
       * page sharing on NUMA boxes: it's probably sane to make it
         possible disabling sharing pages across nodes;
       * ballooning and its interaction with placement (races, amount of
         memory needed and reported being different at different time,
         etc.).

    - Inter-VM dependencies and communication issues. If a workload is
      made up of more than just a VM and they all share the same (NUMA)
      host, it might be best to have them sharing the nodes as much as
      possible, or perhaps do right the opposite, depending on the
      specific characteristics of he workload itself, and this might be
      considered during placement, memory migration and perhaps
      scheduling.

    - Benchmarking and performances evaluation in general. Meaning both
      agreeing on a (set of) relevant workload(s) and on how to extract
      meaningful performances data from there (and maybe how to do that
      automatically?).

So, what do you think?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
@ 2012-08-01 16:24 ` Dario Faggioli
  2012-08-01 16:30 ` Andrew Cooper
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-01 16:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Jan Beulich,
	Andrew Cooper, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 576 bytes --]

On Wed, 2012-08-01 at 18:16 +0200, Dario Faggioli wrote:
> Hi everyone,
>
Quite a bad subject... I put it there just as a placeholder and then
forgot to change it into something sensible. :-(

Sorry for that. I hope the content can still get some attention. :-P

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
  2012-08-01 16:24 ` Dario Faggioli
@ 2012-08-01 16:30 ` Andrew Cooper
  2012-08-01 16:47   ` Dario Faggioli
  2012-08-02  9:40   ` Jan Beulich
  2012-08-01 16:32 ` Anil Madhavapeddy
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Andrew Cooper @ 2012-08-01 16:30 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 3924 bytes --]


On 01/08/12 17:16, Dario Faggioli wrote:
> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen's NUMA support. Well, it turned out that running out of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that I'm
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I'd be happy to serve as a collector of them. I've already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
>
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
>
> I'm putting a [D] (standing for Dario) near the points I've started
> working on or looking at, and again, I'd be happy to try tracking this
> too, i.e., keeping the list of "who-is-doing-what" updated, in order to
> ease collaboration.
>
> So, let's cut the talking:
>
> - Automatic placement at guest creation time. Basics are there and
> will be shipping with 4.2. However, a lot of other things are
> missing and/or can be improved, for instance:
> [D] * automated verification and testing of the placement;
> * benchmarks and improvements of the placement heuristic;
> [D] * choosing/building up some measure of node load (more accurate
> than just counting vcpus) onto which to rely during placement;
> * consider IONUMA during placement;
> * automatic placement of Dom0, if possible (my current series is
> only affecting DomU)
> * having internal xen data structure honour the placement (e.g.,
> I've been told that right now vcpu stacks are always allocated
> on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
> just have them _prefer_ running on the nodes where their memory
> is.
>
> [D] - Dynamic memory migration between different nodes of the host. As
> the counter-part of the NUMA-aware scheduler.
>
> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> guest ends up on more than one nodes, make sure it knows it's
> running on a NUMA platform (smaller than the actual host, but
> still NUMA). This interacts with some of the above points:
> * consider this during automatic placement for
> resuming/migrating domains (if they have a virtual topology,
> better not to change it);
> * consider this during memory migration (it can change the
> actual topology, should we update it on-line or disable memory
> migration?)
>
> - NUMA and ballooning and memory sharing. In some more details:
> * page sharing on NUMA boxes: it's probably sane to make it
> possible disabling sharing pages across nodes;
> * ballooning and its interaction with placement (races, amount of
> memory needed and reported being different at different time,
> etc.).
>
> - Inter-VM dependencies and communication issues. If a workload is
> made up of more than just a VM and they all share the same (NUMA)
> host, it might be best to have them sharing the nodes as much as
> possible, or perhaps do right the opposite, depending on the
> specific characteristics of he workload itself, and this might be
> considered during placement, memory migration and perhaps
> scheduling.
>
> - Benchmarking and performances evaluation in general. Meaning both
> agreeing on a (set of) relevant workload(s) and on how to extract
> meaningful performances data from there (and maybe how to do that
> automatically?).

- Xen NUMA internals.  Placing items such as the per-cpu stacks and data
area on the local NUMA node, rather than unconditionally on node 0 at
the moment.  As part of this, there will be changes to
alloc_{dom,xen}heap_page() to allow specification of which node(s) to
allocate memory from.

~Andrew

>
>
> So, what do you think?
>
> Thanks and Regards,
> Dario
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


[-- Attachment #1.2: Type: text/html, Size: 5643 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
  2012-08-01 16:24 ` Dario Faggioli
  2012-08-01 16:30 ` Andrew Cooper
@ 2012-08-01 16:32 ` Anil Madhavapeddy
  2012-08-01 16:58   ` Dario Faggioli
  2012-08-02  1:04 ` Zhang, Yang Z
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Anil Madhavapeddy @ 2012-08-01 16:32 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Steven Smith, George Dunlap, Malte Schwarzkopf,
	xen-devel, Jan Beulich, Andrew Cooper, Zhang, Yang Z

On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:

>    - Inter-VM dependencies and communication issues. If a workload is
>      made up of more than just a VM and they all share the same (NUMA)
>      host, it might be best to have them sharing the nodes as much as
>      possible, or perhaps do right the opposite, depending on the
>      specific characteristics of he workload itself, and this might be
>      considered during placement, memory migration and perhaps
>      scheduling.
> 
>    - Benchmarking and performances evaluation in general. Meaning both
>      agreeing on a (set of) relevant workload(s) and on how to extract
>      meaningful performances data from there (and maybe how to do that
>      automatically?).

I haven't tried out the latest Xen NUMA features yet, but we've been
keeping track of the IPC benchmarks as we get newer machines here:

http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html

The newer chipsets (Sandy Bridge and AMD Valencia) both have quite
different inter-core/socket/MPM performance characteristics from their
respective previous generations; e.g.

http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpfCBrYh.html
http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmppI61nX.html

Happy to share the raw data if you have cycles to figure out the best
way to auto-place multiple VMs so they are near each other from a memory
latency perspective.  We haven't run many macro-benchmarks though, so
in practise it might not matter, so it would be nice to settle on a good
set of benchmarks to determine that for sure.

-anil

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:30 ` Andrew Cooper
@ 2012-08-01 16:47   ` Dario Faggioli
  2012-08-01 16:53     ` Andrew Cooper
  2012-08-02  9:40   ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: Dario Faggioli @ 2012-08-01 16:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1738 bytes --]

On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:
> On 01/08/12 17:16, Dario Faggioli wrote:
>
> ...
>
> > - Automatic placement at guest creation time. Basics are there and
> > will be shipping with 4.2. However, a lot of other things are
> > missing and/or can be improved, for instance:
> > [D] * automated verification and testing of the placement;
> > * benchmarks and improvements of the placement heuristic;
> > [D] * choosing/building up some measure of node load (more accurate
> > than just counting vcpus) onto which to rely during placement;
> > * consider IONUMA during placement;
> > * automatic placement of Dom0, if possible (my current series is
> > only affecting DomU)
> > * having internal xen data structure honour the placement (e.g., 
> > I've been told that right now vcpu stacks are always allocated
> > on node 0... Andrew?).
> >
> 
> - Xen NUMA internals.  Placing items such as the per-cpu stacks and
> data area on the local NUMA node, rather than unconditionally on node
> 0 at the moment.  As part of this, there will be changes to
> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> allocate memory from.

As you see, I already tried to consider that (as you told me it does
that couple of weeks ago :-) ). I'll add your wording of it (much better
than mine) to the wiki... I understand you're working on this, aren't
you? Can I put that down to?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:47   ` Dario Faggioli
@ 2012-08-01 16:53     ` Andrew Cooper
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Cooper @ 2012-08-01 16:53 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1766 bytes --]


On 01/08/12 17:47, Dario Faggioli wrote:
> On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:
>> On 01/08/12 17:16, Dario Faggioli wrote:
>>
>> ...
>>
>>> - Automatic placement at guest creation time. Basics are there and
>>> will be shipping with 4.2. However, a lot of other things are
>>> missing and/or can be improved, for instance:
>>> [D] * automated verification and testing of the placement;
>>> * benchmarks and improvements of the placement heuristic;
>>> [D] * choosing/building up some measure of node load (more accurate
>>> than just counting vcpus) onto which to rely during placement;
>>> * consider IONUMA during placement;
>>> * automatic placement of Dom0, if possible (my current series is
>>> only affecting DomU)
>>> * having internal xen data structure honour the placement (e.g.,
>>> I've been told that right now vcpu stacks are always allocated
>>> on node 0... Andrew?).
>>>
>>
>> - Xen NUMA internals. Placing items such as the per-cpu stacks and
>> data area on the local NUMA node, rather than unconditionally on node
>> 0 at the moment. As part of this, there will be changes to
>> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
>> allocate memory from.
>
> As you see, I already tried to consider that (as you told me it does
> that couple of weeks ago :-) ). I'll add your wording of it (much better
> than mine) to the wiki... I understand you're working on this, aren't
> you? Can I put that down to?
>
> Thanks and Regards,
> Dario
>

Wow - I completely managed to miss that while reading.  Someone will be
working on it for XS.next, and that someone will probably be me - put me
down for it.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


[-- Attachment #1.2: Type: text/html, Size: 2767 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:32 ` Anil Madhavapeddy
@ 2012-08-01 16:58   ` Dario Faggioli
  2012-08-02  0:04     ` Malte Schwarzkopf
  0 siblings, 1 reply; 36+ messages in thread
From: Dario Faggioli @ 2012-08-01 16:58 UTC (permalink / raw)
  To: Anil Madhavapeddy
  Cc: Andre Przywara, Steven Smith, George Dunlap, Malte Schwarzkopf,
	xen-devel, Jan Beulich, Andrew Cooper, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2179 bytes --]

On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy wrote:
> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:
> 
> >    - Inter-VM dependencies and communication issues. If a workload is
> >      made up of more than just a VM and they all share the same (NUMA)
> >      host, it might be best to have them sharing the nodes as much as
> >      possible, or perhaps do right the opposite, depending on the
> >      specific characteristics of he workload itself, and this might be
> >      considered during placement, memory migration and perhaps
> >      scheduling.
> > 
> >    - Benchmarking and performances evaluation in general. Meaning both
> >      agreeing on a (set of) relevant workload(s) and on how to extract
> >      meaningful performances data from there (and maybe how to do that
> >      automatically?).
> 
> I haven't tried out the latest Xen NUMA features yet, but we've been
> keeping track of the IPC benchmarks as we get newer machines here:
> 

> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html
> 
Wow... That's really cool. I'll definitely take a deep look at all these
data! I'm also adding the link to the wiki, if you're fine with that...

> Happy to share the raw data if you have cycles to figure out the best
> way to auto-place multiple VMs so they are near each other from a memory
> latency perspective.  
>
I don't have anything precise in mind yet, but we need to think about
this.

> We haven't run many macro-benchmarks though, so
> in practise it might not matter, so it would be nice to settle on a good
> set of benchmarks to determine that for sure.
> 
Yes, that's what we need. I'm open and available on trying to figure
this out anytime... I seem to recall you're going to be in SanDiego for
XenSummit, am I right? If yes, we can discuss this more there.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:58   ` Dario Faggioli
@ 2012-08-02  0:04     ` Malte Schwarzkopf
  2012-08-07 23:53       ` Dario Faggioli
  0 siblings, 1 reply; 36+ messages in thread
From: Malte Schwarzkopf @ 2012-08-02  0:04 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Jan Beulich, Andrew Cooper, Zhang, Yang Z, Steven Smith

On 01/08/12 17:58, Dario Faggioli wrote:
> On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy wrote:
>> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:
>>
>>>    - Inter-VM dependencies and communication issues. If a workload is
>>>      made up of more than just a VM and they all share the same (NUMA)
>>>      host, it might be best to have them sharing the nodes as much as
>>>      possible, or perhaps do right the opposite, depending on the
>>>      specific characteristics of he workload itself, and this might be
>>>      considered during placement, memory migration and perhaps
>>>      scheduling.
>>>
>>>    - Benchmarking and performances evaluation in general. Meaning both
>>>      agreeing on a (set of) relevant workload(s) and on how to extract
>>>      meaningful performances data from there (and maybe how to do that
>>>      automatically?).
>>
>> I haven't tried out the latest Xen NUMA features yet, but we've been
>> keeping track of the IPC benchmarks as we get newer machines here:
>>
> 
>> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html
>>
> Wow... That's really cool. I'll definitely take a deep look at all these
> data! I'm also adding the link to the wiki, if you're fine with that...

No problem with adding a link, as this is public data :) If possible,
it'd be splendid to put a note next to this link encouraging people to
submit their own results -- doing so is very simple, and helps us extend
the database. Instructions are at
http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short
link, http://fable.io).

>> Happy to share the raw data if you have cycles to figure out the best
>> way to auto-place multiple VMs so they are near each other from a memory
>> latency perspective.  
>>
> I don't have anything precise in mind yet, but we need to think about
> this.

While there has been plenty of work on optimizing co-location of
different kinds of workloads, there's relatively little work (that I am
aware of) on VM scheduling in this environment. One (sadly somewhat
lacking) paper at HotCloud this year [1] looked at NUMA-aware VM
migration to balance memory accesses. Of greater interest is possibly
the Google ISCA paper on the detrimental effect of sharing
micro-architectural resources between different kinds of workloads,
although it is not explicitly focused on NUMA, and the metrics are
defined with regards to specific classes of latency-sensitive jobs [2].

One interesting thing to look at (that we haven't looked at yet) is what
memory allocators do about NUMA these days; there is an AMD whitepaper
from 2009 discussing the performance benefits of a NUMA-aware version of
tcmalloc [3], but I have found it hard to reproduce their results on
modern hardware. Of course, being virtualized may complicate matters
here, since the memory allocator can no longer freely pick and choose
where to allocate from.

Scheduling, notably, is key here, since the CPU a process is scheduled
on may determine where its memory is allocated -- frequent migrations
are likely to be bad for performance due to remote memory accesses,
although we have been unable to quantify a significant difference on
non-synthetic macrobenchmarks; that said, we did not try very hard so far.

Cheers,
Malte

[1] - Ahn et al., "Dynamic Virtual Machine Scheduling in Clouds for
Architectural Shared Resources", in Proceedings of HotCloud 2012,
https://www.usenix.org/conference/hotcloud12/dynamic-virtual-machine-scheduling-clouds-architectural-shared-resources

[2] - Tang et al., "The impact of memory subsystem resource sharing on
datacenter applications", in Proceedings of ISCA 2011,
http://dl.acm.org/citation.cfm?id=2000099

[3] -
http://developer.amd.com/Assets/NUMA_aware_heap_memory_manager_article_final.pdf

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
                   ` (2 preceding siblings ...)
  2012-08-01 16:32 ` Anil Madhavapeddy
@ 2012-08-02  1:04 ` Zhang, Yang Z
  2012-08-07 22:56   ` Dario Faggioli
  2012-08-02  9:43 ` Jan Beulich
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Zhang, Yang Z @ 2012-08-02  1:04 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Jan Beulich,
	Andrew Cooper

Dario Faggioli wrote on 2012-08-02:
> Hi everyone,
> 
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen's NUMA support. Well, it turned out that running out of
> things to do is not an option! :-O
> 
> In fact, I can think of quite a bit of open issues in that area, that I'm
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I'd be happy to serve as a collector of them. I've already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
> 
>  http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
> I'm putting a [D] (standing for Dario) near the points I've started
> working on or looking at, and again, I'd be happy to try tracking this
> too, i.e., keeping the list of "who-is-doing-what" updated, in order to
> ease collaboration.
> 
> So, let's cut the talking:
> 
>     - Automatic placement at guest creation time. Basics are there and
>       will be shipping with 4.2. However, a lot of other things are
>       missing and/or can be improved, for instance:
> [D]    * automated verification and testing of the placement;
>        * benchmarks and improvements of the placement heuristic;
> [D]    * choosing/building up some measure of node load (more accurate
>          than just counting vcpus) onto which to rely during placement;
>        * consider IONUMA during placement;
We should consider two things:
1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the node which it resides. Currently, Dom0 allocates dma buffer without provide the node info to the hypercall..
2.Guest IONUMA: when guest boots up with pass through device, we need to allocate the memory from the node where the device resides for further dma buffer allocation. And let guest know the IONUMA topology. This rely on the guest NUMA.
This topic was mentioned in xen summit 2011:
http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf


>        * automatic placement of Dom0, if possible (my current series is
>          only affecting DomU) * having internal xen data structure
>          honour the placement (e.g., I've been told that right now vcpu
>          stacks are always allocated on node 0... Andrew?).
> [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
>       just have them _prefer_ running on the nodes where their memory
>       is.
> [D] - Dynamic memory migration between different nodes of the host. As
>       the counter-part of the NUMA-aware scheduler.
>     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>       guest ends up on more than one nodes, make sure it knows it's
>       running on a NUMA platform (smaller than the actual host, but
>       still NUMA). This interacts with some of the above points:
>        * consider this during automatic placement for
>          resuming/migrating domains (if they have a virtual topology,
>          better not to change it); * consider this during memory
>          migration (it can change the actual topology, should we update
>          it on-line or disable memory migration?)
>     - NUMA and ballooning and memory sharing. In some more details:
>        * page sharing on NUMA boxes: it's probably sane to make it
>          possible disabling sharing pages across nodes; * ballooning and
>          its interaction with placement (races, amount of memory needed
>          and reported being different at different time, etc.).
>     - Inter-VM dependencies and communication issues. If a workload is
>       made up of more than just a VM and they all share the same (NUMA)
>       host, it might be best to have them sharing the nodes as much as
>       possible, or perhaps do right the opposite, depending on the
>       specific characteristics of he workload itself, and this might be
>       considered during placement, memory migration and perhaps
>       scheduling.
>     - Benchmarking and performances evaluation in general. Meaning both
>       agreeing on a (set of) relevant workload(s) and on how to extract
>       meaningful performances data from there (and maybe how to do that
>       automatically?).
> So, what do you think?
> 
> Thanks and Regards,
> Dario
> 
> -- <<This happens because I choose it to happen!>> (Raistlin Majere)
> ----------------------------------------------------------------- Dario
> Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software
> Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 
> 
> -- <<This happens because I choose it to happen!>> (Raistlin Majere)
> ----------------------------------------------------------------- Dario
> Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software
> Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


Best regards,
Yang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:30 ` Andrew Cooper
  2012-08-01 16:47   ` Dario Faggioli
@ 2012-08-02  9:40   ` Jan Beulich
  2012-08-02 13:21     ` Dario Faggioli
  1 sibling, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-02  9:40 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Yang Z Zhang

>>> On 01.08.12 at 18:30, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> - Xen NUMA internals.  Placing items such as the per-cpu stacks and data
> area on the local NUMA node, rather than unconditionally on node 0 at
> the moment.  As part of this, there will be changes to
> alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> allocate memory from.

Those interfaces already support flags to be passed, including a
node ID. It just needs to be made use of in more places.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
                   ` (3 preceding siblings ...)
  2012-08-02  1:04 ` Zhang, Yang Z
@ 2012-08-02  9:43 ` Jan Beulich
  2012-08-02 13:34   ` Dario Faggioli
  2012-08-03 22:34   ` Dan Magenheimer
  2012-08-03 10:02 ` Andre Przywara
  2012-08-03 22:22 ` Dan Magenheimer
  6 siblings, 2 replies; 36+ messages in thread
From: Jan Beulich @ 2012-08-02  9:43 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang Z Zhang

>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>       guest ends up on more than one nodes, make sure it knows it's
>       running on a NUMA platform (smaller than the actual host, but
>       still NUMA). This interacts with some of the above points:

The question is whether this is really useful beyond the (I would
suppose) relatively small set of cases where migration isn't
needed.

>        * consider this during automatic placement for
>          resuming/migrating domains (if they have a virtual topology,
>          better not to change it);
>        * consider this during memory migration (it can change the
>          actual topology, should we update it on-line or disable memory
>          migration?)

The question is whether trading functionality for performance
is an acceptable choice.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02  9:40   ` Jan Beulich
@ 2012-08-02 13:21     ` Dario Faggioli
  0 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-02 13:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andre Przywara, Anil Madhavapeddy, Andrew Cooper, xen-devel,
	George Dunlap, Yang Z Zhang


[-- Attachment #1.1: Type: text/plain, Size: 1093 bytes --]

On Thu, 2012-08-02 at 10:40 +0100, Jan Beulich wrote:
> >>> On 01.08.12 at 18:30, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > - Xen NUMA internals.  Placing items such as the per-cpu stacks and data
> > area on the local NUMA node, rather than unconditionally on node 0 at
> > the moment.  As part of this, there will be changes to
> > alloc_{dom,xen}heap_page() to allow specification of which node(s) to
> > allocate memory from.
> 
> Those interfaces already support flags to be passed, including a
> node ID. It just needs to be made use of in more places.
> 
Yes, I also remember it being already node_affinity conscious, and think
it's more a matter of how it is called. I'll update the wiki accordingly
(it doesn't need to contain these sort of details anyway).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02  9:43 ` Jan Beulich
@ 2012-08-02 13:34   ` Dario Faggioli
  2012-08-02 14:07     ` Jan Beulich
  2012-08-02 16:36     ` George Dunlap
  2012-08-03 22:34   ` Dan Magenheimer
  1 sibling, 2 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-02 13:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang Z Zhang


[-- Attachment #1.1: Type: text/plain, Size: 2279 bytes --]

On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> >       guest ends up on more than one nodes, make sure it knows it's
> >       running on a NUMA platform (smaller than the actual host, but
> >       still NUMA). This interacts with some of the above points:
> 
> The question is whether this is really useful beyond the (I would
> suppose) relatively small set of cases where migration isn't
> needed.
> 
Mmm... Not sure I'm getting what you're saying here, sorry. Are you
suggesting that exposing a virtual topology is not a good idea as it
poses constraints/prevents live migration?

If yes, well, I mostly agree that this is an huge issue, and that's why
I think wee need some bright idea on how to deal with it. I mean, it's
easy to make it optional and let it automatically disable migration,
giving users the choice what they prefer, but I think this is more
dodging the problem than dealing with it! :-P

> >        * consider this during automatic placement for
> >          resuming/migrating domains (if they have a virtual topology,
> >          better not to change it);
> >        * consider this during memory migration (it can change the
> >          actual topology, should we update it on-line or disable memory
> >          migration?)
> 
> The question is whether trading functionality for performance
> is an acceptable choice.
> 
Indeed. Again, I think it is possible to implement things flexibly
enough, but then we need to come out with a sane default, so we're not
allowed to avoid discussing and deciding on this.

One can argue that it is an issue only for big-enough guests (and/or
nearly overcommitted hosts) that don't fit in only one node (as, if they
do, there is no virtual topology to export), but I'm not sure we can
neglect them on this basis.

Thanks for the feedback,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02 13:34   ` Dario Faggioli
@ 2012-08-02 14:07     ` Jan Beulich
  2012-08-02 16:36     ` George Dunlap
  1 sibling, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2012-08-02 14:07 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang Z Zhang

>>> On 02.08.12 at 15:34, Dario Faggioli <raistlin@linux.it> wrote:
> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>> >       guest ends up on more than one nodes, make sure it knows it's
>> >       running on a NUMA platform (smaller than the actual host, but
>> >       still NUMA). This interacts with some of the above points:
>> 
>> The question is whether this is really useful beyond the (I would
>> suppose) relatively small set of cases where migration isn't
>> needed.
>> 
> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
> suggesting that exposing a virtual topology is not a good idea as it
> poses constraints/prevents live migration?

Yes.

> If yes, well, I mostly agree that this is an huge issue, and that's why
> I think wee need some bright idea on how to deal with it. I mean, it's
> easy to make it optional and let it automatically disable migration,
> giving users the choice what they prefer, but I think this is more
> dodging the problem than dealing with it! :-P

Indeed.

>> >        * consider this during automatic placement for
>> >          resuming/migrating domains (if they have a virtual topology,
>> >          better not to change it);
>> >        * consider this during memory migration (it can change the
>> >          actual topology, should we update it on-line or disable memory
>> >          migration?)
>> 
>> The question is whether trading functionality for performance
>> is an acceptable choice.
>> 
> Indeed. Again, I think it is possible to implement things flexibly
> enough, but then we need to come out with a sane default, so we're not
> allowed to avoid discussing and deciding on this.
> 
> One can argue that it is an issue only for big-enough guests (and/or
> nearly overcommitted hosts) that don't fit in only one node (as, if they
> do, there is no virtual topology to export), but I'm not sure we can
> neglect them on this basis.

We certainly can't, the more that the "big enough" case may not
be that infrequent going forward.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02 13:34   ` Dario Faggioli
  2012-08-02 14:07     ` Jan Beulich
@ 2012-08-02 16:36     ` George Dunlap
  2012-08-03  9:23       ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: George Dunlap @ 2012-08-02 16:36 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, Andrew Cooper, xen-devel,
	Jan Beulich, Yang Z Zhang

On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote:
> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>> >       guest ends up on more than one nodes, make sure it knows it's
>> >       running on a NUMA platform (smaller than the actual host, but
>> >       still NUMA). This interacts with some of the above points:
>>
>> The question is whether this is really useful beyond the (I would
>> suppose) relatively small set of cases where migration isn't
>> needed.
>>
> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
> suggesting that exposing a virtual topology is not a good idea as it
> poses constraints/prevents live migration?
>
> If yes, well, I mostly agree that this is an huge issue, and that's why
> I think wee need some bright idea on how to deal with it. I mean, it's
> easy to make it optional and let it automatically disable migration,
> giving users the choice what they prefer, but I think this is more
> dodging the problem than dealing with it! :-P
>
>> >        * consider this during automatic placement for
>> >          resuming/migrating domains (if they have a virtual topology,
>> >          better not to change it);
>> >        * consider this during memory migration (it can change the
>> >          actual topology, should we update it on-line or disable memory
>> >          migration?)

I think we could use cpu hot-plug to change the "virtual topology" of
VMs, couldn't we?  We could probably even do that on a running guest
if we really needed to.

 -George

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02 16:36     ` George Dunlap
@ 2012-08-03  9:23       ` Jan Beulich
  2012-08-03  9:48         ` Andre Przywara
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-03  9:23 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, Andrew Cooper, xen-devel,
	Yang Z Zhang

>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote:
>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>>> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>>> >       guest ends up on more than one nodes, make sure it knows it's
>>> >       running on a NUMA platform (smaller than the actual host, but
>>> >       still NUMA). This interacts with some of the above points:
>>>
>>> The question is whether this is really useful beyond the (I would
>>> suppose) relatively small set of cases where migration isn't
>>> needed.
>>>
>> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
>> suggesting that exposing a virtual topology is not a good idea as it
>> poses constraints/prevents live migration?
>>
>> If yes, well, I mostly agree that this is an huge issue, and that's why
>> I think wee need some bright idea on how to deal with it. I mean, it's
>> easy to make it optional and let it automatically disable migration,
>> giving users the choice what they prefer, but I think this is more
>> dodging the problem than dealing with it! :-P
>>
>>> >        * consider this during automatic placement for
>>> >          resuming/migrating domains (if they have a virtual topology,
>>> >          better not to change it);
>>> >        * consider this during memory migration (it can change the
>>> >          actual topology, should we update it on-line or disable memory
>>> >          migration?)
> 
> I think we could use cpu hot-plug to change the "virtual topology" of
> VMs, couldn't we?  We could probably even do that on a running guest
> if we really needed to.

Hmm, not sure - using hotplug behind the back of the guest might
be possible, but you'd first need to hot-unplug the vCPU. That's
something that I don't think you can do on HVM guests (and for
PV guests, guest visible NUMA support makes even less sense
than for HVM ones).

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03  9:23       ` Jan Beulich
@ 2012-08-03  9:48         ` Andre Przywara
  2012-08-03 10:03           ` Jan Beulich
  2012-08-03 11:00           ` George Dunlap
  0 siblings, 2 replies; 36+ messages in thread
From: Andre Przywara @ 2012-08-03  9:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anil Madhavapeddy, George Dunlap, Andrew Cooper, xen-devel,
	Dario Faggioli, Yang Z Zhang

On 08/03/2012 11:23 AM, Jan Beulich wrote:
>>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote:
>>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>>>>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>>>>>      - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>>>>>        guest ends up on more than one nodes, make sure it knows it's
>>>>>        running on a NUMA platform (smaller than the actual host, but
>>>>>        still NUMA). This interacts with some of the above points:
>>>>
>>>> The question is whether this is really useful beyond the (I would
>>>> suppose) relatively small set of cases where migration isn't
>>>> needed.
>>>>
>>> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
>>> suggesting that exposing a virtual topology is not a good idea as it
>>> poses constraints/prevents live migration?

Honestly, what would be the problems with migration? NUMA awareness is 
actually a software optimization, so we will not really break something 
if the advertised topology isn't the real one. This is especially true 
if we lower the number of NUMA nodes. Say the guest starts with two 
nodes and then gets migrated to a machine where it can happily live in 
one node. There would be some extra effort by the guest OS to obey the 
virtual NUMA topology, but if there isn't actually a NUMA penalty 
anymore this shouldn't really hurt, right?
Even if we would need to go to a machine where we have more nodes for a 
certain guest than before, this is actually what we have today: guest 
NUMA unawareness. I am not sure if this is really a migration 
showstopper, and certainly not a NUMA guest showstopper.

But we could make it a config file option, so we leave this decision to 
the admin. I have talked to people with huge guests, they keep asking me 
about this feature.

>>>
>>> If yes, well, I mostly agree that this is an huge issue, and that's why
>>> I think wee need some bright idea on how to deal with it. I mean, it's
>>> easy to make it optional and let it automatically disable migration,
>>> giving users the choice what they prefer, but I think this is more
>>> dodging the problem than dealing with it! :-P
>>>
>>>>>         * consider this during automatic placement for
>>>>>           resuming/migrating domains (if they have a virtual topology,
>>>>>           better not to change it);
>>>>>         * consider this during memory migration (it can change the
>>>>>           actual topology, should we update it on-line or disable memory
>>>>>           migration?)
>>
>> I think we could use cpu hot-plug to change the "virtual topology" of
>> VMs, couldn't we?  We could probably even do that on a running guest
>> if we really needed to.
>
> Hmm, not sure - using hotplug behind the back of the guest might
> be possible, but you'd first need to hot-unplug the vCPU. That's
> something that I don't think you can do on HVM guests (and for
> PV guests, guest visible NUMA support makes even less sense
> than for HVM ones).

I don't think that hotplug would really work. I have checked this some 
times ago, at least the Linux NUMA code cannot be really fooled by this. 
The SRAT table is firmware defined and static by nature, so there is no 
code in Linux to change the NUMA topology at runtime. This is especially 
true for the memory layout.

But as said above, I don't really buy this as an argument against guest 
NUMA. At least provide it as an option to people who know what they do.

Regards,
Andre.


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
                   ` (4 preceding siblings ...)
  2012-08-02  9:43 ` Jan Beulich
@ 2012-08-03 10:02 ` Andre Przywara
  2012-08-03 10:40   ` Jan Beulich
                     ` (2 more replies)
  2012-08-03 22:22 ` Dan Magenheimer
  6 siblings, 3 replies; 36+ messages in thread
From: Andre Przywara @ 2012-08-03 10:02 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Anil Madhavapeddy, George Dunlap, xen-devel, Jan Beulich,
	Andrew Cooper, Zhang, Yang Z

On 08/01/2012 06:16 PM, Dario Faggioli wrote:
> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen's NUMA support. Well, it turned out that running out of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that I'm
> just braindumping here.

> ...
>
>         * automatic placement of Dom0, if possible (my current series is
>           only affecting DomU)

I think Dom0 NUMA awareness should be one of the top priorities. If I 
boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
actually has memory from all 8 nodes and thinks it's memory is flat.
There are some tricks to confine it to node 0 (dom0_mem=<memory of 
node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires 
intimate knowledge of the systems parameters and is error-prone. Also 
this does not work well with ballooning.
Actually we could improve the NUMA placement with that: By asking the 
Dom0 explicitly for memory from a certain node.

>         * having internal xen data structure honour the placement (e.g.,
>           I've been told that right now vcpu stacks are always allocated
>           on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
>        just have them _prefer_ running on the nodes where their memory
>        is.

This would be really cool. I once thought about something like a 
home-node. We start with placement to allocate memory from one node. 
Then we relax the VCPU-pinning, but mark this node as special for this 
guest, so that it if possible happens to get run there. But in times of 
CPU pressure we are happy to let it run on other nodes: CPU starving is 
much worse than NUMA penalty.

>
> [D] - Dynamic memory migration between different nodes of the host. As
>        the counter-part of the NUMA-aware scheduler.

I once read about a VMware feature: bandwith-limited migration in the 
background, hot pages first. So we get flexibility and avoid CPU 
starving, but still don't hog the system with memory copying.
Sounds quite ambitious, though.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03  9:48         ` Andre Przywara
@ 2012-08-03 10:03           ` Jan Beulich
  2012-08-03 22:40             ` Dan Magenheimer
  2012-08-03 11:00           ` George Dunlap
  1 sibling, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-03 10:03 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Anil Madhavapeddy, George Dunlap, Andrew Cooper, xen-devel,
	Dario Faggioli, Yang Z Zhang

>>> On 03.08.12 at 11:48, Andre Przywara <andre.przywara@amd.com> wrote:
> On 08/03/2012 11:23 AM, Jan Beulich wrote:
>>>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>>> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote:
>>>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:
>>>>>>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
>>>>>>      - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
>>>>>>        guest ends up on more than one nodes, make sure it knows it's
>>>>>>        running on a NUMA platform (smaller than the actual host, but
>>>>>>        still NUMA). This interacts with some of the above points:
>>>>>
>>>>> The question is whether this is really useful beyond the (I would
>>>>> suppose) relatively small set of cases where migration isn't
>>>>> needed.
>>>>>
>>>> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
>>>> suggesting that exposing a virtual topology is not a good idea as it
>>>> poses constraints/prevents live migration?
> 
> Honestly, what would be the problems with migration? NUMA awareness is 
> actually a software optimization, so we will not really break something 
> if the advertised topology isn't the real one.

Sure, nothing would break, but the purpose of the whole feature
is improving performance, and that might get entirely lost (or
even worse) after a migration to a different topology host.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 10:02 ` Andre Przywara
@ 2012-08-03 10:40   ` Jan Beulich
  2012-08-03 11:26     ` Andre Przywara
  2012-08-03 22:42   ` Dan Magenheimer
  2012-08-08  7:43   ` Dario Faggioli
  2 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-03 10:40 UTC (permalink / raw)
  To: Andre Przywara, Dario Faggioli
  Cc: Anil Madhavapeddy, Andrew Cooper, xen-devel, George Dunlap, Yang Z Zhang

>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote:
> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>> Hi everyone,
>>
>> With automatic placement finally landing into xen-unstable, I stated
>> thinking about what I could work on next, still in the field of
>> improving Xen's NUMA support. Well, it turned out that running out of
>> things to do is not an option! :-O
>>
>> In fact, I can think of quite a bit of open issues in that area, that I'm
>> just braindumping here.
> 
>> ...
>>
>>         * automatic placement of Dom0, if possible (my current series is
>>           only affecting DomU)
> 
> I think Dom0 NUMA awareness should be one of the top priorities. If I 
> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
> actually has memory from all 8 nodes and thinks it's memory is flat.
> There are some tricks to confine it to node 0 (dom0_mem=<memory of 
> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires 
> intimate knowledge of the systems parameters and is error-prone.

How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
an extension to the current options?

> Also this does not work well with ballooning.
> Actually we could improve the NUMA placement with that: By asking the 
> Dom0 explicitly for memory from a certain node.

Yes, passing sideband information to the balloon driver was
always a missing item, not only for NUMA support, but also
for address restricted memory (e.g. such needed to start
32-bit PV guests on big systems).

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03  9:48         ` Andre Przywara
  2012-08-03 10:03           ` Jan Beulich
@ 2012-08-03 11:00           ` George Dunlap
  1 sibling, 0 replies; 36+ messages in thread
From: George Dunlap @ 2012-08-03 11:00 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Anil Madhavapeddy, Andrew Cooper, xen-devel, Dario Faggioli,
	Jan Beulich, Yang Z Zhang

On 03/08/12 10:48, Andre Przywara wrote:
>>> I think we could use cpu hot-plug to change the "virtual topology" of
>>> VMs, couldn't we?  We could probably even do that on a running guest
>>> if we really needed to.
>> Hmm, not sure - using hotplug behind the back of the guest might
>> be possible, but you'd first need to hot-unplug the vCPU. That's
>> something that I don't think you can do on HVM guests (and for
>> PV guests, guest visible NUMA support makes even less sense
>> than for HVM ones).
> I don't think that hotplug would really work. I have checked this some
> times ago, at least the Linux NUMA code cannot be really fooled by this.
> The SRAT table is firmware defined and static by nature, so there is no
> code in Linux to change the NUMA topology at runtime. This is especially
> true for the memory layout.
I was more thinking of giving a VM the biggest topology you would want 
at boot, and then asking Linux to online or offline vcpus; for example, 
giving it a 4x2 topology (4 vcores x 2 vnodes).  When running on a 
system with 2 cores per node, you offline 2 vcpus per vnode, giving it 
an effective layout of 2x2.  When running on a system with 4 cores per 
node, you could offline all of the cores on one node, giving it an 
effective topology of 4x1.

Unfortunately, I just realized that you could change the number of vcpus 
in a given node, but you couldn't move the memory around very easily.   
Unless you have memory hotplug? Hmm..... :-)

  -George

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 10:40   ` Jan Beulich
@ 2012-08-03 11:26     ` Andre Przywara
  2012-08-03 11:38       ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Andre Przywara @ 2012-08-03 11:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anil Madhavapeddy, Andrew Cooper, xen-devel, Dario Faggioli,
	George Dunlap, Yang Z Zhang

On 08/03/2012 12:40 PM, Jan Beulich wrote:
>>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote:
>> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>>> Hi everyone,
>>>
>>> With automatic placement finally landing into xen-unstable, I stated
>>> thinking about what I could work on next, still in the field of
>>> improving Xen's NUMA support. Well, it turned out that running out of
>>> things to do is not an option! :-O
>>>
>>> In fact, I can think of quite a bit of open issues in that area, that I'm
>>> just braindumping here.
>>
>>> ...
>>>
>>>          * automatic placement of Dom0, if possible (my current series is
>>>            only affecting DomU)
>>
>> I think Dom0 NUMA awareness should be one of the top priorities. If I
>> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which
>> actually has memory from all 8 nodes and thinks it's memory is flat.
>> There are some tricks to confine it to node 0 (dom0_mem=<memory of
>> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires
>> intimate knowledge of the systems parameters and is error-prone.
>
> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
> an extension to the current options?

Yes, that sounds like a good idea. And relatively easy to implement.
Maybe a list or a number of nodes (to make it more complicated ;-)

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 11:26     ` Andre Przywara
@ 2012-08-03 11:38       ` Jan Beulich
  2012-08-03 13:14         ` Dario Faggioli
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-03 11:38 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Anil Madhavapeddy, Andrew Cooper, xen-devel, Dario Faggioli,
	George Dunlap, Yang Z Zhang

>>> On 03.08.12 at 13:26, Andre Przywara <andre.przywara@amd.com> wrote:
> On 08/03/2012 12:40 PM, Jan Beulich wrote:
>>>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote:
>>> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
>>>> Hi everyone,
>>>>
>>>> With automatic placement finally landing into xen-unstable, I stated
>>>> thinking about what I could work on next, still in the field of
>>>> improving Xen's NUMA support. Well, it turned out that running out of
>>>> things to do is not an option! :-O
>>>>
>>>> In fact, I can think of quite a bit of open issues in that area, that I'm
>>>> just braindumping here.
>>>
>>>> ...
>>>>
>>>>          * automatic placement of Dom0, if possible (my current series is
>>>>            only affecting DomU)
>>>
>>> I think Dom0 NUMA awareness should be one of the top priorities. If I
>>> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which
>>> actually has memory from all 8 nodes and thinks it's memory is flat.
>>> There are some tricks to confine it to node 0 (dom0_mem=<memory of
>>> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires
>>> intimate knowledge of the systems parameters and is error-prone.
>>
>> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
>> an extension to the current options?
> 
> Yes, that sounds like a good idea. And relatively easy to implement.
> Maybe a list or a number of nodes (to make it more complicated ;-)

Oh yes, of course I implied this flexibility. Just wanted to give
an easy to read example.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 11:38       ` Jan Beulich
@ 2012-08-03 13:14         ` Dario Faggioli
  2012-08-03 13:52           ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Dario Faggioli @ 2012-08-03 13:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Andrew Cooper,
	xen-devel, Yang Z Zhang


[-- Attachment #1.1: Type: text/plain, Size: 1094 bytes --]

On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote: 
> >> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
> >> an extension to the current options?
> > 
> > Yes, that sounds like a good idea. And relatively easy to implement.
> > Maybe a list or a number of nodes (to make it more complicated ;-)
> 
> Oh yes, of course I implied this flexibility. Just wanted to give
> an easy to read example.
> 
Yep, I agree it sounds nice and should be not to hard. I'll update the
Wiki page.

I only have one question, should we try to take IONUMA into account here
as well? I mean, if it turns out that I/O hubs are connected to some
specific node(s), shouldn't we consider pinning/"affining" Dom0 to those
node(s), as it most likely will be responsible for some/most DomUs' I/O?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 13:14         ` Dario Faggioli
@ 2012-08-03 13:52           ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2012-08-03 13:52 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Andrew Cooper,
	xen-devel, Yang Z Zhang

>>> On 03.08.12 at 15:14, Dario Faggioli <raistlin@linux.it> wrote:
> On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote: 
>> >> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as
>> >> an extension to the current options?
>> > 
>> > Yes, that sounds like a good idea. And relatively easy to implement.
>> > Maybe a list or a number of nodes (to make it more complicated ;-)
>> 
>> Oh yes, of course I implied this flexibility. Just wanted to give
>> an easy to read example.
>> 
> Yep, I agree it sounds nice and should be not to hard. I'll update the
> Wiki page.
> 
> I only have one question, should we try to take IONUMA into account here
> as well? I mean, if it turns out that I/O hubs are connected to some
> specific node(s), shouldn't we consider pinning/"affining" Dom0 to those
> node(s), as it most likely will be responsible for some/most DomUs' I/O?

I don't think the necessary information is available at the time
when Dom0 gets constructed.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
                   ` (5 preceding siblings ...)
  2012-08-03 10:02 ` Andre Przywara
@ 2012-08-03 22:22 ` Dan Magenheimer
  2012-08-07 23:49   ` Dario Faggioli
  6 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2012-08-03 22:22 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Konrad Wilk,
	Kurt Hackel, Jan Beulich, Andrew Cooper, Zhang, Yang Z

> From: Dario Faggioli [mailto:raistlin@linux.it]
> Subject: [Xen-devel] NUMA TODO-list for xen-devel
> 
> Hi everyone,

Hi Dario --

Thanks for your great work on NUMA... an interest area of
mine but one, sadly, I haven't been able to give much time to,
so I'm glad you've taken this bull by the horns.

I've been sitting on an idea for some time that probably
deserves some exposure on your list.  Naturally, it involves
my favorite topic tmem (readers, please don't tune out yet :-).

It has occurred to me that a fundamental tenet of NUMA
is to put infrequently used data on "other" nodes, while
pulling frequently used data onto a "local" node.

Tmem very nicely separates infrequently-used data from
frequently-used data with an API/ABI that is now fully
implemented in upstream Linux.

If Xen had a "alloc_page_on_any_node_but_the_current_one()"
(or "any_node_except_this_guests_node_set" for multinode guests)
and Xen's tmem implementation were to use it, especially
in combination with selfballooning (also upstream), this
could solve a significant part of the NUMA problem when running
tmem-enabled guests.  The most frequently used data
stays in the guest (thus in the guest's "current node")
and the less frequently used data lives in tmem in the
hypervisor (on the guest's "complement guest's

Naturally, this doesn't solve any NUMA problems at all for
tmem-ignorant or tmem-disabled guests, but if it works
sufficiently well for tmem-enabled guests, that might
encourage other OS's to do a simple implementation of tmem.

Sadly, I'm not able to invest much time in this idea,
but the combination of tmem and NUMA might interest some
developers and/or grad students, in which case I'd be happy
to spend a little time assisting.

I'll be at Xen Summit for at least the first day, so we
can chat more if you are interested.  George/Jan, I suspect
you have the best knowledge of tmem outside of Oracle as well
as being NUMA-fluent, so I'd appreciate your thoughts as well!

Thanks,
Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02  9:43 ` Jan Beulich
  2012-08-02 13:34   ` Dario Faggioli
@ 2012-08-03 22:34   ` Dan Magenheimer
  2012-08-06  7:15     ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2012-08-03 22:34 UTC (permalink / raw)
  To: Jan Beulich, Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang Z Zhang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, August 02, 2012 3:43 AM
> To: Dario Faggioli
> Cc: Andre Przywara; Anil Madhavapeddy; George Dunlap; xen-devel; Andrew Cooper; Yang Z Zhang
> Subject: Re: [Xen-devel] NUMA TODO-list for xen-devel
> 
> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote:
> >     - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> >       guest ends up on more than one nodes, make sure it knows it's
> >       running on a NUMA platform (smaller than the actual host, but
> >       still NUMA). This interacts with some of the above points:
> 
> The question is whether this is really useful beyond the (I would
> suppose) relatively small set of cases where migration isn't
> needed.
> 
> >        * consider this during automatic placement for
> >          resuming/migrating domains (if they have a virtual topology,
> >          better not to change it);
> >        * consider this during memory migration (it can change the
> >          actual topology, should we update it on-line or disable memory
> >          migration?)
> 
> The question is whether trading functionality for performance
> is an acceptable choice.

If there were a lwn.net equivalent for Xen, I'd be pushing to get
quoted on the following:

"Virtualization: You can have flexibility or you can have performance.
Pick one."

A couple of years ago when NUMA was first being extensively discussed
for Xen, I suggested that this should really be a "top level" flag
that a sysadmin should be able to select: Either optimize for
performance or optimize for flexibility.  Then Xen and the Xen tools
should "do the right thing" depending on the selection.

I still think this is a good way to surface the tradeoffs for
a very complex problem to the vast majority of users/admins.
Clearly they will want "both" but forcing the choice will
provoke more thought about their use model, as well as provide
important guidance to the underlying implementations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 10:03           ` Jan Beulich
@ 2012-08-03 22:40             ` Dan Magenheimer
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2012-08-03 22:40 UTC (permalink / raw)
  To: Jan Beulich, Andre Przywara
  Cc: Anil Madhavapeddy, George Dunlap, Andrew Cooper, xen-devel,
	Dario Faggioli, Yang Z Zhang

> >>>>> The question is whether this is really useful beyond the (I would
> >>>>> suppose) relatively small set of cases where migration isn't
> >>>>> needed.
> >>>>>
> >>>> Mmm... Not sure I'm getting what you're saying here, sorry. Are you
> >>>> suggesting that exposing a virtual topology is not a good idea as it
> >>>> poses constraints/prevents live migration?
> >
> > Honestly, what would be the problems with migration? NUMA awareness is
> > actually a software optimization, so we will not really break something
> > if the advertised topology isn't the real one.
> 
> Sure, nothing would break, but the purpose of the whole feature
> is improving performance, and that might get entirely lost (or
> even worse) after a migration to a different topology host.

+1

In the end, customers who care about getting 99.9% of native performance
should use physical hardware.  Live migration means that someone/something
is trying to do resource optimization and so performance optimization is
secondary.  But claiming great performance before migration and getting
sucky performance after migration, is IMHO a disaster, especially when
future "cloud users" won't have a clue whether their environment has
migrated or not.

Just my two cents...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 10:02 ` Andre Przywara
  2012-08-03 10:40   ` Jan Beulich
@ 2012-08-03 22:42   ` Dan Magenheimer
  2012-08-08  7:07     ` Dario Faggioli
  2012-08-08  7:43   ` Dario Faggioli
  2 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2012-08-03 22:42 UTC (permalink / raw)
  To: Andre Przywara, Dario Faggioli
  Cc: Anil Madhavapeddy, George Dunlap, xen-devel, Jan Beulich,
	Andrew Cooper, Zhang, Yang Z

> > [D] - Dynamic memory migration between different nodes of the host. As
> >        the counter-part of the NUMA-aware scheduler.
> 
> I once read about a VMware feature: bandwith-limited migration in the
> background, hot pages first. So we get flexibility and avoid CPU
> starving, but still don't hog the system with memory copying.
> Sounds quite ambitious, though.

Something like this, but between NUMA nodes instead of physical systems?

http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 22:34   ` Dan Magenheimer
@ 2012-08-06  7:15     ` Jan Beulich
  2012-08-06 16:28       ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2012-08-06  7:15 UTC (permalink / raw)
  To: Dario Faggioli, Dan Magenheimer
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang ZZhang

>>> On 04.08.12 at 00:34, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> The question is whether trading functionality for performance
>> is an acceptable choice.
> 
> If there were a lwn.net equivalent for Xen, I'd be pushing to get
> quoted on the following:
> 
> "Virtualization: You can have flexibility or you can have performance.
> Pick one."
> 
> A couple of years ago when NUMA was first being extensively discussed
> for Xen, I suggested that this should really be a "top level" flag
> that a sysadmin should be able to select: Either optimize for
> performance or optimize for flexibility.  Then Xen and the Xen tools
> should "do the right thing" depending on the selection.
> 
> I still think this is a good way to surface the tradeoffs for
> a very complex problem to the vast majority of users/admins.
> Clearly they will want "both" but forcing the choice will
> provoke more thought about their use model, as well as provide
> important guidance to the underlying implementations.

I would expect a good part to pick performance, and then
go whine about something not working in an emergency. On
xen-devel one could respond with this-is-what-you-get, but
you can't necessarily do so to paying customers...

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-06  7:15     ` Jan Beulich
@ 2012-08-06 16:28       ` Dan Magenheimer
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2012-08-06 16:28 UTC (permalink / raw)
  To: Jan Beulich, Dario Faggioli
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Andrew Cooper, Yang ZZhang

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: [Xen-devel] NUMA TODO-list for xen-devel
> 
> >>> On 04.08.12 at 00:34, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> The question is whether trading functionality for performance
> >> is an acceptable choice.
> >
> > If there were a lwn.net equivalent for Xen, I'd be pushing to get
> > quoted on the following:
> >
> > "Virtualization: You can have flexibility or you can have performance.
> > Pick one."
> >
> > A couple of years ago when NUMA was first being extensively discussed
> > for Xen, I suggested that this should really be a "top level" flag
> > that a sysadmin should be able to select: Either optimize for
> > performance or optimize for flexibility.  Then Xen and the Xen tools
> > should "do the right thing" depending on the selection.
> >
> > I still think this is a good way to surface the tradeoffs for
> > a very complex problem to the vast majority of users/admins.
> > Clearly they will want "both" but forcing the choice will
> > provoke more thought about their use model, as well as provide
> > important guidance to the underlying implementations.
> 
> I would expect a good part to pick performance, and then
> go whine about something not working in an emergency. On
> xen-devel one could respond with this-is-what-you-get, but
> you can't necessarily do so to paying customers...

Well, you can, but you have to first convince marketing that
virtualization doesn't solve all problems for all users all the
time. :-)

The two options would have to be clearly documented as:

"flexibility-is-my-highest-priority-and-performance-is-second-priority"

and

"performance-is-my-highest-priority-and-flexibility-is-second-priority"

and when a user selects the latter, they should be prompted with

"Are you really sure you want to use virtualization instead of bare metal?"

Sigh. We can only wish.
Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02  1:04 ` Zhang, Yang Z
@ 2012-08-07 22:56   ` Dario Faggioli
  0 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-07 22:56 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, xen-devel,
	Jan Beulich, Andrew Cooper


[-- Attachment #1.1: Type: text/plain, Size: 1687 bytes --]

On Thu, 2012-08-02 at 01:04 +0000, Zhang, Yang Z wrote:
> >     - Automatic placement at guest creation time. Basics are there and
> >       will be shipping with 4.2. However, a lot of other things are
> >       missing and/or can be improved, for instance:
> > [D]    * automated verification and testing of the placement;
> >        * benchmarks and improvements of the placement heuristic;
> > [D]    * choosing/building up some measure of node load (more accurate
> >          than just counting vcpus) onto which to rely during placement;
> >        * consider IONUMA during placement;
> We should consider two things:
> 1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the node which it resides. Currently, Dom0 allocates dma buffer without provide the node info to the hypercall..
> 2.Guest IONUMA: when guest boots up with pass through device, we need to allocate the memory from the node where the device resides for further dma buffer allocation. And let guest know the IONUMA topology. This rely on the guest NUMA.
> This topic was mentioned in xen summit 2011:
> http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf
> 
Seems fine, I knew that presentation and I added these details to the
Wiki page (sorry for the delay). Are you (or someone from your group)
perhaps working or planning to work on it?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 22:22 ` Dan Magenheimer
@ 2012-08-07 23:49   ` Dario Faggioli
  0 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-07 23:49 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Andrew Cooper,
	Konrad Wilk, xen-devel, Jan Beulich, Kurt Hackel, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2675 bytes --]

On Fri, 2012-08-03 at 15:22 -0700, Dan Magenheimer wrote:
> Hi Dario --
> 
Hello Dan,

> Thanks for your great work on NUMA... an interest area of
> mine but one, sadly, I haven't been able to give much time to,
> so I'm glad you've taken this bull by the horns.
> 
Trying to... Let's see! :-P

> I've been sitting on an idea for some time that probably
> deserves some exposure on your list.  Naturally, it involves
> my favorite topic tmem (readers, please don't tune out yet :-).
> 
It sure does! I've already put something quite generic about "memory
sharing" there, because I know that it has all but trivial interactions
with the improved NUMA support I am/we are trying to envision.

The fact that it is, as I said, generic, is due to my ignorance (let's
say for now) of the whole tmem thing, so thanks for the contribution,
it's very useful to hear your point of view on this!

> It has occurred to me that a fundamental tenet of NUMA
> is to put infrequently used data on "other" nodes, while
> pulling frequently used data onto a "local" node.
> 
> Tmem very nicely separates infrequently-used data from
> frequently-used data with an API/ABI that is now fully
> implemented in upstream Linux.
> 
I see, and it seems nice.

> [..]
>
> Naturally, this doesn't solve any NUMA problems at all for
> tmem-ignorant or tmem-disabled guests, but if it works
> sufficiently well for tmem-enabled guests, that might
> encourage other OS's to do a simple implementation of tmem.
> 
Sure. In my opinion, this is not an area where we could aim at "solving
every problem for everyone". However, we should definitely target having
a sensible solution for default and/or most common use cases and
scenarios.

> Sadly, I'm not able to invest much time in this idea,
> but the combination of tmem and NUMA might interest some
> developers and/or grad students, in which case I'd be happy
> to spend a little time assisting.
> 
That's definitely the case. I've tried to put a summary of what you said
in this mail to the Wiki (http://wiki.xen.org/wiki/Xen_NUMA_Roadmap) and
also put your contact next to it. Feel free to update/correct if you fin
something wrong. :-P

> I'll be at Xen Summit for at least the first day, so we
> can chat more if you are interested.
>
I indeed am interested, so let's make that happen! :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-02  0:04     ` Malte Schwarzkopf
@ 2012-08-07 23:53       ` Dario Faggioli
  0 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-07 23:53 UTC (permalink / raw)
  To: Malte Schwarzkopf
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Zhang, Yang Z, Steven Smith


[-- Attachment #1.1: Type: text/plain, Size: 2697 bytes --]

On Thu, 2012-08-02 at 01:04 +0100, Malte Schwarzkopf wrote:
> > Wow... That's really cool. I'll definitely take a deep look at all these
> > data! I'm also adding the link to the wiki, if you're fine with that...
> 
> No problem with adding a link, as this is public data :) If possible,
> it'd be splendid to put a note next to this link encouraging people to
> submit their own results -- doing so is very simple, and helps us extend
> the database. Instructions are at
> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short
> link, http://fable.io).
> 
Ok, I've tried doing this, here it is how it looks:
 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap#Inter-VM_dependencies_and_communication_issues

Thanks also for the references, I'll definitely take a look at them. :-)

> One interesting thing to look at (that we haven't looked at yet) is what
> memory allocators do about NUMA these days; there is an AMD whitepaper
> from 2009 discussing the performance benefits of a NUMA-aware version of
> tcmalloc [3], but I have found it hard to reproduce their results on
> modern hardware. Of course, being virtualized may complicate matters
> here, since the memory allocator can no longer freely pick and choose
> where to allocate from.
> 
> Scheduling, notably, is key here, since the CPU a process is scheduled
> on may determine where its memory is allocated -- frequent migrations
> are likely to be bad for performance due to remote memory accesses,
>
That might be true for Linux, but it's not so much true
(fortunately :-P) for Xen. However, I also think scheduling is a very
important aspect of this whole NUMA thing... I'll repost my NUMA aware
credit scheduler patches soon.

> although we have been unable to quantify a significant difference on
> non-synthetic macrobenchmarks; that said, we did not try very hard so far.
> 
I think both kinds of benchmarks are interesting. I tried to concentrate
a bit on macrobenchmark (specjbb, I'll let you decide if that's
synthetic or not :-D).

Another issue, if we want to tackle the problem of communicating/cooperating
VMs, pops up at the interface level, i.e., how do we want the user to
tell us that 2 (or more) VMs are "related"? Up to what level of detail?
Should this "relationship" be permanent or might it change over time?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 22:42   ` Dan Magenheimer
@ 2012-08-08  7:07     ` Dario Faggioli
  0 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-08  7:07 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Andre Przywara, Anil Madhavapeddy, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1224 bytes --]

On Fri, 2012-08-03 at 15:42 -0700, Dan Magenheimer wrote:
> > > [D] - Dynamic memory migration between different nodes of the host. As
> > >        the counter-part of the NUMA-aware scheduler.
> > 
> > I once read about a VMware feature: bandwith-limited migration in the
> > background, hot pages first. So we get flexibility and avoid CPU
> > starving, but still don't hog the system with memory copying.
> > Sounds quite ambitious, though.
> 
> Something like this, but between NUMA nodes instead of physical systems?
> 
> http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf 
>
Likely. The analogy between this kind of "memory migration" and the
actual live migration we already have is indeed something I want to take
advantage of. The fact that we support that small thing called
_paravirtualization_ is complicating it all quite a bit, but I'm looking
into it... Thanks for the reference. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NUMA TODO-list for xen-devel
  2012-08-03 10:02 ` Andre Przywara
  2012-08-03 10:40   ` Jan Beulich
  2012-08-03 22:42   ` Dan Magenheimer
@ 2012-08-08  7:43   ` Dario Faggioli
  2 siblings, 0 replies; 36+ messages in thread
From: Dario Faggioli @ 2012-08-08  7:43 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Anil Madhavapeddy, George Dunlap, Andrew Cooper, xen-devel,
	Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1614 bytes --]

On Fri, 2012-08-03 at 12:02 +0200, Andre Przywara wrote:
> On 08/01/2012 06:16 PM, Dario Faggioli wrote:
> > ...
> >
> >         * automatic placement of Dom0, if possible (my current series is
> >           only affecting DomU)
> 
> I think Dom0 NUMA awareness should be one of the top priorities. If I 
> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which 
> actually has memory from all 8 nodes and thinks it's memory is flat.
>
Ok, I updated the Wiki page with a link to this (sub)thread --- more
specifically, the mails where we agree about the new syntax. I can work
on it, but not in the next few days, so let's see if anyone steps up
before I get to look at it. :-)

> > [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
> >        just have them _prefer_ running on the nodes where their memory
> >        is.
> 
> This would be really cool. I once thought about something like a 
> home-node. We start with placement to allocate memory from one node. 
> Then we relax the VCPU-pinning, but mark this node as special for this 
> guest, so that it if possible happens to get run there. But in times of 
> CPU pressure we are happy to let it run on other nodes: CPU starving is 
> much worse than NUMA penalty.
> 
Yep. Patches coming soon.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-08-08  7:43 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-01 16:16 NUMA TODO-list for xen-devel Dario Faggioli
2012-08-01 16:24 ` Dario Faggioli
2012-08-01 16:30 ` Andrew Cooper
2012-08-01 16:47   ` Dario Faggioli
2012-08-01 16:53     ` Andrew Cooper
2012-08-02  9:40   ` Jan Beulich
2012-08-02 13:21     ` Dario Faggioli
2012-08-01 16:32 ` Anil Madhavapeddy
2012-08-01 16:58   ` Dario Faggioli
2012-08-02  0:04     ` Malte Schwarzkopf
2012-08-07 23:53       ` Dario Faggioli
2012-08-02  1:04 ` Zhang, Yang Z
2012-08-07 22:56   ` Dario Faggioli
2012-08-02  9:43 ` Jan Beulich
2012-08-02 13:34   ` Dario Faggioli
2012-08-02 14:07     ` Jan Beulich
2012-08-02 16:36     ` George Dunlap
2012-08-03  9:23       ` Jan Beulich
2012-08-03  9:48         ` Andre Przywara
2012-08-03 10:03           ` Jan Beulich
2012-08-03 22:40             ` Dan Magenheimer
2012-08-03 11:00           ` George Dunlap
2012-08-03 22:34   ` Dan Magenheimer
2012-08-06  7:15     ` Jan Beulich
2012-08-06 16:28       ` Dan Magenheimer
2012-08-03 10:02 ` Andre Przywara
2012-08-03 10:40   ` Jan Beulich
2012-08-03 11:26     ` Andre Przywara
2012-08-03 11:38       ` Jan Beulich
2012-08-03 13:14         ` Dario Faggioli
2012-08-03 13:52           ` Jan Beulich
2012-08-03 22:42   ` Dan Magenheimer
2012-08-08  7:07     ` Dario Faggioli
2012-08-08  7:43   ` Dario Faggioli
2012-08-03 22:22 ` Dan Magenheimer
2012-08-07 23:49   ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.