All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] PV NUMA Guests
@ 2010-04-04 19:30 Dulloor
  2010-04-05  6:29 ` Keir Fraser
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dulloor @ 2010-04-04 19:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser

The set of patches implements virtual NUMA-enlightenment to support
NUMA-aware PV guests. In more detail, the patch implements the
following :

* For the NUMA systems, the following memory allocation strategies are
implemented :
 - CONFINE : Confine the VM memory allocation to a single node. As
opposed to the current method of doing this in python, the patch
implements this in libxc(along with other strategies) and with
assurance that the memory actually comes from the selected node.
- STRIPE : If the VM memory doesn't fit in a single node and if the VM
is not compiled with guest-numa-support, the memory is allocated
striped across a selected max-set of nodes.
- SPLIT : If the VM memory doesn't fit in a single node and if the VM
is compiled with guest-numa-support, the memory is allocated split
(equally for now) from the min-set of nodes. The  VM is then made
aware of this NUMA allocation (virtual NUMA enlightenment).
-DEFAULT : This is the existing allocation scheme.

* If the numa-guest support is compiled into the PV guest, we add
numa-guest-support to xen features elfnote. The xen tools use this to
determine if SPLIT strategy can be applied.

* The PV guest uses the virtual NUMA enlightenment to setup its NUMA
layout (at the time of initmem_init)

Please comment.

-dulloor

Signed-off-by: Dulloor Rao <dulloor@gatech.edu>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-04 19:30 [PATCH 00/11] PV NUMA Guests Dulloor
@ 2010-04-05  6:29 ` Keir Fraser
  2010-04-07  7:57   ` Cui, Dexuan
  2010-04-05 14:52 ` Dan Magenheimer
  2010-04-09 11:34 ` Ian Pratt
  2 siblings, 1 reply; 12+ messages in thread
From: Keir Fraser @ 2010-04-05  6:29 UTC (permalink / raw)
  To: Dulloor, xen-devel; +Cc: Andre Przywara, Cui, Dexuan

I would like Acks from the people working on HVM NUMA for this patch series.
At the very least it would be nice to have a single user interface for
setting this up, regardless of whether for a PV or HVM guest. Hopefully code
in the toolstack also can be shared. So I'm cc'ing Dexuan and Andre, as I
know they are involved in the HVM NUMA work.

 Thanks,
 Keir

On 04/04/2010 20:30, "Dulloor" <dulloor@gmail.com> wrote:

> The set of patches implements virtual NUMA-enlightenment to support
> NUMA-aware PV guests. In more detail, the patch implements the
> following :
> 
> * For the NUMA systems, the following memory allocation strategies are
> implemented :
>  - CONFINE : Confine the VM memory allocation to a single node. As
> opposed to the current method of doing this in python, the patch
> implements this in libxc(along with other strategies) and with
> assurance that the memory actually comes from the selected node.
> - STRIPE : If the VM memory doesn't fit in a single node and if the VM
> is not compiled with guest-numa-support, the memory is allocated
> striped across a selected max-set of nodes.
> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
> is compiled with guest-numa-support, the memory is allocated split
> (equally for now) from the min-set of nodes. The  VM is then made
> aware of this NUMA allocation (virtual NUMA enlightenment).
> -DEFAULT : This is the existing allocation scheme.
> 
> * If the numa-guest support is compiled into the PV guest, we add
> numa-guest-support to xen features elfnote. The xen tools use this to
> determine if SPLIT strategy can be applied.
> 
> * The PV guest uses the virtual NUMA enlightenment to setup its NUMA
> layout (at the time of initmem_init)
> 
> Please comment.
> 
> -dulloor
> 
> Signed-off-by: Dulloor Rao <dulloor@gatech.edu>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 00/11] PV NUMA Guests
  2010-04-04 19:30 [PATCH 00/11] PV NUMA Guests Dulloor
  2010-04-05  6:29 ` Keir Fraser
@ 2010-04-05 14:52 ` Dan Magenheimer
  2010-04-06  3:51   ` Dulloor
  2010-04-09 11:34 ` Ian Pratt
  2 siblings, 1 reply; 12+ messages in thread
From: Dan Magenheimer @ 2010-04-05 14:52 UTC (permalink / raw)
  To: Dulloor, xen-devel; +Cc: Keir Fraser

Could you comment on if/how these work when memory is more
dynamically allocated (e.g. via an active balloon driver
in a guest)?  Specifically, I'm wondering if you are running
multiple domains, all are actively ballooning, and there
is a mix of guest NUMA policies, how do you ensure that
non-CONFINE'd domains don't starve a CONFINE'd domain?

Thanks,
Dan

> -----Original Message-----
> From: Dulloor [mailto:dulloor@gmail.com]
> Sent: Sunday, April 04, 2010 1:30 PM
> To: xen-devel@lists.xensource.com
> Cc: Keir Fraser
> Subject: [Xen-devel] [PATCH 00/11] PV NUMA Guests
> 
> The set of patches implements virtual NUMA-enlightenment to support
> NUMA-aware PV guests. In more detail, the patch implements the
> following :
> 
> * For the NUMA systems, the following memory allocation strategies are
> implemented :
>  - CONFINE : Confine the VM memory allocation to a single node. As
> opposed to the current method of doing this in python, the patch
> implements this in libxc(along with other strategies) and with
> assurance that the memory actually comes from the selected node.
> - STRIPE : If the VM memory doesn't fit in a single node and if the VM
> is not compiled with guest-numa-support, the memory is allocated
> striped across a selected max-set of nodes.
> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
> is compiled with guest-numa-support, the memory is allocated split
> (equally for now) from the min-set of nodes. The  VM is then made
> aware of this NUMA allocation (virtual NUMA enlightenment).
> -DEFAULT : This is the existing allocation scheme.
> 
> * If the numa-guest support is compiled into the PV guest, we add
> numa-guest-support to xen features elfnote. The xen tools use this to
> determine if SPLIT strategy can be applied.
> 
> * The PV guest uses the virtual NUMA enlightenment to setup its NUMA
> layout (at the time of initmem_init)
> 
> Please comment.
> 
> -dulloor
> 
> Signed-off-by: Dulloor Rao <dulloor@gatech.edu>
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-05 14:52 ` Dan Magenheimer
@ 2010-04-06  3:51   ` Dulloor
  2010-04-06 17:18     ` Dan Magenheimer
  0 siblings, 1 reply; 12+ messages in thread
From: Dulloor @ 2010-04-06  3:51 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel, Keir Fraser

Dan, Sorry I missed one of your previous mails on the topic too, so I
have copied answers to those too.

> Could you comment on if/how these work when memory is more
> dynamically allocated (e.g. via an active balloon driver
> in a guest)?
The balloon driver is also made numa-aware and uses (the same)
enlightenment to derive the guest-node to physical-node mapping.
Please refer to my previously submitted patch for this
(http://old.nabble.com/Xen-devel--XEN-PATCH---Linux-PVOPS--ballooning-on-numa-domains-td26262334.html).
I intend to send out a refreshed patch once the basic guest numa is
checked in.

> Specifically, I'm wondering if you are running
> multiple domains, all are actively ballooning, and there
> is a mix of guest NUMA policies, how do you ensure that
> non-CONFINE'd domains don't starve a CONFINE'd domain?
We first try to CONFINE a domain and only then proceed to STRIPE or
SPLIT(if capable) the domain. So, in this (automatic) global domain
memory allocation scheme, there is no possibility of starvation from
memory pov. Hope I got your question right.

> I'd be interested in your thoughts on numa-aware tmem
> as well as the other dynamic memory mechanisms in Xen 4.0.
> Tmem is special in that it uses primarily full-page copies
> from/to tmem-space to/from guest-space so, assuming the
> interconnect can pipeline/stream a memcpy, overhead of
> off-node memory vs on-node memory should be less
> noticeable.  However tmem uses large data structures
> (rbtrees and radix-trees) and the lookup process might
> benefit from being NUMA-aware.
For the tmem, I was thinking of the ability to specify a set of nodes
from which the tmem-space memory is preferred which could be derived
from the domain's numa enlightenment, but as you mentioned the
full-page copy overhead is less noticeable (at least on my smaller
NUMA machine). But, the rate would determine if we should do this to
reduce inter-node traffic. What do you suggest ?  I was looking at the
data structures too.

> Also, I will be looking into adding some page-sharing
> techniques into tmem in the near future.  This (and the
> existing page sharing feature just added to 4.0) may
> create some other interesting challenges for NUMA-awareness.
I have just started reading up on the memsharing feature of Xen. I
would be glad to get your input on NUMA challenges over there.

thanks
dulloor


On Mon, Apr 5, 2010 at 10:52 AM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> Could you comment on if/how these work when memory is more
> dynamically allocated (e.g. via an active balloon driver
> in a guest)?  Specifically, I'm wondering if you are running
> multiple domains, all are actively ballooning, and there
> is a mix of guest NUMA policies, how do you ensure that
> non-CONFINE'd domains don't starve a CONFINE'd domain?
>
> Thanks,
> Dan
>
>> -----Original Message-----
>> From: Dulloor [mailto:dulloor@gmail.com]
>> Sent: Sunday, April 04, 2010 1:30 PM
>> To: xen-devel@lists.xensource.com
>> Cc: Keir Fraser
>> Subject: [Xen-devel] [PATCH 00/11] PV NUMA Guests
>>
>> The set of patches implements virtual NUMA-enlightenment to support
>> NUMA-aware PV guests. In more detail, the patch implements the
>> following :
>>
>> * For the NUMA systems, the following memory allocation strategies are
>> implemented :
>>  - CONFINE : Confine the VM memory allocation to a single node. As
>> opposed to the current method of doing this in python, the patch
>> implements this in libxc(along with other strategies) and with
>> assurance that the memory actually comes from the selected node.
>> - STRIPE : If the VM memory doesn't fit in a single node and if the VM
>> is not compiled with guest-numa-support, the memory is allocated
>> striped across a selected max-set of nodes.
>> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
>> is compiled with guest-numa-support, the memory is allocated split
>> (equally for now) from the min-set of nodes. The  VM is then made
>> aware of this NUMA allocation (virtual NUMA enlightenment).
>> -DEFAULT : This is the existing allocation scheme.
>>
>> * If the numa-guest support is compiled into the PV guest, we add
>> numa-guest-support to xen features elfnote. The xen tools use this to
>> determine if SPLIT strategy can be applied.
>>
>> * The PV guest uses the virtual NUMA enlightenment to setup its NUMA
>> layout (at the time of initmem_init)
>>
>> Please comment.
>>
>> -dulloor
>>
>> Signed-off-by: Dulloor Rao <dulloor@gatech.edu>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 00/11] PV NUMA Guests
  2010-04-06  3:51   ` Dulloor
@ 2010-04-06 17:18     ` Dan Magenheimer
  2010-04-09  4:16       ` Dulloor
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Magenheimer @ 2010-04-06 17:18 UTC (permalink / raw)
  To: Dulloor; +Cc: xen-devel, Keir Fraser

In general, I am of the opinion that in a virtualized world,
one gets best flexibility or best performance, but not both.
There may be a couple of reasonable points on this "slider
selector", but I'm not sure in general if it will be worth
a huge time investment as real users will not understand the
subtleties of their workloads well enough to choose from
a large number of (perhaps more than two) points on the
performance/flexibility spectrum.

So customers that want highest performance should be prepared
to pin their guests and not use ballooning.  And those that
want the flexibility of migration and ballooning etc should
expect to see a performance hit (including NUMA consequences).

But since I don't get to make that decision, let's look
at the combination of NUMA + dynamic memory utilization...

> Please refer to my previously submitted patch for this
> (http://old.nabble.com/Xen-devel--XEN-PATCH---Linux-PVOPS--ballooning-
> on-numa-domains-td26262334.html).
> I intend to send out a refreshed patch once the basic guest numa is
> checked in.

OK, will wait and take a look at that later.
 
> We first try to CONFINE a domain and only then proceed to STRIPE or
> SPLIT(if capable) the domain. So, in this (automatic) global domain
> memory allocation scheme, there is no possibility of starvation from
> memory pov. Hope I got your question right.

The example I'm concerned with is:
1) Domain A is CONFINE'd to node A and domain B/C/D/etc are not
   CONFINE'd
2) Domain A uses less than the total memory on node A and/or
   balloons down so it uses even less than when launched.
3) Domains B/C/D have an increasing memory need, and semi-randomly
   absorb memory from all nodes, including node A.

After (3), free memory is somewhat randomly distributed across
all nodes.  Then:

4) Domain A suddenly has an increasing memory need... but there's
   not enough free memory remaining on node A (in fact possibly
   there is none at all) to serve its need.   But by definition
   of CONFINE, domain A is not allowed to use memory other than
   on node A.

What happens now?  It appears to me that other domains have
(perhaps even maliciously) starved domain A.

I think this is a dynamic bin-packing problem which is unsolvable
in general form.  So the choice of heuristics is going to be
important.
 
> For the tmem, I was thinking of the ability to specify a set of nodes
> from which the tmem-space memory is preferred which could be derived
> from the domain's numa enlightenment, but as you mentioned the
> full-page copy overhead is less noticeable (at least on my smaller
> NUMA machine). But, the rate would determine if we should do this to
> reduce inter-node traffic. What do you suggest ?  I was looking at the
> data structures too.

Since tmem allocates individual xmalloc-tlsf memory pools per domain,
it should be possible to inform tmem of node preferences, but I don't
know that it will be feasible to truly CONFINE a domain's tmem.
On the other hand, because of the page copying, affinity by itself
may be sufficient.

> > Also, I will be looking into adding some page-sharing
> > techniques into tmem in the near future.  This (and the
> > existing page sharing feature just added to 4.0) may
> > create some other interesting challenges for NUMA-awareness.
> I have just started reading up on the memsharing feature of Xen. I
> would be glad to get your input on NUMA challenges over there.

Note that the tmem patch that does sharing (tmem calls it "page
deduplication") was just accepted into xen-unstable.  Basically
some memory may belong to more than one domain, so NUMA affects
and performance/memory tradeoffs may get very complicated.

Dan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 00/11] PV NUMA Guests
  2010-04-05  6:29 ` Keir Fraser
@ 2010-04-07  7:57   ` Cui, Dexuan
  2010-04-09  4:47     ` Dulloor
  0 siblings, 1 reply; 12+ messages in thread
From: Cui, Dexuan @ 2010-04-07  7:57 UTC (permalink / raw)
  To: Keir Fraser, Dulloor, xen-devel; +Cc: Andre Przywara

Keir Fraser wrote:
> I would like Acks from the people working on HVM NUMA for this patch
> series. At the very least it would be nice to have a single user
> interface for setting this up, regardless of whether for a PV or HVM
> guest. Hopefully code in the toolstack also can be shared. So I'm
Yes, I strongly agree we should share one interterface, e.g., The XENMEM_numa_op hypercalls implemented by Dulloor could be re-used in the hvm numa case and some parts of the toolstack could be shared, I think. I also replied in another thead and supplied some similarity I found in Andre/Dulloor's patches.

> cc'ing Dexuan and Andre, as I know they are involved in the HVM NUMA
> work. 
> 
>  Thanks,
>  Keir
> 
> On 04/04/2010 20:30, "Dulloor" <dulloor@gmail.com> wrote:
> 
>> The set of patches implements virtual NUMA-enlightenment to support
>> NUMA-aware PV guests. In more detail, the patch implements the
>> following : 
>> 
>> * For the NUMA systems, the following memory allocation strategies
>>  are implemented :
>> - CONFINE : Confine the VM memory allocation to a
>> single node. As opposed to the current method of doing this in
>> python, the patch implements this in libxc(along with other
>> strategies) and with assurance that the memory actually comes from
>> the selected node. 
> > - STRIPE : If the VM memory doesn't fit in a
>> single node and if the VM is not compiled with guest-numa-support,
>> the memory is allocated striped across a selected max-set of nodes.
>> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
>> is compiled with guest-numa-support, the memory is allocated split
>> (equally for now) from the min-set of nodes. The  VM is then made
>> aware of this NUMA allocation (virtual NUMA enlightenment).
>> -DEFAULT : This is the existing allocation scheme.
>> 
>> * If the numa-guest support is compiled into the PV guest, we add
>> numa-guest-support to xen features elfnote. The xen tools use this to
>> determine if SPLIT strategy can be applied.
>> 
I think this looks too complex to allow a real user to easily determine which one to use...
About the CONFINE stragegy -- looks this is not a useful usage model to me -- do we really think it's a typical usage model to ensure a VM's memory can only be allocated on a specified node?
The definitions of STRIPE and SPLIT also doesn't sound like typical usage models to me. 
Why must tools know if the PV kernel is built with guest numa support or not? 
If a user configures guest numa to "on" for a pv guest, the tools can supply the numa info to PV kernel even if the pv kernel is not built with guest numa support -- the pv kernel will neglect the info safely;
If a user configures guest numa to "off" for a pv guest and the tools don't supply the numa info to PV kernel, and if the pv kernel is built with guest numa support, the pv kernel can easily detect this by your new hypercall and will not enable numa.

When a user finds the computing capability of a single node can't satisfy the actual need and hence wants to use guest numa, since the user has specified the amount of guest memory and the number of vcpus in guest config file, I think the user only needs to specify how many guest nodes (the "guestnodes" option in Andre's patch) the guest will see, and the tools and the hypervisor should co-work to distribute guest memory and vcpus uniformly among the guest nodes(I think we may not want to support non-uniform nodes as that doesn't look like a typical usage model) -- of course, maybe a specified node doesn't have the expected amount of memory -- in this case, the guest can continue to run with a slower speed (we can print a warning message to the user); or, if the user does care about predictable guest performance, the guest creation should fail.

How do you like this? My thought is we can make things simple in the first step. :-)

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-06 17:18     ` Dan Magenheimer
@ 2010-04-09  4:16       ` Dulloor
  0 siblings, 0 replies; 12+ messages in thread
From: Dulloor @ 2010-04-09  4:16 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel, Keir Fraser

On Tue, Apr 6, 2010 at 1:18 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> In general, I am of the opinion that in a virtualized world,
> one gets best flexibility or best performance, but not both.
> There may be a couple of reasonable points on this "slider
> selector", but I'm not sure in general if it will be worth
> a huge time investment as real users will not understand the
> subtleties of their workloads well enough to choose from
> a large number of (perhaps more than two) points on the
> performance/flexibility spectrum.
>
> So customers that want highest performance should be prepared
> to pin their guests and not use ballooning.  And those that
> want the flexibility of migration and ballooning etc should
> expect to see a performance hit (including NUMA consequences).
In principle, I agree with you. For the same reason, in this first
version, I have tried to keep the configurable parameters to minimum.
Wrt ballooning, we could work out simple solutions that work.
Migration would be problematic though.
>
> But since I don't get to make that decision, let's look
> at the combination of NUMA + dynamic memory utilization...
>
>> Please refer to my previously submitted patch for this
>> (http://old.nabble.com/Xen-devel--XEN-PATCH---Linux-PVOPS--ballooning-
>> on-numa-domains-td26262334.html).
>> I intend to send out a refreshed patch once the basic guest numa is
>> checked in.
>
> OK, will wait and take a look at that later.
>
>> We first try to CONFINE a domain and only then proceed to STRIPE or
>> SPLIT(if capable) the domain. So, in this (automatic) global domain
>> memory allocation scheme, there is no possibility of starvation from
>> memory pov. Hope I got your question right.
>
> The example I'm concerned with is:
> 1) Domain A is CONFINE'd to node A and domain B/C/D/etc are not
>   CONFINE'd
> 2) Domain A uses less than the total memory on node A and/or
>   balloons down so it uses even less than when launched.
> 3) Domains B/C/D have an increasing memory need, and semi-randomly
>   absorb memory from all nodes, including node A.
>
> After (3), free memory is somewhat randomly distributed across
> all nodes.  Then:
>
> 4) Domain A suddenly has an increasing memory need... but there's
>   not enough free memory remaining on node A (in fact possibly
>   there is none at all) to serve its need.   But by definition
>   of CONFINE, domain A is not allowed to use memory other than
>   on node A.
>
> What happens now?  It appears to me that other domains have
> (perhaps even maliciously) starved domain A.
>
> I think this is a dynamic bin-packing problem which is unsolvable
> in general form.  So the choice of heuristics is going to be
> important.
>
In the proposed solution, the domain could be either CONFINED, SPLIT
(NUMA-aware), or STRIPED.
But, in each case, the domain is aware of how much memory is allocated
from each of the nodes
at the time of start-up. The enlightened ballooning attempts to keep
the state similar to that during startup.
But, we might have to allocate from any node under extreme memory
pressure. For that (hopefully less likely) case,
we can implement dynamic mechanisms to converge to the original state,
by sweeping through the memory and
exchanging memory reservations whenever possible. I already have the
means of doing this, as part of the ballooning changes.

For CONFINED/SPLIT domains, I am using the Best-Fit-Decreasing
heuristic, whereas for STRIPED, I am using First-Fit-Increasing
strategy (as a means to reduce the fragmentation of free node memory).

>> For the tmem, I was thinking of the ability to specify a set of nodes
>> from which the tmem-space memory is preferred which could be derived
>> from the domain's numa enlightenment, but as you mentioned the
>> full-page copy overhead is less noticeable (at least on my smaller
>> NUMA machine). But, the rate would determine if we should do this to
>> reduce inter-node traffic. What do you suggest ?  I was looking at the
>> data structures too.
>
> Since tmem allocates individual xmalloc-tlsf memory pools per domain,
> it should be possible to inform tmem of node preferences, but I don't
> know that it will be feasible to truly CONFINE a domain's tmem.
> On the other hand, because of the page copying, affinity by itself
> may be sufficient.
>
Yeah, I guess affinity should suffice for the CONFINED domains, but
I was thinking of node-preferences for the NUMA (SPLIT)Guests.

>> > Also, I will be looking into adding some page-sharing
>> > techniques into tmem in the near future.  This (and the
>> > existing page sharing feature just added to 4.0) may
>> > create some other interesting challenges for NUMA-awareness.
>> I have just started reading up on the memsharing feature of Xen. I
>> would be glad to get your input on NUMA challenges over there.
>
> Note that the tmem patch that does sharing (tmem calls it "page
> deduplication") was just accepted into xen-unstable.  Basically
> some memory may belong to more than one domain, so NUMA affects
> and performance/memory tradeoffs may get very complicated.
>
Thanks for sharing. I will read this very soon.
> Dan
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-07  7:57   ` Cui, Dexuan
@ 2010-04-09  4:47     ` Dulloor
  2010-04-14  5:18       ` Cui, Dexuan
  0 siblings, 1 reply; 12+ messages in thread
From: Dulloor @ 2010-04-09  4:47 UTC (permalink / raw)
  To: Cui, Dexuan; +Cc: Andre Przywara, xen-devel, Keir Fraser

On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@intel.com> wrote:
> Keir Fraser wrote:
>> I would like Acks from the people working on HVM NUMA for this patch
>> series. At the very least it would be nice to have a single user
>> interface for setting this up, regardless of whether for a PV or HVM
>> guest. Hopefully code in the toolstack also can be shared. So I'm
> Yes, I strongly agree we should share one interterface, e.g., The XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used in the hvm numa case and some parts of the toolstack could be shared, I think. I also replied in another thead and >supplied some similarity I found in Andre/Dulloor's patches.
>
IMO PV NUMA guests and HVM NUMA guests could share most of the code
from toolstack - for instance, getting the current state of machine,
deciding on a strategy for domain memory allocation, selection of
nodes, etc. They diverge only at the actual point of domain
construction. PV NUMA uses enlightenments, whereas HVM would need
working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
that we need to converge.

>> cc'ing Dexuan and Andre, as I know they are involved in the HVM NUMA
>> work.
>>
>>  Thanks,
>>  Keir
>>
>> On 04/04/2010 20:30, "Dulloor" <dulloor@gmail.com> wrote:
>>
>>> The set of patches implements virtual NUMA-enlightenment to support
>>> NUMA-aware PV guests. In more detail, the patch implements the
>>> following :
>>>
>>> * For the NUMA systems, the following memory allocation strategies
>>>  are implemented :
>>> - CONFINE : Confine the VM memory allocation to a
>>> single node. As opposed to the current method of doing this in
>>> python, the patch implements this in libxc(along with other
>>> strategies) and with assurance that the memory actually comes from
>>> the selected node.
>> > - STRIPE : If the VM memory doesn't fit in a
>>> single node and if the VM is not compiled with guest-numa-support,
>>> the memory is allocated striped across a selected max-set of nodes.
>>> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
>>> is compiled with guest-numa-support, the memory is allocated split
>>> (equally for now) from the min-set of nodes. The  VM is then made
>>> aware of this NUMA allocation (virtual NUMA enlightenment).
>>> -DEFAULT : This is the existing allocation scheme.
>>>
>>> * If the numa-guest support is compiled into the PV guest, we add
>>> numa-guest-support to xen features elfnote. The xen tools use this to
>>> determine if SPLIT strategy can be applied.
>>>
> I think this looks too complex to allow a real user to easily determine which one to use...
I think you misunderstood this. For the first version, I have
implemented an automatic global domain memory allocation scheme, which
(when enabled) applies to all domains on a NUMA machine. I am of
opinion that users are seldom in a state to determine which strategy
to use. They would want the best possible performance for their VM at
any point of time, and we can only guarantee the best possible
performance, given the current state of the system (how the free
memory is scattered across nodes, distance between those nodes, etc).
In that regard, this solution is the simplest.

> About the CONFINE stragegy -- looks this is not a useful usage model to me -- do we really think it's a typical usage model to
> ensure a VM's memory can only be allocated on a specified node?
Not all VMs are large enough not to fit into a single node (note that
user doesn't specify a node). And, if a VM can be fit into a single
node, that is obviously the best possible option for a VM.

> The definitions of STRIPE and SPLIT also doesn't sound like typical usage models to me.
There are only two possibilities. Either the VM fits in a single node
or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
optimize the solution when the VM doesn't fit in a single node. The
aim is to reduce the number of inter-node accesses(SPLIT) and/or
provide a more predictable performance(STRIPE).

> Why must tools know if the PV kernel is built with guest numa support or not?
What is the point of arranging the memory amenable for construction of
nodes in guest if the guest itself is not compiled to do so.

> If a user configures guest numa to "on" for a pv guest, the tools can supply the numa info to PV kernel even if the pv kernel is not > built with guest numa support -- the pv kernel will neglect the info safely;
> If a user configures guest numa to "off" for a pv guest and the tools don't supply the numa info to PV kernel, and if the pv kernel > is built with guest numa support, the pv kernel can easily detect this by your new hypercall and will not enable numa.
These error checks are done even now. But, by checking if the PV
kernel is built with guest numa support, we don't require the user to
configure yet another parameter. Wasn't that your concern too in the
very first point ?

>
> When a user finds the computing capability of a single node can't satisfy the actual need and hence wants to use guest numa,
> since the user has specified the amount of guest memory and the number of vcpus in guest config file, I think the user only needs
>to specify how many guest nodes (the "guestnodes" option in Andre's patch) the guest will see, and the tools and the hypervisor
>should co-work to distribute guest memory and vcpus uniformly among the guest nodes(I think we may not want to support non-
>uniform nodes as that doesn't look like a typical usage model) -- of course, maybe a specified node doesn't have the expected
>amount of memory -- in this case, the guest can continue to run with a slower speed (we can print a warning message to the
>user); or, if the user does care about predictable guest performance, the guest creation should fail.

Please observe that the patch does all these things plus some more.
For one, "guestnodes" option doesn't make sense, since as you observe,
it needs the user to carefully read the state of the system when
starting the domain and also the user needs to make sure that the
guest itself is compiled with numa support. The aim should be to
automate this part and provide the best performance, given the current
state. The patch attempts to do that. Secondly, when the guests are
not compiled with numa support, they would still want a more
predictable (albeit average) performance. And, by striping the memory
across the nodes and by pinning the domain vcpus to the union of those
nodes' processors, applications (of substantial sizes) could be
expected to see more predictable performance.
>
> How do you like this? My thought is we can make things simple in the first step. :-)
Please let me know if my comments are not clear. I agree that we
should shoot for simplicity and also for a common interface. Hope we
will get there :)
>
> Thanks,
> -- Dexuan
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 00/11] PV NUMA Guests
  2010-04-04 19:30 [PATCH 00/11] PV NUMA Guests Dulloor
  2010-04-05  6:29 ` Keir Fraser
  2010-04-05 14:52 ` Dan Magenheimer
@ 2010-04-09 11:34 ` Ian Pratt
  2010-04-11  3:06   ` Dulloor
  2 siblings, 1 reply; 12+ messages in thread
From: Ian Pratt @ 2010-04-09 11:34 UTC (permalink / raw)
  To: Dulloor, xen-devel; +Cc: Ian Pratt, Keir Fraser

> The set of patches implements virtual NUMA-enlightenment to support
> NUMA-aware PV guests. In more detail, the patch implements the
> following :
> 
> * For the NUMA systems, the following memory allocation strategies are
> implemented :
>  - CONFINE : Confine the VM memory allocation to a single node. As
> opposed to the current method of doing this in python, the patch
> implements this in libxc(along with other strategies) and with
> assurance that the memory actually comes from the selected node.

Do you use the VCPU affinity masks to determine the node in question, or is there another parameter to specify this?

Thanks,
Ian


> - STRIPE : If the VM memory doesn't fit in a single node and if the VM
> is not compiled with guest-numa-support, the memory is allocated
> striped across a selected max-set of nodes.
> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
> is compiled with guest-numa-support, the memory is allocated split
> (equally for now) from the min-set of nodes. The  VM is then made
> aware of this NUMA allocation (virtual NUMA enlightenment).
> -DEFAULT : This is the existing allocation scheme.
> 
> * If the numa-guest support is compiled into the PV guest, we add
> numa-guest-support to xen features elfnote. The xen tools use this to
> determine if SPLIT strategy can be applied.
> 
> * The PV guest uses the virtual NUMA enlightenment to setup its NUMA
> layout (at the time of initmem_init)
> 
> Please comment.
> 
> -dulloor
> 
> Signed-off-by: Dulloor Rao <dulloor@gatech.edu>
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-09 11:34 ` Ian Pratt
@ 2010-04-11  3:06   ` Dulloor
  0 siblings, 0 replies; 12+ messages in thread
From: Dulloor @ 2010-04-11  3:06 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Keir Fraser

On Fri, Apr 9, 2010 at 7:34 AM, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote:
>> The set of patches implements virtual NUMA-enlightenment to support
>> NUMA-aware PV guests. In more detail, the patch implements the
>> following :
>>
>> * For the NUMA systems, the following memory allocation strategies are
>> implemented :
>>  - CONFINE : Confine the VM memory allocation to a single node. As
>> opposed to the current method of doing this in python, the patch
>> implements this in libxc(along with other strategies) and with
>> assurance that the memory actually comes from the selected node.
>
> Do you use the VCPU affinity masks to determine the node in question, or is there another parameter to specify this?
As of now, the node is selected solely based on the distribution of
free memory (across nodes).

>
> Thanks,
> Ian
>
>
>> - STRIPE : If the VM memory doesn't fit in a single node and if the VM
>> is not compiled with guest-numa-support, the memory is allocated
>> striped across a selected max-set of nodes.
>> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
>> is compiled with guest-numa-support, the memory is allocated split
>> (equally for now) from the min-set of nodes. The  VM is then made
>> aware of this NUMA allocation (virtual NUMA enlightenment).
>> -DEFAULT : This is the existing allocation scheme.
>>
>> * If the numa-guest support is compiled into the PV guest, we add
>> numa-guest-support to xen features elfnote. The xen tools use this to
>> determine if SPLIT strategy can be applied.
>>
>> * The PV guest uses the virtual NUMA enlightenment to setup its NUMA
>> layout (at the time of initmem_init)
>>
>> Please comment.
>>
>> -dulloor
>>
>> Signed-off-by: Dulloor Rao <dulloor@gatech.edu>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 00/11] PV NUMA Guests
  2010-04-09  4:47     ` Dulloor
@ 2010-04-14  5:18       ` Cui, Dexuan
  2010-04-15 17:19         ` Dulloor
  0 siblings, 1 reply; 12+ messages in thread
From: Cui, Dexuan @ 2010-04-14  5:18 UTC (permalink / raw)
  To: Dulloor; +Cc: Andre Przywara, xen-devel, Keir Fraser

Dulloor wrote:
> On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@intel.com>
> wrote: 
>> Keir Fraser wrote:
>>> I would like Acks from the people working on HVM NUMA for this patch
>>> series. At the very least it would be nice to have a single user
>>> interface for setting this up, regardless of whether for a PV or HVM
>>> guest. Hopefully code in the toolstack also can be shared. So I'm
>> Yes, I strongly agree we should share one interterface, e.g., The
>> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used
>> in the hvm numa case and some parts of the toolstack could be
>> shared, I think. I also replied in another thead and >supplied some
>> similarity I found in Andre/Dulloor's patches.    
>> 
> IMO PV NUMA guests and HVM NUMA guests could share most of the code
> from toolstack - for instance, getting the current state of machine,
> deciding on a strategy for domain memory allocation, selection of
> nodes, etc. They diverge only at the actual point of domain
> construction. PV NUMA uses enlightenments, whereas HVM would need
> working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
> that we need to converge.
Hi Dulloor,
In your patches, the toolstack tries to figure out the "best fit nodes" for a PV guest and invokes a hypercall set_domain_numa_layout to tell the hypervisor to remember the info, and later the PV guest invokes a hypercall get_domain_numa_layout to retrieve the info from the hypervisor. 
Can this be changed to: the toolstack writes the guest numa info directly into a new field in the start_info(or the share_info) (maybe in the starndard format of the SRAT/SLIT) and later PV guest reads the info and uses acpi_numa_init() to parse the info?  I think in this way the new hypercalls can be avoided and the pv numa enlightenment code in guest kernel can be minimized.
I'm asking  this because this is the way how HVM numa patches of Andure do(the toolstack passes the info to hvmloader and the latter builds SRAT/SLIT for guest)

xc_select_best_fit_nodes() decides the "min-set" of host nodes that will be used for the guest. It only considers the current memory usage of the system. Maybe we should also condider the cpu load? And the number of the nodes must be 2^^n? And how to handle the case #vcpu is < #vnode?
And looks your patches only consider the guest's memory requirement -- guest's vcpu requirement is neglected? e.g., a guest may not need a very large amount of memory while it needs many vcpus. xc_select_best_fit_nodes() should consider this when determining the number of vnode.

>>> On 04/04/2010 20:30, "Dulloor" <dulloor@gmail.com> wrote:
>>> 
>>>> The set of patches implements virtual NUMA-enlightenment to support
>>>> NUMA-aware PV guests. In more detail, the patch implements the
>>>> following : 
>>>> 
>>>> * For the NUMA systems, the following memory allocation strategies
>>>> are implemented : - CONFINE : Confine the VM memory allocation to a
>>>> single node. As opposed to the current method of doing this in
>>>> python, the patch implements this in libxc(along with other
>>>> strategies) and with assurance that the memory actually comes from
>>>> the selected node. - STRIPE : If the VM memory doesn't fit in a
>>>> single node and if the VM is not compiled with guest-numa-support,
>>>> the memory is allocated striped across a selected max-set of nodes.
>>>> - SPLIT : If the VM memory doesn't fit in a single node and if the
>>>> VM is compiled with guest-numa-support, the memory is allocated
>>>> split (equally for now) from the min-set of nodes. The  VM is then
>>>> made aware of this NUMA allocation (virtual NUMA enlightenment).
>>>> -DEFAULT : This is the existing allocation scheme.
>>>> 
>>>> * If the numa-guest support is compiled into the PV guest, we add
>>>> numa-guest-support to xen features elfnote. The xen tools use this
>>>> to determine if SPLIT strategy can be applied.
>>>> 
>> I think this looks too complex to allow a real user to easily
>> determine which one to use... 
> I think you misunderstood this. For the first version, I have
> implemented an automatic global domain memory allocation scheme, which
> (when enabled) applies to all domains on a NUMA machine. I am of
> opinion that users are seldom in a state to determine which strategy
> to use. They would want the best possible performance for their VM at
> any point of time, and we can only guarantee the best possible
> performance, given the current state of the system (how the free
> memory is scattered across nodes, distance between those nodes, etc).
> In that regard, this solution is the simplest.
Ok, I see.
BTW: I think actually currently Xen can handle the case CONFINE pretty well, e.g, when no vcpu affinity is explicitly specified, the toolstack tries to choose a "best" host node for the guest and pins all vcpus of the guest to the host node.

>> About the CONFINE stragegy -- looks this is not a useful usage model
>> to me -- do we really think it's a typical usage model to 
>> ensure a VM's memory can only be allocated on a specified node?
> Not all VMs are large enough not to fit into a single node (note that
> user doesn't specify a node). And, if a VM can be fit into a single
> node, that is obviously the best possible option for a VM.
> 
>> The definitions of STRIPE and SPLIT also doesn't sound like typical
>> usage models to me. 
> There are only two possibilities. Either the VM fits in a single node
> or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
> optimize the solution when the VM doesn't fit in a single node. The
> aim is to reduce the number of inter-node accesses(SPLIT) and/or
> provide a more predictable performance(STRIPE).
> 
>> Why must tools know if the PV kernel is built with guest numa
>> support or not? 
> What is the point of arranging the memory amenable for construction of
> nodes in guest if the guest itself is not compiled to do so.
I meant: to simplify the implementation, the toolstack can always supply the numa config info to the guest *if necessary*, no matter if the guest kernel is numa-enabled or not (even if the guest kernel isn't numa-enabled, the guest performance may be better if the toolstack decides to supply a numa config to the guest)
About the "*if necessary*": Andure and I think the user should supply an  option "guestnode" in the guest config file, and you think the toolstack should be able to automatically determine a "best" value. I raised some questions about xc_select_best_fit_nodes() in the above paragraph.
Hi Andre, would you like to comment on this?

> 
>> If a user configures guest numa to "on" for a pv guest, the tools
>> can supply the numa info to PV kernel even if the pv kernel is not >
>> built with guest numa support -- the pv kernel will neglect the info
>> safely;   
>> If a user configures guest numa to "off" for a pv guest and the
>> tools don't supply the numa info to PV kernel, and if the pv kernel
>> > is built with guest numa support, the pv kernel can easily detect
>> this by your new hypercall and will not enable numa.   
> These error checks are done even now. But, by checking if the PV
> kernel is built with guest numa support, we don't require the user to
> configure yet another parameter. Wasn't that your concern too in the
> very first point ?
> 
>> 
>> When a user finds the computing capability of a single node can't
>> satisfy the actual need and hence wants to use guest numa, 
>> since the user has specified the amount of guest memory and the
>> number of vcpus in guest config file, I think the user only needs 
>> to specify how many guest nodes (the "guestnodes" option in Andre's
>> patch) the guest will see, and the tools and the hypervisor 
>> should co-work to distribute guest memory and vcpus uniformly among
>> the guest nodes(I think we may not want to support non- 
>> uniform nodes as that doesn't look like a typical usage model) -- of
>> course, maybe a specified node doesn't have the expected 
>> amount of memory -- in this case, the guest can continue to run with
>> a slower speed (we can print a warning message to the 
>> user); or, if the user does care about predictable guest
>> performance, the guest creation should fail. 
> 
> Please observe that the patch does all these things plus some more.
> For one, "guestnodes" option doesn't make sense, since as you observe,
> it needs the user to carefully read the state of the system when
> starting the domain and also the user needs to make sure that the
> guest itself is compiled with numa support. The aim should be to
I think it's not difficult for a user to specify "guestnodes" and to check if a PV/HVM guest kernel is numa-enabled or not(anyway, a user needs to ensure that to achieve the optimal peformance). "xm info/list/vcpu-list" should already supply enough info. I think it's reasonable to assume a numa user has more knowledge than a preliminary user. :-)

I suppose Andure would argue more for the  "guestnodes" option.

PV guest can use the ELFnote as a hit to the toolstack. This may be used as a kind of optimization.
HVM guest can't use this.

> automate this part and provide the best performance, given the current
> state. The patch attempts to do that. Secondly, when the guests are
> not compiled with numa support, they would still want a more
> predictable (albeit average) performance. And, by striping the memory
> across the nodes and by pinning the domain vcpus to the union of those
> nodes' processors, applications (of substantial sizes) could be
> expected to see more predictable performance.
>> 
>> How do you like this? My thought is we can make things simple in the
>> first step. :-) 
> Please let me know if my comments are not clear. I agree that we
> should shoot for simplicity and also for a common interface. Hope we
> will get there :)
Thanks a lot for all the explanation and discussion.
Yes, we need to agree on a common interface to avoid confusion.
And I still think the "guestnodes/uniform_nodes" idea is more straightforward and the implementatin is simpler. :-)

Thanks,
 -- Dexuan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/11] PV NUMA Guests
  2010-04-14  5:18       ` Cui, Dexuan
@ 2010-04-15 17:19         ` Dulloor
  0 siblings, 0 replies; 12+ messages in thread
From: Dulloor @ 2010-04-15 17:19 UTC (permalink / raw)
  To: Cui, Dexuan; +Cc: Andre Przywara, xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 12901 bytes --]

On Wed, Apr 14, 2010 at 1:18 AM, Cui, Dexuan <dexuan.cui@intel.com> wrote:
> Dulloor wrote:
>> On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@intel.com>
>> wrote:
>>> Keir Fraser wrote:
>>>> I would like Acks from the people working on HVM NUMA for this patch
>>>> series. At the very least it would be nice to have a single user
>>>> interface for setting this up, regardless of whether for a PV or HVM
>>>> guest. Hopefully code in the toolstack also can be shared. So I'm
>>> Yes, I strongly agree we should share one interterface, e.g., The
>>> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used
>>> in the hvm numa case and some parts of the toolstack could be
>>> shared, I think. I also replied in another thead and >supplied some
>>> similarity I found in Andre/Dulloor's patches.
>>>
>> IMO PV NUMA guests and HVM NUMA guests could share most of the code
>> from toolstack - for instance, getting the current state of machine,
>> deciding on a strategy for domain memory allocation, selection of
>> nodes, etc. They diverge only at the actual point of domain
>> construction. PV NUMA uses enlightenments, whereas HVM would need
>> working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
>> that we need to converge.
> Hi Dulloor,
> In your patches, the toolstack tries to figure out the "best fit nodes" for a PV guest and
>invokes a hypercall set_domain_numa_layout to tell the hypervisor to remember the
>info, and later the PV guest invokes a hypercall get_domain_numa_layout to retrieve the
>info from the hypervisor.
> Can this be changed to: the toolstack writes the guest numa info directly into a new
>field in the start_info(or the share_info) (maybe in the starndard format of the SRAT/SLIT)
>and later PV guest reads the info and uses acpi_numa_init() to parse the info?  I think in
>this way the new hypercalls can be avoided and the pv numa enlightenment code in
>guest kernel can be minimized.
> I'm asking  this because this is the way how HVM numa patches of Andure do(the
>toolstack passes the info to hvmloader and the latter builds SRAT/SLIT for guest)
Hi Cui,

In my first version of patches (for making dom0 a numa guest), I had
put this information into start_info
(http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00630.html).
But, after that I thought this new approach is better (for pv numa and
maybe even hvm numa) for following reasons :

- For PV NUMA guests, there are more places where the enlightenment
might be useful. For instance, in the attached (refreshed)patch, I
have used the enlightenment to support ballooning (without changing
node mappings) for PV NUMA guests. Similarly, there are
other places within the hypervisor as well as in the VM where I plan
to use the domain_numa_layout. That's the main reason for choosing
this approach. Although I am not sure, I think this could be useful
for HVM too (maybe with PV on HVM).

- Using the hypercall interface is equally simple. And, also with
start-info, I wasn't sure if it looks clean to add feature-specific
variables (useful only with PV NUMA guests) to start-info (or even
shared info), changing the xen-vm interface, adding (unnecessary)
changes for compat, etc.

Please let me know your thoughts.


>
> xc_select_best_fit_nodes() decides the "min-set" of host nodes that will be used for the
>guest. It only considers the current memory usage of the system. Maybe we should also
>condider the cpu load? And the number of the nodes must be 2^^n? And how to handle >the case #vcpu is < #vnode?
> And looks your patches only consider the guest's memory requirement -- guest's vcpu
>requirement is neglected? e.g., a guest may not need a very large amount of memory
>while it needs many vcpus. xc_select_best_fit_nodes() should consider this when
>determining the number of vnode.

I agree with you. I was planning to consider vcpu load as the next
step. Also, I am looking
for a good heuristic. I looked at the nodeload heuristic (currently in
xen), but found it too naive. But, if you/Andre think it is a good
heuristic, I will add the support. Actually, I think
in future we should do away with strict vcpu-affinities and rely more
on a scheduler with
necessary NUMA support to complement our placement strategies.

As of now, we don't SPLIT, if #vcpu < #vnode. We use STRIPING in that case.

>
>>>> On 04/04/2010 20:30, "Dulloor" <dulloor@gmail.com> wrote:
>>>>
>>>>> The set of patches implements virtual NUMA-enlightenment to support
>>>>> NUMA-aware PV guests. In more detail, the patch implements the
>>>>> following :
>>>>>
>>>>> * For the NUMA systems, the following memory allocation strategies
>>>>> are implemented : - CONFINE : Confine the VM memory allocation to a
>>>>> single node. As opposed to the current method of doing this in
>>>>> python, the patch implements this in libxc(along with other
>>>>> strategies) and with assurance that the memory actually comes from
>>>>> the selected node. - STRIPE : If the VM memory doesn't fit in a
>>>>> single node and if the VM is not compiled with guest-numa-support,
>>>>> the memory is allocated striped across a selected max-set of nodes.
>>>>> - SPLIT : If the VM memory doesn't fit in a single node and if the
>>>>> VM is compiled with guest-numa-support, the memory is allocated
>>>>> split (equally for now) from the min-set of nodes. The  VM is then
>>>>> made aware of this NUMA allocation (virtual NUMA enlightenment).
>>>>> -DEFAULT : This is the existing allocation scheme.
>>>>>
>>>>> * If the numa-guest support is compiled into the PV guest, we add
>>>>> numa-guest-support to xen features elfnote. The xen tools use this
>>>>> to determine if SPLIT strategy can be applied.
>>>>>
>>> I think this looks too complex to allow a real user to easily
>>> determine which one to use...
>> I think you misunderstood this. For the first version, I have
>> implemented an automatic global domain memory allocation scheme, which
>> (when enabled) applies to all domains on a NUMA machine. I am of
>> opinion that users are seldom in a state to determine which strategy
>> to use. They would want the best possible performance for their VM at
>> any point of time, and we can only guarantee the best possible
>> performance, given the current state of the system (how the free
>> memory is scattered across nodes, distance between those nodes, etc).
>> In that regard, this solution is the simplest.
> Ok, I see.
> BTW: I think actually currently Xen can handle the case CONFINE pretty well, e.g, when
> no vcpu affinity is explicitly specified, the toolstack tries to choose a "best" host node
> for the guest and pins all vcpus of the guest to the host node.
But, currently it is done in python code and also it doesn't use
exact_node interface.
I added this to the libxc toolstack for the sake of completeness
(CONFINE is just a
special case of SPLIT). Also, with libxl catching up, we might anyway
want to do these
things in libxc, where it is accessible to both xm and xl.

>
>>> About the CONFINE stragegy -- looks this is not a useful usage model
>>> to me -- do we really think it's a typical usage model to
>>> ensure a VM's memory can only be allocated on a specified node?
>> Not all VMs are large enough not to fit into a single node (note that
>> user doesn't specify a node). And, if a VM can be fit into a single
>> node, that is obviously the best possible option for a VM.
>>
>>> The definitions of STRIPE and SPLIT also doesn't sound like typical
>>> usage models to me.
>> There are only two possibilities. Either the VM fits in a single node
>> or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
>> optimize the solution when the VM doesn't fit in a single node. The
>> aim is to reduce the number of inter-node accesses(SPLIT) and/or
>> provide a more predictable performance(STRIPE).
>>
>>> Why must tools know if the PV kernel is built with guest numa
>>> support or not?
>> What is the point of arranging the memory amenable for construction of
>> nodes in guest if the guest itself is not compiled to do so.
> I meant: to simplify the implementation, the toolstack can always supply the numa
> config info to the guest *if necessary*, no matter if the guest kernel is numa-enabled or
> not (even if the guest kernel isn't numa-enabled, the guest performance may be better
> if the toolstack decides to supply a numa config to the guest)
> About the "*if necessary*": Andure and I think the user should supply an  option
> "guestnode" in the guest config file, and you think the toolstack should be able to
> automatically determine a "best" value. I raised some questions about
> xc_select_best_fit_nodes() in the above paragraph.
> Hi Andre, would you like to comment on this?
How about an "automatic"  global option along with a VM-level
"guestnode" option. These options could be work independently or with
each other ("guestnode" would take
preference over global "automatic" option). We can work out finer details.

>
>>
>>> If a user configures guest numa to "on" for a pv guest, the tools
>>> can supply the numa info to PV kernel even if the pv kernel is not >
>>> built with guest numa support -- the pv kernel will neglect the info
>>> safely;
>>> If a user configures guest numa to "off" for a pv guest and the
>>> tools don't supply the numa info to PV kernel, and if the pv kernel
>>> > is built with guest numa support, the pv kernel can easily detect
>>> this by your new hypercall and will not enable numa.
>> These error checks are done even now. But, by checking if the PV
>> kernel is built with guest numa support, we don't require the user to
>> configure yet another parameter. Wasn't that your concern too in the
>> very first point ?
>>
>>>
>>> When a user finds the computing capability of a single node can't
>>> satisfy the actual need and hence wants to use guest numa,
>>> since the user has specified the amount of guest memory and the
>>> number of vcpus in guest config file, I think the user only needs
>>> to specify how many guest nodes (the "guestnodes" option in Andre's
>>> patch) the guest will see, and the tools and the hypervisor
>>> should co-work to distribute guest memory and vcpus uniformly among
>>> the guest nodes(I think we may not want to support non-
>>> uniform nodes as that doesn't look like a typical usage model) -- of
>>> course, maybe a specified node doesn't have the expected
>>> amount of memory -- in this case, the guest can continue to run with
>>> a slower speed (we can print a warning message to the
>>> user); or, if the user does care about predictable guest
>>> performance, the guest creation should fail.
>>
>> Please observe that the patch does all these things plus some more.
>> For one, "guestnodes" option doesn't make sense, since as you observe,
>> it needs the user to carefully read the state of the system when
>> starting the domain and also the user needs to make sure that the
>> guest itself is compiled with numa support. The aim should be to
> I think it's not difficult for a user to specify "guestnodes" and to check if a PV/HVM guest
> kernel is numa-enabled or not(anyway, a user needs to ensure that to achieve the
> optimal peformance). "xm info/list/vcpu-list" should already supply enough info. I think
> it's reasonable to assume a numa user has more knowledge than a preliminary user. :-)
>
> I suppose Andure would argue more for the  "guestnodes" option.
>
> PV guest can use the ELFnote as a hit to the toolstack. This may be used as a kind of optimization.
> HVM guest can't use this.
As mentioned above, I think we have a good case for both global and
VM-level options. What do you think ?

>
>> automate this part and provide the best performance, given the current
>> state. The patch attempts to do that. Secondly, when the guests are
>> not compiled with numa support, they would still want a more
>> predictable (albeit average) performance. And, by striping the memory
>> across the nodes and by pinning the domain vcpus to the union of those
>> nodes' processors, applications (of substantial sizes) could be
>> expected to see more predictable performance.
>>>
>>> How do you like this? My thought is we can make things simple in the
>>> first step. :-)
>> Please let me know if my comments are not clear. I agree that we
>> should shoot for simplicity and also for a common interface. Hope we
>> will get there :)
> Thanks a lot for all the explanation and discussion.
> Yes, we need to agree on a common interface to avoid confusion.
> And I still think the "guestnodes/uniform_nodes" idea is more straightforward and the
> implementatin is simpler. :-)
>
> Thanks,
>  -- Dexuan

thanks
dulloor

[-- Attachment #2: numa-ballooning.patch --]
[-- Type: text/x-patch, Size: 10351 bytes --]

diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index c47b9fa..d214c67 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -44,7 +44,7 @@
 	} while (0)
 #elif defined(__x86_64__)
 #define set_xen_guest_handle(hnd, val)	do { (hnd) = val; } while (0)
-#define get_xen_guest_handle(val, hnd)  do { val = (hnd).p; } while (0)
+#define get_xen_guest_handle(val, hnd)  do { val = (hnd); } while (0)
 #endif
 #endif
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index bd7a398..f510ee0 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -44,6 +44,8 @@
 #include <linux/list.h>
 #include <linux/sysdev.h>
 #include <linux/swap.h>
+#include <linux/nodemask.h>
+#include <linux/numa.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -53,6 +55,7 @@
 
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
+#include <asm/xen/interface.h>
 
 #include <xen/xen.h>
 #include <xen/interface/xen.h>
@@ -107,7 +110,7 @@ static unsigned long frame_list[PAGE_SIZE / sizeof(unsigned long)];
 #endif
 
 /* List of ballooned pages, threaded through the mem_map array. */
-static LIST_HEAD(ballooned_pages);
+static struct list_head ballooned_pages[MAX_NUMNODES];
 
 /* Main work function, always executed in process context. */
 static void balloon_process(struct work_struct *work);
@@ -160,13 +163,14 @@ static unsigned long shrink_frame(unsigned long nr_pages)
 /* balloon_append: add the given page to the balloon. */
 static void balloon_append(struct page *page)
 {
+	int node = page_to_nid(page);
 	/* Lowmem is re-populated first, so highmem pages go at list tail. */
 	if (PageHighMem(page)) {
-		list_add_tail(&page->lru, &ballooned_pages);
+		list_add_tail(&page->lru, &ballooned_pages[node]);
 		balloon_stats.balloon_high++;
 		dec_totalhigh_pages();
 	} else {
-		list_add(&page->lru, &ballooned_pages);
+		list_add(&page->lru, &ballooned_pages[node]);
 		balloon_stats.balloon_low++;
 	}
 
@@ -174,14 +178,14 @@ static void balloon_append(struct page *page)
 }
 
 /* balloon_retrieve: rescue a page from the balloon, if it is not empty. */
-static struct page *balloon_retrieve(void)
+static struct page *balloon_retrieve(int node)
 {
 	struct page *page;
 
-	if (list_empty(&ballooned_pages))
+	if (list_empty(&ballooned_pages[node]))
 		return NULL;
 
-	page = list_entry(ballooned_pages.next, struct page, lru);
+	page = list_entry(ballooned_pages[node].next, struct page, lru);
 	list_del(&page->lru);
 
 	if (PageHighMem(page)) {
@@ -196,17 +200,17 @@ static struct page *balloon_retrieve(void)
 	return page;
 }
 
-static struct page *balloon_first_page(void)
+static struct page *balloon_first_page(int node)
 {
-	if (list_empty(&ballooned_pages))
+	if (list_empty(&ballooned_pages[node]))
 		return NULL;
-	return list_entry(ballooned_pages.next, struct page, lru);
+	return list_entry(ballooned_pages[node].next, struct page, lru);
 }
 
-static struct page *balloon_next_page(struct page *page)
+static struct page *balloon_next_page(int node, struct page *page)
 {
 	struct list_head *next = page->lru.next;
-	if (next == &ballooned_pages)
+	if (next == &ballooned_pages[node])
 		return NULL;
 	return list_entry(next, struct page, lru);
 }
@@ -228,13 +232,26 @@ static unsigned long current_target(void)
 	return target;
 }
 
-static int increase_reservation(unsigned long nr_pages)
+static inline unsigned int xenmemf_vnode_to_mnode(int vnode)
+{
+#ifdef CONFIG_XEN_NUMA_GUEST
+	extern struct xen_domain_numa_layout  HYPERVISOR_pv_numa_layout;
+	int mnid;
+	mnid = HYPERVISOR_pv_numa_layout.vnode_data[vnode].mnode_id;
+	return XENMEMF_exact_node(mnid);
+#else
+	return 0;
+#endif
+}
+
+static int __increase_node_reservation(int node, unsigned long nr_pages)
 {
 	unsigned long  pfn, mfn, i, j, flags;
 	struct page   *page;
-	long           rc;
+	long           rc = 0;
+
 	struct xen_memory_reservation reservation = {
-		.mem_flags = 0,
+		.mem_flags = xenmemf_vnode_to_mnode(node),
 		.domid        = DOMID_SELF
 	};
 
@@ -243,13 +260,15 @@ static int increase_reservation(unsigned long nr_pages)
 
 	spin_lock_irqsave(&xen_reservation_lock, flags);
 
-	page = balloon_first_page();
-	for (i = 0; i < nr_pages; i++) {
-		BUG_ON(page == NULL);
+	if (!(page = balloon_first_page(node)))
+		goto out;
+
+	for (i = 0; page && i<nr_pages; i++) {
 		frame_list[i] = page_to_pfn(page);
-		page = balloon_next_page(page);
+		page = balloon_next_page(node, page);
 	}
-
+	nr_pages = i;
+	
 	set_xen_guest_handle(reservation.extent_start, frame_list);
 	reservation.nr_extents = nr_pages;
 	reservation.extent_order = balloon_order;
@@ -259,7 +278,7 @@ static int increase_reservation(unsigned long nr_pages)
 		goto out;
 
 	for (i = 0; i < rc; i++) {
-		page = balloon_retrieve();
+		page = balloon_retrieve(node);
 		BUG_ON(page == NULL);
 
 		pfn = page_to_pfn(page);
@@ -295,6 +314,23 @@ static int increase_reservation(unsigned long nr_pages)
 	return rc < 0 ? rc : rc != nr_pages;
 }
 
+static int increase_reservation(unsigned long nr_pages)
+{
+    static int node;
+    long rc;
+
+    if (nr_pages > ARRAY_SIZE(frame_list))
+        nr_pages = ARRAY_SIZE(frame_list);
+
+    node = next_node(node, node_online_map);
+    if (node == MAX_NUMNODES)
+        node = first_node(node_online_map);
+
+    rc = __increase_node_reservation(node, nr_pages);
+
+    return rc;
+}
+
 static int decrease_reservation(unsigned long nr_pages)
 {
 	unsigned long  pfn, lpfn, mfn, i, j, flags;
@@ -302,6 +338,9 @@ static int decrease_reservation(unsigned long nr_pages)
 	int            need_sleep = 0;
 	int		discontig, discontig_free;
 	int		ret;
+
+	static int node;
+
 	struct xen_memory_reservation reservation = {
 		.mem_flags = 0,
 		.domid        = DOMID_SELF
@@ -311,7 +350,7 @@ static int decrease_reservation(unsigned long nr_pages)
 		nr_pages = ARRAY_SIZE(frame_list);
 
 	for (i = 0; i < nr_pages; i++) {
-		if ((page = alloc_pages(GFP_BALLOON, balloon_order)) == NULL) {
+		if (!(page = alloc_pages_node(node, GFP_BALLOON, balloon_order))) {
 			nr_pages = i;
 			need_sleep = 1;
 			break;
@@ -366,9 +405,15 @@ static int decrease_reservation(unsigned long nr_pages)
 
 	spin_unlock_irqrestore(&xen_reservation_lock, flags);
 
+	/* balloon from all nodes. */
+	node = next_node(node, node_online_map);
+	if (node == MAX_NUMNODES)
+		node = first_node(node_online_map);
+
 	return need_sleep;
 }
 
+static void nodemem_distribution(void);
 /*
  * We avoid multiple worker processes conflicting via the balloon mutex.
  * We may of course race updates of the target counts (which are protected
@@ -400,6 +445,7 @@ static void balloon_process(struct work_struct *work)
 		mod_timer(&balloon_timer, jiffies + HZ);
 
 	mutex_unlock(&balloon_mutex);
+	nodemem_distribution();
 }
 
 /* Resets the Xen limit, sets new target, and kicks off processing. */
@@ -453,6 +499,7 @@ static int __init balloon_init(void)
 {
 	unsigned long pfn;
 	struct page *page;
+	int node;
 
 	if (!xen_pv_domain())
 		return -ENODEV;
@@ -460,6 +507,9 @@ static int __init balloon_init(void)
 	pr_info("xen_balloon: Initialising balloon driver with page order %d.\n",
 		balloon_order);
 
+	for_each_node(node)
+		INIT_LIST_HEAD(&ballooned_pages[node]);
+
 	balloon_npages = 1 << balloon_order;
 
 	balloon_stats.current_pages = (min(xen_start_info->nr_pages, max_pfn)) >> balloon_order;
@@ -745,4 +795,113 @@ static int register_balloon(struct sys_device *sysdev)
 	return error;
 }
 
+/************************************************************************/
+/* NUMA Guest memory distribution stats */
+#ifdef CONFIG_XEN_NUMA_GUEST
+
+#define MEMNODE_BUFSIZE (PAGE_SIZE)
+#define INVALID_NID (-1)
+static int8_t memnode_buf[MEMNODE_BUFSIZE];
+static struct xenmem_numa_op __xen_numa_memop;
+#define ___memnode (__xen_numa_memop.u.mnodemap)
+static int xen_memnodemap_initialized;
+
+static inline int xen_mfn_to_nid(unsigned long mfn)
+{ 
+	uint8_t *memnode_map;
+	unsigned long addr;
+
+	addr = mfn<<PAGE_SHIFT;
+	if ((addr >> ___memnode.shift) >= ___memnode.mapsize)
+		return INVALID_NID;
+	get_xen_guest_handle(memnode_map, ___memnode.map);
+	return memnode_map[addr >> ___memnode.shift]; 
+}
+
+static inline int xen_memnodemap(void)
+{
+	int rc;
+
+	printk(KERN_INFO "xen_memnodemap called\n");
+
+	__xen_numa_memop.cmd = XENMEM_machine_nodemap;
+	___memnode.bufsize = MEMNODE_BUFSIZE;
+	memset(memnode_buf, 0xFF, MEMNODE_BUFSIZE);
+	set_xen_guest_handle(___memnode.map, memnode_buf);
+
+	if ((rc = HYPERVISOR_memory_op(XENMEM_numa_op, &__xen_numa_memop))) {
+		xen_memnodemap_initialized = 0;
+		printk("XENMEM_memnode_map failed\n");
+	} else {
+		xen_memnodemap_initialized = 1;
+		printk("XENMEM_memnode_map done\n");
+	}
+
+	return rc;
+}
+
+unsigned int node_match_counts[MAX_NUMNODES][MAX_NUMNODES];
+
+static void nodemem_distribution(void)
+{
+	int gnid, mnid;
+	unsigned int oob_mfns, invalid_p2ms;
+
+	if (!xen_memnodemap_initialized && xen_memnodemap())
+		return;
+
+	printk(KERN_INFO "Domain nodemem distribution :\n");
+	if (xen_feature(XENFEAT_auto_translated_physmap))
+	{
+		printk(KERN_INFO "Enlightened ballooning disabled (auto_translated)\n");
+		return;
+	}
+
+	for_each_node(gnid)
+		for_each_node(mnid)
+			node_match_counts[gnid][mnid] = 0;
+
+	oob_mfns = 0;
+	invalid_p2ms = 0;
+
+	for_each_online_node(gnid)
+	{
+		unsigned long pfn, mfn, start_pfn, end_pfn;
+		start_pfn = node_start_pfn(gnid);
+		end_pfn = node_end_pfn(gnid);
+		printk(KERN_INFO "vnode[%d] : start(%lX), end(%lX)\n",
+											gnid, start_pfn, end_pfn);
+		for(pfn=start_pfn;pfn<end_pfn;pfn++)
+		{
+			if (!phys_to_machine_mapping_valid(pfn))
+			{
+				invalid_p2ms++;
+				continue;
+			}
+			mfn = pfn_to_mfn(pfn);
+			if ((mnid = xen_mfn_to_nid(mfn)) == INVALID_NID)
+			{
+				oob_mfns++;
+				continue;
+			}
+			node_match_counts[gnid][mnid]++;
+		}
+	}
+
+	for_each_online_node(gnid)
+		for_each_online_node(mnid)
+			printk(KERN_INFO "node[%d][%d]:%u\n",
+						gnid, mnid, node_match_counts[gnid][mnid]);
+
+	if (invalid_p2ms)
+		printk(KERN_INFO "invalid p2ms : %u\n", invalid_p2ms);
+	if (oob_mfns)
+		printk(KERN_INFO "out-of-bound mfns : %u\n", oob_mfns);
+
+	return;
+}
+
+#endif /* CONFIG_XEN_NUMA_GUEST */
+/************************************************************************/
+
 MODULE_LICENSE("GPL");

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-04-15 17:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-04 19:30 [PATCH 00/11] PV NUMA Guests Dulloor
2010-04-05  6:29 ` Keir Fraser
2010-04-07  7:57   ` Cui, Dexuan
2010-04-09  4:47     ` Dulloor
2010-04-14  5:18       ` Cui, Dexuan
2010-04-15 17:19         ` Dulloor
2010-04-05 14:52 ` Dan Magenheimer
2010-04-06  3:51   ` Dulloor
2010-04-06 17:18     ` Dan Magenheimer
2010-04-09  4:16       ` Dulloor
2010-04-09 11:34 ` Ian Pratt
2010-04-11  3:06   ` Dulloor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.