Re: [PATCH 00/11] PV NUMA Guests

From: Dulloor <dulloor@gmail.com>
To: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: xen-devel@lists.xensource.com, Keir Fraser <keir.fraser@eu.citrix.com>
Subject: Re: [PATCH 00/11] PV NUMA Guests
Date: Fri, 9 Apr 2010 00:16:51 -0400	[thread overview]
Message-ID: <z2v940bcfd21004082116vfc819649zebcb007500293beb@mail.gmail.com> (raw)
In-Reply-To: <7a461573-c606-4f3b-989d-30626655362d@default>

On Tue, Apr 6, 2010 at 1:18 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> In general, I am of the opinion that in a virtualized world,
> one gets best flexibility or best performance, but not both.
> There may be a couple of reasonable points on this "slider
> selector", but I'm not sure in general if it will be worth
> a huge time investment as real users will not understand the
> subtleties of their workloads well enough to choose from
> a large number of (perhaps more than two) points on the
> performance/flexibility spectrum.
>
> So customers that want highest performance should be prepared
> to pin their guests and not use ballooning.  And those that
> want the flexibility of migration and ballooning etc should
> expect to see a performance hit (including NUMA consequences).
In principle, I agree with you. For the same reason, in this first
version, I have tried to keep the configurable parameters to minimum.
Wrt ballooning, we could work out simple solutions that work.
Migration would be problematic though.
>
> But since I don't get to make that decision, let's look
> at the combination of NUMA + dynamic memory utilization...
>
>> Please refer to my previously submitted patch for this
>> (http://old.nabble.com/Xen-devel--XEN-PATCH---Linux-PVOPS--ballooning-
>> on-numa-domains-td26262334.html).
>> I intend to send out a refreshed patch once the basic guest numa is
>> checked in.
>
> OK, will wait and take a look at that later.
>
>> We first try to CONFINE a domain and only then proceed to STRIPE or
>> SPLIT(if capable) the domain. So, in this (automatic) global domain
>> memory allocation scheme, there is no possibility of starvation from
>> memory pov. Hope I got your question right.
>
> The example I'm concerned with is:
> 1) Domain A is CONFINE'd to node A and domain B/C/D/etc are not
>   CONFINE'd
> 2) Domain A uses less than the total memory on node A and/or
>   balloons down so it uses even less than when launched.
> 3) Domains B/C/D have an increasing memory need, and semi-randomly
>   absorb memory from all nodes, including node A.
>
> After (3), free memory is somewhat randomly distributed across
> all nodes.  Then:
>
> 4) Domain A suddenly has an increasing memory need... but there's
>   not enough free memory remaining on node A (in fact possibly
>   there is none at all) to serve its need.   But by definition
>   of CONFINE, domain A is not allowed to use memory other than
>   on node A.
>
> What happens now?  It appears to me that other domains have
> (perhaps even maliciously) starved domain A.
>
> I think this is a dynamic bin-packing problem which is unsolvable
> in general form.  So the choice of heuristics is going to be
> important.
>
In the proposed solution, the domain could be either CONFINED, SPLIT
(NUMA-aware), or STRIPED.
But, in each case, the domain is aware of how much memory is allocated
from each of the nodes
at the time of start-up. The enlightened ballooning attempts to keep
the state similar to that during startup.
But, we might have to allocate from any node under extreme memory
pressure. For that (hopefully less likely) case,
we can implement dynamic mechanisms to converge to the original state,
by sweeping through the memory and
exchanging memory reservations whenever possible. I already have the
means of doing this, as part of the ballooning changes.

For CONFINED/SPLIT domains, I am using the Best-Fit-Decreasing
heuristic, whereas for STRIPED, I am using First-Fit-Increasing
strategy (as a means to reduce the fragmentation of free node memory).

>> For the tmem, I was thinking of the ability to specify a set of nodes
>> from which the tmem-space memory is preferred which could be derived
>> from the domain's numa enlightenment, but as you mentioned the
>> full-page copy overhead is less noticeable (at least on my smaller
>> NUMA machine). But, the rate would determine if we should do this to
>> reduce inter-node traffic. What do you suggest ?  I was looking at the
>> data structures too.
>
> Since tmem allocates individual xmalloc-tlsf memory pools per domain,
> it should be possible to inform tmem of node preferences, but I don't
> know that it will be feasible to truly CONFINE a domain's tmem.
> On the other hand, because of the page copying, affinity by itself
> may be sufficient.
>
Yeah, I guess affinity should suffice for the CONFINED domains, but
I was thinking of node-preferences for the NUMA (SPLIT)Guests.

>> > Also, I will be looking into adding some page-sharing
>> > techniques into tmem in the near future.  This (and the
>> > existing page sharing feature just added to 4.0) may
>> > create some other interesting challenges for NUMA-awareness.
>> I have just started reading up on the memsharing feature of Xen. I
>> would be glad to get your input on NUMA challenges over there.
>
> Note that the tmem patch that does sharing (tmem calls it "page
> deduplication") was just accepted into xen-unstable.  Basically
> some memory may belong to more than one domain, so NUMA affects
> and performance/memory tradeoffs may get very complicated.
>
Thanks for sharing. I will read this very soon.
> Dan
>