Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes

From: Dario Faggioli <raistlin@linux.it>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
Date: Mon, 23 Jul 2012 16:39:39 +0200	[thread overview]
Message-ID: <1343054379.4998.33.camel@Solace> (raw)
In-Reply-To: <20493.14245.392052.643802@mariner.uk.xensource.com>

[-- Attachment #1.1: Type: text/plain, Size: 6669 bytes --]

On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote:
> Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> > If a domain does not have a VCPU affinity, try to pin it automatically
> > to some PCPUs. This is done taking into account the NUMA characteristics
> > of the host. In fact, we look for a combination of host's NUMA nodes
> > with enough free memory and number of PCPUs for the new domain, and pin
> > it to the VCPUs of those nodes.
> 
> I'm afraid your new algorithm appears to rely on an inconsistent
> comparison function:
> 
It's not that new, at least not the comparison function, it's being
there from a couple of versions... Anyway...

> > +static int numa_cmpf(const libxl__numa_candidate *c1,
> > +                     const libxl__numa_candidate *c2)
> > +{
> > +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> > +    double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus);
> > +
> > +    return SIGN(freememkb_diff + 3*nrvcpus_diff);
> > +}
> 
> I don't think this comparison function necessarily defines a partial
> order.  Can you provide a proof that it does ?  I think you probably
> can't.  If you think it _is_ a partial order please let me know and I
> will try to construct a counterexample.
> 
Nhaa, don't bother, I'll change it into something simpler.

>
>[..]
>
>
> That clearly defines a partial order.  The important point is that the
> functions score() must each be calculated only from the node in
> question.  score() must not take the other candidate as an argument.
> Your normalized_diff function violates this.
> 
I think it should be fine, as I'm only using the other candidate's
characteristics for normalization. But again, I'm fine with turning this
into something simpler than it is now, and adhering with the criterion
above, which will make things even more clear. Thanks.

> > +    /* Not even a suitable placement candidate! Let's just don't touch the
> > +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> > +    if (found == 0)
> > +        goto out;
> 
> This probably deserves a log message.
> 
It's there: it's being printed by libxl__get_numa_candidate(). It's like
that to avoid printing more of them, which would be confusing, e.g.,
something like this:

 libxl: ERROR Too many nodes
 libxl: ERROR No placement found

Is that acceptable?

> > @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint
> >      uint32_t rtc_timeoffset;
> > 
> >      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> > +
> > +    /*
> > +     * Check if the domain has any CPU affinity. If not, try to build up one.
> > +     * In case numa_place_domain() find at least a suitable candidate, it will
> 
> This line is 78 characters, leading to wrap damage in my email reply
> to you.  The official coding style says you may use up to 80 which is
> tolerable in the odd line of code but it would be nice if wordwrapped
> comments were wrapped to a shorter distance.
> 
Ok, will take a look at this, here and everywhere else. Thanks.

> > +     * affect info->cpumap accordingly; if it does not, it just leaves it
> > +     * as it is. This means (unless some weird error manifests) the subsequent
> > +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> > +     * whatever that turns out to be.
> > +     */
> > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +        int rc = numa_place_domain(gc, info);
> > +        if (rc)
> > +            return rc;
> > +    }
> >      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
> > +
> 
> As discussed, this is wrong; the fix is in your 2/4 but needs to be
> folded into this patch I think.  Patches should not introduce new
> bugs even if they're about to be fixed in the next commit.
> 
Ok, I'll merge the two patches.

> > +    /*
> > +     * The good thing about this solution is that it is based on heuristics
> > +     * (implemented in numa_cmpf() ), but we at least can evaluate it on
> > +     * all the possible placement candidates. That can happen because the
> > +     * number of nodes present in current NUMA systems is quite small.
> > +     * In fact, even if a sum of binomials is involved, if the system has
> > +     * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a
> > +     * sensible value, as it is exactly the number of nodes the biggest
> > +     * NUMA systems provide at the time of this writing (and will probably
> > +     * continue to do so for a while). However, computanional complexity
> > +     * would explode on systems much bigger than that. 16 nodes system would
> > +     * still be fine (it will be 65536 steps), but it's really importante we
>                                                                   important
> 
> Isn't this an argument that the limit should be 16 ?  (Also 65535
> steps since placement on 0 nodes is typically not an option.)
> 
Well, it is for me, but as we agreed on 8, I went for that. What I
wanted was just make sure we (or whoever else) know/remember that 16
cold work too, in case that turns out to be useful in future. That being
said, if you agree on raising the cap to 16 right now, I'll be glad to
do that. :-D

> > +     * avoid trying to run this on monsters with 32, 64 or more nodes (if
> > +     * they ever pop into being). Therefore, here it comes a safety catch
> > +     * that disables the algorithm for the cases when it wouldn't work well.
> > +     */
> > +    if (nr_nodes > 8) {
> > +        /* Log we did nothing and return 0, as no real error occurred */
> > +        LOG(WARN, "System has %d NUMA nodes, which is too big for the "
> > +                  "placement algorithm to work effectively. Skipping it",
> > +                  nr_nodes);
> 
> It would be nice if this message suggested pinning the vcpus.
>
> > +    if (min_nodes > nr_nodes)
> > +        min_nodes = nr_nodes;
> > +    if (!max_nodes || max_nodes > nr_nodes)
> > +        max_nodes = nr_nodes;
> > +    if (min_nodes > max_nodes) {
> > +        rc = ERROR_INVAL;
> > +        goto out;
> 
> This should be accompanied by a log message.
> 
Ok to both.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel