Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

From: Ian Jackson <Ian.Jackson@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes
Date: Tue, 17 Jul 2012 16:55:50 +0100	[thread overview]
Message-ID: <20485.35590.105351.434937@mariner.uk.xensource.com> (raw)
In-Reply-To: <0411b2cebd725b193465.1341932614@Solace>

Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> If a domain does not have a VCPU affinity, try to pin it automatically to some
> PCPUs. This is done taking into account the NUMA characteristics of the host.
> In fact, we look for a combination of host's NUMA nodes with enough free memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes.

Thanks for this admirably clear patch.

Can you please rewrap your commit messages to around 70 lines ?  Many
VCSs indent them in the log in some situations, and as you see here
mail programs indent them when quoting too.

Indeed I would generally prefer to wrap code to ~75 rather than 80 but
I know that's a bit controversial.

> +/* Subtract two values and translate the result in [0, 1] */
> +static double normalized_diff(double a, double b)
> +{
> +#define max(a, b) (a > b ? a : b)
> +    if (!a && a == b)
> +        return 0.0;
> +    return (a - b) / max(a, b);
> +}

1. This macro max() should be in libxl_internal.h.
2. It should be MAX so people are warned it's a macro
3. It should have all the necessary ()s for macro precedence safety

> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = v1;
> +    const libxl__numa_candidate *c2 = v2;
> +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0

Again.

> +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> +    double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains);
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    return sign(3*freememkb_diff + nrdomains_diff);

The reason you need what sign() does is that you need to convert from
double to int, I guess.

> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up one.
> +     * In case numa_place_domain() find at least a suitable candidate, it will
> +     * affect info->cpumap accordingly; if it does not, it just leaves it
> +     * as it is. This means (unless some weird error manifests) the subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }

I guess it would be preferable to do this only if the bitmap was full
by default, so that setting the bitmap explicitly to all cpus still
works.

I'm not sure that that's essential to have, though.

> +            /*
> +             * Conditions are met, we can add this combination to the
> +             * NUMA placement candidates list. We first make sure there
> +             * is enough space in there, and then we initialize the new
> +             * candidate element with the node map corresponding to the
> +             * combination we are dealing with. Memory allocation for
> +             * expanding the array that hosts the list happens in chunks
> +             * equal to the number of NUMA nodes in the system (to
> +             * avoid allocating memory each and every time we find a
> +             * new candidate).
> +             */
> +            if (*nr_cndts == array_size)
> +                array_size += nr_nodes;
> +            GCREALLOC_ARRAY(new_cndts, array_size);

This part of the algorithm is quadratic in the number of combinations
divided by the number of nodes.  So the algorithm is
   O( (  C( nr_nodes, min_nodes ) / min_nodes  )^2 )
which is quite bad really.

At the very least this needs to be an exponential allocation, eg
  +                array_size += nr_nodes + array_size;

But if you didn't insist on creating the whole list and sorting it,
you would avoid this allocation entirely, wouldn't you ?

Should we bve worried that this algorithm will be too slow even if it
involves just
  O( C(nr_nodes,min_nodes) )
iterations ?

Thanks,
Ian.