From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 08 of 10 v3] libxl: enable automatic
 placement of guests on NUMA nodes
Date: Fri, 6 Jul 2012 12:30:36 +0100
Message-ID: <4FF6CC5C.1060905@eu.citrix.com>
References: <patchbomb.1341418679@Solace>
	<7087d3622ee2051654c9.1341418687@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <7087d3622ee2051654c9.1341418687@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Roger Pau Monne <roger.pau@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On 04/07/12 17:18, Dario Faggioli wrote:
> # HG changeset patch
> # User Dario Faggioli<raistlin@linux.it>
> # Date 1341416324 -7200
> # Node ID 7087d3622ee2051654c9e78fe4829da10c2d46f1
> # Parent  6fd693e7f3bc8b4d9bd20befff2c13de5591a7c5
> libxl: enable automatic placement of guests on NUMA nodes
>
> If a domain does not have a VCPU affinity, try to pin it automatically to some
> PCPUs. This is done taking into account the NUMA characteristics of the host.
> In fact, we look for a combination of host's NUMA nodes with enough free memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes.
>
> Once we know which ones, among all the possible combinations, represents valid
> placement candidates for a domain, use some heuistics for deciding which is the
> best. For instance, smaller candidates are considered to be better, both from
> the domain's point of view (fewer memory spreading among nodes) and from the
> system as a whole point of view (fewer memoy fragmentation).  In case of
> candidates of equal sizes (i.e., with the same number of nodes), the amount of
> free memory and the number of domain already assigned to their nodes are
> considered. Very often, candidates with greater amount of memory are the one
> we wants, as this is also good for keeping memory fragmentation under control.
> However, if the difference in how much free memory two candidates have, the
> number of assigned domains might be what decides which candidate wins.
>
> This all happens internally to libxl, and no API for driving the mechanism is
> provided for now. This matches what xend already does.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
> Acked-by: George Dunlap<george.dunlap@eu.citrix.com>
One question I have:  Is there any particular reason to sort the whole 
list, rather than just finding the maximum based on the comparison function?

But I think it's been a long time and it looks good enough to me:

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> ---
> Changes from v2:
>   * lots of typos.
>   * Clayfied some comments, as requested during (ijc's) review.
>   * Added some more information/reference for the combination generation
>     algorithm.
>   * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus().
>   * libxl_bitmap_init() used to make sure we do not try to free random
>     memory on failure paths of functions that allocates a libxl_bitmap.
>   * Always invoke libxl__sort_numa_candidates(), even if we get there
>     with just 1 candidate, as requested during review.
>   * Simplified the if-s that check for input parameter consistency in
>     libxl__get_numa_candidates() as requested during (gwd's) review.
>   * Comparison function for candidates changed so that it now provides
>     total ordering, as requested during review. It is still using FP
>     arithmetic, though. Also I think that just putting the difference
>     between the amount of free memory and between the number of assigned
>     domains of two candidates in a single formula (after normalizing and
>     weighting them) is both clear and effective enough.
>   * Function definitions moved to a numa specific source file (libxl_numa.c),
>     as suggested during review.
>
>
> Changes from v1:
>   * This patches incorporates the changes from both "libxl, xl: enable automatic
>     placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering
>     NUMA placement candidates" from v1.
>   * The logic of the algorithm is basically the same as in v1, but the splitting
>     of it in the various functions has been completely redesigned from scratch.
>   * No public API for placement or candidate generation is now exposed,
>     everything happens within libxl, as agreed during v1 review.
>   * The relevant documentation have been moved near the actual functions and
>     features. Also, the amount and (hopefully!) the quality of the documentation
>     has been improved a lot, as requested.
>   * All the comments about using the proper libxl facilities and helpers for
>     allocations, etc., have been considered and applied.
>   * This patch still bails out from NUMA optimizations if it find out cpupools
>     are being utilized. It is next patch that makes the two things interact
>     properly, as suggested during review.
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -111,8 +111,8 @@ created online and the remainder will be
>
>   =item B<cpus="CPU-LIST">
>
> -List of which cpus the guest is allowed to use. Default behavior is
> -`all cpus`. A C<CPU-LIST>  may be specified as follows:
> +List of which cpus the guest is allowed to use. By default xl will (via
> +libxl) pick some cpus (see below). A C<CPU-LIST>  may be specified as follows:
>
>   =over 4
>
> @@ -132,6 +132,12 @@ run on cpu #3 of the host.
>
>   =back
>
> +If this option is not specified, libxl automatically tries to place the new
> +domain on the host's NUMA nodes (provided the host has more than one NUMA
> +node) by pinning it to the cpus of those nodes. A heuristic approach is
> +utilized with the goals of maximizing performance for the domain and, at
> +the same time, achieving efficient utilization of the host's CPUs and RAM.
> +
>   =item B<cpu_weight=WEIGHT>
>
>   A domain with a weight of 512 will get twice as much CPU as a domain
> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> --- a/tools/libxl/Makefile
> +++ b/tools/libxl/Makefile
> @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm
>   LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
>                          libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
>                          libxl_internal.o libxl_utils.o libxl_uuid.o \
> -                       libxl_json.o libxl_aoutils.o \
> +                       libxl_json.o libxl_aoutils.o libxl_numa.o \
>                          libxl_save_callout.o _libxl_save_msgs_callout.o \
>                          libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
>   LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -98,6 +98,106 @@ out:
>       return sched;
>   }
>
> +/* Subtract two values and translate the result in [0, 1] */
> +static double normalized_diff(double a, double b)
> +{
> +#define max(a, b) (a>  b ? a : b)
> +    if (!a&&  a == b)
> +        return 0.0;
> +    return (a - b) / max(a, b);
> +}
> +
> +/*
> + * The NUMA placement candidates are reordered according to the following
> + * heuristics:
> + *  - candidates involving fewer nodes come first. In case two (or
> + *    more) candidates span the same number of nodes,
> + *  - the amount of free memory and the number of domains assigned to the
> + *    candidates are considered. In doing that, candidates with greater
> + *    amount of free memory and fewer domains assigned to them are preferred,
> + *    with free memory "weighting" three times as much as number of domains.
> + */
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = v1;
> +    const libxl__numa_candidate *c2 = v2;
> +#define sign(a) a>  0 ? 1 : a<  0 ? -1 : 0
> +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> +    double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains);
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    return sign(3*freememkb_diff + nrdomains_diff);
> +}
> +
> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +{
> +    int nr_candidates = 0;
> +    libxl__numa_candidate *candidates = NULL;
> +    libxl_bitmap candidate_nodemap;
> +    libxl_cpupoolinfo *pinfo;
> +    int nr_pools, rc = 0;
> +    uint32_t memkb;
> +
> +    libxl_bitmap_init(&candidate_nodemap);
> +
> +    /* First of all, if cpupools are in use, better not to mess with them */
> +    pinfo = libxl_list_cpupool(CTX,&nr_pools);
> +    if (!pinfo)
> +        return ERROR_FAIL;
> +    if (nr_pools>  1) {
> +        LOG(NOTICE, "skipping NUMA placement as cpupools are in use");
> +        goto out;
> +    }
> +
> +    rc = libxl_domain_need_memory(CTX, info,&memkb);
> +    if (rc)
> +        goto out;
> +    if (libxl_node_bitmap_alloc(CTX,&candidate_nodemap, 0)) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    /* Find all the candidates with enough free memory and at least
> +     * as much pcpus as the domain has vcpus.  */
> +    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
> +&candidates,&nr_candidates);
> +    if (rc)
> +        goto out;
> +
> +    LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
> +
> +    /* No suitable placement candidates. We just return without touching the
> +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> +    if (nr_candidates == 0) {
> +        LOG(NOTICE, "NUMA placement failed, performance might be affected");
> +        goto out;
> +    }
> +
> +    /* Bring the best candidate in front of the list -->  candidates[0] */
> +    libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
> +
> +    /*
> +     * At this point, the first candidate in the array is the one we want.
> +     * Go for it by mapping its node map to the domain's info->cpumap.
> +     */
> +    libxl__numa_candidate_get_nodemap(gc,&candidates[0],&candidate_nodemap);
> +    rc = libxl_nodemap_to_cpumap(CTX,&candidate_nodemap,&info->cpumap);
> +    if (rc)
> +        goto out;
> +
> +    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
> +                "%"PRIu32" KB free selected", candidates[0].nr_nodes,
> +                candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
> +
> + out:
> +    libxl_bitmap_dispose(&candidate_nodemap);
> +    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
> +    return rc;
> +}
> +
>   int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                 libxl_domain_build_info *info, libxl__domain_build_state *state)
>   {
> @@ -107,7 +207,22 @@ int libxl__build_pre(libxl__gc *gc, uint
>       uint32_t rtc_timeoffset;
>
>       xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up one.
> +     * In case numa_place_domain() find at least a suitable candidate, it will
> +     * affect info->cpumap accordingly; if it does not, it just leaves it
> +     * as it is. This means (unless some weird error manifests) the subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }
>       libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,&info->cpumap);
> +
>       xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
>       if (info->type == LIBXL_DOMAIN_TYPE_PV)
>           xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2216,6 +2216,140 @@ static inline void libxl__ctx_unlock(lib
>   #define CTX_LOCK (libxl__ctx_lock(CTX))
>   #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
>
> +/*
> + * Automatic NUMA placement
> + *
> + * These functions and data structures deal with the initial placement of a
> + * domain onto the host NUMA nodes.
> + *
> + * The key concept here is the one of "NUMA placement candidate", which is
> + * basically a set of nodes whose characteristics have been successfully
> + * checked against some specific requirements. More precisely, a candidate is
> + * the nodemap associated with one of the possible subset of the host NUMA
> + * nodes providing a certain amount of free memory, or a given number of cpus,
> + * or even both (depending in what the caller wants). For convenience of use,
> + * some of this information are stored within the candidate itself, instead of
> + * always being dynamically computed. A single node is a valid placement
> + * candidate, but it is also possible for a candidate to contain all the nodes
> + * of the host. The fewer nodes there are in a candidate, the better
> + * performance a domain placed onto it should get. For instance, looking for a
> + * numa candidates with 2GB of free memory means we want all the possible
> + * subsets of the host NUMA nodes with, cumulatively, at least 2GB of free
> + * memory. That could be possible by just using one particular node, or may
> + * require more nodes, depending on the characteristics of the host, on how
> + * many domains have been created already, on how big they are, etc.
> + *
> + * The intended usage is as follows:
> + *  1. by, fist of all, calling libxl__get_numa_candidates(), and specifying
> + *     the proper constraints to it (e.g., the amount of memory a domain need
> + *     as the minimum amount of free memory for the candidates) one can build
> + *     up a whole set of suitable placing alternatives for a domain;
> + *  2. after that, one specific candidate should be chosen. That can happen
> + *     by looking at their various characteristics;
> + *  3. the chosen candidate's nodemap should be utilized for computing the
> + *     actual affinity of the domain which, given the current NUMA support
> + *     in the hypervisor, is what determines the placement of the domain's
> + *     vcpus and memory.
> + *
> + * To make phase 2 even easier, a sorting helper function for the list of
> + * candidates is provided in the form of libxl__sort_numa_candidates(). The
> + * only that is needed is defining a comparison function, containing the
> + * criteria for deciding, given two candidates, which one is 'better'.
> + * Depending on how the comparison function is defined, the best candidate
> + * (where, of course, best is defined with respect to the heuristics
> + * implemented in the comparison function itself, libxl__numa_candidate_cmpf())
> + * could become the first or the last element of the list.
> + *
> + * Summarizing, achieving automatic NUMA placement is just a matter of
> + * obtaining the list of suitable placement candidates, perhaps asking for each
> + * of them to provide at least the amount of memory the domain needs. After
> + * that just implement a comparison function by means of the various helpers
> + * retrieving the relevant information about the candidates themselves.
> + * Finally, call the sorting helper function and use the candidate that became
> + * (typically) the first element of the list for determining the domain's
> + * affinity.
> + */
> +
> +typedef struct {
> +    int nr_cpus, nr_nodes;
> +    int nr_domains;
> +    uint32_t free_memkb;
> +    libxl_bitmap nodemap;
> +} libxl__numa_candidate;
> +
> +/*
> + * This generates the list of NUMA placement candidates satisfying some
> + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is
> + * used to determine the minimum and maximum number of nodes that are allow to
> + * be present in each candidate. If min_nodes and/or max_nodes are 0, the
> + * minimum and maximum number of nodes to be used are automatically selected by
> + * the implementation (and that will likely be just 1 node for the minimum and
> + * the total number of existent nodes for the maximum). Re min_free_memkb and
> + * min_cpu, if not 0, it means the caller only wants candidates with at
> + * least that amount of free memory and that number of cpus, respectively. If
> + * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number
> + * of cpus won't be checked at all, which means a candidate will always be
> + * considered suitable wrt the specific constraint.  cndts is where the list of
> + * exactly nr_cndts candidates is returned. Note that, in case no candidates
> + * are found at all, the function returns successfully, but with nr_cndts equal
> + * to zero.
> + */
> +_hidden int libxl__get_numa_candidates(libxl__gc *gc,
> +                                uint32_t min_free_memkb, int min_cpus,
> +                                int min_nodes, int max_nodes,
> +                                libxl__numa_candidate *cndts[], int *nr_cndts);
> +
> +/* Initialization, allocation and deallocation for placement candidates */
> +static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt)
> +{
> +    cndt->free_memkb = 0;
> +    cndt->nr_cpus = cndt->nr_nodes = cndt->nr_domains = 0;
> +    libxl_bitmap_init(&cndt->nodemap);
> +}
> +
> +static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
> +                                              libxl__numa_candidate *cndt)
> +{
> +    return libxl_node_bitmap_alloc(CTX,&cndt->nodemap, 0);
> +}
> +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
> +{
> +    libxl_bitmap_dispose(&cndt->nodemap);
> +}
> +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts,
> +                                                  int nr_cndts)
> +{
> +    int i;
> +
> +    for (i = 0; i<  nr_cndts; i++)
> +        libxl__numa_candidate_dispose(&cndts[i]);
> +    free(cndts);
> +}
> +
> +/* Retrieve (in nodemap) the node map associated to placement candidate cndt */
> +static inline
> +void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
> +                                       const libxl__numa_candidate *cndt,
> +                                       libxl_bitmap *nodemap)
> +{
> +    libxl_bitmap_copy(CTX, nodemap,&cndt->nodemap);
> +}
> +/* Set the node map of placement candidate cndt to match nodemap */
> +static inline
> +void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> +                                       libxl__numa_candidate *cndt,
> +                                       const libxl_bitmap *nodemap)
> +{
> +    libxl_bitmap_copy(CTX,&cndt->nodemap, nodemap);
> +}
> +
> +/* Signature for the comparison function between two candidates c1 and c2 */
> +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
> +/* Sort the list of candidates in cndts (an array with nr_cndts elements in
> + * it) using cmpf for comparing two candidates. Uses libc's qsort(). */
> +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
> +                                         int nr_cndts,
> +                                         libxl__numa_candidate_cmpf cmpf);
>
>   /*
>    * Inserts "elm_new" into the sorted list "head".
> diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
> new file mode 100644
> --- /dev/null
> +++ b/tools/libxl/libxl_numa.c
> @@ -0,0 +1,382 @@
> +/*
> + * Copyright (C) 2012      Citrix Ltd.
> + * Author Dario Faggioli<dario.faggioli@citrix.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as published
> + * by the Free Software Foundation; version 2.1 only. with the special
> + * exception on linking described in file LICENSE.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + */
> +
> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#include<glob.h>
> +
> +#include "libxl_internal.h"
> +
> +/*
> + * What follows are helpers for generating all the k-combinations
> + * without repetitions of a set S with n elements in it. Formally
> + * speaking, they are subsets of k distinct elements of S and, if
> + * S is n elements big, the number of k-combinations is equal to
> + * the binomial coefficient (n k), which means we get exactly
> + * n!/(k! * (n - k)!) subsets, all of them with k elements.
> + *
> + * The various subset are generated one after the other by calling
> + * comb_init() first, and, after that, comb_next()
> + * (n k)-1 times. An iterator is used to store the current status
> + * of the whole generation operation (i.e., basically, the last
> + * combination that has been generated). As soon as all
> + * combinations have been generated, comb_next() will
> + * start returning 0 instead of 1. It is of course important that
> + * the same instance of the iterator and the same values for
> + * n and k are used for each call. If that doesn't happen, the
> + * result is unspecified.
> + *
> + * The algorithm is a well known one (see, for example, D. Knuth's "The
> + * Art of Computer Programming - Volume 4, Fascicle 3" and it produces
> + * the combinations in such a way that they (well, more precisely,
> + * their indexes it the array/map representing the set) come with
> + * lexicographic ordering.
> + *
> + * For example, with n = 5 and k = 3, calling comb_init()
> + * will generate { 0, 1, 2 }, while subsequent valid calls to
> + * comb_next() will produce the following:
> + * { { 0, 1, 3 }, { 0, 1, 4 },
> + *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
> + *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
> + *   { 2, 3, 4 } }
> + *
> + * This is used by the automatic NUMA placement logic below.
> + */
> +typedef int* comb_iter_t;
> +
> +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
> +{
> +    comb_iter_t new_iter;
> +    int i;
> +
> +    if (n<  k)
> +        return 0;
> +
> +    /* First set is always { 0, 1, 2, ..., k-1 } */
> +    GCNEW_ARRAY(new_iter, k);
> +    for (i = 0; i<  k; i++)
> +        new_iter[i] = i;
> +
> +    *it = new_iter;
> +    return 1;
> +}
> +
> +static int comb_next(comb_iter_t it, int n, int k)
> +{
> +    int i;
> +
> +    /*
> +     * The idea here is to find the leftmost element from where
> +     * we should start incrementing the indexes of the iterator.
> +     * This means looking for the highest index that can be increased
> +     * while still producing value smaller than n-1. In the example
> +     * above, when dealing with { 0, 1, 4 }, such an element is the
> +     * second one, as the third is already equal to 4 (which actually
> +     * is n-1).
> +     * Once we found from where to start, we increment that element
> +     * and override the right-hand rest of the iterator with its
> +     * successors, thus achieving lexicographic ordering.
> +     *
> +     * Regarding the termination of the generation process, when we
> +     * manage in bringing n-k at the very first position of the iterator,
> +     * we know that is the last valid combination ( { 2, 3, 4 }, with
> +     * n - k = 5 - 2 = 2, in the example above), and thus we start
> +     * returning 0 as soon as we cross that border.
> +     */
> +    for (i = k - 1; it[i] == n - k + i; i--) {
> +        if (i<= 0)
> +            return 0;
> +    }
> +    for (it[i]++, i++; i<  k; i++)
> +        it[i] = it[i - 1] + 1;
> +    return 1;
> +}
> +
> +/* NUMA automatic placement (see libxl_internal.h for details) */
> +
> +/*
> + * This function turns a k-combination iterator into a node map.
> + * This means the bits in the node map corresponding to the indexes
> + * of the given combination are the ones that will be set.
> + * For example, if the iterator represents the combination { 0, 2, 4},
> + * the node map will have bits #0, #2 and #4 set.
> + */
> +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
> +{
> +    int i;
> +
> +    libxl_bitmap_set_none(nodemap);
> +    for (i = 0; i<  k; i++)
> +        libxl_bitmap_set(nodemap, it[i]);
> +}
> +
> +/* Retrieve the number of cpus that the nodes that are part of the nodemap
> + * span. */
> +static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus,
> +                              const libxl_bitmap *nodemap)
> +{
> +    int i, nodes_cpus = 0;
> +
> +    for (i = 0; i<  nr_cpus; i++) {
> +        if (libxl_bitmap_test(nodemap, tinfo[i].node))
> +            nodes_cpus++;
> +    }
> +    return nodes_cpus;
> +}
> +
> +/* Retrieve the amount of free memory within the nodemap */
> +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
> +                                      libxl_bitmap *nodemap)
> +{
> +    uint32_t free_memkb = 0;
> +    int i;
> +
> +    libxl_for_each_set_bit(i, *nodemap)
> +        free_memkb += ninfo[i].free / 1024;
> +
> +    return free_memkb;
> +}
> +
> +/* Retrieve the number of domains that can potentially run on the cpus
> + * the nodes that are part of the nodemap. */
> +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
> +                                 const libxl_bitmap *nodemap)
> +{
> +    libxl_dominfo *dinfo = NULL;
> +    libxl_bitmap dom_nodemap;
> +    int nr_doms, nr_cpus;
> +    int nr_domains = 0;
> +    int i, j, k;
> +
> +    dinfo = libxl_list_domain(CTX,&nr_doms);
> +    if (dinfo == NULL)
> +        return ERROR_FAIL;
> +
> +    if (libxl_node_bitmap_alloc(CTX,&dom_nodemap, 0)<  0) {
> +        libxl_dominfo_list_free(dinfo, nr_doms);
> +        return ERROR_FAIL;
> +    }
> +
> +    for (i = 0; i<  nr_doms; i++) {
> +        libxl_vcpuinfo *vinfo;
> +        int nr_vcpus;
> +
> +        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid,&nr_vcpus,&nr_cpus);
> +        if (vinfo == NULL)
> +            continue;
> +
> +        libxl_bitmap_set_none(&dom_nodemap);
> +        for (j = 0; j<  nr_vcpus; j++) {
> +            libxl_for_each_set_bit(k, vinfo[j].cpumap)
> +                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
> +        }
> +
> +        libxl_for_each_set_bit(j, dom_nodemap) {
> +            if (libxl_bitmap_test(nodemap, j)) {
> +                nr_domains++;
> +                break;
> +            }
> +        }
> +
> +        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
> +    }
> +
> +    libxl_bitmap_dispose(&dom_nodemap);
> +    libxl_dominfo_list_free(dinfo, nr_doms);
> +    return nr_domains;
> +}
> +
> +/*
> + * This function tries to figure out if the host has a consistent number
> + * of cpus along all its NUMA nodes. In fact, if that is the case, we can
> + * calculate the minimum number of nodes needed for a domain by just
> + * dividing its total number of vcpus by this value computed here.
> + * However, we are not allowed to assume that all the nodes have the
> + * same number of cpus. Therefore, in case discrepancies among different
> + * nodes are found, this function just returns 0, for the caller to know
> + * it shouldn't rely on this 'optimization', and sort out things in some
> + * other way (most likely, just start trying with candidates with just
> + * one node).
> + */
> +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
> +                               libxl_numainfo *ninfo, int nr_nodes)
> +{
> +    int cpus_per_node = 0;
> +    int j, i;
> +
> +    /* This makes sense iff # of PCPUs is the same for all nodes */
> +    for (j = 0; j<  nr_nodes; j++) {
> +        int curr_cpus = 0;
> +
> +        for (i = 0; i<  nr_cpus; i++) {
> +            if (tinfo[i].node == j)
> +                curr_cpus++;
> +        }
> +        /* So, if the above does not hold, turn the whole thing off! */
> +        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
> +        if (cpus_per_node != curr_cpus)
> +            return 0;
> +    }
> +    return cpus_per_node;
> +}
> +
> +/* Get all the placement candidates satisfying some specific conditions */
> +int libxl__get_numa_candidates(libxl__gc *gc,
> +                               uint32_t min_free_memkb, int min_cpus,
> +                               int min_nodes, int max_nodes,
> +                               libxl__numa_candidate *cndts[], int *nr_cndts)
> +{
> +    libxl__numa_candidate *new_cndts = NULL;
> +    libxl_cputopology *tinfo = NULL;
> +    libxl_numainfo *ninfo = NULL;
> +    int nr_nodes = 0, nr_cpus = 0;
> +    libxl_bitmap nodemap;
> +    int array_size, rc;
> +
> +    libxl_bitmap_init(&nodemap);
> +
> +    /* Get platform info and prepare the map for testing the combinations */
> +    ninfo = libxl_get_numainfo(CTX,&nr_nodes);
> +    if (ninfo == NULL)
> +        return ERROR_FAIL;
> +    /* If we don't have at least 2 nodes, it is useless to proceed */
> +    if (nr_nodes<  2) {
> +        rc = 0;
> +        goto out;
> +    }
> +
> +    tinfo = libxl_get_cpu_topology(CTX,&nr_cpus);
> +    if (tinfo == NULL) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    rc = libxl_node_bitmap_alloc(CTX,&nodemap, 0);
> +    if (rc)
> +        goto out;
> +
> +    /*
> +     * If the minimum number of NUMA nodes is not explicitly specified
> +     * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
> +     * from where to start generating candidates, if possible (or just start
> +     * from 1 otherwise). The maximum number of nodes should not exceed the
> +     * number of existent NUMA nodes on the host, or the candidate generation
> +     * won't work properly.
> +     */
> +    if (!min_nodes) {
> +        int cpus_per_node;
> +
> +        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes);
> +        if (cpus_per_node == 0)
> +            min_nodes = 1;
> +        else
> +            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
> +    }
> +    if (min_nodes>  nr_nodes)
> +        min_nodes = nr_nodes;
> +    if (!max_nodes || max_nodes>  nr_nodes)
> +        max_nodes = nr_nodes;
> +    if (min_nodes>  max_nodes) {
> +        rc = ERROR_INVAL;
> +        goto out;
> +    }
> +
> +    /* Initialize the local storage for the combinations */
> +    *nr_cndts = 0;
> +    array_size = nr_nodes;
> +    GCNEW_ARRAY(new_cndts, array_size);
> +
> +    /* Generate all the combinations of any size from min_nodes to
> +     * max_nodes (see comb_init() and comb_next()). */
> +    while (min_nodes<= max_nodes) {
> +        comb_iter_t comb_iter;
> +        int comb_ok;
> +
> +        /*
> +        * And here it is. Each step of this cycle generates a combination of
> +        * nodes as big as min_nodes mandates.  Each of these combinations is
> +        * checked against the constraints provided by the caller (namely,
> +        * amount of free memory and number of cpus) and it becomes an actual
> +        * placement candidate iff it passes the check.
> +         */
> +        for (comb_ok = comb_init(gc,&comb_iter, nr_nodes, min_nodes); comb_ok;
> +             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
> +            uint32_t nodes_free_memkb;
> +            int nodes_cpus;
> +
> +            comb_get_nodemap(comb_iter,&nodemap, min_nodes);
> +
> +            /* If there is not enough memory in this combination, skip it
> +             * and go generating the next one... */
> +            nodes_free_memkb = nodemap_to_free_memkb(ninfo,&nodemap);
> +            if (min_free_memkb&&  nodes_free_memkb<  min_free_memkb)
> +                continue;
> +
> +            /* And the same applies if this combination is short in cpus */
> +            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus,&nodemap);
> +            if (min_cpus&&  nodes_cpus<  min_cpus)
> +                continue;
> +
> +            /*
> +             * Conditions are met, we can add this combination to the
> +             * NUMA placement candidates list. We first make sure there
> +             * is enough space in there, and then we initialize the new
> +             * candidate element with the node map corresponding to the
> +             * combination we are dealing with. Memory allocation for
> +             * expanding the array that hosts the list happens in chunks
> +             * equal to the number of NUMA nodes in the system (to
> +             * avoid allocating memory each and every time we find a
> +             * new candidate).
> +             */
> +            if (*nr_cndts == array_size)
> +                array_size += nr_nodes;
> +            GCREALLOC_ARRAY(new_cndts, array_size);
> +
> +            libxl__numa_candidate_alloc(gc,&new_cndts[*nr_cndts]);
> +            libxl__numa_candidate_put_nodemap(gc,&new_cndts[*nr_cndts],
> +&nodemap);
> +            new_cndts[*nr_cndts].nr_domains =
> +                                    nodemap_to_nr_domains(gc, tinfo,&nodemap);
> +            new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
> +            new_cndts[*nr_cndts].nr_nodes = min_nodes;
> +            new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
> +
> +            LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, "
> +                       "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts,
> +                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
> +                       new_cndts[*nr_cndts].free_memkb / 1024);
> +
> +            (*nr_cndts)++;
> +        }
> +        min_nodes++;
> +    }
> +
> +    *cndts = new_cndts;
> + out:
> +    libxl_bitmap_dispose(&nodemap);
> +    libxl_cputopology_list_free(tinfo, nr_cpus);
> +    libxl_numainfo_list_free(ninfo, nr_nodes);
> +    return rc;
> +}
> +
> +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
> +                                 libxl__numa_candidate_cmpf cmpf)
> +{
> +    /* Reorder candidates (see the comparison function for
> +     * the details on the heuristics) */
> +    qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
> +}
> +
> +