From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 08 of 10 v3] libxl: enable automatic placement of guests on NUMA nodes Date: Fri, 6 Jul 2012 12:30:36 +0100 Message-ID: <4FF6CC5C.1060905@eu.citrix.com> References: <7087d3622ee2051654c9.1341418687@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <7087d3622ee2051654c9.1341418687@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Roger Pau Monne List-Id: xen-devel@lists.xenproject.org On 04/07/12 17:18, Dario Faggioli wrote: > # HG changeset patch > # User Dario Faggioli > # Date 1341416324 -7200 > # Node ID 7087d3622ee2051654c9e78fe4829da10c2d46f1 > # Parent 6fd693e7f3bc8b4d9bd20befff2c13de5591a7c5 > libxl: enable automatic placement of guests on NUMA nodes > > If a domain does not have a VCPU affinity, try to pin it automatically to some > PCPUs. This is done taking into account the NUMA characteristics of the host. > In fact, we look for a combination of host's NUMA nodes with enough free memory > and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. > > Once we know which ones, among all the possible combinations, represents valid > placement candidates for a domain, use some heuistics for deciding which is the > best. For instance, smaller candidates are considered to be better, both from > the domain's point of view (fewer memory spreading among nodes) and from the > system as a whole point of view (fewer memoy fragmentation). In case of > candidates of equal sizes (i.e., with the same number of nodes), the amount of > free memory and the number of domain already assigned to their nodes are > considered. Very often, candidates with greater amount of memory are the one > we wants, as this is also good for keeping memory fragmentation under control. > However, if the difference in how much free memory two candidates have, the > number of assigned domains might be what decides which candidate wins. > > This all happens internally to libxl, and no API for driving the mechanism is > provided for now. This matches what xend already does. > > Signed-off-by: Dario Faggioli > Acked-by: George Dunlap One question I have: Is there any particular reason to sort the whole list, rather than just finding the maximum based on the comparison function? But I think it's been a long time and it looks good enough to me: Acked-by: George Dunlap > > --- > Changes from v2: > * lots of typos. > * Clayfied some comments, as requested during (ijc's) review. > * Added some more information/reference for the combination generation > algorithm. > * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus(). > * libxl_bitmap_init() used to make sure we do not try to free random > memory on failure paths of functions that allocates a libxl_bitmap. > * Always invoke libxl__sort_numa_candidates(), even if we get there > with just 1 candidate, as requested during review. > * Simplified the if-s that check for input parameter consistency in > libxl__get_numa_candidates() as requested during (gwd's) review. > * Comparison function for candidates changed so that it now provides > total ordering, as requested during review. It is still using FP > arithmetic, though. Also I think that just putting the difference > between the amount of free memory and between the number of assigned > domains of two candidates in a single formula (after normalizing and > weighting them) is both clear and effective enough. > * Function definitions moved to a numa specific source file (libxl_numa.c), > as suggested during review. > > > Changes from v1: > * This patches incorporates the changes from both "libxl, xl: enable automatic > placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering > NUMA placement candidates" from v1. > * The logic of the algorithm is basically the same as in v1, but the splitting > of it in the various functions has been completely redesigned from scratch. > * No public API for placement or candidate generation is now exposed, > everything happens within libxl, as agreed during v1 review. > * The relevant documentation have been moved near the actual functions and > features. Also, the amount and (hopefully!) the quality of the documentation > has been improved a lot, as requested. > * All the comments about using the proper libxl facilities and helpers for > allocations, etc., have been considered and applied. > * This patch still bails out from NUMA optimizations if it find out cpupools > are being utilized. It is next patch that makes the two things interact > properly, as suggested during review. > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -111,8 +111,8 @@ created online and the remainder will be > > =item B > > -List of which cpus the guest is allowed to use. Default behavior is > -`all cpus`. A C may be specified as follows: > +List of which cpus the guest is allowed to use. By default xl will (via > +libxl) pick some cpus (see below). A C may be specified as follows: > > =over 4 > > @@ -132,6 +132,12 @@ run on cpu #3 of the host. > > =back > > +If this option is not specified, libxl automatically tries to place the new > +domain on the host's NUMA nodes (provided the host has more than one NUMA > +node) by pinning it to the cpus of those nodes. A heuristic approach is > +utilized with the goals of maximizing performance for the domain and, at > +the same time, achieving efficient utilization of the host's CPUs and RAM. > + > =item B > > A domain with a weight of 512 will get twice as much CPU as a domain > diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile > --- a/tools/libxl/Makefile > +++ b/tools/libxl/Makefile > @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm > LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ > libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \ > libxl_internal.o libxl_utils.o libxl_uuid.o \ > - libxl_json.o libxl_aoutils.o \ > + libxl_json.o libxl_aoutils.o libxl_numa.o \ > libxl_save_callout.o _libxl_save_msgs_callout.o \ > libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) > LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -98,6 +98,106 @@ out: > return sched; > } > > +/* Subtract two values and translate the result in [0, 1] */ > +static double normalized_diff(double a, double b) > +{ > +#define max(a, b) (a> b ? a : b) > + if (!a&& a == b) > + return 0.0; > + return (a - b) / max(a, b); > +} > + > +/* > + * The NUMA placement candidates are reordered according to the following > + * heuristics: > + * - candidates involving fewer nodes come first. In case two (or > + * more) candidates span the same number of nodes, > + * - the amount of free memory and the number of domains assigned to the > + * candidates are considered. In doing that, candidates with greater > + * amount of free memory and fewer domains assigned to them are preferred, > + * with free memory "weighting" three times as much as number of domains. > + */ > +static int numa_cmpf(const void *v1, const void *v2) > +{ > + const libxl__numa_candidate *c1 = v1; > + const libxl__numa_candidate *c2 = v2; > +#define sign(a) a> 0 ? 1 : a< 0 ? -1 : 0 > + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); > + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); > + > + if (c1->nr_nodes != c2->nr_nodes) > + return c1->nr_nodes - c2->nr_nodes; > + > + return sign(3*freememkb_diff + nrdomains_diff); > +} > + > +/* The actual automatic NUMA placement routine */ > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) > +{ > + int nr_candidates = 0; > + libxl__numa_candidate *candidates = NULL; > + libxl_bitmap candidate_nodemap; > + libxl_cpupoolinfo *pinfo; > + int nr_pools, rc = 0; > + uint32_t memkb; > + > + libxl_bitmap_init(&candidate_nodemap); > + > + /* First of all, if cpupools are in use, better not to mess with them */ > + pinfo = libxl_list_cpupool(CTX,&nr_pools); > + if (!pinfo) > + return ERROR_FAIL; > + if (nr_pools> 1) { > + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); > + goto out; > + } > + > + rc = libxl_domain_need_memory(CTX, info,&memkb); > + if (rc) > + goto out; > + if (libxl_node_bitmap_alloc(CTX,&candidate_nodemap, 0)) { > + rc = ERROR_FAIL; > + goto out; > + } > + > + /* Find all the candidates with enough free memory and at least > + * as much pcpus as the domain has vcpus. */ > + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, > +&candidates,&nr_candidates); > + if (rc) > + goto out; > + > + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); > + > + /* No suitable placement candidates. We just return without touching the > + * domain's info->cpumap. It will have affinity with all nodes/cpus. */ > + if (nr_candidates == 0) { > + LOG(NOTICE, "NUMA placement failed, performance might be affected"); > + goto out; > + } > + > + /* Bring the best candidate in front of the list --> candidates[0] */ > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); > + > + /* > + * At this point, the first candidate in the array is the one we want. > + * Go for it by mapping its node map to the domain's info->cpumap. > + */ > + libxl__numa_candidate_get_nodemap(gc,&candidates[0],&candidate_nodemap); > + rc = libxl_nodemap_to_cpumap(CTX,&candidate_nodemap,&info->cpumap); > + if (rc) > + goto out; > + > + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " > + "%"PRIu32" KB free selected", candidates[0].nr_nodes, > + candidates[0].nr_cpus, candidates[0].free_memkb / 1024); > + > + out: > + libxl_bitmap_dispose(&candidate_nodemap); > + libxl_cpupoolinfo_list_free(pinfo, nr_pools); > + return rc; > +} > + > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_build_info *info, libxl__domain_build_state *state) > { > @@ -107,7 +207,22 @@ int libxl__build_pre(libxl__gc *gc, uint > uint32_t rtc_timeoffset; > > xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); > + > + /* > + * Check if the domain has any CPU affinity. If not, try to build up one. > + * In case numa_place_domain() find at least a suitable candidate, it will > + * affect info->cpumap accordingly; if it does not, it just leaves it > + * as it is. This means (unless some weird error manifests) the subsequent > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > + * whatever that turns out to be. > + */ > + if (libxl_bitmap_is_full(&info->cpumap)) { > + int rc = numa_place_domain(gc, info); > + if (rc) > + return rc; > + } > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,&info->cpumap); > + > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); > if (info->type == LIBXL_DOMAIN_TYPE_PV) > xc_domain_set_memmap_limit(ctx->xch, domid, > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2216,6 +2216,140 @@ static inline void libxl__ctx_unlock(lib > #define CTX_LOCK (libxl__ctx_lock(CTX)) > #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) > > +/* > + * Automatic NUMA placement > + * > + * These functions and data structures deal with the initial placement of a > + * domain onto the host NUMA nodes. > + * > + * The key concept here is the one of "NUMA placement candidate", which is > + * basically a set of nodes whose characteristics have been successfully > + * checked against some specific requirements. More precisely, a candidate is > + * the nodemap associated with one of the possible subset of the host NUMA > + * nodes providing a certain amount of free memory, or a given number of cpus, > + * or even both (depending in what the caller wants). For convenience of use, > + * some of this information are stored within the candidate itself, instead of > + * always being dynamically computed. A single node is a valid placement > + * candidate, but it is also possible for a candidate to contain all the nodes > + * of the host. The fewer nodes there are in a candidate, the better > + * performance a domain placed onto it should get. For instance, looking for a > + * numa candidates with 2GB of free memory means we want all the possible > + * subsets of the host NUMA nodes with, cumulatively, at least 2GB of free > + * memory. That could be possible by just using one particular node, or may > + * require more nodes, depending on the characteristics of the host, on how > + * many domains have been created already, on how big they are, etc. > + * > + * The intended usage is as follows: > + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying > + * the proper constraints to it (e.g., the amount of memory a domain need > + * as the minimum amount of free memory for the candidates) one can build > + * up a whole set of suitable placing alternatives for a domain; > + * 2. after that, one specific candidate should be chosen. That can happen > + * by looking at their various characteristics; > + * 3. the chosen candidate's nodemap should be utilized for computing the > + * actual affinity of the domain which, given the current NUMA support > + * in the hypervisor, is what determines the placement of the domain's > + * vcpus and memory. > + * > + * To make phase 2 even easier, a sorting helper function for the list of > + * candidates is provided in the form of libxl__sort_numa_candidates(). The > + * only that is needed is defining a comparison function, containing the > + * criteria for deciding, given two candidates, which one is 'better'. > + * Depending on how the comparison function is defined, the best candidate > + * (where, of course, best is defined with respect to the heuristics > + * implemented in the comparison function itself, libxl__numa_candidate_cmpf()) > + * could become the first or the last element of the list. > + * > + * Summarizing, achieving automatic NUMA placement is just a matter of > + * obtaining the list of suitable placement candidates, perhaps asking for each > + * of them to provide at least the amount of memory the domain needs. After > + * that just implement a comparison function by means of the various helpers > + * retrieving the relevant information about the candidates themselves. > + * Finally, call the sorting helper function and use the candidate that became > + * (typically) the first element of the list for determining the domain's > + * affinity. > + */ > + > +typedef struct { > + int nr_cpus, nr_nodes; > + int nr_domains; > + uint32_t free_memkb; > + libxl_bitmap nodemap; > +} libxl__numa_candidate; > + > +/* > + * This generates the list of NUMA placement candidates satisfying some > + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is > + * used to determine the minimum and maximum number of nodes that are allow to > + * be present in each candidate. If min_nodes and/or max_nodes are 0, the > + * minimum and maximum number of nodes to be used are automatically selected by > + * the implementation (and that will likely be just 1 node for the minimum and > + * the total number of existent nodes for the maximum). Re min_free_memkb and > + * min_cpu, if not 0, it means the caller only wants candidates with at > + * least that amount of free memory and that number of cpus, respectively. If > + * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number > + * of cpus won't be checked at all, which means a candidate will always be > + * considered suitable wrt the specific constraint. cndts is where the list of > + * exactly nr_cndts candidates is returned. Note that, in case no candidates > + * are found at all, the function returns successfully, but with nr_cndts equal > + * to zero. > + */ > +_hidden int libxl__get_numa_candidates(libxl__gc *gc, > + uint32_t min_free_memkb, int min_cpus, > + int min_nodes, int max_nodes, > + libxl__numa_candidate *cndts[], int *nr_cndts); > + > +/* Initialization, allocation and deallocation for placement candidates */ > +static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt) > +{ > + cndt->free_memkb = 0; > + cndt->nr_cpus = cndt->nr_nodes = cndt->nr_domains = 0; > + libxl_bitmap_init(&cndt->nodemap); > +} > + > +static inline int libxl__numa_candidate_alloc(libxl__gc *gc, > + libxl__numa_candidate *cndt) > +{ > + return libxl_node_bitmap_alloc(CTX,&cndt->nodemap, 0); > +} > +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) > +{ > + libxl_bitmap_dispose(&cndt->nodemap); > +} > +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts, > + int nr_cndts) > +{ > + int i; > + > + for (i = 0; i< nr_cndts; i++) > + libxl__numa_candidate_dispose(&cndts[i]); > + free(cndts); > +} > + > +/* Retrieve (in nodemap) the node map associated to placement candidate cndt */ > +static inline > +void libxl__numa_candidate_get_nodemap(libxl__gc *gc, > + const libxl__numa_candidate *cndt, > + libxl_bitmap *nodemap) > +{ > + libxl_bitmap_copy(CTX, nodemap,&cndt->nodemap); > +} > +/* Set the node map of placement candidate cndt to match nodemap */ > +static inline > +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, > + libxl__numa_candidate *cndt, > + const libxl_bitmap *nodemap) > +{ > + libxl_bitmap_copy(CTX,&cndt->nodemap, nodemap); > +} > + > +/* Signature for the comparison function between two candidates c1 and c2 */ > +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); > +/* Sort the list of candidates in cndts (an array with nr_cndts elements in > + * it) using cmpf for comparing two candidates. Uses libc's qsort(). */ > +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], > + int nr_cndts, > + libxl__numa_candidate_cmpf cmpf); > > /* > * Inserts "elm_new" into the sorted list "head". > diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c > new file mode 100644 > --- /dev/null > +++ b/tools/libxl/libxl_numa.c > @@ -0,0 +1,382 @@ > +/* > + * Copyright (C) 2012 Citrix Ltd. > + * Author Dario Faggioli > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as published > + * by the Free Software Foundation; version 2.1 only. with the special > + * exception on linking described in file LICENSE. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + */ > + > +#include "libxl_osdeps.h" /* must come before any other headers */ > + > +#include > + > +#include "libxl_internal.h" > + > +/* > + * What follows are helpers for generating all the k-combinations > + * without repetitions of a set S with n elements in it. Formally > + * speaking, they are subsets of k distinct elements of S and, if > + * S is n elements big, the number of k-combinations is equal to > + * the binomial coefficient (n k), which means we get exactly > + * n!/(k! * (n - k)!) subsets, all of them with k elements. > + * > + * The various subset are generated one after the other by calling > + * comb_init() first, and, after that, comb_next() > + * (n k)-1 times. An iterator is used to store the current status > + * of the whole generation operation (i.e., basically, the last > + * combination that has been generated). As soon as all > + * combinations have been generated, comb_next() will > + * start returning 0 instead of 1. It is of course important that > + * the same instance of the iterator and the same values for > + * n and k are used for each call. If that doesn't happen, the > + * result is unspecified. > + * > + * The algorithm is a well known one (see, for example, D. Knuth's "The > + * Art of Computer Programming - Volume 4, Fascicle 3" and it produces > + * the combinations in such a way that they (well, more precisely, > + * their indexes it the array/map representing the set) come with > + * lexicographic ordering. > + * > + * For example, with n = 5 and k = 3, calling comb_init() > + * will generate { 0, 1, 2 }, while subsequent valid calls to > + * comb_next() will produce the following: > + * { { 0, 1, 3 }, { 0, 1, 4 }, > + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, > + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 }, > + * { 2, 3, 4 } } > + * > + * This is used by the automatic NUMA placement logic below. > + */ > +typedef int* comb_iter_t; > + > +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k) > +{ > + comb_iter_t new_iter; > + int i; > + > + if (n< k) > + return 0; > + > + /* First set is always { 0, 1, 2, ..., k-1 } */ > + GCNEW_ARRAY(new_iter, k); > + for (i = 0; i< k; i++) > + new_iter[i] = i; > + > + *it = new_iter; > + return 1; > +} > + > +static int comb_next(comb_iter_t it, int n, int k) > +{ > + int i; > + > + /* > + * The idea here is to find the leftmost element from where > + * we should start incrementing the indexes of the iterator. > + * This means looking for the highest index that can be increased > + * while still producing value smaller than n-1. In the example > + * above, when dealing with { 0, 1, 4 }, such an element is the > + * second one, as the third is already equal to 4 (which actually > + * is n-1). > + * Once we found from where to start, we increment that element > + * and override the right-hand rest of the iterator with its > + * successors, thus achieving lexicographic ordering. > + * > + * Regarding the termination of the generation process, when we > + * manage in bringing n-k at the very first position of the iterator, > + * we know that is the last valid combination ( { 2, 3, 4 }, with > + * n - k = 5 - 2 = 2, in the example above), and thus we start > + * returning 0 as soon as we cross that border. > + */ > + for (i = k - 1; it[i] == n - k + i; i--) { > + if (i<= 0) > + return 0; > + } > + for (it[i]++, i++; i< k; i++) > + it[i] = it[i - 1] + 1; > + return 1; > +} > + > +/* NUMA automatic placement (see libxl_internal.h for details) */ > + > +/* > + * This function turns a k-combination iterator into a node map. > + * This means the bits in the node map corresponding to the indexes > + * of the given combination are the ones that will be set. > + * For example, if the iterator represents the combination { 0, 2, 4}, > + * the node map will have bits #0, #2 and #4 set. > + */ > +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) > +{ > + int i; > + > + libxl_bitmap_set_none(nodemap); > + for (i = 0; i< k; i++) > + libxl_bitmap_set(nodemap, it[i]); > +} > + > +/* Retrieve the number of cpus that the nodes that are part of the nodemap > + * span. */ > +static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus, > + const libxl_bitmap *nodemap) > +{ > + int i, nodes_cpus = 0; > + > + for (i = 0; i< nr_cpus; i++) { > + if (libxl_bitmap_test(nodemap, tinfo[i].node)) > + nodes_cpus++; > + } > + return nodes_cpus; > +} > + > +/* Retrieve the amount of free memory within the nodemap */ > +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo, > + libxl_bitmap *nodemap) > +{ > + uint32_t free_memkb = 0; > + int i; > + > + libxl_for_each_set_bit(i, *nodemap) > + free_memkb += ninfo[i].free / 1024; > + > + return free_memkb; > +} > + > +/* Retrieve the number of domains that can potentially run on the cpus > + * the nodes that are part of the nodemap. */ > +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, > + const libxl_bitmap *nodemap) > +{ > + libxl_dominfo *dinfo = NULL; > + libxl_bitmap dom_nodemap; > + int nr_doms, nr_cpus; > + int nr_domains = 0; > + int i, j, k; > + > + dinfo = libxl_list_domain(CTX,&nr_doms); > + if (dinfo == NULL) > + return ERROR_FAIL; > + > + if (libxl_node_bitmap_alloc(CTX,&dom_nodemap, 0)< 0) { > + libxl_dominfo_list_free(dinfo, nr_doms); > + return ERROR_FAIL; > + } > + > + for (i = 0; i< nr_doms; i++) { > + libxl_vcpuinfo *vinfo; > + int nr_vcpus; > + > + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid,&nr_vcpus,&nr_cpus); > + if (vinfo == NULL) > + continue; > + > + libxl_bitmap_set_none(&dom_nodemap); > + for (j = 0; j< nr_vcpus; j++) { > + libxl_for_each_set_bit(k, vinfo[j].cpumap) > + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); > + } > + > + libxl_for_each_set_bit(j, dom_nodemap) { > + if (libxl_bitmap_test(nodemap, j)) { > + nr_domains++; > + break; > + } > + } > + > + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); > + } > + > + libxl_bitmap_dispose(&dom_nodemap); > + libxl_dominfo_list_free(dinfo, nr_doms); > + return nr_domains; > +} > + > +/* > + * This function tries to figure out if the host has a consistent number > + * of cpus along all its NUMA nodes. In fact, if that is the case, we can > + * calculate the minimum number of nodes needed for a domain by just > + * dividing its total number of vcpus by this value computed here. > + * However, we are not allowed to assume that all the nodes have the > + * same number of cpus. Therefore, in case discrepancies among different > + * nodes are found, this function just returns 0, for the caller to know > + * it shouldn't rely on this 'optimization', and sort out things in some > + * other way (most likely, just start trying with candidates with just > + * one node). > + */ > +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, > + libxl_numainfo *ninfo, int nr_nodes) > +{ > + int cpus_per_node = 0; > + int j, i; > + > + /* This makes sense iff # of PCPUs is the same for all nodes */ > + for (j = 0; j< nr_nodes; j++) { > + int curr_cpus = 0; > + > + for (i = 0; i< nr_cpus; i++) { > + if (tinfo[i].node == j) > + curr_cpus++; > + } > + /* So, if the above does not hold, turn the whole thing off! */ > + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node; > + if (cpus_per_node != curr_cpus) > + return 0; > + } > + return cpus_per_node; > +} > + > +/* Get all the placement candidates satisfying some specific conditions */ > +int libxl__get_numa_candidates(libxl__gc *gc, > + uint32_t min_free_memkb, int min_cpus, > + int min_nodes, int max_nodes, > + libxl__numa_candidate *cndts[], int *nr_cndts) > +{ > + libxl__numa_candidate *new_cndts = NULL; > + libxl_cputopology *tinfo = NULL; > + libxl_numainfo *ninfo = NULL; > + int nr_nodes = 0, nr_cpus = 0; > + libxl_bitmap nodemap; > + int array_size, rc; > + > + libxl_bitmap_init(&nodemap); > + > + /* Get platform info and prepare the map for testing the combinations */ > + ninfo = libxl_get_numainfo(CTX,&nr_nodes); > + if (ninfo == NULL) > + return ERROR_FAIL; > + /* If we don't have at least 2 nodes, it is useless to proceed */ > + if (nr_nodes< 2) { > + rc = 0; > + goto out; > + } > + > + tinfo = libxl_get_cpu_topology(CTX,&nr_cpus); > + if (tinfo == NULL) { > + rc = ERROR_FAIL; > + goto out; > + } > + > + rc = libxl_node_bitmap_alloc(CTX,&nodemap, 0); > + if (rc) > + goto out; > + > + /* > + * If the minimum number of NUMA nodes is not explicitly specified > + * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes > + * from where to start generating candidates, if possible (or just start > + * from 1 otherwise). The maximum number of nodes should not exceed the > + * number of existent NUMA nodes on the host, or the candidate generation > + * won't work properly. > + */ > + if (!min_nodes) { > + int cpus_per_node; > + > + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); > + if (cpus_per_node == 0) > + min_nodes = 1; > + else > + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; > + } > + if (min_nodes> nr_nodes) > + min_nodes = nr_nodes; > + if (!max_nodes || max_nodes> nr_nodes) > + max_nodes = nr_nodes; > + if (min_nodes> max_nodes) { > + rc = ERROR_INVAL; > + goto out; > + } > + > + /* Initialize the local storage for the combinations */ > + *nr_cndts = 0; > + array_size = nr_nodes; > + GCNEW_ARRAY(new_cndts, array_size); > + > + /* Generate all the combinations of any size from min_nodes to > + * max_nodes (see comb_init() and comb_next()). */ > + while (min_nodes<= max_nodes) { > + comb_iter_t comb_iter; > + int comb_ok; > + > + /* > + * And here it is. Each step of this cycle generates a combination of > + * nodes as big as min_nodes mandates. Each of these combinations is > + * checked against the constraints provided by the caller (namely, > + * amount of free memory and number of cpus) and it becomes an actual > + * placement candidate iff it passes the check. > + */ > + for (comb_ok = comb_init(gc,&comb_iter, nr_nodes, min_nodes); comb_ok; > + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { > + uint32_t nodes_free_memkb; > + int nodes_cpus; > + > + comb_get_nodemap(comb_iter,&nodemap, min_nodes); > + > + /* If there is not enough memory in this combination, skip it > + * and go generating the next one... */ > + nodes_free_memkb = nodemap_to_free_memkb(ninfo,&nodemap); > + if (min_free_memkb&& nodes_free_memkb< min_free_memkb) > + continue; > + > + /* And the same applies if this combination is short in cpus */ > + nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus,&nodemap); > + if (min_cpus&& nodes_cpus< min_cpus) > + continue; > + > + /* > + * Conditions are met, we can add this combination to the > + * NUMA placement candidates list. We first make sure there > + * is enough space in there, and then we initialize the new > + * candidate element with the node map corresponding to the > + * combination we are dealing with. Memory allocation for > + * expanding the array that hosts the list happens in chunks > + * equal to the number of NUMA nodes in the system (to > + * avoid allocating memory each and every time we find a > + * new candidate). > + */ > + if (*nr_cndts == array_size) > + array_size += nr_nodes; > + GCREALLOC_ARRAY(new_cndts, array_size); > + > + libxl__numa_candidate_alloc(gc,&new_cndts[*nr_cndts]); > + libxl__numa_candidate_put_nodemap(gc,&new_cndts[*nr_cndts], > +&nodemap); > + new_cndts[*nr_cndts].nr_domains = > + nodemap_to_nr_domains(gc, tinfo,&nodemap); > + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; > + new_cndts[*nr_cndts].nr_nodes = min_nodes; > + new_cndts[*nr_cndts].nr_cpus = nodes_cpus; > + > + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " > + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, > + min_nodes, new_cndts[*nr_cndts].nr_cpus, > + new_cndts[*nr_cndts].free_memkb / 1024); > + > + (*nr_cndts)++; > + } > + min_nodes++; > + } > + > + *cndts = new_cndts; > + out: > + libxl_bitmap_dispose(&nodemap); > + libxl_cputopology_list_free(tinfo, nr_cpus); > + libxl_numainfo_list_free(ninfo, nr_nodes); > + return rc; > +} > + > +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, > + libxl__numa_candidate_cmpf cmpf) > +{ > + /* Reorder candidates (see the comparison function for > + * the details on the heuristics) */ > + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf); > +} > + > +