* [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl @ 2012-07-10 15:03 Dario Faggioli 2012-07-10 15:03 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli ` (3 more replies) 0 siblings, 4 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-10 15:03 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson Hello again, This is a repost of what remained un-committed of my NUMA series. The splat on 1-node-only systems has been resolved as per Ian Campbell's suggestion (thanks!), and that was the only thing I'm aware that was preventing from checking-in these last patches. While at it, I also addressed all any other observation made during review, namely, typos and rewording of some sentences, as well as an improved filtering solution for avoiding generating a lot of potential duplicate placement candidates when cpupool are being used. All the patches have already been acked, although with some minor nits that should be gone now. If you have any further comment, I'm here to answer. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-10 15:03 [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli @ 2012-07-10 15:03 ` Dario Faggioli 2012-07-17 15:55 ` Ian Jackson 2012-07-10 15:03 ` [PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli ` (2 subsequent siblings) 3 siblings, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-10 15:03 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson If a domain does not have a VCPU affinity, try to pin it automatically to some PCPUs. This is done taking into account the NUMA characteristics of the host. In fact, we look for a combination of host's NUMA nodes with enough free memory and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. Once we know which ones, among all the possible combinations, represents valid placement candidates for a domain, use some heuistics for deciding which is the best. For instance, smaller candidates are considered to be better, both from the domain's point of view (fewer memory spreading among nodes) and from the system as a whole point of view (fewer memoy fragmentation). In case of candidates of equal sizes (i.e., with the same number of nodes), the amount of free memory and the number of domain already assigned to their nodes are considered. Very often, candidates with greater amount of memory are the one we wants, as this is also good for keeping memory fragmentation under control. However, if the difference in how much free memory two candidates have, the number of assigned domains might be what decides which candidate wins. This all happens internally to libxl, and no API for driving the mechanism is provided for now. This matches what xend already does. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v3: * fixed a double free on non-NUMA systems (namely, when just 1 node exists). * Reword an ambiguous sentence in xl's man page, as requested during review. * Do no special case single node systems in libxl__get_numa_candidates(), to avoid generating a spurious warning on them. Changes from v2: * lots of typos. * Clayfied some comments, as requested during (ijc's) review. * Added some more information/reference for the combination generation algorithm. * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus(). * libxl_bitmap_init() used to make sure we do not try to free random memory on failure paths of functions that allocates a libxl_bitmap. * Always invoke libxl__sort_numa_candidates(), even if we get there with just 1 candidate, as requested during review. * Simplified the if-s that check for input parameter consistency in libxl__get_numa_candidates() as requested during (gwd's) review. * Comparison function for candidates changed so that it now provides total ordering, as requested during review. It is still using FP arithmetic, though. Also I think that just putting the difference between the amount of free memory and between the number of assigned domains of two candidates in a single formula (after normalizing and weighting them) is both clear and effective enough. * Function definitions moved to a numa specific source file (libxl_numa.c), as suggested during review. Changes from v1: * This patches incorporates the changes from both "libxl, xl: enable automatic placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering NUMA placement candidates" from v1. * The logic of the algorithm is basically the same as in v1, but the splitting of it in the various functions has been completely redesigned from scratch. * No public API for placement or candidate generation is now exposed, everything happens within libxl, as agreed during v1 review. * The relevant documentation have been moved near the actual functions and features. Also, the amount and (hopefully!) the quality of the documentation has been improved a lot, as requested. * All the comments about using the proper libxl facilities and helpers for allocations, etc., have been considered and applied. * This patch still bails out from NUMA optimizations if it find out cpupools are being utilized. It is next patch that makes the two things interact properly, as suggested during review. diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -111,8 +111,8 @@ created online and the remainder will be =item B<cpus="CPU-LIST"> -List of which cpus the guest is allowed to use. Default behavior is -`all cpus`. A C<CPU-LIST> may be specified as follows: +List of which cpus the guest is allowed to use. By default xl will (via +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: =over 4 @@ -132,6 +132,12 @@ run on cpu #3 of the host. =back +If this option is not specified, libxl automatically tries to place the new +domain on the host's NUMA nodes (provided the host has more than one NUMA +node) by pinning it to the cpus of those nodes. A heuristic approach is +utilized with the goals of maximizing performance for the domain and, at +the same time, achieving efficient utilization of the host's CPUs and RAM. + =item B<cpu_weight=WEIGHT> A domain with a weight of 512 will get twice as much CPU as a domain diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \ libxl_internal.o libxl_utils.o libxl_uuid.o \ - libxl_json.o libxl_aoutils.o \ + libxl_json.o libxl_aoutils.o libxl_numa.o \ libxl_save_callout.o _libxl_save_msgs_callout.o \ libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -98,6 +98,106 @@ out: return sched; } +/* Subtract two values and translate the result in [0, 1] */ +static double normalized_diff(double a, double b) +{ +#define max(a, b) (a > b ? a : b) + if (!a && a == b) + return 0.0; + return (a - b) / max(a, b); +} + +/* + * The NUMA placement candidates are reordered according to the following + * heuristics: + * - candidates involving fewer nodes come first. In case two (or + * more) candidates span the same number of nodes, + * - the amount of free memory and the number of domains assigned to the + * candidates are considered. In doing that, candidates with greater + * amount of free memory and fewer domains assigned to them are preferred, + * with free memory "weighting" three times as much as number of domains. + */ +static int numa_cmpf(const void *v1, const void *v2) +{ + const libxl__numa_candidate *c1 = v1; + const libxl__numa_candidate *c2 = v2; +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0 + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); + + if (c1->nr_nodes != c2->nr_nodes) + return c1->nr_nodes - c2->nr_nodes; + + return sign(3*freememkb_diff + nrdomains_diff); +} + +/* The actual automatic NUMA placement routine */ +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +{ + int nr_candidates = 0; + libxl__numa_candidate *candidates = NULL; + libxl_bitmap candidate_nodemap; + libxl_cpupoolinfo *pinfo; + int nr_pools, rc = 0; + uint32_t memkb; + + libxl_bitmap_init(&candidate_nodemap); + + /* First of all, if cpupools are in use, better not to mess with them */ + pinfo = libxl_list_cpupool(CTX, &nr_pools); + if (!pinfo) + return ERROR_FAIL; + if (nr_pools > 1) { + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); + goto out; + } + + rc = libxl_domain_need_memory(CTX, info, &memkb); + if (rc) + goto out; + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) { + rc = ERROR_FAIL; + goto out; + } + + /* Find all the candidates with enough free memory and at least + * as much pcpus as the domain has vcpus. */ + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + &candidates, &nr_candidates); + if (rc) + goto out; + + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); + + /* No suitable placement candidates. We just return without touching the + * domain's info->cpumap. It will have affinity with all nodes/cpus. */ + if (nr_candidates == 0) { + LOG(NOTICE, "NUMA placement failed, performance might be affected"); + goto out; + } + + /* Bring the best candidate in front of the list --> candidates[0] */ + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); + + /* + * At this point, the first candidate in the array is the one we want. + * Go for it by mapping its node map to the domain's info->cpumap. + */ + libxl__numa_candidate_get_nodemap(gc, &candidates[0], &candidate_nodemap); + rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); + if (rc) + goto out; + + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " + "%"PRIu32" KB free selected", candidates[0].nr_nodes, + candidates[0].nr_cpus, candidates[0].free_memkb / 1024); + + out: + libxl_bitmap_dispose(&candidate_nodemap); + libxl_cpupoolinfo_list_free(pinfo, nr_pools); + return rc; +} + int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_build_info *info, libxl__domain_build_state *state) { @@ -107,7 +207,22 @@ int libxl__build_pre(libxl__gc *gc, uint uint32_t rtc_timeoffset; xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); + + /* + * Check if the domain has any CPU affinity. If not, try to build up one. + * In case numa_place_domain() find at least a suitable candidate, it will + * affect info->cpumap accordingly; if it does not, it just leaves it + * as it is. This means (unless some weird error manifests) the subsequent + * call to libxl_set_vcpuaffinity_all() will do the actual placement, + * whatever that turns out to be. + */ + if (libxl_bitmap_is_full(&info->cpumap)) { + int rc = numa_place_domain(gc, info); + if (rc) + return rc; + } libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); + xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type == LIBXL_DOMAIN_TYPE_PV) xc_domain_set_memmap_limit(ctx->xch, domid, diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2216,6 +2216,140 @@ static inline void libxl__ctx_unlock(lib #define CTX_LOCK (libxl__ctx_lock(CTX)) #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) +/* + * Automatic NUMA placement + * + * These functions and data structures deal with the initial placement of a + * domain onto the host NUMA nodes. + * + * The key concept here is the one of "NUMA placement candidate", which is + * basically a set of nodes whose characteristics have been successfully + * checked against some specific requirements. More precisely, a candidate is + * the nodemap associated with one of the possible subset of the host NUMA + * nodes providing a certain amount of free memory, or a given number of cpus, + * or even both (depending in what the caller wants). For convenience of use, + * some of this information are stored within the candidate itself, instead of + * always being dynamically computed. A single node is a valid placement + * candidate, but it is also possible for a candidate to contain all the nodes + * of the host. The fewer nodes there are in a candidate, the better + * performance a domain placed onto it should get. For instance, looking for a + * numa candidates with 2GB of free memory means we want all the possible + * subsets of the host NUMA nodes with, cumulatively, at least 2GB of free + * memory. That could be possible by just using one particular node, or may + * require more nodes, depending on the characteristics of the host, on how + * many domains have been created already, on how big they are, etc. + * + * The intended usage is as follows: + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying + * the proper constraints to it (e.g., the amount of memory a domain need + * as the minimum amount of free memory for the candidates) one can build + * up a whole set of suitable placing alternatives for a domain; + * 2. after that, one specific candidate should be chosen. That can happen + * by looking at their various characteristics; + * 3. the chosen candidate's nodemap should be utilized for computing the + * actual affinity of the domain which, given the current NUMA support + * in the hypervisor, is what determines the placement of the domain's + * vcpus and memory. + * + * To make phase 2 even easier, a sorting helper function for the list of + * candidates is provided in the form of libxl__sort_numa_candidates(). The + * only that is needed is defining a comparison function, containing the + * criteria for deciding, given two candidates, which one is 'better'. + * Depending on how the comparison function is defined, the best candidate + * (where, of course, best is defined with respect to the heuristics + * implemented in the comparison function itself, libxl__numa_candidate_cmpf()) + * could become the first or the last element of the list. + * + * Summarizing, achieving automatic NUMA placement is just a matter of + * obtaining the list of suitable placement candidates, perhaps asking for each + * of them to provide at least the amount of memory the domain needs. After + * that just implement a comparison function by means of the various helpers + * retrieving the relevant information about the candidates themselves. + * Finally, call the sorting helper function and use the candidate that became + * (typically) the first element of the list for determining the domain's + * affinity. + */ + +typedef struct { + int nr_cpus, nr_nodes; + int nr_domains; + uint32_t free_memkb; + libxl_bitmap nodemap; +} libxl__numa_candidate; + +/* + * This generates the list of NUMA placement candidates satisfying some + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is + * used to determine the minimum and maximum number of nodes that are allow to + * be present in each candidate. If min_nodes and/or max_nodes are 0, the + * minimum and maximum number of nodes to be used are automatically selected by + * the implementation (and that will likely be just 1 node for the minimum and + * the total number of existent nodes for the maximum). Re min_free_memkb and + * min_cpu, if not 0, it means the caller only wants candidates with at + * least that amount of free memory and that number of cpus, respectively. If + * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number + * of cpus won't be checked at all, which means a candidate will always be + * considered suitable wrt the specific constraint. cndts is where the list of + * exactly nr_cndts candidates is returned. Note that, in case no candidates + * are found at all, the function returns successfully, but with nr_cndts equal + * to zero. + */ +_hidden int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts); + +/* Initialization, allocation and deallocation for placement candidates */ +static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt) +{ + cndt->free_memkb = 0; + cndt->nr_cpus = cndt->nr_nodes = cndt->nr_domains = 0; + libxl_bitmap_init(&cndt->nodemap); +} + +static inline int libxl__numa_candidate_alloc(libxl__gc *gc, + libxl__numa_candidate *cndt) +{ + return libxl_node_bitmap_alloc(CTX, &cndt->nodemap, 0); +} +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) +{ + libxl_bitmap_dispose(&cndt->nodemap); +} +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts, + int nr_cndts) +{ + int i; + + for (i = 0; i < nr_cndts; i++) + libxl__numa_candidate_dispose(&cndts[i]); + free(cndts); +} + +/* Retrieve (in nodemap) the node map associated to placement candidate cndt */ +static inline +void libxl__numa_candidate_get_nodemap(libxl__gc *gc, + const libxl__numa_candidate *cndt, + libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); +} +/* Set the node map of placement candidate cndt to match nodemap */ +static inline +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, + libxl__numa_candidate *cndt, + const libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); +} + +/* Signature for the comparison function between two candidates c1 and c2 */ +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* Sort the list of candidates in cndts (an array with nr_cndts elements in + * it) using cmpf for comparing two candidates. Uses libc's qsort(). */ +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], + int nr_cndts, + libxl__numa_candidate_cmpf cmpf); /* * Inserts "elm_new" into the sorted list "head". diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c new file mode 100644 --- /dev/null +++ b/tools/libxl/libxl_numa.c @@ -0,0 +1,377 @@ +/* + * Copyright (C) 2012 Citrix Ltd. + * Author Dario Faggioli <dario.faggioli@citrix.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published + * by the Free Software Foundation; version 2.1 only. with the special + * exception on linking described in file LICENSE. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + */ + +#include "libxl_osdeps.h" /* must come before any other headers */ + +#include <glob.h> + +#include "libxl_internal.h" + +/* + * What follows are helpers for generating all the k-combinations + * without repetitions of a set S with n elements in it. Formally + * speaking, they are subsets of k distinct elements of S and, if + * S is n elements big, the number of k-combinations is equal to + * the binomial coefficient (n k), which means we get exactly + * n!/(k! * (n - k)!) subsets, all of them with k elements. + * + * The various subset are generated one after the other by calling + * comb_init() first, and, after that, comb_next() + * (n k)-1 times. An iterator is used to store the current status + * of the whole generation operation (i.e., basically, the last + * combination that has been generated). As soon as all + * combinations have been generated, comb_next() will + * start returning 0 instead of 1. It is of course important that + * the same instance of the iterator and the same values for + * n and k are used for each call. If that doesn't happen, the + * result is unspecified. + * + * The algorithm is a well known one (see, for example, D. Knuth's "The + * Art of Computer Programming - Volume 4, Fascicle 3" and it produces + * the combinations in such a way that they (well, more precisely, + * their indexes it the array/map representing the set) come with + * lexicographic ordering. + * + * For example, with n = 5 and k = 3, calling comb_init() + * will generate { 0, 1, 2 }, while subsequent valid calls to + * comb_next() will produce the following: + * { { 0, 1, 3 }, { 0, 1, 4 }, + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 }, + * { 2, 3, 4 } } + * + * This is used by the automatic NUMA placement logic below. + */ +typedef int* comb_iter_t; + +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k) +{ + comb_iter_t new_iter; + int i; + + if (n < k) + return 0; + + /* First set is always { 0, 1, 2, ..., k-1 } */ + GCNEW_ARRAY(new_iter, k); + for (i = 0; i < k; i++) + new_iter[i] = i; + + *it = new_iter; + return 1; +} + +static int comb_next(comb_iter_t it, int n, int k) +{ + int i; + + /* + * The idea here is to find the leftmost element from where + * we should start incrementing the indexes of the iterator. + * This means looking for the highest index that can be increased + * while still producing value smaller than n-1. In the example + * above, when dealing with { 0, 1, 4 }, such an element is the + * second one, as the third is already equal to 4 (which actually + * is n-1). + * Once we found from where to start, we increment that element + * and override the right-hand rest of the iterator with its + * successors, thus achieving lexicographic ordering. + * + * Regarding the termination of the generation process, when we + * manage in bringing n-k at the very first position of the iterator, + * we know that is the last valid combination ( { 2, 3, 4 }, with + * n - k = 5 - 2 = 2, in the example above), and thus we start + * returning 0 as soon as we cross that border. + */ + for (i = k - 1; it[i] == n - k + i; i--) { + if (i <= 0) + return 0; + } + for (it[i]++, i++; i < k; i++) + it[i] = it[i - 1] + 1; + return 1; +} + +/* NUMA automatic placement (see libxl_internal.h for details) */ + +/* + * This function turns a k-combination iterator into a node map. + * This means the bits in the node map corresponding to the indexes + * of the given combination are the ones that will be set. + * For example, if the iterator represents the combination { 0, 2, 4}, + * the node map will have bits #0, #2 and #4 set. + */ +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) +{ + int i; + + libxl_bitmap_set_none(nodemap); + for (i = 0; i < k; i++) + libxl_bitmap_set(nodemap, it[i]); +} + +/* Retrieve the number of cpus that the nodes that are part of the nodemap + * span. */ +static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *nodemap) +{ + int i, nodes_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (libxl_bitmap_test(nodemap, tinfo[i].node)) + nodes_cpus++; + } + return nodes_cpus; +} + +/* Retrieve the amount of free memory within the nodemap */ +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo, + libxl_bitmap *nodemap) +{ + uint32_t free_memkb = 0; + int i; + + libxl_for_each_set_bit(i, *nodemap) + free_memkb += ninfo[i].free / 1024; + + return free_memkb; +} + +/* Retrieve the number of domains that can potentially run on the cpus + * the nodes that are part of the nodemap. */ +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, + const libxl_bitmap *nodemap) +{ + libxl_dominfo *dinfo = NULL; + libxl_bitmap dom_nodemap; + int nr_doms, nr_cpus; + int nr_domains = 0; + int i, j, k; + + dinfo = libxl_list_domain(CTX, &nr_doms); + if (dinfo == NULL) + return ERROR_FAIL; + + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { + libxl_dominfo_list_free(dinfo, nr_doms); + return ERROR_FAIL; + } + + for (i = 0; i < nr_doms; i++) { + libxl_vcpuinfo *vinfo; + int nr_vcpus; + + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); + if (vinfo == NULL) + continue; + + libxl_bitmap_set_none(&dom_nodemap); + for (j = 0; j < nr_vcpus; j++) { + libxl_for_each_set_bit(k, vinfo[j].cpumap) + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); + } + + libxl_for_each_set_bit(j, dom_nodemap) { + if (libxl_bitmap_test(nodemap, j)) { + nr_domains++; + break; + } + } + + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); + } + + libxl_bitmap_dispose(&dom_nodemap); + libxl_dominfo_list_free(dinfo, nr_doms); + return nr_domains; +} + +/* + * This function tries to figure out if the host has a consistent number + * of cpus along all its NUMA nodes. In fact, if that is the case, we can + * calculate the minimum number of nodes needed for a domain by just + * dividing its total number of vcpus by this value computed here. + * However, we are not allowed to assume that all the nodes have the + * same number of cpus. Therefore, in case discrepancies among different + * nodes are found, this function just returns 0, for the caller to know + * it shouldn't rely on this 'optimization', and sort out things in some + * other way (most likely, just start trying with candidates with just + * one node). + */ +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, + libxl_numainfo *ninfo, int nr_nodes) +{ + int cpus_per_node = 0; + int j, i; + + /* This makes sense iff # of PCPUs is the same for all nodes */ + for (j = 0; j < nr_nodes; j++) { + int curr_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (tinfo[i].node == j) + curr_cpus++; + } + /* So, if the above does not hold, turn the whole thing off! */ + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node; + if (cpus_per_node != curr_cpus) + return 0; + } + return cpus_per_node; +} + +/* Get all the placement candidates satisfying some specific conditions */ +int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts) +{ + libxl__numa_candidate *new_cndts = NULL; + libxl_cputopology *tinfo = NULL; + libxl_numainfo *ninfo = NULL; + int nr_nodes = 0, nr_cpus = 0; + libxl_bitmap nodemap; + int array_size, rc; + + libxl_bitmap_init(&nodemap); + + /* Get platform info and prepare the map for testing the combinations */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) + return ERROR_FAIL; + + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus); + if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + } + + rc = libxl_node_bitmap_alloc(CTX, &nodemap, 0); + if (rc) + goto out; + + /* + * If the minimum number of NUMA nodes is not explicitly specified + * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes + * from where to start generating candidates, if possible (or just start + * from 1 otherwise). The maximum number of nodes should not exceed the + * number of existent NUMA nodes on the host, or the candidate generation + * won't work properly. + */ + if (!min_nodes) { + int cpus_per_node; + + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); + if (cpus_per_node == 0) + min_nodes = 1; + else + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; + } + if (min_nodes > nr_nodes) + min_nodes = nr_nodes; + if (!max_nodes || max_nodes > nr_nodes) + max_nodes = nr_nodes; + if (min_nodes > max_nodes) { + rc = ERROR_INVAL; + goto out; + } + + /* Initialize the local storage for the combinations */ + *nr_cndts = 0; + array_size = nr_nodes; + GCNEW_ARRAY(new_cndts, array_size); + + /* Generate all the combinations of any size from min_nodes to + * max_nodes (see comb_init() and comb_next()). */ + while (min_nodes <= max_nodes) { + comb_iter_t comb_iter; + int comb_ok; + + /* + * And here it is. Each step of this cycle generates a combination of + * nodes as big as min_nodes mandates. Each of these combinations is + * checked against the constraints provided by the caller (namely, + * amount of free memory and number of cpus) and it becomes an actual + * placement candidate iff it passes the check. + */ + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { + uint32_t nodes_free_memkb; + int nodes_cpus; + + comb_get_nodemap(comb_iter, &nodemap, min_nodes); + + /* If there is not enough memory in this combination, skip it + * and go generating the next one... */ + nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap); + if (min_free_memkb && nodes_free_memkb < min_free_memkb) + continue; + + /* And the same applies if this combination is short in cpus */ + nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap); + if (min_cpus && nodes_cpus < min_cpus) + continue; + + /* + * Conditions are met, we can add this combination to the + * NUMA placement candidates list. We first make sure there + * is enough space in there, and then we initialize the new + * candidate element with the node map corresponding to the + * combination we are dealing with. Memory allocation for + * expanding the array that hosts the list happens in chunks + * equal to the number of NUMA nodes in the system (to + * avoid allocating memory each and every time we find a + * new candidate). + */ + if (*nr_cndts == array_size) + array_size += nr_nodes; + GCREALLOC_ARRAY(new_cndts, array_size); + + libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]); + libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], + &nodemap); + new_cndts[*nr_cndts].nr_domains = + nodemap_to_nr_domains(gc, tinfo, &nodemap); + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; + new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_cpus = nodes_cpus; + + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, + min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].free_memkb / 1024); + + (*nr_cndts)++; + } + min_nodes++; + } + + *cndts = new_cndts; + out: + libxl_bitmap_dispose(&nodemap); + libxl_cputopology_list_free(tinfo, nr_cpus); + libxl_numainfo_list_free(ninfo, nr_nodes); + return rc; +} + +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, + libxl__numa_candidate_cmpf cmpf) +{ + /* Reorder candidates (see the comparison function for + * the details on the heuristics) */ + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf); +} + + ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-10 15:03 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli @ 2012-07-17 15:55 ` Ian Jackson 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Ian Jackson @ 2012-07-17 15:55 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > If a domain does not have a VCPU affinity, try to pin it automatically to some > PCPUs. This is done taking into account the NUMA characteristics of the host. > In fact, we look for a combination of host's NUMA nodes with enough free memory > and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. Thanks for this admirably clear patch. Can you please rewrap your commit messages to around 70 lines ? Many VCSs indent them in the log in some situations, and as you see here mail programs indent them when quoting too. Indeed I would generally prefer to wrap code to ~75 rather than 80 but I know that's a bit controversial. > +/* Subtract two values and translate the result in [0, 1] */ > +static double normalized_diff(double a, double b) > +{ > +#define max(a, b) (a > b ? a : b) > + if (!a && a == b) > + return 0.0; > + return (a - b) / max(a, b); > +} 1. This macro max() should be in libxl_internal.h. 2. It should be MAX so people are warned it's a macro 3. It should have all the necessary ()s for macro precedence safety > +static int numa_cmpf(const void *v1, const void *v2) > +{ > + const libxl__numa_candidate *c1 = v1; > + const libxl__numa_candidate *c2 = v2; > +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0 Again. > + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); > + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); > + > + if (c1->nr_nodes != c2->nr_nodes) > + return c1->nr_nodes - c2->nr_nodes; > + > + return sign(3*freememkb_diff + nrdomains_diff); The reason you need what sign() does is that you need to convert from double to int, I guess. > + > + /* > + * Check if the domain has any CPU affinity. If not, try to build up one. > + * In case numa_place_domain() find at least a suitable candidate, it will > + * affect info->cpumap accordingly; if it does not, it just leaves it > + * as it is. This means (unless some weird error manifests) the subsequent > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > + * whatever that turns out to be. > + */ > + if (libxl_bitmap_is_full(&info->cpumap)) { > + int rc = numa_place_domain(gc, info); > + if (rc) > + return rc; > + } I guess it would be preferable to do this only if the bitmap was full by default, so that setting the bitmap explicitly to all cpus still works. I'm not sure that that's essential to have, though. > + /* > + * Conditions are met, we can add this combination to the > + * NUMA placement candidates list. We first make sure there > + * is enough space in there, and then we initialize the new > + * candidate element with the node map corresponding to the > + * combination we are dealing with. Memory allocation for > + * expanding the array that hosts the list happens in chunks > + * equal to the number of NUMA nodes in the system (to > + * avoid allocating memory each and every time we find a > + * new candidate). > + */ > + if (*nr_cndts == array_size) > + array_size += nr_nodes; > + GCREALLOC_ARRAY(new_cndts, array_size); This part of the algorithm is quadratic in the number of combinations divided by the number of nodes. So the algorithm is O( ( C( nr_nodes, min_nodes ) / min_nodes )^2 ) which is quite bad really. At the very least this needs to be an exponential allocation, eg + array_size += nr_nodes + array_size; But if you didn't insist on creating the whole list and sorting it, you would avoid this allocation entirely, wouldn't you ? Should we bve worried that this algorithm will be too slow even if it involves just O( C(nr_nodes,min_nodes) ) iterations ? Thanks, Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl @ 2012-07-16 17:13 ` Dario Faggioli 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli ` (3 more replies) 0 siblings, 4 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-16 17:13 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson Hello again, This is a new version (fixing a small bug) of this series: <patchbomb.1341932613@Solace>, which in turn was a repost of what remained un-committed of my NUMA placement series. As already stated, the splat on 1-node-only systems has been resolved as per Ian Campbell's suggestion (thanks!), and that was the only thing I'm aware that was preventing from checking-in these last patches. While at it, I also addressed all any other observation made during review, namely, typos and rewording of some sentences, as well as an improved filtering solution for avoiding generating a lot of potential duplicate placement candidates when cpupool are being used. All the patches have already been acked, although with some minor nits that should be gone now. If you have any further comment, I'm here to answer. :-) Thanks and Regards, Dario ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli @ 2012-07-16 17:13 ` Dario Faggioli 2012-07-17 18:04 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Ian Jackson 2012-07-19 12:21 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Andre Przywara 2012-07-16 17:13 ` [PATCH 2 of 3 v5/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli ` (2 subsequent siblings) 3 siblings, 2 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-16 17:13 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson If a domain does not have a VCPU affinity, try to pin it automatically to some PCPUs. This is done taking into account the NUMA characteristics of the host. In fact, we look for a combination of host's NUMA nodes with enough free memory and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. Once we know which ones, among all the possible combinations, represents valid placement candidates for a domain, use some heuistics for deciding which is the best. For instance, smaller candidates are considered to be better, both from the domain's point of view (fewer memory spreading among nodes) and from the system as a whole point of view (fewer memoy fragmentation). In case of candidates of equal sizes (i.e., with the same number of nodes), the amount of free memory and the number of domain already assigned to their nodes are considered. Very often, candidates with greater amount of memory are the one we wants, as this is also good for keeping memory fragmentation under control. However, if the difference in how much free memory two candidates have, the number of assigned domains might be what decides which candidate wins. This all happens internally to libxl, and no API for driving the mechanism is provided for now. This matches what xend already does. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v3: * fixed a double free on non-NUMA systems (namely, when just 1 node exists). * Reword an ambiguous sentence in xl's man page, as requested during review. * Do no special case single node systems in libxl__get_numa_candidates(), to avoid generating a spurious warning on them. Changes from v2: * lots of typos. * Clayfied some comments, as requested during (ijc's) review. * Added some more information/reference for the combination generation algorithm. * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus(). * libxl_bitmap_init() used to make sure we do not try to free random memory on failure paths of functions that allocates a libxl_bitmap. * Always invoke libxl__sort_numa_candidates(), even if we get there with just 1 candidate, as requested during review. * Simplified the if-s that check for input parameter consistency in libxl__get_numa_candidates() as requested during (gwd's) review. * Comparison function for candidates changed so that it now provides total ordering, as requested during review. It is still using FP arithmetic, though. Also I think that just putting the difference between the amount of free memory and between the number of assigned domains of two candidates in a single formula (after normalizing and weighting them) is both clear and effective enough. * Function definitions moved to a numa specific source file (libxl_numa.c), as suggested during review. Changes from v1: * This patches incorporates the changes from both "libxl, xl: enable automatic placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering NUMA placement candidates" from v1. * The logic of the algorithm is basically the same as in v1, but the splitting of it in the various functions has been completely redesigned from scratch. * No public API for placement or candidate generation is now exposed, everything happens within libxl, as agreed during v1 review. * The relevant documentation have been moved near the actual functions and features. Also, the amount and (hopefully!) the quality of the documentation has been improved a lot, as requested. * All the comments about using the proper libxl facilities and helpers for allocations, etc., have been considered and applied. * This patch still bails out from NUMA optimizations if it find out cpupools are being utilized. It is next patch that makes the two things interact properly, as suggested during review. diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -111,8 +111,8 @@ created online and the remainder will be =item B<cpus="CPU-LIST"> -List of which cpus the guest is allowed to use. Default behavior is -`all cpus`. A C<CPU-LIST> may be specified as follows: +List of which cpus the guest is allowed to use. By default xl will (via +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: =over 4 @@ -132,6 +132,12 @@ run on cpu #3 of the host. =back +If this option is not specified, libxl automatically tries to place the new +domain on the host's NUMA nodes (provided the host has more than one NUMA +node) by pinning it to the cpus of those nodes. A heuristic approach is +utilized with the goals of maximizing performance for the domain and, at +the same time, achieving efficient utilization of the host's CPUs and RAM. + =item B<cpu_weight=WEIGHT> A domain with a weight of 512 will get twice as much CPU as a domain diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \ libxl_internal.o libxl_utils.o libxl_uuid.o \ - libxl_json.o libxl_aoutils.o \ + libxl_json.o libxl_aoutils.o libxl_numa.o \ libxl_save_callout.o _libxl_save_msgs_callout.o \ libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -98,6 +98,106 @@ out: return sched; } +/* Subtract two values and translate the result in [0, 1] */ +static double normalized_diff(double a, double b) +{ +#define max(a, b) (a > b ? a : b) + if (!a && a == b) + return 0.0; + return (a - b) / max(a, b); +} + +/* + * The NUMA placement candidates are reordered according to the following + * heuristics: + * - candidates involving fewer nodes come first. In case two (or + * more) candidates span the same number of nodes, + * - the amount of free memory and the number of domains assigned to the + * candidates are considered. In doing that, candidates with greater + * amount of free memory and fewer domains assigned to them are preferred, + * with free memory "weighting" three times as much as number of domains. + */ +static int numa_cmpf(const void *v1, const void *v2) +{ + const libxl__numa_candidate *c1 = v1; + const libxl__numa_candidate *c2 = v2; +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0 + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); + + if (c1->nr_nodes != c2->nr_nodes) + return c1->nr_nodes - c2->nr_nodes; + + return sign(3*freememkb_diff + nrdomains_diff); +} + +/* The actual automatic NUMA placement routine */ +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +{ + int nr_candidates = 0; + libxl__numa_candidate *candidates = NULL; + libxl_bitmap candidate_nodemap; + libxl_cpupoolinfo *pinfo; + int nr_pools, rc = 0; + uint32_t memkb; + + libxl_bitmap_init(&candidate_nodemap); + + /* First of all, if cpupools are in use, better not to mess with them */ + pinfo = libxl_list_cpupool(CTX, &nr_pools); + if (!pinfo) + return ERROR_FAIL; + if (nr_pools > 1) { + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); + goto out; + } + + rc = libxl_domain_need_memory(CTX, info, &memkb); + if (rc) + goto out; + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) { + rc = ERROR_FAIL; + goto out; + } + + /* Find all the candidates with enough free memory and at least + * as much pcpus as the domain has vcpus. */ + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + &candidates, &nr_candidates); + if (rc) + goto out; + + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); + + /* No suitable placement candidates. We just return without touching the + * domain's info->cpumap. It will have affinity with all nodes/cpus. */ + if (nr_candidates == 0) { + LOG(NOTICE, "NUMA placement failed, performance might be affected"); + goto out; + } + + /* Bring the best candidate in front of the list --> candidates[0] */ + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); + + /* + * At this point, the first candidate in the array is the one we want. + * Go for it by mapping its node map to the domain's info->cpumap. + */ + libxl__numa_candidate_get_nodemap(gc, &candidates[0], &candidate_nodemap); + rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); + if (rc) + goto out; + + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " + "%"PRIu32" KB free selected", candidates[0].nr_nodes, + candidates[0].nr_cpus, candidates[0].free_memkb / 1024); + + out: + libxl_bitmap_dispose(&candidate_nodemap); + libxl_cpupoolinfo_list_free(pinfo, nr_pools); + return rc; +} + int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_build_info *info, libxl__domain_build_state *state) { @@ -107,7 +207,22 @@ int libxl__build_pre(libxl__gc *gc, uint uint32_t rtc_timeoffset; xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); + + /* + * Check if the domain has any CPU affinity. If not, try to build up one. + * In case numa_place_domain() find at least a suitable candidate, it will + * affect info->cpumap accordingly; if it does not, it just leaves it + * as it is. This means (unless some weird error manifests) the subsequent + * call to libxl_set_vcpuaffinity_all() will do the actual placement, + * whatever that turns out to be. + */ + if (libxl_bitmap_is_full(&info->cpumap)) { + int rc = numa_place_domain(gc, info); + if (rc) + return rc; + } libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); + xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type == LIBXL_DOMAIN_TYPE_PV) xc_domain_set_memmap_limit(ctx->xch, domid, diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2216,6 +2216,140 @@ static inline void libxl__ctx_unlock(lib #define CTX_LOCK (libxl__ctx_lock(CTX)) #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) +/* + * Automatic NUMA placement + * + * These functions and data structures deal with the initial placement of a + * domain onto the host NUMA nodes. + * + * The key concept here is the one of "NUMA placement candidate", which is + * basically a set of nodes whose characteristics have been successfully + * checked against some specific requirements. More precisely, a candidate is + * the nodemap associated with one of the possible subset of the host NUMA + * nodes providing a certain amount of free memory, or a given number of cpus, + * or even both (depending in what the caller wants). For convenience of use, + * some of this information are stored within the candidate itself, instead of + * always being dynamically computed. A single node is a valid placement + * candidate, but it is also possible for a candidate to contain all the nodes + * of the host. The fewer nodes there are in a candidate, the better + * performance a domain placed onto it should get. For instance, looking for a + * numa candidates with 2GB of free memory means we want all the possible + * subsets of the host NUMA nodes with, cumulatively, at least 2GB of free + * memory. That could be possible by just using one particular node, or may + * require more nodes, depending on the characteristics of the host, on how + * many domains have been created already, on how big they are, etc. + * + * The intended usage is as follows: + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying + * the proper constraints to it (e.g., the amount of memory a domain need + * as the minimum amount of free memory for the candidates) one can build + * up a whole set of suitable placing alternatives for a domain; + * 2. after that, one specific candidate should be chosen. That can happen + * by looking at their various characteristics; + * 3. the chosen candidate's nodemap should be utilized for computing the + * actual affinity of the domain which, given the current NUMA support + * in the hypervisor, is what determines the placement of the domain's + * vcpus and memory. + * + * To make phase 2 even easier, a sorting helper function for the list of + * candidates is provided in the form of libxl__sort_numa_candidates(). The + * only that is needed is defining a comparison function, containing the + * criteria for deciding, given two candidates, which one is 'better'. + * Depending on how the comparison function is defined, the best candidate + * (where, of course, best is defined with respect to the heuristics + * implemented in the comparison function itself, libxl__numa_candidate_cmpf()) + * could become the first or the last element of the list. + * + * Summarizing, achieving automatic NUMA placement is just a matter of + * obtaining the list of suitable placement candidates, perhaps asking for each + * of them to provide at least the amount of memory the domain needs. After + * that just implement a comparison function by means of the various helpers + * retrieving the relevant information about the candidates themselves. + * Finally, call the sorting helper function and use the candidate that became + * (typically) the first element of the list for determining the domain's + * affinity. + */ + +typedef struct { + int nr_cpus, nr_nodes; + int nr_domains; + uint32_t free_memkb; + libxl_bitmap nodemap; +} libxl__numa_candidate; + +/* + * This generates the list of NUMA placement candidates satisfying some + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is + * used to determine the minimum and maximum number of nodes that are allow to + * be present in each candidate. If min_nodes and/or max_nodes are 0, the + * minimum and maximum number of nodes to be used are automatically selected by + * the implementation (and that will likely be just 1 node for the minimum and + * the total number of existent nodes for the maximum). Re min_free_memkb and + * min_cpu, if not 0, it means the caller only wants candidates with at + * least that amount of free memory and that number of cpus, respectively. If + * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number + * of cpus won't be checked at all, which means a candidate will always be + * considered suitable wrt the specific constraint. cndts is where the list of + * exactly nr_cndts candidates is returned. Note that, in case no candidates + * are found at all, the function returns successfully, but with nr_cndts equal + * to zero. + */ +_hidden int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts); + +/* Initialization, allocation and deallocation for placement candidates */ +static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt) +{ + cndt->free_memkb = 0; + cndt->nr_cpus = cndt->nr_nodes = cndt->nr_domains = 0; + libxl_bitmap_init(&cndt->nodemap); +} + +static inline int libxl__numa_candidate_alloc(libxl__gc *gc, + libxl__numa_candidate *cndt) +{ + return libxl_node_bitmap_alloc(CTX, &cndt->nodemap, 0); +} +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) +{ + libxl_bitmap_dispose(&cndt->nodemap); +} +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts, + int nr_cndts) +{ + int i; + + for (i = 0; i < nr_cndts; i++) + libxl__numa_candidate_dispose(&cndts[i]); + free(cndts); +} + +/* Retrieve (in nodemap) the node map associated to placement candidate cndt */ +static inline +void libxl__numa_candidate_get_nodemap(libxl__gc *gc, + const libxl__numa_candidate *cndt, + libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); +} +/* Set the node map of placement candidate cndt to match nodemap */ +static inline +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, + libxl__numa_candidate *cndt, + const libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); +} + +/* Signature for the comparison function between two candidates c1 and c2 */ +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* Sort the list of candidates in cndts (an array with nr_cndts elements in + * it) using cmpf for comparing two candidates. Uses libc's qsort(). */ +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], + int nr_cndts, + libxl__numa_candidate_cmpf cmpf); /* * Inserts "elm_new" into the sorted list "head". diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c new file mode 100644 --- /dev/null +++ b/tools/libxl/libxl_numa.c @@ -0,0 +1,377 @@ +/* + * Copyright (C) 2012 Citrix Ltd. + * Author Dario Faggioli <dario.faggioli@citrix.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published + * by the Free Software Foundation; version 2.1 only. with the special + * exception on linking described in file LICENSE. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + */ + +#include "libxl_osdeps.h" /* must come before any other headers */ + +#include <glob.h> + +#include "libxl_internal.h" + +/* + * What follows are helpers for generating all the k-combinations + * without repetitions of a set S with n elements in it. Formally + * speaking, they are subsets of k distinct elements of S and, if + * S is n elements big, the number of k-combinations is equal to + * the binomial coefficient (n k), which means we get exactly + * n!/(k! * (n - k)!) subsets, all of them with k elements. + * + * The various subset are generated one after the other by calling + * comb_init() first, and, after that, comb_next() + * (n k)-1 times. An iterator is used to store the current status + * of the whole generation operation (i.e., basically, the last + * combination that has been generated). As soon as all + * combinations have been generated, comb_next() will + * start returning 0 instead of 1. It is of course important that + * the same instance of the iterator and the same values for + * n and k are used for each call. If that doesn't happen, the + * result is unspecified. + * + * The algorithm is a well known one (see, for example, D. Knuth's "The + * Art of Computer Programming - Volume 4, Fascicle 3" and it produces + * the combinations in such a way that they (well, more precisely, + * their indexes it the array/map representing the set) come with + * lexicographic ordering. + * + * For example, with n = 5 and k = 3, calling comb_init() + * will generate { 0, 1, 2 }, while subsequent valid calls to + * comb_next() will produce the following: + * { { 0, 1, 3 }, { 0, 1, 4 }, + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 }, + * { 2, 3, 4 } } + * + * This is used by the automatic NUMA placement logic below. + */ +typedef int* comb_iter_t; + +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k) +{ + comb_iter_t new_iter; + int i; + + if (n < k) + return 0; + + /* First set is always { 0, 1, 2, ..., k-1 } */ + GCNEW_ARRAY(new_iter, k); + for (i = 0; i < k; i++) + new_iter[i] = i; + + *it = new_iter; + return 1; +} + +static int comb_next(comb_iter_t it, int n, int k) +{ + int i; + + /* + * The idea here is to find the leftmost element from where + * we should start incrementing the indexes of the iterator. + * This means looking for the highest index that can be increased + * while still producing value smaller than n-1. In the example + * above, when dealing with { 0, 1, 4 }, such an element is the + * second one, as the third is already equal to 4 (which actually + * is n-1). + * Once we found from where to start, we increment that element + * and override the right-hand rest of the iterator with its + * successors, thus achieving lexicographic ordering. + * + * Regarding the termination of the generation process, when we + * manage in bringing n-k at the very first position of the iterator, + * we know that is the last valid combination ( { 2, 3, 4 }, with + * n - k = 5 - 2 = 2, in the example above), and thus we start + * returning 0 as soon as we cross that border. + */ + for (i = k - 1; it[i] == n - k + i; i--) { + if (i <= 0) + return 0; + } + for (it[i]++, i++; i < k; i++) + it[i] = it[i - 1] + 1; + return 1; +} + +/* NUMA automatic placement (see libxl_internal.h for details) */ + +/* + * This function turns a k-combination iterator into a node map. + * This means the bits in the node map corresponding to the indexes + * of the given combination are the ones that will be set. + * For example, if the iterator represents the combination { 0, 2, 4}, + * the node map will have bits #0, #2 and #4 set. + */ +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) +{ + int i; + + libxl_bitmap_set_none(nodemap); + for (i = 0; i < k; i++) + libxl_bitmap_set(nodemap, it[i]); +} + +/* Retrieve the number of cpus that the nodes that are part of the nodemap + * span. */ +static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *nodemap) +{ + int i, nodes_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (libxl_bitmap_test(nodemap, tinfo[i].node)) + nodes_cpus++; + } + return nodes_cpus; +} + +/* Retrieve the amount of free memory within the nodemap */ +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo, + libxl_bitmap *nodemap) +{ + uint32_t free_memkb = 0; + int i; + + libxl_for_each_set_bit(i, *nodemap) + free_memkb += ninfo[i].free / 1024; + + return free_memkb; +} + +/* Retrieve the number of domains that can potentially run on the cpus + * the nodes that are part of the nodemap. */ +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, + const libxl_bitmap *nodemap) +{ + libxl_dominfo *dinfo = NULL; + libxl_bitmap dom_nodemap; + int nr_doms, nr_cpus; + int nr_domains = 0; + int i, j, k; + + dinfo = libxl_list_domain(CTX, &nr_doms); + if (dinfo == NULL) + return ERROR_FAIL; + + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { + libxl_dominfo_list_free(dinfo, nr_doms); + return ERROR_FAIL; + } + + for (i = 0; i < nr_doms; i++) { + libxl_vcpuinfo *vinfo; + int nr_vcpus; + + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); + if (vinfo == NULL) + continue; + + libxl_bitmap_set_none(&dom_nodemap); + for (j = 0; j < nr_vcpus; j++) { + libxl_for_each_set_bit(k, vinfo[j].cpumap) + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); + } + + libxl_for_each_set_bit(j, dom_nodemap) { + if (libxl_bitmap_test(nodemap, j)) { + nr_domains++; + break; + } + } + + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); + } + + libxl_bitmap_dispose(&dom_nodemap); + libxl_dominfo_list_free(dinfo, nr_doms); + return nr_domains; +} + +/* + * This function tries to figure out if the host has a consistent number + * of cpus along all its NUMA nodes. In fact, if that is the case, we can + * calculate the minimum number of nodes needed for a domain by just + * dividing its total number of vcpus by this value computed here. + * However, we are not allowed to assume that all the nodes have the + * same number of cpus. Therefore, in case discrepancies among different + * nodes are found, this function just returns 0, for the caller to know + * it shouldn't rely on this 'optimization', and sort out things in some + * other way (most likely, just start trying with candidates with just + * one node). + */ +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, + libxl_numainfo *ninfo, int nr_nodes) +{ + int cpus_per_node = 0; + int j, i; + + /* This makes sense iff # of PCPUs is the same for all nodes */ + for (j = 0; j < nr_nodes; j++) { + int curr_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (tinfo[i].node == j) + curr_cpus++; + } + /* So, if the above does not hold, turn the whole thing off! */ + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node; + if (cpus_per_node != curr_cpus) + return 0; + } + return cpus_per_node; +} + +/* Get all the placement candidates satisfying some specific conditions */ +int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts) +{ + libxl__numa_candidate *new_cndts = NULL; + libxl_cputopology *tinfo = NULL; + libxl_numainfo *ninfo = NULL; + int nr_nodes = 0, nr_cpus = 0; + libxl_bitmap nodemap; + int array_size, rc; + + libxl_bitmap_init(&nodemap); + + /* Get platform info and prepare the map for testing the combinations */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) + return ERROR_FAIL; + + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus); + if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + } + + rc = libxl_node_bitmap_alloc(CTX, &nodemap, 0); + if (rc) + goto out; + + /* + * If the minimum number of NUMA nodes is not explicitly specified + * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes + * from where to start generating candidates, if possible (or just start + * from 1 otherwise). The maximum number of nodes should not exceed the + * number of existent NUMA nodes on the host, or the candidate generation + * won't work properly. + */ + if (!min_nodes) { + int cpus_per_node; + + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); + if (cpus_per_node == 0) + min_nodes = 1; + else + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; + } + if (min_nodes > nr_nodes) + min_nodes = nr_nodes; + if (!max_nodes || max_nodes > nr_nodes) + max_nodes = nr_nodes; + if (min_nodes > max_nodes) { + rc = ERROR_INVAL; + goto out; + } + + /* Initialize the local storage for the combinations */ + *nr_cndts = 0; + array_size = nr_nodes; + GCNEW_ARRAY(new_cndts, array_size); + + /* Generate all the combinations of any size from min_nodes to + * max_nodes (see comb_init() and comb_next()). */ + while (min_nodes <= max_nodes) { + comb_iter_t comb_iter; + int comb_ok; + + /* + * And here it is. Each step of this cycle generates a combination of + * nodes as big as min_nodes mandates. Each of these combinations is + * checked against the constraints provided by the caller (namely, + * amount of free memory and number of cpus) and it becomes an actual + * placement candidate iff it passes the check. + */ + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { + uint32_t nodes_free_memkb; + int nodes_cpus; + + comb_get_nodemap(comb_iter, &nodemap, min_nodes); + + /* If there is not enough memory in this combination, skip it + * and go generating the next one... */ + nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap); + if (min_free_memkb && nodes_free_memkb < min_free_memkb) + continue; + + /* And the same applies if this combination is short in cpus */ + nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap); + if (min_cpus && nodes_cpus < min_cpus) + continue; + + /* + * Conditions are met, we can add this combination to the + * NUMA placement candidates list. We first make sure there + * is enough space in there, and then we initialize the new + * candidate element with the node map corresponding to the + * combination we are dealing with. Memory allocation for + * expanding the array that hosts the list happens in chunks + * equal to the number of NUMA nodes in the system (to + * avoid allocating memory each and every time we find a + * new candidate). + */ + if (*nr_cndts == array_size) + array_size += nr_nodes; + GCREALLOC_ARRAY(new_cndts, array_size); + + libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]); + libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], + &nodemap); + new_cndts[*nr_cndts].nr_domains = + nodemap_to_nr_domains(gc, tinfo, &nodemap); + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; + new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_cpus = nodes_cpus; + + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, + min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].free_memkb / 1024); + + (*nr_cndts)++; + } + min_nodes++; + } + + *cndts = new_cndts; + out: + libxl_bitmap_dispose(&nodemap); + libxl_cputopology_list_free(tinfo, nr_cpus); + libxl_numainfo_list_free(ninfo, nr_nodes); + return rc; +} + +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, + libxl__numa_candidate_cmpf cmpf) +{ + /* Reorder candidates (see the comparison function for + * the details on the heuristics) */ + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf); +} + + ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli @ 2012-07-17 18:04 ` Ian Jackson 2012-07-17 20:23 ` Ian Campbell 2012-07-18 0:22 ` Dario Faggioli 2012-07-19 12:21 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Andre Przywara 1 sibling, 2 replies; 60+ messages in thread From: Ian Jackson @ 2012-07-17 18:04 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel Dario Faggioli writes ("[PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > If a domain does not have a VCPU affinity, try to pin it > automatically to some PCPUs. This is done taking into account the > NUMA characteristics of the host. In fact, we look for a > combination of host's NUMA nodes with enough free memory and number > of PCPUs for the new domain, and pin it to the VCPUs of those nodes. Ian Jackson writes ("Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"): ... > This part of the algorithm is quadratic in the number of combinations > divided by the number of nodes. So the algorithm is > O( ( C( nr_nodes, min_nodes ) / min_nodes )^2 ) > which is quite bad really. > > At the very least this needs to be an exponential allocation, eg > + array_size += nr_nodes + array_size; > > But if you didn't insist on creating the whole list and sorting it, > you would avoid this allocation entirely, wouldn't you ? > > Should we bve worried that this algorithm will be too slow even if it > involves just > O( C(nr_nodes,min_nodes) ) > iterations ? I've been told that boxes with 16 NUMA nodes are coming out about now. This algorithm will start to have unreasonable performance with anything bigger than that. Eg choosing 16 NUMA nodes out of 32 would involve searching 601,080,390 combinations. 32 out of 64 gives 10^18. So I think we need a different approach :-(. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-17 18:04 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Ian Jackson @ 2012-07-17 20:23 ` Ian Campbell 2012-07-18 0:31 ` Dario Faggioli 2012-07-18 10:44 ` Ian Jackson 2012-07-18 0:22 ` Dario Faggioli 1 sibling, 2 replies; 60+ messages in thread From: Ian Campbell @ 2012-07-17 20:23 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On Tue, 2012-07-17 at 19:04 +0100, Ian Jackson wrote: > Dario Faggioli writes ("[PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > > If a domain does not have a VCPU affinity, try to pin it > > automatically to some PCPUs. This is done taking into account the > > NUMA characteristics of the host. In fact, we look for a > > combination of host's NUMA nodes with enough free memory and number > > of PCPUs for the new domain, and pin it to the VCPUs of those nodes. > > Ian Jackson writes ("Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > ... > > This part of the algorithm is quadratic in the number of combinations > > divided by the number of nodes. So the algorithm is > > O( ( C( nr_nodes, min_nodes ) / min_nodes )^2 ) > > which is quite bad really. > > > > At the very least this needs to be an exponential allocation, eg > > + array_size += nr_nodes + array_size; > > > > But if you didn't insist on creating the whole list and sorting it, > > you would avoid this allocation entirely, wouldn't you ? > > > > Should we bve worried that this algorithm will be too slow even if it > > involves just > > O( C(nr_nodes,min_nodes) ) > > iterations ? > > I've been told that boxes with 16 NUMA nodes are coming out about > now. > > This algorithm will start to have unreasonable performance with > anything bigger than that. Eg choosing 16 NUMA nodes out of 32 would > involve searching 601,080,390 combinations. 32 out of 64 gives 10^18. > > So I think we need a different approach :-(. If 16 are coming out now then 32 nodes are at least a little in the future, and 16 won't be common just yet. The 4.2 freeze is starting to drag on and reworking this significantly at this stage seems likely to take quite a while -- can we perhaps go with the existing patches (with the known scalability limitation and workaround of setting the explicit cpu affinity documented in the release notes) and revisit for 4.3 and/or 4.2.1? Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-17 20:23 ` Ian Campbell @ 2012-07-18 0:31 ` Dario Faggioli 2012-07-18 10:44 ` Ian Jackson 1 sibling, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 0:31 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 2250 bytes --] On Tue, 2012-07-17 at 21:23 +0100, Ian Campbell wrote: > > I've been told that boxes with 16 NUMA nodes are coming out about > > now. > > > > This algorithm will start to have unreasonable performance with > > anything bigger than that. Eg choosing 16 NUMA nodes out of 32 would > > involve searching 601,080,390 combinations. 32 out of 64 gives 10^18. > > > > So I think we need a different approach :-(. > > If 16 are coming out now then 32 nodes are at least a little in the > future, and 16 won't be common just yet. > FWIW, I agree. Consider that 16 nodes systems exists right now only because AMD has an architecture where they're putting 2 nodes on one socket, which is something pretty special that might require some special thinking and consideration already. The point being assuming future 16, 32 and 64 nodes systems to just be like the machines we see now, only bigger, won't be something letting me sleep at night... :-D > The 4.2 freeze is starting to drag on and reworking this significantly > at this stage seems likely to take quite a while -- can we perhaps go > with the existing patches (with the known scalability limitation and > workaround of setting the explicit cpu affinity documented in the > release notes) and revisit for 4.3 and/or 4.2.1? > That would be fine by me. I already said in the other e-mail how I think we can and should address the potential scalability issue, at whatever time you think it's best. I'm fine with putting a limit on the maximum number of nodes a guest should span on (and that being more a feature than a workaround, and can definitely go in the release notes) wherever you like. If it's just a matter of putting a LIBXL__MAX_GUEST_NODES (or whatever) macro in libxl_internal.h, and enforce that, I'm fine doing it right now, as I'm reposting these last three patches either way. We just need to agree on a value to write besides it. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-17 20:23 ` Ian Campbell 2012-07-18 0:31 ` Dario Faggioli @ 2012-07-18 10:44 ` Ian Jackson 1 sibling, 0 replies; 60+ messages in thread From: Ian Jackson @ 2012-07-18 10:44 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli Ian Campbell writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): > The 4.2 freeze is starting to drag on and reworking this significantly > at this stage seems likely to take quite a while Yes. > -- can we perhaps go > with the existing patches (with the known scalability limitation and > workaround of setting the explicit cpu affinity documented in the > release notes) and revisit for 4.3 and/or 4.2.1? I think the existing behaviour of these patches is unreleasable. In a couple of years' time people will try to run Xen 4.2 on larger machines and the result will be a hang (perhaps of their entire management stack) when they try to create a large guest. At the very least it should have a safety catch so that it disables itself, or something. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-17 18:04 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Ian Jackson 2012-07-17 20:23 ` Ian Campbell @ 2012-07-18 0:22 ` Dario Faggioli 2012-07-18 8:27 ` Dario Faggioli ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 0:22 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 3856 bytes --] On Tue, 2012-07-17 at 19:04 +0100, Ian Jackson wrote: > I've been told that boxes with 16 NUMA nodes are coming out about > now. > > This algorithm will start to have unreasonable performance with > anything bigger than that. Eg choosing 16 NUMA nodes out of 32 would > involve searching 601,080,390 combinations. 32 out of 64 gives 10^18. > Ok, when designing this thing, I knew it would turn out to be the most computationally complex part of the whole NUMA thing (hopefully!). That was on purpose, mainly because this is the *first* (from the guest's lifetime POV) of the various others NUMA-aware related heuristics that will come, and the only one that will run userspace and far from critical paths. That's why, when the suggestion to consider "all the possible combination of ..." came, during one round of the review, I liked it very much and went for it. _That_all_being_said_, the algorithm takes some input parameters, the most important of which is probably max_nodes. It should contain the maximum number of nodes one wants his guest to span. I worked really hard for it not to not be just hardcoded to 1, as that would mean we were giving up trying to improve performances for guests not fitting in just one node. However, it is not written in stone that it must be free to range from 1 to #_of_nodes. That is sure possible, with the current NUMA systems, and here they are the number of combinations we get for 4, 8 and 16 nodes (n), respectively: $ echo "combs=0; n=4; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 15 $ echo "combs=0; n=8; for k=1:8 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 255 $ echo "combs=0; n=16; for k=1:16 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 65535 However, it might well be decided that it is sane to limit max_nodes to, say, 4, which would mean we'll be trying our best to boost the performances of all guests that fits within 4 nodes (which is a lot better than just one!), which would result in these numbers, for nodes (n) equal to 4, 8, 16, 32 and 64: $ echo "combs=0; n=4; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 15 $ echo "combs=0; n=8; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 162 $ echo "combs=0; n=16; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 2516 $ echo "combs=0; n=32; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 41448 $ echo "combs=0; n=64; for k=1:4 combs=combs+bincoeff(n,k); endfor; combs" | octave -q combs = 679120 Moreover, as another input parameter is a nodemask, telling the algorithm not only how many but also _which_ nodes it should consider, one could imagine to have some multi-tier form of the algorithm itself, or, perhaps, implementing a more abstract and coarse partitioning heuristics among group of nodes (which will better be done as soon as we'll see how 32 and 64 nodes system will actually look like :-D) and then run the current algorithm only on these groups, with numbers that would then look just like above. So, to summarize, I can't be farther than saying the current algorithm is perfect or fast. Rather, I'm definitely saying it has the advantage of not branching out the problem space too much aggressively at this early stage, and it is flexible enough for avoiding getting unreasonable performances, or at least, it looks like that to me, I guess. :-) Thoughts? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 0:22 ` Dario Faggioli @ 2012-07-18 8:27 ` Dario Faggioli 2012-07-18 9:13 ` Ian Campbell 2012-07-18 9:47 ` Dario Faggioli 2 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 8:27 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel On Wed, 2012-07-18 at 02:22 +0200, Dario Faggioli wrote: > Moreover, as another input parameter is a nodemask, telling the > algorithm not only how many but also _which_ nodes it should consider, > one could imagine to have some multi-tier form of the algorithm itself, > or, perhaps, implementing a more abstract and coarse partitioning > heuristics among group of nodes (which will better be done as soon as > we'll see how 32 and 64 nodes system will actually look like :-D) and > then run the current algorithm only on these groups, with numbers that > would then look just like above. > Alternatively (or maybe not even that much 'alternatively' :-)), we can try turning it into something a bit more greedy, e.g., searching for available placement solutions just as big as min_nodes says (typically 1) as a first step (complexity linear with the number of nodes). After that, if we did not find anything, we could have some sort of config switch telling us if we should continue, and maybe investigating candidates of sizes in the range [min_nodes+1,LIBXL__MAX_GUEST_NODES]. At which point, if we haven't found anything yet, we could just give up or, perhaps, think and implement in the near future a more coarse and less precise mechanism for proceeding further. Implementing something like that on top of what I have now (which I'll resend asap, as soon as I've addressed the minor comments I got) during 4.3 should be pretty easy. Also, if we decide to just statically limit max_nodes<=LIBXL__MAX_GUEST_NODES (or whatever we could call it) to something (I was proposing 4 in the other e-mail) for 4.2, backporting to it will be piece of cake as well. :-) Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 0:22 ` Dario Faggioli 2012-07-18 8:27 ` Dario Faggioli @ 2012-07-18 9:13 ` Ian Campbell 2012-07-18 9:43 ` Dario Faggioli 2012-07-18 9:47 ` Dario Faggioli 2 siblings, 1 reply; 60+ messages in thread From: Ian Campbell @ 2012-07-18 9:13 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel On Wed, 2012-07-18 at 01:22 +0100, Dario Faggioli wrote: > So, to summarize, I can't be farther than saying the current algorithm > is perfect or fast. Rather, I'm definitely saying it has the advantage > of not branching out the problem space too much aggressively at this > early stage, and it is flexible enough for avoiding getting unreasonable > performances, or at least, it looks like that to me, I guess. :-) I'm not quite following what you are trying to say with all this. Are you saying that the code as most recently submitted does not have the O( C(nr_nodes,min_nodes) ) property which Ian J suggests, which leads to needing to process hundreds of millions of combinations for a 16 / 32 node guest? Or are you saying that it does have this property but it doesn't matter? Or are you saying it does have this property but you intend to ameliorate it somehow? If the latter then how and for which release(s) (4.2, 4.3 or 4.2.1)? Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 9:13 ` Ian Campbell @ 2012-07-18 9:43 ` Dario Faggioli 2012-07-18 9:53 ` Ian Campbell 2012-07-18 10:53 ` Ian Jackson 0 siblings, 2 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 9:43 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 2651 bytes --] On Wed, 2012-07-18 at 10:13 +0100, Ian Campbell wrote: > On Wed, 2012-07-18 at 01:22 +0100, Dario Faggioli wrote: > > > So, to summarize, I can't be farther than saying the current algorithm > > is perfect or fast. Rather, I'm definitely saying it has the advantage > > of not branching out the problem space too much aggressively at this > > early stage, and it is flexible enough for avoiding getting unreasonable > > performances, or at least, it looks like that to me, I guess. :-) > > I'm not quite following what you are trying to say with all this. > Oh, I see, sorry. :-( > Are you saying that the code as most recently submitted does not have > the O( C(nr_nodes,min_nodes) ) property which Ian J suggests, which > leads to needing to process hundreds of millions of combinations for a > 16 / 32 node guest? > No, I'm not. If you want to place a 16 (32) nodes guest on a 32 (64) nodes host, it is as bad as he says. However, I'm also saying that on <=16 nodes host it is able o provide very good placement at a reasonable cost and that, if we limit the number of guest nodes (e.g., up to 4) it is able to do the same even on 32 or 64 nodes hosts. > Or are you saying that it does have this property > but it doesn't matter? > A lot of people is thinking that trying to address the problem of placing guests that are bigger (or anyway, guests that does not fit) in just 1 node should not be even considered. I'm saying that it should and am also proposing an algorithm that is able to provide good placement performances for guests that spans up to 4 nodes even on future 32 or 64 nodes systems. So basically I'm saying that it does have that property but that is bringing benefits right now, and we have all the knobs to limit its potential nasty effects for future (bigger) system at any time. > Or are you saying it does have this property but > you intend to ameliorate it somehow? If the latter then how and for > which release(s) (4.2, 4.3 or 4.2.1)? > Yes, I guess I'm sort of saying something like that. What could be done is restricting automatic placement to guests that fits on 4 or 8 nodes for 4.2. Then, for 4.3 (and with a very small backport effort to 4.2.1), implement a lighter solution for dealing with the ones that does not. Hope I clarified at least a bit. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 9:43 ` Dario Faggioli @ 2012-07-18 9:53 ` Ian Campbell 2012-07-18 10:08 ` Dario Faggioli 2012-07-18 11:00 ` Ian Jackson 2012-07-18 10:53 ` Ian Jackson 1 sibling, 2 replies; 60+ messages in thread From: Ian Campbell @ 2012-07-18 9:53 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel On Wed, 2012-07-18 at 10:43 +0100, Dario Faggioli wrote: > What could be done is restricting automatic placement to guests that > fits on 4 or 8 nodes for 4.2. 8 would mean on a 32 node system considering 10,518,300 combinations? 4 would mean on a 32 node system considering 35,960 combinations? On a 64 node system it would mean 635,376 combinations. If that's the case then lets go with 4 as the limit for 4.2.0. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 9:53 ` Ian Campbell @ 2012-07-18 10:08 ` Dario Faggioli 2012-07-18 11:00 ` Ian Jackson 1 sibling, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 10:08 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1073 bytes --] On Wed, 2012-07-18 at 10:53 +0100, Ian Campbell wrote: > On Wed, 2012-07-18 at 10:43 +0100, Dario Faggioli wrote: > > What could be done is restricting automatic placement to guests that > > fits on 4 or 8 nodes for 4.2. > > 8 would mean on a 32 node system considering 10,518,300 combinations? > Yes, I was considering that only because I think we'll have something ready well before 32 nodes systems becomes common. But fine, let's not bet on what hardware manufacturers will come up with and in what timeframe... :-) > 4 would mean on a 32 node system considering 35,960 combinations? On a > 64 node system it would mean 635,376 combinations. > Exactly. > If that's the case then lets go with 4 as the limit for 4.2.0. > That is fine with me. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 9:53 ` Ian Campbell 2012-07-18 10:08 ` Dario Faggioli @ 2012-07-18 11:00 ` Ian Jackson 2012-07-18 13:14 ` Ian Campbell 2012-07-18 13:40 ` Andre Przywara 1 sibling, 2 replies; 60+ messages in thread From: Ian Jackson @ 2012-07-18 11:00 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli Ian Campbell writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): > On Wed, 2012-07-18 at 10:43 +0100, Dario Faggioli wrote: > > What could be done is restricting automatic placement to guests that > > fits on 4 or 8 nodes for 4.2. > > 8 would mean on a 32 node system considering 10,518,300 combinations? > > 4 would mean on a 32 node system considering 35,960 combinations? On a > 64 node system it would mean 635,376 combinations. > > If that's the case then lets go with 4 as the limit for 4.2.0. What is the maximum number of NUMA nodes we might expect to see on a single system in the next five years? I would argue that 32 is too optimistic. 128 or 256 seem like more reasonable upper bounds. 4 of 256 is 174,792,640 combinations - ie too many. So there needs to be a limit on the host size too. But if you pick an upper host size limit of 64 then in a hypothetical 128-node system you'd fail to do the trivial search for a 1-node guest. An upper bound on log(number_of_combinations) is log(number_of_nodes_on_the_host) * number_of_nodes_for_the_guest. This fact could be used to determine more accurately whether the algorithm is going to terminate in a reasonable time. Also when this algorithm would be used, but would take too long, we should print a warning which tells the user they should use cpupools to assign nodes directly. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 11:00 ` Ian Jackson @ 2012-07-18 13:14 ` Ian Campbell 2012-07-18 13:35 ` Dario Faggioli 2012-07-19 12:47 ` Dario Faggioli 2012-07-18 13:40 ` Andre Przywara 1 sibling, 2 replies; 60+ messages in thread From: Ian Campbell @ 2012-07-18 13:14 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On Wed, 2012-07-18 at 12:00 +0100, Ian Jackson wrote: > An upper bound on log(number_of_combinations) is > log(number_of_nodes_on_the_host) * number_of_nodes_for_the_guest. > This fact could be used to determine more accurately whether the > algorithm is going to terminate in a reasonable time. This seems like something which could be done for 4.2.0 and is better than the more static limit and better than no NUMA at all. Dario can you do something for this soon, as in this week or not later than next? > Also when this algorithm would be used, but would take too long, we > should print a warning which tells the user they should use cpupools > to assign nodes directly. Agreed. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 13:14 ` Ian Campbell @ 2012-07-18 13:35 ` Dario Faggioli 2012-07-19 12:47 ` Dario Faggioli 1 sibling, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 13:35 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1043 bytes --] On Wed, 2012-07-18 at 14:14 +0100, Ian Campbell wrote: > On Wed, 2012-07-18 at 12:00 +0100, Ian Jackson wrote: > > An upper bound on log(number_of_combinations) is > > log(number_of_nodes_on_the_host) * number_of_nodes_for_the_guest. > > This fact could be used to determine more accurately whether the > > algorithm is going to terminate in a reasonable time. > > This seems like something which could be done for 4.2.0 and is better > than the more static limit and better than no NUMA at all. > I think that too. > Dario can you > do something for this soon, as in this week or not later than next? > I believe it's feasible for me to release a new version with something like the above in by the end of this week. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 13:14 ` Ian Campbell 2012-07-18 13:35 ` Dario Faggioli @ 2012-07-19 12:47 ` Dario Faggioli 1 sibling, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-19 12:47 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 3645 bytes --] On Wed, 2012-07-18 at 14:14 +0100, Ian Campbell wrote: > On Wed, 2012-07-18 at 12:00 +0100, Ian Jackson wrote: > > An upper bound on log(number_of_combinations) is > > log(number_of_nodes_on_the_host) * number_of_nodes_for_the_guest. > > This fact could be used to determine more accurately whether the > > algorithm is going to terminate in a reasonable time. > > This seems like something which could be done for 4.2.0 and is better > than the more static limit and better than no NUMA at all. Dario can you > do something for this soon, as in this week or not later than next? > Ok, I think I can do something like the below. Remember there is an external cycle (the while{}) with i:=[min_nodes...max_nodes] (with, typically, min_nodes=1 and max_nodes=nr_ndoes). Nested in it, there is another cycle (the for{}), performing exactly C(nr_nodes i) steps. So: 1) I'll killing the need for storing the full list of combinations and I'll run the comparisons on-line, so that I always have the best temporary placement candidate, and I can return it anytime; 2) at the beginning of each step of the for{} (the inner cycle) I check if the overall number of steps exceeds a given threshold (I was thinking to 2^18=262144) and if it does I just stop, no matter whether or not I have a solution; The --unlikely but-- worst that can happen is the algorithm runs for a while and at some point it terminates with a suboptimal candidate or with nothing at all. However, please consider that placement on an 8 nodes host requires 256 steps, on 16 nodes it takes 65536 steps. Thus, bad things would only happen on systems with nr_nodes>16, which we now know will probably never exist! :-) As an alternative, after thinking quite a bit about it, I found a way to produce a reasonably good estimation of (an upper bound of) the total number of steps (as the log(n)*p suggested above is nice, but very very far from being precise enough, I'm afraid). So, using that to know in advance whether or not we should run is possible, but calculating it requires some quite advanced math that I'm not very keen on implementing here, as it will be ugly and not so easy to understand and maintain (although I'd do my best in providing good comments :-)). OTOH, doing like I described above enables the possibility of finding a (maybe suboptimal) placement in a reasonable amount of time even on those theoretic large systems, which is something good, I think. > > Also when this algorithm would be used, but would take too long, we > > should print a warning which tells the user they should use cpupools > > to assign nodes directly. > > Agreed. > This is definitely possible, although doing it _in_advance_ and being precise enough would involve the advanced math I was talking about before. I can print the warning on-line, as soon as the estimated number of steps reaches some threshold (lower that the one used to stop the algorithm. I was thinking at 2^16=65536), would that be reasonable? I of course can also print the warning in advance basing on the response of some rough estimation, but I fear it would not be that useful then... I'm implementing the thing just right now. If you have any feedback, please shout it loud, and the sooner the better. :-D Thanks a lot and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 11:00 ` Ian Jackson 2012-07-18 13:14 ` Ian Campbell @ 2012-07-18 13:40 ` Andre Przywara 2012-07-18 13:54 ` Juergen Gross ` (2 more replies) 1 sibling, 3 replies; 60+ messages in thread From: Andre Przywara @ 2012-07-18 13:40 UTC (permalink / raw) To: Ian Jackson Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On 07/18/2012 01:00 PM, Ian Jackson wrote: > Ian Campbell writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): >> On Wed, 2012-07-18 at 10:43 +0100, Dario Faggioli wrote: >>> What could be done is restricting automatic placement to guests that >>> fits on 4 or 8 nodes for 4.2. >> >> 8 would mean on a 32 node system considering 10,518,300 combinations? >> >> 4 would mean on a 32 node system considering 35,960 combinations? On a >> 64 node system it would mean 635,376 combinations. >> >> If that's the case then lets go with 4 as the limit for 4.2.0. > > What is the maximum number of NUMA nodes we might expect to see on a > single system in the next five years? I would argue that 32 is too > optimistic. 128 or 256 seem like more reasonable upper bounds. Wow, what are you talking about? To calm this down from the AMD side: The current Opteron NUMA architecture is limited to exactly 8 nodes. This has ever been the case since the release of Opteron and changing this is not trivial and will not happen in any near future. In general I don't think we will see much bigger NUMA systems, but more cluster like architectures. Maybe Juergen can comment on the Fujitsu side. So my suggestion: Get this NUMA placement in if anyhow possible for 4.2.0. We have much bigger problems without this algorithm than the theoretic gigantic machine you are talking about. Since it is an internal algorithm, we can fix this later easily in 4.2.1 and 4.3 and nobody will ever notice this problem. Just my 2 cents. Andre. 4 of > 256 is 174,792,640 combinations - ie too many. So there needs to be a > limit on the host size too. But if you pick an upper host size limit > of 64 then in a hypothetical 128-node system you'd fail to do the > trivial search for a 1-node guest. > > An upper bound on log(number_of_combinations) is > log(number_of_nodes_on_the_host) * number_of_nodes_for_the_guest. > This fact could be used to determine more accurately whether the > algorithm is going to terminate in a reasonable time. > > Also when this algorithm would be used, but would take too long, we > should print a warning which tells the user they should use cpupools > to assign nodes directly. > > Ian. > -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 13:40 ` Andre Przywara @ 2012-07-18 13:54 ` Juergen Gross 2012-07-18 14:00 ` Dario Faggioli 2012-07-19 14:43 ` Ian Jackson 2 siblings, 0 replies; 60+ messages in thread From: Juergen Gross @ 2012-07-18 13:54 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Ian Jackson, xen-devel, Dario Faggioli Am 18.07.2012 15:40, schrieb Andre Przywara: > On 07/18/2012 01:00 PM, Ian Jackson wrote: >> Ian Campbell writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable >> automatic placement of guests on NUMA nodes [and 1 more messages]"): >>> On Wed, 2012-07-18 at 10:43 +0100, Dario Faggioli wrote: >>>> What could be done is restricting automatic placement to guests that >>>> fits on 4 or 8 nodes for 4.2. >>> >>> 8 would mean on a 32 node system considering 10,518,300 combinations? >>> >>> 4 would mean on a 32 node system considering 35,960 combinations? On a >>> 64 node system it would mean 635,376 combinations. >>> >>> If that's the case then lets go with 4 as the limit for 4.2.0. >> >> What is the maximum number of NUMA nodes we might expect to see on a >> single system in the next five years? I would argue that 32 is too >> optimistic. 128 or 256 seem like more reasonable upper bounds. > > Wow, what are you talking about? > To calm this down from the AMD side: > The current Opteron NUMA architecture is limited to exactly 8 nodes. > This has ever been the case since the release of Opteron and changing > this is not trivial and will not happen in any near future. In general I > don't think we will see much bigger NUMA systems, but more cluster like > architectures. > > Maybe Juergen can comment on the Fujitsu side. I'm not aware of anything with more than 8 nodes. OTOH I'm in the SW department. I'm not sure I would hear about a larger machine long before it is announced in public. :-) Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 13:40 ` Andre Przywara 2012-07-18 13:54 ` Juergen Gross @ 2012-07-18 14:00 ` Dario Faggioli 2012-07-19 14:43 ` Ian Jackson 2 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 14:00 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1955 bytes --] Hi Andre, On Wed, 2012-07-18 at 15:40 +0200, Andre Przywara wrote: > > What is the maximum number of NUMA nodes we might expect to see on a > > single system in the next five years? I would argue that 32 is too > > optimistic. 128 or 256 seem like more reasonable upper bounds. > > Wow, what are you talking about? > To calm this down from the AMD side: > The current Opteron NUMA architecture is limited to exactly 8 nodes. > This has ever been the case since the release of Opteron and changing > this is not trivial and will not happen in any near future. > Cool, thanks for bringing your view into this discussion! :-) > In general I > don't think we will see much bigger NUMA systems, but more cluster like > architectures. > Yep, I think that too, but hearing this from you is much more important than hearing it from me as, given your position, your crystal ball is quite a bit more trained for this kind of foresight than our ones! :-P > Maybe Juergen can comment on the Fujitsu side. > That would be very interesting. > So my suggestion: Get this NUMA placement in if anyhow possible for > 4.2.0. We have much bigger problems without this algorithm than the > theoretic gigantic machine you are talking about. Since it is an > internal algorithm, we can fix this later easily in 4.2.1 and 4.3 and > nobody will ever notice this problem. > While I basically agree, I think the solution would benefit from some kind of "safety stop condition" similar to the one IanJ suggested. I'm just trying to fine tune it a bit, to be sure as much as possible system might benefit from the placement. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 13:40 ` Andre Przywara 2012-07-18 13:54 ` Juergen Gross 2012-07-18 14:00 ` Dario Faggioli @ 2012-07-19 14:43 ` Ian Jackson 2012-07-19 18:37 ` Andre Przywara 2 siblings, 1 reply; 60+ messages in thread From: Ian Jackson @ 2012-07-19 14:43 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli Andre Przywara writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): > On 07/18/2012 01:00 PM, Ian Jackson wrote: > > What is the maximum number of NUMA nodes we might expect to see on a > > single system in the next five years? I would argue that 32 is too > > optimistic. 128 or 256 seem like more reasonable upper bounds. > > Wow, what are you talking about? > To calm this down from the AMD side: > The current Opteron NUMA architecture is limited to exactly 8 nodes. > This has ever been the case since the release of Opteron and changing > this is not trivial and will not happen in any near future. In general I > don't think we will see much bigger NUMA systems, but more cluster like > architectures. One of these `more cluster like architectures', if it has shared memory at all, may well end up being most easily represented in the model supported by Xen as a NUMA host with a large number of nodes. We need to plan for the software that we are writing today to work for at least the next 5 years /even if we intend to replace the algorithm in the next release/. It is very difficult to foresee what might happen to hardware in that time. But it's OK, we don't need to panic. We just need a safety catch which stops this algorithm running in situations where it won't work. Following a discussion with Dario, AIUI he plans to implement a rule that it will bail on systems with more than 8 NUMA nodes. That is releasable in 4.2 and ought to be satisfactory for you ? Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-19 14:43 ` Ian Jackson @ 2012-07-19 18:37 ` Andre Przywara 2012-07-21 1:46 ` Dario Faggioli 0 siblings, 1 reply; 60+ messages in thread From: Andre Przywara @ 2012-07-19 18:37 UTC (permalink / raw) To: Ian Jackson Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On 07/19/2012 04:43 PM, Ian Jackson wrote: > Andre Przywara writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): >> On 07/18/2012 01:00 PM, Ian Jackson wrote: >>> What is the maximum number of NUMA nodes we might expect to see on a >>> single system in the next five years? I would argue that 32 is too >>> optimistic. 128 or 256 seem like more reasonable upper bounds. >> >> Wow, what are you talking about? >> To calm this down from the AMD side: >> The current Opteron NUMA architecture is limited to exactly 8 nodes. >> This has ever been the case since the release of Opteron and changing >> this is not trivial and will not happen in any near future. In general I >> don't think we will see much bigger NUMA systems, but more cluster like >> architectures. > > One of these `more cluster like architectures', if it has shared > memory at all, may well end up being most easily represented in the > model supported by Xen as a NUMA host with a large number of nodes. > > We need to plan for the software that we are writing today to work for > at least the next 5 years /even if we intend to replace the algorithm > in the next release/. It is very difficult to foresee what might > happen to hardware in that time. > > But it's OK, we don't need to panic. We just need a safety catch > which stops this algorithm running in situations where it won't work. > Following a discussion with Dario, AIUI he plans to implement a rule > that it will bail on systems with more than 8 NUMA nodes. > > That is releasable in 4.2 and ought to be satisfactory for you ? Absolutely, that's fine. I just wanted to make sure that we don't drop the patches at all because of the theoretical fear of a large NUMA box lingering somewhere around the corner... Thanks, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-19 18:37 ` Andre Przywara @ 2012-07-21 1:46 ` Dario Faggioli 0 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-21 1:46 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1500 bytes --] On Thu, 2012-07-19 at 20:37 +0200, Andre Przywara wrote: > > But it's OK, we don't need to panic. We just need a safety catch > > which stops this algorithm running in situations where it won't work. > > Following a discussion with Dario, AIUI he plans to implement a rule > > that it will bail on systems with more than 8 NUMA nodes. > > > > That is releasable in 4.2 and ought to be satisfactory for you ? > I did right that. Honestly, I computed the exact amount of steps the algorithm needs in the worst case (i.e., searching all the combinations of all the sizes) for 8 and 16 nodes. 8 nodes means 256 steps, 16 nodes means 65536, so I think both could work, especially after the optimization I put in the version I just released (it does not uses that large array of combinations and does not need to always perform all the steps any longer). However, as agreed, I put an "if (nr_nodes <= 8)" there... If a 16 nodes machine will ever appear, we'll decide what to do. > Absolutely, that's fine. I just wanted to make sure that we don't drop > the patches at all because of the theoretical fear of a large NUMA box > lingering somewhere around the corner... > :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 9:43 ` Dario Faggioli 2012-07-18 9:53 ` Ian Campbell @ 2012-07-18 10:53 ` Ian Jackson 2012-07-18 13:12 ` Ian Campbell 1 sibling, 1 reply; 60+ messages in thread From: Ian Jackson @ 2012-07-18 10:53 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel Dario Faggioli writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): ... > So basically I'm saying that it does have that property but that is > bringing benefits right now, and we have all the knobs to limit its > potential nasty effects for future (bigger) system at any time. I think that it is essential that we do not have a situation where the default behaviour is ever to hang while it searches zillions of combinations. That would be a release critical bug IMO. So I don't think "we will fix this after the series goes in" is acceptable. We should not be introducing new RC bugs during the freeze. > What could be done is restricting automatic placement to guests that > fits on 4 or 8 nodes for 4.2. Then, for 4.3 (and with a very small > backport effort to 4.2.1), implement a lighter solution for dealing with > the ones that does not. That would be acceptable. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 10:53 ` Ian Jackson @ 2012-07-18 13:12 ` Ian Campbell 0 siblings, 0 replies; 60+ messages in thread From: Ian Campbell @ 2012-07-18 13:12 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On Wed, 2012-07-18 at 11:53 +0100, Ian Jackson wrote: > Dario Faggioli writes ("Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages]"): > ... > > So basically I'm saying that it does have that property but that is > > bringing benefits right now, and we have all the knobs to limit its > > potential nasty effects for future (bigger) system at any time. > > I think that it is essential that we do not have a situation where the > default behaviour is ever to hang while it searches zillions of > combinations. That would be a release critical bug IMO. > > So I don't think "we will fix this after the series goes in" is > acceptable. We should not be introducing new RC bugs during the > freeze. If this is the case then we need to consider that we may need to release 4.2.0 without this NUMA awareness unless we can find a simple fix/bandaid soon. I don't want this freeze to drag on for weeks or months while we sort this out. I'd like to see a solution agreed and implemented ASAP or else we need to seriously consider deferring the whole thing to 4.3 (perhaps with a 4.2.1 backport) Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] 2012-07-18 0:22 ` Dario Faggioli 2012-07-18 8:27 ` Dario Faggioli 2012-07-18 9:13 ` Ian Campbell @ 2012-07-18 9:47 ` Dario Faggioli 2 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-18 9:47 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 2277 bytes --] On Wed, 2012-07-18 at 02:22 +0200, Dario Faggioli wrote: > Moreover, as another input parameter is a nodemask, telling the > algorithm not only how many but also _which_ nodes it should consider, > one could imagine to have some multi-tier form of the algorithm itself, > or, perhaps, implementing a more abstract and coarse partitioning > heuristics among group of nodes (which will better be done as soon as > we'll see how 32 and 64 nodes system will actually look like :-D) and > then run the current algorithm only on these groups, with numbers that > would then look just like above. > Alternatively (or maybe not even that much 'alternatively' :-)), we can try turning it into something a bit more greedy, e.g., searching for available placement solutions just as big as min_nodes says (typically 1) as a first step (complexity linear with the number of nodes). After that, if we did not find anything, we could have some sort of config switch telling us if we should continue, and maybe investigating candidates of sizes in the range [min_nodes+1,LIBXL__MAX_GUEST_NODES]. At which point, if we haven't found anything yet, we could just give up or, perhaps, think and implement in the near future a more coarse and less precise mechanism for proceeding further. Implementing something like that on top of what I have now (which I'll resend asap, as soon as I've addressed the minor comments I got) during 4.3 should be pretty easy. Also, if we decide to just statically limit max_nodes<=LIBXL__MAX_GUEST_NODES (or whatever we could call it) to something (I was proposing 4 in the other e-mail) for 4.2, backporting to it will be piece of cake as well. :-) Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli 2012-07-17 18:04 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Ian Jackson @ 2012-07-19 12:21 ` Andre Przywara 2012-07-19 14:22 ` Dario Faggioli 1 sibling, 1 reply; 60+ messages in thread From: Andre Przywara @ 2012-07-19 12:21 UTC (permalink / raw) To: Dario Faggioli Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel Dario, sorry for joining the discussion so late, but I was busy with other things and saw the project in good hands. Finally I managed to get some testing on these patches. I took my 8-node machine, alternately equipped with 16GB and 8GB per node. Each node has 8 pCPUs. As a special(i)ty I removed the DIMMs from node 2 and 3 to test Andrew's memory-less node patches, leading to this configuration: node: memsize memfree distances 0: 17280 4518 10,16,16,22,16,22,16,22 1: 8192 3639 16,10,16,22,22,16,22,16 2: 0 0 16,16,10,16,16,16,16,22 3: 0 0 22,22,16,10,16,16,22,16 4: 16384 4766 16,22,16,16,10,16,16,16 5: 8192 2882 22,16,16,16,16,10,22,22 6: 16384 4866 16,22,16,22,16,22,10,16 7: 8176 2799 22,16,22,16,16,22,16,10 Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. Now since the code prefers free memory so much over free CPUs, the placement was the following: node0: guests 2,5,8,11,14,17,20,25,30 node1: guests 21,27 node2: none node3: none node4: guests 1,4,7,10,13,16,19,23,29 node5: guests 24,31 node6: guests 3,6,9,12,15,18,22,28 node7: guests 26,32 As you can see, the nodes with more memory are _way_ overloaded, while the lower memory ones are underutilized. In fact the first 20 guests didn't use the other nodes at all. I don't care so much about the two memory-less nodes, but I'd like to know how you came to the magic "3" in the formula: > + > + return sign(3*freememkb_diff + nrdomains_diff); > +} I haven't done any measurements on this, but I guess scheduling 36 vCPUs on 8 pCPUs has a much bigger performance penalty than any remote NUMA access, which nowadays is much better than a few years ago, with big L3 caches, better predictors and faster interconnects. I will now put in the memory again and also try to play a bit with the amount of memory and VCPUs per guest. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-19 12:21 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Andre Przywara @ 2012-07-19 14:22 ` Dario Faggioli 2012-07-20 8:19 ` Andre Przywara 2012-07-20 8:20 ` Dario Faggioli 0 siblings, 2 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-19 14:22 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 5310 bytes --] On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote: > Dario, > Hi Andre! > sorry for joining the discussion so late, but I was busy with other > things and saw the project in good hands. > Well, thanks. It's being a big deal (and we're just at the beginning!), but I'm enjoying working on it so I won't give up easily! :-P > Finally I managed to get some testing on these patches. > That's very cool, thanks. > I took my 8-node machine, alternately equipped with 16GB and 8GB per > node. Each node has 8 pCPUs. > As a special(i)ty I removed the DIMMs from node 2 and 3 to test Andrew's > memory-less node patches, leading to this configuration: > node: memsize memfree distances > 0: 17280 4518 10,16,16,22,16,22,16,22 > 1: 8192 3639 16,10,16,22,22,16,22,16 > 2: 0 0 16,16,10,16,16,16,16,22 > 3: 0 0 22,22,16,10,16,16,22,16 > 4: 16384 4766 16,22,16,16,10,16,16,16 > 5: 8192 2882 22,16,16,16,16,10,22,22 > 6: 16384 4866 16,22,16,22,16,22,10,16 > 7: 8176 2799 22,16,22,16,16,22,16,10 > Interesting. That's really the kind of testing we need in order to fine-tune the details. Thanks for doing this. > Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. > Now since the code prefers free memory so much over free CPUs, the > placement was the following: > node0: guests 2,5,8,11,14,17,20,25,30 > node1: guests 21,27 > node2: none > node3: none > node4: guests 1,4,7,10,13,16,19,23,29 > node5: guests 24,31 > node6: guests 3,6,9,12,15,18,22,28 > node7: guests 26,32 > > As you can see, the nodes with more memory are _way_ overloaded, while > the lower memory ones are underutilized. In fact the first 20 guests > didn't use the other nodes at all. > I don't care so much about the two memory-less nodes, but I'd like to > know how you came to the magic "3" in the formula: > > > + > > + return sign(3*freememkb_diff + nrdomains_diff); > > +} > Ok. The idea behind the current implementation of that heuristics is to prefer nodes with more free memory. In fact, this leaves larger "holes", maximizing the probability of being able to put more domain there. Of course that means more domains exploiting local accesses, but introduces the risk of overloading large (from a memory POV) nodes with a lot of domains (which seems right what's happening to you! :-P). Therefore, I wanted to balance that by putting something related to the current load on a node into the equation. Unfortunately, I really am not sure yet what a reasonable estimation of the actual "load on a node" could be. Even worse, Xen does not report anything even close to that, at least not right now. That's why I went for a quite dumb count of the number of domains for now, waiting to find the time to implement something more clever. So, that is basically why I thought it could be a good idea to overweight the differences in free memory wrt the differences in number of assigned domain. The first implementation was only considering the number of assigned domain to decide which was the best candidate between two that were less 10% different in their amount of free memory. However, that didn't produce a good comparison function, and thus I rewrote it like above, with the magic 3 selected via trial and error to mimic something similar to the old 10% rule. That all being said, this is the first time the patchset had the chance to run on such a big system, so I'm definitely open to suggestion on how to make that formula better in reflecting what we think it's The Right Thing! > I haven't done any measurements on this, but I guess scheduling 36 vCPUs > on 8 pCPUs has a much bigger performance penalty than any remote NUMA > access, which nowadays is much better than a few years ago, with big L3 > caches, better predictors and faster interconnects. > Definitely. Consider that the guests are being pinned because that is what XenD does and because there has been no time to properly refine the implementation of a more NUMA-aware scheduler for 4.2. In future, as soon as I'll have it ready, the vcpus from the overloaded nodes would get some runtime on the otherwise idle ones, even if they're remote. Nevertheless, this is what Xen 4.2 will have, and I really think initial placement is a very important step, and we must get the most out of being able to do it well (as opposed to other technologies, where something like that has to happen in the kernel/hypervisor, which entails a lot of limitations we don't have!), and am therefore happy about trying to do so as hard as I can. > I will now put in the memory again and also try to play a bit with the > amount of memory and VCPUs per guest. > Great, keep me super-posted about these things and feel free to ask anything that comes to your mind! :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-19 14:22 ` Dario Faggioli @ 2012-07-20 8:19 ` Andre Przywara 2012-07-20 9:39 ` Dario Faggioli 2012-07-20 8:20 ` Dario Faggioli 1 sibling, 1 reply; 60+ messages in thread From: Andre Przywara @ 2012-07-20 8:19 UTC (permalink / raw) To: Dario Faggioli Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel On 07/19/2012 04:22 PM, Dario Faggioli wrote: > On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote: >> Dario, thanks for the warm welcome. >> ... >> As you can see, the nodes with more memory are _way_ overloaded, while >> the lower memory ones are underutilized. In fact the first 20 guests >> didn't use the other nodes at all. >> I don't care so much about the two memory-less nodes, but I'd like to >> know how you came to the magic "3" in the formula: >> >>> + >>> + return sign(3*freememkb_diff + nrdomains_diff); >>> +} >> > Ok. The idea behind the current implementation of that heuristics is to > prefer nodes with more free memory. In fact, this leaves larger "holes", > maximizing the probability of being able to put more domain there. Of > course that means more domains exploiting local accesses, but introduces > the risk of overloading large (from a memory POV) nodes with a lot of > domains (which seems right what's happening to you! :-P). I always assumed the vast majority of actual users/customers use comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. So these domains are much smaller than a usual node. I'd consider a node size of 16 GB the lower boundary, with up to 128GB as the common scenarios. Sure there are bigger or smaller machines, but I'd consider this the sweet spot. > Therefore, I wanted to balance that by putting something related to the > current load on a node into the equation. Unfortunately, I really am not > sure yet what a reasonable estimation of the actual "load on a node" > could be. Even worse, Xen does not report anything even close to that, > at least not right now. That's why I went for a quite dumb count of the > number of domains for now, waiting to find the time to implement > something more clever. Right. So we just use the number of already pinned vCPUs as the metric. Let me look if I can change the code to really use number of vCPUs instead of number of domains. A domain could be UP or 8-way SMP, which really makes much difference wrt to load on a node. In the long run we need something like a per-node (or per-pCPU) load average. We cannot foresee the future, but we just assume that the past is a good indicator for it. xl top generates such numbers on demand already. But that surely is something for 4.3, just wanted to mention it. > So, that is basically why I thought it could be a good idea to > overweight the differences in free memory wrt the differences in number > of assigned domain. The first implementation was only considering the > number of assigned domain to decide which was the best candidate between > two that were less 10% different in their amount of free memory. > However, that didn't produce a good comparison function, and thus I > rewrote it like above, with the magic 3 selected via trial and error to > mimic something similar to the old 10% rule. OK, I see. I thought about this a bit more and agree a single heuristic formula isn't easy to find. After reading the code I consider this a bit over-engineered, but I cannot possibly complain about this after having remained silent for such a long time. So lets see what we can make out of this code, just firing up some ideas, feel free to just ignore them in case they are dumb ;-) So if you agree to the small-domain assumption, then domains easily fitting into a node are the rule, not the exception. We should handle it that way. Maybe we can also solve the complexity problem by only generating single node candidates in the first place and only if these don't fit look at alternatives? I really admire that lazy comb_next generation function, so why we don't use it in a really lazy way? I think there was already a discussion about this, just don't remember what it's outcome was. Some debugging code showed that on the above (empty) machine a 2 VCPUs/2GB domain generated already 255 candidates. That really looks like overkill, especially if we actually should focus on the 8 single-node ones. Maybe we can use a two-step approach? First use a simple heuristic similar to the xend one: We only consider domains with enough free memory. Then we look for the least utilized ones: Simply calculate the difference between the number of currently pinned vCPUs and the number pCPUs. So any node with free (aka non-overcommited) CPUs should really be considered first. After all we don't need to care about memory latency if the domains starve for compute time and only get a fraction of each pCPU. If you don't want to believe this, I can run some benchmarks to prove this. If we somehow determine that this approach doesn't work (no nodes with enough free memory or more vCPUs than CPUs-per-node) we should use the sophisticated algorithm. Also consider the following: With really big machines or with odd configurations people will probably do their pinning/placement themselves (or by external mgmt applications). What this automatic placement algorithm is good for is more the What-is-this-NUMA-thingie-anyways people. > > That all being said, this is the first time the patchset had the chance > to run on such a big system, so I'm definitely open to suggestion on how > to make that formula better in reflecting what we think it's The Right > Thing! > >> I haven't done any measurements on this, but I guess scheduling 36 vCPUs >> on 8 pCPUs has a much bigger performance penalty than any remote NUMA >> access, which nowadays is much better than a few years ago, with big L3 >> caches, better predictors and faster interconnects. >> > Definitely. Consider that the guests are being pinned because that is > what XenD does and because there has been no time to properly refine the > implementation of a more NUMA-aware scheduler for 4.2. In future, as > soon as I'll have it ready, the vcpus from the overloaded nodes would > get some runtime on the otherwise idle ones, even if they're remote. Right, that sounds good. If you have any good (read: meaningful to customers) benchmarks I can do some experiments on my machine to fine-tune this. > Nevertheless, this is what Xen 4.2 will have, and I really think initial > placement is a very important step, and we must get the most out of > being able to do it well (as opposed to other technologies, where > something like that has to happen in the kernel/hypervisor, which > entails a lot of limitations we don't have!), and am therefore happy > about trying to do so as hard as I can. Right. We definitely need some placement for 4.2. Lets push this in if anyhow possible. >> I will now put in the memory again and also try to play a bit with the >> amount of memory and VCPUs per guest. >> > Great, keep me super-posted about these things and feel free to ask > anything that comes to your mind! :-) So changing the VCPUs and memory config didn't make any real difference. I think that is because the number of domains is considered, not the number of vCPUS. This should be fixed. Second I inserted the memory again. I only have 24 DIMMs for 32 sockets (we have easy access to boards and CPUs, but memory we have to buy like everyone else ;-), so I have to go with this alternating setup, having four 16GB nodes and four 8 GB nodes. This didn't change much, so 16 guests ended up with: 4-1-0-0-4-0-4-3 setup (domains per node). The 0's or 1's where the 8GB nodes. The guests were 2P/2GB ones. So far. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 8:19 ` Andre Przywara @ 2012-07-20 9:39 ` Dario Faggioli 2012-07-20 10:01 ` Dario Faggioli 0 siblings, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 9:39 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 5602 bytes --] On Fri, 2012-07-20 at 10:19 +0200, Andre Przywara wrote: > thanks for the warm welcome. > :-) > I always assumed the vast majority of actual users/customers use > comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. > So these domains are much smaller than a usual node. I'd consider a node > size of 16 GB the lower boundary, with up to 128GB as the common > scenarios. Sure there are bigger or smaller machines, but I'd consider > this the sweet spot. > Yep, I agree this should be our target, at least for the default behaviour. > Right. So we just use the number of already pinned vCPUs as the metric. > Let me look if I can change the code to really use number of vCPUs > instead of number of domains. A domain could be UP or 8-way SMP, which > really makes much difference wrt to load on a node. > As we both are saying in the other e-mail, this is probably something good to do. I'm working on a new version of the patchset that I'm going to release later today (addressing the comments and the outcome of the long discussion the last round generated). If you can go as far as having a patch, that would be great, and I guess it could make it in even in early -rc days. > In the long run we need something like a per-node (or per-pCPU) load > average. We cannot foresee the future, but we just assume that the past > is a good indicator for it. xl top generates such numbers on demand > already. But that surely is something for 4.3, just wanted to mention it. > Definitely. > After reading the code I consider this a bit > over-engineered, but I cannot possibly complain about this after having > remained silent for such a long time. > :-) > So lets see what we can make out of this code, just firing up some > ideas, feel free to just ignore them in case they are dumb ;-) > I bet they are not! > So if you agree to the small-domain assumption, then domains easily > fitting into a node are the rule, not the exception. We should handle it > that way. Maybe we can also solve the complexity problem by only > generating single node candidates in the first place and only if these > don't fit look at alternatives? > That's exactly what I'm doing right now. You'll see the code later but, basically, at the i-eth step, I now compare all the candidates with i nodes and, if I find at least one, I quit the whole thing and avoid proceeding at step i+1. Of course, if I find more than one, I return the one that is best according to the heuristics. Given that the very first step is very likely going to be looking at candidates with 1 node in them, here you are exactly what you're talking about. > I really admire that lazy comb_next generation function, so why we don't > use it in a really lazy way? I think there was already a discussion > about this, just don't remember what it's outcome was. > Some debugging code showed that on the above (empty) machine a 2 > VCPUs/2GB domain generated already 255 candidates. That really looks > like overkill, especially if we actually should focus on the 8 > single-node ones. > Yep, and in fact 255 is what it takes on 8 nodes. As said above, I sort-of read your mind and started implementing what you write here yesterday. :-P > Maybe we can use a two-step approach? First use a simple heuristic > similar to the xend one: > We only consider domains with enough free memory. Then we look for the > least utilized ones: Simply calculate the difference between the number > of currently pinned vCPUs and the number pCPUs. So any node with free > (aka non-overcommited) CPUs should really be considered first. > Again, with the change above, this thing you're saying here can be achieved just removing the memfree_diff from the comparison function (which is no longer used during a proper sort, rather is is being called on-line as soon as a new candidate is found, to compare it with the current cached best). And yes, of course turning the domain count into a vcpu count, as said above. > If we somehow determine that this approach doesn't work (no nodes with > enough free memory or more vCPUs than CPUs-per-node) we should use the > sophisticated algorithm. > And again, you'll see the new code and will tell me what you think later, but I really think I turned it into something like that. The only thing > Right, that sounds good. If you have any good (read: meaningful to > customers) benchmarks I can do some experiments on my machine to > fine-tune this. > Nothing that goes that far. I tried to run specjbb2005 concurrently on some VM with and without placement, but I only have a very small testbox. :-( > > Nevertheless, this is what Xen 4.2 will have, and I really think initial > > placement is a very important step, and we must get the most out of > > being able to do it well (as opposed to other technologies, where > > something like that has to happen in the kernel/hypervisor, which > > entails a lot of limitations we don't have!), and am therefore happy > > about trying to do so as hard as I can. > > Right. We definitely need some placement for 4.2. Lets push this in if > anyhow possible. > That's what I'm working quite hard for. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 9:39 ` Dario Faggioli @ 2012-07-20 10:01 ` Dario Faggioli 0 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 10:01 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1492 bytes --] On Fri, 2012-07-20 at 11:39 +0200, Dario Faggioli wrote: > > If we somehow determine that this approach doesn't work (no nodes with > > enough free memory or more vCPUs than CPUs-per-node) we should use the > > sophisticated algorithm. > > > And again, you'll see the new code and will tell me what you think > later, but I really think I turned it into something like that. The only > thing > Some text got cut here. What I wanted to ask is what you think we should do with guests that have more vcpus than pcpus of a node. Right now, as I do for memory, I'm using it as a reason to consider candidates with more than one node. XenD did something similar, as it adds nodes (although it does that afterwords) until the above stop being true. Should we ignore that and stick to the number of nodes that gives us enough memory? Should we do that but only up to a certain extent (e.g., by asking for a candidate to have at least half pcpus as the domain has vcpus)? After all, even if we give the domain enough pcpus, he's not going to have exclusive access to them anyway... Again, just random thought seeking for a confirmation that is very hard to find! :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-19 14:22 ` Dario Faggioli 2012-07-20 8:19 ` Andre Przywara @ 2012-07-20 8:20 ` Dario Faggioli 2012-07-20 8:26 ` Andre Przywara 1 sibling, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 8:20 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 2107 bytes --] On Thu, 2012-07-19 at 16:22 +0200, Dario Faggioli wrote: > Interesting. That's really the kind of testing we need in order to > fine-tune the details. Thanks for doing this. > > > Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. > > Now since the code prefers free memory so much over free CPUs, the > > placement was the following: > > node0: guests 2,5,8,11,14,17,20,25,30 > > node1: guests 21,27 > > node2: none > > node3: none > > node4: guests 1,4,7,10,13,16,19,23,29 > > node5: guests 24,31 > > node6: guests 3,6,9,12,15,18,22,28 > > node7: guests 26,32 > > > > As you can see, the nodes with more memory are _way_ overloaded, while > > the lower memory ones are underutilized. In fact the first 20 guests > > didn't use the other nodes at all. > > I don't care so much about the two memory-less nodes, but I'd like to > > know how you came to the magic "3" in the formula: > > > > > + > > > + return sign(3*freememkb_diff + nrdomains_diff); > > > +} > > > > That all being said, this is the first time the patchset had the chance > to run on such a big system, so I'm definitely open to suggestion on how > to make that formula better in reflecting what we think it's The Right > Thing! > Thinking more about this, I realize that I was implicitly assuming some symmetry in the amount of memory each nodes comes with, which is probably something I shouldn't have done... I really am not sure what to do here, perhaps treating the two metrics more evenly? Or maybe even reverse the logic and give nr_domains more weight? I was also thinking whether it could be worthwhile to consider the total number of vcpus on a node instead than the number of domain, but again, that's not guaranteed to be any more meaningful (suppose there are a lot of idle vcpus)... Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 8:20 ` Dario Faggioli @ 2012-07-20 8:26 ` Andre Przywara 2012-07-20 8:38 ` Juergen Gross 2012-07-20 9:44 ` Dario Faggioli 0 siblings, 2 replies; 60+ messages in thread From: Andre Przywara @ 2012-07-20 8:26 UTC (permalink / raw) To: Dario Faggioli Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel On 07/20/2012 10:20 AM, Dario Faggioli wrote: > On Thu, 2012-07-19 at 16:22 +0200, Dario Faggioli wrote: >> Interesting. That's really the kind of testing we need in order to >> fine-tune the details. Thanks for doing this. >> >>> Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. >>> Now since the code prefers free memory so much over free CPUs, the >>> placement was the following: >>> node0: guests 2,5,8,11,14,17,20,25,30 >>> node1: guests 21,27 >>> node2: none >>> node3: none >>> node4: guests 1,4,7,10,13,16,19,23,29 >>> node5: guests 24,31 >>> node6: guests 3,6,9,12,15,18,22,28 >>> node7: guests 26,32 >>> >>> As you can see, the nodes with more memory are _way_ overloaded, while >>> the lower memory ones are underutilized. In fact the first 20 guests >>> didn't use the other nodes at all. >>> I don't care so much about the two memory-less nodes, but I'd like to >>> know how you came to the magic "3" in the formula: >>> >>>> + >>>> + return sign(3*freememkb_diff + nrdomains_diff); >>>> +} >>> >> >> That all being said, this is the first time the patchset had the chance >> to run on such a big system, so I'm definitely open to suggestion on how >> to make that formula better in reflecting what we think it's The Right >> Thing! >> > Thinking more about this, I realize that I was implicitly assuming some > symmetry in the amount of memory each nodes comes with, which is > probably something I shouldn't have done... > > I really am not sure what to do here, perhaps treating the two metrics > more evenly? Or maybe even reverse the logic and give nr_domains more > weight? I replaced the 3 with 1 already, that didn't change so much. I think we should kind of reverse the importance of node load, since starving for CPU time is much worse than bad memory latency. I will do some experiments... > I was also thinking whether it could be worthwhile to consider the total > number of vcpus on a node instead than the number of domain, but again, > that's not guaranteed to be any more meaningful (suppose there are a lot > of idle vcpus)... Right, that was my thinking on the ride to work also ;-) What about this: 1P and 2P guests really use their vCPUs, but for bigger guests we assume only a fractional usage? Andre -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 8:26 ` Andre Przywara @ 2012-07-20 8:38 ` Juergen Gross 2012-07-20 9:52 ` Dario Faggioli 2012-07-20 9:44 ` Dario Faggioli 1 sibling, 1 reply; 60+ messages in thread From: Juergen Gross @ 2012-07-20 8:38 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, xen-devel, Dario Faggioli Am 20.07.2012 10:26, schrieb Andre Przywara: > On 07/20/2012 10:20 AM, Dario Faggioli wrote: >> On Thu, 2012-07-19 at 16:22 +0200, Dario Faggioli wrote: >>> Interesting. That's really the kind of testing we need in order to >>> fine-tune the details. Thanks for doing this. >>> >>>> Then I started 32 guests, each 4 vCPUs and 1 GB of RAM. >>>> Now since the code prefers free memory so much over free CPUs, the >>>> placement was the following: >>>> node0: guests 2,5,8,11,14,17,20,25,30 >>>> node1: guests 21,27 >>>> node2: none >>>> node3: none >>>> node4: guests 1,4,7,10,13,16,19,23,29 >>>> node5: guests 24,31 >>>> node6: guests 3,6,9,12,15,18,22,28 >>>> node7: guests 26,32 >>>> >>>> As you can see, the nodes with more memory are _way_ overloaded, while >>>> the lower memory ones are underutilized. In fact the first 20 guests >>>> didn't use the other nodes at all. >>>> I don't care so much about the two memory-less nodes, but I'd like to >>>> know how you came to the magic "3" in the formula: >>>> >>>>> + >>>>> + return sign(3*freememkb_diff + nrdomains_diff); >>>>> +} >>>> >>> >>> That all being said, this is the first time the patchset had the chance >>> to run on such a big system, so I'm definitely open to suggestion on how >>> to make that formula better in reflecting what we think it's The Right >>> Thing! >>> >> Thinking more about this, I realize that I was implicitly assuming some >> symmetry in the amount of memory each nodes comes with, which is >> probably something I shouldn't have done... >> >> I really am not sure what to do here, perhaps treating the two metrics >> more evenly? Or maybe even reverse the logic and give nr_domains more >> weight? > > I replaced the 3 with 1 already, that didn't change so much. I think we > should kind of reverse the importance of node load, since starving for > CPU time is much worse than bad memory latency. I will do some > experiments... > >> I was also thinking whether it could be worthwhile to consider the total >> number of vcpus on a node instead than the number of domain, but again, >> that's not guaranteed to be any more meaningful (suppose there are a lot >> of idle vcpus)... > > Right, that was my thinking on the ride to work also ;-) > What about this: 1P and 2P guests really use their vCPUs, but for bigger > guests we assume only a fractional usage? Hmm. I wouldn't be sure about this. I would guess there is a reason a guest has more than 1 vcpu. Normally this is because the guest needs more power. Having only 1 vcpu means this is enough. It probably would need only 0.1 vcpus. A guest with many vcpus suffers much more from a heavy loaded node, as lock contention in the guest is increasing with number of vcpus and waiting for a vcpu holding a lock but being preempted. Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 8:38 ` Juergen Gross @ 2012-07-20 9:52 ` Dario Faggioli 2012-07-20 9:56 ` Juergen Gross 0 siblings, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 9:52 UTC (permalink / raw) To: Juergen Gross Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 984 bytes --] On Fri, 2012-07-20 at 10:38 +0200, Juergen Gross wrote: > > Right, that was my thinking on the ride to work also ;-) > > What about this: 1P and 2P guests really use their vCPUs, but for bigger > > guests we assume only a fractional usage? > > Hmm. I wouldn't be sure about this. I would guess there is a reason a guest has > more than 1 vcpu. Normally this is because the guest needs more power. > Hehe, I agree with both! :-P Maybe it's better to avoid this kind of mangling for now, and just consider all the vcpus. When we'll be tracking the actual load better, we'll see if it could be worthwhile to do something along that line either. What do you think? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 9:52 ` Dario Faggioli @ 2012-07-20 9:56 ` Juergen Gross 0 siblings, 0 replies; 60+ messages in thread From: Juergen Gross @ 2012-07-20 9:56 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, xen-devel Am 20.07.2012 11:52, schrieb Dario Faggioli: > On Fri, 2012-07-20 at 10:38 +0200, Juergen Gross wrote: >>> Right, that was my thinking on the ride to work also ;-) >>> What about this: 1P and 2P guests really use their vCPUs, but for bigger >>> guests we assume only a fractional usage? >> >> Hmm. I wouldn't be sure about this. I would guess there is a reason a guest has >> more than 1 vcpu. Normally this is because the guest needs more power. >> > Hehe, I agree with both! :-P > > Maybe it's better to avoid this kind of mangling for now, and just > consider all the vcpus. > > When we'll be tracking the actual load better, we'll see if it could be > worthwhile to do something along that line either. > > What do you think? Yep. Juergen -- Juergen Gross Principal Developer Operating Systems PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 8:26 ` Andre Przywara 2012-07-20 8:38 ` Juergen Gross @ 2012-07-20 9:44 ` Dario Faggioli 2012-07-20 11:47 ` Andre Przywara 1 sibling, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 9:44 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1076 bytes --] On Fri, 2012-07-20 at 10:26 +0200, Andre Przywara wrote: > > I really am not sure what to do here, perhaps treating the two metrics > > more evenly? Or maybe even reverse the logic and give nr_domains more > > weight? > > I replaced the 3 with 1 already, that didn't change so much. I think we > should kind of reverse the importance of node load, since starving for > CPU time is much worse than bad memory latency. I will do some > experiments... > That would be nice. If you happen to have time to put something like "3*nrdomains_diff+memfree_diff" and see how it goes, I'll be happy to include at least that change, even in next posting. After that, let's see what it takes to turn the domain counting into vcpu counting (it shouldn't be too hard). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 9:44 ` Dario Faggioli @ 2012-07-20 11:47 ` Andre Przywara 2012-07-20 12:54 ` Dario Faggioli 0 siblings, 1 reply; 60+ messages in thread From: Andre Przywara @ 2012-07-20 11:47 UTC (permalink / raw) To: Dario Faggioli Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel On 07/20/2012 11:44 AM, Dario Faggioli wrote: > On Fri, 2012-07-20 at 10:26 +0200, Andre Przywara wrote: >>> I really am not sure what to do here, perhaps treating the two metrics >>> more evenly? Or maybe even reverse the logic and give nr_domains more >>> weight? >> >> I replaced the 3 with 1 already, that didn't change so much. I think we >> should kind of reverse the importance of node load, since starving for >> CPU time is much worse than bad memory latency. I will do some >> experiments... >> > That would be nice. If you happen to have time to put something like > "3*nrdomains_diff+memfree_diff" and see how it goes, I'll be happy to > include at least that change, even in next posting. I did that. Guests are 2 VCPUs/2GB RAM. The results looked much better. After 16 guests I get: # xl vcpu-list | sed -e 1d | sort -n -k 7 | tr -s \ | cut -d\ -f7 | uniq -c 16 any (Dom0 had max_vcpus=16) 4 0-7 4 8-15 4 16-23 4 24-31 4 32-39 4 40-47 4 48-55 4 56-63 This is number of VCPUs per node. Also distributes equally. Memory looked like this: 0: 17280 9969 1: 8192 2268 2: 8192 1690 3: 8192 1754 4: 16384 10049 5: 8192 1879 6: 16384 10947 7: 16368 9267 After another 8 guests: 16 any 8 0-7 4 8-15 4 16-23 4 24-31 8 32-39 4 40-47 8 48-55 8 56-63 still not over-commited. 0: 17280 5846 1: 8192 2266 2: 8192 1690 3: 8192 1752 4: 16384 5925 5: 8192 1877 6: 16384 6824 7: 16368 5142 Finally with all 32 guests: 12 0-7 6 8-15 4 16-23 4 24-31 12 32-39 4 40-47 12 48-55 10 56-63 The bigger nodes are overcommited, while the others have free pCPUs (8 cores per node). But memory dictates this: 0: 17280 3130 1: 8192 0 2: 8192 485 3: 8192 1747 4: 16384 1803 5: 8192 1876 6: 16384 2701 7: 16368 3081 And the last domain already took memory from multiple nodes: (XEN) Domain 105 (total: 523255): (XEN) Node 0: 162740 (XEN) Node 1: 52616 (XEN) Node 2: 307899 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 All other domains had all their 523255 pages from a single node. # xl vcpu-list 105 Name ID VCPU CPU State Time(s) CPU Affinity Guest32 105 0 5 -b- 3.1 0-7 Guest32 105 1 2 -b- 0.7 0-7 So without any deeper thinking this looks much better than the original version and possibly good enough for Xen 4.2. Maybe the automated testing finds some leftovers. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 11:47 ` Andre Przywara @ 2012-07-20 12:54 ` Dario Faggioli 2012-07-20 13:07 ` Andre Przywara 0 siblings, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-20 12:54 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1395 bytes --] On Fri, 2012-07-20 at 13:47 +0200, Andre Przywara wrote: > > That would be nice. If you happen to have time to put something like > > "3*nrdomains_diff+memfree_diff" and see how it goes, I'll be happy to > > include at least that change, even in next posting. > > I did that. Guests are 2 VCPUs/2GB RAM. > Thanks! > The results looked much better. > That's very good to hear. :-) > > [snip] > > So without any deeper thinking this looks much better than the original > version and possibly good enough for Xen 4.2. > Maybe the automated testing finds some leftovers. > So, here's what we can do: 1) I can resend the series with this change in it (i.e., number of domains weighting 3 times more than free memory); 2) I can resend the series like in 1) _but_ while also converting the domain counting in vcpu counting (I checked that and it's trivial); 3) I can resend the series like in 1) and also send a separate patch turning the domain counting in vcpu counting, and we can apply that after it get some testing; Which one you like better? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 12:54 ` Dario Faggioli @ 2012-07-20 13:07 ` Andre Przywara 2012-07-21 1:44 ` Dario Faggioli 0 siblings, 1 reply; 60+ messages in thread From: Andre Przywara @ 2012-07-20 13:07 UTC (permalink / raw) To: Dario Faggioli Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel On 07/20/2012 02:54 PM, Dario Faggioli wrote: > On Fri, 2012-07-20 at 13:47 +0200, Andre Przywara wrote: >>> That would be nice. If you happen to have time to put something like >>> "3*nrdomains_diff+memfree_diff" and see how it goes, I'll be happy to >>> include at least that change, even in next posting. >> >> I did that. Guests are 2 VCPUs/2GB RAM. >> > Thanks! > >> The results looked much better. >> > That's very good to hear. :-) > >> >> [snip] >> >> So without any deeper thinking this looks much better than the original >> version and possibly good enough for Xen 4.2. >> Maybe the automated testing finds some leftovers. >> > So, here's what we can do: > > 1) I can resend the series with this change in it (i.e., number of > domains weighting 3 times more than free memory); > > 2) I can resend the series like in 1) _but_ while also converting the > domain counting in vcpu counting (I checked that and it's trivial); > > 3) I can resend the series like in 1) and also send a separate patch > turning the domain counting in vcpu counting, and we can apply that > after it get some testing; > > Which one you like better? I'd prefer 2). To me it makes more sense and I can test it over the weekend and can send my acked-by on it. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-20 13:07 ` Andre Przywara @ 2012-07-21 1:44 ` Dario Faggioli 0 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-21 1:44 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 1077 bytes --] On Fri, 2012-07-20 at 15:07 +0200, Andre Przywara wrote: > > So, here's what we can do: > > > > 1) I can resend the series with this change in it (i.e., number of > > domains weighting 3 times more than free memory); > > > > 2) I can resend the series like in 1) _but_ while also converting the > > domain counting in vcpu counting (I checked that and it's trivial); > > > > 3) I can resend the series like in 1) and also send a separate patch > > turning the domain counting in vcpu counting, and we can apply that > > after it get some testing; > > > > Which one you like better? > > I'd prefer 2). To me it makes more sense and I can test it over the > weekend and can send my acked-by on it. > Ok, done. Sorry if I am a little bit late. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 2 of 3 v5/leftover] libxl: have NUMA placement deal with cpupools 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli @ 2012-07-16 17:13 ` Dario Faggioli 2012-07-16 17:13 ` [PATCH 3 of 3 v5/leftover] Some automatic NUMA placement documentation Dario Faggioli 2012-07-20 11:07 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl David Vrabel 3 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-16 17:13 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson In such a way that only the cpus belonging to the cpupool of the domain being placed are considered for the placement itself. This happens by filtering out all the nodes in which the cpupool has not any cpu from the placement candidates. After that -- as a cpu pooling not necessarily happens at NUMA nodes boundaries -- we also make sure only the actual cpus that are part of the pool are considered when counting how much processors a placement candidate provides. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v4: * fixed a bug in the cpupool filtering logic (also the number of nodes should be affected by that) Changes from v3: * fixed a double free on non-NUMA systems (namely, when just 1 node exists). * Changed the filtering logic: now only the meaningful combinations and nodemap are being considered. Changes from v2: * fixed typos in comments. diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -132,25 +132,29 @@ static int numa_cmpf(const void *v1, con } /* The actual automatic NUMA placement routine */ -static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +static int numa_place_domain(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) { int nr_candidates = 0; libxl__numa_candidate *candidates = NULL; libxl_bitmap candidate_nodemap; - libxl_cpupoolinfo *pinfo; - int nr_pools, rc = 0; + libxl_cpupoolinfo cpupool_info; + int i, cpupool, rc = 0; uint32_t memkb; libxl_bitmap_init(&candidate_nodemap); - /* First of all, if cpupools are in use, better not to mess with them */ - pinfo = libxl_list_cpupool(CTX, &nr_pools); - if (!pinfo) - return ERROR_FAIL; - if (nr_pools > 1) { - LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); - goto out; - } + /* + * Extract the cpumap from the cpupool the domain belong to. In fact, + * it only makes sense to consider the cpus/nodes that are in there + * for placement. + */ + rc = cpupool = libxl__domain_cpupool(gc, domid); + if (rc < 0) + return rc; + rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool); + if (rc) + return rc; rc = libxl_domain_need_memory(CTX, info, &memkb); if (rc) @@ -162,7 +166,8 @@ static int numa_place_domain(libxl__gc * /* Find all the candidates with enough free memory and at least * as much pcpus as the domain has vcpus. */ - rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, + 0, 0, &cpupool_info.cpumap, &candidates, &nr_candidates); if (rc) goto out; @@ -188,13 +193,20 @@ static int numa_place_domain(libxl__gc * if (rc) goto out; + /* Avoid trying to set the affinity to cpus that might be in the + * nodemap but not in our cpupool. */ + libxl_for_each_set_bit(i, info->cpumap) { + if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) + libxl_bitmap_reset(&info->cpumap, i); + } + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " "%"PRIu32" KB free selected", candidates[0].nr_nodes, candidates[0].nr_cpus, candidates[0].free_memkb / 1024); out: libxl_bitmap_dispose(&candidate_nodemap); - libxl_cpupoolinfo_list_free(pinfo, nr_pools); + libxl_cpupoolinfo_dispose(&cpupool_info); return rc; } @@ -217,7 +229,7 @@ int libxl__build_pre(libxl__gc *gc, uint * whatever that turns out to be. */ if (libxl_bitmap_is_full(&info->cpumap)) { - int rc = numa_place_domain(gc, info); + int rc = numa_place_domain(gc, domid, info); if (rc) return rc; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2289,14 +2289,17 @@ typedef struct { * least that amount of free memory and that number of cpus, respectively. If * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number * of cpus won't be checked at all, which means a candidate will always be - * considered suitable wrt the specific constraint. cndts is where the list of - * exactly nr_cndts candidates is returned. Note that, in case no candidates - * are found at all, the function returns successfully, but with nr_cndts equal - * to zero. + * considered suitable wrt the specific constraint. suitable_cpumap is useful + * for specifying we want only the cpus in that mask to be considered while + * generating placement candidates (for example because of cpupools). cndts is + * where the list of exactly nr_cndts candidates is returned. Note that, in + * case no candidates are found at all, the function returns successfully, but + * with nr_cndts equal to zero. */ _hidden int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts); /* Initialization, allocation and deallocation for placement candidates */ diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -107,30 +107,48 @@ static int comb_next(comb_iter_t it, int /* NUMA automatic placement (see libxl_internal.h for details) */ /* - * This function turns a k-combination iterator into a node map. - * This means the bits in the node map corresponding to the indexes - * of the given combination are the ones that will be set. - * For example, if the iterator represents the combination { 0, 2, 4}, - * the node map will have bits #0, #2 and #4 set. + * This function turns a k-combination iterator into a node map, + * given another map, telling us which nodes should be considered. + * + * This means the bits that are set in suitable_nodemap and that + * corresponds to the indexes of the given combination are the ones + * that will be set in nodemap. + * + * For example, given a fully set suitable_nodemap, if the iterator + * represents the combination { 0, 2, 4}, nodmeap will have bits #0, + * #2 and #4 set. + * On the other hand, if, say, suitable_nodemap=01011011, the same + * iterator will cause bits #1, #4 and #7 of nodemap to be set. */ -static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *suitable_nodemap, + libxl_bitmap *nodemap, int k) { - int i; + int i, m = 0, n = 0; libxl_bitmap_set_none(nodemap); - for (i = 0; i < k; i++) - libxl_bitmap_set(nodemap, it[i]); + libxl_for_each_set_bit(i, *suitable_nodemap) { + /* Check wether the n-th set bit of suitable_nodemap + * matches with the m-th element of the iterator (and, + * only if it does, advance to the next one) */ + if (m < k && n == it[m]) { + libxl_bitmap_set(nodemap, i); + m++; + } + n++; + } } /* Retrieve the number of cpus that the nodes that are part of the nodemap - * span. */ + * span and are also set in suitable_cpumap. */ static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *suitable_cpumap, const libxl_bitmap *nodemap) { int i, nodes_cpus = 0; for (i = 0; i < nr_cpus; i++) { - if (libxl_bitmap_test(nodemap, tinfo[i].node)) + if (libxl_bitmap_test(suitable_cpumap, i) && + libxl_bitmap_test(nodemap, tinfo[i].node)) nodes_cpus++; } return nodes_cpus; @@ -236,15 +254,17 @@ static int cpus_per_node_count(libxl_cpu int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts) { libxl__numa_candidate *new_cndts = NULL; libxl_cputopology *tinfo = NULL; libxl_numainfo *ninfo = NULL; - int nr_nodes = 0, nr_cpus = 0; - libxl_bitmap nodemap; + int nr_nodes = 0, nr_suit_nodes, nr_cpus = 0; + libxl_bitmap suitable_nodemap, nodemap; int array_size, rc; + libxl_bitmap_init(&suitable_nodemap); libxl_bitmap_init(&nodemap); /* Get platform info and prepare the map for testing the combinations */ @@ -262,6 +282,15 @@ int libxl__get_numa_candidates(libxl__gc if (rc) goto out; + /* Allocate and prepare the map of the node that can be utilized for + * placement, basing on the map of suitable cpus. */ + rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap, 0); + if (rc) + goto out; + rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap); + if (rc) + goto out; + /* * If the minimum number of NUMA nodes is not explicitly specified * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes @@ -279,10 +308,14 @@ int libxl__get_numa_candidates(libxl__gc else min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; } - if (min_nodes > nr_nodes) - min_nodes = nr_nodes; - if (!max_nodes || max_nodes > nr_nodes) - max_nodes = nr_nodes; + /* We also need to be sure we do not exceed the number of + * nodes we are allowed to use. */ + nr_suit_nodes = libxl_bitmap_count_set(&suitable_nodemap); + + if (min_nodes > nr_suit_nodes) + min_nodes = nr_suit_nodes; + if (!max_nodes || max_nodes > nr_suit_nodes) + max_nodes = nr_suit_nodes; if (min_nodes > max_nodes) { rc = ERROR_INVAL; goto out; @@ -290,7 +323,7 @@ int libxl__get_numa_candidates(libxl__gc /* Initialize the local storage for the combinations */ *nr_cndts = 0; - array_size = nr_nodes; + array_size = nr_suit_nodes; GCNEW_ARRAY(new_cndts, array_size); /* Generate all the combinations of any size from min_nodes to @@ -306,12 +339,16 @@ int libxl__get_numa_candidates(libxl__gc * amount of free memory and number of cpus) and it becomes an actual * placement candidate iff it passes the check. */ - for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; - comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); + comb_ok; + comb_ok = comb_next(comb_iter, nr_suit_nodes, min_nodes)) { uint32_t nodes_free_memkb; int nodes_cpus; - comb_get_nodemap(comb_iter, &nodemap, min_nodes); + /* Get the nodemap for the combination, only considering + * suitable nodes. */ + comb_get_nodemap(comb_iter, &suitable_nodemap, + &nodemap, min_nodes); /* If there is not enough memory in this combination, skip it * and go generating the next one... */ @@ -320,7 +357,8 @@ int libxl__get_numa_candidates(libxl__gc continue; /* And the same applies if this combination is short in cpus */ - nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap); + nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, suitable_cpumap, + &nodemap); if (min_cpus && nodes_cpus < min_cpus) continue; @@ -336,7 +374,7 @@ int libxl__get_numa_candidates(libxl__gc * new candidate). */ if (*nr_cndts == array_size) - array_size += nr_nodes; + array_size += nr_suit_nodes; GCREALLOC_ARRAY(new_cndts, array_size); libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]); @@ -345,12 +383,13 @@ int libxl__get_numa_candidates(libxl__gc new_cndts[*nr_cndts].nr_domains = nodemap_to_nr_domains(gc, tinfo, &nodemap); new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; - new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_nodes = libxl_bitmap_count_set(&nodemap); new_cndts[*nr_cndts].nr_cpus = nodes_cpus; LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, - min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].nr_nodes, + new_cndts[*nr_cndts].nr_cpus, new_cndts[*nr_cndts].free_memkb / 1024); (*nr_cndts)++; @@ -360,6 +399,7 @@ int libxl__get_numa_candidates(libxl__gc *cndts = new_cndts; out: + libxl_bitmap_dispose(&suitable_nodemap); libxl_bitmap_dispose(&nodemap); libxl_cputopology_list_free(tinfo, nr_cpus); libxl_numainfo_list_free(ninfo, nr_nodes); ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 3 of 3 v5/leftover] Some automatic NUMA placement documentation 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli 2012-07-16 17:13 ` [PATCH 2 of 3 v5/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli @ 2012-07-16 17:13 ` Dario Faggioli 2012-07-20 11:07 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl David Vrabel 3 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-16 17:13 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson About rationale, usage and (some small bits of) API. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v3: * typos and rewording of some sentences, as suggested during review. Changes from v1: * API documentation moved close to the actual functions. diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdown @@ -0,0 +1,89 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +NUMA means the memory accessing times of a program running on a CPU depends on +the relative distance between that CPU and that memory. In fact, most of the +NUMA systems are built in such a way that each processor has its local memory, +on which it can operate very fast. On the other hand, getting and storing data +from and on remote memory (that is, memory local to some other processor) is +quite more complex and slow. On these machines, a NUMA node is usually defined +as a set of processor cores (typically a physical CPU package) and the memory +directly attached to the set of cores. + +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by +assigning to its domain a "node affinity", i.e., a set of NUMA nodes of the +host from which it gets its memory allocated. + +NUMA awareness becomes very important as soon as many domains start running +memory-intensive workloads on a shared host. In fact, the cost of accessing non +node-local memory locations is very high, and the performance degradation is +likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask for both +manual or automatic placement of them across the host's NUMA nodes. + +Note that xm/xend does the very same thing, the only differences residing in +the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain should +be created and scheduled on, directly in its config file. This affects NUMA +placement and memory accesses as the hypervisor constructs the node affinity of +a VM basing right on its CPU affinity when it is created. + +This is very simple and effective, but requires the user/system administrator +to explicitly specify affinities for each and every domain, or Xen won't be +able to guarantee the locality for their memory accesses. + +It is also possible to deal with NUMA by partitioning the system using cpupools +(available in the upcoming release of Xen, 4.2). Again, this could be "The +Right Answer" for many needs and occasions, but has to to be carefully +considered and manually setup by hand. + +### Automatic Guest Placement with xl ### + +If no "cpus=" option is specified in the config file, libxl tries to figure out +on its own on which node(s) the domain could fit best. It is worthwhile noting +that optimally fitting a set of VMs on the NUMA nodes of an host is an +incarnation of the Bin Packing Problem. In fact, the various VMs with different +memory sizes are the items to be packed, and the host nodes are the bins. That +is known to be NP-hard, thus. We will therefore be using some heuristics. + +The first thing to do is finding a node, or even a set of nodes, that have +enough free memory and enough physical CPUs for accommodating the one new +domain. The idea is to find a spot for the domain with at least as much free +memory as it has configured, and as much pCPUs as it has vCPUs. After that, +the actual decision on which solution to go for happens accordingly to the +following heuristics: + + * candidates involving fewer nodes come first. In case two (or more) + candidates span the same number of nodes, + * the amount of free memory and the number of domains assigned to the + candidates are considered. In doing that, candidates with greater amount + of free memory and fewer assigned domains are preferred, with free memory + "weighting" three times as much as number of domains. + +Giving preference to candidates with fewer nodes ensures better performance for +the guest, as it avoid spreading its memory among different nodes. Favouring +the nodes that have the largest amounts of free memory helps keeping the memory +fragmentation small, from a system wide perspective. However, if more +candidates fulfil these criteria by roughly the same extent, having the number +of domains the candidates are "hosting" helps balancing the load on the various +nodes. + +## Guest Placement within libxl ## + +xl achieves automatic NUMA just because libxl does it interrnally. +No API is provided (yet) for interacting with this feature and modify +the library behaviour regarding automatic placement, it just happens +by default if no affinity is specified (as it is with xm/xend). + +For actually looking and maybe tweaking the mechanism and the algorithms it +uses, all is implemented as a set of libxl internal interfaces and facilities. +Look at the comment "Automatic NUMA placement" in libxl\_internal.h. + +Note this may change in future versions of Xen/libxl. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli ` (2 preceding siblings ...) 2012-07-16 17:13 ` [PATCH 3 of 3 v5/leftover] Some automatic NUMA placement documentation Dario Faggioli @ 2012-07-20 11:07 ` David Vrabel 2012-07-20 11:43 ` Andre Przywara 3 siblings, 1 reply; 60+ messages in thread From: David Vrabel @ 2012-07-20 11:07 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel On 16/07/12 18:13, Dario Faggioli wrote: > Hello again, > > This is a new version (fixing a small bug) of this series: > <patchbomb.1341932613@Solace>, which in turn was a repost of what remained > un-committed of my NUMA placement series. Whilst I don't think this should prevent the merging of this series now, I think there needs to be some sort of unit tests for the placement algorithm before the 4.2 release. I think the tests should test a representative sample of common system configurations, available memory and VM memory requirements. I'd suggest you'd be looking at 100s of test cases here for reasonable coverage. One method would be to start with various 'empty' systems and pile as many differently sized VMs as will fit. You may want both fixed test of reproducible tests and random ones. Some means of automatically verifying the quality of the result at each stage would be essential. Without a comprehensive set of tests, tweaking the algorithm/heuristics has a high risk of introducing regressions. Having the ability to specify specific setups so you can (e.g.) run tests when someone reports sub-optimal performance with their particular setup would also be useful. David ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 11:07 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl David Vrabel @ 2012-07-20 11:43 ` Andre Przywara 2012-07-20 12:00 ` Ian Campbell 2012-07-20 12:14 ` David Vrabel 0 siblings, 2 replies; 60+ messages in thread From: Andre Przywara @ 2012-07-20 11:43 UTC (permalink / raw) To: David Vrabel Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, Dario Faggioli On 07/20/2012 01:07 PM, David Vrabel wrote: > On 16/07/12 18:13, Dario Faggioli wrote: >> Hello again, >> >> This is a new version (fixing a small bug) of this series: >> <patchbomb.1341932613@Solace>, which in turn was a repost of what remained >> un-committed of my NUMA placement series. > > Whilst I don't think this should prevent the merging of this series now, > I think there needs to be some sort of unit tests for the placement > algorithm before the 4.2 release. > > I think the tests should test a representative sample of common system > configurations, available memory and VM memory requirements. I'd > suggest you'd be looking at 100s of test cases here for reasonable coverage. > > One method would be to start with various 'empty' systems and pile as > many differently sized VMs as will fit. You may want both fixed test of > reproducible tests and random ones. That sounds good. Though I don't have much automated testing experience with Xen I could chime in with two things: 1. If we focus on placement only, I have good experience with ttylinux.iso. Those live distros can be killed easily at any time and you just need one instance of the .iso file on the disk. 2. # xl vcpu-list | sed -e 1d | sort -n -k 7 | tr -s \ | cut -d\ -f7 | uniq -c This gives the number of VCPUs per node (sort of ;-) > Some means of automatically verifying the quality of the result at each > stage would be essential. This "automatically verifying the quality of the result" doesn't sound trivial. If we knew this exactly, we could just code this into the algorithm, right? Also it seems to depend on the setup. Maybe we just collect some test output first and then try to come up with quality algorithms? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 11:43 ` Andre Przywara @ 2012-07-20 12:00 ` Ian Campbell 2012-07-20 12:08 ` Ian Campbell 2012-07-23 10:23 ` Dario Faggioli 2012-07-20 12:14 ` David Vrabel 1 sibling, 2 replies; 60+ messages in thread From: Ian Campbell @ 2012-07-20 12:00 UTC (permalink / raw) To: Andre Przywara Cc: Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel, Dario Faggioli On Fri, 2012-07-20 at 12:43 +0100, Andre Przywara wrote: > On 07/20/2012 01:07 PM, David Vrabel wrote: > > On 16/07/12 18:13, Dario Faggioli wrote: > >> Hello again, > >> > >> This is a new version (fixing a small bug) of this series: > >> <patchbomb.1341932613@Solace>, which in turn was a repost of what remained > >> un-committed of my NUMA placement series. > > > > Whilst I don't think this should prevent the merging of this series now, > > I think there needs to be some sort of unit tests for the placement > > algorithm before the 4.2 release. > > > > I think the tests should test a representative sample of common system > > configurations, available memory and VM memory requirements. I'd > > suggest you'd be looking at 100s of test cases here for reasonable coverage. > > > > One method would be to start with various 'empty' systems and pile as > > many differently sized VMs as will fit. You may want both fixed test of > > reproducible tests and random ones. > > That sounds good. Though I don't have much automated testing experience > with Xen I could chime in with two things: > > 1. If we focus on placement only, I have good experience with > ttylinux.iso. Those live distros can be killed easily at any time and > you just need one instance of the .iso file on the disk. > 2. # xl vcpu-list | sed -e 1d | sort -n -k 7 | tr -s \ | cut -d\ -f7 | > uniq -c > This gives the number of VCPUs per node (sort of ;-) Ideally you wouldn't need a xen system at all for this, you just want a database of input configurations (host NUMA setup, existing guest layout) and hypothetical new guests and their mapping to the expected output. You can then feed these offline to the algorithmn and validate that the output is the expected one. > > Some means of automatically verifying the quality of the result at each > > stage would be essential. > > This "automatically verifying the quality of the result" doesn't sound > trivial. If we knew this exactly, we could just code this into the > algorithm, right? Well, just knowing something has changed is useful, even you can't decide in an automated way if it is the before or after case which is the good/best one. Even if we just blindly store the current result today as the expected one then when someone makes a tweak we can manually inspect the differences in the output and say "yes, that seems sane" or "no, that's mad". The need to do this will be encoded in the diff which makes the change (since you'd have to patch the test suite...) > Also it seems to depend on the setup. Maybe we just collect some test > output first and then try to come up with quality algorithms? > > Regards, > Andre. > > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 12:00 ` Ian Campbell @ 2012-07-20 12:08 ` Ian Campbell 2012-07-23 10:38 ` Dario Faggioli 2012-07-23 10:23 ` Dario Faggioli 1 sibling, 1 reply; 60+ messages in thread From: Ian Campbell @ 2012-07-20 12:08 UTC (permalink / raw) To: Andre Przywara Cc: Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel, Dario Faggioli On Fri, 2012-07-20 at 13:00 +0100, Ian Campbell wrote: > Even if we just blindly store the current result today as the expected > one then when someone makes a tweak we can manually inspect the > differences in the output and say "yes, that seems sane" or "no, that's > mad". The need to do this will be encoded in the diff which makes the > change (since you'd have to patch the test suite...) BTW, this is mostly what we do for tools/libxl/check-xl-{disk,vif}-parse. and what we should do for tools/pygrub/examples/* but so far we only collect examples... Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 12:08 ` Ian Campbell @ 2012-07-23 10:38 ` Dario Faggioli 2012-07-23 10:42 ` Ian Campbell 0 siblings, 1 reply; 60+ messages in thread From: Dario Faggioli @ 2012-07-23 10:38 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel [-- Attachment #1.1: Type: text/plain, Size: 1793 bytes --] On Fri, 2012-07-20 at 13:08 +0100, Ian Campbell wrote: > On Fri, 2012-07-20 at 13:00 +0100, Ian Campbell wrote: > > Even if we just blindly store the current result today as the expected > > one then when someone makes a tweak we can manually inspect the > > differences in the output and say "yes, that seems sane" or "no, that's > > mad". The need to do this will be encoded in the diff which makes the > > change (since you'd have to patch the test suite...) > > BTW, this is mostly what we do for > tools/libxl/check-xl-{disk,vif}-parse. > That is very nice, although our case here is a bit different. I can put together a similar script, and I can also think about some test cases and methods to check how things are going. The tricky part is I _can't_ run xl just like we do in that file, as there's not a specific command for running just the placement, and I can't invoke `cl create', can I? :-P So, it seems I should write a sort of C program from which to trigger the libxl domain creation, after taking care of intercepting and properly replacing quite a bit of functions to either nops or (e.g., in case of libxl_get_cpu_topology(), etc) proper reads of the test database. Also, this involves quite a bit of libxl internal function. Will I be able to intercept them too? Not sure, I need to check... How does that sound? I'll think about and try to implement this in next days, any thoughts that may come to your mind, do feel free to share. :-D Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-23 10:38 ` Dario Faggioli @ 2012-07-23 10:42 ` Ian Campbell 2012-07-23 15:31 ` Dario Faggioli 0 siblings, 1 reply; 60+ messages in thread From: Ian Campbell @ 2012-07-23 10:42 UTC (permalink / raw) To: Dario Faggioli Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel On Mon, 2012-07-23 at 11:38 +0100, Dario Faggioli wrote: > On Fri, 2012-07-20 at 13:08 +0100, Ian Campbell wrote: > > On Fri, 2012-07-20 at 13:00 +0100, Ian Campbell wrote: > > > Even if we just blindly store the current result today as the expected > > > one then when someone makes a tweak we can manually inspect the > > > differences in the output and say "yes, that seems sane" or "no, that's > > > mad". The need to do this will be encoded in the diff which makes the > > > change (since you'd have to patch the test suite...) > > > > BTW, this is mostly what we do for > > tools/libxl/check-xl-{disk,vif}-parse. > > > That is very nice, although our case here is a bit different. I can put > together a similar script, and I can also think about some test cases > and methods to check how things are going. > > The tricky part is I _can't_ run xl just like we do in that file, as > there's not a specific command for running just the placement, and I > can't invoke `cl create', can I? :-P > > So, it seems I should write a sort of C program from which to trigger > the libxl domain creation, after taking care of intercepting and > properly replacing quite a bit of functions to either nops or (e.g., in > case of libxl_get_cpu_topology(), etc) proper reads of the test > database. Also, this involves quite a bit of libxl internal function. > Will I be able to intercept them too? Not sure, I need to check... > > How does that sound? I'll think about and try to implement this in next > days, any thoughts that may come to your mind, do feel free to > share. :-D I think it would be fine for a test program such as this to make use of libxl internals and "look behind the curtain" as it were and call the core of the placement algorithm directly with falsified inputs. However I wouldn't prioritise this over any outstanding repost of the series. Ian. -- Ian Campbell Current Noise: Judas Priest - Painkiller After they got rid of capital punishment, they had to hang twice as many people as before. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-23 10:42 ` Ian Campbell @ 2012-07-23 15:31 ` Dario Faggioli 0 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-23 15:31 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel [-- Attachment #1.1: Type: text/plain, Size: 1055 bytes --] On Mon, 2012-07-23 at 11:42 +0100, Ian Campbell wrote: > > How does that sound? I'll think about and try to implement this in next > > days, any thoughts that may come to your mind, do feel free to > > share. :-D > > I think it would be fine for a test program such as this to make use of > libxl internals and "look behind the curtain" as it were and call the > core of the placement algorithm directly with falsified inputs. > Ok, so, IIUC, I'll do something like "#include <libxl_internal.h" and call my functions directly. Sounds reasonable. > However I wouldn't prioritise this over any outstanding repost of the > series. > Me neither, and I won't do that... I was just hoping for that not to be necessary, I guess! :-P :-P Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 12:00 ` Ian Campbell 2012-07-20 12:08 ` Ian Campbell @ 2012-07-23 10:23 ` Dario Faggioli 1 sibling, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-23 10:23 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, David Vrabel [-- Attachment #1.1: Type: text/plain, Size: 1760 bytes --] On Fri, 2012-07-20 at 13:00 +0100, Ian Campbell wrote: > On Fri, 2012-07-20 at 12:43 +0100, Andre Przywara wrote: > > On 07/20/2012 01:07 PM, David Vrabel wrote: > > > On 16/07/12 18:13, Dario Faggioli wrote: > > >> Hello again, > > >> > > > I think the tests should test a representative sample of common system > > > configurations, available memory and VM memory requirements. I'd > > > suggest you'd be looking at 100s of test cases here for reasonable coverage. > > > > > > One method would be to start with various 'empty' systems and pile as > > > many differently sized VMs as will fit. You may want both fixed test of > > > reproducible tests and random ones. > > > > 1. If we focus on placement only, I have good experience with > > ttylinux.iso. Those live distros can be killed easily at any time and > > you just need one instance of the .iso file on the disk. > > 2. # xl vcpu-list | sed -e 1d | sort -n -k 7 | tr -s \ | cut -d\ -f7 | > > uniq -c > > This gives the number of VCPUs per node (sort of ;-) > > Ideally you wouldn't need a xen system at all for this, you just want a > database of input configurations (host NUMA setup, existing guest > layout) and hypothetical new guests and their mapping to the expected > output. You can then feed these offline to the algorithmn and validate > that the output is the expected one. > Ok, I see. I like this and I think it should be done. I'll look into it. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl 2012-07-20 11:43 ` Andre Przywara 2012-07-20 12:00 ` Ian Campbell @ 2012-07-20 12:14 ` David Vrabel 1 sibling, 0 replies; 60+ messages in thread From: David Vrabel @ 2012-07-20 12:14 UTC (permalink / raw) To: Andre Przywara Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson, xen-devel, Dario Faggioli On 20/07/12 12:43, Andre Przywara wrote: > On 07/20/2012 01:07 PM, David Vrabel wrote: >> On 16/07/12 18:13, Dario Faggioli wrote: >>> Hello again, >>> >>> This is a new version (fixing a small bug) of this series: >>> <patchbomb.1341932613@Solace>, which in turn was a repost of what >>> remained >>> un-committed of my NUMA placement series. >> >> Whilst I don't think this should prevent the merging of this series now, >> I think there needs to be some sort of unit tests for the placement >> algorithm before the 4.2 release. >> >> I think the tests should test a representative sample of common system >> configurations, available memory and VM memory requirements. I'd >> suggest you'd be looking at 100s of test cases here for reasonable >> coverage. >> >> One method would be to start with various 'empty' systems and pile as >> many differently sized VMs as will fit. You may want both fixed test of >> reproducible tests and random ones. > > That sounds good. Though I don't have much automated testing experience > with Xen I could chime in with two things: > > 1. If we focus on placement only, I have good experience with > ttylinux.iso. Those live distros can be killed easily at any time and > you just need one instance of the .iso file on the disk. > 2. # xl vcpu-list | sed -e 1d | sort -n -k 7 | tr -s \ | cut -d\ -f7 | > uniq -c > This gives the number of VCPUs per node (sort of ;-) I'm talking about unit tests of the algorithm not system tests on real hardware. i.e., some sort of wrapper around the C functions that call then with various sets of input data. >> Some means of automatically verifying the quality of the result at each >> stage would be essential. > > This "automatically verifying the quality of the result" doesn't sound > trivial. If we knew this exactly, we could just code this into the > algorithm, right? I think it is much easier to verify the result than generate the solution. The quality score is basically the percentage of memory that ended up on the expected number of nodes -- this is easy to calculate. If the algorithm is tweaked and the resulting quality score takes a nose dive this would give a very quick indication that the tweak may be broken. Conversely, if the quality goes up, the tweak is more likely to be good. David ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-17 15:55 ` Ian Jackson 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli @ 2012-07-17 15:59 ` Ian Campbell 2012-07-17 18:01 ` Ian Jackson 2012-07-17 22:15 ` Dario Faggioli 2 siblings, 1 reply; 60+ messages in thread From: Ian Campbell @ 2012-07-17 15:59 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote: > > +/* Subtract two values and translate the result in [0, 1] */ > > +static double normalized_diff(double a, double b) > > +{ > > +#define max(a, b) (a > b ? a : b) > > + if (!a && a == b) > > + return 0.0; > > + return (a - b) / max(a, b); > > +} > > 1. This macro max() should be in libxl_internal.h. > 2. It should be MAX so people are warned it's a macro > 3. It should have all the necessary ()s for macro precedence safety You probably also want to do the necessary tricks to avoid multiple evaluation of the arguments? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-17 15:59 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Ian Campbell @ 2012-07-17 18:01 ` Ian Jackson 0 siblings, 0 replies; 60+ messages in thread From: Ian Jackson @ 2012-07-17 18:01 UTC (permalink / raw) To: Ian Campbell Cc: Andre Przywara, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel, Dario Faggioli Ian Campbell writes ("Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote: > > 1. This macro max() should be in libxl_internal.h. > > 2. It should be MAX so people are warned it's a macro > > 3. It should have all the necessary ()s for macro precedence safety > > You probably also want to do the necessary tricks to avoid multiple > evaluation of the arguments? Those tricks are a pain. I think in general macros are `allowed' to do this provided their names are SHOUTY. That's why it should be called `MAX'. Ian. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes 2012-07-17 15:55 ` Ian Jackson 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-17 15:59 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Ian Campbell @ 2012-07-17 22:15 ` Dario Faggioli 2 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-17 22:15 UTC (permalink / raw) To: Ian Jackson Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 5699 bytes --] On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote: > Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes"): > > If a domain does not have a VCPU affinity, try to pin it automatically to some > > PCPUs. This is done taking into account the NUMA characteristics of the host. > > In fact, we look for a combination of host's NUMA nodes with enough free memory > > and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. > > Thanks for this admirably clear patch. > Thanks to you for looking at it. > Can you please rewrap your commit messages to around 70 lines ? Many > VCSs indent them in the log in some situations, and as you see here > mail programs indent them when quoting too. > That should have happened in the first place. As I'm reposting, I'll take extra attention to that, sorry. > > +/* Subtract two values and translate the result in [0, 1] */ > > +static double normalized_diff(double a, double b) > > +{ > > +#define max(a, b) (a > b ? a : b) > > + if (!a && a == b) > > + return 0.0; > > + return (a - b) / max(a, b); > > +} > > 1. This macro max() should be in libxl_internal.h. > 2. It should be MAX so people are warned it's a macro > 3. It should have all the necessary ()s for macro precedence safety > Ok, will do that. > > + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); > > + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); > > + > > + if (c1->nr_nodes != c2->nr_nodes) > > + return c1->nr_nodes - c2->nr_nodes; > > + > > + return sign(3*freememkb_diff + nrdomains_diff); > > The reason you need what sign() does is that you need to convert from > double to int, I guess. > Mostly for that and to make what's happening even more clear. > > + > > + /* > > + * Check if the domain has any CPU affinity. If not, try to build up one. > > + * In case numa_place_domain() find at least a suitable candidate, it will > > + * affect info->cpumap accordingly; if it does not, it just leaves it > > + * as it is. This means (unless some weird error manifests) the subsequent > > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > > + * whatever that turns out to be. > > + */ > > + if (libxl_bitmap_is_full(&info->cpumap)) { > > + int rc = numa_place_domain(gc, info); > > + if (rc) > > + return rc; > > + } > > I guess it would be preferable to do this only if the bitmap was full > by default, so that setting the bitmap explicitly to all cpus still > works. > > I'm not sure that that's essential to have, though. > I was thinking about this right in this days, and I think it should be exactly as you say, as one need a mechanism for disabling this thing as a whole. I really don't think it should take to much to put something together, even as a separate, future, patch, if these get checked-in. Thanks. > > + /* > > + * Conditions are met, we can add this combination to the > > + * NUMA placement candidates list. We first make sure there > > + * is enough space in there, and then we initialize the new > > + * candidate element with the node map corresponding to the > > + * combination we are dealing with. Memory allocation for > > + * expanding the array that hosts the list happens in chunks > > + * equal to the number of NUMA nodes in the system (to > > + * avoid allocating memory each and every time we find a > > + * new candidate). > > + */ > > + if (*nr_cndts == array_size) > > + array_size += nr_nodes; > > + GCREALLOC_ARRAY(new_cndts, array_size); > > This part of the algorithm is quadratic in the number of combinations > divided by the number of nodes. So the algorithm is > O( ( C( nr_nodes, min_nodes ) / min_nodes )^2 ) > which is quite bad really. > I might well be wrong, but I was thinking to it as something like this: O( C(nr_nodes,(nr_nodes/2)) * nr_nodes ) That's because the external while() is repeated, at most, nr_nodes times (if min_nodes=1 and max_nodes=nr_nodes). Each of these steps hosts a for() which visits all the combinations, the maximum number of which ISTR to be C(nr_nodes,(nr_nodes/2)). I'm not sure what you meant when putting min_nodes up there in your formula (min_nodes is likely to be 1 most of the cases...), so I can't get numbers and compare them, but it looked (looks?) a bit less bad to me... Or did I make some obvious mistake I'm not seeing right now? :-( > At the very least this needs to be an exponential allocation, eg > + array_size += nr_nodes + array_size; > > But if you didn't insist on creating the whole list and sorting it, > you would avoid this allocation entirely, wouldn't you ? > I'll kill the separation between candidate identification and sorting: that's easy, quick, and will make George happy as well. :-) > Should we bve worried that this algorithm will be too slow even if it > involves just > O( C(nr_nodes,min_nodes) ) > iterations ? > I'm commenting about this in the other thread you opened... Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools 2012-07-10 15:03 [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-10 15:03 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli @ 2012-07-10 15:03 ` Dario Faggioli 2012-07-10 15:03 ` [PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation Dario Faggioli 2012-07-16 17:03 ` [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli 3 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-10 15:03 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson In such a way that only the cpus belonging to the cpupool of the domain being placed are considered for the placement itself. This happens by filtering out all the nodes in which the cpupool has not any cpu from the placement candidates. After that -- as a cpu pooling not necessarily happens at NUMA nodes boundaries -- we also make sure only the actual cpus that are part of the pool are considered when counting how much processors a placement candidate provides. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v3: * fixed a double free on non-NUMA systems (namely, when just 1 node exists). * Changed the filtering logic: now only the meaningful combinations and nodemap are being considered. Changes from v2: * fixed typos in comments. diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -132,25 +132,29 @@ static int numa_cmpf(const void *v1, con } /* The actual automatic NUMA placement routine */ -static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +static int numa_place_domain(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) { int nr_candidates = 0; libxl__numa_candidate *candidates = NULL; libxl_bitmap candidate_nodemap; - libxl_cpupoolinfo *pinfo; - int nr_pools, rc = 0; + libxl_cpupoolinfo cpupool_info; + int i, cpupool, rc = 0; uint32_t memkb; libxl_bitmap_init(&candidate_nodemap); - /* First of all, if cpupools are in use, better not to mess with them */ - pinfo = libxl_list_cpupool(CTX, &nr_pools); - if (!pinfo) - return ERROR_FAIL; - if (nr_pools > 1) { - LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); - goto out; - } + /* + * Extract the cpumap from the cpupool the domain belong to. In fact, + * it only makes sense to consider the cpus/nodes that are in there + * for placement. + */ + rc = cpupool = libxl__domain_cpupool(gc, domid); + if (rc < 0) + return rc; + rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool); + if (rc) + return rc; rc = libxl_domain_need_memory(CTX, info, &memkb); if (rc) @@ -162,7 +166,8 @@ static int numa_place_domain(libxl__gc * /* Find all the candidates with enough free memory and at least * as much pcpus as the domain has vcpus. */ - rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, + 0, 0, &cpupool_info.cpumap, &candidates, &nr_candidates); if (rc) goto out; @@ -188,13 +193,20 @@ static int numa_place_domain(libxl__gc * if (rc) goto out; + /* Avoid trying to set the affinity to cpus that might be in the + * nodemap but not in our cpupool. */ + libxl_for_each_set_bit(i, info->cpumap) { + if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) + libxl_bitmap_reset(&info->cpumap, i); + } + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " "%"PRIu32" KB free selected", candidates[0].nr_nodes, candidates[0].nr_cpus, candidates[0].free_memkb / 1024); out: libxl_bitmap_dispose(&candidate_nodemap); - libxl_cpupoolinfo_list_free(pinfo, nr_pools); + libxl_cpupoolinfo_dispose(&cpupool_info); return rc; } @@ -217,7 +229,7 @@ int libxl__build_pre(libxl__gc *gc, uint * whatever that turns out to be. */ if (libxl_bitmap_is_full(&info->cpumap)) { - int rc = numa_place_domain(gc, info); + int rc = numa_place_domain(gc, domid, info); if (rc) return rc; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2289,14 +2289,17 @@ typedef struct { * least that amount of free memory and that number of cpus, respectively. If * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number * of cpus won't be checked at all, which means a candidate will always be - * considered suitable wrt the specific constraint. cndts is where the list of - * exactly nr_cndts candidates is returned. Note that, in case no candidates - * are found at all, the function returns successfully, but with nr_cndts equal - * to zero. + * considered suitable wrt the specific constraint. suitable_cpumap is useful + * for specifying we want only the cpus in that mask to be considered while + * generating placement candidates (for example because of cpupools). cndts is + * where the list of exactly nr_cndts candidates is returned. Note that, in + * case no candidates are found at all, the function returns successfully, but + * with nr_cndts equal to zero. */ _hidden int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts); /* Initialization, allocation and deallocation for placement candidates */ diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -107,30 +107,48 @@ static int comb_next(comb_iter_t it, int /* NUMA automatic placement (see libxl_internal.h for details) */ /* - * This function turns a k-combination iterator into a node map. - * This means the bits in the node map corresponding to the indexes - * of the given combination are the ones that will be set. - * For example, if the iterator represents the combination { 0, 2, 4}, - * the node map will have bits #0, #2 and #4 set. + * This function turns a k-combination iterator into a node map, + * given another map, telling us which nodes should be considered. + * + * This means the bits that are set in suitable_nodemap and that + * corresponds to the indexes of the given combination are the ones + * that will be set in nodemap. + * + * For example, given a fully set suitable_nodemap, if the iterator + * represents the combination { 0, 2, 4}, nodmeap will have bits #0, + * #2 and #4 set. + * On the other hand, if, say, suitable_nodemap=01011011, the same + * iterator will cause bits #1, #4 and #7 of nodemap to be set. */ -static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *suitable_nodemap, + libxl_bitmap *nodemap, int k) { - int i; + int i, m = 0, n = 0; libxl_bitmap_set_none(nodemap); - for (i = 0; i < k; i++) - libxl_bitmap_set(nodemap, it[i]); + libxl_for_each_set_bit(i, *suitable_nodemap) { + /* Check wether the n-th set bit of suitable_nodemap + * matches with the m-th element of the iterator (and, + * only if it does, advance to the next one) */ + if (m < k && n == it[m]) { + libxl_bitmap_set(nodemap, i); + m++; + } + n++; + } } /* Retrieve the number of cpus that the nodes that are part of the nodemap - * span. */ + * span and are also set in suitable_cpumap. */ static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *suitable_cpumap, const libxl_bitmap *nodemap) { int i, nodes_cpus = 0; for (i = 0; i < nr_cpus; i++) { - if (libxl_bitmap_test(nodemap, tinfo[i].node)) + if (libxl_bitmap_test(suitable_cpumap, i) && + libxl_bitmap_test(nodemap, tinfo[i].node)) nodes_cpus++; } return nodes_cpus; @@ -236,15 +254,17 @@ static int cpus_per_node_count(libxl_cpu int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts) { libxl__numa_candidate *new_cndts = NULL; libxl_cputopology *tinfo = NULL; libxl_numainfo *ninfo = NULL; int nr_nodes = 0, nr_cpus = 0; - libxl_bitmap nodemap; + libxl_bitmap suitable_nodemap, nodemap; int array_size, rc; + libxl_bitmap_init(&suitable_nodemap); libxl_bitmap_init(&nodemap); /* Get platform info and prepare the map for testing the combinations */ @@ -262,6 +282,15 @@ int libxl__get_numa_candidates(libxl__gc if (rc) goto out; + /* Allocate and prepare the map of the node that can be utilized for + * placement, basing on the map of suitable cpus. */ + rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap, 0); + if (rc) + goto out; + rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap); + if (rc) + goto out; + /* * If the minimum number of NUMA nodes is not explicitly specified * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes @@ -287,6 +316,13 @@ int libxl__get_numa_candidates(libxl__gc rc = ERROR_INVAL; goto out; } + /* We also need to be sure we do not exceed the number of + * nodes we are allowed to use. */ + if (libxl_bitmap_count_set(&suitable_nodemap) < max_nodes) { + max_nodes = libxl_bitmap_count_set(&suitable_nodemap); + if (min_nodes > max_nodes) + min_nodes = max_nodes; + } /* Initialize the local storage for the combinations */ *nr_cndts = 0; @@ -311,7 +347,10 @@ int libxl__get_numa_candidates(libxl__gc uint32_t nodes_free_memkb; int nodes_cpus; - comb_get_nodemap(comb_iter, &nodemap, min_nodes); + /* Get the nodemap for the combination, only considering + * suitable nodes. */ + comb_get_nodemap(comb_iter, &suitable_nodemap, + &nodemap, min_nodes); /* If there is not enough memory in this combination, skip it * and go generating the next one... */ @@ -320,7 +359,8 @@ int libxl__get_numa_candidates(libxl__gc continue; /* And the same applies if this combination is short in cpus */ - nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap); + nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, suitable_cpumap, + &nodemap); if (min_cpus && nodes_cpus < min_cpus) continue; @@ -345,12 +385,13 @@ int libxl__get_numa_candidates(libxl__gc new_cndts[*nr_cndts].nr_domains = nodemap_to_nr_domains(gc, tinfo, &nodemap); new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; - new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_nodes = libxl_bitmap_count_set(&nodemap); new_cndts[*nr_cndts].nr_cpus = nodes_cpus; LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, - min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].nr_nodes, + new_cndts[*nr_cndts].nr_cpus, new_cndts[*nr_cndts].free_memkb / 1024); (*nr_cndts)++; @@ -360,6 +401,7 @@ int libxl__get_numa_candidates(libxl__gc *cndts = new_cndts; out: + libxl_bitmap_dispose(&suitable_nodemap); libxl_bitmap_dispose(&nodemap); libxl_cputopology_list_free(tinfo, nr_cpus); libxl_numainfo_list_free(ninfo, nr_nodes); ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation 2012-07-10 15:03 [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-10 15:03 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli 2012-07-10 15:03 ` [PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli @ 2012-07-10 15:03 ` Dario Faggioli 2012-07-16 17:03 ` [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli 3 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-10 15:03 UTC (permalink / raw) To: xen-devel, Dario Faggioli Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson About rationale, usage and (some small bits of) API. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v3: * typos and rewording of some sentences, as suggested during review. Changes from v1: * API documentation moved close to the actual functions. diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdown @@ -0,0 +1,89 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +NUMA means the memory accessing times of a program running on a CPU depends on +the relative distance between that CPU and that memory. In fact, most of the +NUMA systems are built in such a way that each processor has its local memory, +on which it can operate very fast. On the other hand, getting and storing data +from and on remote memory (that is, memory local to some other processor) is +quite more complex and slow. On these machines, a NUMA node is usually defined +as a set of processor cores (typically a physical CPU package) and the memory +directly attached to the set of cores. + +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by +assigning to its domain a "node affinity", i.e., a set of NUMA nodes of the +host from which it gets its memory allocated. + +NUMA awareness becomes very important as soon as many domains start running +memory-intensive workloads on a shared host. In fact, the cost of accessing non +node-local memory locations is very high, and the performance degradation is +likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask for both +manual or automatic placement of them across the host's NUMA nodes. + +Note that xm/xend does the very same thing, the only differences residing in +the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain should +be created and scheduled on, directly in its config file. This affects NUMA +placement and memory accesses as the hypervisor constructs the node affinity of +a VM basing right on its CPU affinity when it is created. + +This is very simple and effective, but requires the user/system administrator +to explicitly specify affinities for each and every domain, or Xen won't be +able to guarantee the locality for their memory accesses. + +It is also possible to deal with NUMA by partitioning the system using cpupools +(available in the upcoming release of Xen, 4.2). Again, this could be "The +Right Answer" for many needs and occasions, but has to to be carefully +considered and manually setup by hand. + +### Automatic Guest Placement with xl ### + +If no "cpus=" option is specified in the config file, libxl tries to figure out +on its own on which node(s) the domain could fit best. It is worthwhile noting +that optimally fitting a set of VMs on the NUMA nodes of an host is an +incarnation of the Bin Packing Problem. In fact, the various VMs with different +memory sizes are the items to be packed, and the host nodes are the bins. That +is known to be NP-hard, thus. We will therefore be using some heuristics. + +The first thing to do is finding a node, or even a set of nodes, that have +enough free memory and enough physical CPUs for accommodating the one new +domain. The idea is to find a spot for the domain with at least as much free +memory as it has configured, and as much pCPUs as it has vCPUs. After that, +the actual decision on which solution to go for happens accordingly to the +following heuristics: + + * candidates involving fewer nodes come first. In case two (or more) + candidates span the same number of nodes, + * the amount of free memory and the number of domains assigned to the + candidates are considered. In doing that, candidates with greater amount + of free memory and fewer assigned domains are preferred, with free memory + "weighting" three times as much as number of domains. + +Giving preference to candidates with fewer nodes ensures better performance for +the guest, as it avoid spreading its memory among different nodes. Favouring +the nodes that have the largest amounts of free memory helps keeping the memory +fragmentation small, from a system wide perspective. However, if more +candidates fulfil these criteria by roughly the same extent, having the number +of domains the candidates are "hosting" helps balancing the load on the various +nodes. + +## Guest Placement within libxl ## + +xl achieves automatic NUMA just because libxl does it interrnally. +No API is provided (yet) for interacting with this feature and modify +the library behaviour regarding automatic placement, it just happens +by default if no affinity is specified (as it is with xm/xend). + +For actually looking and maybe tweaking the mechanism and the algorithms it +uses, all is implemented as a set of libxl internal interfaces and facilities. +Look at the comment "Automatic NUMA placement" in libxl\_internal.h. + +Note this may change in future versions of Xen/libxl. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl 2012-07-10 15:03 [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli ` (2 preceding siblings ...) 2012-07-10 15:03 ` [PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation Dario Faggioli @ 2012-07-16 17:03 ` Dario Faggioli 3 siblings, 0 replies; 60+ messages in thread From: Dario Faggioli @ 2012-07-16 17:03 UTC (permalink / raw) To: xen-devel Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap, Juergen Gross, Ian Jackson [-- Attachment #1.1: Type: text/plain, Size: 687 bytes --] On Tue, 2012-07-10 at 17:03 +0200, Dario Faggioli wrote: > Hello again, > Hi, This seems to not have been considered yet... If that is so, I'd like to post a new version, in which I fix a small bug that is presenti in this version (patch 2/3). Please, consider that one repost for review or inclusion (subject will be "[... v5/leftover] Automatic NUMA placement for xl"). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2012-07-23 15:31 UTC | newest] Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-10 15:03 [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-10 15:03 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli 2012-07-17 15:55 ` Ian Jackson 2012-07-16 17:13 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl Dario Faggioli 2012-07-16 17:13 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli 2012-07-17 18:04 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes [and 1 more messages] Ian Jackson 2012-07-17 20:23 ` Ian Campbell 2012-07-18 0:31 ` Dario Faggioli 2012-07-18 10:44 ` Ian Jackson 2012-07-18 0:22 ` Dario Faggioli 2012-07-18 8:27 ` Dario Faggioli 2012-07-18 9:13 ` Ian Campbell 2012-07-18 9:43 ` Dario Faggioli 2012-07-18 9:53 ` Ian Campbell 2012-07-18 10:08 ` Dario Faggioli 2012-07-18 11:00 ` Ian Jackson 2012-07-18 13:14 ` Ian Campbell 2012-07-18 13:35 ` Dario Faggioli 2012-07-19 12:47 ` Dario Faggioli 2012-07-18 13:40 ` Andre Przywara 2012-07-18 13:54 ` Juergen Gross 2012-07-18 14:00 ` Dario Faggioli 2012-07-19 14:43 ` Ian Jackson 2012-07-19 18:37 ` Andre Przywara 2012-07-21 1:46 ` Dario Faggioli 2012-07-18 10:53 ` Ian Jackson 2012-07-18 13:12 ` Ian Campbell 2012-07-18 9:47 ` Dario Faggioli 2012-07-19 12:21 ` [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes Andre Przywara 2012-07-19 14:22 ` Dario Faggioli 2012-07-20 8:19 ` Andre Przywara 2012-07-20 9:39 ` Dario Faggioli 2012-07-20 10:01 ` Dario Faggioli 2012-07-20 8:20 ` Dario Faggioli 2012-07-20 8:26 ` Andre Przywara 2012-07-20 8:38 ` Juergen Gross 2012-07-20 9:52 ` Dario Faggioli 2012-07-20 9:56 ` Juergen Gross 2012-07-20 9:44 ` Dario Faggioli 2012-07-20 11:47 ` Andre Przywara 2012-07-20 12:54 ` Dario Faggioli 2012-07-20 13:07 ` Andre Przywara 2012-07-21 1:44 ` Dario Faggioli 2012-07-16 17:13 ` [PATCH 2 of 3 v5/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli 2012-07-16 17:13 ` [PATCH 3 of 3 v5/leftover] Some automatic NUMA placement documentation Dario Faggioli 2012-07-20 11:07 ` [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl David Vrabel 2012-07-20 11:43 ` Andre Przywara 2012-07-20 12:00 ` Ian Campbell 2012-07-20 12:08 ` Ian Campbell 2012-07-23 10:38 ` Dario Faggioli 2012-07-23 10:42 ` Ian Campbell 2012-07-23 15:31 ` Dario Faggioli 2012-07-23 10:23 ` Dario Faggioli 2012-07-20 12:14 ` David Vrabel 2012-07-17 15:59 ` [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Ian Campbell 2012-07-17 18:01 ` Ian Jackson 2012-07-17 22:15 ` Dario Faggioli 2012-07-10 15:03 ` [PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli 2012-07-10 15:03 ` [PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation Dario Faggioli 2012-07-16 17:03 ` [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl Dario Faggioli
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.