All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl
@ 2012-07-21  1:22 Dario Faggioli
  2012-07-21  1:22 ` [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-21  1:22 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, Ian Jackson

Hi,

Here it is another version of what remained uncommitted of my placement series.

All the minor and major comments that were raised during review have been
addressed, in particular the observations about the memory overhead and the
computational complexity of the placement algorithm.

The heuristics has also been modified according to the experimental results
provided by Andre.

More details are provided in the mails containing the individual patches.

Thanks to everyone and Regards,
Dario

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
  2012-07-21  1:22 [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl Dario Faggioli
@ 2012-07-21  1:22 ` Dario Faggioli
  2012-07-23 11:38   ` Ian Jackson
  2012-07-21  1:22 ` [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file Dario Faggioli
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-21  1:22 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, Ian Jackson

If a domain does not have a VCPU affinity, try to pin it automatically
to some PCPUs. This is done taking into account the NUMA characteristics
of the host. In fact, we look for a combination of host's NUMA nodes
with enough free memory and number of PCPUs for the new domain, and pin
it to the VCPUs of those nodes.

Deciding which placement is the best happens by means of some heuristics.
For instance, smaller candidates are better, both from a domain perspective
(less memory spreading among nodes) and from the entire system perspective
(smaller memory fragmentation). In case of candidates of equal sizes
(i.e., with the same number of nodes), the amount of free memory and
the number of domains' vCPUs already pinned to the candidates' nodes are
both considered. Very often, candidates with greater amount of memory
are the one we wants, as this is good for keeping memory fragmentation
under control. However, we do not want to overcommit some node too much,
just because it has a lot of memory, and that's why the number of vCPUs
must be accounted for.

This all happens internally to libxl, and no API for driving the
mechanism is provided for now. This matches what xend already does.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

---
Changes from v5:
 * macros moved from libxl_numa.c to libxl_internal.h.
 * Changed the placement heuristics that now considers how many
   vCPUs are already pinned to a candidate, instead of how many
   domains, as suggested during testing/review (thanks a lot Andre!).
 * Changed the weights in the placement heuristics so that, now,
   CPU load counts more than free memory, as suggested during
   testing/review (thanks a log again Andre! :-D).
 * Removed the separation between generating the list of possible
   candidates and sorting that list. Now the best candidate is selected
   on-line, without the need of keeping track of more than 2 candidates
   at a time.
 * Thanks to the above, the placement can now end earlier, i.e., as soon as
   a placement is found and all the other candidates with the same number
   of nodes have been checked (as bigger candidates would be worse than
   that anyway).
 * Avoid running any placement if the system has more than 8 nodes,
   as agreed during review.
 * Avoid running any placement if the config file has a "cpus='all'"
   directive, as suggested during review.

Changes from v3:
 * fixed a double free on non-NUMA systems (namely, when just 1 node exists).
 * Reword an ambiguous sentence in xl's man page, as requested during review.
 * Do no special case single node systems in libxl__get_numa_candidates(), to
   avoid generating a spurious warning on them.

Changes from v2:
 * lots of typos.
 * Clayfied some comments, as requested during (ijc's) review.
 * Added some more information/reference for the combination generation
   algorithm.
 * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus().
 * libxl_bitmap_init() used to make sure we do not try to free random
   memory on failure paths of functions that allocates a libxl_bitmap.
 * Always invoke libxl__sort_numa_candidates(), even if we get there
   with just 1 candidate, as requested during review.
 * Simplified the if-s that check for input parameter consistency in
   libxl__get_numa_candidates() as requested during (gwd's) review.
 * Comparison function for candidates changed so that it now provides
   total ordering, as requested during review. It is still using FP
   arithmetic, though. Also I think that just putting the difference
   between the amount of free memory and between the number of assigned
   domains of two candidates in a single formula (after normalizing and
   weighting them) is both clear and effective enough.
 * Function definitions moved to a numa specific source file (libxl_numa.c),
   as suggested during review.


Changes from v1:
 * This patches incorporates the changes from both "libxl, xl: enable automatic
   placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering
   NUMA placement candidates" from v1.
 * The logic of the algorithm is basically the same as in v1, but the splitting
   of it in the various functions has been completely redesigned from scratch.
 * No public API for placement or candidate generation is now exposed,
   everything happens within libxl, as agreed during v1 review.
 * The relevant documentation have been moved near the actual functions and
   features. Also, the amount and (hopefully!) the quality of the documentation
   has been improved a lot, as requested.
 * All the comments about using the proper libxl facilities and helpers for
   allocations, etc., have been considered and applied.
 * This patch still bails out from NUMA optimizations if it find out cpupools
   are being utilized. It is next patch that makes the two things interact
   properly, as suggested during review.

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -111,8 +111,8 @@ created online and the remainder will be
 
 =item B<cpus="CPU-LIST">
 
-List of which cpus the guest is allowed to use. Default behavior is
-`all cpus`. A C<CPU-LIST> may be specified as follows:
+List of which cpus the guest is allowed to use. By default xl will pick
+some cpus on its own (see below). A C<CPU-LIST> may be specified as follows:
 
 =over 4
 
@@ -132,6 +132,12 @@ run on cpu #3 of the host.
 
 =back
 
+If this option is not specified, libxl automatically tries to place the new
+domain on the host's NUMA nodes (provided the host has more than one NUMA
+node) by pinning it to the cpus of those nodes. A heuristic approach is
+utilized with the goals of maximizing performance for the domain and, at
+the same time, achieving efficient utilization of the host's CPUs and RAM.
+
 =item B<cpu_weight=WEIGHT>
 
 A domain with a weight of 512 will get twice as much CPU as a domain
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
-			libxl_json.o libxl_aoutils.o \
+			libxl_json.o libxl_aoutils.o libxl_numa.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -98,6 +98,102 @@ out:
     return sched;
 }
 
+/* Subtract two values and translate the result in [0, 1] */
+static double normalized_diff(double a, double b)
+{
+    if (!a && a == b)
+        return 0.0;
+    return (a - b) / MAX(a, b);
+}
+
+/*
+ * Two NUMA placement candidates are compared by means of the following
+ * heuristics:
+
+ *  - the amount of free memory and the number of vcpus runnable on the
+ *    candidates are considered. In doing that, candidates with greater
+ *    amount of free memory and fewer runnable vcpus are preferred. Also,
+ *    the difference in number of vcpus "weights" three times as much as
+ *    the amount of free memory.
+ *
+ * In fact, leaving larger memory holes, maximizes the probability of being
+ * able to put other domains on the node. That hopefully means many domains
+ * will benefit from local memory accesses, but also introduces the risk of
+ * overloading large (from a memory POV) nodes. That's right the effect
+ * that counting the vcpus able to run on the nodes tries to prevent.
+ *
+ * Note that this completely ignore the number of nodes each candidate span,
+ * as the fact that fewer nodes is better is already accounted for in the
+ * algorithm.
+ */
+static int numa_cmpf(const libxl__numa_candidate *c1,
+                     const libxl__numa_candidate *c2)
+{
+    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
+    double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus);
+
+    return SIGN(freememkb_diff + 3*nrvcpus_diff);
+}
+
+/* The actual automatic NUMA placement routine */
+static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+{
+    int found;
+    libxl__numa_candidate candidate;
+    libxl_bitmap candidate_nodemap;
+    libxl_cpupoolinfo *pinfo;
+    int nr_pools, rc = 0;
+    uint32_t memkb;
+
+    libxl__numa_candidate_init(&candidate);
+    libxl_bitmap_init(&candidate_nodemap);
+
+    /* First of all, if cpupools are in use, better not to mess with them */
+    pinfo = libxl_list_cpupool(CTX, &nr_pools);
+    if (!pinfo)
+        return ERROR_FAIL;
+    if (nr_pools > 1) {
+        LOG(NOTICE, "Skipping NUMA placement as cpupools are in use");
+        goto out;
+    }
+
+    rc = libxl_domain_need_memory(CTX, info, &memkb);
+    if (rc)
+        goto out;
+    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* Find the best candidate with enough free memory and at least
+     * as much pcpus as the domain has vcpus.  */
+    rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus, 0, 0,
+                                   numa_cmpf, &candidate, &found);
+    if (rc)
+        goto out;
+
+    /* Not even a suitable placement candidate! Let's just don't touch the
+     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
+    if (found == 0)
+        goto out;
+
+    /* Map the candidate's node map to the domain's info->cpumap */
+    libxl__numa_candidate_get_nodemap(gc, &candidate, &candidate_nodemap);
+    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
+                "%"PRIu32" KB free selected", candidate.nr_nodes,
+                candidate.nr_cpus, candidate.free_memkb / 1024);
+
+ out:
+    libxl__numa_candidate_dispose(&candidate);
+    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    return rc;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint
     uint32_t rtc_timeoffset;
 
     xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
+
+    /*
+     * Check if the domain has any CPU affinity. If not, try to build up one.
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->cpumap accordingly; if it does not, it just leaves it
+     * as it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
+     * whatever that turns out to be.
+     */
+    if (libxl_bitmap_is_full(&info->cpumap)) {
+        int rc = numa_place_domain(gc, info);
+        if (rc)
+            return rc;
+    }
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
+
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
     if (info->type == LIBXL_DOMAIN_TYPE_PV)
         xc_domain_set_memmap_limit(ctx->xch, domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2216,6 +2216,125 @@ static inline void libxl__ctx_unlock(lib
 #define CTX_LOCK (libxl__ctx_lock(CTX))
 #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
 
+#define SIGN(x) ((x) > 0 ? 1 : (x) < 0 ? -1 : 0)
+#define MAX(x, y) ((x) > (y) ? (x) : (y))
+
+/*
+ * Automatic NUMA placement
+ *
+ * These functions and data structures deal with the initial placement of a
+ * domain onto the host NUMA nodes.
+ *
+ * The key concept here is the one of "NUMA placement candidate", which is
+ * basically a set of nodes whose characteristics have been successfully
+ * checked against some specific requirements. More precisely, a candidate
+ * is the nodemap associated with one of the possible subset of the host
+ * NUMA nodes providing a certain amount of free memory, or a given number
+ * of cpus, or even both (depending in what the caller wants). For
+ * convenience of use, some of this information are stored within the
+ * candidate itself, instead of always being dynamically computed. A single
+ * node can be valid placement candidate, as well as it is possible for a
+ * candidate to contain all the nodes of the host. The fewer nodes there
+ * are in a candidate, the better performance a domain placed onto it
+ * should get (at least from a NUMA point of view). For instance, looking
+ * for a numa candidates with 2GB of free memory means we want the subsets
+ * of the host NUMA nodes with, cumulatively, at least 2GB of free memory.
+ * This condition can be satisfied by just one particular node, or it may
+ * require more nodes, depending on the characteristics of the host, on how
+ * many domains have been created already, on how big they are, etc.
+ *
+ * The intended usage is as follows:
+ *  1. fist of all, call libxl__get_numa_candidates(), and specify the proper
+ *     constraints to it (e.g., the amount of memory a domain need as the
+ *     minimum amount of free memory for the candidates). If a candidate
+ *     comparison function is provided, the candidate with fewer nodes that
+ *     is found to be best according to what such fucntion says is returned.
+ *     If no comparison function is passed, the very first candidate is.
+ *  2. The chosen candidate's nodemap should be utilized for computing the
+ *     actual affinity of the domain which, given the current NUMA support
+ *     in the hypervisor, is what determines the placement of the domain's
+ *     vcpus and memory.
+ */
+
+typedef struct {
+    int nr_cpus, nr_nodes;
+    int nr_vcpus;
+    uint32_t free_memkb;
+    libxl_bitmap nodemap;
+} libxl__numa_candidate;
+
+/* Signature for the comparison function between two candidates */
+typedef int (*libxl__numa_candidate_cmpf)(const libxl__numa_candidate *c1,
+                                          const libxl__numa_candidate *c2);
+
+/*
+ * This looks for the best NUMA placement candidate satisfying some
+ * specific conditions. If min_nodes and/or max_nodes are not 0, their
+ * value is used to determine the minimum and maximum number of nodes the
+ * candidate can have. If they are 0, it means the candidate can contain
+ * from 1 node (min_nodes=0) to the total number of nodes of the host
+ * (max_ndoes=0). If min_free_memkb and/or min_cpus are not 0, the caller
+ * only wants candidates with at least the amount of free memory and the
+ * number of cpus they specify, respectively. If they are 0, the
+ * candidates' free memory and/or number of cpus won't be checked at all.
+ *
+ * Candidates are compared among each others by calling numa_cmpf(), which
+ * is where the heuristics for determining which candidate is the best
+ * one is actually implemented. The only bit of it that is hardcoded in
+ * this function is the fact that candidates with fewer nodes are always
+ * preferrable.
+ *
+ * If at least one suitable candidate is found, it is returned in cndt_out,
+ * cndt_found is set to one, and the function returns successfully. On the
+ * other hand, if not even one single candidate can be found, the function
+ * still returns successfully but cndt_found will be zero.
+ *
+ * It is up to the function to properly allocate cndt_out (by calling
+ * libxl__numa_candidate_alloc()), while it is the caller that should init
+ * (libxl__numa_candidate_init()) and free (libxl__numa_candidate_dispose())
+ * it.
+ */
+_hidden int libxl__get_numa_candidate(libxl__gc *gc,
+                                      uint32_t min_free_memkb, int min_cpus,
+                                      int min_nodes, int max_nodes,
+                                      libxl__numa_candidate_cmpf numa_cmpf,
+                                      libxl__numa_candidate *cndt_out,
+                                      int *cndt_found);
+
+/* Initialization, allocation and deallocation for placement candidates */
+static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt)
+{
+    cndt->free_memkb = 0;
+    cndt->nr_cpus = cndt->nr_nodes = cndt->nr_vcpus = 0;
+    libxl_bitmap_init(&cndt->nodemap);
+}
+
+static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
+                                              libxl__numa_candidate *cndt)
+{
+    return libxl_node_bitmap_alloc(CTX, &cndt->nodemap, 0);
+}
+static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
+{
+    libxl_bitmap_dispose(&cndt->nodemap);
+}
+
+/* Retrieve (in nodemap) the node map associated to placement candidate cndt */
+static inline
+void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
+                                       const libxl__numa_candidate *cndt,
+                                       libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
+}
+/* Set the node map of placement candidate cndt to match nodemap */
+static inline
+void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
+                                       libxl__numa_candidate *cndt,
+                                       const libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
+}
 
 /*
  * Inserts "elm_new" into the sorted list "head".
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
new file mode 100644
--- /dev/null
+++ b/tools/libxl/libxl_numa.c
@@ -0,0 +1,426 @@
+/*
+ * Copyright (C) 2012      Citrix Ltd.
+ * Author Dario Faggioli <dario.faggioli@citrix.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include <glob.h>
+
+#include "libxl_internal.h"
+
+/*
+ * What follows are helpers for generating all the k-combinations
+ * without repetitions of a set S with n elements in it. Formally
+ * speaking, they are subsets of k distinct elements of S and, if
+ * S is n elements big, the number of k-combinations is equal to the
+ * binomial coefficient C(n k)=n!/(k! * (n - k)!).
+ *
+ * The various subset are generated one after the other by calling
+ * comb_init() first, and, after that, comb_next()
+ * C(n k)-1 times. An iterator is used to store the current status
+ * of the whole generation operation (i.e., basically, the last
+ * combination that has been generated). As soon as all combinations
+ * have been generated, comb_next() will start returning 0 instead of
+ * 1. The same instance of the iterator and the same values for
+ * n and k _must_ be used for each call (if that doesn't happen, the
+ * result is unspecified).
+ *
+ * The algorithm is a well known one (see, for example, D. Knuth's "The
+ * Art of Computer Programming - Volume 4, Fascicle 3" and it produces
+ * the combinations in such a way that they (well, more precisely, their
+ * indexes it the array/map representing the set) come with lexicographic
+ * ordering.
+ *
+ * For example, with n = 5 and k = 3, calling comb_init()
+ * will generate { 0, 1, 2 }, while subsequent valid calls to
+ * comb_next() will produce the following:
+ * { { 0, 1, 3 }, { 0, 1, 4 },
+ *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
+ *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
+ *   { 2, 3, 4 } }
+ *
+ * This is used by the automatic NUMA placement logic below.
+ */
+typedef int* comb_iter_t;
+
+static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
+{
+    comb_iter_t new_iter;
+    int i;
+
+    if (n < k)
+        return 0;
+
+    /* First set is always { 0, 1, 2, ..., k-1 } */
+    GCNEW_ARRAY(new_iter, k);
+    for (i = 0; i < k; i++)
+        new_iter[i] = i;
+
+    *it = new_iter;
+    return 1;
+}
+
+static int comb_next(comb_iter_t it, int n, int k)
+{
+    int i;
+
+    /*
+     * The idea here is to find the leftmost element from where
+     * we should start incrementing the indexes of the iterator.
+     * This means looking for the highest index that can be increased
+     * while still producing value smaller than n-1. In the example
+     * above, when dealing with { 0, 1, 4 }, such an element is the
+     * second one, as the third is already equal to 4 (which actually
+     * is n-1).
+     * Once we found from where to start, we increment that element
+     * and override the right-hand rest of the iterator with its
+     * successors, thus achieving lexicographic ordering.
+     *
+     * Regarding the termination of the generation process, when we
+     * manage in bringing n-k at the very first position of the iterator,
+     * we know that is the last valid combination ( { 2, 3, 4 }, with
+     * n - k = 5 - 2 = 2, in the example above), and thus we start
+     * returning 0 as soon as we cross that border.
+     */
+    for (i = k - 1; it[i] == n - k + i; i--) {
+        if (i <= 0)
+            return 0;
+    }
+    for (it[i]++, i++; i < k; i++)
+        it[i] = it[i - 1] + 1;
+    return 1;
+}
+
+/* NUMA automatic placement (see libxl_internal.h for details) */
+
+/*
+ * This function turns a k-combination iterator into a node map.
+ * This means the bits in the node map corresponding to the indexes
+ * of the given combination are the ones that will be set.
+ * For example, if the iterator represents the combination { 0, 2, 4},
+ * the node map will have bits #0, #2 and #4 set.
+ */
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+{
+    int i;
+
+    libxl_bitmap_set_none(nodemap);
+    for (i = 0; i < k; i++)
+        libxl_bitmap_set(nodemap, it[i]);
+}
+
+/* Retrieve the number of cpus that the nodes that are part of the nodemap
+ * span. */
+static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                              const libxl_bitmap *nodemap)
+{
+    int i, nodes_cpus = 0;
+
+    for (i = 0; i < nr_cpus; i++) {
+        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+            nodes_cpus++;
+    }
+    return nodes_cpus;
+}
+
+/* Retrieve the amount of free memory within the nodemap */
+static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
+                                      libxl_bitmap *nodemap)
+{
+    uint32_t free_memkb = 0;
+    int i;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        free_memkb += ninfo[i].free / 1024;
+
+    return free_memkb;
+}
+
+/* Retrieve the number of vcpus able to run on the cpus of the nodes
+ * that are part of the nodemap. */
+static int nodemap_to_nr_vcpus(libxl__gc *gc, libxl_cputopology *tinfo,
+                               const libxl_bitmap *nodemap)
+{
+    libxl_dominfo *dinfo = NULL;
+    libxl_bitmap vcpu_nodemap;
+    int nr_doms, nr_cpus;
+    int nr_vcpus = 0;
+    int i, j, k;
+
+    dinfo = libxl_list_domain(CTX, &nr_doms);
+    if (dinfo == NULL)
+        return ERROR_FAIL;
+
+    if (libxl_node_bitmap_alloc(CTX, &vcpu_nodemap, 0) < 0) {
+        libxl_dominfo_list_free(dinfo, nr_doms);
+        return ERROR_FAIL;
+    }
+
+    for (i = 0; i < nr_doms; i++) {
+        libxl_vcpuinfo *vinfo;
+        int nr_dom_vcpus;
+
+        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_dom_vcpus, &nr_cpus);
+        if (vinfo == NULL)
+            continue;
+
+        /* For each vcpu of each domain ... */
+        for (j = 0; j < nr_dom_vcpus; j++) {
+
+            /* Build up a map telling on which nodes the vcpu is runnable on */
+            libxl_bitmap_set_none(&vcpu_nodemap);
+            libxl_for_each_set_bit(k, vinfo[j].cpumap)
+                libxl_bitmap_set(&vcpu_nodemap, tinfo[k].node);
+
+            /* And check if that map has any intersection with our nodemap */
+            libxl_for_each_set_bit(k, vcpu_nodemap) {
+                if (libxl_bitmap_test(nodemap, k)) {
+                    nr_vcpus++;
+                    break;
+                }
+            }
+        }
+
+        libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
+    }
+
+    libxl_bitmap_dispose(&vcpu_nodemap);
+    libxl_dominfo_list_free(dinfo, nr_doms);
+    return nr_vcpus;
+}
+
+/*
+ * This function tries to figure out if the host has a consistent number
+ * of cpus along all its NUMA nodes. In fact, if that is the case, we can
+ * calculate the minimum number of nodes needed for a domain by just
+ * dividing its total number of vcpus by this value computed here.
+ * However, we are not allowed to assume that all the nodes have the
+ * same number of cpus. Therefore, in case discrepancies among different
+ * nodes are found, this function just returns 0, for the caller to know
+ * it shouldn't rely on this 'optimization', and sort out things in some
+ * other way (by doing something basic, like starting trying with candidates
+ * with just one node).
+ */
+static int count_cpus_per_node(libxl_cputopology *tinfo, int nr_cpus,
+                               libxl_numainfo *ninfo, int nr_nodes)
+{
+    int cpus_per_node = 0;
+    int j, i;
+
+    /* This makes sense iff # of PCPUs is the same for all nodes */
+    for (j = 0; j < nr_nodes; j++) {
+        int curr_cpus = 0;
+
+        for (i = 0; i < nr_cpus; i++) {
+            if (tinfo[i].node == j)
+                curr_cpus++;
+        }
+        /* So, if the above does not hold, turn the whole thing off! */
+        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
+        if (cpus_per_node != curr_cpus)
+            return 0;
+    }
+    return cpus_per_node;
+}
+
+/*
+ * Looks for the placement candidates that satisfyies some specific
+ * conditions and return the best one according to the provided
+ * comparison function.
+ */
+int libxl__get_numa_candidate(libxl__gc *gc,
+                              uint32_t min_free_memkb, int min_cpus,
+                              int min_nodes, int max_nodes,
+                              libxl__numa_candidate_cmpf numa_cmpf,
+                              libxl__numa_candidate *cndt_out,
+                              int *cndt_found)
+{
+    libxl__numa_candidate new_cndt;
+    libxl_cputopology *tinfo = NULL;
+    libxl_numainfo *ninfo = NULL;
+    int nr_nodes = 0, nr_cpus = 0;
+    libxl_bitmap nodemap;
+    int rc = 0;
+
+    libxl_bitmap_init(&nodemap);
+    libxl__numa_candidate_init(&new_cndt);
+
+    /* Get platform info and prepare the map for testing the combinations */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+
+    /*
+     * The good thing about this solution is that it is based on heuristics
+     * (implemented in numa_cmpf() ), but we at least can evaluate it on
+     * all the possible placement candidates. That can happen because the
+     * number of nodes present in current NUMA systems is quite small.
+     * In fact, even if a sum of binomials is involved, if the system has
+     * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a
+     * sensible value, as it is exactly the number of nodes the biggest
+     * NUMA systems provide at the time of this writing (and will probably
+     * continue to do so for a while). However, computanional complexity
+     * would explode on systems much bigger than that. 16 nodes system would
+     * still be fine (it will be 65536 steps), but it's really importante we
+     * avoid trying to run this on monsters with 32, 64 or more nodes (if
+     * they ever pop into being). Therefore, here it comes a safety catch
+     * that disables the algorithm for the cases when it wouldn't work well.
+     */
+    if (nr_nodes > 8) {
+        /* Log we did nothing and return 0, as no real error occurred */
+        LOG(WARN, "System has %d NUMA nodes, which is too big for the "
+                  "placement algorithm to work effectively. Skipping it",
+                  nr_nodes);
+        *cndt_found = 0;
+        goto out;
+    }
+
+    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = libxl_node_bitmap_alloc(CTX, &nodemap, 0);
+    if (rc)
+        goto out;
+    rc = libxl__numa_candidate_alloc(gc, &new_cndt);
+    if (rc)
+        goto out;
+
+    /*
+     * If the minimum number of NUMA nodes is not explicitly specified
+     * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
+     * from where to start generating candidates, if possible (or just start
+     * from 1 otherwise). The maximum number of nodes should not exceed the
+     * number of existent NUMA nodes on the host, or the candidate generation
+     * won't work properly.
+     */
+    if (!min_nodes) {
+        int cpus_per_node;
+
+        cpus_per_node = count_cpus_per_node(tinfo, nr_cpus, ninfo, nr_nodes);
+        if (cpus_per_node == 0)
+            min_nodes = 1;
+        else
+            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
+    }
+    if (min_nodes > nr_nodes)
+        min_nodes = nr_nodes;
+    if (!max_nodes || max_nodes > nr_nodes)
+        max_nodes = nr_nodes;
+    if (min_nodes > max_nodes) {
+        rc = ERROR_INVAL;
+        goto out;
+    }
+
+    /* This is up to the caller to be disposed */
+    rc = libxl__numa_candidate_alloc(gc, cndt_out);
+    if (rc)
+        goto out;
+
+    /*
+     * Consider all the combinations with sizes in [min_nodes, max_nodes]
+     * (see comb_init() and comb_next()). Note that, since the fewer the
+     * number of nodes the better, it is guaranteed that any candidate
+     * found during the i-eth step will be better than any other one we
+     * could find during the (i+1)-eth and all the subsequent steps (they
+     * all will have more nodes). It's thus pointless to keep going if
+     * we already found something.
+     */
+    *cndt_found = 0;
+    while (min_nodes <= max_nodes && *cndt_found == 0) {
+        comb_iter_t comb_iter;
+        int comb_ok;
+
+        /*
+	 * And here it is. Each step of this cycle generates a combination of
+	 * nodes as big as min_nodes mandates.  Each of these combinations is
+	 * checked against the constraints provided by the caller (namely,
+	 * amount of free memory and number of cpus) and it can concur to
+         * become our best placement iff it passes the check.
+         */
+        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok;
+             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
+            uint32_t nodes_free_memkb;
+            int nodes_cpus;
+
+            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+
+            /* If there is not enough memory in this combination, skip it
+             * and go generating the next one... */
+            nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
+            if (min_free_memkb && nodes_free_memkb < min_free_memkb)
+                continue;
+
+            /* And the same applies if this combination is short in cpus */
+            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap);
+            if (min_cpus && nodes_cpus < min_cpus)
+                continue;
+
+            /*
+             * Conditions are met, we can compare this candidate with the
+             * current best one (if any).
+             */
+            libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap);
+            new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap);
+            new_cndt.free_memkb = nodes_free_memkb;
+            new_cndt.nr_nodes = min_nodes;
+            new_cndt.nr_cpus = nodes_cpus;
+
+            /*
+             * Check if the new candidate we is better the what we found up
+             * to now by means of the comparison function. If no comparison
+             * function is provided, just return as soon as we find our first
+             * candidate.
+             */
+            if (*cndt_found == 0 || numa_cmpf(&new_cndt, cndt_out) < 0) {
+                *cndt_found = 1;
+
+                LOG(DEBUG, "New best NUMA placement candidate found: "
+                           "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, "
+                           "free_memkb=%"PRIu32"", min_nodes, new_cndt.nr_cpus,
+                           new_cndt.nr_vcpus, new_cndt.free_memkb / 1024);
+
+                libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap);
+                cndt_out->nr_vcpus = new_cndt.nr_vcpus;
+                cndt_out->free_memkb = new_cndt.free_memkb;
+                cndt_out->nr_nodes = new_cndt.nr_nodes;
+                cndt_out->nr_cpus = new_cndt.nr_cpus;
+
+                if (numa_cmpf == NULL)
+                    break;
+            }
+        }
+        min_nodes++;
+    }
+
+    if (*cndt_found == 0)
+        LOG(NOTICE, "NUMA placement failed, performance might be affected");
+
+ out:
+    libxl_bitmap_dispose(&nodemap);
+    libxl__numa_candidate_dispose(&new_cndt);
+    libxl_numainfo_list_free(ninfo, nr_nodes);
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file
  2012-07-21  1:22 [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl Dario Faggioli
  2012-07-21  1:22 ` [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
@ 2012-07-21  1:22 ` Dario Faggioli
  2012-07-23 11:04   ` Ian Jackson
  2012-07-21  1:22 ` [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli
  2012-07-21  1:22 ` [PATCH 4 of 4 v6/leftover] Some automatic NUMA placement documentation Dario Faggioli
  3 siblings, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-21  1:22 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, Ian Jackson

So that we attempt automatic placement only if that was not the case.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -215,6 +215,8 @@ int libxl__domain_build_info_setdefault(
         libxl_bitmap_set_any(&b_info->cpumap);
     }
 
+    libxl_defbool_setdefault(&b_info->numa_placement, true);
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -212,7 +212,8 @@ int libxl__build_pre(libxl__gc *gc, uint
      * call to libxl_set_vcpuaffinity_all() will do the actual placement,
      * whatever that turns out to be.
      */
-    if (libxl_bitmap_is_full(&info->cpumap)) {
+    if (libxl_defbool_val(info->numa_placement) &&
+        libxl_bitmap_is_full(&info->cpumap)) {
         int rc = numa_place_domain(gc, info);
         if (rc)
             return rc;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -249,6 +249,7 @@ libxl_domain_build_info = Struct("domain
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
     ("cpumap",          libxl_bitmap),
+    ("numa_placement",  libxl_defbool),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -696,6 +696,9 @@ static void parse_config_data(const char
                 vcpu_to_pcpu[n_cpus] = i;
             n_cpus++;
         }
+
+        /* We have a cpumap, disable automatic placement */
+        libxl_defbool_set(&b_info->numa_placement, false);
     }
     else if (!xlu_cfg_get_string (config, "cpus", &buf, 0)) {
         char *buf2 = strdup(buf);
@@ -709,6 +712,8 @@ static void parse_config_data(const char
         if (vcpupin_parse(buf2, &b_info->cpumap))
             exit(1);
         free(buf2);
+
+        libxl_defbool_set(&b_info->numa_placement, false);
     }
 
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools
  2012-07-21  1:22 [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl Dario Faggioli
  2012-07-21  1:22 ` [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
  2012-07-21  1:22 ` [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file Dario Faggioli
@ 2012-07-21  1:22 ` Dario Faggioli
  2012-07-23 11:38   ` Ian Jackson
  2012-07-21  1:22 ` [PATCH 4 of 4 v6/leftover] Some automatic NUMA placement documentation Dario Faggioli
  3 siblings, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-21  1:22 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, Ian Jackson

In such a way that only the cpus belonging to the cpupool of the
domain being placed are considered for the placement itself.

This happens by filtering out all the nodes in which the cpupool
has not any cpu from the placement candidates. After that ---as
cpu pooling not necessarily happens at NUMA nodes boundaries--- we
also make sure only the actual cpus that are part of the pool are
considered when counting how much processors a placement candidate
provides.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v4:
 * fixed a bug in the cpupool filtering logic (also the number of nodes
   should be affected by that)

Changes from v3:
 * fixed a double free on non-NUMA systems (namely, when just 1 node exists).
 * Changed the filtering logic: now only the meaningful combinations and
   nodemap are being considered.

Changes from v2:
 * fixed typos in comments.

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -136,26 +136,30 @@ static int numa_cmpf(const libxl__numa_c
 }
 
 /* The actual automatic NUMA placement routine */
-static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+static int numa_place_domain(libxl__gc *gc, uint32_t domid,
+                             libxl_domain_build_info *info)
 {
     int found;
     libxl__numa_candidate candidate;
     libxl_bitmap candidate_nodemap;
-    libxl_cpupoolinfo *pinfo;
-    int nr_pools, rc = 0;
+    libxl_cpupoolinfo cpupool_info;
+    int i, cpupool, rc = 0;
     uint32_t memkb;
 
     libxl__numa_candidate_init(&candidate);
     libxl_bitmap_init(&candidate_nodemap);
 
-    /* First of all, if cpupools are in use, better not to mess with them */
-    pinfo = libxl_list_cpupool(CTX, &nr_pools);
-    if (!pinfo)
-        return ERROR_FAIL;
-    if (nr_pools > 1) {
-        LOG(NOTICE, "Skipping NUMA placement as cpupools are in use");
-        goto out;
-    }
+    /*
+     * Extract the cpumap from the cpupool the domain belong to. In fact,
+     * it only makes sense to consider the cpus/nodes that are in there
+     * for placement.
+     */
+    rc = cpupool = libxl__domain_cpupool(gc, domid);
+    if (rc < 0)
+        return rc;
+    rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool);
+    if (rc)
+        return rc;
 
     rc = libxl_domain_need_memory(CTX, info, &memkb);
     if (rc)
@@ -167,7 +171,8 @@ static int numa_place_domain(libxl__gc *
 
     /* Find the best candidate with enough free memory and at least
      * as much pcpus as the domain has vcpus.  */
-    rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus, 0, 0,
+    rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus,
+                                   0, 0, &cpupool_info.cpumap,
                                    numa_cmpf, &candidate, &found);
     if (rc)
         goto out;
@@ -183,6 +188,13 @@ static int numa_place_domain(libxl__gc *
     if (rc)
         goto out;
 
+    /* Avoid trying to set the affinity to cpus that might be in the
+     * nodemap but not in our cpupool. */
+    libxl_for_each_set_bit(i, info->cpumap) {
+        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
+            libxl_bitmap_reset(&info->cpumap, i);
+    }
+
     LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
                 "%"PRIu32" KB free selected", candidate.nr_nodes,
                 candidate.nr_cpus, candidate.free_memkb / 1024);
@@ -190,7 +202,7 @@ static int numa_place_domain(libxl__gc *
  out:
     libxl__numa_candidate_dispose(&candidate);
     libxl_bitmap_dispose(&candidate_nodemap);
-    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    libxl_cpupoolinfo_dispose(&cpupool_info);
     return rc;
 }
 
@@ -214,7 +226,7 @@ int libxl__build_pre(libxl__gc *gc, uint
      */
     if (libxl_defbool_val(info->numa_placement) &&
         libxl_bitmap_is_full(&info->cpumap)) {
-        int rc = numa_place_domain(gc, info);
+        int rc = numa_place_domain(gc, domid, info);
         if (rc)
             return rc;
     }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2289,6 +2289,10 @@ typedef int (*libxl__numa_candidate_cmpf
  * other hand, if not even one single candidate can be found, the function
  * still returns successfully but cndt_found will be zero.
  *
+ * Finally, suitable_cpumap is useful for telling that only the cpus in that
+ * mask should be considered when generating placement candidates (for
+ * example because of cpupools).
+ *
  * It is up to the function to properly allocate cndt_out (by calling
  * libxl__numa_candidate_alloc()), while it is the caller that should init
  * (libxl__numa_candidate_init()) and free (libxl__numa_candidate_dispose())
@@ -2297,6 +2301,7 @@ typedef int (*libxl__numa_candidate_cmpf
 _hidden int libxl__get_numa_candidate(libxl__gc *gc,
                                       uint32_t min_free_memkb, int min_cpus,
                                       int min_nodes, int max_nodes,
+                                      const libxl_bitmap *suitable_cpumap,
                                       libxl__numa_candidate_cmpf numa_cmpf,
                                       libxl__numa_candidate *cndt_out,
                                       int *cndt_found);
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -105,30 +105,48 @@ static int comb_next(comb_iter_t it, int
 /* NUMA automatic placement (see libxl_internal.h for details) */
 
 /*
- * This function turns a k-combination iterator into a node map.
- * This means the bits in the node map corresponding to the indexes
- * of the given combination are the ones that will be set.
- * For example, if the iterator represents the combination { 0, 2, 4},
- * the node map will have bits #0, #2 and #4 set.
+ * This function turns a k-combination iterator into a node map,
+ * given another map, telling us which nodes should be considered.
+ *
+ * This means the bits that are set in suitable_nodemap and that
+ * corresponds to the indexes of the given combination are the ones
+ * that will be set in nodemap.
+ *
+ * For example, given a fully set suitable_nodemap, if the iterator
+ * represents the combination { 0, 2, 4}, nodmeap will have bits #0,
+ * #2 and #4 set.
+ * On the other hand, if, say,  suitable_nodemap=01011011, the same
+ * iterator will cause bits #1, #4 and #7 of nodemap to be set.
  */
-static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *suitable_nodemap,
+                             libxl_bitmap *nodemap, int k)
 {
-    int i;
+    int i, m = 0, n = 0;
 
     libxl_bitmap_set_none(nodemap);
-    for (i = 0; i < k; i++)
-        libxl_bitmap_set(nodemap, it[i]);
+    libxl_for_each_set_bit(i, *suitable_nodemap) {
+        /* Check wether the n-th set bit of suitable_nodemap
+         * matches with the m-th element of the iterator (and,
+         * only if it does, advance to the next one) */
+        if (m < k && n == it[m]) {
+            libxl_bitmap_set(nodemap, i);
+            m++;
+        }
+        n++;
+    }
 }
 
 /* Retrieve the number of cpus that the nodes that are part of the nodemap
- * span. */
+ * span and are also set in suitable_cpumap. */
 static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                              const libxl_bitmap *suitable_cpumap,
                               const libxl_bitmap *nodemap)
 {
     int i, nodes_cpus = 0;
 
     for (i = 0; i < nr_cpus; i++) {
-        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+        if (libxl_bitmap_test(suitable_cpumap, i) &&
+            libxl_bitmap_test(nodemap, tinfo[i].node))
             nodes_cpus++;
     }
     return nodes_cpus;
@@ -242,6 +260,7 @@ static int count_cpus_per_node(libxl_cpu
 int libxl__get_numa_candidate(libxl__gc *gc,
                               uint32_t min_free_memkb, int min_cpus,
                               int min_nodes, int max_nodes,
+                              const libxl_bitmap *suitable_cpumap,
                               libxl__numa_candidate_cmpf numa_cmpf,
                               libxl__numa_candidate *cndt_out,
                               int *cndt_found)
@@ -249,11 +268,12 @@ int libxl__get_numa_candidate(libxl__gc 
     libxl__numa_candidate new_cndt;
     libxl_cputopology *tinfo = NULL;
     libxl_numainfo *ninfo = NULL;
-    int nr_nodes = 0, nr_cpus = 0;
-    libxl_bitmap nodemap;
+    int nr_nodes = 0, nr_suit_nodes, nr_cpus = 0;
+    libxl_bitmap suitable_nodemap, nodemap;
     int rc = 0;
 
     libxl_bitmap_init(&nodemap);
+    libxl_bitmap_init(&suitable_nodemap);
     libxl__numa_candidate_init(&new_cndt);
 
     /* Get platform info and prepare the map for testing the combinations */
@@ -299,6 +319,15 @@ int libxl__get_numa_candidate(libxl__gc 
     if (rc)
         goto out;
 
+    /* Allocate and prepare the map of the node that can be utilized for
+     * placement, basing on the map of suitable cpus. */
+    rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap, 0);
+    if (rc)
+        goto out;
+    rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap);
+    if (rc)
+        goto out;
+
     /*
      * If the minimum number of NUMA nodes is not explicitly specified
      * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
@@ -316,10 +345,14 @@ int libxl__get_numa_candidate(libxl__gc 
         else
             min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
     }
-    if (min_nodes > nr_nodes)
-        min_nodes = nr_nodes;
-    if (!max_nodes || max_nodes > nr_nodes)
-        max_nodes = nr_nodes;
+    /* We also need to be sure we do not exceed the number of
+     * nodes we are allowed to use. */
+    nr_suit_nodes = libxl_bitmap_count_set(&suitable_nodemap);
+
+    if (min_nodes > nr_suit_nodes)
+        min_nodes = nr_suit_nodes;
+    if (!max_nodes || max_nodes > nr_suit_nodes)
+        max_nodes = nr_suit_nodes;
     if (min_nodes > max_nodes) {
         rc = ERROR_INVAL;
         goto out;
@@ -351,12 +384,16 @@ int libxl__get_numa_candidate(libxl__gc 
 	 * amount of free memory and number of cpus) and it can concur to
          * become our best placement iff it passes the check.
          */
-        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok;
-             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
+        for (comb_ok = comb_init(gc, &comb_iter, nr_suit_nodes, min_nodes);
+             comb_ok;
+             comb_ok = comb_next(comb_iter, nr_suit_nodes, min_nodes)) {
             uint32_t nodes_free_memkb;
             int nodes_cpus;
 
-            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+            /* Get the nodemap for the combination, only considering
+             * suitable nodes. */
+            comb_get_nodemap(comb_iter, &suitable_nodemap,
+                             &nodemap, min_nodes);
 
             /* If there is not enough memory in this combination, skip it
              * and go generating the next one... */
@@ -365,7 +402,8 @@ int libxl__get_numa_candidate(libxl__gc 
                 continue;
 
             /* And the same applies if this combination is short in cpus */
-            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap);
+            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, suitable_cpumap,
+                                            &nodemap);
             if (min_cpus && nodes_cpus < min_cpus)
                 continue;
 
@@ -376,7 +414,7 @@ int libxl__get_numa_candidate(libxl__gc 
             libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap);
             new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap);
             new_cndt.free_memkb = nodes_free_memkb;
-            new_cndt.nr_nodes = min_nodes;
+            new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap);
             new_cndt.nr_cpus = nodes_cpus;
 
             /*
@@ -390,8 +428,9 @@ int libxl__get_numa_candidate(libxl__gc 
 
                 LOG(DEBUG, "New best NUMA placement candidate found: "
                            "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, "
-                           "free_memkb=%"PRIu32"", min_nodes, new_cndt.nr_cpus,
-                           new_cndt.nr_vcpus, new_cndt.free_memkb / 1024);
+                           "free_memkb=%"PRIu32"", new_cndt.nr_nodes,
+                           new_cndt.nr_cpus, new_cndt.nr_vcpus,
+                           new_cndt.free_memkb / 1024);
 
                 libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap);
                 cndt_out->nr_vcpus = new_cndt.nr_vcpus;
@@ -411,6 +450,7 @@ int libxl__get_numa_candidate(libxl__gc 
 
  out:
     libxl_bitmap_dispose(&nodemap);
+    libxl_bitmap_dispose(&suitable_nodemap);
     libxl__numa_candidate_dispose(&new_cndt);
     libxl_numainfo_list_free(ninfo, nr_nodes);
     libxl_cputopology_list_free(tinfo, nr_cpus);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4 of 4 v6/leftover] Some automatic NUMA placement documentation
  2012-07-21  1:22 [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl Dario Faggioli
                   ` (2 preceding siblings ...)
  2012-07-21  1:22 ` [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli
@ 2012-07-21  1:22 ` Dario Faggioli
  3 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-21  1:22 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, Ian Jackson

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v5:
 * text updated to reflect the modified behaviour.

Changes from v3:
 * typos and rewording of some sentences, as suggested during review.

Changes from v1:
 * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,90 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+NUMA means the memory accessing times of a program running on a CPU depends on
+the relative distance between that CPU and that memory. In fact, most of the
+NUMA systems are built in such a way that each processor has its local memory,
+on which it can operate very fast. On the other hand, getting and storing data
+from and on remote memory (that is, memory local to some other processor) is
+quite more complex and slow. On these machines, a NUMA node is usually defined
+as a set of processor cores (typically a physical CPU package) and the memory
+directly attached to the set of cores.
+
+The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by
+assigning to each domain a "node affinity", i.e., a set of NUMA nodes of the
+host from which they get their memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start running
+memory-intensive workloads on a shared host. In fact, the cost of accessing non
+node-local memory locations is very high, and the performance degradation is
+likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask for both
+manual or automatic placement of them across the host's NUMA nodes.
+
+Note that xm/xend does the very same thing, the only differences residing in
+the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a domain should
+be created and scheduled on, directly in its config file. This affects NUMA
+placement and memory accesses as the hypervisor constructs the node affinity of
+a VM basing right on its CPU affinity when it is created.
+
+This is very simple and effective, but requires the user/system administrator
+to explicitly specify affinities for each and every domain, or Xen won't be
+able to guarantee the locality for their memory accesses.
+
+It is also possible to deal with NUMA by partitioning the system using cpupools
+(available in the upcoming release of Xen, 4.2). Again, this could be "The
+Right Answer" for many needs and occasions, but  has to to be carefully
+considered and setup by hand.
+
+### Automatic Guest Placement with xl ###
+
+If no "cpus=" option is specified in the config file, libxl tries to figure out
+on its own on which node(s) the domain could fit best.  It is worthwhile noting
+that optimally fitting a set of VMs on the NUMA nodes of an host is an
+incarnation of the Bin Packing Problem. In fact, the various VMs with different
+memory sizes are the items to be packed, and the host nodes are the bins. As
+such problem is known to be NP-hard, we will be using some heuristics.
+
+The first thing to do is find the nodes or the sets of nodes (from now on
+referred to as 'candidates') that have enough free memory and enough physical
+CPUs for accommodating the new domain. The idea is to find a spot for the
+domain with at least as much free memory as it has configured to have, and as
+much pCPUs as it has vCPUs.  After that, the actual decision on which candidate
+to pick happens accordingly to the following heuristics:
+
+  *  candidates involving fewer nodes are always the best. In case two (or
+     more) candidates span the same number of nodes,
+  *  the number of vCPUs currently able to run on the candidates, and how much
+     free memory they have are both considered. In doing that, candidates with
+     smaller number of runnable vCPUs and with greater amount of free memory
+     are preferred, with number of vCPUs "weighting" three times as much as
+     free memory.
+
+Giving preference to candidates with fewer nodes ensures better performance for
+the guest, as it avoid spreading its memory among different nodes. Favoring
+candidates with fewer vCPUs already runnable there ensures a good balance of
+the overall host load. Finally, if more candidates fulfil these criteria by
+roughly the same extent, prioritizing the nodes that have the largest amounts
+of free memory helps keeping the memory fragmentation small, and maximizes the
+probability of being able to put more domains there.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA just because libxl does it internally.  No API is
+provided (yet) for interacting with this feature and modify the library
+behaviour regarding automatic placement, it just happens by default if no
+affinity is specified (as it is with xm/xend).
+
+For actually looking and maybe tweaking the mechanism and the algorithms it
+uses, all is implemented as a set of libxl internal interfaces and facilities.
+Look for the comment "Automatic NUMA placement" in libxl\_internal.h.
+
+Note this may change in future versions of Xen/libxl.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file
  2012-07-21  1:22 ` [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file Dario Faggioli
@ 2012-07-23 11:04   ` Ian Jackson
  2012-07-23 13:35     ` Dario Faggioli
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Jackson @ 2012-07-23 11:04 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel

Dario Faggioli writes ("[PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file"):
> So that we attempt automatic placement only if that was not the case.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -215,6 +215,8 @@ int libxl__domain_build_info_setdefault(
>          libxl_bitmap_set_any(&b_info->cpumap);
>      }
>  
> +    libxl_defbool_setdefault(&b_info->numa_placement, true);
> +
>      if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
>          b_info->max_memkb = 32 * 1024;
>      if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -212,7 +212,8 @@ int libxl__build_pre(libxl__gc *gc, uint
>       * call to libxl_set_vcpuaffinity_all() will do the actual placement,
>       * whatever that turns out to be.
>       */
> -    if (libxl_bitmap_is_full(&info->cpumap)) {
> +    if (libxl_defbool_val(info->numa_placement) &&
> +        libxl_bitmap_is_full(&info->cpumap)) {
>          int rc = numa_place_domain(gc, info);

Shouldn't this be

  -    if (libxl_bitmap_is_full(&info->cpumap)) {
  +    if (libxl_defbool_val(info->numa_placement)) {
  +        if (!libxl_bitmap_is_full(&info->cpumap)) {
  +            rc = ERROR_INVAL;
  +            LOG blah blah invalid not supported
  +            goto out;
  +        }

or the equivalent ?

Apart from that it looks good to me.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
  2012-07-21  1:22 ` [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
@ 2012-07-23 11:38   ` Ian Jackson
  2012-07-23 14:39     ` Dario Faggioli
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Jackson @ 2012-07-23 11:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel

Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> If a domain does not have a VCPU affinity, try to pin it automatically
> to some PCPUs. This is done taking into account the NUMA characteristics
> of the host. In fact, we look for a combination of host's NUMA nodes
> with enough free memory and number of PCPUs for the new domain, and pin
> it to the VCPUs of those nodes.

I'm afraid your new algorithm appears to rely on an inconsistent
comparison function:

> +static int numa_cmpf(const libxl__numa_candidate *c1,
> +                     const libxl__numa_candidate *c2)
> +{
> +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> +    double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus);
> +
> +    return SIGN(freememkb_diff + 3*nrvcpus_diff);
> +}

I don't think this comparison function necessarily defines a partial
order.  Can you provide a proof that it does ?  I think you probably
can't.  If you think it _is_ a partial order please let me know and I
will try to construct a counterexample.

I think your comparison function should be of the form:

   static int numa_cmpf(const libxl__numa_candidate *a,
                        const libxl__numa_candidate *b)
   {
       scorea = score(a);  scoreb = score(b);
       return SIGN(scorea - scoreb);
       /* if scores are smalllish ints, can just return scorea-scoreb */
   }

or more generally:

   static int numa_cmpf(const libxl__numa_candidate *a,
                        const libxl__numa_candidate *b)
   {
       score1a = score1(a);  score1b = score1(b);
       if (score1a != score1b) return SIGN(score1a - score1b);
          /* if scores are smalllish ints, can just return score1a-score1b */

       score2a = score2(a);  score2b = score2(b);
       if (score2a != score2b) return SIGN(score2a - score2b);
       /* etc. */

       return 0;
   }

That clearly defines a partial order.  The important point is that the
functions score() must each be calculated only from the node in
question.  score() must not take the other candidate as an argument.
Your normalized_diff function violates this.

If you like you could calculate the scores once for each node and
store them in libxl__numa_candidate rather than the ingredients for
the score (max_memkb) etc., in which case the comparison function is a
simple lexical comparison of the scores.

> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +{
...
> +    /* Not even a suitable placement candidate! Let's just don't touch the
> +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> +    if (found == 0)
> +        goto out;

This probably deserves a log message.

> @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint
>      uint32_t rtc_timeoffset;
> 
>      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up one.
> +     * In case numa_place_domain() find at least a suitable candidate, it will

This line is 78 characters, leading to wrap damage in my email reply
to you.  The official coding style says you may use up to 80 which is
tolerable in the odd line of code but it would be nice if wordwrapped
comments were wrapped to a shorter distance.

> +     * affect info->cpumap accordingly; if it does not, it just leaves it
> +     * as it is. This means (unless some weird error manifests) the subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }
>      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
> +

As discussed, this is wrong; the fix is in your 2/4 but needs to be
folded into this patch I think.  Patches should not introduce new
bugs even if they're about to be fixed in the next commit.

> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2216,6 +2216,125 @@ static inline void libxl__ctx_unlock(lib
...
> + * The intended usage is as follows:
> + *  1. fist of all, call libxl__get_numa_candidates(), and specify the proper
          first


> +/*
> + * Looks for the placement candidates that satisfyies some specific
> + * conditions and return the best one according to the provided
> + * comparison function.
> + */
> +int libxl__get_numa_candidate(libxl__gc *gc,
> +                              uint32_t min_free_memkb, int min_cpus,
> +                              int min_nodes, int max_nodes,
> +                              libxl__numa_candidate_cmpf numa_cmpf,
> +                              libxl__numa_candidate *cndt_out,
> +                              int *cndt_found)
> +{
...
> +    /*
> +     * The good thing about this solution is that it is based on heuristics
> +     * (implemented in numa_cmpf() ), but we at least can evaluate it on
> +     * all the possible placement candidates. That can happen because the
> +     * number of nodes present in current NUMA systems is quite small.
> +     * In fact, even if a sum of binomials is involved, if the system has
> +     * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a
> +     * sensible value, as it is exactly the number of nodes the biggest
> +     * NUMA systems provide at the time of this writing (and will probably
> +     * continue to do so for a while). However, computanional complexity
> +     * would explode on systems much bigger than that. 16 nodes system would
> +     * still be fine (it will be 65536 steps), but it's really importante we
                                                                  important

Isn't this an argument that the limit should be 16 ?  (Also 65535
steps since placement on 0 nodes is typically not an option.)

> +     * avoid trying to run this on monsters with 32, 64 or more nodes (if
> +     * they ever pop into being). Therefore, here it comes a safety catch
> +     * that disables the algorithm for the cases when it wouldn't work well.
> +     */
> +    if (nr_nodes > 8) {
> +        /* Log we did nothing and return 0, as no real error occurred */
> +        LOG(WARN, "System has %d NUMA nodes, which is too big for the "
> +                  "placement algorithm to work effectively. Skipping it",
> +                  nr_nodes);

It would be nice if this message suggested pinning the vcpus.


> +    if (min_nodes > nr_nodes)
> +        min_nodes = nr_nodes;
> +    if (!max_nodes || max_nodes > nr_nodes)
> +        max_nodes = nr_nodes;
> +    if (min_nodes > max_nodes) {
> +        rc = ERROR_INVAL;
> +        goto out;

This should be accompanied by a log message.

> +    while (min_nodes <= max_nodes && *cndt_found == 0) {
> +        comb_iter_t comb_iter;
> +        int comb_ok;
> +
> +        /*
> +        * And here it is. Each step of this cycle generates a combination of
> +        * nodes as big as min_nodes mandates.  Each of these combinations is
> +        * checked against the constraints provided by the caller (namely,
> +        * amount of free memory and number of cpus) and it can concur to
> +         * become our best placement iff it passes the check.
> +         */

Indentation error.


Thanks,
Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools
  2012-07-21  1:22 ` [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli
@ 2012-07-23 11:38   ` Ian Jackson
  2012-07-23 13:21     ` Dario Faggioli
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Jackson @ 2012-07-23 11:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel

Dario Faggioli writes ("[PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools"):
> In such a way that only the cpus belonging to the cpupool of the
> domain being placed are considered for the placement itself.
> 
> This happens by filtering out all the nodes in which the cpupool
> has not any cpu from the placement candidates. After that ---as
> cpu pooling not necessarily happens at NUMA nodes boundaries--- we
> also make sure only the actual cpus that are part of the pool are
> considered when counting how much processors a placement candidate
> provides.

Is this patch essential for the 4.2 release ?

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools
  2012-07-23 11:38   ` Ian Jackson
@ 2012-07-23 13:21     ` Dario Faggioli
  0 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-23 13:21 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1207 bytes --]

On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote:
> Dario Faggioli writes ("[PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools"):
> > In such a way that only the cpus belonging to the cpupool of the
> > domain being placed are considered for the placement itself.
> > 
> > This happens by filtering out all the nodes in which the cpupool
> > has not any cpu from the placement candidates. After that ---as
> > cpu pooling not necessarily happens at NUMA nodes boundaries--- we
> > also make sure only the actual cpus that are part of the pool are
> > considered when counting how much processors a placement candidate
> > provides.
> 
> Is this patch essential for the 4.2 release ?
> 
I think it is, as cpupool are officially part of 4.2, IIRC, so ignoring
them doesn't look like the right thing to.

Of course that well is debatable, it's just my opinion. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file
  2012-07-23 11:04   ` Ian Jackson
@ 2012-07-23 13:35     ` Dario Faggioli
  0 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-23 13:35 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1475 bytes --]

On Mon, 2012-07-23 at 12:04 +0100, Ian Jackson wrote:
> > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> > --- a/tools/libxl/libxl_dom.c
> > +++ b/tools/libxl/libxl_dom.c
> > @@ -212,7 +212,8 @@ int libxl__build_pre(libxl__gc *gc, uint
> >       * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> >       * whatever that turns out to be.
> >       */
> > -    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +    if (libxl_defbool_val(info->numa_placement) &&
> > +        libxl_bitmap_is_full(&info->cpumap)) {
> >          int rc = numa_place_domain(gc, info);
> 
> Shouldn't this be
> 
>   -    if (libxl_bitmap_is_full(&info->cpumap)) {
>   +    if (libxl_defbool_val(info->numa_placement)) {
>   +        if (!libxl_bitmap_is_full(&info->cpumap)) {
>   +            rc = ERROR_INVAL;
>   +            LOG blah blah invalid not supported
>   +            goto out;
>   +        }
> 
> or the equivalent ?
> 
Mmm... So that if one manage in setting numa_placement to true but also
specify a cpumap we bail out, right? Yes, I think I can go for this.

> Apart from that it looks good to me.
> 
Cool. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
  2012-07-23 11:38   ` Ian Jackson
@ 2012-07-23 14:39     ` Dario Faggioli
  2012-07-23 15:50       ` Ian Jackson
  0 siblings, 1 reply; 13+ messages in thread
From: Dario Faggioli @ 2012-07-23 14:39 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6669 bytes --]

On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote:
> Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> > If a domain does not have a VCPU affinity, try to pin it automatically
> > to some PCPUs. This is done taking into account the NUMA characteristics
> > of the host. In fact, we look for a combination of host's NUMA nodes
> > with enough free memory and number of PCPUs for the new domain, and pin
> > it to the VCPUs of those nodes.
> 
> I'm afraid your new algorithm appears to rely on an inconsistent
> comparison function:
> 
It's not that new, at least not the comparison function, it's being
there from a couple of versions... Anyway...

> > +static int numa_cmpf(const libxl__numa_candidate *c1,
> > +                     const libxl__numa_candidate *c2)
> > +{
> > +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> > +    double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus);
> > +
> > +    return SIGN(freememkb_diff + 3*nrvcpus_diff);
> > +}
> 
> I don't think this comparison function necessarily defines a partial
> order.  Can you provide a proof that it does ?  I think you probably
> can't.  If you think it _is_ a partial order please let me know and I
> will try to construct a counterexample.
> 
Nhaa, don't bother, I'll change it into something simpler.

>
>[..]
>
>
> That clearly defines a partial order.  The important point is that the
> functions score() must each be calculated only from the node in
> question.  score() must not take the other candidate as an argument.
> Your normalized_diff function violates this.
> 
I think it should be fine, as I'm only using the other candidate's
characteristics for normalization. But again, I'm fine with turning this
into something simpler than it is now, and adhering with the criterion
above, which will make things even more clear. Thanks.

> > +    /* Not even a suitable placement candidate! Let's just don't touch the
> > +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> > +    if (found == 0)
> > +        goto out;
> 
> This probably deserves a log message.
> 
It's there: it's being printed by libxl__get_numa_candidate(). It's like
that to avoid printing more of them, which would be confusing, e.g.,
something like this:

 libxl: ERROR Too many nodes
 libxl: ERROR No placement found

Is that acceptable?


> > @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint
> >      uint32_t rtc_timeoffset;
> > 
> >      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> > +
> > +    /*
> > +     * Check if the domain has any CPU affinity. If not, try to build up one.
> > +     * In case numa_place_domain() find at least a suitable candidate, it will
> 
> This line is 78 characters, leading to wrap damage in my email reply
> to you.  The official coding style says you may use up to 80 which is
> tolerable in the odd line of code but it would be nice if wordwrapped
> comments were wrapped to a shorter distance.
> 
Ok, will take a look at this, here and everywhere else. Thanks.

> > +     * affect info->cpumap accordingly; if it does not, it just leaves it
> > +     * as it is. This means (unless some weird error manifests) the subsequent
> > +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> > +     * whatever that turns out to be.
> > +     */
> > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +        int rc = numa_place_domain(gc, info);
> > +        if (rc)
> > +            return rc;
> > +    }
> >      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
> > +
> 
> As discussed, this is wrong; the fix is in your 2/4 but needs to be
> folded into this patch I think.  Patches should not introduce new
> bugs even if they're about to be fixed in the next commit.
> 
Ok, I'll merge the two patches.

> > +    /*
> > +     * The good thing about this solution is that it is based on heuristics
> > +     * (implemented in numa_cmpf() ), but we at least can evaluate it on
> > +     * all the possible placement candidates. That can happen because the
> > +     * number of nodes present in current NUMA systems is quite small.
> > +     * In fact, even if a sum of binomials is involved, if the system has
> > +     * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a
> > +     * sensible value, as it is exactly the number of nodes the biggest
> > +     * NUMA systems provide at the time of this writing (and will probably
> > +     * continue to do so for a while). However, computanional complexity
> > +     * would explode on systems much bigger than that. 16 nodes system would
> > +     * still be fine (it will be 65536 steps), but it's really importante we
>                                                                   important
> 
> Isn't this an argument that the limit should be 16 ?  (Also 65535
> steps since placement on 0 nodes is typically not an option.)
> 
Well, it is for me, but as we agreed on 8, I went for that. What I
wanted was just make sure we (or whoever else) know/remember that 16
cold work too, in case that turns out to be useful in future. That being
said, if you agree on raising the cap to 16 right now, I'll be glad to
do that. :-D

> > +     * avoid trying to run this on monsters with 32, 64 or more nodes (if
> > +     * they ever pop into being). Therefore, here it comes a safety catch
> > +     * that disables the algorithm for the cases when it wouldn't work well.
> > +     */
> > +    if (nr_nodes > 8) {
> > +        /* Log we did nothing and return 0, as no real error occurred */
> > +        LOG(WARN, "System has %d NUMA nodes, which is too big for the "
> > +                  "placement algorithm to work effectively. Skipping it",
> > +                  nr_nodes);
> 
> It would be nice if this message suggested pinning the vcpus.
>
> > +    if (min_nodes > nr_nodes)
> > +        min_nodes = nr_nodes;
> > +    if (!max_nodes || max_nodes > nr_nodes)
> > +        max_nodes = nr_nodes;
> > +    if (min_nodes > max_nodes) {
> > +        rc = ERROR_INVAL;
> > +        goto out;
> 
> This should be accompanied by a log message.
> 
Ok to both.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
  2012-07-23 14:39     ` Dario Faggioli
@ 2012-07-23 15:50       ` Ian Jackson
  2012-07-23 16:16         ` Dario Faggioli
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Jackson @ 2012-07-23 15:50 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel

Dario Faggioli writes ("Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote:
> > This probably deserves a log message.
> 
> It's there: it's being printed by libxl__get_numa_candidate(). It's like
> that to avoid printing more of them, which would be confusing, e.g.,
> something like this:
> 
>  libxl: ERROR Too many nodes
>  libxl: ERROR No placement found
> 
> Is that acceptable?

Two messages is better than one vague one.  One message would be
better but then you have to make sure of course that every path
prints exactly one message.

...
> > Isn't this an argument that the limit should be 16 ?  (Also 65535
> > steps since placement on 0 nodes is typically not an option.)
> 
> Well, it is for me, but as we agreed on 8, I went for that. What I
> wanted was just make sure we (or whoever else) know/remember that 16
> cold work too, in case that turns out to be useful in future. That being
> said, if you agree on raising the cap to 16 right now, I'll be glad to
> do that. :-D

I don't have a particular opinion on exactly what the cap should be.
It should be sufficiently tight to prevent runaways.  A 2^16 worst
case computation on domain start is certainly arguably acceptable.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes
  2012-07-23 15:50       ` Ian Jackson
@ 2012-07-23 16:16         ` Dario Faggioli
  0 siblings, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2012-07-23 16:16 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Andre Przywara, Ian Campbell, Stefano Stabellini, George Dunlap,
	Juergen Gross, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2490 bytes --]

On Mon, 2012-07-23 at 16:50 +0100, Ian Jackson wrote:
> Dario Faggioli writes ("Re: [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes"):
> > On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote:
> > > This probably deserves a log message.
> > 
> > It's there: it's being printed by libxl__get_numa_candidate(). It's like
> > that to avoid printing more of them, which would be confusing, e.g.,
> > something like this:
> > 
> >  libxl: ERROR Too many nodes
> >  libxl: ERROR No placement found
> > 
> > Is that acceptable?
> 
> Two messages is better than one vague one.  
>
I agree, but in this specific case, it looked particularly ugly, as we
were telling the user that we decided not to run placement _and_ also
that he should be worried because placement did not succeeded! :-O

> One message would be
> better but then you have to make sure of course that every path
> prints exactly one message.
>
What I want is no path producing two (or more) conflicting indications. 

Basically, putting the message saying that we haven't found any
placement in the function that actually looks for the placement
--instead than in its callers-- I ensure that either an error happens
(and it's logged) before the placement itself could take place, or it
manages in running, does not find anything a log that... And this all
looks reasonable to me.

Also, I tested the various path (by creating fake nodes, etc.), and it
seems to behave as you ask. I'll double check that, if it turns out to
be that way, are you fine with it?

> > Well, it is for me, but as we agreed on 8, I went for that. What I
> > wanted was just make sure we (or whoever else) know/remember that 16
> > cold work too, in case that turns out to be useful in future. That being
> > said, if you agree on raising the cap to 16 right now, I'll be glad to
> > do that. :-D
> 
> I don't have a particular opinion on exactly what the cap should be.
> It should be sufficiently tight to prevent runaways.  A 2^16 worst
> case computation on domain start is certainly arguably acceptable.
> 
Ok, I'll go for 16 then (and will fix the 65536-->65535).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-23 16:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-21  1:22 [PATCH 0 of 4 v6/leftover] Automatic NUMA placement for xl Dario Faggioli
2012-07-21  1:22 ` [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
2012-07-23 11:38   ` Ian Jackson
2012-07-23 14:39     ` Dario Faggioli
2012-07-23 15:50       ` Ian Jackson
2012-07-23 16:16         ` Dario Faggioli
2012-07-21  1:22 ` [PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file Dario Faggioli
2012-07-23 11:04   ` Ian Jackson
2012-07-23 13:35     ` Dario Faggioli
2012-07-21  1:22 ` [PATCH 3 of 4 v6/leftover] libxl: have NUMA placement deal with cpupools Dario Faggioli
2012-07-23 11:38   ` Ian Jackson
2012-07-23 13:21     ` Dario Faggioli
2012-07-21  1:22 ` [PATCH 4 of 4 v6/leftover] Some automatic NUMA placement documentation Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.