From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <raistlin@linux.it>
Subject: [PATCH 4 of 4 v6/leftover] Some automatic NUMA
	placement documentation
Date: Sat, 21 Jul 2012 03:22:24 +0200
Message-ID: <902a6d6eb2100d9f40d3.1342833744@Solace>
References: <patchbomb.1342833740@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <patchbomb.1342833740@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: xen-devel <xen-devel@lists.xen.org>, Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, George Dunlap <george.dunlap@eu.citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v5:
 * text updated to reflect the modified behaviour.

Changes from v3:
 * typos and rewording of some sentences, as suggested during review.

Changes from v1:
 * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,90 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+NUMA means the memory accessing times of a program running on a CPU depends on
+the relative distance between that CPU and that memory. In fact, most of the
+NUMA systems are built in such a way that each processor has its local memory,
+on which it can operate very fast. On the other hand, getting and storing data
+from and on remote memory (that is, memory local to some other processor) is
+quite more complex and slow. On these machines, a NUMA node is usually defined
+as a set of processor cores (typically a physical CPU package) and the memory
+directly attached to the set of cores.
+
+The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by
+assigning to each domain a "node affinity", i.e., a set of NUMA nodes of the
+host from which they get their memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start running
+memory-intensive workloads on a shared host. In fact, the cost of accessing non
+node-local memory locations is very high, and the performance degradation is
+likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask for both
+manual or automatic placement of them across the host's NUMA nodes.
+
+Note that xm/xend does the very same thing, the only differences residing in
+the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a domain should
+be created and scheduled on, directly in its config file. This affects NUMA
+placement and memory accesses as the hypervisor constructs the node affinity of
+a VM basing right on its CPU affinity when it is created.
+
+This is very simple and effective, but requires the user/system administrator
+to explicitly specify affinities for each and every domain, or Xen won't be
+able to guarantee the locality for their memory accesses.
+
+It is also possible to deal with NUMA by partitioning the system using cpupools
+(available in the upcoming release of Xen, 4.2). Again, this could be "The
+Right Answer" for many needs and occasions, but  has to to be carefully
+considered and setup by hand.
+
+### Automatic Guest Placement with xl ###
+
+If no "cpus=" option is specified in the config file, libxl tries to figure out
+on its own on which node(s) the domain could fit best.  It is worthwhile noting
+that optimally fitting a set of VMs on the NUMA nodes of an host is an
+incarnation of the Bin Packing Problem. In fact, the various VMs with different
+memory sizes are the items to be packed, and the host nodes are the bins. As
+such problem is known to be NP-hard, we will be using some heuristics.
+
+The first thing to do is find the nodes or the sets of nodes (from now on
+referred to as 'candidates') that have enough free memory and enough physical
+CPUs for accommodating the new domain. The idea is to find a spot for the
+domain with at least as much free memory as it has configured to have, and as
+much pCPUs as it has vCPUs.  After that, the actual decision on which candidate
+to pick happens accordingly to the following heuristics:
+
+  *  candidates involving fewer nodes are always the best. In case two (or
+     more) candidates span the same number of nodes,
+  *  the number of vCPUs currently able to run on the candidates, and how much
+     free memory they have are both considered. In doing that, candidates with
+     smaller number of runnable vCPUs and with greater amount of free memory
+     are preferred, with number of vCPUs "weighting" three times as much as
+     free memory.
+
+Giving preference to candidates with fewer nodes ensures better performance for
+the guest, as it avoid spreading its memory among different nodes. Favoring
+candidates with fewer vCPUs already runnable there ensures a good balance of
+the overall host load. Finally, if more candidates fulfil these criteria by
+roughly the same extent, prioritizing the nodes that have the largest amounts
+of free memory helps keeping the memory fragmentation small, and maximizes the
+probability of being able to put more domains there.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA just because libxl does it internally.  No API is
+provided (yet) for interacting with this feature and modify the library
+behaviour regarding automatic placement, it just happens by default if no
+affinity is specified (as it is with xm/xend).
+
+For actually looking and maybe tweaking the mechanism and the algorithms it
+uses, all is implemented as a set of libxl internal interfaces and facilities.
+Look for the comment "Automatic NUMA placement" in libxl\_internal.h.
+
+Note this may change in future versions of Xen/libxl.