From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: [PATCH 4 of 4 v6/leftover] Some automatic NUMA placement documentation Date: Sat, 21 Jul 2012 03:22:24 +0200 Message-ID: <902a6d6eb2100d9f40d3.1342833744@Solace> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen-devel , Dario Faggioli Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , George Dunlap , Juergen Gross , Ian Jackson List-Id: xen-devel@lists.xenproject.org About rationale, usage and (some small bits of) API. Signed-off-by: Dario Faggioli Acked-by: Ian Campbell --- Changes from v5: * text updated to reflect the modified behaviour. Changes from v3: * typos and rewording of some sentences, as suggested during review. Changes from v1: * API documentation moved close to the actual functions. diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdown @@ -0,0 +1,90 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +NUMA means the memory accessing times of a program running on a CPU depends on +the relative distance between that CPU and that memory. In fact, most of the +NUMA systems are built in such a way that each processor has its local memory, +on which it can operate very fast. On the other hand, getting and storing data +from and on remote memory (that is, memory local to some other processor) is +quite more complex and slow. On these machines, a NUMA node is usually defined +as a set of processor cores (typically a physical CPU package) and the memory +directly attached to the set of cores. + +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by +assigning to each domain a "node affinity", i.e., a set of NUMA nodes of the +host from which they get their memory allocated. + +NUMA awareness becomes very important as soon as many domains start running +memory-intensive workloads on a shared host. In fact, the cost of accessing non +node-local memory locations is very high, and the performance degradation is +likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask for both +manual or automatic placement of them across the host's NUMA nodes. + +Note that xm/xend does the very same thing, the only differences residing in +the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain should +be created and scheduled on, directly in its config file. This affects NUMA +placement and memory accesses as the hypervisor constructs the node affinity of +a VM basing right on its CPU affinity when it is created. + +This is very simple and effective, but requires the user/system administrator +to explicitly specify affinities for each and every domain, or Xen won't be +able to guarantee the locality for their memory accesses. + +It is also possible to deal with NUMA by partitioning the system using cpupools +(available in the upcoming release of Xen, 4.2). Again, this could be "The +Right Answer" for many needs and occasions, but has to to be carefully +considered and setup by hand. + +### Automatic Guest Placement with xl ### + +If no "cpus=" option is specified in the config file, libxl tries to figure out +on its own on which node(s) the domain could fit best. It is worthwhile noting +that optimally fitting a set of VMs on the NUMA nodes of an host is an +incarnation of the Bin Packing Problem. In fact, the various VMs with different +memory sizes are the items to be packed, and the host nodes are the bins. As +such problem is known to be NP-hard, we will be using some heuristics. + +The first thing to do is find the nodes or the sets of nodes (from now on +referred to as 'candidates') that have enough free memory and enough physical +CPUs for accommodating the new domain. The idea is to find a spot for the +domain with at least as much free memory as it has configured to have, and as +much pCPUs as it has vCPUs. After that, the actual decision on which candidate +to pick happens accordingly to the following heuristics: + + * candidates involving fewer nodes are always the best. In case two (or + more) candidates span the same number of nodes, + * the number of vCPUs currently able to run on the candidates, and how much + free memory they have are both considered. In doing that, candidates with + smaller number of runnable vCPUs and with greater amount of free memory + are preferred, with number of vCPUs "weighting" three times as much as + free memory. + +Giving preference to candidates with fewer nodes ensures better performance for +the guest, as it avoid spreading its memory among different nodes. Favoring +candidates with fewer vCPUs already runnable there ensures a good balance of +the overall host load. Finally, if more candidates fulfil these criteria by +roughly the same extent, prioritizing the nodes that have the largest amounts +of free memory helps keeping the memory fragmentation small, and maximizes the +probability of being able to put more domains there. + +## Guest Placement within libxl ## + +xl achieves automatic NUMA just because libxl does it internally. No API is +provided (yet) for interacting with this feature and modify the library +behaviour regarding automatic placement, it just happens by default if no +affinity is specified (as it is with xm/xend). + +For actually looking and maybe tweaking the mechanism and the algorithms it +uses, all is implemented as a set of libxl internal interfaces and facilities. +Look for the comment "Automatic NUMA placement" in libxl\_internal.h. + +Note this may change in future versions of Xen/libxl.