Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

From: Ian Campbell <Ian.Campbell@citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation
Date: Thu, 12 Apr 2012 10:11:42 +0100	[thread overview]
Message-ID: <1334221902.16387.45.camel@zakaz.uk.xensource.com> (raw)
In-Reply-To: <ec0abe6e2de3d35d626a.1334150277@Solace>

On Wed, 2012-04-11 at 14:17 +0100, Dario Faggioli wrote:
> Add some rationale and usage documentation for the new automatic
> NUMA placement feature of xl.
> 
> TODO: * Decide whether we want to have things like "Future Steps/Roadmap"
>         and/or "Performances/Benchmarks Results" here as well.

I think these would be better in the list archives and on the wiki
respectively.

> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt
> new file mode 100644
> --- /dev/null
> +++ b/docs/misc/xl-numa-placement.txt

It looks like you are using something approximating markdown syntax
here, so you might as well name this xl-numa-placement.markdown and get
a .html version etc almost for free.

> @@ -0,0 +1,205 @@
> +               -------------------------------------
> +               NUMA Guest Placement Design and Usage
> +               -------------------------------------
> +
> +Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For
> +example each domain has its own "node affinity", i.e., a set of NUMA nodes
> +of the host from which memory for that domain is allocated. That becomes
> +very important as soon as many domains start running memory-intensive
> +workloads on a shared host. In fact, accessing non node-local memory
> +locations costs much more than node-local ones, to the point that the
> +degradation in performance is likely to be noticeable.
> +
> +It is then quite obvious that, any mechanism that enable the most of the
> +memory accesses for the most of the most of the guest domains to stay
> +local is something very important to achieve when dealing with NUMA
> +platforms.
> +
> +
> +Node Affinity and CPU Affinity
> +------------------------------
> +
> +There is another very popular 'affinity', besides node affinity we are
> +discussing here, which is '(v)cpu affinity'. Moreover, to make things
> +even worse, the two are different but somehow related things. In fact,
> +in both Xen and Linux worlds, 'cpu affinity' is the set of CPUs a domain
> +(that would be a task, when talking about Linux) can be scheduled on.
> +This seems to have few to do with memory accesses, but it does, as the

                      ^little

> +CPU where a domain run is also from where it tries to access its memory,
> +i.e., that is one half of what decides whether a memory access is remote
> +or local --- the other half being where the location it wants to access
> +is stored.
> +
> +Of course, if a domain is known to only run on a subset of the physical
> +CPUs of the host, it is very easy to turn all its memory accesses into
> +local ones, by just constructing it's node affinity (in Xen) basing on

                                                                ^based

> +what nodes these CPUs belongs to. Actually, that is exactly what is being
> +done by the hypervisor by default, as soon as it finds out a domain (or
> +better, the vcpus of a domain, but let's avoid getting into too much
> +details here) has a cpu affinity.
> +
> +This is working quite well, but it requires the user/system administrator
> +to explicitly specify such property --- the cpu affinity --- while the
> +domain is being created, or Xen won't be able to exploit that for ensuring
> +accesses locality.
> +
> +On the other hand, as node affinity directly affects where domain's memory
> +lives, it makes a lot of sense for it to be involved in scheduling decisions,
> +as it would be great if the hypervisor would manage in scheduling all the
> +vcpus of all the domains on CPUs attached to the various domains' local
> +memory. That is why, the node affinity of a domain is treated by the scheduler
> +as the set of nodes on which it would be preferable to run it, although
> +not at the cost of violating the scheduling algorithm behavior and
> +invariants. This means it Xen will check whether a vcpu of a domain can run
> +on one of the CPUs belonging to the nodes of the domain's node affinity,
> +but will better run it somewhere else --- even on another, remote, CPU ---
> +than violating the priority ordering (e.g., by kicking out from there another
> +running vcpu with higher priority) it is designed to enforce.
> +
> +So, last but not least, what if a domain has both vcpu and node affinity, and
> +they only partially match or they do not match at all (to understand how that
> +can happen, see the following sections)? Well, in such case, all the domain
> +memory will be allocated reflecting its node affinity, while scheduling will
> +happen according to its vcpu affinities, meaning that it is easy enough to
> +construct optimal, sub-optimal, neutral and even bad and awful configurations
> +(which is something nice, e.g., for benchmarking purposes). The remainder
> +part of this document is explaining how to do so.
> +
> +
> +Specifying Node Affinity
> +------------------------
> +
> +Besides being automatically computed from the vcpu affinities of a domain
> +(or also from it being part of a cpupool) within Xen, it might make sense
> +for the user to specify the node affinity of its domains by hand, while
> +editing their config files, as another form of partitioning the host
> +resources. If that is the case, this is where the "nodes" option of the xl
> +config file becomes useful. In fact, specifying something like the below
> +
> +        nodes = [ '0', '1', '3', '4' ]
> +
> +in a domain configuration file would result in Xen assigning host NUMA nodes
> +number 0, 1, 3 and 4 to the domain's node affinity, regardless of any vcpu
> +affinity setting for the same domain. The idea is, yes, the to things are

                                                               two

> +related, and if only one is present, it makes sense to use the other for
> +inferring it, but it is always possible to explicitly specify both of them,
> +independently on how good or awful it could end up being.
> +
> +Therefore, this is what one should expect when using "nodes", perhaps in
> +conjunction with "cpus" in a domain configuration file:
> +
> + * `cpus = "0, 1"` and no `nodes=` at all
> +   (i.e., only vcpu affinity specified):
> +     domain's vcpus can and will run only on host CPUs 0 and 1. Also, as
> +     domain's node affinity will be computed by Xen and set to whatever
> +     nodes host CPUs 0 and 1 belongs, all the domain's memory accesses will
> +     be local accesses;
> +
> + * `nodes = [ '0', '1' ]` and no `cpus=` at all
> +   (i.e., only node affinity present):
> +     domain's vcpus can run on any of the host CPUs, but the scheduler (at
> +     least if credit is used, as it is the only scheduler supporting this
> +     right now) will try running them on the CPUs that are part of host
> +     NUMA nodes 0 and 1. Memory-wise, all the domain's memory will be
> +     allocated on host NUMA nodes nodes 0 and 1. This means the most of
> +     the memory accesses of the domain should be local, but that will
> +     depend on the on-line load, behavior and actual scheduling of both
> +     the domain in question and all the other domains on the same host;
> +
> + * `nodes = [ '0', '1' ]` and `cpus = "0"`, with CPU 0 within node 0:
> +   (i.e., cpu affinity subset of node affinity):
> +     domain's vcpus can and will only run on host CPU 0. As node affinity
> +     is being explicitly set to host NUMA nodes 0 and 1 --- which includes
> +     CPU 0 --- all the memory access of the domain will be local;

In this case won't some of (half?) the memory come from node 1 and
therefore be non-local to cpu 0?

> +
> + * `nodes = [ '0', '1' ]` and `cpus = "0, 4", with CPU 0 in node 0 but
> +   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity):
> +     domain's vcpus can run on host CPUs 0 and 4, with CPU 4 not being within
> +     the node affinity (explicitly set to host NUMA nodes 0 and 1). The
> +     (credit) scheduler will try to keep memory accesses local by scheduling
> +     the domain's vcpus on CPU 0, but it may not achieve 100% success;
> +
> + * `nodes = [ '0', '1' ]` and `cpus = "4"`, with CPU 4 within, say, node 2

These examples might be a little clearer if you defined up front what
the nodes and cpus were and then used that for all of them?

> +   (i.e., cpu affinity disjointed with node affinity):
> +     domain's vcpus can and will run only on host CPU 4, i.e., completely
> +     "outside" of the chosen node affinity. That necessarily means all the
> +     domain's memory access will be remote.
> +
> +
> +Automatic NUMA Placement
> +------------------------
> +
> +Just in case one does not want to take the burden of manually specifying
> +all the node (and, perhaps, CPU) affinities for all its domains, xl implements
> +some automatic placement logic. This basically means the user can ask the
> +toolstack to try sorting things out in the best possible way for him.
> +This is instead of specifying manually a domain's node affinity and can be
> +paired or not with any vcpu affinity (in case it is, the relationship between
> +vcpu and node affinities just stays as stated above). To serve this purpose,
> +a new domain config switch has been introduces, i.e., the "nodes_policy"
> +option. As the name suggests, it allows for specifying a policy to be used
> +while attempting automatic placement of the new domain. Available policies
> +at the time of writing are:

A bunch of what follows would be good to have in the xl or xl.cfg man
pages too/instead. (I started with this docs patch so I haven't actually
looked at the earlier ones yet, perhaps this is already the case)

> +
> + * "auto": automatic placement by means of a not better specified (xl
> +           implementation dependant) algorithm. It is basically for those
> +           who do want automatic placement, but have no idea what policy
> +           or algorithm would be better... <<Just give me a sane default!>>
> +
> + * "ffit": automatic placement via the First Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "bfit": automatic placement via the Best Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "wfit": automatic placement via the Worst Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> +The various algorithms have been implemented as they offer different behavior
> +and performances (for different performance metrics). For instance, First Fit
> +is known to be efficient and quick, and it generally works better than Best
> +Fit wrt memory fragmentation, although it tends to occupy "early" nodes more
> +than "late" ones. On the other hand, Best Fit aims at optimizing memory usage,
> +although it introduces quite a bit of fragmentation, by leaving large amounts
> +of small free memory areas. Finally, the idea behind Worst Fit is that it will
> +leave big enough free memory chunks to limit the amount of fragmentation, but
> +it (as well as Best Fit does) is more expensive in terms of execution time, as
> +it needs the "list" of free memory areas to be kept sorted.
> +
> +Therefore, achieving automatic placement actually happens by properly using
> +the "nodes" and "nodes_config" configuration options as follows:
> +
> + * `nodes="auto` or `nodes_policy="auto"`:
> +     xl will try fitting the domain on the host NUMA nodes by using its
> +     own default placing algorithm, with default parameters. Most likely,
> +     all nodes will be considered suitable for the domain (unless a vcpu
> +     affinity is specified, see the last entry of this list;
> +
> + * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`) and no `nodes=` at all:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy. All nodes will be considered suitable for the
> +     domain, and consecutive fitting attempts will be performed while
> +     increasing the number of nodes on which to put the domain itself
> +     (unless a vcpu affinity is specified, see the last entry of this list);
> +
> + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `nodes=2`:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy and only the number of nodes specified in `nodes=`
> +     (2 in this example).

Number of nodes rather than specifically node 2? This is different to
the examples in the preceding section?

>  All the nodes will be considered suitable for
> +     the domain, and consecutive attempts will be performed while
> +     increasing such a value;
> +
> + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `cpus="0-6":
> +     xl will try fitting the domain on the host NUMA nodes to which the CPUs
> +     specified as vcpu affinity (0 to 6 in this example) belong, by using the
> +     requested policy. In case it fails, consecutive fitting attempts will
> +     be performed with both a reduced (first) and an increased (next) number
> +     of nodes).
> +
> +Different usage patterns --- like specifying both a policy and a list of nodes
> +are accepted, but does not make much sense after all. Therefore, although xl
> +will try at its best to interpret user's will, the resulting behavior is
> +somehow unspecified.