All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <Ian.Campbell@citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation
Date: Thu, 12 Apr 2012 10:11:42 +0100	[thread overview]
Message-ID: <1334221902.16387.45.camel@zakaz.uk.xensource.com> (raw)
In-Reply-To: <ec0abe6e2de3d35d626a.1334150277@Solace>

On Wed, 2012-04-11 at 14:17 +0100, Dario Faggioli wrote:
> Add some rationale and usage documentation for the new automatic
> NUMA placement feature of xl.
> 
> TODO: * Decide whether we want to have things like "Future Steps/Roadmap"
>         and/or "Performances/Benchmarks Results" here as well.

I think these would be better in the list archives and on the wiki
respectively.

> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt
> new file mode 100644
> --- /dev/null
> +++ b/docs/misc/xl-numa-placement.txt

It looks like you are using something approximating markdown syntax
here, so you might as well name this xl-numa-placement.markdown and get
a .html version etc almost for free.

> @@ -0,0 +1,205 @@
> +               -------------------------------------
> +               NUMA Guest Placement Design and Usage
> +               -------------------------------------
> +
> +Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For
> +example each domain has its own "node affinity", i.e., a set of NUMA nodes
> +of the host from which memory for that domain is allocated. That becomes
> +very important as soon as many domains start running memory-intensive
> +workloads on a shared host. In fact, accessing non node-local memory
> +locations costs much more than node-local ones, to the point that the
> +degradation in performance is likely to be noticeable.
> +
> +It is then quite obvious that, any mechanism that enable the most of the
> +memory accesses for the most of the most of the guest domains to stay
> +local is something very important to achieve when dealing with NUMA
> +platforms.
> +
> +
> +Node Affinity and CPU Affinity
> +------------------------------
> +
> +There is another very popular 'affinity', besides node affinity we are
> +discussing here, which is '(v)cpu affinity'. Moreover, to make things
> +even worse, the two are different but somehow related things. In fact,
> +in both Xen and Linux worlds, 'cpu affinity' is the set of CPUs a domain
> +(that would be a task, when talking about Linux) can be scheduled on.
> +This seems to have few to do with memory accesses, but it does, as the

                      ^little

> +CPU where a domain run is also from where it tries to access its memory,
> +i.e., that is one half of what decides whether a memory access is remote
> +or local --- the other half being where the location it wants to access
> +is stored.
> +
> +Of course, if a domain is known to only run on a subset of the physical
> +CPUs of the host, it is very easy to turn all its memory accesses into
> +local ones, by just constructing it's node affinity (in Xen) basing on

                                                                ^based

> +what nodes these CPUs belongs to. Actually, that is exactly what is being
> +done by the hypervisor by default, as soon as it finds out a domain (or
> +better, the vcpus of a domain, but let's avoid getting into too much
> +details here) has a cpu affinity.
> +
> +This is working quite well, but it requires the user/system administrator
> +to explicitly specify such property --- the cpu affinity --- while the
> +domain is being created, or Xen won't be able to exploit that for ensuring
> +accesses locality.
> +
> +On the other hand, as node affinity directly affects where domain's memory
> +lives, it makes a lot of sense for it to be involved in scheduling decisions,
> +as it would be great if the hypervisor would manage in scheduling all the
> +vcpus of all the domains on CPUs attached to the various domains' local
> +memory. That is why, the node affinity of a domain is treated by the scheduler
> +as the set of nodes on which it would be preferable to run it, although
> +not at the cost of violating the scheduling algorithm behavior and
> +invariants. This means it Xen will check whether a vcpu of a domain can run
> +on one of the CPUs belonging to the nodes of the domain's node affinity,
> +but will better run it somewhere else --- even on another, remote, CPU ---
> +than violating the priority ordering (e.g., by kicking out from there another
> +running vcpu with higher priority) it is designed to enforce.
> +
> +So, last but not least, what if a domain has both vcpu and node affinity, and
> +they only partially match or they do not match at all (to understand how that
> +can happen, see the following sections)? Well, in such case, all the domain
> +memory will be allocated reflecting its node affinity, while scheduling will
> +happen according to its vcpu affinities, meaning that it is easy enough to
> +construct optimal, sub-optimal, neutral and even bad and awful configurations
> +(which is something nice, e.g., for benchmarking purposes). The remainder
> +part of this document is explaining how to do so.
> +
> +
> +Specifying Node Affinity
> +------------------------
> +
> +Besides being automatically computed from the vcpu affinities of a domain
> +(or also from it being part of a cpupool) within Xen, it might make sense
> +for the user to specify the node affinity of its domains by hand, while
> +editing their config files, as another form of partitioning the host
> +resources. If that is the case, this is where the "nodes" option of the xl
> +config file becomes useful. In fact, specifying something like the below
> +
> +        nodes = [ '0', '1', '3', '4' ]
> +
> +in a domain configuration file would result in Xen assigning host NUMA nodes
> +number 0, 1, 3 and 4 to the domain's node affinity, regardless of any vcpu
> +affinity setting for the same domain. The idea is, yes, the to things are

                                                               two

> +related, and if only one is present, it makes sense to use the other for
> +inferring it, but it is always possible to explicitly specify both of them,
> +independently on how good or awful it could end up being.
> +
> +Therefore, this is what one should expect when using "nodes", perhaps in
> +conjunction with "cpus" in a domain configuration file:
> +
> + * `cpus = "0, 1"` and no `nodes=` at all
> +   (i.e., only vcpu affinity specified):
> +     domain's vcpus can and will run only on host CPUs 0 and 1. Also, as
> +     domain's node affinity will be computed by Xen and set to whatever
> +     nodes host CPUs 0 and 1 belongs, all the domain's memory accesses will
> +     be local accesses;
> +
> + * `nodes = [ '0', '1' ]` and no `cpus=` at all
> +   (i.e., only node affinity present):
> +     domain's vcpus can run on any of the host CPUs, but the scheduler (at
> +     least if credit is used, as it is the only scheduler supporting this
> +     right now) will try running them on the CPUs that are part of host
> +     NUMA nodes 0 and 1. Memory-wise, all the domain's memory will be
> +     allocated on host NUMA nodes nodes 0 and 1. This means the most of
> +     the memory accesses of the domain should be local, but that will
> +     depend on the on-line load, behavior and actual scheduling of both
> +     the domain in question and all the other domains on the same host;
> +
> + * `nodes = [ '0', '1' ]` and `cpus = "0"`, with CPU 0 within node 0:
> +   (i.e., cpu affinity subset of node affinity):
> +     domain's vcpus can and will only run on host CPU 0. As node affinity
> +     is being explicitly set to host NUMA nodes 0 and 1 --- which includes
> +     CPU 0 --- all the memory access of the domain will be local;

In this case won't some of (half?) the memory come from node 1 and
therefore be non-local to cpu 0?

> +
> + * `nodes = [ '0', '1' ]` and `cpus = "0, 4", with CPU 0 in node 0 but
> +   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity):
> +     domain's vcpus can run on host CPUs 0 and 4, with CPU 4 not being within
> +     the node affinity (explicitly set to host NUMA nodes 0 and 1). The
> +     (credit) scheduler will try to keep memory accesses local by scheduling
> +     the domain's vcpus on CPU 0, but it may not achieve 100% success;
> +
> + * `nodes = [ '0', '1' ]` and `cpus = "4"`, with CPU 4 within, say, node 2

These examples might be a little clearer if you defined up front what
the nodes and cpus were and then used that for all of them?

> +   (i.e., cpu affinity disjointed with node affinity):
> +     domain's vcpus can and will run only on host CPU 4, i.e., completely
> +     "outside" of the chosen node affinity. That necessarily means all the
> +     domain's memory access will be remote.
> +
> +
> +Automatic NUMA Placement
> +------------------------
> +
> +Just in case one does not want to take the burden of manually specifying
> +all the node (and, perhaps, CPU) affinities for all its domains, xl implements
> +some automatic placement logic. This basically means the user can ask the
> +toolstack to try sorting things out in the best possible way for him.
> +This is instead of specifying manually a domain's node affinity and can be
> +paired or not with any vcpu affinity (in case it is, the relationship between
> +vcpu and node affinities just stays as stated above). To serve this purpose,
> +a new domain config switch has been introduces, i.e., the "nodes_policy"
> +option. As the name suggests, it allows for specifying a policy to be used
> +while attempting automatic placement of the new domain. Available policies
> +at the time of writing are:

A bunch of what follows would be good to have in the xl or xl.cfg man
pages too/instead. (I started with this docs patch so I haven't actually
looked at the earlier ones yet, perhaps this is already the case)

> +
> + * "auto": automatic placement by means of a not better specified (xl
> +           implementation dependant) algorithm. It is basically for those
> +           who do want automatic placement, but have no idea what policy
> +           or algorithm would be better... <<Just give me a sane default!>>
> +
> + * "ffit": automatic placement via the First Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "bfit": automatic placement via the Best Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "wfit": automatic placement via the Worst Fit algorithm, applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> +The various algorithms have been implemented as they offer different behavior
> +and performances (for different performance metrics). For instance, First Fit
> +is known to be efficient and quick, and it generally works better than Best
> +Fit wrt memory fragmentation, although it tends to occupy "early" nodes more
> +than "late" ones. On the other hand, Best Fit aims at optimizing memory usage,
> +although it introduces quite a bit of fragmentation, by leaving large amounts
> +of small free memory areas. Finally, the idea behind Worst Fit is that it will
> +leave big enough free memory chunks to limit the amount of fragmentation, but
> +it (as well as Best Fit does) is more expensive in terms of execution time, as
> +it needs the "list" of free memory areas to be kept sorted.
> +
> +Therefore, achieving automatic placement actually happens by properly using
> +the "nodes" and "nodes_config" configuration options as follows:
> +
> + * `nodes="auto` or `nodes_policy="auto"`:
> +     xl will try fitting the domain on the host NUMA nodes by using its
> +     own default placing algorithm, with default parameters. Most likely,
> +     all nodes will be considered suitable for the domain (unless a vcpu
> +     affinity is specified, see the last entry of this list;
> +
> + * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`) and no `nodes=` at all:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy. All nodes will be considered suitable for the
> +     domain, and consecutive fitting attempts will be performed while
> +     increasing the number of nodes on which to put the domain itself
> +     (unless a vcpu affinity is specified, see the last entry of this list);
> +
> + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `nodes=2`:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy and only the number of nodes specified in `nodes=`
> +     (2 in this example).

Number of nodes rather than specifically node 2? This is different to
the examples in the preceding section?

>  All the nodes will be considered suitable for
> +     the domain, and consecutive attempts will be performed while
> +     increasing such a value;
> +
> + * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`, `"wfit"`) and `cpus="0-6":
> +     xl will try fitting the domain on the host NUMA nodes to which the CPUs
> +     specified as vcpu affinity (0 to 6 in this example) belong, by using the
> +     requested policy. In case it fails, consecutive fitting attempts will
> +     be performed with both a reduced (first) and an increased (next) number
> +     of nodes).
> +
> +Different usage patterns --- like specifying both a policy and a list of nodes
> +are accepted, but does not make much sense after all. Therefore, although xl
> +will try at its best to interpret user's will, the resulting behavior is
> +somehow unspecified.

  reply	other threads:[~2012-04-12  9:11 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell [this message]
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1334221902.16387.45.camel@zakaz.uk.xensource.com \
    --to=ian.campbell@citrix.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=raistlin@linux.it \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.