All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: peter.maydell@linaro.org
Cc: dbarboza@redhat.com,
	Daniel Henrique Barboza <danielhb413@gmail.com>,
	qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nonngu.org,
	David Gibson <david@gibson.dropbear.id.au>
Subject: [PULL 20/20] specs/ppc-spapr-numa: update with new NUMA support
Date: Fri,  9 Oct 2020 21:19:51 +1100	[thread overview]
Message-ID: <20201009101951.1569252-21-david@gibson.dropbear.id.au> (raw)
In-Reply-To: <20201009101951.1569252-1-david@gibson.dropbear.id.au>

From: Daniel Henrique Barboza <danielhb413@gmail.com>

This update provides more in depth information about the
choices and drawbacks of the new NUMA support for the
spapr machine.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20201007172849.302240-6-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 docs/specs/ppc-spapr-numa.rst | 235 ++++++++++++++++++++++++++++++++--
 1 file changed, 227 insertions(+), 8 deletions(-)

diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
index e762038022..5fca2bdd8e 100644
--- a/docs/specs/ppc-spapr-numa.rst
+++ b/docs/specs/ppc-spapr-numa.rst
@@ -158,9 +158,235 @@ kernel tree). This results in the following distances:
 * resources four NUMA levels apart: 160
 
 
-Consequences for QEMU NUMA tuning
+pseries NUMA mechanics
+======================
+
+Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA
+topology of the guest. The overall design is:
+
+* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing
+  for 4 distinct NUMA distance values based on the NUMA levels
+
+* ibm,max-associativity-domains supports multiple associativity domains in all
+  NUMA levels, granting user flexibility
+
+* ibm,associativity for all resources varies with user input
+
+These changes are only effective for pseries-5.2 and newer machines that are
+created with more than one NUMA node (disconsidering NUMA nodes created by
+the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
+around for such a long time, with users seeing NUMA distances 10 and 40
+(and 80 if using NVLink2 GPUs), and there is no need to disrupt the
+existing experience of those guests.
+
+To bring the user experience x86 users have when tuning up NUMA, we had
+to operate under the current pseries Linux kernel logic described in
+`How the pseries Linux guest calculates NUMA distances`_. The result
+is that we needed to translate NUMA distance user input to pseries
+Linux kernel input.
+
+Translating user distance to kernel distance
+--------------------------------------------
+
+User input for NUMA distance can vary from 10 to 254. We need to translate
+that to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
+This is how it is being done:
+
+* user distance 11 to 30 will be interpreted as 20
+* user distance 31 to 60 will be interpreted as 40
+* user distance 61 to 120 will be interpreted as 80
+* user distance 121 and beyond will be interpreted as 160
+* user distance 10 stays 10
+
+The reasoning behind this aproximation is to avoid any round up to the local
+distance (10), keeping it exclusive to the 4th NUMA level (which is still
+exclusive to the node_id). All other ranges were chosen under the developer
+discretion of what would be (somewhat) sensible considering the user input.
+Any other strategy can be used here, but in the end the reality is that we'll
+have to accept that a large array of values will be translated to the same
+NUMA topology in the guest, e.g. this user input:
+
+::
+
+      0   1   2
+  0  10  31 120
+  1  31  10  30
+  2 120  30  10
+
+And this other user input:
+
+::
+
+      0   1   2
+  0  10  60  61
+  1  60  10  11
+  2  61  11  10
+
+Will both be translated to the same values internally:
+
+::
+
+      0   1   2
+  0  10  40  80
+  1  40  10  20
+  2  80  20  10
+
+Users are encouraged to use only the kernel values in the NUMA definition to
+avoid being taken by surprise with that the guest is actually seeing in the
+topology. There are enough potential surprises that are inherent to the
+associativity domain assignment process, discussed below.
+
+
+How associativity domains are assigned
+--------------------------------------
+
+LOPAPR allows more than one associativity array (or 'string') per allocated
+resource. This would be used to represent that the resource has multiple
+connections with the board, and then the operational system, when deciding
+NUMA distancing, should consider the associativity information that provides
+the shortest distance.
+
+The spapr implementation does not support multiple associativity arrays per
+resource, neither does the pseries Linux kernel. We'll have to represent the
+NUMA topology using one associativity per resource, which means that choices
+and compromises are going to be made.
+
+Consider the following NUMA topology entered by user input:
+
+::
+
+      0   1   2   3
+  0  10  40  20  40
+  1  40  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+All the associativity arrays are initialized with NUMA id in all associativity
+domains:
+
+* node 0: 0 0 0 0
+* node 1: 1 1 1 1
+* node 2: 2 2 2 2
+* node 3: 3 3 3 3
+
+
+Honoring just the relative distances of node 0 to every other node, we find the
+NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for
+each distance:
+
+* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match
+  at 0x2)
+* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
+* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
+  at 0x2)
+
+We'll copy the associativity domains of node 0 to all other nodes, based on
+the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy
+the domains 0x2 and 0x1 from 0 to 1 as well. This will give us:
+
+* node 0: 0 0 0 0
+* node 1: 0 0 1 1
+
+Doing the same to node 2 and node 3, these are the associativity arrays
+after considering all matches with node 0:
+
+* node 0: 0 0 0 0
+* node 1: 0 0 1 1
+* node 2: 0 0 0 2
+* node 3: 0 0 3 3
+
+The distances related to node 0 are accounted for. For node 1, and keeping
+in mind that we don't need to revisit node 0 again, the distance from
+node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40,
+match in 0x2. Repeating the same logic of copying all domains up to
+the NUMA level match:
+
+* node 0: 0 0 0 0
+* node 1: 1 0 1 1
+* node 2: 1 0 0 2
+* node 3: 1 0 3 3
+
+In the last step we will analyze just nodes 2 and 3. The desired distance
+between 2 and 3 is 20, i.e. a match in 0x3:
+
+* node 0: 0 0 0 0
+* node 1: 1 0 1 1
+* node 2: 1 0 0 2
+* node 3: 1 0 0 3
+
+
+The kernel will read these arrays and will calculate the following NUMA topology for
+the guest:
+
+::
+
+      0   1   2   3
+  0  10  40  20  20
+  1  40  10  40  40
+  2  20  40  10  20
+  3  20  40  20  10
+
+Note that this is not what the user wanted - the desired distance between
+0 and 3 is 40, we calculated it as 20. This is what the current logic and
+implementation constraints of the kernel and QEMU will provide inside the
+LOPAPR specification.
+
+Users are welcome to use this knowledge and experiment with the input to get
+the NUMA topology they want, or as closer as they want. The important thing
+is to keep expectations up to par with what we are capable of provide at this
+moment: an approximation.
+
+Limitations of the implementation
 ---------------------------------
 
+As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate
+user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways
+to fully map user input to actual NUMA distance the guest will use. These limitations
+creates two notable limitations in our support:
+
+* Asymmetrical topologies aren't supported. We only support NUMA topologies where
+  the distance from node A to B is always the same as B to A. We do not support
+  any A-B pair where the distance back and forth is asymmetric. For example, the
+  following topology isn't supported and the pSeries guest will not boot with this
+  user input:
+
+::
+
+      0   1
+  0  10  40
+  1  20  10
+
+
+* 'non-transitive' topologies will be poorly translated to the guest. This is the
+  kind of topology where the distance from a node A to B is X, B to C is X, but
+  the distance A to C is not X. E.g.:
+
+::
+
+      0   1   2   3
+  0  10  20  20  40
+  1  20  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+  In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40.
+  The kernel will always match with the shortest associativity domain possible,
+  and we're attempting to retain the previous established relations between the
+  nodes. This means that a distance equal to 20 between nodes 0 and 2 and the
+  same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3
+  to also be 20.
+
+
+Legacy (5.1 and older) pseries NUMA mechanics
+=============================================
+
+In short, we can summarize the NUMA distances seem in pseries Linux guests, using
+QEMU up to 5.1, as follows:
+
+* local distance, i.e. the distance of the resource to its own NUMA node: 10
+* if it's a NVLink GPU device, distance: 80
+* every other resource, distance: 40
+
 The way the pseries Linux guest calculates NUMA distances has a direct effect
 on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
 the default ibm,associativity-reference-points being used in the pseries
@@ -180,12 +406,5 @@ as far as NUMA distance goes:
   to the same third NUMA level, having distance = 40
 * for NVLink GPUs, distance = 80 from everything else
 
-In short, we can summarize the NUMA distances seem in pseries Linux guests, using
-QEMU up to 5.1, as follows:
-
-* local distance, i.e. the distance of the resource to its own NUMA node: 10
-* if it's a NVLink GPU device, distance: 80
-* every other resource, distance: 40
-
 This also means that user input in QEMU command line does not change the
 NUMA distancing inside the guest for the pseries machine.
-- 
2.26.2



  parent reply	other threads:[~2020-10-09 10:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-09 10:19 [PULL 00/20] ppc-for-5.2 queue 20201009 David Gibson
2020-10-09 10:19 ` [PULL 01/20] spapr: Handle HPT allocation failure in nested guest David Gibson
2020-10-09 10:19 ` [PULL 02/20] spapr: Fix error leak in spapr_realize_vcpu() David Gibson
2020-10-09 10:19 ` [PULL 03/20] ppc: Add a return value to ppc_set_compat() and ppc_set_compat_all() David Gibson
2020-10-09 10:19 ` [PULL 04/20] ppc: Fix return value in cpu_post_load() error path David Gibson
2020-10-09 10:19 ` [PULL 05/20] spapr: Simplify error handling in callers of ppc_set_compat() David Gibson
2020-10-09 10:19 ` [PULL 06/20] spapr: Get rid of cas_check_pvr() error reporting David Gibson
2020-10-09 10:19 ` [PULL 07/20] spapr: Simplify error handling in do_client_architecture_support() David Gibson
2020-10-09 10:19 ` [PULL 08/20] spapr: Simplify error handling in spapr_vio_busdev_realize() David Gibson
2020-10-09 10:19 ` [PULL 09/20] spapr: Add a return value to spapr_drc_attach() David Gibson
2020-10-09 10:19 ` [PULL 10/20] spapr: Simplify error handling in prop_get_fdt() David Gibson
2020-10-09 10:19 ` [PULL 11/20] spapr: Add a return value to spapr_set_vcpu_id() David Gibson
2020-10-09 10:19 ` [PULL 12/20] spapr: Simplify error handling in spapr_cpu_core_realize() David Gibson
2020-10-09 10:19 ` [PULL 13/20] spapr: Add a return value to spapr_nvdimm_validate() David Gibson
2020-10-09 10:19 ` [PULL 14/20] spapr: Add a return value to spapr_check_pagesize() David Gibson
2020-10-09 10:19 ` [PULL 15/20] ppc/pnv: Increase max firmware size David Gibson
2020-10-09 10:19 ` [PULL 16/20] spapr: add spapr_machine_using_legacy_numa() helper David Gibson
2020-10-09 10:19 ` [PULL 17/20] spapr_numa: forbid asymmetrical NUMA setups David Gibson
2020-10-09 10:19 ` [PULL 18/20] spapr_numa: change reference-points and maxdomain settings David Gibson
2020-10-09 10:19 ` [PULL 19/20] spapr_numa: consider user input when defining associativity David Gibson
2020-10-09 10:19 ` David Gibson [this message]
2020-10-09 16:22 ` [PULL 00/20] ppc-for-5.2 queue 20201009 Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201009101951.1569252-21-david@gibson.dropbear.id.au \
    --to=david@gibson.dropbear.id.au \
    --cc=danielhb413@gmail.com \
    --cc=dbarboza@redhat.com \
    --cc=groug@kaod.org \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nonngu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.