qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: peter.maydell@linaro.org
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>,
	mark.cave-ayland@ilande.co.uk, qemu-devel@nongnu.org,
	groug@kaod.org, hpoussin@reactos.org, clg@kaod.org,
	qemu-ppc@nongnu.org, philmd@redhat.com,
	David Gibson <david@gibson.dropbear.id.au>
Subject: [PULL 30/44] spapr_numa.c: FORM2 NUMA affinity support
Date: Thu, 30 Sep 2021 15:44:12 +1000	[thread overview]
Message-ID: <20210930054426.357344-31-david@gibson.dropbear.id.au> (raw)
In-Reply-To: <20210930054426.357344-1-david@gibson.dropbear.id.au>

From: Daniel Henrique Barboza <danielhb413@gmail.com>

The main feature of FORM2 affinity support is the separation of NUMA
distances from ibm,associativity information. This allows for a more
flexible and straightforward NUMA distance assignment without relying on
complex associations between several levels of NUMA via
ibm,associativity matches. Another feature is its extensibility. This base
support contains the facilities for NUMA distance assignment, but in the
future more facilities will be added for latency, performance, bandwidth
and so on.

This patch implements the base FORM2 affinity support as follows:

- the use of FORM2 associativity is indicated by using bit 2 of byte 5
of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
or FORM2 affinity. Setting both forms will default to FORM2. We're not
advertising FORM2 for pseries-6.1 and older machine versions to prevent
guest visible changes in those;

- ibm,associativity-reference-points has a new semantic. Instead of
being used to calculate distances via NUMA levels, it's now used to
indicate the primary domain index in the ibm,associativity domain of
each resource. In our case it's set to {0x4}, matching the position
where we already place logical_domain_id;

- two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
and ibm,numa-distance-table. The index table is used to list all the
NUMA logical domains of the platform, in ascending order, and allows for
spartial NUMA configurations (although QEMU ATM doesn't support that).
ibm,numa-distance-table is an array that contains all the distances from
the first NUMA node to all other nodes, then the second NUMA node
distances to all other nodes and so on;

- get_max_dist_ref_points(), get_numa_assoc_size() and get_associativity()
now checks for OV5_FORM2_AFFINITY and returns FORM2 values if the guest
selected FORM2 affinity during CAS.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210920174947.556324-7-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c              |   8 ++
 hw/ppc/spapr_numa.c         | 146 ++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h      |   9 +++
 include/hw/ppc/spapr_ovec.h |   1 +
 4 files changed, 164 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 524951def1..b7bee5f4ff 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2753,6 +2753,11 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
 
+    /* Do not advertise FORM2 NUMA support for pseries-6.1 and older */
+    if (!smc->pre_6_2_numa_affinity) {
+        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
+    }
+
     /* advertise support for dedicated HP event source to guests */
     if (spapr->use_hotplug_event_source) {
         spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
@@ -4675,8 +4680,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
  */
 static void spapr_machine_6_1_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_6_2_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
+    smc->pre_6_2_numa_affinity = true;
 }
 
 DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 6718c0fdd1..13db321997 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -24,6 +24,10 @@
  */
 static int get_max_dist_ref_points(SpaprMachineState *spapr)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return FORM2_DIST_REF_POINTS;
+    }
+
     return FORM1_DIST_REF_POINTS;
 }
 
@@ -32,6 +36,10 @@ static int get_max_dist_ref_points(SpaprMachineState *spapr)
  */
 static int get_numa_assoc_size(SpaprMachineState *spapr)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return FORM2_NUMA_ASSOC_SIZE;
+    }
+
     return FORM1_NUMA_ASSOC_SIZE;
 }
 
@@ -52,6 +60,9 @@ static int get_vcpu_assoc_size(SpaprMachineState *spapr)
  */
 static const uint32_t *get_associativity(SpaprMachineState *spapr, int node_id)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return spapr->FORM2_assoc_array[node_id];
+    }
     return spapr->FORM1_assoc_array[node_id];
 }
 
@@ -295,14 +306,50 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     spapr_numa_define_FORM1_domains(spapr);
 }
 
+/*
+ * Init NUMA FORM2 machine state data
+ */
+static void spapr_numa_FORM2_affinity_init(SpaprMachineState *spapr)
+{
+    int i;
+
+    /*
+     * For all resources but CPUs, FORM2 associativity arrays will
+     * be a size 2 array with the following format:
+     *
+     * ibm,associativity = {1, numa_id}
+     *
+     * CPUs will write an additional 'vcpu_id' on top of the arrays
+     * being initialized here. 'numa_id' is represented by the
+     * index 'i' of the loop.
+     *
+     * Given that this initialization is also valid for GPU associativity
+     * arrays, handle everything in one single step by populating the
+     * arrays up to NUMA_NODES_MAX_NUM.
+     */
+    for (i = 0; i < NUMA_NODES_MAX_NUM; i++) {
+        spapr->FORM2_assoc_array[i][0] = cpu_to_be32(1);
+        spapr->FORM2_assoc_array[i][1] = cpu_to_be32(i);
+    }
+}
+
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine)
 {
     spapr_numa_FORM1_affinity_init(spapr, machine);
+    spapr_numa_FORM2_affinity_init(spapr);
 }
 
 void spapr_numa_associativity_check(SpaprMachineState *spapr)
 {
+    /*
+     * FORM2 does not have any restrictions we need to handle
+     * at CAS time, for now.
+     */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return;
+    }
+
     spapr_numa_FORM1_affinity_check(MACHINE(spapr));
 }
 
@@ -447,6 +494,100 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
                      maxdomains, sizeof(maxdomains)));
 }
 
+static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
+                                               void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
+    g_autofree uint32_t *lookup_index_table = NULL;
+    g_autofree uint32_t *distance_table = NULL;
+    int src, dst, i, distance_table_size;
+    uint8_t *node_distances;
+
+    /*
+     * ibm,numa-lookup-index-table: array with length and a
+     * list of NUMA ids present in the guest.
+     */
+    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
+    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        lookup_index_table[i + 1] = cpu_to_be32(i);
+    }
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
+                     lookup_index_table,
+                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
+
+    /*
+     * ibm,numa-distance-table: contains all node distances. First
+     * element is the size of the table as uint32, followed up
+     * by all the uint8 distances from the first NUMA node, then all
+     * distances from the second NUMA node and so on.
+     *
+     * ibm,numa-lookup-index-table is used by guest to navigate this
+     * array because NUMA ids can be sparse (node 0 is the first,
+     * node 8 is the second ...).
+     */
+    distance_table = g_new0(uint32_t, distance_table_entries + 1);
+    distance_table[0] = cpu_to_be32(distance_table_entries);
+
+    node_distances = (uint8_t *)&distance_table[1];
+    i = 0;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            node_distances[i++] = numa_info[src].distance[dst];
+        }
+    }
+
+    distance_table_size = distance_table_entries * sizeof(uint8_t) +
+                          sizeof(uint32_t);
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
+                     distance_table, distance_table_size));
+}
+
+/*
+ * This helper could be compressed in a single function with
+ * FORM1 logic since we're setting the same DT values, with the
+ * difference being a call to spapr_numa_FORM2_write_rtas_tables()
+ * in the end. The separation was made to avoid clogging FORM1 code
+ * which already has to deal with compat modes from previous
+ * QEMU machine types.
+ */
+static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
+                                   spapr_numa_initial_nvgpu_numa_id(ms);
+
+    /*
+     * In FORM2, ibm,associativity-reference-points will point to
+     * the element in the ibm,associativity array that contains the
+     * primary domain index (for FORM2, the first element).
+     *
+     * This value (in our case, the numa-id) is then used as an index
+     * to retrieve all other attributes of the node (distance,
+     * bandwidth, latency) via ibm,numa-lookup-index-table and other
+     * ibm,numa-*-table properties.
+     */
+    uint32_t refpoints[] = { cpu_to_be32(1) };
+
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
+    uint32_t maxdomains[] = { cpu_to_be32(1), cpu_to_be32(maxdomain) };
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
+                     refpoints, sizeof(refpoints)));
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
+                     maxdomains, sizeof(maxdomains)));
+
+    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
+}
+
 /*
  * Helper that writes ibm,associativity-reference-points and
  * max-associativity-domains in the RTAS pointed by @rtas
@@ -454,6 +595,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
  */
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 6b3dfc5dc2..ee7504b976 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -118,6 +118,13 @@ typedef enum {
 #define FORM1_DIST_REF_POINTS            4
 #define FORM1_NUMA_ASSOC_SIZE            (FORM1_DIST_REF_POINTS + 1)
 
+/*
+ * FORM2 NUMA affinity has a single associativity domain, giving
+ * us a assoc size of 2.
+ */
+#define FORM2_DIST_REF_POINTS            1
+#define FORM2_NUMA_ASSOC_SIZE            (FORM2_DIST_REF_POINTS + 1)
+
 typedef struct SpaprCapabilities SpaprCapabilities;
 struct SpaprCapabilities {
     uint8_t caps[SPAPR_CAP_NUM];
@@ -145,6 +152,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    bool pre_6_2_numa_affinity;
 
     bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio,
@@ -250,6 +258,7 @@ struct SpaprMachineState {
     SpaprTpmProxy *tpm_proxy;
 
     uint32_t FORM1_assoc_array[NUMA_NODES_MAX_NUM][FORM1_NUMA_ASSOC_SIZE];
+    uint32_t FORM2_assoc_array[NUMA_NODES_MAX_NUM][FORM2_NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
 };
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 48b716a060..c3e8b98e7e 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
 /* option vector 5 */
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
+#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
 #define OV5_HP_EVT              OV_BIT(6, 5)
 #define OV5_HPT_RESIZE          OV_BIT(6, 7)
 #define OV5_DRMEM_V2            OV_BIT(22, 0)
-- 
2.31.1



  parent reply	other threads:[~2021-09-30  6:27 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-30  5:43 [PULL 00/44] ppc-for-6.2 queue 20210930 David Gibson
2021-09-30  5:43 ` [PULL 01/44] host-utils: Fix overflow detection in divu128() David Gibson
2021-09-30  5:43 ` [PULL 02/44] host-utils: fix missing zero-extension in divs128 David Gibson
2021-09-30  5:43 ` [PULL 03/44] host-utils: introduce uabs64() David Gibson
2021-09-30  5:43 ` [PULL 04/44] i386/kvm: Replace abs64() with uabs64() from host-utils David Gibson
2021-09-30  5:43 ` [PULL 05/44] ppc/spapr: Add a POWER10 DD2 CPU David Gibson
2021-09-30  5:43 ` [PULL 06/44] ppc/pnv: Add a comment on the "primary-topology-index" property David Gibson
2021-09-30  5:43 ` [PULL 07/44] ppc/pnv: Remove useless variable David Gibson
2021-09-30  5:43 ` [PULL 08/44] ppc/xive: Export priority_to_ipb() helper David Gibson
2021-09-30  5:43 ` [PULL 09/44] ppc/xive: Export xive_tctx_word2() helper David Gibson
2021-09-30  5:43 ` [PULL 10/44] ppc/pnv: Rename "id" to "quad-id" in PnvQuad David Gibson
2021-09-30  5:43 ` [PULL 11/44] docs/system: ppc: Update the URL for OpenPOWER firmware images David Gibson
2021-09-30  5:43 ` [PULL 12/44] ppc/pnv: Add an assert when calculating the RAM distribution on chips David Gibson
2021-09-30  5:43 ` [PULL 13/44] target/ppc: fix setting of CR flags in bcdcfsq David Gibson
2021-09-30  5:43 ` [PULL 14/44] memory_hotplug.c: handle dev->id = NULL in acpi_memory_hotplug_write() David Gibson
2021-09-30  5:43 ` [PULL 15/44] spapr.c: handle dev->id in spapr_memory_unplug_rollback() David Gibson
2021-09-30  5:43 ` [PULL 16/44] spapr_drc.c: do not error_report() when drc->dev->id == NULL David Gibson
2021-09-30  5:43 ` [PULL 17/44] qapi/qdev.json: fix DEVICE_DELETED parameters doc David Gibson
2021-09-30  5:44 ` [PULL 18/44] qapi/qdev.json: add DEVICE_UNPLUG_GUEST_ERROR QAPI event David Gibson
2021-09-30  5:44 ` [PULL 19/44] spapr: use DEVICE_UNPLUG_GUEST_ERROR to report unplug errors David Gibson
2021-09-30  5:44 ` [PULL 20/44] memory_hotplug.c: send DEVICE_UNPLUG_GUEST_ERROR in acpi_memory_hotplug_write() David Gibson
2021-09-30  5:44 ` [PULL 21/44] target/ppc: Convert debug to trace events (exceptions) David Gibson
2021-09-30  5:44 ` [PULL 22/44] target/ppc: Replace debug messages by asserts for unknown IRQ pins David Gibson
2021-09-30  5:44 ` [PULL 23/44] target/ppc: add LPCR[HR] to DisasContext and hflags David Gibson
2021-09-30  5:44 ` [PULL 24/44] target/ppc: Check privilege level based on PSR and LPCR[HR] in tlbie[l] David Gibson
2021-09-30  5:44 ` [PULL 25/44] spapr_numa.c: split FORM1 code into helpers David Gibson
2021-09-30  5:44 ` [PULL 26/44] spapr_numa.c: scrap 'legacy_numa' concept David Gibson
2021-09-30  5:44 ` [PULL 27/44] spapr_numa.c: parametrize FORM1 macros David Gibson
2021-09-30  5:44 ` [PULL 28/44] spapr_numa.c: rename numa_assoc_array to FORM1_assoc_array David Gibson
2021-09-30  5:44 ` [PULL 29/44] spapr: move FORM1 verifications to post CAS David Gibson
2021-09-30  5:44 ` David Gibson [this message]
2021-09-30  5:44 ` [PULL 31/44] spapr_numa.c: handle auto NUMA node with no distance info David Gibson
2021-09-30  5:44 ` [PULL 32/44] target/ppc: Convert debug to trace events (decrementer and IRQ) David Gibson
2021-09-30  5:44 ` [PULL 33/44] target/ppc: Fix 64-bit decrementer David Gibson
2021-10-02 10:39   ` Peter Maydell
2021-10-04  6:54     ` Cédric Le Goater
2021-09-30  5:44 ` [PULL 34/44] hw/intc: openpic: Correct the reset value of IPIDR for FSL chipset David Gibson
2021-09-30  5:44 ` [PULL 35/44] hw/intc: openpic: Drop Raven related codes David Gibson
2021-09-30  5:44 ` [PULL 36/44] hw/intc: openpic: Clean up the styles David Gibson
2021-09-30  5:44 ` [PULL 37/44] spapr_numa.c: fixes in spapr_numa_FORM2_write_rtas_tables() David Gibson
2021-09-30  5:44 ` [PULL 38/44] spapr/xive: Fix kvm_xive_source_reset trace event David Gibson
2021-09-30  5:44 ` [PULL 39/44] MAINTAINERS: Remove machine specific files from ppc TCG CPUs entry David Gibson
2021-09-30  5:44 ` [PULL 40/44] MAINTAINERS: Remove David & Greg as reviewers for a number of boards David Gibson
2021-09-30  5:44 ` [PULL 41/44] MAINTAINERS: Orphan obscure ppc platforms David Gibson
2021-09-30  5:44 ` [PULL 42/44] MAINTAINERS: Remove David & Greg as reviewers/co-maintainers of powernv David Gibson
2021-09-30  5:44 ` [PULL 43/44] MAINTAINERS: Add information for OpenPIC David Gibson
2021-09-30  5:44 ` [PULL 44/44] MAINTAINERS: Demote sPAPR from "Supported" to "Maintained" David Gibson
2021-09-30 16:37 ` [PULL 00/44] ppc-for-6.2 queue 20210930 Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210930054426.357344-31-david@gibson.dropbear.id.au \
    --to=david@gibson.dropbear.id.au \
    --cc=clg@kaod.org \
    --cc=danielhb413@gmail.com \
    --cc=groug@kaod.org \
    --cc=hpoussin@reactos.org \
    --cc=mark.cave-ayland@ilande.co.uk \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).