qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5]  pSeries FORM2 affinity support
@ 2021-08-27  9:24 Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine Daniel Henrique Barboza
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

This new version contains changes suggested by Greg from his v3 review.


Changes from v3:

- patch 1: added due to a need of having a MachineClass attribute for tne
yet to be added 6.2 machine type. Can be dropped if someone already did this
in an yet to be merged tree. Can also be enhanced to add the 6.2 machine types
to all guests if no one else already did it.

- patch 2 (former 1):
  * added Greg's R-b

- former patch 2 (post-cas DT changes): dropped

- v3 link: https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04394.html

Daniel Henrique Barboza (5):
  hw, spapr: add 6.2 compat pseries machine
  spapr_numa.c: split FORM1 code into helpers
  spapr_numa.c: base FORM2 NUMA affinity support
  spapr: simplify spapr_numa_associativity_init params
  spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init()

 hw/core/machine.c           |   3 +
 hw/ppc/spapr.c              |  75 +++++++------
 hw/ppc/spapr_hcall.c        |   4 +
 hw/ppc/spapr_numa.c         | 213 ++++++++++++++++++++++++++++++++----
 include/hw/boards.h         |   3 +
 include/hw/ppc/spapr.h      |   1 +
 include/hw/ppc/spapr_numa.h |   3 +-
 include/hw/ppc/spapr_ovec.h |   1 +
 8 files changed, 244 insertions(+), 59 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine
  2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
@ 2021-08-27  9:24 ` Daniel Henrique Barboza
  2021-08-30  7:34   ` Greg Kurz
  2021-08-27  9:24 ` [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/core/machine.c   |  3 +++
 hw/ppc/spapr.c      | 15 +++++++++++++--
 include/hw/boards.h |  3 +++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 54e040587d..067f42b528 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -37,6 +37,9 @@
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/virtio-pci.h"
 
+GlobalProperty hw_compat_6_1[] = {};
+const size_t hw_compat_6_1_len = G_N_ELEMENTS(hw_compat_6_1);
+
 GlobalProperty hw_compat_6_0[] = {
     { "gpex-pcihost", "allow-unmapped-accesses", "false" },
     { "i8042", "extended-state", "false"},
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 81699d4f8b..d39fd4e644 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4685,15 +4685,26 @@ static void spapr_machine_latest_class_options(MachineClass *mc)
     }                                                                \
     type_init(spapr_machine_register_##suffix)
 
+/*
+ * pseries-6.2
+ */
+static void spapr_machine_6_2_class_options(MachineClass *mc)
+{
+    /* Defaults for the latest behaviour inherited from the base class */
+}
+
+DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
+
 /*
  * pseries-6.1
  */
 static void spapr_machine_6_1_class_options(MachineClass *mc)
 {
-    /* Defaults for the latest behaviour inherited from the base class */
+    spapr_machine_6_2_class_options(mc);
+    compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
 }
 
-DEFINE_SPAPR_MACHINE(6_1, "6.1", true);
+DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
 
 /*
  * pseries-6.0
diff --git a/include/hw/boards.h b/include/hw/boards.h
index accd6eff35..463a5514f9 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -353,6 +353,9 @@ struct MachineState {
     } \
     type_init(machine_initfn##_register_types)
 
+extern GlobalProperty hw_compat_6_1[];
+extern const size_t hw_compat_6_1_len;
+
 extern GlobalProperty hw_compat_6_0[];
 extern const size_t hw_compat_6_0_len;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers
  2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine Daniel Henrique Barboza
@ 2021-08-27  9:24 ` Daniel Henrique Barboza
  2021-09-07  0:30   ` David Gibson
  2021-08-27  9:24 ` [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
and doesn't need be concerned with all the legacy support for older
pseries FORM1 guests.

We're also not going to calculate associativity domains based on numa
distance (via spapr_numa_define_associativity_domains) since the
distances will be written directly into new DT properties.

Let's split FORM1 code into its own functions to allow for easier
insertion of FORM2 logic later on.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 779f18b994..04a86f9b5b 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
+/*
+ * Set NUMA machine state data based on FORM1 affinity semantics.
+ */
+static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
+                                           MachineState *machine)
+{
+    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
+
+    /*
+     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
+     * 1 NUMA node) will not benefit from anything we're going to do
+     * after this point.
+     */
+    if (using_legacy_numa) {
+        return;
+    }
+
+    if (!spapr_numa_is_symmetrical(machine)) {
+        error_report("Asymmetrical NUMA topologies aren't supported "
+                     "in the pSeries machine");
+        exit(EXIT_FAILURE);
+    }
+
+    spapr_numa_define_associativity_domains(spapr);
+}
+
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine)
 {
@@ -210,22 +236,7 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    /*
-     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-     * 1 NUMA node) will not benefit from anything we're going to do
-     * after this point.
-     */
-    if (using_legacy_numa) {
-        return;
-    }
-
-    if (!spapr_numa_is_symmetrical(machine)) {
-        error_report("Asymmetrical NUMA topologies aren't supported "
-                     "in the pSeries machine");
-        exit(EXIT_FAILURE);
-    }
-
-    spapr_numa_define_associativity_domains(spapr);
+    spapr_numa_FORM1_affinity_init(spapr, machine);
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
-/*
- * Helper that writes ibm,associativity-reference-points and
- * max-associativity-domains in the RTAS pointed by @rtas
- * in the DT @fdt.
- */
-void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
 {
     MachineState *ms = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
@@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
                      maxdomains, sizeof(maxdomains)));
 }
 
+/*
+ * Helper that writes ibm,associativity-reference-points and
+ * max-associativity-domains in the RTAS pointed by @rtas
+ * in the DT @fdt.
+ */
+void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+{
+    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support
  2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-08-27  9:24 ` Daniel Henrique Barboza
  2021-08-31 13:40   ` Greg Kurz
  2021-08-27  9:24 ` [PATCH v4 4/5] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 5/5] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza
  4 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The main feature of FORM2 affinity support is the separation of NUMA
distances from ibm,associativity information. This allows for a more
flexible and straightforward NUMA distance assignment without relying on
complex associations between several levels of NUMA via
ibm,associativity matches. Another feature is its extensibility. This base
support contains the facilities for NUMA distance assignment, but in the
future more facilities will be added for latency, performance, bandwidth
and so on.

This patch implements the base FORM2 affinity support as follows:

- the use of FORM2 associativity is indicated by using bit 2 of byte 5
of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
or FORM2 affinity. Setting both forms will default to FORM2. We're not
advertising FORM2 for pseries-6.1 and older machine versions to prevent
guest visible changes in those;

- move spapr_numa_associativity_init() from spapr_machine_init() to
do_client_architecture_support(), when we already know which NUMA
affinity the guest will use. This will avoid initializing FORM1
spapr->numa_assoc_array and overwrite it shortly after if FORM2 is
chosen;

- ibm,associativity-reference-points has a new semantic. Instead of
being used to calculate distances via NUMA levels, it's now used to
indicate the primary domain index in the ibm,associativity domain of
each resource. In our case it's set to {0x4}, matching the position
where we already place logical_domain_id;

- two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
and ibm,numa-distance-table. The index table is used to list all the
NUMA logical domains of the platform, in ascending order, and allows for
spartial NUMA configurations (although QEMU ATM doesn't support that).
ibm,numa-distance-table is an array that contains all the distances from
the first NUMA node to all other nodes, then the second NUMA node
distances to all other nodes and so on;

- spapr_post_load changes: since we're adding a new NUMA affinity that
isn't compatible with the existing one, migration must be handled
accordingly because we can't be certain of whether the guest went
through CAS in the source. The solution chosen is to initiate the NUMA
associativity data in spapr_post_load() unconditionally. The worst case
would be to write the DT twice if the guest is in pre-CAS stage.
Otherwise, we're making sure that a FORM1 guest will have the
spapr->numa_assoc_array initialized with the proper information based on
user distance, something that we're not doing with FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              |  27 +++++++-
 hw/ppc/spapr_hcall.c        |   4 ++
 hw/ppc/spapr_numa.c         | 127 +++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr.h      |   1 +
 include/hw/ppc/spapr_ovec.h |   1 +
 5 files changed, 156 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d39fd4e644..a3eb33764d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1786,6 +1786,22 @@ static int spapr_post_load(void *opaque, int version_id)
         return err;
     }
 
+    /*
+     * NUMA data init is made in CAS time. There is no reliable
+     * way of telling whether the guest already went through CAS
+     * in the source due to how spapr_ov5_cas_needed works: a
+     * FORM1 guest can be migrated with ov5_cas empty regardless
+     * of going through CAS first.
+     *
+     * One solution is to always call numa_associativity_init. The
+     * downside is that a guest migrated before CAS will run
+     * numa_associativity_init again when going through it, but
+     * at least we're making sure spapr->numa_assoc_array will be
+     * initialized and hotplug operations won't fail in both before
+     * and after CAS migration cases.
+     */
+     spapr_numa_associativity_init(spapr, MACHINE(spapr));
+
     return err;
 }
 
@@ -2752,6 +2768,11 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
 
+    /* Do not advertise FORM2 support for pseries-6.1 and older */
+    if (!smc->pre_6_2_numa_affinity) {
+        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
+    }
+
     /* advertise support for dedicated HP event source to guests */
     if (spapr->use_hotplug_event_source) {
         spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
@@ -2808,9 +2829,6 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
-    /* Init numa_assoc_array */
-    spapr_numa_associativity_init(spapr, machine);
-
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
                               spapr->max_compat_pvr)) {
@@ -4700,8 +4718,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
  */
 static void spapr_machine_6_1_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_6_2_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
+    smc->pre_6_2_numa_affinity = true;
 }
 
 DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 0e9a5b2e40..1cc88716c1 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -11,6 +11,7 @@
 #include "helper_regs.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_cpu_core.h"
+#include "hw/ppc/spapr_numa.h"
 #include "mmu-hash64.h"
 #include "cpu-models.h"
 #include "trace.h"
@@ -1197,6 +1198,9 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
     spapr_ovec_cleanup(ov1_guest);
 
+    /* Init numa_assoc_array */
+    spapr_numa_associativity_init(spapr, MACHINE(spapr));
+
     /*
      * Ensure the guest asks for an interrupt mode we support;
      * otherwise terminate the boot.
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 04a86f9b5b..0a5fa8101e 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
 
+        /*
+         * For FORM2 affinity the initialization above is enough. No
+         * need to fill non-zero NUMA nodes with node_id because
+         * there is no associativity domain match to calculate
+         * NUMA distances in FORM2.
+         */
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+            continue;
+        }
+
         /*
          * Fill all associativity domains of non-zero NUMA nodes with
          * node_id. This is required because the default value (0) is
@@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    spapr_numa_FORM1_affinity_init(spapr, machine);
+    /*
+     * We test for !FORM2 instead of testing for FORM1 because,
+     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
+     * to get ov5_cas migrated, but setting FORM2 is. Since we're
+     * dealing with either FORM1 or FORM2, test for the option
+     * that is guaranteed to be set after a migration.
+     */
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM1_affinity_init(spapr, machine);
+    }
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -313,6 +332,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
+static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
+                                               void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
+    g_autofree uint32_t *lookup_index_table = NULL;
+    g_autofree uint32_t *distance_table = NULL;
+    int src, dst, i, distance_table_size;
+    uint8_t *node_distances;
+
+    /*
+     * ibm,numa-lookup-index-table: array with length and a
+     * list of NUMA ids present in the guest.
+     */
+    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
+    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        lookup_index_table[i + 1] = cpu_to_be32(i);
+    }
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
+                     lookup_index_table,
+                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
+
+    /*
+     * ibm,numa-distance-table: contains all node distances. First
+     * element is the size of the table as uint32, followed up
+     * by all the uint8 distances from the first NUMA node, then all
+     * distances from the second NUMA node and so on.
+     *
+     * ibm,numa-lookup-index-table is used by guest to navigate this
+     * array because NUMA ids can be sparse (node 0 is the first,
+     * node 8 is the second ...).
+     */
+    distance_table = g_new0(uint32_t, distance_table_entries + 1);
+    distance_table[0] = cpu_to_be32(distance_table_entries);
+
+    node_distances = (uint8_t *)&distance_table[1];
+    i = 0;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            node_distances[i++] = numa_info[src].distance[dst];
+        }
+    }
+
+    distance_table_size = distance_table_entries * sizeof(uint8_t) +
+                          sizeof(uint32_t);
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
+                     distance_table, distance_table_size));
+}
+
+/*
+ * This helper could be compressed in a single function with
+ * FORM1 logic since we're setting the same DT values, with the
+ * difference being a call to spapr_numa_FORM2_write_rtas_tables()
+ * in the end. The separation was made to avoid clogging FORM1 code
+ * which already has to deal with compat modes from previous
+ * QEMU machine types.
+ */
+static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
+                                   spapr_numa_initial_nvgpu_numa_id(ms);
+
+    /*
+     * In FORM2, ibm,associativity-reference-points will point to
+     * the element in the ibm,associativity array that contains the
+     * primary domain index. This value (in our case, the numa-id) is
+     * then used as an index to retrieve all other attributes of the
+     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
+     * and other ibm,numa-*-table properties.
+     */
+    uint32_t refpoints[] = {
+        cpu_to_be32(0x4),
+    };
+
+    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
+    uint32_t maxdomains[] = {
+        cpu_to_be32(4),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain)
+    };
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
+                     refpoints, nr_refpoints * sizeof(refpoints[0])));
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
+                     maxdomains, sizeof(maxdomains)));
+
+    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
+}
+
 static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
                                            void *fdt, int rtas)
 {
@@ -379,6 +499,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
  */
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 637652ad16..21b1cf9ebf 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -145,6 +145,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    bool pre_6_2_numa_affinity;
 
     bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio,
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 48b716a060..c3e8b98e7e 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
 /* option vector 5 */
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
+#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
 #define OV5_HP_EVT              OV_BIT(6, 5)
 #define OV5_HPT_RESIZE          OV_BIT(6, 7)
 #define OV5_DRMEM_V2            OV_BIT(22, 0)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/5] spapr: simplify spapr_numa_associativity_init params
  2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-08-27  9:24 ` [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-08-27  9:24 ` Daniel Henrique Barboza
  2021-08-27  9:24 ` [PATCH v4 5/5] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza
  4 siblings, 0 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

When spapr_numa_associativity_init() was introduced it was being
called from spapr_machine_init(), where we have pointers to a
SpaprMachineState and a MachineState. Having both being passed
as params spared us from calling a macro to get the MachineState.

Previous patches moved the function away from spapr_machine_init(),
and the new locations doesn't have a pointer to MachineState ready.
This resulted in  MACHINE(spapr) macro calls as the second parameter
in both callers.

Simplify the function by folding the 'MACHINE(spapr)' macro into the
function body, which can now receive only a pointer to
SpaprMachineState.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 2 +-
 hw/ppc/spapr_hcall.c        | 2 +-
 hw/ppc/spapr_numa.c         | 4 ++--
 include/hw/ppc/spapr_numa.h | 3 +--
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a3eb33764d..63c4876415 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1800,7 +1800,7 @@ static int spapr_post_load(void *opaque, int version_id)
      * initialized and hotplug operations won't fail in both before
      * and after CAS migration cases.
      */
-     spapr_numa_associativity_init(spapr, MACHINE(spapr));
+     spapr_numa_associativity_init(spapr);
 
     return err;
 }
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 1cc88716c1..a721d3edce 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1199,7 +1199,7 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr_ovec_cleanup(ov1_guest);
 
     /* Init numa_assoc_array */
-    spapr_numa_associativity_init(spapr, MACHINE(spapr));
+    spapr_numa_associativity_init(spapr);
 
     /*
      * Ensure the guest asks for an interrupt mode we support;
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 0a5fa8101e..2f261be9fd 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -181,9 +181,9 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     spapr_numa_define_associativity_domains(spapr);
 }
 
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine)
+void spapr_numa_associativity_init(SpaprMachineState *spapr)
 {
+    MachineState *machine = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int i, j, max_nodes_with_gpus;
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index 6f9f02d3de..adaec8e163 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -22,8 +22,7 @@
  * function. spapr_machine_init() is the only caller for it, and
  * it has both pointers resolved already.
  */
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine);
+void spapr_numa_associativity_init(SpaprMachineState *spapr);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/5] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init()
  2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2021-08-27  9:24 ` [PATCH v4 4/5] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
@ 2021-08-27  9:24 ` Daniel Henrique Barboza
  4 siblings, 0 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-27  9:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

FORM2 NUMA affinity is prepared to deal with memory/cpu less
NUMA nodes. This is used by the DAX KMEM driver to locate a
PAPR SCM device that has a different latency than the original
NUMA node from the regular memory.

Move this verification to FORM1 affinity code.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c      | 33 ---------------------------------
 hw/ppc/spapr_numa.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 63c4876415..4105a2504f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2794,39 +2794,6 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
-    /*
-     * check we don't have a memory-less/cpu-less NUMA node
-     * Firmware relies on the existing memory/cpu topology to provide the
-     * NUMA topology to the kernel.
-     * And the linux kernel needs to know the NUMA topology at start
-     * to be able to hotplug CPUs later.
-     */
-    if (machine->numa_state->num_nodes) {
-        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
-            /* check for memory-less node */
-            if (machine->numa_state->nodes[i].node_mem == 0) {
-                CPUState *cs;
-                int found = 0;
-                /* check for cpu-less node */
-                CPU_FOREACH(cs) {
-                    PowerPCCPU *cpu = POWERPC_CPU(cs);
-                    if (cpu->node_id == i) {
-                        found = 1;
-                        break;
-                    }
-                }
-                /* memory-less and cpu-less node */
-                if (!found) {
-                    error_report(
-                       "Memory-less/cpu-less nodes are not supported (node %d)",
-                                 i);
-                    exit(1);
-                }
-            }
-        }
-
-    }
-
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 2f261be9fd..fddac5fcce 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -162,6 +162,39 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
                                            MachineState *machine)
 {
     bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
+    int i;
+
+    /*
+     * check we don't have a memory-less/cpu-less NUMA node
+     * Firmware relies on the existing memory/cpu topology to provide the
+     * NUMA topology to the kernel.
+     * And the linux kernel needs to know the NUMA topology at start
+     * to be able to hotplug CPUs later.
+     */
+    if (machine->numa_state->num_nodes) {
+        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
+            /* check for memory-less node */
+            if (machine->numa_state->nodes[i].node_mem == 0) {
+                CPUState *cs;
+                int found = 0;
+                /* check for cpu-less node */
+                CPU_FOREACH(cs) {
+                    PowerPCCPU *cpu = POWERPC_CPU(cs);
+                    if (cpu->node_id == i) {
+                        found = 1;
+                        break;
+                    }
+                }
+                /* memory-less and cpu-less node */
+                if (!found) {
+                    error_report(
+                       "Memory-less/cpu-less nodes are not supported (node %d)",
+                                 i);
+                    exit(1);
+                }
+            }
+        }
+    }
 
     /*
      * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine
  2021-08-27  9:24 ` [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine Daniel Henrique Barboza
@ 2021-08-30  7:34   ` Greg Kurz
  2021-08-30 18:24     ` Daniel Henrique Barboza
  2021-08-30 18:32     ` Peter Maydell
  0 siblings, 2 replies; 16+ messages in thread
From: Greg Kurz @ 2021-08-30  7:34 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 27 Aug 2021 06:24:51 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---

We usually introduce the compat machine types for all archs in a
single patch. One was already posted for 6.2 :

https://patchwork.ozlabs.org/project/qemu-devel/patch/20210823122804.7692-7-wangyanan55@huawei.com/


>  hw/core/machine.c   |  3 +++
>  hw/ppc/spapr.c      | 15 +++++++++++++--
>  include/hw/boards.h |  3 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 54e040587d..067f42b528 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -37,6 +37,9 @@
>  #include "hw/virtio/virtio.h"
>  #include "hw/virtio/virtio-pci.h"
>  
> +GlobalProperty hw_compat_6_1[] = {};
> +const size_t hw_compat_6_1_len = G_N_ELEMENTS(hw_compat_6_1);
> +
>  GlobalProperty hw_compat_6_0[] = {
>      { "gpex-pcihost", "allow-unmapped-accesses", "false" },
>      { "i8042", "extended-state", "false"},
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 81699d4f8b..d39fd4e644 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4685,15 +4685,26 @@ static void spapr_machine_latest_class_options(MachineClass *mc)
>      }                                                                \
>      type_init(spapr_machine_register_##suffix)
>  
> +/*
> + * pseries-6.2
> + */
> +static void spapr_machine_6_2_class_options(MachineClass *mc)
> +{
> +    /* Defaults for the latest behaviour inherited from the base class */
> +}
> +
> +DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
> +
>  /*
>   * pseries-6.1
>   */
>  static void spapr_machine_6_1_class_options(MachineClass *mc)
>  {
> -    /* Defaults for the latest behaviour inherited from the base class */
> +    spapr_machine_6_2_class_options(mc);
> +    compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
>  }
>  
> -DEFINE_SPAPR_MACHINE(6_1, "6.1", true);
> +DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
>  
>  /*
>   * pseries-6.0
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index accd6eff35..463a5514f9 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -353,6 +353,9 @@ struct MachineState {
>      } \
>      type_init(machine_initfn##_register_types)
>  
> +extern GlobalProperty hw_compat_6_1[];
> +extern const size_t hw_compat_6_1_len;
> +
>  extern GlobalProperty hw_compat_6_0[];
>  extern const size_t hw_compat_6_0_len;
>  



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine
  2021-08-30  7:34   ` Greg Kurz
@ 2021-08-30 18:24     ` Daniel Henrique Barboza
  2021-08-30 18:32     ` Peter Maydell
  1 sibling, 0 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-30 18:24 UTC (permalink / raw)
  To: Greg Kurz; +Cc: qemu-ppc, qemu-devel, david



On 8/30/21 4:34 AM, Greg Kurz wrote:
> On Fri, 27 Aug 2021 06:24:51 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
> 
> We usually introduce the compat machine types for all archs in a
> single patch. One was already posted for 6.2 :
> 
> https://patchwork.ozlabs.org/project/qemu-devel/patch/20210823122804.7692-7-wangyanan55@huawei.com/

We can discard this patch. The rest of the series will play ball with
the official 6.2 machine type patch later on.


Thanks,


Daniel

> 
> 
>>   hw/core/machine.c   |  3 +++
>>   hw/ppc/spapr.c      | 15 +++++++++++++--
>>   include/hw/boards.h |  3 +++
>>   3 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 54e040587d..067f42b528 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -37,6 +37,9 @@
>>   #include "hw/virtio/virtio.h"
>>   #include "hw/virtio/virtio-pci.h"
>>   
>> +GlobalProperty hw_compat_6_1[] = {};
>> +const size_t hw_compat_6_1_len = G_N_ELEMENTS(hw_compat_6_1);
>> +
>>   GlobalProperty hw_compat_6_0[] = {
>>       { "gpex-pcihost", "allow-unmapped-accesses", "false" },
>>       { "i8042", "extended-state", "false"},
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 81699d4f8b..d39fd4e644 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -4685,15 +4685,26 @@ static void spapr_machine_latest_class_options(MachineClass *mc)
>>       }                                                                \
>>       type_init(spapr_machine_register_##suffix)
>>   
>> +/*
>> + * pseries-6.2
>> + */
>> +static void spapr_machine_6_2_class_options(MachineClass *mc)
>> +{
>> +    /* Defaults for the latest behaviour inherited from the base class */
>> +}
>> +
>> +DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>> +
>>   /*
>>    * pseries-6.1
>>    */
>>   static void spapr_machine_6_1_class_options(MachineClass *mc)
>>   {
>> -    /* Defaults for the latest behaviour inherited from the base class */
>> +    spapr_machine_6_2_class_options(mc);
>> +    compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
>>   }
>>   
>> -DEFINE_SPAPR_MACHINE(6_1, "6.1", true);
>> +DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
>>   
>>   /*
>>    * pseries-6.0
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index accd6eff35..463a5514f9 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -353,6 +353,9 @@ struct MachineState {
>>       } \
>>       type_init(machine_initfn##_register_types)
>>   
>> +extern GlobalProperty hw_compat_6_1[];
>> +extern const size_t hw_compat_6_1_len;
>> +
>>   extern GlobalProperty hw_compat_6_0[];
>>   extern const size_t hw_compat_6_0_len;
>>   
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine
  2021-08-30  7:34   ` Greg Kurz
  2021-08-30 18:24     ` Daniel Henrique Barboza
@ 2021-08-30 18:32     ` Peter Maydell
  2021-08-31  2:01       ` wangyanan (Y)
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Maydell @ 2021-08-30 18:32 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Daniel Henrique Barboza, qemu-ppc, QEMU Developers, David Gibson

On Mon, 30 Aug 2021 at 08:36, Greg Kurz <groug@kaod.org> wrote:
>
> On Fri, 27 Aug 2021 06:24:51 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>
> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > ---
>
> We usually introduce the compat machine types for all archs in a
> single patch. One was already posted for 6.2 :
>
> https://patchwork.ozlabs.org/project/qemu-devel/patch/20210823122804.7692-7-wangyanan55@huawei.com/

Would somebody like to fish that patch out of the unrelated large
series it was posted in so that it can get into the tree sooner?

-- PMM


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine
  2021-08-30 18:32     ` Peter Maydell
@ 2021-08-31  2:01       ` wangyanan (Y)
  0 siblings, 0 replies; 16+ messages in thread
From: wangyanan (Y) @ 2021-08-31  2:01 UTC (permalink / raw)
  To: Peter Maydell, Greg Kurz
  Cc: Daniel Henrique Barboza, qemu-ppc, QEMU Developers, David Gibson

Hi,

On 2021/8/31 2:32, Peter Maydell wrote:
> On Mon, 30 Aug 2021 at 08:36, Greg Kurz <groug@kaod.org> wrote:
>> On Fri, 27 Aug 2021 06:24:51 -0300
>> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>>
>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>> ---
>> We usually introduce the compat machine types for all archs in a
>> single patch. One was already posted for 6.2 :
>>
>> https://patchwork.ozlabs.org/project/qemu-devel/patch/20210823122804.7692-7-wangyanan55@huawei.com/
> Would somebody like to fish that patch out of the unrelated large
> series it was posted in so that it can get into the tree sooner?
>
Now I have resent that patch separately, and hopefully someone will take 
it. :)

Thanks,
Yanan



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support
  2021-08-27  9:24 ` [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-08-31 13:40   ` Greg Kurz
  2021-08-31 14:13     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Kurz @ 2021-08-31 13:40 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 27 Aug 2021 06:24:53 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The main feature of FORM2 affinity support is the separation of NUMA
> distances from ibm,associativity information. This allows for a more
> flexible and straightforward NUMA distance assignment without relying on
> complex associations between several levels of NUMA via
> ibm,associativity matches. Another feature is its extensibility. This base
> support contains the facilities for NUMA distance assignment, but in the
> future more facilities will be added for latency, performance, bandwidth
> and so on.
> 
> This patch implements the base FORM2 affinity support as follows:
> 
> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> or FORM2 affinity. Setting both forms will default to FORM2. We're not
> advertising FORM2 for pseries-6.1 and older machine versions to prevent
> guest visible changes in those;
> 
> - move spapr_numa_associativity_init() from spapr_machine_init() to
> do_client_architecture_support(), when we already know which NUMA
> affinity the guest will use. This will avoid initializing FORM1
> spapr->numa_assoc_array and overwrite it shortly after if FORM2 is
> chosen;
> 

As I was saying in another mail. I'm not very comfortable with delaying
spapr_numa_associativity_init() to CAS until I'm sure nothing else
assumes it's already done when the machine first resets and boots.

And also, these are slow paths and I'd rather overwrite the array if
that keeps the code simple. FWIW this is what we already do with the
irq backend : we always reset to XICS and switch to XIVE at CAS.

This also makes me think that spapr_numa_associativity_init() should
be called during machine reset so that spapr->numa_assoc_array is
reset to a known default value, i.e. FORM1 or legacy depending on
the machine version. Maybe rename it to spapr_numa_associativity_reset()
for clarity ?

> - ibm,associativity-reference-points has a new semantic. Instead of
> being used to calculate distances via NUMA levels, it's now used to
> indicate the primary domain index in the ibm,associativity domain of
> each resource. In our case it's set to {0x4}, matching the position
> where we already place logical_domain_id;
> 
> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
> and ibm,numa-distance-table. The index table is used to list all the
> NUMA logical domains of the platform, in ascending order, and allows for
> spartial NUMA configurations (although QEMU ATM doesn't support that).
> ibm,numa-distance-table is an array that contains all the distances from
> the first NUMA node to all other nodes, then the second NUMA node
> distances to all other nodes and so on;
> 
> - spapr_post_load changes: since we're adding a new NUMA affinity that
> isn't compatible with the existing one, migration must be handled
> accordingly because we can't be certain of whether the guest went
> through CAS in the source. The solution chosen is to initiate the NUMA
> associativity data in spapr_post_load() unconditionally. The worst case
> would be to write the DT twice if the guest is in pre-CAS stage.
> Otherwise, we're making sure that a FORM1 guest will have the
> spapr->numa_assoc_array initialized with the proper information based on
> user distance, something that we're not doing with FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              |  27 +++++++-
>  hw/ppc/spapr_hcall.c        |   4 ++
>  hw/ppc/spapr_numa.c         | 127 +++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h      |   1 +
>  include/hw/ppc/spapr_ovec.h |   1 +
>  5 files changed, 156 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d39fd4e644..a3eb33764d 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1786,6 +1786,22 @@ static int spapr_post_load(void *opaque, int version_id)
>          return err;
>      }
>  
> +    /*
> +     * NUMA data init is made in CAS time. There is no reliable
> +     * way of telling whether the guest already went through CAS
> +     * in the source due to how spapr_ov5_cas_needed works: a
> +     * FORM1 guest can be migrated with ov5_cas empty regardless
> +     * of going through CAS first.
> +     *
> +     * One solution is to always call numa_associativity_init. The
> +     * downside is that a guest migrated before CAS will run
> +     * numa_associativity_init again when going through it, but
> +     * at least we're making sure spapr->numa_assoc_array will be
> +     * initialized and hotplug operations won't fail in both before
> +     * and after CAS migration cases.

This seems to comfort my feelings : it is safer to do the init
unconditionally to ensure no other code gets upset.

> +     */
> +     spapr_numa_associativity_init(spapr, MACHINE(spapr));
> +
>      return err;
>  }
>  
> @@ -2752,6 +2768,11 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>  
> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
> +    if (!smc->pre_6_2_numa_affinity) {
> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
> +    }
> +
>      /* advertise support for dedicated HP event source to guests */
>      if (spapr->use_hotplug_event_source) {
>          spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
> @@ -2808,9 +2829,6 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>  
> -    /* Init numa_assoc_array */
> -    spapr_numa_associativity_init(spapr, machine);
> -
>      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>          ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>                                spapr->max_compat_pvr)) {
> @@ -4700,8 +4718,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>   */
>  static void spapr_machine_6_1_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_6_2_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
> +    smc->pre_6_2_numa_affinity = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 0e9a5b2e40..1cc88716c1 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -11,6 +11,7 @@
>  #include "helper_regs.h"
>  #include "hw/ppc/spapr.h"
>  #include "hw/ppc/spapr_cpu_core.h"
> +#include "hw/ppc/spapr_numa.h"
>  #include "mmu-hash64.h"
>  #include "cpu-models.h"
>  #include "trace.h"
> @@ -1197,6 +1198,9 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>      spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>      spapr_ovec_cleanup(ov1_guest);
>  
> +    /* Init numa_assoc_array */
> +    spapr_numa_associativity_init(spapr, MACHINE(spapr));
> +
>      /*
>       * Ensure the guest asks for an interrupt mode we support;
>       * otherwise terminate the boot.
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 04a86f9b5b..0a5fa8101e 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>          spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>  
> +        /*
> +         * For FORM2 affinity the initialization above is enough. No
> +         * need to fill non-zero NUMA nodes with node_id because
> +         * there is no associativity domain match to calculate
> +         * NUMA distances in FORM2.
> +         */
> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +            continue;
> +        }
> +
>          /*
>           * Fill all associativity domains of non-zero NUMA nodes with
>           * node_id. This is required because the default value (0) is
> @@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
> -    spapr_numa_FORM1_affinity_init(spapr, machine);
> +    /*
> +     * We test for !FORM2 instead of testing for FORM1 because,
> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
> +     * dealing with either FORM1 or FORM2, test for the option
> +     * that is guaranteed to be set after a migration.
> +     */
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM1_affinity_init(spapr, machine);
> +    }
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -313,6 +332,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> +                                               void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
> +    g_autofree uint32_t *lookup_index_table = NULL;
> +    g_autofree uint32_t *distance_table = NULL;
> +    int src, dst, i, distance_table_size;
> +    uint8_t *node_distances;
> +
> +    /*
> +     * ibm,numa-lookup-index-table: array with length and a
> +     * list of NUMA ids present in the guest.
> +     */
> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        lookup_index_table[i + 1] = cpu_to_be32(i);
> +    }
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
> +                     lookup_index_table,
> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
> +
> +    /*
> +     * ibm,numa-distance-table: contains all node distances. First
> +     * element is the size of the table as uint32, followed up
> +     * by all the uint8 distances from the first NUMA node, then all
> +     * distances from the second NUMA node and so on.
> +     *
> +     * ibm,numa-lookup-index-table is used by guest to navigate this
> +     * array because NUMA ids can be sparse (node 0 is the first,
> +     * node 8 is the second ...).
> +     */
> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
> +    distance_table[0] = cpu_to_be32(distance_table_entries);
> +
> +    node_distances = (uint8_t *)&distance_table[1];
> +    i = 0;
> +
> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            node_distances[i++] = numa_info[src].distance[dst];

It looks like you could s/i++/src + dst/ here.

> +        }
> +    }
> +
> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
> +                          sizeof(uint32_t);
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
> +                     distance_table, distance_table_size));
> +}
> +
> +/*
> + * This helper could be compressed in a single function with
> + * FORM1 logic since we're setting the same DT values, with the
> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
> + * in the end. The separation was made to avoid clogging FORM1 code
> + * which already has to deal with compat modes from previous
> + * QEMU machine types.
> + */
> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
> +
> +    /*
> +     * In FORM2, ibm,associativity-reference-points will point to
> +     * the element in the ibm,associativity array that contains the
> +     * primary domain index. This value (in our case, the numa-id) is
> +     * then used as an index to retrieve all other attributes of the
> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
> +     * and other ibm,numa-*-table properties.
> +     */
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +    };
> +
> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
> +    uint32_t maxdomains[] = {
> +        cpu_to_be32(4),

Maybe use same base for initializers, especially when they land in the same
hunk :) Either dec or hex I don't care.

Rest LGTM.

> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain)
> +    };
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> +                     maxdomains, sizeof(maxdomains)));
> +
> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
> +}
> +
>  static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>                                             void *fdt, int rtas)
>  {
> @@ -379,6 +499,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>   */
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
> +        return;
> +    }
> +
>      spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 637652ad16..21b1cf9ebf 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    bool pre_6_2_numa_affinity;
>  
>      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio,
> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
> index 48b716a060..c3e8b98e7e 100644
> --- a/include/hw/ppc/spapr_ovec.h
> +++ b/include/hw/ppc/spapr_ovec.h
> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>  /* option vector 5 */
>  #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>  #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>  #define OV5_HP_EVT              OV_BIT(6, 5)
>  #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>  #define OV5_DRMEM_V2            OV_BIT(22, 0)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support
  2021-08-31 13:40   ` Greg Kurz
@ 2021-08-31 14:13     ` Daniel Henrique Barboza
  2021-08-31 14:51       ` Greg Kurz
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-08-31 14:13 UTC (permalink / raw)
  To: Greg Kurz; +Cc: qemu-ppc, qemu-devel, david



On 8/31/21 10:40 AM, Greg Kurz wrote:
> On Fri, 27 Aug 2021 06:24:53 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> The main feature of FORM2 affinity support is the separation of NUMA
>> distances from ibm,associativity information. This allows for a more
>> flexible and straightforward NUMA distance assignment without relying on
>> complex associations between several levels of NUMA via
>> ibm,associativity matches. Another feature is its extensibility. This base
>> support contains the facilities for NUMA distance assignment, but in the
>> future more facilities will be added for latency, performance, bandwidth
>> and so on.
>>
>> This patch implements the base FORM2 affinity support as follows:
>>
>> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
>> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
>> or FORM2 affinity. Setting both forms will default to FORM2. We're not
>> advertising FORM2 for pseries-6.1 and older machine versions to prevent
>> guest visible changes in those;
>>
>> - move spapr_numa_associativity_init() from spapr_machine_init() to
>> do_client_architecture_support(), when we already know which NUMA
>> affinity the guest will use. This will avoid initializing FORM1
>> spapr->numa_assoc_array and overwrite it shortly after if FORM2 is
>> chosen;
>>
> 
> As I was saying in another mail. I'm not very comfortable with delaying
> spapr_numa_associativity_init() to CAS until I'm sure nothing else
> assumes it's already done when the machine first resets and boots.
> 
> And also, these are slow paths and I'd rather overwrite the array if
> that keeps the code simple. FWIW this is what we already do with the
> irq backend : we always reset to XICS and switch to XIVE at CAS.
> 
> This also makes me think that spapr_numa_associativity_init() should
> be called during machine reset so that spapr->numa_assoc_array is
> reset to a known default value, i.e. FORM1 or legacy depending on
> the machine version. Maybe rename it to spapr_numa_associativity_reset()
> for clarity ?

It is not tragic to delay spapr_numa_associativity_init() up until CAS
because we haven't written any ibm,associativity arrays yet. What we
write before CAS are the common RTAS artifacts such as reference-points
and maxdomains, but as you suggested in the previous review, we're just
writing them again if the guest chose to use FORM2.

If we really want to be safe I can do the following:

- keep spapr_numa_associativity_init() in machine_init(), like it is today;

- call it again after CAS if the guest chooses FORM2;

- change spapr_numa_associativity_init() to clear the unused FORM 1 associativity
domains if FORM2 is chosen. This step is needed because there's no way to know if
we're before or after CAS, and FORM1 code populates the associativity domains
based on user NUMA distance input. For clarity, I want to keep unused stuff
zeroed when using FORM2.


Thanks,


Daniel

> 
>> - ibm,associativity-reference-points has a new semantic. Instead of
>> being used to calculate distances via NUMA levels, it's now used to
>> indicate the primary domain index in the ibm,associativity domain of
>> each resource. In our case it's set to {0x4}, matching the position
>> where we already place logical_domain_id;
>>
>> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
>> and ibm,numa-distance-table. The index table is used to list all the
>> NUMA logical domains of the platform, in ascending order, and allows for
>> spartial NUMA configurations (although QEMU ATM doesn't support that).
>> ibm,numa-distance-table is an array that contains all the distances from
>> the first NUMA node to all other nodes, then the second NUMA node
>> distances to all other nodes and so on;
>>
>> - spapr_post_load changes: since we're adding a new NUMA affinity that
>> isn't compatible with the existing one, migration must be handled
>> accordingly because we can't be certain of whether the guest went
>> through CAS in the source. The solution chosen is to initiate the NUMA
>> associativity data in spapr_post_load() unconditionally. The worst case
>> would be to write the DT twice if the guest is in pre-CAS stage.
>> Otherwise, we're making sure that a FORM1 guest will have the
>> spapr->numa_assoc_array initialized with the proper information based on
>> user distance, something that we're not doing with FORM2.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c              |  27 +++++++-
>>   hw/ppc/spapr_hcall.c        |   4 ++
>>   hw/ppc/spapr_numa.c         | 127 +++++++++++++++++++++++++++++++++++-
>>   include/hw/ppc/spapr.h      |   1 +
>>   include/hw/ppc/spapr_ovec.h |   1 +
>>   5 files changed, 156 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index d39fd4e644..a3eb33764d 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1786,6 +1786,22 @@ static int spapr_post_load(void *opaque, int version_id)
>>           return err;
>>       }
>>   
>> +    /*
>> +     * NUMA data init is made in CAS time. There is no reliable
>> +     * way of telling whether the guest already went through CAS
>> +     * in the source due to how spapr_ov5_cas_needed works: a
>> +     * FORM1 guest can be migrated with ov5_cas empty regardless
>> +     * of going through CAS first.
>> +     *
>> +     * One solution is to always call numa_associativity_init. The
>> +     * downside is that a guest migrated before CAS will run
>> +     * numa_associativity_init again when going through it, but
>> +     * at least we're making sure spapr->numa_assoc_array will be
>> +     * initialized and hotplug operations won't fail in both before
>> +     * and after CAS migration cases.
> 
> This seems to comfort my feelings : it is safer to do the init
> unconditionally to ensure no other code gets upset.
> 
>> +     */
>> +     spapr_numa_associativity_init(spapr, MACHINE(spapr));
>> +
>>       return err;
>>   }
>>   
>> @@ -2752,6 +2768,11 @@ static void spapr_machine_init(MachineState *machine)
>>   
>>       spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>>   
>> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
>> +    if (!smc->pre_6_2_numa_affinity) {
>> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
>> +    }
>> +
>>       /* advertise support for dedicated HP event source to guests */
>>       if (spapr->use_hotplug_event_source) {
>>           spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
>> @@ -2808,9 +2829,6 @@ static void spapr_machine_init(MachineState *machine)
>>   
>>       spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>>   
>> -    /* Init numa_assoc_array */
>> -    spapr_numa_associativity_init(spapr, machine);
>> -
>>       if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>>           ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>>                                 spapr->max_compat_pvr)) {
>> @@ -4700,8 +4718,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>>    */
>>   static void spapr_machine_6_1_class_options(MachineClass *mc)
>>   {
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>> +
>>       spapr_machine_6_2_class_options(mc);
>>       compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
>> +    smc->pre_6_2_numa_affinity = true;
>>   }
>>   
>>   DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 0e9a5b2e40..1cc88716c1 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -11,6 +11,7 @@
>>   #include "helper_regs.h"
>>   #include "hw/ppc/spapr.h"
>>   #include "hw/ppc/spapr_cpu_core.h"
>> +#include "hw/ppc/spapr_numa.h"
>>   #include "mmu-hash64.h"
>>   #include "cpu-models.h"
>>   #include "trace.h"
>> @@ -1197,6 +1198,9 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>       spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>>       spapr_ovec_cleanup(ov1_guest);
>>   
>> +    /* Init numa_assoc_array */
>> +    spapr_numa_associativity_init(spapr, MACHINE(spapr));
>> +
>>       /*
>>        * Ensure the guest asks for an interrupt mode we support;
>>        * otherwise terminate the boot.
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 04a86f9b5b..0a5fa8101e 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>           spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>   
>> +        /*
>> +         * For FORM2 affinity the initialization above is enough. No
>> +         * need to fill non-zero NUMA nodes with node_id because
>> +         * there is no associativity domain match to calculate
>> +         * NUMA distances in FORM2.
>> +         */
>> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +            continue;
>> +        }
>> +
>>           /*
>>            * Fill all associativity domains of non-zero NUMA nodes with
>>            * node_id. This is required because the default value (0) is
>> @@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>       }
>>   
>> -    spapr_numa_FORM1_affinity_init(spapr, machine);
>> +    /*
>> +     * We test for !FORM2 instead of testing for FORM1 because,
>> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
>> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
>> +     * dealing with either FORM1 or FORM2, test for the option
>> +     * that is guaranteed to be set after a migration.
>> +     */
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +        spapr_numa_FORM1_affinity_init(spapr, machine);
>> +    }
>>   }
>>   
>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>> @@ -313,6 +332,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>>       return ret;
>>   }
>>   
>> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>> +                                               void *fdt, int rtas)
>> +{
>> +    MachineState *ms = MACHINE(spapr);
>> +    NodeInfo *numa_info = ms->numa_state->nodes;
>> +    int nb_numa_nodes = ms->numa_state->num_nodes;
>> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
>> +    g_autofree uint32_t *lookup_index_table = NULL;
>> +    g_autofree uint32_t *distance_table = NULL;
>> +    int src, dst, i, distance_table_size;
>> +    uint8_t *node_distances;
>> +
>> +    /*
>> +     * ibm,numa-lookup-index-table: array with length and a
>> +     * list of NUMA ids present in the guest.
>> +     */
>> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
>> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
>> +
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        lookup_index_table[i + 1] = cpu_to_be32(i);
>> +    }
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
>> +                     lookup_index_table,
>> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
>> +
>> +    /*
>> +     * ibm,numa-distance-table: contains all node distances. First
>> +     * element is the size of the table as uint32, followed up
>> +     * by all the uint8 distances from the first NUMA node, then all
>> +     * distances from the second NUMA node and so on.
>> +     *
>> +     * ibm,numa-lookup-index-table is used by guest to navigate this
>> +     * array because NUMA ids can be sparse (node 0 is the first,
>> +     * node 8 is the second ...).
>> +     */
>> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
>> +    distance_table[0] = cpu_to_be32(distance_table_entries);
>> +
>> +    node_distances = (uint8_t *)&distance_table[1];
>> +    i = 0;
>> +
>> +    for (src = 0; src < nb_numa_nodes; src++) {
>> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
>> +            node_distances[i++] = numa_info[src].distance[dst];
> 
> It looks like you could s/i++/src + dst/ here.
> 
>> +        }
>> +    }
>> +
>> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
>> +                          sizeof(uint32_t);
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
>> +                     distance_table, distance_table_size));
>> +}
>> +
>> +/*
>> + * This helper could be compressed in a single function with
>> + * FORM1 logic since we're setting the same DT values, with the
>> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
>> + * in the end. The separation was made to avoid clogging FORM1 code
>> + * which already has to deal with compat modes from previous
>> + * QEMU machine types.
>> + */
>> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
>> +                                           void *fdt, int rtas)
>> +{
>> +    MachineState *ms = MACHINE(spapr);
>> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
>> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
>> +
>> +    /*
>> +     * In FORM2, ibm,associativity-reference-points will point to
>> +     * the element in the ibm,associativity array that contains the
>> +     * primary domain index. This value (in our case, the numa-id) is
>> +     * then used as an index to retrieve all other attributes of the
>> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
>> +     * and other ibm,numa-*-table properties.
>> +     */
>> +    uint32_t refpoints[] = {
>> +        cpu_to_be32(0x4),
>> +    };
>> +
>> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
>> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
>> +    uint32_t maxdomains[] = {
>> +        cpu_to_be32(4),
> 
> Maybe use same base for initializers, especially when they land in the same
> hunk :) Either dec or hex I don't care.
> 
> Rest LGTM.
> 
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain)
>> +    };
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
>> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
>> +                     maxdomains, sizeof(maxdomains)));
>> +
>> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
>> +}
>> +
>>   static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>>                                              void *fdt, int rtas)
>>   {
>> @@ -379,6 +499,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>>    */
>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>>   {
>> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
>> +        return;
>> +    }
>> +
>>       spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>>   }
>>   
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 637652ad16..21b1cf9ebf 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
>>       hwaddr rma_limit;          /* clamp the RMA to this size */
>>       bool pre_5_1_assoc_refpoints;
>>       bool pre_5_2_numa_associativity;
>> +    bool pre_6_2_numa_affinity;
>>   
>>       bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>>                             uint64_t *buid, hwaddr *pio,
>> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
>> index 48b716a060..c3e8b98e7e 100644
>> --- a/include/hw/ppc/spapr_ovec.h
>> +++ b/include/hw/ppc/spapr_ovec.h
>> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>>   /* option vector 5 */
>>   #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>>   #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
>> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>>   #define OV5_HP_EVT              OV_BIT(6, 5)
>>   #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>>   #define OV5_DRMEM_V2            OV_BIT(22, 0)
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support
  2021-08-31 14:13     ` Daniel Henrique Barboza
@ 2021-08-31 14:51       ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-08-31 14:51 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Tue, 31 Aug 2021 11:13:45 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> 
> 
> On 8/31/21 10:40 AM, Greg Kurz wrote:
> > On Fri, 27 Aug 2021 06:24:53 -0300
> > Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> > 
> >> The main feature of FORM2 affinity support is the separation of NUMA
> >> distances from ibm,associativity information. This allows for a more
> >> flexible and straightforward NUMA distance assignment without relying on
> >> complex associations between several levels of NUMA via
> >> ibm,associativity matches. Another feature is its extensibility. This base
> >> support contains the facilities for NUMA distance assignment, but in the
> >> future more facilities will be added for latency, performance, bandwidth
> >> and so on.
> >>
> >> This patch implements the base FORM2 affinity support as follows:
> >>
> >> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> >> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> >> or FORM2 affinity. Setting both forms will default to FORM2. We're not
> >> advertising FORM2 for pseries-6.1 and older machine versions to prevent
> >> guest visible changes in those;
> >>
> >> - move spapr_numa_associativity_init() from spapr_machine_init() to
> >> do_client_architecture_support(), when we already know which NUMA
> >> affinity the guest will use. This will avoid initializing FORM1
> >> spapr->numa_assoc_array and overwrite it shortly after if FORM2 is
> >> chosen;
> >>
> > 
> > As I was saying in another mail. I'm not very comfortable with delaying
> > spapr_numa_associativity_init() to CAS until I'm sure nothing else
> > assumes it's already done when the machine first resets and boots.
> > 
> > And also, these are slow paths and I'd rather overwrite the array if
> > that keeps the code simple. FWIW this is what we already do with the
> > irq backend : we always reset to XICS and switch to XIVE at CAS.
> > 
> > This also makes me think that spapr_numa_associativity_init() should
> > be called during machine reset so that spapr->numa_assoc_array is
> > reset to a known default value, i.e. FORM1 or legacy depending on
> > the machine version. Maybe rename it to spapr_numa_associativity_reset()
> > for clarity ?
> 
> It is not tragic to delay spapr_numa_associativity_init() up until CAS
> because we haven't written any ibm,associativity arrays yet. What we
> write before CAS are the common RTAS artifacts such as reference-points
> and maxdomains, but as you suggested in the previous review, we're just
> writing them again if the guest chose to use FORM2.
> 
> If we really want to be safe I can do the following:
> 
> - keep spapr_numa_associativity_init() in machine_init(), like it is today;
> 

I think it should be performed at machine reset actually.

> - call it again after CAS if the guest chooses FORM2;
> 
> - change spapr_numa_associativity_init() to clear the unused FORM 1 associativity
> domains if FORM2 is chosen. This step is needed because there's no way to know if
> we're before or after CAS, and FORM1 code populates the associativity domains
> based on user NUMA distance input. For clarity, I want to keep unused stuff
> zeroed when using FORM2.
> 
> 
> Thanks,
> 
> 
> Daniel
> 
> > 
> >> - ibm,associativity-reference-points has a new semantic. Instead of
> >> being used to calculate distances via NUMA levels, it's now used to
> >> indicate the primary domain index in the ibm,associativity domain of
> >> each resource. In our case it's set to {0x4}, matching the position
> >> where we already place logical_domain_id;
> >>
> >> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
> >> and ibm,numa-distance-table. The index table is used to list all the
> >> NUMA logical domains of the platform, in ascending order, and allows for
> >> spartial NUMA configurations (although QEMU ATM doesn't support that).
> >> ibm,numa-distance-table is an array that contains all the distances from
> >> the first NUMA node to all other nodes, then the second NUMA node
> >> distances to all other nodes and so on;
> >>
> >> - spapr_post_load changes: since we're adding a new NUMA affinity that
> >> isn't compatible with the existing one, migration must be handled
> >> accordingly because we can't be certain of whether the guest went
> >> through CAS in the source. The solution chosen is to initiate the NUMA
> >> associativity data in spapr_post_load() unconditionally. The worst case
> >> would be to write the DT twice if the guest is in pre-CAS stage.
> >> Otherwise, we're making sure that a FORM1 guest will have the
> >> spapr->numa_assoc_array initialized with the proper information based on
> >> user distance, something that we're not doing with FORM2.
> >>
> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> ---
> >>   hw/ppc/spapr.c              |  27 +++++++-
> >>   hw/ppc/spapr_hcall.c        |   4 ++
> >>   hw/ppc/spapr_numa.c         | 127 +++++++++++++++++++++++++++++++++++-
> >>   include/hw/ppc/spapr.h      |   1 +
> >>   include/hw/ppc/spapr_ovec.h |   1 +
> >>   5 files changed, 156 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index d39fd4e644..a3eb33764d 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -1786,6 +1786,22 @@ static int spapr_post_load(void *opaque, int version_id)
> >>           return err;
> >>       }
> >>   
> >> +    /*
> >> +     * NUMA data init is made in CAS time. There is no reliable
> >> +     * way of telling whether the guest already went through CAS
> >> +     * in the source due to how spapr_ov5_cas_needed works: a
> >> +     * FORM1 guest can be migrated with ov5_cas empty regardless
> >> +     * of going through CAS first.
> >> +     *
> >> +     * One solution is to always call numa_associativity_init. The
> >> +     * downside is that a guest migrated before CAS will run
> >> +     * numa_associativity_init again when going through it, but
> >> +     * at least we're making sure spapr->numa_assoc_array will be
> >> +     * initialized and hotplug operations won't fail in both before
> >> +     * and after CAS migration cases.
> > 
> > This seems to comfort my feelings : it is safer to do the init
> > unconditionally to ensure no other code gets upset.
> > 
> >> +     */
> >> +     spapr_numa_associativity_init(spapr, MACHINE(spapr));
> >> +
> >>       return err;
> >>   }
> >>   
> >> @@ -2752,6 +2768,11 @@ static void spapr_machine_init(MachineState *machine)
> >>   
> >>       spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
> >>   
> >> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
> >> +    if (!smc->pre_6_2_numa_affinity) {
> >> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
> >> +    }
> >> +
> >>       /* advertise support for dedicated HP event source to guests */
> >>       if (spapr->use_hotplug_event_source) {
> >>           spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
> >> @@ -2808,9 +2829,6 @@ static void spapr_machine_init(MachineState *machine)
> >>   
> >>       spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
> >>   
> >> -    /* Init numa_assoc_array */
> >> -    spapr_numa_associativity_init(spapr, machine);
> >> -
> >>       if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> >>           ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> >>                                 spapr->max_compat_pvr)) {
> >> @@ -4700,8 +4718,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
> >>    */
> >>   static void spapr_machine_6_1_class_options(MachineClass *mc)
> >>   {
> >> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> >> +
> >>       spapr_machine_6_2_class_options(mc);
> >>       compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
> >> +    smc->pre_6_2_numa_affinity = true;
> >>   }
> >>   
> >>   DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
> >> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> >> index 0e9a5b2e40..1cc88716c1 100644
> >> --- a/hw/ppc/spapr_hcall.c
> >> +++ b/hw/ppc/spapr_hcall.c
> >> @@ -11,6 +11,7 @@
> >>   #include "helper_regs.h"
> >>   #include "hw/ppc/spapr.h"
> >>   #include "hw/ppc/spapr_cpu_core.h"
> >> +#include "hw/ppc/spapr_numa.h"
> >>   #include "mmu-hash64.h"
> >>   #include "cpu-models.h"
> >>   #include "trace.h"
> >> @@ -1197,6 +1198,9 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> >>       spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
> >>       spapr_ovec_cleanup(ov1_guest);
> >>   
> >> +    /* Init numa_assoc_array */
> >> +    spapr_numa_associativity_init(spapr, MACHINE(spapr));
> >> +
> >>       /*
> >>        * Ensure the guest asks for an interrupt mode we support;
> >>        * otherwise terminate the boot.
> >> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> >> index 04a86f9b5b..0a5fa8101e 100644
> >> --- a/hw/ppc/spapr_numa.c
> >> +++ b/hw/ppc/spapr_numa.c
> >> @@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
> >>           spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> >>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>   
> >> +        /*
> >> +         * For FORM2 affinity the initialization above is enough. No
> >> +         * need to fill non-zero NUMA nodes with node_id because
> >> +         * there is no associativity domain match to calculate
> >> +         * NUMA distances in FORM2.
> >> +         */
> >> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> >> +            continue;
> >> +        }
> >> +
> >>           /*
> >>            * Fill all associativity domains of non-zero NUMA nodes with
> >>            * node_id. This is required because the default value (0) is
> >> @@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
> >>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>       }
> >>   
> >> -    spapr_numa_FORM1_affinity_init(spapr, machine);
> >> +    /*
> >> +     * We test for !FORM2 instead of testing for FORM1 because,
> >> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
> >> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
> >> +     * dealing with either FORM1 or FORM2, test for the option
> >> +     * that is guaranteed to be set after a migration.
> >> +     */
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> >> +        spapr_numa_FORM1_affinity_init(spapr, machine);
> >> +    }
> >>   }
> >>   
> >>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> >> @@ -313,6 +332,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
> >>       return ret;
> >>   }
> >>   
> >> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> >> +                                               void *fdt, int rtas)
> >> +{
> >> +    MachineState *ms = MACHINE(spapr);
> >> +    NodeInfo *numa_info = ms->numa_state->nodes;
> >> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> >> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
> >> +    g_autofree uint32_t *lookup_index_table = NULL;
> >> +    g_autofree uint32_t *distance_table = NULL;
> >> +    int src, dst, i, distance_table_size;
> >> +    uint8_t *node_distances;
> >> +
> >> +    /*
> >> +     * ibm,numa-lookup-index-table: array with length and a
> >> +     * list of NUMA ids present in the guest.
> >> +     */
> >> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
> >> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
> >> +
> >> +    for (i = 0; i < nb_numa_nodes; i++) {
> >> +        lookup_index_table[i + 1] = cpu_to_be32(i);
> >> +    }
> >> +
> >> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
> >> +                     lookup_index_table,
> >> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
> >> +
> >> +    /*
> >> +     * ibm,numa-distance-table: contains all node distances. First
> >> +     * element is the size of the table as uint32, followed up
> >> +     * by all the uint8 distances from the first NUMA node, then all
> >> +     * distances from the second NUMA node and so on.
> >> +     *
> >> +     * ibm,numa-lookup-index-table is used by guest to navigate this
> >> +     * array because NUMA ids can be sparse (node 0 is the first,
> >> +     * node 8 is the second ...).
> >> +     */
> >> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
> >> +    distance_table[0] = cpu_to_be32(distance_table_entries);
> >> +
> >> +    node_distances = (uint8_t *)&distance_table[1];
> >> +    i = 0;
> >> +
> >> +    for (src = 0; src < nb_numa_nodes; src++) {
> >> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> >> +            node_distances[i++] = numa_info[src].distance[dst];
> > 
> > It looks like you could s/i++/src + dst/ here.
> > 
> >> +        }
> >> +    }
> >> +
> >> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
> >> +                          sizeof(uint32_t);
> >> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
> >> +                     distance_table, distance_table_size));
> >> +}
> >> +
> >> +/*
> >> + * This helper could be compressed in a single function with
> >> + * FORM1 logic since we're setting the same DT values, with the
> >> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
> >> + * in the end. The separation was made to avoid clogging FORM1 code
> >> + * which already has to deal with compat modes from previous
> >> + * QEMU machine types.
> >> + */
> >> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
> >> +                                           void *fdt, int rtas)
> >> +{
> >> +    MachineState *ms = MACHINE(spapr);
> >> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
> >> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
> >> +
> >> +    /*
> >> +     * In FORM2, ibm,associativity-reference-points will point to
> >> +     * the element in the ibm,associativity array that contains the
> >> +     * primary domain index. This value (in our case, the numa-id) is
> >> +     * then used as an index to retrieve all other attributes of the
> >> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
> >> +     * and other ibm,numa-*-table properties.
> >> +     */
> >> +    uint32_t refpoints[] = {
> >> +        cpu_to_be32(0x4),
> >> +    };
> >> +
> >> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
> >> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
> >> +    uint32_t maxdomains[] = {
> >> +        cpu_to_be32(4),
> > 
> > Maybe use same base for initializers, especially when they land in the same
> > hunk :) Either dec or hex I don't care.
> > 
> > Rest LGTM.
> > 
> >> +        cpu_to_be32(maxdomain),
> >> +        cpu_to_be32(maxdomain),
> >> +        cpu_to_be32(maxdomain),
> >> +        cpu_to_be32(maxdomain)
> >> +    };
> >> +
> >> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> >> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
> >> +
> >> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> >> +                     maxdomains, sizeof(maxdomains)));
> >> +
> >> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
> >> +}
> >> +
> >>   static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
> >>                                              void *fdt, int rtas)
> >>   {
> >> @@ -379,6 +499,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
> >>    */
> >>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> >>   {
> >> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> >> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
> >> +        return;
> >> +    }
> >> +
> >>       spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
> >>   }
> >>   
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 637652ad16..21b1cf9ebf 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
> >>       hwaddr rma_limit;          /* clamp the RMA to this size */
> >>       bool pre_5_1_assoc_refpoints;
> >>       bool pre_5_2_numa_associativity;
> >> +    bool pre_6_2_numa_affinity;
> >>   
> >>       bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
> >>                             uint64_t *buid, hwaddr *pio,
> >> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
> >> index 48b716a060..c3e8b98e7e 100644
> >> --- a/include/hw/ppc/spapr_ovec.h
> >> +++ b/include/hw/ppc/spapr_ovec.h
> >> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
> >>   /* option vector 5 */
> >>   #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
> >>   #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
> >> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
> >>   #define OV5_HP_EVT              OV_BIT(6, 5)
> >>   #define OV5_HPT_RESIZE          OV_BIT(6, 7)
> >>   #define OV5_DRMEM_V2            OV_BIT(22, 0)
> > 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers
  2021-08-27  9:24 ` [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-09-07  0:30   ` David Gibson
  2021-09-07  0:50     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2021-09-07  0:30 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 4814 bytes --]

On Fri, Aug 27, 2021 at 06:24:52AM -0300, Daniel Henrique Barboza wrote:
> The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
> and doesn't need be concerned with all the legacy support for older
> pseries FORM1 guests.
> 
> We're also not going to calculate associativity domains based on numa
> distance (via spapr_numa_define_associativity_domains) since the
> distances will be written directly into new DT properties.
> 
> Let's split FORM1 code into its own functions to allow for easier
> insertion of FORM2 logic later on.
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 779f18b994..04a86f9b5b 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>  
>  }
>  
> +/*
> + * Set NUMA machine state data based on FORM1 affinity semantics.
> + */
> +static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> +                                           MachineState *machine)
> +{
> +    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> +
> +    /*
> +     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> +     * 1 NUMA node) will not benefit from anything we're going to do
> +     * after this point.
> +     */
> +    if (using_legacy_numa) {
> +        return;
> +    }

My only concern with this patch is that handling the
"using_legacy_numa" case might logically belong outside the FORM1
code.  I mean, I'm pretty sure using_legacy_numa implies FORM1 in
practice, but conceptually it seems like a more fundamental question
than the DT encoding of the NUMA information.

> +
> +    if (!spapr_numa_is_symmetrical(machine)) {
> +        error_report("Asymmetrical NUMA topologies aren't supported "
> +                     "in the pSeries machine");
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    spapr_numa_define_associativity_domains(spapr);
> +}
> +
>  void spapr_numa_associativity_init(SpaprMachineState *spapr,
>                                     MachineState *machine)
>  {
> @@ -210,22 +236,7 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
> -    /*
> -     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> -     * 1 NUMA node) will not benefit from anything we're going to do
> -     * after this point.
> -     */
> -    if (using_legacy_numa) {
> -        return;
> -    }
> -
> -    if (!spapr_numa_is_symmetrical(machine)) {
> -        error_report("Asymmetrical NUMA topologies aren't supported "
> -                     "in the pSeries machine");
> -        exit(EXIT_FAILURE);
> -    }
> -
> -    spapr_numa_define_associativity_domains(spapr);
> +    spapr_numa_FORM1_affinity_init(spapr, machine);
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> -/*
> - * Helper that writes ibm,associativity-reference-points and
> - * max-associativity-domains in the RTAS pointed by @rtas
> - * in the DT @fdt.
> - */
> -void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
>  {
>      MachineState *ms = MACHINE(spapr);
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> @@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>                       maxdomains, sizeof(maxdomains)));
>  }
>  
> +/*
> + * Helper that writes ibm,associativity-reference-points and
> + * max-associativity-domains in the RTAS pointed by @rtas
> + * in the DT @fdt.
> + */
> +void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +{
> +    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
> +}
> +
>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                SpaprMachineState *spapr,
>                                                target_ulong opcode,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers
  2021-09-07  0:30   ` David Gibson
@ 2021-09-07  0:50     ` Daniel Henrique Barboza
  2021-09-07  1:12       ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:50 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, groug



On 9/6/21 9:30 PM, David Gibson wrote:
> On Fri, Aug 27, 2021 at 06:24:52AM -0300, Daniel Henrique Barboza wrote:
>> The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
>> and doesn't need be concerned with all the legacy support for older
>> pseries FORM1 guests.
>>
>> We're also not going to calculate associativity domains based on numa
>> distance (via spapr_numa_define_associativity_domains) since the
>> distances will be written directly into new DT properties.
>>
>> Let's split FORM1 code into its own functions to allow for easier
>> insertion of FORM2 logic later on.
>>
>> Reviewed-by: Greg Kurz <groug@kaod.org>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
>>   1 file changed, 39 insertions(+), 22 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 779f18b994..04a86f9b5b 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>>   
>>   }
>>   
>> +/*
>> + * Set NUMA machine state data based on FORM1 affinity semantics.
>> + */
>> +static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>> +                                           MachineState *machine)
>> +{
>> +    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>> +
>> +    /*
>> +     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
>> +     * 1 NUMA node) will not benefit from anything we're going to do
>> +     * after this point.
>> +     */
>> +    if (using_legacy_numa) {
>> +        return;
>> +    }
> 
> My only concern with this patch is that handling the
> "using_legacy_numa" case might logically belong outside the FORM1
> code.  I mean, I'm pretty sure using_legacy_numa implies FORM1 in
> practice, but conceptually it seems like a more fundamental question
> than the DT encoding of the NUMA information.

I'll carry on this suggestion for the next spin, v6, given that the v5 I sent
a few minutes ago is also verifying legacy numa in FORM1 code.


Thanks,


Daniel

> 
>> +
>> +    if (!spapr_numa_is_symmetrical(machine)) {
>> +        error_report("Asymmetrical NUMA topologies aren't supported "
>> +                     "in the pSeries machine");
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    spapr_numa_define_associativity_domains(spapr);
>> +}
>> +
>>   void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>                                      MachineState *machine)
>>   {
>> @@ -210,22 +236,7 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>       }
>>   
>> -    /*
>> -     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
>> -     * 1 NUMA node) will not benefit from anything we're going to do
>> -     * after this point.
>> -     */
>> -    if (using_legacy_numa) {
>> -        return;
>> -    }
>> -
>> -    if (!spapr_numa_is_symmetrical(machine)) {
>> -        error_report("Asymmetrical NUMA topologies aren't supported "
>> -                     "in the pSeries machine");
>> -        exit(EXIT_FAILURE);
>> -    }
>> -
>> -    spapr_numa_define_associativity_domains(spapr);
>> +    spapr_numa_FORM1_affinity_init(spapr, machine);
>>   }
>>   
>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>> @@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>>       return ret;
>>   }
>>   
>> -/*
>> - * Helper that writes ibm,associativity-reference-points and
>> - * max-associativity-domains in the RTAS pointed by @rtas
>> - * in the DT @fdt.
>> - */
>> -void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>> +static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>> +                                           void *fdt, int rtas)
>>   {
>>       MachineState *ms = MACHINE(spapr);
>>       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>> @@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>>                        maxdomains, sizeof(maxdomains)));
>>   }
>>   
>> +/*
>> + * Helper that writes ibm,associativity-reference-points and
>> + * max-associativity-domains in the RTAS pointed by @rtas
>> + * in the DT @fdt.
>> + */
>> +void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>> +{
>> +    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>> +}
>> +
>>   static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>                                                 SpaprMachineState *spapr,
>>                                                 target_ulong opcode,
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers
  2021-09-07  0:50     ` Daniel Henrique Barboza
@ 2021-09-07  1:12       ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2021-09-07  1:12 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 2828 bytes --]

On Mon, Sep 06, 2021 at 09:50:36PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 9/6/21 9:30 PM, David Gibson wrote:
> > On Fri, Aug 27, 2021 at 06:24:52AM -0300, Daniel Henrique Barboza wrote:
> > > The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
> > > and doesn't need be concerned with all the legacy support for older
> > > pseries FORM1 guests.
> > > 
> > > We're also not going to calculate associativity domains based on numa
> > > distance (via spapr_numa_define_associativity_domains) since the
> > > distances will be written directly into new DT properties.
> > > 
> > > Let's split FORM1 code into its own functions to allow for easier
> > > insertion of FORM2 logic later on.
> > > 
> > > Reviewed-by: Greg Kurz <groug@kaod.org>
> > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > > ---
> > >   hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
> > >   1 file changed, 39 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > > index 779f18b994..04a86f9b5b 100644
> > > --- a/hw/ppc/spapr_numa.c
> > > +++ b/hw/ppc/spapr_numa.c
> > > @@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
> > >   }
> > > +/*
> > > + * Set NUMA machine state data based on FORM1 affinity semantics.
> > > + */
> > > +static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> > > +                                           MachineState *machine)
> > > +{
> > > +    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> > > +
> > > +    /*
> > > +     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> > > +     * 1 NUMA node) will not benefit from anything we're going to do
> > > +     * after this point.
> > > +     */
> > > +    if (using_legacy_numa) {
> > > +        return;
> > > +    }
> > 
> > My only concern with this patch is that handling the
> > "using_legacy_numa" case might logically belong outside the FORM1
> > code.  I mean, I'm pretty sure using_legacy_numa implies FORM1 in
> > practice, but conceptually it seems like a more fundamental question
> > than the DT encoding of the NUMA information.
> 
> I'll carry on this suggestion for the next spin, v6, given that the v5 I sent
> a few minutes ago is also verifying legacy numa in FORM1 code.

Ok.  I should note that I'm not saying what you have now is definitely
wrong, it just looks a bit odd to me.  If you have a rationale for
doing it this way, go ahead and tell me, rather than changing it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-07  1:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-27  9:24 [PATCH v4 0/5] pSeries FORM2 affinity support Daniel Henrique Barboza
2021-08-27  9:24 ` [PATCH v4 1/5] hw, spapr: add 6.2 compat pseries machine Daniel Henrique Barboza
2021-08-30  7:34   ` Greg Kurz
2021-08-30 18:24     ` Daniel Henrique Barboza
2021-08-30 18:32     ` Peter Maydell
2021-08-31  2:01       ` wangyanan (Y)
2021-08-27  9:24 ` [PATCH v4 2/5] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
2021-09-07  0:30   ` David Gibson
2021-09-07  0:50     ` Daniel Henrique Barboza
2021-09-07  1:12       ` David Gibson
2021-08-27  9:24 ` [PATCH v4 3/5] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
2021-08-31 13:40   ` Greg Kurz
2021-08-31 14:13     ` Daniel Henrique Barboza
2021-08-31 14:51       ` Greg Kurz
2021-08-27  9:24 ` [PATCH v4 4/5] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
2021-08-27  9:24 ` [PATCH v4 5/5] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).