All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] pSeries FORM2 affinity support
@ 2021-09-07  0:25 Daniel Henrique Barboza
  2021-09-07  0:25 ` [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset Daniel Henrique Barboza
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

In this new version, the biggest change is that now we're initializing
NUMA associativity internal data during machine_reset(), instead of
machine_init(), to allow for the guest to switch between FORM1 and
FORM2 during guest reset. All other changes are consequence of this
design change.

Changes from v4:
- former patch 1:
  * dropped, pseries-6.2 machine type is already available
- new patch 1:
  * move numa associativity init to machine reset
- patch 3:
  * avoid resetting associativity data if FORM1 was chosen
- former patch 4:
  * dropped, folded into patch 1
- patch 4 (former 5):
  * move both FORM1 verifications to post-CAS
- v4 link: https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04860.html
 

Daniel Henrique Barboza (4):
  spapr: move NUMA associativity init to machine reset
  spapr_numa.c: split FORM1 code into helpers
  spapr_numa.c: base FORM2 NUMA affinity support
  spapr: move FORM1 verifications to do_client_architecture_support()

 hw/ppc/spapr.c              |  63 +++++-----
 hw/ppc/spapr_hcall.c        |  16 +++
 hw/ppc/spapr_numa.c         | 225 +++++++++++++++++++++++++++++++++---
 include/hw/ppc/spapr.h      |   1 +
 include/hw/ppc/spapr_numa.h |  10 +-
 include/hw/ppc/spapr_ovec.h |   1 +
 6 files changed, 253 insertions(+), 63 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-07  0:25 [PATCH v5 0/4] pSeries FORM2 affinity support Daniel Henrique Barboza
@ 2021-09-07  0:25 ` Daniel Henrique Barboza
  2021-09-07  0:37   ` David Gibson
  2021-09-07  0:25 ` [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

At this moment we only support one form of NUMA affinity, FORM1. This
allows us to init the internal structures during machine_init(), and
given that NUMA distances won't change during the guest lifetime we
don't need to bother with that again.

We're about to introduce FORM2, a new NUMA affinity mode for pSeries
guests. This means that we'll only be certain about the affinity mode
being used after client architecture support. This also means that the
guest can switch affinity modes in machine reset.

Let's prepare the ground for the FORM2 support by moving the NUMA
internal data init from machine_init() to machine_reset(). Change the
name to spapr_numa_associativity_reset() to make it clearer that this is
a function that can be called multiple times during the guest lifecycle.
We're also simplifying its current API since this method will be called
during CAS time (do_client_architecture_support()) later on and there's no
MachineState pointer already solved there.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 6 +++---
 hw/ppc/spapr_numa.c         | 4 ++--
 include/hw/ppc/spapr_numa.h | 9 +--------
 3 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d39fd4e644..8e1ff6cd10 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
      */
     spapr_irq_reset(spapr, &error_fatal);
 
+    /* Reset numa_assoc_array */
+    spapr_numa_associativity_reset(spapr);
+
     /*
      * There is no CAS under qtest. Simulate one to please the code that
      * depends on spapr->ov5_cas. This is especially needed to test device
@@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
-    /* Init numa_assoc_array */
-    spapr_numa_associativity_init(spapr, machine);
-
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
                               spapr->max_compat_pvr)) {
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 779f18b994..9ee4b479fe 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine)
+void spapr_numa_associativity_reset(SpaprMachineState *spapr)
 {
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+    MachineState *machine = MACHINE(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int i, j, max_nodes_with_gpus;
     bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index 6f9f02d3de..0e457bba57 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -16,14 +16,7 @@
 #include "hw/boards.h"
 #include "hw/ppc/spapr.h"
 
-/*
- * Having both SpaprMachineState and MachineState as arguments
- * feels odd, but it will spare a MACHINE() call inside the
- * function. spapr_machine_init() is the only caller for it, and
- * it has both pointers resolved already.
- */
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine);
+void spapr_numa_associativity_reset(SpaprMachineState *spapr);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers
  2021-09-07  0:25 [PATCH v5 0/4] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-09-07  0:25 ` [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset Daniel Henrique Barboza
@ 2021-09-07  0:25 ` Daniel Henrique Barboza
  2021-09-07  0:39   ` David Gibson
  2021-09-07  0:25 ` [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
  2021-09-07  0:25 ` [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support() Daniel Henrique Barboza
  3 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
and doesn't need be concerned with all the legacy support for older
pseries FORM1 guests.

We're also not going to calculate associativity domains based on numa
distance (via spapr_numa_define_associativity_domains) since the
distances will be written directly into new DT properties.

Let's split FORM1 code into its own functions to allow for easier
insertion of FORM2 logic later on.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 9ee4b479fe..84636cb86a 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
+/*
+ * Set NUMA machine state data based on FORM1 affinity semantics.
+ */
+static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
+                                           MachineState *machine)
+{
+    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
+
+    /*
+     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
+     * 1 NUMA node) will not benefit from anything we're going to do
+     * after this point.
+     */
+    if (using_legacy_numa) {
+        return;
+    }
+
+    if (!spapr_numa_is_symmetrical(machine)) {
+        error_report("Asymmetrical NUMA topologies aren't supported "
+                     "in the pSeries machine");
+        exit(EXIT_FAILURE);
+    }
+
+    spapr_numa_define_associativity_domains(spapr);
+}
+
 void spapr_numa_associativity_reset(SpaprMachineState *spapr)
 {
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
@@ -210,22 +236,7 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    /*
-     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-     * 1 NUMA node) will not benefit from anything we're going to do
-     * after this point.
-     */
-    if (using_legacy_numa) {
-        return;
-    }
-
-    if (!spapr_numa_is_symmetrical(machine)) {
-        error_report("Asymmetrical NUMA topologies aren't supported "
-                     "in the pSeries machine");
-        exit(EXIT_FAILURE);
-    }
-
-    spapr_numa_define_associativity_domains(spapr);
+    spapr_numa_FORM1_affinity_init(spapr, machine);
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
-/*
- * Helper that writes ibm,associativity-reference-points and
- * max-associativity-domains in the RTAS pointed by @rtas
- * in the DT @fdt.
- */
-void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
 {
     MachineState *ms = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
@@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
                      maxdomains, sizeof(maxdomains)));
 }
 
+/*
+ * Helper that writes ibm,associativity-reference-points and
+ * max-associativity-domains in the RTAS pointed by @rtas
+ * in the DT @fdt.
+ */
+void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+{
+    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support
  2021-09-07  0:25 [PATCH v5 0/4] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-09-07  0:25 ` [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset Daniel Henrique Barboza
  2021-09-07  0:25 ` [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-09-07  0:25 ` Daniel Henrique Barboza
  2021-09-07  1:02   ` David Gibson
  2021-09-07  7:50   ` Greg Kurz
  2021-09-07  0:25 ` [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support() Daniel Henrique Barboza
  3 siblings, 2 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The main feature of FORM2 affinity support is the separation of NUMA
distances from ibm,associativity information. This allows for a more
flexible and straightforward NUMA distance assignment without relying on
complex associations between several levels of NUMA via
ibm,associativity matches. Another feature is its extensibility. This base
support contains the facilities for NUMA distance assignment, but in the
future more facilities will be added for latency, performance, bandwidth
and so on.

This patch implements the base FORM2 affinity support as follows:

- the use of FORM2 associativity is indicated by using bit 2 of byte 5
of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
or FORM2 affinity. Setting both forms will default to FORM2. We're not
advertising FORM2 for pseries-6.1 and older machine versions to prevent
guest visible changes in those;

- call spapr_numa_associativity_reset() in do_client_architecture_support()
if FORM2 is chosen. This will avoid re-initializing FORM1 artifacts that
were already initialized in spapr_machine_reset();

- ibm,associativity-reference-points has a new semantic. Instead of
being used to calculate distances via NUMA levels, it's now used to
indicate the primary domain index in the ibm,associativity domain of
each resource. In our case it's set to {0x4}, matching the position
where we already place logical_domain_id;

- two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
and ibm,numa-distance-table. The index table is used to list all the
NUMA logical domains of the platform, in ascending order, and allows for
spartial NUMA configurations (although QEMU ATM doesn't support that).
ibm,numa-distance-table is an array that contains all the distances from
the first NUMA node to all other nodes, then the second NUMA node
distances to all other nodes and so on;

- spapr_post_load changes: since we're adding a new NUMA affinity that
isn't compatible with the existing one, migration must be handled
accordingly because we can't be certain of whether the guest went
through CAS in the source. The solution chosen is to initiate the NUMA
associativity data in spapr_post_load() unconditionally. The worst case
would be to write the DT twice if the guest is in pre-CAS stage.
Otherwise, we're making sure that a FORM1 guest will have the
spapr->numa_assoc_array initialized with the proper information based on
user distance, something that we're not doing with FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              |  24 +++++++
 hw/ppc/spapr_hcall.c        |  10 +++
 hw/ppc/spapr_numa.c         | 135 +++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr.h      |   1 +
 include/hw/ppc/spapr_ovec.h |   1 +
 5 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8e1ff6cd10..8d98e3b08a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1789,6 +1789,22 @@ static int spapr_post_load(void *opaque, int version_id)
         return err;
     }
 
+    /*
+     * NUMA data init is made in CAS time. There is no reliable
+     * way of telling whether the guest already went through CAS
+     * in the source due to how spapr_ov5_cas_needed works: a
+     * FORM1 guest can be migrated with ov5_cas empty regardless
+     * of going through CAS first.
+     *
+     * One solution is to always call numa_associativity_reset. The
+     * downside is that a guest migrated before CAS will run
+     * numa_associativity_reset again when going through it, but
+     * at least we're making sure spapr->numa_assoc_array will be
+     * initialized and hotplug operations won't fail in both before
+     * and after CAS migration cases.
+     */
+     spapr_numa_associativity_reset(spapr);
+
     return err;
 }
 
@@ -2755,6 +2771,11 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
 
+    /* Do not advertise FORM2 support for pseries-6.1 and older */
+    if (!smc->pre_6_2_numa_affinity) {
+        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
+    }
+
     /* advertise support for dedicated HP event source to guests */
     if (spapr->use_hotplug_event_source) {
         spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
@@ -4700,8 +4721,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
  */
 static void spapr_machine_6_1_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_6_2_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
+    smc->pre_6_2_numa_affinity = true;
 }
 
 DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 0e9a5b2e40..7efbe93f4b 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -11,6 +11,7 @@
 #include "helper_regs.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_cpu_core.h"
+#include "hw/ppc/spapr_numa.h"
 #include "mmu-hash64.h"
 #include "cpu-models.h"
 #include "trace.h"
@@ -1197,6 +1198,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
     spapr_ovec_cleanup(ov1_guest);
 
+    /*
+     * If the guest chooses FORM2 we need to reset the associativity
+     * information - it is being defaulted to FORM1 during
+     * spapr_machine_reset().
+     */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_associativity_reset(spapr);
+    }
+
     /*
      * Ensure the guest asks for an interrupt mode we support;
      * otherwise terminate the boot.
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 84636cb86a..ca276e16cb 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -202,6 +202,24 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
         spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
 
+        /*
+         * For FORM2 affinity, zero all the non-used associativity
+         * domains (all domains but the last). This step is necessary
+         * because FORM2 will always be populated on top of FORM1
+         * (via spapr_machine_reset()), which populates them.
+         *
+         * The kernel doesn't mind these non-used domains but zeroing
+         * them out makes it clearer to userspace that they are not
+         * being used.
+         */
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
+                spapr->numa_assoc_array[i][j] = 0;
+            }
+
+            continue;
+        }
+
         /*
          * Fill all associativity domains of non-zero NUMA nodes with
          * node_id. This is required because the default value (0) is
@@ -236,7 +254,16 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    spapr_numa_FORM1_affinity_init(spapr, machine);
+    /*
+     * We test for !FORM2 instead of testing for FORM1 because,
+     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
+     * to get ov5_cas migrated, but setting FORM2 is. Since we're
+     * dealing with either FORM1 or FORM2, test for the option
+     * that is guaranteed to be set after a migration.
+     */
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM1_affinity_init(spapr, machine);
+    }
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -313,6 +340,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
+static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
+                                               void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
+    g_autofree uint32_t *lookup_index_table = NULL;
+    g_autofree uint32_t *distance_table = NULL;
+    int src, dst, i, distance_table_size;
+    uint8_t *node_distances;
+
+    /*
+     * ibm,numa-lookup-index-table: array with length and a
+     * list of NUMA ids present in the guest.
+     */
+    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
+    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        lookup_index_table[i + 1] = cpu_to_be32(i);
+    }
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
+                     lookup_index_table,
+                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
+
+    /*
+     * ibm,numa-distance-table: contains all node distances. First
+     * element is the size of the table as uint32, followed up
+     * by all the uint8 distances from the first NUMA node, then all
+     * distances from the second NUMA node and so on.
+     *
+     * ibm,numa-lookup-index-table is used by guest to navigate this
+     * array because NUMA ids can be sparse (node 0 is the first,
+     * node 8 is the second ...).
+     */
+    distance_table = g_new0(uint32_t, distance_table_entries + 1);
+    distance_table[0] = cpu_to_be32(distance_table_entries);
+
+    node_distances = (uint8_t *)&distance_table[1];
+    i = 0;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            node_distances[i++] = numa_info[src].distance[dst];
+        }
+    }
+
+    distance_table_size = distance_table_entries * sizeof(uint8_t) +
+                          sizeof(uint32_t);
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
+                     distance_table, distance_table_size));
+}
+
+/*
+ * This helper could be compressed in a single function with
+ * FORM1 logic since we're setting the same DT values, with the
+ * difference being a call to spapr_numa_FORM2_write_rtas_tables()
+ * in the end. The separation was made to avoid clogging FORM1 code
+ * which already has to deal with compat modes from previous
+ * QEMU machine types.
+ */
+static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
+                                   spapr_numa_initial_nvgpu_numa_id(ms);
+
+    /*
+     * In FORM2, ibm,associativity-reference-points will point to
+     * the element in the ibm,associativity array that contains the
+     * primary domain index. This value (in our case, the numa-id) is
+     * then used as an index to retrieve all other attributes of the
+     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
+     * and other ibm,numa-*-table properties.
+     */
+    uint32_t refpoints[] = {
+        cpu_to_be32(0x4),
+    };
+
+    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
+    uint32_t maxdomains[] = {
+        cpu_to_be32(4),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain)
+    };
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
+                     refpoints, nr_refpoints * sizeof(refpoints[0])));
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
+                     maxdomains, sizeof(maxdomains)));
+
+    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
+}
+
 static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
                                            void *fdt, int rtas)
 {
@@ -379,6 +507,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
  */
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 637652ad16..21b1cf9ebf 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -145,6 +145,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    bool pre_6_2_numa_affinity;
 
     bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio,
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 48b716a060..c3e8b98e7e 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
 /* option vector 5 */
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
+#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
 #define OV5_HP_EVT              OV_BIT(6, 5)
 #define OV5_HPT_RESIZE          OV_BIT(6, 7)
 #define OV5_DRMEM_V2            OV_BIT(22, 0)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support()
  2021-09-07  0:25 [PATCH v5 0/4] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-09-07  0:25 ` [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-09-07  0:25 ` Daniel Henrique Barboza
  2021-09-07  1:04   ` David Gibson
  3 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07  0:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

FORM2 NUMA affinity is prepared to deal with empty (memory/cpu less)
NUMA nodes. This is used by the DAX KMEM driver to locate a PAPR SCM
device that has a different latency than the original NUMA node from the
regular memory. FORM2 is also enable to deal with asymmetric NUMA
distances gracefully, something that our FORM1 implementation doesn't
do.

Move these FORM1 verifications to a new function and wait until after
CAS, when we're sure that we're sticking with FORM1, to enforce them.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 33 -------------------------
 hw/ppc/spapr_hcall.c        |  6 +++++
 hw/ppc/spapr_numa.c         | 49 ++++++++++++++++++++++++++++++++-----
 include/hw/ppc/spapr_numa.h |  1 +
 4 files changed, 50 insertions(+), 39 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8d98e3b08a..c974c07fb8 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2797,39 +2797,6 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
-    /*
-     * check we don't have a memory-less/cpu-less NUMA node
-     * Firmware relies on the existing memory/cpu topology to provide the
-     * NUMA topology to the kernel.
-     * And the linux kernel needs to know the NUMA topology at start
-     * to be able to hotplug CPUs later.
-     */
-    if (machine->numa_state->num_nodes) {
-        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
-            /* check for memory-less node */
-            if (machine->numa_state->nodes[i].node_mem == 0) {
-                CPUState *cs;
-                int found = 0;
-                /* check for cpu-less node */
-                CPU_FOREACH(cs) {
-                    PowerPCCPU *cpu = POWERPC_CPU(cs);
-                    if (cpu->node_id == i) {
-                        found = 1;
-                        break;
-                    }
-                }
-                /* memory-less and cpu-less node */
-                if (!found) {
-                    error_report(
-                       "Memory-less/cpu-less nodes are not supported (node %d)",
-                                 i);
-                    exit(1);
-                }
-            }
-        }
-
-    }
-
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 7efbe93f4b..27ee713600 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1202,9 +1202,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
      * If the guest chooses FORM2 we need to reset the associativity
      * information - it is being defaulted to FORM1 during
      * spapr_machine_reset().
+     *
+     * If we're sure that we'll be using FORM1, verify now if we have
+     * a configuration or condition that is not available for FORM1
+     * (namely asymmetric NUMA topologies and empty NUMA nodes).
      */
     if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
         spapr_numa_associativity_reset(spapr);
+    } else {
+        spapr_numa_check_FORM1_constraints(MACHINE(spapr));
     }
 
     /*
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index ca276e16cb..0c57d03184 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -155,6 +155,49 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
+void spapr_numa_check_FORM1_constraints(MachineState *machine)
+{
+    int i;
+
+    if (!spapr_numa_is_symmetrical(machine)) {
+        error_report("Asymmetrical NUMA topologies aren't supported "
+                     "in the pSeries machine");
+        exit(EXIT_FAILURE);
+    }
+
+    /*
+     * check we don't have a memory-less/cpu-less NUMA node
+     * Firmware relies on the existing memory/cpu topology to provide the
+     * NUMA topology to the kernel.
+     * And the linux kernel needs to know the NUMA topology at start
+     * to be able to hotplug CPUs later.
+     */
+    if (machine->numa_state->num_nodes) {
+        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
+            /* check for memory-less node */
+            if (machine->numa_state->nodes[i].node_mem == 0) {
+                CPUState *cs;
+                int found = 0;
+                /* check for cpu-less node */
+                CPU_FOREACH(cs) {
+                    PowerPCCPU *cpu = POWERPC_CPU(cs);
+                    if (cpu->node_id == i) {
+                        found = 1;
+                        break;
+                    }
+                }
+                /* memory-less and cpu-less node */
+                if (!found) {
+                    error_report(
+                       "Memory-less/cpu-less nodes are not supported (node %d)",
+                                 i);
+                    exit(EXIT_FAILURE);
+                }
+            }
+        }
+    }
+}
+
 /*
  * Set NUMA machine state data based on FORM1 affinity semantics.
  */
@@ -172,12 +215,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
         return;
     }
 
-    if (!spapr_numa_is_symmetrical(machine)) {
-        error_report("Asymmetrical NUMA topologies aren't supported "
-                     "in the pSeries machine");
-        exit(EXIT_FAILURE);
-    }
-
     spapr_numa_define_associativity_domains(spapr);
 }
 
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index 0e457bba57..b5a19cb3f1 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -25,5 +25,6 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
 int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
                                          int offset);
 unsigned int spapr_numa_initial_nvgpu_numa_id(MachineState *machine);
+void spapr_numa_check_FORM1_constraints(MachineState *machine);
 
 #endif /* HW_SPAPR_NUMA_H */
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-07  0:25 ` [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset Daniel Henrique Barboza
@ 2021-09-07  0:37   ` David Gibson
  2021-09-07  7:10     ` Greg Kurz
  0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2021-09-07  0:37 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 4367 bytes --]

On Mon, Sep 06, 2021 at 09:25:24PM -0300, Daniel Henrique Barboza wrote:
> At this moment we only support one form of NUMA affinity, FORM1. This
> allows us to init the internal structures during machine_init(), and
> given that NUMA distances won't change during the guest lifetime we
> don't need to bother with that again.
> 
> We're about to introduce FORM2, a new NUMA affinity mode for pSeries
> guests. This means that we'll only be certain about the affinity mode
> being used after client architecture support. This also means that the
> guest can switch affinity modes in machine reset.
> 
> Let's prepare the ground for the FORM2 support by moving the NUMA
> internal data init from machine_init() to machine_reset(). Change the
> name to spapr_numa_associativity_reset() to make it clearer that this is
> a function that can be called multiple times during the guest lifecycle.
> We're also simplifying its current API since this method will be called
> during CAS time (do_client_architecture_support()) later on and there's no
> MachineState pointer already solved there.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Applied to ppc-for-6.2, thanks.

> ---
>  hw/ppc/spapr.c              | 6 +++---
>  hw/ppc/spapr_numa.c         | 4 ++--
>  include/hw/ppc/spapr_numa.h | 9 +--------
>  3 files changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d39fd4e644..8e1ff6cd10 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
>       */
>      spapr_irq_reset(spapr, &error_fatal);
>  
> +    /* Reset numa_assoc_array */
> +    spapr_numa_associativity_reset(spapr);
> +
>      /*
>       * There is no CAS under qtest. Simulate one to please the code that
>       * depends on spapr->ov5_cas. This is especially needed to test device
> @@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>  
> -    /* Init numa_assoc_array */
> -    spapr_numa_associativity_init(spapr, machine);
> -
>      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>          ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>                                spapr->max_compat_pvr)) {
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 779f18b994..9ee4b479fe 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>  
>  }
>  
> -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> -                                   MachineState *machine)
> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>  {
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +    MachineState *machine = MACHINE(spapr);
>      int nb_numa_nodes = machine->numa_state->num_nodes;
>      int i, j, max_nodes_with_gpus;
>      bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> index 6f9f02d3de..0e457bba57 100644
> --- a/include/hw/ppc/spapr_numa.h
> +++ b/include/hw/ppc/spapr_numa.h
> @@ -16,14 +16,7 @@
>  #include "hw/boards.h"
>  #include "hw/ppc/spapr.h"
>  
> -/*
> - * Having both SpaprMachineState and MachineState as arguments
> - * feels odd, but it will spare a MACHINE() call inside the
> - * function. spapr_machine_init() is the only caller for it, and
> - * it has both pointers resolved already.
> - */
> -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> -                                   MachineState *machine);

Nice additional cleanup to the signature, thanks.

> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>                                         int offset, int nodeid);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers
  2021-09-07  0:25 ` [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-09-07  0:39   ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2021-09-07  0:39 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 4819 bytes --]

On Mon, Sep 06, 2021 at 09:25:25PM -0300, Daniel Henrique Barboza wrote:
65;6402;1c> The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
> and doesn't need be concerned with all the legacy support for older
> pseries FORM1 guests.
> 
> We're also not going to calculate associativity domains based on numa
> distance (via spapr_numa_define_associativity_domains) since the
> distances will be written directly into new DT properties.
> 
> Let's split FORM1 code into its own functions to allow for easier
> insertion of FORM2 logic later on.
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 9ee4b479fe..84636cb86a 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>  
>  }
>  
> +/*
> + * Set NUMA machine state data based on FORM1 affinity semantics.
> + */
> +static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> +                                           MachineState *machine)
> +{
> +    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> +
> +    /*
> +     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> +     * 1 NUMA node) will not benefit from anything we're going to do
> +     * after this point.
> +     */
> +    if (using_legacy_numa) {
> +        return;
> +    }

As noted on the previous version (send moments before seeing the new
spin), I'm just slightly uncomfortable with the logic being

   if (form1) {
       if (!legacy) {
          ....
       }
   }

rather than

   if (!legacy) {
       if (form1) {
           ....
       }
   }

> +
> +    if (!spapr_numa_is_symmetrical(machine)) {
> +        error_report("Asymmetrical NUMA topologies aren't supported "
> +                     "in the pSeries machine");
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    spapr_numa_define_associativity_domains(spapr);
> +}
> +
>  void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>  {
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> @@ -210,22 +236,7 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
> -    /*
> -     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> -     * 1 NUMA node) will not benefit from anything we're going to do
> -     * after this point.
> -     */
> -    if (using_legacy_numa) {
> -        return;
> -    }
> -
> -    if (!spapr_numa_is_symmetrical(machine)) {
> -        error_report("Asymmetrical NUMA topologies aren't supported "
> -                     "in the pSeries machine");
> -        exit(EXIT_FAILURE);
> -    }
> -
> -    spapr_numa_define_associativity_domains(spapr);
> +    spapr_numa_FORM1_affinity_init(spapr, machine);
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> -/*
> - * Helper that writes ibm,associativity-reference-points and
> - * max-associativity-domains in the RTAS pointed by @rtas
> - * in the DT @fdt.
> - */
> -void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
>  {
>      MachineState *ms = MACHINE(spapr);
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> @@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>                       maxdomains, sizeof(maxdomains)));
>  }
>  
> +/*
> + * Helper that writes ibm,associativity-reference-points and
> + * max-associativity-domains in the RTAS pointed by @rtas
> + * in the DT @fdt.
> + */
> +void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +{
> +    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
> +}
> +
>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                SpaprMachineState *spapr,
>                                                target_ulong opcode,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support
  2021-09-07  0:25 ` [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-09-07  1:02   ` David Gibson
  2021-09-07 10:07     ` Daniel Henrique Barboza
  2021-09-07  7:50   ` Greg Kurz
  1 sibling, 1 reply; 16+ messages in thread
From: David Gibson @ 2021-09-07  1:02 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 15704 bytes --]

On Mon, Sep 06, 2021 at 09:25:26PM -0300, Daniel Henrique Barboza wrote:
> The main feature of FORM2 affinity support is the separation of NUMA
> distances from ibm,associativity information. This allows for a more
> flexible and straightforward NUMA distance assignment without relying on
> complex associations between several levels of NUMA via
> ibm,associativity matches. Another feature is its extensibility. This base
> support contains the facilities for NUMA distance assignment, but in the
> future more facilities will be added for latency, performance, bandwidth
> and so on.
> 
> This patch implements the base FORM2 affinity support as follows:
> 
> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> or FORM2 affinity. Setting both forms will default to FORM2. We're not
> advertising FORM2 for pseries-6.1 and older machine versions to prevent
> guest visible changes in those;
> 
> - call spapr_numa_associativity_reset() in do_client_architecture_support()
> if FORM2 is chosen. This will avoid re-initializing FORM1 artifacts that
> were already initialized in spapr_machine_reset();
> 
> - ibm,associativity-reference-points has a new semantic. Instead of
> being used to calculate distances via NUMA levels, it's now used to
> indicate the primary domain index in the ibm,associativity domain of
> each resource. In our case it's set to {0x4}, matching the position
> where we already place logical_domain_id;

Hmm... I'm a bit torn on this.  The whole reason the ibm,associativity
things are arrays rather than just numbers was to enable the FORM1
nonsense. So we have a choice here: keep the associativity arrays in
the same form, for simplicity of the code, or reduce the associativity
arrays to one entry for FORM2, to simplify the overall DT contents in
the "modern" case.

> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table

This doesn't really have anything to do with RTAS.

> and ibm,numa-distance-table. The index table is used to list all the
> NUMA logical domains of the platform, in ascending order, and allows for
> spartial NUMA configurations (although QEMU ATM doesn't support
> that).

s/spartial/partial/

Perhaps more to the point, it allows for sparsely allocated domain
IDs.

> ibm,numa-distance-table is an array that contains all the distances from
> the first NUMA node to all other nodes, then the second NUMA node
> distances to all other nodes and so on;
> 
> - spapr_post_load changes: since we're adding a new NUMA affinity that
> isn't compatible with the existing one, migration must be handled
> accordingly because we can't be certain of whether the guest went
> through CAS in the source. The solution chosen is to initiate the NUMA
> associativity data in spapr_post_load() unconditionally. The worst case
> would be to write the DT twice if the guest is in pre-CAS stage.
> Otherwise, we're making sure that a FORM1 guest will have the
> spapr->numa_assoc_array initialized with the proper information based on
> user distance, something that we're not doing with FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              |  24 +++++++
>  hw/ppc/spapr_hcall.c        |  10 +++
>  hw/ppc/spapr_numa.c         | 135 +++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h      |   1 +
>  include/hw/ppc/spapr_ovec.h |   1 +
>  5 files changed, 170 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 8e1ff6cd10..8d98e3b08a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1789,6 +1789,22 @@ static int spapr_post_load(void *opaque, int version_id)
>          return err;
>      }
>  
> +    /*
> +     * NUMA data init is made in CAS time. There is no reliable
> +     * way of telling whether the guest already went through CAS
> +     * in the source due to how spapr_ov5_cas_needed works: a
> +     * FORM1 guest can be migrated with ov5_cas empty regardless
> +     * of going through CAS first.
> +     *
> +     * One solution is to always call numa_associativity_reset. The
> +     * downside is that a guest migrated before CAS will run
> +     * numa_associativity_reset again when going through it, but
> +     * at least we're making sure spapr->numa_assoc_array will be
> +     * initialized and hotplug operations won't fail in both before
> +     * and after CAS migration cases.
> +     */
> +     spapr_numa_associativity_reset(spapr);
> +
>      return err;
>  }
>  
> @@ -2755,6 +2771,11 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>  
> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
> +    if (!smc->pre_6_2_numa_affinity) {
> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
> +    }
> +
>      /* advertise support for dedicated HP event source to guests */
>      if (spapr->use_hotplug_event_source) {
>          spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
> @@ -4700,8 +4721,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>   */
>  static void spapr_machine_6_1_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_6_2_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
> +    smc->pre_6_2_numa_affinity = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 0e9a5b2e40..7efbe93f4b 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -11,6 +11,7 @@
>  #include "helper_regs.h"
>  #include "hw/ppc/spapr.h"
>  #include "hw/ppc/spapr_cpu_core.h"
> +#include "hw/ppc/spapr_numa.h"
>  #include "mmu-hash64.h"
>  #include "cpu-models.h"
>  #include "trace.h"
> @@ -1197,6 +1198,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>      spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>      spapr_ovec_cleanup(ov1_guest);
>  
> +    /*
> +     * If the guest chooses FORM2 we need to reset the associativity
> +     * information - it is being defaulted to FORM1 during
> +     * spapr_machine_reset().
> +     */
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_associativity_reset(spapr);
> +    }
> +
>      /*
>       * Ensure the guest asks for an interrupt mode we support;
>       * otherwise terminate the boot.
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 84636cb86a..ca276e16cb 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -202,6 +202,24 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>          spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>  
> +        /*
> +         * For FORM2 affinity, zero all the non-used associativity
> +         * domains (all domains but the last). This step is necessary
> +         * because FORM2 will always be populated on top of FORM1
> +         * (via spapr_machine_reset()), which populates them.
> +         *
> +         * The kernel doesn't mind these non-used domains but zeroing
> +         * them out makes it clearer to userspace that they are not
> +         * being used.

Huh... so you are clearing the entries in any case.  Wouldn't it make
more sense to just make all the arrays 1 entry long for FORM2?

> +         */
> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> +                spapr->numa_assoc_array[i][j] = 0;
> +            }
> +
> +            continue;
> +        }
> +
>          /*
>           * Fill all associativity domains of non-zero NUMA nodes with
>           * node_id. This is required because the default value (0) is
> @@ -236,7 +254,16 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
> -    spapr_numa_FORM1_affinity_init(spapr, machine);
> +    /*
> +     * We test for !FORM2 instead of testing for FORM1 because,
> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
> +     * dealing with either FORM1 or FORM2, test for the option
> +     * that is guaranteed to be set after a migration.
> +     */
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM1_affinity_init(spapr, machine);
> +    }
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -313,6 +340,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> +                                               void *fdt, int rtas)

Again, not really related to RTAS at all... oh except for appearing in
the /rtas node, ok I see your point.  I'd still probably dump the
"rtas" from the function name.

> +{
> +    MachineState *ms = MACHINE(spapr);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
> +    g_autofree uint32_t *lookup_index_table = NULL;
> +    g_autofree uint32_t *distance_table = NULL;
> +    int src, dst, i, distance_table_size;
> +    uint8_t *node_distances;
> +
> +    /*
> +     * ibm,numa-lookup-index-table: array with length and a
> +     * list of NUMA ids present in the guest.
> +     */
> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        lookup_index_table[i + 1] = cpu_to_be32(i);
> +    }
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
> +                     lookup_index_table,
> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
> +
> +    /*
> +     * ibm,numa-distance-table: contains all node distances. First
> +     * element is the size of the table as uint32, followed up
> +     * by all the uint8 distances from the first NUMA node, then all
> +     * distances from the second NUMA node and so on.
> +     *
> +     * ibm,numa-lookup-index-table is used by guest to navigate this
> +     * array because NUMA ids can be sparse (node 0 is the first,
> +     * node 8 is the second ...).

This comment seems a little misleading.  The ids can be sparse in
PAPR, but they're not as we've constructed them above.

> +     */
> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
> +    distance_table[0] = cpu_to_be32(distance_table_entries);
> +
> +    node_distances = (uint8_t *)&distance_table[1];
> +    i = 0;
> +
> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            node_distances[i++] = numa_info[src].distance[dst];
> +        }
> +    }
> +
> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
> +                          sizeof(uint32_t);
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
> +                     distance_table, distance_table_size));
> +}
> +
> +/*
> + * This helper could be compressed in a single function with
> + * FORM1 logic since we're setting the same DT values, with the
> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
> + * in the end. The separation was made to avoid clogging FORM1 code
> + * which already has to deal with compat modes from previous
> + * QEMU machine types.
> + */
> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
> +
> +    /*
> +     * In FORM2, ibm,associativity-reference-points will point to
> +     * the element in the ibm,associativity array that contains the
> +     * primary domain index. This value (in our case, the numa-id) is
> +     * then used as an index to retrieve all other attributes of the
> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
> +     * and other ibm,numa-*-table properties.
> +     */
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +    };
> +
> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
> +    uint32_t maxdomains[] = {
> +        cpu_to_be32(4),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain)

This doesn't seem right, since you only ever use 0 for the top 3
domain ID levels.  Again, wouldn't this be simplifed by just making
the first entry 1, along with every associativity array?

> +    };
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> +                     maxdomains, sizeof(maxdomains)));
> +
> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
> +}
> +
>  static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>                                             void *fdt, int rtas)
>  {
> @@ -379,6 +507,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>   */
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
> +        return;
> +    }
> +
>      spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 637652ad16..21b1cf9ebf 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    bool pre_6_2_numa_affinity;
>  
>      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio,
> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
> index 48b716a060..c3e8b98e7e 100644
> --- a/include/hw/ppc/spapr_ovec.h
> +++ b/include/hw/ppc/spapr_ovec.h
> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>  /* option vector 5 */
>  #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>  #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>  #define OV5_HP_EVT              OV_BIT(6, 5)
>  #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>  #define OV5_DRMEM_V2            OV_BIT(22, 0)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support()
  2021-09-07  0:25 ` [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support() Daniel Henrique Barboza
@ 2021-09-07  1:04   ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2021-09-07  1:04 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 6887 bytes --]

On Mon, Sep 06, 2021 at 09:25:27PM -0300, Daniel Henrique Barboza wrote:
> FORM2 NUMA affinity is prepared to deal with empty (memory/cpu less)
> NUMA nodes. This is used by the DAX KMEM driver to locate a PAPR SCM
> device that has a different latency than the original NUMA node from the
> regular memory. FORM2 is also enable to deal with asymmetric NUMA
> distances gracefully, something that our FORM1 implementation doesn't
> do.
> 
> Move these FORM1 verifications to a new function and wait until after
> CAS, when we're sure that we're sticking with FORM1, to enforce them.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              | 33 -------------------------
>  hw/ppc/spapr_hcall.c        |  6 +++++
>  hw/ppc/spapr_numa.c         | 49 ++++++++++++++++++++++++++++++++-----
>  include/hw/ppc/spapr_numa.h |  1 +
>  4 files changed, 50 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 8d98e3b08a..c974c07fb8 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2797,39 +2797,6 @@ static void spapr_machine_init(MachineState *machine)
>      /* init CPUs */
>      spapr_init_cpus(spapr);
>  
> -    /*
> -     * check we don't have a memory-less/cpu-less NUMA node
> -     * Firmware relies on the existing memory/cpu topology to provide the
> -     * NUMA topology to the kernel.
> -     * And the linux kernel needs to know the NUMA topology at start
> -     * to be able to hotplug CPUs later.
> -     */
> -    if (machine->numa_state->num_nodes) {
> -        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
> -            /* check for memory-less node */
> -            if (machine->numa_state->nodes[i].node_mem == 0) {
> -                CPUState *cs;
> -                int found = 0;
> -                /* check for cpu-less node */
> -                CPU_FOREACH(cs) {
> -                    PowerPCCPU *cpu = POWERPC_CPU(cs);
> -                    if (cpu->node_id == i) {
> -                        found = 1;
> -                        break;
> -                    }
> -                }
> -                /* memory-less and cpu-less node */
> -                if (!found) {
> -                    error_report(
> -                       "Memory-less/cpu-less nodes are not supported (node %d)",
> -                                 i);
> -                    exit(1);
> -                }
> -            }
> -        }
> -
> -    }
> -
>      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>  
>      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 7efbe93f4b..27ee713600 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1202,9 +1202,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>       * If the guest chooses FORM2 we need to reset the associativity
>       * information - it is being defaulted to FORM1 during
>       * spapr_machine_reset().
> +     *
> +     * If we're sure that we'll be using FORM1, verify now if we have
> +     * a configuration or condition that is not available for FORM1
> +     * (namely asymmetric NUMA topologies and empty NUMA nodes).
>       */
>      if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>          spapr_numa_associativity_reset(spapr);
> +    } else {
> +        spapr_numa_check_FORM1_constraints(MACHINE(spapr));

Couldn't you put this call into one of the existing FORM1 functions?

>      }
>  
>      /*
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index ca276e16cb..0c57d03184 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -155,6 +155,49 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>  
>  }
>  
> +void spapr_numa_check_FORM1_constraints(MachineState *machine)
> +{
> +    int i;
> +
> +    if (!spapr_numa_is_symmetrical(machine)) {
> +        error_report("Asymmetrical NUMA topologies aren't supported "
> +                     "in the pSeries machine");

Error message needs an update since they are now possible with FORM2.

> +        exit(EXIT_FAILURE);
> +    }
> +
> +    /*
> +     * check we don't have a memory-less/cpu-less NUMA node
> +     * Firmware relies on the existing memory/cpu topology to provide the
> +     * NUMA topology to the kernel.
> +     * And the linux kernel needs to know the NUMA topology at start
> +     * to be able to hotplug CPUs later.
> +     */
> +    if (machine->numa_state->num_nodes) {
> +        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
> +            /* check for memory-less node */
> +            if (machine->numa_state->nodes[i].node_mem == 0) {
> +                CPUState *cs;
> +                int found = 0;
> +                /* check for cpu-less node */
> +                CPU_FOREACH(cs) {
> +                    PowerPCCPU *cpu = POWERPC_CPU(cs);
> +                    if (cpu->node_id == i) {
> +                        found = 1;
> +                        break;
> +                    }
> +                }
> +                /* memory-less and cpu-less node */
> +                if (!found) {
> +                    error_report(
> +                       "Memory-less/cpu-less nodes are not supported (node %d)",
> +                                 i);
> +                    exit(EXIT_FAILURE);
> +                }
> +            }
> +        }
> +    }
> +}
> +
>  /*
>   * Set NUMA machine state data based on FORM1 affinity semantics.
>   */
> @@ -172,12 +215,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>          return;
>      }
>  
> -    if (!spapr_numa_is_symmetrical(machine)) {
> -        error_report("Asymmetrical NUMA topologies aren't supported "
> -                     "in the pSeries machine");
> -        exit(EXIT_FAILURE);
> -    }
> -
>      spapr_numa_define_associativity_domains(spapr);
>  }
>  
> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> index 0e457bba57..b5a19cb3f1 100644
> --- a/include/hw/ppc/spapr_numa.h
> +++ b/include/hw/ppc/spapr_numa.h
> @@ -25,5 +25,6 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
>  int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>                                           int offset);
>  unsigned int spapr_numa_initial_nvgpu_numa_id(MachineState *machine);
> +void spapr_numa_check_FORM1_constraints(MachineState *machine);
>  
>  #endif /* HW_SPAPR_NUMA_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-07  0:37   ` David Gibson
@ 2021-09-07  7:10     ` Greg Kurz
  2021-09-07  9:23       ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Kurz @ 2021-09-07  7:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Daniel Henrique Barboza, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4525 bytes --]

On Tue, 7 Sep 2021 10:37:27 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Sep 06, 2021 at 09:25:24PM -0300, Daniel Henrique Barboza wrote:
> > At this moment we only support one form of NUMA affinity, FORM1. This
> > allows us to init the internal structures during machine_init(), and
> > given that NUMA distances won't change during the guest lifetime we
> > don't need to bother with that again.
> > 
> > We're about to introduce FORM2, a new NUMA affinity mode for pSeries
> > guests. This means that we'll only be certain about the affinity mode
> > being used after client architecture support. This also means that the
> > guest can switch affinity modes in machine reset.
> > 
> > Let's prepare the ground for the FORM2 support by moving the NUMA
> > internal data init from machine_init() to machine_reset(). Change the
> > name to spapr_numa_associativity_reset() to make it clearer that this is
> > a function that can be called multiple times during the guest lifecycle.
> > We're also simplifying its current API since this method will be called
> > during CAS time (do_client_architecture_support()) later on and there's no
> > MachineState pointer already solved there.
> > 
> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> Applied to ppc-for-6.2, thanks.
> 

Even if already applied :

Reviewed-by: Greg Kurz <groug@kaod.org>

> > ---
> >  hw/ppc/spapr.c              | 6 +++---
> >  hw/ppc/spapr_numa.c         | 4 ++--
> >  include/hw/ppc/spapr_numa.h | 9 +--------
> >  3 files changed, 6 insertions(+), 13 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index d39fd4e644..8e1ff6cd10 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
> >       */
> >      spapr_irq_reset(spapr, &error_fatal);
> >  
> > +    /* Reset numa_assoc_array */
> > +    spapr_numa_associativity_reset(spapr);
> > +
> >      /*
> >       * There is no CAS under qtest. Simulate one to please the code that
> >       * depends on spapr->ov5_cas. This is especially needed to test device
> > @@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
> >  
> >      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
> >  
> > -    /* Init numa_assoc_array */
> > -    spapr_numa_associativity_init(spapr, machine);
> > -
> >      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> >          ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> >                                spapr->max_compat_pvr)) {
> > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > index 779f18b994..9ee4b479fe 100644
> > --- a/hw/ppc/spapr_numa.c
> > +++ b/hw/ppc/spapr_numa.c
> > @@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
> >  
> >  }
> >  
> > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > -                                   MachineState *machine)
> > +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> >  {
> >      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > +    MachineState *machine = MACHINE(spapr);
> >      int nb_numa_nodes = machine->numa_state->num_nodes;
> >      int i, j, max_nodes_with_gpus;
> >      bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> > diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> > index 6f9f02d3de..0e457bba57 100644
> > --- a/include/hw/ppc/spapr_numa.h
> > +++ b/include/hw/ppc/spapr_numa.h
> > @@ -16,14 +16,7 @@
> >  #include "hw/boards.h"
> >  #include "hw/ppc/spapr.h"
> >  
> > -/*
> > - * Having both SpaprMachineState and MachineState as arguments
> > - * feels odd, but it will spare a MACHINE() call inside the
> > - * function. spapr_machine_init() is the only caller for it, and
> > - * it has both pointers resolved already.
> > - */
> > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > -                                   MachineState *machine);
> 
> Nice additional cleanup to the signature, thanks.
> 
> > +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
> >  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
> >  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> >                                         int offset, int nodeid);
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support
  2021-09-07  0:25 ` [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
  2021-09-07  1:02   ` David Gibson
@ 2021-09-07  7:50   ` Greg Kurz
  1 sibling, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-07  7:50 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Mon,  6 Sep 2021 21:25:26 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The main feature of FORM2 affinity support is the separation of NUMA
> distances from ibm,associativity information. This allows for a more
> flexible and straightforward NUMA distance assignment without relying on
> complex associations between several levels of NUMA via
> ibm,associativity matches. Another feature is its extensibility. This base
> support contains the facilities for NUMA distance assignment, but in the
> future more facilities will be added for latency, performance, bandwidth
> and so on.
> 
> This patch implements the base FORM2 affinity support as follows:
> 
> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> or FORM2 affinity. Setting both forms will default to FORM2. We're not
> advertising FORM2 for pseries-6.1 and older machine versions to prevent
> guest visible changes in those;
> 
> - call spapr_numa_associativity_reset() in do_client_architecture_support()
> if FORM2 is chosen. This will avoid re-initializing FORM1 artifacts that
> were already initialized in spapr_machine_reset();
> 
> - ibm,associativity-reference-points has a new semantic. Instead of
> being used to calculate distances via NUMA levels, it's now used to
> indicate the primary domain index in the ibm,associativity domain of
> each resource. In our case it's set to {0x4}, matching the position
> where we already place logical_domain_id;
> 
> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
> and ibm,numa-distance-table. The index table is used to list all the
> NUMA logical domains of the platform, in ascending order, and allows for
> spartial NUMA configurations (although QEMU ATM doesn't support that).
> ibm,numa-distance-table is an array that contains all the distances from
> the first NUMA node to all other nodes, then the second NUMA node
> distances to all other nodes and so on;
> 
> - spapr_post_load changes: since we're adding a new NUMA affinity that
> isn't compatible with the existing one, migration must be handled
> accordingly because we can't be certain of whether the guest went
> through CAS in the source. The solution chosen is to initiate the NUMA
> associativity data in spapr_post_load() unconditionally. The worst case
> would be to write the DT twice if the guest is in pre-CAS stage.
> Otherwise, we're making sure that a FORM1 guest will have the
> spapr->numa_assoc_array initialized with the proper information based on
> user distance, something that we're not doing with FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              |  24 +++++++
>  hw/ppc/spapr_hcall.c        |  10 +++
>  hw/ppc/spapr_numa.c         | 135 +++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h      |   1 +
>  include/hw/ppc/spapr_ovec.h |   1 +
>  5 files changed, 170 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 8e1ff6cd10..8d98e3b08a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1789,6 +1789,22 @@ static int spapr_post_load(void *opaque, int version_id)
>          return err;
>      }
>  
> +    /*
> +     * NUMA data init is made in CAS time. There is no reliable
> +     * way of telling whether the guest already went through CAS
> +     * in the source due to how spapr_ov5_cas_needed works: a
> +     * FORM1 guest can be migrated with ov5_cas empty regardless
> +     * of going through CAS first.
> +     *
> +     * One solution is to always call numa_associativity_reset. The
> +     * downside is that a guest migrated before CAS will run
> +     * numa_associativity_reset again when going through it, but
> +     * at least we're making sure spapr->numa_assoc_array will be
> +     * initialized and hotplug operations won't fail in both before
> +     * and after CAS migration cases.
> +     */
> +     spapr_numa_associativity_reset(spapr);
> +
>      return err;
>  }
>  
> @@ -2755,6 +2771,11 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>  
> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
> +    if (!smc->pre_6_2_numa_affinity) {
> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
> +    }
> +
>      /* advertise support for dedicated HP event source to guests */
>      if (spapr->use_hotplug_event_source) {
>          spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
> @@ -4700,8 +4721,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>   */
>  static void spapr_machine_6_1_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_6_2_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
> +    smc->pre_6_2_numa_affinity = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 0e9a5b2e40..7efbe93f4b 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -11,6 +11,7 @@
>  #include "helper_regs.h"
>  #include "hw/ppc/spapr.h"
>  #include "hw/ppc/spapr_cpu_core.h"
> +#include "hw/ppc/spapr_numa.h"
>  #include "mmu-hash64.h"
>  #include "cpu-models.h"
>  #include "trace.h"
> @@ -1197,6 +1198,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>      spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>      spapr_ovec_cleanup(ov1_guest);
>  
> +    /*
> +     * If the guest chooses FORM2 we need to reset the associativity
> +     * information - it is being defaulted to FORM1 during
> +     * spapr_machine_reset().
> +     */
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_associativity_reset(spapr);
> +    }
> +
>      /*
>       * Ensure the guest asks for an interrupt mode we support;
>       * otherwise terminate the boot.
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 84636cb86a..ca276e16cb 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -202,6 +202,24 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>          spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>  
> +        /*
> +         * For FORM2 affinity, zero all the non-used associativity
> +         * domains (all domains but the last). This step is necessary
> +         * because FORM2 will always be populated on top of FORM1
> +         * (via spapr_machine_reset()), which populates them.
> +         *
> +         * The kernel doesn't mind these non-used domains but zeroing
> +         * them out makes it clearer to userspace that they are not
> +         * being used.
> +         */
> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> +                spapr->numa_assoc_array[i][j] = 0;
> +            }
> +
> +            continue;
> +        }
> +

Having to do that clearly indicates that spapr->numa_assoc_array as it
is today isn't really adapted to FORM2. It is unfortunate that FORM2 has
to carry on an extra quirk because of FORM1.

So I think I'd prefer to have more distinct data structures and
paths between FORM1 and FORM2, so that all the legacy complexity
stays in FORM1 and FORM2 is kept as simple as possible, because
that's what new guests are going to use in practice. That is
I tend to agree with David's remarks.

>          /*
>           * Fill all associativity domains of non-zero NUMA nodes with
>           * node_id. This is required because the default value (0) is
> @@ -236,7 +254,16 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
> -    spapr_numa_FORM1_affinity_init(spapr, machine);
> +    /*
> +     * We test for !FORM2 instead of testing for FORM1 because,
> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
> +     * dealing with either FORM1 or FORM2, test for the option
> +     * that is guaranteed to be set after a migration.
> +     */
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM1_affinity_init(spapr, machine);
> +    }
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -313,6 +340,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> +                                               void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
> +    g_autofree uint32_t *lookup_index_table = NULL;
> +    g_autofree uint32_t *distance_table = NULL;
> +    int src, dst, i, distance_table_size;
> +    uint8_t *node_distances;
> +
> +    /*
> +     * ibm,numa-lookup-index-table: array with length and a
> +     * list of NUMA ids present in the guest.
> +     */
> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        lookup_index_table[i + 1] = cpu_to_be32(i);
> +    }
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
> +                     lookup_index_table,
> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
> +
> +    /*
> +     * ibm,numa-distance-table: contains all node distances. First
> +     * element is the size of the table as uint32, followed up
> +     * by all the uint8 distances from the first NUMA node, then all
> +     * distances from the second NUMA node and so on.
> +     *
> +     * ibm,numa-lookup-index-table is used by guest to navigate this
> +     * array because NUMA ids can be sparse (node 0 is the first,
> +     * node 8 is the second ...).
> +     */
> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
> +    distance_table[0] = cpu_to_be32(distance_table_entries);
> +
> +    node_distances = (uint8_t *)&distance_table[1];
> +    i = 0;
> +
> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            node_distances[i++] = numa_info[src].distance[dst];
> +        }
> +    }
> +
> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
> +                          sizeof(uint32_t);
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
> +                     distance_table, distance_table_size));
> +}
> +
> +/*
> + * This helper could be compressed in a single function with
> + * FORM1 logic since we're setting the same DT values, with the
> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
> + * in the end. The separation was made to avoid clogging FORM1 code
> + * which already has to deal with compat modes from previous
> + * QEMU machine types.
> + */
> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
> +
> +    /*
> +     * In FORM2, ibm,associativity-reference-points will point to
> +     * the element in the ibm,associativity array that contains the
> +     * primary domain index. This value (in our case, the numa-id) is
> +     * then used as an index to retrieve all other attributes of the
> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
> +     * and other ibm,numa-*-table properties.
> +     */
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +    };
> +
> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
> +    uint32_t maxdomains[] = {
> +        cpu_to_be32(4),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain),
> +        cpu_to_be32(maxdomain)
> +    };
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> +                     maxdomains, sizeof(maxdomains)));
> +
> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
> +}
> +
>  static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>                                             void *fdt, int rtas)
>  {
> @@ -379,6 +507,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>   */
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
> +        return;
> +    }
> +
>      spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 637652ad16..21b1cf9ebf 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    bool pre_6_2_numa_affinity;
>  
>      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio,
> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
> index 48b716a060..c3e8b98e7e 100644
> --- a/include/hw/ppc/spapr_ovec.h
> +++ b/include/hw/ppc/spapr_ovec.h
> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>  /* option vector 5 */
>  #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>  #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>  #define OV5_HP_EVT              OV_BIT(6, 5)
>  #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>  #define OV5_DRMEM_V2            OV_BIT(22, 0)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-07  7:10     ` Greg Kurz
@ 2021-09-07  9:23       ` David Gibson
  2021-09-10 19:57         ` Daniel Henrique Barboza
  0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2021-09-07  9:23 UTC (permalink / raw)
  To: Greg Kurz; +Cc: Daniel Henrique Barboza, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5007 bytes --]

On Tue, Sep 07, 2021 at 09:10:13AM +0200, Greg Kurz wrote:
> On Tue, 7 Sep 2021 10:37:27 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Sep 06, 2021 at 09:25:24PM -0300, Daniel Henrique Barboza wrote:
> > > At this moment we only support one form of NUMA affinity, FORM1. This
> > > allows us to init the internal structures during machine_init(), and
> > > given that NUMA distances won't change during the guest lifetime we
> > > don't need to bother with that again.
> > > 
> > > We're about to introduce FORM2, a new NUMA affinity mode for pSeries
> > > guests. This means that we'll only be certain about the affinity mode
> > > being used after client architecture support. This also means that the
> > > guest can switch affinity modes in machine reset.
> > > 
> > > Let's prepare the ground for the FORM2 support by moving the NUMA
> > > internal data init from machine_init() to machine_reset(). Change the
> > > name to spapr_numa_associativity_reset() to make it clearer that this is
> > > a function that can be called multiple times during the guest lifecycle.
> > > We're also simplifying its current API since this method will be called
> > > during CAS time (do_client_architecture_support()) later on and there's no
> > > MachineState pointer already solved there.
> > > 
> > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > 
> > Applied to ppc-for-6.2, thanks.
> > 
> 
> Even if already applied :
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>

Added, thanks.

> > > ---
> > >  hw/ppc/spapr.c              | 6 +++---
> > >  hw/ppc/spapr_numa.c         | 4 ++--
> > >  include/hw/ppc/spapr_numa.h | 9 +--------
> > >  3 files changed, 6 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index d39fd4e644..8e1ff6cd10 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
> > >       */
> > >      spapr_irq_reset(spapr, &error_fatal);
> > >  
> > > +    /* Reset numa_assoc_array */
> > > +    spapr_numa_associativity_reset(spapr);
> > > +
> > >      /*
> > >       * There is no CAS under qtest. Simulate one to please the code that
> > >       * depends on spapr->ov5_cas. This is especially needed to test device
> > > @@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
> > >  
> > >      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
> > >  
> > > -    /* Init numa_assoc_array */
> > > -    spapr_numa_associativity_init(spapr, machine);
> > > -
> > >      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> > >          ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> > >                                spapr->max_compat_pvr)) {
> > > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > > index 779f18b994..9ee4b479fe 100644
> > > --- a/hw/ppc/spapr_numa.c
> > > +++ b/hw/ppc/spapr_numa.c
> > > @@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
> > >  
> > >  }
> > >  
> > > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > > -                                   MachineState *machine)
> > > +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> > >  {
> > >      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > > +    MachineState *machine = MACHINE(spapr);
> > >      int nb_numa_nodes = machine->numa_state->num_nodes;
> > >      int i, j, max_nodes_with_gpus;
> > >      bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> > > diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> > > index 6f9f02d3de..0e457bba57 100644
> > > --- a/include/hw/ppc/spapr_numa.h
> > > +++ b/include/hw/ppc/spapr_numa.h
> > > @@ -16,14 +16,7 @@
> > >  #include "hw/boards.h"
> > >  #include "hw/ppc/spapr.h"
> > >  
> > > -/*
> > > - * Having both SpaprMachineState and MachineState as arguments
> > > - * feels odd, but it will spare a MACHINE() call inside the
> > > - * function. spapr_machine_init() is the only caller for it, and
> > > - * it has both pointers resolved already.
> > > - */
> > > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > > -                                   MachineState *machine);
> > 
> > Nice additional cleanup to the signature, thanks.
> > 
> > > +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
> > >  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
> > >  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> > >                                         int offset, int nodeid);
> > 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support
  2021-09-07  1:02   ` David Gibson
@ 2021-09-07 10:07     ` Daniel Henrique Barboza
  2021-09-08  1:54       ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-07 10:07 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, groug



On 9/6/21 10:02 PM, David Gibson wrote:
> On Mon, Sep 06, 2021 at 09:25:26PM -0300, Daniel Henrique Barboza wrote:
>> The main feature of FORM2 affinity support is the separation of NUMA
>> distances from ibm,associativity information. This allows for a more
>> flexible and straightforward NUMA distance assignment without relying on
>> complex associations between several levels of NUMA via
>> ibm,associativity matches. Another feature is its extensibility. This base
>> support contains the facilities for NUMA distance assignment, but in the
>> future more facilities will be added for latency, performance, bandwidth
>> and so on.
>>
>> This patch implements the base FORM2 affinity support as follows:
>>
>> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
>> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
>> or FORM2 affinity. Setting both forms will default to FORM2. We're not
>> advertising FORM2 for pseries-6.1 and older machine versions to prevent
>> guest visible changes in those;
>>
>> - call spapr_numa_associativity_reset() in do_client_architecture_support()
>> if FORM2 is chosen. This will avoid re-initializing FORM1 artifacts that
>> were already initialized in spapr_machine_reset();
>>
>> - ibm,associativity-reference-points has a new semantic. Instead of
>> being used to calculate distances via NUMA levels, it's now used to
>> indicate the primary domain index in the ibm,associativity domain of
>> each resource. In our case it's set to {0x4}, matching the position
>> where we already place logical_domain_id;
> 
> Hmm... I'm a bit torn on this.  The whole reason the ibm,associativity
> things are arrays rather than just numbers was to enable the FORM1
> nonsense. So we have a choice here: keep the associativity arrays in
> the same form, for simplicity of the code, or reduce the associativity
> arrays to one entry for FORM2, to simplify the overall DT contents in
> the "modern" case.

I'm not against making it different from FORM2. I did it this way because
it minimizes the amount of code being changed.

In fact, if we're going to add separated data structures for both FORM1 and
FORM2, might as well start both FORM1 and FORM2 data structures during
machine_init() and then just switch to the chosen affinity after CAS.

Something like a FORM1_assoc_array[N][MAX_DISTANCE_REF_POINTS], that contains
all the initialization already done today and a FORM2_assoc_array[N][2],
where FORM2_assoc_array[node_id] = {1, node_id}, changing reference-points
accordingly of course.

spapr_numa_assoc_array would become a pointer that would point to either
FORM1_assoc_array[][] or FORM2_assoc_array[][] depending on guest choice. I
think this might be enough to make everything we already have just works, although
I need to check how much code is dependant on the MAX_DISTANCE_REF_POINTS
macro and adapt it.

If no one opposes I'll go for this approach.


Thanks,


Daniel


> 
>> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
> 
> This doesn't really have anything to do with RTAS.
> 
>> and ibm,numa-distance-table. The index table is used to list all the
>> NUMA logical domains of the platform, in ascending order, and allows for
>> spartial NUMA configurations (although QEMU ATM doesn't support
>> that).
> 
> s/spartial/partial/
> 
> Perhaps more to the point, it allows for sparsely allocated domain
> IDs.
> 
>> ibm,numa-distance-table is an array that contains all the distances from
>> the first NUMA node to all other nodes, then the second NUMA node
>> distances to all other nodes and so on;
>>
>> - spapr_post_load changes: since we're adding a new NUMA affinity that
>> isn't compatible with the existing one, migration must be handled
>> accordingly because we can't be certain of whether the guest went
>> through CAS in the source. The solution chosen is to initiate the NUMA
>> associativity data in spapr_post_load() unconditionally. The worst case
>> would be to write the DT twice if the guest is in pre-CAS stage.
>> Otherwise, we're making sure that a FORM1 guest will have the
>> spapr->numa_assoc_array initialized with the proper information based on
>> user distance, something that we're not doing with FORM2.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c              |  24 +++++++
>>   hw/ppc/spapr_hcall.c        |  10 +++
>>   hw/ppc/spapr_numa.c         | 135 +++++++++++++++++++++++++++++++++++-
>>   include/hw/ppc/spapr.h      |   1 +
>>   include/hw/ppc/spapr_ovec.h |   1 +
>>   5 files changed, 170 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 8e1ff6cd10..8d98e3b08a 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1789,6 +1789,22 @@ static int spapr_post_load(void *opaque, int version_id)
>>           return err;
>>       }
>>   
>> +    /*
>> +     * NUMA data init is made in CAS time. There is no reliable
>> +     * way of telling whether the guest already went through CAS
>> +     * in the source due to how spapr_ov5_cas_needed works: a
>> +     * FORM1 guest can be migrated with ov5_cas empty regardless
>> +     * of going through CAS first.
>> +     *
>> +     * One solution is to always call numa_associativity_reset. The
>> +     * downside is that a guest migrated before CAS will run
>> +     * numa_associativity_reset again when going through it, but
>> +     * at least we're making sure spapr->numa_assoc_array will be
>> +     * initialized and hotplug operations won't fail in both before
>> +     * and after CAS migration cases.
>> +     */
>> +     spapr_numa_associativity_reset(spapr);
>> +
>>       return err;
>>   }
>>   
>> @@ -2755,6 +2771,11 @@ static void spapr_machine_init(MachineState *machine)
>>   
>>       spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>>   
>> +    /* Do not advertise FORM2 support for pseries-6.1 and older */
>> +    if (!smc->pre_6_2_numa_affinity) {
>> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
>> +    }
>> +
>>       /* advertise support for dedicated HP event source to guests */
>>       if (spapr->use_hotplug_event_source) {
>>           spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
>> @@ -4700,8 +4721,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>>    */
>>   static void spapr_machine_6_1_class_options(MachineClass *mc)
>>   {
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>> +
>>       spapr_machine_6_2_class_options(mc);
>>       compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
>> +    smc->pre_6_2_numa_affinity = true;
>>   }
>>   
>>   DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 0e9a5b2e40..7efbe93f4b 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -11,6 +11,7 @@
>>   #include "helper_regs.h"
>>   #include "hw/ppc/spapr.h"
>>   #include "hw/ppc/spapr_cpu_core.h"
>> +#include "hw/ppc/spapr_numa.h"
>>   #include "mmu-hash64.h"
>>   #include "cpu-models.h"
>>   #include "trace.h"
>> @@ -1197,6 +1198,15 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>       spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>>       spapr_ovec_cleanup(ov1_guest);
>>   
>> +    /*
>> +     * If the guest chooses FORM2 we need to reset the associativity
>> +     * information - it is being defaulted to FORM1 during
>> +     * spapr_machine_reset().
>> +     */
>> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +        spapr_numa_associativity_reset(spapr);
>> +    }
>> +
>>       /*
>>        * Ensure the guest asks for an interrupt mode we support;
>>        * otherwise terminate the boot.
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 84636cb86a..ca276e16cb 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -202,6 +202,24 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>>           spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>   
>> +        /*
>> +         * For FORM2 affinity, zero all the non-used associativity
>> +         * domains (all domains but the last). This step is necessary
>> +         * because FORM2 will always be populated on top of FORM1
>> +         * (via spapr_machine_reset()), which populates them.
>> +         *
>> +         * The kernel doesn't mind these non-used domains but zeroing
>> +         * them out makes it clearer to userspace that they are not
>> +         * being used.
> 
> Huh... so you are clearing the entries in any case.  Wouldn't it make
> more sense to just make all the arrays 1 entry long for FORM2?
> 
>> +         */
>> +        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
>> +                spapr->numa_assoc_array[i][j] = 0;
>> +            }
>> +
>> +            continue;
>> +        }
>> +
>>           /*
>>            * Fill all associativity domains of non-zero NUMA nodes with
>>            * node_id. This is required because the default value (0) is
>> @@ -236,7 +254,16 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>>           spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>       }
>>   
>> -    spapr_numa_FORM1_affinity_init(spapr, machine);
>> +    /*
>> +     * We test for !FORM2 instead of testing for FORM1 because,
>> +     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
>> +     * to get ov5_cas migrated, but setting FORM2 is. Since we're
>> +     * dealing with either FORM1 or FORM2, test for the option
>> +     * that is guaranteed to be set after a migration.
>> +     */
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +        spapr_numa_FORM1_affinity_init(spapr, machine);
>> +    }
>>   }
>>   
>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>> @@ -313,6 +340,107 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>>       return ret;
>>   }
>>   
>> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
>> +                                               void *fdt, int rtas)
> 
> Again, not really related to RTAS at all... oh except for appearing in
> the /rtas node, ok I see your point.  I'd still probably dump the
> "rtas" from the function name.
> 
>> +{
>> +    MachineState *ms = MACHINE(spapr);
>> +    NodeInfo *numa_info = ms->numa_state->nodes;
>> +    int nb_numa_nodes = ms->numa_state->num_nodes;
>> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
>> +    g_autofree uint32_t *lookup_index_table = NULL;
>> +    g_autofree uint32_t *distance_table = NULL;
>> +    int src, dst, i, distance_table_size;
>> +    uint8_t *node_distances;
>> +
>> +    /*
>> +     * ibm,numa-lookup-index-table: array with length and a
>> +     * list of NUMA ids present in the guest.
>> +     */
>> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
>> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
>> +
>> +    for (i = 0; i < nb_numa_nodes; i++) {
>> +        lookup_index_table[i + 1] = cpu_to_be32(i);
>> +    }
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
>> +                     lookup_index_table,
>> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
>> +
>> +    /*
>> +     * ibm,numa-distance-table: contains all node distances. First
>> +     * element is the size of the table as uint32, followed up
>> +     * by all the uint8 distances from the first NUMA node, then all
>> +     * distances from the second NUMA node and so on.
>> +     *
>> +     * ibm,numa-lookup-index-table is used by guest to navigate this
>> +     * array because NUMA ids can be sparse (node 0 is the first,
>> +     * node 8 is the second ...).
> 
> This comment seems a little misleading.  The ids can be sparse in
> PAPR, but they're not as we've constructed them above.
> 
>> +     */
>> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
>> +    distance_table[0] = cpu_to_be32(distance_table_entries);
>> +
>> +    node_distances = (uint8_t *)&distance_table[1];
>> +    i = 0;
>> +
>> +    for (src = 0; src < nb_numa_nodes; src++) {
>> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
>> +            node_distances[i++] = numa_info[src].distance[dst];
>> +        }
>> +    }
>> +
>> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
>> +                          sizeof(uint32_t);
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
>> +                     distance_table, distance_table_size));
>> +}
>> +
>> +/*
>> + * This helper could be compressed in a single function with
>> + * FORM1 logic since we're setting the same DT values, with the
>> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
>> + * in the end. The separation was made to avoid clogging FORM1 code
>> + * which already has to deal with compat modes from previous
>> + * QEMU machine types.
>> + */
>> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
>> +                                           void *fdt, int rtas)
>> +{
>> +    MachineState *ms = MACHINE(spapr);
>> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
>> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
>> +
>> +    /*
>> +     * In FORM2, ibm,associativity-reference-points will point to
>> +     * the element in the ibm,associativity array that contains the
>> +     * primary domain index. This value (in our case, the numa-id) is
>> +     * then used as an index to retrieve all other attributes of the
>> +     * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table
>> +     * and other ibm,numa-*-table properties.
>> +     */
>> +    uint32_t refpoints[] = {
>> +        cpu_to_be32(0x4),
>> +    };
>> +
>> +    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
>> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
>> +    uint32_t maxdomains[] = {
>> +        cpu_to_be32(4),
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain),
>> +        cpu_to_be32(maxdomain)
> 
> This doesn't seem right, since you only ever use 0 for the top 3
> domain ID levels.  Again, wouldn't this be simplifed by just making
> the first entry 1, along with every associativity array?
> 
>> +    };
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
>> +                     refpoints, nr_refpoints * sizeof(refpoints[0])));
>> +
>> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
>> +                     maxdomains, sizeof(maxdomains)));
>> +
>> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
>> +}
>> +
>>   static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>>                                              void *fdt, int rtas)
>>   {
>> @@ -379,6 +507,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>>    */
>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>>   {
>> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
>> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
>> +        return;
>> +    }
>> +
>>       spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>>   }
>>   
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 637652ad16..21b1cf9ebf 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -145,6 +145,7 @@ struct SpaprMachineClass {
>>       hwaddr rma_limit;          /* clamp the RMA to this size */
>>       bool pre_5_1_assoc_refpoints;
>>       bool pre_5_2_numa_associativity;
>> +    bool pre_6_2_numa_affinity;
>>   
>>       bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>>                             uint64_t *buid, hwaddr *pio,
>> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
>> index 48b716a060..c3e8b98e7e 100644
>> --- a/include/hw/ppc/spapr_ovec.h
>> +++ b/include/hw/ppc/spapr_ovec.h
>> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>>   /* option vector 5 */
>>   #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>>   #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
>> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>>   #define OV5_HP_EVT              OV_BIT(6, 5)
>>   #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>>   #define OV5_DRMEM_V2            OV_BIT(22, 0)
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support
  2021-09-07 10:07     ` Daniel Henrique Barboza
@ 2021-09-08  1:54       ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2021-09-08  1:54 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 3480 bytes --]

On Tue, Sep 07, 2021 at 07:07:41AM -0300, Daniel Henrique Barboza wrote:
65;6402;1c> 
> 
> On 9/6/21 10:02 PM, David Gibson wrote:
> > On Mon, Sep 06, 2021 at 09:25:26PM -0300, Daniel Henrique Barboza wrote:
> > > The main feature of FORM2 affinity support is the separation of NUMA
> > > distances from ibm,associativity information. This allows for a more
> > > flexible and straightforward NUMA distance assignment without relying on
> > > complex associations between several levels of NUMA via
> > > ibm,associativity matches. Another feature is its extensibility. This base
> > > support contains the facilities for NUMA distance assignment, but in the
> > > future more facilities will be added for latency, performance, bandwidth
> > > and so on.
> > > 
> > > This patch implements the base FORM2 affinity support as follows:
> > > 
> > > - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> > > of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> > > or FORM2 affinity. Setting both forms will default to FORM2. We're not
> > > advertising FORM2 for pseries-6.1 and older machine versions to prevent
> > > guest visible changes in those;
> > > 
> > > - call spapr_numa_associativity_reset() in do_client_architecture_support()
> > > if FORM2 is chosen. This will avoid re-initializing FORM1 artifacts that
> > > were already initialized in spapr_machine_reset();
> > > 
> > > - ibm,associativity-reference-points has a new semantic. Instead of
> > > being used to calculate distances via NUMA levels, it's now used to
> > > indicate the primary domain index in the ibm,associativity domain of
> > > each resource. In our case it's set to {0x4}, matching the position
> > > where we already place logical_domain_id;
> > 
> > Hmm... I'm a bit torn on this.  The whole reason the ibm,associativity
> > things are arrays rather than just numbers was to enable the FORM1
> > nonsense. So we have a choice here: keep the associativity arrays in
> > the same form, for simplicity of the code, or reduce the associativity
> > arrays to one entry for FORM2, to simplify the overall DT contents in
> > the "modern" case.
> 
> I'm not against making it different from FORM2. I did it this way because
> it minimizes the amount of code being changed.
> 
> In fact, if we're going to add separated data structures for both FORM1 and
> FORM2, might as well start both FORM1 and FORM2 data structures during
> machine_init() and then just switch to the chosen affinity after CAS.
> 
> Something like a FORM1_assoc_array[N][MAX_DISTANCE_REF_POINTS], that contains
> all the initialization already done today and a FORM2_assoc_array[N][2],
> where FORM2_assoc_array[node_id] = {1, node_id}, changing reference-points
> accordingly of course.
> 
> spapr_numa_assoc_array would become a pointer that would point to either
> FORM1_assoc_array[][] or FORM2_assoc_array[][] depending on guest choice. I
> think this might be enough to make everything we already have just works, although
> I need to check how much code is dependant on the MAX_DISTANCE_REF_POINTS
> macro and adapt it.
> 
> If no one opposes I'll go for this approach.

I think that's the way to go.  Thanks for working on this.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-07  9:23       ` David Gibson
@ 2021-09-10 19:57         ` Daniel Henrique Barboza
  2021-09-11  3:53           ` David Gibson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:57 UTC (permalink / raw)
  To: David Gibson, Greg Kurz; +Cc: qemu-ppc, qemu-devel



On 9/7/21 6:23 AM, David Gibson wrote:
> On Tue, Sep 07, 2021 at 09:10:13AM +0200, Greg Kurz wrote:
>> On Tue, 7 Sep 2021 10:37:27 +1000
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Mon, Sep 06, 2021 at 09:25:24PM -0300, Daniel Henrique Barboza wrote:
>>>> At this moment we only support one form of NUMA affinity, FORM1. This
>>>> allows us to init the internal structures during machine_init(), and
>>>> given that NUMA distances won't change during the guest lifetime we
>>>> don't need to bother with that again.
>>>>
>>>> We're about to introduce FORM2, a new NUMA affinity mode for pSeries
>>>> guests. This means that we'll only be certain about the affinity mode
>>>> being used after client architecture support. This also means that the
>>>> guest can switch affinity modes in machine reset.
>>>>
>>>> Let's prepare the ground for the FORM2 support by moving the NUMA
>>>> internal data init from machine_init() to machine_reset(). Change the
>>>> name to spapr_numa_associativity_reset() to make it clearer that this is
>>>> a function that can be called multiple times during the guest lifecycle.
>>>> We're also simplifying its current API since this method will be called
>>>> during CAS time (do_client_architecture_support()) later on and there's no
>>>> MachineState pointer already solved there.
>>>>
>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>>
>>> Applied to ppc-for-6.2, thanks.
>>>
>>
>> Even if already applied :
>>
>> Reviewed-by: Greg Kurz <groug@kaod.org>
> 
> Added, thanks.


I'm afraid this patch was deprecated by the new patch series I just posted.


Thanks,


Daniel

> 
>>>> ---
>>>>   hw/ppc/spapr.c              | 6 +++---
>>>>   hw/ppc/spapr_numa.c         | 4 ++--
>>>>   include/hw/ppc/spapr_numa.h | 9 +--------
>>>>   3 files changed, 6 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index d39fd4e644..8e1ff6cd10 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
>>>>        */
>>>>       spapr_irq_reset(spapr, &error_fatal);
>>>>   
>>>> +    /* Reset numa_assoc_array */
>>>> +    spapr_numa_associativity_reset(spapr);
>>>> +
>>>>       /*
>>>>        * There is no CAS under qtest. Simulate one to please the code that
>>>>        * depends on spapr->ov5_cas. This is especially needed to test device
>>>> @@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
>>>>   
>>>>       spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>>>>   
>>>> -    /* Init numa_assoc_array */
>>>> -    spapr_numa_associativity_init(spapr, machine);
>>>> -
>>>>       if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>>>>           ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>>>>                                 spapr->max_compat_pvr)) {
>>>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>>>> index 779f18b994..9ee4b479fe 100644
>>>> --- a/hw/ppc/spapr_numa.c
>>>> +++ b/hw/ppc/spapr_numa.c
>>>> @@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>>>>   
>>>>   }
>>>>   
>>>> -void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>>> -                                   MachineState *machine)
>>>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>>>>   {
>>>>       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>>> +    MachineState *machine = MACHINE(spapr);
>>>>       int nb_numa_nodes = machine->numa_state->num_nodes;
>>>>       int i, j, max_nodes_with_gpus;
>>>>       bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>>>> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
>>>> index 6f9f02d3de..0e457bba57 100644
>>>> --- a/include/hw/ppc/spapr_numa.h
>>>> +++ b/include/hw/ppc/spapr_numa.h
>>>> @@ -16,14 +16,7 @@
>>>>   #include "hw/boards.h"
>>>>   #include "hw/ppc/spapr.h"
>>>>   
>>>> -/*
>>>> - * Having both SpaprMachineState and MachineState as arguments
>>>> - * feels odd, but it will spare a MACHINE() call inside the
>>>> - * function. spapr_machine_init() is the only caller for it, and
>>>> - * it has both pointers resolved already.
>>>> - */
>>>> -void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>>> -                                   MachineState *machine);
>>>
>>> Nice additional cleanup to the signature, thanks.
>>>
>>>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
>>>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>>>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>>>>                                          int offset, int nodeid);
>>>
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset
  2021-09-10 19:57         ` Daniel Henrique Barboza
@ 2021-09-11  3:53           ` David Gibson
  0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2021-09-11  3:53 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, Greg Kurz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5714 bytes --]

On Fri, Sep 10, 2021 at 04:57:14PM -0300, Daniel Henrique Barboza wrote:
65;6402;1c> 
> 
> On 9/7/21 6:23 AM, David Gibson wrote:
> > On Tue, Sep 07, 2021 at 09:10:13AM +0200, Greg Kurz wrote:
> > > On Tue, 7 Sep 2021 10:37:27 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > 
> > > > On Mon, Sep 06, 2021 at 09:25:24PM -0300, Daniel Henrique Barboza wrote:
> > > > > At this moment we only support one form of NUMA affinity, FORM1. This
> > > > > allows us to init the internal structures during machine_init(), and
> > > > > given that NUMA distances won't change during the guest lifetime we
> > > > > don't need to bother with that again.
> > > > > 
> > > > > We're about to introduce FORM2, a new NUMA affinity mode for pSeries
> > > > > guests. This means that we'll only be certain about the affinity mode
> > > > > being used after client architecture support. This also means that the
> > > > > guest can switch affinity modes in machine reset.
> > > > > 
> > > > > Let's prepare the ground for the FORM2 support by moving the NUMA
> > > > > internal data init from machine_init() to machine_reset(). Change the
> > > > > name to spapr_numa_associativity_reset() to make it clearer that this is
> > > > > a function that can be called multiple times during the guest lifecycle.
> > > > > We're also simplifying its current API since this method will be called
> > > > > during CAS time (do_client_architecture_support()) later on and there's no
> > > > > MachineState pointer already solved there.
> > > > > 
> > > > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > > > 
> > > > Applied to ppc-for-6.2, thanks.
> > > > 
> > > 
> > > Even if already applied :
> > > 
> > > Reviewed-by: Greg Kurz <groug@kaod.org>
> > 
> > Added, thanks.
> 
> 
> I'm afraid this patch was deprecated by the new patch series I just
> posted.

Ok, I've removed the old patch from ppc-for-6.2.

> 
> 
> Thanks,
> 
> 
> Daniel
> 
> > 
> > > > > ---
> > > > >   hw/ppc/spapr.c              | 6 +++---
> > > > >   hw/ppc/spapr_numa.c         | 4 ++--
> > > > >   include/hw/ppc/spapr_numa.h | 9 +--------
> > > > >   3 files changed, 6 insertions(+), 13 deletions(-)
> > > > > 
> > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > > > index d39fd4e644..8e1ff6cd10 100644
> > > > > --- a/hw/ppc/spapr.c
> > > > > +++ b/hw/ppc/spapr.c
> > > > > @@ -1621,6 +1621,9 @@ static void spapr_machine_reset(MachineState *machine)
> > > > >        */
> > > > >       spapr_irq_reset(spapr, &error_fatal);
> > > > > +    /* Reset numa_assoc_array */
> > > > > +    spapr_numa_associativity_reset(spapr);
> > > > > +
> > > > >       /*
> > > > >        * There is no CAS under qtest. Simulate one to please the code that
> > > > >        * depends on spapr->ov5_cas. This is especially needed to test device
> > > > > @@ -2808,9 +2811,6 @@ static void spapr_machine_init(MachineState *machine)
> > > > >       spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
> > > > > -    /* Init numa_assoc_array */
> > > > > -    spapr_numa_associativity_init(spapr, machine);
> > > > > -
> > > > >       if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> > > > >           ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> > > > >                                 spapr->max_compat_pvr)) {
> > > > > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > > > > index 779f18b994..9ee4b479fe 100644
> > > > > --- a/hw/ppc/spapr_numa.c
> > > > > +++ b/hw/ppc/spapr_numa.c
> > > > > @@ -155,10 +155,10 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
> > > > >   }
> > > > > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > > > > -                                   MachineState *machine)
> > > > > +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> > > > >   {
> > > > >       SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> > > > > +    MachineState *machine = MACHINE(spapr);
> > > > >       int nb_numa_nodes = machine->numa_state->num_nodes;
> > > > >       int i, j, max_nodes_with_gpus;
> > > > >       bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> > > > > diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> > > > > index 6f9f02d3de..0e457bba57 100644
> > > > > --- a/include/hw/ppc/spapr_numa.h
> > > > > +++ b/include/hw/ppc/spapr_numa.h
> > > > > @@ -16,14 +16,7 @@
> > > > >   #include "hw/boards.h"
> > > > >   #include "hw/ppc/spapr.h"
> > > > > -/*
> > > > > - * Having both SpaprMachineState and MachineState as arguments
> > > > > - * feels odd, but it will spare a MACHINE() call inside the
> > > > > - * function. spapr_machine_init() is the only caller for it, and
> > > > > - * it has both pointers resolved already.
> > > > > - */
> > > > > -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> > > > > -                                   MachineState *machine);
> > > > 
> > > > Nice additional cleanup to the signature, thanks.
> > > > 
> > > > > +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
> > > > >   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
> > > > >   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> > > > >                                          int offset, int nodeid);
> > > > 
> > > 
> > 
> > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-11  4:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-07  0:25 [PATCH v5 0/4] pSeries FORM2 affinity support Daniel Henrique Barboza
2021-09-07  0:25 ` [PATCH v5 1/4] spapr: move NUMA associativity init to machine reset Daniel Henrique Barboza
2021-09-07  0:37   ` David Gibson
2021-09-07  7:10     ` Greg Kurz
2021-09-07  9:23       ` David Gibson
2021-09-10 19:57         ` Daniel Henrique Barboza
2021-09-11  3:53           ` David Gibson
2021-09-07  0:25 ` [PATCH v5 2/4] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
2021-09-07  0:39   ` David Gibson
2021-09-07  0:25 ` [PATCH v5 3/4] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
2021-09-07  1:02   ` David Gibson
2021-09-07 10:07     ` Daniel Henrique Barboza
2021-09-08  1:54       ` David Gibson
2021-09-07  7:50   ` Greg Kurz
2021-09-07  0:25 ` [PATCH v5 4/4] spapr: move FORM1 verifications to do_client_architecture_support() Daniel Henrique Barboza
2021-09-07  1:04   ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.