All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/6] pSeries FORM2 affinity support
@ 2021-09-10 19:55 Daniel Henrique Barboza
  2021-09-10 19:55 ` [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

In this version there was significant design changes after the
v5 review. Only patches 1 and 5 were present in the last version.

changes from v5:
- patch order was changed to make all the preliminary work without
adding FORM2 code;
- FORM1 and FORM2 data now co-exists. Both are being initialized in
spapr_numa_associativity_init() in two static arrays called
'FORM1_assoc_array' and 'FORM2_assoc_array'. 'numa_assoc_array' is now a
pointer that toggles between those 2;
- spapr_numa_associativity_reset() switches the NUMA affinity data to be
used. It is not a replace for associativity_init() as it was in v5;
- 'legacy_numa' concept was removed;
- FORM2 affinity init() is now completely separated from FORM1;
- FORM2 ibm,associativity array only contains size and numa_id for non-CPU
resources, and an extra vcpu_id for CPUs;
- FORM2 reference-points is { 1 };
- FORM2 maxdomain has size = 2;
- several other changes to accomodate the new design of having to deal with
2 different data structures, while minimizing changes in the write_dt()
functions 


Daniel Henrique Barboza (6):
  spapr_numa.c: split FORM1 code into helpers
  spapr_numa.c: scrap 'legacy_numa' concept
  spapr: introduce spapr_numa_associativity_reset()
  spapr_numa.c: parametrize FORM1 macros
  spapr: move FORM1 verifications to post CAS
  spapr_numa.c: FORM2 NUMA affinity support

 hw/ppc/spapr.c              |  55 +++---
 hw/ppc/spapr_hcall.c        |   7 +
 hw/ppc/spapr_numa.c         | 382 ++++++++++++++++++++++++++++++------
 include/hw/ppc/spapr.h      |  25 +--
 include/hw/ppc/spapr_numa.h |   2 +
 include/hw/ppc/spapr_ovec.h |   1 +
 6 files changed, 362 insertions(+), 110 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14  8:23   ` Greg Kurz
  2021-09-10 19:55 ` [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept Daniel Henrique Barboza
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
and doesn't need be concerned with all the legacy support for older
pseries FORM1 guests.

We're also not going to calculate associativity domains based on numa
distance (via spapr_numa_define_associativity_domains) since the
distances will be written directly into new DT properties.

Let's split FORM1 code into its own functions to allow for easier
insertion of FORM2 logic later on.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 779f18b994..786def7c73 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -92,7 +92,7 @@ static uint8_t spapr_numa_get_numa_level(uint8_t distance)
     return 0;
 }
 
-static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
+static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
 {
     MachineState *ms = MACHINE(spapr);
     NodeInfo *numa_info = ms->numa_state->nodes;
@@ -155,8 +155,11 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine)
+/*
+ * Set NUMA machine state data based on FORM1 affinity semantics.
+ */
+static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
+                                           MachineState *machine)
 {
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
@@ -225,7 +228,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         exit(EXIT_FAILURE);
     }
 
-    spapr_numa_define_associativity_domains(spapr);
+    spapr_numa_define_FORM1_domains(spapr);
+}
+
+void spapr_numa_associativity_init(SpaprMachineState *spapr,
+                                   MachineState *machine)
+{
+    spapr_numa_FORM1_affinity_init(spapr, machine);
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -302,12 +311,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
-/*
- * Helper that writes ibm,associativity-reference-points and
- * max-associativity-domains in the RTAS pointed by @rtas
- * in the DT @fdt.
- */
-void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
 {
     MachineState *ms = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
@@ -365,6 +370,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
                      maxdomains, sizeof(maxdomains)));
 }
 
+/*
+ * Helper that writes ibm,associativity-reference-points and
+ * max-associativity-domains in the RTAS pointed by @rtas
+ * in the DT @fdt.
+ */
+void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+{
+    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-09-10 19:55 ` [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14  8:34   ` Greg Kurz
  2021-09-10 19:55 ` [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset() Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

When first introduced, 'legacy_numa' was a way to refer to guests that
either wouldn't be affected by associativity domain calculations, namely
the ones with only 1 NUMA node, and pre 5.2 guests that shouldn't be
affected by it because it would be an userspace change. Calling these
cases 'legacy_numa' was a convenient way to label these cases.

We're about to introduce a new NUMA affinity, FORM2, and this concept
of 'legacy_numa' is now a bit misleading because, although it is called
'legacy' it is in fact a FORM1 exclusive contraint.

This patch removes spapr_machine_using_legacy_numa() and open code the
conditions in each caller. While we're at it, move the chunk inside
spapr_numa_FORM1_affinity_init() that sets all numa_assoc_array domains
with 'node_id' to spapr_numa_define_FORM1_domains(). This chunk was
being executed if !pre_5_2_numa_associativity and num_nodes => 1, the
same conditions in which spapr_numa_define_FORM1_domains() is called
shortly after.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 46 +++++++++++++++++++--------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 786def7c73..fb6059550f 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -19,15 +19,6 @@
 /* Moved from hw/ppc/spapr_pci_nvlink2.c */
 #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
 
-static bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
-{
-    MachineState *machine = MACHINE(spapr);
-    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
-
-    return smc->pre_5_2_numa_associativity ||
-           machine->numa_state->num_nodes <= 1;
-}
-
 static bool spapr_numa_is_symmetrical(MachineState *ms)
 {
     int src, dst;
@@ -97,7 +88,18 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
     MachineState *ms = MACHINE(spapr);
     NodeInfo *numa_info = ms->numa_state->nodes;
     int nb_numa_nodes = ms->numa_state->num_nodes;
-    int src, dst, i;
+    int src, dst, i, j;
+
+    /*
+     * Fill all associativity domains of non-zero NUMA nodes with
+     * node_id. This is required because the default value (0) is
+     * considered a match with associativity domains of node 0.
+     */
+    for (i = 1; i < nb_numa_nodes; i++) {
+        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
+            spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
+        }
+    }
 
     for (src = 0; src < nb_numa_nodes; src++) {
         for (dst = src; dst < nb_numa_nodes; dst++) {
@@ -164,7 +166,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int i, j, max_nodes_with_gpus;
-    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
 
     /*
      * For all associativity arrays: first position is the size,
@@ -178,17 +179,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     for (i = 0; i < nb_numa_nodes; i++) {
         spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
-
-        /*
-         * Fill all associativity domains of non-zero NUMA nodes with
-         * node_id. This is required because the default value (0) is
-         * considered a match with associativity domains of node 0.
-         */
-        if (!using_legacy_numa && i != 0) {
-            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
-                spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
-            }
-        }
     }
 
     /*
@@ -214,11 +204,13 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     }
 
     /*
-     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-     * 1 NUMA node) will not benefit from anything we're going to do
-     * after this point.
+     * Guests pseries-5.1 and older uses zeroed associativity domains,
+     * i.e. no domain definition based on NUMA distance input.
+     *
+     * Same thing with guests that have only one NUMA node.
      */
-    if (using_legacy_numa) {
+    if (smc->pre_5_2_numa_associativity ||
+        machine->numa_state->num_nodes <= 1) {
         return;
     }
 
@@ -334,7 +326,7 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
         cpu_to_be32(maxdomain)
     };
 
-    if (spapr_machine_using_legacy_numa(spapr)) {
+    if (smc->pre_5_2_numa_associativity) {
         uint32_t legacy_refpoints[] = {
             cpu_to_be32(0x4),
             cpu_to_be32(0x4),
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset()
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
  2021-09-10 19:55 ` [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
  2021-09-10 19:55 ` [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14 11:55   ` Greg Kurz
  2021-09-10 19:55 ` [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Introducing a new NUMA affinity, FORM2, requires a new mechanism to
switch between affinity modes after CAS. Also, we want FORM2 data
structures and functions to be completely separated from the existing
FORM1 code, allowing us to avoid adding new code that inherits the
existing complexity of FORM1.

At the same time, it's also desirable to minimize the amount of changes
made in write_dt() functions that are used to write ibm,associativity of
the resources, RTAS artifacts and h_home_node_associativity. These
functions can work in the same way in both NUMA affinity modes, as long
as we use a similar data structure and parametrize it properly depending
on the affinity mode selected.

This patch introduces spapr_numa_associativity_reset() to start this
process. This function will be used to switch to the chosen NUMA
affinity after CAS and after migrating the guest. To do that, the
existing 'numa_assoc_array' is renamed to 'FORM1_assoc_array' and will
hold FORM1 data that is populated at associativity_init().
'numa_assoc_array' is now a pointer that can be switched between the
existing affinity arrays. We don't have FORM2 data structures yet, so
'numa_assoc_array' will always point to 'FORM1_assoc_array'.

We also take the precaution of pointing 'numa_assoc_array' to
'FORM1_assoc_array' in associativity_init() time, before CAS, to not
change FORM1 availability for existing guests.

A small change in spapr_numa_write_associativity_dt() is made to reflect
the fact that 'numa_assoc_array' is now a pointer and we must be
explicit with the size being written in the DT.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 14 +++++++++++++
 hw/ppc/spapr_hcall.c        |  7 +++++++
 hw/ppc/spapr_numa.c         | 42 +++++++++++++++++++++++++++++--------
 include/hw/ppc/spapr.h      |  3 ++-
 include/hw/ppc/spapr_numa.h |  1 +
 5 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d39fd4e644..5afbb76cab 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1786,6 +1786,20 @@ static int spapr_post_load(void *opaque, int version_id)
         return err;
     }
 
+    /*
+     * NUMA affinity selection is made in CAS time. There is no reliable
+     * way of telling whether the guest already went through CAS before
+     * migration due to how spapr_ov5_cas_needed works: a FORM1 guest can
+     * be migrated with ov5_cas empty regardless of going through CAS
+     * first.
+     *
+     * One solution is to call numa_associativity_reset(). The downside
+     * is that a guest migrated before CAS will reset it again when going
+     * through it, but since it's a lightweight operation it's worth being
+     * a little redundant to be safe.
+     */
+     spapr_numa_associativity_reset(spapr);
+
     return err;
 }
 
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 0e9a5b2e40..82ab92ddba 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -17,6 +17,7 @@
 #include "kvm_ppc.h"
 #include "hw/ppc/fdt.h"
 #include "hw/ppc/spapr_ovec.h"
+#include "hw/ppc/spapr_numa.h"
 #include "mmu-book3s-v3.h"
 #include "hw/mem/memory-device.h"
 
@@ -1197,6 +1198,12 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
     spapr_ovec_cleanup(ov1_guest);
 
+    /*
+     * Reset numa_assoc_array now that we know which NUMA affinity
+     * the guest will use.
+     */
+    spapr_numa_associativity_reset(spapr);
+
     /*
      * Ensure the guest asks for an interrupt mode we support;
      * otherwise terminate the boot.
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index fb6059550f..327952ba9e 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -97,7 +97,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
      */
     for (i = 1; i < nb_numa_nodes; i++) {
         for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
-            spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
+            spapr->FORM1_assoc_array[i][j] = cpu_to_be32(i);
         }
     }
 
@@ -149,8 +149,8 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
              * and going up to 0x1.
              */
             for (i = n_level; i > 0; i--) {
-                assoc_src = spapr->numa_assoc_array[src][i];
-                spapr->numa_assoc_array[dst][i] = assoc_src;
+                assoc_src = spapr->FORM1_assoc_array[src][i];
+                spapr->FORM1_assoc_array[dst][i] = assoc_src;
             }
         }
     }
@@ -167,6 +167,11 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int i, j, max_nodes_with_gpus;
 
+    /* init FORM1_assoc_array */
+    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
+        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
+    }
+
     /*
      * For all associativity arrays: first position is the size,
      * position MAX_DISTANCE_REF_POINTS is always the numa_id,
@@ -177,8 +182,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
      * 'i' will be a valid node_id set by the user.
      */
     for (i = 0; i < nb_numa_nodes; i++) {
-        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
-        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
+        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
+        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
     /*
@@ -192,15 +197,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
 
     for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
-        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
+        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
 
         for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
             uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
                                  SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
-            spapr->numa_assoc_array[i][j] = gpu_assoc;
+            spapr->FORM1_assoc_array[i][j] = gpu_assoc;
         }
 
-        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
+        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
     /*
@@ -227,14 +232,33 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine)
 {
     spapr_numa_FORM1_affinity_init(spapr, machine);
+
+    /*
+     * Default to FORM1 affinity until CAS. We'll call affinity_reset()
+     * during CAS when we're sure about which NUMA affinity the guest
+     * is going to use.
+     *
+     * This step is a failsafe - guests in the wild were able to read
+     * FORM1 affinity info before CAS for a long time. Since affinity_reset()
+     * is just a pointer switch between data that was already populated
+     * here, this is an acceptable overhead to be on the safer side.
+     */
+    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
+}
+
+void spapr_numa_associativity_reset(SpaprMachineState *spapr)
+{
+    /* No FORM2 affinity implemented yet */
+    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid)
 {
+    /* Hardcode the size of FORM1 associativity array for now */
     _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
                       spapr->numa_assoc_array[nodeid],
-                      sizeof(spapr->numa_assoc_array[nodeid]))));
+                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
 }
 
 static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 637652ad16..8a9490f0bf 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -249,7 +249,8 @@ struct SpaprMachineState {
     unsigned gpu_numa_id;
     SpaprTpmProxy *tpm_proxy;
 
-    uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
+    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
+    uint32_t **numa_assoc_array;
 
     Error *fwnmi_migration_blocker;
 };
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index 6f9f02d3de..ccf3e4eae8 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -24,6 +24,7 @@
  */
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine);
+void spapr_numa_associativity_reset(SpaprMachineState *spapr);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-09-10 19:55 ` [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset() Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14 12:10   ` Greg Kurz
  2021-09-10 19:55 ` [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS Daniel Henrique Barboza
  2021-09-10 19:55 ` [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support Daniel Henrique Barboza
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The next preliminary step to introduce NUMA FORM2 affinity is to make
the existing code independent of FORM1 macros and values, i.e.
MAX_DISTANCE_REF_POINTS, NUMA_ASSOC_SIZE and VCPU_ASSOC_SIZE. This patch
accomplishes that by doing the following:

- move the NUMA related macros from spapr.h to spapr_numa.c where they
are used. spapr.h gets instead a 'NUMA_NODES_MAX_NUM' macro that is used
to refer to the maximum number of NUMA nodes, including GPU nodes, that
the machine can support;

- MAX_DISTANCE_REF_POINTS and NUMA_ASSOC_SIZE are renamed to
FORM1_DIST_REF_POINTS and FORM1_NUMA_ASSOC_SIZE. These FORM1 specific
macros are used in FORM1 init functions;

- code that uses MAX_DISTANCE_REF_POINTS now retrieves the
max_dist_ref_points value using get_max_dist_ref_points().
NUMA_ASSOC_SIZE is replaced by get_numa_assoc_size() and VCPU_ASSOC_SIZE
is replaced by get_vcpu_assoc_size(). These functions are used by the
generic device tree functions and h_home_node_associativity() and will
allow them to switch between FORM1 and FORM2 without changing their core
logic.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c    | 93 +++++++++++++++++++++++++++++++-----------
 include/hw/ppc/spapr.h | 22 +++-------
 2 files changed, 74 insertions(+), 41 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 327952ba9e..7ad4b6582b 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -19,6 +19,47 @@
 /* Moved from hw/ppc/spapr_pci_nvlink2.c */
 #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
 
+/*
+ * NUMA FORM1 macros. FORM1_DIST_REF_POINTS was taken from
+ * MAX_DISTANCE_REF_POINTS in arch/powerpc/mm/numa.h from Linux
+ * kernel source. It represents the amount of associativity domains
+ * for non-CPU resources.
+ *
+ * FORM1_NUMA_ASSOC_SIZE is the base array size of an ibm,associativity
+ * array for any non-CPU resource.
+ */
+#define FORM1_DIST_REF_POINTS            4
+#define FORM1_NUMA_ASSOC_SIZE            (FORM1_DIST_REF_POINTS + 1)
+
+/*
+ * Retrieves max_dist_ref_points of the current NUMA affinity.
+ */
+static int get_max_dist_ref_points(SpaprMachineState *spapr)
+{
+    /* No FORM2 affinity implemented yet */
+    return FORM1_DIST_REF_POINTS;
+}
+
+/*
+ * Retrieves numa_assoc_size of the current NUMA affinity.
+ */
+static int get_numa_assoc_size(SpaprMachineState *spapr)
+{
+    /* No FORM2 affinity implemented yet */
+    return FORM1_NUMA_ASSOC_SIZE;
+}
+
+/*
+ * Retrieves vcpu_assoc_size of the current NUMA affinity.
+ *
+ * vcpu_assoc_size is the size of ibm,associativity array
+ * for CPUs, which has an extra element (vcpu_id) in the end.
+ */
+static int get_vcpu_assoc_size(SpaprMachineState *spapr)
+{
+    return get_numa_assoc_size(spapr) + 1;
+}
+
 static bool spapr_numa_is_symmetrical(MachineState *ms)
 {
     int src, dst;
@@ -96,7 +137,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
      * considered a match with associativity domains of node 0.
      */
     for (i = 1; i < nb_numa_nodes; i++) {
-        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
+        for (j = 1; j < FORM1_DIST_REF_POINTS; j++) {
             spapr->FORM1_assoc_array[i][j] = cpu_to_be32(i);
         }
     }
@@ -134,7 +175,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
              *
              * The Linux kernel will assume that the distance between src and
              * dst, in this case of no match, is 10 (local distance) doubled
-             * for each NUMA it didn't match. We have MAX_DISTANCE_REF_POINTS
+             * for each NUMA it didn't match. We have FORM1_DIST_REF_POINTS
              * levels (4), so this gives us 10*2*2*2*2 = 160.
              *
              * This logic can be seen in the Linux kernel source code, as of
@@ -168,13 +209,13 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     int i, j, max_nodes_with_gpus;
 
     /* init FORM1_assoc_array */
-    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
-        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
+    for (i = 0; i < NUMA_NODES_MAX_NUM; i++) {
+        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, FORM1_NUMA_ASSOC_SIZE);
     }
 
     /*
      * For all associativity arrays: first position is the size,
-     * position MAX_DISTANCE_REF_POINTS is always the numa_id,
+     * position FORM1_DIST_REF_POINTS is always the numa_id,
      * represented by the index 'i'.
      *
      * This will break on sparse NUMA setups, when/if QEMU starts
@@ -182,8 +223,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
      * 'i' will be a valid node_id set by the user.
      */
     for (i = 0; i < nb_numa_nodes; i++) {
-        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
-        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
+        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(FORM1_DIST_REF_POINTS);
+        spapr->FORM1_assoc_array[i][FORM1_DIST_REF_POINTS] = cpu_to_be32(i);
     }
 
     /*
@@ -197,15 +238,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
 
     for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
-        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
+        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(FORM1_DIST_REF_POINTS);
 
-        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
+        for (j = 1; j < FORM1_DIST_REF_POINTS; j++) {
             uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
                                  SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
             spapr->FORM1_assoc_array[i][j] = gpu_assoc;
         }
 
-        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
+        spapr->FORM1_assoc_array[i][FORM1_DIST_REF_POINTS] = cpu_to_be32(i);
     }
 
     /*
@@ -255,16 +296,17 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid)
 {
-    /* Hardcode the size of FORM1 associativity array for now */
     _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
                       spapr->numa_assoc_array[nodeid],
-                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
+                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));
 }
 
 static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
                                            PowerPCCPU *cpu)
 {
-    uint32_t *vcpu_assoc = g_new(uint32_t, VCPU_ASSOC_SIZE);
+    int max_distance_ref_points = get_max_dist_ref_points(spapr);
+    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
+    uint32_t *vcpu_assoc = g_new(uint32_t, vcpu_assoc_size);
     int index = spapr_get_vcpu_id(cpu);
 
     /*
@@ -273,10 +315,10 @@ static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
      * 0, put cpu_id last, then copy the remaining associativity
      * domains.
      */
-    vcpu_assoc[0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS + 1);
-    vcpu_assoc[VCPU_ASSOC_SIZE - 1] = cpu_to_be32(index);
+    vcpu_assoc[0] = cpu_to_be32(max_distance_ref_points + 1);
+    vcpu_assoc[vcpu_assoc_size - 1] = cpu_to_be32(index);
     memcpy(vcpu_assoc + 1, spapr->numa_assoc_array[cpu->node_id] + 1,
-           (VCPU_ASSOC_SIZE - 2) * sizeof(uint32_t));
+           (vcpu_assoc_size - 2) * sizeof(uint32_t));
 
     return vcpu_assoc;
 }
@@ -285,12 +327,13 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
                             int offset, PowerPCCPU *cpu)
 {
     g_autofree uint32_t *vcpu_assoc = NULL;
+    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
 
     vcpu_assoc = spapr_numa_get_vcpu_assoc(spapr, cpu);
 
     /* Advertise NUMA via ibm,associativity */
     return fdt_setprop(fdt, offset, "ibm,associativity", vcpu_assoc,
-                       VCPU_ASSOC_SIZE * sizeof(uint32_t));
+                       vcpu_assoc_size * sizeof(uint32_t));
 }
 
 
@@ -298,17 +341,18 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
                                          int offset)
 {
     MachineState *machine = MACHINE(spapr);
+    int max_distance_ref_points = get_max_dist_ref_points(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
     uint32_t *int_buf, *cur_index, buf_len;
     int ret, i;
 
     /* ibm,associativity-lookup-arrays */
-    buf_len = (nr_nodes * MAX_DISTANCE_REF_POINTS + 2) * sizeof(uint32_t);
+    buf_len = (nr_nodes * max_distance_ref_points + 2) * sizeof(uint32_t);
     cur_index = int_buf = g_malloc0(buf_len);
     int_buf[0] = cpu_to_be32(nr_nodes);
      /* Number of entries per associativity list */
-    int_buf[1] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
+    int_buf[1] = cpu_to_be32(max_distance_ref_points);
     cur_index += 2;
     for (i = 0; i < nr_nodes; i++) {
         /*
@@ -317,8 +361,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
          */
         uint32_t *associativity = spapr->numa_assoc_array[i];
         memcpy(cur_index, ++associativity,
-               sizeof(uint32_t) * MAX_DISTANCE_REF_POINTS);
-        cur_index += MAX_DISTANCE_REF_POINTS;
+               sizeof(uint32_t) * max_distance_ref_points);
+        cur_index += max_distance_ref_points;
     }
     ret = fdt_setprop(fdt, offset, "ibm,associativity-lookup-arrays", int_buf,
                       (cur_index - int_buf) * sizeof(uint32_t));
@@ -406,6 +450,7 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
     target_ulong procno = args[1];
     PowerPCCPU *tcpu;
     int idx, assoc_idx;
+    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
 
     /* only support procno from H_REGISTER_VPA */
     if (flags != 0x1) {
@@ -424,7 +469,7 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
      * 12 associativity domains for vcpus. Assert and bail if that's
      * not the case.
      */
-    G_STATIC_ASSERT((VCPU_ASSOC_SIZE - 1) <= 12);
+    g_assert((vcpu_assoc_size - 1) <= 12);
 
     vcpu_assoc = spapr_numa_get_vcpu_assoc(spapr, tcpu);
     /* assoc_idx starts at 1 to skip associativity size */
@@ -445,9 +490,9 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
          * macro. The ternary will fill the remaining registers with -1
          * after we went through vcpu_assoc[].
          */
-        a = assoc_idx < VCPU_ASSOC_SIZE ?
+        a = assoc_idx < vcpu_assoc_size ?
             be32_to_cpu(vcpu_assoc[assoc_idx++]) : -1;
-        b = assoc_idx < VCPU_ASSOC_SIZE ?
+        b = assoc_idx < vcpu_assoc_size ?
             be32_to_cpu(vcpu_assoc[assoc_idx++]) : -1;
 
         args[idx] = ASSOCIATIVITY(a, b);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 8a9490f0bf..2554928250 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -100,23 +100,11 @@ typedef enum {
 
 #define FDT_MAX_SIZE                    0x200000
 
-/*
- * NUMA related macros. MAX_DISTANCE_REF_POINTS was taken
- * from Linux kernel arch/powerpc/mm/numa.h. It represents the
- * amount of associativity domains for non-CPU resources.
- *
- * NUMA_ASSOC_SIZE is the base array size of an ibm,associativity
- * array for any non-CPU resource.
- *
- * VCPU_ASSOC_SIZE represents the size of ibm,associativity array
- * for CPUs, which has an extra element (vcpu_id) in the end.
- */
-#define MAX_DISTANCE_REF_POINTS    4
-#define NUMA_ASSOC_SIZE            (MAX_DISTANCE_REF_POINTS + 1)
-#define VCPU_ASSOC_SIZE            (NUMA_ASSOC_SIZE + 1)
+/* Max number of GPUs per system */
+#define NVGPU_MAX_NUM              6
 
-/* Max number of these GPUsper a physical box */
-#define NVGPU_MAX_NUM                6
+/* Max number of NUMA nodes */
+#define NUMA_NODES_MAX_NUM         (MAX_NODES + NVGPU_MAX_NUM)
 
 typedef struct SpaprCapabilities SpaprCapabilities;
 struct SpaprCapabilities {
@@ -249,7 +237,7 @@ struct SpaprMachineState {
     unsigned gpu_numa_id;
     SpaprTpmProxy *tpm_proxy;
 
-    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
+    uint32_t *FORM1_assoc_array[NUMA_NODES_MAX_NUM];
     uint32_t **numa_assoc_array;
 
     Error *fwnmi_migration_blocker;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2021-09-10 19:55 ` [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14 12:26   ` Greg Kurz
  2021-09-10 19:55 ` [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support Daniel Henrique Barboza
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

FORM2 NUMA affinity is prepared to deal with empty (memory/cpu less)
NUMA nodes. This is used by the DAX KMEM driver to locate a PAPR SCM
device that has a different latency than the original NUMA node from the
regular memory. FORM2 is also enable to deal with asymmetric NUMA
distances gracefully, something that our FORM1 implementation doesn't
do.

Move these FORM1 verifications to a new function and wait until after
CAS, when we're sure that we're sticking with FORM1, to enforce them.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 35 +----------------------
 hw/ppc/spapr_hcall.c        |  2 +-
 hw/ppc/spapr_numa.c         | 55 ++++++++++++++++++++++++++++++++-----
 include/hw/ppc/spapr_numa.h |  3 +-
 4 files changed, 52 insertions(+), 43 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 5afbb76cab..0703a26093 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1798,7 +1798,7 @@ static int spapr_post_load(void *opaque, int version_id)
      * through it, but since it's a lightweight operation it's worth being
      * a little redundant to be safe.
      */
-     spapr_numa_associativity_reset(spapr);
+     spapr_numa_associativity_reset(spapr, false);
 
     return err;
 }
@@ -2787,39 +2787,6 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
-    /*
-     * check we don't have a memory-less/cpu-less NUMA node
-     * Firmware relies on the existing memory/cpu topology to provide the
-     * NUMA topology to the kernel.
-     * And the linux kernel needs to know the NUMA topology at start
-     * to be able to hotplug CPUs later.
-     */
-    if (machine->numa_state->num_nodes) {
-        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
-            /* check for memory-less node */
-            if (machine->numa_state->nodes[i].node_mem == 0) {
-                CPUState *cs;
-                int found = 0;
-                /* check for cpu-less node */
-                CPU_FOREACH(cs) {
-                    PowerPCCPU *cpu = POWERPC_CPU(cs);
-                    if (cpu->node_id == i) {
-                        found = 1;
-                        break;
-                    }
-                }
-                /* memory-less and cpu-less node */
-                if (!found) {
-                    error_report(
-                       "Memory-less/cpu-less nodes are not supported (node %d)",
-                                 i);
-                    exit(1);
-                }
-            }
-        }
-
-    }
-
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
     /* Init numa_assoc_array */
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 82ab92ddba..2dc22e2dc7 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1202,7 +1202,7 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
      * Reset numa_assoc_array now that we know which NUMA affinity
      * the guest will use.
      */
-    spapr_numa_associativity_reset(spapr);
+    spapr_numa_associativity_reset(spapr, true);
 
     /*
      * Ensure the guest asks for an interrupt mode we support;
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 7ad4b6582b..0ade63c2d3 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -198,6 +198,48 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
 
 }
 
+static void spapr_numa_FORM1_affinity_check(MachineState *machine)
+{
+    int i;
+
+    /*
+     * Check we don't have a memory-less/cpu-less NUMA node
+     * Firmware relies on the existing memory/cpu topology to provide the
+     * NUMA topology to the kernel.
+     * And the linux kernel needs to know the NUMA topology at start
+     * to be able to hotplug CPUs later.
+     */
+    if (machine->numa_state->num_nodes) {
+        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
+            /* check for memory-less node */
+            if (machine->numa_state->nodes[i].node_mem == 0) {
+                CPUState *cs;
+                int found = 0;
+                /* check for cpu-less node */
+                CPU_FOREACH(cs) {
+                    PowerPCCPU *cpu = POWERPC_CPU(cs);
+                    if (cpu->node_id == i) {
+                        found = 1;
+                        break;
+                    }
+                }
+                /* memory-less and cpu-less node */
+                if (!found) {
+                    error_report(
+"Memory-less/cpu-less nodes are not supported with FORM1 NUMA (node %d)", i);
+                    exit(EXIT_FAILURE);
+                }
+            }
+        }
+    }
+
+    if (!spapr_numa_is_symmetrical(machine)) {
+        error_report(
+"Asymmetrical NUMA topologies aren't supported in the pSeries machine using FORM1 NUMA");
+        exit(EXIT_FAILURE);
+    }
+}
+
 /*
  * Set NUMA machine state data based on FORM1 affinity semantics.
  */
@@ -260,12 +302,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
         return;
     }
 
-    if (!spapr_numa_is_symmetrical(machine)) {
-        error_report("Asymmetrical NUMA topologies aren't supported "
-                     "in the pSeries machine");
-        exit(EXIT_FAILURE);
-    }
-
     spapr_numa_define_FORM1_domains(spapr);
 }
 
@@ -287,10 +323,15 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
     spapr->numa_assoc_array = spapr->FORM1_assoc_array;
 }
 
-void spapr_numa_associativity_reset(SpaprMachineState *spapr)
+void spapr_numa_associativity_reset(SpaprMachineState *spapr,
+                                    bool post_CAS_check)
 {
     /* No FORM2 affinity implemented yet */
     spapr->numa_assoc_array = spapr->FORM1_assoc_array;
+
+    if (post_CAS_check) {
+        spapr_numa_FORM1_affinity_check(MACHINE(spapr));
+    }
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index ccf3e4eae8..246767d0a8 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -24,7 +24,8 @@
  */
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine);
-void spapr_numa_associativity_reset(SpaprMachineState *spapr);
+void spapr_numa_associativity_reset(SpaprMachineState *spapr,
+                                    bool post_CAS_check);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support
  2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
                   ` (4 preceding siblings ...)
  2021-09-10 19:55 ` [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS Daniel Henrique Barboza
@ 2021-09-10 19:55 ` Daniel Henrique Barboza
  2021-09-14 12:58   ` Greg Kurz
  5 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-10 19:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

The main feature of FORM2 affinity support is the separation of NUMA
distances from ibm,associativity information. This allows for a more
flexible and straightforward NUMA distance assignment without relying on
complex associations between several levels of NUMA via
ibm,associativity matches. Another feature is its extensibility. This base
support contains the facilities for NUMA distance assignment, but in the
future more facilities will be added for latency, performance, bandwidth
and so on.

This patch implements the base FORM2 affinity support as follows:

- the use of FORM2 associativity is indicated by using bit 2 of byte 5
of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
or FORM2 affinity. Setting both forms will default to FORM2. We're not
advertising FORM2 for pseries-6.1 and older machine versions to prevent
guest visible changes in those;

- ibm,associativity-reference-points has a new semantic. Instead of
being used to calculate distances via NUMA levels, it's now used to
indicate the primary domain index in the ibm,associativity domain of
each resource. In our case it's set to {0x4}, matching the position
where we already place logical_domain_id;

- two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
and ibm,numa-distance-table. The index table is used to list all the
NUMA logical domains of the platform, in ascending order, and allows for
spartial NUMA configurations (although QEMU ATM doesn't support that).
ibm,numa-distance-table is an array that contains all the distances from
the first NUMA node to all other nodes, then the second NUMA node
distances to all other nodes and so on;

- get_max_dist_ref_points() and get_numa_assoc_size() now checks for
OV5_FORM2_AFFINITY and returns FORM2 values if the guest selected FORM2
affinity during CAS.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              |   8 ++
 hw/ppc/spapr_numa.c         | 151 +++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr.h      |   2 +
 include/hw/ppc/spapr_ovec.h |   1 +
 4 files changed, 159 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 0703a26093..23aba5ae2d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2766,6 +2766,11 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
 
+    /* Do not advertise FORM2 NUMA support for pseries-6.1 and older */
+    if (!smc->pre_6_2_numa_affinity) {
+        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
+    }
+
     /* advertise support for dedicated HP event source to guests */
     if (spapr->use_hotplug_event_source) {
         spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
@@ -4681,8 +4686,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
  */
 static void spapr_machine_6_1_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_6_2_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
+    smc->pre_6_2_numa_affinity = true;
 }
 
 DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 0ade63c2d3..dd52b8921c 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -31,12 +31,22 @@
 #define FORM1_DIST_REF_POINTS            4
 #define FORM1_NUMA_ASSOC_SIZE            (FORM1_DIST_REF_POINTS + 1)
 
+/*
+ * FORM2 NUMA affinity has a single associativity domain, giving
+ * us a assoc size of 2.
+ */
+#define FORM2_DIST_REF_POINTS            1
+#define FORM2_NUMA_ASSOC_SIZE            (FORM2_DIST_REF_POINTS + 1)
+
 /*
  * Retrieves max_dist_ref_points of the current NUMA affinity.
  */
 static int get_max_dist_ref_points(SpaprMachineState *spapr)
 {
-    /* No FORM2 affinity implemented yet */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return FORM2_DIST_REF_POINTS;
+    }
+
     return FORM1_DIST_REF_POINTS;
 }
 
@@ -45,7 +55,10 @@ static int get_max_dist_ref_points(SpaprMachineState *spapr)
  */
 static int get_numa_assoc_size(SpaprMachineState *spapr)
 {
-    /* No FORM2 affinity implemented yet */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        return FORM2_NUMA_ASSOC_SIZE;
+    }
+
     return FORM1_NUMA_ASSOC_SIZE;
 }
 
@@ -305,10 +318,39 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     spapr_numa_define_FORM1_domains(spapr);
 }
 
+/*
+ * Init NUMA FORM2 machine state data
+ */
+static void spapr_numa_FORM2_affinity_init(SpaprMachineState *spapr)
+{
+    int i;
+
+    /*
+     * For all resources but CPUs, FORM2 associativity arrays will
+     * be a size 2 array with the following format:
+     *
+     * ibm,associativity = {1, numa_id}
+     *
+     * CPUs will write an additional 'vcpu_id' on top of the arrays
+     * being initialized here. 'numa_id' is represented by the
+     * index 'i' of the loop.
+     *
+     * Given that this initialization is also valid for GPU associativity
+     * arrays, handle everything in one single step by populating the
+     * arrays up to NUMA_NODES_MAX_NUM.
+     */
+    for (i = 0; i < NUMA_NODES_MAX_NUM; i++) {
+        spapr->FORM2_assoc_array[i] = g_new0(uint32_t, 2);
+        spapr->FORM2_assoc_array[i][0] = cpu_to_be32(1);
+        spapr->FORM2_assoc_array[i][1] = cpu_to_be32(i);
+    }
+}
+
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine)
 {
     spapr_numa_FORM1_affinity_init(spapr, machine);
+    spapr_numa_FORM2_affinity_init(spapr);
 
     /*
      * Default to FORM1 affinity until CAS. We'll call affinity_reset()
@@ -326,7 +368,11 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
 void spapr_numa_associativity_reset(SpaprMachineState *spapr,
                                     bool post_CAS_check)
 {
-    /* No FORM2 affinity implemented yet */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr->numa_assoc_array = spapr->FORM2_assoc_array;
+        return;
+    }
+
     spapr->numa_assoc_array = spapr->FORM1_assoc_array;
 
     if (post_CAS_check) {
@@ -471,6 +517,100 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
                      maxdomains, sizeof(maxdomains)));
 }
 
+static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
+                                               void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
+    g_autofree uint32_t *lookup_index_table = NULL;
+    g_autofree uint32_t *distance_table = NULL;
+    int src, dst, i, distance_table_size;
+    uint8_t *node_distances;
+
+    /*
+     * ibm,numa-lookup-index-table: array with length and a
+     * list of NUMA ids present in the guest.
+     */
+    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
+    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        lookup_index_table[i + 1] = cpu_to_be32(i);
+    }
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
+                     lookup_index_table,
+                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
+
+    /*
+     * ibm,numa-distance-table: contains all node distances. First
+     * element is the size of the table as uint32, followed up
+     * by all the uint8 distances from the first NUMA node, then all
+     * distances from the second NUMA node and so on.
+     *
+     * ibm,numa-lookup-index-table is used by guest to navigate this
+     * array because NUMA ids can be sparse (node 0 is the first,
+     * node 8 is the second ...).
+     */
+    distance_table = g_new0(uint32_t, distance_table_entries + 1);
+    distance_table[0] = cpu_to_be32(distance_table_entries);
+
+    node_distances = (uint8_t *)&distance_table[1];
+    i = 0;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            node_distances[i++] = numa_info[src].distance[dst];
+        }
+    }
+
+    distance_table_size = distance_table_entries * sizeof(uint8_t) +
+                          sizeof(uint32_t);
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
+                     distance_table, distance_table_size));
+}
+
+/*
+ * This helper could be compressed in a single function with
+ * FORM1 logic since we're setting the same DT values, with the
+ * difference being a call to spapr_numa_FORM2_write_rtas_tables()
+ * in the end. The separation was made to avoid clogging FORM1 code
+ * which already has to deal with compat modes from previous
+ * QEMU machine types.
+ */
+static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
+                                   spapr_numa_initial_nvgpu_numa_id(ms);
+
+    /*
+     * In FORM2, ibm,associativity-reference-points will point to
+     * the element in the ibm,associativity array that contains the
+     * primary domain index (for FORM2, the first element).
+     *
+     * This value (in our case, the numa-id) is then used as an index
+     * to retrieve all other attributes of the node (distance,
+     * bandwidth, latency) via ibm,numa-lookup-index-table and other
+     * ibm,numa-*-table properties.
+     */
+    uint32_t refpoints[] = { cpu_to_be32(1) };
+
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
+    uint32_t maxdomains[] = { cpu_to_be32(1), cpu_to_be32(maxdomain) };
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
+                     refpoints, sizeof(refpoints)));
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
+                     maxdomains, sizeof(maxdomains)));
+
+    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
+}
+
 /*
  * Helper that writes ibm,associativity-reference-points and
  * max-associativity-domains in the RTAS pointed by @rtas
@@ -478,6 +618,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
  */
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 2554928250..2a4502b65d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -133,6 +133,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    bool pre_6_2_numa_affinity;
 
     bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio,
@@ -238,6 +239,7 @@ struct SpaprMachineState {
     SpaprTpmProxy *tpm_proxy;
 
     uint32_t *FORM1_assoc_array[NUMA_NODES_MAX_NUM];
+    uint32_t *FORM2_assoc_array[NUMA_NODES_MAX_NUM];
     uint32_t **numa_assoc_array;
 
     Error *fwnmi_migration_blocker;
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 48b716a060..c3e8b98e7e 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
 /* option vector 5 */
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
+#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
 #define OV5_HP_EVT              OV_BIT(6, 5)
 #define OV5_HPT_RESIZE          OV_BIT(6, 7)
 #define OV5_DRMEM_V2            OV_BIT(22, 0)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers
  2021-09-10 19:55 ` [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-09-14  8:23   ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-14  8:23 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:34 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
> and doesn't need be concerned with all the legacy support for older
> pseries FORM1 guests.
> 
> We're also not going to calculate associativity domains based on numa
> distance (via spapr_numa_define_associativity_domains) since the
> distances will be written directly into new DT properties.
> 
> Let's split FORM1 code into its own functions to allow for easier
> insertion of FORM2 logic later on.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr_numa.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 779f18b994..786def7c73 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -92,7 +92,7 @@ static uint8_t spapr_numa_get_numa_level(uint8_t distance)
>      return 0;
>  }
>  
> -static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
> +static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)

Another option would have been to open-code this function in its
unique caller but your patch is definitely easier to review so :

Reviewed-by: Greg Kurz <groug@kaod.org>


>  {
>      MachineState *ms = MACHINE(spapr);
>      NodeInfo *numa_info = ms->numa_state->nodes;
> @@ -155,8 +155,11 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
>  
>  }
>  
> -void spapr_numa_associativity_init(SpaprMachineState *spapr,
> -                                   MachineState *machine)
> +/*
> + * Set NUMA machine state data based on FORM1 affinity semantics.
> + */
> +static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> +                                           MachineState *machine)
>  {
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>      int nb_numa_nodes = machine->numa_state->num_nodes;
> @@ -225,7 +228,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>          exit(EXIT_FAILURE);
>      }
>  
> -    spapr_numa_define_associativity_domains(spapr);
> +    spapr_numa_define_FORM1_domains(spapr);
> +}
> +
> +void spapr_numa_associativity_init(SpaprMachineState *spapr,
> +                                   MachineState *machine)
> +{
> +    spapr_numa_FORM1_affinity_init(spapr, machine);
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> @@ -302,12 +311,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>      return ret;
>  }
>  
> -/*
> - * Helper that writes ibm,associativity-reference-points and
> - * max-associativity-domains in the RTAS pointed by @rtas
> - * in the DT @fdt.
> - */
> -void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
>  {
>      MachineState *ms = MACHINE(spapr);
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> @@ -365,6 +370,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>                       maxdomains, sizeof(maxdomains)));
>  }
>  
> +/*
> + * Helper that writes ibm,associativity-reference-points and
> + * max-associativity-domains in the RTAS pointed by @rtas
> + * in the DT @fdt.
> + */
> +void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
> +{
> +    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
> +}
> +
>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                SpaprMachineState *spapr,
>                                                target_ulong opcode,



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept
  2021-09-10 19:55 ` [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept Daniel Henrique Barboza
@ 2021-09-14  8:34   ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-14  8:34 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:35 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> When first introduced, 'legacy_numa' was a way to refer to guests that
> either wouldn't be affected by associativity domain calculations, namely
> the ones with only 1 NUMA node, and pre 5.2 guests that shouldn't be
> affected by it because it would be an userspace change. Calling these
> cases 'legacy_numa' was a convenient way to label these cases.
> 
> We're about to introduce a new NUMA affinity, FORM2, and this concept
> of 'legacy_numa' is now a bit misleading because, although it is called
> 'legacy' it is in fact a FORM1 exclusive contraint.
> 
> This patch removes spapr_machine_using_legacy_numa() and open code the
> conditions in each caller. While we're at it, move the chunk inside
> spapr_numa_FORM1_affinity_init() that sets all numa_assoc_array domains
> with 'node_id' to spapr_numa_define_FORM1_domains(). This chunk was
> being executed if !pre_5_2_numa_associativity and num_nodes => 1, the
> same conditions in which spapr_numa_define_FORM1_domains() is called
> shortly after.
> 

Those are definitely improvements. Just one question, see below.

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr_numa.c | 46 +++++++++++++++++++--------------------------
>  1 file changed, 19 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 786def7c73..fb6059550f 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -19,15 +19,6 @@
>  /* Moved from hw/ppc/spapr_pci_nvlink2.c */
>  #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
>  
> -static bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> -{
> -    MachineState *machine = MACHINE(spapr);
> -    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> -
> -    return smc->pre_5_2_numa_associativity ||
> -           machine->numa_state->num_nodes <= 1;
> -}
> -
>  static bool spapr_numa_is_symmetrical(MachineState *ms)
>  {
>      int src, dst;
> @@ -97,7 +88,18 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>      MachineState *ms = MACHINE(spapr);
>      NodeInfo *numa_info = ms->numa_state->nodes;
>      int nb_numa_nodes = ms->numa_state->num_nodes;
> -    int src, dst, i;
> +    int src, dst, i, j;
> +
> +    /*
> +     * Fill all associativity domains of non-zero NUMA nodes with
> +     * node_id. This is required because the default value (0) is
> +     * considered a match with associativity domains of node 0.
> +     */
> +    for (i = 1; i < nb_numa_nodes; i++) {
> +        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> +            spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
> +        }
> +    }
>  
>      for (src = 0; src < nb_numa_nodes; src++) {
>          for (dst = src; dst < nb_numa_nodes; dst++) {
> @@ -164,7 +166,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>      int nb_numa_nodes = machine->numa_state->num_nodes;
>      int i, j, max_nodes_with_gpus;
> -    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>  
>      /*
>       * For all associativity arrays: first position is the size,
> @@ -178,17 +179,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      for (i = 0; i < nb_numa_nodes; i++) {
>          spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>          spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> -
> -        /*
> -         * Fill all associativity domains of non-zero NUMA nodes with
> -         * node_id. This is required because the default value (0) is
> -         * considered a match with associativity domains of node 0.
> -         */
> -        if (!using_legacy_numa && i != 0) {
> -            for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> -                spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
> -            }
> -        }
>      }
>  
>      /*
> @@ -214,11 +204,13 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      }
>  
>      /*
> -     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
> -     * 1 NUMA node) will not benefit from anything we're going to do
> -     * after this point.
> +     * Guests pseries-5.1 and older uses zeroed associativity domains,
> +     * i.e. no domain definition based on NUMA distance input.
> +     *
> +     * Same thing with guests that have only one NUMA node.
>       */
> -    if (using_legacy_numa) {
> +    if (smc->pre_5_2_numa_associativity ||
> +        machine->numa_state->num_nodes <= 1) {
>          return;
>      }
>  
> @@ -334,7 +326,7 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>          cpu_to_be32(maxdomain)
>      };
>  
> -    if (spapr_machine_using_legacy_numa(spapr)) {
> +    if (smc->pre_5_2_numa_associativity) {

This doesn't check the number of NUMA nodes and thus changes behavior,
or am I missing something ?

Rest looks good though so if you add the missing check or provide a
justification, feel free to add:

Reviewed-by: Greg Kurz <groug@kaod.org>

>          uint32_t legacy_refpoints[] = {
>              cpu_to_be32(0x4),
>              cpu_to_be32(0x4),



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset()
  2021-09-10 19:55 ` [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset() Daniel Henrique Barboza
@ 2021-09-14 11:55   ` Greg Kurz
  2021-09-14 19:58     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Kurz @ 2021-09-14 11:55 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:36 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Introducing a new NUMA affinity, FORM2, requires a new mechanism to
> switch between affinity modes after CAS. Also, we want FORM2 data
> structures and functions to be completely separated from the existing
> FORM1 code, allowing us to avoid adding new code that inherits the
> existing complexity of FORM1.
> 
> At the same time, it's also desirable to minimize the amount of changes
> made in write_dt() functions that are used to write ibm,associativity of
> the resources, RTAS artifacts and h_home_node_associativity. These
> functions can work in the same way in both NUMA affinity modes, as long
> as we use a similar data structure and parametrize it properly depending
> on the affinity mode selected.
> 
> This patch introduces spapr_numa_associativity_reset() to start this
> process. This function will be used to switch to the chosen NUMA
> affinity after CAS and after migrating the guest. To do that, the
> existing 'numa_assoc_array' is renamed to 'FORM1_assoc_array' and will
> hold FORM1 data that is populated at associativity_init().
> 'numa_assoc_array' is now a pointer that can be switched between the
> existing affinity arrays. We don't have FORM2 data structures yet, so
> 'numa_assoc_array' will always point to 'FORM1_assoc_array'.
> 
> We also take the precaution of pointing 'numa_assoc_array' to
> 'FORM1_assoc_array' in associativity_init() time, before CAS, to not
> change FORM1 availability for existing guests.
> 
> A small change in spapr_numa_write_associativity_dt() is made to reflect
> the fact that 'numa_assoc_array' is now a pointer and we must be
> explicit with the size being written in the DT.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              | 14 +++++++++++++
>  hw/ppc/spapr_hcall.c        |  7 +++++++
>  hw/ppc/spapr_numa.c         | 42 +++++++++++++++++++++++++++++--------
>  include/hw/ppc/spapr.h      |  3 ++-
>  include/hw/ppc/spapr_numa.h |  1 +
>  5 files changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d39fd4e644..5afbb76cab 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1786,6 +1786,20 @@ static int spapr_post_load(void *opaque, int version_id)
>          return err;
>      }
>  
> +    /*
> +     * NUMA affinity selection is made in CAS time. There is no reliable
> +     * way of telling whether the guest already went through CAS before
> +     * migration due to how spapr_ov5_cas_needed works: a FORM1 guest can
> +     * be migrated with ov5_cas empty regardless of going through CAS
> +     * first.
> +     *
> +     * One solution is to call numa_associativity_reset(). The downside
> +     * is that a guest migrated before CAS will reset it again when going
> +     * through it, but since it's a lightweight operation it's worth being
> +     * a little redundant to be safe.

Also this isn't a hot path.

> +     */
> +     spapr_numa_associativity_reset(spapr);
> +
>      return err;
>  }
>  
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 0e9a5b2e40..82ab92ddba 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -17,6 +17,7 @@
>  #include "kvm_ppc.h"
>  #include "hw/ppc/fdt.h"
>  #include "hw/ppc/spapr_ovec.h"
> +#include "hw/ppc/spapr_numa.h"
>  #include "mmu-book3s-v3.h"
>  #include "hw/mem/memory-device.h"
>  
> @@ -1197,6 +1198,12 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>      spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>      spapr_ovec_cleanup(ov1_guest);
>  
> +    /*
> +     * Reset numa_assoc_array now that we know which NUMA affinity
> +     * the guest will use.
> +     */
> +    spapr_numa_associativity_reset(spapr);
> +
>      /*
>       * Ensure the guest asks for an interrupt mode we support;
>       * otherwise terminate the boot.
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index fb6059550f..327952ba9e 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -97,7 +97,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>       */
>      for (i = 1; i < nb_numa_nodes; i++) {
>          for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> -            spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
> +            spapr->FORM1_assoc_array[i][j] = cpu_to_be32(i);
>          }
>      }
>  
> @@ -149,8 +149,8 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>               * and going up to 0x1.
>               */
>              for (i = n_level; i > 0; i--) {
> -                assoc_src = spapr->numa_assoc_array[src][i];
> -                spapr->numa_assoc_array[dst][i] = assoc_src;
> +                assoc_src = spapr->FORM1_assoc_array[src][i];
> +                spapr->FORM1_assoc_array[dst][i] = assoc_src;
>              }
>          }
>      }
> @@ -167,6 +167,11 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      int nb_numa_nodes = machine->numa_state->num_nodes;
>      int i, j, max_nodes_with_gpus;
>  
> +    /* init FORM1_assoc_array */
> +    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
> +        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);

Why dynamic allocation since you have fixed size ?

> +    }
> +
>      /*
>       * For all associativity arrays: first position is the size,
>       * position MAX_DISTANCE_REF_POINTS is always the numa_id,
> @@ -177,8 +182,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>       * 'i' will be a valid node_id set by the user.
>       */
>      for (i = 0; i < nb_numa_nodes; i++) {
> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
>      /*
> @@ -192,15 +197,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
>  
>      for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>  
>          for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
>              uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
>                                   SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
> -            spapr->numa_assoc_array[i][j] = gpu_assoc;
> +            spapr->FORM1_assoc_array[i][j] = gpu_assoc;
>          }
>  
> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>      }
>  
>      /*
> @@ -227,14 +232,33 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>                                     MachineState *machine)
>  {
>      spapr_numa_FORM1_affinity_init(spapr, machine);
> +
> +    /*
> +     * Default to FORM1 affinity until CAS. We'll call affinity_reset()
> +     * during CAS when we're sure about which NUMA affinity the guest
> +     * is going to use.
> +     *
> +     * This step is a failsafe - guests in the wild were able to read
> +     * FORM1 affinity info before CAS for a long time. Since affinity_reset()
> +     * is just a pointer switch between data that was already populated
> +     * here, this is an acceptable overhead to be on the safer side.
> +     */
> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;

The right way to do that is to call spapr_numa_associativity_reset() from
spapr_machine_reset() because we want to revert to FORM1 each time the
guest is rebooted.

> +}
> +
> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> +{
> +    /* No FORM2 affinity implemented yet */

This seems quite obvious at this point, not sure the comment helps.

> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>                                         int offset, int nodeid)
>  {
> +    /* Hardcode the size of FORM1 associativity array for now */
>      _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
>                        spapr->numa_assoc_array[nodeid],
> -                      sizeof(spapr->numa_assoc_array[nodeid]))));
> +                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));

Rather than doing this temporary change that gets undone in
a later patch, I suggest you introduce get_numa_assoc_size()
in a preliminary patch and use it here already :

-                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
+                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));

This will simplify the reviewing.

>  }
>  
>  static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 637652ad16..8a9490f0bf 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -249,7 +249,8 @@ struct SpaprMachineState {
>      unsigned gpu_numa_id;
>      SpaprTpmProxy *tpm_proxy;
>  
> -    uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
> +    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];

As said above, I really don't see the point in not having :

    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];

> +    uint32_t **numa_assoc_array;
>  
>      Error *fwnmi_migration_blocker;
>  };
> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> index 6f9f02d3de..ccf3e4eae8 100644
> --- a/include/hw/ppc/spapr_numa.h
> +++ b/include/hw/ppc/spapr_numa.h
> @@ -24,6 +24,7 @@
>   */
>  void spapr_numa_associativity_init(SpaprMachineState *spapr,
>                                     MachineState *machine);
> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>                                         int offset, int nodeid);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros
  2021-09-10 19:55 ` [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros Daniel Henrique Barboza
@ 2021-09-14 12:10   ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-14 12:10 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:37 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The next preliminary step to introduce NUMA FORM2 affinity is to make
> the existing code independent of FORM1 macros and values, i.e.
> MAX_DISTANCE_REF_POINTS, NUMA_ASSOC_SIZE and VCPU_ASSOC_SIZE. This patch
> accomplishes that by doing the following:
> 
> - move the NUMA related macros from spapr.h to spapr_numa.c where they
> are used. spapr.h gets instead a 'NUMA_NODES_MAX_NUM' macro that is used
> to refer to the maximum number of NUMA nodes, including GPU nodes, that
> the machine can support;
> 
> - MAX_DISTANCE_REF_POINTS and NUMA_ASSOC_SIZE are renamed to
> FORM1_DIST_REF_POINTS and FORM1_NUMA_ASSOC_SIZE. These FORM1 specific
> macros are used in FORM1 init functions;
> 
> - code that uses MAX_DISTANCE_REF_POINTS now retrieves the
> max_dist_ref_points value using get_max_dist_ref_points().
> NUMA_ASSOC_SIZE is replaced by get_numa_assoc_size() and VCPU_ASSOC_SIZE
> is replaced by get_vcpu_assoc_size(). These functions are used by the
> generic device tree functions and h_home_node_associativity() and will
> allow them to switch between FORM1 and FORM2 without changing their core
> logic.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---

I'd prefer this patch to go before patch 3 so that we get a
clear distinction between FORM1 and common code, before we
starting adding new paths.

LGTM appart from that.

>  hw/ppc/spapr_numa.c    | 93 +++++++++++++++++++++++++++++++-----------
>  include/hw/ppc/spapr.h | 22 +++-------
>  2 files changed, 74 insertions(+), 41 deletions(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 327952ba9e..7ad4b6582b 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -19,6 +19,47 @@
>  /* Moved from hw/ppc/spapr_pci_nvlink2.c */
>  #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
>  
> +/*
> + * NUMA FORM1 macros. FORM1_DIST_REF_POINTS was taken from
> + * MAX_DISTANCE_REF_POINTS in arch/powerpc/mm/numa.h from Linux
> + * kernel source. It represents the amount of associativity domains
> + * for non-CPU resources.
> + *
> + * FORM1_NUMA_ASSOC_SIZE is the base array size of an ibm,associativity
> + * array for any non-CPU resource.
> + */
> +#define FORM1_DIST_REF_POINTS            4
> +#define FORM1_NUMA_ASSOC_SIZE            (FORM1_DIST_REF_POINTS + 1)
> +
> +/*
> + * Retrieves max_dist_ref_points of the current NUMA affinity.
> + */
> +static int get_max_dist_ref_points(SpaprMachineState *spapr)
> +{
> +    /* No FORM2 affinity implemented yet */
> +    return FORM1_DIST_REF_POINTS;
> +}
> +
> +/*
> + * Retrieves numa_assoc_size of the current NUMA affinity.
> + */
> +static int get_numa_assoc_size(SpaprMachineState *spapr)
> +{
> +    /* No FORM2 affinity implemented yet */
> +    return FORM1_NUMA_ASSOC_SIZE;
> +}
> +
> +/*
> + * Retrieves vcpu_assoc_size of the current NUMA affinity.
> + *
> + * vcpu_assoc_size is the size of ibm,associativity array
> + * for CPUs, which has an extra element (vcpu_id) in the end.
> + */
> +static int get_vcpu_assoc_size(SpaprMachineState *spapr)
> +{
> +    return get_numa_assoc_size(spapr) + 1;
> +}
> +
>  static bool spapr_numa_is_symmetrical(MachineState *ms)
>  {
>      int src, dst;
> @@ -96,7 +137,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>       * considered a match with associativity domains of node 0.
>       */
>      for (i = 1; i < nb_numa_nodes; i++) {
> -        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> +        for (j = 1; j < FORM1_DIST_REF_POINTS; j++) {
>              spapr->FORM1_assoc_array[i][j] = cpu_to_be32(i);
>          }
>      }
> @@ -134,7 +175,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>               *
>               * The Linux kernel will assume that the distance between src and
>               * dst, in this case of no match, is 10 (local distance) doubled
> -             * for each NUMA it didn't match. We have MAX_DISTANCE_REF_POINTS
> +             * for each NUMA it didn't match. We have FORM1_DIST_REF_POINTS
>               * levels (4), so this gives us 10*2*2*2*2 = 160.
>               *
>               * This logic can be seen in the Linux kernel source code, as of
> @@ -168,13 +209,13 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      int i, j, max_nodes_with_gpus;
>  
>      /* init FORM1_assoc_array */
> -    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
> -        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
> +    for (i = 0; i < NUMA_NODES_MAX_NUM; i++) {
> +        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, FORM1_NUMA_ASSOC_SIZE);
>      }
>  
>      /*
>       * For all associativity arrays: first position is the size,
> -     * position MAX_DISTANCE_REF_POINTS is always the numa_id,
> +     * position FORM1_DIST_REF_POINTS is always the numa_id,
>       * represented by the index 'i'.
>       *
>       * This will break on sparse NUMA setups, when/if QEMU starts
> @@ -182,8 +223,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>       * 'i' will be a valid node_id set by the user.
>       */
>      for (i = 0; i < nb_numa_nodes; i++) {
> -        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> -        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(FORM1_DIST_REF_POINTS);
> +        spapr->FORM1_assoc_array[i][FORM1_DIST_REF_POINTS] = cpu_to_be32(i);
>      }
>  
>      /*
> @@ -197,15 +238,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
>  
>      for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
> -        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(FORM1_DIST_REF_POINTS);
>  
> -        for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> +        for (j = 1; j < FORM1_DIST_REF_POINTS; j++) {
>              uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
>                                   SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
>              spapr->FORM1_assoc_array[i][j] = gpu_assoc;
>          }
>  
> -        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> +        spapr->FORM1_assoc_array[i][FORM1_DIST_REF_POINTS] = cpu_to_be32(i);
>      }
>  
>      /*
> @@ -255,16 +296,17 @@ void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>                                         int offset, int nodeid)
>  {
> -    /* Hardcode the size of FORM1 associativity array for now */
>      _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
>                        spapr->numa_assoc_array[nodeid],
> -                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
> +                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));
>  }
>  
>  static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
>                                             PowerPCCPU *cpu)
>  {
> -    uint32_t *vcpu_assoc = g_new(uint32_t, VCPU_ASSOC_SIZE);
> +    int max_distance_ref_points = get_max_dist_ref_points(spapr);
> +    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
> +    uint32_t *vcpu_assoc = g_new(uint32_t, vcpu_assoc_size);
>      int index = spapr_get_vcpu_id(cpu);
>  
>      /*
> @@ -273,10 +315,10 @@ static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
>       * 0, put cpu_id last, then copy the remaining associativity
>       * domains.
>       */
> -    vcpu_assoc[0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS + 1);
> -    vcpu_assoc[VCPU_ASSOC_SIZE - 1] = cpu_to_be32(index);
> +    vcpu_assoc[0] = cpu_to_be32(max_distance_ref_points + 1);
> +    vcpu_assoc[vcpu_assoc_size - 1] = cpu_to_be32(index);
>      memcpy(vcpu_assoc + 1, spapr->numa_assoc_array[cpu->node_id] + 1,
> -           (VCPU_ASSOC_SIZE - 2) * sizeof(uint32_t));
> +           (vcpu_assoc_size - 2) * sizeof(uint32_t));
>  
>      return vcpu_assoc;
>  }
> @@ -285,12 +327,13 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
>                              int offset, PowerPCCPU *cpu)
>  {
>      g_autofree uint32_t *vcpu_assoc = NULL;
> +    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
>  
>      vcpu_assoc = spapr_numa_get_vcpu_assoc(spapr, cpu);
>  
>      /* Advertise NUMA via ibm,associativity */
>      return fdt_setprop(fdt, offset, "ibm,associativity", vcpu_assoc,
> -                       VCPU_ASSOC_SIZE * sizeof(uint32_t));
> +                       vcpu_assoc_size * sizeof(uint32_t));
>  }
>  
>  
> @@ -298,17 +341,18 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>                                           int offset)
>  {
>      MachineState *machine = MACHINE(spapr);
> +    int max_distance_ref_points = get_max_dist_ref_points(spapr);
>      int nb_numa_nodes = machine->numa_state->num_nodes;
>      int nr_nodes = nb_numa_nodes ? nb_numa_nodes : 1;
>      uint32_t *int_buf, *cur_index, buf_len;
>      int ret, i;
>  
>      /* ibm,associativity-lookup-arrays */
> -    buf_len = (nr_nodes * MAX_DISTANCE_REF_POINTS + 2) * sizeof(uint32_t);
> +    buf_len = (nr_nodes * max_distance_ref_points + 2) * sizeof(uint32_t);
>      cur_index = int_buf = g_malloc0(buf_len);
>      int_buf[0] = cpu_to_be32(nr_nodes);
>       /* Number of entries per associativity list */
> -    int_buf[1] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> +    int_buf[1] = cpu_to_be32(max_distance_ref_points);
>      cur_index += 2;
>      for (i = 0; i < nr_nodes; i++) {
>          /*
> @@ -317,8 +361,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>           */
>          uint32_t *associativity = spapr->numa_assoc_array[i];
>          memcpy(cur_index, ++associativity,
> -               sizeof(uint32_t) * MAX_DISTANCE_REF_POINTS);
> -        cur_index += MAX_DISTANCE_REF_POINTS;
> +               sizeof(uint32_t) * max_distance_ref_points);
> +        cur_index += max_distance_ref_points;
>      }
>      ret = fdt_setprop(fdt, offset, "ibm,associativity-lookup-arrays", int_buf,
>                        (cur_index - int_buf) * sizeof(uint32_t));
> @@ -406,6 +450,7 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>      target_ulong procno = args[1];
>      PowerPCCPU *tcpu;
>      int idx, assoc_idx;
> +    int vcpu_assoc_size = get_vcpu_assoc_size(spapr);
>  
>      /* only support procno from H_REGISTER_VPA */
>      if (flags != 0x1) {
> @@ -424,7 +469,7 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>       * 12 associativity domains for vcpus. Assert and bail if that's
>       * not the case.
>       */
> -    G_STATIC_ASSERT((VCPU_ASSOC_SIZE - 1) <= 12);
> +    g_assert((vcpu_assoc_size - 1) <= 12);
>  
>      vcpu_assoc = spapr_numa_get_vcpu_assoc(spapr, tcpu);
>      /* assoc_idx starts at 1 to skip associativity size */
> @@ -445,9 +490,9 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>           * macro. The ternary will fill the remaining registers with -1
>           * after we went through vcpu_assoc[].
>           */
> -        a = assoc_idx < VCPU_ASSOC_SIZE ?
> +        a = assoc_idx < vcpu_assoc_size ?
>              be32_to_cpu(vcpu_assoc[assoc_idx++]) : -1;
> -        b = assoc_idx < VCPU_ASSOC_SIZE ?
> +        b = assoc_idx < vcpu_assoc_size ?
>              be32_to_cpu(vcpu_assoc[assoc_idx++]) : -1;
>  
>          args[idx] = ASSOCIATIVITY(a, b);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 8a9490f0bf..2554928250 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -100,23 +100,11 @@ typedef enum {
>  
>  #define FDT_MAX_SIZE                    0x200000
>  
> -/*
> - * NUMA related macros. MAX_DISTANCE_REF_POINTS was taken
> - * from Linux kernel arch/powerpc/mm/numa.h. It represents the
> - * amount of associativity domains for non-CPU resources.
> - *
> - * NUMA_ASSOC_SIZE is the base array size of an ibm,associativity
> - * array for any non-CPU resource.
> - *
> - * VCPU_ASSOC_SIZE represents the size of ibm,associativity array
> - * for CPUs, which has an extra element (vcpu_id) in the end.
> - */
> -#define MAX_DISTANCE_REF_POINTS    4
> -#define NUMA_ASSOC_SIZE            (MAX_DISTANCE_REF_POINTS + 1)
> -#define VCPU_ASSOC_SIZE            (NUMA_ASSOC_SIZE + 1)
> +/* Max number of GPUs per system */
> +#define NVGPU_MAX_NUM              6
>  
> -/* Max number of these GPUsper a physical box */
> -#define NVGPU_MAX_NUM                6
> +/* Max number of NUMA nodes */
> +#define NUMA_NODES_MAX_NUM         (MAX_NODES + NVGPU_MAX_NUM)
>  
>  typedef struct SpaprCapabilities SpaprCapabilities;
>  struct SpaprCapabilities {
> @@ -249,7 +237,7 @@ struct SpaprMachineState {
>      unsigned gpu_numa_id;
>      SpaprTpmProxy *tpm_proxy;
>  
> -    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
> +    uint32_t *FORM1_assoc_array[NUMA_NODES_MAX_NUM];
>      uint32_t **numa_assoc_array;
>  
>      Error *fwnmi_migration_blocker;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS
  2021-09-10 19:55 ` [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS Daniel Henrique Barboza
@ 2021-09-14 12:26   ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-14 12:26 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:38 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> FORM2 NUMA affinity is prepared to deal with empty (memory/cpu less)
> NUMA nodes. This is used by the DAX KMEM driver to locate a PAPR SCM
> device that has a different latency than the original NUMA node from the
> regular memory. FORM2 is also enable to deal with asymmetric NUMA

s/enable/able

> distances gracefully, something that our FORM1 implementation doesn't
> do.
> 
> Move these FORM1 verifications to a new function and wait until after
> CAS, when we're sure that we're sticking with FORM1, to enforce them.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              | 35 +----------------------
>  hw/ppc/spapr_hcall.c        |  2 +-
>  hw/ppc/spapr_numa.c         | 55 ++++++++++++++++++++++++++++++++-----
>  include/hw/ppc/spapr_numa.h |  3 +-
>  4 files changed, 52 insertions(+), 43 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 5afbb76cab..0703a26093 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1798,7 +1798,7 @@ static int spapr_post_load(void *opaque, int version_id)
>       * through it, but since it's a lightweight operation it's worth being
>       * a little redundant to be safe.
>       */
> -     spapr_numa_associativity_reset(spapr);
> +     spapr_numa_associativity_reset(spapr, false);
>  
>      return err;
>  }
> @@ -2787,39 +2787,6 @@ static void spapr_machine_init(MachineState *machine)
>      /* init CPUs */
>      spapr_init_cpus(spapr);
>  
> -    /*
> -     * check we don't have a memory-less/cpu-less NUMA node
> -     * Firmware relies on the existing memory/cpu topology to provide the
> -     * NUMA topology to the kernel.
> -     * And the linux kernel needs to know the NUMA topology at start
> -     * to be able to hotplug CPUs later.
> -     */
> -    if (machine->numa_state->num_nodes) {
> -        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
> -            /* check for memory-less node */
> -            if (machine->numa_state->nodes[i].node_mem == 0) {
> -                CPUState *cs;
> -                int found = 0;
> -                /* check for cpu-less node */
> -                CPU_FOREACH(cs) {
> -                    PowerPCCPU *cpu = POWERPC_CPU(cs);
> -                    if (cpu->node_id == i) {
> -                        found = 1;
> -                        break;
> -                    }
> -                }
> -                /* memory-less and cpu-less node */
> -                if (!found) {
> -                    error_report(
> -                       "Memory-less/cpu-less nodes are not supported (node %d)",
> -                                 i);
> -                    exit(1);
> -                }
> -            }
> -        }
> -
> -    }
> -
>      spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
>  
>      /* Init numa_assoc_array */
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 82ab92ddba..2dc22e2dc7 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1202,7 +1202,7 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>       * Reset numa_assoc_array now that we know which NUMA affinity
>       * the guest will use.
>       */
> -    spapr_numa_associativity_reset(spapr);
> +    spapr_numa_associativity_reset(spapr, true);
>  
>      /*
>       * Ensure the guest asks for an interrupt mode we support;
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 7ad4b6582b..0ade63c2d3 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -198,6 +198,48 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>  
>  }
>  
> +static void spapr_numa_FORM1_affinity_check(MachineState *machine)
> +{
> +    int i;
> +
> +    /*
> +     * Check we don't have a memory-less/cpu-less NUMA node
> +     * Firmware relies on the existing memory/cpu topology to provide the
> +     * NUMA topology to the kernel.
> +     * And the linux kernel needs to know the NUMA topology at start
> +     * to be able to hotplug CPUs later.
> +     */
> +    if (machine->numa_state->num_nodes) {
> +        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
> +            /* check for memory-less node */
> +            if (machine->numa_state->nodes[i].node_mem == 0) {
> +                CPUState *cs;
> +                int found = 0;
> +                /* check for cpu-less node */
> +                CPU_FOREACH(cs) {
> +                    PowerPCCPU *cpu = POWERPC_CPU(cs);
> +                    if (cpu->node_id == i) {
> +                        found = 1;
> +                        break;
> +                    }
> +                }
> +                /* memory-less and cpu-less node */
> +                if (!found) {
> +                    error_report(
> +"Memory-less/cpu-less nodes are not supported with FORM1 NUMA (node %d)", i);
> +                    exit(EXIT_FAILURE);
> +                }
> +            }
> +        }
> +    }
> +
> +    if (!spapr_numa_is_symmetrical(machine)) {
> +        error_report(
> +"Asymmetrical NUMA topologies aren't supported in the pSeries machine using FORM1 NUMA");
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
>  /*
>   * Set NUMA machine state data based on FORM1 affinity semantics.
>   */
> @@ -260,12 +302,6 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>          return;
>      }
>  
> -    if (!spapr_numa_is_symmetrical(machine)) {
> -        error_report("Asymmetrical NUMA topologies aren't supported "
> -                     "in the pSeries machine");
> -        exit(EXIT_FAILURE);
> -    }
> -
>      spapr_numa_define_FORM1_domains(spapr);
>  }
>  
> @@ -287,10 +323,15 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>      spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>  }
>  
> -void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> +void spapr_numa_associativity_reset(SpaprMachineState *spapr,
> +                                    bool post_CAS_check)

Coding style requires lower case for variable names and that's what
we already with other CAS related variables in the rest of the code.

Rest LGTM so with these remarks addressed :

Reviewed-by: Greg Kurz <groug@kaod.org>

>  {
>      /* No FORM2 affinity implemented yet */
>      spapr->numa_assoc_array = spapr->FORM1_assoc_array;
> +
> +    if (post_CAS_check) {
> +        spapr_numa_FORM1_affinity_check(MACHINE(spapr));
> +    }
>  }
>  
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> index ccf3e4eae8..246767d0a8 100644
> --- a/include/hw/ppc/spapr_numa.h
> +++ b/include/hw/ppc/spapr_numa.h
> @@ -24,7 +24,8 @@
>   */
>  void spapr_numa_associativity_init(SpaprMachineState *spapr,
>                                     MachineState *machine);
> -void spapr_numa_associativity_reset(SpaprMachineState *spapr);
> +void spapr_numa_associativity_reset(SpaprMachineState *spapr,
> +                                    bool post_CAS_check);
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>  void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>                                         int offset, int nodeid);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support
  2021-09-10 19:55 ` [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-09-14 12:58   ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-14 12:58 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Fri, 10 Sep 2021 16:55:39 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The main feature of FORM2 affinity support is the separation of NUMA
> distances from ibm,associativity information. This allows for a more
> flexible and straightforward NUMA distance assignment without relying on
> complex associations between several levels of NUMA via
> ibm,associativity matches. Another feature is its extensibility. This base
> support contains the facilities for NUMA distance assignment, but in the
> future more facilities will be added for latency, performance, bandwidth
> and so on.
> 
> This patch implements the base FORM2 affinity support as follows:
> 
> - the use of FORM2 associativity is indicated by using bit 2 of byte 5
> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
> or FORM2 affinity. Setting both forms will default to FORM2. We're not
> advertising FORM2 for pseries-6.1 and older machine versions to prevent
> guest visible changes in those;
> 
> - ibm,associativity-reference-points has a new semantic. Instead of
> being used to calculate distances via NUMA levels, it's now used to
> indicate the primary domain index in the ibm,associativity domain of
> each resource. In our case it's set to {0x4}, matching the position
> where we already place logical_domain_id;
> 
> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
> and ibm,numa-distance-table. The index table is used to list all the
> NUMA logical domains of the platform, in ascending order, and allows for
> spartial NUMA configurations (although QEMU ATM doesn't support that).
> ibm,numa-distance-table is an array that contains all the distances from
> the first NUMA node to all other nodes, then the second NUMA node
> distances to all other nodes and so on;
> 
> - get_max_dist_ref_points() and get_numa_assoc_size() now checks for
> OV5_FORM2_AFFINITY and returns FORM2 values if the guest selected FORM2
> affinity during CAS.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              |   8 ++
>  hw/ppc/spapr_numa.c         | 151 +++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h      |   2 +
>  include/hw/ppc/spapr_ovec.h |   1 +
>  4 files changed, 159 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 0703a26093..23aba5ae2d 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2766,6 +2766,11 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
>  
> +    /* Do not advertise FORM2 NUMA support for pseries-6.1 and older */
> +    if (!smc->pre_6_2_numa_affinity) {
> +        spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
> +    }
> +
>      /* advertise support for dedicated HP event source to guests */
>      if (spapr->use_hotplug_event_source) {
>          spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
> @@ -4681,8 +4686,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true);
>   */
>  static void spapr_machine_6_1_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_6_2_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len);
> +    smc->pre_6_2_numa_affinity = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(6_1, "6.1", false);
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 0ade63c2d3..dd52b8921c 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -31,12 +31,22 @@
>  #define FORM1_DIST_REF_POINTS            4
>  #define FORM1_NUMA_ASSOC_SIZE            (FORM1_DIST_REF_POINTS + 1)
>  
> +/*
> + * FORM2 NUMA affinity has a single associativity domain, giving
> + * us a assoc size of 2.
> + */
> +#define FORM2_DIST_REF_POINTS            1
> +#define FORM2_NUMA_ASSOC_SIZE            (FORM2_DIST_REF_POINTS + 1)
> +
>  /*
>   * Retrieves max_dist_ref_points of the current NUMA affinity.
>   */
>  static int get_max_dist_ref_points(SpaprMachineState *spapr)
>  {
> -    /* No FORM2 affinity implemented yet */
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        return FORM2_DIST_REF_POINTS;
> +    }
> +
>      return FORM1_DIST_REF_POINTS;
>  }
>  
> @@ -45,7 +55,10 @@ static int get_max_dist_ref_points(SpaprMachineState *spapr)
>   */
>  static int get_numa_assoc_size(SpaprMachineState *spapr)
>  {
> -    /* No FORM2 affinity implemented yet */
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        return FORM2_NUMA_ASSOC_SIZE;
> +    }
> +
>      return FORM1_NUMA_ASSOC_SIZE;
>  }
>  
> @@ -305,10 +318,39 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>      spapr_numa_define_FORM1_domains(spapr);
>  }
>  
> +/*
> + * Init NUMA FORM2 machine state data
> + */
> +static void spapr_numa_FORM2_affinity_init(SpaprMachineState *spapr)
> +{
> +    int i;
> +
> +    /*
> +     * For all resources but CPUs, FORM2 associativity arrays will
> +     * be a size 2 array with the following format:
> +     *
> +     * ibm,associativity = {1, numa_id}
> +     *
> +     * CPUs will write an additional 'vcpu_id' on top of the arrays
> +     * being initialized here. 'numa_id' is represented by the
> +     * index 'i' of the loop.
> +     *
> +     * Given that this initialization is also valid for GPU associativity
> +     * arrays, handle everything in one single step by populating the
> +     * arrays up to NUMA_NODES_MAX_NUM.
> +     */
> +    for (i = 0; i < NUMA_NODES_MAX_NUM; i++) {
> +        spapr->FORM2_assoc_array[i] = g_new0(uint32_t, 2);

Same remark as with FORM1 for dynamic allocation.

> +        spapr->FORM2_assoc_array[i][0] = cpu_to_be32(1);
> +        spapr->FORM2_assoc_array[i][1] = cpu_to_be32(i);
> +    }
> +}
> +
>  void spapr_numa_associativity_init(SpaprMachineState *spapr,
>                                     MachineState *machine)
>  {
>      spapr_numa_FORM1_affinity_init(spapr, machine);
> +    spapr_numa_FORM2_affinity_init(spapr);
>  
>      /*
>       * Default to FORM1 affinity until CAS. We'll call affinity_reset()
> @@ -326,7 +368,11 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>  void spapr_numa_associativity_reset(SpaprMachineState *spapr,
>                                      bool post_CAS_check)
>  {
> -    /* No FORM2 affinity implemented yet */
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr->numa_assoc_array = spapr->FORM2_assoc_array;
> +        return;
> +    }
> +
>      spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>  
>      if (post_CAS_check) {
> @@ -471,6 +517,100 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>                       maxdomains, sizeof(maxdomains)));
>  }
>  
> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
> +                                               void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
> +    g_autofree uint32_t *lookup_index_table = NULL;
> +    g_autofree uint32_t *distance_table = NULL;
> +    int src, dst, i, distance_table_size;
> +    uint8_t *node_distances;
> +
> +    /*
> +     * ibm,numa-lookup-index-table: array with length and a
> +     * list of NUMA ids present in the guest.
> +     */
> +    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
> +    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
> +
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        lookup_index_table[i + 1] = cpu_to_be32(i);
> +    }
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
> +                     lookup_index_table,
> +                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
> +
> +    /*
> +     * ibm,numa-distance-table: contains all node distances. First
> +     * element is the size of the table as uint32, followed up
> +     * by all the uint8 distances from the first NUMA node, then all
> +     * distances from the second NUMA node and so on.
> +     *
> +     * ibm,numa-lookup-index-table is used by guest to navigate this
> +     * array because NUMA ids can be sparse (node 0 is the first,
> +     * node 8 is the second ...).
> +     */
> +    distance_table = g_new0(uint32_t, distance_table_entries + 1);
> +    distance_table[0] = cpu_to_be32(distance_table_entries);
> +
> +    node_distances = (uint8_t *)&distance_table[1];
> +    i = 0;
> +
> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            node_distances[i++] = numa_info[src].distance[dst];
> +        }
> +    }
> +
> +    distance_table_size = distance_table_entries * sizeof(uint8_t) +
> +                          sizeof(uint32_t);
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
> +                     distance_table, distance_table_size));
> +}
> +
> +/*
> + * This helper could be compressed in a single function with
> + * FORM1 logic since we're setting the same DT values, with the
> + * difference being a call to spapr_numa_FORM2_write_rtas_tables()
> + * in the end. The separation was made to avoid clogging FORM1 code
> + * which already has to deal with compat modes from previous
> + * QEMU machine types.
> + */

Well... indeed both functions populate the same properties in the DT,
but that's about the only common thing between FORM1 and FORM2. The
contents are different and so are the ways they are computed. This
load of differences is the motivation to have distinct implementations.

Not sure it is worth adding a comment that basically says "we could
have come up with a works-for-all clogged helper but we decided to
go for separate and easier code paths". This is clearly visible with
the resulting code.

Also to make this even more obvious you could open-code
spapr_numa_FORM2_write_rtas_tables() in spapr_numa_FORM2_write_rtas_dt()
since it is its only user.

> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
> +                                           void *fdt, int rtas)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
> +                                   spapr_numa_initial_nvgpu_numa_id(ms);
> +
> +    /*
> +     * In FORM2, ibm,associativity-reference-points will point to
> +     * the element in the ibm,associativity array that contains the
> +     * primary domain index (for FORM2, the first element).
> +     *
> +     * This value (in our case, the numa-id) is then used as an index
> +     * to retrieve all other attributes of the node (distance,
> +     * bandwidth, latency) via ibm,numa-lookup-index-table and other
> +     * ibm,numa-*-table properties.
> +     */
> +    uint32_t refpoints[] = { cpu_to_be32(1) };
> +
> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
> +    uint32_t maxdomains[] = { cpu_to_be32(1), cpu_to_be32(maxdomain) };
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> +                     refpoints, sizeof(refpoints)));
> +
> +    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> +                     maxdomains, sizeof(maxdomains)));
> +
> +    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
> +}
> +
>  /*
>   * Helper that writes ibm,associativity-reference-points and
>   * max-associativity-domains in the RTAS pointed by @rtas
> @@ -478,6 +618,11 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>   */
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
> +    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
> +        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
> +        return;
> +    }
> +
>      spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 2554928250..2a4502b65d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -133,6 +133,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    bool pre_6_2_numa_affinity;
>  
>      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio,
> @@ -238,6 +239,7 @@ struct SpaprMachineState {
>      SpaprTpmProxy *tpm_proxy;
>  
>      uint32_t *FORM1_assoc_array[NUMA_NODES_MAX_NUM];
> +    uint32_t *FORM2_assoc_array[NUMA_NODES_MAX_NUM];
>      uint32_t **numa_assoc_array;
>  
>      Error *fwnmi_migration_blocker;
> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
> index 48b716a060..c3e8b98e7e 100644
> --- a/include/hw/ppc/spapr_ovec.h
> +++ b/include/hw/ppc/spapr_ovec.h
> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
>  /* option vector 5 */
>  #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>  #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
> +#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
>  #define OV5_HP_EVT              OV_BIT(6, 5)
>  #define OV5_HPT_RESIZE          OV_BIT(6, 7)
>  #define OV5_DRMEM_V2            OV_BIT(22, 0)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset()
  2021-09-14 11:55   ` Greg Kurz
@ 2021-09-14 19:58     ` Daniel Henrique Barboza
  2021-09-16  1:32       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-14 19:58 UTC (permalink / raw)
  To: Greg Kurz; +Cc: qemu-ppc, qemu-devel, david



On 9/14/21 08:55, Greg Kurz wrote:
> On Fri, 10 Sep 2021 16:55:36 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> Introducing a new NUMA affinity, FORM2, requires a new mechanism to
>> switch between affinity modes after CAS. Also, we want FORM2 data
>> structures and functions to be completely separated from the existing
>> FORM1 code, allowing us to avoid adding new code that inherits the
>> existing complexity of FORM1.
>>
>> At the same time, it's also desirable to minimize the amount of changes
>> made in write_dt() functions that are used to write ibm,associativity of
>> the resources, RTAS artifacts and h_home_node_associativity. These
>> functions can work in the same way in both NUMA affinity modes, as long
>> as we use a similar data structure and parametrize it properly depending
>> on the affinity mode selected.
>>
>> This patch introduces spapr_numa_associativity_reset() to start this
>> process. This function will be used to switch to the chosen NUMA
>> affinity after CAS and after migrating the guest. To do that, the
>> existing 'numa_assoc_array' is renamed to 'FORM1_assoc_array' and will
>> hold FORM1 data that is populated at associativity_init().
>> 'numa_assoc_array' is now a pointer that can be switched between the
>> existing affinity arrays. We don't have FORM2 data structures yet, so
>> 'numa_assoc_array' will always point to 'FORM1_assoc_array'.
>>
>> We also take the precaution of pointing 'numa_assoc_array' to
>> 'FORM1_assoc_array' in associativity_init() time, before CAS, to not
>> change FORM1 availability for existing guests.
>>
>> A small change in spapr_numa_write_associativity_dt() is made to reflect
>> the fact that 'numa_assoc_array' is now a pointer and we must be
>> explicit with the size being written in the DT.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c              | 14 +++++++++++++
>>   hw/ppc/spapr_hcall.c        |  7 +++++++
>>   hw/ppc/spapr_numa.c         | 42 +++++++++++++++++++++++++++++--------
>>   include/hw/ppc/spapr.h      |  3 ++-
>>   include/hw/ppc/spapr_numa.h |  1 +
>>   5 files changed, 57 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index d39fd4e644..5afbb76cab 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1786,6 +1786,20 @@ static int spapr_post_load(void *opaque, int version_id)
>>           return err;
>>       }
>>   
>> +    /*
>> +     * NUMA affinity selection is made in CAS time. There is no reliable
>> +     * way of telling whether the guest already went through CAS before
>> +     * migration due to how spapr_ov5_cas_needed works: a FORM1 guest can
>> +     * be migrated with ov5_cas empty regardless of going through CAS
>> +     * first.
>> +     *
>> +     * One solution is to call numa_associativity_reset(). The downside
>> +     * is that a guest migrated before CAS will reset it again when going
>> +     * through it, but since it's a lightweight operation it's worth being
>> +     * a little redundant to be safe.
> 
> Also this isn't a hot path.
> 
>> +     */
>> +     spapr_numa_associativity_reset(spapr);
>> +
>>       return err;
>>   }
>>   
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 0e9a5b2e40..82ab92ddba 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -17,6 +17,7 @@
>>   #include "kvm_ppc.h"
>>   #include "hw/ppc/fdt.h"
>>   #include "hw/ppc/spapr_ovec.h"
>> +#include "hw/ppc/spapr_numa.h"
>>   #include "mmu-book3s-v3.h"
>>   #include "hw/mem/memory-device.h"
>>   
>> @@ -1197,6 +1198,12 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>       spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
>>       spapr_ovec_cleanup(ov1_guest);
>>   
>> +    /*
>> +     * Reset numa_assoc_array now that we know which NUMA affinity
>> +     * the guest will use.
>> +     */
>> +    spapr_numa_associativity_reset(spapr);
>> +
>>       /*
>>        * Ensure the guest asks for an interrupt mode we support;
>>        * otherwise terminate the boot.
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index fb6059550f..327952ba9e 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -97,7 +97,7 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>>        */
>>       for (i = 1; i < nb_numa_nodes; i++) {
>>           for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
>> -            spapr->numa_assoc_array[i][j] = cpu_to_be32(i);
>> +            spapr->FORM1_assoc_array[i][j] = cpu_to_be32(i);
>>           }
>>       }
>>   
>> @@ -149,8 +149,8 @@ static void spapr_numa_define_FORM1_domains(SpaprMachineState *spapr)
>>                * and going up to 0x1.
>>                */
>>               for (i = n_level; i > 0; i--) {
>> -                assoc_src = spapr->numa_assoc_array[src][i];
>> -                spapr->numa_assoc_array[dst][i] = assoc_src;
>> +                assoc_src = spapr->FORM1_assoc_array[src][i];
>> +                spapr->FORM1_assoc_array[dst][i] = assoc_src;
>>               }
>>           }
>>       }
>> @@ -167,6 +167,11 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>       int nb_numa_nodes = machine->numa_state->num_nodes;
>>       int i, j, max_nodes_with_gpus;
>>   
>> +    /* init FORM1_assoc_array */
>> +    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
>> +        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
> 
> Why dynamic allocation since you have fixed size ?

If I use static allocation the C compiler complains that I can't assign a
uint32_t** pointer to a uint32_t[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE]
array type.

And given that the FORM2 array is a [MAX_NODES + NVGPU_MAX_NUM][2] array, the
way I worked around that here is to use dynamic allocation. Then C considers valid
to use numa_assoc_array as an uint32_t** pointer for both FORM1 and FORM2
2D arrays. I'm fairly certain that there might be a way of doing static allocation
and keeping the uint32_t** pointer as is, but didn't find any. Tips welcome :D

An alternative that I considered, without the need for this dynamic allocation hack,
is to make both FORM1 and FORM2 data structures the same size (i.e.
an [MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE] uint32_t array) and then numa_assoc_array
can be a pointer of the same array type for both. Since we're controlling FORM1 and
FORM2 sizes separately inside the functions this would work. The downside is that
FORM2 2D array would be bigger than necessary.


I don't have strong opinions about which way to do it, so I'm all ears.


Thanks,


Daniel



> 
>> +    }
>> +
>>       /*
>>        * For all associativity arrays: first position is the size,
>>        * position MAX_DISTANCE_REF_POINTS is always the numa_id,
>> @@ -177,8 +182,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>        * 'i' will be a valid node_id set by the user.
>>        */
>>       for (i = 0; i < nb_numa_nodes; i++) {
>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>       }
>>   
>>       /*
>> @@ -192,15 +197,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>       max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
>>   
>>       for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>   
>>           for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
>>               uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
>>                                    SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
>> -            spapr->numa_assoc_array[i][j] = gpu_assoc;
>> +            spapr->FORM1_assoc_array[i][j] = gpu_assoc;
>>           }
>>   
>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>       }
>>   
>>       /*
>> @@ -227,14 +232,33 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>                                      MachineState *machine)
>>   {
>>       spapr_numa_FORM1_affinity_init(spapr, machine);
>> +
>> +    /*
>> +     * Default to FORM1 affinity until CAS. We'll call affinity_reset()
>> +     * during CAS when we're sure about which NUMA affinity the guest
>> +     * is going to use.
>> +     *
>> +     * This step is a failsafe - guests in the wild were able to read
>> +     * FORM1 affinity info before CAS for a long time. Since affinity_reset()
>> +     * is just a pointer switch between data that was already populated
>> +     * here, this is an acceptable overhead to be on the safer side.
>> +     */
>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
> 
> The right way to do that is to call spapr_numa_associativity_reset() from
> spapr_machine_reset() because we want to revert to FORM1 each time the
> guest is rebooted.
> 
>> +}
>> +
>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>> +{
>> +    /* No FORM2 affinity implemented yet */
> 
> This seems quite obvious at this point, not sure the comment helps.
> 
>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>>   }
>>   
>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>>                                          int offset, int nodeid)
>>   {
>> +    /* Hardcode the size of FORM1 associativity array for now */
>>       _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
>>                         spapr->numa_assoc_array[nodeid],
>> -                      sizeof(spapr->numa_assoc_array[nodeid]))));
>> +                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
> 
> Rather than doing this temporary change that gets undone in
> a later patch, I suggest you introduce get_numa_assoc_size()
> in a preliminary patch and use it here already :
> 
> -                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
> +                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));
> 
> This will simplify the reviewing.
> 
>>   }
>>   
>>   static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 637652ad16..8a9490f0bf 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -249,7 +249,8 @@ struct SpaprMachineState {
>>       unsigned gpu_numa_id;
>>       SpaprTpmProxy *tpm_proxy;
>>   
>> -    uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>> +    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
> 
> As said above, I really don't see the point in not having :
> 
>      uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
> 
>> +    uint32_t **numa_assoc_array;
>>   
>>       Error *fwnmi_migration_blocker;
>>   };
>> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
>> index 6f9f02d3de..ccf3e4eae8 100644
>> --- a/include/hw/ppc/spapr_numa.h
>> +++ b/include/hw/ppc/spapr_numa.h
>> @@ -24,6 +24,7 @@
>>    */
>>   void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>                                      MachineState *machine);
>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>>                                          int offset, int nodeid);
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset()
  2021-09-14 19:58     ` Daniel Henrique Barboza
@ 2021-09-16  1:32       ` Daniel Henrique Barboza
  2021-09-16 17:31         ` Greg Kurz
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Henrique Barboza @ 2021-09-16  1:32 UTC (permalink / raw)
  To: Greg Kurz; +Cc: qemu-ppc, qemu-devel, david

Greg,


On 9/14/21 16:58, Daniel Henrique Barboza wrote:
> 
> 
> On 9/14/21 08:55, Greg Kurz wrote:
>> On Fri, 10 Sep 2021 16:55:36 -0300
>> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>>

[...]


>>>       }
>>> @@ -167,6 +167,11 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>>       int nb_numa_nodes = machine->numa_state->num_nodes;
>>>       int i, j, max_nodes_with_gpus;
>>> +    /* init FORM1_assoc_array */
>>> +    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
>>> +        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
>>
>> Why dynamic allocation since you have fixed size ?
> 
> If I use static allocation the C compiler complains that I can't assign a
> uint32_t** pointer to a uint32_t[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE]
> array type.
> 
> And given that the FORM2 array is a [MAX_NODES + NVGPU_MAX_NUM][2] array, the
> way I worked around that here is to use dynamic allocation. Then C considers valid
> to use numa_assoc_array as an uint32_t** pointer for both FORM1 and FORM2
> 2D arrays. I'm fairly certain that there might be a way of doing static allocation
> and keeping the uint32_t** pointer as is, but didn't find any. Tips welcome :D
> 
> An alternative that I considered, without the need for this dynamic allocation hack,
> is to make both FORM1 and FORM2 data structures the same size (i.e.
> an [MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE] uint32_t array) and then numa_assoc_array
> can be a pointer of the same array type for both. Since we're controlling FORM1 and
> FORM2 sizes separately inside the functions this would work. The downside is that
> FORM2 2D array would be bigger than necessary.

I took a look and didn't find any neat way of doing a pointer switch
between 2 static arrays without either allocating dynamic mem on the
pointer and then g_memdup'ing it, or make all arrays the same size
(i.e. numa_assoc_array would also be a statically allocated array
with FORM1 size) and then memcpy() the chosen mode.

I just posted a new version in which I'm not relying on 'numa_assoc_array'.
Instead, the DT functions will use a get_associativity() to retrieve
the current associativity array based on CAS choice, in a similar
manner like it is already done with the array sizes. This also allowed
us to get rid of associativity_reset().


Thanks,


Daniel



> 
> 
> I don't have strong opinions about which way to do it, so I'm all ears.
> 
> 
> Thanks,
> 
> 
> Daniel
> 
> 
> 
>>
>>> +    }
>>> +
>>>       /*
>>>        * For all associativity arrays: first position is the size,
>>>        * position MAX_DISTANCE_REF_POINTS is always the numa_id,
>>> @@ -177,8 +182,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>>        * 'i' will be a valid node_id set by the user.
>>>        */
>>>       for (i = 0; i < nb_numa_nodes; i++) {
>>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>>       }
>>>       /*
>>> @@ -192,15 +197,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
>>>       max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
>>>       for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
>>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
>>>           for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
>>>               uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
>>>                                    SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
>>> -            spapr->numa_assoc_array[i][j] = gpu_assoc;
>>> +            spapr->FORM1_assoc_array[i][j] = gpu_assoc;
>>>           }
>>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
>>>       }
>>>       /*
>>> @@ -227,14 +232,33 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>>                                      MachineState *machine)
>>>   {
>>>       spapr_numa_FORM1_affinity_init(spapr, machine);
>>> +
>>> +    /*
>>> +     * Default to FORM1 affinity until CAS. We'll call affinity_reset()
>>> +     * during CAS when we're sure about which NUMA affinity the guest
>>> +     * is going to use.
>>> +     *
>>> +     * This step is a failsafe - guests in the wild were able to read
>>> +     * FORM1 affinity info before CAS for a long time. Since affinity_reset()
>>> +     * is just a pointer switch between data that was already populated
>>> +     * here, this is an acceptable overhead to be on the safer side.
>>> +     */
>>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>>
>> The right way to do that is to call spapr_numa_associativity_reset() from
>> spapr_machine_reset() because we want to revert to FORM1 each time the
>> guest is rebooted.
>>
>>> +}
>>> +
>>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
>>> +{
>>> +    /* No FORM2 affinity implemented yet */
>>
>> This seems quite obvious at this point, not sure the comment helps.
>>
>>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
>>>   }
>>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>>>                                          int offset, int nodeid)
>>>   {
>>> +    /* Hardcode the size of FORM1 associativity array for now */
>>>       _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
>>>                         spapr->numa_assoc_array[nodeid],
>>> -                      sizeof(spapr->numa_assoc_array[nodeid]))));
>>> +                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
>>
>> Rather than doing this temporary change that gets undone in
>> a later patch, I suggest you introduce get_numa_assoc_size()
>> in a preliminary patch and use it here already :
>>
>> -                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
>> +                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));
>>
>> This will simplify the reviewing.
>>
>>>   }
>>>   static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>> index 637652ad16..8a9490f0bf 100644
>>> --- a/include/hw/ppc/spapr.h
>>> +++ b/include/hw/ppc/spapr.h
>>> @@ -249,7 +249,8 @@ struct SpaprMachineState {
>>>       unsigned gpu_numa_id;
>>>       SpaprTpmProxy *tpm_proxy;
>>> -    uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>>> +    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
>>
>> As said above, I really don't see the point in not having :
>>
>>      uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>>
>>> +    uint32_t **numa_assoc_array;
>>>       Error *fwnmi_migration_blocker;
>>>   };
>>> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
>>> index 6f9f02d3de..ccf3e4eae8 100644
>>> --- a/include/hw/ppc/spapr_numa.h
>>> +++ b/include/hw/ppc/spapr_numa.h
>>> @@ -24,6 +24,7 @@
>>>    */
>>>   void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>>                                      MachineState *machine);
>>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
>>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
>>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
>>>                                          int offset, int nodeid);
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset()
  2021-09-16  1:32       ` Daniel Henrique Barboza
@ 2021-09-16 17:31         ` Greg Kurz
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Kurz @ 2021-09-16 17:31 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, david

On Wed, 15 Sep 2021 22:32:13 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Greg,
> 
> 
> On 9/14/21 16:58, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 9/14/21 08:55, Greg Kurz wrote:
> >> On Fri, 10 Sep 2021 16:55:36 -0300
> >> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> >>
> 
> [...]
> 
> 
> >>>       }
> >>> @@ -167,6 +167,11 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> >>>       int nb_numa_nodes = machine->numa_state->num_nodes;
> >>>       int i, j, max_nodes_with_gpus;
> >>> +    /* init FORM1_assoc_array */
> >>> +    for (i = 0; i < MAX_NODES + NVGPU_MAX_NUM; i++) {
> >>> +        spapr->FORM1_assoc_array[i] = g_new0(uint32_t, NUMA_ASSOC_SIZE);
> >>
> >> Why dynamic allocation since you have fixed size ?
> > 
> > If I use static allocation the C compiler complains that I can't assign a
> > uint32_t** pointer to a uint32_t[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE]
> > array type.
> > 
> > And given that the FORM2 array is a [MAX_NODES + NVGPU_MAX_NUM][2] array, the
> > way I worked around that here is to use dynamic allocation. Then C considers valid
> > to use numa_assoc_array as an uint32_t** pointer for both FORM1 and FORM2
> > 2D arrays. I'm fairly certain that there might be a way of doing static allocation
> > and keeping the uint32_t** pointer as is, but didn't find any. Tips welcome :D
> > 
> > An alternative that I considered, without the need for this dynamic allocation hack,
> > is to make both FORM1 and FORM2 data structures the same size (i.e.
> > an [MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE] uint32_t array) and then numa_assoc_array
> > can be a pointer of the same array type for both. Since we're controlling FORM1 and
> > FORM2 sizes separately inside the functions this would work. The downside is that
> > FORM2 2D array would be bigger than necessary.
> 
> I took a look and didn't find any neat way of doing a pointer switch
> between 2 static arrays without either allocating dynamic mem on the
> pointer and then g_memdup'ing it, or make all arrays the same size
> (i.e. numa_assoc_array would also be a statically allocated array
> with FORM1 size) and then memcpy() the chosen mode.
> 
> I just posted a new version in which I'm not relying on 'numa_assoc_array'.
> Instead, the DT functions will use a get_associativity() to retrieve
> the current associativity array based on CAS choice, in a similar
> manner like it is already done with the array sizes. This also allowed
> us to get rid of associativity_reset().
> 


Hi Daniel,

I've vaguely started at looking at the new version and it seems
to look better indeed. Now that KVM Forum 2021 and its too many
captivating talks are over :-) , I'll try to find some time to
review.

Cheers,

--
Greg

> 
> Thanks,
> 
> 
> Daniel
> 
> 
> 
> > 
> > 
> > I don't have strong opinions about which way to do it, so I'm all ears.
> > 
> > 
> > Thanks,
> > 
> > 
> > Daniel
> > 
> > 
> > 
> >>
> >>> +    }
> >>> +
> >>>       /*
> >>>        * For all associativity arrays: first position is the size,
> >>>        * position MAX_DISTANCE_REF_POINTS is always the numa_id,
> >>> @@ -177,8 +182,8 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> >>>        * 'i' will be a valid node_id set by the user.
> >>>        */
> >>>       for (i = 0; i < nb_numa_nodes; i++) {
> >>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> >>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> >>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>>       }
> >>>       /*
> >>> @@ -192,15 +197,15 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
> >>>       max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM;
> >>>       for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) {
> >>> -        spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> >>> +        spapr->FORM1_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
> >>>           for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) {
> >>>               uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ?
> >>>                                    SPAPR_GPU_NUMA_ID : cpu_to_be32(i);
> >>> -            spapr->numa_assoc_array[i][j] = gpu_assoc;
> >>> +            spapr->FORM1_assoc_array[i][j] = gpu_assoc;
> >>>           }
> >>> -        spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>> +        spapr->FORM1_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
> >>>       }
> >>>       /*
> >>> @@ -227,14 +232,33 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
> >>>                                      MachineState *machine)
> >>>   {
> >>>       spapr_numa_FORM1_affinity_init(spapr, machine);
> >>> +
> >>> +    /*
> >>> +     * Default to FORM1 affinity until CAS. We'll call affinity_reset()
> >>> +     * during CAS when we're sure about which NUMA affinity the guest
> >>> +     * is going to use.
> >>> +     *
> >>> +     * This step is a failsafe - guests in the wild were able to read
> >>> +     * FORM1 affinity info before CAS for a long time. Since affinity_reset()
> >>> +     * is just a pointer switch between data that was already populated
> >>> +     * here, this is an acceptable overhead to be on the safer side.
> >>> +     */
> >>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
> >>
> >> The right way to do that is to call spapr_numa_associativity_reset() from
> >> spapr_machine_reset() because we want to revert to FORM1 each time the
> >> guest is rebooted.
> >>
> >>> +}
> >>> +
> >>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr)
> >>> +{
> >>> +    /* No FORM2 affinity implemented yet */
> >>
> >> This seems quite obvious at this point, not sure the comment helps.
> >>
> >>> +    spapr->numa_assoc_array = spapr->FORM1_assoc_array;
> >>>   }
> >>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> >>>                                          int offset, int nodeid)
> >>>   {
> >>> +    /* Hardcode the size of FORM1 associativity array for now */
> >>>       _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
> >>>                         spapr->numa_assoc_array[nodeid],
> >>> -                      sizeof(spapr->numa_assoc_array[nodeid]))));
> >>> +                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
> >>
> >> Rather than doing this temporary change that gets undone in
> >> a later patch, I suggest you introduce get_numa_assoc_size()
> >> in a preliminary patch and use it here already :
> >>
> >> -                      NUMA_ASSOC_SIZE * sizeof(uint32_t))));
> >> +                      get_numa_assoc_size(spapr) * sizeof(uint32_t))));
> >>
> >> This will simplify the reviewing.
> >>
> >>>   }
> >>>   static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
> >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>> index 637652ad16..8a9490f0bf 100644
> >>> --- a/include/hw/ppc/spapr.h
> >>> +++ b/include/hw/ppc/spapr.h
> >>> @@ -249,7 +249,8 @@ struct SpaprMachineState {
> >>>       unsigned gpu_numa_id;
> >>>       SpaprTpmProxy *tpm_proxy;
> >>> -    uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
> >>> +    uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM];
> >>
> >> As said above, I really don't see the point in not having :
> >>
> >>      uint32_t *FORM1_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
> >>
> >>> +    uint32_t **numa_assoc_array;
> >>>       Error *fwnmi_migration_blocker;
> >>>   };
> >>> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> >>> index 6f9f02d3de..ccf3e4eae8 100644
> >>> --- a/include/hw/ppc/spapr_numa.h
> >>> +++ b/include/hw/ppc/spapr_numa.h
> >>> @@ -24,6 +24,7 @@
> >>>    */
> >>>   void spapr_numa_associativity_init(SpaprMachineState *spapr,
> >>>                                      MachineState *machine);
> >>> +void spapr_numa_associativity_reset(SpaprMachineState *spapr);
> >>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
> >>>   void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
> >>>                                          int offset, int nodeid);
> >>



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-16 17:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 19:55 [PATCH v6 0/6] pSeries FORM2 affinity support Daniel Henrique Barboza
2021-09-10 19:55 ` [PATCH v6 1/6] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
2021-09-14  8:23   ` Greg Kurz
2021-09-10 19:55 ` [PATCH v6 2/6] spapr_numa.c: scrap 'legacy_numa' concept Daniel Henrique Barboza
2021-09-14  8:34   ` Greg Kurz
2021-09-10 19:55 ` [PATCH v6 3/6] spapr: introduce spapr_numa_associativity_reset() Daniel Henrique Barboza
2021-09-14 11:55   ` Greg Kurz
2021-09-14 19:58     ` Daniel Henrique Barboza
2021-09-16  1:32       ` Daniel Henrique Barboza
2021-09-16 17:31         ` Greg Kurz
2021-09-10 19:55 ` [PATCH v6 4/6] spapr_numa.c: parametrize FORM1 macros Daniel Henrique Barboza
2021-09-14 12:10   ` Greg Kurz
2021-09-10 19:55 ` [PATCH v6 5/6] spapr: move FORM1 verifications to post CAS Daniel Henrique Barboza
2021-09-14 12:26   ` Greg Kurz
2021-09-10 19:55 ` [PATCH v6 6/6] spapr_numa.c: FORM2 NUMA affinity support Daniel Henrique Barboza
2021-09-14 12:58   ` Greg Kurz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.