qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support
@ 2021-06-15  1:33 Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 1/8] spapr: move NUMA data init to do_client_architecture_support() Daniel Henrique Barboza
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

This RFC series implements FORM2 NUMA associativity support in
the pSeries machine. This new associativity format is going to
be added in the LOPAR spec in the near future. For now, the preview
of the specification can be found in Aneesh kernel side patches that
implements this support, specially the documentation patch [2].

For QEMU, the most drastic change FORM2 brings is that, at long
last, we're free from the shackles of an overcomplicated and bloated
way of calculating NUMA distances. This new affinity format promotes
separation from performance metrics such as distance, latency,
bandwidth and so on from the ibm,associativity arrays of the
devices. This also allows for asymmetric NUMA configurations.

FORM2 is set by ibm,architecture-vec-5 bit 2 byte 5. This means that
the guest is able to choose between FORM1 and FORM2 during CAS, and
we need to adapt NUMA internals accordingly based on this choice.
Patches 1 to 5 implement the base FORM2 support in the pSeries
machine. 

Patches 6-8 deal with NVDIMM changes. FORM2 allows NVDIMMs to declare
an extra NUMA node called 'device-node' to support their use as persistent
memory. 'device-node' is locality based an can be different from the
NUMA node that the NVDIMM belongs to when used as regular memory.


With this series and Aneesh's guest kernel from [1], this is the
'numactl -H' output of this guest:

-----

sudo ppc64-softmmu/qemu-system-ppc64 \
-machine pseries,accel=kvm,usb=off,dump-guest-core=off \
-m size=14G,slots=256,maxmem=256G -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \
(...)
-object memory-backend-ram,id=mem0,size=4G -numa node,memdev=mem0,cpus=0-1,nodeid=0 \
-object memory-backend-ram,id=mem1,size=4G -numa node,memdev=mem1,cpus=2-3,nodeid=1 \
-object memory-backend-ram,id=mem2,size=4G -numa node,memdev=mem2,cpus=4-5,nodeid=2 \
-object memory-backend-ram,id=mem3,size=2G -numa node,memdev=mem3,cpus=6-7,nodeid=3 \
-numa dist,src=0,dst=1,val=22 -numa dist,src=0,dst=2,val=22 -numa dist,src=0,dst=3,val=22 \
-numa dist,src=1,dst=0,val=44 -numa dist,src=1,dst=2,val=44 -numa dist,src=1,dst=3,val=44 \
-numa dist,src=2,dst=0,val=66 -numa dist,src=2,dst=1,val=66 -numa dist,src=2,dst=3,val=66 \
-numa dist,src=3,dst=0,val=88 -numa dist,src=3,dst=1,val=88 -numa dist,src=3,dst=2,val=88 


# numactl -H 
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 3987 MB
node 0 free: 3394 MB
node 1 cpus: 2 3
node 1 size: 4090 MB
node 1 free: 4073 MB
node 2 cpus: 4 5
node 2 size: 4090 MB
node 2 free: 4072 MB
node 3 cpus: 6 7
node 3 size: 2027 MB
node 3 free: 2012 MB
node distances:
node   0   1   2   3 
  0:  10  22  22  22 
  1:  44  10  44  44 
  2:  66  66  10  66 
  3:  88  88  88  10 


The exact user NUMA distances were reflected in the kernel, without any
approximation like we have to do for FORM1.


[1] https://lore.kernel.org/linuxppc-dev/20210614164003.196094-1-aneesh.kumar@linux.ibm.com/
[2] https://lore.kernel.org/linuxppc-dev/20210614164003.196094-8-aneesh.kumar@linux.ibm.com/


Daniel Henrique Barboza (8):
  spapr: move NUMA data init to do_client_architecture_support()
  spapr_numa.c: split FORM1 code into helpers
  spapr_numa.c: wait for CAS before writing rtas DT
  spapr_numa.c: base FORM2 NUMA affinity support
  spapr: simplify spapr_numa_associativity_init params
  nvdimm: add PPC64 'device-node' property
  spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms
  spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init()

 hw/mem/nvdimm.c             |  28 ++++
 hw/ppc/spapr.c              |  53 +++-----
 hw/ppc/spapr_hcall.c        |   4 +
 hw/ppc/spapr_numa.c         | 250 +++++++++++++++++++++++++++++++++---
 hw/ppc/spapr_nvdimm.c       |   3 +-
 include/hw/mem/nvdimm.h     |  12 ++
 include/hw/ppc/spapr_numa.h |   6 +-
 include/hw/ppc/spapr_ovec.h |   1 +
 8 files changed, 299 insertions(+), 58 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/8] spapr: move NUMA data init to do_client_architecture_support()
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 2/8] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

The pSeries machine will support a new NUMA affinity form, FORM2.
This new FORM will be negotiated via ibm,architecture-vec5 during
CAS. All artifacts and assumptions that are currently on use for
FORM1 affinity will be deprecated in a guest that chooses to use
FORM2.

This means that we're going to wait until CAS to determine whether
we're going to use FORM1 and FORM2. This patch starts doing that
by moving spapr_numa_associativity_init() from spapr_machine_init()
to do_client_architecture_support().

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c       | 3 ---
 hw/ppc/spapr_hcall.c | 4 ++++
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4dd90b75cc..dd57b30cff 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2790,9 +2790,6 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
-    /* Init numa_assoc_array */
-    spapr_numa_associativity_init(spapr, machine);
-
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
                               spapr->max_compat_pvr)) {
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index f25014afda..6d6ffcc92b 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -11,6 +11,7 @@
 #include "helper_regs.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_cpu_core.h"
+#include "hw/ppc/spapr_numa.h"
 #include "mmu-hash64.h"
 #include "cpu-models.h"
 #include "trace.h"
@@ -1197,6 +1198,9 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, OV1_PPC_3_00);
     spapr_ovec_cleanup(ov1_guest);
 
+    /* Init numa_assoc_array */
+    spapr_numa_associativity_init(spapr, MACHINE(spapr));
+
     /*
      * Ensure the guest asks for an interrupt mode we support;
      * otherwise terminate the boot.
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/8] spapr_numa.c: split FORM1 code into helpers
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 1/8] spapr: move NUMA data init to do_client_architecture_support() Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT Daniel Henrique Barboza
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

The upcoming FORM2 NUMA affinity will support asymmetric NUMA topologies
and doesn't need be concerned with all the legacy support for older
pseries FORM1 guests.

We're also not going to calculate associativity domains based on numa
distance (via spapr_numa_define_associativity_domains) since the
distances will be written directly into new DT properties.

Let's split FORM1 code into its own functions to allow for easier
insertion of FORM2 logic later on.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 61 +++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 779f18b994..04a86f9b5b 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -155,6 +155,32 @@ static void spapr_numa_define_associativity_domains(SpaprMachineState *spapr)
 
 }
 
+/*
+ * Set NUMA machine state data based on FORM1 affinity semantics.
+ */
+static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
+                                           MachineState *machine)
+{
+    bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
+
+    /*
+     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
+     * 1 NUMA node) will not benefit from anything we're going to do
+     * after this point.
+     */
+    if (using_legacy_numa) {
+        return;
+    }
+
+    if (!spapr_numa_is_symmetrical(machine)) {
+        error_report("Asymmetrical NUMA topologies aren't supported "
+                     "in the pSeries machine");
+        exit(EXIT_FAILURE);
+    }
+
+    spapr_numa_define_associativity_domains(spapr);
+}
+
 void spapr_numa_associativity_init(SpaprMachineState *spapr,
                                    MachineState *machine)
 {
@@ -210,22 +236,7 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    /*
-     * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-     * 1 NUMA node) will not benefit from anything we're going to do
-     * after this point.
-     */
-    if (using_legacy_numa) {
-        return;
-    }
-
-    if (!spapr_numa_is_symmetrical(machine)) {
-        error_report("Asymmetrical NUMA topologies aren't supported "
-                     "in the pSeries machine");
-        exit(EXIT_FAILURE);
-    }
-
-    spapr_numa_define_associativity_domains(spapr);
+    spapr_numa_FORM1_affinity_init(spapr, machine);
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -302,12 +313,8 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
-/*
- * Helper that writes ibm,associativity-reference-points and
- * max-associativity-domains in the RTAS pointed by @rtas
- * in the DT @fdt.
- */
-void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
 {
     MachineState *ms = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
@@ -365,6 +372,16 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
                      maxdomains, sizeof(maxdomains)));
 }
 
+/*
+ * Helper that writes ibm,associativity-reference-points and
+ * max-associativity-domains in the RTAS pointed by @rtas
+ * in the DT @fdt.
+ */
+void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
+{
+    spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 1/8] spapr: move NUMA data init to do_client_architecture_support() Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 2/8] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  4:02   ` David Gibson
  2021-06-15  1:33 ` [RFC PATCH 4/8] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

spapr_numa_write_rtas_dt() is called from spapr_dt_rtas(), which in
turned is called by spapr_build_fdt(). spapr_build_fdt() is called in
two places: spapr_machine_reset() and do_client_architecture_support().
When called in machine_reset() we're writing RTAS nodes with NUMA
artifacts without going through CAS first.

This is not an issue because we always write the same thing in DT, since
we support just FORM1 NUMA affinity. With the upcoming FORM2 support,
we're now reliant on guest choice to decide what to write.

Instead of taking a guess (e.g. default to FORM1, switch to FORM2 if
guest chooses it), postpone the writing of
ibm,associativity-reference-points and ibm,max-associativity-domains
until we're sure what was negotiated with the guest.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 04a86f9b5b..e1a7f80076 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -379,6 +379,10 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
  */
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
+    if (spapr_ovec_empty(spapr->ov5_cas)) {
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/8] spapr_numa.c: base FORM2 NUMA affinity support
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-06-15  1:33 ` [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 5/8] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

The main feature of FORM2 affinity support is the separation of NUMA
distances from ibm,associativity information. This allows for a more
flexible and straightforward NUMA distance assignment without relying on
complex associations between several levels of NUMA via
ibm,associativity matches.

Another feature is its extensibility. This base support contains the
facilities for NUMA distance assignment. In the future more facilities
will be added for latency, performance, bandwidth and so on.

This patch implements the base FORM2 affinity support as follows:

- the use of FORM2 associativity is indicated by using bit 2 of byte 5
of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1
or FORM2 affinity. Setting both forms will default to FORM2. We rely on
CAS to decide what to write in the DT, so all writing is postponed until
client architecture support is done;

- ibm,associativity-reference-points has a new semantic. Instead of
being used to calculate distances via NUMA levels, the concept of
subdomain was introduced. The primary domain is the first element of the
array, secondary domain is the second element and so on. At this moment,
the only usage of this semantic is with NVDIMMs. This will be explained
further in the next patches;

- two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table
and ibm,numa-distance-table. The index table is used to list all the
NUMA logical domains of the platform, in ascending order, and allows for
spartial NUMA configurations (although QEMU ATM doesn't support that).
ibm,numa-distance-table is an array that contains all the distances from
the first NUMA node to all other nodes, then the second NUMA node
distances to all other nodes and so on;

- spapr_post_load changes: since we're adding a new NUMA affinity that
isn't compatible with the existing one, migration must be handled
accordingly because we can't be certain of whether the guest went
through CAS in the source. The solution chosen is to initiate the NUMA
associativity data in spapr_post_load() unconditionally. The worst case
would be to write the DT twice if the guest is in pre-CAS stage.
Otherwise, we're making sure that a FORM1 guest will have the
spapr->numa_assoc_array initialized with the proper information based on
user distance, something that we're not doing with FORM2.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              |  17 +++++
 hw/ppc/spapr_numa.c         | 140 +++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr_ovec.h |   1 +
 3 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index dd57b30cff..c6def3690a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1770,6 +1770,22 @@ static int spapr_post_load(void *opaque, int version_id)
         return err;
     }
 
+    /*
+     * NUMA data init is made in CAS time. There is no reliable
+     * way of telling whether the guest already went through CAS
+     * in the source due to how spapr_ov5_cas_needed works: a
+     * FORM1 guest can be migrated with ov5_cas empty regardless
+     * of going through CAS first.
+     *
+     * One solution is to always call numa_associativity_init. The
+     * downside is that a guest migrated before CAS will run
+     * numa_associativity_init again when going through it, but
+     * at least we're making sure spapr->numa_assoc_array will be
+     * initialized and hotplug operations won't fail in both before
+     * and after CAS migration cases.
+     */
+     spapr_numa_associativity_init(spapr, MACHINE(spapr));
+
     return err;
 }
 
@@ -2733,6 +2749,7 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY);
+    spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY);
 
     /* advertise support for dedicated HP event source to guests */
     if (spapr->use_hotplug_event_source) {
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index e1a7f80076..c8fd66b53a 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS);
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
 
+        /*
+         * For FORM2 affinity the initialization above is enough. No
+         * need to fill non-zero NUMA nodes with node_id because
+         * there is no associativity domain match to calculate
+         * NUMA distances in FORM2.
+         */
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+            continue;
+        }
+
         /*
          * Fill all associativity domains of non-zero NUMA nodes with
          * node_id. This is required because the default value (0) is
@@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
         spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i);
     }
 
-    spapr_numa_FORM1_affinity_init(spapr, machine);
+    /*
+     * We test for !FORM2 instead of testing for FORM1 because,
+     * as per spapr_ov5_cas_needed, setting FORM1 is not enough
+     * to get ov5_cas migrated, but setting FORM2 is. Since we're
+     * dealing with either FORM1 or FORM2, test for the option
+     * that is guaranteed to be set after a migration.
+     */
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM1_affinity_init(spapr, machine);
+    }
 }
 
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
@@ -313,6 +332,120 @@ int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
     return ret;
 }
 
+static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr,
+                                               void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    int distance_table_entries = nb_numa_nodes * nb_numa_nodes;
+    g_autofree uint32_t *lookup_index_table = NULL;
+    g_autofree uint32_t *distance_table = NULL;
+    int src, dst, i, distance_table_size;
+    uint8_t *node_distances;
+
+    /*
+     * ibm,numa-lookup-index-table: array with length and a
+     * list of NUMA ids present in the guest.
+     */
+    lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1);
+    lookup_index_table[0] = cpu_to_be32(nb_numa_nodes);
+
+    for (i = 0; i < nb_numa_nodes; i++) {
+        lookup_index_table[i + 1] = cpu_to_be32(i);
+    }
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table",
+                     lookup_index_table,
+                     (nb_numa_nodes + 1) * sizeof(uint32_t)));
+
+    /*
+     * ibm,numa-distance-table: contains all node distances. First
+     * element is the size of the table as uint32, followed up
+     * by all the uint8 distances from the first NUMA node, then all
+     * distances from the second NUMA node and so on.
+     *
+     * ibm,numa-lookup-index-table is used by guest to navigate this
+     * array because NUMA ids can be sparse (node 0 is the first,
+     * node 8 is the second ...).
+     */
+    distance_table = g_new0(uint32_t, distance_table_entries + 1);
+    distance_table[0] = cpu_to_be32(distance_table_entries);
+
+    node_distances = (uint8_t *)&distance_table[1];
+    i = 0;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            node_distances[i++] = numa_info[src].distance[dst];
+        }
+    }
+
+    distance_table_size = distance_table_entries * sizeof(uint8_t) +
+                          sizeof(uint32_t);
+    _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table",
+                     distance_table, distance_table_size));
+}
+
+/*
+ * This helper could be compressed in a single function with
+ * FORM1 logic since we're setting the same DT values, with the
+ * difference being a call to spapr_numa_FORM2_write_rtas_tables()
+ * in the end. The separation was made to avoid clogging FORM1 code
+ * which already has to deal with compat modes from previous
+ * QEMU machine types.
+ */
+static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr,
+                                           void *fdt, int rtas)
+{
+    MachineState *ms = MACHINE(spapr);
+    uint32_t number_nvgpus_nodes = spapr->gpu_numa_id -
+                                   spapr_numa_initial_nvgpu_numa_id(ms);
+
+    /*
+     * From the kernel docs: "With Form2 the primary domainID and
+     * secondary domainID are used to identify the NUMA nodes
+     * the kernel should use when using persistent memory devices."
+     *
+     * Persistent memory devices, such as spapr-nvdimm, will use the
+     * primary domainID to indicate the NUMA node number the guest
+     * OS should use when using the device as regular memory. The
+     * secondary domainID indicates the numa node number that should
+     * be used when the device is used as persistent memory.
+     *
+     * FORM2 specification allows for further subdomains such as
+     * tertiary and quaternary. To avoid having to update
+     * ibm,associativity-reference-points every time a new subdomain
+     * level is going to be used by the kernel, let's set all available
+     * subdomains QEMU is willing to support without increasing
+     * MAX_DISTANCE_REF_POINTS.
+     */
+    uint32_t refpoints[] = {
+        cpu_to_be32(0x4),
+        cpu_to_be32(0x3),
+        cpu_to_be32(0x2),
+        cpu_to_be32(0x1),
+    };
+
+    uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
+    uint32_t maxdomains[] = {
+        cpu_to_be32(4),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain),
+        cpu_to_be32(maxdomain)
+    };
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
+                     refpoints, nr_refpoints * sizeof(refpoints[0])));
+
+    _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
+                     maxdomains, sizeof(maxdomains)));
+
+    spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas);
+}
+
 static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
                                            void *fdt, int rtas)
 {
@@ -383,6 +516,11 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
         return;
     }
 
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas);
+        return;
+    }
+
     spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
 }
 
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 48b716a060..c3e8b98e7e 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector;
 /* option vector 5 */
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
+#define OV5_FORM2_AFFINITY      OV_BIT(5, 2)
 #define OV5_HP_EVT              OV_BIT(6, 5)
 #define OV5_HPT_RESIZE          OV_BIT(6, 7)
 #define OV5_DRMEM_V2            OV_BIT(22, 0)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 5/8] spapr: simplify spapr_numa_associativity_init params
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2021-06-15  1:33 ` [RFC PATCH 4/8] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

When spapr_numa_associativity_init() was introduced it was being
called from spapr_machine_init(), where we have pointers to a
SpaprMachineState and a MachineState. Having both being passed
as params spared us from calling a macro to get the MachineState.

Previous patches moved the function away from spapr_machine_init(),
and the new locations doesn't have a pointer to MachineState ready.
This resulted in  MACHINE(spapr) macro calls as the second parameter
in both callers.

Simplify the function by folding the 'MACHINE(spapr)' macro into the
function body, which can now receive only a pointer to
SpaprMachineState.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 2 +-
 hw/ppc/spapr_hcall.c        | 2 +-
 hw/ppc/spapr_numa.c         | 4 ++--
 include/hw/ppc/spapr_numa.h | 3 +--
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c6def3690a..c70b8b2f44 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1784,7 +1784,7 @@ static int spapr_post_load(void *opaque, int version_id)
      * initialized and hotplug operations won't fail in both before
      * and after CAS migration cases.
      */
-     spapr_numa_associativity_init(spapr, MACHINE(spapr));
+     spapr_numa_associativity_init(spapr);
 
     return err;
 }
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 6d6ffcc92b..73aca0aed6 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1199,7 +1199,7 @@ target_ulong do_client_architecture_support(PowerPCCPU *cpu,
     spapr_ovec_cleanup(ov1_guest);
 
     /* Init numa_assoc_array */
-    spapr_numa_associativity_init(spapr, MACHINE(spapr));
+    spapr_numa_associativity_init(spapr);
 
     /*
      * Ensure the guest asks for an interrupt mode we support;
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index c8fd66b53a..b14dd543c8 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -181,9 +181,9 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
     spapr_numa_define_associativity_domains(spapr);
 }
 
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine)
+void spapr_numa_associativity_init(SpaprMachineState *spapr)
 {
+    MachineState *machine = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int i, j, max_nodes_with_gpus;
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index 6f9f02d3de..adaec8e163 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -22,8 +22,7 @@
  * function. spapr_machine_init() is the only caller for it, and
  * it has both pointers resolved already.
  */
-void spapr_numa_associativity_init(SpaprMachineState *spapr,
-                                   MachineState *machine);
+void spapr_numa_associativity_init(SpaprMachineState *spapr);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (4 preceding siblings ...)
  2021-06-15  1:33 ` [RFC PATCH 5/8] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  9:33   ` Aneesh Kumar K.V
  2021-06-15  1:33 ` [RFC PATCH 7/8] spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 8/8] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza
  7 siblings, 1 reply; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shivaprasad G Bhat, aneesh.kumar, Daniel Henrique Barboza, groug,
	qemu-ppc, Igor Mammedov, david

The spapr-nvdimm driver is able to operate in two ways. The first
one is as a regular memory in which the NUMA node of the parent
pc-dimm class is used. The second mode, as persistent memory, allows for
a different NUMA node to be used based on the locality of the device.

At this moment we don't have a way to express this second NUMA node for
the persistent memory mode. This patch introduces a new 'device-node'
property that will be used by the PPC64 spapr-nvdimm driver to set a
second NUMA node for the nvdimm.

CC: Shivaprasad G Bhat <sbhat@linux.ibm.com>
CC: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/mem/nvdimm.c         | 28 ++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h | 12 ++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..030ccf7575 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -96,6 +96,29 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, const char *name,
     g_free(value);
 }
 
+static void nvdimm_get_device_node(Object *obj, Visitor *v,
+                                         const char *name, void *opaque,
+                                         Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    uint8_t value = nvdimm->device_node;
+
+    visit_type_uint8(v, name, &value, errp);
+}
+
+static void nvdimm_set_device_node(Object *obj, Visitor *v,
+                                         const char *name, void *opaque,
+                                         Error **errp)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(obj);
+    uint8_t value;
+
+    if (!visit_type_uint8(v, name, &value, errp)) {
+        return;
+    }
+
+    nvdimm->device_node = value;
+}
 
 static void nvdimm_init(Object *obj)
 {
@@ -105,6 +128,11 @@ static void nvdimm_init(Object *obj)
 
     object_property_add(obj, NVDIMM_UUID_PROP, "QemuUUID", nvdimm_get_uuid,
                         nvdimm_set_uuid, NULL, NULL);
+
+    object_property_add(obj, NVDIMM_DEVICE_NODE, "uint8",
+                        nvdimm_get_device_node,
+                        nvdimm_set_device_node,
+                        NULL, NULL);
 }
 
 static void nvdimm_finalize(Object *obj)
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..430169322f 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -49,6 +49,7 @@
 OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
+#define NVDIMM_DEVICE_NODE  "device-node"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
 
@@ -89,6 +90,17 @@ struct NVDIMMDevice {
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
     QemuUUID uuid;
+
+   /*
+    * The spapr-nvdimm (PPC64 NVDIMM) driver supports two modes of
+    * operation: regular memory and persistent memory. When using the
+    * device as regular memory, the NUMA nodeid that comes from
+    * PC_DIMM_NODEPROP is to be used. When used as persistent memory,
+    * the guest should consider the 'device-node' instead since it
+    * indicates the locality of the device to an established NUMA
+    * node, which is more relevant to this type of usage.
+    */
+    uint8_t device_node;
 };
 
 struct NVDIMMClass {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 7/8] spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (5 preceding siblings ...)
  2021-06-15  1:33 ` [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  2021-06-15  1:33 ` [RFC PATCH 8/8] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shivaprasad G Bhat, aneesh.kumar, Daniel Henrique Barboza, groug,
	qemu-ppc, david

Using the new 'device-node' property, write it in the nvdimm DT to set a
secondary domain for the persistent memory operation mode.

Note that this is only available in FORM2 affinity. FORM1 affinity
NVDIMMs aren't affected by this change.

CC: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c         | 20 ++++++++++++++++++++
 hw/ppc/spapr_nvdimm.c       |  3 ++-
 include/hw/ppc/spapr_numa.h |  3 +++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index b14dd543c8..0cabb6757f 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -266,6 +266,26 @@ void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                       sizeof(spapr->numa_assoc_array[nodeid]))));
 }
 
+void spapr_numa_write_nvdimm_assoc_dt(SpaprMachineState *spapr, void *fdt,
+                                      int offset, int nodeid,
+                                      int device_node)
+{
+    uint32_t *nvdimm_assoc_array = spapr->numa_assoc_array[nodeid];
+
+    /*
+     * 'device-node' is the secondary domain for NVDIMMs when
+     * using FORM2. The secondary domain for FORM2 in QEMU
+     * is 0x3.
+     */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) {
+        nvdimm_assoc_array[0x3] = cpu_to_be32(device_node);
+    }
+
+    _FDT((fdt_setprop(fdt, offset, "ibm,associativity",
+                      nvdimm_assoc_array,
+                      sizeof(spapr->numa_assoc_array[nodeid]))));
+}
+
 static uint32_t *spapr_numa_get_vcpu_assoc(SpaprMachineState *spapr,
                                            PowerPCCPU *cpu)
 {
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 91de1052f2..49a05736fe 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -142,7 +142,8 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
     _FDT((fdt_setprop_string(fdt, child_offset, "compatible", "ibm,pmemory")));
     _FDT((fdt_setprop_string(fdt, child_offset, "device_type", "ibm,pmemory")));
 
-    spapr_numa_write_associativity_dt(spapr, fdt, child_offset, node);
+    spapr_numa_write_nvdimm_assoc_dt(spapr, fdt, child_offset, node,
+                                     nvdimm->device_node);
 
     buf = qemu_uuid_unparse_strdup(&nvdimm->uuid);
     _FDT((fdt_setprop_string(fdt, child_offset, "ibm,unit-guid", buf)));
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index adaec8e163..af25741e70 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -26,6 +26,9 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr);
 void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas);
 void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void *fdt,
                                        int offset, int nodeid);
+void spapr_numa_write_nvdimm_assoc_dt(SpaprMachineState *spapr, void *fdt,
+                                      int offset, int nodeid,
+                                      int device_node);
 int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
                             int offset, PowerPCCPU *cpu);
 int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 8/8] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init()
  2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
                   ` (6 preceding siblings ...)
  2021-06-15  1:33 ` [RFC PATCH 7/8] spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms Daniel Henrique Barboza
@ 2021-06-15  1:33 ` Daniel Henrique Barboza
  7 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15  1:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: aneesh.kumar, Daniel Henrique Barboza, qemu-ppc, groug, david

FORM2 NUMA affinity is prepared to deal with memory/cpu less
NUMA nodes. This is used by the DAX KMEM driver to locate a
PAPR SCM device that has a different latency than the original
NUMA node from the regular memory.

Move this verification to FORM1 affinity code.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c      | 33 ---------------------------------
 hw/ppc/spapr_numa.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c70b8b2f44..e42bfd314b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2772,39 +2772,6 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
-    /*
-     * check we don't have a memory-less/cpu-less NUMA node
-     * Firmware relies on the existing memory/cpu topology to provide the
-     * NUMA topology to the kernel.
-     * And the linux kernel needs to know the NUMA topology at start
-     * to be able to hotplug CPUs later.
-     */
-    if (machine->numa_state->num_nodes) {
-        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
-            /* check for memory-less node */
-            if (machine->numa_state->nodes[i].node_mem == 0) {
-                CPUState *cs;
-                int found = 0;
-                /* check for cpu-less node */
-                CPU_FOREACH(cs) {
-                    PowerPCCPU *cpu = POWERPC_CPU(cs);
-                    if (cpu->node_id == i) {
-                        found = 1;
-                        break;
-                    }
-                }
-                /* memory-less and cpu-less node */
-                if (!found) {
-                    error_report(
-                       "Memory-less/cpu-less nodes are not supported (node %d)",
-                                 i);
-                    exit(1);
-                }
-            }
-        }
-
-    }
-
     spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine);
 
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 0cabb6757f..c7b0745455 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -162,6 +162,39 @@ static void spapr_numa_FORM1_affinity_init(SpaprMachineState *spapr,
                                            MachineState *machine)
 {
     bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
+    int i;
+
+    /*
+     * check we don't have a memory-less/cpu-less NUMA node
+     * Firmware relies on the existing memory/cpu topology to provide the
+     * NUMA topology to the kernel.
+     * And the linux kernel needs to know the NUMA topology at start
+     * to be able to hotplug CPUs later.
+     */
+    if (machine->numa_state->num_nodes) {
+        for (i = 0; i < machine->numa_state->num_nodes; ++i) {
+            /* check for memory-less node */
+            if (machine->numa_state->nodes[i].node_mem == 0) {
+                CPUState *cs;
+                int found = 0;
+                /* check for cpu-less node */
+                CPU_FOREACH(cs) {
+                    PowerPCCPU *cpu = POWERPC_CPU(cs);
+                    if (cpu->node_id == i) {
+                        found = 1;
+                        break;
+                    }
+                }
+                /* memory-less and cpu-less node */
+                if (!found) {
+                    error_report(
+                       "Memory-less/cpu-less nodes are not supported (node %d)",
+                                 i);
+                    exit(1);
+                }
+            }
+        }
+    }
 
     /*
      * Legacy NUMA guests (pseries-5.1 and older, or guests with only
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT
  2021-06-15  1:33 ` [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT Daniel Henrique Barboza
@ 2021-06-15  4:02   ` David Gibson
  2021-06-15 20:26     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 13+ messages in thread
From: David Gibson @ 2021-06-15  4:02 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: aneesh.kumar, qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

On Mon, Jun 14, 2021 at 10:33:04PM -0300, Daniel Henrique Barboza wrote:
> spapr_numa_write_rtas_dt() is called from spapr_dt_rtas(), which in
> turned is called by spapr_build_fdt(). spapr_build_fdt() is called in
> two places: spapr_machine_reset() and do_client_architecture_support().
> When called in machine_reset() we're writing RTAS nodes with NUMA
> artifacts without going through CAS first.
> 
> This is not an issue because we always write the same thing in DT, since
> we support just FORM1 NUMA affinity. With the upcoming FORM2 support,
> we're now reliant on guest choice to decide what to write.
> 
> Instead of taking a guess (e.g. default to FORM1, switch to FORM2 if
> guest chooses it), postpone the writing of
> ibm,associativity-reference-points and ibm,max-associativity-domains
> until we're sure what was negotiated with the guest.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

I think it makes sense to fold this in with 1/8 moving the calculation
itself until after CAS.

This does make a (theoretical) functional change - it means that NUMA
information is not available before CAS, which it was before.  I think
that's very unlikely to break anything, but I wonder if we should make
it dependent on the machine version just to be safe.

> ---
>  hw/ppc/spapr_numa.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 04a86f9b5b..e1a7f80076 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -379,6 +379,10 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>   */
>  void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
> +    if (spapr_ovec_empty(spapr->ov5_cas)) {
> +        return;
> +    }
> +
>      spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property
  2021-06-15  1:33 ` [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property Daniel Henrique Barboza
@ 2021-06-15  9:33   ` Aneesh Kumar K.V
  2021-06-15 20:13     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 13+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-15  9:33 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel
  Cc: Shivaprasad G Bhat, Daniel Henrique Barboza, groug, qemu-ppc,
	Igor Mammedov, david

Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> The spapr-nvdimm driver is able to operate in two ways. The first
> one is as a regular memory in which the NUMA node of the parent
> pc-dimm class is used. The second mode, as persistent memory, allows for
> a different NUMA node to be used based on the locality of the device.
>
> At this moment we don't have a way to express this second NUMA node for
> the persistent memory mode. This patch introduces a new 'device-node'
> property that will be used by the PPC64 spapr-nvdimm driver to set a
> second NUMA node for the nvdimm.

if device-node is not specified on the commandline, we should default to
the same value of node= attribute? Right now I am finding this default
to 0.

-aneesh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property
  2021-06-15  9:33   ` Aneesh Kumar K.V
@ 2021-06-15 20:13     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15 20:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V, qemu-devel
  Cc: Igor Mammedov, qemu-ppc, groug, Shivaprasad G Bhat, david



On 6/15/21 6:33 AM, Aneesh Kumar K.V wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> 
>> The spapr-nvdimm driver is able to operate in two ways. The first
>> one is as a regular memory in which the NUMA node of the parent
>> pc-dimm class is used. The second mode, as persistent memory, allows for
>> a different NUMA node to be used based on the locality of the device.
>>
>> At this moment we don't have a way to express this second NUMA node for
>> the persistent memory mode. This patch introduces a new 'device-node'
>> property that will be used by the PPC64 spapr-nvdimm driver to set a
>> second NUMA node for the nvdimm.
> 
> if device-node is not specified on the commandline, we should default to
> the same value of node= attribute? Right now I am finding this default
> to 0.


I agree. I fixed it in the next version to default to the 'node'
value when 'device-node' is omitted.



Daniel

> 
> -aneesh
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT
  2021-06-15  4:02   ` David Gibson
@ 2021-06-15 20:26     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Henrique Barboza @ 2021-06-15 20:26 UTC (permalink / raw)
  To: David Gibson; +Cc: aneesh.kumar, qemu-ppc, qemu-devel, groug



On 6/15/21 1:02 AM, David Gibson wrote:
> On Mon, Jun 14, 2021 at 10:33:04PM -0300, Daniel Henrique Barboza wrote:
>> spapr_numa_write_rtas_dt() is called from spapr_dt_rtas(), which in
>> turned is called by spapr_build_fdt(). spapr_build_fdt() is called in
>> two places: spapr_machine_reset() and do_client_architecture_support().
>> When called in machine_reset() we're writing RTAS nodes with NUMA
>> artifacts without going through CAS first.
>>
>> This is not an issue because we always write the same thing in DT, since
>> we support just FORM1 NUMA affinity. With the upcoming FORM2 support,
>> we're now reliant on guest choice to decide what to write.
>>
>> Instead of taking a guess (e.g. default to FORM1, switch to FORM2 if
>> guest chooses it), postpone the writing of
>> ibm,associativity-reference-points and ibm,max-associativity-domains
>> until we're sure what was negotiated with the guest.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> I think it makes sense to fold this in with 1/8 moving the calculation
> itself until after CAS.

Ok.

> 
> This does make a (theoretical) functional change - it means that NUMA
> information is not available before CAS, which it was before.  I think
> that's very unlikely to break anything, but I wonder if we should make
> it dependent on the machine version just to be safe.

I don't mind making it dependent on the default machine. I'll wrap this
CAS change (and as result, all FORM2 support) to be available only for
the default machine type.


Thanks,


Daniel

> 
>> ---
>>   hw/ppc/spapr_numa.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 04a86f9b5b..e1a7f80076 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -379,6 +379,10 @@ static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr,
>>    */
>>   void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>>   {
>> +    if (spapr_ovec_empty(spapr->ov5_cas)) {
>> +        return;
>> +    }
>> +
>>       spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas);
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-15 20:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  1:33 [RFC PATCH 0/8] pSeries base FORM2 NUMA affinity support Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 1/8] spapr: move NUMA data init to do_client_architecture_support() Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 2/8] spapr_numa.c: split FORM1 code into helpers Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 3/8] spapr_numa.c: wait for CAS before writing rtas DT Daniel Henrique Barboza
2021-06-15  4:02   ` David Gibson
2021-06-15 20:26     ` Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 4/8] spapr_numa.c: base FORM2 NUMA affinity support Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 5/8] spapr: simplify spapr_numa_associativity_init params Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 6/8] nvdimm: add PPC64 'device-node' property Daniel Henrique Barboza
2021-06-15  9:33   ` Aneesh Kumar K.V
2021-06-15 20:13     ` Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 7/8] spapr_numa, spapar_nvdimm: write secondary NUMA domain for nvdimms Daniel Henrique Barboza
2021-06-15  1:33 ` [RFC PATCH 8/8] spapr: move memory/cpu less check to spapr_numa_FORM1_affinity_init() Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).