qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] pseries NUMA distance rework
@ 2020-08-14 20:54 Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 01/10] hw: add compat machines for 5.2 Daniel Henrique Barboza
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

Hi,

This series implements a new approach to the NUMA code in
the spapr machine. We're now able to make an attempt to
try to take user input in consideration, instead of ignoring
any user input regarding NUMA distance.

This series was rebased upon David's ppc-for-5.2 tree. More
information about what are being done here can be found in
the specs/ppc-spapr-numa.rst file (only available in David's
branch ATM). This is also available in [1].

First patch is the patch from Cornelia that added the 5.2 machine
type. I included it here because I built upon it for compatibility
code.

Second patch adds a flag to hw/core/numa.c to forbid assymetrical
topologies for the spapr machine, since we do not support that.

All other patches are focused on cleanups/code juggling and
the actual implementation inside the ppc/spapr files.

With this patch, we're now able to at least attempt to honor
user input. Patch 10 has examples of how the code is working
and what is to expect and, most important, what to not
expect.

NOTE: for anyone willing to test this series, be warned that there
is a bug in the Linux kernel, causing any associativity information
for the node 0 to be discarded. A patch was sent in [2] to try to 
fix that.


[1] https://github.com/danielhb/qemu/tree/spapr_numa_v1
[2] https://lore.kernel.org/linuxppc-dev/20200814203413.542050-1-danielhb413@gmail.com/T/#m5c85daff828d8c8156781f9f957ad04903e93151

Cornelia Huck (1):
  hw: add compat machines for 5.2

Daniel Henrique Barboza (9):
  numa: introduce MachineClass::forbid_asymmetrical_numa
  spapr: robustify NVLink2 NUMA node logic
  spapr: add spapr_machine_using_legacy_numa() helper
  spapr: make ibm,max-associativity-domains scale with user input
  spapr: allow 4 NUMA levels in ibm,associativity-reference-points
  spapr: create helper to set ibm,associativity
  spapr: introduce SpaprMachineClass::numa_assoc_domains
  spapr: consider user input when defining spapr guest NUMA
  specs/ppc-spapr-numa: update with new NUMA support

 docs/specs/ppc-spapr-numa.rst | 213 ++++++++++++++++++++++++++++++
 hw/arm/virt.c                 |   9 +-
 hw/core/machine.c             |   3 +
 hw/core/numa.c                |   7 +
 hw/i386/pc.c                  |   3 +
 hw/i386/pc_piix.c             |  14 +-
 hw/i386/pc_q35.c              |  13 +-
 hw/ppc/spapr.c                | 239 +++++++++++++++++++++++++++++-----
 hw/ppc/spapr_hcall.c          |  13 +-
 hw/ppc/spapr_nvdimm.c         |  19 ++-
 hw/ppc/spapr_pci.c            |  42 +++++-
 hw/ppc/spapr_pci_nvlink2.c    |  10 +-
 hw/s390x/s390-virtio-ccw.c    |  14 +-
 include/hw/boards.h           |   4 +
 include/hw/i386/pc.h          |   3 +
 include/hw/pci-host/spapr.h   |   2 +
 include/hw/ppc/spapr.h        |  12 +-
 include/hw/ppc/spapr_nvdimm.h |   5 +-
 18 files changed, 568 insertions(+), 57 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/10] hw: add compat machines for 5.2
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa Daniel Henrique Barboza
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Cornelia Huck, qemu-ppc, david

From: Cornelia Huck <cohuck@redhat.com>

Add 5.2 machine types for arm/i440fx/q35/s390x/spapr.

Signed-off-by: Cornelia Huck <cohuck@redhat.com>
---
 hw/arm/virt.c              |  9 ++++++++-
 hw/core/machine.c          |  3 +++
 hw/i386/pc.c               |  3 +++
 hw/i386/pc_piix.c          | 14 +++++++++++++-
 hw/i386/pc_q35.c           | 13 ++++++++++++-
 hw/ppc/spapr.c             | 15 +++++++++++++--
 hw/s390x/s390-virtio-ccw.c | 14 +++++++++++++-
 include/hw/boards.h        |  3 +++
 include/hw/i386/pc.h       |  3 +++
 9 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ecfee362a1..acf9bfbece 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2546,10 +2546,17 @@ static void machvirt_machine_init(void)
 }
 type_init(machvirt_machine_init);
 
+static void virt_machine_5_2_options(MachineClass *mc)
+{
+}
+DEFINE_VIRT_MACHINE_AS_LATEST(5, 2)
+
 static void virt_machine_5_1_options(MachineClass *mc)
 {
+    virt_machine_5_2_options(mc);
+    compat_props_add(mc->compat_props, hw_compat_5_1, hw_compat_5_1_len);
 }
-DEFINE_VIRT_MACHINE_AS_LATEST(5, 1)
+DEFINE_VIRT_MACHINE(5, 1)
 
 static void virt_machine_5_0_options(MachineClass *mc)
 {
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 8d1a90c6cf..cf5f2dfaeb 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,9 @@
 #include "hw/mem/nvdimm.h"
 #include "migration/vmstate.h"
 
+GlobalProperty hw_compat_5_1[] = {};
+const size_t hw_compat_5_1_len = G_N_ELEMENTS(hw_compat_5_1);
+
 GlobalProperty hw_compat_5_0[] = {
     { "pci-host-bridge", "x-config-reg-migration-enabled", "off" },
     { "virtio-balloon-device", "page-poison", "false" },
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 47c5ca3e34..9aa813949c 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -97,6 +97,9 @@
 #include "fw_cfg.h"
 #include "trace.h"
 
+GlobalProperty pc_compat_5_1[] = {};
+const size_t pc_compat_5_1_len = G_N_ELEMENTS(pc_compat_5_1);
+
 GlobalProperty pc_compat_5_0[] = {
 };
 const size_t pc_compat_5_0_len = G_N_ELEMENTS(pc_compat_5_0);
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index b789e83f9a..c5ba70ca17 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -426,7 +426,7 @@ static void pc_i440fx_machine_options(MachineClass *m)
     machine_class_allow_dynamic_sysbus_dev(m, TYPE_VMBUS_BRIDGE);
 }
 
-static void pc_i440fx_5_1_machine_options(MachineClass *m)
+static void pc_i440fx_5_2_machine_options(MachineClass *m)
 {
     PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_machine_options(m);
@@ -435,6 +435,18 @@ static void pc_i440fx_5_1_machine_options(MachineClass *m)
     pcmc->default_cpu_version = 1;
 }
 
+DEFINE_I440FX_MACHINE(v5_2, "pc-i440fx-5.2", NULL,
+                      pc_i440fx_5_2_machine_options);
+
+static void pc_i440fx_5_1_machine_options(MachineClass *m)
+{
+    pc_i440fx_5_2_machine_options(m);
+    m->alias = NULL;
+    m->is_default = false;
+    compat_props_add(m->compat_props, hw_compat_5_1, hw_compat_5_1_len);
+    compat_props_add(m->compat_props, pc_compat_5_1, pc_compat_5_1_len);
+}
+
 DEFINE_I440FX_MACHINE(v5_1, "pc-i440fx-5.1", NULL,
                       pc_i440fx_5_1_machine_options);
 
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index a3e607a544..0cb9c18cd4 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -353,7 +353,7 @@ static void pc_q35_machine_options(MachineClass *m)
     m->max_cpus = 288;
 }
 
-static void pc_q35_5_1_machine_options(MachineClass *m)
+static void pc_q35_5_2_machine_options(MachineClass *m)
 {
     PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_machine_options(m);
@@ -361,6 +361,17 @@ static void pc_q35_5_1_machine_options(MachineClass *m)
     pcmc->default_cpu_version = 1;
 }
 
+DEFINE_Q35_MACHINE(v5_2, "pc-q35-5.2", NULL,
+                   pc_q35_5_2_machine_options);
+
+static void pc_q35_5_1_machine_options(MachineClass *m)
+{
+    pc_q35_5_2_machine_options(m);
+    m->alias = NULL;
+    compat_props_add(m->compat_props, hw_compat_5_1, hw_compat_5_1_len);
+    compat_props_add(m->compat_props, pc_compat_5_1, pc_compat_5_1_len);
+}
+
 DEFINE_Q35_MACHINE(v5_1, "pc-q35-5.1", NULL,
                    pc_q35_5_1_machine_options);
 
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a5bb0736e2..dd2fa4826b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4581,15 +4581,26 @@ static void spapr_machine_latest_class_options(MachineClass *mc)
     }                                                                \
     type_init(spapr_machine_register_##suffix)
 
+/*
+ * pseries-5.2
+ */
+static void spapr_machine_5_2_class_options(MachineClass *mc)
+{
+    /* Defaults for the latest behaviour inherited from the base class */
+}
+
+DEFINE_SPAPR_MACHINE(5_2, "5.2", true);
+
 /*
  * pseries-5.1
  */
 static void spapr_machine_5_1_class_options(MachineClass *mc)
 {
-    /* Defaults for the latest behaviour inherited from the base class */
+    spapr_machine_5_2_class_options(mc);
+    compat_props_add(mc->compat_props, hw_compat_5_1, hw_compat_5_1_len);
 }
 
-DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
+DEFINE_SPAPR_MACHINE(5_1, "5.1", false);
 
 /*
  * pseries-5.0
diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index e72c61d2ea..f4ea6a9545 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -801,14 +801,26 @@ bool css_migration_enabled(void)
     }                                                                         \
     type_init(ccw_machine_register_##suffix)
 
+static void ccw_machine_5_2_instance_options(MachineState *machine)
+{
+}
+
+static void ccw_machine_5_2_class_options(MachineClass *mc)
+{
+}
+DEFINE_CCW_MACHINE(5_2, "5.2", true);
+
 static void ccw_machine_5_1_instance_options(MachineState *machine)
 {
+    ccw_machine_5_2_instance_options(machine);
 }
 
 static void ccw_machine_5_1_class_options(MachineClass *mc)
 {
+    ccw_machine_5_2_class_options(mc);
+    compat_props_add(mc->compat_props, hw_compat_5_1, hw_compat_5_1_len);
 }
-DEFINE_CCW_MACHINE(5_1, "5.1", true);
+DEFINE_CCW_MACHINE(5_1, "5.1", false);
 
 static void ccw_machine_5_0_instance_options(MachineState *machine)
 {
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 426ce5f625..bc5b82ad20 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -319,6 +319,9 @@ struct MachineState {
     } \
     type_init(machine_initfn##_register_types)
 
+extern GlobalProperty hw_compat_5_1[];
+extern const size_t hw_compat_5_1_len;
+
 extern GlobalProperty hw_compat_5_0[];
 extern const size_t hw_compat_5_0_len;
 
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 3d7ed3a55e..fe52e165b2 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -193,6 +193,9 @@ void pc_system_firmware_init(PCMachineState *pcms, MemoryRegion *rom_memory);
 void pc_madt_cpu_entry(AcpiDeviceIf *adev, int uid,
                        const CPUArchIdList *apic_ids, GArray *entry);
 
+extern GlobalProperty pc_compat_5_1[];
+extern const size_t pc_compat_5_1_len;
+
 extern GlobalProperty pc_compat_5_0[];
 extern const size_t pc_compat_5_0_len;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 01/10] hw: add compat machines for 5.2 Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  1:17   ` David Gibson
  2020-08-14 20:54 ` [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic Daniel Henrique Barboza
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, Eduardo Habkost, david

The pSeries machine does not support asymmetrical NUMA configurations.

CC: Eduardo Habkost <ehabkost@redhat.com>
CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/core/numa.c      | 7 +++++++
 hw/ppc/spapr.c      | 1 +
 include/hw/boards.h | 1 +
 3 files changed, 9 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index d1a94a14f8..1e81233c1d 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
  */
 static void validate_numa_distance(MachineState *ms)
 {
+    MachineClass *mc = MACHINE_GET_CLASS(ms);
     int src, dst;
     bool is_asymmetrical = false;
     int nb_numa_nodes = ms->numa_state->num_nodes;
@@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
     }
 
     if (is_asymmetrical) {
+        if (mc->forbid_asymmetrical_numa) {
+            error_report("This machine type does not support "
+                         "asymmetrical numa distances.");
+            exit(EXIT_FAILURE);
+        }
+
         for (src = 0; src < nb_numa_nodes; src++) {
             for (dst = 0; dst < nb_numa_nodes; dst++) {
                 if (src != dst && numa_info[src].distance[dst] == 0) {
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index dd2fa4826b..3b16edaf4c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
      */
     mc->numa_mem_align_shift = 28;
     mc->auto_enable_numa = true;
+    mc->forbid_asymmetrical_numa = true;
 
     smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
     smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
diff --git a/include/hw/boards.h b/include/hw/boards.h
index bc5b82ad20..dc6cdd1c53 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -215,6 +215,7 @@ struct MachineClass {
     bool nvdimm_supported;
     bool numa_mem_supported;
     bool auto_enable_numa;
+    bool forbid_asymmetrical_numa;
     const char *default_ram_id;
 
     HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 01/10] hw: add compat machines for 5.2 Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  2:14   ` David Gibson
  2020-08-14 20:54 ` [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper Daniel Henrique Barboza
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

NVLink2 GPUs are allocated in their own NUMA node, at maximum
distance from every other resource in the board. The existing
logic makes some assumptions that don't scale well:

- only NVLink2 GPUs will ever require such mechanism, meaning
that the GPU logic is tightly coupled with the NUMA setup of
the machine, via how ibm,max-associativity-domains is set.

- the code is relying on the lack of support for sparse NUMA
nodes in QEMU. Eventually this support can be implemented, and
then the assumption that spapr->gpu_numa_id represents the total
of NUMA nodes plus all generated NUMA ids for the GPUs, which
relies on all QEMU NUMA nodes not being sparsed, has a good
potential for disaster.

This patch aims to fix both assumptions by creating a generic
mechanism to get an available NUMA node, regardless of the
NUMA setup being sparse or not. The idea is to rename the existing
spapr->gpu_numa_id to spapr->current_numa_id and add a new
spapr->extra_numa_nodes attribute. They are used in a new function
called spapr_pci_get_available_numa_id(), that takes into account
that the NUMA conf can be sparsed or not, to retrieve an available
NUMA id for the caller. Each consecutive call of
spapr_pci_get_available_numa_id() will generate a new ID, up
to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
exceeding MAX_NODES. This is a generic code being used only by
NVLink2 ATM, being available to be used in the future by any
other device.

With this new function in place, we can decouple
ibm,max-associativity-domains logic from NVLink2 logic by
using the new spapr->extra_numa_nodes to define the maxdomains
of the forth NUMA level. Instead of defining it as gpu_numa_id,
use num_nodes + extra_numa_nodes. This also makes it resilient
to any future change in the support of sparse NUMA nodes.

Despite all the code juggling, no functional change was made
because sparse NUMA nodes isn't a thing and we do not support
distinct NUMA distances via user input. Next patches will
change that.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 15 ++++++++++-----
 hw/ppc/spapr_pci.c          | 33 +++++++++++++++++++++++++++++++++
 hw/ppc/spapr_pci_nvlink2.c  | 10 ++++++----
 include/hw/pci-host/spapr.h |  2 ++
 include/hw/ppc/spapr.h      |  4 +++-
 5 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3b16edaf4c..22e78cfc84 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
         cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
         cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
     };
-    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
+    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
     uint32_t maxdomains[] = {
         cpu_to_be32(4),
         maxdomain,
         maxdomain,
         maxdomain,
-        cpu_to_be32(spapr->gpu_numa_id),
+        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
     };
 
     _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
@@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
     /*
      * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
      * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
-     * called from vPHB reset handler so we initialize the counter here.
+     * called from vPHB reset handler. We have code to generate an extra numa
+     * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', which
+     * are initialized here.
+     *
      * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
      * must be equally distant from any other node.
-     * The final value of spapr->gpu_numa_id is going to be written to
+     *
+     * The extra NUMA node ids generated for GPU usage will be written to
      * max-associativity-domains in spapr_build_fdt().
      */
-    spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
+    spapr->current_numa_id = 0;
+    spapr->extra_numa_nodes = 0;
 
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 0a418f1e67..09ac58fd7f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
                            &big_endian);
     }
 }
+
+unsigned spapr_pci_get_available_numa_id(Error **errp)
+{
+    MachineState *machine = MACHINE(qdev_get_machine());
+    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
+    NodeInfo *numa_info = machine->numa_state->nodes;
+    unsigned i, start;
+
+    if (machine->numa_state->num_nodes + spapr->extra_numa_nodes >= MAX_NODES) {
+        error_setg(errp,
+                   "Unable to get an extra NUMA node beyond MAX_NODES = %d",
+                   MAX_NODES);
+        return spapr->current_numa_id;
+    }
+
+    if (spapr->extra_numa_nodes == 0) {
+        start = 0;
+    } else {
+        start = spapr->current_numa_id + 1;
+    }
+
+    for (i = start; i < MAX_NODES; i++) {
+        if (!numa_info[i].present) {
+            spapr->extra_numa_nodes++;
+            spapr->current_numa_id = i;
+            return i;
+        }
+    }
+
+    error_setg(errp, "Unable to find a valid NUMA id");
+
+    return spapr->current_numa_id;
+}
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
index 76ae77ebc8..611c8a2957 100644
--- a/hw/ppc/spapr_pci_nvlink2.c
+++ b/hw/ppc/spapr_pci_nvlink2.c
@@ -87,9 +87,8 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
                                     PCIDevice *pdev, uint64_t tgt,
                                     MemoryRegion *mr, Error **errp)
 {
-    MachineState *machine = MACHINE(qdev_get_machine());
-    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
     SpaprPhbPciNvGpuSlot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
+    Error *local_err = NULL;
 
     if (!nvslot) {
         error_setg(errp, "Found too many GPUs per vPHB");
@@ -100,8 +99,11 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
 
     nvslot->gpa = nvgpus->nv2_ram_current;
     nvgpus->nv2_ram_current += memory_region_size(mr);
-    nvslot->numa_id = spapr->gpu_numa_id;
-    ++spapr->gpu_numa_id;
+
+    nvslot->numa_id = spapr_pci_get_available_numa_id(&local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
 }
 
 static void spapr_pci_collect_nvnpu(SpaprPhbPciNvGpuConfig *nvgpus,
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 600eb55c34..8d93223a76 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -129,6 +129,8 @@ struct SpaprPhbState {
 #define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
                                       64 * KiB)
 
+unsigned spapr_pci_get_available_numa_id(Error **errp);
+
 int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
                  uint32_t intc_phandle, void *fdt, int *node_offset);
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 3134d339e8..739a6a4942 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -227,7 +227,9 @@ struct SpaprMachineState {
     bool cmd_line_caps[SPAPR_CAP_NUM];
     SpaprCapabilities def, eff, mig;
 
-    unsigned gpu_numa_id;
+    unsigned current_numa_id;
+    unsigned extra_numa_nodes;
+
     SpaprTpmProxy *tpm_proxy;
 
     Error *fwnmi_migration_blocker;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  2:15   ` David Gibson
  2020-08-14 20:54 ` [PATCH 05/10] spapr: make ibm, max-associativity-domains scale with user input Daniel Henrique Barboza
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

The changes to come to NUMA support are all guest visible. In
theory we could just create a new 5_1 class option flag to
avoid the changes to cascade to 5.1 and under. The reality is that
these changes are only relevant if the machine has more than one
NUMA node. There is no need to change guest behavior that has
been around for years needlesly.

This new helper will be used by the next patches to determine
whether we should retain the (soon to be) legacy NUMA behavior
in the pSeries machine. The new behavior will only be exposed
if::

- machine is pseries-5.2 and newer;
- more than one NUMA node is declared in NUMA state.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c         | 12 ++++++++++++
 include/hw/ppc/spapr.h |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 22e78cfc84..073a59c47d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -308,6 +308,15 @@ static hwaddr spapr_node0_size(MachineState *machine)
     return machine->ram_size;
 }
 
+bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
+{
+    MachineState *machine = MACHINE(spapr);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
+
+    return smc->pre_5_2_numa_associativity ||
+           machine->numa_state->num_nodes <= 1;
+}
+
 static void add_str(GString *s, const gchar *s1)
 {
     g_string_append_len(s, s1, strlen(s1) + 1);
@@ -4602,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(5_2, "5.2", true);
  */
 static void spapr_machine_5_1_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_5_2_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_1, hw_compat_5_1_len);
+    smc->pre_5_2_numa_associativity = true;
 }
 
 DEFINE_SPAPR_MACHINE(5_1, "5.1", false);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 739a6a4942..d9f1afa8b2 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -130,6 +130,7 @@ struct SpaprMachineClass {
     bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
+    bool pre_5_2_numa_associativity;
 
     void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio, 
@@ -847,6 +848,7 @@ int spapr_max_server_number(SpaprMachineState *spapr);
 void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
                       uint64_t pte0, uint64_t pte1);
 void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
+bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr);
 
 /* DRC callbacks. */
 void spapr_core_release(DeviceState *dev);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/10] spapr: make ibm, max-associativity-domains scale with user input
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (3 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  2:55   ` [PATCH 05/10] spapr: make ibm,max-associativity-domains " David Gibson
  2020-08-14 20:54 ` [PATCH 06/10] spapr: allow 4 NUMA levels in ibm, associativity-reference-points Daniel Henrique Barboza
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

The ibm,max-associativity-domains is considering that only a single
associativity domain can exist in the same NUMA level. This is true
today because we do not support any type of NUMA distance user
customization, and all nodes are in the same distance to each other.

To enhance NUMA distance support in the pSeries machine we need to
make this limit flexible. This patch rewrites the max-associativity
logic to consider that multiple associativity domains can co-exist
in the same NUMA level. We're using the legacy_numa() helper to
avoid leaking unneeded guest changes.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 073a59c47d..b0c4b80a23 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -919,13 +919,20 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
         cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
         cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
     };
-    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
+
+    /* The maximum domains for a given NUMA level, supposing that every
+     * additional NUMA node belongs to the same domain (aside from the
+     * 4th level, where we must support all available NUMA domains), is
+     * total number of domains - 1. */
+    uint32_t total_nodes_number = ms->numa_state->num_nodes +
+                                  spapr->extra_numa_nodes;
+    uint32_t maxdomain = cpu_to_be32(total_nodes_number - 1);
     uint32_t maxdomains[] = {
         cpu_to_be32(4),
         maxdomain,
         maxdomain,
         maxdomain,
-        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
+        cpu_to_be32(total_nodes_number),
     };
 
     _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
@@ -962,6 +969,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
                      qemu_hypertas->str, qemu_hypertas->len));
     g_string_free(qemu_hypertas, TRUE);
 
+    if (spapr_machine_using_legacy_numa(spapr)) {
+        maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
+        maxdomains[1] = maxdomain;
+        maxdomains[2] = maxdomain;
+        maxdomains[3] = maxdomain;
+    }
+
     if (smc->pre_5_1_assoc_refpoints) {
         nr_refpoints = 2;
     }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/10] spapr: allow 4 NUMA levels in ibm, associativity-reference-points
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (4 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 05/10] spapr: make ibm, max-associativity-domains scale with user input Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 07/10] spapr: create helper to set ibm,associativity Daniel Henrique Barboza
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

The value of ibm,associativity-reference-points is in sync with
what Skiboot does. It's a three level NUMA configuration where
the first two levels references the same associativity index
(0x4), meaning that the next distance after the local_distance
(10) is two orders away (a '40' value in the Linux kernel). The
third level (0x2) was added to allow NVLink GPUs to be placed
outside of the same associativity domain of the regular
NUMA nodes. However, we have to deal with the possibility of
user customization of the NUMA distances, something that
Skiboot doesn't need to, and this current scheme is too
tight.

The next step to give users more flexibility is to define 4
distinct NUMA levels, allowing for 5 discrete values of
distances (10, 20, 40, 80 and 160 as seen by the kernel).

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b0c4b80a23..bc51d2db90 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -906,8 +906,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     GString *qemu_hypertas = g_string_sized_new(256);
     uint32_t refpoints[] = {
         cpu_to_be32(0x4),
-        cpu_to_be32(0x4),
+        cpu_to_be32(0x3),
         cpu_to_be32(0x2),
+        cpu_to_be32(0x1),
     };
     uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
     uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
@@ -970,6 +971,10 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     g_string_free(qemu_hypertas, TRUE);
 
     if (spapr_machine_using_legacy_numa(spapr)) {
+        refpoints[1] = cpu_to_be32(0x4);
+        refpoints[2] = cpu_to_be32(0x2);
+        nr_refpoints = 3;
+
         maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
         maxdomains[1] = maxdomain;
         maxdomains[2] = maxdomain;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/10] spapr: create helper to set ibm,associativity
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (5 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 06/10] spapr: allow 4 NUMA levels in ibm, associativity-reference-points Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  3:00   ` David Gibson
  2020-08-14 20:54 ` [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

We have several places around hw/ppc files where we use the
same code to set the ibm,associativity array. This patch
creates a helper called spapr_set_associativity() to do
that in a single place. It'll also make it saner to change
the value of ibm,associativity in the next patches.

After this patch, only 2 places are left with open code
ibm,associativity assignment:

- spapr_dt_dynamic_reconfiguration_memory()
- h_home_node_associativity() in spapr_hcall.c

The update of associativity values will be made in these places
manually later on.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c         | 32 +++++++++++++++++++++-----------
 hw/ppc/spapr_nvdimm.c  |  8 +++-----
 hw/ppc/spapr_pci.c     |  8 +++-----
 include/hw/ppc/spapr.h |  1 +
 4 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index bc51d2db90..b80a6f6936 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -201,15 +201,27 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
     return ret;
 }
 
+void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
+{
+    uint8_t assoc_size = 0x4;
+
+    if (cpu_index >= 0) {
+        assoc_size = 0x5;
+        assoc[5] = cpu_to_be32(cpu_index);
+    }
+
+    assoc[0] = cpu_to_be32(assoc_size);
+    assoc[1] = cpu_to_be32(0x0);
+    assoc[2] = cpu_to_be32(0x0);
+    assoc[3] = cpu_to_be32(0x0);
+    assoc[4] = cpu_to_be32(node_id);
+}
+
 static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
 {
     int index = spapr_get_vcpu_id(cpu);
-    uint32_t associativity[] = {cpu_to_be32(0x5),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(cpu->node_id),
-                                cpu_to_be32(index)};
+    uint32_t associativity[6];
+    spapr_set_associativity(associativity, cpu->node_id, index);
 
     /* Advertise NUMA via ibm,associativity */
     return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
@@ -325,15 +337,13 @@ static void add_str(GString *s, const gchar *s1)
 static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
                                 hwaddr size)
 {
-    uint32_t associativity[] = {
-        cpu_to_be32(0x4), /* length */
-        cpu_to_be32(0x0), cpu_to_be32(0x0),
-        cpu_to_be32(0x0), cpu_to_be32(nodeid)
-    };
+    uint32_t associativity[5];
     char mem_name[32];
     uint64_t mem_reg_property[2];
     int off;
 
+    spapr_set_associativity(associativity, nodeid, -1);
+
     mem_reg_property[0] = cpu_to_be64(start);
     mem_reg_property[1] = cpu_to_be64(size);
 
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 81410aa63f..bd109bfc00 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -115,15 +115,13 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
                                              &error_abort);
     uint64_t slot = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_SLOT_PROP,
                                              &error_abort);
-    uint32_t associativity[] = {
-        cpu_to_be32(0x4), /* length */
-        cpu_to_be32(0x0), cpu_to_be32(0x0),
-        cpu_to_be32(0x0), cpu_to_be32(node)
-    };
+    uint32_t associativity[5];
     uint64_t lsize = nvdimm->label_size;
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
 
+    spapr_set_associativity(associativity, node, -1);
+
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
 
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 09ac58fd7f..c02ace226c 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -2321,11 +2321,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
         cpu_to_be32(1),
         cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
     };
-    uint32_t associativity[] = {cpu_to_be32(0x4),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(0x0),
-                                cpu_to_be32(phb->numa_node)};
+    uint32_t associativity[5];
+
     SpaprTceTable *tcet;
     SpaprDrc *drc;
     Error *err = NULL;
@@ -2358,6 +2355,7 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
 
     /* Advertise NUMA via ibm,associativity */
     if (phb->numa_node != -1) {
+        spapr_set_associativity(associativity, phb->numa_node, -1);
         _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
                          sizeof(associativity)));
     }
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d9f1afa8b2..cd158bf95a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -863,6 +863,7 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
 
 void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
 int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
+void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
 
 #define TYPE_SPAPR_RNG "spapr-rng"
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (6 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 07/10] spapr: create helper to set ibm,associativity Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-20  4:26   ` David Gibson
  2020-08-14 20:54 ` [PATCH 09/10] spapr: consider user input when defining spapr guest NUMA Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 10/10] specs/ppc-spapr-numa: update with new NUMA support Daniel Henrique Barboza
  9 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

We can't use the input from machine->numa_state->nodes directly
in the pSeries machine because PAPR does not work with raw distance
values, like ACPI SLIT does. We need to determine common
associativity domains, based on similar performance/distance of the
resources, and set these domains in the associativy array that goes
to the FDT of each resource.

To ease the translation between regular ACPI NUMA distance info
to our PAPR dialect, let's create a matrix called numa_assoc_domains
in the SpaprMachineClass. This matrix will be initiated during
machine init, where  we will read NUMA information from user input,
apply a heuristic to determine the associativity domains for each node,
then populate numa_assoc_domains accordingly.

The changes are mostly centered in the spapr_set_associativity()
helper that will use the values of numa_assoc_domains instead of
using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and
h_home_node_associativity() being the exceptions.

To keep the changes under control, we'll plug in the matrix usage
in the existing code first. The actual heuristic to determine
the associativity domains for each NUMA node will come in a follow-up
patch.

Note that the matrix is initiated with zeros, meaning that there is
no guest changes implemented in this patch. We'll keep these
changes from legacy NUMA guests by not initiating the matrix
in these cases.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c                | 46 +++++++++++++++++++++++------------
 hw/ppc/spapr_hcall.c          | 13 ++++++++--
 hw/ppc/spapr_nvdimm.c         | 13 +++++-----
 hw/ppc/spapr_pci.c            |  3 ++-
 include/hw/ppc/spapr.h        |  7 +++++-
 include/hw/ppc/spapr_nvdimm.h |  5 ++--
 6 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b80a6f6936..4f50ab21ee 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
     return ret;
 }
 
-void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
+void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
+                             MachineState *machine)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
+    uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0];
+    uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1];
+    uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2];
     uint8_t assoc_size = 0x4;
 
     if (cpu_index >= 0) {
@@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
     }
 
     assoc[0] = cpu_to_be32(assoc_size);
-    assoc[1] = cpu_to_be32(0x0);
-    assoc[2] = cpu_to_be32(0x0);
-    assoc[3] = cpu_to_be32(0x0);
+    assoc[1] = cpu_to_be32(assoc_domain1);
+    assoc[2] = cpu_to_be32(assoc_domain2);
+    assoc[3] = cpu_to_be32(assoc_domain3);
     assoc[4] = cpu_to_be32(node_id);
 }
 
-static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
+static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu,
+                                   MachineState *machine)
 {
     int index = spapr_get_vcpu_id(cpu);
     uint32_t associativity[6];
-    spapr_set_associativity(associativity, cpu->node_id, index);
+    spapr_set_associativity(associativity, cpu->node_id, index, machine);
 
     /* Advertise NUMA via ibm,associativity */
     return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
@@ -335,14 +341,14 @@ static void add_str(GString *s, const gchar *s1)
 }
 
 static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
-                                hwaddr size)
+                                hwaddr size, MachineState *machine)
 {
     uint32_t associativity[5];
     char mem_name[32];
     uint64_t mem_reg_property[2];
     int off;
 
-    spapr_set_associativity(associativity, nodeid, -1);
+    spapr_set_associativity(associativity, nodeid, -1, machine);
 
     mem_reg_property[0] = cpu_to_be64(start);
     mem_reg_property[1] = cpu_to_be64(size);
@@ -574,6 +580,7 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
                                                    void *fdt)
 {
     MachineState *machine = MACHINE(spapr);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
     int nb_numa_nodes = machine->numa_state->num_nodes;
     int ret, i, offset;
     uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
@@ -628,12 +635,17 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
     int_buf[1] = cpu_to_be32(4); /* Number of entries per associativity list */
     cur_index += 2;
     for (i = 0; i < nr_nodes; i++) {
+        uint8_t assoc_domain1 = smc->numa_assoc_domains[i][0];
+        uint8_t assoc_domain2 = smc->numa_assoc_domains[i][1];
+        uint8_t assoc_domain3 = smc->numa_assoc_domains[i][2];
+
         uint32_t associativity[] = {
-            cpu_to_be32(0x0),
-            cpu_to_be32(0x0),
-            cpu_to_be32(0x0),
+            cpu_to_be32(assoc_domain1),
+            cpu_to_be32(assoc_domain2),
+            cpu_to_be32(assoc_domain3),
             cpu_to_be32(i)
         };
+
         memcpy(cur_index, associativity, sizeof(associativity));
         cur_index += 4;
     }
@@ -667,7 +679,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
         if (!mem_start) {
             /* spapr_machine_init() checks for rma_size <= node0_size
              * already */
-            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size);
+            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size, machine);
             mem_start += spapr->rma_size;
             node_size -= spapr->rma_size;
         }
@@ -679,7 +691,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
                 sizetmp = 1ULL << ctzl(mem_start);
             }
 
-            spapr_dt_memory_node(fdt, i, mem_start, sizetmp);
+            spapr_dt_memory_node(fdt, i, mem_start, sizetmp, machine);
             node_size -= sizetmp;
             mem_start += sizetmp;
         }
@@ -809,7 +821,7 @@ static void spapr_dt_cpu(CPUState *cs, void *fdt, int offset,
                       pft_size_prop, sizeof(pft_size_prop))));
 
     if (ms->numa_state->num_nodes > 1) {
-        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu));
+        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu, ms));
     }
 
     _FDT(spapr_fixup_cpu_smt_dt(fdt, offset, cpu, compat_smt));
@@ -1335,7 +1347,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
 
     /* NVDIMM devices */
     if (mc->nvdimm_supported) {
-        spapr_dt_persistent_memory(fdt);
+        spapr_dt_persistent_memory(fdt, machine);
     }
 
     return fdt;
@@ -3453,6 +3465,7 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
 int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                           void *fdt, int *fdt_start_offset, Error **errp)
 {
+    MachineState *machine = MACHINE(spapr);
     uint64_t addr;
     uint32_t node;
 
@@ -3460,7 +3473,8 @@ int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
     node = object_property_get_uint(OBJECT(drc->dev), PC_DIMM_NODE_PROP,
                                     &error_abort);
     *fdt_start_offset = spapr_dt_memory_node(fdt, node, addr,
-                                             SPAPR_MEMORY_BLOCK_SIZE);
+                                             SPAPR_MEMORY_BLOCK_SIZE,
+                                             machine);
     return 0;
 }
 
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index c1d01228c6..6e94e513cf 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1878,10 +1878,15 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               target_ulong opcode,
                                               target_ulong *args)
 {
+    MachineState *machine = MACHINE(spapr);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
     target_ulong flags = args[0];
     target_ulong procno = args[1];
     PowerPCCPU *tcpu;
     int idx;
+    uint8_t assoc_domain1;
+    uint8_t assoc_domain2;
+    uint8_t assoc_domain3;
 
     /* only support procno from H_REGISTER_VPA */
     if (flags != 0x1) {
@@ -1893,13 +1898,17 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
         return H_P2;
     }
 
+    assoc_domain1 = smc->numa_assoc_domains[tcpu->node_id][0];
+    assoc_domain2 = smc->numa_assoc_domains[tcpu->node_id][1];
+    assoc_domain3 = smc->numa_assoc_domains[tcpu->node_id][2];
+
     /* sequence is the same as in the "ibm,associativity" property */
 
     idx = 0;
 #define ASSOCIATIVITY(a, b) (((uint64_t)(a) << 32) | \
                              ((uint64_t)(b) & 0xffffffff))
-    args[idx++] = ASSOCIATIVITY(0, 0);
-    args[idx++] = ASSOCIATIVITY(0, tcpu->node_id);
+    args[idx++] = ASSOCIATIVITY(assoc_domain1, assoc_domain2);
+    args[idx++] = ASSOCIATIVITY(assoc_domain3, tcpu->node_id);
     args[idx++] = ASSOCIATIVITY(procno, -1);
     for ( ; idx < 6; idx++) {
         args[idx] = -1;
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index bd109bfc00..7f1f088c39 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -87,8 +87,9 @@ int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                            void *fdt, int *fdt_start_offset, Error **errp)
 {
     NVDIMMDevice *nvdimm = NVDIMM(drc->dev);
+    MachineState *machine = MACHINE(spapr);
 
-    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm);
+    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm, machine);
 
     return 0;
 }
@@ -104,8 +105,8 @@ void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
 }
 
 
-int spapr_dt_nvdimm(void *fdt, int parent_offset,
-                           NVDIMMDevice *nvdimm)
+int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
+                    MachineState *machine)
 {
     int child_offset;
     char *buf;
@@ -120,7 +121,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
 
-    spapr_set_associativity(associativity, node, -1);
+    spapr_set_associativity(associativity, node, -1, machine);
 
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
@@ -159,7 +160,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
     return child_offset;
 }
 
-void spapr_dt_persistent_memory(void *fdt)
+void spapr_dt_persistent_memory(void *fdt, MachineState *machine)
 {
     int offset = fdt_subnode_offset(fdt, 0, "persistent-memory");
     GSList *iter, *nvdimms = nvdimm_get_device_list();
@@ -177,7 +178,7 @@ void spapr_dt_persistent_memory(void *fdt)
     for (iter = nvdimms; iter; iter = iter->next) {
         NVDIMMDevice *nvdimm = iter->data;
 
-        spapr_dt_nvdimm(fdt, offset, nvdimm);
+        spapr_dt_nvdimm(fdt, offset, nvdimm, machine);
     }
     g_slist_free(nvdimms);
 
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index c02ace226c..4d9ef63f3e 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -2355,7 +2355,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
 
     /* Advertise NUMA via ibm,associativity */
     if (phb->numa_node != -1) {
-        spapr_set_associativity(associativity, phb->numa_node, -1);
+        MachineState *machine = MACHINE(spapr);
+        spapr_set_associativity(associativity, phb->numa_node, -1, machine);
         _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
                          sizeof(associativity)));
     }
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index cd158bf95a..1f9700ac19 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -104,6 +104,9 @@ typedef enum {
 
 #define FDT_MAX_SIZE                    0x100000
 
+/* Taken from Linux kernel arch/powerpc/mm/numa.h */
+#define MAX_DISTANCE_REF_POINTS         4
+
 typedef struct SpaprCapabilities SpaprCapabilities;
 struct SpaprCapabilities {
     uint8_t caps[SPAPR_CAP_NUM];
@@ -131,6 +134,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    uint8_t numa_assoc_domains[MAX_NODES][MAX_DISTANCE_REF_POINTS-1];
 
     void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio, 
@@ -863,7 +867,8 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
 
 void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
 int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
-void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
+void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
+                             MachineState *machine);
 
 #define TYPE_SPAPR_RNG "spapr-rng"
 
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index b3330cc485..be30773c7d 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -27,8 +27,9 @@ QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
 
 int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
                            void *fdt, int *fdt_start_offset, Error **errp);
-int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm);
-void spapr_dt_persistent_memory(void *fdt);
+int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
+                    MachineState *machine);
+void spapr_dt_persistent_memory(void *fdt, MachineState *machine);
 void spapr_nvdimm_validate_opts(NVDIMMDevice *nvdimm, uint64_t size,
                                 Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot, Error **errp);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/10] spapr: consider user input when defining spapr guest NUMA
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (7 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  2020-08-14 20:54 ` [PATCH 10/10] specs/ppc-spapr-numa: update with new NUMA support Daniel Henrique Barboza
  9 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

This patch puts all the pieces together to finally allow user
input when defining the NUMA topology of the spapr guest.

The logic is centered in the new spapr_init_numa_assoc_domains()
function. This is called once at machine_init(), if we're not
using legacy_numa mode, to initiate the numa_assoc_domain matrix
introduced in the previous patch. We can divide the logic in two
that are mashed together in the body of this function.

First stage is to sanitize the user input from numa_state. Due to
the nature of what ACPI allows the user to do (directly define
the distances the guest will see in the DT) versus what PAPR
allows (we can hint at associativity relations, the OS must decide
what to do), we had to bake in kernel logic in here. The kernel
allows 4 levels of NUMA, where the last one is always the node_id
itself, with distance = 10. The other levels doubles the distances
of previous levels, meaning that the pSeries kernel will only
show distances of 20, 40, 80 and 160 (in case no match is found).
This first stage is then to get the distances defined by the user
and approximate them to those discrete values:

- user distance 11 to 30 will be interpreted as 20
- user distance 31 to 60 will be interpreted as 40
- user distance 61 to 120 will be interpreted as 80
- user distance 121 and beyond will be interpreted as 160
- user distance 10 stays 10

The other stage is defining the associativity domains based
on the NUMA level match. Again, more than one strategy exists
for this same problem, with different results. The approach
taken is to re-use any existing associativity values to the
new matches, instead of overwriting them with a new associativity
match. This decision is necessary because neither we, nor the
pSeries kernel, supports multiple associativity domains for
each resource, meaning that we have to decide what to preserve.
With the current logic, the associativities established by
the earlier nodes take precedence, i.e. associativities defined
by the first node are retained along all other nodes.

These decisions have direct impact on how the user will interact
with the NUMA topology, and none of them are perfect. To keep
this commit message no longer than it already is, let's update the
existing documentation in ppc-spapr-numa.rst with more in depth
details and design considerations/drawbacks in the next patch.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4f50ab21ee..0d60d06cf4 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -222,6 +222,109 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
     assoc[4] = cpu_to_be32(node_id);
 }
 
+static void spapr_numa_assoc_assign_domain(SpaprMachineClass *smc,
+                                           uint8_t nodeA, uint8_t nodeB,
+                                           uint8_t numaLevel,
+                                           uint8_t curr_domain)
+{
+    uint8_t assoc_A, assoc_B;
+
+    assoc_A = smc->numa_assoc_domains[nodeA][numaLevel];
+    assoc_B = smc->numa_assoc_domains[nodeB][numaLevel];
+
+    /* No associativity domains on both. Assign and move on */
+    if ((assoc_A | assoc_B) == 0) {
+        smc->numa_assoc_domains[nodeA][numaLevel] = curr_domain;
+        smc->numa_assoc_domains[nodeB][numaLevel] = curr_domain;
+        return;
+    }
+
+    /* Use the existing assoc domain of any of the nodes to not
+     * disrupt previous associations already defined */
+    if (assoc_A != 0) {
+        smc->numa_assoc_domains[nodeB][numaLevel] = assoc_A;
+    } else {
+        smc->numa_assoc_domains[nodeA][numaLevel] = assoc_B;
+    }
+}
+
+static void spapr_init_numa_assoc_domains(MachineState *machine)
+{
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
+    int nb_numa_nodes = machine->numa_state->num_nodes;
+    NodeInfo *numa_info = machine->numa_state->nodes;
+    uint8_t existing_nodes[nb_numa_nodes];
+    int i, j, src_node, dst_node, index = 0;
+
+    /* We don't have information about any extra NUMA nodes that
+     * the machine might create at this point (e.g. NVLink2 GPUs).
+     * Assigning associativity domains with low numbers might have
+     * unintended consequences in the presence of GPUs, which are
+     * supposed to always be at maximum distance of everything else,
+     * because we might end up using a GPU numa_id identifier by
+     * accident.
+     *
+     * Starting this counter at MAX_NODES avoids any possible
+     * collision since no NUMA id can reach this value. */
+    uint8_t assoc_domain = MAX_NODES;
+
+    /* We will assume that the NUMA nodes might be sparsed. This
+     * preliminary fetch step is required to avoid having to search
+     * for an existing NUMA node more than once. */
+    for (i = 0; i < MAX_NODES; i++) {
+        if (numa_info[i].present) {
+            existing_nodes[index++] = i;
+            if (index == nb_numa_nodes) {
+                break;
+            }
+        }
+    }
+
+    /* Start iterating through the existing numa nodes to
+     * define associativity groups */
+    for (i = 0; i < nb_numa_nodes; i++) {
+        uint8_t distance = 20;
+        uint8_t lower_end = 10;
+        uint8_t upper_end = 0;
+
+        src_node = existing_nodes[i];
+
+        /* Calculate all associativity domains src_node belongs to. */
+        for(index = 0; index < 3; index++) {
+            upper_end = distance/2 + distance;
+
+            for(j = i + 1; j < nb_numa_nodes; j++) {
+                uint8_t node_dist;
+
+                dst_node = existing_nodes[j];
+                node_dist = numa_info[src_node].distance[dst_node];
+
+                if (node_dist > lower_end && node_dist <= upper_end) {
+                    spapr_numa_assoc_assign_domain(smc, src_node, dst_node,
+                                                   2 - index, assoc_domain);
+                                                   assoc_domain++;
+                }
+            }
+
+            lower_end = upper_end;
+            distance *= 2;
+        }
+    }
+
+    /* Zero (0) is considered a valid associativity domain identifier.
+     * To avoid NUMA nodes having matches where it wasn't intended, fill
+     * the zeros with unique identifiers. */
+    for (i = 0; i < nb_numa_nodes; i++) {
+        src_node = existing_nodes[i];
+        for (j = 0; j < 3; j++) {
+            if (smc->numa_assoc_domains[src_node][j] == 0) {
+                smc->numa_assoc_domains[src_node][j] = assoc_domain;
+                assoc_domain++;
+            }
+        }
+    }
+ }
+
 static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu,
                                    MachineState *machine)
 {
@@ -2887,6 +2990,12 @@ static void spapr_machine_init(MachineState *machine)
     spapr->current_numa_id = 0;
     spapr->extra_numa_nodes = 0;
 
+    /* We don't need to init the NUMA matrix if we're running in
+     * legacy NUMA mode. */
+    if (!spapr_machine_using_legacy_numa(spapr)) {
+        spapr_init_numa_assoc_domains(machine);
+    }
+
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
                               spapr->max_compat_pvr)) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/10] specs/ppc-spapr-numa: update with new NUMA support
  2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
                   ` (8 preceding siblings ...)
  2020-08-14 20:54 ` [PATCH 09/10] spapr: consider user input when defining spapr guest NUMA Daniel Henrique Barboza
@ 2020-08-14 20:54 ` Daniel Henrique Barboza
  9 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-14 20:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, david

This update provides more in depth information about the
choices and drawbacks of the new NUMA support for the
spapr machine.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 docs/specs/ppc-spapr-numa.rst | 213 ++++++++++++++++++++++++++++++++++
 1 file changed, 213 insertions(+)

diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
index e762038022..994bfb996f 100644
--- a/docs/specs/ppc-spapr-numa.rst
+++ b/docs/specs/ppc-spapr-numa.rst
@@ -189,3 +189,216 @@ QEMU up to 5.1, as follows:
 
 This also means that user input in QEMU command line does not change the
 NUMA distancing inside the guest for the pseries machine.
+
+New NUMA mechanics for pseries in QEMU 5.2
+==========================================
+
+Starting in QEMU 5.2, the pseries machine now considers user input when
+setting NUMA topology of the guest. The following changes were made:
+
+* ibm,associativity-reference-points was changed to {0x4, 0x3, 0x2, 0x1}, allowing
+  for 4 distinct NUMA distance values based on the NUMA levels
+
+* ibm,max-associativity-domains was changed to support multiple associativity
+  domains in all NUMA levels. This is needed to ensure user flexibility
+
+* ibm,associativity for all resources now varies with user input
+
+These changes are only effective for pseries-5.2 and newer machines that are
+created with more than one NUMA node (disconsidering NUMA nodes created by
+the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
+around for such a long time, with users seeing NUMA distances 10 and 40
+(and 80 if using NVLink2 GPUs), and there is no need to disrupt the
+existing experience of those guests.
+
+To bring the user experience x86 users have when tuning up NUMA, we had
+to operate under the current pseries Linux kernel logic described in
+`How the pseries Linux guest calculates NUMA distances`_. The result
+is that we needed to translate NUMA distance user input to pseries
+Linux kernel input.
+
+Translating user distance to kernel distance
+--------------------------------------------
+
+User input for NUMA distance can vary from 10 to 254. We need to translate
+that to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
+This is how it is being done:
+
+* user distance 11 to 30 will be interpreted as 20
+* user distance 31 to 60 will be interpreted as 40
+* user distance 61 to 120 will be interpreted as 80
+* user distance 121 and beyond will be interpreted as 160
+* user distance 10 stays 10
+
+The reasoning behind this aproximation is to avoid any round up to the local
+distance (10), keeping it exclusive to the 4th NUMA level (which is still
+exclusive to the node_id). All other ranges were chosen under the developer
+discretion of what would be (somewhat) sensible considering the user input.
+Any other strategy can be used here, but in the end the reality is that we'll
+have to accept that a large array of values will be translated to the same
+NUMA topology in the guest, e.g. this user input:
+
+::
+
+      0   1   2
+  0  10  31 120
+  1  31  10  30
+  2 120  30  10
+
+And this other user input:
+
+::
+
+      0   1   2
+  0  10  60  61
+  1  60  10  11
+  2  61  11  10
+
+Will both be translated to the same values internally:
+
+::
+
+      0   1   2
+  0  10  40  80
+  1  40  10  20
+  2  80  20  10
+
+Users are encouraged to use only the kernel values in the NUMA definition to
+avoid being taken by surprise with that the guest is actually seeing in the
+topology. There are enough potential surprises that are inherent to the
+associativity domain assignment process, discussed below.
+
+
+How associativity domains are assigned
+--------------------------------------
+
+LOPAPR allows more than one associativity array (or 'string') per allocated
+resource. This would be used to represent that the resource has multiple
+connections with the board, and then the operational system, when deciding
+NUMA distancing, should consider the associativity information that provides
+the shortest distance.
+
+The spapr implementation does not support multiple associativity arrays per
+resource, neither does the pseries Linux kernel. We'll have to represent the
+NUMA topology using one associativity per resource, which means that choices
+and compromises are going to be made.
+
+Consider the following NUMA topology entered by user input:
+
+::
+
+      0   1   2   3
+  0  10  20  20  40
+  1  20  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+Honoring just the relative distances of node 0 to every other node, one possible
+value for all associativity arrays would be:
+
+* node 0: 0 B A 0
+* node 1: 0 0 A 1
+* node 2: 0 0 A 2
+* node 3: 0 B 0 3
+
+With the reference points {0x4, 0x3, 0x2, 0x1}, for node 0:
+
+* distance from 0 to 1 is 20 (no match at 0x4, will match at 0x3)
+* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
+* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
+  at 0x2)
+
+The distances related to node 0 are well represented. Doing for node 1, and keeping
+in mind that we don't need to revisit node 0 again, the distance from node 1 to
+2 is 80, matching at 0x4:
+
+* node 1: C 0 A 1
+* node 2: C 0 A 2
+
+Over here we already have the first conflict. Even if we assign a new associativity
+domain at 0x4 for 1 and 2, and we do that in the code, the kernel will define
+the distance between 1 and 2 as 20, not 80, because both 1 and 2 have the "A"
+associativity domain from the previous step. If we decide to discard the
+associativity with "A" then the node 0 distances are compromised.
+
+Following up with the distance from 1 to 3 being 40 (a match in 0x2) we have another
+decision to make. These are the current associativity domains of each:
+
+* node 1: C 0 A 1
+* node 3: 0 B 0 3
+
+There is already an associativity domain at 0x2 in node 3, "B", which was assigned
+by the node 0 distances. If we define a new associativity domain at this level
+for 1 and 3 we will overwrite the existing associativity between 0 and 3. What
+the code is doing in this case is to assign the existing domain to the
+current associativity, in this case, "B" is now assigned to the 0x2 of node 1,
+resulting in the following associativity arrays:
+
+* node 0: 0 B A 0
+* node 1: C 0 A 1
+* node 2: C B A 2
+* node 3: 0 B 0 3
+
+In the last step we will analyze just nodes 2 and 3. The desired distance between
+2 and 3 is 20, i.e. a match in 0x3. Node 2 already has a domain assigned in 0x3,
+A, so we do the same as we did in the previous case and assign it to node 3
+at 0x3. This is the end result for the associativity arrays:
+
+* node 0: 0 B A 0
+* node 1: C 0 A 1
+* node 2: C B A 2
+* node 3: 0 B A 3
+
+The kernel will read these arrays and will calculate the following NUMA topology for
+the guest:
+
+::
+
+      0   1   2   3
+  0  10  20  20  20
+  1  20  10  20  20
+  2  20  20  10  20
+  3  20  20  20  10
+
+Which is not what the user wanted, but it is what the current logic and implementation
+constraints of the kernel and QEMU will provide inside the LOPAPR specification.
+
+Changing a single value, specially a low distance value, makes for drastic changes
+in the result. For example, with the same user input from above, but changing the
+node distance from 0 to 1 to 40:
+
+::
+
+      0   1   2   3
+  0  10  40  20  40
+  1  40  10  80  40
+  2  20  80  10  20
+  3  40  40  20  10
+
+This is the result inside the guest, applying the same heuristics:
+
+::
+
+  $ numactl -H
+  available: 4 nodes (0-3)
+  (...)
+  node distances:
+  node   0   1   2   3
+    0:  10  40  20  20
+    1:  40  10  80  40
+    2:  20  80  10  20
+    3:  20  40  20  10
+
+This result is much closer to the user input and only a single distance was changed
+from the original.
+
+The kernel will always match with the shortest associativity domain possible, and we're
+attempting to retain the previous established relations between the nodes. This means
+that a distance equal to 20 between nodes A and B and the same distance 20 between nodes
+A and F will cause the distance between B and F to also be 20. The same will happen to
+other distances, but shorter distances has precedent over it to the distance calculation.
+
+Users are welcome to use this knowledge and experiment with the input to get the
+NUMA topology they want, or as closer as they want. The important thing is to keep
+expectations up to par with what we are capable of provide at this moment: an
+approximation.
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-14 20:54 ` [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa Daniel Henrique Barboza
@ 2020-08-20  1:17   ` David Gibson
  2020-08-20  2:11     ` Eduardo Habkost
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-20  1:17 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, Eduardo Habkost

[-- Attachment #1: Type: text/plain, Size: 2802 bytes --]

On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> The pSeries machine does not support asymmetrical NUMA
> configurations.

This seems a bit oddly specific to have as a global machine class
property.

Would it make more sense for machines with specific NUMA constraints
to just verify those during their initialization?
> 
> CC: Eduardo Habkost <ehabkost@redhat.com>
> CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/core/numa.c      | 7 +++++++
>  hw/ppc/spapr.c      | 1 +
>  include/hw/boards.h | 1 +
>  3 files changed, 9 insertions(+)
> 
> diff --git a/hw/core/numa.c b/hw/core/numa.c
> index d1a94a14f8..1e81233c1d 100644
> --- a/hw/core/numa.c
> +++ b/hw/core/numa.c
> @@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>   */
>  static void validate_numa_distance(MachineState *ms)
>  {
> +    MachineClass *mc = MACHINE_GET_CLASS(ms);
>      int src, dst;
>      bool is_asymmetrical = false;
>      int nb_numa_nodes = ms->numa_state->num_nodes;
> @@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
>      }
>  
>      if (is_asymmetrical) {
> +        if (mc->forbid_asymmetrical_numa) {
> +            error_report("This machine type does not support "
> +                         "asymmetrical numa distances.");
> +            exit(EXIT_FAILURE);
> +        }
> +
>          for (src = 0; src < nb_numa_nodes; src++) {
>              for (dst = 0; dst < nb_numa_nodes; dst++) {
>                  if (src != dst && numa_info[src].distance[dst] == 0) {
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index dd2fa4826b..3b16edaf4c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>       */
>      mc->numa_mem_align_shift = 28;
>      mc->auto_enable_numa = true;
> +    mc->forbid_asymmetrical_numa = true;
>  
>      smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
>      smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index bc5b82ad20..dc6cdd1c53 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -215,6 +215,7 @@ struct MachineClass {
>      bool nvdimm_supported;
>      bool numa_mem_supported;
>      bool auto_enable_numa;
> +    bool forbid_asymmetrical_numa;
>      const char *default_ram_id;
>  
>      HotplugHandler *(*get_hotplug_handler)(MachineState *machine,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20  1:17   ` David Gibson
@ 2020-08-20  2:11     ` Eduardo Habkost
  2020-08-20  4:15       ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Eduardo Habkost @ 2020-08-20  2:11 UTC (permalink / raw)
  To: David Gibson; +Cc: John Snow, Daniel Henrique Barboza, qemu-ppc, qemu-devel

On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
> On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> > The pSeries machine does not support asymmetrical NUMA
> > configurations.
> 
> This seems a bit oddly specific to have as a global machine class
> property.
> 
> Would it make more sense for machines with specific NUMA constraints
> to just verify those during their initialization?

This would be much simpler.  However, I like the idea of
representing machine-specific configuration validation rules as
data that can eventually be exported to management software.

(CCing John Snow, who had spent some time thinking about
configuration validation recently.)


> > 
> > CC: Eduardo Habkost <ehabkost@redhat.com>
> > CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > ---
> >  hw/core/numa.c      | 7 +++++++
> >  hw/ppc/spapr.c      | 1 +
> >  include/hw/boards.h | 1 +
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/hw/core/numa.c b/hw/core/numa.c
> > index d1a94a14f8..1e81233c1d 100644
> > --- a/hw/core/numa.c
> > +++ b/hw/core/numa.c
> > @@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >   */
> >  static void validate_numa_distance(MachineState *ms)
> >  {
> > +    MachineClass *mc = MACHINE_GET_CLASS(ms);
> >      int src, dst;
> >      bool is_asymmetrical = false;
> >      int nb_numa_nodes = ms->numa_state->num_nodes;
> > @@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
> >      }
> >  
> >      if (is_asymmetrical) {
> > +        if (mc->forbid_asymmetrical_numa) {
> > +            error_report("This machine type does not support "
> > +                         "asymmetrical numa distances.");
> > +            exit(EXIT_FAILURE);
> > +        }
> > +
> >          for (src = 0; src < nb_numa_nodes; src++) {
> >              for (dst = 0; dst < nb_numa_nodes; dst++) {
> >                  if (src != dst && numa_info[src].distance[dst] == 0) {
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index dd2fa4826b..3b16edaf4c 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >       */
> >      mc->numa_mem_align_shift = 28;
> >      mc->auto_enable_numa = true;
> > +    mc->forbid_asymmetrical_numa = true;
> >  
> >      smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
> >      smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
> > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > index bc5b82ad20..dc6cdd1c53 100644
> > --- a/include/hw/boards.h
> > +++ b/include/hw/boards.h
> > @@ -215,6 +215,7 @@ struct MachineClass {
> >      bool nvdimm_supported;
> >      bool numa_mem_supported;
> >      bool auto_enable_numa;
> > +    bool forbid_asymmetrical_numa;
> >      const char *default_ram_id;
> >  
> >      HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson



-- 
Eduardo



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic
  2020-08-14 20:54 ` [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic Daniel Henrique Barboza
@ 2020-08-20  2:14   ` David Gibson
  2020-08-26 21:49     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-20  2:14 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9148 bytes --]

On Fri, Aug 14, 2020 at 05:54:17PM -0300, Daniel Henrique Barboza wrote:
> NVLink2 GPUs are allocated in their own NUMA node, at maximum
> distance from every other resource in the board. The existing
> logic makes some assumptions that don't scale well:
> 
> - only NVLink2 GPUs will ever require such mechanism, meaning
> that the GPU logic is tightly coupled with the NUMA setup of
> the machine, via how ibm,max-associativity-domains is set.
> 
> - the code is relying on the lack of support for sparse NUMA
> nodes in QEMU. Eventually this support can be implemented, and
> then the assumption that spapr->gpu_numa_id represents the total
> of NUMA nodes plus all generated NUMA ids for the GPUs, which
> relies on all QEMU NUMA nodes not being sparsed, has a good
> potential for disaster.
> 
> This patch aims to fix both assumptions by creating a generic
> mechanism to get an available NUMA node, regardless of the
> NUMA setup being sparse or not. The idea is to rename the existing
> spapr->gpu_numa_id to spapr->current_numa_id and add a new
> spapr->extra_numa_nodes attribute. They are used in a new function
> called spapr_pci_get_available_numa_id(), that takes into account
> that the NUMA conf can be sparsed or not, to retrieve an available
> NUMA id for the caller. Each consecutive call of
> spapr_pci_get_available_numa_id() will generate a new ID, up
> to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
> exceeding MAX_NODES. This is a generic code being used only by
> NVLink2 ATM, being available to be used in the future by any
> other device.
> 
> With this new function in place, we can decouple
> ibm,max-associativity-domains logic from NVLink2 logic by
> using the new spapr->extra_numa_nodes to define the maxdomains
> of the forth NUMA level. Instead of defining it as gpu_numa_id,
> use num_nodes + extra_numa_nodes. This also makes it resilient
> to any future change in the support of sparse NUMA nodes.
> 
> Despite all the code juggling, no functional change was made
> because sparse NUMA nodes isn't a thing and we do not support
> distinct NUMA distances via user input. Next patches will
> change that.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              | 15 ++++++++++-----
>  hw/ppc/spapr_pci.c          | 33 +++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_pci_nvlink2.c  | 10 ++++++----
>  include/hw/pci-host/spapr.h |  2 ++
>  include/hw/ppc/spapr.h      |  4 +++-
>  5 files changed, 54 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 3b16edaf4c..22e78cfc84 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>          cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
>          cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
>      };
> -    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
> +    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
>      uint32_t maxdomains[] = {
>          cpu_to_be32(4),
>          maxdomain,
>          maxdomain,
>          maxdomain,
> -        cpu_to_be32(spapr->gpu_numa_id),
> +        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
>      };
>  
>      _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> @@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
>      /*
>       * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>       * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> -     * called from vPHB reset handler so we initialize the counter here.
> +     * called from vPHB reset handler. We have code to generate an extra numa
> +     * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', which
> +     * are initialized here.
> +     *
>       * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>       * must be equally distant from any other node.
> -     * The final value of spapr->gpu_numa_id is going to be written to
> +     *
> +     * The extra NUMA node ids generated for GPU usage will be written to
>       * max-associativity-domains in spapr_build_fdt().
>       */
> -    spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
> +    spapr->current_numa_id = 0;
> +    spapr->extra_numa_nodes = 0;
>  
>      if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>          ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 0a418f1e67..09ac58fd7f 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
>                             &big_endian);
>      }
>  }
> +
> +unsigned spapr_pci_get_available_numa_id(Error **errp)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
> +    NodeInfo *numa_info = machine->numa_state->nodes;
> +    unsigned i, start;
> +
> +    if (machine->numa_state->num_nodes + spapr->extra_numa_nodes >= MAX_NODES) {
> +        error_setg(errp,
> +                   "Unable to get an extra NUMA node beyond MAX_NODES = %d",
> +                   MAX_NODES);
> +        return spapr->current_numa_id;
> +    }
> +
> +    if (spapr->extra_numa_nodes == 0) {
> +        start = 0;
> +    } else {
> +        start = spapr->current_numa_id + 1;
> +    }
> +
> +    for (i = start; i < MAX_NODES; i++) {
> +        if (!numa_info[i].present) {
> +            spapr->extra_numa_nodes++;
> +            spapr->current_numa_id = i;
> +            return i;
> +        }
> +    }

I think I see what you're trying to do, but this logic makes me a bit
nervous.  I guess migration isn't really an issue for the GPUs, but as
a general rule we want the NUMA ids for everything to be the same from
boot to boot (assuming similar command line configuration).

I think that's probably true here, but the rules are complex enough
that it's hard to convince myself there isn't some edge case where
something that doesn't seem relevant could change the numa nodes the
gpus end up with.

> +
> +    error_setg(errp, "Unable to find a valid NUMA id");
> +
> +    return spapr->current_numa_id;
> +}
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 76ae77ebc8..611c8a2957 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -87,9 +87,8 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
>                                      PCIDevice *pdev, uint64_t tgt,
>                                      MemoryRegion *mr, Error **errp)
>  {
> -    MachineState *machine = MACHINE(qdev_get_machine());
> -    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
>      SpaprPhbPciNvGpuSlot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
> +    Error *local_err = NULL;
>  
>      if (!nvslot) {
>          error_setg(errp, "Found too many GPUs per vPHB");
> @@ -100,8 +99,11 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
>  
>      nvslot->gpa = nvgpus->nv2_ram_current;
>      nvgpus->nv2_ram_current += memory_region_size(mr);
> -    nvslot->numa_id = spapr->gpu_numa_id;
> -    ++spapr->gpu_numa_id;
> +
> +    nvslot->numa_id = spapr_pci_get_available_numa_id(&local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +    }
>  }
>  
>  static void spapr_pci_collect_nvnpu(SpaprPhbPciNvGpuConfig *nvgpus,
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 600eb55c34..8d93223a76 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -129,6 +129,8 @@ struct SpaprPhbState {
>  #define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
>                                        64 * KiB)
>  
> +unsigned spapr_pci_get_available_numa_id(Error **errp);
> +
>  int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>                   uint32_t intc_phandle, void *fdt, int *node_offset);
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 3134d339e8..739a6a4942 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -227,7 +227,9 @@ struct SpaprMachineState {
>      bool cmd_line_caps[SPAPR_CAP_NUM];
>      SpaprCapabilities def, eff, mig;
>  
> -    unsigned gpu_numa_id;
> +    unsigned current_numa_id;

The nane "current_numa_id" is a bit unclear.  AFAICT this is used
specifically for the GPU "extra" numa nodes, which isn't obvious from
the name.

> +    unsigned extra_numa_nodes;
> +
>      SpaprTpmProxy *tpm_proxy;
>  
>      Error *fwnmi_migration_blocker;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper
  2020-08-14 20:54 ` [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper Daniel Henrique Barboza
@ 2020-08-20  2:15   ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2020-08-20  2:15 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3307 bytes --]

On Fri, Aug 14, 2020 at 05:54:18PM -0300, Daniel Henrique Barboza wrote:
> The changes to come to NUMA support are all guest visible. In
> theory we could just create a new 5_1 class option flag to
> avoid the changes to cascade to 5.1 and under. The reality is that
> these changes are only relevant if the machine has more than one
> NUMA node. There is no need to change guest behavior that has
> been around for years needlesly.
> 
> This new helper will be used by the next patches to determine
> whether we should retain the (soon to be) legacy NUMA behavior
> in the pSeries machine. The new behavior will only be exposed
> if::
> 
> - machine is pseries-5.2 and newer;
> - more than one NUMA node is declared in NUMA state.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Seems reasonable.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  hw/ppc/spapr.c         | 12 ++++++++++++
>  include/hw/ppc/spapr.h |  2 ++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 22e78cfc84..073a59c47d 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -308,6 +308,15 @@ static hwaddr spapr_node0_size(MachineState *machine)
>      return machine->ram_size;
>  }
>  
> +bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> +{
> +    MachineState *machine = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> +
> +    return smc->pre_5_2_numa_associativity ||
> +           machine->numa_state->num_nodes <= 1;
> +}
> +
>  static void add_str(GString *s, const gchar *s1)
>  {
>      g_string_append_len(s, s1, strlen(s1) + 1);
> @@ -4602,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(5_2, "5.2", true);
>   */
>  static void spapr_machine_5_1_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_5_2_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_5_1, hw_compat_5_1_len);
> +    smc->pre_5_2_numa_associativity = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_1, "5.1", false);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 739a6a4942..d9f1afa8b2 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -130,6 +130,7 @@ struct SpaprMachineClass {
>      bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
> +    bool pre_5_2_numa_associativity;
>  
>      void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio, 
> @@ -847,6 +848,7 @@ int spapr_max_server_number(SpaprMachineState *spapr);
>  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
>                        uint64_t pte0, uint64_t pte1);
>  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
> +bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr);
>  
>  /* DRC callbacks. */
>  void spapr_core_release(DeviceState *dev);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] spapr: make ibm,max-associativity-domains scale with user input
  2020-08-14 20:54 ` [PATCH 05/10] spapr: make ibm, max-associativity-domains scale with user input Daniel Henrique Barboza
@ 2020-08-20  2:55   ` David Gibson
  2020-08-26 21:17     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-20  2:55 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

On Fri, Aug 14, 2020 at 05:54:19PM -0300, Daniel Henrique Barboza wrote:
> The ibm,max-associativity-domains is considering that only a single
> associativity domain can exist in the same NUMA level. This is true
> today because we do not support any type of NUMA distance user
> customization, and all nodes are in the same distance to each other.
> 
> To enhance NUMA distance support in the pSeries machine we need to
> make this limit flexible. This patch rewrites the max-associativity
> logic to consider that multiple associativity domains can co-exist
> in the same NUMA level. We're using the legacy_numa() helper to
> avoid leaking unneeded guest changes.


Hrm.  I find the above a bit hard to understand.  Having the limit be
one less than the number of nodes at every level except the last seems
kind of odd to me.

> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 073a59c47d..b0c4b80a23 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -919,13 +919,20 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>          cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
>          cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
>      };
> -    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
> +
> +    /* The maximum domains for a given NUMA level, supposing that every
> +     * additional NUMA node belongs to the same domain (aside from the
> +     * 4th level, where we must support all available NUMA domains), is
> +     * total number of domains - 1. */
> +    uint32_t total_nodes_number = ms->numa_state->num_nodes +
> +                                  spapr->extra_numa_nodes;
> +    uint32_t maxdomain = cpu_to_be32(total_nodes_number - 1);
>      uint32_t maxdomains[] = {
>          cpu_to_be32(4),
>          maxdomain,
>          maxdomain,
>          maxdomain,
> -        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
> +        cpu_to_be32(total_nodes_number),
>      };
>  
>      _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> @@ -962,6 +969,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>                       qemu_hypertas->str, qemu_hypertas->len));
>      g_string_free(qemu_hypertas, TRUE);
>  
> +    if (spapr_machine_using_legacy_numa(spapr)) {
> +        maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
> +        maxdomains[1] = maxdomain;
> +        maxdomains[2] = maxdomain;
> +        maxdomains[3] = maxdomain;
> +    }
> +
>      if (smc->pre_5_1_assoc_refpoints) {
>          nr_refpoints = 2;
>      }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/10] spapr: create helper to set ibm,associativity
  2020-08-14 20:54 ` [PATCH 07/10] spapr: create helper to set ibm,associativity Daniel Henrique Barboza
@ 2020-08-20  3:00   ` David Gibson
  2020-08-20 10:39     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-20  3:00 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6197 bytes --]

On Fri, Aug 14, 2020 at 05:54:21PM -0300, Daniel Henrique Barboza wrote:
> We have several places around hw/ppc files where we use the
> same code to set the ibm,associativity array. This patch
> creates a helper called spapr_set_associativity() to do
> that in a single place. It'll also make it saner to change
> the value of ibm,associativity in the next patches.
> 
> After this patch, only 2 places are left with open code
> ibm,associativity assignment:
> 
> - spapr_dt_dynamic_reconfiguration_memory()
> - h_home_node_associativity() in spapr_hcall.c
> 
> The update of associativity values will be made in these places
> manually later on.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I like this - any chance you could move this to the front of the
series so that we can make this code easier to follow while we're
still discussing the more meaningful changes?

> ---
>  hw/ppc/spapr.c         | 32 +++++++++++++++++++++-----------
>  hw/ppc/spapr_nvdimm.c  |  8 +++-----
>  hw/ppc/spapr_pci.c     |  8 +++-----
>  include/hw/ppc/spapr.h |  1 +
>  4 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index bc51d2db90..b80a6f6936 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -201,15 +201,27 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
>      return ret;
>  }
>  
> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
> +{
> +    uint8_t assoc_size = 0x4;
> +
> +    if (cpu_index >= 0) {
> +        assoc_size = 0x5;
> +        assoc[5] = cpu_to_be32(cpu_index);
> +    }
> +
> +    assoc[0] = cpu_to_be32(assoc_size);
> +    assoc[1] = cpu_to_be32(0x0);
> +    assoc[2] = cpu_to_be32(0x0);
> +    assoc[3] = cpu_to_be32(0x0);
> +    assoc[4] = cpu_to_be32(node_id);
> +}
> +
>  static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
>  {
>      int index = spapr_get_vcpu_id(cpu);
> -    uint32_t associativity[] = {cpu_to_be32(0x5),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(cpu->node_id),
> -                                cpu_to_be32(index)};
> +    uint32_t associativity[6];
> +    spapr_set_associativity(associativity, cpu->node_id, index);
>  
>      /* Advertise NUMA via ibm,associativity */
>      return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
> @@ -325,15 +337,13 @@ static void add_str(GString *s, const gchar *s1)
>  static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
>                                  hwaddr size)
>  {
> -    uint32_t associativity[] = {
> -        cpu_to_be32(0x4), /* length */
> -        cpu_to_be32(0x0), cpu_to_be32(0x0),
> -        cpu_to_be32(0x0), cpu_to_be32(nodeid)
> -    };
> +    uint32_t associativity[5];
>      char mem_name[32];
>      uint64_t mem_reg_property[2];
>      int off;
>  
> +    spapr_set_associativity(associativity, nodeid, -1);
> +
>      mem_reg_property[0] = cpu_to_be64(start);
>      mem_reg_property[1] = cpu_to_be64(size);
>  
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 81410aa63f..bd109bfc00 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -115,15 +115,13 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>                                               &error_abort);
>      uint64_t slot = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_SLOT_PROP,
>                                               &error_abort);
> -    uint32_t associativity[] = {
> -        cpu_to_be32(0x4), /* length */
> -        cpu_to_be32(0x0), cpu_to_be32(0x0),
> -        cpu_to_be32(0x0), cpu_to_be32(node)
> -    };
> +    uint32_t associativity[5];
>      uint64_t lsize = nvdimm->label_size;
>      uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>                                              NULL);
>  
> +    spapr_set_associativity(associativity, node, -1);
> +
>      drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>      g_assert(drc);
>  
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 09ac58fd7f..c02ace226c 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -2321,11 +2321,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>          cpu_to_be32(1),
>          cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>      };
> -    uint32_t associativity[] = {cpu_to_be32(0x4),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(0x0),
> -                                cpu_to_be32(phb->numa_node)};
> +    uint32_t associativity[5];
> +
>      SpaprTceTable *tcet;
>      SpaprDrc *drc;
>      Error *err = NULL;
> @@ -2358,6 +2355,7 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>  
>      /* Advertise NUMA via ibm,associativity */
>      if (phb->numa_node != -1) {
> +        spapr_set_associativity(associativity, phb->numa_node, -1);
>          _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
>                           sizeof(associativity)));
>      }
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d9f1afa8b2..cd158bf95a 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -863,6 +863,7 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>  
>  void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
>  int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
>  
>  #define TYPE_SPAPR_RNG "spapr-rng"
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20  2:11     ` Eduardo Habkost
@ 2020-08-20  4:15       ` David Gibson
  2020-08-20 10:33         ` Daniel Henrique Barboza
  2020-08-20 16:51         ` Eduardo Habkost
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2020-08-20  4:15 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: John Snow, Daniel Henrique Barboza, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4247 bytes --]

On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
> On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
> > On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> > > The pSeries machine does not support asymmetrical NUMA
> > > configurations.
> > 
> > This seems a bit oddly specific to have as a global machine class
> > property.
> > 
> > Would it make more sense for machines with specific NUMA constraints
> > to just verify those during their initialization?
> 
> This would be much simpler.  However, I like the idea of
> representing machine-specific configuration validation rules as
> data that can eventually be exported to management software.

Ah, ok, so basically the usual tradeoff between flexibility and
advertisability.

So, in that case, I guess the question is whether we envisage "no
assymmetry" as a constraint common enough that it's worth creating an
advertisable rule or not.  If we only ever have one user, then we
haven't really done any better than hard coding the constraint in the
manageent software.

Of course to complicate matters, in the longer term we're looking at
removing that constraint from pseries - but doing so will be dependent
on the guest kernel understanding a new format for the NUMA
information in the device tree.  So qemu alone won't have enough
information to tell if such a configuration is possible or not.

> (CCing John Snow, who had spent some time thinking about
> configuration validation recently.)
> 
> 
> > > 
> > > CC: Eduardo Habkost <ehabkost@redhat.com>
> > > CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > > ---
> > >  hw/core/numa.c      | 7 +++++++
> > >  hw/ppc/spapr.c      | 1 +
> > >  include/hw/boards.h | 1 +
> > >  3 files changed, 9 insertions(+)
> > > 
> > > diff --git a/hw/core/numa.c b/hw/core/numa.c
> > > index d1a94a14f8..1e81233c1d 100644
> > > --- a/hw/core/numa.c
> > > +++ b/hw/core/numa.c
> > > @@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> > >   */
> > >  static void validate_numa_distance(MachineState *ms)
> > >  {
> > > +    MachineClass *mc = MACHINE_GET_CLASS(ms);
> > >      int src, dst;
> > >      bool is_asymmetrical = false;
> > >      int nb_numa_nodes = ms->numa_state->num_nodes;
> > > @@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
> > >      }
> > >  
> > >      if (is_asymmetrical) {
> > > +        if (mc->forbid_asymmetrical_numa) {
> > > +            error_report("This machine type does not support "
> > > +                         "asymmetrical numa distances.");
> > > +            exit(EXIT_FAILURE);
> > > +        }
> > > +
> > >          for (src = 0; src < nb_numa_nodes; src++) {
> > >              for (dst = 0; dst < nb_numa_nodes; dst++) {
> > >                  if (src != dst && numa_info[src].distance[dst] == 0) {
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index dd2fa4826b..3b16edaf4c 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >       */
> > >      mc->numa_mem_align_shift = 28;
> > >      mc->auto_enable_numa = true;
> > > +    mc->forbid_asymmetrical_numa = true;
> > >  
> > >      smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
> > >      smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
> > > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > > index bc5b82ad20..dc6cdd1c53 100644
> > > --- a/include/hw/boards.h
> > > +++ b/include/hw/boards.h
> > > @@ -215,6 +215,7 @@ struct MachineClass {
> > >      bool nvdimm_supported;
> > >      bool numa_mem_supported;
> > >      bool auto_enable_numa;
> > > +    bool forbid_asymmetrical_numa;
> > >      const char *default_ram_id;
> > >  
> > >      HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
> > 
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
  2020-08-14 20:54 ` [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains Daniel Henrique Barboza
@ 2020-08-20  4:26   ` David Gibson
  2020-08-26 20:06     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-20  4:26 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 16985 bytes --]

On Fri, Aug 14, 2020 at 05:54:22PM -0300, Daniel Henrique Barboza wrote:
> We can't use the input from machine->numa_state->nodes directly
> in the pSeries machine because PAPR does not work with raw distance
> values, like ACPI SLIT does. We need to determine common
> associativity domains, based on similar performance/distance of the
> resources, and set these domains in the associativy array that goes
> to the FDT of each resource.
> 
> To ease the translation between regular ACPI NUMA distance info
> to our PAPR dialect, let's create a matrix called numa_assoc_domains
> in the SpaprMachineClass. This matrix will be initiated during
> machine init, where  we will read NUMA information from user input,
> apply a heuristic to determine the associativity domains for each node,
> then populate numa_assoc_domains accordingly.
> 
> The changes are mostly centered in the spapr_set_associativity()
> helper that will use the values of numa_assoc_domains instead of
> using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and
> h_home_node_associativity() being the exceptions.
> 
> To keep the changes under control, we'll plug in the matrix usage
> in the existing code first. The actual heuristic to determine
> the associativity domains for each NUMA node will come in a follow-up
> patch.
> 
> Note that the matrix is initiated with zeros, meaning that there is
> no guest changes implemented in this patch. We'll keep these
> changes from legacy NUMA guests by not initiating the matrix
> in these cases.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

IIUC, what this is basically doing is that instead of doing the
translation from qemu's internal NUMA representation to PAPRs at the
point(s) we emit the PAPR information, we're moving it to persistent
data we calculate for each (qemu) numa node then just copy out when we
emit the information?

That could be a reasonable idea, and indeed the rest of the series
might be easier to understand if this change were earlier in the
series.  In particular it means we might be able localise all the
hacks for calculating the right vectors depending on machine
version/options into one place.

A couple of nits, though:

> ---
>  hw/ppc/spapr.c                | 46 +++++++++++++++++++++++------------
>  hw/ppc/spapr_hcall.c          | 13 ++++++++--
>  hw/ppc/spapr_nvdimm.c         | 13 +++++-----
>  hw/ppc/spapr_pci.c            |  3 ++-
>  include/hw/ppc/spapr.h        |  7 +++++-
>  include/hw/ppc/spapr_nvdimm.h |  5 ++--
>  6 files changed, 59 insertions(+), 28 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index b80a6f6936..4f50ab21ee 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
>      return ret;
>  }
>  
> -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
> +                             MachineState *machine)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> +    uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0];
> +    uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1];
> +    uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2];
>      uint8_t assoc_size = 0x4;
>  
>      if (cpu_index >= 0) {
> @@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
>      }
>  
>      assoc[0] = cpu_to_be32(assoc_size);
> -    assoc[1] = cpu_to_be32(0x0);
> -    assoc[2] = cpu_to_be32(0x0);
> -    assoc[3] = cpu_to_be32(0x0);
> +    assoc[1] = cpu_to_be32(assoc_domain1);
> +    assoc[2] = cpu_to_be32(assoc_domain2);
> +    assoc[3] = cpu_to_be32(assoc_domain3);
>      assoc[4] = cpu_to_be32(node_id);

So spapr_set_associativity() is already a slightly dangerous function,
because the required buffer space for 'assoc' varies in a non-obvious
way depending on if cpu_index is >= 0.  I didn't comment on that when
it was introduced, because it's not really any worse than what it
replaced.

But with this change, I think we can do better.  I'd suggest storing
the full PAPR associativity vector for each qemu numa node verbatim,
so it can just be copied straight into the device tree without
interpretation.  Then the helper can actually do the property set, and
we don't need magically sized locals any more.

Obviously there will need to be some more handling for the extra layer
we add on cpu assoc vectors.  We could store a vector for each vcpu as
well, or just have a hack to adjust these (fdt_appendprop() might be
useful).

>  }
>  
> -static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
> +static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu,
> +                                   MachineState *machine)
>  {
>      int index = spapr_get_vcpu_id(cpu);
>      uint32_t associativity[6];
> -    spapr_set_associativity(associativity, cpu->node_id, index);
> +    spapr_set_associativity(associativity, cpu->node_id, index, machine);
>  
>      /* Advertise NUMA via ibm,associativity */
>      return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
> @@ -335,14 +341,14 @@ static void add_str(GString *s, const gchar *s1)
>  }
>  
>  static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
> -                                hwaddr size)
> +                                hwaddr size, MachineState *machine)
>  {
>      uint32_t associativity[5];
>      char mem_name[32];
>      uint64_t mem_reg_property[2];
>      int off;
>  
> -    spapr_set_associativity(associativity, nodeid, -1);
> +    spapr_set_associativity(associativity, nodeid, -1, machine);
>  
>      mem_reg_property[0] = cpu_to_be64(start);
>      mem_reg_property[1] = cpu_to_be64(size);
> @@ -574,6 +580,7 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
>                                                     void *fdt)
>  {
>      MachineState *machine = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
>      int nb_numa_nodes = machine->numa_state->num_nodes;
>      int ret, i, offset;
>      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
> @@ -628,12 +635,17 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
>      int_buf[1] = cpu_to_be32(4); /* Number of entries per associativity list */
>      cur_index += 2;
>      for (i = 0; i < nr_nodes; i++) {
> +        uint8_t assoc_domain1 = smc->numa_assoc_domains[i][0];
> +        uint8_t assoc_domain2 = smc->numa_assoc_domains[i][1];
> +        uint8_t assoc_domain3 = smc->numa_assoc_domains[i][2];
> +
>          uint32_t associativity[] = {
> -            cpu_to_be32(0x0),
> -            cpu_to_be32(0x0),
> -            cpu_to_be32(0x0),
> +            cpu_to_be32(assoc_domain1),
> +            cpu_to_be32(assoc_domain2),
> +            cpu_to_be32(assoc_domain3),
>              cpu_to_be32(i)

So again, I suggest this be copiedas a blob from a stored vector for
the node, rather than a mix of stored info and locally calculated info
(the final 'i').

>          };
> +
>          memcpy(cur_index, associativity, sizeof(associativity));
>          cur_index += 4;
>      }
> @@ -667,7 +679,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
>          if (!mem_start) {
>              /* spapr_machine_init() checks for rma_size <= node0_size
>               * already */
> -            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size);
> +            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size, machine);
>              mem_start += spapr->rma_size;
>              node_size -= spapr->rma_size;
>          }
> @@ -679,7 +691,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
>                  sizetmp = 1ULL << ctzl(mem_start);
>              }
>  
> -            spapr_dt_memory_node(fdt, i, mem_start, sizetmp);
> +            spapr_dt_memory_node(fdt, i, mem_start, sizetmp, machine);
>              node_size -= sizetmp;
>              mem_start += sizetmp;
>          }
> @@ -809,7 +821,7 @@ static void spapr_dt_cpu(CPUState *cs, void *fdt, int offset,
>                        pft_size_prop, sizeof(pft_size_prop))));
>  
>      if (ms->numa_state->num_nodes > 1) {
> -        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu));
> +        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu, ms));
>      }
>  
>      _FDT(spapr_fixup_cpu_smt_dt(fdt, offset, cpu, compat_smt));
> @@ -1335,7 +1347,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>  
>      /* NVDIMM devices */
>      if (mc->nvdimm_supported) {
> -        spapr_dt_persistent_memory(fdt);
> +        spapr_dt_persistent_memory(fdt, machine);
>      }
>  
>      return fdt;
> @@ -3453,6 +3465,7 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
>  int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                            void *fdt, int *fdt_start_offset, Error **errp)
>  {
> +    MachineState *machine = MACHINE(spapr);
>      uint64_t addr;
>      uint32_t node;
>  
> @@ -3460,7 +3473,8 @@ int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>      node = object_property_get_uint(OBJECT(drc->dev), PC_DIMM_NODE_PROP,
>                                      &error_abort);
>      *fdt_start_offset = spapr_dt_memory_node(fdt, node, addr,
> -                                             SPAPR_MEMORY_BLOCK_SIZE);
> +                                             SPAPR_MEMORY_BLOCK_SIZE,
> +                                             machine);
>      return 0;
>  }
>  
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index c1d01228c6..6e94e513cf 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1878,10 +1878,15 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                target_ulong opcode,
>                                                target_ulong *args)
>  {
> +    MachineState *machine = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
>      target_ulong flags = args[0];
>      target_ulong procno = args[1];
>      PowerPCCPU *tcpu;
>      int idx;
> +    uint8_t assoc_domain1;
> +    uint8_t assoc_domain2;
> +    uint8_t assoc_domain3;
>  
>      /* only support procno from H_REGISTER_VPA */
>      if (flags != 0x1) {
> @@ -1893,13 +1898,17 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>          return H_P2;
>      }
>  
> +    assoc_domain1 = smc->numa_assoc_domains[tcpu->node_id][0];
> +    assoc_domain2 = smc->numa_assoc_domains[tcpu->node_id][1];
> +    assoc_domain3 = smc->numa_assoc_domains[tcpu->node_id][2];
> +
>      /* sequence is the same as in the "ibm,associativity" property */
>  
>      idx = 0;
>  #define ASSOCIATIVITY(a, b) (((uint64_t)(a) << 32) | \
>                               ((uint64_t)(b) & 0xffffffff))
> -    args[idx++] = ASSOCIATIVITY(0, 0);
> -    args[idx++] = ASSOCIATIVITY(0, tcpu->node_id);
> +    args[idx++] = ASSOCIATIVITY(assoc_domain1, assoc_domain2);
> +    args[idx++] = ASSOCIATIVITY(assoc_domain3, tcpu->node_id);
>      args[idx++] = ASSOCIATIVITY(procno, -1);
>      for ( ; idx < 6; idx++) {
>          args[idx] = -1;
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index bd109bfc00..7f1f088c39 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -87,8 +87,9 @@ int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                             void *fdt, int *fdt_start_offset, Error **errp)
>  {
>      NVDIMMDevice *nvdimm = NVDIMM(drc->dev);
> +    MachineState *machine = MACHINE(spapr);
>  
> -    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm);
> +    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm, machine);
>  
>      return 0;
>  }
> @@ -104,8 +105,8 @@ void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
>  }
>  
>  
> -int spapr_dt_nvdimm(void *fdt, int parent_offset,
> -                           NVDIMMDevice *nvdimm)
> +int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
> +                    MachineState *machine)
>  {
>      int child_offset;
>      char *buf;
> @@ -120,7 +121,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>      uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>                                              NULL);
>  
> -    spapr_set_associativity(associativity, node, -1);
> +    spapr_set_associativity(associativity, node, -1, machine);
>  
>      drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>      g_assert(drc);
> @@ -159,7 +160,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>      return child_offset;
>  }
>  
> -void spapr_dt_persistent_memory(void *fdt)
> +void spapr_dt_persistent_memory(void *fdt, MachineState *machine)
>  {
>      int offset = fdt_subnode_offset(fdt, 0, "persistent-memory");
>      GSList *iter, *nvdimms = nvdimm_get_device_list();
> @@ -177,7 +178,7 @@ void spapr_dt_persistent_memory(void *fdt)
>      for (iter = nvdimms; iter; iter = iter->next) {
>          NVDIMMDevice *nvdimm = iter->data;
>  
> -        spapr_dt_nvdimm(fdt, offset, nvdimm);
> +        spapr_dt_nvdimm(fdt, offset, nvdimm, machine);
>      }
>      g_slist_free(nvdimms);
>  
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index c02ace226c..4d9ef63f3e 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -2355,7 +2355,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>  
>      /* Advertise NUMA via ibm,associativity */
>      if (phb->numa_node != -1) {
> -        spapr_set_associativity(associativity, phb->numa_node, -1);
> +        MachineState *machine = MACHINE(spapr);
> +        spapr_set_associativity(associativity, phb->numa_node, -1, machine);
>          _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
>                           sizeof(associativity)));
>      }
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index cd158bf95a..1f9700ac19 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -104,6 +104,9 @@ typedef enum {
>  
>  #define FDT_MAX_SIZE                    0x100000
>  
> +/* Taken from Linux kernel arch/powerpc/mm/numa.h */
> +#define MAX_DISTANCE_REF_POINTS         4
> +
>  typedef struct SpaprCapabilities SpaprCapabilities;
>  struct SpaprCapabilities {
>      uint8_t caps[SPAPR_CAP_NUM];
> @@ -131,6 +134,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    uint8_t numa_assoc_domains[MAX_NODES][MAX_DISTANCE_REF_POINTS-1];
>  
>      void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio, 
> @@ -863,7 +867,8 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>  
>  void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
>  int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
> -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
> +                             MachineState *machine);
>  
>  #define TYPE_SPAPR_RNG "spapr-rng"
>  
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index b3330cc485..be30773c7d 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -27,8 +27,9 @@ QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
>  
>  int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>                             void *fdt, int *fdt_start_offset, Error **errp);
> -int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm);
> -void spapr_dt_persistent_memory(void *fdt);
> +int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
> +                    MachineState *machine);
> +void spapr_dt_persistent_memory(void *fdt, MachineState *machine);
>  void spapr_nvdimm_validate_opts(NVDIMMDevice *nvdimm, uint64_t size,
>                                  Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot, Error **errp);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20  4:15       ` David Gibson
@ 2020-08-20 10:33         ` Daniel Henrique Barboza
  2020-08-20 14:29           ` Igor Mammedov
  2020-08-20 16:51         ` Eduardo Habkost
  1 sibling, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-20 10:33 UTC (permalink / raw)
  To: David Gibson, Eduardo Habkost; +Cc: qemu-ppc, John Snow, qemu-devel



On 8/20/20 1:15 AM, David Gibson wrote:
> On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
>> On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
>>> On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
>>>> The pSeries machine does not support asymmetrical NUMA
>>>> configurations.
>>>
>>> This seems a bit oddly specific to have as a global machine class
>>> property.
>>>
>>> Would it make more sense for machines with specific NUMA constraints
>>> to just verify those during their initialization?
>>
>> This would be much simpler.  However, I like the idea of
>> representing machine-specific configuration validation rules as
>> data that can eventually be exported to management software.
> 
> Ah, ok, so basically the usual tradeoff between flexibility and
> advertisability.



To provide context, what I did here was inspired by this commit:

commit 0533ef5f2089f4f12a0ec5c8035e5e15ba0b5556
Author: Tao Xu <tao3.xu@intel.com>
Date:   Thu Sep 5 16:32:38 2019 +0800

     numa: Introduce MachineClass::auto_enable_numa for implicit NUMA node


In this commit, exclusive NUMA code from spapr.c was taken and put it
into numa.c, with a flag being set in spapr machine_init.


Thanks,


DHB


> 
> So, in that case, I guess the question is whether we envisage "no
> assymmetry" as a constraint common enough that it's worth creating an
> advertisable rule or not.  If we only ever have one user, then we
> haven't really done any better than hard coding the constraint in the
> manageent software.
> 
> Of course to complicate matters, in the longer term we're looking at
> removing that constraint from pseries - but doing so will be dependent
> on the guest kernel understanding a new format for the NUMA
> information in the device tree.  So qemu alone won't have enough
> information to tell if such a configuration is possible or not.
> 
>> (CCing John Snow, who had spent some time thinking about
>> configuration validation recently.)
>>
>>
>>>>
>>>> CC: Eduardo Habkost <ehabkost@redhat.com>
>>>> CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>>> ---
>>>>   hw/core/numa.c      | 7 +++++++
>>>>   hw/ppc/spapr.c      | 1 +
>>>>   include/hw/boards.h | 1 +
>>>>   3 files changed, 9 insertions(+)
>>>>
>>>> diff --git a/hw/core/numa.c b/hw/core/numa.c
>>>> index d1a94a14f8..1e81233c1d 100644
>>>> --- a/hw/core/numa.c
>>>> +++ b/hw/core/numa.c
>>>> @@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>>>>    */
>>>>   static void validate_numa_distance(MachineState *ms)
>>>>   {
>>>> +    MachineClass *mc = MACHINE_GET_CLASS(ms);
>>>>       int src, dst;
>>>>       bool is_asymmetrical = false;
>>>>       int nb_numa_nodes = ms->numa_state->num_nodes;
>>>> @@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
>>>>       }
>>>>   
>>>>       if (is_asymmetrical) {
>>>> +        if (mc->forbid_asymmetrical_numa) {
>>>> +            error_report("This machine type does not support "
>>>> +                         "asymmetrical numa distances.");
>>>> +            exit(EXIT_FAILURE);
>>>> +        }
>>>> +
>>>>           for (src = 0; src < nb_numa_nodes; src++) {
>>>>               for (dst = 0; dst < nb_numa_nodes; dst++) {
>>>>                   if (src != dst && numa_info[src].distance[dst] == 0) {
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index dd2fa4826b..3b16edaf4c 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>>>        */
>>>>       mc->numa_mem_align_shift = 28;
>>>>       mc->auto_enable_numa = true;
>>>> +    mc->forbid_asymmetrical_numa = true;
>>>>   
>>>>       smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
>>>>       smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>> index bc5b82ad20..dc6cdd1c53 100644
>>>> --- a/include/hw/boards.h
>>>> +++ b/include/hw/boards.h
>>>> @@ -215,6 +215,7 @@ struct MachineClass {
>>>>       bool nvdimm_supported;
>>>>       bool numa_mem_supported;
>>>>       bool auto_enable_numa;
>>>> +    bool forbid_asymmetrical_numa;
>>>>       const char *default_ram_id;
>>>>   
>>>>       HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
>>>
>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 07/10] spapr: create helper to set ibm,associativity
  2020-08-20  3:00   ` David Gibson
@ 2020-08-20 10:39     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-20 10:39 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 8/20/20 12:00 AM, David Gibson wrote:
> On Fri, Aug 14, 2020 at 05:54:21PM -0300, Daniel Henrique Barboza wrote:
>> We have several places around hw/ppc files where we use the
>> same code to set the ibm,associativity array. This patch
>> creates a helper called spapr_set_associativity() to do
>> that in a single place. It'll also make it saner to change
>> the value of ibm,associativity in the next patches.
>>
>> After this patch, only 2 places are left with open code
>> ibm,associativity assignment:
>>
>> - spapr_dt_dynamic_reconfiguration_memory()
>> - h_home_node_associativity() in spapr_hcall.c
>>
>> The update of associativity values will be made in these places
>> manually later on.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> I like this - any chance you could move this to the front of the
> series so that we can make this code easier to follow while we're
> still discussing the more meaningful changes?


No problem. I'll move this up front in v2.


DHB

> 
>> ---
>>   hw/ppc/spapr.c         | 32 +++++++++++++++++++++-----------
>>   hw/ppc/spapr_nvdimm.c  |  8 +++-----
>>   hw/ppc/spapr_pci.c     |  8 +++-----
>>   include/hw/ppc/spapr.h |  1 +
>>   4 files changed, 28 insertions(+), 21 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index bc51d2db90..b80a6f6936 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -201,15 +201,27 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
>>       return ret;
>>   }
>>   
>> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
>> +{
>> +    uint8_t assoc_size = 0x4;
>> +
>> +    if (cpu_index >= 0) {
>> +        assoc_size = 0x5;
>> +        assoc[5] = cpu_to_be32(cpu_index);
>> +    }
>> +
>> +    assoc[0] = cpu_to_be32(assoc_size);
>> +    assoc[1] = cpu_to_be32(0x0);
>> +    assoc[2] = cpu_to_be32(0x0);
>> +    assoc[3] = cpu_to_be32(0x0);
>> +    assoc[4] = cpu_to_be32(node_id);
>> +}
>> +
>>   static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
>>   {
>>       int index = spapr_get_vcpu_id(cpu);
>> -    uint32_t associativity[] = {cpu_to_be32(0x5),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(cpu->node_id),
>> -                                cpu_to_be32(index)};
>> +    uint32_t associativity[6];
>> +    spapr_set_associativity(associativity, cpu->node_id, index);
>>   
>>       /* Advertise NUMA via ibm,associativity */
>>       return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
>> @@ -325,15 +337,13 @@ static void add_str(GString *s, const gchar *s1)
>>   static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
>>                                   hwaddr size)
>>   {
>> -    uint32_t associativity[] = {
>> -        cpu_to_be32(0x4), /* length */
>> -        cpu_to_be32(0x0), cpu_to_be32(0x0),
>> -        cpu_to_be32(0x0), cpu_to_be32(nodeid)
>> -    };
>> +    uint32_t associativity[5];
>>       char mem_name[32];
>>       uint64_t mem_reg_property[2];
>>       int off;
>>   
>> +    spapr_set_associativity(associativity, nodeid, -1);
>> +
>>       mem_reg_property[0] = cpu_to_be64(start);
>>       mem_reg_property[1] = cpu_to_be64(size);
>>   
>> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
>> index 81410aa63f..bd109bfc00 100644
>> --- a/hw/ppc/spapr_nvdimm.c
>> +++ b/hw/ppc/spapr_nvdimm.c
>> @@ -115,15 +115,13 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>>                                                &error_abort);
>>       uint64_t slot = object_property_get_uint(OBJECT(nvdimm), PC_DIMM_SLOT_PROP,
>>                                                &error_abort);
>> -    uint32_t associativity[] = {
>> -        cpu_to_be32(0x4), /* length */
>> -        cpu_to_be32(0x0), cpu_to_be32(0x0),
>> -        cpu_to_be32(0x0), cpu_to_be32(node)
>> -    };
>> +    uint32_t associativity[5];
>>       uint64_t lsize = nvdimm->label_size;
>>       uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>>                                               NULL);
>>   
>> +    spapr_set_associativity(associativity, node, -1);
>> +
>>       drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>>       g_assert(drc);
>>   
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 09ac58fd7f..c02ace226c 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -2321,11 +2321,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>>           cpu_to_be32(1),
>>           cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>       };
>> -    uint32_t associativity[] = {cpu_to_be32(0x4),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(0x0),
>> -                                cpu_to_be32(phb->numa_node)};
>> +    uint32_t associativity[5];
>> +
>>       SpaprTceTable *tcet;
>>       SpaprDrc *drc;
>>       Error *err = NULL;
>> @@ -2358,6 +2355,7 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>>   
>>       /* Advertise NUMA via ibm,associativity */
>>       if (phb->numa_node != -1) {
>> +        spapr_set_associativity(associativity, phb->numa_node, -1);
>>           _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
>>                            sizeof(associativity)));
>>       }
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index d9f1afa8b2..cd158bf95a 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -863,6 +863,7 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>   
>>   void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
>>   int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
>> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
>>   
>>   #define TYPE_SPAPR_RNG "spapr-rng"
>>   
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20 10:33         ` Daniel Henrique Barboza
@ 2020-08-20 14:29           ` Igor Mammedov
  0 siblings, 0 replies; 34+ messages in thread
From: Igor Mammedov @ 2020-08-20 14:29 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, John Snow, qemu-ppc, Eduardo Habkost, David Gibson

On Thu, 20 Aug 2020 07:33:00 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> On 8/20/20 1:15 AM, David Gibson wrote:
> > On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:  
> >> On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:  
> >>> On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:  
> >>>> The pSeries machine does not support asymmetrical NUMA
> >>>> configurations.  
> >>>
> >>> This seems a bit oddly specific to have as a global machine class
> >>> property.
> >>>
> >>> Would it make more sense for machines with specific NUMA constraints
> >>> to just verify those during their initialization?  
> >>
> >> This would be much simpler.  However, I like the idea of
> >> representing machine-specific configuration validation rules as
> >> data that can eventually be exported to management software.  
> > 
> > Ah, ok, so basically the usual tradeoff between flexibility and
> > advertisability.  
> 
> 
> 
> To provide context, what I did here was inspired by this commit:
> 
> commit 0533ef5f2089f4f12a0ec5c8035e5e15ba0b5556
> Author: Tao Xu <tao3.xu@intel.com>
> Date:   Thu Sep 5 16:32:38 2019 +0800
> 
>      numa: Introduce MachineClass::auto_enable_numa for implicit NUMA node
> 
> 
> In this commit, exclusive NUMA code from spapr.c was taken and put it
> into numa.c, with a flag being set in spapr machine_init.

We have too many auto_enable_numa*, we probably should merge them into
just one auto_enable_numa if it's possible.

> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> > So, in that case, I guess the question is whether we envisage "no
> > assymmetry" as a constraint common enough that it's worth creating an
> > advertisable rule or not.  If we only ever have one user, then we
> > haven't really done any better than hard coding the constraint in the
> > manageent software.
> > 
> > Of course to complicate matters, in the longer term we're looking at
> > removing that constraint from pseries - but doing so will be dependent
> > on the guest kernel understanding a new format for the NUMA
> > information in the device tree.  So qemu alone won't have enough
> > information to tell if such a configuration is possible or not.

If it's guest limitiation, then it doesn't sound like QEMU material to me.
It would be more appropriate to validate at mgmt level which knows what
guest is being used.

> >   
> >> (CCing John Snow, who had spent some time thinking about
> >> configuration validation recently.)
> >>
> >>  
> >>>>
> >>>> CC: Eduardo Habkost <ehabkost@redhat.com>
> >>>> CC: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> >>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >>>> ---
> >>>>   hw/core/numa.c      | 7 +++++++
> >>>>   hw/ppc/spapr.c      | 1 +
> >>>>   include/hw/boards.h | 1 +
> >>>>   3 files changed, 9 insertions(+)
> >>>>
> >>>> diff --git a/hw/core/numa.c b/hw/core/numa.c
> >>>> index d1a94a14f8..1e81233c1d 100644
> >>>> --- a/hw/core/numa.c
> >>>> +++ b/hw/core/numa.c
> >>>> @@ -547,6 +547,7 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >>>>    */
> >>>>   static void validate_numa_distance(MachineState *ms)
> >>>>   {
> >>>> +    MachineClass *mc = MACHINE_GET_CLASS(ms);
> >>>>       int src, dst;
> >>>>       bool is_asymmetrical = false;
> >>>>       int nb_numa_nodes = ms->numa_state->num_nodes;
> >>>> @@ -575,6 +576,12 @@ static void validate_numa_distance(MachineState *ms)
> >>>>       }
> >>>>   
> >>>>       if (is_asymmetrical) {
> >>>> +        if (mc->forbid_asymmetrical_numa) {
> >>>> +            error_report("This machine type does not support "
> >>>> +                         "asymmetrical numa distances.");
> >>>> +            exit(EXIT_FAILURE);
> >>>> +        }
> >>>> +
> >>>>           for (src = 0; src < nb_numa_nodes; src++) {
> >>>>               for (dst = 0; dst < nb_numa_nodes; dst++) {
> >>>>                   if (src != dst && numa_info[src].distance[dst] == 0) {
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index dd2fa4826b..3b16edaf4c 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -4512,6 +4512,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>>        */
> >>>>       mc->numa_mem_align_shift = 28;
> >>>>       mc->auto_enable_numa = true;
> >>>> +    mc->forbid_asymmetrical_numa = true;
> >>>>   
> >>>>       smc->default_caps.caps[SPAPR_CAP_HTM] = SPAPR_CAP_OFF;
> >>>>       smc->default_caps.caps[SPAPR_CAP_VSX] = SPAPR_CAP_ON;
> >>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >>>> index bc5b82ad20..dc6cdd1c53 100644
> >>>> --- a/include/hw/boards.h
> >>>> +++ b/include/hw/boards.h
> >>>> @@ -215,6 +215,7 @@ struct MachineClass {
> >>>>       bool nvdimm_supported;
> >>>>       bool numa_mem_supported;
> >>>>       bool auto_enable_numa;
> >>>> +    bool forbid_asymmetrical_numa;
> >>>>       const char *default_ram_id;
> >>>>   
> >>>>       HotplugHandler *(*get_hotplug_handler)(MachineState *machine,  
> >>>  
> >>
> >>
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20  4:15       ` David Gibson
  2020-08-20 10:33         ` Daniel Henrique Barboza
@ 2020-08-20 16:51         ` Eduardo Habkost
  2020-08-21  8:55           ` Igor Mammedov
  2020-09-23 15:21           ` John Snow
  1 sibling, 2 replies; 34+ messages in thread
From: Eduardo Habkost @ 2020-08-20 16:51 UTC (permalink / raw)
  To: David Gibson; +Cc: John Snow, Daniel Henrique Barboza, qemu-ppc, qemu-devel

On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
> On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
> > On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
> > > On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> > > > The pSeries machine does not support asymmetrical NUMA
> > > > configurations.
> > > 
> > > This seems a bit oddly specific to have as a global machine class
> > > property.
> > > 
> > > Would it make more sense for machines with specific NUMA constraints
> > > to just verify those during their initialization?
> > 
> > This would be much simpler.  However, I like the idea of
> > representing machine-specific configuration validation rules as
> > data that can eventually be exported to management software.
> 
> Ah, ok, so basically the usual tradeoff between flexibility and
> advertisability.
> 
> So, in that case, I guess the question is whether we envisage "no
> assymmetry" as a constraint common enough that it's worth creating an
> advertisable rule or not.  If we only ever have one user, then we
> haven't really done any better than hard coding the constraint in the
> manageent software.
> 
> Of course to complicate matters, in the longer term we're looking at
> removing that constraint from pseries - but doing so will be dependent
> on the guest kernel understanding a new format for the NUMA
> information in the device tree.  So qemu alone won't have enough
> information to tell if such a configuration is possible or not.

Requiring both QEMU (and possibly management software) to be
patched again after the guest kernel is fixed sounds undesirable.
Perhaps a warning would be better in this case?

In either case, it sounds like this won't be a common constraint
and I now agree with your original suggestion of doing this in
machine initialization code.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20 16:51         ` Eduardo Habkost
@ 2020-08-21  8:55           ` Igor Mammedov
  2020-08-21 12:47             ` Daniel Henrique Barboza
  2020-09-23 15:21           ` John Snow
  1 sibling, 1 reply; 34+ messages in thread
From: Igor Mammedov @ 2020-08-21  8:55 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: qemu-ppc, Daniel Henrique Barboza, John Snow, qemu-devel, David Gibson

On Thu, 20 Aug 2020 12:51:03 -0400
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
> > On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:  
> > > On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:  
> > > > On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:  
> > > > > The pSeries machine does not support asymmetrical NUMA
> > > > > configurations.  
> > > > 
> > > > This seems a bit oddly specific to have as a global machine class
> > > > property.
> > > > 
> > > > Would it make more sense for machines with specific NUMA constraints
> > > > to just verify those during their initialization?  
> > > 
> > > This would be much simpler.  However, I like the idea of
> > > representing machine-specific configuration validation rules as
> > > data that can eventually be exported to management software.  
> > 
> > Ah, ok, so basically the usual tradeoff between flexibility and
> > advertisability.
> > 
> > So, in that case, I guess the question is whether we envisage "no
> > assymmetry" as a constraint common enough that it's worth creating an
> > advertisable rule or not.  If we only ever have one user, then we
> > haven't really done any better than hard coding the constraint in the
> > manageent software.
> > 
> > Of course to complicate matters, in the longer term we're looking at
> > removing that constraint from pseries - but doing so will be dependent
> > on the guest kernel understanding a new format for the NUMA
> > information in the device tree.  So qemu alone won't have enough
> > information to tell if such a configuration is possible or not.  
> 
> Requiring both QEMU (and possibly management software) to be
> patched again after the guest kernel is fixed sounds undesirable.
If we drop this restriction, then we don't need to touch QEMU when
guest kernel is ready.

Btw, what spapr spec says about the matter?

> Perhaps a warning would be better in this case?
> 
> In either case, it sounds like this won't be a common constraint
> and I now agree with your original suggestion of doing this in
> machine initialization code.
Agreed, if it goes to spapr specific machine code I will not object much.
(it will burden just spapr maintainers, so it's about convincing
David in the end)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-21  8:55           ` Igor Mammedov
@ 2020-08-21 12:47             ` Daniel Henrique Barboza
  2020-08-24  6:08               ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-21 12:47 UTC (permalink / raw)
  To: Igor Mammedov, Eduardo Habkost
  Cc: qemu-ppc, John Snow, qemu-devel, David Gibson



On 8/21/20 5:55 AM, Igor Mammedov wrote:
> On Thu, 20 Aug 2020 12:51:03 -0400
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
>> On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
>>> On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
>>>> On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
>>>>> On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
>>>>>> The pSeries machine does not support asymmetrical NUMA
>>>>>> configurations.
>>>>>
>>>>> This seems a bit oddly specific to have as a global machine class
>>>>> property.
>>>>>
>>>>> Would it make more sense for machines with specific NUMA constraints
>>>>> to just verify those during their initialization?
>>>>
>>>> This would be much simpler.  However, I like the idea of
>>>> representing machine-specific configuration validation rules as
>>>> data that can eventually be exported to management software.
>>>
>>> Ah, ok, so basically the usual tradeoff between flexibility and
>>> advertisability.
>>>
>>> So, in that case, I guess the question is whether we envisage "no
>>> assymmetry" as a constraint common enough that it's worth creating an
>>> advertisable rule or not.  If we only ever have one user, then we
>>> haven't really done any better than hard coding the constraint in the
>>> manageent software.
>>>
>>> Of course to complicate matters, in the longer term we're looking at
>>> removing that constraint from pseries - but doing so will be dependent
>>> on the guest kernel understanding a new format for the NUMA
>>> information in the device tree.  So qemu alone won't have enough
>>> information to tell if such a configuration is possible or not.
>>
>> Requiring both QEMU (and possibly management software) to be
>> patched again after the guest kernel is fixed sounds undesirable.
> If we drop this restriction, then we don't need to touch QEMU when
> guest kernel is ready.
> 
> Btw, what spapr spec says about the matter?


LOPAPR support a somewhat asymmetrical NUMA setup in its current form, but
the Linux kernel doesn't support it. The effort to implement it in the current
spapr machine code, given that Linux wouldn't mind it, is not worth it. This
is why I chose to invalidate it for pseries.


> 
>> Perhaps a warning would be better in this case?
>>
>> In either case, it sounds like this won't be a common constraint
>> and I now agree with your original suggestion of doing this in
>> machine initialization code.
> Agreed, if it goes to spapr specific machine code I will not object much.
> (it will burden just spapr maintainers, so it's about convincing
> David in the end)

I believe he's ok with it given that he suggested it in his first reply.

I'll move this verification to spapr machine_init in the next version.



Thanks,

DHB


> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-21 12:47             ` Daniel Henrique Barboza
@ 2020-08-24  6:08               ` David Gibson
  2020-08-24 11:45                 ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-24  6:08 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-ppc, Igor Mammedov, John Snow, Eduardo Habkost, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4530 bytes --]

On Fri, Aug 21, 2020 at 09:47:47AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 8/21/20 5:55 AM, Igor Mammedov wrote:
> > On Thu, 20 Aug 2020 12:51:03 -0400
> > Eduardo Habkost <ehabkost@redhat.com> wrote:
> > 
> > > On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
> > > > On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
> > > > > On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
> > > > > > On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> > > > > > > The pSeries machine does not support asymmetrical NUMA
> > > > > > > configurations.
> > > > > > 
> > > > > > This seems a bit oddly specific to have as a global machine class
> > > > > > property.
> > > > > > 
> > > > > > Would it make more sense for machines with specific NUMA constraints
> > > > > > to just verify those during their initialization?
> > > > > 
> > > > > This would be much simpler.  However, I like the idea of
> > > > > representing machine-specific configuration validation rules as
> > > > > data that can eventually be exported to management software.
> > > > 
> > > > Ah, ok, so basically the usual tradeoff between flexibility and
> > > > advertisability.
> > > > 
> > > > So, in that case, I guess the question is whether we envisage "no
> > > > assymmetry" as a constraint common enough that it's worth creating an
> > > > advertisable rule or not.  If we only ever have one user, then we
> > > > haven't really done any better than hard coding the constraint in the
> > > > manageent software.
> > > > 
> > > > Of course to complicate matters, in the longer term we're looking at
> > > > removing that constraint from pseries - but doing so will be dependent
> > > > on the guest kernel understanding a new format for the NUMA
> > > > information in the device tree.  So qemu alone won't have enough
> > > > information to tell if such a configuration is possible or not.
> > > 
> > > Requiring both QEMU (and possibly management software) to be
> > > patched again after the guest kernel is fixed sounds undesirable.
> > If we drop this restriction, then we don't need to touch QEMU when
> > guest kernel is ready.
> > 
> > Btw, what spapr spec says about the matter?
> 
> LOPAPR support a somewhat asymmetrical NUMA setup in its current
> form,

Huh, I didn't even realize that.  What's the mechanism?

> but
> the Linux kernel doesn't support it. The effort to implement it in the current
> spapr machine code, given that Linux wouldn't mind it, is not worth it. This
> is why I chose to invalidate it for pseries.

Igor,

It's kind of difficult to answer that question - PAPR doesn't
specifically describe limitations, it's just that the representation
it uses is inherently limited.  Instead of the obvious, simple and
pretty much universal method (used in the generic kernel and qemu) of
having a matrix of distance between all the nodes, it instead
describes the hierarchy of components that give rise to the different
distances.

So, for each NUMA relevant object (cpu, memory block, host bridge,
etc.) there is a vector of IDs.  Each number in the vector gives one
level of the objects location in the heirarchy.

So, for example the first number might be the physical chip/socket.
the second one which group of cores & memory interfaces sharing an Ln
cache, the third one the specific core number.  So to work out how far
objects are from each other you essentially look at how long a prefix
of their vector they share, which tells you how far above in the
hierarchy you have to go to reach it.

There's a bunch of complicating details, but that's the gist of it.

> > > Perhaps a warning would be better in this case?
> > > 
> > > In either case, it sounds like this won't be a common constraint
> > > and I now agree with your original suggestion of doing this in
> > > machine initialization code.
> > Agreed, if it goes to spapr specific machine code I will not object much.
> > (it will burden just spapr maintainers, so it's about convincing
> > David in the end)
> 
> I believe he's ok with it given that he suggested it in his first reply.
> 
> I'll move this verification to spapr machine_init in the next version.
> 
> 
> 
> Thanks,
> 
> DHB
> 
> 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-24  6:08               ` David Gibson
@ 2020-08-24 11:45                 ` Daniel Henrique Barboza
  2020-08-24 23:49                   ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-24 11:45 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, Igor Mammedov, John Snow, Eduardo Habkost, qemu-devel



On 8/24/20 3:08 AM, David Gibson wrote:
> On Fri, Aug 21, 2020 at 09:47:47AM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 8/21/20 5:55 AM, Igor Mammedov wrote:
>>> On Thu, 20 Aug 2020 12:51:03 -0400
>>> Eduardo Habkost <ehabkost@redhat.com> wrote:
>>>
>>>> On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
>>>>> On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
>>>>>> On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
>>>>>>> On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>> The pSeries machine does not support asymmetrical NUMA
>>>>>>>> configurations.
>>>>>>>
>>>>>>> This seems a bit oddly specific to have as a global machine class
>>>>>>> property.
>>>>>>>
>>>>>>> Would it make more sense for machines with specific NUMA constraints
>>>>>>> to just verify those during their initialization?
>>>>>>
>>>>>> This would be much simpler.  However, I like the idea of
>>>>>> representing machine-specific configuration validation rules as
>>>>>> data that can eventually be exported to management software.
>>>>>
>>>>> Ah, ok, so basically the usual tradeoff between flexibility and
>>>>> advertisability.
>>>>>
>>>>> So, in that case, I guess the question is whether we envisage "no
>>>>> assymmetry" as a constraint common enough that it's worth creating an
>>>>> advertisable rule or not.  If we only ever have one user, then we
>>>>> haven't really done any better than hard coding the constraint in the
>>>>> manageent software.
>>>>>
>>>>> Of course to complicate matters, in the longer term we're looking at
>>>>> removing that constraint from pseries - but doing so will be dependent
>>>>> on the guest kernel understanding a new format for the NUMA
>>>>> information in the device tree.  So qemu alone won't have enough
>>>>> information to tell if such a configuration is possible or not.
>>>>
>>>> Requiring both QEMU (and possibly management software) to be
>>>> patched again after the guest kernel is fixed sounds undesirable.
>>> If we drop this restriction, then we don't need to touch QEMU when
>>> guest kernel is ready.
>>>
>>> Btw, what spapr spec says about the matter?
>>
>> LOPAPR support a somewhat asymmetrical NUMA setup in its current
>> form,
> 
> Huh, I didn't even realize that.  What's the mechanism?

LOPAPR mentions that a single resource/node can have multiple associativity
arrays. The idea is to contemplate the situations where the node has
more than one connection with the board.

I say "somewhat" because, right after mentioning that, the spec also says that
the OS should consider that the distance between two nodes must always be
the shortest one of all available arrays. I'll copy/paste the except here
(end of section 15.2, "Numa Resource Associativity":

-----

The reason that the “ibm,associativity” property may contain multiple associativity
lists is that a resource may be multiply connected into the platform. This resource
then has a different associativity characteristics relative to its multiple connections.
To determine the associativity between any two resources, the OS scans down the two
resources associativity lists in all pair wise combinations counting how many domains
are the same until the first domain where the two list do not agree. The highest such
count is the associativity between the two resources.

----


DHB


> 
>> but
>> the Linux kernel doesn't support it. The effort to implement it in the current
>> spapr machine code, given that Linux wouldn't mind it, is not worth it. This
>> is why I chose to invalidate it for pseries.
> 
> Igor,
> 
> It's kind of difficult to answer that question - PAPR doesn't
> specifically describe limitations, it's just that the representation
> it uses is inherently limited.  Instead of the obvious, simple and
> pretty much universal method (used in the generic kernel and qemu) of
> having a matrix of distance between all the nodes, it instead
> describes the hierarchy of components that give rise to the different
> distances.
> 
> So, for each NUMA relevant object (cpu, memory block, host bridge,
> etc.) there is a vector of IDs.  Each number in the vector gives one
> level of the objects location in the heirarchy.
> 
> So, for example the first number might be the physical chip/socket.
> the second one which group of cores & memory interfaces sharing an Ln
> cache, the third one the specific core number.  So to work out how far
> objects are from each other you essentially look at how long a prefix
> of their vector they share, which tells you how far above in the
> hierarchy you have to go to reach it.
> 
> There's a bunch of complicating details, but that's the gist of it.
> 
>>>> Perhaps a warning would be better in this case?
>>>>
>>>> In either case, it sounds like this won't be a common constraint
>>>> and I now agree with your original suggestion of doing this in
>>>> machine initialization code.
>>> Agreed, if it goes to spapr specific machine code I will not object much.
>>> (it will burden just spapr maintainers, so it's about convincing
>>> David in the end)
>>
>> I believe he's ok with it given that he suggested it in his first reply.
>>
>> I'll move this verification to spapr machine_init in the next version.
>>
>>
>>
>> Thanks,
>>
>> DHB
>>
>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-24 11:45                 ` Daniel Henrique Barboza
@ 2020-08-24 23:49                   ` David Gibson
  2020-08-25  9:56                     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2020-08-24 23:49 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-ppc, Igor Mammedov, John Snow, Eduardo Habkost, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6324 bytes --]

On Mon, Aug 24, 2020 at 08:45:12AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 8/24/20 3:08 AM, David Gibson wrote:
> > On Fri, Aug 21, 2020 at 09:47:47AM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > 
> > > On 8/21/20 5:55 AM, Igor Mammedov wrote:
> > > > On Thu, 20 Aug 2020 12:51:03 -0400
> > > > Eduardo Habkost <ehabkost@redhat.com> wrote:
> > > > 
> > > > > On Thu, Aug 20, 2020 at 02:15:04PM +1000, David Gibson wrote:
> > > > > > On Wed, Aug 19, 2020 at 10:11:28PM -0400, Eduardo Habkost wrote:
> > > > > > > On Thu, Aug 20, 2020 at 11:17:26AM +1000, David Gibson wrote:
> > > > > > > > On Fri, Aug 14, 2020 at 05:54:16PM -0300, Daniel Henrique Barboza wrote:
> > > > > > > > > The pSeries machine does not support asymmetrical NUMA
> > > > > > > > > configurations.
> > > > > > > > 
> > > > > > > > This seems a bit oddly specific to have as a global machine class
> > > > > > > > property.
> > > > > > > > 
> > > > > > > > Would it make more sense for machines with specific NUMA constraints
> > > > > > > > to just verify those during their initialization?
> > > > > > > 
> > > > > > > This would be much simpler.  However, I like the idea of
> > > > > > > representing machine-specific configuration validation rules as
> > > > > > > data that can eventually be exported to management software.
> > > > > > 
> > > > > > Ah, ok, so basically the usual tradeoff between flexibility and
> > > > > > advertisability.
> > > > > > 
> > > > > > So, in that case, I guess the question is whether we envisage "no
> > > > > > assymmetry" as a constraint common enough that it's worth creating an
> > > > > > advertisable rule or not.  If we only ever have one user, then we
> > > > > > haven't really done any better than hard coding the constraint in the
> > > > > > manageent software.
> > > > > > 
> > > > > > Of course to complicate matters, in the longer term we're looking at
> > > > > > removing that constraint from pseries - but doing so will be dependent
> > > > > > on the guest kernel understanding a new format for the NUMA
> > > > > > information in the device tree.  So qemu alone won't have enough
> > > > > > information to tell if such a configuration is possible or not.
> > > > > 
> > > > > Requiring both QEMU (and possibly management software) to be
> > > > > patched again after the guest kernel is fixed sounds undesirable.
> > > > If we drop this restriction, then we don't need to touch QEMU when
> > > > guest kernel is ready.
> > > > 
> > > > Btw, what spapr spec says about the matter?
> > > 
> > > LOPAPR support a somewhat asymmetrical NUMA setup in its current
> > > form,
> > 
> > Huh, I didn't even realize that.  What's the mechanism?
> 
> LOPAPR mentions that a single resource/node can have multiple associativity
> arrays. The idea is to contemplate the situations where the node has
> more than one connection with the board.
> 
> I say "somewhat" because, right after mentioning that, the spec also says that
> the OS should consider that the distance between two nodes must always be
> the shortest one of all available arrays. I'll copy/paste the except here
> (end of section 15.2, "Numa Resource Associativity":

Ah.  I didn't think that's what "asymmetric NUMA" meant... but come to
think of it, I'm not very sure about that.

> -----
> 
> The reason that the “ibm,associativity” property may contain multiple associativity
> lists is that a resource may be multiply connected into the platform. This resource
> then has a different associativity characteristics relative to its multiple connections.
> To determine the associativity between any two resources, the OS scans down the two
> resources associativity lists in all pair wise combinations counting how many domains
> are the same until the first domain where the two list do not agree. The highest such
> count is the associativity between the two resources.
> 
> ----
> 
> 
> DHB
> 
> 
> > 
> > > but
> > > the Linux kernel doesn't support it. The effort to implement it in the current
> > > spapr machine code, given that Linux wouldn't mind it, is not worth it. This
> > > is why I chose to invalidate it for pseries.
> > 
> > Igor,
> > 
> > It's kind of difficult to answer that question - PAPR doesn't
> > specifically describe limitations, it's just that the representation
> > it uses is inherently limited.  Instead of the obvious, simple and
> > pretty much universal method (used in the generic kernel and qemu) of
> > having a matrix of distance between all the nodes, it instead
> > describes the hierarchy of components that give rise to the different
> > distances.
> > 
> > So, for each NUMA relevant object (cpu, memory block, host bridge,
> > etc.) there is a vector of IDs.  Each number in the vector gives one
> > level of the objects location in the heirarchy.
> > 
> > So, for example the first number might be the physical chip/socket.
> > the second one which group of cores & memory interfaces sharing an Ln
> > cache, the third one the specific core number.  So to work out how far
> > objects are from each other you essentially look at how long a prefix
> > of their vector they share, which tells you how far above in the
> > hierarchy you have to go to reach it.
> > 
> > There's a bunch of complicating details, but that's the gist of it.
> > 
> > > > > Perhaps a warning would be better in this case?
> > > > > 
> > > > > In either case, it sounds like this won't be a common constraint
> > > > > and I now agree with your original suggestion of doing this in
> > > > > machine initialization code.
> > > > Agreed, if it goes to spapr specific machine code I will not object much.
> > > > (it will burden just spapr maintainers, so it's about convincing
> > > > David in the end)
> > > 
> > > I believe he's ok with it given that he suggested it in his first reply.
> > > 
> > > I'll move this verification to spapr machine_init in the next version.
> > > 
> > > 
> > > 
> > > Thanks,
> > > 
> > > DHB
> > > 
> > > 
> > > > 
> > > 
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-24 23:49                   ` David Gibson
@ 2020-08-25  9:56                     ` Daniel Henrique Barboza
  2020-08-25 11:12                       ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-25  9:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, Igor Mammedov, John Snow, Eduardo Habkost, qemu-devel



On 8/24/20 8:49 PM, David Gibson wrote:
> On Mon, Aug 24, 2020 at 08:45:12AM -0300, Daniel Henrique Barboza wrote:
>>
>>

[...]

>>>> LOPAPR support a somewhat asymmetrical NUMA setup in its current
>>>> form,
>>>
>>> Huh, I didn't even realize that.  What's the mechanism?
>>
>> LOPAPR mentions that a single resource/node can have multiple associativity
>> arrays. The idea is to contemplate the situations where the node has
>> more than one connection with the board.
>>
>> I say "somewhat" because, right after mentioning that, the spec also says that
>> the OS should consider that the distance between two nodes must always be
>> the shortest one of all available arrays. I'll copy/paste the except here
>> (end of section 15.2, "Numa Resource Associativity":
> 
> Ah.  I didn't think that's what "asymmetric NUMA" meant... but come to
> think of it, I'm not very sure about that.


This was a poor attempt of my part to cut PAPR some slack.

TBH, even if current PAPR allows for some form of NUMA asymmetry, I don't think
it's worth implementing at all. It'll be more complexity on top of what I
already added here, and the best case scenario will be the kernel ignoring it
(worst case - kernel blowing it up because we're adding more associativity
arrays in each CPU and so on).



Thanks,


DHB

> 
>> -----
>>
>> The reason that the “ibm,associativity” property may contain multiple associativity
>> lists is that a resource may be multiply connected into the platform. This resource
>> then has a different associativity characteristics relative to its multiple connections.
>> To determine the associativity between any two resources, the OS scans down the two
>> resources associativity lists in all pair wise combinations counting how many domains
>> are the same until the first domain where the two list do not agree. The highest such
>> count is the associativity between the two resources.
>>
>> ----
>>
>>
>> DHB
>>
>>
>>>
>>>> but
>>>> the Linux kernel doesn't support it. The effort to implement it in the current
>>>> spapr machine code, given that Linux wouldn't mind it, is not worth it. This
>>>> is why I chose to invalidate it for pseries.
>>>
>>> Igor,
>>>
>>> It's kind of difficult to answer that question - PAPR doesn't
>>> specifically describe limitations, it's just that the representation
>>> it uses is inherently limited.  Instead of the obvious, simple and
>>> pretty much universal method (used in the generic kernel and qemu) of
>>> having a matrix of distance between all the nodes, it instead
>>> describes the hierarchy of components that give rise to the different
>>> distances.
>>>
>>> So, for each NUMA relevant object (cpu, memory block, host bridge,
>>> etc.) there is a vector of IDs.  Each number in the vector gives one
>>> level of the objects location in the heirarchy.
>>>
>>> So, for example the first number might be the physical chip/socket.
>>> the second one which group of cores & memory interfaces sharing an Ln
>>> cache, the third one the specific core number.  So to work out how far
>>> objects are from each other you essentially look at how long a prefix
>>> of their vector they share, which tells you how far above in the
>>> hierarchy you have to go to reach it.
>>>
>>> There's a bunch of complicating details, but that's the gist of it.
>>>
>>>>>> Perhaps a warning would be better in this case?
>>>>>>
>>>>>> In either case, it sounds like this won't be a common constraint
>>>>>> and I now agree with your original suggestion of doing this in
>>>>>> machine initialization code.
>>>>> Agreed, if it goes to spapr specific machine code I will not object much.
>>>>> (it will burden just spapr maintainers, so it's about convincing
>>>>> David in the end)
>>>>
>>>> I believe he's ok with it given that he suggested it in his first reply.
>>>>
>>>> I'll move this verification to spapr machine_init in the next version.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> DHB
>>>>
>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-25  9:56                     ` Daniel Henrique Barboza
@ 2020-08-25 11:12                       ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2020-08-25 11:12 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-ppc, Igor Mammedov, John Snow, Eduardo Habkost, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1705 bytes --]

On Tue, Aug 25, 2020 at 06:56:46AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 8/24/20 8:49 PM, David Gibson wrote:
> > On Mon, Aug 24, 2020 at 08:45:12AM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > 
> 
> [...]
> 
> > > > > LOPAPR support a somewhat asymmetrical NUMA setup in its current
> > > > > form,
> > > > 
> > > > Huh, I didn't even realize that.  What's the mechanism?
> > > 
> > > LOPAPR mentions that a single resource/node can have multiple associativity
> > > arrays. The idea is to contemplate the situations where the node has
> > > more than one connection with the board.
> > > 
> > > I say "somewhat" because, right after mentioning that, the spec also says that
> > > the OS should consider that the distance between two nodes must always be
> > > the shortest one of all available arrays. I'll copy/paste the except here
> > > (end of section 15.2, "Numa Resource Associativity":
> > 
> > Ah.  I didn't think that's what "asymmetric NUMA" meant... but come to
> > think of it, I'm not very sure about that.
> 
> 
> This was a poor attempt of my part to cut PAPR some slack.
> 
> TBH, even if current PAPR allows for some form of NUMA asymmetry, I don't think
> it's worth implementing at all. It'll be more complexity on top of what I
> already added here, and the best case scenario will be the kernel ignoring it
> (worst case - kernel blowing it up because we're adding more associativity
> arrays in each CPU and so on).

Yes, I agree.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
  2020-08-20  4:26   ` David Gibson
@ 2020-08-26 20:06     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-26 20:06 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 8/20/20 1:26 AM, David Gibson wrote:
> On Fri, Aug 14, 2020 at 05:54:22PM -0300, Daniel Henrique Barboza wrote:
>> We can't use the input from machine->numa_state->nodes directly
>> in the pSeries machine because PAPR does not work with raw distance
>> values, like ACPI SLIT does. We need to determine common
>> associativity domains, based on similar performance/distance of the
>> resources, and set these domains in the associativy array that goes
>> to the FDT of each resource.
>>
>> To ease the translation between regular ACPI NUMA distance info
>> to our PAPR dialect, let's create a matrix called numa_assoc_domains
>> in the SpaprMachineClass. This matrix will be initiated during
>> machine init, where  we will read NUMA information from user input,
>> apply a heuristic to determine the associativity domains for each node,
>> then populate numa_assoc_domains accordingly.
>>
>> The changes are mostly centered in the spapr_set_associativity()
>> helper that will use the values of numa_assoc_domains instead of
>> using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and
>> h_home_node_associativity() being the exceptions.
>>
>> To keep the changes under control, we'll plug in the matrix usage
>> in the existing code first. The actual heuristic to determine
>> the associativity domains for each NUMA node will come in a follow-up
>> patch.
>>
>> Note that the matrix is initiated with zeros, meaning that there is
>> no guest changes implemented in this patch. We'll keep these
>> changes from legacy NUMA guests by not initiating the matrix
>> in these cases.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> IIUC, what this is basically doing is that instead of doing the
> translation from qemu's internal NUMA representation to PAPRs at the
> point(s) we emit the PAPR information, we're moving it to persistent
> data we calculate for each (qemu) numa node then just copy out when we
> emit the information?

Yes. The original intention was to not go straight from QEMU numa_state
to associativity arrays, given that for each numa_state entry we need to
(1) round the value up to what the kernel understands (10,20,40,80,160)
and (2) determine the associativity domains for each position of the
associativity array. I could made basically the same thing on top of
the existing numa_state info, but I wasn't sure if overwriting it was
a good idea.


> 
> That could be a reasonable idea, and indeed the rest of the series
> might be easier to understand if this change were earlier in the
> series.  In particular it means we might be able localise all the
> hacks for calculating the right vectors depending on machine
> version/options into one place.


This didn't cross my mind when I first wrote it but it is a good idea to
put all the NUMA related code into the same function. I considered creating
a spapr_numa.c to avoid putting more stuff into spapr.c but, for the
amount of code being added, I didn't think it was justified.


> 
> A couple of nits, though:
> 
>> ---
>>   hw/ppc/spapr.c                | 46 +++++++++++++++++++++++------------
>>   hw/ppc/spapr_hcall.c          | 13 ++++++++--
>>   hw/ppc/spapr_nvdimm.c         | 13 +++++-----
>>   hw/ppc/spapr_pci.c            |  3 ++-
>>   include/hw/ppc/spapr.h        |  7 +++++-
>>   include/hw/ppc/spapr_nvdimm.h |  5 ++--
>>   6 files changed, 59 insertions(+), 28 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index b80a6f6936..4f50ab21ee 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
>>       return ret;
>>   }
>>   
>> -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
>> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
>> +                             MachineState *machine)
>>   {
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
>> +    uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0];
>> +    uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1];
>> +    uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2];
>>       uint8_t assoc_size = 0x4;
>>   
>>       if (cpu_index >= 0) {
>> @@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index)
>>       }
>>   
>>       assoc[0] = cpu_to_be32(assoc_size);
>> -    assoc[1] = cpu_to_be32(0x0);
>> -    assoc[2] = cpu_to_be32(0x0);
>> -    assoc[3] = cpu_to_be32(0x0);
>> +    assoc[1] = cpu_to_be32(assoc_domain1);
>> +    assoc[2] = cpu_to_be32(assoc_domain2);
>> +    assoc[3] = cpu_to_be32(assoc_domain3);
>>       assoc[4] = cpu_to_be32(node_id);
> 
> So spapr_set_associativity() is already a slightly dangerous function,
> because the required buffer space for 'assoc' varies in a non-obvious
> way depending on if cpu_index is >= 0.  I didn't comment on that when
> it was introduced, because it's not really any worse than what it
> replaced.
> 
> But with this change, I think we can do better.  I'd suggest storing
> the full PAPR associativity vector for each qemu numa node verbatim,
> so it can just be copied straight into the device tree without
> interpretation.  Then the helper can actually do the property set, and
> we don't need magically sized locals any more.

Interesting. IIUC your idea, we're more than half way there with what we
have here already. What we need is an extra column for this numa_assoc_domains
matrix to contemplate the numa_id as well. This alone can take care of all cases
but CPUs. For cpus, we can simply ask for the vcpu_id as an argument. Then
a function like spapr_write_associativity(node_id, vcpu_id) will write the
FDT with the numa_assoc_domains[node_id] array and, if vcpu_id is not NULL,
we add the vcpu_id and update the size before writing the FDT.

Given this new role, numa_assoc_domains[][] would be renamed to something
more sensible, such as associativity_arrays[][] or similar.

This will also make changing the size to associativity arrays less painful when
Power10 hits (spoiler: we will need more numa levels for P10 ...).


I'm not sure if this change would deprecate the previous change made by patch 07
(add spapr_set_associativity). Not sure if I drop that patch or use it as a
preliminary step for this change. I'll see what I can do.


> 
> Obviously there will need to be some more handling for the extra layer
> we add on cpu assoc vectors.  We could store a vector for each vcpu as
> well, or just have a hack to adjust these (fdt_appendprop() might be
> useful).


I think we can handle CPUs like any other resource, but passing an extra arg
(vcpu_id) to a spapr_write_associativity() function.



Thanks,


DHB

> 
>>   }
>>   
>> -static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu)
>> +static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu,
>> +                                   MachineState *machine)
>>   {
>>       int index = spapr_get_vcpu_id(cpu);
>>       uint32_t associativity[6];
>> -    spapr_set_associativity(associativity, cpu->node_id, index);
>> +    spapr_set_associativity(associativity, cpu->node_id, index, machine);
>>   
>>       /* Advertise NUMA via ibm,associativity */
>>       return fdt_setprop(fdt, offset, "ibm,associativity", associativity,
>> @@ -335,14 +341,14 @@ static void add_str(GString *s, const gchar *s1)
>>   }
>>   
>>   static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start,
>> -                                hwaddr size)
>> +                                hwaddr size, MachineState *machine)
>>   {
>>       uint32_t associativity[5];
>>       char mem_name[32];
>>       uint64_t mem_reg_property[2];
>>       int off;
>>   
>> -    spapr_set_associativity(associativity, nodeid, -1);
>> +    spapr_set_associativity(associativity, nodeid, -1, machine);
>>   
>>       mem_reg_property[0] = cpu_to_be64(start);
>>       mem_reg_property[1] = cpu_to_be64(size);
>> @@ -574,6 +580,7 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
>>                                                      void *fdt)
>>   {
>>       MachineState *machine = MACHINE(spapr);
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
>>       int nb_numa_nodes = machine->numa_state->num_nodes;
>>       int ret, i, offset;
>>       uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
>> @@ -628,12 +635,17 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr,
>>       int_buf[1] = cpu_to_be32(4); /* Number of entries per associativity list */
>>       cur_index += 2;
>>       for (i = 0; i < nr_nodes; i++) {
>> +        uint8_t assoc_domain1 = smc->numa_assoc_domains[i][0];
>> +        uint8_t assoc_domain2 = smc->numa_assoc_domains[i][1];
>> +        uint8_t assoc_domain3 = smc->numa_assoc_domains[i][2];
>> +
>>           uint32_t associativity[] = {
>> -            cpu_to_be32(0x0),
>> -            cpu_to_be32(0x0),
>> -            cpu_to_be32(0x0),
>> +            cpu_to_be32(assoc_domain1),
>> +            cpu_to_be32(assoc_domain2),
>> +            cpu_to_be32(assoc_domain3),
>>               cpu_to_be32(i)
> 
> So again, I suggest this be copiedas a blob from a stored vector for
> the node, rather than a mix of stored info and locally calculated info
> (the final 'i').
> 
>>           };
>> +
>>           memcpy(cur_index, associativity, sizeof(associativity));
>>           cur_index += 4;
>>       }
>> @@ -667,7 +679,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
>>           if (!mem_start) {
>>               /* spapr_machine_init() checks for rma_size <= node0_size
>>                * already */
>> -            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size);
>> +            spapr_dt_memory_node(fdt, i, 0, spapr->rma_size, machine);
>>               mem_start += spapr->rma_size;
>>               node_size -= spapr->rma_size;
>>           }
>> @@ -679,7 +691,7 @@ static int spapr_dt_memory(SpaprMachineState *spapr, void *fdt)
>>                   sizetmp = 1ULL << ctzl(mem_start);
>>               }
>>   
>> -            spapr_dt_memory_node(fdt, i, mem_start, sizetmp);
>> +            spapr_dt_memory_node(fdt, i, mem_start, sizetmp, machine);
>>               node_size -= sizetmp;
>>               mem_start += sizetmp;
>>           }
>> @@ -809,7 +821,7 @@ static void spapr_dt_cpu(CPUState *cs, void *fdt, int offset,
>>                         pft_size_prop, sizeof(pft_size_prop))));
>>   
>>       if (ms->numa_state->num_nodes > 1) {
>> -        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu));
>> +        _FDT(spapr_fixup_cpu_numa_dt(fdt, offset, cpu, ms));
>>       }
>>   
>>       _FDT(spapr_fixup_cpu_smt_dt(fdt, offset, cpu, compat_smt));
>> @@ -1335,7 +1347,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>   
>>       /* NVDIMM devices */
>>       if (mc->nvdimm_supported) {
>> -        spapr_dt_persistent_memory(fdt);
>> +        spapr_dt_persistent_memory(fdt, machine);
>>       }
>>   
>>       return fdt;
>> @@ -3453,6 +3465,7 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
>>   int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>                             void *fdt, int *fdt_start_offset, Error **errp)
>>   {
>> +    MachineState *machine = MACHINE(spapr);
>>       uint64_t addr;
>>       uint32_t node;
>>   
>> @@ -3460,7 +3473,8 @@ int spapr_lmb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>       node = object_property_get_uint(OBJECT(drc->dev), PC_DIMM_NODE_PROP,
>>                                       &error_abort);
>>       *fdt_start_offset = spapr_dt_memory_node(fdt, node, addr,
>> -                                             SPAPR_MEMORY_BLOCK_SIZE);
>> +                                             SPAPR_MEMORY_BLOCK_SIZE,
>> +                                             machine);
>>       return 0;
>>   }
>>   
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index c1d01228c6..6e94e513cf 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -1878,10 +1878,15 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>                                                 target_ulong opcode,
>>                                                 target_ulong *args)
>>   {
>> +    MachineState *machine = MACHINE(spapr);
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
>>       target_ulong flags = args[0];
>>       target_ulong procno = args[1];
>>       PowerPCCPU *tcpu;
>>       int idx;
>> +    uint8_t assoc_domain1;
>> +    uint8_t assoc_domain2;
>> +    uint8_t assoc_domain3;
>>   
>>       /* only support procno from H_REGISTER_VPA */
>>       if (flags != 0x1) {
>> @@ -1893,13 +1898,17 @@ static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>           return H_P2;
>>       }
>>   
>> +    assoc_domain1 = smc->numa_assoc_domains[tcpu->node_id][0];
>> +    assoc_domain2 = smc->numa_assoc_domains[tcpu->node_id][1];
>> +    assoc_domain3 = smc->numa_assoc_domains[tcpu->node_id][2];
>> +
>>       /* sequence is the same as in the "ibm,associativity" property */
>>   
>>       idx = 0;
>>   #define ASSOCIATIVITY(a, b) (((uint64_t)(a) << 32) | \
>>                                ((uint64_t)(b) & 0xffffffff))
>> -    args[idx++] = ASSOCIATIVITY(0, 0);
>> -    args[idx++] = ASSOCIATIVITY(0, tcpu->node_id);
>> +    args[idx++] = ASSOCIATIVITY(assoc_domain1, assoc_domain2);
>> +    args[idx++] = ASSOCIATIVITY(assoc_domain3, tcpu->node_id);
>>       args[idx++] = ASSOCIATIVITY(procno, -1);
>>       for ( ; idx < 6; idx++) {
>>           args[idx] = -1;
>> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
>> index bd109bfc00..7f1f088c39 100644
>> --- a/hw/ppc/spapr_nvdimm.c
>> +++ b/hw/ppc/spapr_nvdimm.c
>> @@ -87,8 +87,9 @@ int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>                              void *fdt, int *fdt_start_offset, Error **errp)
>>   {
>>       NVDIMMDevice *nvdimm = NVDIMM(drc->dev);
>> +    MachineState *machine = MACHINE(spapr);
>>   
>> -    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm);
>> +    *fdt_start_offset = spapr_dt_nvdimm(fdt, 0, nvdimm, machine);
>>   
>>       return 0;
>>   }
>> @@ -104,8 +105,8 @@ void spapr_create_nvdimm_dr_connectors(SpaprMachineState *spapr)
>>   }
>>   
>>   
>> -int spapr_dt_nvdimm(void *fdt, int parent_offset,
>> -                           NVDIMMDevice *nvdimm)
>> +int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
>> +                    MachineState *machine)
>>   {
>>       int child_offset;
>>       char *buf;
>> @@ -120,7 +121,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>>       uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>>                                               NULL);
>>   
>> -    spapr_set_associativity(associativity, node, -1);
>> +    spapr_set_associativity(associativity, node, -1, machine);
>>   
>>       drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>>       g_assert(drc);
>> @@ -159,7 +160,7 @@ int spapr_dt_nvdimm(void *fdt, int parent_offset,
>>       return child_offset;
>>   }
>>   
>> -void spapr_dt_persistent_memory(void *fdt)
>> +void spapr_dt_persistent_memory(void *fdt, MachineState *machine)
>>   {
>>       int offset = fdt_subnode_offset(fdt, 0, "persistent-memory");
>>       GSList *iter, *nvdimms = nvdimm_get_device_list();
>> @@ -177,7 +178,7 @@ void spapr_dt_persistent_memory(void *fdt)
>>       for (iter = nvdimms; iter; iter = iter->next) {
>>           NVDIMMDevice *nvdimm = iter->data;
>>   
>> -        spapr_dt_nvdimm(fdt, offset, nvdimm);
>> +        spapr_dt_nvdimm(fdt, offset, nvdimm, machine);
>>       }
>>       g_slist_free(nvdimms);
>>   
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index c02ace226c..4d9ef63f3e 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -2355,7 +2355,8 @@ int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>>   
>>       /* Advertise NUMA via ibm,associativity */
>>       if (phb->numa_node != -1) {
>> -        spapr_set_associativity(associativity, phb->numa_node, -1);
>> +        MachineState *machine = MACHINE(spapr);
>> +        spapr_set_associativity(associativity, phb->numa_node, -1, machine);
>>           _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
>>                            sizeof(associativity)));
>>       }
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index cd158bf95a..1f9700ac19 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -104,6 +104,9 @@ typedef enum {
>>   
>>   #define FDT_MAX_SIZE                    0x100000
>>   
>> +/* Taken from Linux kernel arch/powerpc/mm/numa.h */
>> +#define MAX_DISTANCE_REF_POINTS         4
>> +
>>   typedef struct SpaprCapabilities SpaprCapabilities;
>>   struct SpaprCapabilities {
>>       uint8_t caps[SPAPR_CAP_NUM];
>> @@ -131,6 +134,7 @@ struct SpaprMachineClass {
>>       hwaddr rma_limit;          /* clamp the RMA to this size */
>>       bool pre_5_1_assoc_refpoints;
>>       bool pre_5_2_numa_associativity;
>> +    uint8_t numa_assoc_domains[MAX_NODES][MAX_DISTANCE_REF_POINTS-1];
>>   
>>       void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>>                             uint64_t *buid, hwaddr *pio,
>> @@ -863,7 +867,8 @@ int spapr_phb_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>   
>>   void spapr_rtc_read(SpaprRtcState *rtc, struct tm *tm, uint32_t *ns);
>>   int spapr_rtc_import_offset(SpaprRtcState *rtc, int64_t legacy_offset);
>> -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index);
>> +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index,
>> +                             MachineState *machine);
>>   
>>   #define TYPE_SPAPR_RNG "spapr-rng"
>>   
>> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
>> index b3330cc485..be30773c7d 100644
>> --- a/include/hw/ppc/spapr_nvdimm.h
>> +++ b/include/hw/ppc/spapr_nvdimm.h
>> @@ -27,8 +27,9 @@ QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
>>   
>>   int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
>>                              void *fdt, int *fdt_start_offset, Error **errp);
>> -int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm);
>> -void spapr_dt_persistent_memory(void *fdt);
>> +int spapr_dt_nvdimm(void *fdt, int parent_offset, NVDIMMDevice *nvdimm,
>> +                    MachineState *machine);
>> +void spapr_dt_persistent_memory(void *fdt, MachineState *machine);
>>   void spapr_nvdimm_validate_opts(NVDIMMDevice *nvdimm, uint64_t size,
>>                                   Error **errp);
>>   void spapr_add_nvdimm(DeviceState *dev, uint64_t slot, Error **errp);
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/10] spapr: make ibm,max-associativity-domains scale with user input
  2020-08-20  2:55   ` [PATCH 05/10] spapr: make ibm,max-associativity-domains " David Gibson
@ 2020-08-26 21:17     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-26 21:17 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 8/19/20 11:55 PM, David Gibson wrote:
> On Fri, Aug 14, 2020 at 05:54:19PM -0300, Daniel Henrique Barboza wrote:
>> The ibm,max-associativity-domains is considering that only a single
>> associativity domain can exist in the same NUMA level. This is true
>> today because we do not support any type of NUMA distance user
>> customization, and all nodes are in the same distance to each other.
>>
>> To enhance NUMA distance support in the pSeries machine we need to
>> make this limit flexible. This patch rewrites the max-associativity
>> logic to consider that multiple associativity domains can co-exist
>> in the same NUMA level. We're using the legacy_numa() helper to
>> avoid leaking unneeded guest changes.
> 
> 
> Hrm.  I find the above a bit hard to understand.  Having the limit be
> one less than the number of nodes at every level except the last seems
> kind of odd to me.

I took a bit to reply on this because I was reconsidering this logic.

I tried to "not be greedy" with this maximum number and ended up doing
something that breaks in a simple scenario. Today, in a single conf with
a single NUMA node with a single CPU, and say 2 GPUs, given that all GPUs
are in their own associativity domains, we would have something like:

cpu0: 0 0 0 0 0 0
gpu1: gpu_1 gpu_1 gpu_1 gpu_1
gpu2: gpu_2 gpu_2 gpu_2 gpu_2

This would already break apart what I did there. I think we should simplify
and just set maxdomains to be all nodes in all levels, like we do today
but using spapr->gpu_numa_id as an alias to maxnodes.


Thanks,

DHB

> 
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c | 18 ++++++++++++++++--
>>   1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 073a59c47d..b0c4b80a23 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -919,13 +919,20 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>           cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
>>           cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
>>       };
>> -    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
>> +
>> +    /* The maximum domains for a given NUMA level, supposing that every
>> +     * additional NUMA node belongs to the same domain (aside from the
>> +     * 4th level, where we must support all available NUMA domains), is
>> +     * total number of domains - 1. */
>> +    uint32_t total_nodes_number = ms->numa_state->num_nodes +
>> +                                  spapr->extra_numa_nodes;
>> +    uint32_t maxdomain = cpu_to_be32(total_nodes_number - 1);
>>       uint32_t maxdomains[] = {
>>           cpu_to_be32(4),
>>           maxdomain,
>>           maxdomain,
>>           maxdomain,
>> -        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
>> +        cpu_to_be32(total_nodes_number),
>>       };
>>   
>>       _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
>> @@ -962,6 +969,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>                        qemu_hypertas->str, qemu_hypertas->len));
>>       g_string_free(qemu_hypertas, TRUE);
>>   
>> +    if (spapr_machine_using_legacy_numa(spapr)) {
>> +        maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
>> +        maxdomains[1] = maxdomain;
>> +        maxdomains[2] = maxdomain;
>> +        maxdomains[3] = maxdomain;
>> +    }
>> +
>>       if (smc->pre_5_1_assoc_refpoints) {
>>           nr_refpoints = 2;
>>       }
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic
  2020-08-20  2:14   ` David Gibson
@ 2020-08-26 21:49     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Henrique Barboza @ 2020-08-26 21:49 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel



On 8/19/20 11:14 PM, David Gibson wrote:
> On Fri, Aug 14, 2020 at 05:54:17PM -0300, Daniel Henrique Barboza wrote:
>> NVLink2 GPUs are allocated in their own NUMA node, at maximum
>> distance from every other resource in the board. The existing
>> logic makes some assumptions that don't scale well:
>>
>> - only NVLink2 GPUs will ever require such mechanism, meaning
>> that the GPU logic is tightly coupled with the NUMA setup of
>> the machine, via how ibm,max-associativity-domains is set.
>>
>> - the code is relying on the lack of support for sparse NUMA
>> nodes in QEMU. Eventually this support can be implemented, and
>> then the assumption that spapr->gpu_numa_id represents the total
>> of NUMA nodes plus all generated NUMA ids for the GPUs, which
>> relies on all QEMU NUMA nodes not being sparsed, has a good
>> potential for disaster.
>>
>> This patch aims to fix both assumptions by creating a generic
>> mechanism to get an available NUMA node, regardless of the
>> NUMA setup being sparse or not. The idea is to rename the existing
>> spapr->gpu_numa_id to spapr->current_numa_id and add a new
>> spapr->extra_numa_nodes attribute. They are used in a new function
>> called spapr_pci_get_available_numa_id(), that takes into account
>> that the NUMA conf can be sparsed or not, to retrieve an available
>> NUMA id for the caller. Each consecutive call of
>> spapr_pci_get_available_numa_id() will generate a new ID, up
>> to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
>> exceeding MAX_NODES. This is a generic code being used only by
>> NVLink2 ATM, being available to be used in the future by any
>> other device.
>>
>> With this new function in place, we can decouple
>> ibm,max-associativity-domains logic from NVLink2 logic by
>> using the new spapr->extra_numa_nodes to define the maxdomains
>> of the forth NUMA level. Instead of defining it as gpu_numa_id,
>> use num_nodes + extra_numa_nodes. This also makes it resilient
>> to any future change in the support of sparse NUMA nodes.
>>
>> Despite all the code juggling, no functional change was made
>> because sparse NUMA nodes isn't a thing and we do not support
>> distinct NUMA distances via user input. Next patches will
>> change that.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c              | 15 ++++++++++-----
>>   hw/ppc/spapr_pci.c          | 33 +++++++++++++++++++++++++++++++++
>>   hw/ppc/spapr_pci_nvlink2.c  | 10 ++++++----
>>   include/hw/pci-host/spapr.h |  2 ++
>>   include/hw/ppc/spapr.h      |  4 +++-
>>   5 files changed, 54 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 3b16edaf4c..22e78cfc84 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>           cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0xffffffff),
>>           cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
>>       };
>> -    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
>> +    uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
>>       uint32_t maxdomains[] = {
>>           cpu_to_be32(4),
>>           maxdomain,
>>           maxdomain,
>>           maxdomain,
>> -        cpu_to_be32(spapr->gpu_numa_id),
>> +        cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
>>       };
>>   
>>       _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
>> @@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
>>       /*
>>        * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>>        * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
>> -     * called from vPHB reset handler so we initialize the counter here.
>> +     * called from vPHB reset handler. We have code to generate an extra numa
>> +     * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', which
>> +     * are initialized here.
>> +     *
>>        * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
>>        * must be equally distant from any other node.
>> -     * The final value of spapr->gpu_numa_id is going to be written to
>> +     *
>> +     * The extra NUMA node ids generated for GPU usage will be written to
>>        * max-associativity-domains in spapr_build_fdt().
>>        */
>> -    spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
>> +    spapr->current_numa_id = 0;
>> +    spapr->extra_numa_nodes = 0;
>>   
>>       if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>>           ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 0a418f1e67..09ac58fd7f 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
>>                              &big_endian);
>>       }
>>   }
>> +
>> +unsigned spapr_pci_get_available_numa_id(Error **errp)
>> +{
>> +    MachineState *machine = MACHINE(qdev_get_machine());
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
>> +    NodeInfo *numa_info = machine->numa_state->nodes;
>> +    unsigned i, start;
>> +
>> +    if (machine->numa_state->num_nodes + spapr->extra_numa_nodes >= MAX_NODES) {
>> +        error_setg(errp,
>> +                   "Unable to get an extra NUMA node beyond MAX_NODES = %d",
>> +                   MAX_NODES);
>> +        return spapr->current_numa_id;
>> +    }
>> +
>> +    if (spapr->extra_numa_nodes == 0) {
>> +        start = 0;
>> +    } else {
>> +        start = spapr->current_numa_id + 1;
>> +    }
>> +
>> +    for (i = start; i < MAX_NODES; i++) {
>> +        if (!numa_info[i].present) {
>> +            spapr->extra_numa_nodes++;
>> +            spapr->current_numa_id = i;
>> +            return i;
>> +        }
>> +    }
> 
> I think I see what you're trying to do, but this logic makes me a bit
> nervous.  I guess migration isn't really an issue for the GPUs, but as
> a general rule we want the NUMA ids for everything to be the same from
> boot to boot (assuming similar command line configuration).

I didn't change the existing logic of GPU node assignment, although I surely
added more complexity on top of it.

This patch, and some chunks of other patches as well, were heavy influenced by
my idea of attempting to contemplate sparse NUMA nodes. Right now QEMU does not
support it, but when it does, the GPU logic will hit us in the face pretty hard.
For example, I can create 2 NUMA nodes in the command line, ids 0 and 2. The GPU code
will consider that the next gpu_id is max_nodes + spapr->gpu_numa_id, so 2 + 0 = 2,
and suddenly the GPU is in node_id 2 with other stuff. There are other areas
where a sparse configuration, if not considered, will be a spectacular disaster.

All that said, and seeing the amount of immediate work that we can push forward
in this work, I'd be OK with dropping this sparse NUMA node support for now. I
can document in ppc-spapr-numa.rst the hotspots (well, some of them at least)
that will blow up when QEMU moves to sparse NUMA confs. I would drop this
patch entirely and simplify the rest of the code as well.


Thanks,


DHB


> 
> I think that's probably true here, but the rules are complex enough
> that it's hard to convince myself there isn't some edge case where
> something that doesn't seem relevant could change the numa nodes the
> gpus end up with.
> 
>> +
>> +    error_setg(errp, "Unable to find a valid NUMA id");
>> +
>> +    return spapr->current_numa_id;
>> +}
>> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
>> index 76ae77ebc8..611c8a2957 100644
>> --- a/hw/ppc/spapr_pci_nvlink2.c
>> +++ b/hw/ppc/spapr_pci_nvlink2.c
>> @@ -87,9 +87,8 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
>>                                       PCIDevice *pdev, uint64_t tgt,
>>                                       MemoryRegion *mr, Error **errp)
>>   {
>> -    MachineState *machine = MACHINE(qdev_get_machine());
>> -    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
>>       SpaprPhbPciNvGpuSlot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
>> +    Error *local_err = NULL;
>>   
>>       if (!nvslot) {
>>           error_setg(errp, "Found too many GPUs per vPHB");
>> @@ -100,8 +99,11 @@ static void spapr_pci_collect_nvgpu(SpaprPhbPciNvGpuConfig *nvgpus,
>>   
>>       nvslot->gpa = nvgpus->nv2_ram_current;
>>       nvgpus->nv2_ram_current += memory_region_size(mr);
>> -    nvslot->numa_id = spapr->gpu_numa_id;
>> -    ++spapr->gpu_numa_id;
>> +
>> +    nvslot->numa_id = spapr_pci_get_available_numa_id(&local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +    }
>>   }
>>   
>>   static void spapr_pci_collect_nvnpu(SpaprPhbPciNvGpuConfig *nvgpus,
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 600eb55c34..8d93223a76 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -129,6 +129,8 @@ struct SpaprPhbState {
>>   #define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
>>                                         64 * KiB)
>>   
>> +unsigned spapr_pci_get_available_numa_id(Error **errp);
>> +
>>   int spapr_dt_phb(SpaprMachineState *spapr, SpaprPhbState *phb,
>>                    uint32_t intc_phandle, void *fdt, int *node_offset);
>>   
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 3134d339e8..739a6a4942 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -227,7 +227,9 @@ struct SpaprMachineState {
>>       bool cmd_line_caps[SPAPR_CAP_NUM];
>>       SpaprCapabilities def, eff, mig;
>>   
>> -    unsigned gpu_numa_id;
>> +    unsigned current_numa_id;
> 
> The nane "current_numa_id" is a bit unclear.  AFAICT this is used
> specifically for the GPU "extra" numa nodes, which isn't obvious from
> the name.
> 
>> +    unsigned extra_numa_nodes;
>> +
>>       SpaprTpmProxy *tpm_proxy;
>>   
>>       Error *fwnmi_migration_blocker;
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa
  2020-08-20 16:51         ` Eduardo Habkost
  2020-08-21  8:55           ` Igor Mammedov
@ 2020-09-23 15:21           ` John Snow
  1 sibling, 0 replies; 34+ messages in thread
From: John Snow @ 2020-09-23 15:21 UTC (permalink / raw)
  To: Eduardo Habkost, David Gibson
  Cc: Daniel Henrique Barboza, qemu-ppc, qemu-devel

On 8/20/20 12:51 PM, Eduardo Habkost wrote:
> 
> In either case, it sounds like this won't be a common constraint
> and I now agree with your original suggestion of doing this in
> machine initialization code.

I'm late to this chat, but not every constraint needs to be modeled as 
data -- it's okay to do validation in the machine initialization code.

(It can always be changed later!)

I have been more concerned with schema-based validation of the shape of 
data, but which boards support NUMA or not falls perfectly into the 
realm of semantic validation that is outside the realm of static data 
validation, I think.

--js



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2020-09-23 15:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 20:54 [PATCH 00/10] pseries NUMA distance rework Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 01/10] hw: add compat machines for 5.2 Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 02/10] numa: introduce MachineClass::forbid_asymmetrical_numa Daniel Henrique Barboza
2020-08-20  1:17   ` David Gibson
2020-08-20  2:11     ` Eduardo Habkost
2020-08-20  4:15       ` David Gibson
2020-08-20 10:33         ` Daniel Henrique Barboza
2020-08-20 14:29           ` Igor Mammedov
2020-08-20 16:51         ` Eduardo Habkost
2020-08-21  8:55           ` Igor Mammedov
2020-08-21 12:47             ` Daniel Henrique Barboza
2020-08-24  6:08               ` David Gibson
2020-08-24 11:45                 ` Daniel Henrique Barboza
2020-08-24 23:49                   ` David Gibson
2020-08-25  9:56                     ` Daniel Henrique Barboza
2020-08-25 11:12                       ` David Gibson
2020-09-23 15:21           ` John Snow
2020-08-14 20:54 ` [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic Daniel Henrique Barboza
2020-08-20  2:14   ` David Gibson
2020-08-26 21:49     ` Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 04/10] spapr: add spapr_machine_using_legacy_numa() helper Daniel Henrique Barboza
2020-08-20  2:15   ` David Gibson
2020-08-14 20:54 ` [PATCH 05/10] spapr: make ibm, max-associativity-domains scale with user input Daniel Henrique Barboza
2020-08-20  2:55   ` [PATCH 05/10] spapr: make ibm,max-associativity-domains " David Gibson
2020-08-26 21:17     ` Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 06/10] spapr: allow 4 NUMA levels in ibm, associativity-reference-points Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 07/10] spapr: create helper to set ibm,associativity Daniel Henrique Barboza
2020-08-20  3:00   ` David Gibson
2020-08-20 10:39     ` Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains Daniel Henrique Barboza
2020-08-20  4:26   ` David Gibson
2020-08-26 20:06     ` Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 09/10] spapr: consider user input when defining spapr guest NUMA Daniel Henrique Barboza
2020-08-14 20:54 ` [PATCH 10/10] specs/ppc-spapr-numa: update with new NUMA support Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).