qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] pseries: SMP sockets must match NUMA nodes
@ 2021-03-19 18:34 Daniel Henrique Barboza
  2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
  2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-19 18:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: clg, Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

The main change made in this series, in patch 01, is result of
discussions with the kernel community. See the commit message
for more details.

Patch 02 is a second version of the patch sent in [1], but now the
changes are only effective for the default pseries machine. It's being
sent/reported together with patch 01 since they 'share history'.


[1] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg04029.html


Daniel Henrique Barboza (2):
  spapr: number of SMP sockets must be equal to NUMA nodes
  spapr.c: remove 'ibm,chip-id' from DT

 hw/ppc/spapr.c                 | 10 +++++--
 hw/ppc/spapr_numa.c            |  7 +++++
 include/hw/ppc/spapr.h         |  1 +
 tests/qtest/cpu-plug-test.c    |  4 +--
 tests/qtest/device-plug-test.c |  9 +++++-
 tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
 6 files changed, 69 insertions(+), 14 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
@ 2021-03-19 18:34 ` Daniel Henrique Barboza
  2021-03-23  1:03   ` David Gibson
  2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-19 18:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju,
	Daniel Henrique Barboza, groug, qemu-ppc, clg, Igor Mammedov,
	david

Kernel commit 4bce545903fa ("powerpc/topology: Update
topology_core_cpumask") cause a regression in the pseries machine when
defining certain SMP topologies [1]. The reasoning behind the change is
explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
far as the kernel understanding of SMP topologies goes, both masks are
equivalent.

Further discussions in the kernel mailing list [2] shown that the
powerpc kernel always considered that the number of sockets were equal
to the number of NUMA nodes. The claim is that it doesn't make sense,
for Power hardware at least, 2+ sockets being in the same NUMA node. The
immediate conclusion is that all SMP topologies the pseries machine were
supplying to the kernel, with more than one socket in the same NUMA node
as in [1], happened to be correctly represented in the kernel by
accident during all these years.

There's a case to be made for virtual topologies being detached from
hardware constraints, allowing maximum flexibility to users. At the same
time, this freedom can't result in unrealistic hardware representations
being emulated. If the real hardware and the pseries kernel don't
support multiple chips/sockets in the same NUMA node, neither should we.

Starting in 6.0.0, all sockets must match an unique NUMA node in the
pseries machine. qtest changes were made to adapt to this new
condition.

[1] https://bugzilla.redhat.com/1934421
[2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/

CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
CC: Cédric Le Goater <clg@kaod.org>
CC: Igor Mammedov <imammedo@redhat.com>
CC: Laurent Vivier <lvivier@redhat.com>
CC: Thomas Huth <thuth@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c                 |  3 ++
 hw/ppc/spapr_numa.c            |  7 +++++
 include/hw/ppc/spapr.h         |  1 +
 tests/qtest/cpu-plug-test.c    |  4 +--
 tests/qtest/device-plug-test.c |  9 +++++-
 tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
 6 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..745f71c243 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
  */
 static void spapr_machine_5_2_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_6_0_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
+    smc->pre_6_0_smp_topology = true;
 }
 
 DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 779f18b994..0ade43dd79 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
     int i, j, max_nodes_with_gpus;
     bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
 
+    if (!smc->pre_6_0_smp_topology &&
+        nb_numa_nodes != machine->smp.sockets) {
+        error_report("Number of CPU sockets must be equal to the number "
+                     "of NUMA nodes");
+        exit(EXIT_FAILURE);
+    }
+
     /*
      * For all associativity arrays: first position is the size,
      * position MAX_DISTANCE_REF_POINTS is always the numa_id,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 47cebaf3ac..98dc5d198a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -142,6 +142,7 @@ struct SpaprMachineClass {
     hwaddr rma_limit;          /* clamp the RMA to this size */
     bool pre_5_1_assoc_refpoints;
     bool pre_5_2_numa_associativity;
+    bool pre_6_0_smp_topology;
 
     bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio,
diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c
index a1c689414b..946b9129ea 100644
--- a/tests/qtest/cpu-plug-test.c
+++ b/tests/qtest/cpu-plug-test.c
@@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname)
     data->machine = g_strdup(mname);
     data->cpu_model = "power8_v2.0";
     data->device_model = g_strdup("power8_v2.0-spapr-cpu-core");
-    data->sockets = 2;
-    data->cores = 3;
+    data->sockets = 1;
+    data->cores = 6;
     data->threads = 1;
     data->maxcpus = data->sockets * data->cores * data->threads;
 
diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c
index 559d47727a..dd7d8268d2 100644
--- a/tests/qtest/device-plug-test.c
+++ b/tests/qtest/device-plug-test.c
@@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void)
 {
     QTestState *qtest;
 
-    qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 "
+    /*
+     * Default smp settings will prioritize sockets over cores and
+     * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However,
+     * the pseries machine requires a NUMA node for each socket
+     * (since 6.0.0). Specify sockets=1 to make life easier.
+     */
+    qtest = qtest_initf("-cpu power9_v2.0 "
+                        "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 "
                         "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0");
 
     /* similar to test_pci_unplug_request */
diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
index dc0ec571ca..bb13f7131b 100644
--- a/tests/qtest/numa-test.c
+++ b/tests/qtest/numa-test.c
@@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data)
     QTestState *qts;
     g_autofree char *s = NULL;
     g_autofree char *cli = NULL;
+    const char *arch = qtest_get_arch();
+
+    if (g_str_equal(arch, "ppc64")) {
+        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
+                             "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
+                             "-numa node,nodeid=1,cpus=4-7");
+    } else {
+        cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
+                             "-numa node,nodeid=1,cpus=4-7");
+    }
 
-    cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
-                         "-numa node,nodeid=1,cpus=4-7");
     qts = qtest_init(cli);
 
     s = qtest_hmp(qts, "info numa");
@@ -57,10 +65,18 @@ static void test_mon_partial(const void *data)
     QTestState *qts;
     g_autofree char *s = NULL;
     g_autofree char *cli = NULL;
+    const char *arch = qtest_get_arch();
+
+    if (g_str_equal(arch, "ppc64")) {
+        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
+                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
+                             "-numa node,nodeid=1,cpus=4-5 ");
+    } else {
+        cli = make_cli(data, "-smp 8 "
+                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
+                             "-numa node,nodeid=1,cpus=4-5 ");
+    }
 
-    cli = make_cli(data, "-smp 8 "
-                   "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
-                   "-numa node,nodeid=1,cpus=4-5 ");
     qts = qtest_init(cli);
 
     s = qtest_hmp(qts, "info numa");
@@ -85,9 +101,17 @@ static void test_query_cpus(const void *data)
     QObject *e;
     QTestState *qts;
     g_autofree char *cli = NULL;
+    const char *arch = qtest_get_arch();
+
+    if (g_str_equal(arch, "ppc64")) {
+        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
+                             "-numa node,memdev=ram,cpus=0-3 "
+                             "-numa node,cpus=4-7");
+    } else {
+        cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
+                             "-numa node,cpus=4-7");
+    }
 
-    cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
-                         "-numa node,cpus=4-7");
     qts = qtest_init(cli);
     cpus = get_cpus(qts, &resp);
     g_assert(cpus);
@@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data)
     QTestState *qts;
     g_autofree char *cli = NULL;
 
-    cli = make_cli(data, "-smp 4,cores=4 "
+    cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 "
         "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
         "-numa cpu,node-id=0,core-id=0 "
         "-numa cpu,node-id=0,core-id=1 "
@@ -554,7 +578,17 @@ int main(int argc, char **argv)
 
     g_test_init(&argc, &argv, NULL);
 
-    qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
+    /*
+     * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work
+     * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match
+     * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of
+     * 0246/1357 that test_def_cpu_split expects. In short, this test is
+     * no longer valid for ppc64 in 6.0.0.
+     */
+    if (!g_str_equal(arch, "ppc64")) {
+        qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
+    }
+
     qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
     qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
     qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT
  2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
  2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
@ 2021-03-19 18:34 ` Daniel Henrique Barboza
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-19 18:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Daniel Henrique Barboza, groug, qemu-ppc,
	clg, david

The attribute 'ibm,chip-id' does not exist in PAPR. This alone would be
enough reason to remove it from the spapr DT, but before doing that,
let's give a brief context on how and why it was introduced.

'ibm,chip-id' was added in the spapr DT by commit 10582ff83279. This
commit references kernel commit 256f2d4b463d ("powerpc: Use ibm,chip-id
property to compute cpu_core_mask if available"). In this kernel commit,
the 'ibm,chip-id' DT attribute is being used to calculate the
cpu_core_mask in traverse_siblings_chip_id(). This function still need
to consider 'ibm,chip-id' not being available as well to avoid breaking
older guests.

Later on, kernel commit df52f6714071 ("powerpc/smp: Rework CPU topology
construction") removed traverse_siblings_chip_id() and its callers,
making the CPU topology calculation independent of the 'ibm,chip-id'
attribute.

Last, but perhaps most relevant, kernel commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask") changed the way the kernel
calculates the cpu_core_mask, removing the use of 'ibm,chip-id' in the
calculation altogether, with consequences already discussed in the
previous patch.

All that said, since it's not in PAPR and the pseries kernel does not
rely on it, let's remove ibm,chip-id from the DT for the default machine
type. This removal is related to the previous SMP change, so re-use the
same smc->pre_6_0_smp_topology flag.

CC: Alexey Kardashevskiy <aik@ozlabs.ru>
Suggested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 745f71c243..58efb51ac7 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -649,6 +649,7 @@ static void spapr_dt_cpu(CPUState *cs, void *fdt, int offset,
     PowerPCCPU *cpu = POWERPC_CPU(cs);
     CPUPPCState *env = &cpu->env;
     PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cs);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
     int index = spapr_get_vcpu_id(cpu);
     uint32_t segs[] = {cpu_to_be32(28), cpu_to_be32(40),
                        0xffffffff, 0xffffffff};
@@ -745,8 +746,10 @@ static void spapr_dt_cpu(CPUState *cs, void *fdt, int offset,
 
     spapr_dt_pa_features(spapr, cpu, fdt, offset);
 
-    _FDT((fdt_setprop_cell(fdt, offset, "ibm,chip-id",
-                           cs->cpu_index / vcpus_per_socket)));
+    if (smc->pre_6_0_smp_topology) {
+        _FDT((fdt_setprop_cell(fdt, offset, "ibm,chip-id",
+                               cs->cpu_index / vcpus_per_socket)));
+    }
 
     _FDT((fdt_setprop(fdt, offset, "ibm,pft-size",
                       pft_size_prop, sizeof(pft_size_prop))));
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
@ 2021-03-23  1:03   ` David Gibson
  2021-03-23 17:21     ` Daniel Henrique Barboza
  2021-03-29 23:51     ` Igor Mammedov
  0 siblings, 2 replies; 23+ messages in thread
From: David Gibson @ 2021-03-23  1:03 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, qemu-devel,
	groug, qemu-ppc, clg, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 10478 bytes --]

On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> Kernel commit 4bce545903fa ("powerpc/topology: Update
> topology_core_cpumask") cause a regression in the pseries machine when
> defining certain SMP topologies [1]. The reasoning behind the change is
> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> far as the kernel understanding of SMP topologies goes, both masks are
> equivalent.
> 
> Further discussions in the kernel mailing list [2] shown that the
> powerpc kernel always considered that the number of sockets were equal
> to the number of NUMA nodes. The claim is that it doesn't make sense,
> for Power hardware at least, 2+ sockets being in the same NUMA node. The
> immediate conclusion is that all SMP topologies the pseries machine were
> supplying to the kernel, with more than one socket in the same NUMA node
> as in [1], happened to be correctly represented in the kernel by
> accident during all these years.
> 
> There's a case to be made for virtual topologies being detached from
> hardware constraints, allowing maximum flexibility to users. At the same
> time, this freedom can't result in unrealistic hardware representations
> being emulated. If the real hardware and the pseries kernel don't
> support multiple chips/sockets in the same NUMA node, neither should we.
> 
> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> pseries machine. qtest changes were made to adapt to this new
> condition.

Oof.  I really don't like this idea.  It means a bunch of fiddly work
for users to match these up, for no real gain.  I'm also concerned
that this will require follow on changes in libvirt to not make this a
really cryptic and irritating point of failure.

> 
> [1] https://bugzilla.redhat.com/1934421
> [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/
> 
> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> CC: Cédric Le Goater <clg@kaod.org>
> CC: Igor Mammedov <imammedo@redhat.com>
> CC: Laurent Vivier <lvivier@redhat.com>
> CC: Thomas Huth <thuth@redhat.com>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c                 |  3 ++
>  hw/ppc/spapr_numa.c            |  7 +++++
>  include/hw/ppc/spapr.h         |  1 +
>  tests/qtest/cpu-plug-test.c    |  4 +--
>  tests/qtest/device-plug-test.c |  9 +++++-
>  tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
>  6 files changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d56418ca29..745f71c243 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
>   */
>  static void spapr_machine_5_2_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_6_0_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
> +    smc->pre_6_0_smp_topology = true;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 779f18b994..0ade43dd79 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>      int i, j, max_nodes_with_gpus;
>      bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>  
> +    if (!smc->pre_6_0_smp_topology &&
> +        nb_numa_nodes != machine->smp.sockets) {
> +        error_report("Number of CPU sockets must be equal to the number "
> +                     "of NUMA nodes");
> +        exit(EXIT_FAILURE);
> +    }
> +
>      /*
>       * For all associativity arrays: first position is the size,
>       * position MAX_DISTANCE_REF_POINTS is always the numa_id,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 47cebaf3ac..98dc5d198a 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -142,6 +142,7 @@ struct SpaprMachineClass {
>      hwaddr rma_limit;          /* clamp the RMA to this size */
>      bool pre_5_1_assoc_refpoints;
>      bool pre_5_2_numa_associativity;
> +    bool pre_6_0_smp_topology;
>  
>      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio,
> diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c
> index a1c689414b..946b9129ea 100644
> --- a/tests/qtest/cpu-plug-test.c
> +++ b/tests/qtest/cpu-plug-test.c
> @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname)
>      data->machine = g_strdup(mname);
>      data->cpu_model = "power8_v2.0";
>      data->device_model = g_strdup("power8_v2.0-spapr-cpu-core");
> -    data->sockets = 2;
> -    data->cores = 3;
> +    data->sockets = 1;
> +    data->cores = 6;
>      data->threads = 1;
>      data->maxcpus = data->sockets * data->cores * data->threads;
>  
> diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c
> index 559d47727a..dd7d8268d2 100644
> --- a/tests/qtest/device-plug-test.c
> +++ b/tests/qtest/device-plug-test.c
> @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void)
>  {
>      QTestState *qtest;
>  
> -    qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 "
> +    /*
> +     * Default smp settings will prioritize sockets over cores and
> +     * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However,
> +     * the pseries machine requires a NUMA node for each socket
> +     * (since 6.0.0). Specify sockets=1 to make life easier.
> +     */
> +    qtest = qtest_initf("-cpu power9_v2.0 "
> +                        "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 "
>                          "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0");
>  
>      /* similar to test_pci_unplug_request */
> diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
> index dc0ec571ca..bb13f7131b 100644
> --- a/tests/qtest/numa-test.c
> +++ b/tests/qtest/numa-test.c
> @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data)
>      QTestState *qts;
>      g_autofree char *s = NULL;
>      g_autofree char *cli = NULL;
> +    const char *arch = qtest_get_arch();
> +
> +    if (g_str_equal(arch, "ppc64")) {
> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
> +                             "-numa node,nodeid=1,cpus=4-7");
> +    } else {
> +        cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
> +                             "-numa node,nodeid=1,cpus=4-7");
> +    }
>  
> -    cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
> -                         "-numa node,nodeid=1,cpus=4-7");
>      qts = qtest_init(cli);
>  
>      s = qtest_hmp(qts, "info numa");
> @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data)
>      QTestState *qts;
>      g_autofree char *s = NULL;
>      g_autofree char *cli = NULL;
> +    const char *arch = qtest_get_arch();
> +
> +    if (g_str_equal(arch, "ppc64")) {
> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> +                             "-numa node,nodeid=1,cpus=4-5 ");
> +    } else {
> +        cli = make_cli(data, "-smp 8 "
> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> +                             "-numa node,nodeid=1,cpus=4-5 ");
> +    }
>  
> -    cli = make_cli(data, "-smp 8 "
> -                   "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> -                   "-numa node,nodeid=1,cpus=4-5 ");
>      qts = qtest_init(cli);
>  
>      s = qtest_hmp(qts, "info numa");
> @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data)
>      QObject *e;
>      QTestState *qts;
>      g_autofree char *cli = NULL;
> +    const char *arch = qtest_get_arch();
> +
> +    if (g_str_equal(arch, "ppc64")) {
> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> +                             "-numa node,memdev=ram,cpus=0-3 "
> +                             "-numa node,cpus=4-7");
> +    } else {
> +        cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
> +                             "-numa node,cpus=4-7");
> +    }
>  
> -    cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
> -                         "-numa node,cpus=4-7");
>      qts = qtest_init(cli);
>      cpus = get_cpus(qts, &resp);
>      g_assert(cpus);
> @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data)
>      QTestState *qts;
>      g_autofree char *cli = NULL;
>  
> -    cli = make_cli(data, "-smp 4,cores=4 "
> +    cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 "
>          "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
>          "-numa cpu,node-id=0,core-id=0 "
>          "-numa cpu,node-id=0,core-id=1 "
> @@ -554,7 +578,17 @@ int main(int argc, char **argv)
>  
>      g_test_init(&argc, &argv, NULL);
>  
> -    qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
> +    /*
> +     * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work
> +     * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match
> +     * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of
> +     * 0246/1357 that test_def_cpu_split expects. In short, this test is
> +     * no longer valid for ppc64 in 6.0.0.
> +     */
> +    if (!g_str_equal(arch, "ppc64")) {
> +        qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
> +    }
> +
>      qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
>      qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
>      qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-23  1:03   ` David Gibson
@ 2021-03-23 17:21     ` Daniel Henrique Barboza
  2021-03-25  2:10       ` David Gibson
  2021-03-29 23:51     ` Igor Mammedov
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-23 17:21 UTC (permalink / raw)
  To: David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, qemu-devel,
	groug, qemu-ppc, clg, Igor Mammedov



On 3/22/21 10:03 PM, David Gibson wrote:
> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>> topology_core_cpumask") cause a regression in the pseries machine when
>> defining certain SMP topologies [1]. The reasoning behind the change is
>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>> far as the kernel understanding of SMP topologies goes, both masks are
>> equivalent.
>>
>> Further discussions in the kernel mailing list [2] shown that the
>> powerpc kernel always considered that the number of sockets were equal
>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>> immediate conclusion is that all SMP topologies the pseries machine were
>> supplying to the kernel, with more than one socket in the same NUMA node
>> as in [1], happened to be correctly represented in the kernel by
>> accident during all these years.
>>
>> There's a case to be made for virtual topologies being detached from
>> hardware constraints, allowing maximum flexibility to users. At the same
>> time, this freedom can't result in unrealistic hardware representations
>> being emulated. If the real hardware and the pseries kernel don't
>> support multiple chips/sockets in the same NUMA node, neither should we.
>>
>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>> pseries machine. qtest changes were made to adapt to this new
>> condition.
> 
> Oof.  I really don't like this idea.  It means a bunch of fiddly work
> for users to match these up, for no real gain.  I'm also concerned
> that this will require follow on changes in libvirt to not make this a
> really cryptic and irritating point of failure.

Haven't though about required Libvirt changes, although I can say that there
will be some amount to be mande and it will probably annoy existing users
(everyone that has a multiple socket per NUMA node topology).

There is not much we can do from the QEMU layer aside from what I've proposed
here. The other alternative is to keep interacting with the kernel folks to
see if there is a way to keep our use case untouched. This also means that
'ibm,chip-id' will probably remain in use since it's the only place where
we inform cores per socket information to the kernel.



Thanks,


DHB




> 
>>
>> [1] https://bugzilla.redhat.com/1934421
>> [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/
>>
>> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> CC: Cédric Le Goater <clg@kaod.org>
>> CC: Igor Mammedov <imammedo@redhat.com>
>> CC: Laurent Vivier <lvivier@redhat.com>
>> CC: Thomas Huth <thuth@redhat.com>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> ---
>>   hw/ppc/spapr.c                 |  3 ++
>>   hw/ppc/spapr_numa.c            |  7 +++++
>>   include/hw/ppc/spapr.h         |  1 +
>>   tests/qtest/cpu-plug-test.c    |  4 +--
>>   tests/qtest/device-plug-test.c |  9 +++++-
>>   tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
>>   6 files changed, 64 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index d56418ca29..745f71c243 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
>>    */
>>   static void spapr_machine_5_2_class_options(MachineClass *mc)
>>   {
>> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>> +
>>       spapr_machine_6_0_class_options(mc);
>>       compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
>> +    smc->pre_6_0_smp_topology = true;
>>   }
>>   
>>   DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>> index 779f18b994..0ade43dd79 100644
>> --- a/hw/ppc/spapr_numa.c
>> +++ b/hw/ppc/spapr_numa.c
>> @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>       int i, j, max_nodes_with_gpus;
>>       bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>>   
>> +    if (!smc->pre_6_0_smp_topology &&
>> +        nb_numa_nodes != machine->smp.sockets) {
>> +        error_report("Number of CPU sockets must be equal to the number "
>> +                     "of NUMA nodes");
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>>       /*
>>        * For all associativity arrays: first position is the size,
>>        * position MAX_DISTANCE_REF_POINTS is always the numa_id,
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 47cebaf3ac..98dc5d198a 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -142,6 +142,7 @@ struct SpaprMachineClass {
>>       hwaddr rma_limit;          /* clamp the RMA to this size */
>>       bool pre_5_1_assoc_refpoints;
>>       bool pre_5_2_numa_associativity;
>> +    bool pre_6_0_smp_topology;
>>   
>>       bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>>                             uint64_t *buid, hwaddr *pio,
>> diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c
>> index a1c689414b..946b9129ea 100644
>> --- a/tests/qtest/cpu-plug-test.c
>> +++ b/tests/qtest/cpu-plug-test.c
>> @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname)
>>       data->machine = g_strdup(mname);
>>       data->cpu_model = "power8_v2.0";
>>       data->device_model = g_strdup("power8_v2.0-spapr-cpu-core");
>> -    data->sockets = 2;
>> -    data->cores = 3;
>> +    data->sockets = 1;
>> +    data->cores = 6;
>>       data->threads = 1;
>>       data->maxcpus = data->sockets * data->cores * data->threads;
>>   
>> diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c
>> index 559d47727a..dd7d8268d2 100644
>> --- a/tests/qtest/device-plug-test.c
>> +++ b/tests/qtest/device-plug-test.c
>> @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void)
>>   {
>>       QTestState *qtest;
>>   
>> -    qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 "
>> +    /*
>> +     * Default smp settings will prioritize sockets over cores and
>> +     * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However,
>> +     * the pseries machine requires a NUMA node for each socket
>> +     * (since 6.0.0). Specify sockets=1 to make life easier.
>> +     */
>> +    qtest = qtest_initf("-cpu power9_v2.0 "
>> +                        "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 "
>>                           "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0");
>>   
>>       /* similar to test_pci_unplug_request */
>> diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
>> index dc0ec571ca..bb13f7131b 100644
>> --- a/tests/qtest/numa-test.c
>> +++ b/tests/qtest/numa-test.c
>> @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data)
>>       QTestState *qts;
>>       g_autofree char *s = NULL;
>>       g_autofree char *cli = NULL;
>> +    const char *arch = qtest_get_arch();
>> +
>> +    if (g_str_equal(arch, "ppc64")) {
>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
>> +                             "-numa node,nodeid=1,cpus=4-7");
>> +    } else {
>> +        cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
>> +                             "-numa node,nodeid=1,cpus=4-7");
>> +    }
>>   
>> -    cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
>> -                         "-numa node,nodeid=1,cpus=4-7");
>>       qts = qtest_init(cli);
>>   
>>       s = qtest_hmp(qts, "info numa");
>> @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data)
>>       QTestState *qts;
>>       g_autofree char *s = NULL;
>>       g_autofree char *cli = NULL;
>> +    const char *arch = qtest_get_arch();
>> +
>> +    if (g_str_equal(arch, "ppc64")) {
>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>> +                             "-numa node,nodeid=1,cpus=4-5 ");
>> +    } else {
>> +        cli = make_cli(data, "-smp 8 "
>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>> +                             "-numa node,nodeid=1,cpus=4-5 ");
>> +    }
>>   
>> -    cli = make_cli(data, "-smp 8 "
>> -                   "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>> -                   "-numa node,nodeid=1,cpus=4-5 ");
>>       qts = qtest_init(cli);
>>   
>>       s = qtest_hmp(qts, "info numa");
>> @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data)
>>       QObject *e;
>>       QTestState *qts;
>>       g_autofree char *cli = NULL;
>> +    const char *arch = qtest_get_arch();
>> +
>> +    if (g_str_equal(arch, "ppc64")) {
>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>> +                             "-numa node,memdev=ram,cpus=0-3 "
>> +                             "-numa node,cpus=4-7");
>> +    } else {
>> +        cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
>> +                             "-numa node,cpus=4-7");
>> +    }
>>   
>> -    cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
>> -                         "-numa node,cpus=4-7");
>>       qts = qtest_init(cli);
>>       cpus = get_cpus(qts, &resp);
>>       g_assert(cpus);
>> @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data)
>>       QTestState *qts;
>>       g_autofree char *cli = NULL;
>>   
>> -    cli = make_cli(data, "-smp 4,cores=4 "
>> +    cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 "
>>           "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
>>           "-numa cpu,node-id=0,core-id=0 "
>>           "-numa cpu,node-id=0,core-id=1 "
>> @@ -554,7 +578,17 @@ int main(int argc, char **argv)
>>   
>>       g_test_init(&argc, &argv, NULL);
>>   
>> -    qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
>> +    /*
>> +     * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work
>> +     * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match
>> +     * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of
>> +     * 0246/1357 that test_def_cpu_split expects. In short, this test is
>> +     * no longer valid for ppc64 in 6.0.0.
>> +     */
>> +    if (!g_str_equal(arch, "ppc64")) {
>> +        qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
>> +    }
>> +
>>       qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
>>       qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
>>       qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-23 17:21     ` Daniel Henrique Barboza
@ 2021-03-25  2:10       ` David Gibson
  2021-03-25  8:56         ` Cédric Le Goater
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2021-03-25  2:10 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, qemu-devel,
	groug, qemu-ppc, clg, Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 3347 bytes --]

On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/22/21 10:03 PM, David Gibson wrote:
> > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> > > Kernel commit 4bce545903fa ("powerpc/topology: Update
> > > topology_core_cpumask") cause a regression in the pseries machine when
> > > defining certain SMP topologies [1]. The reasoning behind the change is
> > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> > > far as the kernel understanding of SMP topologies goes, both masks are
> > > equivalent.
> > > 
> > > Further discussions in the kernel mailing list [2] shown that the
> > > powerpc kernel always considered that the number of sockets were equal
> > > to the number of NUMA nodes. The claim is that it doesn't make sense,
> > > for Power hardware at least, 2+ sockets being in the same NUMA node. The
> > > immediate conclusion is that all SMP topologies the pseries machine were
> > > supplying to the kernel, with more than one socket in the same NUMA node
> > > as in [1], happened to be correctly represented in the kernel by
> > > accident during all these years.
> > > 
> > > There's a case to be made for virtual topologies being detached from
> > > hardware constraints, allowing maximum flexibility to users. At the same
> > > time, this freedom can't result in unrealistic hardware representations
> > > being emulated. If the real hardware and the pseries kernel don't
> > > support multiple chips/sockets in the same NUMA node, neither should we.
> > > 
> > > Starting in 6.0.0, all sockets must match an unique NUMA node in the
> > > pseries machine. qtest changes were made to adapt to this new
> > > condition.
> > 
> > Oof.  I really don't like this idea.  It means a bunch of fiddly work
> > for users to match these up, for no real gain.  I'm also concerned
> > that this will require follow on changes in libvirt to not make this a
> > really cryptic and irritating point of failure.
> 
> Haven't though about required Libvirt changes, although I can say that there
> will be some amount to be mande and it will probably annoy existing users
> (everyone that has a multiple socket per NUMA node topology).
> 
> There is not much we can do from the QEMU layer aside from what I've proposed
> here. The other alternative is to keep interacting with the kernel folks to
> see if there is a way to keep our use case untouched.

Right.  Well.. not necessarily untouched, but I'm hoping for more
replies from Cédric to my objections and mpe's.  Even with sockets
being a kinda meaningless concept in PAPR, I don't think tying it to
NUMA nodes makes sense.

> This also means that
> 'ibm,chip-id' will probably remain in use since it's the only place where
> we inform cores per socket information to the kernel.

Well.. unless we can find some other sensible way to convey that
information.  I haven't given up hope for that yet.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-25  2:10       ` David Gibson
@ 2021-03-25  8:56         ` Cédric Le Goater
  2021-03-25 10:15           ` Daniel Henrique Barboza
  2021-03-29  4:20           ` David Gibson
  0 siblings, 2 replies; 23+ messages in thread
From: Cédric Le Goater @ 2021-03-25  8:56 UTC (permalink / raw)
  To: David Gibson, Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov

On 3/25/21 3:10 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/22/21 10:03 PM, David Gibson wrote:
>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>> equivalent.
>>>>
>>>> Further discussions in the kernel mailing list [2] shown that the
>>>> powerpc kernel always considered that the number of sockets were equal
>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>> as in [1], happened to be correctly represented in the kernel by
>>>> accident during all these years.
>>>>
>>>> There's a case to be made for virtual topologies being detached from
>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>> time, this freedom can't result in unrealistic hardware representations
>>>> being emulated. If the real hardware and the pseries kernel don't
>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>
>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>> pseries machine. qtest changes were made to adapt to this new
>>>> condition.
>>>
>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>> for users to match these up, for no real gain.  I'm also concerned
>>> that this will require follow on changes in libvirt to not make this a
>>> really cryptic and irritating point of failure.
>>
>> Haven't though about required Libvirt changes, although I can say that there
>> will be some amount to be mande and it will probably annoy existing users
>> (everyone that has a multiple socket per NUMA node topology).
>>
>> There is not much we can do from the QEMU layer aside from what I've proposed
>> here. The other alternative is to keep interacting with the kernel folks to
>> see if there is a way to keep our use case untouched.
> 
> Right.  Well.. not necessarily untouched, but I'm hoping for more
> replies from Cédric to my objections and mpe's.  Even with sockets
> being a kinda meaningless concept in PAPR, I don't think tying it to
> NUMA nodes makes sense.

I did a couple of replies in different email threads but maybe not 
to all. I felt it was going nowhere :/ Couple of thoughts,

Shouldn't we get rid of the socket concept, die also, under pseries 
since they don't exist under PAPR ? We only have numa nodes, cores, 
threads AFAICT.

Should we diverged from PAPR and add extra DT properties "qemu,..." ?
There are a couple of places where Linux checks for the underlying 
hypervisor already.

>> This also means that
>> 'ibm,chip-id' will probably remain in use since it's the only place where
>> we inform cores per socket information to the kernel.
> 
> Well.. unless we can find some other sensible way to convey that
> information.  I haven't given up hope for that yet.

Well, we could start by fixing the value in QEMU. It is broken today.


This is all coming from some work we did last year to evaluate our HW 
(mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. 
We saw some real problems because Linux did not have a clear view of the 
topology. See the figures here : 

http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/

The node id is a key parameter for system resource management, memory 
allocation, interrupt affinity, etc. Linux scales much better if used
correctly. 

C.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-25  8:56         ` Cédric Le Goater
@ 2021-03-25 10:15           ` Daniel Henrique Barboza
  2021-03-29  4:20           ` David Gibson
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-25 10:15 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov



On 3/25/21 5:56 AM, Cédric Le Goater wrote:
> On 3/25/21 3:10 AM, David Gibson wrote:
>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>> equivalent.
>>>>>
>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>> accident during all these years.
>>>>>
>>>>> There's a case to be made for virtual topologies being detached from
>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>
>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>> condition.
>>>>
>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>> for users to match these up, for no real gain.  I'm also concerned
>>>> that this will require follow on changes in libvirt to not make this a
>>>> really cryptic and irritating point of failure.
>>>
>>> Haven't though about required Libvirt changes, although I can say that there
>>> will be some amount to be mande and it will probably annoy existing users
>>> (everyone that has a multiple socket per NUMA node topology).
>>>
>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>> here. The other alternative is to keep interacting with the kernel folks to
>>> see if there is a way to keep our use case untouched.
>>
>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>> replies from Cédric to my objections and mpe's.  Even with sockets
>> being a kinda meaningless concept in PAPR, I don't think tying it to
>> NUMA nodes makes sense.
> 
> I did a couple of replies in different email threads but maybe not
> to all. I felt it was going nowhere :/ Couple of thoughts,
> 
> Shouldn't we get rid of the socket concept, die also, under pseries
> since they don't exist under PAPR ? We only have numa nodes, cores,
> threads AFAICT.

I don't think we work with 'die'.

Getting rid of the 'socket' representation is sensible regarding PAPR,
but the effect for pseries will be similar to what this patch is already
doing: users could have multiple sockets in the same NUMA node, and then
they won't. Either because we got rid of the 'socket' representation or
because socket == NUMA node.

> 
> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> There are a couple of places where Linux checks for the underlying
> hypervisor already.
> 
>>> This also means that
>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>> we inform cores per socket information to the kernel.
>>
>> Well.. unless we can find some other sensible way to convey that
>> information.  I haven't given up hope for that yet.
> 
> Well, we could start by fixing the value in QEMU. It is broken today.


I'll look into it. It makes more sense to talk about keeping it when
it's working properly.



DHB


> 
> 
> This is all coming from some work we did last year to evaluate our HW
> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
> We saw some real problems because Linux did not have a clear view of the
> topology. See the figures here :
> 
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
> 
> The node id is a key parameter for system resource management, memory
> allocation, interrupt affinity, etc. Linux scales much better if used
> correctly.
> 
> C.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-25  8:56         ` Cédric Le Goater
  2021-03-25 10:15           ` Daniel Henrique Barboza
@ 2021-03-29  4:20           ` David Gibson
  2021-03-29 15:32             ` Cédric Le Goater
  1 sibling, 1 reply; 23+ messages in thread
From: David Gibson @ 2021-03-29  4:20 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc,
	Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 5505 bytes --]

On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> On 3/25/21 3:10 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> >>
> >>
> >> On 3/22/21 10:03 PM, David Gibson wrote:
> >>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> >>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
> >>>> topology_core_cpumask") cause a regression in the pseries machine when
> >>>> defining certain SMP topologies [1]. The reasoning behind the change is
> >>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> >>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> >>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> >>>> far as the kernel understanding of SMP topologies goes, both masks are
> >>>> equivalent.
> >>>>
> >>>> Further discussions in the kernel mailing list [2] shown that the
> >>>> powerpc kernel always considered that the number of sockets were equal
> >>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
> >>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
> >>>> immediate conclusion is that all SMP topologies the pseries machine were
> >>>> supplying to the kernel, with more than one socket in the same NUMA node
> >>>> as in [1], happened to be correctly represented in the kernel by
> >>>> accident during all these years.
> >>>>
> >>>> There's a case to be made for virtual topologies being detached from
> >>>> hardware constraints, allowing maximum flexibility to users. At the same
> >>>> time, this freedom can't result in unrealistic hardware representations
> >>>> being emulated. If the real hardware and the pseries kernel don't
> >>>> support multiple chips/sockets in the same NUMA node, neither should we.
> >>>>
> >>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> >>>> pseries machine. qtest changes were made to adapt to this new
> >>>> condition.
> >>>
> >>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
> >>> for users to match these up, for no real gain.  I'm also concerned
> >>> that this will require follow on changes in libvirt to not make this a
> >>> really cryptic and irritating point of failure.
> >>
> >> Haven't though about required Libvirt changes, although I can say that there
> >> will be some amount to be mande and it will probably annoy existing users
> >> (everyone that has a multiple socket per NUMA node topology).
> >>
> >> There is not much we can do from the QEMU layer aside from what I've proposed
> >> here. The other alternative is to keep interacting with the kernel folks to
> >> see if there is a way to keep our use case untouched.
> > 
> > Right.  Well.. not necessarily untouched, but I'm hoping for more
> > replies from Cédric to my objections and mpe's.  Even with sockets
> > being a kinda meaningless concept in PAPR, I don't think tying it to
> > NUMA nodes makes sense.
> 
> I did a couple of replies in different email threads but maybe not 
> to all. I felt it was going nowhere :/ Couple of thoughts,

I think I saw some of those, but maybe not all.

> Shouldn't we get rid of the socket concept, die also, under pseries 
> since they don't exist under PAPR ? We only have numa nodes, cores, 
> threads AFAICT.

Theoretically, yes.  I'm not sure it's really practical, though, since
AFAICT, both qemu and the kernel have the notion of sockets (though
not dies) built into generic code.

It does mean that one possible approach here - maybe the best one - is
to simply declare that sockets are meaningless under, so we simply
don't expect what the guest kernel reports to match what's given to
qemu.

It'd be nice to avoid that if we can: in a sense it's just cosmetic,
but it is likely to surprise and confuse people.

> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> There are a couple of places where Linux checks for the underlying 
> hypervisor already.
> 
> >> This also means that
> >> 'ibm,chip-id' will probably remain in use since it's the only place where
> >> we inform cores per socket information to the kernel.
> > 
> > Well.. unless we can find some other sensible way to convey that
> > information.  I haven't given up hope for that yet.
> 
> Well, we could start by fixing the value in QEMU. It is broken
> today.

Fixing what value, exactly?

> This is all coming from some work we did last year to evaluate our HW 
> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. 
> We saw some real problems because Linux did not have a clear view of the 
> topology. See the figures here : 
> 
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
> 
> The node id is a key parameter for system resource management, memory 
> allocation, interrupt affinity, etc. Linux scales much better if used
> correctly.

Well, sure.  And we have all the ibm,associativity stuff to convey the
node ids to the guest (which has its own problems, but not that are
relevant here).  What's throwing me is why getting node IDs correct
has anything to do with socket numbers.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-29  4:20           ` David Gibson
@ 2021-03-29 15:32             ` Cédric Le Goater
  2021-03-29 18:32               ` Daniel Henrique Barboza
  0 siblings, 1 reply; 23+ messages in thread
From: Cédric Le Goater @ 2021-03-29 15:32 UTC (permalink / raw)
  To: David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc,
	Igor Mammedov

On 3/29/21 6:20 AM, David Gibson wrote:
> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>> On 3/25/21 3:10 AM, David Gibson wrote:
>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>> equivalent.
>>>>>>
>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>> accident during all these years.
>>>>>>
>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>
>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>> condition.
>>>>>
>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>> that this will require follow on changes in libvirt to not make this a
>>>>> really cryptic and irritating point of failure.
>>>>
>>>> Haven't though about required Libvirt changes, although I can say that there
>>>> will be some amount to be mande and it will probably annoy existing users
>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>
>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>> see if there is a way to keep our use case untouched.
>>>
>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>> NUMA nodes makes sense.
>>
>> I did a couple of replies in different email threads but maybe not 
>> to all. I felt it was going nowhere :/ Couple of thoughts,
> 
> I think I saw some of those, but maybe not all.
> 
>> Shouldn't we get rid of the socket concept, die also, under pseries 
>> since they don't exist under PAPR ? We only have numa nodes, cores, 
>> threads AFAICT.
> 
> Theoretically, yes.  I'm not sure it's really practical, though, since
> AFAICT, both qemu and the kernel have the notion of sockets (though
> not dies) built into generic code.

Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" 
and PPC Linux only has a NUMA node id, on pseries and powernv.

> It does mean that one possible approach here - maybe the best one - is
> to simply declare that sockets are meaningless under, so we simply
> don't expect what the guest kernel reports to match what's given to
> qemu.
> 
> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> but it is likely to surprise and confuse people.
> 
>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>> There are a couple of places where Linux checks for the underlying 
>> hypervisor already.
>>
>>>> This also means that
>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>> we inform cores per socket information to the kernel.
>>>
>>> Well.. unless we can find some other sensible way to convey that
>>> information.  I haven't given up hope for that yet.
>>
>> Well, we could start by fixing the value in QEMU. It is broken
>> today.
> 
> Fixing what value, exactly?

The value of the "ibm,chip-id" since we are keeping the property under
QEMU.

>> This is all coming from some work we did last year to evaluate our HW 
>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. 
>> We saw some real problems because Linux did not have a clear view of the 
>> topology. See the figures here : 
>>
>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
>>
>> The node id is a key parameter for system resource management, memory 
>> allocation, interrupt affinity, etc. Linux scales much better if used
>> correctly.
> 
> Well, sure.  And we have all the ibm,associativity stuff to convey the
> node ids to the guest (which has its own problems, but not that are
> relevant here).  What's throwing me is why getting node IDs correct
> has anything to do with socket numbers.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-29 15:32             ` Cédric Le Goater
@ 2021-03-29 18:32               ` Daniel Henrique Barboza
  2021-03-29 23:55                 ` Igor Mammedov
  2021-03-31  0:57                 ` David Gibson
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-29 18:32 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov



On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> On 3/29/21 6:20 AM, David Gibson wrote:
>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>> equivalent.
>>>>>>>
>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>> accident during all these years.
>>>>>>>
>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>
>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>> condition.
>>>>>>
>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>> really cryptic and irritating point of failure.
>>>>>
>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>
>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>> see if there is a way to keep our use case untouched.
>>>>
>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>> NUMA nodes makes sense.
>>>
>>> I did a couple of replies in different email threads but maybe not
>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>
>> I think I saw some of those, but maybe not all.
>>
>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>> threads AFAICT.
>>
>> Theoretically, yes.  I'm not sure it's really practical, though, since
>> AFAICT, both qemu and the kernel have the notion of sockets (though
>> not dies) built into generic code.
> 
> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> and PPC Linux only has a NUMA node id, on pseries and powernv.
> 
>> It does mean that one possible approach here - maybe the best one - is
>> to simply declare that sockets are meaningless under, so we simply
>> don't expect what the guest kernel reports to match what's given to
>> qemu.
>>
>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>> but it is likely to surprise and confuse people.
>>
>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>> There are a couple of places where Linux checks for the underlying
>>> hypervisor already.
>>>
>>>>> This also means that
>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>> we inform cores per socket information to the kernel.
>>>>
>>>> Well.. unless we can find some other sensible way to convey that
>>>> information.  I haven't given up hope for that yet.
>>>
>>> Well, we could start by fixing the value in QEMU. It is broken
>>> today.
>>
>> Fixing what value, exactly?
> 
> The value of the "ibm,chip-id" since we are keeping the property under
> QEMU.

David, I believe this has to do with the discussing we had last Friday.

I mentioned that the ibm,chip-id property is being calculated in a way that
promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
e.g.:

-smp 4,cores=4,maxcpus=8,threads=1 \
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1


$ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
			ibm,pft-size = <0x00 0x19>;
			ibm,chip-id = <0x00>;
--
			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
			ibm,pft-size = <0x00 0x19>;
			ibm,chip-id = <0x00>;
--
			ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
			ibm,pft-size = <0x00 0x19>;
			ibm,chip-id = <0x00>;
--
			ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
			ibm,pft-size = <0x00 0x19>;
			ibm,chip-id = <0x00>;

We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different
NUMA node than 0-1. This would mean that the same socket would belong to
different NUMA nodes at the same time.

I believe this is what Cedric wants to be addressed. Given that the property is
called after the OPAL property ibm,chip-id, the kernel expects that the property
will have the same semantics as in OPAL.



Thanks,


DHB




> 
>>> This is all coming from some work we did last year to evaluate our HW
>>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
>>> We saw some real problems because Linux did not have a clear view of the
>>> topology. See the figures here :
>>>
>>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
>>>
>>> The node id is a key parameter for system resource management, memory
>>> allocation, interrupt affinity, etc. Linux scales much better if used
>>> correctly.
>>
>> Well, sure.  And we have all the ibm,associativity stuff to convey the
>> node ids to the guest (which has its own problems, but not that are
>> relevant here).  What's throwing me is why getting node IDs correct
>> has anything to do with socket numbers.
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-23  1:03   ` David Gibson
  2021-03-23 17:21     ` Daniel Henrique Barboza
@ 2021-03-29 23:51     ` Igor Mammedov
  2021-03-30 21:33       ` Daniel Henrique Barboza
  1 sibling, 1 reply; 23+ messages in thread
From: Igor Mammedov @ 2021-03-29 23:51 UTC (permalink / raw)
  To: David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc, clg

On Tue, 23 Mar 2021 12:03:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> > Kernel commit 4bce545903fa ("powerpc/topology: Update
> > topology_core_cpumask") cause a regression in the pseries machine when
> > defining certain SMP topologies [1]. The reasoning behind the change is
> > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> > far as the kernel understanding of SMP topologies goes, both masks are
> > equivalent.
> > 
> > Further discussions in the kernel mailing list [2] shown that the
> > powerpc kernel always considered that the number of sockets were equal
> > to the number of NUMA nodes. The claim is that it doesn't make sense,
> > for Power hardware at least, 2+ sockets being in the same NUMA node. The
> > immediate conclusion is that all SMP topologies the pseries machine were
> > supplying to the kernel, with more than one socket in the same NUMA node
> > as in [1], happened to be correctly represented in the kernel by
> > accident during all these years.
> > 
> > There's a case to be made for virtual topologies being detached from
> > hardware constraints, allowing maximum flexibility to users. At the same
> > time, this freedom can't result in unrealistic hardware representations
> > being emulated. If the real hardware and the pseries kernel don't
> > support multiple chips/sockets in the same NUMA node, neither should we.
> > 
> > Starting in 6.0.0, all sockets must match an unique NUMA node in the
> > pseries machine. qtest changes were made to adapt to this new
> > condition.  
> 
> Oof.  I really don't like this idea.  It means a bunch of fiddly work
> for users to match these up, for no real gain.  I'm also concerned
> that this will require follow on changes in libvirt to not make this a
> really cryptic and irritating point of failure.

yes, it is annoying to users but we were very slowly tightening limitations
on numa path, especially if it causes problems on guest side when user
specify nonsense configs (and suspect there are few things to enforce in
generic numa code left (currently just warnings)). So it's nothing new.

So if limit applies to new machine type it should be fine (i.e. no existing
VMs will suffer). Later on we can deprecate invalid configurations altogether
and just error out.
 
> > 
> > [1] https://bugzilla.redhat.com/1934421
> > [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/
> > 
> > CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > CC: Cédric Le Goater <clg@kaod.org>
> > CC: Igor Mammedov <imammedo@redhat.com>
> > CC: Laurent Vivier <lvivier@redhat.com>
> > CC: Thomas Huth <thuth@redhat.com>
> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> > ---
> >  hw/ppc/spapr.c                 |  3 ++
> >  hw/ppc/spapr_numa.c            |  7 +++++
> >  include/hw/ppc/spapr.h         |  1 +
> >  tests/qtest/cpu-plug-test.c    |  4 +--
> >  tests/qtest/device-plug-test.c |  9 +++++-
> >  tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
> >  6 files changed, 64 insertions(+), 12 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index d56418ca29..745f71c243 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
> >   */
> >  static void spapr_machine_5_2_class_options(MachineClass *mc)
> >  {
> > +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > +
> >      spapr_machine_6_0_class_options(mc);
> >      compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
> > +    smc->pre_6_0_smp_topology = true;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
> > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> > index 779f18b994..0ade43dd79 100644
> > --- a/hw/ppc/spapr_numa.c
> > +++ b/hw/ppc/spapr_numa.c
> > @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
> >      int i, j, max_nodes_with_gpus;
> >      bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
> >  
> > +    if (!smc->pre_6_0_smp_topology &&
> > +        nb_numa_nodes != machine->smp.sockets) {
> > +        error_report("Number of CPU sockets must be equal to the number "
> > +                     "of NUMA nodes");
> > +        exit(EXIT_FAILURE);
> > +    }
> > +
> >      /*
> >       * For all associativity arrays: first position is the size,
> >       * position MAX_DISTANCE_REF_POINTS is always the numa_id,
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index 47cebaf3ac..98dc5d198a 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -142,6 +142,7 @@ struct SpaprMachineClass {
> >      hwaddr rma_limit;          /* clamp the RMA to this size */
> >      bool pre_5_1_assoc_refpoints;
> >      bool pre_5_2_numa_associativity;
> > +    bool pre_6_0_smp_topology;
> >  
> >      bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
> >                            uint64_t *buid, hwaddr *pio,
> > diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c
> > index a1c689414b..946b9129ea 100644
> > --- a/tests/qtest/cpu-plug-test.c
> > +++ b/tests/qtest/cpu-plug-test.c
> > @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname)
> >      data->machine = g_strdup(mname);
> >      data->cpu_model = "power8_v2.0";
> >      data->device_model = g_strdup("power8_v2.0-spapr-cpu-core");
> > -    data->sockets = 2;
> > -    data->cores = 3;
> > +    data->sockets = 1;
> > +    data->cores = 6;
> >      data->threads = 1;
> >      data->maxcpus = data->sockets * data->cores * data->threads;
> >  
> > diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c
> > index 559d47727a..dd7d8268d2 100644
> > --- a/tests/qtest/device-plug-test.c
> > +++ b/tests/qtest/device-plug-test.c
> > @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void)
> >  {
> >      QTestState *qtest;
> >  
> > -    qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 "
> > +    /*
> > +     * Default smp settings will prioritize sockets over cores and
> > +     * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However,
> > +     * the pseries machine requires a NUMA node for each socket
> > +     * (since 6.0.0). Specify sockets=1 to make life easier.
> > +     */
> > +    qtest = qtest_initf("-cpu power9_v2.0 "
> > +                        "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 "
> >                          "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0");
> >  
> >      /* similar to test_pci_unplug_request */
> > diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
> > index dc0ec571ca..bb13f7131b 100644
> > --- a/tests/qtest/numa-test.c
> > +++ b/tests/qtest/numa-test.c
> > @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data)
> >      QTestState *qts;
> >      g_autofree char *s = NULL;
> >      g_autofree char *cli = NULL;
> > +    const char *arch = qtest_get_arch();
> > +
> > +    if (g_str_equal(arch, "ppc64")) {
> > +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> > +                             "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
> > +                             "-numa node,nodeid=1,cpus=4-7");
> > +    } else {
> > +        cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
> > +                             "-numa node,nodeid=1,cpus=4-7");
> > +    }
> >  
> > -    cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
> > -                         "-numa node,nodeid=1,cpus=4-7");
> >      qts = qtest_init(cli);
> >  
> >      s = qtest_hmp(qts, "info numa");
> > @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data)
> >      QTestState *qts;
> >      g_autofree char *s = NULL;
> >      g_autofree char *cli = NULL;
> > +    const char *arch = qtest_get_arch();
> > +
> > +    if (g_str_equal(arch, "ppc64")) {
> > +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> > +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> > +                             "-numa node,nodeid=1,cpus=4-5 ");
> > +    } else {
> > +        cli = make_cli(data, "-smp 8 "
> > +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> > +                             "-numa node,nodeid=1,cpus=4-5 ");
> > +    }
> >  
> > -    cli = make_cli(data, "-smp 8 "
> > -                   "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
> > -                   "-numa node,nodeid=1,cpus=4-5 ");
> >      qts = qtest_init(cli);
> >  
> >      s = qtest_hmp(qts, "info numa");
> > @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data)
> >      QObject *e;
> >      QTestState *qts;
> >      g_autofree char *cli = NULL;
> > +    const char *arch = qtest_get_arch();
> > +
> > +    if (g_str_equal(arch, "ppc64")) {
> > +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
> > +                             "-numa node,memdev=ram,cpus=0-3 "
> > +                             "-numa node,cpus=4-7");
> > +    } else {
> > +        cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
> > +                             "-numa node,cpus=4-7");
> > +    }
> >  
> > -    cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
> > -                         "-numa node,cpus=4-7");
> >      qts = qtest_init(cli);
> >      cpus = get_cpus(qts, &resp);
> >      g_assert(cpus);
> > @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data)
> >      QTestState *qts;
> >      g_autofree char *cli = NULL;
> >  
> > -    cli = make_cli(data, "-smp 4,cores=4 "
> > +    cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 "
> >          "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
> >          "-numa cpu,node-id=0,core-id=0 "
> >          "-numa cpu,node-id=0,core-id=1 "
> > @@ -554,7 +578,17 @@ int main(int argc, char **argv)
> >  
> >      g_test_init(&argc, &argv, NULL);
> >  
> > -    qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
> > +    /*
> > +     * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work
> > +     * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match
> > +     * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of
> > +     * 0246/1357 that test_def_cpu_split expects. In short, this test is
> > +     * no longer valid for ppc64 in 6.0.0.
> > +     */
> > +    if (!g_str_equal(arch, "ppc64")) {
> > +        qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
> > +    }
> > +
> >      qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
> >      qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
> >      qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);  
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-29 18:32               ` Daniel Henrique Barboza
@ 2021-03-29 23:55                 ` Igor Mammedov
  2021-03-31  0:57                 ` David Gibson
  1 sibling, 0 replies; 23+ messages in thread
From: Igor Mammedov @ 2021-03-29 23:55 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Cédric Le Goater, David Gibson

On Mon, 29 Mar 2021 15:32:37 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> > On 3/29/21 6:20 AM, David Gibson wrote:  
> >> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:  
> >>> On 3/25/21 3:10 AM, David Gibson wrote:  
> >>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:  
> >>>>>
> >>>>>
> >>>>> On 3/22/21 10:03 PM, David Gibson wrote:  
> >>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:  
> >>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
> >>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
> >>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
> >>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> >>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> >>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> >>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
> >>>>>>> equivalent.
> >>>>>>>
> >>>>>>> Further discussions in the kernel mailing list [2] shown that the
> >>>>>>> powerpc kernel always considered that the number of sockets were equal
> >>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
> >>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
> >>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
> >>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
> >>>>>>> as in [1], happened to be correctly represented in the kernel by
> >>>>>>> accident during all these years.
> >>>>>>>
> >>>>>>> There's a case to be made for virtual topologies being detached from
> >>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
> >>>>>>> time, this freedom can't result in unrealistic hardware representations
> >>>>>>> being emulated. If the real hardware and the pseries kernel don't
> >>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
> >>>>>>>
> >>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> >>>>>>> pseries machine. qtest changes were made to adapt to this new
> >>>>>>> condition.  
> >>>>>>
> >>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
> >>>>>> for users to match these up, for no real gain.  I'm also concerned
> >>>>>> that this will require follow on changes in libvirt to not make this a
> >>>>>> really cryptic and irritating point of failure.  
> >>>>>
> >>>>> Haven't though about required Libvirt changes, although I can say that there
> >>>>> will be some amount to be mande and it will probably annoy existing users
> >>>>> (everyone that has a multiple socket per NUMA node topology).
> >>>>>
> >>>>> There is not much we can do from the QEMU layer aside from what I've proposed
> >>>>> here. The other alternative is to keep interacting with the kernel folks to
> >>>>> see if there is a way to keep our use case untouched.  
> >>>>
> >>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
> >>>> replies from Cédric to my objections and mpe's.  Even with sockets
> >>>> being a kinda meaningless concept in PAPR, I don't think tying it to
> >>>> NUMA nodes makes sense.  
> >>>
> >>> I did a couple of replies in different email threads but maybe not
> >>> to all. I felt it was going nowhere :/ Couple of thoughts,  
> >>
> >> I think I saw some of those, but maybe not all.
> >>  
> >>> Shouldn't we get rid of the socket concept, die also, under pseries
> >>> since they don't exist under PAPR ? We only have numa nodes, cores,
> >>> threads AFAICT.  
> >>
> >> Theoretically, yes.  I'm not sure it's really practical, though, since
> >> AFAICT, both qemu and the kernel have the notion of sockets (though
> >> not dies) built into generic code.  
> > 
> > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> > and PPC Linux only has a NUMA node id, on pseries and powernv.
> >   
> >> It does mean that one possible approach here - maybe the best one - is
> >> to simply declare that sockets are meaningless under, so we simply
> >> don't expect what the guest kernel reports to match what's given to
> >> qemu.
> >>
> >> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> >> but it is likely to surprise and confuse people.
> >>  
> >>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> >>> There are a couple of places where Linux checks for the underlying
> >>> hypervisor already.
> >>>  
> >>>>> This also means that
> >>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
> >>>>> we inform cores per socket information to the kernel.  
> >>>>
> >>>> Well.. unless we can find some other sensible way to convey that
> >>>> information.  I haven't given up hope for that yet.  
> >>>
> >>> Well, we could start by fixing the value in QEMU. It is broken
> >>> today.  
> >>
> >> Fixing what value, exactly?  
> > 
> > The value of the "ibm,chip-id" since we are keeping the property under
> > QEMU.  
> 
> David, I believe this has to do with the discussing we had last Friday.
> 
> I mentioned that the ibm,chip-id property is being calculated in a way that
> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> e.g.:
> 
> -smp 4,cores=4,maxcpus=8,threads=1 \
> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

just heads up, 'cpus=' is going away (deprecation patch got lost during 6.0
but I plan on pushing it during 6.1 devel window),
you can use '-numa cpus,node-id=a,core-id=b' instead

> 
> 
> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> --
> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> --
> 			ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> --
> 			ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> 
> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different
> NUMA node than 0-1. This would mean that the same socket would belong to
> different NUMA nodes at the same time.
> 
> I believe this is what Cedric wants to be addressed. Given that the property is
> called after the OPAL property ibm,chip-id, the kernel expects that the property
> will have the same semantics as in OPAL.
> 
> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> 
> 
> >   
> >>> This is all coming from some work we did last year to evaluate our HW
> >>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
> >>> We saw some real problems because Linux did not have a clear view of the
> >>> topology. See the figures here :
> >>>
> >>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
> >>>
> >>> The node id is a key parameter for system resource management, memory
> >>> allocation, interrupt affinity, etc. Linux scales much better if used
> >>> correctly.  
> >>
> >> Well, sure.  And we have all the ibm,associativity stuff to convey the
> >> node ids to the guest (which has its own problems, but not that are
> >> relevant here).  What's throwing me is why getting node IDs correct
> >> has anything to do with socket numbers.  
> > 
> >   
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-29 23:51     ` Igor Mammedov
@ 2021-03-30 21:33       ` Daniel Henrique Barboza
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-30 21:33 UTC (permalink / raw)
  To: Igor Mammedov, David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, qemu-devel,
	groug, qemu-ppc, clg



On 3/29/21 8:51 PM, Igor Mammedov wrote:
> On Tue, 23 Mar 2021 12:03:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>> topology_core_cpumask") cause a regression in the pseries machine when
>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>> far as the kernel understanding of SMP topologies goes, both masks are
>>> equivalent.
>>>
>>> Further discussions in the kernel mailing list [2] shown that the
>>> powerpc kernel always considered that the number of sockets were equal
>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>> immediate conclusion is that all SMP topologies the pseries machine were
>>> supplying to the kernel, with more than one socket in the same NUMA node
>>> as in [1], happened to be correctly represented in the kernel by
>>> accident during all these years.
>>>
>>> There's a case to be made for virtual topologies being detached from
>>> hardware constraints, allowing maximum flexibility to users. At the same
>>> time, this freedom can't result in unrealistic hardware representations
>>> being emulated. If the real hardware and the pseries kernel don't
>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>
>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>> pseries machine. qtest changes were made to adapt to this new
>>> condition.
>>
>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>> for users to match these up, for no real gain.  I'm also concerned
>> that this will require follow on changes in libvirt to not make this a
>> really cryptic and irritating point of failure.
> 
> yes, it is annoying to users but we were very slowly tightening limitations
> on numa path, especially if it causes problems on guest side when user
> specify nonsense configs (and suspect there are few things to enforce in
> generic numa code left (currently just warnings)). So it's nothing new.

Do you mind elaborating a bit on the 'nonsense configs' bit? Do you mean
configs that are unrealistic because it's uncompliant with ACPI? I always
thought that all possible SMP/NUMA topologies QEMU allows were ACPI/x86
compliant.

If both x86 and PowerPC have restrictions in common we might want to consider
putting this code in a common helper/place.


> 
> So if limit applies to new machine type it should be fine (i.e. no existing
> VMs will suffer). Later on we can deprecate invalid configurations altogether
> and just error out.

hmmm I'm not warning the user about the deprecated topologies in existing
guests in this patch. I guess it doesn't hurt to add a warning for existing
VMs.



Thanks,


DHB


>   
>>>
>>> [1] https://bugzilla.redhat.com/1934421
>>> [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/
>>>
>>> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> CC: Cédric Le Goater <clg@kaod.org>
>>> CC: Igor Mammedov <imammedo@redhat.com>
>>> CC: Laurent Vivier <lvivier@redhat.com>
>>> CC: Thomas Huth <thuth@redhat.com>
>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>>> ---
>>>   hw/ppc/spapr.c                 |  3 ++
>>>   hw/ppc/spapr_numa.c            |  7 +++++
>>>   include/hw/ppc/spapr.h         |  1 +
>>>   tests/qtest/cpu-plug-test.c    |  4 +--
>>>   tests/qtest/device-plug-test.c |  9 +++++-
>>>   tests/qtest/numa-test.c        | 52 ++++++++++++++++++++++++++++------
>>>   6 files changed, 64 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index d56418ca29..745f71c243 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true);
>>>    */
>>>   static void spapr_machine_5_2_class_options(MachineClass *mc)
>>>   {
>>> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>>> +
>>>       spapr_machine_6_0_class_options(mc);
>>>       compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
>>> +    smc->pre_6_0_smp_topology = true;
>>>   }
>>>   
>>>   DEFINE_SPAPR_MACHINE(5_2, "5.2", false);
>>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
>>> index 779f18b994..0ade43dd79 100644
>>> --- a/hw/ppc/spapr_numa.c
>>> +++ b/hw/ppc/spapr_numa.c
>>> @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr,
>>>       int i, j, max_nodes_with_gpus;
>>>       bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr);
>>>   
>>> +    if (!smc->pre_6_0_smp_topology &&
>>> +        nb_numa_nodes != machine->smp.sockets) {
>>> +        error_report("Number of CPU sockets must be equal to the number "
>>> +                     "of NUMA nodes");
>>> +        exit(EXIT_FAILURE);
>>> +    }
>>> +
>>>       /*
>>>        * For all associativity arrays: first position is the size,
>>>        * position MAX_DISTANCE_REF_POINTS is always the numa_id,
>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>> index 47cebaf3ac..98dc5d198a 100644
>>> --- a/include/hw/ppc/spapr.h
>>> +++ b/include/hw/ppc/spapr.h
>>> @@ -142,6 +142,7 @@ struct SpaprMachineClass {
>>>       hwaddr rma_limit;          /* clamp the RMA to this size */
>>>       bool pre_5_1_assoc_refpoints;
>>>       bool pre_5_2_numa_associativity;
>>> +    bool pre_6_0_smp_topology;
>>>   
>>>       bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>>>                             uint64_t *buid, hwaddr *pio,
>>> diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c
>>> index a1c689414b..946b9129ea 100644
>>> --- a/tests/qtest/cpu-plug-test.c
>>> +++ b/tests/qtest/cpu-plug-test.c
>>> @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname)
>>>       data->machine = g_strdup(mname);
>>>       data->cpu_model = "power8_v2.0";
>>>       data->device_model = g_strdup("power8_v2.0-spapr-cpu-core");
>>> -    data->sockets = 2;
>>> -    data->cores = 3;
>>> +    data->sockets = 1;
>>> +    data->cores = 6;
>>>       data->threads = 1;
>>>       data->maxcpus = data->sockets * data->cores * data->threads;
>>>   
>>> diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c
>>> index 559d47727a..dd7d8268d2 100644
>>> --- a/tests/qtest/device-plug-test.c
>>> +++ b/tests/qtest/device-plug-test.c
>>> @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void)
>>>   {
>>>       QTestState *qtest;
>>>   
>>> -    qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 "
>>> +    /*
>>> +     * Default smp settings will prioritize sockets over cores and
>>> +     * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However,
>>> +     * the pseries machine requires a NUMA node for each socket
>>> +     * (since 6.0.0). Specify sockets=1 to make life easier.
>>> +     */
>>> +    qtest = qtest_initf("-cpu power9_v2.0 "
>>> +                        "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 "
>>>                           "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0");
>>>   
>>>       /* similar to test_pci_unplug_request */
>>> diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
>>> index dc0ec571ca..bb13f7131b 100644
>>> --- a/tests/qtest/numa-test.c
>>> +++ b/tests/qtest/numa-test.c
>>> @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data)
>>>       QTestState *qts;
>>>       g_autofree char *s = NULL;
>>>       g_autofree char *cli = NULL;
>>> +    const char *arch = qtest_get_arch();
>>> +
>>> +    if (g_str_equal(arch, "ppc64")) {
>>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-3 "
>>> +                             "-numa node,nodeid=1,cpus=4-7");
>>> +    } else {
>>> +        cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
>>> +                             "-numa node,nodeid=1,cpus=4-7");
>>> +    }
>>>   
>>> -    cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 "
>>> -                         "-numa node,nodeid=1,cpus=4-7");
>>>       qts = qtest_init(cli);
>>>   
>>>       s = qtest_hmp(qts, "info numa");
>>> @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data)
>>>       QTestState *qts;
>>>       g_autofree char *s = NULL;
>>>       g_autofree char *cli = NULL;
>>> +    const char *arch = qtest_get_arch();
>>> +
>>> +    if (g_str_equal(arch, "ppc64")) {
>>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>>> +                             "-numa node,nodeid=1,cpus=4-5 ");
>>> +    } else {
>>> +        cli = make_cli(data, "-smp 8 "
>>> +                             "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>>> +                             "-numa node,nodeid=1,cpus=4-5 ");
>>> +    }
>>>   
>>> -    cli = make_cli(data, "-smp 8 "
>>> -                   "-numa node,nodeid=0,memdev=ram,cpus=0-1 "
>>> -                   "-numa node,nodeid=1,cpus=4-5 ");
>>>       qts = qtest_init(cli);
>>>   
>>>       s = qtest_hmp(qts, "info numa");
>>> @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data)
>>>       QObject *e;
>>>       QTestState *qts;
>>>       g_autofree char *cli = NULL;
>>> +    const char *arch = qtest_get_arch();
>>> +
>>> +    if (g_str_equal(arch, "ppc64")) {
>>> +        cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 "
>>> +                             "-numa node,memdev=ram,cpus=0-3 "
>>> +                             "-numa node,cpus=4-7");
>>> +    } else {
>>> +        cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
>>> +                             "-numa node,cpus=4-7");
>>> +    }
>>>   
>>> -    cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 "
>>> -                         "-numa node,cpus=4-7");
>>>       qts = qtest_init(cli);
>>>       cpus = get_cpus(qts, &resp);
>>>       g_assert(cpus);
>>> @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data)
>>>       QTestState *qts;
>>>       g_autofree char *cli = NULL;
>>>   
>>> -    cli = make_cli(data, "-smp 4,cores=4 "
>>> +    cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 "
>>>           "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
>>>           "-numa cpu,node-id=0,core-id=0 "
>>>           "-numa cpu,node-id=0,core-id=1 "
>>> @@ -554,7 +578,17 @@ int main(int argc, char **argv)
>>>   
>>>       g_test_init(&argc, &argv, NULL);
>>>   
>>> -    qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
>>> +    /*
>>> +     * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work
>>> +     * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match
>>> +     * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of
>>> +     * 0246/1357 that test_def_cpu_split expects. In short, this test is
>>> +     * no longer valid for ppc64 in 6.0.0.
>>> +     */
>>> +    if (!g_str_equal(arch, "ppc64")) {
>>> +        qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split);
>>> +    }
>>> +
>>>       qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit);
>>>       qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial);
>>>       qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-29 18:32               ` Daniel Henrique Barboza
  2021-03-29 23:55                 ` Igor Mammedov
@ 2021-03-31  0:57                 ` David Gibson
  2021-03-31  4:58                   ` Michael Ellerman
  2021-03-31 15:18                   ` Cédric Le Goater
  1 sibling, 2 replies; 23+ messages in thread
From: David Gibson @ 2021-03-31  0:57 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	groug, qemu-devel, qemu-ppc, Cédric Le Goater,
	Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 7676 bytes --]

On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> > On 3/29/21 6:20 AM, David Gibson wrote:
> > > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> > > > On 3/25/21 3:10 AM, David Gibson wrote:
> > > > > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> > > > > > 
> > > > > > 
> > > > > > On 3/22/21 10:03 PM, David Gibson wrote:
> > > > > > > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> > > > > > > > Kernel commit 4bce545903fa ("powerpc/topology: Update
> > > > > > > > topology_core_cpumask") cause a regression in the pseries machine when
> > > > > > > > defining certain SMP topologies [1]. The reasoning behind the change is
> > > > > > > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> > > > > > > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> > > > > > > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> > > > > > > > far as the kernel understanding of SMP topologies goes, both masks are
> > > > > > > > equivalent.
> > > > > > > > 
> > > > > > > > Further discussions in the kernel mailing list [2] shown that the
> > > > > > > > powerpc kernel always considered that the number of sockets were equal
> > > > > > > > to the number of NUMA nodes. The claim is that it doesn't make sense,
> > > > > > > > for Power hardware at least, 2+ sockets being in the same NUMA node. The
> > > > > > > > immediate conclusion is that all SMP topologies the pseries machine were
> > > > > > > > supplying to the kernel, with more than one socket in the same NUMA node
> > > > > > > > as in [1], happened to be correctly represented in the kernel by
> > > > > > > > accident during all these years.
> > > > > > > > 
> > > > > > > > There's a case to be made for virtual topologies being detached from
> > > > > > > > hardware constraints, allowing maximum flexibility to users. At the same
> > > > > > > > time, this freedom can't result in unrealistic hardware representations
> > > > > > > > being emulated. If the real hardware and the pseries kernel don't
> > > > > > > > support multiple chips/sockets in the same NUMA node, neither should we.
> > > > > > > > 
> > > > > > > > Starting in 6.0.0, all sockets must match an unique NUMA node in the
> > > > > > > > pseries machine. qtest changes were made to adapt to this new
> > > > > > > > condition.
> > > > > > > 
> > > > > > > Oof.  I really don't like this idea.  It means a bunch of fiddly work
> > > > > > > for users to match these up, for no real gain.  I'm also concerned
> > > > > > > that this will require follow on changes in libvirt to not make this a
> > > > > > > really cryptic and irritating point of failure.
> > > > > > 
> > > > > > Haven't though about required Libvirt changes, although I can say that there
> > > > > > will be some amount to be mande and it will probably annoy existing users
> > > > > > (everyone that has a multiple socket per NUMA node topology).
> > > > > > 
> > > > > > There is not much we can do from the QEMU layer aside from what I've proposed
> > > > > > here. The other alternative is to keep interacting with the kernel folks to
> > > > > > see if there is a way to keep our use case untouched.
> > > > > 
> > > > > Right.  Well.. not necessarily untouched, but I'm hoping for more
> > > > > replies from Cédric to my objections and mpe's.  Even with sockets
> > > > > being a kinda meaningless concept in PAPR, I don't think tying it to
> > > > > NUMA nodes makes sense.
> > > > 
> > > > I did a couple of replies in different email threads but maybe not
> > > > to all. I felt it was going nowhere :/ Couple of thoughts,
> > > 
> > > I think I saw some of those, but maybe not all.
> > > 
> > > > Shouldn't we get rid of the socket concept, die also, under pseries
> > > > since they don't exist under PAPR ? We only have numa nodes, cores,
> > > > threads AFAICT.
> > > 
> > > Theoretically, yes.  I'm not sure it's really practical, though, since
> > > AFAICT, both qemu and the kernel have the notion of sockets (though
> > > not dies) built into generic code.
> > 
> > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> > and PPC Linux only has a NUMA node id, on pseries and powernv.
> > 
> > > It does mean that one possible approach here - maybe the best one - is
> > > to simply declare that sockets are meaningless under, so we simply
> > > don't expect what the guest kernel reports to match what's given to
> > > qemu.
> > > 
> > > It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> > > but it is likely to surprise and confuse people.
> > > 
> > > > Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> > > > There are a couple of places where Linux checks for the underlying
> > > > hypervisor already.
> > > > 
> > > > > > This also means that
> > > > > > 'ibm,chip-id' will probably remain in use since it's the only place where
> > > > > > we inform cores per socket information to the kernel.
> > > > > 
> > > > > Well.. unless we can find some other sensible way to convey that
> > > > > information.  I haven't given up hope for that yet.
> > > > 
> > > > Well, we could start by fixing the value in QEMU. It is broken
> > > > today.
> > > 
> > > Fixing what value, exactly?
> > 
> > The value of the "ibm,chip-id" since we are keeping the property under
> > QEMU.
> 
> David, I believe this has to do with the discussing we had last Friday.
> 
> I mentioned that the ibm,chip-id property is being calculated in a way that
> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> e.g.:
> 
> -smp 4,cores=4,maxcpus=8,threads=1 \
> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> 
> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;

> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
> different NUMA node than 0-1. This would mean that the same socket
> would belong to different NUMA nodes at the same time.

Right... and I'm still not seeing why that's a problem.  AFAICT that's
a possible, if unexpected, situation under real hardware - though
maybe not for POWER9 specifically.

> I believe this is what Cedric wants to be addressed. Given that the
> property is called after the OPAL property ibm,chip-id, the kernel
> expects that the property will have the same semantics as in OPAL.

Even on powernv, I'm not clear why chip-id is tied into the NUMA
configuration, rather than getting all the NUMA info from
associativity properties.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31  0:57                 ` David Gibson
@ 2021-03-31  4:58                   ` Michael Ellerman
  2021-03-31 15:22                     ` Cédric Le Goater
  2021-03-31 15:18                   ` Cédric Le Goater
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2021-03-31  4:58 UTC (permalink / raw)
  To: David Gibson, Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, groug,
	qemu-devel, qemu-ppc, Cédric Le Goater, Igor Mammedov

David Gibson <david@gibson.dropbear.id.au> writes:
> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
...
>
>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>> different NUMA node than 0-1. This would mean that the same socket
>> would belong to different NUMA nodes at the same time.
>
> Right... and I'm still not seeing why that's a problem.  AFAICT that's
> a possible, if unexpected, situation under real hardware - though
> maybe not for POWER9 specifically.

I think I agree.

>> I believe this is what Cedric wants to be addressed. Given that the
>> property is called after the OPAL property ibm,chip-id, the kernel
>> expects that the property will have the same semantics as in OPAL.
>
> Even on powernv, I'm not clear why chip-id is tied into the NUMA
> configuration, rather than getting all the NUMA info from
> associativity properties.

AFAIK we don't use chip-id for anything related to NUMA, if we do I'd
consider that a bug.

We do use it for topology_physical_package_id(), but that's almost
completely unused.

cheers


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31  0:57                 ` David Gibson
  2021-03-31  4:58                   ` Michael Ellerman
@ 2021-03-31 15:18                   ` Cédric Le Goater
  2021-03-31 17:29                     ` Daniel Henrique Barboza
  2021-04-01  2:59                     ` David Gibson
  1 sibling, 2 replies; 23+ messages in thread
From: Cédric Le Goater @ 2021-03-31 15:18 UTC (permalink / raw)
  To: David Gibson, Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov

On 3/31/21 2:57 AM, David Gibson wrote:
> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>> equivalent.
>>>>>>>>>
>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>> accident during all these years.
>>>>>>>>>
>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>>>
>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>> condition.
>>>>>>>>
>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>
>>>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>
>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>
>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>> NUMA nodes makes sense.
>>>>>
>>>>> I did a couple of replies in different email threads but maybe not
>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>
>>>> I think I saw some of those, but maybe not all.
>>>>
>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>> threads AFAICT.
>>>>
>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>> not dies) built into generic code.
>>>
>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>
>>>> It does mean that one possible approach here - maybe the best one - is
>>>> to simply declare that sockets are meaningless under, so we simply
>>>> don't expect what the guest kernel reports to match what's given to
>>>> qemu.
>>>>
>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>> but it is likely to surprise and confuse people.
>>>>
>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>> There are a couple of places where Linux checks for the underlying
>>>>> hypervisor already.
>>>>>
>>>>>>> This also means that
>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>>>> we inform cores per socket information to the kernel.
>>>>>>
>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>> information.  I haven't given up hope for that yet.
>>>>>
>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>> today.
>>>>
>>>> Fixing what value, exactly?
>>>
>>> The value of the "ibm,chip-id" since we are keeping the property under
>>> QEMU.
>>
>> David, I believe this has to do with the discussing we had last Friday.
>>
>> I mentioned that the ibm,chip-id property is being calculated in a way that
>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>> e.g.:
>>
>> -smp 4,cores=4,maxcpus=8,threads=1 \
>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>>
>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>> 			ibm,pft-size = <0x00 0x19>;
>> 			ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
> 
>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>> different NUMA node than 0-1. This would mean that the same socket
>> would belong to different NUMA nodes at the same time.
> 
> Right... and I'm still not seeing why that's a problem.  AFAICT that's
> a possible, if unexpected, situation under real hardware - though
> maybe not for POWER9 specifically.
The ibm,chip-id property does not exist under PAPR. PAPR only has
NUMA nodes, no sockets nor chips. 

And the property value is simply broken under QEMU. Try this  :

   -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
		ibm,chip-id = <0x01>;
		ibm,chip-id = <0x02>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x03>;

>> I believe this is what Cedric wants to be addressed. Given that the
>> property is called after the OPAL property ibm,chip-id, the kernel
>> expects that the property will have the same semantics as in OPAL.>
> Even on powernv, I'm not clear why chip-id is tied into the NUMA
> configuration, rather than getting all the NUMA info from
> associativity properties.

It is the case. 

The associativity properties are built from chip-id in OPAL though.

The chip-id property is only used in low level PowerNV drivers, VAS,
XSCOM, LPC, etc. 

It's also badly used in the common part of the XIVE driver, what I am
trying to fix to introduce an IPI per node on all platforms.   

C.

 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31  4:58                   ` Michael Ellerman
@ 2021-03-31 15:22                     ` Cédric Le Goater
  2021-04-01  2:53                       ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Cédric Le Goater @ 2021-03-31 15:22 UTC (permalink / raw)
  To: Michael Ellerman, David Gibson, Daniel Henrique Barboza
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, qemu-devel,
	groug, qemu-ppc, Igor Mammedov

On 3/31/21 6:58 AM, Michael Ellerman wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> ...
>>
>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>> different NUMA node than 0-1. This would mean that the same socket
>>> would belong to different NUMA nodes at the same time.
>>
>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>> a possible, if unexpected, situation under real hardware - though
>> maybe not for POWER9 specifically.
> 
> I think I agree.
> 
>>> I believe this is what Cedric wants to be addressed. Given that the
>>> property is called after the OPAL property ibm,chip-id, the kernel
>>> expects that the property will have the same semantics as in OPAL.
>>
>> Even on powernv, I'm not clear why chip-id is tied into the NUMA
>> configuration, rather than getting all the NUMA info from
>> associativity properties.
> 
> AFAIK we don't use chip-id for anything related to NUMA, if we do I'd
> consider that a bug.

Since PAPR only has NUMA nodes, is the use of chip-id in XIVE PAPR 
considered as a bug ? I would say so.

> We do use it for topology_physical_package_id(), but that's almost
> completely unused.

In that case, I think it should be fine to return -1 like under PowerVM.

Thanks,

C. 




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31 15:18                   ` Cédric Le Goater
@ 2021-03-31 17:29                     ` Daniel Henrique Barboza
  2021-03-31 17:40                       ` Cédric Le Goater
  2021-04-01  2:59                     ` David Gibson
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-31 17:29 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov


On 3/31/21 12:18 PM, Cédric Le Goater wrote:
> On 3/31/21 2:57 AM, David Gibson wrote:
>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>>> equivalent.
>>>>>>>>>>
>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>> accident during all these years.
>>>>>>>>>>
>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>>>>
>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>> condition.
>>>>>>>>>
>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>
>>>>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>
>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>
>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>> NUMA nodes makes sense.
>>>>>>
>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>
>>>>> I think I saw some of those, but maybe not all.
>>>>>
>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>> threads AFAICT.
>>>>>
>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>> not dies) built into generic code.
>>>>
>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>
>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>> don't expect what the guest kernel reports to match what's given to
>>>>> qemu.
>>>>>
>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>> but it is likely to surprise and confuse people.
>>>>>
>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>> hypervisor already.
>>>>>>
>>>>>>>> This also means that
>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>
>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>
>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>> today.
>>>>>
>>>>> Fixing what value, exactly?
>>>>
>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>> QEMU.
>>>
>>> David, I believe this has to do with the discussing we had last Friday.
>>>
>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>> e.g.:
>>>
>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>
>>>
>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>> 			ibm,pft-size = <0x00 0x19>;
>>> 			ibm,chip-id = <0x00>;
>>> --
>>>                        ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>>                        ibm,pft-size = <0x00 0x19>;
>>>                        ibm,chip-id = <0x00>;
>>> --
>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>>                        ibm,pft-size = <0x00 0x19>;
>>>                        ibm,chip-id = <0x00>;
>>> --
>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>>                        ibm,pft-size = <0x00 0x19>;
>>>                        ibm,chip-id = <0x00>;
>>
>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>> different NUMA node than 0-1. This would mean that the same socket
>>> would belong to different NUMA nodes at the same time.
>>
>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>> a possible, if unexpected, situation under real hardware - though
>> maybe not for POWER9 specifically.
> The ibm,chip-id property does not exist under PAPR. PAPR only has
> NUMA nodes, no sockets nor chips.
> 
> And the property value is simply broken under QEMU. Try this  :
> 
>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;

These values are not wrong. When you do:

-smp 4,cores=1,maxcpus=8 (....)

You didn't fill threads and sockets. QEMU default is to prioritize sockets
to fill the missing information, up to the maxcpus value. This means that what
you did is equivalent to:

-smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....)


It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU
will sit in its own core, having its own ibm,chip-id. So:

"-numa node,nodeid=0,cpus=0-1"

is in fact allocating sockets 0 and 1 to NUMA node 0.


Thanks,



DHB

> 
>>> I believe this is what Cedric wants to be addressed. Given that the
>>> property is called after the OPAL property ibm,chip-id, the kernel
>>> expects that the property will have the same semantics as in OPAL.>
>> Even on powernv, I'm not clear why chip-id is tied into the NUMA
>> configuration, rather than getting all the NUMA info from
>> associativity properties.
> 
> It is the case.
> 
> The associativity properties are built from chip-id in OPAL though.
> 
> The chip-id property is only used in low level PowerNV drivers, VAS,
> XSCOM, LPC, etc.
> 
> It's also badly used in the common part of the XIVE driver, what I am
> trying to fix to introduce an IPI per node on all platforms.
> 
> C.
> 
>   
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31 17:29                     ` Daniel Henrique Barboza
@ 2021-03-31 17:40                       ` Cédric Le Goater
  0 siblings, 0 replies; 23+ messages in thread
From: Cédric Le Goater @ 2021-03-31 17:40 UTC (permalink / raw)
  To: Daniel Henrique Barboza, David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	qemu-devel, groug, qemu-ppc, Igor Mammedov

On 3/31/21 7:29 PM, Daniel Henrique Barboza wrote:
> 
> On 3/31/21 12:18 PM, Cédric Le Goater wrote:
>> On 3/31/21 2:57 AM, David Gibson wrote:
>>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>>>> equivalent.
>>>>>>>>>>>
>>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>>> accident during all these years.
>>>>>>>>>>>
>>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>>>>>
>>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>>> condition.
>>>>>>>>>>
>>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>>
>>>>>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>>
>>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>>
>>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>>> NUMA nodes makes sense.
>>>>>>>
>>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>>
>>>>>> I think I saw some of those, but maybe not all.
>>>>>>
>>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>>> threads AFAICT.
>>>>>>
>>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>>> not dies) built into generic code.
>>>>>
>>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>>
>>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>>> don't expect what the guest kernel reports to match what's given to
>>>>>> qemu.
>>>>>>
>>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>>> but it is likely to surprise and confuse people.
>>>>>>
>>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>>> hypervisor already.
>>>>>>>
>>>>>>>>> This also means that
>>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>>
>>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>>
>>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>>> today.
>>>>>>
>>>>>> Fixing what value, exactly?
>>>>>
>>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>>> QEMU.
>>>>
>>>> David, I believe this has to do with the discussing we had last Friday.
>>>>
>>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>>> e.g.:
>>>>
>>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>>
>>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>>>             ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>>>             ibm,pft-size = <0x00 0x19>;
>>>>             ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>
>>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>>> different NUMA node than 0-1. This would mean that the same socket
>>>> would belong to different NUMA nodes at the same time.
>>>
>>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>>> a possible, if unexpected, situation under real hardware - though
>>> maybe not for POWER9 specifically.
>> The ibm,chip-id property does not exist under PAPR. PAPR only has
>> NUMA nodes, no sockets nor chips.
>>
>> And the property value is simply broken under QEMU. Try this  :
>>
>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>         ibm,chip-id = <0x01>;
>>         ibm,chip-id = <0x02>;
>>         ibm,chip-id = <0x00>;
>>         ibm,chip-id = <0x03>;
> 
> These values are not wrong. When you do:
> 
> -smp 4,cores=1,maxcpus=8 (....)
> 
> You didn't fill threads and sockets. QEMU default is to prioritize sockets
> to fill the missing information, up to the maxcpus value. This means that what
> you did is equivalent to:
> 
> -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....)
> 
> 
> It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU
> will sit in its own core, having its own ibm,chip-id. So:
> 
> "-numa node,nodeid=0,cpus=0-1"
> 
> is in fact allocating sockets 0 and 1 to NUMA node 0.

ok. 
Thanks,

C.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31 15:22                     ` Cédric Le Goater
@ 2021-04-01  2:53                       ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2021-04-01  2:53 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc,
	Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

On Wed, Mar 31, 2021 at 05:22:39PM +0200, Cédric Le Goater wrote:
> On 3/31/21 6:58 AM, Michael Ellerman wrote:
> > David Gibson <david@gibson.dropbear.id.au> writes:
> >> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> > ...
> >>
> >>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
> >>> different NUMA node than 0-1. This would mean that the same socket
> >>> would belong to different NUMA nodes at the same time.
> >>
> >> Right... and I'm still not seeing why that's a problem.  AFAICT that's
> >> a possible, if unexpected, situation under real hardware - though
> >> maybe not for POWER9 specifically.
> > 
> > I think I agree.
> > 
> >>> I believe this is what Cedric wants to be addressed. Given that the
> >>> property is called after the OPAL property ibm,chip-id, the kernel
> >>> expects that the property will have the same semantics as in OPAL.
> >>
> >> Even on powernv, I'm not clear why chip-id is tied into the NUMA
> >> configuration, rather than getting all the NUMA info from
> >> associativity properties.
> > 
> > AFAIK we don't use chip-id for anything related to NUMA, if we do I'd
> > consider that a bug.
> 
> Since PAPR only has NUMA nodes, is the use of chip-id in XIVE PAPR 
> considered as a bug ? I would say so.

As noted in another thread, XIVE PAPR *doesn't* actually use chip_id.

And even on PowerNV, I don't think this has any real connection to
NUMA.  For PowerNV we care about whether we're working within a single
XIVE hardware instance, or across multiple.  There's one XIVE
per-chip, hence the relevance of chip-id.  That happens to also match
to NUMA topology on (currently existing) POWER9 chips, but I don't see
that it inherently has to.

> > We do use it for topology_physical_package_id(), but that's almost
> > completely unused.
> 
> In that case, I think it should be fine to return -1 like under PowerVM.
> 
> Thanks,
> 
> C. 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-03-31 15:18                   ` Cédric Le Goater
  2021-03-31 17:29                     ` Daniel Henrique Barboza
@ 2021-04-01  2:59                     ` David Gibson
  2021-04-01  9:21                       ` Cédric Le Goater
  1 sibling, 1 reply; 23+ messages in thread
From: David Gibson @ 2021-04-01  2:59 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc,
	Igor Mammedov

[-- Attachment #1: Type: text/plain, Size: 8751 bytes --]

On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote:
> On 3/31/21 2:57 AM, David Gibson wrote:
> > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> >>
> >>
> >> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> >>> On 3/29/21 6:20 AM, David Gibson wrote:
> >>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> >>>>> On 3/25/21 3:10 AM, David Gibson wrote:
> >>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
> >>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> >>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
> >>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
> >>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
> >>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> >>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> >>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> >>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
> >>>>>>>>> equivalent.
> >>>>>>>>>
> >>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
> >>>>>>>>> powerpc kernel always considered that the number of sockets were equal
> >>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
> >>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
> >>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
> >>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
> >>>>>>>>> as in [1], happened to be correctly represented in the kernel by
> >>>>>>>>> accident during all these years.
> >>>>>>>>>
> >>>>>>>>> There's a case to be made for virtual topologies being detached from
> >>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
> >>>>>>>>> time, this freedom can't result in unrealistic hardware representations
> >>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
> >>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
> >>>>>>>>>
> >>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> >>>>>>>>> pseries machine. qtest changes were made to adapt to this new
> >>>>>>>>> condition.
> >>>>>>>>
> >>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
> >>>>>>>> for users to match these up, for no real gain.  I'm also concerned
> >>>>>>>> that this will require follow on changes in libvirt to not make this a
> >>>>>>>> really cryptic and irritating point of failure.
> >>>>>>>
> >>>>>>> Haven't though about required Libvirt changes, although I can say that there
> >>>>>>> will be some amount to be mande and it will probably annoy existing users
> >>>>>>> (everyone that has a multiple socket per NUMA node topology).
> >>>>>>>
> >>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
> >>>>>>> here. The other alternative is to keep interacting with the kernel folks to
> >>>>>>> see if there is a way to keep our use case untouched.
> >>>>>>
> >>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
> >>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
> >>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
> >>>>>> NUMA nodes makes sense.
> >>>>>
> >>>>> I did a couple of replies in different email threads but maybe not
> >>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
> >>>>
> >>>> I think I saw some of those, but maybe not all.
> >>>>
> >>>>> Shouldn't we get rid of the socket concept, die also, under pseries
> >>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
> >>>>> threads AFAICT.
> >>>>
> >>>> Theoretically, yes.  I'm not sure it's really practical, though, since
> >>>> AFAICT, both qemu and the kernel have the notion of sockets (though
> >>>> not dies) built into generic code.
> >>>
> >>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> >>> and PPC Linux only has a NUMA node id, on pseries and powernv.
> >>>
> >>>> It does mean that one possible approach here - maybe the best one - is
> >>>> to simply declare that sockets are meaningless under, so we simply
> >>>> don't expect what the guest kernel reports to match what's given to
> >>>> qemu.
> >>>>
> >>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> >>>> but it is likely to surprise and confuse people.
> >>>>
> >>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> >>>>> There are a couple of places where Linux checks for the underlying
> >>>>> hypervisor already.
> >>>>>
> >>>>>>> This also means that
> >>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
> >>>>>>> we inform cores per socket information to the kernel.
> >>>>>>
> >>>>>> Well.. unless we can find some other sensible way to convey that
> >>>>>> information.  I haven't given up hope for that yet.
> >>>>>
> >>>>> Well, we could start by fixing the value in QEMU. It is broken
> >>>>> today.
> >>>>
> >>>> Fixing what value, exactly?
> >>>
> >>> The value of the "ibm,chip-id" since we are keeping the property under
> >>> QEMU.
> >>
> >> David, I believe this has to do with the discussing we had last Friday.
> >>
> >> I mentioned that the ibm,chip-id property is being calculated in a way that
> >> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> >> e.g.:
> >>
> >> -smp 4,cores=4,maxcpus=8,threads=1 \
> >> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> >> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>
> >>
> >> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> >> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> >> 			ibm,pft-size = <0x00 0x19>;
> >> 			ibm,chip-id = <0x00>;
> > 
> >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
> >> different NUMA node than 0-1. This would mean that the same socket
> >> would belong to different NUMA nodes at the same time.
> > 
> > Right... and I'm still not seeing why that's a problem.  AFAICT that's
> > a possible, if unexpected, situation under real hardware - though
> > maybe not for POWER9 specifically.
> The ibm,chip-id property does not exist under PAPR. PAPR only has
> NUMA nodes, no sockets nor chips. 
> 
> And the property value is simply broken under QEMU. Try this  :
> 
>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;

Not seeing the problem here.  You've specified 8 (possible) cpus, 1
core-per-socket, therefore there are 4 sockets, hence 4 chip-ids.
Again, I don't see where this assumption that the chip-id is somehow
related to the NUMA topology is coming from.

> >> I believe this is what Cedric wants to be addressed. Given that the
> >> property is called after the OPAL property ibm,chip-id, the kernel
> >> expects that the property will have the same semantics as in OPAL.>
> > Even on powernv, I'm not clear why chip-id is tied into the NUMA
> > configuration, rather than getting all the NUMA info from
> > associativity properties.
> 
> It is the case. 

What exactly is the case?

> The associativity properties are built from chip-id in OPAL though.

Ok, so?  Why do we care how OPAL builds the associativity properties
once we *have* the associativity properties its built?

> The chip-id property is only used in low level PowerNV drivers, VAS,
> XSCOM, LPC, etc.

Which AFAIK, really *do* need to know which chip they're on, not the
NUMA toplogy.

> It's also badly used in the common part of the XIVE driver, what I am
> trying to fix to introduce an IPI per node on all platforms.   

See comments on other thread.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
  2021-04-01  2:59                     ` David Gibson
@ 2021-04-01  9:21                       ` Cédric Le Goater
  0 siblings, 0 replies; 23+ messages in thread
From: Cédric Le Goater @ 2021-04-01  9:21 UTC (permalink / raw)
  To: David Gibson
  Cc: Laurent Vivier, Thomas Huth, Srikar Dronamraju, Michael Ellerman,
	Daniel Henrique Barboza, qemu-devel, groug, qemu-ppc,
	Igor Mammedov

On 4/1/21 4:59 AM, David Gibson wrote:
> On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote:
>> On 3/31/21 2:57 AM, David Gibson wrote:
>>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>>>> equivalent.
>>>>>>>>>>>
>>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>>> accident during all these years.
>>>>>>>>>>>
>>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>>>>>
>>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>>> condition.
>>>>>>>>>>
>>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>>
>>>>>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>>
>>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>>
>>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>>> NUMA nodes makes sense.
>>>>>>>
>>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>>
>>>>>> I think I saw some of those, but maybe not all.
>>>>>>
>>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>>> threads AFAICT.
>>>>>>
>>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>>> not dies) built into generic code.
>>>>>
>>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>>
>>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>>> don't expect what the guest kernel reports to match what's given to
>>>>>> qemu.
>>>>>>
>>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>>> but it is likely to surprise and confuse people.
>>>>>>
>>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>>> hypervisor already.
>>>>>>>
>>>>>>>>> This also means that
>>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>>
>>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>>
>>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>>> today.
>>>>>>
>>>>>> Fixing what value, exactly?
>>>>>
>>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>>> QEMU.
>>>>
>>>> David, I believe this has to do with the discussing we had last Friday.
>>>>
>>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>>> e.g.:
>>>>
>>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>>
>>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>>> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>>> 			ibm,pft-size = <0x00 0x19>;
>>>> 			ibm,chip-id = <0x00>;
>>>
>>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>>> different NUMA node than 0-1. This would mean that the same socket
>>>> would belong to different NUMA nodes at the same time.
>>>
>>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>>> a possible, if unexpected, situation under real hardware - though
>>> maybe not for POWER9 specifically.
>> The ibm,chip-id property does not exist under PAPR. PAPR only has
>> NUMA nodes, no sockets nor chips. 
>>
>> And the property value is simply broken under QEMU. Try this  :
>>
>>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>> 		ibm,chip-id = <0x01>;
>> 		ibm,chip-id = <0x02>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x03>;
> 
> Not seeing the problem here.  You've specified 8 (possible) cpus, 1
> core-per-socket, therefore there are 4 sockets, hence 4 chip-ids.
> Again, I don't see where this assumption that the chip-id is somehow
> related to the NUMA topology is coming from.
> 
>>>> I believe this is what Cedric wants to be addressed. Given that the
>>>> property is called after the OPAL property ibm,chip-id, the kernel
>>>> expects that the property will have the same semantics as in OPAL.>
>>> Even on powernv, I'm not clear why chip-id is tied into the NUMA
>>> configuration, rather than getting all the NUMA info from
>>> associativity properties.
>>
>> It is the case. 
> 
> What exactly is the case?

sorry. That was badly phrased. I meant ; powernv gets the NUMA info from
"ibm,associativity" property.

C.

>> The associativity properties are built from chip-id in OPAL though.
> 
> Ok, so?  Why do we care how OPAL builds the associativity properties
> once we *have* the associativity properties its built?
> 
>> The chip-id property is only used in low level PowerNV drivers, VAS,
>> XSCOM, LPC, etc.
> 
> Which AFAIK, really *do* need to know which chip they're on, not the
> NUMA toplogy.
> 
>> It's also badly used in the common part of the XIVE driver, what I am
>> trying to fix to introduce an IPI per node on all platforms.   
> 
> See comments on other thread.
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-04-01  9:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23  1:03   ` David Gibson
2021-03-23 17:21     ` Daniel Henrique Barboza
2021-03-25  2:10       ` David Gibson
2021-03-25  8:56         ` Cédric Le Goater
2021-03-25 10:15           ` Daniel Henrique Barboza
2021-03-29  4:20           ` David Gibson
2021-03-29 15:32             ` Cédric Le Goater
2021-03-29 18:32               ` Daniel Henrique Barboza
2021-03-29 23:55                 ` Igor Mammedov
2021-03-31  0:57                 ` David Gibson
2021-03-31  4:58                   ` Michael Ellerman
2021-03-31 15:22                     ` Cédric Le Goater
2021-04-01  2:53                       ` David Gibson
2021-03-31 15:18                   ` Cédric Le Goater
2021-03-31 17:29                     ` Daniel Henrique Barboza
2021-03-31 17:40                       ` Cédric Le Goater
2021-04-01  2:59                     ` David Gibson
2021-04-01  9:21                       ` Cédric Le Goater
2021-03-29 23:51     ` Igor Mammedov
2021-03-30 21:33       ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).