All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains
@ 2021-01-28 15:17 Daniel Henrique Barboza
  2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Daniel Henrique Barboza @ 2021-01-28 15:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: clg, Daniel Henrique Barboza, qemu-ppc, groug, david

Hi,

Patches 02 and 03 contain fixes for a problem Cedric found out when
booting TCG guests with multiple NUMA nodes. See patch 03 commit
message for more info.

First patch is an unrelated cleanup I did while investigating.

Daniel Henrique Barboza (3):
  spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
  spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper
  spapr_numa.c: fix ibm,max-associativity-domains calculation

 hw/ppc/spapr.c              | 21 ++------------------
 hw/ppc/spapr_numa.c         | 39 ++++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr.h      |  1 -
 include/hw/ppc/spapr_numa.h |  1 +
 4 files changed, 41 insertions(+), 21 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
  2021-01-28 15:17 [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Daniel Henrique Barboza
@ 2021-01-28 15:17 ` Daniel Henrique Barboza
  2021-01-28 16:03   ` Greg Kurz
  2021-01-28 23:56   ` David Gibson
  2021-01-28 15:17 ` [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Daniel Henrique Barboza @ 2021-01-28 15:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: clg, Daniel Henrique Barboza, qemu-ppc, groug, david

This function is used only in spapr_numa.c.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c         | 9 ---------
 hw/ppc/spapr_numa.c    | 9 +++++++++
 include/hw/ppc/spapr.h | 1 -
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 6c47466fc2..2d60c6f594 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -295,15 +295,6 @@ static hwaddr spapr_node0_size(MachineState *machine)
     return machine->ram_size;
 }
 
-bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
-{
-    MachineState *machine = MACHINE(spapr);
-    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
-
-    return smc->pre_5_2_numa_associativity ||
-           machine->numa_state->num_nodes <= 1;
-}
-
 static void add_str(GString *s, const gchar *s1)
 {
     g_string_append_len(s, s1, strlen(s1) + 1);
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index b50796bbe3..261810525b 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -19,6 +19,15 @@
 /* Moved from hw/ppc/spapr_pci_nvlink2.c */
 #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
 
+static bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
+{
+    MachineState *machine = MACHINE(spapr);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
+
+    return smc->pre_5_2_numa_associativity ||
+           machine->numa_state->num_nodes <= 1;
+}
+
 static bool spapr_numa_is_symmetrical(MachineState *ms)
 {
     int src, dst;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index c27c7ce515..ccbeeca1de 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -851,7 +851,6 @@ int spapr_max_server_number(SpaprMachineState *spapr);
 void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
                       uint64_t pte0, uint64_t pte1);
 void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
-bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr);
 
 /* DRC callbacks. */
 void spapr_core_release(DeviceState *dev);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper
  2021-01-28 15:17 [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Daniel Henrique Barboza
  2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
@ 2021-01-28 15:17 ` Daniel Henrique Barboza
  2021-01-28 15:50   ` Greg Kurz
  2021-01-28 15:17 ` [PATCH 3/3] spapr_numa.c: fix ibm, max-associativity-domains calculation Daniel Henrique Barboza
  2021-01-28 16:03 ` [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Greg Kurz
  3 siblings, 1 reply; 12+ messages in thread
From: Daniel Henrique Barboza @ 2021-01-28 15:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: clg, Daniel Henrique Barboza, qemu-ppc, groug, david

We'll need to check the initial value given to spapr->gpu_numa_id when
building the rtas DT, so put it in a helper for easier access and to
avoid repetition.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr.c              | 12 ++----------
 hw/ppc/spapr_numa.c         | 14 ++++++++++++++
 include/hw/ppc/spapr_numa.h |  1 +
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 2d60c6f594..c2b74cbfdf 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2765,16 +2765,8 @@ static void spapr_machine_init(MachineState *machine)
 
     }
 
-    /*
-     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
-     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
-     * called from vPHB reset handler so we initialize the counter here.
-     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
-     * must be equally distant from any other node.
-     * The final value of spapr->gpu_numa_id is going to be written to
-     * max-associativity-domains in spapr_build_fdt().
-     */
-    spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
+    /* Init gpu_numa_id */
+    spapr->gpu_numa_id = spapr_numa_initial_nvgpu_NUMA_id(machine);
 
     /* Init numa_assoc_array */
     spapr_numa_associativity_init(spapr, machine);
diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index 261810525b..f71105c783 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -46,6 +46,20 @@ static bool spapr_numa_is_symmetrical(MachineState *ms)
     return true;
 }
 
+/*
+ * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
+ * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
+ * called from vPHB reset handler so we initialize the counter here.
+ * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
+ * must be equally distant from any other node.
+ * The final value of spapr->gpu_numa_id is going to be written to
+ * max-associativity-domains in spapr_build_fdt().
+ */
+unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine)
+{
+    return MAX(1, machine->numa_state->num_nodes);
+}
+
 /*
  * This function will translate the user distances into
  * what the kernel understand as possible values: 10
diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
index b3fd950634..6655bcf281 100644
--- a/include/hw/ppc/spapr_numa.h
+++ b/include/hw/ppc/spapr_numa.h
@@ -31,5 +31,6 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
                             int offset, PowerPCCPU *cpu);
 int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
                                          int offset);
+unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine);
 
 #endif /* HW_SPAPR_NUMA_H */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] spapr_numa.c: fix ibm, max-associativity-domains calculation
  2021-01-28 15:17 [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Daniel Henrique Barboza
  2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
  2021-01-28 15:17 ` [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper Daniel Henrique Barboza
@ 2021-01-28 15:17 ` Daniel Henrique Barboza
  2021-01-28 16:21   ` [PATCH 3/3] spapr_numa.c: fix ibm,max-associativity-domains calculation Greg Kurz
  2021-01-28 16:03 ` [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Greg Kurz
  3 siblings, 1 reply; 12+ messages in thread
From: Daniel Henrique Barboza @ 2021-01-28 15:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: clg, Daniel Henrique Barboza, qemu-ppc, groug, david

The current logic for calculating 'maxdomain' making it a sum of
numa_state->num_nodes with spapr->gpu_numa_id. spapr->gpu_numa_id is
used as a index to determine the next available NUMA id that a
given NVGPU can use.

The problem is that the initial value of gpu_numa_id, for any topology
that has more than one NUMA node, is equal to numa_state->num_nodes.
This means that our maxdomain will always be, at least, twice the
amount of existing NUMA nodes. This means that a guest with 4 NUMA
nodes will end up with the following max-associativity-domains:

rtas/ibm,max-associativity-domains
                 00000004 00000008 00000008 00000008 00000008

This overtuning of maxdomains doesn't go unnoticed in the guest, being
detected in SLUB during boot:

 dmesg | grep SLUB
[    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=4, Nodes=8

SLUB is detecting 8 total nodes, with 4 nodes being online.

This patch fixes ibm,max-associativity-domains by considering the amount
of NVGPUs NUMA nodes presented in the guest, instead of
spapr->gpu_numa_id.

Reported-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_numa.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
index f71105c783..f4d6abce87 100644
--- a/hw/ppc/spapr_numa.c
+++ b/hw/ppc/spapr_numa.c
@@ -60,6 +60,19 @@ unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine)
     return MAX(1, machine->numa_state->num_nodes);
 }
 
+/*
+ * Note: if called before spapr_phb_pci_collect_nvgpu() finishes collecting
+ * all NVGPUs, this function will not give the right number of NVGPUs NUMA
+ * nodes.
+ */
+static
+unsigned int spapr_numa_get_number_nvgpus_nodes(SpaprMachineState *spapr)
+{
+    MachineState *ms = MACHINE(spapr);
+
+    return spapr->gpu_numa_id - spapr_numa_initial_nvgpu_NUMA_id(ms);
+}
+
 /*
  * This function will translate the user distances into
  * what the kernel understand as possible values: 10
@@ -311,6 +324,7 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
 {
     MachineState *ms = MACHINE(spapr);
     SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+    uint32_t number_nvgpus_nodes = spapr_numa_get_number_nvgpus_nodes(spapr);
     uint32_t refpoints[] = {
         cpu_to_be32(0x4),
         cpu_to_be32(0x3),
@@ -318,7 +332,7 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
         cpu_to_be32(0x1),
     };
     uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
-    uint32_t maxdomain = ms->numa_state->num_nodes + spapr->gpu_numa_id;
+    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
     uint32_t maxdomains[] = {
         cpu_to_be32(4),
         cpu_to_be32(maxdomain),
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper
  2021-01-28 15:17 ` [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper Daniel Henrique Barboza
@ 2021-01-28 15:50   ` Greg Kurz
  0 siblings, 0 replies; 12+ messages in thread
From: Greg Kurz @ 2021-01-28 15:50 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: clg, qemu-ppc, qemu-devel, david

On Thu, 28 Jan 2021 12:17:30 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> We'll need to check the initial value given to spapr->gpu_numa_id when
> building the rtas DT, so put it in a helper for easier access and to
> avoid repetition.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr.c              | 12 ++----------
>  hw/ppc/spapr_numa.c         | 14 ++++++++++++++
>  include/hw/ppc/spapr_numa.h |  1 +
>  3 files changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 2d60c6f594..c2b74cbfdf 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2765,16 +2765,8 @@ static void spapr_machine_init(MachineState *machine)
>  
>      }
>  
> -    /*
> -     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> -     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> -     * called from vPHB reset handler so we initialize the counter here.
> -     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> -     * must be equally distant from any other node.
> -     * The final value of spapr->gpu_numa_id is going to be written to
> -     * max-associativity-domains in spapr_build_fdt().
> -     */
> -    spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
> +    /* Init gpu_numa_id */

Code is trivial enough you don't really need to paraphrase with a
comment.

> +    spapr->gpu_numa_id = spapr_numa_initial_nvgpu_NUMA_id(machine);
>  

This _NUMA_ looks a bit aggressive and not especially informative
to me. Maybe just make it spapr_numa_initial_nvgpu_id() ?

With these fixed,

Reviewed-by: Greg Kurz <groug@kaod.org>

>      /* Init numa_assoc_array */
>      spapr_numa_associativity_init(spapr, machine);
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index 261810525b..f71105c783 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -46,6 +46,20 @@ static bool spapr_numa_is_symmetrical(MachineState *ms)
>      return true;
>  }
>  
> +/*
> + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> + * called from vPHB reset handler so we initialize the counter here.
> + * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> + * must be equally distant from any other node.
> + * The final value of spapr->gpu_numa_id is going to be written to
> + * max-associativity-domains in spapr_build_fdt().
> + */
> +unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine)
> +{
> +    return MAX(1, machine->numa_state->num_nodes);
> +}
> +
>  /*
>   * This function will translate the user distances into
>   * what the kernel understand as possible values: 10
> diff --git a/include/hw/ppc/spapr_numa.h b/include/hw/ppc/spapr_numa.h
> index b3fd950634..6655bcf281 100644
> --- a/include/hw/ppc/spapr_numa.h
> +++ b/include/hw/ppc/spapr_numa.h
> @@ -31,5 +31,6 @@ int spapr_numa_fixup_cpu_dt(SpaprMachineState *spapr, void *fdt,
>                              int offset, PowerPCCPU *cpu);
>  int spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt,
>                                           int offset);
> +unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine);
>  
>  #endif /* HW_SPAPR_NUMA_H */



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains
  2021-01-28 15:17 [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-01-28 15:17 ` [PATCH 3/3] spapr_numa.c: fix ibm, max-associativity-domains calculation Daniel Henrique Barboza
@ 2021-01-28 16:03 ` Greg Kurz
  2021-01-28 17:05   ` Daniel Henrique Barboza
  3 siblings, 1 reply; 12+ messages in thread
From: Greg Kurz @ 2021-01-28 16:03 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: clg, qemu-ppc, qemu-devel, david

On Thu, 28 Jan 2021 12:17:28 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Hi,
> 
> Patches 02 and 03 contain fixes for a problem Cedric found out when
> booting TCG guests with multiple NUMA nodes. See patch 03 commit
> message for more info.
> 

This paragraph mentions "TCG guests", but I see nothing that is
specific to TCG in these patches... so I expect the problem to
also exists with KVM, right ?

> First patch is an unrelated cleanup I did while investigating.
> 
> Daniel Henrique Barboza (3):
>   spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
>   spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper
>   spapr_numa.c: fix ibm,max-associativity-domains calculation
> 
>  hw/ppc/spapr.c              | 21 ++------------------
>  hw/ppc/spapr_numa.c         | 39 ++++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h      |  1 -
>  include/hw/ppc/spapr_numa.h |  1 +
>  4 files changed, 41 insertions(+), 21 deletions(-)
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
  2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
@ 2021-01-28 16:03   ` Greg Kurz
  2021-01-28 23:56   ` David Gibson
  1 sibling, 0 replies; 12+ messages in thread
From: Greg Kurz @ 2021-01-28 16:03 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: clg, qemu-ppc, qemu-devel, david

On Thu, 28 Jan 2021 12:17:29 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> This function is used only in spapr_numa.c.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  hw/ppc/spapr.c         | 9 ---------
>  hw/ppc/spapr_numa.c    | 9 +++++++++
>  include/hw/ppc/spapr.h | 1 -
>  3 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 6c47466fc2..2d60c6f594 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -295,15 +295,6 @@ static hwaddr spapr_node0_size(MachineState *machine)
>      return machine->ram_size;
>  }
>  
> -bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> -{
> -    MachineState *machine = MACHINE(spapr);
> -    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> -
> -    return smc->pre_5_2_numa_associativity ||
> -           machine->numa_state->num_nodes <= 1;
> -}
> -
>  static void add_str(GString *s, const gchar *s1)
>  {
>      g_string_append_len(s, s1, strlen(s1) + 1);
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index b50796bbe3..261810525b 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -19,6 +19,15 @@
>  /* Moved from hw/ppc/spapr_pci_nvlink2.c */
>  #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
>  
> +static bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> +{
> +    MachineState *machine = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> +
> +    return smc->pre_5_2_numa_associativity ||
> +           machine->numa_state->num_nodes <= 1;
> +}
> +
>  static bool spapr_numa_is_symmetrical(MachineState *ms)
>  {
>      int src, dst;
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index c27c7ce515..ccbeeca1de 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -851,7 +851,6 @@ int spapr_max_server_number(SpaprMachineState *spapr);
>  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
>                        uint64_t pte0, uint64_t pte1);
>  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
> -bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr);
>  
>  /* DRC callbacks. */
>  void spapr_core_release(DeviceState *dev);



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] spapr_numa.c: fix ibm,max-associativity-domains calculation
  2021-01-28 15:17 ` [PATCH 3/3] spapr_numa.c: fix ibm, max-associativity-domains calculation Daniel Henrique Barboza
@ 2021-01-28 16:21   ` Greg Kurz
  0 siblings, 0 replies; 12+ messages in thread
From: Greg Kurz @ 2021-01-28 16:21 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: clg, qemu-ppc, qemu-devel, david

On Thu, 28 Jan 2021 12:17:31 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> The current logic for calculating 'maxdomain' making it a sum of
> numa_state->num_nodes with spapr->gpu_numa_id. spapr->gpu_numa_id is
> used as a index to determine the next available NUMA id that a
> given NVGPU can use.
> 
> The problem is that the initial value of gpu_numa_id, for any topology
> that has more than one NUMA node, is equal to numa_state->num_nodes.
> This means that our maxdomain will always be, at least, twice the
> amount of existing NUMA nodes. This means that a guest with 4 NUMA
> nodes will end up with the following max-associativity-domains:
> 
> rtas/ibm,max-associativity-domains
>                  00000004 00000008 00000008 00000008 00000008
> 
> This overtuning of maxdomains doesn't go unnoticed in the guest, being
> detected in SLUB during boot:
> 
>  dmesg | grep SLUB
> [    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=4, Nodes=8
> 
> SLUB is detecting 8 total nodes, with 4 nodes being online.
> 
> This patch fixes ibm,max-associativity-domains by considering the amount
> of NVGPUs NUMA nodes presented in the guest, instead of
> spapr->gpu_numa_id.
> 
> Reported-by: Cédric Le Goater <clg@kaod.org>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> ---
>  hw/ppc/spapr_numa.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index f71105c783..f4d6abce87 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -60,6 +60,19 @@ unsigned int spapr_numa_initial_nvgpu_NUMA_id(MachineState *machine)
>      return MAX(1, machine->numa_state->num_nodes);
>  }
>  
> +/*
> + * Note: if called before spapr_phb_pci_collect_nvgpu() finishes collecting
> + * all NVGPUs, this function will not give the right number of NVGPUs NUMA
> + * nodes.
> + */

This helper has exactly one user : spapr_numa_write_rtas_dt(). Maybe just
open-code it there, with a comment that spapr->gpu_numa_id is assumed to
be correct at the time we populate the device tree ?

> +static
> +unsigned int spapr_numa_get_number_nvgpus_nodes(SpaprMachineState *spapr)
> +{
> +    MachineState *ms = MACHINE(spapr);
> +
> +    return spapr->gpu_numa_id - spapr_numa_initial_nvgpu_NUMA_id(ms);
> +}
> +
>  /*
>   * This function will translate the user distances into
>   * what the kernel understand as possible values: 10
> @@ -311,6 +324,7 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>  {
>      MachineState *ms = MACHINE(spapr);
>      SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +    uint32_t number_nvgpus_nodes = spapr_numa_get_number_nvgpus_nodes(spapr);
>      uint32_t refpoints[] = {
>          cpu_to_be32(0x4),
>          cpu_to_be32(0x3),
> @@ -318,7 +332,7 @@ void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int rtas)
>          cpu_to_be32(0x1),
>      };
>      uint32_t nr_refpoints = ARRAY_SIZE(refpoints);
> -    uint32_t maxdomain = ms->numa_state->num_nodes + spapr->gpu_numa_id;
> +    uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes;
>      uint32_t maxdomains[] = {
>          cpu_to_be32(4),
>          cpu_to_be32(maxdomain),



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains
  2021-01-28 16:03 ` [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Greg Kurz
@ 2021-01-28 17:05   ` Daniel Henrique Barboza
  2021-01-28 17:13     ` Cédric Le Goater
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Henrique Barboza @ 2021-01-28 17:05 UTC (permalink / raw)
  To: Greg Kurz; +Cc: clg, qemu-ppc, qemu-devel, david



On 1/28/21 1:03 PM, Greg Kurz wrote:
> On Thu, 28 Jan 2021 12:17:28 -0300
> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> 
>> Hi,
>>
>> Patches 02 and 03 contain fixes for a problem Cedric found out when
>> booting TCG guests with multiple NUMA nodes. See patch 03 commit
>> message for more info.
>>
> 
> This paragraph mentions "TCG guests", but I see nothing that is
> specific to TCG in these patches... so I expect the problem to
> also exists with KVM, right ?

Yeah. I mentioned TCG because this is the use case Cedric reproduced
the bug with, but I myself had no problems reproducing it with
accel=kvm as well.


DHB

> 
>> First patch is an unrelated cleanup I did while investigating.
>>
>> Daniel Henrique Barboza (3):
>>    spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
>>    spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper
>>    spapr_numa.c: fix ibm,max-associativity-domains calculation
>>
>>   hw/ppc/spapr.c              | 21 ++------------------
>>   hw/ppc/spapr_numa.c         | 39 ++++++++++++++++++++++++++++++++++++-
>>   include/hw/ppc/spapr.h      |  1 -
>>   include/hw/ppc/spapr_numa.h |  1 +
>>   4 files changed, 41 insertions(+), 21 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains
  2021-01-28 17:05   ` Daniel Henrique Barboza
@ 2021-01-28 17:13     ` Cédric Le Goater
  2021-01-28 17:20       ` Cédric Le Goater
  0 siblings, 1 reply; 12+ messages in thread
From: Cédric Le Goater @ 2021-01-28 17:13 UTC (permalink / raw)
  To: Daniel Henrique Barboza, Greg Kurz; +Cc: qemu-ppc, qemu-devel, david

On 1/28/21 6:05 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 1/28/21 1:03 PM, Greg Kurz wrote:
>> On Thu, 28 Jan 2021 12:17:28 -0300
>> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Patches 02 and 03 contain fixes for a problem Cedric found out when
>>> booting TCG guests with multiple NUMA nodes. See patch 03 commit
>>> message for more info.
>>>
>>
>> This paragraph mentions "TCG guests", but I see nothing that is
>> specific to TCG in these patches... so I expect the problem to
>> also exists with KVM, right ?
> 
> Yeah. I mentioned TCG because this is the use case Cedric reproduced
> the bug with, but I myself had no problems reproducing it with
> accel=kvm as well.

I was also seeing the issue on KVM and I am still seeing it with 
this patchset. It's gone on TCG however. 

C.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains
  2021-01-28 17:13     ` Cédric Le Goater
@ 2021-01-28 17:20       ` Cédric Le Goater
  0 siblings, 0 replies; 12+ messages in thread
From: Cédric Le Goater @ 2021-01-28 17:20 UTC (permalink / raw)
  To: Daniel Henrique Barboza, Greg Kurz; +Cc: qemu-ppc, qemu-devel, david

On 1/28/21 6:13 PM, Cédric Le Goater wrote:
> On 1/28/21 6:05 PM, Daniel Henrique Barboza wrote:
>>
>>
>> On 1/28/21 1:03 PM, Greg Kurz wrote:
>>> On Thu, 28 Jan 2021 12:17:28 -0300
>>> Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Patches 02 and 03 contain fixes for a problem Cedric found out when
>>>> booting TCG guests with multiple NUMA nodes. See patch 03 commit
>>>> message for more info.
>>>>
>>>
>>> This paragraph mentions "TCG guests", but I see nothing that is
>>> specific to TCG in these patches... so I expect the problem to
>>> also exists with KVM, right ?
>>
>> Yeah. I mentioned TCG because this is the use case Cedric reproduced
>> the bug with, but I myself had no problems reproducing it with
>> accel=kvm as well.
> 
> I was also seeing the issue on KVM and I am still seeing it with 
> this patchset. It's gone on TCG however. 

ooups, sorry. All good on KVM also ! 

Tested-by: Cédric Le Goater <clg@kaod.org>

We can now safely use for_each_node() in the kernel.

Thanks Daniel,

C.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c
  2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
  2021-01-28 16:03   ` Greg Kurz
@ 2021-01-28 23:56   ` David Gibson
  1 sibling, 0 replies; 12+ messages in thread
From: David Gibson @ 2021-01-28 23:56 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: clg, qemu-ppc, qemu-devel, groug

[-- Attachment #1: Type: text/plain, Size: 2533 bytes --]

On Thu, Jan 28, 2021 at 12:17:29PM -0300, Daniel Henrique Barboza wrote:
> This function is used only in spapr_numa.c.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Applied to ppc-for-6.0, thanks.

> ---
>  hw/ppc/spapr.c         | 9 ---------
>  hw/ppc/spapr_numa.c    | 9 +++++++++
>  include/hw/ppc/spapr.h | 1 -
>  3 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 6c47466fc2..2d60c6f594 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -295,15 +295,6 @@ static hwaddr spapr_node0_size(MachineState *machine)
>      return machine->ram_size;
>  }
>  
> -bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> -{
> -    MachineState *machine = MACHINE(spapr);
> -    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> -
> -    return smc->pre_5_2_numa_associativity ||
> -           machine->numa_state->num_nodes <= 1;
> -}
> -
>  static void add_str(GString *s, const gchar *s1)
>  {
>      g_string_append_len(s, s1, strlen(s1) + 1);
> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c
> index b50796bbe3..261810525b 100644
> --- a/hw/ppc/spapr_numa.c
> +++ b/hw/ppc/spapr_numa.c
> @@ -19,6 +19,15 @@
>  /* Moved from hw/ppc/spapr_pci_nvlink2.c */
>  #define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
>  
> +static bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr)
> +{
> +    MachineState *machine = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine);
> +
> +    return smc->pre_5_2_numa_associativity ||
> +           machine->numa_state->num_nodes <= 1;
> +}
> +
>  static bool spapr_numa_is_symmetrical(MachineState *ms)
>  {
>      int src, dst;
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index c27c7ce515..ccbeeca1de 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -851,7 +851,6 @@ int spapr_max_server_number(SpaprMachineState *spapr);
>  void spapr_store_hpte(PowerPCCPU *cpu, hwaddr ptex,
>                        uint64_t pte0, uint64_t pte1);
>  void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered);
> -bool spapr_machine_using_legacy_numa(SpaprMachineState *spapr);
>  
>  /* DRC callbacks. */
>  void spapr_core_release(DeviceState *dev);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-01-29  0:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 15:17 [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Daniel Henrique Barboza
2021-01-28 15:17 ` [PATCH 1/3] spapr: move spapr_machine_using_legacy_numa() to spapr_numa.c Daniel Henrique Barboza
2021-01-28 16:03   ` Greg Kurz
2021-01-28 23:56   ` David Gibson
2021-01-28 15:17 ` [PATCH 2/3] spapr_numa.c: create spapr_numa_initial_nvgpu_NUMA_id() helper Daniel Henrique Barboza
2021-01-28 15:50   ` Greg Kurz
2021-01-28 15:17 ` [PATCH 3/3] spapr_numa.c: fix ibm, max-associativity-domains calculation Daniel Henrique Barboza
2021-01-28 16:21   ` [PATCH 3/3] spapr_numa.c: fix ibm,max-associativity-domains calculation Greg Kurz
2021-01-28 16:03 ` [PATCH 0/3] spapr, spapr_numa: fix max-associativity-domains Greg Kurz
2021-01-28 17:05   ` Daniel Henrique Barboza
2021-01-28 17:13     ` Cédric Le Goater
2021-01-28 17:20       ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.