All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/2] numa, spapr: equally distribute memory on nodes
@ 2017-04-27 10:12 Laurent Vivier
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass Laurent Vivier
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
  0 siblings, 2 replies; 9+ messages in thread
From: Laurent Vivier @ 2017-04-27 10:12 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: David Gibson, Thomas Huth, qemu-devel, Laurent Vivier

When there are more nodes than memory available to put the minimum
allowed memory by node, all the memory is put on the last node.

This series introduces a new MachineState function to
distribute equally the memory across the nodes
without breaking compatibility with previous
machine types.

The new function uses an error diffusion algorithm to
distribute the memory across the nodes.

As the memory alignment for pseries is 256MB and for the others
is only 8MB, we introduce this change only for pseries-2.10

Laurent Vivier (2):
  numa: introduce numa_auto_assign_ram() function in MachineClass
  numa,spapr: equally distribute memory on nodes

 hw/ppc/spapr.c         | 21 ++++++++++++++++++++-
 include/hw/boards.h    |  2 ++
 include/hw/ppc/spapr.h |  3 ++-
 numa.c                 | 44 +++++++++++++++++++++++++++++++++-----------
 4 files changed, 57 insertions(+), 13 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass
  2017-04-27 10:12 [Qemu-devel] [PATCH v2 0/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
@ 2017-04-27 10:12 ` Laurent Vivier
  2017-04-27 14:09   ` Eduardo Habkost
  2017-04-27 20:28   ` Eduardo Habkost
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
  1 sibling, 2 replies; 9+ messages in thread
From: Laurent Vivier @ 2017-04-27 10:12 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: David Gibson, Thomas Huth, qemu-devel, Laurent Vivier

We need to change the way we distribute the memory across
the nodes. To keep compatibility between machine type version
introduce a  machine type dependent function.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 include/hw/boards.h |  2 ++
 numa.c              | 44 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 31d9c72..e571b22 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -136,6 +136,8 @@ struct MachineClass {
     int minimum_page_bits;
     bool has_hotpluggable_cpus;
     int numa_mem_align_shift;
+    void (*numa_auto_assign_ram)(uint64_t *nodes,
+                                 int nb_nodes, ram_addr_t size);
 
     HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
                                            DeviceState *dev);
diff --git a/numa.c b/numa.c
index 6fc2393..2770c18 100644
--- a/numa.c
+++ b/numa.c
@@ -294,6 +294,38 @@ static void validate_numa_cpus(void)
     g_free(seen_cpus);
 }
 
+static void numa_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
+                                 int nb_nodes, ram_addr_t size)
+{
+    int i;
+    uint64_t usedmem = 0;
+
+    if (mc->numa_auto_assign_ram) {
+        uint64_t *mem = g_new(uint64_t, nb_nodes);
+
+        mc->numa_auto_assign_ram(mem, nb_nodes, size);
+
+        for (i = 0; i < nb_nodes; i++) {
+            nodes[i].node_mem = mem[i];
+        }
+
+        g_free(mem);
+
+        return;
+    }
+
+    /* Align each node according to the alignment
+     * requirements of the machine class
+     */
+
+    for (i = 0; i < nb_nodes - 1; i++) {
+        nodes[i].node_mem = (size / nb_nodes) &
+                            ~((1 << mc->numa_mem_align_shift) - 1);
+        usedmem += nodes[i].node_mem;
+    }
+    nodes[i].node_mem = size - usedmem;
+}
+
 void parse_numa_opts(MachineClass *mc)
 {
     int i;
@@ -336,17 +368,7 @@ void parse_numa_opts(MachineClass *mc)
             }
         }
         if (i == nb_numa_nodes) {
-            uint64_t usedmem = 0;
-
-            /* Align each node according to the alignment
-             * requirements of the machine class
-             */
-            for (i = 0; i < nb_numa_nodes - 1; i++) {
-                numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
-                                        ~((1 << mc->numa_mem_align_shift) - 1);
-                usedmem += numa_info[i].node_mem;
-            }
-            numa_info[i].node_mem = ram_size - usedmem;
+            numa_auto_assign_ram(mc, numa_info, nb_numa_nodes, ram_size);
         }
 
         numa_total = 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes
  2017-04-27 10:12 [Qemu-devel] [PATCH v2 0/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass Laurent Vivier
@ 2017-04-27 10:12 ` Laurent Vivier
  2017-04-27 20:31   ` Eduardo Habkost
  1 sibling, 1 reply; 9+ messages in thread
From: Laurent Vivier @ 2017-04-27 10:12 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: David Gibson, Thomas Huth, qemu-devel, Laurent Vivier

When there are more nodes than memory available to put the minimum
allowed memory by node, all the memory is put on the last node.

This is because we put (ram_size / nb_numa_nodes) &
~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
case the value is 0. This is particularly true with pseries,
as the memory must be aligned to 256MB.

To avoid this problem, this patch uses an error diffusion algorithm [1]
to distribute equally the memory on nodes.

Example:

qemu-system-ppc64 -S -nographic  -nodefaults -monitor stdio -m 1G -smp 8 \
                  -numa node -numa node -numa node \
                  -numa node -numa node -numa node

Before:

(qemu) info numa
6 nodes
node 0 cpus: 0 6
node 0 size: 0 MB
node 1 cpus: 1 7
node 1 size: 0 MB
node 2 cpus: 2
node 2 size: 0 MB
node 3 cpus: 3
node 3 size: 0 MB
node 4 cpus: 4
node 4 size: 0 MB
node 5 cpus: 5
node 5 size: 1024 MB

After:
(qemu) info numa
6 nodes
node 0 cpus: 0 6
node 0 size: 0 MB
node 1 cpus: 1 7
node 1 size: 256 MB
node 2 cpus: 2
node 2 size: 0 MB
node 3 cpus: 3
node 3 size: 256 MB
node 4 cpus: 4
node 4 size: 256 MB
node 5 cpus: 5
node 5 size: 256 MB

[1] https://en.wikipedia.org/wiki/Error_diffusion

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 hw/ppc/spapr.c         | 21 ++++++++++++++++++++-
 include/hw/ppc/spapr.h |  3 ++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 80d12d0..be498e2 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3106,6 +3106,23 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
     ics_pic_print_info(spapr->ics, mon);
 }
 
+static void spapr_numa_auto_assign_ram(uint64_t *nodes, int nb_nodes,
+                                       ram_addr_t size)
+{
+    int i;
+    uint64_t usedmem = 0, node_mem;
+    uint64_t granularity = size / nb_nodes;
+    uint64_t propagate = 0;
+
+    for (i = 0; i < nb_nodes - 1; i++) {
+        node_mem = (granularity + propagate) & ~(SPAPR_MEMORY_BLOCK_SIZE - 1);
+        propagate = granularity + propagate - node_mem;
+        nodes[i] = node_mem;
+        usedmem += node_mem;
+    }
+    nodes[i] = ram_size - usedmem;
+}
+
 static void spapr_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
@@ -3162,7 +3179,8 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
      * SPAPR_MEMORY_BLOCK_SIZE (256M) since that's the granularity
      * in which LMBs are represented and hot-added
      */
-    mc->numa_mem_align_shift = 28;
+    mc->numa_mem_align_shift = SPAPR_MEMORY_BLOCK_SIZE_SHIFT;
+    mc->numa_auto_assign_ram = spapr_numa_auto_assign_ram;
 }
 
 static const TypeInfo spapr_machine_info = {
@@ -3242,6 +3260,7 @@ static void spapr_machine_2_9_class_options(MachineClass *mc)
 {
     spapr_machine_2_10_class_options(mc);
     SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_9);
+    mc->numa_auto_assign_ram = NULL;
 }
 
 DEFINE_SPAPR_MACHINE(2_9, "2.9", false);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 5802f88..8f4a588 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -653,7 +653,8 @@ int spapr_rtc_import_offset(sPAPRRTCState *rtc, int64_t legacy_offset);
 
 int spapr_rng_populate_dt(void *fdt);
 
-#define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */
+#define SPAPR_MEMORY_BLOCK_SIZE_SHIFT 28 /* 256MB */
+#define SPAPR_MEMORY_BLOCK_SIZE (1 << SPAPR_MEMORY_BLOCK_SIZE_SHIFT)
 
 /*
  * This defines the maximum number of DIMM slots we can have for sPAPR
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass Laurent Vivier
@ 2017-04-27 14:09   ` Eduardo Habkost
  2017-04-27 15:09     ` Laurent Vivier
  2017-04-27 20:28   ` Eduardo Habkost
  1 sibling, 1 reply; 9+ messages in thread
From: Eduardo Habkost @ 2017-04-27 14:09 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, Thomas Huth, qemu-devel

On Thu, Apr 27, 2017 at 12:12:58PM +0200, Laurent Vivier wrote:
> We need to change the way we distribute the memory across
> the nodes. To keep compatibility between machine type version
> introduce a  machine type dependent function.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[...]
> +static void numa_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
> +                                 int nb_nodes, ram_addr_t size)
> +{
> +    int i;
> +    uint64_t usedmem = 0;
> +
> +    if (mc->numa_auto_assign_ram) {
> +        uint64_t *mem = g_new(uint64_t, nb_nodes);
> +
> +        mc->numa_auto_assign_ram(mem, nb_nodes, size);
> +
> +        for (i = 0; i < nb_nodes; i++) {
> +            nodes[i].node_mem = mem[i];
> +        }
> +
> +        g_free(mem);
> +
> +        return;
> +    }
> +
> +    /* Align each node according to the alignment
> +     * requirements of the machine class
> +     */
> +
> +    for (i = 0; i < nb_nodes - 1; i++) {
> +        nodes[i].node_mem = (size / nb_nodes) &
> +                            ~((1 << mc->numa_mem_align_shift) - 1);
> +        usedmem += nodes[i].node_mem;
> +    }
> +    nodes[i].node_mem = size - usedmem;
> +}

I would prefer to make your new algorithm the default, and move
this code to a legacy_auto_assign_ram() function set by the 2.9
machine-types.

> +
>  void parse_numa_opts(MachineClass *mc)
>  {
>      int i;
> @@ -336,17 +368,7 @@ void parse_numa_opts(MachineClass *mc)
>              }
>          }
>          if (i == nb_numa_nodes) {
> -            uint64_t usedmem = 0;
> -
> -            /* Align each node according to the alignment
> -             * requirements of the machine class
> -             */
> -            for (i = 0; i < nb_numa_nodes - 1; i++) {
> -                numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
> -                                        ~((1 << mc->numa_mem_align_shift) - 1);
> -                usedmem += numa_info[i].node_mem;
> -            }
> -            numa_info[i].node_mem = ram_size - usedmem;
> +            numa_auto_assign_ram(mc, numa_info, nb_numa_nodes, ram_size);
>          }
>  
>          numa_total = 0;
> -- 
> 2.9.3
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass
  2017-04-27 14:09   ` Eduardo Habkost
@ 2017-04-27 15:09     ` Laurent Vivier
  2017-04-27 16:09       ` Eduardo Habkost
  0 siblings, 1 reply; 9+ messages in thread
From: Laurent Vivier @ 2017-04-27 15:09 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: David Gibson, Thomas Huth, qemu-devel

On 27/04/2017 16:09, Eduardo Habkost wrote:
> On Thu, Apr 27, 2017 at 12:12:58PM +0200, Laurent Vivier wrote:
>> We need to change the way we distribute the memory across
>> the nodes. To keep compatibility between machine type version
>> introduce a  machine type dependent function.
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> [...]
>> +static void numa_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
>> +                                 int nb_nodes, ram_addr_t size)
>> +{
>> +    int i;
>> +    uint64_t usedmem = 0;
>> +
>> +    if (mc->numa_auto_assign_ram) {
>> +        uint64_t *mem = g_new(uint64_t, nb_nodes);
>> +
>> +        mc->numa_auto_assign_ram(mem, nb_nodes, size);
>> +
>> +        for (i = 0; i < nb_nodes; i++) {
>> +            nodes[i].node_mem = mem[i];
>> +        }
>> +
>> +        g_free(mem);
>> +
>> +        return;
>> +    }
>> +
>> +    /* Align each node according to the alignment
>> +     * requirements of the machine class
>> +     */
>> +
>> +    for (i = 0; i < nb_nodes - 1; i++) {
>> +        nodes[i].node_mem = (size / nb_nodes) &
>> +                            ~((1 << mc->numa_mem_align_shift) - 1);
>> +        usedmem += nodes[i].node_mem;
>> +    }
>> +    nodes[i].node_mem = size - usedmem;
>> +}
> 
> I would prefer to make your new algorithm the default, and move
> this code to a legacy_auto_assign_ram() function set by the 2.9
> machine-types.

I think it's easier to do as I've done because otherwise, we need:

- to add the numa_mem_align_shift to the parameters list of the
  numa_auto_assign_ram() function.

- set the function by default to numa_auto_assign_ram in
  hw/core/machine.c:machine_class_init()

- set the pointer to NULL in 2.10 pseries machine type,

- export the function to re-set the legacy function in the 2.9 pseries
  machine type definition.

So, we need to add one argument to the function, export the function to
use it from machine.c and at least spapr.c, to set the function in
machine_class_init() and spapr_machine_2_9_class_options() (as we clear
it in 2.10 function).

I can do that, but is this what you want?

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass
  2017-04-27 15:09     ` Laurent Vivier
@ 2017-04-27 16:09       ` Eduardo Habkost
  0 siblings, 0 replies; 9+ messages in thread
From: Eduardo Habkost @ 2017-04-27 16:09 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, Thomas Huth, qemu-devel

On Thu, Apr 27, 2017 at 05:09:26PM +0200, Laurent Vivier wrote:
> On 27/04/2017 16:09, Eduardo Habkost wrote:
> > On Thu, Apr 27, 2017 at 12:12:58PM +0200, Laurent Vivier wrote:
> >> We need to change the way we distribute the memory across
> >> the nodes. To keep compatibility between machine type version
> >> introduce a  machine type dependent function.
> >>
> >> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > [...]
> >> +static void numa_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
> >> +                                 int nb_nodes, ram_addr_t size)
> >> +{
> >> +    int i;
> >> +    uint64_t usedmem = 0;
> >> +
> >> +    if (mc->numa_auto_assign_ram) {
> >> +        uint64_t *mem = g_new(uint64_t, nb_nodes);
> >> +
> >> +        mc->numa_auto_assign_ram(mem, nb_nodes, size);
> >> +
> >> +        for (i = 0; i < nb_nodes; i++) {
> >> +            nodes[i].node_mem = mem[i];
> >> +        }
> >> +
> >> +        g_free(mem);
> >> +
> >> +        return;
> >> +    }
> >> +
> >> +    /* Align each node according to the alignment
> >> +     * requirements of the machine class
> >> +     */
> >> +
> >> +    for (i = 0; i < nb_nodes - 1; i++) {
> >> +        nodes[i].node_mem = (size / nb_nodes) &
> >> +                            ~((1 << mc->numa_mem_align_shift) - 1);
> >> +        usedmem += nodes[i].node_mem;
> >> +    }
> >> +    nodes[i].node_mem = size - usedmem;
> >> +}
> > 
> > I would prefer to make your new algorithm the default, and move
> > this code to a legacy_auto_assign_ram() function set by the 2.9
> > machine-types.
> 
> I think it's easier to do as I've done because otherwise, we need:
> 
> - to add the numa_mem_align_shift to the parameters list of the
>   numa_auto_assign_ram() function.

You can take MachineClass or MachineState as paramter.

> 
> - set the function by default to numa_auto_assign_ram in
>   hw/core/machine.c:machine_class_init()

Yep.

> 
> - set the pointer to NULL in 2.10 pseries machine type,

Won't pseries-2.10 have the same behavior as all other machines
except pc/pseries <= 2.9? pseries-2.10 and pc-2.10 would just
reuse the default value set by machine_class_init().

> 
> - export the function to re-set the legacy function in the 2.9 pseries
>   machine type definition.

Well, this is the part where I agree it's too much hassle.  :)

> 
> So, we need to add one argument to the function, export the function to
> use it from machine.c and at least spapr.c, to set the function in
> machine_class_init() and spapr_machine_2_9_class_options() (as we clear
> it in 2.10 function).
> 
> I can do that, but is this what you want?

I don't have a strong opinion. If you believe your way is
simpler, we can keep it.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass Laurent Vivier
  2017-04-27 14:09   ` Eduardo Habkost
@ 2017-04-27 20:28   ` Eduardo Habkost
  1 sibling, 0 replies; 9+ messages in thread
From: Eduardo Habkost @ 2017-04-27 20:28 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, Thomas Huth, qemu-devel

On Thu, Apr 27, 2017 at 12:12:58PM +0200, Laurent Vivier wrote:
> We need to change the way we distribute the memory across
> the nodes. To keep compatibility between machine type version
> introduce a  machine type dependent function.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  include/hw/boards.h |  2 ++
>  numa.c              | 44 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 31d9c72..e571b22 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -136,6 +136,8 @@ struct MachineClass {
>      int minimum_page_bits;
>      bool has_hotpluggable_cpus;
>      int numa_mem_align_shift;
> +    void (*numa_auto_assign_ram)(uint64_t *nodes,
> +                                 int nb_nodes, ram_addr_t size);
>  

The code looks good, but I would like to have comments here
documenting the behavior when the field is NULL.

>      HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
>                                             DeviceState *dev);
> diff --git a/numa.c b/numa.c
> index 6fc2393..2770c18 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -294,6 +294,38 @@ static void validate_numa_cpus(void)
>      g_free(seen_cpus);
>  }
>  
> +static void numa_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
> +                                 int nb_nodes, ram_addr_t size)
> +{
> +    int i;
> +    uint64_t usedmem = 0;
> +
> +    if (mc->numa_auto_assign_ram) {
> +        uint64_t *mem = g_new(uint64_t, nb_nodes);
> +
> +        mc->numa_auto_assign_ram(mem, nb_nodes, size);
> +
> +        for (i = 0; i < nb_nodes; i++) {
> +            nodes[i].node_mem = mem[i];
> +        }
> +
> +        g_free(mem);
> +
> +        return;
> +    }
> +
> +    /* Align each node according to the alignment
> +     * requirements of the machine class
> +     */
> +
> +    for (i = 0; i < nb_nodes - 1; i++) {
> +        nodes[i].node_mem = (size / nb_nodes) &
> +                            ~((1 << mc->numa_mem_align_shift) - 1);
> +        usedmem += nodes[i].node_mem;
> +    }
> +    nodes[i].node_mem = size - usedmem;
> +}
> +
>  void parse_numa_opts(MachineClass *mc)
>  {
>      int i;
> @@ -336,17 +368,7 @@ void parse_numa_opts(MachineClass *mc)
>              }
>          }
>          if (i == nb_numa_nodes) {
> -            uint64_t usedmem = 0;
> -
> -            /* Align each node according to the alignment
> -             * requirements of the machine class
> -             */
> -            for (i = 0; i < nb_numa_nodes - 1; i++) {
> -                numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
> -                                        ~((1 << mc->numa_mem_align_shift) - 1);
> -                usedmem += numa_info[i].node_mem;
> -            }
> -            numa_info[i].node_mem = ram_size - usedmem;
> +            numa_auto_assign_ram(mc, numa_info, nb_numa_nodes, ram_size);
>          }
>  
>          numa_total = 0;
> -- 
> 2.9.3
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes
  2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
@ 2017-04-27 20:31   ` Eduardo Habkost
  2017-05-01  3:56     ` David Gibson
  0 siblings, 1 reply; 9+ messages in thread
From: Eduardo Habkost @ 2017-04-27 20:31 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, Thomas Huth, qemu-devel

On Thu, Apr 27, 2017 at 12:12:59PM +0200, Laurent Vivier wrote:
> When there are more nodes than memory available to put the minimum
> allowed memory by node, all the memory is put on the last node.
> 
> This is because we put (ram_size / nb_numa_nodes) &
> ~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
> case the value is 0. This is particularly true with pseries,
> as the memory must be aligned to 256MB.
> 
> To avoid this problem, this patch uses an error diffusion algorithm [1]
> to distribute equally the memory on nodes.
> 
> Example:
> 
> qemu-system-ppc64 -S -nographic  -nodefaults -monitor stdio -m 1G -smp 8 \
>                   -numa node -numa node -numa node \
>                   -numa node -numa node -numa node
> 
> Before:
> 
> (qemu) info numa
> 6 nodes
> node 0 cpus: 0 6
> node 0 size: 0 MB
> node 1 cpus: 1 7
> node 1 size: 0 MB
> node 2 cpus: 2
> node 2 size: 0 MB
> node 3 cpus: 3
> node 3 size: 0 MB
> node 4 cpus: 4
> node 4 size: 0 MB
> node 5 cpus: 5
> node 5 size: 1024 MB
> 
> After:
> (qemu) info numa
> 6 nodes
> node 0 cpus: 0 6
> node 0 size: 0 MB
> node 1 cpus: 1 7
> node 1 size: 256 MB
> node 2 cpus: 2
> node 2 size: 0 MB
> node 3 cpus: 3
> node 3 size: 256 MB
> node 4 cpus: 4
> node 4 size: 256 MB
> node 5 cpus: 5
> node 5 size: 256 MB
> 
> [1] https://en.wikipedia.org/wiki/Error_diffusion
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  hw/ppc/spapr.c         | 21 ++++++++++++++++++++-
>  include/hw/ppc/spapr.h |  3 ++-
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 80d12d0..be498e2 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3106,6 +3106,23 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
>      ics_pic_print_info(spapr->ics, mon);
>  }
>  
> +static void spapr_numa_auto_assign_ram(uint64_t *nodes, int nb_nodes,
> +                                       ram_addr_t size)
> +{
> +    int i;
> +    uint64_t usedmem = 0, node_mem;
> +    uint64_t granularity = size / nb_nodes;
> +    uint64_t propagate = 0;
> +
> +    for (i = 0; i < nb_nodes - 1; i++) {
> +        node_mem = (granularity + propagate) & ~(SPAPR_MEMORY_BLOCK_SIZE - 1);
> +        propagate = granularity + propagate - node_mem;
> +        nodes[i] = node_mem;
> +        usedmem += node_mem;
> +    }
> +    nodes[i] = ram_size - usedmem;
> +}

This new algorithm is a much better default, I would like to
enable it for all machines by default. We could move it to
core/machine.c:machine_class_init(), and set numa_auto_assign_ram
to NULL on pc-*-2.9 and spapr-2.9.

> +
>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> @@ -3162,7 +3179,8 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>       * SPAPR_MEMORY_BLOCK_SIZE (256M) since that's the granularity
>       * in which LMBs are represented and hot-added
>       */
> -    mc->numa_mem_align_shift = 28;
> +    mc->numa_mem_align_shift = SPAPR_MEMORY_BLOCK_SIZE_SHIFT;
> +    mc->numa_auto_assign_ram = spapr_numa_auto_assign_ram;
>  }
>  
>  static const TypeInfo spapr_machine_info = {
> @@ -3242,6 +3260,7 @@ static void spapr_machine_2_9_class_options(MachineClass *mc)
>  {
>      spapr_machine_2_10_class_options(mc);
>      SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_9);
> +    mc->numa_auto_assign_ram = NULL;
>  }
>  
>  DEFINE_SPAPR_MACHINE(2_9, "2.9", false);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 5802f88..8f4a588 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -653,7 +653,8 @@ int spapr_rtc_import_offset(sPAPRRTCState *rtc, int64_t legacy_offset);
>  
>  int spapr_rng_populate_dt(void *fdt);
>  
> -#define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */
> +#define SPAPR_MEMORY_BLOCK_SIZE_SHIFT 28 /* 256MB */
> +#define SPAPR_MEMORY_BLOCK_SIZE (1 << SPAPR_MEMORY_BLOCK_SIZE_SHIFT)
>  
>  /*
>   * This defines the maximum number of DIMM slots we can have for sPAPR
> -- 
> 2.9.3
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes
  2017-04-27 20:31   ` Eduardo Habkost
@ 2017-05-01  3:56     ` David Gibson
  0 siblings, 0 replies; 9+ messages in thread
From: David Gibson @ 2017-05-01  3:56 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Laurent Vivier, Thomas Huth, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5273 bytes --]

On Thu, Apr 27, 2017 at 05:31:48PM -0300, Eduardo Habkost wrote:
> On Thu, Apr 27, 2017 at 12:12:59PM +0200, Laurent Vivier wrote:
> > When there are more nodes than memory available to put the minimum
> > allowed memory by node, all the memory is put on the last node.
> > 
> > This is because we put (ram_size / nb_numa_nodes) &
> > ~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
> > case the value is 0. This is particularly true with pseries,
> > as the memory must be aligned to 256MB.
> > 
> > To avoid this problem, this patch uses an error diffusion algorithm [1]
> > to distribute equally the memory on nodes.
> > 
> > Example:
> > 
> > qemu-system-ppc64 -S -nographic  -nodefaults -monitor stdio -m 1G -smp 8 \
> >                   -numa node -numa node -numa node \
> >                   -numa node -numa node -numa node
> > 
> > Before:
> > 
> > (qemu) info numa
> > 6 nodes
> > node 0 cpus: 0 6
> > node 0 size: 0 MB
> > node 1 cpus: 1 7
> > node 1 size: 0 MB
> > node 2 cpus: 2
> > node 2 size: 0 MB
> > node 3 cpus: 3
> > node 3 size: 0 MB
> > node 4 cpus: 4
> > node 4 size: 0 MB
> > node 5 cpus: 5
> > node 5 size: 1024 MB
> > 
> > After:
> > (qemu) info numa
> > 6 nodes
> > node 0 cpus: 0 6
> > node 0 size: 0 MB
> > node 1 cpus: 1 7
> > node 1 size: 256 MB
> > node 2 cpus: 2
> > node 2 size: 0 MB
> > node 3 cpus: 3
> > node 3 size: 256 MB
> > node 4 cpus: 4
> > node 4 size: 256 MB
> > node 5 cpus: 5
> > node 5 size: 256 MB
> > 
> > [1] https://en.wikipedia.org/wiki/Error_diffusion
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  hw/ppc/spapr.c         | 21 ++++++++++++++++++++-
> >  include/hw/ppc/spapr.h |  3 ++-
> >  2 files changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 80d12d0..be498e2 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -3106,6 +3106,23 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
> >      ics_pic_print_info(spapr->ics, mon);
> >  }
> >  
> > +static void spapr_numa_auto_assign_ram(uint64_t *nodes, int nb_nodes,
> > +                                       ram_addr_t size)
> > +{
> > +    int i;
> > +    uint64_t usedmem = 0, node_mem;
> > +    uint64_t granularity = size / nb_nodes;
> > +    uint64_t propagate = 0;
> > +
> > +    for (i = 0; i < nb_nodes - 1; i++) {
> > +        node_mem = (granularity + propagate) & ~(SPAPR_MEMORY_BLOCK_SIZE - 1);
> > +        propagate = granularity + propagate - node_mem;
> > +        nodes[i] = node_mem;
> > +        usedmem += node_mem;
> > +    }
> > +    nodes[i] = ram_size - usedmem;
> > +}
> 
> This new algorithm is a much better default, I would like to
> enable it for all machines by default. We could move it to
> core/machine.c:machine_class_init(), and set numa_auto_assign_ram
> to NULL on pc-*-2.9 and spapr-2.9.

Right.  So, the convention that the default is the new behaviour and
we override to set the older behaviour on previous machine types is
strong enough that I think we should keep it here, even though it adds
some complications.

Note that it might be simpler not to have a "NULL" behaviour.  Instead
the core code can provide both old style and error diffusion versions
of the assign hook.  Set it to the error difusion one in the base
MachineClass, then override it to the old one in previous versioned
machine types (basically just the previous pc and spapr models,
AFAIK).

> 
> > +
> >  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >  {
> >      MachineClass *mc = MACHINE_CLASS(oc);
> > @@ -3162,7 +3179,8 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >       * SPAPR_MEMORY_BLOCK_SIZE (256M) since that's the granularity
> >       * in which LMBs are represented and hot-added
> >       */
> > -    mc->numa_mem_align_shift = 28;
> > +    mc->numa_mem_align_shift = SPAPR_MEMORY_BLOCK_SIZE_SHIFT;
> > +    mc->numa_auto_assign_ram = spapr_numa_auto_assign_ram;
> >  }
> >  
> >  static const TypeInfo spapr_machine_info = {
> > @@ -3242,6 +3260,7 @@ static void spapr_machine_2_9_class_options(MachineClass *mc)
> >  {
> >      spapr_machine_2_10_class_options(mc);
> >      SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_9);
> > +    mc->numa_auto_assign_ram = NULL;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(2_9, "2.9", false);
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index 5802f88..8f4a588 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -653,7 +653,8 @@ int spapr_rtc_import_offset(sPAPRRTCState *rtc, int64_t legacy_offset);
> >  
> >  int spapr_rng_populate_dt(void *fdt);
> >  
> > -#define SPAPR_MEMORY_BLOCK_SIZE (1 << 28) /* 256MB */
> > +#define SPAPR_MEMORY_BLOCK_SIZE_SHIFT 28 /* 256MB */
> > +#define SPAPR_MEMORY_BLOCK_SIZE (1 << SPAPR_MEMORY_BLOCK_SIZE_SHIFT)
> >  
> >  /*
> >   * This defines the maximum number of DIMM slots we can have for sPAPR
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-05-01  3:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27 10:12 [Qemu-devel] [PATCH v2 0/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 1/2] numa: introduce numa_auto_assign_ram() function in MachineClass Laurent Vivier
2017-04-27 14:09   ` Eduardo Habkost
2017-04-27 15:09     ` Laurent Vivier
2017-04-27 16:09       ` Eduardo Habkost
2017-04-27 20:28   ` Eduardo Habkost
2017-04-27 10:12 ` [Qemu-devel] [PATCH v2 2/2] numa, spapr: equally distribute memory on nodes Laurent Vivier
2017-04-27 20:31   ` Eduardo Habkost
2017-05-01  3:56     ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.