All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
@ 2014-06-16  7:53 Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper Alexey Kardashevskiy
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf


c4177479 "spapr: make sure RMA is in first mode of first memory node"
introduced regression which prevents from running guests with memoryless
NUMA node#0 which may happen on real POWER8 boxes and which would make
sense to debug in QEMU.

This patchset aim is to fix that and also fix various code problems in
memory nodes generation.

These 2 patches could be merged (the resulting patch looks rather ugly):
spapr: Use DT memory node rendering helper for other nodes
spapr: Move DT memory node rendering to a helper

Please comment. Thanks!




Alexey Kardashevskiy (7):
  spapr: Move DT memory node rendering to a helper
  spapr: Use DT memory node rendering helper for other nodes
  spapr: Refactor spapr_populate_memory()
  spapr: Split memory nodes to power-of-two blocks
  spapr: Add a helper for node0_size calculation
  spapr: Fix ibm,associativity for memory nodes
  numa: Allow empty nodes

 hw/ppc/spapr.c | 104 +++++++++++++++++++++++++++++++--------------------------
 vl.c           |   2 +-
 2 files changed, 57 insertions(+), 49 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 2/7] spapr: Use DT memory node rendering helper for other nodes Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

This moves recurring bits of code related to memory@xxx nodes
creation to a helper.

This makes use of the new helper for node@0.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 48 ++++++++++++++++++++++++++++--------------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e95be8c..8390759 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -662,6 +662,31 @@ int spapr_h_cas_compose_response(target_ulong addr, target_ulong size)
     return 0;
 }
 
+static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
+                                       hwaddr size)
+{
+    uint32_t associativity[] = {
+        cpu_to_be32(0x4), /* length */
+        cpu_to_be32(0x0), cpu_to_be32(0x0),
+        cpu_to_be32(nodeid), cpu_to_be32(nodeid)
+    };
+    char mem_name[32];
+    uint64_t mem_reg_property[2];
+    int off;
+
+    mem_reg_property[0] = cpu_to_be64(start);
+    mem_reg_property[1] = cpu_to_be64(size);
+
+    sprintf(mem_name, "memory@" TARGET_FMT_lx, start);
+    off = fdt_add_subnode(fdt, 0, mem_name);
+    _FDT(off);
+    _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
+    _FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
+                      sizeof(mem_reg_property))));
+    _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
+                      sizeof(associativity))));
+}
+
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
     uint32_t associativity[] = {cpu_to_be32(0x4), cpu_to_be32(0x0),
@@ -680,29 +705,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
     }
 
     /* RMA */
-    mem_reg_property[0] = 0;
-    mem_reg_property[1] = cpu_to_be64(spapr->rma_size);
-    off = fdt_add_subnode(fdt, 0, "memory@0");
-    _FDT(off);
-    _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-    _FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-                      sizeof(mem_reg_property))));
-    _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-                      sizeof(associativity))));
+    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
 
     /* RAM: Node 0 */
     if (node0_size > spapr->rma_size) {
-        mem_reg_property[0] = cpu_to_be64(spapr->rma_size);
-        mem_reg_property[1] = cpu_to_be64(node0_size - spapr->rma_size);
-
-        sprintf(mem_name, "memory@" TARGET_FMT_lx, spapr->rma_size);
-        off = fdt_add_subnode(fdt, 0, mem_name);
-        _FDT(off);
-        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-        _FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-                          sizeof(mem_reg_property))));
-        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-                          sizeof(associativity))));
+        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
+                                   node0_size - spapr->rma_size);
     }
 
     /* RAM: Node 1 and beyond */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 2/7] spapr: Use DT memory node rendering helper for other nodes
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

This finishes refactoring by using the spapr_populate_memory_node helper
for all nodes and removing leftovers from spapr_populate_memory().

This is not a part of the previous patch because the patches look
nicer apart.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8390759..cb3a10a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -689,13 +689,8 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
 
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
-    uint32_t associativity[] = {cpu_to_be32(0x4), cpu_to_be32(0x0),
-                                cpu_to_be32(0x0), cpu_to_be32(0x0),
-                                cpu_to_be32(0x0)};
-    char mem_name[32];
     hwaddr node0_size, mem_start, node_size;
-    uint64_t mem_reg_property[2];
-    int i, off;
+    int i;
 
     /* memory node(s) */
     if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
@@ -716,7 +711,6 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
     /* RAM: Node 1 and beyond */
     mem_start = node0_size;
     for (i = 1; i < nb_numa_nodes; i++) {
-        mem_reg_property[0] = cpu_to_be64(mem_start);
         if (mem_start >= ram_size) {
             node_size = 0;
         } else {
@@ -725,16 +719,7 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
                 node_size = ram_size - mem_start;
             }
         }
-        mem_reg_property[1] = cpu_to_be64(node_size);
-        associativity[3] = associativity[4] = cpu_to_be32(i);
-        sprintf(mem_name, "memory@" TARGET_FMT_lx, mem_start);
-        off = fdt_add_subnode(fdt, 0, mem_name);
-        _FDT(off);
-        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-        _FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-                          sizeof(mem_reg_property))));
-        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-                          sizeof(associativity))));
+        spapr_populate_memory_node(fdt, i, mem_start, node_size);
         mem_start += node_size;
     }
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 2/7] spapr: Use DT memory node rendering helper for other nodes Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-18  5:04   ` Alexey Kardashevskiy
                     ` (2 more replies)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

Current QEMU does not support memoryless NUMA nodes.
This prepares SPAPR for that.

This moves 2 calls of spapr_populate_memory_node() into
the existing loop which handles nodes other than than
the first one.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 31 +++++++++++--------------------
 1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index cb3a10a..666b676 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
 
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
-    hwaddr node0_size, mem_start, node_size;
+    hwaddr mem_start, node_size;
     int i;
 
-    /* memory node(s) */
-    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
-        node0_size = node_mem[0];
-    } else {
-        node0_size = ram_size;
-    }
-
-    /* RMA */
-    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
-
-    /* RAM: Node 0 */
-    if (node0_size > spapr->rma_size) {
-        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
-                                   node0_size - spapr->rma_size);
-    }
-
-    /* RAM: Node 1 and beyond */
-    mem_start = node0_size;
-    for (i = 1; i < nb_numa_nodes; i++) {
+    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
+        if (!node_mem[i]) {
+            continue;
+        }
         if (mem_start >= ram_size) {
             node_size = 0;
         } else {
@@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
                 node_size = ram_size - mem_start;
             }
         }
+        if (!mem_start) {
+            /* ppc_spapr_init() checks for rma_size <= node0_size already */
+            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
+            mem_start += spapr->rma_size;
+            node_size -= spapr->rma_size;
+        }
         spapr_populate_memory_node(fdt, i, mem_start, node_size);
         mem_start += node_size;
     }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-17  7:07   ` Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

Linux kernel expects nodes to have power-of-two size and
does WARN_ON if this is not the case:
[    0.041052] devtmpfs: initialized
[    0.041292] ------------[ cut here ]------------
[    0.041456] WARNING: at drivers/base/memory.c:115

This splits memory nodes into set of smaller blocks with
a size which is a power of two. This makes sure the start
address of every node is aligned to the node size.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 666b676..10202e2 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -710,8 +710,18 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
             mem_start += spapr->rma_size;
             node_size -= spapr->rma_size;
         }
-        spapr_populate_memory_node(fdt, i, mem_start, node_size);
-        mem_start += node_size;
+        for ( ; node_size; ) {
+            hwaddr sizetmp = pow2floor(node_size);
+
+            /* mem_start != 0 here */
+            if (ffs(mem_start) < ffs(sizetmp)) {
+                sizetmp = MIN(sizetmp, 1 << (ffs(mem_start) - 1));
+            }
+
+            spapr_populate_memory_node(fdt, i, mem_start, sizetmp);
+            node_size -= sizetmp;
+            mem_start += sizetmp;
+        }
     }
 
     return 0;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-16 18:43   ` Nishanth Aravamudan
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 6/7] spapr: Fix ibm, associativity for memory nodes Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

In multiple places there is a node0_size variable calculation
which assumes that NUMA node #0 and memory node #0 are the same
things which they are not. Since we are going to change it and
do not want to change it in multiple places, let's make a helper.

This adds a spapr_node0_size() helper and makes use of it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 10202e2..00ef7a1 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -351,6 +351,20 @@ static size_t create_page_sizes_prop(CPUPPCState *env, uint32_t *prop,
     return (p - prop) * sizeof(uint32_t);
 }
 
+static hwaddr spapr_node0_size(void)
+{
+    if (nb_numa_nodes > 1) {
+        int i;
+        for (i = 0; i < nb_numa_nodes; ++i) {
+            if (node_mem[i]) {
+                return MIN(pow2floor(node_mem[i]), ram_size);
+            }
+        }
+        return ram_size;
+    }
+    return ram_size;
+}
+
 #define _FDT(exp) \
     do { \
         int ret = (exp);                                           \
@@ -851,8 +865,8 @@ static void spapr_reset_htab(sPAPREnvironment *spapr)
 
     /* Update the RMA size if necessary */
     if (spapr->vrma_adjust) {
-        hwaddr node0_size = (nb_numa_nodes > 1) ? node_mem[0] : ram_size;
-        spapr->rma_size = kvmppc_rma_size(node0_size, spapr->htab_shift);
+        spapr->rma_size = kvmppc_rma_size(spapr_node0_size(),
+                                          spapr->htab_shift);
     }
 }
 
@@ -1284,7 +1298,7 @@ static void ppc_spapr_init(MachineState *machine)
     MemoryRegion *sysmem = get_system_memory();
     MemoryRegion *ram = g_new(MemoryRegion, 1);
     hwaddr rma_alloc_size;
-    hwaddr node0_size = (nb_numa_nodes > 1) ? node_mem[0] : ram_size;
+    hwaddr node0_size = spapr_node0_size();
     uint32_t initrd_base = 0;
     long kernel_size = 0, initrd_size = 0;
     long load_limit, rtas_limit, fw_size;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 6/7] spapr: Fix ibm, associativity for memory nodes
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes Alexey Kardashevskiy
  2014-06-16  8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
  7 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc, Alexander Graf

We want the associtivity lists of memory and CPU nodes to match but
memory nodes have incorrect domain#3 which is zero for CPU so they won't
match.

This clears domain#3 in the list to match CPUs associtivity lists.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 00ef7a1..57ae835 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -682,7 +682,7 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
     uint32_t associativity[] = {
         cpu_to_be32(0x4), /* length */
         cpu_to_be32(0x0), cpu_to_be32(0x0),
-        cpu_to_be32(nodeid), cpu_to_be32(nodeid)
+        cpu_to_be32(0x0), cpu_to_be32(nodeid)
     };
     char mem_name[32];
     uint64_t mem_reg_property[2];
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 6/7] spapr: Fix ibm, associativity for memory nodes Alexey Kardashevskiy
@ 2014-06-16  7:53 ` Alexey Kardashevskiy
  2014-06-16 16:15   ` Eduardo Habkost
  2014-06-16  8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
  7 siblings, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  7:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Nishanth Aravamudan, qemu-ppc,
	Alexander Graf, Eduardo Habkost

Currently NUMA nodes must go consequently and QEMU ignores nodes
with @nodeid bigger than the number of NUMA nodes in the command line.
This prevents us from creating memory-less nodes which is possible
situation in POWERPC under pHyp or Sapphire.

This makes nb_numa_nodes a total number of nodes or the biggest node
number + 1 whichever is greater.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 vl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vl.c b/vl.c
index ac0e3d7..f1b75cb 100644
--- a/vl.c
+++ b/vl.c
@@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
         if (get_param_value(option, 128, "cpus", optarg) != 0) {
             numa_node_parse_cpus(nodenr, option);
         }
-        nb_numa_nodes++;
+        nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);
     } else {
         fprintf(stderr, "Invalid -numa option: %s\n", option);
         exit(1);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes Alexey Kardashevskiy
@ 2014-06-16  8:16 ` Alexey Kardashevskiy
  2014-06-16 18:26   ` Nishanth Aravamudan
  2014-06-16 20:51   ` Eduardo Habkost
  7 siblings, 2 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-16  8:16 UTC (permalink / raw)
  To: qemu-devel; +Cc: Nishanth Aravamudan, qemu-ppc, Alexander Graf

On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> c4177479 "spapr: make sure RMA is in first mode of first memory node"
> introduced regression which prevents from running guests with memoryless
> NUMA node#0 which may happen on real POWER8 boxes and which would make
> sense to debug in QEMU.
> 
> This patchset aim is to fix that and also fix various code problems in
> memory nodes generation.
> 
> These 2 patches could be merged (the resulting patch looks rather ugly):
> spapr: Use DT memory node rendering helper for other nodes
> spapr: Move DT memory node rendering to a helper
> 
> Please comment. Thanks!
> 

Sure I forgot to add an example of what I am trying to run without errors
and warnings:

/home/aik/qemu-system-ppc64 \
-enable-kvm \
-machine pseries \
-nographic \
-vga none \
-drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
-device scsi-disk,id=id1,drive=id0 \
-m 2080 \
-smp 8 \
-numa node,nodeid=0,cpus=0-7,memory=0 \
-numa node,nodeid=2,cpus=0-3,mem=1040 \
-numa node,nodeid=4,cpus=4-7,mem=1040


[root@localhost ~]# find /proc/device-tree/ -iname "ibm,associativity"
-exec lspr
/proc/device-tree/cpus/PowerPC,POWER7@1c/ibm,associativityibm,associativity" -exec

Password:        00000005 00000000 00000000 00000000 00000004 0000001cvity"
-exec
/proc/device-tree/cpus/PowerPC,POWER7@18/ibm,associativitym,associativity"
-exec l
lsprop {} \;earc 00000005 00000000 00000000 00000000 00000004 00000018ty"
-exec l
/proc/device-tree/cpus/PowerPC,POWER7@14/ibm,associativity
                 00000005 00000000 00000000 00000000 00000004 00000014
/proc/device-tree/cpus/PowerPC,POWER7@10/ibm,associativity
                 00000005 00000000 00000000 00000000 00000004 00000010
/proc/device-tree/cpus/PowerPC,POWER7@c/ibm,associativity
                 00000005 00000000 00000000 00000000 00000002 0000000c
/proc/device-tree/cpus/PowerPC,POWER7@8/ibm,associativity
                 00000005 00000000 00000000 00000000 00000002 00000008
/proc/device-tree/cpus/PowerPC,POWER7@4/ibm,associativity
                 00000005 00000000 00000000 00000000 00000002 00000004
/proc/device-tree/cpus/PowerPC,POWER7@0/ibm,associativity
                 00000005 00000000 00000000 00000000 00000002 00000000
/proc/device-tree/memory@0/ibm,associativity
                 00000004 00000000 00000000 00000000 00000002
/proc/device-tree/memory@40000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000002
/proc/device-tree/memory@41000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@42000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@44000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@48000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@50000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@60000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
/proc/device-tree/memory@80000000/ibm,associativity
                 00000004 00000000 00000000 00000000 00000004
[root@localhost ~]# numactl --hardware

available: 3 nodes (0,2,4)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3
node 2 size: 1021 MB
node 2 free: 610 MB
node 4 cpus: 4 5 6 7
node 4 size: 1038 MB
node 4 free: 881 MB
node distances:
node   0   2   4
  0:  10  40  40
  2:  40  10  40
  4:  40  40  10


Seems correct except that weird node#0 which comes I do not where from.


And the patchset is made agains agraf/ppc-next tree.

> 
> 
> Alexey Kardashevskiy (7):
>   spapr: Move DT memory node rendering to a helper
>   spapr: Use DT memory node rendering helper for other nodes
>   spapr: Refactor spapr_populate_memory()
>   spapr: Split memory nodes to power-of-two blocks
>   spapr: Add a helper for node0_size calculation
>   spapr: Fix ibm,associativity for memory nodes
>   numa: Allow empty nodes
> 
>  hw/ppc/spapr.c | 104 +++++++++++++++++++++++++++++++--------------------------
>  vl.c           |   2 +-
>  2 files changed, 57 insertions(+), 49 deletions(-)
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes Alexey Kardashevskiy
@ 2014-06-16 16:15   ` Eduardo Habkost
  2014-06-16 18:49     ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-16 16:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Nishanth Aravamudan, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote:
> Currently NUMA nodes must go consequently and QEMU ignores nodes
> with @nodeid bigger than the number of NUMA nodes in the command line.

Why would somebody need a NUMA node with nodeid bigger than the number
of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range.

> This prevents us from creating memory-less nodes which is possible
> situation in POWERPC under pHyp or Sapphire.

Why? If I recall correctly, nodes without any CPUs or any memory are
already allowed.

How exactly would this patch help you? How do you expect the
command-line to look like for your use case?

> 
> This makes nb_numa_nodes a total number of nodes or the biggest node
> number + 1 whichever is greater.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  vl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/vl.c b/vl.c
> index ac0e3d7..f1b75cb 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
>          if (get_param_value(option, 128, "cpus", optarg) != 0) {
>              numa_node_parse_cpus(nodenr, option);
>          }
> -        nb_numa_nodes++;
> +        nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);

I would instead suggest that if any node in the [0, max_node_id] range
is not present on the command-line, QEMU should instead reject the
command-line.

>      } else {
>          fprintf(stderr, "Invalid -numa option: %s\n", option);
>          exit(1);
> -- 
> 2.0.0
> 

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-16  8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
@ 2014-06-16 18:26   ` Nishanth Aravamudan
  2014-06-16 20:51   ` Eduardo Habkost
  1 sibling, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-16 18:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [18:16:29 +1000], Alexey Kardashevskiy wrote:
> On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> > c4177479 "spapr: make sure RMA is in first mode of first memory node"
> > introduced regression which prevents from running guests with memoryless
> > NUMA node#0 which may happen on real POWER8 boxes and which would make
> > sense to debug in QEMU.
> > 
> > This patchset aim is to fix that and also fix various code problems in
> > memory nodes generation.
> > 
> > These 2 patches could be merged (the resulting patch looks rather ugly):
> > spapr: Use DT memory node rendering helper for other nodes
> > spapr: Move DT memory node rendering to a helper
> > 
> > Please comment. Thanks!
> > 
> 
> Sure I forgot to add an example of what I am trying to run without errors
> and warnings:

<snip>

> -numa node,nodeid=0,cpus=0-7,memory=0 \
> -numa node,nodeid=2,cpus=0-3,mem=1040 \
> -numa node,nodeid=4,cpus=4-7,mem=1040

Semantically, what does this mean? CPUs 0-3 are on both node 0 and node
2? I didn't think the NUMA spec allowed that? Or does qemu's
command-line take the "last" specified assignment of a CPU to a nodeid?

Perhaps unrelated to your changes, but I think it would be most sensible
here to error out if a CPU is assigned to multiple NUMA nodes.

<snip>

> [root@localhost ~]# numactl --hardware
> 
> available: 3 nodes (0,2,4)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 0 1 2 3
> node 2 size: 1021 MB
> node 2 free: 610 MB
> node 4 cpus: 4 5 6 7
> node 4 size: 1038 MB
> node 4 free: 881 MB
> node distances:
> node   0   2   4
>   0:  10  40  40
>   2:  40  10  40
>   4:  40  40  10
> 
> 
> Seems correct except that weird node#0 which comes I do not where from.

Well, Linux has a statically online Node 0, which if no CPUs or memory
are assigned to it, will show up as above as a cpuless and memoryless
node. That's not a bug in qemu, and is something I'm looking into
upstream in the kernel.

> And the patchset is made agains agraf/ppc-next tree.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation Alexey Kardashevskiy
@ 2014-06-16 18:43   ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-16 18:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:53:51 +1000], Alexey Kardashevskiy wrote:
> In multiple places there is a node0_size variable calculation
> which assumes that NUMA node #0 and memory node #0 are the same
> things which they are not. Since we are going to change it and
> do not want to change it in multiple places, let's make a helper.
> 
> This adds a spapr_node0_size() helper and makes use of it.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 10202e2..00ef7a1 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -351,6 +351,20 @@ static size_t create_page_sizes_prop(CPUPPCState *env, uint32_t *prop,
>      return (p - prop) * sizeof(uint32_t);
>  }
> 
> +static hwaddr spapr_node0_size(void)
> +{
> +    if (nb_numa_nodes > 1) {
> +        int i;
> +        for (i = 0; i < nb_numa_nodes; ++i) {
> +            if (node_mem[i]) {
> +                return MIN(pow2floor(node_mem[i]), ram_size);
> +            }
> +        }
> +        return ram_size;

Can't this just fall-through as well to the below return?

> +    }
> +    return ram_size;
> +}
> +
>  #define _FDT(exp) \
>      do { \
>          int ret = (exp);                                           \
> @@ -851,8 +865,8 @@ static void spapr_reset_htab(sPAPREnvironment *spapr)
> 
>      /* Update the RMA size if necessary */
>      if (spapr->vrma_adjust) {
> -        hwaddr node0_size = (nb_numa_nodes > 1) ? node_mem[0] : ram_size;
> -        spapr->rma_size = kvmppc_rma_size(node0_size, spapr->htab_shift);
> +        spapr->rma_size = kvmppc_rma_size(spapr_node0_size(),
> +                                          spapr->htab_shift);
>      }
>  }
> 
> @@ -1284,7 +1298,7 @@ static void ppc_spapr_init(MachineState *machine)
>      MemoryRegion *sysmem = get_system_memory();
>      MemoryRegion *ram = g_new(MemoryRegion, 1);
>      hwaddr rma_alloc_size;
> -    hwaddr node0_size = (nb_numa_nodes > 1) ? node_mem[0] : ram_size;
> +    hwaddr node0_size = spapr_node0_size();
>      uint32_t initrd_base = 0;
>      long kernel_size = 0, initrd_size = 0;
>      long load_limit, rtas_limit, fw_size;
> -- 
> 2.0.0
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16 16:15   ` Eduardo Habkost
@ 2014-06-16 18:49     ` Nishanth Aravamudan
  2014-06-16 20:11       ` Eduardo Habkost
  0 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-16 18:49 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote:
> > Currently NUMA nodes must go consequently and QEMU ignores nodes
> > with @nodeid bigger than the number of NUMA nodes in the command line.
> 
> Why would somebody need a NUMA node with nodeid bigger than the number
> of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range.

That is not how the code works currently.

vl.c::numa_add()
...
        if (get_param_value(option, 128, "nodeid", optarg) == 0) {
            nodenr = nb_numa_nodes;
        } else {
            if (parse_uint_full(option, &nodenr, 10) < 0) {
                fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option);
                exit(1);
            }
        }
...
        if (get_param_value(option, 128, "mem", optarg) == 0) {
            node_mem[nodenr] = 0;
        } else {
            int64_t sval;
            sval = strtosz(option, &endptr);
            if (sval < 0 || *endptr) {
                fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg);
                exit(1);
            }
            node_mem[nodenr] = sval;
        }
...
	nb_numa_nodes++;
...

So if a user passes nodeid= to the NUMA node definition, that entry in
node_mem is set to the appropriate value, but nb_numa_nodes, which is
used to bound the iteration of that array is not bumped appropriately.
So we end up looking at arbitrary indices in the node_mem array, which
are often 0.

Note also that means that we can't generically differentiate here
between a user-defined memoryless node and one that happens to be 0
because the particular nodeid was not specified on the command-line.
Alexey, how do you differentiate between these two cases after your
patches? In patch 3, I see you check (and skip in the loop) explicitly
if !node_mem[nodeid], but I'm not sure how that check can differentiate
between the statically 0 (from main's intialization loop) and when a
user says a node's memory is 0. Probably something obvious I'm missing
(it is Monday after all)...

> > This prevents us from creating memory-less nodes which is possible
> > situation in POWERPC under pHyp or Sapphire.
> 
> Why? If I recall correctly, nodes without any CPUs or any memory are
> already allowed.

They are allowed, but it seems like the code throughout qemu (where it's
relevant) assumes that NUMA nodes are sequential and continuous, but
that's not required (nor is it enforced on the command-line).

> How exactly would this patch help you? How do you expect the
> command-line to look like for your use case?

Alexey has replied with that, it looks like.

> > This makes nb_numa_nodes a total number of nodes or the biggest node
> > number + 1 whichever is greater.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> >  vl.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/vl.c b/vl.c
> > index ac0e3d7..f1b75cb 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
> >          if (get_param_value(option, 128, "cpus", optarg) != 0) {
> >              numa_node_parse_cpus(nodenr, option);
> >          }
> > -        nb_numa_nodes++;
> > +        nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);
> 
> I would instead suggest that if any node in the [0, max_node_id] range
> is not present on the command-line, QEMU should instead reject the
> command-line.

We already check two things:

Too many nodes (meaning we've filled the array):
        if (nb_numa_nodes >= MAX_NODES) {
            fprintf(stderr, "qemu: too many NUMA nodes\n");
            exit(1);
        }

Node ID itself is out of range (due to the use of an array):
        if (nodenr >= MAX_NODES) {
            fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr);
            exit(1);
        }

But that doesn't prevent one from having *sparse* NUMA node IDs. And, as
far as I can tell, this is allowed by the spec, but isn't properly
supported by qemu.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16 18:49     ` Nishanth Aravamudan
@ 2014-06-16 20:11       ` Eduardo Habkost
  2014-06-16 20:31         ` Eduardo Habkost
  2014-06-17  0:16         ` Nishanth Aravamudan
  0 siblings, 2 replies; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-16 20:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 11:49:46AM -0700, Nishanth Aravamudan wrote:
> On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote:
> > On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote:
> > > Currently NUMA nodes must go consequently and QEMU ignores nodes
> > > with @nodeid bigger than the number of NUMA nodes in the command line.
> > 
> > Why would somebody need a NUMA node with nodeid bigger than the number
> > of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range.
> 
> That is not how the code works currently.
> 
> vl.c::numa_add()
> ...
>         if (get_param_value(option, 128, "nodeid", optarg) == 0) {
>             nodenr = nb_numa_nodes;
>         } else {
>             if (parse_uint_full(option, &nodenr, 10) < 0) {
>                 fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option);
>                 exit(1);
>             }
>         }
> ...
>         if (get_param_value(option, 128, "mem", optarg) == 0) {
>             node_mem[nodenr] = 0;
>         } else {
>             int64_t sval;
>             sval = strtosz(option, &endptr);
>             if (sval < 0 || *endptr) {
>                 fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg);
>                 exit(1);
>             }
>             node_mem[nodenr] = sval;
>         }
> ...
> 	nb_numa_nodes++;
> ...
> 
> So if a user passes nodeid= to the NUMA node definition, that entry in
> node_mem is set to the appropriate value, but nb_numa_nodes, which is
> used to bound the iteration of that array is not bumped appropriately.
> So we end up looking at arbitrary indices in the node_mem array, which
> are often 0.

Yes. What I claim is: the bug is that the code is not validating
nodeids, which must be all in the [0, num_nodes-1] range.

Alexey, why do you believe you need non-contiguous nodeids to declare
memory-less nodes at all? Just use "-numa node" with no arguments, and
you will get a node with node_mem==0.

In other words, if you are trying to do:
  -numa node,id=1,mem=X -numa node,id=3,mem=Y
(so you get nodes 0 and 2 with no memory).

You just need to do this instead:
  -numa node,id=0 -numa node,id=1,mem=X -numa node,id=2 -numa node,id=3,mem=Y

> 
> Note also that means that we can't generically differentiate here
> between a user-defined memoryless node and one that happens to be 0
> because the particular nodeid was not specified on the command-line.

That's because all nodeids must be specified in the command-line.
Accepting omitted nodes is a bug which should be fixed.

> Alexey, how do you differentiate between these two cases after your
> patches? In patch 3, I see you check (and skip in the loop) explicitly
> if !node_mem[nodeid], but I'm not sure how that check can differentiate
> between the statically 0 (from main's intialization loop) and when a
> user says a node's memory is 0. Probably something obvious I'm missing
> (it is Monday after all)...
> 
> > > This prevents us from creating memory-less nodes which is possible
> > > situation in POWERPC under pHyp or Sapphire.
> > 
> > Why? If I recall correctly, nodes without any CPUs or any memory are
> > already allowed.
> 
> They are allowed, but it seems like the code throughout qemu (where it's
> relevant) assumes that NUMA nodes are sequential and continuous, but
> that's not required (nor is it enforced on the command-line).

It is not being enforced, but you will get weird bugs if you don't do
it. What I suggest is that we start enforcing it instead of magically
crating new nodes when a wrong (too high) ID is provided.

> 
> > How exactly would this patch help you? How do you expect the
> > command-line to look like for your use case?
> 
> Alexey has replied with that, it looks like.

Where? I don't see a reply.

> 
> > > This makes nb_numa_nodes a total number of nodes or the biggest node
> > > number + 1 whichever is greater.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > ---
> > >  vl.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/vl.c b/vl.c
> > > index ac0e3d7..f1b75cb 100644
> > > --- a/vl.c
> > > +++ b/vl.c
> > > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
> > >          if (get_param_value(option, 128, "cpus", optarg) != 0) {
> > >              numa_node_parse_cpus(nodenr, option);
> > >          }
> > > -        nb_numa_nodes++;
> > > +        nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);
> > 
> > I would instead suggest that if any node in the [0, max_node_id] range
> > is not present on the command-line, QEMU should instead reject the
> > command-line.
> 
> We already check two things:
> 
> Too many nodes (meaning we've filled the array):
>         if (nb_numa_nodes >= MAX_NODES) {
>             fprintf(stderr, "qemu: too many NUMA nodes\n");
>             exit(1);
>         }
> 
> Node ID itself is out of range (due to the use of an array):
>         if (nodenr >= MAX_NODES) {
>             fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr);
>             exit(1);
>         }
> 
> But that doesn't prevent one from having *sparse* NUMA node IDs. And, as
> far as I can tell, this is allowed by the spec, but isn't properly
> supported by qemu.

Wait, is the node ID visible to the guest at all? I believe it is a
QEMU-internal thing, just to allow the NUMA nodes to be ordered in the
command-line. I would even claim that the parameter is useless and
shouldn't have been introduced in the first place.

What I don't se is: why you need the command-line to look like:
  -numa node,id=1,mem=X
when you can simply write it as:
  -numa node,id=0 -numa node,id=1,mem=X

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16 20:11       ` Eduardo Habkost
@ 2014-06-16 20:31         ` Eduardo Habkost
  2014-06-17  0:21           ` Nishanth Aravamudan
  2014-06-17  0:16         ` Nishanth Aravamudan
  1 sibling, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-16 20:31 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 05:11:40PM -0300, Eduardo Habkost wrote:
[...]
> Wait, is the node ID visible to the guest at all? I believe it is a
> QEMU-internal thing, just to allow the NUMA nodes to be ordered in the
> command-line. I would even claim that the parameter is useless and
> shouldn't have been introduced in the first place.
> 
> What I don't se is: why you need the command-line to look like:
>   -numa node,id=1,mem=X
> when you can simply write it as:
>   -numa node,id=0 -numa node,id=1,mem=X

Oh, I believe now I see it: the problem is not that you don't just need
"memory-less nodes" (which could be simply defined explicitly like
above), but that you need non-contiguous node IDs, which are visible to
the guest.

In this case, my example above with two -numa options would work, but it
would be confusing as the user just wants one node with ID=1 (instead of
two nodes).

So, now your patch makes sense to me. But we first need something to
make sure the following command-line:
  -numa node,id=3 -numa node,id=2
be different from:
  -numa node,id=0 -numa node,id=1 -numa node,id=2 -numa node,id=3

The former should divide the memory in half, between nodes 1 and 2. The
latter should divide the memory in four, between nodes 0, 1, and 2.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-16  8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
  2014-06-16 18:26   ` Nishanth Aravamudan
@ 2014-06-16 20:51   ` Eduardo Habkost
  2014-06-17  0:25     ` Nishanth Aravamudan
  2014-06-17  5:51     ` Alexey Kardashevskiy
  1 sibling, 2 replies; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-16 20:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Nishanth Aravamudan, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> > c4177479 "spapr: make sure RMA is in first mode of first memory node"
> > introduced regression which prevents from running guests with memoryless
> > NUMA node#0 which may happen on real POWER8 boxes and which would make
> > sense to debug in QEMU.
> > 
> > This patchset aim is to fix that and also fix various code problems in
> > memory nodes generation.
> > 
> > These 2 patches could be merged (the resulting patch looks rather ugly):
> > spapr: Use DT memory node rendering helper for other nodes
> > spapr: Move DT memory node rendering to a helper
> > 
> > Please comment. Thanks!
> > 
> 
> Sure I forgot to add an example of what I am trying to run without errors
> and warnings:
> 
> /home/aik/qemu-system-ppc64 \
> -enable-kvm \
> -machine pseries \
> -nographic \
> -vga none \
> -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> -device scsi-disk,id=id1,drive=id0 \
> -m 2080 \
> -smp 8 \
> -numa node,nodeid=0,cpus=0-7,memory=0 \
> -numa node,nodeid=2,cpus=0-3,mem=1040 \
> -numa node,nodeid=4,cpus=4-7,mem=1040

(Note: I will ignore the "cpus" argument for the discussion below.)

I understand now that the non-contiguous node IDs are guest-visible.

But I still would like to understand the motivations for your use case,
to understand which solution makes more sense.

If you really want 5 nodes, you just need to write this:
  -numa node,nodeid=0,cpus=0-7,memory=0 \
  -numa node,nodeid=1 \
  -numa node,nodeid=2,cpus=0-3,mem=1040 \
  -numa node,nodeid=3 \
  -numa node,nodeid=4,cpus=4-7,mem=1040

If you just want 3 nodes, you can just write this:
  -numa node,nodeid=0,cpus=0-7,memory=0 \
  -numa node,nodeid=1,cpus=0-3,mem=1040 \
  -numa node,nodeid=4,cpus=4-7,mem=1040

But you seem to claim you need 3 nodes with non-contiguous IDs. In that
case, which exactly is the guest-visible difference you expect to get
between:
  -numa node,nodeid=0,cpus=0-7,memory=0 \
  -numa node,nodeid=1 \
  -numa node,nodeid=2,cpus=0-3,mem=1040 \
  -numa node,nodeid=3 \
  -numa node,nodeid=4,cpus=4-7,mem=1040
and
  -numa node,nodeid=0,cpus=0-7,memory=0 \
  -numa node,nodeid=2,cpus=0-3,mem=1040 \
  -numa node,nodeid=4,cpus=4-7,mem=1040
?

Because your patch is making both be exactly the same, and I guess you
don't want that (otherwise you could simply use the 5-node command-line
above and we wouldn't need patch 7/7).

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16 20:11       ` Eduardo Habkost
  2014-06-16 20:31         ` Eduardo Habkost
@ 2014-06-17  0:16         ` Nishanth Aravamudan
  1 sibling, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17  0:16 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:11:40 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 11:49:46AM -0700, Nishanth Aravamudan wrote:
> > On 16.06.2014 [13:15:00 -0300], Eduardo Habkost wrote:
> > > On Mon, Jun 16, 2014 at 05:53:53PM +1000, Alexey Kardashevskiy wrote:
> > > > Currently NUMA nodes must go consequently and QEMU ignores nodes
> > > > with @nodeid bigger than the number of NUMA nodes in the command line.
> > > 
> > > Why would somebody need a NUMA node with nodeid bigger than the number
> > > of NUMA nodes? NUMA node IDs must be in the [0, numa_nodes-1] range.
> > 
> > That is not how the code works currently.
> > 
> > vl.c::numa_add()
> > ...
> >         if (get_param_value(option, 128, "nodeid", optarg) == 0) {
> >             nodenr = nb_numa_nodes;
> >         } else {
> >             if (parse_uint_full(option, &nodenr, 10) < 0) {
> >                 fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option);
> >                 exit(1);
> >             }
> >         }
> > ...
> >         if (get_param_value(option, 128, "mem", optarg) == 0) {
> >             node_mem[nodenr] = 0;
> >         } else {
> >             int64_t sval;
> >             sval = strtosz(option, &endptr);
> >             if (sval < 0 || *endptr) {
> >                 fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg);
> >                 exit(1);
> >             }
> >             node_mem[nodenr] = sval;
> >         }
> > ...
> > 	nb_numa_nodes++;
> > ...
> > 
> > So if a user passes nodeid= to the NUMA node definition, that entry in
> > node_mem is set to the appropriate value, but nb_numa_nodes, which is
> > used to bound the iteration of that array is not bumped appropriately.
> > So we end up looking at arbitrary indices in the node_mem array, which
> > are often 0.

<snip, replying to this part in your follow-up>
 
> > Note also that means that we can't generically differentiate here
> > between a user-defined memoryless node and one that happens to be 0
> > because the particular nodeid was not specified on the command-line.
> 
> That's because all nodeids must be specified in the command-line.
> Accepting omitted nodes is a bug which should be fixed.

Fair enough, but that's certainly not enforced. In fact, the way the
code is written right now (in hw/ppc/spapr.c and hw/i386/pc.c, e.g) it
seems like hte architecture code assumes nodeids are sequential (0 to
nb_numa_nodes), but the generic code puts values in node_mem based upon
the nodeid specified on the command-line. That's the bug Alexey's 7/7
patch is attempting to fix, I think.

> > Alexey, how do you differentiate between these two cases after your
> > patches? In patch 3, I see you check (and skip in the loop) explicitly
> > if !node_mem[nodeid], but I'm not sure how that check can differentiate
> > between the statically 0 (from main's intialization loop) and when a
> > user says a node's memory is 0. Probably something obvious I'm missing
> > (it is Monday after all)...
> > 
> > > > This prevents us from creating memory-less nodes which is possible
> > > > situation in POWERPC under pHyp or Sapphire.
> > > 
> > > Why? If I recall correctly, nodes without any CPUs or any memory are
> > > already allowed.
> > 
> > They are allowed, but it seems like the code throughout qemu (where it's
> > relevant) assumes that NUMA nodes are sequential and continuous, but
> > that's not required (nor is it enforced on the command-line).
> 
> It is not being enforced, but you will get weird bugs if you don't do
> it. What I suggest is that we start enforcing it instead of magically
> crating new nodes when a wrong (too high) ID is provided.
> 
> > 
> > > How exactly would this patch help you? How do you expect the
> > > command-line to look like for your use case?
> > 
> > Alexey has replied with that, it looks like.
> 
> Where? I don't see a reply.
> 
> > 
> > > > This makes nb_numa_nodes a total number of nodes or the biggest node
> > > > number + 1 whichever is greater.
> > > > 
> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > ---
> > > >  vl.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/vl.c b/vl.c
> > > > index ac0e3d7..f1b75cb 100644
> > > > --- a/vl.c
> > > > +++ b/vl.c
> > > > @@ -1356,7 +1356,7 @@ static void numa_add(const char *optarg)
> > > >          if (get_param_value(option, 128, "cpus", optarg) != 0) {
> > > >              numa_node_parse_cpus(nodenr, option);
> > > >          }
> > > > -        nb_numa_nodes++;
> > > > +        nb_numa_nodes = MAX(nb_numa_nodes + 1, nodenr + 1);
> > > 
> > > I would instead suggest that if any node in the [0, max_node_id] range
> > > is not present on the command-line, QEMU should instead reject the
> > > command-line.
> > 
> > We already check two things:
> > 
> > Too many nodes (meaning we've filled the array):
> >         if (nb_numa_nodes >= MAX_NODES) {
> >             fprintf(stderr, "qemu: too many NUMA nodes\n");
> >             exit(1);
> >         }
> > 
> > Node ID itself is out of range (due to the use of an array):
> >         if (nodenr >= MAX_NODES) {
> >             fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr);
> >             exit(1);
> >         }
> > 
> > But that doesn't prevent one from having *sparse* NUMA node IDs. And, as
> > far as I can tell, this is allowed by the spec, but isn't properly
> > supported by qemu.
> 
> Wait, is the node ID visible to the guest at all? I believe it is a
> QEMU-internal thing, just to allow the NUMA nodes to be ordered in the
> command-line. I would even claim that the parameter is useless and
> shouldn't have been introduced in the first place.

nodeid's show up in the topology to Linux, at least on ppc.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes
  2014-06-16 20:31         ` Eduardo Habkost
@ 2014-06-17  0:21           ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17  0:21 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:31:24 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 05:11:40PM -0300, Eduardo Habkost wrote:
> [...]
> > Wait, is the node ID visible to the guest at all? I believe it is a
> > QEMU-internal thing, just to allow the NUMA nodes to be ordered in the
> > command-line. I would even claim that the parameter is useless and
> > shouldn't have been introduced in the first place.
> > 
> > What I don't se is: why you need the command-line to look like:
> >   -numa node,id=1,mem=X
> > when you can simply write it as:
> >   -numa node,id=0 -numa node,id=1,mem=X
> 
> Oh, I believe now I see it: the problem is not that you don't just need
> "memory-less nodes" (which could be simply defined explicitly like
> above), but that you need non-contiguous node IDs, which are visible to
> the guest.

Well, and for powerpc, at least, there is some changing that needs
needed (the prior patches) to get memory-less Node 0 supported.

> In this case, my example above with two -numa options would work, but it
> would be confusing as the user just wants one node with ID=1 (instead of
> two nodes).

Yep, exactly. This is something we see on PowerVM, and it tends to trip
up the kernel (or lead to corner cases) and it would be nice to be able
to emulate these topologies with KVM.

> So, now your patch makes sense to me. But we first need something to
> make sure the following command-line:
>   -numa node,id=3 -numa node,id=2
> be different from:
>   -numa node,id=0 -numa node,id=1 -numa node,id=2 -numa node,id=3
> 
> The former should divide the memory in half, between nodes 1 and 2. The
> latter should divide the memory in four, between nodes 0, 1, and 2.

Agreed on both those counts, but I think there is another case to ensure
we get right as well:

-numa node,id=3,mem=0 -numa node,id=2

All of the memory for the guest should be on node 2, but node 3 should
exist and be memoryless. But how do you differentiate between the values
in node_mem that started out zero and those that are set to 0 by the
user's arguments.

My first guess is that node_mem needs to be turned into a signed array
and we stuff -1 in there (currently 0 is written). Then we know any
non-negative entries are those specified by the user? Dunno if we care
about the reduction in per-node maximum memory (down to 2^63 from 2^64)?

Perhaps the resolution is the same to dealing with your case. Just
making it explicit what nodes were passed by the user in node_mem, and
counting the total correctly, should let us do the right thing?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-16 20:51   ` Eduardo Habkost
@ 2014-06-17  0:25     ` Nishanth Aravamudan
  2014-06-17  1:37       ` Eduardo Habkost
  2014-06-17  1:41       ` Eduardo Habkost
  2014-06-17  5:51     ` Alexey Kardashevskiy
  1 sibling, 2 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17  0:25 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:51:50 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> > On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> > > c4177479 "spapr: make sure RMA is in first mode of first memory node"
> > > introduced regression which prevents from running guests with memoryless
> > > NUMA node#0 which may happen on real POWER8 boxes and which would make
> > > sense to debug in QEMU.
> > > 
> > > This patchset aim is to fix that and also fix various code problems in
> > > memory nodes generation.
> > > 
> > > These 2 patches could be merged (the resulting patch looks rather ugly):
> > > spapr: Use DT memory node rendering helper for other nodes
> > > spapr: Move DT memory node rendering to a helper
> > > 
> > > Please comment. Thanks!
> > > 
> > 
> > Sure I forgot to add an example of what I am trying to run without errors
> > and warnings:
> > 
> > /home/aik/qemu-system-ppc64 \
> > -enable-kvm \
> > -machine pseries \
> > -nographic \
> > -vga none \
> > -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> > -device scsi-disk,id=id1,drive=id0 \
> > -m 2080 \
> > -smp 8 \
> > -numa node,nodeid=0,cpus=0-7,memory=0 \
> > -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> (Note: I will ignore the "cpus" argument for the discussion below.)
> 
> I understand now that the non-contiguous node IDs are guest-visible.
> 
> But I still would like to understand the motivations for your use case,
> to understand which solution makes more sense.
> 
> If you really want 5 nodes, you just need to write this:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=3 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> If you just want 3 nodes, you can just write this:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1,cpus=0-3,mem=1040 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040

No, this doesn't do what you think it would :)

nb_numa_nodes = 3

but node_mem[0] = 0
node_mem[1] = 1040
node_mem[2] = 0
node_mem[3] = 0
node_mem[4] = 1040

Because of the generic parsing of the numa options.

I'd need to look at my test case again (and this is reproducible on
x86), but I believe it's actually worse if you skip node 0 altogether,
e.g.:

   -numa node,nodeid=1,cpus=0-7,memory=0 \
   -numa node,nodeid=2,cpus=0-3,mem=1040 \
   -numa node,nodeid=4,cpus=4-7,mem=1040

Node 0 will have node 4's memory (because we put the rest there, iirc)
and the cpus that should be on node 4 are on node 0 as well).

I'll try to get the exact test results later.

In any case, it's confusing the topology you see in Linux vs. what the
command-line says.

> But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> case, which exactly is the guest-visible difference you expect to get
> between:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=3 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040

I guess here you'd see 5 NUMA nodes in Linux, with 0, 1 and 3 having no
memory.

> and
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> ?

And here you'd see 3 NUMA nodes in Linux, with 0 having no memory. I
would think the principle of least surprise means qemu doesn't change
the topology from the user-requested one without any indicate that's
happening?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17  0:25     ` Nishanth Aravamudan
@ 2014-06-17  1:37       ` Eduardo Habkost
  2014-06-17 18:36         ` Nishanth Aravamudan
  2014-06-17  1:41       ` Eduardo Habkost
  1 sibling, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-17  1:37 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 05:25:00PM -0700, Nishanth Aravamudan wrote:
> On 16.06.2014 [17:51:50 -0300], Eduardo Habkost wrote:
> > On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> > > On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> > > > c4177479 "spapr: make sure RMA is in first mode of first memory node"
> > > > introduced regression which prevents from running guests with memoryless
> > > > NUMA node#0 which may happen on real POWER8 boxes and which would make
> > > > sense to debug in QEMU.
> > > > 
> > > > This patchset aim is to fix that and also fix various code problems in
> > > > memory nodes generation.
> > > > 
> > > > These 2 patches could be merged (the resulting patch looks rather ugly):
> > > > spapr: Use DT memory node rendering helper for other nodes
> > > > spapr: Move DT memory node rendering to a helper
> > > > 
> > > > Please comment. Thanks!
> > > > 
> > > 
> > > Sure I forgot to add an example of what I am trying to run without errors
> > > and warnings:
> > > 
> > > /home/aik/qemu-system-ppc64 \
> > > -enable-kvm \
> > > -machine pseries \
> > > -nographic \
> > > -vga none \
> > > -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> > > -device scsi-disk,id=id1,drive=id0 \
> > > -m 2080 \
> > > -smp 8 \
> > > -numa node,nodeid=0,cpus=0-7,memory=0 \
> > > -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > > -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > (Note: I will ignore the "cpus" argument for the discussion below.)
> > 
> > I understand now that the non-contiguous node IDs are guest-visible.
> > 
> > But I still would like to understand the motivations for your use case,
> > to understand which solution makes more sense.
> > 
> > If you really want 5 nodes, you just need to write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > If you just want 3 nodes, you can just write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> No, this doesn't do what you think it would :)

That was a typo. I meant:
   -numa node,nodeid=0,cpus=0-7,memory=0 \
   -numa node,nodeid=1,cpus=0-3,mem=1040 \
   -numa node,nodeid=2,cpus=4-7,mem=1040

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17  0:25     ` Nishanth Aravamudan
  2014-06-17  1:37       ` Eduardo Habkost
@ 2014-06-17  1:41       ` Eduardo Habkost
  2014-06-17 18:37         ` Nishanth Aravamudan
  1 sibling, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-17  1:41 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Mon, Jun 16, 2014 at 05:25:00PM -0700, Nishanth Aravamudan wrote:
[...]
> > But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> > case, which exactly is the guest-visible difference you expect to get
> > between:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> I guess here you'd see 5 NUMA nodes in Linux, with 0, 1 and 3 having no
> memory.
> 
> > and
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > ?
> 
> And here you'd see 3 NUMA nodes in Linux, with 0 having no memory. I
> would think the principle of least surprise means qemu doesn't change
> the topology from the user-requested one without any indicate that's
> happening?

OK, so we are on the same page. The problem is: with your patch, both
cases above are exactly the same. That's what confused me: I thought you
wanted to implement the former, but you want the latter.

When changing the code to allow non-contiguous node IDs, I believe we
need to differentiate those cases and implement both correctly.
Otherwise we will be forced to break compatibility in the future.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-16 20:51   ` Eduardo Habkost
  2014-06-17  0:25     ` Nishanth Aravamudan
@ 2014-06-17  5:51     ` Alexey Kardashevskiy
  2014-06-17 14:07       ` Eduardo Habkost
  1 sibling, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-17  5:51 UTC (permalink / raw)
  To: Eduardo Habkost; +Cc: Nishanth Aravamudan, qemu-ppc, qemu-devel, Alexander Graf

On 06/17/2014 06:51 AM, Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
>> On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
>>> c4177479 "spapr: make sure RMA is in first mode of first memory node"
>>> introduced regression which prevents from running guests with memoryless
>>> NUMA node#0 which may happen on real POWER8 boxes and which would make
>>> sense to debug in QEMU.
>>>
>>> This patchset aim is to fix that and also fix various code problems in
>>> memory nodes generation.
>>>
>>> These 2 patches could be merged (the resulting patch looks rather ugly):
>>> spapr: Use DT memory node rendering helper for other nodes
>>> spapr: Move DT memory node rendering to a helper
>>>
>>> Please comment. Thanks!
>>>
>>
>> Sure I forgot to add an example of what I am trying to run without errors
>> and warnings:
>>
>> /home/aik/qemu-system-ppc64 \
>> -enable-kvm \
>> -machine pseries \
>> -nographic \
>> -vga none \
>> -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
>> -device scsi-disk,id=id1,drive=id0 \
>> -m 2080 \
>> -smp 8 \
>> -numa node,nodeid=0,cpus=0-7,memory=0 \
>> -numa node,nodeid=2,cpus=0-3,mem=1040 \
>> -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> (Note: I will ignore the "cpus" argument for the discussion below.)

The example is quite bad, I should not have used same CPUs in 2 nodes.
SPAPR allows this but QEMU does not really support this and I am not
touching this now.


> 
> I understand now that the non-contiguous node IDs are guest-visible.
> 
> But I still would like to understand the motivations for your use case,
> to understand which solution makes more sense.

One of examples is a 2 CPUs on one die, one of CPUs is connected to memory
bus, the other is not, instead it is connected to the first CPU (via super
fast bus) and the first CPU acts as a bridge.



> If you really want 5 nodes, you just need to write this:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=3 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> If you just want 3 nodes, you can just write this:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1,cpus=0-3,mem=1040 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> 
> But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> case, which exactly is the guest-visible difference you expect to get
> between:
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=1 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=3 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> and
>   -numa node,nodeid=0,cpus=0-7,memory=0 \
>   -numa node,nodeid=2,cpus=0-3,mem=1040 \
>   -numa node,nodeid=4,cpus=4-7,mem=1040
> ?
> 
> Because your patch is making both be exactly the same, and I guess you
> don't want that (otherwise you could simply use the 5-node command-line
> above and we wouldn't need patch 7/7).

If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
I just fail to see why we need a requirement for nodes to go consequently
here. And it confuses me as a user a bit if I can add "-numa
node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.


btw how is it supposed to work with memory hotplug? Current "-numa" does
not support gaps in memory and I would expect that we will need it. Any
plans here?


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks Alexey Kardashevskiy
@ 2014-06-17  7:07   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-17  7:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Nishanth Aravamudan, qemu-ppc, Alexander Graf

On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> Linux kernel expects nodes to have power-of-two size and
> does WARN_ON if this is not the case:
> [    0.041052] devtmpfs: initialized
> [    0.041292] ------------[ cut here ]------------
> [    0.041456] WARNING: at drivers/base/memory.c:115
> 
> This splits memory nodes into set of smaller blocks with
> a size which is a power of two. This makes sure the start
> address of every node is aligned to the node size.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 666b676..10202e2 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -710,8 +710,18 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>              mem_start += spapr->rma_size;
>              node_size -= spapr->rma_size;
>          }
> -        spapr_populate_memory_node(fdt, i, mem_start, node_size);
> -        mem_start += node_size;
> +        for ( ; node_size; ) {
> +            hwaddr sizetmp = pow2floor(node_size);
> +
> +            /* mem_start != 0 here */
> +            if (ffs(mem_start) < ffs(sizetmp)) {
> +                sizetmp = MIN(sizetmp, 1 << (ffs(mem_start) - 1));


And I do not need MIN() here. Uff.


> +            }
> +
> +            spapr_populate_memory_node(fdt, i, mem_start, sizetmp);
> +            node_size -= sizetmp;
> +            mem_start += sizetmp;
> +        }
>      }
>  
>      return 0;
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17  5:51     ` Alexey Kardashevskiy
@ 2014-06-17 14:07       ` Eduardo Habkost
  2014-06-17 18:38         ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-17 14:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Nishanth Aravamudan, qemu-ppc, qemu-devel, Alexander Graf

On Tue, Jun 17, 2014 at 03:51:35PM +1000, Alexey Kardashevskiy wrote:
> On 06/17/2014 06:51 AM, Eduardo Habkost wrote:
> > On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> >> On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> >>> c4177479 "spapr: make sure RMA is in first mode of first memory node"
> >>> introduced regression which prevents from running guests with memoryless
> >>> NUMA node#0 which may happen on real POWER8 boxes and which would make
> >>> sense to debug in QEMU.
> >>>
> >>> This patchset aim is to fix that and also fix various code problems in
> >>> memory nodes generation.
> >>>
> >>> These 2 patches could be merged (the resulting patch looks rather ugly):
> >>> spapr: Use DT memory node rendering helper for other nodes
> >>> spapr: Move DT memory node rendering to a helper
> >>>
> >>> Please comment. Thanks!
> >>>
> >>
> >> Sure I forgot to add an example of what I am trying to run without errors
> >> and warnings:
> >>
> >> /home/aik/qemu-system-ppc64 \
> >> -enable-kvm \
> >> -machine pseries \
> >> -nographic \
> >> -vga none \
> >> -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> >> -device scsi-disk,id=id1,drive=id0 \
> >> -m 2080 \
> >> -smp 8 \
> >> -numa node,nodeid=0,cpus=0-7,memory=0 \
> >> -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >> -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > (Note: I will ignore the "cpus" argument for the discussion below.)
> 
> The example is quite bad, I should not have used same CPUs in 2 nodes.
> SPAPR allows this but QEMU does not really support this and I am not
> touching this now.
> 
> 
> > 
> > I understand now that the non-contiguous node IDs are guest-visible.
> > 
> > But I still would like to understand the motivations for your use case,
> > to understand which solution makes more sense.
> 
> One of examples is a 2 CPUs on one die, one of CPUs is connected to memory
> bus, the other is not, instead it is connected to the first CPU (via super
> fast bus) and the first CPU acts as a bridge.
> 
> 
> 
> > If you really want 5 nodes, you just need to write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > If you just want 3 nodes, you can just write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> > case, which exactly is the guest-visible difference you expect to get
> > between:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > and
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > ?
> > 
> > Because your patch is making both be exactly the same, and I guess you
> > don't want that (otherwise you could simply use the 5-node command-line
> > above and we wouldn't need patch 7/7).
> 
> If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> I just fail to see why we need a requirement for nodes to go consequently
> here. And it confuses me as a user a bit if I can add "-numa
> node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.

I agree with you it is confusing. But before we support that use case,
we need to make sure auto-allocation is handled properly, because it
would be hard to fix it later without breaking compatibility.

We probably just need a "present" field on struct NodeInfo, so
machine-specific code and auto-allocation code can differentiate nodes
that are not present on the command-line from empty nodes that were
specified in the command-line.

In the meantime, people can use the 5-node example above as a
workaround.

> 
> btw how is it supposed to work with memory hotplug? Current "-numa" does
> not support gaps in memory and I would expect that we will need it. Any
> plans here?

The DIMM device used for memory hotplug has a "node" property, for the
NUMA node ID.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17  1:37       ` Eduardo Habkost
@ 2014-06-17 18:36         ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17 18:36 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [22:37:00 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 05:25:00PM -0700, Nishanth Aravamudan wrote:
> > On 16.06.2014 [17:51:50 -0300], Eduardo Habkost wrote:
> > > On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> > > > On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> > > > > c4177479 "spapr: make sure RMA is in first mode of first memory node"
> > > > > introduced regression which prevents from running guests with memoryless
> > > > > NUMA node#0 which may happen on real POWER8 boxes and which would make
> > > > > sense to debug in QEMU.
> > > > > 
> > > > > This patchset aim is to fix that and also fix various code problems in
> > > > > memory nodes generation.
> > > > > 
> > > > > These 2 patches could be merged (the resulting patch looks rather ugly):
> > > > > spapr: Use DT memory node rendering helper for other nodes
> > > > > spapr: Move DT memory node rendering to a helper
> > > > > 
> > > > > Please comment. Thanks!
> > > > > 
> > > > 
> > > > Sure I forgot to add an example of what I am trying to run without errors
> > > > and warnings:
> > > > 
> > > > /home/aik/qemu-system-ppc64 \
> > > > -enable-kvm \
> > > > -machine pseries \
> > > > -nographic \
> > > > -vga none \
> > > > -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> > > > -device scsi-disk,id=id1,drive=id0 \
> > > > -m 2080 \
> > > > -smp 8 \
> > > > -numa node,nodeid=0,cpus=0-7,memory=0 \
> > > > -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > > > -numa node,nodeid=4,cpus=4-7,mem=1040
> > > 
> > > (Note: I will ignore the "cpus" argument for the discussion below.)
> > > 
> > > I understand now that the non-contiguous node IDs are guest-visible.
> > > 
> > > But I still would like to understand the motivations for your use case,
> > > to understand which solution makes more sense.
> > > 
> > > If you really want 5 nodes, you just need to write this:
> > >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> > >   -numa node,nodeid=1 \
> > >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > >   -numa node,nodeid=3 \
> > >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > > 
> > > If you just want 3 nodes, you can just write this:
> > >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> > >   -numa node,nodeid=1,cpus=0-3,mem=1040 \
> > >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > No, this doesn't do what you think it would :)
> 
> That was a typo. I meant:
>    -numa node,nodeid=0,cpus=0-7,memory=0 \
>    -numa node,nodeid=1,cpus=0-3,mem=1040 \
>    -numa node,nodeid=2,cpus=4-7,mem=1040

Ok, that's fair, but I think we all agree now that the goal is two-fold:

1) properly support memoryless nodes (including node 0) in ppc
2) properly support sparse NUMA numbering

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17  1:41       ` Eduardo Habkost
@ 2014-06-17 18:37         ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17 18:37 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [22:41:08 -0300], Eduardo Habkost wrote:
> On Mon, Jun 16, 2014 at 05:25:00PM -0700, Nishanth Aravamudan wrote:
> [...]
> > > But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> > > case, which exactly is the guest-visible difference you expect to get
> > > between:
> > >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> > >   -numa node,nodeid=1 \
> > >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > >   -numa node,nodeid=3 \
> > >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > I guess here you'd see 5 NUMA nodes in Linux, with 0, 1 and 3 having no
> > memory.
> > 
> > > and
> > >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> > >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> > >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > > ?
> > 
> > And here you'd see 3 NUMA nodes in Linux, with 0 having no memory. I
> > would think the principle of least surprise means qemu doesn't change
> > the topology from the user-requested one without any indicate that's
> > happening?
> 
> OK, so we are on the same page. The problem is: with your patch, both
> cases above are exactly the same. That's what confused me: I thought you
> wanted to implement the former, but you want the latter.
> 
> When changing the code to allow non-contiguous node IDs, I believe we
> need to differentiate those cases and implement both correctly.
> Otherwise we will be forced to break compatibility in the future.

Yep, I think we need at least one follow-on (or additional) patch in
Alexey's series.

-Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17 14:07       ` Eduardo Habkost
@ 2014-06-17 18:38         ` Nishanth Aravamudan
  2014-06-17 19:22           ` Eduardo Habkost
  0 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-17 18:38 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 17.06.2014 [11:07:00 -0300], Eduardo Habkost wrote:
<snip>
> > If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> > I just fail to see why we need a requirement for nodes to go consequently
> > here. And it confuses me as a user a bit if I can add "-numa
> > node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.
> 
> I agree with you it is confusing. But before we support that use case,
> we need to make sure auto-allocation is handled properly, because it
> would be hard to fix it later without breaking compatibility.
> 
> We probably just need a "present" field on struct NodeInfo, so
> machine-specific code and auto-allocation code can differentiate nodes
> that are not present on the command-line from empty nodes that were
> specified in the command-line.

What/where is struct NodeInfo?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17 18:38         ` Nishanth Aravamudan
@ 2014-06-17 19:22           ` Eduardo Habkost
  2014-06-18 18:28             ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-17 19:22 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Tue, Jun 17, 2014 at 11:38:16AM -0700, Nishanth Aravamudan wrote:
> On 17.06.2014 [11:07:00 -0300], Eduardo Habkost wrote:
> <snip>
> > > If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> > > I just fail to see why we need a requirement for nodes to go consequently
> > > here. And it confuses me as a user a bit if I can add "-numa
> > > node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.
> > 
> > I agree with you it is confusing. But before we support that use case,
> > we need to make sure auto-allocation is handled properly, because it
> > would be hard to fix it later without breaking compatibility.
> > 
> > We probably just need a "present" field on struct NodeInfo, so
> > machine-specific code and auto-allocation code can differentiate nodes
> > that are not present on the command-line from empty nodes that were
> > specified in the command-line.
> 
> What/where is struct NodeInfo?

It was introduced very recently. See the pull request at:

  From: "Michael S. Tsirkin" <mst@redhat.com>
  Message-ID: <1403021756-15960-1-git-send-email-mst@redhat.com>
  Subject: [Qemu-devel] [PULL 000/103] pc, pci, virtio, hotplug fixes, enhancements for 2.1

  git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
@ 2014-06-18  5:04   ` Alexey Kardashevskiy
  2014-06-20 19:10   ` Nishanth Aravamudan
  2014-06-20 22:55   ` Nishanth Aravamudan
  2 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-18  5:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Nishanth Aravamudan, qemu-ppc, Alexander Graf

On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> Current QEMU does not support memoryless NUMA nodes.
> This prepares SPAPR for that.
> 
> This moves 2 calls of spapr_populate_memory_node() into
> the existing loop which handles nodes other than than
> the first one.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr.c | 31 +++++++++++--------------------
>  1 file changed, 11 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index cb3a10a..666b676 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
>  
>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>  {
> -    hwaddr node0_size, mem_start, node_size;
> +    hwaddr mem_start, node_size;
>      int i;
>  
> -    /* memory node(s) */
> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> -        node0_size = node_mem[0];
> -    } else {
> -        node0_size = ram_size;
> -    }
> -
> -    /* RMA */
> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> -
> -    /* RAM: Node 0 */
> -    if (node0_size > spapr->rma_size) {
> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> -                                   node0_size - spapr->rma_size);
> -    }
> -
> -    /* RAM: Node 1 and beyond */
> -    mem_start = node0_size;
> -    for (i = 1; i < nb_numa_nodes; i++) {


No NUMA, no memory nodes at all. Embarrassing bug :)


> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> +        if (!node_mem[i]) {
> +            continue;
> +        }
>          if (mem_start >= ram_size) {
>              node_size = 0;
>          } else {
> @@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>                  node_size = ram_size - mem_start;
>              }
>          }
> +        if (!mem_start) {
> +            /* ppc_spapr_init() checks for rma_size <= node0_size already */
> +            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> +            mem_start += spapr->rma_size;
> +            node_size -= spapr->rma_size;
> +        }
>          spapr_populate_memory_node(fdt, i, mem_start, node_size);
>          mem_start += node_size;
>      }
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-17 19:22           ` Eduardo Habkost
@ 2014-06-18 18:28             ` Nishanth Aravamudan
  2014-06-18 19:33               ` Eduardo Habkost
  0 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-18 18:28 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 17.06.2014 [16:22:33 -0300], Eduardo Habkost wrote:
> On Tue, Jun 17, 2014 at 11:38:16AM -0700, Nishanth Aravamudan wrote:
> > On 17.06.2014 [11:07:00 -0300], Eduardo Habkost wrote:
> > <snip>
> > > > If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> > > > I just fail to see why we need a requirement for nodes to go consequently
> > > > here. And it confuses me as a user a bit if I can add "-numa
> > > > node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.
> > > 
> > > I agree with you it is confusing. But before we support that use case,
> > > we need to make sure auto-allocation is handled properly, because it
> > > would be hard to fix it later without breaking compatibility.
> > > 
> > > We probably just need a "present" field on struct NodeInfo, so
> > > machine-specific code and auto-allocation code can differentiate nodes
> > > that are not present on the command-line from empty nodes that were
> > > specified in the command-line.
> > 
> > What/where is struct NodeInfo?
> 
> It was introduced very recently. See the pull request at:
> 
>   From: "Michael S. Tsirkin" <mst@redhat.com>
>   Message-ID: <1403021756-15960-1-git-send-email-mst@redhat.com>
>   Subject: [Qemu-devel] [PULL 000/103] pc, pci, virtio, hotplug fixes, enhancements for 2.1
> 
>   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

Ah thank you very much!

Before I get cracking on some patches, wanted to clarify some things:

1) We need something like a "present" field to deal with
auto-allocation, which indicates a user-specified NUMA node.

2) We need something like a "defined" field to indicate which entries
are actually valid and which aren't just 0 because they weren't ever set
in order to support sparse node numbering.
	2a) We could add a defined field to indicate the defined
	entries, iterate over the entire array and skip those not
	defined [keeps index:nodeid mapping, changes all loops]
	2b) We could add a nodeid field to indicate the defined entries,
	iterate over only nb_numa_nodes [breaks index:nodeid, keeps
	loops the same, but requires using the nodeid member in the
	loops, not guaranteed for the array to be sorted on nodeid]

I'm currently in favor of 2b, with perhaps a call to qsort on the array
after parsing to sort by node id? I'd have to audit the users of the
array to make sure they use the nodeid member and not the index, but
that should be straightforward.

These patches would probably need to go in before Alexey's series
(Alexey, I can rebase your patches on top, if you want).

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-18 18:28             ` Nishanth Aravamudan
@ 2014-06-18 19:33               ` Eduardo Habkost
  2014-06-18 23:58                 ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Eduardo Habkost @ 2014-06-18 19:33 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On Wed, Jun 18, 2014 at 11:28:53AM -0700, Nishanth Aravamudan wrote:
> On 17.06.2014 [16:22:33 -0300], Eduardo Habkost wrote:
> > On Tue, Jun 17, 2014 at 11:38:16AM -0700, Nishanth Aravamudan wrote:
> > > On 17.06.2014 [11:07:00 -0300], Eduardo Habkost wrote:
> > > <snip>
> > > > > If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> > > > > I just fail to see why we need a requirement for nodes to go consequently
> > > > > here. And it confuses me as a user a bit if I can add "-numa
> > > > > node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.
> > > > 
> > > > I agree with you it is confusing. But before we support that use case,
> > > > we need to make sure auto-allocation is handled properly, because it
> > > > would be hard to fix it later without breaking compatibility.
> > > > 
> > > > We probably just need a "present" field on struct NodeInfo, so
> > > > machine-specific code and auto-allocation code can differentiate nodes
> > > > that are not present on the command-line from empty nodes that were
> > > > specified in the command-line.
> > > 
> > > What/where is struct NodeInfo?
> > 
> > It was introduced very recently. See the pull request at:
> > 
> >   From: "Michael S. Tsirkin" <mst@redhat.com>
> >   Message-ID: <1403021756-15960-1-git-send-email-mst@redhat.com>
> >   Subject: [Qemu-devel] [PULL 000/103] pc, pci, virtio, hotplug fixes, enhancements for 2.1
> > 
> >   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> 
> Ah thank you very much!
> 
> Before I get cracking on some patches, wanted to clarify some things:
> 
> 1) We need something like a "present" field to deal with
> auto-allocation, which indicates a user-specified NUMA node.
> 
> 2) We need something like a "defined" field to indicate which entries
> are actually valid and which aren't just 0 because they weren't ever set
> in order to support sparse node numbering.
> 	2a) We could add a defined field to indicate the defined
> 	entries, iterate over the entire array and skip those not
> 	defined [keeps index:nodeid mapping, changes all loops]
> 	2b) We could add a nodeid field to indicate the defined entries,
> 	iterate over only nb_numa_nodes [breaks index:nodeid, keeps
> 	loops the same, but requires using the nodeid member in the
> 	loops, not guaranteed for the array to be sorted on nodeid]
> 
> I'm currently in favor of 2b, with perhaps a call to qsort on the array
> after parsing to sort by node id? I'd have to audit the users of the
> array to make sure they use the nodeid member and not the index, but
> that should be straightforward.

As the holes in the node ID space don't seem to be frequently large, and
the ID space is currently very small (we support 8-bit IDs only), 2a
looks much simpler to implement and review. We can always change the
code to use 2b if we decide to support larger node IDs in the future.

(And we don't even need to iterate over the entire array. We just need
to iterate up to the highest ID seen on the commend-line.)

-- 
Eduardo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
  2014-06-18 19:33               ` Eduardo Habkost
@ 2014-06-18 23:58                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-18 23:58 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, Alexander Graf

On 18.06.2014 [16:33:55 -0300], Eduardo Habkost wrote:
> On Wed, Jun 18, 2014 at 11:28:53AM -0700, Nishanth Aravamudan wrote:
> > On 17.06.2014 [16:22:33 -0300], Eduardo Habkost wrote:
> > > On Tue, Jun 17, 2014 at 11:38:16AM -0700, Nishanth Aravamudan wrote:
> > > > On 17.06.2014 [11:07:00 -0300], Eduardo Habkost wrote:
> > > > <snip>
> > > > > > If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> > > > > > I just fail to see why we need a requirement for nodes to go consequently
> > > > > > here. And it confuses me as a user a bit if I can add "-numa
> > > > > > node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.
> > > > > 
> > > > > I agree with you it is confusing. But before we support that use case,
> > > > > we need to make sure auto-allocation is handled properly, because it
> > > > > would be hard to fix it later without breaking compatibility.
> > > > > 
> > > > > We probably just need a "present" field on struct NodeInfo, so
> > > > > machine-specific code and auto-allocation code can differentiate nodes
> > > > > that are not present on the command-line from empty nodes that were
> > > > > specified in the command-line.
> > > > 
> > > > What/where is struct NodeInfo?
> > > 
> > > It was introduced very recently. See the pull request at:
> > > 
> > >   From: "Michael S. Tsirkin" <mst@redhat.com>
> > >   Message-ID: <1403021756-15960-1-git-send-email-mst@redhat.com>
> > >   Subject: [Qemu-devel] [PULL 000/103] pc, pci, virtio, hotplug fixes, enhancements for 2.1
> > > 
> > >   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> > 
> > Ah thank you very much!
> > 
> > Before I get cracking on some patches, wanted to clarify some things:
> > 
> > 1) We need something like a "present" field to deal with
> > auto-allocation, which indicates a user-specified NUMA node.
> > 
> > 2) We need something like a "defined" field to indicate which entries
> > are actually valid and which aren't just 0 because they weren't ever set
> > in order to support sparse node numbering.
> > 	2a) We could add a defined field to indicate the defined
> > 	entries, iterate over the entire array and skip those not
> > 	defined [keeps index:nodeid mapping, changes all loops]
> > 	2b) We could add a nodeid field to indicate the defined entries,
> > 	iterate over only nb_numa_nodes [breaks index:nodeid, keeps
> > 	loops the same, but requires using the nodeid member in the
> > 	loops, not guaranteed for the array to be sorted on nodeid]
> > 
> > I'm currently in favor of 2b, with perhaps a call to qsort on the array
> > after parsing to sort by node id? I'd have to audit the users of the
> > array to make sure they use the nodeid member and not the index, but
> > that should be straightforward.
> 
> As the holes in the node ID space don't seem to be frequently large, and
> the ID space is currently very small (we support 8-bit IDs only), 2a
> looks much simpler to implement and review. We can always change the
> code to use 2b if we decide to support larger node IDs in the future.

Ah, I didn't even check to see that MAX_NODES is so small, will do.

> (And we don't even need to iterate over the entire array. We just need
> to iterate up to the highest ID seen on the commend-line.)

Yep, that's a good shortcut, will add it to my changes. That is
equivalent, fwiw, to tracking how many valid nodes you've seen in any
given loop and only checking up to the number known).

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
  2014-06-18  5:04   ` Alexey Kardashevskiy
@ 2014-06-20 19:10   ` Nishanth Aravamudan
  2014-06-21  3:08     ` Alexey Kardashevskiy
  2014-06-20 22:55   ` Nishanth Aravamudan
  2 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-20 19:10 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> Current QEMU does not support memoryless NUMA nodes.
> This prepares SPAPR for that.
> 
> This moves 2 calls of spapr_populate_memory_node() into
> the existing loop which handles nodes other than than
> the first one.

<snip>

> @@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>                  node_size = ram_size - mem_start;
>              }
>          }
> +        if (!mem_start) {
> +            /* ppc_spapr_init() checks for rma_size <= node0_size already */
> +            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> +            mem_start += spapr->rma_size;
> +            node_size -= spapr->rma_size;
> +        }

Why is this needed to be separate? The RMA fits in the first node, per
the comment and the prior checks, so can't we just leave the first node
alone?

-Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
  2014-06-18  5:04   ` Alexey Kardashevskiy
  2014-06-20 19:10   ` Nishanth Aravamudan
@ 2014-06-20 22:55   ` Nishanth Aravamudan
  2014-06-21  3:06     ` Alexey Kardashevskiy
  2 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-20 22:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> Current QEMU does not support memoryless NUMA nodes.
> This prepares SPAPR for that.
> 
> This moves 2 calls of spapr_populate_memory_node() into
> the existing loop which handles nodes other than than
> the first one.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/ppc/spapr.c | 31 +++++++++++--------------------
>  1 file changed, 11 insertions(+), 20 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index cb3a10a..666b676 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
> 
>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>  {
> -    hwaddr node0_size, mem_start, node_size;
> +    hwaddr mem_start, node_size;
>      int i;
> 
> -    /* memory node(s) */
> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> -        node0_size = node_mem[0];
> -    } else {
> -        node0_size = ram_size;
> -    }
> -
> -    /* RMA */
> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> -
> -    /* RAM: Node 0 */
> -    if (node0_size > spapr->rma_size) {
> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> -                                   node0_size - spapr->rma_size);
> -    }
> -
> -    /* RAM: Node 1 and beyond */
> -    mem_start = node0_size;
> -    for (i = 1; i < nb_numa_nodes; i++) {
> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> +        if (!node_mem[i]) {
> +            continue;
> +        }

Doesn't this skip memoryless nodes? What actually puts the memoryless
node in the device-tree?

And if you were to put them in, wouldn't spapr_populate_memory_node()
fail because we'd be creating two nodes with memory@XXX where XXX is the
same (starting address) for both?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-20 22:55   ` Nishanth Aravamudan
@ 2014-06-21  3:06     ` Alexey Kardashevskiy
  2014-06-23 17:40       ` Nishanth Aravamudan
  2014-06-24  3:08       ` Nishanth Aravamudan
  0 siblings, 2 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-21  3:06 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
>> Current QEMU does not support memoryless NUMA nodes.
>> This prepares SPAPR for that.
>>
>> This moves 2 calls of spapr_populate_memory_node() into
>> the existing loop which handles nodes other than than
>> the first one.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
>>  1 file changed, 11 insertions(+), 20 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index cb3a10a..666b676 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
>>
>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>>  {
>> -    hwaddr node0_size, mem_start, node_size;
>> +    hwaddr mem_start, node_size;
>>      int i;
>>
>> -    /* memory node(s) */
>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
>> -        node0_size = node_mem[0];
>> -    } else {
>> -        node0_size = ram_size;
>> -    }
>> -
>> -    /* RMA */
>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>> -
>> -    /* RAM: Node 0 */
>> -    if (node0_size > spapr->rma_size) {
>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
>> -                                   node0_size - spapr->rma_size);
>> -    }
>> -
>> -    /* RAM: Node 1 and beyond */
>> -    mem_start = node0_size;
>> -    for (i = 1; i < nb_numa_nodes; i++) {
>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
>> +        if (!node_mem[i]) {
>> +            continue;
>> +        }
> 
> Doesn't this skip memoryless nodes? What actually puts the memoryless
> node in the device-tree?

It does skip.

> And if you were to put them in, wouldn't spapr_populate_memory_node()
> fail because we'd be creating two nodes with memory@XXX where XXX is the
> same (starting address) for both?

I cannot do this now - there is no way to tell from the command line where
I want NUMA node memory start from so I'll end up with multiple nodes with
the same name and QEMU won't start. When NUMA fixes reach upstream, I'll
try to work out something on top of that.


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-20 19:10   ` Nishanth Aravamudan
@ 2014-06-21  3:08     ` Alexey Kardashevskiy
  2014-06-23 17:41       ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-21  3:08 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 06/21/2014 05:10 AM, Nishanth Aravamudan wrote:
> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
>> Current QEMU does not support memoryless NUMA nodes.
>> This prepares SPAPR for that.
>>
>> This moves 2 calls of spapr_populate_memory_node() into
>> the existing loop which handles nodes other than than
>> the first one.
> 
> <snip>
> 
>> @@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>>                  node_size = ram_size - mem_start;
>>              }
>>          }
>> +        if (!mem_start) {
>> +            /* ppc_spapr_init() checks for rma_size <= node0_size already */
>> +            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
>> +            mem_start += spapr->rma_size;
>> +            node_size -= spapr->rma_size;
>> +        }
> 
> Why is this needed to be separate? The RMA fits in the first node, per
> the comment and the prior checks, so can't we just leave the first node
> alone?

This is the way to tell SLOF what memory it can use. It can use RMA and it
will use first available memory node.



-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-21  3:06     ` Alexey Kardashevskiy
@ 2014-06-23 17:40       ` Nishanth Aravamudan
  2014-06-24  6:07         ` Alexey Kardashevskiy
  2014-06-24  3:08       ` Nishanth Aravamudan
  1 sibling, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-23 17:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> > On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >> Current QEMU does not support memoryless NUMA nodes.
> >> This prepares SPAPR for that.
> >>
> >> This moves 2 calls of spapr_populate_memory_node() into
> >> the existing loop which handles nodes other than than
> >> the first one.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/ppc/spapr.c | 31 +++++++++++--------------------
> >>  1 file changed, 11 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index cb3a10a..666b676 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
> >>
> >>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>  {
> >> -    hwaddr node0_size, mem_start, node_size;
> >> +    hwaddr mem_start, node_size;
> >>      int i;
> >>
> >> -    /* memory node(s) */
> >> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> >> -        node0_size = node_mem[0];
> >> -    } else {
> >> -        node0_size = ram_size;
> >> -    }
> >> -
> >> -    /* RMA */
> >> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> >> -
> >> -    /* RAM: Node 0 */
> >> -    if (node0_size > spapr->rma_size) {
> >> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> >> -                                   node0_size - spapr->rma_size);
> >> -    }
> >> -
> >> -    /* RAM: Node 1 and beyond */
> >> -    mem_start = node0_size;
> >> -    for (i = 1; i < nb_numa_nodes; i++) {
> >> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> >> +        if (!node_mem[i]) {
> >> +            continue;
> >> +        }
> > 
> > Doesn't this skip memoryless nodes? What actually puts the memoryless
> > node in the device-tree?
> 
> It does skip.
> 
> > And if you were to put them in, wouldn't spapr_populate_memory_node()
> > fail because we'd be creating two nodes with memory@XXX where XXX is the
> > same (starting address) for both?
> 
> I cannot do this now - there is no way to tell from the command line
> where I want NUMA node memory start from so I'll end up with multiple
> nodes with the same name and QEMU won't start. When NUMA fixes reach
> upstream, I'll try to work out something on top of that.

So in mst's tree, which I've rebased your patches, we have a struct
defining each NUMA node, which has a size (and the index is the nodeid).
I've got patches working that allow for sparse indexing, but I'm curious
what you think we should do for the naming. I can send out the patches,
with the caveat that architectures still need to fix the remaining
issues for memoryless nodes?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-21  3:08     ` Alexey Kardashevskiy
@ 2014-06-23 17:41       ` Nishanth Aravamudan
  2014-06-23 22:02         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-23 17:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 21.06.2014 [13:08:59 +1000], Alexey Kardashevskiy wrote:
> On 06/21/2014 05:10 AM, Nishanth Aravamudan wrote:
> > On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >> Current QEMU does not support memoryless NUMA nodes.
> >> This prepares SPAPR for that.
> >>
> >> This moves 2 calls of spapr_populate_memory_node() into
> >> the existing loop which handles nodes other than than
> >> the first one.
> > 
> > <snip>
> > 
> >> @@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>                  node_size = ram_size - mem_start;
> >>              }
> >>          }
> >> +        if (!mem_start) {
> >> +            /* ppc_spapr_init() checks for rma_size <= node0_size already */
> >> +            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> >> +            mem_start += spapr->rma_size;
> >> +            node_size -= spapr->rma_size;
> >> +        }
> > 
> > Why is this needed to be separate? The RMA fits in the first node, per
> > the comment and the prior checks, so can't we just leave the first node
> > alone?
> 
> This is the way to tell SLOF what memory it can use. It can use RMA and it
> will use first available memory node.

Right, but why does the RMA need to be in it's own memory node? Can't it
just be part of the first present memory node?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-23 17:41       ` Nishanth Aravamudan
@ 2014-06-23 22:02         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-23 22:02 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 06/24/2014 03:41 AM, Nishanth Aravamudan wrote:
> On 21.06.2014 [13:08:59 +1000], Alexey Kardashevskiy wrote:
>> On 06/21/2014 05:10 AM, Nishanth Aravamudan wrote:
>>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
>>>> Current QEMU does not support memoryless NUMA nodes.
>>>> This prepares SPAPR for that.
>>>>
>>>> This moves 2 calls of spapr_populate_memory_node() into
>>>> the existing loop which handles nodes other than than
>>>> the first one.
>>>
>>> <snip>
>>>
>>>> @@ -719,6 +704,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>>>>                  node_size = ram_size - mem_start;
>>>>              }
>>>>          }
>>>> +        if (!mem_start) {
>>>> +            /* ppc_spapr_init() checks for rma_size <= node0_size already */
>>>> +            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
>>>> +            mem_start += spapr->rma_size;
>>>> +            node_size -= spapr->rma_size;
>>>> +        }
>>>
>>> Why is this needed to be separate? The RMA fits in the first node, per
>>> the comment and the prior checks, so can't we just leave the first node
>>> alone?
>>
>> This is the way to tell SLOF what memory it can use. It can use RMA and it
>> will use first available memory node.
> 
> Right, but why does the RMA need to be in it's own memory node? Can't it
> just be part of the first present memory node?

As I understand SLOF can try using the whole node.


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-21  3:06     ` Alexey Kardashevskiy
  2014-06-23 17:40       ` Nishanth Aravamudan
@ 2014-06-24  3:08       ` Nishanth Aravamudan
  2014-06-24  6:14         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-24  3:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> > On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >> Current QEMU does not support memoryless NUMA nodes.
> >> This prepares SPAPR for that.
> >>
> >> This moves 2 calls of spapr_populate_memory_node() into
> >> the existing loop which handles nodes other than than
> >> the first one.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/ppc/spapr.c | 31 +++++++++++--------------------
> >>  1 file changed, 11 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index cb3a10a..666b676 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
> >>
> >>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>  {
> >> -    hwaddr node0_size, mem_start, node_size;
> >> +    hwaddr mem_start, node_size;
> >>      int i;
> >>
> >> -    /* memory node(s) */
> >> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> >> -        node0_size = node_mem[0];
> >> -    } else {
> >> -        node0_size = ram_size;
> >> -    }
> >> -
> >> -    /* RMA */
> >> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> >> -
> >> -    /* RAM: Node 0 */
> >> -    if (node0_size > spapr->rma_size) {
> >> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> >> -                                   node0_size - spapr->rma_size);
> >> -    }
> >> -
> >> -    /* RAM: Node 1 and beyond */
> >> -    mem_start = node0_size;
> >> -    for (i = 1; i < nb_numa_nodes; i++) {
> >> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> >> +        if (!node_mem[i]) {
> >> +            continue;
> >> +        }
> > 
> > Doesn't this skip memoryless nodes? What actually puts the memoryless
> > node in the device-tree?
> 
> It does skip.
> 
> > And if you were to put them in, wouldn't spapr_populate_memory_node()
> > fail because we'd be creating two nodes with memory@XXX where XXX is the
> > same (starting address) for both?
> 
> I cannot do this now - there is no way to tell from the command line where
> I want NUMA node memory start from so I'll end up with multiple nodes with
> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll
> try to work out something on top of that.

Ah I got something here. With the patches I just sent to enable sparse
NUMA nodes, plus your series rebased on top, here's what I see in a
Linux LPAR:

qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa node,nodeid=2,mem=0,cpus=0-1 -smp 4

info numa
2 nodes
node 2 cpus: 0 1
node 2 size: 0 MB
node 3 cpus: 2 3
node 3 size: 4096 MB

numactl --hardware
available: 3 nodes (0,2-3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 2 3
node 3 size: 4073 MB
node 3 free: 3830 MB
node distances:
node   0   2   3 
  0:  10  40  40 
  2:  40  10  40 
  3:  40  40  10 

The trick, it seems, is if you have a memoryless node, it needs to
have CPUs assigned to it. The CPU's "ibm,associativity" property will
make Linux set up the proper NUMA topology.

Thoughts? Should there be a check that every "present" NUMA node at
least either has CPUs or memory. It seems like if neither are present,
we can just hotplug them later? Does qemu support topology for PCI
devices?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-23 17:40       ` Nishanth Aravamudan
@ 2014-06-24  6:07         ` Alexey Kardashevskiy
  2014-06-24 17:07           ` Nishanth Aravamudan
  0 siblings, 1 reply; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-24  6:07 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 06/24/2014 03:40 AM, Nishanth Aravamudan wrote:
> On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
>> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
>>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
>>>> Current QEMU does not support memoryless NUMA nodes.
>>>> This prepares SPAPR for that.
>>>>
>>>> This moves 2 calls of spapr_populate_memory_node() into
>>>> the existing loop which handles nodes other than than
>>>> the first one.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
>>>>  1 file changed, 11 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index cb3a10a..666b676 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
>>>>
>>>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>>>>  {
>>>> -    hwaddr node0_size, mem_start, node_size;
>>>> +    hwaddr mem_start, node_size;
>>>>      int i;
>>>>
>>>> -    /* memory node(s) */
>>>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
>>>> -        node0_size = node_mem[0];
>>>> -    } else {
>>>> -        node0_size = ram_size;
>>>> -    }
>>>> -
>>>> -    /* RMA */
>>>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>>>> -
>>>> -    /* RAM: Node 0 */
>>>> -    if (node0_size > spapr->rma_size) {
>>>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
>>>> -                                   node0_size - spapr->rma_size);
>>>> -    }
>>>> -
>>>> -    /* RAM: Node 1 and beyond */
>>>> -    mem_start = node0_size;
>>>> -    for (i = 1; i < nb_numa_nodes; i++) {
>>>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
>>>> +        if (!node_mem[i]) {
>>>> +            continue;
>>>> +        }
>>>
>>> Doesn't this skip memoryless nodes? What actually puts the memoryless
>>> node in the device-tree?
>>
>> It does skip.
>>
>>> And if you were to put them in, wouldn't spapr_populate_memory_node()
>>> fail because we'd be creating two nodes with memory@XXX where XXX is the
>>> same (starting address) for both?
>>
>> I cannot do this now - there is no way to tell from the command line
>> where I want NUMA node memory start from so I'll end up with multiple
>> nodes with the same name and QEMU won't start. When NUMA fixes reach
>> upstream, I'll try to work out something on top of that.
> 
> So in mst's tree, which I've rebased your patches, we have a struct
> defining each NUMA node, which has a size (and the index is the nodeid).
> I've got patches working that allow for sparse indexing, but I'm curious
> what you think we should do for the naming.


There should be no nodes for memory@xxx in the device tree for memoryless
NUMA nodes so no problem with naming.


> I can send out the patches,
> with the caveat that architectures still need to fix the remaining
> issues for memoryless nodes?

? If we do not change the existing behavior and just extending it, why will
there be a problem?


-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-24  3:08       ` Nishanth Aravamudan
@ 2014-06-24  6:14         ` Alexey Kardashevskiy
  2014-06-24 17:01           ` Nishanth Aravamudan
  2014-07-21 18:08           ` Nishanth Aravamudan
  0 siblings, 2 replies; 45+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-24  6:14 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 06/24/2014 01:08 PM, Nishanth Aravamudan wrote:
> On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
>> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
>>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
>>>> Current QEMU does not support memoryless NUMA nodes.
>>>> This prepares SPAPR for that.
>>>>
>>>> This moves 2 calls of spapr_populate_memory_node() into
>>>> the existing loop which handles nodes other than than
>>>> the first one.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
>>>>  1 file changed, 11 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index cb3a10a..666b676 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
>>>>
>>>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
>>>>  {
>>>> -    hwaddr node0_size, mem_start, node_size;
>>>> +    hwaddr mem_start, node_size;
>>>>      int i;
>>>>
>>>> -    /* memory node(s) */
>>>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
>>>> -        node0_size = node_mem[0];
>>>> -    } else {
>>>> -        node0_size = ram_size;
>>>> -    }
>>>> -
>>>> -    /* RMA */
>>>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>>>> -
>>>> -    /* RAM: Node 0 */
>>>> -    if (node0_size > spapr->rma_size) {
>>>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
>>>> -                                   node0_size - spapr->rma_size);
>>>> -    }
>>>> -
>>>> -    /* RAM: Node 1 and beyond */
>>>> -    mem_start = node0_size;
>>>> -    for (i = 1; i < nb_numa_nodes; i++) {
>>>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
>>>> +        if (!node_mem[i]) {
>>>> +            continue;
>>>> +        }
>>>
>>> Doesn't this skip memoryless nodes? What actually puts the memoryless
>>> node in the device-tree?
>>
>> It does skip.
>>
>>> And if you were to put them in, wouldn't spapr_populate_memory_node()
>>> fail because we'd be creating two nodes with memory@XXX where XXX is the
>>> same (starting address) for both?
>>
>> I cannot do this now - there is no way to tell from the command line where
>> I want NUMA node memory start from so I'll end up with multiple nodes with
>> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll
>> try to work out something on top of that.
> 
> Ah I got something here. With the patches I just sent to enable sparse
> NUMA nodes, plus your series rebased on top, here's what I see in a
> Linux LPAR:
> 
> qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa node,nodeid=2,mem=0,cpus=0-1 -smp 4
> 
> info numa
> 2 nodes
> node 2 cpus: 0 1
> node 2 size: 0 MB
> node 3 cpus: 2 3
> node 3 size: 4096 MB
> 
> numactl --hardware
> available: 3 nodes (0,2-3)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 0 1
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 2 3
> node 3 size: 4073 MB
> node 3 free: 3830 MB
> node distances:
> node   0   2   3 
>   0:  10  40  40 
>   2:  40  10  40 
>   3:  40  40  10 
> 
> The trick, it seems, is if you have a memoryless node, it needs to
> have CPUs assigned to it.

Yep. The device tree does not have NUMA nodes, it only has CPUs and
memory@xxx (memory banks?) and the guest kernel has to parse
ibm,associativity and reconstruct the NUMA topology. If some node is not
mentioned in any ibm,associativity, it does not exist.


> The CPU's "ibm,associativity" property will
> make Linux set up the proper NUMA topology.
> 
> Thoughts? Should there be a check that every "present" NUMA node at
> least either has CPUs or memory.

May be, I'll wait for NUMA stuff in upstream, apply your patch(es), my
patches and see what I get :)


> It seems like if neither are present,
> we can just hotplug them later?

hotplug what? NUMA nodes?

> Does qemu support topology for PCI devices?

Nope.



-- 
Alexey

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-24  6:14         ` Alexey Kardashevskiy
@ 2014-06-24 17:01           ` Nishanth Aravamudan
  2014-07-21 18:08           ` Nishanth Aravamudan
  1 sibling, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-24 17:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 24.06.2014 [16:14:11 +1000], Alexey Kardashevskiy wrote:
> On 06/24/2014 01:08 PM, Nishanth Aravamudan wrote:
> > On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
> >> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> >>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >>>> Current QEMU does not support memoryless NUMA nodes.
> >>>> This prepares SPAPR for that.
> >>>>
> >>>> This moves 2 calls of spapr_populate_memory_node() into
> >>>> the existing loop which handles nodes other than than
> >>>> the first one.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
> >>>>  1 file changed, 11 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index cb3a10a..666b676 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
> >>>>
> >>>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>>>  {
> >>>> -    hwaddr node0_size, mem_start, node_size;
> >>>> +    hwaddr mem_start, node_size;
> >>>>      int i;
> >>>>
> >>>> -    /* memory node(s) */
> >>>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> >>>> -        node0_size = node_mem[0];
> >>>> -    } else {
> >>>> -        node0_size = ram_size;
> >>>> -    }
> >>>> -
> >>>> -    /* RMA */
> >>>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> >>>> -
> >>>> -    /* RAM: Node 0 */
> >>>> -    if (node0_size > spapr->rma_size) {
> >>>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> >>>> -                                   node0_size - spapr->rma_size);
> >>>> -    }
> >>>> -
> >>>> -    /* RAM: Node 1 and beyond */
> >>>> -    mem_start = node0_size;
> >>>> -    for (i = 1; i < nb_numa_nodes; i++) {
> >>>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> >>>> +        if (!node_mem[i]) {
> >>>> +            continue;
> >>>> +        }
> >>>
> >>> Doesn't this skip memoryless nodes? What actually puts the memoryless
> >>> node in the device-tree?
> >>
> >> It does skip.
> >>
> >>> And if you were to put them in, wouldn't spapr_populate_memory_node()
> >>> fail because we'd be creating two nodes with memory@XXX where XXX is the
> >>> same (starting address) for both?
> >>
> >> I cannot do this now - there is no way to tell from the command line where
> >> I want NUMA node memory start from so I'll end up with multiple nodes with
> >> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll
> >> try to work out something on top of that.
> > 
> > Ah I got something here. With the patches I just sent to enable sparse
> > NUMA nodes, plus your series rebased on top, here's what I see in a
> > Linux LPAR:
> > 
> > qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa node,nodeid=2,mem=0,cpus=0-1 -smp 4
> > 
> > info numa
> > 2 nodes
> > node 2 cpus: 0 1
> > node 2 size: 0 MB
> > node 3 cpus: 2 3
> > node 3 size: 4096 MB
> > 
> > numactl --hardware
> > available: 3 nodes (0,2-3)
> > node 0 cpus:
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 2 cpus: 0 1
> > node 2 size: 0 MB
> > node 2 free: 0 MB
> > node 3 cpus: 2 3
> > node 3 size: 4073 MB
> > node 3 free: 3830 MB
> > node distances:
> > node   0   2   3 
> >   0:  10  40  40 
> >   2:  40  10  40 
> >   3:  40  40  10 
> > 
> > The trick, it seems, is if you have a memoryless node, it needs to
> > have CPUs assigned to it.
> 
> Yep. The device tree does not have NUMA nodes, it only has CPUs and
> memory@xxx (memory banks?) and the guest kernel has to parse
> ibm,associativity and reconstruct the NUMA topology. If some node is not
> mentioned in any ibm,associativity, it does not exist.

Yep, that all makes sense, but we need something (I think) to handle
this kind of command-line, even if it's just a warning/error:

qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m
4096 -numa node,nodeid=3,mem=4096,cpus=0-3 -numa
node,nodeid=2,mem=0 -smp 4

info numa
2 nodes
node 2 cpus:
node 2 size: 0 MB
node 3 cpus: 0 1 2 3
node 3 size: 4096 MB

numactl --hardware
available: 2 nodes (0,3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 3 cpus: 0 1 2 3
node 3 size: 4076 MB
node 3 free: 3864 MB
node distances:
node   0   3 
  0:  10  40 
    3:  40  10 

A pathological case, obviously, but it's pretty trivial to enforce some
sanity here, I think.

> > The CPU's "ibm,associativity" property will
> > make Linux set up the proper NUMA topology.
> > 
> > Thoughts? Should there be a check that every "present" NUMA node at
> > least either has CPUs or memory.
> 
> May be, I'll wait for NUMA stuff in upstream, apply your patch(es), my
> patches and see what I get :)

Ok, sounds good.

> > It seems like if neither are present,
> > we can just hotplug them later?
> 
> hotplug what? NUMA nodes?

Well, this actually existed in practice, IIRC, with SGI's larger boxes
(or was planned at least). But I actually meant when we hotplug in a CPU
or memory later, the appropriate topology should show up. I wonder if
that works, as under PowerVM, those dynamically adjustable hardware are
in in the drconf property, not in memory@ or CPU@ nodes. Ah well, cross
that bridge when we get to it.

> > Does qemu support topology for PCI devices?
> 
> Nope.

Ok, good to know -- as that's another place that can determine what NUMA
nodes are online/offline in Linux, I believe.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-24  6:07         ` Alexey Kardashevskiy
@ 2014-06-24 17:07           ` Nishanth Aravamudan
  0 siblings, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-06-24 17:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 24.06.2014 [16:07:41 +1000], Alexey Kardashevskiy wrote:
> On 06/24/2014 03:40 AM, Nishanth Aravamudan wrote:
> > On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
> >> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> >>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >>>> Current QEMU does not support memoryless NUMA nodes.
> >>>> This prepares SPAPR for that.
> >>>>
> >>>> This moves 2 calls of spapr_populate_memory_node() into
> >>>> the existing loop which handles nodes other than than
> >>>> the first one.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
> >>>>  1 file changed, 11 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index cb3a10a..666b676 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
> >>>>
> >>>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>>>  {
> >>>> -    hwaddr node0_size, mem_start, node_size;
> >>>> +    hwaddr mem_start, node_size;
> >>>>      int i;
> >>>>
> >>>> -    /* memory node(s) */
> >>>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> >>>> -        node0_size = node_mem[0];
> >>>> -    } else {
> >>>> -        node0_size = ram_size;
> >>>> -    }
> >>>> -
> >>>> -    /* RMA */
> >>>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> >>>> -
> >>>> -    /* RAM: Node 0 */
> >>>> -    if (node0_size > spapr->rma_size) {
> >>>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> >>>> -                                   node0_size - spapr->rma_size);
> >>>> -    }
> >>>> -
> >>>> -    /* RAM: Node 1 and beyond */
> >>>> -    mem_start = node0_size;
> >>>> -    for (i = 1; i < nb_numa_nodes; i++) {
> >>>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> >>>> +        if (!node_mem[i]) {
> >>>> +            continue;
> >>>> +        }
> >>>
> >>> Doesn't this skip memoryless nodes? What actually puts the memoryless
> >>> node in the device-tree?
> >>
> >> It does skip.
> >>
> >>> And if you were to put them in, wouldn't spapr_populate_memory_node()
> >>> fail because we'd be creating two nodes with memory@XXX where XXX is the
> >>> same (starting address) for both?
> >>
> >> I cannot do this now - there is no way to tell from the command line
> >> where I want NUMA node memory start from so I'll end up with multiple
> >> nodes with the same name and QEMU won't start. When NUMA fixes reach
> >> upstream, I'll try to work out something on top of that.
> > 
> > So in mst's tree, which I've rebased your patches, we have a struct
> > defining each NUMA node, which has a size (and the index is the nodeid).
> > I've got patches working that allow for sparse indexing, but I'm curious
> > what you think we should do for the naming.
> 
> 
> There should be no nodes for memory@xxx in the device tree for memoryless
> NUMA nodes so no problem with naming.
> 
> 
> > I can send out the patches,
> > with the caveat that architectures still need to fix the remaining
> > issues for memoryless nodes?
> 
> ? If we do not change the existing behavior and just extending it, why will
> there be a problem?

Sorry, I meant to say to actually support memoryless or sparse NUMA
nodes, there may be further adjustments needed in the architecture code.
Your patches cover it (we don't need 7/7 anymore, btw) when rebased on
top of what I sent out, for powerpc. But x86 needs quite a bit of
massaging. So my point was that the generic qemu understands and
supports memoryless and sparse NUMA numbering (with my patches), but
that doesn't mean it will work as expected on a given platform yet. That
is status quo, though.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()
  2014-06-24  6:14         ` Alexey Kardashevskiy
  2014-06-24 17:01           ` Nishanth Aravamudan
@ 2014-07-21 18:08           ` Nishanth Aravamudan
  1 sibling, 0 replies; 45+ messages in thread
From: Nishanth Aravamudan @ 2014-07-21 18:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Alexander Graf

On 24.06.2014 [16:14:11 +1000], Alexey Kardashevskiy wrote:

<snip>

> > It seems like if neither are present,
> > we can just hotplug them later?
> 
> hotplug what? NUMA nodes?

Just an FYI, as I didn't realize this, but Jiang Liu recently posted a
patchset for NUMA node hotplug issues Intel ran into on one of their
platforms due to a buggy BIOS (https://lkml.org/lkml/2014/7/11/75). So I
guess this is a real thing :) Ignore the subject line of that thread,
it's really about node online...

-Nish

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-07-21 18:09 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-16  7:53 [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
2014-06-16  7:53 ` [Qemu-devel] [PATCH 1/7] spapr: Move DT memory node rendering to a helper Alexey Kardashevskiy
2014-06-16  7:53 ` [Qemu-devel] [PATCH 2/7] spapr: Use DT memory node rendering helper for other nodes Alexey Kardashevskiy
2014-06-16  7:53 ` [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory() Alexey Kardashevskiy
2014-06-18  5:04   ` Alexey Kardashevskiy
2014-06-20 19:10   ` Nishanth Aravamudan
2014-06-21  3:08     ` Alexey Kardashevskiy
2014-06-23 17:41       ` Nishanth Aravamudan
2014-06-23 22:02         ` Alexey Kardashevskiy
2014-06-20 22:55   ` Nishanth Aravamudan
2014-06-21  3:06     ` Alexey Kardashevskiy
2014-06-23 17:40       ` Nishanth Aravamudan
2014-06-24  6:07         ` Alexey Kardashevskiy
2014-06-24 17:07           ` Nishanth Aravamudan
2014-06-24  3:08       ` Nishanth Aravamudan
2014-06-24  6:14         ` Alexey Kardashevskiy
2014-06-24 17:01           ` Nishanth Aravamudan
2014-07-21 18:08           ` Nishanth Aravamudan
2014-06-16  7:53 ` [Qemu-devel] [PATCH 4/7] spapr: Split memory nodes to power-of-two blocks Alexey Kardashevskiy
2014-06-17  7:07   ` Alexey Kardashevskiy
2014-06-16  7:53 ` [Qemu-devel] [PATCH 5/7] spapr: Add a helper for node0_size calculation Alexey Kardashevskiy
2014-06-16 18:43   ` Nishanth Aravamudan
2014-06-16  7:53 ` [Qemu-devel] [PATCH 6/7] spapr: Fix ibm, associativity for memory nodes Alexey Kardashevskiy
2014-06-16  7:53 ` [Qemu-devel] [PATCH 7/7] numa: Allow empty nodes Alexey Kardashevskiy
2014-06-16 16:15   ` Eduardo Habkost
2014-06-16 18:49     ` Nishanth Aravamudan
2014-06-16 20:11       ` Eduardo Habkost
2014-06-16 20:31         ` Eduardo Habkost
2014-06-17  0:21           ` Nishanth Aravamudan
2014-06-17  0:16         ` Nishanth Aravamudan
2014-06-16  8:16 ` [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes Alexey Kardashevskiy
2014-06-16 18:26   ` Nishanth Aravamudan
2014-06-16 20:51   ` Eduardo Habkost
2014-06-17  0:25     ` Nishanth Aravamudan
2014-06-17  1:37       ` Eduardo Habkost
2014-06-17 18:36         ` Nishanth Aravamudan
2014-06-17  1:41       ` Eduardo Habkost
2014-06-17 18:37         ` Nishanth Aravamudan
2014-06-17  5:51     ` Alexey Kardashevskiy
2014-06-17 14:07       ` Eduardo Habkost
2014-06-17 18:38         ` Nishanth Aravamudan
2014-06-17 19:22           ` Eduardo Habkost
2014-06-18 18:28             ` Nishanth Aravamudan
2014-06-18 19:33               ` Eduardo Habkost
2014-06-18 23:58                 ` Nishanth Aravamudan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.