All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4]: NUMA: add host binding
@ 2010-08-11 13:52 Andre Przywara
  2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andre Przywara @ 2010-08-11 13:52 UTC (permalink / raw)
  To: avi, anthony; +Cc: kvm

Hi,

the following 4 patches add NUMA host binding to qemu(-kvm).
They allow to specify NUMA policies to be applied to each of the
(separately specified) guest NUMA nodes. It mimics numactl's syntax,
allowing "membind", "preferred" and "interleave" as possible
policies. An example is:
$ qemu -smp 4 ... -numa node,cpus=0,mem=1024M \
  -numa node,cpus=1-2,mem=2048M -numa node,cpus=3,mem=1024M \
  -numa host,nodeid=0,membind=0 -numa host,nodeid=1,interleave=0-1 \
  -numa host,nodeid=2,preferred=!0
The complete syntax is: -numa host,nodeid=<n>,\
  (membind|preferred|interleave)=[+!]<hostnode>[-<hostnode>]
The '+' denotes CPUSET relative nodes, '!' means negation
(see numactl(8)).
As it can be easier done from outside of QEMU, I completely left out
vCPU binding, AFAIK management applications do this today already.

This version is based on the new generic bitmap code submitted with
patch 11/15 of the VNC update by Corentin Chary on Aug 11th:
http://lists.gnu.org/archive/html/qemu-devel/2010-08/msg00517.html
So it requires this patch to build (but not to apply).

Please comment!

Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany




^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation
  2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
@ 2010-08-11 13:52 ` Andre Przywara
  2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2010-08-11 13:52 UTC (permalink / raw)
  To: avi, anthony; +Cc: kvm, Andre Przywara

The current NUMA guest implementation uses a "poor-man's-bitmap"
consisting of a single uint64_t. This patch reworks this by
leveraging the new generic bitmap code and thus lifts the 64 VCPUs
limit for NUMA guests.
Beside that it improves the NUMA data structures in preparation
for future host binding code.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
---
 cpus.c    |    2 +-
 hw/pc.c   |    4 +-
 monitor.c |    2 +-
 sysemu.h  |   11 ++++++-
 vl.c      |   94 +++++++++++++++++++++++++++++++++++++++----------------------
 5 files changed, 73 insertions(+), 40 deletions(-)

diff --git a/cpus.c b/cpus.c
index 2e40814..86a0a47 100644
--- a/cpus.c
+++ b/cpus.c
@@ -805,7 +805,7 @@ void set_numa_modes(void)
 
     for (env = first_cpu; env != NULL; env = env->next_cpu) {
         for (i = 0; i < nb_numa_nodes; i++) {
-            if (node_cpumask[i] & (1 << env->cpu_index)) {
+            if (test_bit(env->cpu_index, numa_info[i].guest_cpu)) {
                 env->numa_node = i;
             }
         }
diff --git a/hw/pc.c b/hw/pc.c
index 89bd4af..1b24409 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -529,14 +529,14 @@ static void *bochs_bios_init(void)
     numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
     for (i = 0; i < smp_cpus; i++) {
         for (j = 0; j < nb_numa_nodes; j++) {
-            if (node_cpumask[j] & (1 << i)) {
+            if (test_bit(i, numa_info[j].guest_cpu)) {
                 numa_fw_cfg[i + 1] = cpu_to_le64(j);
                 break;
             }
         }
     }
     for (i = 0; i < nb_numa_nodes; i++) {
-        numa_fw_cfg[smp_cpus + 1 + i] = cpu_to_le64(node_mem[i]);
+        numa_fw_cfg[smp_cpus + 1 + i] = cpu_to_le64(numa_info[i].guest_mem);
     }
     fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, (uint8_t *)numa_fw_cfg,
                      (1 + smp_cpus + nb_numa_nodes) * 8);
diff --git a/monitor.c b/monitor.c
index e51df62..74da6c4 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1983,7 +1983,7 @@ static void do_info_numa(Monitor *mon)
         }
         monitor_printf(mon, "\n");
         monitor_printf(mon, "node %d size: %" PRId64 " MB\n", i,
-            node_mem[i] >> 20);
+            numa_info[i].guest_mem >> 20);
     }
 }
 
diff --git a/sysemu.h b/sysemu.h
index bf1d68a..e5f88d1 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -7,6 +7,7 @@
 #include "qemu-queue.h"
 #include "qemu-timer.h"
 #include "notify.h"
+#include "bitmap.h"
 
 #ifdef _WIN32
 #include <windows.h>
@@ -136,9 +137,15 @@ extern QEMUClock *rtc_clock;
 extern long hpagesize;
 
 #define MAX_NODES 64
+#ifndef MAX_NUMA_VCPUS
+#define MAX_NUMA_VCPUS 256
+#endif
 extern int nb_numa_nodes;
-extern uint64_t node_mem[MAX_NODES];
-extern uint64_t node_cpumask[MAX_NODES];
+struct numa_info {
+    uint64_t guest_mem;
+    DECLARE_BITMAP(guest_cpu, MAX_NUMA_VCPUS);
+};
+extern struct numa_info numa_info[MAX_NODES];
 
 #define MAX_OPTION_ROMS 16
 extern const char *option_rom[MAX_OPTION_ROMS];
diff --git a/vl.c b/vl.c
index 3d8298e..40fac59 100644
--- a/vl.c
+++ b/vl.c
@@ -161,6 +161,7 @@ int main(int argc, char **argv)
 #include "qemu-queue.h"
 #include "cpus.h"
 #include "arch_init.h"
+#include "bitmap.h"
 
 //#define DEBUG_NET
 //#define DEBUG_SLIRP
@@ -230,8 +231,7 @@ const char *nvram = NULL;
 int boot_menu;
 
 int nb_numa_nodes;
-uint64_t node_mem[MAX_NODES];
-uint64_t node_cpumask[MAX_NODES];
+struct numa_info numa_info[MAX_NODES];
 
 static QEMUTimer *nographic_timer;
 
@@ -717,11 +717,51 @@ static void restore_boot_devices(void *opaque)
     qemu_free(standard_boot_devices);
 }
 
+static int parse_bitmap(const char *str, unsigned long *bm, int maxlen)
+{
+    unsigned long long value, endvalue;
+    char *endptr;
+    unsigned int flags = 0;
+
+    if (str[0] == '!') {
+        flags |= 2;
+        bitmap_fill(bm, maxlen);
+        str++;
+    }
+    if (str[0] == '+') {
+        flags |= 1;
+        str++;
+    }
+    value = strtoull(str, &endptr, 10);
+    if (endptr == str) {
+        if (strcmp(str, "all"))
+            return -1;
+        bitmap_fill(bm, maxlen);
+        return flags;
+    }
+    if (value >= maxlen)
+        return -value;
+    if (*endptr == '-') {
+        endvalue = strtoull(endptr + 1, &endptr, 10);
+        if (endvalue >= maxlen)
+            endvalue = maxlen;
+    } else {
+        endvalue = value;
+    }
+
+    if (flags & 2)
+        bitmap_clear(bm, value, endvalue + 1 - value);
+    else
+        bitmap_set(bm, value, endvalue + 1 - value);
+
+    return flags;
+}
+
 static void numa_add(const char *optarg)
 {
     char option[128];
     char *endptr;
-    unsigned long long value, endvalue;
+    unsigned long long value;
     int nodenr;
 
     optarg = get_opt_name(option, 128, optarg, ',') + 1;
@@ -733,7 +773,7 @@ static void numa_add(const char *optarg)
         }
 
         if (get_param_value(option, 128, "mem", optarg) == 0) {
-            node_mem[nodenr] = 0;
+            numa_info[nodenr].guest_mem = 0;
         } else {
             value = strtoull(option, &endptr, 0);
             switch (*endptr) {
@@ -744,29 +784,12 @@ static void numa_add(const char *optarg)
                 value <<= 30;
                 break;
             }
-            node_mem[nodenr] = value;
+            numa_info[nodenr].guest_mem = value;
         }
         if (get_param_value(option, 128, "cpus", optarg) == 0) {
-            node_cpumask[nodenr] = 0;
+            bitmap_zero(numa_info[nodenr].guest_cpu, MAX_NUMA_VCPUS);
         } else {
-            value = strtoull(option, &endptr, 10);
-            if (value >= 64) {
-                value = 63;
-                fprintf(stderr, "only 64 CPUs in NUMA mode supported.\n");
-            } else {
-                if (*endptr == '-') {
-                    endvalue = strtoull(endptr+1, &endptr, 10);
-                    if (endvalue >= 63) {
-                        endvalue = 62;
-                        fprintf(stderr,
-                            "only 63 CPUs in NUMA mode supported.\n");
-                    }
-                    value = (2ULL << endvalue) - (1ULL << value);
-                } else {
-                    value = 1ULL << value;
-                }
-            }
-            node_cpumask[nodenr] = value;
+            parse_bitmap(option, numa_info[nodenr].guest_cpu, MAX_NUMA_VCPUS);
         }
         nb_numa_nodes++;
     }
@@ -1870,8 +1893,8 @@ int main(int argc, char **argv, char **envp)
     translation = BIOS_ATA_TRANSLATION_AUTO;
 
     for (i = 0; i < MAX_NODES; i++) {
-        node_mem[i] = 0;
-        node_cpumask[i] = 0;
+        numa_info[i].guest_mem = 0;
+        bitmap_zero(numa_info[i].guest_cpu, MAX_NUMA_VCPUS);
     }
 
     assigned_devices_index = 0;
@@ -2887,7 +2910,7 @@ int main(int argc, char **argv, char **envp)
          * and distribute the available memory equally across all nodes
          */
         for (i = 0; i < nb_numa_nodes; i++) {
-            if (node_mem[i] != 0)
+            if (numa_info[i].guest_mem != 0)
                 break;
         }
         if (i == nb_numa_nodes) {
@@ -2897,14 +2920,18 @@ int main(int argc, char **argv, char **envp)
              * the final node gets the rest.
              */
             for (i = 0; i < nb_numa_nodes - 1; i++) {
-                node_mem[i] = (ram_size / nb_numa_nodes) & ~((1 << 23UL) - 1);
-                usedmem += node_mem[i];
+                numa_info[i].guest_mem = (ram_size / nb_numa_nodes) &
+                    ~((1 << 23UL) - 1);
+                usedmem += numa_info[i].guest_mem;
             }
-            node_mem[i] = ram_size - usedmem;
+            numa_info[i].guest_mem = ram_size - usedmem;
         }
 
+        /* check whether any guest CPU number has been specified.
+         * If not, we use an automatic assignment algorithm.
+         */
         for (i = 0; i < nb_numa_nodes; i++) {
-            if (node_cpumask[i] != 0)
+            if (!bitmap_empty(numa_info[i].guest_cpu, MAX_NUMA_VCPUS))
                 break;
         }
         /* assigning the VCPUs round-robin is easier to implement, guest OSes
@@ -2912,9 +2939,8 @@ int main(int argc, char **argv, char **envp)
          * real machines which also use this scheme.
          */
         if (i == nb_numa_nodes) {
-            for (i = 0; i < smp_cpus; i++) {
-                node_cpumask[i % nb_numa_nodes] |= 1 << i;
-            }
+            for (i = 0; i < smp_cpus; i++)
+                set_bit(i, numa_info[i % nb_numa_nodes].guest_cpu);
         }
     }
 
-- 
1.6.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] NUMA: add Linux libnuma detection
  2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
  2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
@ 2010-08-11 13:52 ` Andre Przywara
  2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
  2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
  3 siblings, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2010-08-11 13:52 UTC (permalink / raw)
  To: avi, anthony; +Cc: kvm, Andre Przywara

Add detection of libnuma (mostly contained in the numactl package)
to the configure script. Currently this is Linux only, but can be
extended later should the need for other interfaces come up.
Can be enabled or disabled on the command line, default is use if
available.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
---
 configure |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index af50607..91d5e48 100755
--- a/configure
+++ b/configure
@@ -282,6 +282,7 @@ xen=""
 linux_aio=""
 attr=""
 vhost_net=""
+numa="yes"
 
 gprof="no"
 debug_tcg="no"
@@ -722,6 +723,10 @@ for opt do
   ;;
   --enable-vhost-net) vhost_net="yes"
   ;;
+  --disable-numa) numa="no"
+  ;;
+  --enable-numa) numa="yes"
+  ;;
   --*dir)
   ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
@@ -909,6 +914,8 @@ echo "  --enable-docs            enable documentation build"
 echo "  --disable-docs           disable documentation build"
 echo "  --disable-vhost-net      disable vhost-net acceleration support"
 echo "  --enable-vhost-net       enable vhost-net acceleration support"
+echo "  --disable-numa           disable host Linux NUMA support"
+echo "  --enable-numa            enable host Linux NUMA support"
 echo ""
 echo "NOTE: The object files are built at the place where configure is launched"
 exit 1
@@ -1987,6 +1994,28 @@ if compile_prog "" "" ; then
   signalfd=yes
 fi
 
+##########################################
+# libnuma probe
+
+if test "$numa" = "yes" ; then
+  numa=no
+  cat > $TMPC << EOF
+#include <numa.h>
+int main(void) { return numa_available(); }
+EOF
+
+  if compile_prog "" "-lnuma" ; then
+    numa=yes
+    libs_softmmu="-lnuma $libs_softmmu"
+  else
+    if test "$numa" = "yes" ; then
+      feature_not_found "linux NUMA (install numactl?)"
+    fi
+    numa=no
+  fi
+fi
+
+
 # check if eventfd is supported
 eventfd=no
 cat > $TMPC << EOF
@@ -2256,6 +2285,7 @@ echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
 echo "uuid support      $uuid"
 echo "vhost-net support $vhost_net"
+echo "NUMA host support $numa"
 
 if test $sdl_too_old = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -2487,6 +2517,9 @@ if test $cpu_emulation = "yes"; then
 else
   echo "CONFIG_NO_CPU_EMULATION=y" >> $config_host_mak
 fi
+if test "$numa" = "yes"; then
+  echo "CONFIG_NUMA=y" >> $config_host_mak
+fi
 
 # XXX: suppress that
 if [ "$bsd" = "yes" ] ; then
-- 
1.6.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] NUMA: parse new host dependent command line options
  2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
  2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
  2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
@ 2010-08-11 13:52 ` Andre Przywara
  2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
  3 siblings, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2010-08-11 13:52 UTC (permalink / raw)
  To: avi, anthony; +Cc: kvm, Andre Przywara

To separate the host and the guest NUMA part the host NUMA options
can be specified separately from the guest ones.
Mimicing numactl's syntax the parser allows to specify the NUMA
binding policy for each guest node. It supports membind, interleave
and preferred together with negation (!) and CPUSET relative
addressing (+). Since the comma is already used by the QEMU
command line interpreter, it cannot be used here to enumerate
host nodes (but '-' is supported). Example:
$ qemu ... -numa node -numa host,nodeid=0,interleave=+0-1
(uses interleaving on the first two nodes belonging to the current
CPUSET for the one guest node)
$ qemu ... -numa node -numa node -numa host,nodeid=0,membind=3 \
  -numa host,nodeid=1,preferred=!2-3
(binding the first guest node to host node 3 and the second guest
node to any node expect 2 and 3)

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
---
 sysemu.h |    8 ++++++++
 vl.c     |   25 +++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/sysemu.h b/sysemu.h
index e5f88d1..52fedd4 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -140,10 +140,18 @@ extern long hpagesize;
 #ifndef MAX_NUMA_VCPUS
 #define MAX_NUMA_VCPUS 256
 #endif
+#define NODE_HOST_NONE        0x00
+#define NODE_HOST_BIND        0x01
+#define NODE_HOST_INTERLEAVE  0x02
+#define NODE_HOST_PREFERRED   0x03
+#define NODE_HOST_POLICY_MASK 0x03
+#define NODE_HOST_RELATIVE    0x04
 extern int nb_numa_nodes;
 struct numa_info {
     uint64_t guest_mem;
     DECLARE_BITMAP(guest_cpu, MAX_NUMA_VCPUS);
+    DECLARE_BITMAP(host_mem, MAX_NUMA_VCPUS);
+    unsigned int flags;
 };
 extern struct numa_info numa_info[MAX_NODES];
 
diff --git a/vl.c b/vl.c
index 40fac59..6df9cc9 100644
--- a/vl.c
+++ b/vl.c
@@ -792,6 +792,29 @@ static void numa_add(const char *optarg)
             parse_bitmap(option, numa_info[nodenr].guest_cpu, MAX_NUMA_VCPUS);
         }
         nb_numa_nodes++;
+    } else if (!strcmp(option, "host")) {
+        if (get_param_value(option, 128, "nodeid", optarg) == 0) {
+            fprintf(stderr, "error: need nodeid for -numa host,...\n");
+            exit(1);
+        }
+        nodenr = strtoull(option, NULL, 10);
+        if (nodenr >= nb_numa_nodes) {
+            fprintf(stderr, "nodeid exceeds specified NUMA nodes\n");
+           	exit(1);
+        }
+        numa_info[nodenr].flags = NODE_HOST_NONE;
+        option[0] = 0;
+        if (get_param_value(option, 128, "interleave", optarg) != 0)
+            numa_info[nodenr].flags |= NODE_HOST_INTERLEAVE;
+        else if (get_param_value(option, 128, "preferred", optarg) != 0)
+            numa_info[nodenr].flags |= NODE_HOST_PREFERRED;
+        else if (get_param_value(option, 128, "membind", optarg) != 0)
+            numa_info[nodenr].flags |= NODE_HOST_BIND;
+        if (option[0] != 0) {
+            if (parse_bitmap(option, numa_info[nodenr].host_mem,
+                MAX_NUMA_VCPUS) & 1)
+                numa_info[nodenr].flags |= NODE_HOST_RELATIVE;
+        }
     }
     return;
 }
@@ -1895,6 +1918,8 @@ int main(int argc, char **argv, char **envp)
     for (i = 0; i < MAX_NODES; i++) {
         numa_info[i].guest_mem = 0;
         bitmap_zero(numa_info[i].guest_cpu, MAX_NUMA_VCPUS);
+        bitmap_zero(numa_info[i].host_mem, MAX_NUMA_VCPUS);
+        numa_info[i].flags = NODE_HOST_NONE;
     }
 
     assigned_devices_index = 0;
-- 
1.6.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
                   ` (2 preceding siblings ...)
  2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
@ 2010-08-11 13:52 ` Andre Przywara
  2010-08-23 18:59   ` Marcelo Tosatti
  3 siblings, 1 reply; 13+ messages in thread
From: Andre Przywara @ 2010-08-11 13:52 UTC (permalink / raw)
  To: avi, anthony; +Cc: kvm, Andre Przywara

According to the user-provided assignment bind the respective part
of the guest's memory to the given host node. This uses Linux'
mbind syscall (which is wrapped only in libnuma) to realize the
pinning right after the allocation.
Failures are not fatal, but produce a warning.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
---
 hw/pc.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 1b24409..dbfc082 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -42,6 +42,15 @@
 #include "device-assignment.h"
 #include "kvm.h"
 
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#include <numaif.h>
+#ifndef MPOL_F_RELATIVE_NODES
+  #define MPOL_F_RELATIVE_NODES (1 << 14)
+  #define MPOL_F_STATIC_NODES (1 << 15)
+#endif
+#endif
+
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
 
@@ -882,6 +891,53 @@ void pc_cpus_init(const char *cpu_model)
     }
 }
 
+static void bind_numa(ram_addr_t ram_addr)
+{
+#ifdef CONFIG_NUMA
+    int i;
+    char* ram_ptr;
+    ram_addr_t len, ram_offset;
+    int bind_mode;
+
+    ram_ptr = qemu_get_ram_ptr(ram_addr);
+
+    ram_offset = 0;
+    for (i = 0; i < nb_numa_nodes; i++) {
+        len = numa_info[i].guest_mem;
+        if (numa_info[i].flags != 0) {
+            switch (numa_info[i].flags & NODE_HOST_POLICY_MASK) {
+            case NODE_HOST_BIND:
+                bind_mode = MPOL_BIND;
+                break;
+            case NODE_HOST_INTERLEAVE:
+                bind_mode = MPOL_INTERLEAVE;
+                break;
+            case NODE_HOST_PREFERRED:
+                bind_mode = MPOL_PREFERRED;
+                break;
+            default:
+                bind_mode = MPOL_DEFAULT;
+                break;
+            }
+            bind_mode |= (numa_info[i].flags & NODE_HOST_RELATIVE) ?
+                MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
+
+            /* This is a workaround for a long standing bug in Linux'
+             * mbind implementation, which cuts off the last specified
+             * node. To stay compatible should this bug be fixed, we
+             * specify one more node and zero this one out.
+             */
+            clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem);
+            if (mbind(ram_ptr + ram_offset, len, bind_mode,
+                numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0))
+                    perror("mbind");
+        }
+        ram_offset += len;
+    }
+#endif
+    return;
+}
+
 void pc_memory_init(ram_addr_t ram_size,
                     const char *kernel_filename,
                     const char *kernel_cmdline,
@@ -919,6 +975,8 @@ void pc_memory_init(ram_addr_t ram_size,
     cpu_register_physical_memory(0x100000,
                  below_4g_mem_size - 0x100000,
                  ram_addr + 0x100000);
+    bind_numa(ram_addr);
+
 #if TARGET_PHYS_ADDR_BITS > 32
     cpu_register_physical_memory(0x100000000ULL, above_4g_mem_size,
                                  ram_addr + below_4g_mem_size);
-- 
1.6.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
@ 2010-08-23 18:59   ` Marcelo Tosatti
  2010-08-23 19:27     ` Anthony Liguori
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2010-08-23 18:59 UTC (permalink / raw)
  To: Andre Przywara; +Cc: avi, anthony, kvm

On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> According to the user-provided assignment bind the respective part
> of the guest's memory to the given host node. This uses Linux'
> mbind syscall (which is wrapped only in libnuma) to realize the
> pinning right after the allocation.
> Failures are not fatal, but produce a warning.
> 
> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
> ---
>  hw/pc.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 58 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/pc.c b/hw/pc.c
> index 1b24409..dbfc082 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -42,6 +42,15 @@
>  #include "device-assignment.h"
>  #include "kvm.h"
>  
> +#ifdef CONFIG_NUMA
> +#include <numa.h>
> +#include <numaif.h>
> +#ifndef MPOL_F_RELATIVE_NODES
> +  #define MPOL_F_RELATIVE_NODES (1 << 14)
> +  #define MPOL_F_STATIC_NODES (1 << 15)
> +#endif
> +#endif
> +
>  /* output Bochs bios info messages */
>  //#define DEBUG_BIOS
>  
> @@ -882,6 +891,53 @@ void pc_cpus_init(const char *cpu_model)
>      }
>  }
>  
> +static void bind_numa(ram_addr_t ram_addr)
> +{
> +#ifdef CONFIG_NUMA
> +    int i;
> +    char* ram_ptr;
> +    ram_addr_t len, ram_offset;
> +    int bind_mode;
> +
> +    ram_ptr = qemu_get_ram_ptr(ram_addr);
> +
> +    ram_offset = 0;
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        len = numa_info[i].guest_mem;
> +        if (numa_info[i].flags != 0) {
> +            switch (numa_info[i].flags & NODE_HOST_POLICY_MASK) {
> +            case NODE_HOST_BIND:
> +                bind_mode = MPOL_BIND;
> +                break;
> +            case NODE_HOST_INTERLEAVE:
> +                bind_mode = MPOL_INTERLEAVE;
> +                break;
> +            case NODE_HOST_PREFERRED:
> +                bind_mode = MPOL_PREFERRED;
> +                break;
> +            default:
> +                bind_mode = MPOL_DEFAULT;
> +                break;
> +            }
> +            bind_mode |= (numa_info[i].flags & NODE_HOST_RELATIVE) ?
> +                MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
> +
> +            /* This is a workaround for a long standing bug in Linux'
> +             * mbind implementation, which cuts off the last specified
> +             * node. To stay compatible should this bug be fixed, we
> +             * specify one more node and zero this one out.
> +             */
> +            clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem);
> +            if (mbind(ram_ptr + ram_offset, len, bind_mode,
> +                numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0))
> +                    perror("mbind");
> +        }
> +        ram_offset += len;
> +    }
> +#endif

Why is it not possible (or perhaps not desired) to change the binding
after the guest is started?

Sounds unflexible.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-23 18:59   ` Marcelo Tosatti
@ 2010-08-23 19:27     ` Anthony Liguori
  2010-08-23 21:16       ` Andre Przywara
  0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2010-08-23 19:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andre Przywara, avi, kvm

On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>    
>> According to the user-provided assignment bind the respective part
>> of the guest's memory to the given host node. This uses Linux'
>> mbind syscall (which is wrapped only in libnuma) to realize the
>> pinning right after the allocation.
>> Failures are not fatal, but produce a warning.
>>
>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
>> ---
>>   hw/pc.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 58 insertions(+), 0 deletions(-)
>>
>> diff --git a/hw/pc.c b/hw/pc.c
>> index 1b24409..dbfc082 100644
>> --- a/hw/pc.c
>> +++ b/hw/pc.c
>> @@ -42,6 +42,15 @@
>>   #include "device-assignment.h"
>>   #include "kvm.h"
>>
>> +#ifdef CONFIG_NUMA
>> +#include<numa.h>
>> +#include<numaif.h>
>> +#ifndef MPOL_F_RELATIVE_NODES
>> +  #define MPOL_F_RELATIVE_NODES (1<<  14)
>> +  #define MPOL_F_STATIC_NODES (1<<  15)
>> +#endif
>> +#endif
>> +
>>   /* output Bochs bios info messages */
>>   //#define DEBUG_BIOS
>>
>> @@ -882,6 +891,53 @@ void pc_cpus_init(const char *cpu_model)
>>       }
>>   }
>>
>> +static void bind_numa(ram_addr_t ram_addr)
>> +{
>> +#ifdef CONFIG_NUMA
>> +    int i;
>> +    char* ram_ptr;
>> +    ram_addr_t len, ram_offset;
>> +    int bind_mode;
>> +
>> +    ram_ptr = qemu_get_ram_ptr(ram_addr);
>> +
>> +    ram_offset = 0;
>> +    for (i = 0; i<  nb_numa_nodes; i++) {
>> +        len = numa_info[i].guest_mem;
>> +        if (numa_info[i].flags != 0) {
>> +            switch (numa_info[i].flags&  NODE_HOST_POLICY_MASK) {
>> +            case NODE_HOST_BIND:
>> +                bind_mode = MPOL_BIND;
>> +                break;
>> +            case NODE_HOST_INTERLEAVE:
>> +                bind_mode = MPOL_INTERLEAVE;
>> +                break;
>> +            case NODE_HOST_PREFERRED:
>> +                bind_mode = MPOL_PREFERRED;
>> +                break;
>> +            default:
>> +                bind_mode = MPOL_DEFAULT;
>> +                break;
>> +            }
>> +            bind_mode |= (numa_info[i].flags&  NODE_HOST_RELATIVE) ?
>> +                MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
>> +
>> +            /* This is a workaround for a long standing bug in Linux'
>> +             * mbind implementation, which cuts off the last specified
>> +             * node. To stay compatible should this bug be fixed, we
>> +             * specify one more node and zero this one out.
>> +             */
>> +            clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem);
>> +            if (mbind(ram_ptr + ram_offset, len, bind_mode,
>> +                numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0))
>> +                    perror("mbind");
>> +        }
>> +        ram_offset += len;
>> +    }
>> +#endif
>>      
> Why is it not possible (or perhaps not desired) to change the binding
> after the guest is started?
>
> Sounds unflexible.
>    

We really need a solution that lets a user use a tool like numactl 
outside of the QEMU instance.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-23 19:27     ` Anthony Liguori
@ 2010-08-23 21:16       ` Andre Przywara
  2010-08-23 21:27         ` Anthony Liguori
  0 siblings, 1 reply; 13+ messages in thread
From: Andre Przywara @ 2010-08-23 21:16 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Marcelo Tosatti, avi, kvm

Anthony Liguori wrote:
> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>    
>>> According to the user-provided assignment bind the respective part
>>> of the guest's memory to the given host node. This uses Linux'
>>> mbind syscall (which is wrapped only in libnuma) to realize the
>>> pinning right after the allocation.
>>> Failures are not fatal, but produce a warning.
>>>
>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
 >>> ...
>>>      
>> Why is it not possible (or perhaps not desired) to change the binding
>> after the guest is started?
>>
>> Sounds unflexible.
>>    
The solution is to introduce a monitor interface to later adjust the 
pinning, allowing both changing the affinity only (only valid for future 
fault-ins) and actually copying the memory (more costly).
Actually this is the next item on my list, but I wanted to bring up the 
basics first to avoid recoding parts afterwards. Also I am not (yet) 
familiar with the QMP protocol.
> 
> We really need a solution that lets a user use a tool like numactl 
> outside of the QEMU instance.
I fear that is not how it's meant to work with the Linux' NUMA API. In 
opposite to the VCPU threads, which are externally visible entities 
(PIDs), the memory should be private to the QEMU process. While you can 
change the NUMA allocation policy of the _whole_ process, there is no 
way to externally distinguish parts of the process' memory. Although you 
could later (and externally) migrate already faulted pages (via 
move_pages(2) and by looking in /proc/$$/numa_maps), you would let an 
external tool interfere with QEMUs internal memory management. Take for 
instance the change of the allocation policy regarding the 1MB and 
3.5-4GB holes. An external tool would have to either track such changes 
or you simply could not change such things in QEMU. So what is wrong 
with keeping that code in QEMU, which knows best about the internals and 
already has flexible and mighty ways (command line and QMP) of 
manipulating its behavior?

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-23 21:16       ` Andre Przywara
@ 2010-08-23 21:27         ` Anthony Liguori
  2010-08-31 20:54           ` Andrew Theurer
  0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2010-08-23 21:27 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Marcelo Tosatti, avi, kvm

On 08/23/2010 04:16 PM, Andre Przywara wrote:
> Anthony Liguori wrote:
>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>> According to the user-provided assignment bind the respective part
>>>> of the guest's memory to the given host node. This uses Linux'
>>>> mbind syscall (which is wrapped only in libnuma) to realize the
>>>> pinning right after the allocation.
>>>> Failures are not fatal, but produce a warning.
>>>>
>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> >>> ...
>>> Why is it not possible (or perhaps not desired) to change the binding
>>> after the guest is started?
>>>
>>> Sounds unflexible.
> The solution is to introduce a monitor interface to later adjust the 
> pinning, allowing both changing the affinity only (only valid for 
> future fault-ins) and actually copying the memory (more costly).

This is just duplicating numactl.

> Actually this is the next item on my list, but I wanted to bring up 
> the basics first to avoid recoding parts afterwards. Also I am not 
> (yet) familiar with the QMP protocol.
>>
>> We really need a solution that lets a user use a tool like numactl 
>> outside of the QEMU instance.
> I fear that is not how it's meant to work with the Linux' NUMA API. In 
> opposite to the VCPU threads, which are externally visible entities 
> (PIDs), the memory should be private to the QEMU process. While you 
> can change the NUMA allocation policy of the _whole_ process, there is 
> no way to externally distinguish parts of the process' memory. 
> Although you could later (and externally) migrate already faulted 
> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you 
> would let an external tool interfere with QEMUs internal memory 
> management. Take for instance the change of the allocation policy 
> regarding the 1MB and 3.5-4GB holes. An external tool would have to 
> either track such changes or you simply could not change such things 
> in QEMU.

It's extremely likely that if you're doing NUMA pinning, you're also 
doing large pages via hugetlbfs.  numactl can already set policies for 
files in hugetlbfs so all you need to do is have a separate hugetlbfs 
file for each numa node.

Then you have all the flexibility of numactl and you can implement node 
migration external to QEMU if you so desire.

> So what is wrong with keeping that code in QEMU, which knows best 
> about the internals and already has flexible and mighty ways (command 
> line and QMP) of manipulating its behavior?

NUMA is a last-mile optimization.  For the audience that cares about 
this level of optimization, only providing an interface that allows a 
small set of those optimizations to be used is unacceptable.

There's a very simple way to do this right and that's by adding 
interfaces to QEMU that let's us work with existing tooling instead of 
inventing new interfaces.

Regards,

Anthony Liguori

> Regards,
> Andre.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-23 21:27         ` Anthony Liguori
@ 2010-08-31 20:54           ` Andrew Theurer
  2010-08-31 22:03             ` Anthony Liguori
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Theurer @ 2010-08-31 20:54 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Andre Przywara, Marcelo Tosatti, avi, kvm

On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> > Anthony Liguori wrote:
> >> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>> According to the user-provided assignment bind the respective part
> >>>> of the guest's memory to the given host node. This uses Linux'
> >>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>> pinning right after the allocation.
> >>>> Failures are not fatal, but produce a warning.
> >>>>
> >>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> > >>> ...
> >>> Why is it not possible (or perhaps not desired) to change the binding
> >>> after the guest is started?
> >>>
> >>> Sounds unflexible.
> > The solution is to introduce a monitor interface to later adjust the 
> > pinning, allowing both changing the affinity only (only valid for 
> > future fault-ins) and actually copying the memory (more costly).
> 
> This is just duplicating numactl.
> 
> > Actually this is the next item on my list, but I wanted to bring up 
> > the basics first to avoid recoding parts afterwards. Also I am not 
> > (yet) familiar with the QMP protocol.
> >>
> >> We really need a solution that lets a user use a tool like numactl 
> >> outside of the QEMU instance.
> > I fear that is not how it's meant to work with the Linux' NUMA API. In 
> > opposite to the VCPU threads, which are externally visible entities 
> > (PIDs), the memory should be private to the QEMU process. While you 
> > can change the NUMA allocation policy of the _whole_ process, there is 
> > no way to externally distinguish parts of the process' memory. 
> > Although you could later (and externally) migrate already faulted 
> > pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you 
> > would let an external tool interfere with QEMUs internal memory 
> > management. Take for instance the change of the allocation policy 
> > regarding the 1MB and 3.5-4GB holes. An external tool would have to 
> > either track such changes or you simply could not change such things 
> > in QEMU.
> 
> It's extremely likely that if you're doing NUMA pinning, you're also 
> doing large pages via hugetlbfs.  numactl can already set policies for 
> files in hugetlbfs so all you need to do is have a separate hugetlbfs 
> file for each numa node.

Why would we resort to hugetlbfs when we have transparent hugepages?

FWIW, large apps like databases have set a precedent for managing their
own NUMA policies.  I don't see why qemu should be any different.
Numactl is great for small apps that need to be pinned in one node, or
spread evenly on all nodes.  Having to get hugetlbfs involved just to
workaround a shortcoming of numactl just seems like a bad idea.   
> 
> Then you have all the flexibility of numactl and you can implement node 
> migration external to QEMU if you so desire.
> 
> > So what is wrong with keeping that code in QEMU, which knows best 
> > about the internals and already has flexible and mighty ways (command 
> > line and QMP) of manipulating its behavior?
> 
> NUMA is a last-mile optimization.  For the audience that cares about 
> this level of optimization, only providing an interface that allows a 
> small set of those optimizations to be used is unacceptable.
> 
> There's a very simple way to do this right and that's by adding 
> interfaces to QEMU that let's us work with existing tooling instead of 
> inventing new interfaces.
> 
> Regards,
> 
> Anthony Liguori
> 
> > Regards,
> > Andre.

-Andrew Theurer


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-31 20:54           ` Andrew Theurer
@ 2010-08-31 22:03             ` Anthony Liguori
  2010-09-01  3:38               ` Andrew Theurer
  2010-09-09 20:00               ` Andre Przywara
  0 siblings, 2 replies; 13+ messages in thread
From: Anthony Liguori @ 2010-08-31 22:03 UTC (permalink / raw)
  To: habanero; +Cc: Andre Przywara, Marcelo Tosatti, avi, kvm

On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
>    
>> On 08/23/2010 04:16 PM, Andre Przywara wrote:
>>      
>>> Anthony Liguori wrote:
>>>        
>>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>>>          
>>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>>>            
>>>>>> According to the user-provided assignment bind the respective part
>>>>>> of the guest's memory to the given host node. This uses Linux'
>>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
>>>>>> pinning right after the allocation.
>>>>>> Failures are not fatal, but produce a warning.
>>>>>>
>>>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
>>>>>> ...
>>>>>>              
>>>>> Why is it not possible (or perhaps not desired) to change the binding
>>>>> after the guest is started?
>>>>>
>>>>> Sounds unflexible.
>>>>>            
>>> The solution is to introduce a monitor interface to later adjust the
>>> pinning, allowing both changing the affinity only (only valid for
>>> future fault-ins) and actually copying the memory (more costly).
>>>        
>> This is just duplicating numactl.
>>
>>      
>>> Actually this is the next item on my list, but I wanted to bring up
>>> the basics first to avoid recoding parts afterwards. Also I am not
>>> (yet) familiar with the QMP protocol.
>>>        
>>>> We really need a solution that lets a user use a tool like numactl
>>>> outside of the QEMU instance.
>>>>          
>>> I fear that is not how it's meant to work with the Linux' NUMA API. In
>>> opposite to the VCPU threads, which are externally visible entities
>>> (PIDs), the memory should be private to the QEMU process. While you
>>> can change the NUMA allocation policy of the _whole_ process, there is
>>> no way to externally distinguish parts of the process' memory.
>>> Although you could later (and externally) migrate already faulted
>>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
>>> would let an external tool interfere with QEMUs internal memory
>>> management. Take for instance the change of the allocation policy
>>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
>>> either track such changes or you simply could not change such things
>>> in QEMU.
>>>        
>> It's extremely likely that if you're doing NUMA pinning, you're also
>> doing large pages via hugetlbfs.  numactl can already set policies for
>> files in hugetlbfs so all you need to do is have a separate hugetlbfs
>> file for each numa node.
>>      
> Why would we resort to hugetlbfs when we have transparent hugepages?
>    

If you care about NUMA pinning, I can't believe you don't want 
guaranteed large page allocation which THP does not provide.

The general point though is that we should find a way to partition 
memory in qemu such that an external process can control the actual NUMA 
placement.  This gives us maximum flexibility.

Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
nodes?  Can we migrate memory between nodes?  Should we support 
interleaving memory between two virtual nodes?  Why pick and choose when 
we can have it all.

> FWIW, large apps like databases have set a precedent for managing their
> own NUMA policies.

Of course because they know what their NUMA policy should be.  They live 
in a simple world where they assume they're the only application in the 
system, they read the distance tables, figure they'll use XX% of all 
physical memory, and then pin how they see fit.

But an individual QEMU process lives in a complex world.  It's almost 
never the only thing on the system and it's only allowed to use a subset 
of resources.  It's not sure what set of resources it can and can't use 
and that's often times changing.  The topology chosen for a guest is 
static but it's host topology may be dynamic due to thinks like live 
migration.

In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
Instead, it needs to let something with a larger view of the system 
determine a NUMA policy that makes sense overall.

There are two ways we can do this.  We can implement monitor commands 
that attempt to expose every single NUMA tunable possible.  Or, we can 
tie into the existing commands which guarantee that we support every 
possible tunable and that as NUMA support in Linux evolves, we get all 
the new features for free.

And, since numactl already supports setting policies on files in 
hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
work per-node instead of globally.  And it's useful to implement other 
types of things like having one node be guaranteed large pages and 
another node THP or some other fanciness.

Sounds awfully appealing to me.

>    I don't see why qemu should be any different.
> Numactl is great for small apps that need to be pinned in one node, or
> spread evenly on all nodes.  Having to get hugetlbfs involved just to
> workaround a shortcoming of numactl just seems like a bad idea.
>    

You seem to be asserting that we should implement a full NUMA policy in 
QEMU.  What should it be when we don't (in QEMU) know what else is 
running on the system?

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-31 22:03             ` Anthony Liguori
@ 2010-09-01  3:38               ` Andrew Theurer
  2010-09-09 20:00               ` Andre Przywara
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Theurer @ 2010-09-01  3:38 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Andre Przywara, Marcelo Tosatti, avi, kvm

On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >    
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>      
> >>> Anthony Liguori wrote:
> >>>        
> >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>>>          
> >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>>>            
> >>>>>> According to the user-provided assignment bind the respective part
> >>>>>> of the guest's memory to the given host node. This uses Linux'
> >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>>>> pinning right after the allocation.
> >>>>>> Failures are not fatal, but produce a warning.
> >>>>>>
> >>>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> >>>>>> ...
> >>>>>>              
> >>>>> Why is it not possible (or perhaps not desired) to change the binding
> >>>>> after the guest is started?
> >>>>>
> >>>>> Sounds unflexible.
> >>>>>            
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>        
> >> This is just duplicating numactl.
> >>
> >>      
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>        
> >>>> We really need a solution that lets a user use a tool like numactl
> >>>> outside of the QEMU instance.
> >>>>          
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>        
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs.  numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>      
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >    
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.

I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node.  For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.  

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.

If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this.  Is there another way to change NUMA
policies on mappings from a user tool?  We can already inspect
with /proc/<pid>/numamaps.  Is this something that could be added to
numactl?

> 
> > FWIW, large apps like databases have set a precedent for managing their
> > own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.

True, that's why this would require support to change in the monitor.

> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
> Instead, it needs to let something with a larger view of the system 
> determine a NUMA policy that makes sense overall.

I agree.

> There are two ways we can do this.  We can implement monitor commands 
> that attempt to expose every single NUMA tunable possible.  Or, we can 
> tie into the existing commands which guarantee that we support every 
> possible tunable and that as NUMA support in Linux evolves, we get all 
> the new features for free.

Assuming there's no new thing one needs to expose in qemu to work with
whatever new feature numactl/libnuma gets.  But perhaps that's a lot
less likely.

> And, since numactl already supports setting policies on files in 
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
> work per-node instead of globally.  And it's useful to implement other 
> types of things like having one node be guaranteed large pages and 
> another node THP or some other fanciness.

If it were not dependent on hugetlbfs, then I don't think I would have
an issue.

> Sounds awfully appealing to me.
> 
> >    I don't see why qemu should be any different.
> > Numactl is great for small apps that need to be pinned in one node, or
> > spread evenly on all nodes.  Having to get hugetlbfs involved just to
> > workaround a shortcoming of numactl just seems like a bad idea.
> >    
> 
> You seem to be asserting that we should implement a full NUMA policy in 
> QEMU.  What should it be when we don't (in QEMU) know what else is 
> running on the system?

I don't think the qemu itself should decide where to "be" on the system.
I would like to have -something- else make those decisions, either a
user or some mgmt daemon that looks at the whole picture.  Or <gulp> get
the scheduler involved (with new algorithms).

I am still quite curious of numactl/libnuma could be extended to set
some policies on individual mappings.  Then we will not even need to
have multiple -mem-path's.

-Andrew

> 
> Regards,
> 
> Anthony Liguori
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
  2010-08-31 22:03             ` Anthony Liguori
  2010-09-01  3:38               ` Andrew Theurer
@ 2010-09-09 20:00               ` Andre Przywara
  1 sibling, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2010-09-09 20:00 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: habanero, Marcelo Tosatti, avi, kvm

Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
>> On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
>>    
>>> On 08/23/2010 04:16 PM, Andre Przywara wrote:
>>>> Anthony Liguori wrote:
>>>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>>>>

Sorry for the delay in this discussion, I was busy with other things.
...
>>>>        
>>> It's extremely likely that if you're doing NUMA pinning, you're also
>>> doing large pages via hugetlbfs.  numactl can already set policies for
>>> files in hugetlbfs so all you need to do is have a separate hugetlbfs
>>> file for each numa node.
>>>      
>> Why would we resort to hugetlbfs when we have transparent hugepages?
>>    
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.
I doubt that anyone _wants_ to care about NUMA. You only _have_ to care
about it sometimes, mostly if your performance drops on scaling up. So I
wouldn't consider NUMA a special HPC only scenario, since virtually
every recent server has NUMA. I don't want to tell the people that their
shiny new 48-core 96GB RAM box can only run VMs with at most 8 GB RAM
and not more than 6 cores.
So I don't want to see NUMA tied to the (IMHO clumsy) hugetlbfs
interface, if everyone else uses THP and is happy with that.

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
Why not do this from the management application and use the QEMU monitor
protocol?
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.
We use either libnuma or the interface the kernel provides. I don't see
how this restricts us.
In general I don't see why you care so much about avoiding duplicating
numactl. numactl is only a small wrapper around the libnuma, which is
itself a wrapper around the kernel interface (mostly mbind and
setpolicy). numactl itself mostly provides command line parsing, which
is no rocket science and is even partly provided by the lib itself. I
only couldn't use it because the comma is used in QEMUs own cmdline parsing.

>> FWIW, large apps like databases have set a precedent for managing their
>> own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.
> 
> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
> Instead, it needs to let something with a larger view of the system 
> determine a NUMA policy that makes sense overall.
That's why I want to provide an interface to an external management
application (which you probably have anyway).
> 
> There are two ways we can do this.  We can implement monitor commands 
> that attempt to expose every single NUMA tunable possible.  Or, we can 
> tie into the existing commands which guarantee that we support every 
> possible tunable and that as NUMA support in Linux evolves, we get all 
> the new features for free.
I don't see that the tuning parameters are that many that QEMU cannot
sanely implement them.
And basically that means that your virt management app calls numactl
which tunes QEMU. I consider the direct way cleaner.
> 
> And, since numactl already supports setting policies on files in 
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
> work per-node instead of globally.  And it's useful to implement other 
> types of things like having one node be guaranteed large pages and 
> another node THP or some other fanciness.
> 
> Sounds awfully appealing to me.
I see your point, it would be rather easy to just do so. But I also see
the defiances of this approach on the other hand:
1. We tie to hugetlbfs, which I consider broken and on the demise. If
THP is upstream (I hope that it's only when and not if), I don't want to
being tied to the old way if I have a NUMA machine and need larger guests.
2. We need to expose QEMU's internal guest memory layout and stick to
that "interface". Currently we allocate all in one large chunk, but it
wasn't ever so and may change again in the future. What about the
dynamic memory approaches, would this still work with visibly mmaped files?
3. We need to go with the shortcomings of hugetlbfs, namely not being
able to swap out, the need to early allocate all memory, the need to
reserve it beforehand and the missing possibility to scatter it again.

I don't want to become NUMA a second class citizen, in that it restricts
the possibilities.
> 
>>    I don't see why qemu should be any different.
>> Numactl is great for small apps that need to be pinned in one node, or
>> spread evenly on all nodes.  Having to get hugetlbfs involved just to
>> workaround a shortcoming of numactl just seems like a bad idea.
>>    
> 
> You seem to be asserting that we should implement a full NUMA policy in 
> QEMU.  What should it be when we don't (in QEMU) know what else is 
> running on the system?
?? Nobody wants QEMU to do the assignment itself, it is all the job of a
management application. We just provide an interface (which QEMU owns!)
to allow control.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-09-09 20:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
2010-08-23 18:59   ` Marcelo Tosatti
2010-08-23 19:27     ` Anthony Liguori
2010-08-23 21:16       ` Andre Przywara
2010-08-23 21:27         ` Anthony Liguori
2010-08-31 20:54           ` Andrew Theurer
2010-08-31 22:03             ` Anthony Liguori
2010-09-01  3:38               ` Andrew Theurer
2010-09-09 20:00               ` Andre Przywara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.