* [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes
@ 2013-08-23 4:09 Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 01/12] NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions Wanlong Gao
` (11 more replies)
0 siblings, 12 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
As you know, QEMU can't direct it's memory allocation now, this may cause
guest cross node access performance regression.
And, the worse thing is that if PCI-passthrough is used,
direct-attached-device uses DMA transfer between device and qemu process.
All pages of the guest will be pinned by get_user_pages().
KVM_ASSIGN_PCI_DEVICE ioctl
kvm_vm_ioctl_assign_device()
=>kvm_assign_device()
=> kvm_iommu_map_memslots()
=> kvm_iommu_map_pages()
=> kvm_pin_pages()
So, with direct-attached-device, all guest page's page count will be +1 and
any page migration will not work. AutoNUMA won't too.
So, we should set the guest nodes memory allocation policy before
the pages are really mapped.
According to this patch set, we are able to set guest nodes memory policy
like following:
-numa node,nodeid=0,cpus=0, \
-numa mem,size=1024M,policy=membind,host-nodes=0-1 \
-numa node,nodeid=1,cpus=1 \
-numa mem,size=1024M,policy=interleave,host-nodes=1
This supports "policy={default|membind|interleave|preferred},relative=true,host-nodes=N-N" like format.
Also add "set-mem-policy" QMP and hmp command to set memory policy.
And patch 10/11 adds a QMP command "query-numa" to show numa info through
this API.
And patch 11/11 converts the "info numa" monitor command to use this
QMP command "query-numa".
V1->V2:
change to use QemuOpts in numa options (Paolo)
handle Error in mpol parser (Paolo)
change qmp command format to mem-policy=membind,mem-hostnode=0-1 like (Paolo)
V2->V3:
also handle Error in cpus parser (5/10)
split out common parser from cpus and hostnode parser (Bandan 6/10)
V3-V4:
rebase to request for comments
V4->V5:
use OptVisitor and split -numa option (Paolo)
- s/set-mpol/set-mem-policy (Andreas)
- s/mem-policy/policy
- s/mem-hostnode/host-nodes
fix hmp command process after error (Luiz)
add qmp command query-numa and convert info numa to it (Luiz)
V5->V6:
remove tabs in json file (Laszlo, Paolo)
add back "-numa node,mem=xxx" as legacy (Paolo)
change cpus and host-nodes to array (Laszlo, Eric)
change "nodeid" to "uint16"
add NumaMemPolicy enum type (Eric)
rebased on Laszlo's "OptsVisitor: support / flatten integer ranges for repeating options" patch set, thanks for Laszlo's help
V6-V7:
change UInt16 to uint16 (Laszlo)
fix a typo in adding qmp command set-mem-policy
V7-V8:
rebase to current master with Laszlo's V2 of OptsVisitor patch set
fix an adding white space line error
V8->V9:
rebase to current master
check if total numa memory size is equal to ram_size (Paolo)
add comments to the OptsVisitor stuff in qapi-schema.json (Eric, Laszlo)
replace the use of numa_num_configured_nodes() (Andrew)
avoid abusing the fact i==nodeid (Andrew)
Wanlong Gao (12):
NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions
NUMA: split -numa option
NUMA: check if the total numa memory size is equal to ram_size
NUMA: move numa related code to numa.c
NUMA: Add numa_info structure to contain numa nodes info
NUMA: Add Linux libnuma detection
NUMA: parse guest numa nodes memory policy
NUMA: set guest numa nodes memory policy
NUMA: add qmp command set-mem-policy to set memory policy for NUMA
node
NUMA: add hmp command set-mem-policy
NUMA: add qmp command query-numa
NUMA: convert hmp command info_numa to use qmp command query_numa
Makefile.target | 2 +-
configure | 32 ++++
cpus.c | 14 --
hmp-commands.hx | 16 ++
hmp.c | 119 +++++++++++++
hmp.h | 2 +
hw/i386/pc.c | 4 +-
include/sysemu/cpus.h | 1 -
include/sysemu/sysemu.h | 16 +-
monitor.c | 21 +--
numa.c | 455 ++++++++++++++++++++++++++++++++++++++++++++++++
qapi-schema.json | 131 ++++++++++++++
qemu-options.hx | 6 +-
qmp-commands.hx | 90 ++++++++++
vl.c | 160 ++---------------
15 files changed, 885 insertions(+), 184 deletions(-)
create mode 100644 numa.c
--
1.8.4.rc4
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 01/12] NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 02/12] NUMA: split -numa option Wanlong Gao
` (10 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
These are used to generate stuff for OptsVisitor.
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
qapi-schema.json | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/qapi-schema.json b/qapi-schema.json
index a51f7d2..11851a1 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3773,3 +3773,50 @@
##
{ 'command': 'query-rx-filter', 'data': { '*name': 'str' },
'returns': ['RxFilterInfo'] }
+
+##
+# @NumaOptions
+#
+# A discriminated record of NUMA options. (for OptsVisitor)
+#
+# Since 1.7
+##
+{ 'union': 'NumaOptions',
+ 'data': {
+ 'node': 'NumaNodeOptions',
+ 'mem': 'NumaMemOptions' }}
+
+##
+# @NumaNodeOptions
+#
+# Create a guest NUMA node. (for OptsVisitor)
+#
+# @nodeid: #optional NUMA node ID
+#
+# @cpus: #optional VCPUs belong to this node
+#
+# @mem: #optional memory size of this node (remain as legacy)
+#
+# Since: 1.7
+##
+{ 'type': 'NumaNodeOptions',
+ 'data': {
+ '*nodeid': 'uint16',
+ '*cpus': ['uint16'],
+ '*mem': 'str' }}
+
+##
+# @NumaMemOptions
+#
+# Set memory information of guest NUMA node. (for OptsVisitor)
+#
+# @nodeid: #optional NUMA node ID
+#
+# @size: #optional memory size of this node
+#
+# Since 1.7
+##
+{ 'type': 'NumaMemOptions',
+ 'data': {
+ '*nodeid': 'uint16',
+ '*size': 'size' }}
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 02/12] NUMA: split -numa option
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 01/12] NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 03/12] NUMA: check if the total numa memory size is equal to ram_size Wanlong Gao
` (9 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Change -numa option like following as Paolo suggested:
-numa node,nodeid=0,cpus=0-1 \
-numa mem,nodeid=0,size=1G
This new option will make later coming memory hotplug better.
And this new option is implemented using OptsVisitor.
And just remain "-numa node,mem=xx" as legacy.
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
Makefile.target | 2 +-
include/sysemu/sysemu.h | 3 +
numa.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++
qemu-options.hx | 6 +-
vl.c | 113 ++++++-------------------------------
5 files changed, 168 insertions(+), 100 deletions(-)
create mode 100644 numa.c
diff --git a/Makefile.target b/Makefile.target
index 9a49852..7e1fddf 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -113,7 +113,7 @@ endif #CONFIG_BSD_USER
#########################################################
# System emulator target
ifdef CONFIG_SOFTMMU
-obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o ioport.o
+obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o ioport.o numa.o
obj-y += qtest.o
obj-y += hw/
obj-$(CONFIG_FDT) += device_tree.o
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index d7a77b6..474dd9e 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -129,8 +129,11 @@ extern QEMUClock *rtc_clock;
#define MAX_NODES 64
#define MAX_CPUMASK_BITS 255
extern int nb_numa_nodes;
+extern int nb_numa_mem_nodes;
extern uint64_t node_mem[MAX_NODES];
extern unsigned long *node_cpumask[MAX_NODES];
+extern QemuOptsList qemu_numa_opts;
+int numa_init_func(QemuOpts *opts, void *opaque);
#define MAX_OPTION_ROMS 16
typedef struct QEMUOptionRom {
diff --git a/numa.c b/numa.c
new file mode 100644
index 0000000..e6924f4
--- /dev/null
+++ b/numa.c
@@ -0,0 +1,144 @@
+/*
+ * QEMU System Emulator
+ *
+ * Copyright (c) 2013 Fujitsu Ltd.
+ * Author: Wanlong Gao <gaowanlong@cn.fujitsu.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "sysemu/sysemu.h"
+#include "qemu/bitmap.h"
+#include "qapi-visit.h"
+#include "qapi/opts-visitor.h"
+#include "qapi/dealloc-visitor.h"
+
+QemuOptsList qemu_numa_opts = {
+ .name = "numa",
+ .implied_opt_name = "type",
+ .head = QTAILQ_HEAD_INITIALIZER(qemu_numa_opts.head),
+ .desc = { { 0 } } /* validated with OptsVisitor */
+};
+
+static int numa_node_parse(NumaNodeOptions *opts)
+{
+ uint16_t nodenr;
+ uint16List *cpus = NULL;
+
+ if (opts->has_nodeid) {
+ nodenr = opts->nodeid;
+ if (nodenr >= MAX_NODES) {
+ fprintf(stderr, "qemu: Max number of NUMA nodes reached: %"
+ PRIu16 "\n", nodenr);
+ return -1;
+ }
+ } else {
+ nodenr = nb_numa_nodes;
+ }
+
+ for (cpus = opts->cpus; cpus; cpus = cpus->next) {
+ bitmap_set(node_cpumask[nodenr], cpus->value, 1);
+ }
+
+ if (opts->has_mem) {
+ int64_t mem_size;
+ char *endptr;
+ mem_size = strtosz(opts->mem, &endptr);
+ if (mem_size < 0 || *endptr) {
+ fprintf(stderr, "qemu: invalid numa mem size: %s\n", opts->mem);
+ return -1;
+ }
+ node_mem[nodenr] = mem_size;
+ }
+
+ return 0;
+}
+
+static int numa_mem_parse(NumaMemOptions *opts)
+{
+ uint16_t nodenr;
+ uint64_t mem_size;
+
+ if (opts->has_nodeid) {
+ nodenr = opts->nodeid;
+ if (nodenr >= MAX_NODES) {
+ fprintf(stderr, "qemu: Max number of NUMA nodes reached: %"
+ PRIu16 "\n", nodenr);
+ return -1;
+ }
+ } else {
+ nodenr = nb_numa_mem_nodes;
+ }
+
+ if (opts->has_size) {
+ mem_size = opts->size;
+ node_mem[nodenr] = mem_size;
+ }
+
+ return 0;
+}
+
+int numa_init_func(QemuOpts *opts, void *opaque)
+{
+ NumaOptions *object = NULL;
+ Error *err = NULL;
+ int ret = 0;
+
+ {
+ OptsVisitor *ov = opts_visitor_new(opts);
+ visit_type_NumaOptions(opts_get_visitor(ov), &object, NULL, &err);
+ opts_visitor_cleanup(ov);
+ }
+
+ if (error_is_set(&err)) {
+ fprintf(stderr, "qemu: %s\n", error_get_pretty(err));
+ error_free(err);
+ ret = -1;
+ goto error;
+ }
+
+ switch (object->kind) {
+ case NUMA_OPTIONS_KIND_NODE:
+ if (nb_numa_nodes >= MAX_NODES) {
+ fprintf(stderr, "qemu: too many NUMA nodes\n");
+ ret = -1;
+ goto error;
+ }
+ ret = numa_node_parse(object->node);
+ nb_numa_nodes++;
+ break;
+ case NUMA_OPTIONS_KIND_MEM:
+ ret = numa_mem_parse(object->mem);
+ nb_numa_mem_nodes++;
+ break;
+ default:
+ fprintf(stderr, "qemu: Invalid NUMA options type.\n");
+ ret = -1;
+ }
+
+error:
+ if (object) {
+ QapiDeallocVisitor *dv = qapi_dealloc_visitor_new();
+ visit_type_NumaOptions(qapi_dealloc_get_visitor(dv),
+ &object, NULL, NULL);
+ qapi_dealloc_visitor_cleanup(dv);
+ }
+
+ return ret;
+}
diff --git a/qemu-options.hx b/qemu-options.hx
index d15338e..e9123b8 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -95,11 +95,13 @@ specifies the maximum number of hotpluggable CPUs.
ETEXI
DEF("numa", HAS_ARG, QEMU_OPTION_numa,
- "-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
+ "-numa node[,nodeid=node][,cpus=cpu[-cpu]]\n"
+ "-numa mem[,nodeid=node][,size=size]\n"
+ , QEMU_ARCH_ALL)
STEXI
@item -numa @var{opts}
@findex -numa
-Simulate a multi node NUMA system. If mem and cpus are omitted, resources
+Simulate a multi node NUMA system. If @var{size} and @var{cpus} are omitted, resources
are split equally.
ETEXI
diff --git a/vl.c b/vl.c
index 1c283c9..8829344 100644
--- a/vl.c
+++ b/vl.c
@@ -250,6 +250,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order =
QTAILQ_HEAD_INITIALIZER(fw_boot_order);
int nb_numa_nodes;
+int nb_numa_mem_nodes;
uint64_t node_mem[MAX_NODES];
unsigned long *node_cpumask[MAX_NODES];
@@ -1330,102 +1331,6 @@ char *get_boot_devices_list(size_t *size)
return list;
}
-static void numa_node_parse_cpus(int nodenr, const char *cpus)
-{
- char *endptr;
- unsigned long long value, endvalue;
-
- /* Empty CPU range strings will be considered valid, they will simply
- * not set any bit in the CPU bitmap.
- */
- if (!*cpus) {
- return;
- }
-
- if (parse_uint(cpus, &value, &endptr, 10) < 0) {
- goto error;
- }
- if (*endptr == '-') {
- if (parse_uint_full(endptr + 1, &endvalue, 10) < 0) {
- goto error;
- }
- } else if (*endptr == '\0') {
- endvalue = value;
- } else {
- goto error;
- }
-
- if (endvalue >= MAX_CPUMASK_BITS) {
- endvalue = MAX_CPUMASK_BITS - 1;
- fprintf(stderr,
- "qemu: NUMA: A max of %d VCPUs are supported\n",
- MAX_CPUMASK_BITS);
- }
-
- if (endvalue < value) {
- goto error;
- }
-
- bitmap_set(node_cpumask[nodenr], value, endvalue-value+1);
- return;
-
-error:
- fprintf(stderr, "qemu: Invalid NUMA CPU range: %s\n", cpus);
- exit(1);
-}
-
-static void numa_add(const char *optarg)
-{
- char option[128];
- char *endptr;
- unsigned long long nodenr;
-
- optarg = get_opt_name(option, 128, optarg, ',');
- if (*optarg == ',') {
- optarg++;
- }
- if (!strcmp(option, "node")) {
-
- if (nb_numa_nodes >= MAX_NODES) {
- fprintf(stderr, "qemu: too many NUMA nodes\n");
- exit(1);
- }
-
- if (get_param_value(option, 128, "nodeid", optarg) == 0) {
- nodenr = nb_numa_nodes;
- } else {
- if (parse_uint_full(option, &nodenr, 10) < 0) {
- fprintf(stderr, "qemu: Invalid NUMA nodeid: %s\n", option);
- exit(1);
- }
- }
-
- if (nodenr >= MAX_NODES) {
- fprintf(stderr, "qemu: invalid NUMA nodeid: %llu\n", nodenr);
- exit(1);
- }
-
- if (get_param_value(option, 128, "mem", optarg) == 0) {
- node_mem[nodenr] = 0;
- } else {
- int64_t sval;
- sval = strtosz(option, &endptr);
- if (sval < 0 || *endptr) {
- fprintf(stderr, "qemu: invalid numa mem size: %s\n", optarg);
- exit(1);
- }
- node_mem[nodenr] = sval;
- }
- if (get_param_value(option, 128, "cpus", optarg) != 0) {
- numa_node_parse_cpus(nodenr, option);
- }
- nb_numa_nodes++;
- } else {
- fprintf(stderr, "Invalid -numa option: %s\n", option);
- exit(1);
- }
-}
-
static QemuOptsList qemu_smp_opts = {
.name = "smp-opts",
.implied_opt_name = "cpus",
@@ -2961,6 +2866,7 @@ int main(int argc, char **argv, char **envp)
qemu_add_opts(&qemu_tpmdev_opts);
qemu_add_opts(&qemu_realtime_opts);
qemu_add_opts(&qemu_msg_opts);
+ qemu_add_opts(&qemu_numa_opts);
runstate_init();
@@ -2986,6 +2892,7 @@ int main(int argc, char **argv, char **envp)
}
nb_numa_nodes = 0;
+ nb_numa_mem_nodes = 0;
nb_nics = 0;
bdrv_init_with_whitelist();
@@ -3147,7 +3054,10 @@ int main(int argc, char **argv, char **envp)
}
break;
case QEMU_OPTION_numa:
- numa_add(optarg);
+ opts = qemu_opts_parse(qemu_find_opts("numa"), optarg, 1);
+ if (!opts) {
+ exit(1);
+ }
break;
case QEMU_OPTION_display:
display_type = select_display(optarg);
@@ -4226,6 +4136,15 @@ int main(int argc, char **argv, char **envp)
register_savevm_live(NULL, "ram", 0, 4, &savevm_ram_handlers, NULL);
+ if (qemu_opts_foreach(qemu_find_opts("numa"), numa_init_func,
+ NULL, 1) != 0) {
+ exit(1);
+ }
+
+ if (nb_numa_mem_nodes > nb_numa_nodes) {
+ nb_numa_nodes = nb_numa_mem_nodes;
+ }
+
if (nb_numa_nodes > 0) {
int i;
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 03/12] NUMA: check if the total numa memory size is equal to ram_size
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 01/12] NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 02/12] NUMA: split -numa option Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 04/12] NUMA: move numa related code to numa.c Wanlong Gao
` (8 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
If the total number of the assigned numa nodes memory is not
equal to the assigned ram size, it will write the wrong data
to ACPI talb, then the guest will ignore the wrong ACPI table
and recognize all memory to one node. It's buggy, we should
check it to ensure that we write the right data to ACPI table.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
vl.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/vl.c b/vl.c
index 8829344..46d1d55 100644
--- a/vl.c
+++ b/vl.c
@@ -4172,6 +4172,16 @@ int main(int argc, char **argv, char **envp)
node_mem[i] = ram_size - usedmem;
}
+ uint64_t numa_total = 0;
+ for (i = 0; i < nb_numa_nodes; i++) {
+ numa_total += node_mem[i];
+ }
+ if (numa_total != ram_size) {
+ fprintf(stderr, "qemu: numa nodes total memory size "
+ "should equal to ram_size\n");
+ exit(1);
+ }
+
for (i = 0; i < nb_numa_nodes; i++) {
if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) {
break;
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 04/12] NUMA: move numa related code to numa.c
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (2 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 03/12] NUMA: check if the total numa memory size is equal to ram_size Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 05/12] NUMA: Add numa_info structure to contain numa nodes info Wanlong Gao
` (7 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
cpus.c | 14 ---------
include/sysemu/cpus.h | 1 -
include/sysemu/sysemu.h | 2 ++
numa.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
vl.c | 57 +------------------------------------
5 files changed, 79 insertions(+), 71 deletions(-)
diff --git a/cpus.c b/cpus.c
index 70cc617..8a1344e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1201,20 +1201,6 @@ static void tcg_exec_all(void)
exit_request = 0;
}
-void set_numa_modes(void)
-{
- CPUState *cpu;
- int i;
-
- for (cpu = first_cpu; cpu != NULL; cpu = cpu->next_cpu) {
- for (i = 0; i < nb_numa_nodes; i++) {
- if (test_bit(cpu->cpu_index, node_cpumask[i])) {
- cpu->numa_node = i;
- }
- }
- }
-}
-
void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
{
/* XXX: implement xxx_cpu_list for targets that still miss it */
diff --git a/include/sysemu/cpus.h b/include/sysemu/cpus.h
index 6502488..4f79081 100644
--- a/include/sysemu/cpus.h
+++ b/include/sysemu/cpus.h
@@ -23,7 +23,6 @@ extern int smp_threads;
#define smp_threads 1
#endif
-void set_numa_modes(void);
void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg);
#endif
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 474dd9e..b42f4a1 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -134,6 +134,8 @@ extern uint64_t node_mem[MAX_NODES];
extern unsigned long *node_cpumask[MAX_NODES];
extern QemuOptsList qemu_numa_opts;
int numa_init_func(QemuOpts *opts, void *opaque);
+void set_numa_nodes(void);
+void set_numa_modes(void);
#define MAX_OPTION_ROMS 16
typedef struct QEMUOptionRom {
diff --git a/numa.c b/numa.c
index e6924f4..035fb86 100644
--- a/numa.c
+++ b/numa.c
@@ -142,3 +142,79 @@ error:
return ret;
}
+
+void set_numa_nodes(void)
+{
+ if (nb_numa_mem_nodes > nb_numa_nodes) {
+ nb_numa_nodes = nb_numa_mem_nodes;
+ }
+
+ if (nb_numa_nodes > 0) {
+ int i;
+
+ if (nb_numa_nodes > MAX_NODES) {
+ nb_numa_nodes = MAX_NODES;
+ }
+
+ /* If no memory size if given for any node, assume the default case
+ * and distribute the available memory equally across all nodes
+ */
+ for (i = 0; i < nb_numa_nodes; i++) {
+ if (node_mem[i] != 0) {
+ break;
+ }
+ }
+
+ if (i == nb_numa_nodes) {
+ uint64_t usedmem = 0;
+
+ /* On Linux, the each node's border has to be 8MB aligned,
+ * the final node gets the rest.
+ */
+ for (i = 0; i < nb_numa_nodes - 1; i++) {
+ node_mem[i] = (ram_size / nb_numa_nodes) & ~((1 << 23UL) - 1);
+ usedmem += node_mem[i];
+ }
+ node_mem[i] = ram_size - usedmem;
+ }
+
+ uint64_t numa_total = 0;
+ for (i = 0; i < nb_numa_nodes; i++) {
+ numa_total += node_mem[i];
+ }
+ if (numa_total != ram_size) {
+ fprintf(stderr, "qemu: numa nodes total memory size "
+ "should equal to ram_size\n");
+ exit(1);
+ }
+
+ for (i = 0; i < nb_numa_nodes; i++) {
+ if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) {
+ break;
+ }
+ }
+ /* assigning the VCPUs round-robin is easier to implement, guest OSes
+ * must cope with this anyway, because there are BIOSes out there in
+ * real machines which also use this scheme.
+ */
+ if (i == nb_numa_nodes) {
+ for (i = 0; i < max_cpus; i++) {
+ set_bit(i, node_cpumask[i % nb_numa_nodes]);
+ }
+ }
+ }
+}
+
+void set_numa_modes(void)
+{
+ CPUState *cpu;
+ int i;
+
+ for (cpu = first_cpu; cpu != NULL; cpu = cpu->next_cpu) {
+ for (i = 0; i < nb_numa_nodes; i++) {
+ if (test_bit(cpu->cpu_index, node_cpumask[i])) {
+ cpu->numa_node = i;
+ }
+ }
+ }
+}
diff --git a/vl.c b/vl.c
index 46d1d55..0f180fe 100644
--- a/vl.c
+++ b/vl.c
@@ -4141,62 +4141,7 @@ int main(int argc, char **argv, char **envp)
exit(1);
}
- if (nb_numa_mem_nodes > nb_numa_nodes) {
- nb_numa_nodes = nb_numa_mem_nodes;
- }
-
- if (nb_numa_nodes > 0) {
- int i;
-
- if (nb_numa_nodes > MAX_NODES) {
- nb_numa_nodes = MAX_NODES;
- }
-
- /* If no memory size if given for any node, assume the default case
- * and distribute the available memory equally across all nodes
- */
- for (i = 0; i < nb_numa_nodes; i++) {
- if (node_mem[i] != 0)
- break;
- }
- if (i == nb_numa_nodes) {
- uint64_t usedmem = 0;
-
- /* On Linux, the each node's border has to be 8MB aligned,
- * the final node gets the rest.
- */
- for (i = 0; i < nb_numa_nodes - 1; i++) {
- node_mem[i] = (ram_size / nb_numa_nodes) & ~((1 << 23UL) - 1);
- usedmem += node_mem[i];
- }
- node_mem[i] = ram_size - usedmem;
- }
-
- uint64_t numa_total = 0;
- for (i = 0; i < nb_numa_nodes; i++) {
- numa_total += node_mem[i];
- }
- if (numa_total != ram_size) {
- fprintf(stderr, "qemu: numa nodes total memory size "
- "should equal to ram_size\n");
- exit(1);
- }
-
- for (i = 0; i < nb_numa_nodes; i++) {
- if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) {
- break;
- }
- }
- /* assigning the VCPUs round-robin is easier to implement, guest OSes
- * must cope with this anyway, because there are BIOSes out there in
- * real machines which also use this scheme.
- */
- if (i == nb_numa_nodes) {
- for (i = 0; i < max_cpus; i++) {
- set_bit(i, node_cpumask[i % nb_numa_nodes]);
- }
- }
- }
+ set_numa_nodes();
if (qemu_opts_foreach(qemu_find_opts("mon"), mon_init_func, NULL, 1) != 0) {
exit(1);
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 05/12] NUMA: Add numa_info structure to contain numa nodes info
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (3 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 04/12] NUMA: move numa related code to numa.c Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection Wanlong Gao
` (6 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Add the numa_info structure to contain the numa nodes memory,
VCPUs information and the future added numa nodes host memory
policies.
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
hw/i386/pc.c | 4 ++--
include/sysemu/sysemu.h | 8 ++++++--
monitor.c | 2 +-
numa.c | 23 ++++++++++++-----------
vl.c | 7 +++----
5 files changed, 24 insertions(+), 20 deletions(-)
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 3a620a1..2243184 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -653,14 +653,14 @@ static FWCfgState *bochs_bios_init(void)
unsigned int apic_id = x86_cpu_apic_id_from_index(i);
assert(apic_id < apic_id_limit);
for (j = 0; j < nb_numa_nodes; j++) {
- if (test_bit(i, node_cpumask[j])) {
+ if (test_bit(i, numa_info[j].node_cpu)) {
numa_fw_cfg[apic_id + 1] = cpu_to_le64(j);
break;
}
}
}
for (i = 0; i < nb_numa_nodes; i++) {
- numa_fw_cfg[apic_id_limit + 1 + i] = cpu_to_le64(node_mem[i]);
+ numa_fw_cfg[apic_id_limit + 1 + i] = cpu_to_le64(numa_info[i].node_mem);
}
fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, numa_fw_cfg,
(1 + apic_id_limit + nb_numa_nodes) *
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index b42f4a1..b683d08 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -9,6 +9,7 @@
#include "qapi-types.h"
#include "qemu/notify.h"
#include "qemu/main-loop.h"
+#include "qemu/bitmap.h"
/* vl.c */
@@ -130,8 +131,11 @@ extern QEMUClock *rtc_clock;
#define MAX_CPUMASK_BITS 255
extern int nb_numa_nodes;
extern int nb_numa_mem_nodes;
-extern uint64_t node_mem[MAX_NODES];
-extern unsigned long *node_cpumask[MAX_NODES];
+typedef struct node_info {
+ uint64_t node_mem;
+ DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
+} NodeInfo;
+extern NodeInfo numa_info[MAX_NODES];
extern QemuOptsList qemu_numa_opts;
int numa_init_func(QemuOpts *opts, void *opaque);
void set_numa_nodes(void);
diff --git a/monitor.c b/monitor.c
index da9c9a2..343f9f4 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1826,7 +1826,7 @@ static void do_info_numa(Monitor *mon, const QDict *qdict)
}
monitor_printf(mon, "\n");
monitor_printf(mon, "node %d size: %" PRId64 " MB\n", i,
- node_mem[i] >> 20);
+ numa_info[i].node_mem >> 20);
}
}
diff --git a/numa.c b/numa.c
index 035fb86..3e2dfc1 100644
--- a/numa.c
+++ b/numa.c
@@ -53,7 +53,7 @@ static int numa_node_parse(NumaNodeOptions *opts)
}
for (cpus = opts->cpus; cpus; cpus = cpus->next) {
- bitmap_set(node_cpumask[nodenr], cpus->value, 1);
+ bitmap_set(numa_info[nodenr].node_cpu, cpus->value, 1);
}
if (opts->has_mem) {
@@ -64,7 +64,7 @@ static int numa_node_parse(NumaNodeOptions *opts)
fprintf(stderr, "qemu: invalid numa mem size: %s\n", opts->mem);
return -1;
}
- node_mem[nodenr] = mem_size;
+ numa_info[nodenr].node_mem = mem_size;
}
return 0;
@@ -88,7 +88,7 @@ static int numa_mem_parse(NumaMemOptions *opts)
if (opts->has_size) {
mem_size = opts->size;
- node_mem[nodenr] = mem_size;
+ numa_info[nodenr].node_mem = mem_size;
}
return 0;
@@ -160,7 +160,7 @@ void set_numa_nodes(void)
* and distribute the available memory equally across all nodes
*/
for (i = 0; i < nb_numa_nodes; i++) {
- if (node_mem[i] != 0) {
+ if (numa_info[i].node_mem != 0) {
break;
}
}
@@ -172,15 +172,16 @@ void set_numa_nodes(void)
* the final node gets the rest.
*/
for (i = 0; i < nb_numa_nodes - 1; i++) {
- node_mem[i] = (ram_size / nb_numa_nodes) & ~((1 << 23UL) - 1);
- usedmem += node_mem[i];
+ numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
+ ~((1 << 23UL) - 1);
+ usedmem += numa_info[i].node_mem;
}
- node_mem[i] = ram_size - usedmem;
+ numa_info[i].node_mem = ram_size - usedmem;
}
uint64_t numa_total = 0;
for (i = 0; i < nb_numa_nodes; i++) {
- numa_total += node_mem[i];
+ numa_total += numa_info[i].node_mem;
}
if (numa_total != ram_size) {
fprintf(stderr, "qemu: numa nodes total memory size "
@@ -189,7 +190,7 @@ void set_numa_nodes(void)
}
for (i = 0; i < nb_numa_nodes; i++) {
- if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) {
+ if (!bitmap_empty(numa_info[i].node_cpu, MAX_CPUMASK_BITS)) {
break;
}
}
@@ -199,7 +200,7 @@ void set_numa_nodes(void)
*/
if (i == nb_numa_nodes) {
for (i = 0; i < max_cpus; i++) {
- set_bit(i, node_cpumask[i % nb_numa_nodes]);
+ set_bit(i, numa_info[i % nb_numa_nodes].node_cpu);
}
}
}
@@ -212,7 +213,7 @@ void set_numa_modes(void)
for (cpu = first_cpu; cpu != NULL; cpu = cpu->next_cpu) {
for (i = 0; i < nb_numa_nodes; i++) {
- if (test_bit(cpu->cpu_index, node_cpumask[i])) {
+ if (test_bit(cpu->cpu_index, numa_info[i].node_cpu)) {
cpu->numa_node = i;
}
}
diff --git a/vl.c b/vl.c
index 0f180fe..2377b67 100644
--- a/vl.c
+++ b/vl.c
@@ -251,8 +251,7 @@ static QTAILQ_HEAD(, FWBootEntry) fw_boot_order =
int nb_numa_nodes;
int nb_numa_mem_nodes;
-uint64_t node_mem[MAX_NODES];
-unsigned long *node_cpumask[MAX_NODES];
+NodeInfo numa_info[MAX_NODES];
uint8_t qemu_uuid[16];
@@ -2887,8 +2886,8 @@ int main(int argc, char **argv, char **envp)
translation = BIOS_ATA_TRANSLATION_AUTO;
for (i = 0; i < MAX_NODES; i++) {
- node_mem[i] = 0;
- node_cpumask[i] = bitmap_new(MAX_CPUMASK_BITS);
+ numa_info[i].node_mem = 0;
+ bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
}
nb_numa_nodes = 0;
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (4 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 05/12] NUMA: Add numa_info structure to contain numa nodes info Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 8:40 ` Andrew Jones
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy Wanlong Gao
` (5 subsequent siblings)
11 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Add detection of libnuma (mostly contained in the numactl package)
to the configure script. Can be enabled or disabled on the command line,
default is use if available.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
configure | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/configure b/configure
index 18fa608..b82e89a 100755
--- a/configure
+++ b/configure
@@ -243,6 +243,7 @@ gtk=""
gtkabi="2.0"
tpm="no"
libssh2=""
+numa=""
# parse CC options first
for opt do
@@ -945,6 +946,10 @@ for opt do
;;
--enable-libssh2) libssh2="yes"
;;
+ --disable-numa) numa="no"
+ ;;
+ --enable-numa) numa="yes"
+ ;;
*) echo "ERROR: unknown option $opt"; show_help="yes"
;;
esac
@@ -1159,6 +1164,8 @@ echo " --gcov=GCOV use specified gcov [$gcov_tool]"
echo " --enable-tpm enable TPM support"
echo " --disable-libssh2 disable ssh block device support"
echo " --enable-libssh2 enable ssh block device support"
+echo " --disable-numa disable libnuma support"
+echo " --enable-numa enable libnuma support"
echo ""
echo "NOTE: The object files are built at the place where configure is launched"
exit 1
@@ -2412,6 +2419,27 @@ EOF
fi
##########################################
+# libnuma probe
+
+if test "$numa" != "no" ; then
+ numa=no
+ cat > $TMPC << EOF
+#include <numa.h>
+int main(void) { return numa_available(); }
+EOF
+
+ if compile_prog "" "-lnuma" ; then
+ numa=yes
+ libs_softmmu="-lnuma $libs_softmmu"
+ else
+ if test "$numa" = "yes" ; then
+ feature_not_found "linux NUMA (install numactl?)"
+ fi
+ numa=no
+ fi
+fi
+
+##########################################
# linux-aio probe
if test "$linux_aio" != "no" ; then
@@ -3613,6 +3641,7 @@ echo "TPM support $tpm"
echo "libssh2 support $libssh2"
echo "TPM passthrough $tpm_passthrough"
echo "QOM debugging $qom_cast_debug"
+echo "NUMA host support $numa"
if test "$sdl_too_old" = "yes"; then
echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3646,6 +3675,9 @@ echo "extra_cflags=$EXTRA_CFLAGS" >> $config_host_mak
echo "extra_ldflags=$EXTRA_LDFLAGS" >> $config_host_mak
echo "qemu_localedir=$qemu_localedir" >> $config_host_mak
echo "libs_softmmu=$libs_softmmu" >> $config_host_mak
+if test "$numa" = "yes"; then
+ echo "CONFIG_NUMA=y" >> $config_host_mak
+fi
echo "ARCH=$ARCH" >> $config_host_mak
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (5 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 14:11 ` Andrew Jones
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 08/12] NUMA: set " Wanlong Gao
` (4 subsequent siblings)
11 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
The memory policy setting format is like:
policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N
And we are adding this setting as a suboption of "-numa mem,",
the memory policy then can be set like following:
-numa node,nodeid=0,cpus=0 \
-numa node,nodeid=1,cpus=1 \
-numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \
-numa mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
include/sysemu/sysemu.h | 3 +++
numa.c | 13 +++++++++++++
qapi-schema.json | 31 +++++++++++++++++++++++++++++--
vl.c | 3 +++
4 files changed, 48 insertions(+), 2 deletions(-)
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index b683d08..81d16a5 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes;
typedef struct node_info {
uint64_t node_mem;
DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
+ DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS);
+ NumaNodePolicy policy;
+ bool relative;
} NodeInfo;
extern NodeInfo numa_info[MAX_NODES];
extern QemuOptsList qemu_numa_opts;
diff --git a/numa.c b/numa.c
index 3e2dfc1..4ccc6cb 100644
--- a/numa.c
+++ b/numa.c
@@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts)
{
uint16_t nodenr;
uint64_t mem_size;
+ uint16List *nodes;
if (opts->has_nodeid) {
nodenr = opts->nodeid;
@@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts)
numa_info[nodenr].node_mem = mem_size;
}
+ if (opts->has_policy) {
+ numa_info[nodenr].policy = opts->policy;
+ }
+
+ if (opts->has_relative) {
+ numa_info[nodenr].relative = opts->relative;
+ }
+
+ for (nodes = opts->host_nodes; nodes; nodes = nodes->next) {
+ bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1);
+ }
+
return 0;
}
diff --git a/qapi-schema.json b/qapi-schema.json
index 11851a1..650741f 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3806,6 +3806,24 @@
'*mem': 'str' }}
##
+# @NumaNodePolicy
+#
+# NUMA node policy types
+#
+# @default: restore default policy, remove any nondefault policy
+#
+# @membind: a strict policy that restricts memory allocation to the
+# nodes specified
+#
+# @interleave: the page allocations is interleaved across the set
+# of nodes specified
+#
+# @preferred: set the preferred node for allocation
+##
+{ 'enum': 'NumaNodePolicy',
+ 'data': [ 'default', 'membind', 'interleave', 'preferred' ] }
+
+##
# @NumaMemOptions
#
# Set memory information of guest NUMA node. (for OptsVisitor)
@@ -3814,9 +3832,18 @@
#
# @size: #optional memory size of this node
#
+# @policy: #optional memory policy of this node
+#
+# @relative: #optional if the nodes specified are relative
+#
+# @host-nodes: #optional host nodes for its memory policy
+#
# Since 1.7
##
{ 'type': 'NumaMemOptions',
'data': {
- '*nodeid': 'uint16',
- '*size': 'size' }}
+ '*nodeid': 'uint16',
+ '*size': 'size',
+ '*policy': 'NumaNodePolicy',
+ '*relative': 'bool',
+ '*host-nodes': ['uint16'] }}
diff --git a/vl.c b/vl.c
index 2377b67..91b0d76 100644
--- a/vl.c
+++ b/vl.c
@@ -2888,6 +2888,9 @@ int main(int argc, char **argv, char **envp)
for (i = 0; i < MAX_NODES; i++) {
numa_info[i].node_mem = 0;
bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
+ bitmap_zero(numa_info[i].host_mem, MAX_CPUMASK_BITS);
+ numa_info[i].policy = NUMA_NODE_POLICY_DEFAULT;
+ numa_info[i].relative = false;
}
nb_numa_nodes = 0;
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 08/12] NUMA: set guest numa nodes memory policy
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (6 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy Wanlong Gao
@ 2013-08-23 4:09 ` Wanlong Gao
2013-08-23 8:44 ` Andrew Jones
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 09/12] NUMA: add qmp command set-mem-policy to set memory policy for NUMA node Wanlong Gao
` (3 subsequent siblings)
11 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:09 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Set the guest numa nodes memory policies using the mbind(2)
system call node by node.
After this patch, we are able to set guest nodes memory policies
through the QEMU options, this arms to solve the guest cross
nodes memory access performance issue.
And as you all know, if PCI-passthrough is used,
direct-attached-device uses DMA transfer between device and qemu process.
All pages of the guest will be pinned by get_user_pages().
KVM_ASSIGN_PCI_DEVICE ioctl
kvm_vm_ioctl_assign_device()
=>kvm_assign_device()
=> kvm_iommu_map_memslots()
=> kvm_iommu_map_pages()
=> kvm_pin_pages()
So, with direct-attached-device, all guest page's page count will be +1 and
any page migration will not work. AutoNUMA won't too.
So, we should set the guest nodes memory allocation policies before
the pages are really mapped.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
numa.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)
diff --git a/numa.c b/numa.c
index 4ccc6cb..4a9c368 100644
--- a/numa.c
+++ b/numa.c
@@ -28,6 +28,16 @@
#include "qapi-visit.h"
#include "qapi/opts-visitor.h"
#include "qapi/dealloc-visitor.h"
+#include "exec/memory.h"
+
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#include <numaif.h>
+#ifndef MPOL_F_RELATIVE_NODES
+#define MPOL_F_RELATIVE_NODES (1 << 14)
+#define MPOL_F_STATIC_NODES (1 << 15)
+#endif
+#endif
QemuOptsList qemu_numa_opts = {
.name = "numa",
@@ -219,6 +229,79 @@ void set_numa_nodes(void)
}
}
+#ifdef CONFIG_NUMA
+static int node_parse_bind_mode(unsigned int nodeid)
+{
+ int bind_mode;
+
+ switch (numa_info[nodeid].policy) {
+ case NUMA_NODE_POLICY_MEMBIND:
+ bind_mode = MPOL_BIND;
+ break;
+ case NUMA_NODE_POLICY_INTERLEAVE:
+ bind_mode = MPOL_INTERLEAVE;
+ break;
+ case NUMA_NODE_POLICY_PREFERRED:
+ bind_mode = MPOL_PREFERRED;
+ break;
+ case NUMA_NODE_POLICY_DEFAULT:
+ default:
+ bind_mode = MPOL_DEFAULT;
+ return bind_mode;
+ }
+
+ bind_mode |= numa_info[nodeid].relative ?
+ MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
+
+ return bind_mode;
+}
+#endif
+
+static int set_node_mem_policy(int nodeid)
+{
+#ifdef CONFIG_NUMA
+ void *ram_ptr;
+ RAMBlock *block;
+ ram_addr_t len, ram_offset = 0;
+ int bind_mode;
+ int i;
+
+ QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+ if (!strcmp(block->mr->name, "pc.ram")) {
+ break;
+ }
+ }
+
+ if (block->host == NULL) {
+ return -1;
+ }
+
+ ram_ptr = block->host;
+ for (i = 0; i < nodeid; i++) {
+ len = numa_info[i].node_mem;
+ ram_offset += len;
+ }
+
+ len = numa_info[nodeid].node_mem;
+ bind_mode = node_parse_bind_mode(nodeid);
+ unsigned long *nodes = numa_info[nodeid].host_mem;
+
+ /* This is a workaround for a long standing bug in Linux'
+ * mbind implementation, which cuts off the last specified
+ * node. To stay compatible should this bug be fixed, we
+ * specify one more node and zero this one out.
+ */
+ unsigned long maxnode = find_last_bit(nodes, MAX_CPUMASK_BITS);
+ clear_bit(maxnode + 1, nodes);
+ if (mbind(ram_ptr + ram_offset, len, bind_mode, nodes, maxnode + 1, 0)) {
+ perror("mbind");
+ return -1;
+ }
+#endif
+
+ return 0;
+}
+
void set_numa_modes(void)
{
CPUState *cpu;
@@ -231,4 +314,11 @@ void set_numa_modes(void)
}
}
}
+
+ for (i = 0; i < nb_numa_nodes; i++) {
+ if (set_node_mem_policy(i) == -1) {
+ fprintf(stderr,
+ "qemu: can not set host memory policy for node%d\n", i);
+ }
+ }
}
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 09/12] NUMA: add qmp command set-mem-policy to set memory policy for NUMA node
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (7 preceding siblings ...)
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 08/12] NUMA: set " Wanlong Gao
@ 2013-08-23 4:10 ` Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 10/12] NUMA: add hmp command set-mem-policy Wanlong Gao
` (2 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:10 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
This QMP command allows user set guest node's memory policy
through the QMP protocol. The qmp-shell command is like:
set-mem-policy nodeid=0 policy=membind relative=true host-nodes=0-1
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
numa.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
qapi-schema.json | 21 +++++++++++++++++++
qmp-commands.hx | 41 +++++++++++++++++++++++++++++++++++++
3 files changed, 124 insertions(+)
diff --git a/numa.c b/numa.c
index 4a9c368..04b6444 100644
--- a/numa.c
+++ b/numa.c
@@ -29,6 +29,7 @@
#include "qapi/opts-visitor.h"
#include "qapi/dealloc-visitor.h"
#include "exec/memory.h"
+#include "qmp-commands.h"
#ifdef CONFIG_NUMA
#include <numa.h>
@@ -322,3 +323,64 @@ void set_numa_modes(void)
}
}
}
+
+void qmp_set_mem_policy(uint16_t nodeid, bool has_policy, NumaNodePolicy policy,
+ bool has_relative, bool relative,
+ bool has_host_nodes, uint16List *host_nodes,
+ Error **errp)
+{
+ NumaNodePolicy old_policy;
+ bool old_relative;
+ DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS);
+ uint16List *nodes;
+
+ if (nodeid >= nb_numa_nodes) {
+ error_setg(errp, "Only has '%d' NUMA nodes", nb_numa_nodes);
+ return;
+ }
+
+ bitmap_copy(host_mem, numa_info[nodeid].host_mem, MAX_CPUMASK_BITS);
+ old_policy = numa_info[nodeid].policy;
+ old_relative = numa_info[nodeid].relative;
+
+ numa_info[nodeid].policy = NUMA_NODE_POLICY_DEFAULT;
+ numa_info[nodeid].relative = false;
+ bitmap_zero(numa_info[nodeid].host_mem, MAX_CPUMASK_BITS);
+
+ if (!has_policy) {
+ if (set_node_mem_policy(nodeid) == -1) {
+ error_setg(errp, "Failed to set memory policy for node%" PRIu16,
+ nodeid);
+ goto error;
+ }
+ return;
+ }
+
+ numa_info[nodeid].policy = policy;
+
+ if (has_relative) {
+ numa_info[nodeid].relative = relative;
+ }
+
+ if (!has_host_nodes) {
+ bitmap_fill(numa_info[nodeid].host_mem, MAX_CPUMASK_BITS);
+ }
+
+ for (nodes = host_nodes; nodes; nodes = nodes->next) {
+ bitmap_set(numa_info[nodeid].host_mem, nodes->value, 1);
+ }
+
+ if (set_node_mem_policy(nodeid) == -1) {
+ error_setg(errp, "Failed to set memory policy for node%" PRIu16,
+ nodeid);
+ goto error;
+ }
+
+ return;
+
+error:
+ bitmap_copy(numa_info[nodeid].host_mem, host_mem, MAX_CPUMASK_BITS);
+ numa_info[nodeid].policy = old_policy;
+ numa_info[nodeid].relative = old_relative;
+ return;
+}
diff --git a/qapi-schema.json b/qapi-schema.json
index 650741f..3c0bdfe 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3847,3 +3847,24 @@
'*policy': 'NumaNodePolicy',
'*relative': 'bool',
'*host-nodes': ['uint16'] }}
+
+##
+# @set-mem-policy:
+#
+# Set the host memory binding policy for guest NUMA node.
+#
+# @nodeid: The node ID of guest NUMA node to set memory policy to.
+#
+# @policy: #optional The memory policy to be set (default 'default').
+#
+# @relative: #optional If the specified nodes are relative (default 'false')
+#
+# @host-nodes: #optional The host nodes range for memory policy.
+#
+# Returns: Nothing on success
+#
+# Since: 1.7
+##
+{ 'command': 'set-mem-policy',
+ 'data': {'nodeid': 'uint16', '*policy': 'NumaNodePolicy',
+ '*relative': 'bool', '*host-nodes': ['uint16'] } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index cf47e3f..52e6ff3 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -3061,6 +3061,7 @@ Example:
<- { "return": {} }
EQMP
+
{
.name = "query-rx-filter",
.args_type = "name:s?",
@@ -3124,3 +3125,43 @@ Example:
}
EQMP
+
+ {
+ .name = "set-mem-policy",
+ .args_type = "nodeid:i,policy:s?,relative:b?,host-nodes:q?",
+ .help = "Set the host memory binding policy for guest NUMA node",
+ .mhandler.cmd_new = qmp_marshal_input_set_mem_policy,
+ },
+
+SQMP
+set-mem-policy
+------
+
+Set the host memory binding policy for guest NUMA node
+
+Arguments:
+
+- "nodeid": The nodeid of guest NUMA node to set memory policy to.
+ (json-int)
+- "policy": The memory policy to set.
+ (json-string, optional)
+- "relative": If the specified nodes are relative.
+ (json-bool, optional)
+- "host-nodes": The host nodes contained to this memory policy.
+ (a json-array of int, optional)
+
+Example:
+
+-> { "execute": "set-mem-policy", "arguments": { "nodeid": 0,
+ "policy": "membind",
+ "relative": true,
+ "host-nodes": [0, 1] } }
+<- { "return": {} }
+
+Notes:
+ 1. If "policy" is not set, the memory policy of this "nodeid" will be set
+ to "default".
+ 2. If "host-nodes" is not set, the node mask of this "policy" will be set
+ to all available host nodes.
+
+EQMP
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 10/12] NUMA: add hmp command set-mem-policy
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (8 preceding siblings ...)
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 09/12] NUMA: add qmp command set-mem-policy to set memory policy for NUMA node Wanlong Gao
@ 2013-08-23 4:10 ` Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 11/12] NUMA: add qmp command query-numa Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 12/12] NUMA: convert hmp command info_numa to use qmp command query_numa Wanlong Gao
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:10 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Add hmp command set-mem-policy to set host memory policy for a guest
NUMA node. Then we can also set node's memory policy using
the monitor command like:
(qemu) set-mem-policy 0 policy=membind,relative=false,host-nodes=0-1
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
hmp-commands.hx | 16 ++++++++++++++
hmp.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
hmp.h | 1 +
3 files changed, 82 insertions(+)
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 8c6b91a..fe3a26f 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1587,6 +1587,22 @@ Executes a qemu-io command on the given block device.
ETEXI
{
+ .name = "set-mem-policy",
+ .args_type = "nodeid:i,args:s?",
+ .params = "nodeid [args]",
+ .help = "set host memory policy for a guest NUMA node",
+ .mhandler.cmd = hmp_set_mem_policy,
+ },
+
+STEXI
+@item set-mem-policy @var{nodeid} @var{args}
+@findex set-mem-policy
+
+Set host memory policy for a guest NUMA node
+
+ETEXI
+
+ {
.name = "info",
.args_type = "item:s?",
.params = "[subcommand]",
diff --git a/hmp.c b/hmp.c
index c45514b..98d2a76 100644
--- a/hmp.c
+++ b/hmp.c
@@ -24,6 +24,9 @@
#include "ui/console.h"
#include "block/qapi.h"
#include "qemu-io.h"
+#include "qapi-visit.h"
+#include "qapi/opts-visitor.h"
+#include "qapi/dealloc-visitor.h"
static void hmp_handle_error(Monitor *mon, Error **errp)
{
@@ -1514,3 +1517,65 @@ void hmp_qemu_io(Monitor *mon, const QDict *qdict)
hmp_handle_error(mon, &err);
}
+
+void hmp_set_mem_policy(Monitor *mon, const QDict *qdict)
+{
+ Error *local_err = NULL;
+ bool has_policy = true;
+ bool has_relative = true;
+ bool has_host_nodes = true;
+ QemuOpts *opts;
+ NumaMemOptions *object = NULL;
+ NumaNodePolicy policy = NUMA_NODE_POLICY_DEFAULT;
+ bool relative = false;
+ uint16List *host_nodes = NULL;
+
+ uint64_t nodeid = qdict_get_int(qdict, "nodeid");
+ const char *args = qdict_get_try_str(qdict, "args");
+
+ if (args == NULL) {
+ has_policy = false;
+ has_relative = false;
+ has_host_nodes = false;
+ } else {
+ opts = qemu_opts_parse(qemu_find_opts("numa"), args, 1);
+ if (opts == NULL) {
+ monitor_printf(mon, "Parsing memory policy args failed\n");
+ return;
+ } else {
+ OptsVisitor *ov = opts_visitor_new(opts);
+ visit_type_NumaMemOptions(opts_get_visitor(ov), &object, NULL,
+ &local_err);
+ opts_visitor_cleanup(ov);
+
+ if (error_is_set(&local_err)) {
+ goto error;
+ }
+
+ has_policy = object->has_policy;
+ if (has_policy) {
+ policy = object->policy;
+ }
+ has_relative = object->has_relative;
+ if (has_relative) {
+ relative = object->relative;
+ }
+ has_host_nodes = object->has_host_nodes;
+ if (has_host_nodes) {
+ host_nodes = object->host_nodes;
+ }
+ }
+ }
+
+ qmp_set_mem_policy(nodeid, has_policy, policy, has_relative, relative,
+ has_host_nodes, host_nodes, &local_err);
+error:
+ if (object) {
+ QapiDeallocVisitor *dv = qapi_dealloc_visitor_new();
+ visit_type_NumaMemOptions(qapi_dealloc_get_visitor(dv),
+ &object, NULL, NULL);
+ qapi_dealloc_visitor_cleanup(dv);
+ }
+
+ hmp_handle_error(mon, &local_err);
+}
diff --git a/hmp.h b/hmp.h
index 6c3bdcd..ae09525 100644
--- a/hmp.h
+++ b/hmp.h
@@ -87,5 +87,6 @@ void hmp_nbd_server_stop(Monitor *mon, const QDict *qdict);
void hmp_chardev_add(Monitor *mon, const QDict *qdict);
void hmp_chardev_remove(Monitor *mon, const QDict *qdict);
void hmp_qemu_io(Monitor *mon, const QDict *qdict);
+void hmp_set_mem_policy(Monitor *mon, const QDict *qdict);
#endif
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 11/12] NUMA: add qmp command query-numa
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (9 preceding siblings ...)
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 10/12] NUMA: add hmp command set-mem-policy Wanlong Gao
@ 2013-08-23 4:10 ` Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 12/12] NUMA: convert hmp command info_numa to use qmp command query_numa Wanlong Gao
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:10 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Add qmp command query-numa to show guest NUMA information.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
numa.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
qapi-schema.json | 36 +++++++++++++++++++++++++++++
qmp-commands.hx | 49 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+)
diff --git a/numa.c b/numa.c
index 04b6444..4a78143 100644
--- a/numa.c
+++ b/numa.c
@@ -384,3 +384,72 @@ error:
numa_info[nodeid].relative = old_relative;
return;
}
+
+NUMAInfoList *qmp_query_numa(Error **errp)
+{
+ NUMAInfoList *head = NULL, *cur_item = NULL;
+ CPUState *cpu;
+ int i;
+
+ for (i = 0; i < nb_numa_nodes; i++) {
+ NUMAInfoList *info;
+ uint16List *cur_cpu_item = NULL;
+ info = g_malloc0(sizeof(*info));
+ info->value = g_malloc0(sizeof(*info->value));
+ info->value->nodeid = i;
+ for (cpu = first_cpu; cpu != NULL; cpu = cpu->next_cpu) {
+ if (cpu->numa_node == i) {
+ uint16List *node_cpu = g_malloc0(sizeof(*node_cpu));
+ node_cpu->value = cpu->cpu_index;
+
+ if (!cur_cpu_item) {
+ info->value->cpus = cur_cpu_item = node_cpu;
+ } else {
+ cur_cpu_item->next = node_cpu;
+ cur_cpu_item = node_cpu;
+ }
+ }
+ }
+ info->value->memory = numa_info[i].node_mem;
+
+#ifdef CONFIG_NUMA
+ info->value->policy = numa_info[i].policy;
+ info->value->relative = numa_info[i].relative;
+
+ unsigned long first, next;
+ next = first = find_first_bit(numa_info[i].host_mem, MAX_CPUMASK_BITS);
+ if (first == MAX_CPUMASK_BITS) {
+ goto end;
+ }
+ uint16List *cur_node_item = g_malloc0(sizeof(*cur_node_item));
+ cur_node_item->value = first;
+ info->value->host_nodes = cur_node_item;
+ do {
+ if (next == numa_max_node()) {
+ break;
+ }
+
+ next = find_next_bit(numa_info[i].host_mem, MAX_CPUMASK_BITS,
+ next + 1);
+ if (next > numa_max_node() || next == MAX_CPUMASK_BITS) {
+ break;
+ }
+
+ uint16List *host_node = g_malloc0(sizeof(*host_node));
+ host_node->value = next;
+ cur_node_item->next = host_node;
+ cur_node_item = host_node;
+ } while (true);
+end:
+#endif
+
+ if (!cur_item) {
+ head = cur_item = info;
+ } else {
+ cur_item->next = info;
+ cur_item = info;
+ }
+ }
+
+ return head;
+}
diff --git a/qapi-schema.json b/qapi-schema.json
index 3c0bdfe..7546057 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3868,3 +3868,39 @@
{ 'command': 'set-mem-policy',
'data': {'nodeid': 'uint16', '*policy': 'NumaNodePolicy',
'*relative': 'bool', '*host-nodes': ['uint16'] } }
+
+##
+# @NUMAInfo:
+#
+# Information about guest NUMA nodes
+#
+# @nodeid: NUMA node ID
+#
+# @cpus: VCPUs contained in this node
+#
+# @memory: memory size of this node
+#
+# @policy: memory policy of this node
+#
+# @relative: if host nodes are relative for memory policy
+#
+# @host-nodes: host nodes for its memory policy
+#
+# Since: 1.7
+#
+##
+{ 'type': 'NUMAInfo',
+ 'data': {'nodeid': 'uint16', 'cpus': ['uint16'], 'memory': 'uint64',
+ 'policy': 'NumaNodePolicy', 'relative': 'bool',
+ 'host-nodes': ['uint16'] }}
+
+##
+# @query-numa:
+#
+# Returns a list of information about each guest node.
+#
+# Returns: a list of @NUMAInfo for each guest node
+#
+# Since: 1.7
+##
+{ 'command': 'query-numa', 'returns': ['NUMAInfo'] }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 52e6ff3..20f1e74 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -3165,3 +3165,52 @@ Notes:
to all available host nodes.
EQMP
+
+ {
+ .name = "query-numa",
+ .args_type = "",
+ .mhandler.cmd_new = qmp_marshal_input_query_numa,
+ },
+
+SQMP
+query-numa
+---------
+
+Show NUMA information.
+
+Return a json-array. Each NUMA node is represented by a json-object,
+which contains:
+
+- "nodeid": NUMA node ID (json-int)
+- "cpus": a json-arry of contained VCPUs
+- "memory": amount of memory in each node in Byte (json-int)
+- "policy": memory policy of this node (json-string)
+- "relative": if host nodes is relative for its memory policy (json-bool)
+- "host-nodes": a json-array of host nodes for its memory policy
+
+Arguments:
+
+Example:
+
+-> { "excute": "query-numa" }
+<- { "return":[
+ {
+ "nodeid": 0,
+ "cpus": [0, 1],
+ "memory": 536870912,
+ "policy": "membind",
+ "relative": false,
+ "host-nodes": [0, 1]
+ },
+ {
+ "nodeid": 1,
+ "cpus": [2, 3],
+ "memory": 536870912,
+ "policy": "interleave",
+ "relative": false,
+ "host-nodes": [1]
+ }
+ ]
+ }
+
+EQMP
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [Qemu-devel] [PATCH V9 12/12] NUMA: convert hmp command info_numa to use qmp command query_numa
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
` (10 preceding siblings ...)
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 11/12] NUMA: add qmp command query-numa Wanlong Gao
@ 2013-08-23 4:10 ` Wanlong Gao
11 siblings, 0 replies; 26+ messages in thread
From: Wanlong Gao @ 2013-08-23 4:10 UTC (permalink / raw)
To: qemu-devel
Cc: aliguori, ehabkost, lersek, peter.huangpeng, lcapitulino,
drjones, bsd, hutao, y-goto, pbonzini, afaerber, gaowanlong
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
hmp.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
hmp.h | 1 +
monitor.c | 21 +--------------------
3 files changed, 56 insertions(+), 20 deletions(-)
diff --git a/hmp.c b/hmp.c
index 98d2a76..3b2f04d 100644
--- a/hmp.c
+++ b/hmp.c
@@ -27,6 +27,7 @@
#include "qapi-visit.h"
#include "qapi/opts-visitor.h"
#include "qapi/dealloc-visitor.h"
+#include "sysemu/sysemu.h"
static void hmp_handle_error(Monitor *mon, Error **errp)
{
@@ -1579,3 +1580,56 @@ error:
hmp_handle_error(mon, &local_err);
}
+
+void hmp_info_numa(Monitor *mon, const QDict *qdict)
+{
+ NUMAInfoList *node_list, *node;
+ uint16List *head;
+ int nodeid;
+ char *policy_str = NULL;
+
+ node_list = qmp_query_numa(NULL);
+
+ monitor_printf(mon, "%d nodes\n", nb_numa_nodes);
+ for (node = node_list; node; node = node->next) {
+ nodeid = node->value->nodeid;
+ monitor_printf(mon, "node %d cpus:", nodeid);
+ head = node->value->cpus;
+ for (head = node->value->cpus; head != NULL; head = head->next) {
+ monitor_printf(mon, " %d", (int)head->value);
+ }
+ monitor_printf(mon, "\n");
+ monitor_printf(mon, "node %d size: %" PRId64 " MB\n",
+ nodeid, node->value->memory >> 20);
+ switch (node->value->policy) {
+ case NUMA_NODE_POLICY_DEFAULT:
+ policy_str = g_strdup("default");
+ break;
+ case NUMA_NODE_POLICY_MEMBIND:
+ policy_str = g_strdup("membind");
+ break;
+ case NUMA_NODE_POLICY_INTERLEAVE:
+ policy_str = g_strdup("interleave");
+ break;
+ case NUMA_NODE_POLICY_PREFERRED:
+ policy_str = g_strdup("preferred");
+ break;
+ default:
+ break;
+ }
+ monitor_printf(mon, "node %d policy: %s\n",
+ nodeid, policy_str ? : " ");
+ if (policy_str) {
+ free(policy_str);
+ }
+ monitor_printf(mon, "node %d relative: %s\n", nodeid,
+ node->value->relative ? "true" : "false");
+ monitor_printf(mon, "node %d host-nodes:", nodeid);
+ for (head = node->value->host_nodes; head != NULL; head = head->next) {
+ monitor_printf(mon, " %d", (int)head->value);
+ }
+ monitor_printf(mon, "\n");
+ }
+
+ qapi_free_NUMAInfoList(node_list);
+}
diff --git a/hmp.h b/hmp.h
index ae09525..56a5efd 100644
--- a/hmp.h
+++ b/hmp.h
@@ -37,6 +37,7 @@ void hmp_info_balloon(Monitor *mon, const QDict *qdict);
void hmp_info_pci(Monitor *mon, const QDict *qdict);
void hmp_info_block_jobs(Monitor *mon, const QDict *qdict);
void hmp_info_tpm(Monitor *mon, const QDict *qdict);
+void hmp_info_numa(Monitor *mon, const QDict *qdict);
void hmp_quit(Monitor *mon, const QDict *qdict);
void hmp_stop(Monitor *mon, const QDict *qdict);
void hmp_system_reset(Monitor *mon, const QDict *qdict);
diff --git a/monitor.c b/monitor.c
index 343f9f4..712435c 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1811,25 +1811,6 @@ static void do_info_mtree(Monitor *mon, const QDict *qdict)
mtree_info((fprintf_function)monitor_printf, mon);
}
-static void do_info_numa(Monitor *mon, const QDict *qdict)
-{
- int i;
- CPUState *cpu;
-
- monitor_printf(mon, "%d nodes\n", nb_numa_nodes);
- for (i = 0; i < nb_numa_nodes; i++) {
- monitor_printf(mon, "node %d cpus:", i);
- for (cpu = first_cpu; cpu != NULL; cpu = cpu->next_cpu) {
- if (cpu->numa_node == i) {
- monitor_printf(mon, " %d", cpu->cpu_index);
- }
- }
- monitor_printf(mon, "\n");
- monitor_printf(mon, "node %d size: %" PRId64 " MB\n", i,
- numa_info[i].node_mem >> 20);
- }
-}
-
#ifdef CONFIG_PROFILER
int64_t qemu_time;
@@ -2597,7 +2578,7 @@ static mon_cmd_t info_cmds[] = {
.args_type = "",
.params = "",
.help = "show NUMA information",
- .mhandler.cmd = do_info_numa,
+ .mhandler.cmd = hmp_info_numa,
},
{
.name = "usb",
--
1.8.4.rc4
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection Wanlong Gao
@ 2013-08-23 8:40 ` Andrew Jones
2013-08-26 1:43 ` Wanlong Gao
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2013-08-23 8:40 UTC (permalink / raw)
To: Wanlong Gao
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber
----- Original Message -----
> Add detection of libnuma (mostly contained in the numactl package)
> to the configure script. Can be enabled or disabled on the command line,
> default is use if available.
>
> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Is this patch still necessary? I thought that dropping the
numa_num_configured_nodes() calls from patch 8/12 got rid
of the need for this library. Maybe I missed other uses?
drew
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 08/12] NUMA: set guest numa nodes memory policy
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 08/12] NUMA: set " Wanlong Gao
@ 2013-08-23 8:44 ` Andrew Jones
0 siblings, 0 replies; 26+ messages in thread
From: Andrew Jones @ 2013-08-23 8:44 UTC (permalink / raw)
To: Wanlong Gao
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber
----- Original Message -----
> Set the guest numa nodes memory policies using the mbind(2)
> system call node by node.
> After this patch, we are able to set guest nodes memory policies
> through the QEMU options, this arms to solve the guest cross
> nodes memory access performance issue.
> And as you all know, if PCI-passthrough is used,
> direct-attached-device uses DMA transfer between device and qemu process.
> All pages of the guest will be pinned by get_user_pages().
>
> KVM_ASSIGN_PCI_DEVICE ioctl
> kvm_vm_ioctl_assign_device()
> =>kvm_assign_device()
> => kvm_iommu_map_memslots()
> => kvm_iommu_map_pages()
> => kvm_pin_pages()
>
> So, with direct-attached-device, all guest page's page count will be +1 and
> any page migration will not work. AutoNUMA won't too.
>
> So, we should set the guest nodes memory allocation policies before
> the pages are really mapped.
>
> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> ---
> numa.c | 90
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 90 insertions(+)
>
> diff --git a/numa.c b/numa.c
> index 4ccc6cb..4a9c368 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -28,6 +28,16 @@
> #include "qapi-visit.h"
> #include "qapi/opts-visitor.h"
> #include "qapi/dealloc-visitor.h"
> +#include "exec/memory.h"
> +
> +#ifdef CONFIG_NUMA
> +#include <numa.h>
> +#include <numaif.h>
> +#ifndef MPOL_F_RELATIVE_NODES
> +#define MPOL_F_RELATIVE_NODES (1 << 14)
> +#define MPOL_F_STATIC_NODES (1 << 15)
> +#endif
> +#endif
>
> QemuOptsList qemu_numa_opts = {
> .name = "numa",
> @@ -219,6 +229,79 @@ void set_numa_nodes(void)
> }
> }
>
> +#ifdef CONFIG_NUMA
> +static int node_parse_bind_mode(unsigned int nodeid)
> +{
> + int bind_mode;
> +
> + switch (numa_info[nodeid].policy) {
> + case NUMA_NODE_POLICY_MEMBIND:
> + bind_mode = MPOL_BIND;
> + break;
> + case NUMA_NODE_POLICY_INTERLEAVE:
> + bind_mode = MPOL_INTERLEAVE;
> + break;
> + case NUMA_NODE_POLICY_PREFERRED:
> + bind_mode = MPOL_PREFERRED;
> + break;
> + case NUMA_NODE_POLICY_DEFAULT:
> + default:
> + bind_mode = MPOL_DEFAULT;
> + return bind_mode;
> + }
> +
> + bind_mode |= numa_info[nodeid].relative ?
> + MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
> +
> + return bind_mode;
> +}
> +#endif
> +
> +static int set_node_mem_policy(int nodeid)
> +{
> +#ifdef CONFIG_NUMA
> + void *ram_ptr;
> + RAMBlock *block;
> + ram_addr_t len, ram_offset = 0;
> + int bind_mode;
> + int i;
> +
> + QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> + if (!strcmp(block->mr->name, "pc.ram")) {
> + break;
> + }
> + }
> +
> + if (block->host == NULL) {
> + return -1;
> + }
> +
> + ram_ptr = block->host;
> + for (i = 0; i < nodeid; i++) {
> + len = numa_info[i].node_mem;
> + ram_offset += len;
> + }
> +
> + len = numa_info[nodeid].node_mem;
> + bind_mode = node_parse_bind_mode(nodeid);
> + unsigned long *nodes = numa_info[nodeid].host_mem;
> +
> + /* This is a workaround for a long standing bug in Linux'
> + * mbind implementation, which cuts off the last specified
> + * node. To stay compatible should this bug be fixed, we
> + * specify one more node and zero this one out.
> + */
> + unsigned long maxnode = find_last_bit(nodes, MAX_CPUMASK_BITS);
> + clear_bit(maxnode + 1, nodes);
This clear_bit() isn't necessary. We know that maxnode+1 is certainly
already clear, because find_last_bit() just returned maxnode as the last
"non-clear" bit.
> + if (mbind(ram_ptr + ram_offset, len, bind_mode, nodes, maxnode + 1, 0))
> {
> + perror("mbind");
> + return -1;
> + }
> +#endif
> +
> + return 0;
> +}
> +
> void set_numa_modes(void)
> {
> CPUState *cpu;
> @@ -231,4 +314,11 @@ void set_numa_modes(void)
> }
> }
> }
> +
> + for (i = 0; i < nb_numa_nodes; i++) {
> + if (set_node_mem_policy(i) == -1) {
> + fprintf(stderr,
> + "qemu: can not set host memory policy for node%d\n", i);
> + }
> + }
> }
> --
> 1.8.4.rc4
>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy Wanlong Gao
@ 2013-08-23 14:11 ` Andrew Jones
2013-08-26 1:07 ` Wanlong Gao
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2013-08-23 14:11 UTC (permalink / raw)
To: Wanlong Gao
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber
----- Original Message -----
> The memory policy setting format is like:
> policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N
> And we are adding this setting as a suboption of "-numa mem,",
> the memory policy then can be set like following:
> -numa node,nodeid=0,cpus=0 \
> -numa node,nodeid=1,cpus=1 \
> -numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \
> -numa mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1
>
> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> ---
> include/sysemu/sysemu.h | 3 +++
> numa.c | 13 +++++++++++++
> qapi-schema.json | 31 +++++++++++++++++++++++++++++--
> vl.c | 3 +++
> 4 files changed, 48 insertions(+), 2 deletions(-)
>
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index b683d08..81d16a5 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes;
> typedef struct node_info {
> uint64_t node_mem;
> DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
> + DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS);
> + NumaNodePolicy policy;
> + bool relative;
> } NodeInfo;
> extern NodeInfo numa_info[MAX_NODES];
> extern QemuOptsList qemu_numa_opts;
> diff --git a/numa.c b/numa.c
> index 3e2dfc1..4ccc6cb 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts)
> {
> uint16_t nodenr;
> uint64_t mem_size;
> + uint16List *nodes;
>
> if (opts->has_nodeid) {
> nodenr = opts->nodeid;
> @@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts)
> numa_info[nodenr].node_mem = mem_size;
> }
>
> + if (opts->has_policy) {
> + numa_info[nodenr].policy = opts->policy;
> + }
> +
> + if (opts->has_relative) {
> + numa_info[nodenr].relative = opts->relative;
> + }
> +
> + for (nodes = opts->host_nodes; nodes; nodes = nodes->next) {
> + bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1);
> + }
> +
> return 0;
> }
>
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 11851a1..650741f 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -3806,6 +3806,24 @@
> '*mem': 'str' }}
>
> ##
> +# @NumaNodePolicy
> +#
> +# NUMA node policy types
> +#
> +# @default: restore default policy, remove any nondefault policy
> +#
> +# @membind: a strict policy that restricts memory allocation to the
> +# nodes specified
> +#
> +# @interleave: the page allocations is interleaved across the set
> +# of nodes specified
> +#
> +# @preferred: set the preferred node for allocation
> +##
> +{ 'enum': 'NumaNodePolicy',
> + 'data': [ 'default', 'membind', 'interleave', 'preferred' ] }
> +
> +##
> # @NumaMemOptions
> #
> # Set memory information of guest NUMA node. (for OptsVisitor)
> @@ -3814,9 +3832,18 @@
> #
> # @size: #optional memory size of this node
> #
> +# @policy: #optional memory policy of this node
> +#
> +# @relative: #optional if the nodes specified are relative
> +#
> +# @host-nodes: #optional host nodes for its memory policy
> +#
> # Since 1.7
> ##
> { 'type': 'NumaMemOptions',
> 'data': {
> - '*nodeid': 'uint16',
> - '*size': 'size' }}
> + '*nodeid': 'uint16',
> + '*size': 'size',
> + '*policy': 'NumaNodePolicy',
> + '*relative': 'bool',
> + '*host-nodes': ['uint16'] }}
> diff --git a/vl.c b/vl.c
> index 2377b67..91b0d76 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2888,6 +2888,9 @@ int main(int argc, char **argv, char **envp)
> for (i = 0; i < MAX_NODES; i++) {
> numa_info[i].node_mem = 0;
> bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
> + bitmap_zero(numa_info[i].host_mem, MAX_CPUMASK_BITS);
Shouldn't the bitmap size of host_mem be MAX_NODES? If so, and you
change it, then make sure the find_last_bit() call also gets
updated in patch 8/12, and anywhere else needed.
drew
> + numa_info[i].policy = NUMA_NODE_POLICY_DEFAULT;
> + numa_info[i].relative = false;
> }
>
> nb_numa_nodes = 0;
> --
> 1.8.4.rc4
>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
2013-08-23 14:11 ` Andrew Jones
@ 2013-08-26 1:07 ` Wanlong Gao
2013-08-26 7:12 ` Andrew Jones
0 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-26 1:07 UTC (permalink / raw)
To: Andrew Jones
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber,
Wanlong Gao
On 08/23/2013 10:11 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> The memory policy setting format is like:
>> policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N
>> And we are adding this setting as a suboption of "-numa mem,",
>> the memory policy then can be set like following:
>> -numa node,nodeid=0,cpus=0 \
>> -numa node,nodeid=1,cpus=1 \
>> -numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \
>> -numa mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1
>>
>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
>> ---
>> include/sysemu/sysemu.h | 3 +++
>> numa.c | 13 +++++++++++++
>> qapi-schema.json | 31 +++++++++++++++++++++++++++++--
>> vl.c | 3 +++
>> 4 files changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>> index b683d08..81d16a5 100644
>> --- a/include/sysemu/sysemu.h
>> +++ b/include/sysemu/sysemu.h
>> @@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes;
>> typedef struct node_info {
>> uint64_t node_mem;
>> DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
>> + DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS);
>> + NumaNodePolicy policy;
>> + bool relative;
>> } NodeInfo;
>> extern NodeInfo numa_info[MAX_NODES];
>> extern QemuOptsList qemu_numa_opts;
>> diff --git a/numa.c b/numa.c
>> index 3e2dfc1..4ccc6cb 100644
>> --- a/numa.c
>> +++ b/numa.c
>> @@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts)
>> {
>> uint16_t nodenr;
>> uint64_t mem_size;
>> + uint16List *nodes;
>>
>> if (opts->has_nodeid) {
>> nodenr = opts->nodeid;
>> @@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts)
>> numa_info[nodenr].node_mem = mem_size;
>> }
>>
>> + if (opts->has_policy) {
>> + numa_info[nodenr].policy = opts->policy;
>> + }
>> +
>> + if (opts->has_relative) {
>> + numa_info[nodenr].relative = opts->relative;
>> + }
>> +
>> + for (nodes = opts->host_nodes; nodes; nodes = nodes->next) {
>> + bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1);
>> + }
>> +
>> return 0;
>> }
>>
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 11851a1..650741f 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -3806,6 +3806,24 @@
>> '*mem': 'str' }}
>>
>> ##
>> +# @NumaNodePolicy
>> +#
>> +# NUMA node policy types
>> +#
>> +# @default: restore default policy, remove any nondefault policy
>> +#
>> +# @membind: a strict policy that restricts memory allocation to the
>> +# nodes specified
>> +#
>> +# @interleave: the page allocations is interleaved across the set
>> +# of nodes specified
>> +#
>> +# @preferred: set the preferred node for allocation
>> +##
>> +{ 'enum': 'NumaNodePolicy',
>> + 'data': [ 'default', 'membind', 'interleave', 'preferred' ] }
>> +
>> +##
>> # @NumaMemOptions
>> #
>> # Set memory information of guest NUMA node. (for OptsVisitor)
>> @@ -3814,9 +3832,18 @@
>> #
>> # @size: #optional memory size of this node
>> #
>> +# @policy: #optional memory policy of this node
>> +#
>> +# @relative: #optional if the nodes specified are relative
>> +#
>> +# @host-nodes: #optional host nodes for its memory policy
>> +#
>> # Since 1.7
>> ##
>> { 'type': 'NumaMemOptions',
>> 'data': {
>> - '*nodeid': 'uint16',
>> - '*size': 'size' }}
>> + '*nodeid': 'uint16',
>> + '*size': 'size',
>> + '*policy': 'NumaNodePolicy',
>> + '*relative': 'bool',
>> + '*host-nodes': ['uint16'] }}
>> diff --git a/vl.c b/vl.c
>> index 2377b67..91b0d76 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -2888,6 +2888,9 @@ int main(int argc, char **argv, char **envp)
>> for (i = 0; i < MAX_NODES; i++) {
>> numa_info[i].node_mem = 0;
>> bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
>> + bitmap_zero(numa_info[i].host_mem, MAX_CPUMASK_BITS);
>
> Shouldn't the bitmap size of host_mem be MAX_NODES? If so, and you
MAX_NODES is for guest numa nodes number, but this bitmap is for host
numa nodes. AFAIK, this MAX_NODES is not big enough for host nodes number,
the default host kernel NODES_SHIFT is 9.
Thanks,
Wanlong Gao
> change it, then make sure the find_last_bit() call also gets
> updated in patch 8/12, and anywhere else needed.
>
> drew
>
>> + numa_info[i].policy = NUMA_NODE_POLICY_DEFAULT;
>> + numa_info[i].relative = false;
>> }
>>
>> nb_numa_nodes = 0;
>> --
>> 1.8.4.rc4
>>
>>
>>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-23 8:40 ` Andrew Jones
@ 2013-08-26 1:43 ` Wanlong Gao
2013-08-26 7:46 ` Andrew Jones
0 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-26 1:43 UTC (permalink / raw)
To: Andrew Jones
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber,
Wanlong Gao
On 08/23/2013 04:40 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> Add detection of libnuma (mostly contained in the numactl package)
>> to the configure script. Can be enabled or disabled on the command line,
>> default is use if available.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
>
> Is this patch still necessary? I thought that dropping the
> numa_num_configured_nodes() calls from patch 8/12 got rid
> of the need for this library. Maybe I missed other uses?
Yes, in 08/12 we also use mbind(), and in 09/12 we use max_numa_node().
Thanks,
Wanlong Gao
>
> drew
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
2013-08-26 1:07 ` Wanlong Gao
@ 2013-08-26 7:12 ` Andrew Jones
0 siblings, 0 replies; 26+ messages in thread
From: Andrew Jones @ 2013-08-26 7:12 UTC (permalink / raw)
To: gaowanlong
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, pbonzini, y-goto, lersek, afaerber
----- Original Message -----
> On 08/23/2013 10:11 PM, Andrew Jones wrote:
> >
> >
> > ----- Original Message -----
> >> The memory policy setting format is like:
> >> policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N
> >> And we are adding this setting as a suboption of "-numa mem,",
> >> the memory policy then can be set like following:
> >> -numa node,nodeid=0,cpus=0 \
> >> -numa node,nodeid=1,cpus=1 \
> >> -numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \
> >> -numa
> >> mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1
> >>
> >> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> >> ---
> >> include/sysemu/sysemu.h | 3 +++
> >> numa.c | 13 +++++++++++++
> >> qapi-schema.json | 31 +++++++++++++++++++++++++++++--
> >> vl.c | 3 +++
> >> 4 files changed, 48 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> >> index b683d08..81d16a5 100644
> >> --- a/include/sysemu/sysemu.h
> >> +++ b/include/sysemu/sysemu.h
> >> @@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes;
> >> typedef struct node_info {
> >> uint64_t node_mem;
> >> DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
> >> + DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS);
> >> + NumaNodePolicy policy;
> >> + bool relative;
> >> } NodeInfo;
> >> extern NodeInfo numa_info[MAX_NODES];
> >> extern QemuOptsList qemu_numa_opts;
> >> diff --git a/numa.c b/numa.c
> >> index 3e2dfc1..4ccc6cb 100644
> >> --- a/numa.c
> >> +++ b/numa.c
> >> @@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts)
> >> {
> >> uint16_t nodenr;
> >> uint64_t mem_size;
> >> + uint16List *nodes;
> >>
> >> if (opts->has_nodeid) {
> >> nodenr = opts->nodeid;
> >> @@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts)
> >> numa_info[nodenr].node_mem = mem_size;
> >> }
> >>
> >> + if (opts->has_policy) {
> >> + numa_info[nodenr].policy = opts->policy;
> >> + }
> >> +
> >> + if (opts->has_relative) {
> >> + numa_info[nodenr].relative = opts->relative;
> >> + }
> >> +
> >> + for (nodes = opts->host_nodes; nodes; nodes = nodes->next) {
> >> + bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1);
> >> + }
> >> +
> >> return 0;
> >> }
> >>
> >> diff --git a/qapi-schema.json b/qapi-schema.json
> >> index 11851a1..650741f 100644
> >> --- a/qapi-schema.json
> >> +++ b/qapi-schema.json
> >> @@ -3806,6 +3806,24 @@
> >> '*mem': 'str' }}
> >>
> >> ##
> >> +# @NumaNodePolicy
> >> +#
> >> +# NUMA node policy types
> >> +#
> >> +# @default: restore default policy, remove any nondefault policy
> >> +#
> >> +# @membind: a strict policy that restricts memory allocation to the
> >> +# nodes specified
> >> +#
> >> +# @interleave: the page allocations is interleaved across the set
> >> +# of nodes specified
> >> +#
> >> +# @preferred: set the preferred node for allocation
> >> +##
> >> +{ 'enum': 'NumaNodePolicy',
> >> + 'data': [ 'default', 'membind', 'interleave', 'preferred' ] }
> >> +
> >> +##
> >> # @NumaMemOptions
> >> #
> >> # Set memory information of guest NUMA node. (for OptsVisitor)
> >> @@ -3814,9 +3832,18 @@
> >> #
> >> # @size: #optional memory size of this node
> >> #
> >> +# @policy: #optional memory policy of this node
> >> +#
> >> +# @relative: #optional if the nodes specified are relative
> >> +#
> >> +# @host-nodes: #optional host nodes for its memory policy
> >> +#
> >> # Since 1.7
> >> ##
> >> { 'type': 'NumaMemOptions',
> >> 'data': {
> >> - '*nodeid': 'uint16',
> >> - '*size': 'size' }}
> >> + '*nodeid': 'uint16',
> >> + '*size': 'size',
> >> + '*policy': 'NumaNodePolicy',
> >> + '*relative': 'bool',
> >> + '*host-nodes': ['uint16'] }}
> >> diff --git a/vl.c b/vl.c
> >> index 2377b67..91b0d76 100644
> >> --- a/vl.c
> >> +++ b/vl.c
> >> @@ -2888,6 +2888,9 @@ int main(int argc, char **argv, char **envp)
> >> for (i = 0; i < MAX_NODES; i++) {
> >> numa_info[i].node_mem = 0;
> >> bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS);
> >> + bitmap_zero(numa_info[i].host_mem, MAX_CPUMASK_BITS);
> >
> > Shouldn't the bitmap size of host_mem be MAX_NODES? If so, and you
>
> MAX_NODES is for guest numa nodes number, but this bitmap is for host
> numa nodes. AFAIK, this MAX_NODES is not big enough for host nodes number,
> the default host kernel NODES_SHIFT is 9.
MAX_CPUMASK_BITS == 255 is also too small for a default node shift of 9.
You have to pick something, and then manage that limit. I think MAX_NODES
== 64 will be big enough for quite some time, but libnuma chooses 128 (see
/usr/include/numa.h:NUMA_NUM_NODES). So maybe we can bump MAX_NODES up to
128? You can also add a warning for when you detect starting on a machine
that has more than MAX_NODES nodes.
drew
>
> Thanks,
> Wanlong Gao
>
> > change it, then make sure the find_last_bit() call also gets
> > updated in patch 8/12, and anywhere else needed.
> >
> > drew
> >
> >> + numa_info[i].policy = NUMA_NODE_POLICY_DEFAULT;
> >> + numa_info[i].relative = false;
> >> }
> >>
> >> nb_numa_nodes = 0;
> >> --
> >> 1.8.4.rc4
> >>
> >>
> >>
> >
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-26 1:43 ` Wanlong Gao
@ 2013-08-26 7:46 ` Andrew Jones
2013-08-26 8:16 ` Wanlong Gao
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2013-08-26 7:46 UTC (permalink / raw)
To: gaowanlong
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, y-goto, pbonzini, lersek, afaerber
----- Original Message -----
> On 08/23/2013 04:40 PM, Andrew Jones wrote:
> >
> >
> > ----- Original Message -----
> >> Add detection of libnuma (mostly contained in the numactl package)
> >> to the configure script. Can be enabled or disabled on the command line,
> >> default is use if available.
> >>
> >> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
> >> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
> >
> > Is this patch still necessary? I thought that dropping the
> > numa_num_configured_nodes() calls from patch 8/12 got rid
> > of the need for this library. Maybe I missed other uses?
>
> Yes, in 08/12 we also use mbind(),
You don't need a whole library for mbind(), it's a syscall. See syscall(2).
> and in 09/12 we use max_numa_node().
Really? I didn't see it there. And anyway, that goes back to our discussion
about setting qemu's MAX_NODES to whatever we think qemu should support,
and then just checking that we don't blow that limit whenever reading
host node info, i.e.
maxnode = 0;
while (host_nodes[maxnode] && maxnode < MAX_NODES)
node_read(&info[maxnode++]);
type of a thing.
And, if there's a place you really need to know the current online number
of host nodes, then, like I said earlier, you should just go to sysfs
yourself. libnuma:numa_max_node() returns an int that it only initializes
at library load time, so it's not going to adapt to onlining/offlining.
drew
>
> Thanks,
> Wanlong Gao
>
> >
> > drew
> >
>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-26 7:46 ` Andrew Jones
@ 2013-08-26 8:16 ` Wanlong Gao
2013-08-26 8:43 ` Andrew Jones
0 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-26 8:16 UTC (permalink / raw)
To: Andrew Jones
Cc: aliguori, ehabkost, qemu-devel, hutao, peter huangpeng,
lcapitulino, bsd, y-goto, pbonzini, lersek, afaerber,
Wanlong Gao
On 08/26/2013 03:46 PM, Andrew Jones wrote:
>>> Is this patch still necessary? I thought that dropping the
>>> > > numa_num_configured_nodes() calls from patch 8/12 got rid
>>> > > of the need for this library. Maybe I missed other uses?
>> >
>> > Yes, in 08/12 we also use mbind(),
> You don't need a whole library for mbind(), it's a syscall. See syscall(2).
>
>> > and in 09/12 we use max_numa_node().
> Really? I didn't see it there. And anyway, that goes back to our discussion
> about setting qemu's MAX_NODES to whatever we think qemu should support,
> and then just checking that we don't blow that limit whenever reading
> host node info, i.e.
>
> maxnode = 0;
> while (host_nodes[maxnode] && maxnode < MAX_NODES)
> node_read(&info[maxnode++]);
>
> type of a thing.
>
> And, if there's a place you really need to know the current online number
> of host nodes, then, like I said earlier, you should just go to sysfs
> yourself. libnuma:numa_max_node() returns an int that it only initializes
> at library load time, so it's not going to adapt to onlining/offlining.
OK, thank you.
Then I should define MPOL_* macros in QEMU and use mbind(2) syscall directly, right?
Thanks,
Wanlong Gao
>
> drew
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-26 8:16 ` Wanlong Gao
@ 2013-08-26 8:43 ` Andrew Jones
2013-08-28 13:44 ` Paolo Bonzini
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2013-08-26 8:43 UTC (permalink / raw)
To: gaowanlong, aliguori, pbonzini
Cc: ehabkost, qemu-devel, hutao, peter huangpeng, lcapitulino, bsd,
y-goto, lersek, afaerber
----- Original Message -----
> On 08/26/2013 03:46 PM, Andrew Jones wrote:
> >>> Is this patch still necessary? I thought that dropping the
> >>> > > numa_num_configured_nodes() calls from patch 8/12 got rid
> >>> > > of the need for this library. Maybe I missed other uses?
> >> >
> >> > Yes, in 08/12 we also use mbind(),
> > You don't need a whole library for mbind(), it's a syscall. See syscall(2).
> >
> >> > and in 09/12 we use max_numa_node().
> > Really? I didn't see it there. And anyway, that goes back to our discussion
> > about setting qemu's MAX_NODES to whatever we think qemu should support,
> > and then just checking that we don't blow that limit whenever reading
> > host node info, i.e.
> >
> > maxnode = 0;
> > while (host_nodes[maxnode] && maxnode < MAX_NODES)
> > node_read(&info[maxnode++]);
> >
> > type of a thing.
> >
> > And, if there's a place you really need to know the current online number
> > of host nodes, then, like I said earlier, you should just go to sysfs
> > yourself. libnuma:numa_max_node() returns an int that it only initializes
> > at library load time, so it's not going to adapt to onlining/offlining.
>
> OK, thank you.
> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall directly,
> right?
Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more
general lib. Whether or not we want to redefine those symbols within
qemu, in order to avoid the dependency on installing numactl-devel, isn't
something I can answer. That's a better question for Anthony. Anthony? Paolo,
any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the
linux-header synch script?
thanks,
drew
>
> Thanks,
> Wanlong Gao
>
> >
> > drew
> >
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-26 8:43 ` Andrew Jones
@ 2013-08-28 13:44 ` Paolo Bonzini
2013-08-29 2:22 ` Wanlong Gao
0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2013-08-28 13:44 UTC (permalink / raw)
To: Andrew Jones
Cc: aliguori, ehabkost, hutao, peter huangpeng, qemu-devel, bsd,
y-goto, lcapitulino, lersek, afaerber, gaowanlong
Il 26/08/2013 10:43, Andrew Jones ha scritto:
>
> ----- Original Message -----
>> > On 08/26/2013 03:46 PM, Andrew Jones wrote:
>>>>> > >>> Is this patch still necessary? I thought that dropping the
>>>>>>> > >>> > > numa_num_configured_nodes() calls from patch 8/12 got rid
>>>>>>> > >>> > > of the need for this library. Maybe I missed other uses?
>>>>> > >> >
>>>>> > >> > Yes, in 08/12 we also use mbind(),
>>> > > You don't need a whole library for mbind(), it's a syscall. See syscall(2).
>>> > >
>>>>> > >> > and in 09/12 we use max_numa_node().
>>> > > Really? I didn't see it there. And anyway, that goes back to our discussion
>>> > > about setting qemu's MAX_NODES to whatever we think qemu should support,
>>> > > and then just checking that we don't blow that limit whenever reading
>>> > > host node info, i.e.
>>> > >
>>> > > maxnode = 0;
>>> > > while (host_nodes[maxnode] && maxnode < MAX_NODES)
>>> > > node_read(&info[maxnode++]);
>>> > >
>>> > > type of a thing.
>>> > >
>>> > > And, if there's a place you really need to know the current online number
>>> > > of host nodes, then, like I said earlier, you should just go to sysfs
>>> > > yourself. libnuma:numa_max_node() returns an int that it only initializes
>>> > > at library load time, so it's not going to adapt to onlining/offlining.
>> >
>> > OK, thank you.
>> > Then I should define MPOL_* macros in QEMU and use mbind(2) syscall directly,
>> > right?
> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more
> general lib. Whether or not we want to redefine those symbols within
> qemu, in order to avoid the dependency on installing numactl-devel, isn't
> something I can answer. That's a better question for Anthony. Anthony? Paolo,
> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the
> linux-header synch script?
>
I think using libnuma is fine. In principle this could be used on other
OSes than Linux, I think?
Paolo
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-28 13:44 ` Paolo Bonzini
@ 2013-08-29 2:22 ` Wanlong Gao
2013-08-29 8:15 ` Andrew Jones
0 siblings, 1 reply; 26+ messages in thread
From: Wanlong Gao @ 2013-08-29 2:22 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Andrew Jones, ehabkost, hutao, qemu-devel, peter huangpeng, bsd,
aliguori, y-goto, lcapitulino, lersek, afaerber, Wanlong Gao
On 08/28/2013 09:44 PM, Paolo Bonzini wrote:
> Il 26/08/2013 10:43, Andrew Jones ha scritto:
>>
>> ----- Original Message -----
>>>> On 08/26/2013 03:46 PM, Andrew Jones wrote:
>>>>>>>>>> Is this patch still necessary? I thought that dropping the
>>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid
>>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses?
>>>>>>>>>>
>>>>>>>>>> Yes, in 08/12 we also use mbind(),
>>>>>> You don't need a whole library for mbind(), it's a syscall. See syscall(2).
>>>>>>
>>>>>>>>>> and in 09/12 we use max_numa_node().
>>>>>> Really? I didn't see it there. And anyway, that goes back to our discussion
>>>>>> about setting qemu's MAX_NODES to whatever we think qemu should support,
>>>>>> and then just checking that we don't blow that limit whenever reading
>>>>>> host node info, i.e.
>>>>>>
>>>>>> maxnode = 0;
>>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES)
>>>>>> node_read(&info[maxnode++]);
>>>>>>
>>>>>> type of a thing.
>>>>>>
>>>>>> And, if there's a place you really need to know the current online number
>>>>>> of host nodes, then, like I said earlier, you should just go to sysfs
>>>>>> yourself. libnuma:numa_max_node() returns an int that it only initializes
>>>>>> at library load time, so it's not going to adapt to onlining/offlining.
>>>>
>>>> OK, thank you.
>>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall directly,
>>>> right?
>> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more
>> general lib. Whether or not we want to redefine those symbols within
>> qemu, in order to avoid the dependency on installing numactl-devel, isn't
>> something I can answer. That's a better question for Anthony. Anthony? Paolo,
>> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the
>> linux-header synch script?
>>
>
> I think using libnuma is fine. In principle this could be used on other
> OSes than Linux, I think?
But seems that mbind(2) is Linux-specific syscall, right?
Thanks,
Wanlong Gao
>
> Paolo
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-29 2:22 ` Wanlong Gao
@ 2013-08-29 8:15 ` Andrew Jones
2013-08-29 8:31 ` Andrew Jones
0 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2013-08-29 8:15 UTC (permalink / raw)
To: gaowanlong
Cc: aliguori, ehabkost, hutao, qemu-devel, peter huangpeng, bsd,
y-goto, Paolo Bonzini, lcapitulino, lersek, afaerber
----- Original Message -----
> On 08/28/2013 09:44 PM, Paolo Bonzini wrote:
> > Il 26/08/2013 10:43, Andrew Jones ha scritto:
> >>
> >> ----- Original Message -----
> >>>> On 08/26/2013 03:46 PM, Andrew Jones wrote:
> >>>>>>>>>> Is this patch still necessary? I thought that dropping the
> >>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid
> >>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses?
> >>>>>>>>>>
> >>>>>>>>>> Yes, in 08/12 we also use mbind(),
> >>>>>> You don't need a whole library for mbind(), it's a syscall. See
> >>>>>> syscall(2).
> >>>>>>
> >>>>>>>>>> and in 09/12 we use max_numa_node().
> >>>>>> Really? I didn't see it there. And anyway, that goes back to our
> >>>>>> discussion
> >>>>>> about setting qemu's MAX_NODES to whatever we think qemu should
> >>>>>> support,
> >>>>>> and then just checking that we don't blow that limit whenever reading
> >>>>>> host node info, i.e.
> >>>>>>
> >>>>>> maxnode = 0;
> >>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES)
> >>>>>> node_read(&info[maxnode++]);
> >>>>>>
> >>>>>> type of a thing.
> >>>>>>
> >>>>>> And, if there's a place you really need to know the current online
> >>>>>> number
> >>>>>> of host nodes, then, like I said earlier, you should just go to sysfs
> >>>>>> yourself. libnuma:numa_max_node() returns an int that it only
> >>>>>> initializes
> >>>>>> at library load time, so it's not going to adapt to
> >>>>>> onlining/offlining.
> >>>>
> >>>> OK, thank you.
> >>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall
> >>>> directly,
> >>>> right?
> >> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more
> >> general lib. Whether or not we want to redefine those symbols within
> >> qemu, in order to avoid the dependency on installing numactl-devel, isn't
> >> something I can answer. That's a better question for Anthony. Anthony?
> >> Paolo,
> >> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the
> >> linux-header synch script?
> >>
> >
> > I think using libnuma is fine. In principle this could be used on other
> > OSes than Linux, I think?
>
> But seems that mbind(2) is Linux-specific syscall, right?
>
You would need to avoid directly calling mbind, i.e. use libnuma for all
numa related calls. Then, if libnuma were to support more OSes, qemu would
automatically (wrt to numa) as well. Your mbind() with libnuma would look
like this
numa_set_bind_policy(strict)
numa_tonodemask_memory(addr, size, nodemask)
The problem is that set_bind_policy only takes a bool, and thus only
allows two of the four possibly policies
MPOL_BIND strict == 1
MPOL_PREFERRED strict == 0
So, due to libnuma's policy setting limitations, and the fact it doesn't
currently support more OSes than Linux, then I prefer your current
series version that drops libnuma. If qemu will need to support NUMA on
another OS, then we can cross this bridge when we get there.
drew
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
2013-08-29 8:15 ` Andrew Jones
@ 2013-08-29 8:31 ` Andrew Jones
0 siblings, 0 replies; 26+ messages in thread
From: Andrew Jones @ 2013-08-29 8:31 UTC (permalink / raw)
To: gaowanlong
Cc: aliguori, ehabkost, hutao, peter huangpeng, qemu-devel, bsd,
Paolo Bonzini, y-goto, lcapitulino, lersek, afaerber
----- Original Message -----
>
>
> ----- Original Message -----
> > On 08/28/2013 09:44 PM, Paolo Bonzini wrote:
> > > Il 26/08/2013 10:43, Andrew Jones ha scritto:
> > >>
> > >> ----- Original Message -----
> > >>>> On 08/26/2013 03:46 PM, Andrew Jones wrote:
> > >>>>>>>>>> Is this patch still necessary? I thought that dropping the
> > >>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid
> > >>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses?
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, in 08/12 we also use mbind(),
> > >>>>>> You don't need a whole library for mbind(), it's a syscall. See
> > >>>>>> syscall(2).
> > >>>>>>
> > >>>>>>>>>> and in 09/12 we use max_numa_node().
> > >>>>>> Really? I didn't see it there. And anyway, that goes back to our
> > >>>>>> discussion
> > >>>>>> about setting qemu's MAX_NODES to whatever we think qemu should
> > >>>>>> support,
> > >>>>>> and then just checking that we don't blow that limit whenever
> > >>>>>> reading
> > >>>>>> host node info, i.e.
> > >>>>>>
> > >>>>>> maxnode = 0;
> > >>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES)
> > >>>>>> node_read(&info[maxnode++]);
> > >>>>>>
> > >>>>>> type of a thing.
> > >>>>>>
> > >>>>>> And, if there's a place you really need to know the current online
> > >>>>>> number
> > >>>>>> of host nodes, then, like I said earlier, you should just go to
> > >>>>>> sysfs
> > >>>>>> yourself. libnuma:numa_max_node() returns an int that it only
> > >>>>>> initializes
> > >>>>>> at library load time, so it's not going to adapt to
> > >>>>>> onlining/offlining.
> > >>>>
> > >>>> OK, thank you.
> > >>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall
> > >>>> directly,
> > >>>> right?
> > >> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a
> > >> more
> > >> general lib. Whether or not we want to redefine those symbols within
> > >> qemu, in order to avoid the dependency on installing numactl-devel,
> > >> isn't
> > >> something I can answer. That's a better question for Anthony. Anthony?
> > >> Paolo,
> > >> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the
> > >> linux-header synch script?
> > >>
> > >
> > > I think using libnuma is fine. In principle this could be used on other
> > > OSes than Linux, I think?
> >
> > But seems that mbind(2) is Linux-specific syscall, right?
> >
>
> You would need to avoid directly calling mbind, i.e. use libnuma for all
> numa related calls. Then, if libnuma were to support more OSes, qemu would
> automatically (wrt to numa) as well. Your mbind() with libnuma would look
> like this
>
> numa_set_bind_policy(strict)
> numa_tonodemask_memory(addr, size, nodemask)
>
> The problem is that set_bind_policy only takes a bool, and thus only
> allows two of the four possibly policies
>
> MPOL_BIND strict == 1
> MPOL_PREFERRED strict == 0
>
Ah, there is a way to get interleave policy
if (policy == interleave) {
numa_interleave_memory(addr, size, nodemask)
} else {
numa_set_bind_policy(strict)
numa_tonodemask_memory(addr, size, nodemask)
}
a bit clunky. And I still don't see a way to select MPOL_DEFAULT, nor a way to
use any additional flags, such as MPOL_F_RELATIVE_NODES.
> So, due to libnuma's policy setting limitations, and the fact it doesn't
> currently support more OSes than Linux, then I prefer your current
> series version that drops libnuma. If qemu will need to support NUMA on
> another OS, then we can cross this bridge when we get there.
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-08-29 8:32 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-23 4:09 [Qemu-devel] [PATCH V9 00/12] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 01/12] NUMA: add NumaOptions, NumaNodeOptions and NumaMemOptions Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 02/12] NUMA: split -numa option Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 03/12] NUMA: check if the total numa memory size is equal to ram_size Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 04/12] NUMA: move numa related code to numa.c Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 05/12] NUMA: Add numa_info structure to contain numa nodes info Wanlong Gao
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection Wanlong Gao
2013-08-23 8:40 ` Andrew Jones
2013-08-26 1:43 ` Wanlong Gao
2013-08-26 7:46 ` Andrew Jones
2013-08-26 8:16 ` Wanlong Gao
2013-08-26 8:43 ` Andrew Jones
2013-08-28 13:44 ` Paolo Bonzini
2013-08-29 2:22 ` Wanlong Gao
2013-08-29 8:15 ` Andrew Jones
2013-08-29 8:31 ` Andrew Jones
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy Wanlong Gao
2013-08-23 14:11 ` Andrew Jones
2013-08-26 1:07 ` Wanlong Gao
2013-08-26 7:12 ` Andrew Jones
2013-08-23 4:09 ` [Qemu-devel] [PATCH V9 08/12] NUMA: set " Wanlong Gao
2013-08-23 8:44 ` Andrew Jones
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 09/12] NUMA: add qmp command set-mem-policy to set memory policy for NUMA node Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 10/12] NUMA: add hmp command set-mem-policy Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 11/12] NUMA: add qmp command query-numa Wanlong Gao
2013-08-23 4:10 ` [Qemu-devel] [PATCH V9 12/12] NUMA: convert hmp command info_numa to use qmp command query_numa Wanlong Gao
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.