* [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection @ 2013-06-05 3:58 Wanlong Gao 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao 2013-06-05 10:02 ` [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Andreas Färber 0 siblings, 2 replies; 19+ messages in thread From: Wanlong Gao @ 2013-06-05 3:58 UTC (permalink / raw) To: qemu-devel; +Cc: andre.przywara, aliguori, ehabkost, pbonzini, Wanlong Gao Add detection of libnuma (mostly contained in the numactl package) to the configure script. Currently this is Linux only, but can be extended later should the need for other interfaces come up. Can be enabled or disabled on the command line, default is use if available. Original-signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> --- configure | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/configure b/configure index 1654413..d897f98 100755 --- a/configure +++ b/configure @@ -242,6 +242,7 @@ gtk="" gtkabi="2.0" tpm="no" libssh2="" +numa="yes" # parse CC options first for opt do @@ -937,6 +938,10 @@ for opt do ;; --enable-libssh2) libssh2="yes" ;; + --disable-numa) numa="no" + ;; + --enable-numa) numa="yes" + ;; *) echo "ERROR: unknown option $opt"; show_help="yes" ;; esac @@ -1151,6 +1156,8 @@ echo " --gcov=GCOV use specified gcov [$gcov_tool]" echo " --enable-tpm enable TPM support" echo " --disable-libssh2 disable ssh block device support" echo " --enable-libssh2 enable ssh block device support" +echo " --disable-numa disable host Linux NUMA support" +echo " --enable-numa enable host Linux NUMA support" echo "" echo "NOTE: The object files are built at the place where configure is launched" exit 1 @@ -2382,6 +2389,27 @@ EOF fi ########################################## +# libnuma probe + +if test "$numa" = "yes" ; then + numa=no + cat > $TMPC << EOF +#include <numa.h> +int main(void) { return numa_available(); } +EOF + + if compile_prog "" "-lnuma" ; then + numa=yes + libs_softmmu="-lnuma $libs_softmmu" + else + if test "$numa" = "yes" ; then + feature_not_found "linux NUMA (install numactl?)" + fi + numa=no + fi +fi + +########################################## # linux-aio probe if test "$linux_aio" != "no" ; then @@ -3550,6 +3578,7 @@ echo "TPM support $tpm" echo "libssh2 support $libssh2" echo "TPM passthrough $tpm_passthrough" echo "QOM debugging $qom_cast_debug" +echo "NUMA host support $numa" if test "$sdl_too_old" = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -3584,6 +3613,9 @@ echo "extra_cflags=$EXTRA_CFLAGS" >> $config_host_mak echo "extra_ldflags=$EXTRA_LDFLAGS" >> $config_host_mak echo "qemu_localedir=$qemu_localedir" >> $config_host_mak echo "libs_softmmu=$libs_softmmu" >> $config_host_mak +if test "$numa" = "yes"; then + echo "CONFIG_NUMA=y" >> $config_host_mak +fi echo "ARCH=$ARCH" >> $config_host_mak -- 1.8.3.rc2.10.g0c2b1cf ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 3:58 [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Wanlong Gao @ 2013-06-05 3:58 ` Wanlong Gao 2013-06-05 12:39 ` Eric Blake ` (2 more replies) 2013-06-05 10:02 ` [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Andreas Färber 1 sibling, 3 replies; 19+ messages in thread From: Wanlong Gao @ 2013-06-05 3:58 UTC (permalink / raw) To: qemu-devel; +Cc: andre.przywara, aliguori, ehabkost, pbonzini, Wanlong Gao Add monitor command mem-nodes to show the huge mapped memory nodes locations. (qemu) info mem-nodes /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 Refer to the proposal of Eduardo and Daniel. http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> --- monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/monitor.c b/monitor.c index eefc7f0..85c865f 100644 --- a/monitor.c +++ b/monitor.c @@ -74,6 +74,10 @@ #endif #include "hw/lm32/lm32_pic.h" +#if defined(CONFIG_NUMA) +#include <numaif.h> +#endif + //#define DEBUG //#define DEBUG_COMPLETION @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) } #endif +#if defined(CONFIG_NUMA) +static void mem_nodes(Monitor *mon, const QDict *qdict) +{ + RAMBlock *block; + int prevnode, node; + unsigned long long c, start, area; + int fd; + int pid = getpid(); + QTAILQ_FOREACH(block, &ram_list.blocks, next) { + if (!(fd = block->fd)) + continue; + prevnode = -1; + start = 0; + area = (unsigned long long)block->host; + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { + if (get_mempolicy(&node, NULL, 0, c + block->host, + MPOL_F_ADDR | MPOL_F_NODE) < 0) + continue; + if (node == prevnode) + continue; + if (prevnode != -1) + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", + pid, fd, start + area, c + area, prevnode); + prevnode = node; + start = c; + } + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", + pid, fd, start + area, c + area, prevnode); + } +} +#endif + #if defined(TARGET_SH4) static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { .mhandler.cmd = mem_info, }, #endif +#if defined(CONFIG_NUMA) + { + .name = "mem-nodes", + .args_type = "", + .params = "", + .help = "show the huge mapped memory nodes location", + .mhandler.cmd = mem_nodes, + }, +#endif { .name = "mtree", .args_type = "", -- 1.8.3.rc2.10.g0c2b1cf ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao @ 2013-06-05 12:39 ` Eric Blake 2013-06-05 12:57 ` Anthony Liguori 2013-06-05 13:46 ` Eduardo Habkost 2 siblings, 0 replies; 19+ messages in thread From: Eric Blake @ 2013-06-05 12:39 UTC (permalink / raw) To: Wanlong Gao; +Cc: andre.przywara, aliguori, pbonzini, qemu-devel, ehabkost [-- Attachment #1: Type: text/plain, Size: 363 bytes --] On 06/04/2013 09:58 PM, Wanlong Gao wrote: > Add monitor command mem-nodes to show the huge mapped > memory nodes locations. Missing a QMP counterpart. Libvirt would probably like to use this command, and providing it HMP-only is not friendly. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 621 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao 2013-06-05 12:39 ` Eric Blake @ 2013-06-05 12:57 ` Anthony Liguori 2013-06-05 15:54 ` Eduardo Habkost 2013-06-05 13:46 ` Eduardo Habkost 2 siblings, 1 reply; 19+ messages in thread From: Anthony Liguori @ 2013-06-05 12:57 UTC (permalink / raw) To: Wanlong Gao, qemu-devel; +Cc: pbonzini, ehabkost, andre.przywara Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: > Add monitor command mem-nodes to show the huge mapped > memory nodes locations. > > (qemu) info mem-nodes > /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 This creates an ABI that we don't currently support. Memory hotplug or a variety of things can break this mapping and then we'd have to provide an interface to describe that the mapping was broken. Also, it only works with hugetlbfs which is probbably not widely used given the existance of THP. I had hoped that we would get proper userspace interfaces for describing memory groups but that appears to have stalled out. Does anyone know if this is still on the table? If we can't get a proper kernel interface, then perhaps we need to add full libnuma support but that would really be unfortunate... Regards, Anthony Liguori > > Refer to the proposal of Eduardo and Daniel. > http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > > Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > --- > monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/monitor.c b/monitor.c > index eefc7f0..85c865f 100644 > --- a/monitor.c > +++ b/monitor.c > @@ -74,6 +74,10 @@ > #endif > #include "hw/lm32/lm32_pic.h" > > +#if defined(CONFIG_NUMA) > +#include <numaif.h> > +#endif > + > //#define DEBUG > //#define DEBUG_COMPLETION > > @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) > } > #endif > > +#if defined(CONFIG_NUMA) > +static void mem_nodes(Monitor *mon, const QDict *qdict) > +{ > + RAMBlock *block; > + int prevnode, node; > + unsigned long long c, start, area; > + int fd; > + int pid = getpid(); > + QTAILQ_FOREACH(block, &ram_list.blocks, next) { > + if (!(fd = block->fd)) > + continue; > + prevnode = -1; > + start = 0; > + area = (unsigned long long)block->host; > + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { > + if (get_mempolicy(&node, NULL, 0, c + block->host, > + MPOL_F_ADDR | MPOL_F_NODE) < 0) > + continue; > + if (node == prevnode) > + continue; > + if (prevnode != -1) > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > + pid, fd, start + area, c + area, prevnode); > + prevnode = node; > + start = c; > + } > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > + pid, fd, start + area, c + area, prevnode); > + } > +} > +#endif > + > #if defined(TARGET_SH4) > > static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) > @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { > .mhandler.cmd = mem_info, > }, > #endif > +#if defined(CONFIG_NUMA) > + { > + .name = "mem-nodes", > + .args_type = "", > + .params = "", > + .help = "show the huge mapped memory nodes location", > + .mhandler.cmd = mem_nodes, > + }, > +#endif > { > .name = "mtree", > .args_type = "", > -- > 1.8.3.rc2.10.g0c2b1cf ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 12:57 ` Anthony Liguori @ 2013-06-05 15:54 ` Eduardo Habkost 2013-06-06 9:30 ` Wanlong Gao 2013-06-14 1:04 ` Anthony Liguori 0 siblings, 2 replies; 19+ messages in thread From: Eduardo Habkost @ 2013-06-05 15:54 UTC (permalink / raw) To: Anthony Liguori; +Cc: pbonzini, qemu-devel, Wanlong Gao, andre.przywara On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote: > Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: > > > Add monitor command mem-nodes to show the huge mapped > > memory nodes locations. > > > > (qemu) info mem-nodes > > /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > > /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > > /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > > /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > > This creates an ABI that we don't currently support. Memory hotplug or > a variety of things can break this mapping and then we'd have to provide > an interface to describe that the mapping was broken. What do you mean by "breaking this mapping", exactly? Would the backing file of existing guest RAM ever change? (It would require a memory copy from one file to another, why would QEMU ever do that?) > > Also, it only works with hugetlbfs which is probbably not widely used > given the existance of THP. Quoting yourself at http://article.gmane.org/gmane.comp.emulators.kvm.devel/58227: >> It's extremely likely that if you're doing NUMA pinning, you're also >> doing large pages via hugetlbfs. numactl can already set policies for >> files in hugetlbfs so all you need to do is have a separate hugetlbfs >> file for each numa node. >> >> Then you have all the flexibility of numactl and you can implement node >> migration external to QEMU if you so desire. And if we simply report where are the backing files and offsets being used for guest RAM, one could simply use 'numactl --file --offset --length', so we don't even need separate files/mem-paths for each node. Does THP work with tmpfs, already? If it does, people who doesn't want hugetlbfs and want numa tuning to work with THP could just use tmpfs for -mem-path. > > I had hoped that we would get proper userspace interfaces for describing > memory groups but that appears to have stalled out. I would love to have it. But while we don't have it, sharing the tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to let other tools manipulate guest memory policy. We just need to let external tools know where the backing files are. > > Does anyone know if this is still on the table? > > If we can't get a proper kernel interface, then perhaps we need to add > full libnuma support but that would really be unfortunate... Why isn't the "info mem-nodes" solution (I mean: not this version, but a proper QMP version that exposes all the information we need) an option? > > Regards, > > Anthony Liguori > > > > > Refer to the proposal of Eduardo and Daniel. > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > > > > Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > > --- > > monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 45 insertions(+) > > > > diff --git a/monitor.c b/monitor.c > > index eefc7f0..85c865f 100644 > > --- a/monitor.c > > +++ b/monitor.c > > @@ -74,6 +74,10 @@ > > #endif > > #include "hw/lm32/lm32_pic.h" > > > > +#if defined(CONFIG_NUMA) > > +#include <numaif.h> > > +#endif > > + > > //#define DEBUG > > //#define DEBUG_COMPLETION > > > > @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) > > } > > #endif > > > > +#if defined(CONFIG_NUMA) > > +static void mem_nodes(Monitor *mon, const QDict *qdict) > > +{ > > + RAMBlock *block; > > + int prevnode, node; > > + unsigned long long c, start, area; > > + int fd; > > + int pid = getpid(); > > + QTAILQ_FOREACH(block, &ram_list.blocks, next) { > > + if (!(fd = block->fd)) > > + continue; > > + prevnode = -1; > > + start = 0; > > + area = (unsigned long long)block->host; > > + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { > > + if (get_mempolicy(&node, NULL, 0, c + block->host, > > + MPOL_F_ADDR | MPOL_F_NODE) < 0) > > + continue; > > + if (node == prevnode) > > + continue; > > + if (prevnode != -1) > > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > > + pid, fd, start + area, c + area, prevnode); > > + prevnode = node; > > + start = c; > > + } > > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > > + pid, fd, start + area, c + area, prevnode); > > + } > > +} > > +#endif > > + > > #if defined(TARGET_SH4) > > > > static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) > > @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { > > .mhandler.cmd = mem_info, > > }, > > #endif > > +#if defined(CONFIG_NUMA) > > + { > > + .name = "mem-nodes", > > + .args_type = "", > > + .params = "", > > + .help = "show the huge mapped memory nodes location", > > + .mhandler.cmd = mem_nodes, > > + }, > > +#endif > > { > > .name = "mtree", > > .args_type = "", > > -- > > 1.8.3.rc2.10.g0c2b1cf > -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 15:54 ` Eduardo Habkost @ 2013-06-06 9:30 ` Wanlong Gao 2013-06-06 16:15 ` Eduardo Habkost 2013-06-14 1:04 ` Anthony Liguori 1 sibling, 1 reply; 19+ messages in thread From: Wanlong Gao @ 2013-06-06 9:30 UTC (permalink / raw) To: Eduardo Habkost, Anthony Liguori Cc: aarcange, andre.przywara, qemu-devel, mgorman, pbonzini, Wanlong Gao On 06/05/2013 11:54 PM, Eduardo Habkost wrote: > On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote: >> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: >> >>> Add monitor command mem-nodes to show the huge mapped >>> memory nodes locations. >>> >>> (qemu) info mem-nodes >>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 >>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 >>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 >>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 >> >> This creates an ABI that we don't currently support. Memory hotplug or >> a variety of things can break this mapping and then we'd have to provide >> an interface to describe that the mapping was broken. > > What do you mean by "breaking this mapping", exactly? Would the backing > file of existing guest RAM ever change? (It would require a memory copy > from one file to another, why would QEMU ever do that?) > >> >> Also, it only works with hugetlbfs which is probbably not widely used >> given the existance of THP. > > Quoting yourself at > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58227: > >>> It's extremely likely that if you're doing NUMA pinning, you're also >>> doing large pages via hugetlbfs. numactl can already set policies for >>> files in hugetlbfs so all you need to do is have a separate hugetlbfs >>> file for each numa node. >>> >>> Then you have all the flexibility of numactl and you can implement node >>> migration external to QEMU if you so desire. > > And if we simply report where are the backing files and offsets being > used for guest RAM, one could simply use > 'numactl --file --offset --length', so we don't even need separate > files/mem-paths for each node. Does "numactl" work after QEMU process mapped the hugetlbfs file? I'm afraid the mempolicy set after it will take no effect. And, if PCI-passthrough is used, direct-attached-device uses DMA transfer between device and qemu process. All pages of the guest will be pinned by get_user_pages(). KVM_ASSIGN_PCI_DEVICE ioctl kvm_vm_ioctl_assign_device() =>kvm_assign_device() => kvm_iommu_map_memslots() => kvm_iommu_map_pages() => kvm_pin_pages() So, with direct-attached-device, all guest page's page count will be +1 and any page migration will not work. AutoNUMA won't too. Then any numa directions are also ignored. Mel? Andrea? > > Does THP work with tmpfs, already? If it does, people who doesn't want > hugetlbfs and want numa tuning to work with THP could just use tmpfs for > -mem-path. > >> >> I had hoped that we would get proper userspace interfaces for describing >> memory groups but that appears to have stalled out. > > I would love to have it. But while we don't have it, sharing the > tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to > let other tools manipulate guest memory policy. We just need to let > external tools know where the backing files are. > >> >> Does anyone know if this is still on the table? >> >> If we can't get a proper kernel interface, then perhaps we need to add >> full libnuma support but that would really be unfortunate... > > Why isn't the "info mem-nodes" solution (I mean: not this version, but a > proper QMP version that exposes all the information we need) an option? > And the shortage of hugetlbfs is that we can't know how many virtual machines, so that we can't determine how many huge pages to be reserved for virtual machines. In order to set numa mempolices on THP or normal memories, as Anthony said, I think we should add full libnuma support into QEMU, and allow manually set mempolices in the QEMU command line and through QEMU monitor after QEMU started. The external tools can't do exactly the right thing. So IMO the right direction is the objected proposal - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@amd.com> http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684 - Message-ID: <4C7D7C2A.7000205@codemonkey.ws> http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835 If you agree, I'll refact this patch set for review. Thanks, Wanlong Gao > >> >> Regards, >> >> Anthony Liguori >> >>> >>> Refer to the proposal of Eduardo and Daniel. >>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 >>> >>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> >>> --- >>> monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 45 insertions(+) >>> >>> diff --git a/monitor.c b/monitor.c >>> index eefc7f0..85c865f 100644 >>> --- a/monitor.c >>> +++ b/monitor.c >>> @@ -74,6 +74,10 @@ >>> #endif >>> #include "hw/lm32/lm32_pic.h" >>> >>> +#if defined(CONFIG_NUMA) >>> +#include <numaif.h> >>> +#endif >>> + >>> //#define DEBUG >>> //#define DEBUG_COMPLETION >>> >>> @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) >>> } >>> #endif >>> >>> +#if defined(CONFIG_NUMA) >>> +static void mem_nodes(Monitor *mon, const QDict *qdict) >>> +{ >>> + RAMBlock *block; >>> + int prevnode, node; >>> + unsigned long long c, start, area; >>> + int fd; >>> + int pid = getpid(); >>> + QTAILQ_FOREACH(block, &ram_list.blocks, next) { >>> + if (!(fd = block->fd)) >>> + continue; >>> + prevnode = -1; >>> + start = 0; >>> + area = (unsigned long long)block->host; >>> + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { >>> + if (get_mempolicy(&node, NULL, 0, c + block->host, >>> + MPOL_F_ADDR | MPOL_F_NODE) < 0) >>> + continue; >>> + if (node == prevnode) >>> + continue; >>> + if (prevnode != -1) >>> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >>> + pid, fd, start + area, c + area, prevnode); >>> + prevnode = node; >>> + start = c; >>> + } >>> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >>> + pid, fd, start + area, c + area, prevnode); >>> + } >>> +} >>> +#endif >>> + >>> #if defined(TARGET_SH4) >>> >>> static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) >>> @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { >>> .mhandler.cmd = mem_info, >>> }, >>> #endif >>> +#if defined(CONFIG_NUMA) >>> + { >>> + .name = "mem-nodes", >>> + .args_type = "", >>> + .params = "", >>> + .help = "show the huge mapped memory nodes location", >>> + .mhandler.cmd = mem_nodes, >>> + }, >>> +#endif >>> { >>> .name = "mtree", >>> .args_type = "", >>> -- >>> 1.8.3.rc2.10.g0c2b1cf >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-06 9:30 ` Wanlong Gao @ 2013-06-06 16:15 ` Eduardo Habkost 0 siblings, 0 replies; 19+ messages in thread From: Eduardo Habkost @ 2013-06-06 16:15 UTC (permalink / raw) To: Wanlong Gao Cc: aarcange, andre.przywara, Anthony Liguori, qemu-devel, mgorman, pbonzini On Thu, Jun 06, 2013 at 05:30:16PM +0800, Wanlong Gao wrote: > On 06/05/2013 11:54 PM, Eduardo Habkost wrote: > > On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote: > >> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: > >> > >>> Add monitor command mem-nodes to show the huge mapped > >>> memory nodes locations. > >>> > >>> (qemu) info mem-nodes > >>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > >>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > >>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > >>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > >> > >> This creates an ABI that we don't currently support. Memory hotplug or > >> a variety of things can break this mapping and then we'd have to provide > >> an interface to describe that the mapping was broken. > > > > What do you mean by "breaking this mapping", exactly? Would the backing > > file of existing guest RAM ever change? (It would require a memory copy > > from one file to another, why would QEMU ever do that?) > > > >> > >> Also, it only works with hugetlbfs which is probbably not widely used > >> given the existance of THP. > > > > Quoting yourself at > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58227: > > > >>> It's extremely likely that if you're doing NUMA pinning, you're also > >>> doing large pages via hugetlbfs. numactl can already set policies for > >>> files in hugetlbfs so all you need to do is have a separate hugetlbfs > >>> file for each numa node. > >>> > >>> Then you have all the flexibility of numactl and you can implement node > >>> migration external to QEMU if you so desire. > > > > And if we simply report where are the backing files and offsets being > > used for guest RAM, one could simply use > > 'numactl --file --offset --length', so we don't even need separate > > files/mem-paths for each node. > > Does "numactl" work after QEMU process mapped the hugetlbfs file? > I'm afraid the mempolicy set after it will take no effect. I was always expecting it to work. I will make some tests to find out. > > And, if PCI-passthrough is used, direct-attached-device uses DMA transfer > between device and qemu process. All pages of the guest will be pinned by get_user_pages(). > > KVM_ASSIGN_PCI_DEVICE ioctl > kvm_vm_ioctl_assign_device() > =>kvm_assign_device() > => kvm_iommu_map_memslots() > => kvm_iommu_map_pages() > => kvm_pin_pages() > > So, with direct-attached-device, all guest page's page count will be +1 and > any page migration will not work. AutoNUMA won't too. Then any numa directions > are also ignored. We could have a new -mem-path-like option where we could tell QEMU exactly which hugetlbfs/tmpfs files should be used for each memory region. This way an external tool could create the files, set the policy on them, and then tell QEMU to use them. > > Mel? Andrea? > > > > > Does THP work with tmpfs, already? If it does, people who doesn't want > > hugetlbfs and want numa tuning to work with THP could just use tmpfs for > > -mem-path. > > > >> > >> I had hoped that we would get proper userspace interfaces for describing > >> memory groups but that appears to have stalled out. > > > > I would love to have it. But while we don't have it, sharing the > > tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to > > let other tools manipulate guest memory policy. We just need to let > > external tools know where the backing files are. > > > >> > >> Does anyone know if this is still on the table? > >> > >> If we can't get a proper kernel interface, then perhaps we need to add > >> full libnuma support but that would really be unfortunate... > > > > Why isn't the "info mem-nodes" solution (I mean: not this version, but a > > proper QMP version that exposes all the information we need) an option? > > > > And the shortage of hugetlbfs is that we can't know how many virtual > machines, so that we can't determine how many huge pages to be reserved > for virtual machines. Then we just need to have THP working on tmpfs, and people who don't want hugetlbfs could use tmpfs for -mem-path. > > In order to set numa mempolices on THP or normal memories, as Anthony said, > I think we should add full libnuma support into QEMU, and allow manually > set mempolices in the QEMU command line and through QEMU monitor after > QEMU started. The external tools can't do exactly the right thing. > > So IMO the right direction is the objected proposal > - Message-ID: <1281534738-8310-1-git-send-email-andre.przywara@amd.com> > http://article.gmane.org/gmane.comp.emulators.kvm.devel/57684 > - Message-ID: <4C7D7C2A.7000205@codemonkey.ws> > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58835 > > If you agree, I'll refact this patch set for review. I believe there are many different ways to avoid having QEMU reimplementing what numactl and other tools can already do, and we should consider those solutions before going that way. But my main question about this specific patch is: if you are not exposing this information so other tools could set policy themselves and QEMU can't set the policy itself yet, what's exactly the purpose of this patch? > > Thanks, > Wanlong Gao > > > > >> > >> Regards, > >> > >> Anthony Liguori > >> > >>> > >>> Refer to the proposal of Eduardo and Daniel. > >>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > >>> > >>> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > >>> --- > >>> monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > >>> 1 file changed, 45 insertions(+) > >>> > >>> diff --git a/monitor.c b/monitor.c > >>> index eefc7f0..85c865f 100644 > >>> --- a/monitor.c > >>> +++ b/monitor.c > >>> @@ -74,6 +74,10 @@ > >>> #endif > >>> #include "hw/lm32/lm32_pic.h" > >>> > >>> +#if defined(CONFIG_NUMA) > >>> +#include <numaif.h> > >>> +#endif > >>> + > >>> //#define DEBUG > >>> //#define DEBUG_COMPLETION > >>> > >>> @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) > >>> } > >>> #endif > >>> > >>> +#if defined(CONFIG_NUMA) > >>> +static void mem_nodes(Monitor *mon, const QDict *qdict) > >>> +{ > >>> + RAMBlock *block; > >>> + int prevnode, node; > >>> + unsigned long long c, start, area; > >>> + int fd; > >>> + int pid = getpid(); > >>> + QTAILQ_FOREACH(block, &ram_list.blocks, next) { > >>> + if (!(fd = block->fd)) > >>> + continue; > >>> + prevnode = -1; > >>> + start = 0; > >>> + area = (unsigned long long)block->host; > >>> + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { > >>> + if (get_mempolicy(&node, NULL, 0, c + block->host, > >>> + MPOL_F_ADDR | MPOL_F_NODE) < 0) > >>> + continue; > >>> + if (node == prevnode) > >>> + continue; > >>> + if (prevnode != -1) > >>> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > >>> + pid, fd, start + area, c + area, prevnode); > >>> + prevnode = node; > >>> + start = c; > >>> + } > >>> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > >>> + pid, fd, start + area, c + area, prevnode); > >>> + } > >>> +} > >>> +#endif > >>> + > >>> #if defined(TARGET_SH4) > >>> > >>> static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) > >>> @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { > >>> .mhandler.cmd = mem_info, > >>> }, > >>> #endif > >>> +#if defined(CONFIG_NUMA) > >>> + { > >>> + .name = "mem-nodes", > >>> + .args_type = "", > >>> + .params = "", > >>> + .help = "show the huge mapped memory nodes location", > >>> + .mhandler.cmd = mem_nodes, > >>> + }, > >>> +#endif > >>> { > >>> .name = "mtree", > >>> .args_type = "", > >>> -- > >>> 1.8.3.rc2.10.g0c2b1cf > >> > > > -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 15:54 ` Eduardo Habkost 2013-06-06 9:30 ` Wanlong Gao @ 2013-06-14 1:04 ` Anthony Liguori 2013-06-14 13:56 ` Eduardo Habkost 1 sibling, 1 reply; 19+ messages in thread From: Anthony Liguori @ 2013-06-14 1:04 UTC (permalink / raw) To: Eduardo Habkost; +Cc: pbonzini, qemu-devel, Wanlong Gao, andre.przywara Eduardo Habkost <ehabkost@redhat.com> writes: > On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote: >> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: >> >> > Add monitor command mem-nodes to show the huge mapped >> > memory nodes locations. >> > >> > (qemu) info mem-nodes >> > /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 >> > /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 >> > /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 >> > /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 >> >> This creates an ABI that we don't currently support. Memory hotplug or >> a variety of things can break this mapping and then we'd have to provide >> an interface to describe that the mapping was broken. > > What do you mean by "breaking this mapping", exactly? Would the backing > file of existing guest RAM ever change? (It would require a memory copy > from one file to another, why would QEMU ever do that?) Memory hot-add will change the mapping. hot-remove (if ever implemented) would break it. > >> >> Also, it only works with hugetlbfs which is probbably not widely used >> given the existance of THP. > > Quoting yourself at > http://article.gmane.org/gmane.comp.emulators.kvm.devel/58227: Unfortunately the kernel side of the world has all but stalled making progress here. I don't think we have better choices. > >>> It's extremely likely that if you're doing NUMA pinning, you're also >>> doing large pages via hugetlbfs. numactl can already set policies for >>> files in hugetlbfs so all you need to do is have a separate hugetlbfs >>> file for each numa node. >>> >>> Then you have all the flexibility of numactl and you can implement node >>> migration external to QEMU if you so desire. > > And if we simply report where are the backing files and offsets being > used for guest RAM, one could simply use > 'numactl --file --offset --length', so we don't even need separate > files/mem-paths for each node. > > Does THP work with tmpfs, already? No. > If it does, people who doesn't want > hugetlbfs and want numa tuning to work with THP could just use tmpfs for > -mem-path. > >> >> I had hoped that we would get proper userspace interfaces for describing >> memory groups but that appears to have stalled out. > > I would love to have it. But while we don't have it, sharing the > tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to > let other tools manipulate guest memory policy. We just need to let > external tools know where the backing files are. Is this meant for numad? Wouldn't you want numad to work without hugetlbfs? You have to preallocate pages to hugetlbfs. It's very difficult to use in practice. > >> >> Does anyone know if this is still on the table? >> >> If we can't get a proper kernel interface, then perhaps we need to add >> full libnuma support but that would really be unfortunate... > > Why isn't the "info mem-nodes" solution (I mean: not this version, but a > proper QMP version that exposes all the information we need) an > option? We're exposing internal QEMU information (the HVA -> GPA mapping) as an external stable interface. Regards, Anthony Liguori > > >> >> Regards, >> >> Anthony Liguori >> >> > >> > Refer to the proposal of Eduardo and Daniel. >> > http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 >> > >> > Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> >> > --- >> > monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> > 1 file changed, 45 insertions(+) >> > >> > diff --git a/monitor.c b/monitor.c >> > index eefc7f0..85c865f 100644 >> > --- a/monitor.c >> > +++ b/monitor.c >> > @@ -74,6 +74,10 @@ >> > #endif >> > #include "hw/lm32/lm32_pic.h" >> > >> > +#if defined(CONFIG_NUMA) >> > +#include <numaif.h> >> > +#endif >> > + >> > //#define DEBUG >> > //#define DEBUG_COMPLETION >> > >> > @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) >> > } >> > #endif >> > >> > +#if defined(CONFIG_NUMA) >> > +static void mem_nodes(Monitor *mon, const QDict *qdict) >> > +{ >> > + RAMBlock *block; >> > + int prevnode, node; >> > + unsigned long long c, start, area; >> > + int fd; >> > + int pid = getpid(); >> > + QTAILQ_FOREACH(block, &ram_list.blocks, next) { >> > + if (!(fd = block->fd)) >> > + continue; >> > + prevnode = -1; >> > + start = 0; >> > + area = (unsigned long long)block->host; >> > + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { >> > + if (get_mempolicy(&node, NULL, 0, c + block->host, >> > + MPOL_F_ADDR | MPOL_F_NODE) < 0) >> > + continue; >> > + if (node == prevnode) >> > + continue; >> > + if (prevnode != -1) >> > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >> > + pid, fd, start + area, c + area, prevnode); >> > + prevnode = node; >> > + start = c; >> > + } >> > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >> > + pid, fd, start + area, c + area, prevnode); >> > + } >> > +} >> > +#endif >> > + >> > #if defined(TARGET_SH4) >> > >> > static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) >> > @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { >> > .mhandler.cmd = mem_info, >> > }, >> > #endif >> > +#if defined(CONFIG_NUMA) >> > + { >> > + .name = "mem-nodes", >> > + .args_type = "", >> > + .params = "", >> > + .help = "show the huge mapped memory nodes location", >> > + .mhandler.cmd = mem_nodes, >> > + }, >> > +#endif >> > { >> > .name = "mtree", >> > .args_type = "", >> > -- >> > 1.8.3.rc2.10.g0c2b1cf >> > > -- > Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-14 1:04 ` Anthony Liguori @ 2013-06-14 13:56 ` Eduardo Habkost 0 siblings, 0 replies; 19+ messages in thread From: Eduardo Habkost @ 2013-06-14 13:56 UTC (permalink / raw) To: Anthony Liguori; +Cc: pbonzini, qemu-devel, Wanlong Gao, andre.przywara On Thu, Jun 13, 2013 at 08:04:00PM -0500, Anthony Liguori wrote: > Eduardo Habkost <ehabkost@redhat.com> writes: > > > On Wed, Jun 05, 2013 at 07:57:42AM -0500, Anthony Liguori wrote: > >> Wanlong Gao <gaowanlong@cn.fujitsu.com> writes: > >> > >> > Add monitor command mem-nodes to show the huge mapped > >> > memory nodes locations. > >> > > >> > (qemu) info mem-nodes > >> > /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > >> > /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > >> > /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > >> > /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > >> > >> This creates an ABI that we don't currently support. Memory hotplug or > >> a variety of things can break this mapping and then we'd have to provide > >> an interface to describe that the mapping was broken. > > > > What do you mean by "breaking this mapping", exactly? Would the backing > > file of existing guest RAM ever change? (It would require a memory copy > > from one file to another, why would QEMU ever do that?) > > Memory hot-add will change the mapping. hot-remove (if ever > implemented) would break it. So, would the backing-file/offset of existing guest RAM ever change? (It would require a memory copy from one file to another, why would QEMU ever do that?) [...] > > > > Does THP work with tmpfs, already? > > No. OK, that's a real problem. > > If it does, people who doesn't want > > hugetlbfs and want numa tuning to work with THP could just use tmpfs for > > -mem-path. > > > >> > >> I had hoped that we would get proper userspace interfaces for describing > >> memory groups but that appears to have stalled out. > > > > I would love to have it. But while we don't have it, sharing the > > tmpfs/hugetlbfs backing files seem to work just fine as a mechanism to > > let other tools manipulate guest memory policy. We just need to let > > external tools know where the backing files are. > > Is this meant for numad? Wouldn't you want numad to work without > hugetlbfs? > > You have to preallocate pages to hugetlbfs. It's very difficult to use > in practice. If you don't want hugetlbfs you could use tmpfs, and set the policy on the tmpfs files. What I am asking is: why do we need to ask the kernel folks for interfaces to define and set policies on memory groups if we can (in theory) do the exactly same using tmpfs and hugetlbfs files? (But the fact that THP doesn't work with tmpfs is a real problem, as I said above) > >> > >> Does anyone know if this is still on the table? > >> > >> If we can't get a proper kernel interface, then perhaps we need to add > >> full libnuma support but that would really be unfortunate... > > > > Why isn't the "info mem-nodes" solution (I mean: not this version, but a > > proper QMP version that exposes all the information we need) an > > option? > > We're exposing internal QEMU information (the HVA -> GPA mapping) as an > external stable interface. I never wanted to expose the HVA -> GPA mapping. What I want to expose is: * The virtual-NUMA-node -> GPA-range mapping * The GPA -> mem-path file/offset mapping (Alternatively, a simple virtual-NUMA-node -> mem-path file/offset mapping would be enough, too) We could even replace "mem-path file/offset mapping" with "memory groups", if the kernel already had interfaces to deal with memory groups. -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao 2013-06-05 12:39 ` Eric Blake 2013-06-05 12:57 ` Anthony Liguori @ 2013-06-05 13:46 ` Eduardo Habkost 2013-06-11 7:22 ` Wanlong Gao 2 siblings, 1 reply; 19+ messages in thread From: Eduardo Habkost @ 2013-06-05 13:46 UTC (permalink / raw) To: Wanlong Gao; +Cc: pbonzini, aliguori, qemu-devel, andre.przywara On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: > Add monitor command mem-nodes to show the huge mapped > memory nodes locations. > This is for machine consumption, so we need a QMP command. > (qemu) info mem-nodes > /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 Are node0/node1 _host_ nodes? How do I know what's the _guest_ address/node corresponding to each file/range above? What I am really looking for is: * The correspondence between guest (virtual) NUMA nodes and guest physical address ranges (it could be provided by the QMP version of "info numa") * The correspondence between guest physical address ranges and ranges inside the mapped files (so external tools could set the policy on those files instead of requiring QEMU to set it directly) I understand that your use case may require additional information and additional interfaces. But if we provide the information above we will allow external components set the policy on the hugetlbfs files before we add new interfaces required for your use case. Also, what about making it conditional to OSes where we really know "/proc/<pid>/fd/<fd>" is available? > > Refer to the proposal of Eduardo and Daniel. > http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > > Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > --- > monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/monitor.c b/monitor.c > index eefc7f0..85c865f 100644 > --- a/monitor.c > +++ b/monitor.c > @@ -74,6 +74,10 @@ > #endif > #include "hw/lm32/lm32_pic.h" > > +#if defined(CONFIG_NUMA) > +#include <numaif.h> > +#endif > + > //#define DEBUG > //#define DEBUG_COMPLETION > > @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) > } > #endif > > +#if defined(CONFIG_NUMA) > +static void mem_nodes(Monitor *mon, const QDict *qdict) > +{ > + RAMBlock *block; > + int prevnode, node; > + unsigned long long c, start, area; > + int fd; > + int pid = getpid(); > + QTAILQ_FOREACH(block, &ram_list.blocks, next) { > + if (!(fd = block->fd)) > + continue; > + prevnode = -1; > + start = 0; > + area = (unsigned long long)block->host; > + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { > + if (get_mempolicy(&node, NULL, 0, c + block->host, > + MPOL_F_ADDR | MPOL_F_NODE) < 0) > + continue; > + if (node == prevnode) > + continue; > + if (prevnode != -1) > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > + pid, fd, start + area, c + area, prevnode); > + prevnode = node; > + start = c; > + } > + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > + pid, fd, start + area, c + area, prevnode); > + } > +} > +#endif > + > #if defined(TARGET_SH4) > > static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) > @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { > .mhandler.cmd = mem_info, > }, > #endif > +#if defined(CONFIG_NUMA) > + { > + .name = "mem-nodes", > + .args_type = "", > + .params = "", > + .help = "show the huge mapped memory nodes location", > + .mhandler.cmd = mem_nodes, > + }, > +#endif > { > .name = "mtree", > .args_type = "", > -- > 1.8.3.rc2.10.g0c2b1cf > -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-05 13:46 ` Eduardo Habkost @ 2013-06-11 7:22 ` Wanlong Gao 2013-06-11 13:40 ` Eduardo Habkost 0 siblings, 1 reply; 19+ messages in thread From: Wanlong Gao @ 2013-06-11 7:22 UTC (permalink / raw) To: Eduardo Habkost Cc: andre.przywara, aliguori, qemu-devel, pbonzini, Wanlong Gao On 06/05/2013 09:46 PM, Eduardo Habkost wrote: > On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: >> Add monitor command mem-nodes to show the huge mapped >> memory nodes locations. >> > > This is for machine consumption, so we need a QMP command. > >> (qemu) info mem-nodes >> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 >> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 >> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 >> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > > Are node0/node1 _host_ nodes? > > How do I know what's the _guest_ address/node corresponding to each > file/range above? > > What I am really looking for is: > > * The correspondence between guest (virtual) NUMA nodes and guest > physical address ranges (it could be provided by the QMP version of > "info numa") AFAIK, the guest NUMA nodes and guest physical address ranges are set by seabios, we can't get this information from QEMU, and I think this information is useless for pinning memory range to host. > * The correspondence between guest physical address ranges and ranges > inside the mapped files (so external tools could set the policy on > those files instead of requiring QEMU to set it directly) > > I understand that your use case may require additional information and > additional interfaces. But if we provide the information above we will > allow external components set the policy on the hugetlbfs files before > we add new interfaces required for your use case. But the file backed memory is not good for the host which has many virtual machines, in this situation, we can't handle anon THP yet. And as I mentioned, the cross numa node access performance regression is caused by pci-passthrough, it's a very long time bug, we should back port the host memory pinning patch to old QEMU to resolve this performance problem, too. Thanks, Wanlong Gao > > Also, what about making it conditional to OSes where we really know > "/proc/<pid>/fd/<fd>" is available? > > >> >> Refer to the proposal of Eduardo and Daniel. >> http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > >> >> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> >> --- >> monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/monitor.c b/monitor.c >> index eefc7f0..85c865f 100644 >> --- a/monitor.c >> +++ b/monitor.c >> @@ -74,6 +74,10 @@ >> #endif >> #include "hw/lm32/lm32_pic.h" >> >> +#if defined(CONFIG_NUMA) >> +#include <numaif.h> >> +#endif >> + >> //#define DEBUG >> //#define DEBUG_COMPLETION >> >> @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) >> } >> #endif >> >> +#if defined(CONFIG_NUMA) >> +static void mem_nodes(Monitor *mon, const QDict *qdict) >> +{ >> + RAMBlock *block; >> + int prevnode, node; >> + unsigned long long c, start, area; >> + int fd; >> + int pid = getpid(); >> + QTAILQ_FOREACH(block, &ram_list.blocks, next) { >> + if (!(fd = block->fd)) >> + continue; >> + prevnode = -1; >> + start = 0; >> + area = (unsigned long long)block->host; >> + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { >> + if (get_mempolicy(&node, NULL, 0, c + block->host, >> + MPOL_F_ADDR | MPOL_F_NODE) < 0) >> + continue; >> + if (node == prevnode) >> + continue; >> + if (prevnode != -1) >> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >> + pid, fd, start + area, c + area, prevnode); >> + prevnode = node; >> + start = c; >> + } >> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", >> + pid, fd, start + area, c + area, prevnode); >> + } >> +} >> +#endif >> + >> #if defined(TARGET_SH4) >> >> static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) >> @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { >> .mhandler.cmd = mem_info, >> }, >> #endif >> +#if defined(CONFIG_NUMA) >> + { >> + .name = "mem-nodes", >> + .args_type = "", >> + .params = "", >> + .help = "show the huge mapped memory nodes location", >> + .mhandler.cmd = mem_nodes, >> + }, >> +#endif >> { >> .name = "mtree", >> .args_type = "", >> -- >> 1.8.3.rc2.10.g0c2b1cf >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-11 7:22 ` Wanlong Gao @ 2013-06-11 13:40 ` Eduardo Habkost 2013-06-13 1:40 ` Wanlong Gao 0 siblings, 1 reply; 19+ messages in thread From: Eduardo Habkost @ 2013-06-11 13:40 UTC (permalink / raw) To: Wanlong Gao; +Cc: pbonzini, aliguori, qemu-devel, andre.przywara On Tue, Jun 11, 2013 at 03:22:13PM +0800, Wanlong Gao wrote: > On 06/05/2013 09:46 PM, Eduardo Habkost wrote: > > On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: > >> Add monitor command mem-nodes to show the huge mapped > >> memory nodes locations. > >> > > > > This is for machine consumption, so we need a QMP command. > > > >> (qemu) info mem-nodes > >> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > >> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > >> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > >> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > > > > Are node0/node1 _host_ nodes? > > > > How do I know what's the _guest_ address/node corresponding to each > > file/range above? > > > > What I am really looking for is: > > > > * The correspondence between guest (virtual) NUMA nodes and guest > > physical address ranges (it could be provided by the QMP version of > > "info numa") > > AFAIK, the guest NUMA nodes and guest physical address ranges are set > by seabios, we can't get this information from QEMU, QEMU _has_ to know about it, otherwise we would never be able to know which virtual addresses inside the QEMU process (or offsets inside the backing files) belong to which virtual NUMA node. (After all, the NUMA wiring is a hardware feature, not something that the BIOS can decide) > and I think this > information is useless for pinning memory range to host. Well, we have to somehow identify each region of guest memory when deciding how to pin it. How would you identify it without using guest physical addresses? Guest physical addresses are more meaningful than the QEMU virtual addresses your patch exposes (that are meaningless outside QEMU). > > * The correspondence between guest physical address ranges and ranges > > inside the mapped files (so external tools could set the policy on > > those files instead of requiring QEMU to set it directly) > > > > I understand that your use case may require additional information and > > additional interfaces. But if we provide the information above we will > > allow external components set the policy on the hugetlbfs files before > > we add new interfaces required for your use case. > > But the file backed memory is not good for the host which has many > virtual machines, in this situation, we can't handle anon THP yet. I don't understand what you mean, here. What prevents someone from using file-backed memory with multiple virtual machines? > > And as I mentioned, the cross numa node access performance regression > is caused by pci-passthrough, it's a very long time bug, we should > back port the host memory pinning patch to old QEMU to resolve this performance > problem, too. If it's a regression, what's the last version of QEMU where the bug wasn't present? > > Thanks, > Wanlong Gao > > > > > Also, what about making it conditional to OSes where we really know > > "/proc/<pid>/fd/<fd>" is available? > > > > > >> > >> Refer to the proposal of Eduardo and Daniel. > >> http://article.gmane.org/gmane.comp.emulators.kvm.devel/93476 > > > >> > >> Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > >> --- > >> monitor.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 45 insertions(+) > >> > >> diff --git a/monitor.c b/monitor.c > >> index eefc7f0..85c865f 100644 > >> --- a/monitor.c > >> +++ b/monitor.c > >> @@ -74,6 +74,10 @@ > >> #endif > >> #include "hw/lm32/lm32_pic.h" > >> > >> +#if defined(CONFIG_NUMA) > >> +#include <numaif.h> > >> +#endif > >> + > >> //#define DEBUG > >> //#define DEBUG_COMPLETION > >> > >> @@ -1759,6 +1763,38 @@ static void mem_info(Monitor *mon, const QDict *qdict) > >> } > >> #endif > >> > >> +#if defined(CONFIG_NUMA) > >> +static void mem_nodes(Monitor *mon, const QDict *qdict) > >> +{ > >> + RAMBlock *block; > >> + int prevnode, node; > >> + unsigned long long c, start, area; > >> + int fd; > >> + int pid = getpid(); > >> + QTAILQ_FOREACH(block, &ram_list.blocks, next) { > >> + if (!(fd = block->fd)) > >> + continue; > >> + prevnode = -1; > >> + start = 0; > >> + area = (unsigned long long)block->host; > >> + for (c = 0; c < block->length; c += TARGET_PAGE_SIZE) { > >> + if (get_mempolicy(&node, NULL, 0, c + block->host, > >> + MPOL_F_ADDR | MPOL_F_NODE) < 0) > >> + continue; > >> + if (node == prevnode) > >> + continue; > >> + if (prevnode != -1) > >> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > >> + pid, fd, start + area, c + area, prevnode); > >> + prevnode = node; > >> + start = c; > >> + } > >> + monitor_printf(mon, "/proc/%d/fd/%d: %016Lx-%016Lx: node%d\n", > >> + pid, fd, start + area, c + area, prevnode); > >> + } > >> +} > >> +#endif > >> + > >> #if defined(TARGET_SH4) > >> > >> static void print_tlb(Monitor *mon, int idx, tlb_t *tlb) > >> @@ -2567,6 +2603,15 @@ static mon_cmd_t info_cmds[] = { > >> .mhandler.cmd = mem_info, > >> }, > >> #endif > >> +#if defined(CONFIG_NUMA) > >> + { > >> + .name = "mem-nodes", > >> + .args_type = "", > >> + .params = "", > >> + .help = "show the huge mapped memory nodes location", > >> + .mhandler.cmd = mem_nodes, > >> + }, > >> +#endif > >> { > >> .name = "mtree", > >> .args_type = "", > >> -- > >> 1.8.3.rc2.10.g0c2b1cf > >> > > > -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-11 13:40 ` Eduardo Habkost @ 2013-06-13 1:40 ` Wanlong Gao 2013-06-13 12:50 ` Eduardo Habkost 0 siblings, 1 reply; 19+ messages in thread From: Wanlong Gao @ 2013-06-13 1:40 UTC (permalink / raw) To: Eduardo Habkost Cc: andre.przywara, aliguori, qemu-devel, pbonzini, Wanlong Gao On 06/11/2013 09:40 PM, Eduardo Habkost wrote: > On Tue, Jun 11, 2013 at 03:22:13PM +0800, Wanlong Gao wrote: >> On 06/05/2013 09:46 PM, Eduardo Habkost wrote: >>> On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: >>>> Add monitor command mem-nodes to show the huge mapped >>>> memory nodes locations. >>>> >>> >>> This is for machine consumption, so we need a QMP command. >>> >>>> (qemu) info mem-nodes >>>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 >>>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 >>>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 >>>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 >>> >>> Are node0/node1 _host_ nodes? >>> >>> How do I know what's the _guest_ address/node corresponding to each >>> file/range above? >>> >>> What I am really looking for is: >>> >>> * The correspondence between guest (virtual) NUMA nodes and guest >>> physical address ranges (it could be provided by the QMP version of >>> "info numa") >> >> AFAIK, the guest NUMA nodes and guest physical address ranges are set >> by seabios, we can't get this information from QEMU, > > QEMU _has_ to know about it, otherwise we would never be able to know > which virtual addresses inside the QEMU process (or offsets inside the > backing files) belong to which virtual NUMA node. Nope, if I'm right, actually it's linear except that there are holes in the physical address spaces. So we can know which node the guest virtual address is included just by each numa node size. It's enough for us if we can provide a QMP interface from QEMU to let external tools like libvirt set the host memory binding polices according to the QMP interface, and we can also provide the QEMU command line option to be able to set host bindings through command line options before we start QEMU process. > > (After all, the NUMA wiring is a hardware feature, not something that > the BIOS can decide) But this is ACPI table which wrote by seabios now. AFAIK, there is no unified idea about moving this part to QEMU with the QEMU interfaces for seabios removed or just stay it there. > > >> and I think this >> information is useless for pinning memory range to host. > > Well, we have to somehow identify each region of guest memory when > deciding how to pin it. How would you identify it without using guest > physical addresses? Guest physical addresses are more meaningful than > the QEMU virtual addresses your patch exposes (that are meaningless > outside QEMU). As I mentioned above, we can know this just by the guest node memory size, and can set the host bindings by treating this sizes as offsets. And I think we only need to set the host memory binding polices to each guest numa nodes. It's unnecessary to set polices to each region as you said. > > > >>> * The correspondence between guest physical address ranges and ranges >>> inside the mapped files (so external tools could set the policy on >>> those files instead of requiring QEMU to set it directly) >>> >>> I understand that your use case may require additional information and >>> additional interfaces. But if we provide the information above we will >>> allow external components set the policy on the hugetlbfs files before >>> we add new interfaces required for your use case. >> >> But the file backed memory is not good for the host which has many >> virtual machines, in this situation, we can't handle anon THP yet. > > I don't understand what you mean, here. What prevents someone from using > file-backed memory with multiple virtual machines? While if we use hugetlbfs backed memory, we should know how many virtual machines, how much memory each vm will use, then reserve these pages for them. And even should reserve more pages for external tools(numactl) to set memory polices. Even the memory reservation also has it's own memory policies. It's very hard to control it to what we want to set. > >> >> And as I mentioned, the cross numa node access performance regression >> is caused by pci-passthrough, it's a very long time bug, we should >> back port the host memory pinning patch to old QEMU to resolve this performance >> problem, too. > > If it's a regression, what's the last version of QEMU where the bug > wasn't present? > As QEMU doesn't support host memory binding, I think this was present since we support guest NUMA, and the pci-passthrough made it even worse. Thanks, Wanlong Gao ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-13 1:40 ` Wanlong Gao @ 2013-06-13 12:50 ` Eduardo Habkost 2013-06-13 22:32 ` Paolo Bonzini 0 siblings, 1 reply; 19+ messages in thread From: Eduardo Habkost @ 2013-06-13 12:50 UTC (permalink / raw) To: Wanlong Gao; +Cc: pbonzini, aliguori, qemu-devel, andre.przywara On Thu, Jun 13, 2013 at 09:40:14AM +0800, Wanlong Gao wrote: > On 06/11/2013 09:40 PM, Eduardo Habkost wrote: > > On Tue, Jun 11, 2013 at 03:22:13PM +0800, Wanlong Gao wrote: > >> On 06/05/2013 09:46 PM, Eduardo Habkost wrote: > >>> On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote: > >>>> Add monitor command mem-nodes to show the huge mapped > >>>> memory nodes locations. > >>>> > >>> > >>> This is for machine consumption, so we need a QMP command. > >>> > >>>> (qemu) info mem-nodes > >>>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0 > >>>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1 > >>>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0 > >>>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1 > >>> > >>> Are node0/node1 _host_ nodes? > >>> > >>> How do I know what's the _guest_ address/node corresponding to each > >>> file/range above? > >>> > >>> What I am really looking for is: > >>> > >>> * The correspondence between guest (virtual) NUMA nodes and guest > >>> physical address ranges (it could be provided by the QMP version of > >>> "info numa") > >> > >> AFAIK, the guest NUMA nodes and guest physical address ranges are set > >> by seabios, we can't get this information from QEMU, > > > > QEMU _has_ to know about it, otherwise we would never be able to know > > which virtual addresses inside the QEMU process (or offsets inside the > > backing files) belong to which virtual NUMA node. > > Nope, if I'm right, actually it's linear except that there are holes in > the physical address spaces. So we can know which node the guest virtual > address is included just by each numa node size. You are just describing a way to accomplish the item I asked about above: finding out the correspondence between guest physical addresses and NUMA nodes. :) (But I would prefer to have something more explicit in the QMP interface instead of something implicit that assumes a predefined binding) > It's enough for us if we > can provide a QMP interface from QEMU to let external tools like libvirt > set the host memory binding polices according to the QMP interface, and > we can also provide the QEMU command line option to be able to set host > bindings through command line options before we start QEMU process. And how would you identify memory regions through this memory binding QMP interface, if not by guest physical addresses? > > > > > (After all, the NUMA wiring is a hardware feature, not something that > > the BIOS can decide) > > But this is ACPI table which wrote by seabios now. AFAIK, there is no > unified idea about moving this part to QEMU with the QEMU interfaces > for seabios removed or just stay it there. It doesn't matter who writes the ACPI table. QEMU must always know on which virtual NUMA node each memory region is located. > > > >> and I think this > >> information is useless for pinning memory range to host. > > > > Well, we have to somehow identify each region of guest memory when > > deciding how to pin it. How would you identify it without using guest > > physical addresses? Guest physical addresses are more meaningful than > > the QEMU virtual addresses your patch exposes (that are meaningless > > outside QEMU). > > As I mentioned above, we can know this just by the guest node memory size, > and can set the host bindings by treating this sizes as offsets. > And I think we only need to set the host memory binding polices to each > guest numa nodes. It's unnecessary to set polices to each region as you > said. I believe an interface based on guest physical memory addresses is more flexible (and even simpler!) than one that only allows binding of whole virtual NUMA nodes. (And I still don't understand why you are exposing QEMU virtual memory addresses in the new command, if they are useless). > > > > > >>> * The correspondence between guest physical address ranges and ranges > >>> inside the mapped files (so external tools could set the policy on > >>> those files instead of requiring QEMU to set it directly) > >>> > >>> I understand that your use case may require additional information and > >>> additional interfaces. But if we provide the information above we will > >>> allow external components set the policy on the hugetlbfs files before > >>> we add new interfaces required for your use case. > >> > >> But the file backed memory is not good for the host which has many > >> virtual machines, in this situation, we can't handle anon THP yet. > > > > I don't understand what you mean, here. What prevents someone from using > > file-backed memory with multiple virtual machines? > > While if we use hugetlbfs backed memory, we should know how many virtual machines, > how much memory each vm will use, then reserve these pages for them. And even > should reserve more pages for external tools(numactl) to set memory polices. > Even the memory reservation also has it's own memory policies. It's very hard > to control it to what we want to set. Well, it's hard because we don't even have tools to help on that, yet. Anyway, I understand that you want to make it work with THP as well. But if THP works with tmpfs (does it?), people then could use exactly the same file-based mechanisms with tmpfs and keep THP working. (Right now I am doing some experiments to understand how the system behaves when using numactl on hugetlbfs and tmpfs, before and after getting the files mapped). > > > >> > >> And as I mentioned, the cross numa node access performance regression > >> is caused by pci-passthrough, it's a very long time bug, we should > >> back port the host memory pinning patch to old QEMU to resolve this performance > >> problem, too. > > > > If it's a regression, what's the last version of QEMU where the bug > > wasn't present? > > > > As QEMU doesn't support host memory binding, I think > this was present since we support guest NUMA, and the pci-passthrough made > it even worse. If the problem was always present, it is not a regression, is it? -- Eduardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-13 12:50 ` Eduardo Habkost @ 2013-06-13 22:32 ` Paolo Bonzini 2013-06-14 1:05 ` Anthony Liguori 0 siblings, 1 reply; 19+ messages in thread From: Paolo Bonzini @ 2013-06-13 22:32 UTC (permalink / raw) To: Eduardo Habkost; +Cc: andre.przywara, aliguori, qemu-devel, Wanlong Gao Il 13/06/2013 08:50, Eduardo Habkost ha scritto: > I believe an interface based on guest physical memory addresses is more > flexible (and even simpler!) than one that only allows binding of whole > virtual NUMA nodes. And "-numa node" is already one, what about just adding "mem-path=/foo" or "host_node=NN" suboptions? Then "-mem-path /foo" would be a shortcut for "-numa node,mem-path=/foo". I even had patches to convert -numa to QemuOpts, I can dig them out if your interested. Paolo > (And I still don't understand why you are exposing QEMU virtual memory > addresses in the new command, if they are useless). > > >>> >>> >>>>> * The correspondence between guest physical address ranges and ranges >>>>> inside the mapped files (so external tools could set the policy on >>>>> those files instead of requiring QEMU to set it directly) >>>>> >>>>> I understand that your use case may require additional information and >>>>> additional interfaces. But if we provide the information above we will >>>>> allow external components set the policy on the hugetlbfs files before >>>>> we add new interfaces required for your use case. >>>> >>>> But the file backed memory is not good for the host which has many >>>> virtual machines, in this situation, we can't handle anon THP yet. >>> >>> I don't understand what you mean, here. What prevents someone from using >>> file-backed memory with multiple virtual machines? >> >> While if we use hugetlbfs backed memory, we should know how many virtual machines, >> how much memory each vm will use, then reserve these pages for them. And even >> should reserve more pages for external tools(numactl) to set memory polices. >> Even the memory reservation also has it's own memory policies. It's very hard >> to control it to what we want to set. > > Well, it's hard because we don't even have tools to help on that, yet. > > Anyway, I understand that you want to make it work with THP as well. But > if THP works with tmpfs (does it?), people then could use exactly the > same file-based mechanisms with tmpfs and keep THP working. > > (Right now I am doing some experiments to understand how the system > behaves when using numactl on hugetlbfs and tmpfs, before and after > getting the files mapped). > > >>> >>>> >>>> And as I mentioned, the cross numa node access performance regression >>>> is caused by pci-passthrough, it's a very long time bug, we should >>>> back port the host memory pinning patch to old QEMU to resolve this performance >>>> problem, too. >>> >>> If it's a regression, what's the last version of QEMU where the bug >>> wasn't present? >>> >> >> As QEMU doesn't support host memory binding, I think >> this was present since we support guest NUMA, and the pci-passthrough made >> it even worse. > > If the problem was always present, it is not a regression, is it? > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-13 22:32 ` Paolo Bonzini @ 2013-06-14 1:05 ` Anthony Liguori 2013-06-14 1:16 ` Wanlong Gao 0 siblings, 1 reply; 19+ messages in thread From: Anthony Liguori @ 2013-06-14 1:05 UTC (permalink / raw) To: Paolo Bonzini, Eduardo Habkost; +Cc: andre.przywara, qemu-devel, Wanlong Gao Paolo Bonzini <pbonzini@redhat.com> writes: > Il 13/06/2013 08:50, Eduardo Habkost ha scritto: >> I believe an interface based on guest physical memory addresses is more >> flexible (and even simpler!) than one that only allows binding of whole >> virtual NUMA nodes. > > And "-numa node" is already one, what about just adding "mem-path=/foo" > or "host_node=NN" suboptions? Then "-mem-path /foo" would be a shortcut > for "-numa node,mem-path=/foo". > > I even had patches to convert -numa to QemuOpts, I can dig them out if > your interested. Ack. This is a very reasonable thing to add. Regards, Anthony Liguori > > Paolo > >> (And I still don't understand why you are exposing QEMU virtual memory >> addresses in the new command, if they are useless). >> >> >>>> >>>> >>>>>> * The correspondence between guest physical address ranges and ranges >>>>>> inside the mapped files (so external tools could set the policy on >>>>>> those files instead of requiring QEMU to set it directly) >>>>>> >>>>>> I understand that your use case may require additional information and >>>>>> additional interfaces. But if we provide the information above we will >>>>>> allow external components set the policy on the hugetlbfs files before >>>>>> we add new interfaces required for your use case. >>>>> >>>>> But the file backed memory is not good for the host which has many >>>>> virtual machines, in this situation, we can't handle anon THP yet. >>>> >>>> I don't understand what you mean, here. What prevents someone from using >>>> file-backed memory with multiple virtual machines? >>> >>> While if we use hugetlbfs backed memory, we should know how many virtual machines, >>> how much memory each vm will use, then reserve these pages for them. And even >>> should reserve more pages for external tools(numactl) to set memory polices. >>> Even the memory reservation also has it's own memory policies. It's very hard >>> to control it to what we want to set. >> >> Well, it's hard because we don't even have tools to help on that, yet. >> >> Anyway, I understand that you want to make it work with THP as well. But >> if THP works with tmpfs (does it?), people then could use exactly the >> same file-based mechanisms with tmpfs and keep THP working. >> >> (Right now I am doing some experiments to understand how the system >> behaves when using numactl on hugetlbfs and tmpfs, before and after >> getting the files mapped). >> >> >>>> >>>>> >>>>> And as I mentioned, the cross numa node access performance regression >>>>> is caused by pci-passthrough, it's a very long time bug, we should >>>>> back port the host memory pinning patch to old QEMU to resolve this performance >>>>> problem, too. >>>> >>>> If it's a regression, what's the last version of QEMU where the bug >>>> wasn't present? >>>> >>> >>> As QEMU doesn't support host memory binding, I think >>> this was present since we support guest NUMA, and the pci-passthrough made >>> it even worse. >> >> If the problem was always present, it is not a regression, is it? >> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-14 1:05 ` Anthony Liguori @ 2013-06-14 1:16 ` Wanlong Gao 2013-06-15 17:23 ` Paolo Bonzini 0 siblings, 1 reply; 19+ messages in thread From: Wanlong Gao @ 2013-06-14 1:16 UTC (permalink / raw) To: Anthony Liguori Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Wanlong Gao, andre.przywara On 06/14/2013 09:05 AM, Anthony Liguori wrote: > Paolo Bonzini <pbonzini@redhat.com> writes: > >> Il 13/06/2013 08:50, Eduardo Habkost ha scritto: >>> I believe an interface based on guest physical memory addresses is more >>> flexible (and even simpler!) than one that only allows binding of whole >>> virtual NUMA nodes. >> >> And "-numa node" is already one, what about just adding "mem-path=/foo" >> or "host_node=NN" suboptions? Then "-mem-path /foo" would be a shortcut >> for "-numa node,mem-path=/foo". >> >> I even had patches to convert -numa to QemuOpts, I can dig them out if >> your interested. > > Ack. This is a very reasonable thing to add. How about making "-numa node,membind=0" like options, and also provide a QMP interface "numa_set guest_node_id mempolicy". So that we can set the mempolicy not only for file backed memory but also for anon mapped guest numa node. This is full numa support in QEMU as you said. I'm making the patches now. Thanks, Wanlong Gao > > Regards, > > Anthony Liguori > >> >> Paolo >> >>> (And I still don't understand why you are exposing QEMU virtual memory >>> addresses in the new command, if they are useless). >>> >>> >>>>> >>>>> >>>>>>> * The correspondence between guest physical address ranges and ranges >>>>>>> inside the mapped files (so external tools could set the policy on >>>>>>> those files instead of requiring QEMU to set it directly) >>>>>>> >>>>>>> I understand that your use case may require additional information and >>>>>>> additional interfaces. But if we provide the information above we will >>>>>>> allow external components set the policy on the hugetlbfs files before >>>>>>> we add new interfaces required for your use case. >>>>>> >>>>>> But the file backed memory is not good for the host which has many >>>>>> virtual machines, in this situation, we can't handle anon THP yet. >>>>> >>>>> I don't understand what you mean, here. What prevents someone from using >>>>> file-backed memory with multiple virtual machines? >>>> >>>> While if we use hugetlbfs backed memory, we should know how many virtual machines, >>>> how much memory each vm will use, then reserve these pages for them. And even >>>> should reserve more pages for external tools(numactl) to set memory polices. >>>> Even the memory reservation also has it's own memory policies. It's very hard >>>> to control it to what we want to set. >>> >>> Well, it's hard because we don't even have tools to help on that, yet. >>> >>> Anyway, I understand that you want to make it work with THP as well. But >>> if THP works with tmpfs (does it?), people then could use exactly the >>> same file-based mechanisms with tmpfs and keep THP working. >>> >>> (Right now I am doing some experiments to understand how the system >>> behaves when using numactl on hugetlbfs and tmpfs, before and after >>> getting the files mapped). >>> >>> >>>>> >>>>>> >>>>>> And as I mentioned, the cross numa node access performance regression >>>>>> is caused by pci-passthrough, it's a very long time bug, we should >>>>>> back port the host memory pinning patch to old QEMU to resolve this performance >>>>>> problem, too. >>>>> >>>>> If it's a regression, what's the last version of QEMU where the bug >>>>> wasn't present? >>>>> >>>> >>>> As QEMU doesn't support host memory binding, I think >>>> this was present since we support guest NUMA, and the pci-passthrough made >>>> it even worse. >>> >>> If the problem was always present, it is not a regression, is it? >>> > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes 2013-06-14 1:16 ` Wanlong Gao @ 2013-06-15 17:23 ` Paolo Bonzini 0 siblings, 0 replies; 19+ messages in thread From: Paolo Bonzini @ 2013-06-15 17:23 UTC (permalink / raw) To: gaowanlong; +Cc: andre.przywara, Anthony Liguori, Eduardo Habkost, qemu-devel Il 13/06/2013 21:16, Wanlong Gao ha scritto: > On 06/14/2013 09:05 AM, Anthony Liguori wrote: >> Paolo Bonzini <pbonzini@redhat.com> writes: >> >>> Il 13/06/2013 08:50, Eduardo Habkost ha scritto: >>>> I believe an interface based on guest physical memory addresses is more >>>> flexible (and even simpler!) than one that only allows binding of whole >>>> virtual NUMA nodes. >>> >>> And "-numa node" is already one, what about just adding "mem-path=/foo" >>> or "host_node=NN" suboptions? Then "-mem-path /foo" would be a shortcut >>> for "-numa node,mem-path=/foo". >>> >>> I even had patches to convert -numa to QemuOpts, I can dig them out if >>> your interested. >> >> Ack. This is a very reasonable thing to add. > > How about making "-numa node,membind=0" like options, and also provide a > QMP interface "numa_set guest_node_id mempolicy". So that we can set > the mempolicy not only for file backed memory but also for anon mapped > guest numa node. This is full numa support in QEMU as you said. I'm making > the patches now. Yup, that's exactly what I called "host_node". membind also makes sense, but make it mem-bind or mem-host-node for consistency with mem-path. Paolo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection 2013-06-05 3:58 [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Wanlong Gao 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao @ 2013-06-05 10:02 ` Andreas Färber 1 sibling, 0 replies; 19+ messages in thread From: Andreas Färber @ 2013-06-05 10:02 UTC (permalink / raw) To: Wanlong Gao; +Cc: andre.przywara, aliguori, pbonzini, qemu-devel, ehabkost Am 05.06.2013 05:58, schrieb Wanlong Gao: > Add detection of libnuma (mostly contained in the numactl package) > to the configure script. Currently this is Linux only, but can be > extended later should the need for other interfaces come up. > Can be enabled or disabled on the command line, default is use if > available. > > Original-signed-off-by: Andre Przywara <andre.przywara@amd.com> This should be just Signed-off-by, the difference between From and Sob already indicates you changed his code. > Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> > --- > configure | 32 ++++++++++++++++++++++++++++++++ > 1 file changed, 32 insertions(+) > > diff --git a/configure b/configure > index 1654413..d897f98 100755 > --- a/configure > +++ b/configure > @@ -242,6 +242,7 @@ gtk="" > gtkabi="2.0" > tpm="no" > libssh2="" > +numa="yes" This should be "" by default - it is not needed by linux-user and when --enable-numa is used we should error out if it's not there. > > # parse CC options first > for opt do > @@ -937,6 +938,10 @@ for opt do > ;; > --enable-libssh2) libssh2="yes" > ;; > + --disable-numa) numa="no" > + ;; > + --enable-numa) numa="yes" > + ;; > *) echo "ERROR: unknown option $opt"; show_help="yes" > ;; > esac > @@ -1151,6 +1156,8 @@ echo " --gcov=GCOV use specified gcov [$gcov_tool]" > echo " --enable-tpm enable TPM support" > echo " --disable-libssh2 disable ssh block device support" > echo " --enable-libssh2 enable ssh block device support" > +echo " --disable-numa disable host Linux NUMA support" > +echo " --enable-numa enable host Linux NUMA support" > echo "" > echo "NOTE: The object files are built at the place where configure is launched" > exit 1 > @@ -2382,6 +2389,27 @@ EOF > fi > > ########################################## > +# libnuma probe > + > +if test "$numa" = "yes" ; then Requires to check for != "no" here. > + numa=no > + cat > $TMPC << EOF > +#include <numa.h> > +int main(void) { return numa_available(); } > +EOF > + > + if compile_prog "" "-lnuma" ; then > + numa=yes > + libs_softmmu="-lnuma $libs_softmmu" > + else > + if test "$numa" = "yes" ; then > + feature_not_found "linux NUMA (install numactl?)" > + fi > + numa=no > + fi > +fi > + > +########################################## > # linux-aio probe > > if test "$linux_aio" != "no" ; then > @@ -3550,6 +3578,7 @@ echo "TPM support $tpm" > echo "libssh2 support $libssh2" > echo "TPM passthrough $tpm_passthrough" > echo "QOM debugging $qom_cast_debug" > +echo "NUMA host support $numa" > > if test "$sdl_too_old" = "yes"; then > echo "-> Your SDL version is too old - please upgrade to have SDL support" > @@ -3584,6 +3613,9 @@ echo "extra_cflags=$EXTRA_CFLAGS" >> $config_host_mak > echo "extra_ldflags=$EXTRA_LDFLAGS" >> $config_host_mak > echo "qemu_localedir=$qemu_localedir" >> $config_host_mak > echo "libs_softmmu=$libs_softmmu" >> $config_host_mak > +if test "$numa" = "yes"; then > + echo "CONFIG_NUMA=y" >> $config_host_mak > +fi > > echo "ARCH=$ARCH" >> $config_host_mak > > Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-06-15 17:23 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-06-05 3:58 [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Wanlong Gao 2013-06-05 3:58 ` [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes Wanlong Gao 2013-06-05 12:39 ` Eric Blake 2013-06-05 12:57 ` Anthony Liguori 2013-06-05 15:54 ` Eduardo Habkost 2013-06-06 9:30 ` Wanlong Gao 2013-06-06 16:15 ` Eduardo Habkost 2013-06-14 1:04 ` Anthony Liguori 2013-06-14 13:56 ` Eduardo Habkost 2013-06-05 13:46 ` Eduardo Habkost 2013-06-11 7:22 ` Wanlong Gao 2013-06-11 13:40 ` Eduardo Habkost 2013-06-13 1:40 ` Wanlong Gao 2013-06-13 12:50 ` Eduardo Habkost 2013-06-13 22:32 ` Paolo Bonzini 2013-06-14 1:05 ` Anthony Liguori 2013-06-14 1:16 ` Wanlong Gao 2013-06-15 17:23 ` Paolo Bonzini 2013-06-05 10:02 ` [Qemu-devel] [PATCH 1/2] Add Linux libnuma detection Andreas Färber
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.