Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes

From: Eduardo Habkost <ehabkost@redhat.com>
To: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: pbonzini@redhat.com, aliguori@us.ibm.com, qemu-devel@nongnu.org,
	andre.przywara@amd.com
Subject: Re: [Qemu-devel] [PATCH 2/2] Add monitor command mem-nodes
Date: Thu, 13 Jun 2013 09:50:19 -0300	[thread overview]
Message-ID: <20130613125019.GI2895@otherpad.lan.raisama.net> (raw)
In-Reply-To: <51B922FE.8090109@cn.fujitsu.com>

On Thu, Jun 13, 2013 at 09:40:14AM +0800, Wanlong Gao wrote:
> On 06/11/2013 09:40 PM, Eduardo Habkost wrote:
> > On Tue, Jun 11, 2013 at 03:22:13PM +0800, Wanlong Gao wrote:
> >> On 06/05/2013 09:46 PM, Eduardo Habkost wrote:
> >>> On Wed, Jun 05, 2013 at 11:58:25AM +0800, Wanlong Gao wrote:
> >>>> Add monitor command mem-nodes to show the huge mapped
> >>>> memory nodes locations.
> >>>>
> >>>
> >>> This is for machine consumption, so we need a QMP command.
> >>>
> >>>> (qemu) info mem-nodes
> >>>> /proc/14132/fd/13: 00002aaaaac00000-00002aaaeac00000: node0
> >>>> /proc/14132/fd/13: 00002aaaeac00000-00002aab2ac00000: node1
> >>>> /proc/14132/fd/14: 00002aab2ac00000-00002aab2b000000: node0
> >>>> /proc/14132/fd/14: 00002aab2b000000-00002aab2b400000: node1
> >>>
> >>> Are node0/node1 _host_ nodes?
> >>>
> >>> How do I know what's the _guest_ address/node corresponding to each
> >>> file/range above?
> >>>
> >>> What I am really looking for is:
> >>>
> >>>  * The correspondence between guest (virtual) NUMA nodes and guest
> >>>    physical address ranges (it could be provided by the QMP version of
> >>>    "info numa")
> >>
> >> AFAIK, the guest NUMA nodes and guest physical address ranges are set
> >> by seabios, we can't get this information from QEMU,
> > 
> > QEMU _has_ to know about it, otherwise we would never be able to know
> > which virtual addresses inside the QEMU process (or offsets inside the
> > backing files) belong to which virtual NUMA node.
> 
> Nope, if I'm right, actually it's linear except that there are holes in
> the physical address spaces. So we can know which node the guest virtual
> address is included just by each numa node size.

You are just describing a way to accomplish the item I asked about
above: finding out the correspondence between guest physical addresses
and NUMA nodes.  :)

(But I would prefer to have something more explicit in the QMP interface
instead of something implicit that assumes a predefined binding)

> It's enough for us if we
> can provide a QMP interface from QEMU to let external tools like libvirt
> set the host memory binding polices according to the QMP interface, and
> we can also provide the QEMU command line option to be able to set host
> bindings through command line options before we start QEMU process.

And how would you identify memory regions through this memory binding
QMP interface, if not by guest physical addresses?

> 
> > 
> > (After all, the NUMA wiring is a hardware feature, not something that
> > the BIOS can decide)
> 
> But this is ACPI table which wrote by seabios now. AFAIK, there is no
> unified idea about moving this part to QEMU with the QEMU interfaces
> for seabios removed or just stay it there.

It doesn't matter who writes the ACPI table. QEMU must always know on
which virtual NUMA node each memory region is located.

> > 
> >> and I think this
> >> information is useless for pinning memory range to host.
> > 
> > Well, we have to somehow identify each region of guest memory when
> > deciding how to pin it. How would you identify it without using guest
> > physical addresses? Guest physical addresses are more meaningful than
> > the QEMU virtual addresses your patch exposes (that are meaningless
> > outside QEMU).
> 
> As I mentioned above, we can know this just by the guest node memory size,
> and can set the host bindings by treating this sizes as offsets.
> And I think we only need to set the host memory binding polices to each
> guest numa nodes. It's unnecessary to set polices to each region as you
> said.

I believe an interface based on guest physical memory addresses is more
flexible (and even simpler!) than one that only allows binding of whole
virtual NUMA nodes.

(And I still don't understand why you are exposing QEMU virtual memory
addresses in the new command, if they are useless).

> > 
> > 
> >>>  * The correspondence between guest physical address ranges and ranges
> >>>    inside the mapped files (so external tools could set the policy on
> >>>    those files instead of requiring QEMU to set it directly)
> >>>
> >>> I understand that your use case may require additional information and
> >>> additional interfaces. But if we provide the information above we will
> >>> allow external components set the policy on the hugetlbfs files before
> >>> we add new interfaces required for your use case.
> >>
> >> But the file backed memory is not good for the host which has many
> >> virtual machines, in this situation, we can't handle anon THP yet.
> > 
> > I don't understand what you mean, here. What prevents someone from using
> > file-backed memory with multiple virtual machines?
> 
> While if we use hugetlbfs backed memory, we should know how many virtual machines,
> how much memory each vm will use, then reserve these pages for them. And even
> should reserve more pages for external tools(numactl) to set memory polices.
> Even the memory reservation also has it's own memory policies. It's very hard
> to control it to what we want to set.

Well, it's hard because we don't even have tools to help on that, yet.

Anyway, I understand that you want to make it work with THP as well. But
if THP works with tmpfs (does it?), people then could use exactly the
same file-based mechanisms with tmpfs and keep THP working.

(Right now I am doing some experiments to understand how the system
behaves when using numactl on hugetlbfs and tmpfs, before and after
getting the files mapped).

> > 
> >>
> >> And as I mentioned, the cross numa node access performance regression
> >> is caused by pci-passthrough, it's a very long time bug, we should
> >> back port the host memory pinning patch to old QEMU to resolve this performance
> >> problem, too.
> > 
> > If it's a regression, what's the last version of QEMU where the bug
> > wasn't present?
> > 
> 
>  As QEMU doesn't support host memory binding, I think
> this was present since we support guest NUMA, and the pci-passthrough made
> it even worse.

If the problem was always present, it is not a regression, is it?

-- 
Eduardo