Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

From: Dipankar Sarma <dipankar@in.ibm.com>
To: Alexander Graf <agraf@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	bharata@linux.vnet.ibm.com, kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Chris Wright <chrisw@sous-sol.org>,
	Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 21:52:37 +0530	[thread overview]
Message-ID: <20111130162237.GC27308@in.ibm.com> (raw)
In-Reply-To: <4ECD3CBD.7010902@suse.de>

On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> >Hi!
> >
> >
> >In my view the trouble of the numa hard bindings is not the fact
> >they're hard and qemu has to also decide the location (in fact it
> >doesn't need to decide the location if you use cpusets and relative
> >mbinds). The bigger problem is the fact either the admin or the app
> >developer has to explicitly scan the numa physical topology (both cpus
> >and memory) and tell the kernel how much memory to bind to each
> >thread. ms_mbind/ms_tbind only partially solve that problem. They're
> >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> >don't need an admin or a cpuset-job-scheduler (or a perl script) to
> >redistribute the hardware resources.
> 
> Well yeah, of course the guest needs to see some topology. I don't
> see why we'd have to actually scan the host for this though. All we
> need to tell the kernel is "this memory region is close to that
> thread".
> 
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able
> to tell the kernel that this GB of RAM actually is close to that
> vCPU thread.
> 
> Of course the admin still needs to decide how to split up memory.
> That's the deal with emulating real hardware. You get the interfaces
> hardware gets :). However, if you follow a reasonable default
> strategy such as numa splitting your RAM into equal chunks between
> guest vCPUs you're probably close enough to optimal usage models. Or
> at least you could have a close enough approximation of how this
> mapping could work for the _guest_ regardless of the host and when
> you migrate it somewhere else it should also work reasonably well.

Allowing specification of the numa nodes to qemu, allowing
qemu to create cpu+mem grouping (without binding) and letting
the kernel decide how to manage them seems like a reasonable incremental 
step between no guest/host NUMA awareness and automatic NUMA 
configuration in the host kernel. It would be suffice for the current 
needs we see.

Besides migration, we also have use cases where we may want to
have large multi-node VMs that are static (like LPARs), having the guest 
aware of the topology there is helpful. 

Also, if at all topology changes due to migration or host kernel decisions,
we can make use of something like VPHN (virtual processor home node)
capability on Power systems to have guest kernel update its topology
knowledge. You can refer to that in
arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel
maintains mappings requested by ms_tbind()/ms_mbind(), we can
create the guest topology correctly and optimize for NUMA. This
would work for us.

Thanks
Dipankar