Re: [RFC PATCH] Exporting Guest RAM information for NUMA binding

From: Andrea Arcangeli <aarcange@redhat.com>
To: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>,
	Chris Wright <chrisw@sous-sol.org>,
	bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 18:41:13 +0100	[thread overview]
Message-ID: <20111130174113.GM23466@redhat.com> (raw)
In-Reply-To: <20111130162237.GC27308@in.ibm.com>

On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.