From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC PATCH] Exporting Guest RAM information for
 NUMA binding
Date: Wed, 23 Nov 2011 21:19:06 +0100
Message-ID: <20111123201906.GL8397@redhat.com>
References: <20111029184502.GH11038@in.ibm.com>
	<7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de>
	<20111108173304.GA14486@sequoia.sous-sol.org>
	<20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins>
	<20111121160001.GB3602@in.ibm.com>
	<1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws>
	<20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm list <kvm@vger.kernel.org>,
	dipankar@in.ibm.com, qemu-devel Developers <qemu-devel@nongnu.org>,
	Chris Wright <chrisw@sous-sol.org>, bharata@linux.vnet.ibm.com,
	Vaidyanathan S <svaidy@in.ibm.com>
To: Alexander Graf <agraf@suse.de>
Return-path: <qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org>
Content-Disposition: inline
In-Reply-To: <4ECD3CBD.7010902@suse.de>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
List-Id: kvm.vger.kernel.org

On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to 
> tell the kernel that this GB of RAM actually is close to that vCPU thread.
> Of course the admin still needs to decide how to split up memory. That's 
> the deal with emulating real hardware. You get the interfaces hardware 
> gets :). However, if you follow a reasonable default strategy such as 

The problem is how do you decide the parameter "-numa
node,mem=1G,cpus=0".

Real hardware exists when the VM starts. But then the VM can be
migrated. Or the VM may have to be split in the middle of two nodes
regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so
there may be two 512M nodes with 1 cpu each instead of 1 NUMA node
with 1G and 2 cpus.

Especially by relaxing the hard bindings and using ms_mbind/tbind, the
vtopology you create won't match real hardware because you don't know
the real hardware that you will get.

> numa splitting your RAM into equal chunks between guest vCPUs you're 
> probably close enough to optimal usage models. Or at least you could 
> have a close enough approximation of how this mapping could work for the 
> _guest_ regardless of the host and when you migrate it somewhere else it 
> should also work reasonably well.

If you enforce these assumptions and the admin has still again to
choose the "-numa node,mem=1G" parameters after checking the physical
numa topology and make sure the vtopology can match the real physical
topology and that the guest runs on "real hardware", it's not very
different from using hard bindings, hard bindings enforces the "real
hardware" so there's no way it can go wrong. I mean you still need
some NUMA topology knowledge outside of QEMU to be sure you get "real
hardware" out of the vtopology.

Ok cpusets would restrict the availability of idle cpus, so there's a
slight improvement in maximizing idle CPU usage (it's better to run
50% slower than not to run at all), but that could be achieved also by
a relax of the cpuset semantics (if that's not already available).

> So you want to basically dynamically create NUMA topologies from the 
> runtime behavior of the guest? What if it changes over time?

Yes, just I wouldn't call it NUMA topologies or it looks like a
vtopology, and the vtopology is something fixed at boot time, the sort
of thing created by using a command line like "-numa
node,mem=1G,cpu=0".

I wouldn't try to give the guest any "memory" topology, just the vcpus
are magic threads that don't behave like normal threads in memory
affinity terms. So they need a paravirtualization layer to be dealt
with.

The fact vcpu0 accessed 10 pages right now, doesn't mean there's a
real affinity between vcpu0 and those N pages if the guest scheduler
is free to migrate anything anywhere. The guest thread running in the
vcpu0 may be migrated to the vcpu7 which may belong to a different
physical node. So if we want to automatically detect thread<->memory
affinity between vcpus and guest memory, we also need to group the
guest threads in certain vcpus and prevent those cpu migrations. The
thread in the guest would better stick to vcpu0/1/2/3 (instead of
migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same
memory which fits in one node. That can only be told dynamically from
KVM to the guest OS scheduler as we may migrate virtual machines or we
may move the memory.

Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes
(4G per node). Suppose one of the two VM that have all the 2.5G
allocated in a single node quits. Then the VM that was split across
the two nodes will  "memory-migrated" to fit in one node. So far so
good, but then KVM should tell the guest OS scheduler that it should
stop grouping vcpus and all vcpus are equal and all guest threads can
be migrated to any vcpu.

I don't see a way to do those things with a vtopology fixed at boot.

> I actually like the idea of just telling the kernel how close memory 
> will be to a thread. Sure, you can handle this basically by shoving your 
> scheduler into user space, but isn't managing processes what a kernel is 
> supposed to do in the first place?

Assume you're not in virt and you just want to tell thread A uses
memory range A and thread B uses memory range B. If the memory range A
fits in one node you're ok. But if "memory A" now spans over two nodes
(maybe to avoid swapping), you're still screwed and you won't give
enough information to the kernel on the real runtime affinity that
"thread A" has on the memory. Now if statistically the access to
"memory a" are all equal, it won't make a difference but if you end up
using half of "memory A" 99% of the time, it will not work as well.

This is especially a problem for KVM because statistically the
accesses to "memory a" given to vcpu0 won't be equal. 50% of it may
not be used at all and just have pagecache sitting there, or even free
memory, so we can do better if "memory a" is split across two nodes to
avoid swapping, if we detect the vcpu<->memory affinity dynamically.

> You can always argue for a microkernel, but having a scheduler in user 
> space (perl script) and another one in the kernel doesn't sound very 
> appealing to me. If you want to go full-on user space, sure, I can see 
> why :).
>
> Either way, your approach sounds to be very much in the concept phase, 
> while this is more something that can actually be tested and benchmarked 

The thread<->memory affinity is in the concept phase, but the
process<->memory affinity already runs and in benchmarks it already performs
almost as well as hard bindings. It has the cost of a knumad daemon
scanning the memory in the background but that's cheap, not even
comparable to something like KSM. It's comparable to khugepaged
overhead, which is orders of magnitude lower and considering those are
big systems with many CPUs I don't think it's a big deal.

Once process<->memory affinity works well if we go into the
thread<->memory affinity, we'll have to tweak knumad to trigger page
faults to give us per-thread information on the memory affinity.

Also I'm only working on anonymous memory right now, maybe it should
be extended to other types of memory and handle the case of the memory
being shared by entities running in different nodes and not touch it
in that case, while if the pagecache is used by just one thread (or
process initially) it could still migrate it. For readonly shared
memory duplicating it per-node is the way to go but I'm not going into
that direction as it's not useful for virt. It remains a possibility
for the future.

> against today. So yes, I want the interim solution - just in case your 
> plan doesn't work out :). Oh, and then there's the non-PV guests too...

Actually to me it looks like the code misses the memory affinity and
migration, so I'm not sure how much you can run benchmarks on it
yet. It seems to tweak the scheduler though.

I don't mean ms_tbind/mbind are a bad idea, it allows to remove the
migration invoked by a perl script, into the kernel, but I'm not
satisfied with the trouble that creating a vtopology still gives us
(vtopology only makes sense if it matches the "real hardware" and as
said above we don't always have real hardware, and if you enforce real
hardware you're pretty close to using hard bindings, except it will be
the kernel doing the migration of cpus and memory instead of those
being invoked by userland, and it also allows to maximize usage of the
idle CPUs).

But even after you create a vtopology in guest and you makes sure it
won't split across nodes so that the vtopology runs on "real hardware"
it is has been created for, it won't help much if all userland apps in
guest aren't also modified to use ms_mbind/ms_tbind which I don't see
happening any time soon. You could still run knumad in guest, to take
advantage of the vtopology without having to modify guest apps
though. But if knumad would run in host (if we can solve the
thread<->memory affinity) there would be no need of vtopology in the
guest in the first place.

I'm positive and I've proof of concept that knumad works for
process<->memory affinity but considering the automigration code isn't
complete (it doesn't even yet migrate THP without splitting them) I'm
not yet delving into the complications of the thread affinity.

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:60711)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RTJIZ-0002Fy-Np
	for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:05 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RTJIY-0003QR-3j
	for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:03 -0500
Received: from mx1.redhat.com ([209.132.183.28]:35087)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RTJIX-0003QN-Si
	for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:02 -0500
Date: Wed, 23 Nov 2011 21:19:06 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Message-ID: <20111123201906.GL8397@redhat.com>
References: <20111029184502.GH11038@in.ibm.com>
	<7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de>
	<20111108173304.GA14486@sequoia.sous-sol.org>
	<20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins>
	<20111121160001.GB3602@in.ibm.com>
	<1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws>
	<20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4ECD3CBD.7010902@suse.de>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for
 NUMA binding
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexander Graf <agraf@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm list <kvm@vger.kernel.org>, dipankar@in.ibm.com, qemu-devel Developers <qemu-devel@nongnu.org>, Chris Wright <chrisw@sous-sol.org>, bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>

On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to 
> tell the kernel that this GB of RAM actually is close to that vCPU thread.
> Of course the admin still needs to decide how to split up memory. That's 
> the deal with emulating real hardware. You get the interfaces hardware 
> gets :). However, if you follow a reasonable default strategy such as 

The problem is how do you decide the parameter "-numa
node,mem=1G,cpus=0".

Real hardware exists when the VM starts. But then the VM can be
migrated. Or the VM may have to be split in the middle of two nodes
regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so
there may be two 512M nodes with 1 cpu each instead of 1 NUMA node
with 1G and 2 cpus.

Especially by relaxing the hard bindings and using ms_mbind/tbind, the
vtopology you create won't match real hardware because you don't know
the real hardware that you will get.

> numa splitting your RAM into equal chunks between guest vCPUs you're 
> probably close enough to optimal usage models. Or at least you could 
> have a close enough approximation of how this mapping could work for the 
> _guest_ regardless of the host and when you migrate it somewhere else it 
> should also work reasonably well.

If you enforce these assumptions and the admin has still again to
choose the "-numa node,mem=1G" parameters after checking the physical
numa topology and make sure the vtopology can match the real physical
topology and that the guest runs on "real hardware", it's not very
different from using hard bindings, hard bindings enforces the "real
hardware" so there's no way it can go wrong. I mean you still need
some NUMA topology knowledge outside of QEMU to be sure you get "real
hardware" out of the vtopology.

Ok cpusets would restrict the availability of idle cpus, so there's a
slight improvement in maximizing idle CPU usage (it's better to run
50% slower than not to run at all), but that could be achieved also by
a relax of the cpuset semantics (if that's not already available).

> So you want to basically dynamically create NUMA topologies from the 
> runtime behavior of the guest? What if it changes over time?

Yes, just I wouldn't call it NUMA topologies or it looks like a
vtopology, and the vtopology is something fixed at boot time, the sort
of thing created by using a command line like "-numa
node,mem=1G,cpu=0".

I wouldn't try to give the guest any "memory" topology, just the vcpus
are magic threads that don't behave like normal threads in memory
affinity terms. So they need a paravirtualization layer to be dealt
with.

The fact vcpu0 accessed 10 pages right now, doesn't mean there's a
real affinity between vcpu0 and those N pages if the guest scheduler
is free to migrate anything anywhere. The guest thread running in the
vcpu0 may be migrated to the vcpu7 which may belong to a different
physical node. So if we want to automatically detect thread<->memory
affinity between vcpus and guest memory, we also need to group the
guest threads in certain vcpus and prevent those cpu migrations. The
thread in the guest would better stick to vcpu0/1/2/3 (instead of
migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same
memory which fits in one node. That can only be told dynamically from
KVM to the guest OS scheduler as we may migrate virtual machines or we
may move the memory.

Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes
(4G per node). Suppose one of the two VM that have all the 2.5G
allocated in a single node quits. Then the VM that was split across
the two nodes will  "memory-migrated" to fit in one node. So far so
good, but then KVM should tell the guest OS scheduler that it should
stop grouping vcpus and all vcpus are equal and all guest threads can
be migrated to any vcpu.

I don't see a way to do those things with a vtopology fixed at boot.

> I actually like the idea of just telling the kernel how close memory 
> will be to a thread. Sure, you can handle this basically by shoving your 
> scheduler into user space, but isn't managing processes what a kernel is 
> supposed to do in the first place?

Assume you're not in virt and you just want to tell thread A uses
memory range A and thread B uses memory range B. If the memory range A
fits in one node you're ok. But if "memory A" now spans over two nodes
(maybe to avoid swapping), you're still screwed and you won't give
enough information to the kernel on the real runtime affinity that
"thread A" has on the memory. Now if statistically the access to
"memory a" are all equal, it won't make a difference but if you end up
using half of "memory A" 99% of the time, it will not work as well.

This is especially a problem for KVM because statistically the
accesses to "memory a" given to vcpu0 won't be equal. 50% of it may
not be used at all and just have pagecache sitting there, or even free
memory, so we can do better if "memory a" is split across two nodes to
avoid swapping, if we detect the vcpu<->memory affinity dynamically.

> You can always argue for a microkernel, but having a scheduler in user 
> space (perl script) and another one in the kernel doesn't sound very 
> appealing to me. If you want to go full-on user space, sure, I can see 
> why :).
>
> Either way, your approach sounds to be very much in the concept phase, 
> while this is more something that can actually be tested and benchmarked 

The thread<->memory affinity is in the concept phase, but the
process<->memory affinity already runs and in benchmarks it already performs
almost as well as hard bindings. It has the cost of a knumad daemon
scanning the memory in the background but that's cheap, not even
comparable to something like KSM. It's comparable to khugepaged
overhead, which is orders of magnitude lower and considering those are
big systems with many CPUs I don't think it's a big deal.

Once process<->memory affinity works well if we go into the
thread<->memory affinity, we'll have to tweak knumad to trigger page
faults to give us per-thread information on the memory affinity.

Also I'm only working on anonymous memory right now, maybe it should
be extended to other types of memory and handle the case of the memory
being shared by entities running in different nodes and not touch it
in that case, while if the pagecache is used by just one thread (or
process initially) it could still migrate it. For readonly shared
memory duplicating it per-node is the way to go but I'm not going into
that direction as it's not useful for virt. It remains a possibility
for the future.

> against today. So yes, I want the interim solution - just in case your 
> plan doesn't work out :). Oh, and then there's the non-PV guests too...

Actually to me it looks like the code misses the memory affinity and
migration, so I'm not sure how much you can run benchmarks on it
yet. It seems to tweak the scheduler though.

I don't mean ms_tbind/mbind are a bad idea, it allows to remove the
migration invoked by a perl script, into the kernel, but I'm not
satisfied with the trouble that creating a vtopology still gives us
(vtopology only makes sense if it matches the "real hardware" and as
said above we don't always have real hardware, and if you enforce real
hardware you're pretty close to using hard bindings, except it will be
the kernel doing the migration of cpus and memory instead of those
being invoked by userland, and it also allows to maximize usage of the
idle CPUs).

But even after you create a vtopology in guest and you makes sure it
won't split across nodes so that the vtopology runs on "real hardware"
it is has been created for, it won't help much if all userland apps in
guest aren't also modified to use ms_mbind/ms_tbind which I don't see
happening any time soon. You could still run knumad in guest, to take
advantage of the vtopology without having to modify guest apps
though. But if knumad would run in host (if we can solve the
thread<->memory affinity) there would be no need of vtopology in the
guest in the first place.

I'm positive and I've proof of concept that knumad works for
process<->memory affinity but considering the automigration code isn't
complete (it doesn't even yet migrate THP without splitting them) I'm
not yet delving into the complications of the thread affinity.