From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [RFC PATCH] Exporting Guest RAM information for NUMA binding Date: Wed, 23 Nov 2011 21:19:06 +0100 Message-ID: <20111123201906.GL8397@redhat.com> References: <20111029184502.GH11038@in.ibm.com> <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de> <20111108173304.GA14486@sequoia.sous-sol.org> <20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins> <20111121160001.GB3602@in.ibm.com> <1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws> <20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Peter Zijlstra , kvm list , dipankar@in.ibm.com, qemu-devel Developers , Chris Wright , bharata@linux.vnet.ibm.com, Vaidyanathan S To: Alexander Graf Return-path: Content-Disposition: inline In-Reply-To: <4ECD3CBD.7010902@suse.de> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: > So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to > tell the kernel that this GB of RAM actually is close to that vCPU thread. > Of course the admin still needs to decide how to split up memory. That's > the deal with emulating real hardware. You get the interfaces hardware > gets :). However, if you follow a reasonable default strategy such as The problem is how do you decide the parameter "-numa node,mem=1G,cpus=0". Real hardware exists when the VM starts. But then the VM can be migrated. Or the VM may have to be split in the middle of two nodes regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so there may be two 512M nodes with 1 cpu each instead of 1 NUMA node with 1G and 2 cpus. Especially by relaxing the hard bindings and using ms_mbind/tbind, the vtopology you create won't match real hardware because you don't know the real hardware that you will get. > numa splitting your RAM into equal chunks between guest vCPUs you're > probably close enough to optimal usage models. Or at least you could > have a close enough approximation of how this mapping could work for the > _guest_ regardless of the host and when you migrate it somewhere else it > should also work reasonably well. If you enforce these assumptions and the admin has still again to choose the "-numa node,mem=1G" parameters after checking the physical numa topology and make sure the vtopology can match the real physical topology and that the guest runs on "real hardware", it's not very different from using hard bindings, hard bindings enforces the "real hardware" so there's no way it can go wrong. I mean you still need some NUMA topology knowledge outside of QEMU to be sure you get "real hardware" out of the vtopology. Ok cpusets would restrict the availability of idle cpus, so there's a slight improvement in maximizing idle CPU usage (it's better to run 50% slower than not to run at all), but that could be achieved also by a relax of the cpuset semantics (if that's not already available). > So you want to basically dynamically create NUMA topologies from the > runtime behavior of the guest? What if it changes over time? Yes, just I wouldn't call it NUMA topologies or it looks like a vtopology, and the vtopology is something fixed at boot time, the sort of thing created by using a command line like "-numa node,mem=1G,cpu=0". I wouldn't try to give the guest any "memory" topology, just the vcpus are magic threads that don't behave like normal threads in memory affinity terms. So they need a paravirtualization layer to be dealt with. The fact vcpu0 accessed 10 pages right now, doesn't mean there's a real affinity between vcpu0 and those N pages if the guest scheduler is free to migrate anything anywhere. The guest thread running in the vcpu0 may be migrated to the vcpu7 which may belong to a different physical node. So if we want to automatically detect thread<->memory affinity between vcpus and guest memory, we also need to group the guest threads in certain vcpus and prevent those cpu migrations. The thread in the guest would better stick to vcpu0/1/2/3 (instead of migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same memory which fits in one node. That can only be told dynamically from KVM to the guest OS scheduler as we may migrate virtual machines or we may move the memory. Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes (4G per node). Suppose one of the two VM that have all the 2.5G allocated in a single node quits. Then the VM that was split across the two nodes will "memory-migrated" to fit in one node. So far so good, but then KVM should tell the guest OS scheduler that it should stop grouping vcpus and all vcpus are equal and all guest threads can be migrated to any vcpu. I don't see a way to do those things with a vtopology fixed at boot. > I actually like the idea of just telling the kernel how close memory > will be to a thread. Sure, you can handle this basically by shoving your > scheduler into user space, but isn't managing processes what a kernel is > supposed to do in the first place? Assume you're not in virt and you just want to tell thread A uses memory range A and thread B uses memory range B. If the memory range A fits in one node you're ok. But if "memory A" now spans over two nodes (maybe to avoid swapping), you're still screwed and you won't give enough information to the kernel on the real runtime affinity that "thread A" has on the memory. Now if statistically the access to "memory a" are all equal, it won't make a difference but if you end up using half of "memory A" 99% of the time, it will not work as well. This is especially a problem for KVM because statistically the accesses to "memory a" given to vcpu0 won't be equal. 50% of it may not be used at all and just have pagecache sitting there, or even free memory, so we can do better if "memory a" is split across two nodes to avoid swapping, if we detect the vcpu<->memory affinity dynamically. > You can always argue for a microkernel, but having a scheduler in user > space (perl script) and another one in the kernel doesn't sound very > appealing to me. If you want to go full-on user space, sure, I can see > why :). > > Either way, your approach sounds to be very much in the concept phase, > while this is more something that can actually be tested and benchmarked The thread<->memory affinity is in the concept phase, but the process<->memory affinity already runs and in benchmarks it already performs almost as well as hard bindings. It has the cost of a knumad daemon scanning the memory in the background but that's cheap, not even comparable to something like KSM. It's comparable to khugepaged overhead, which is orders of magnitude lower and considering those are big systems with many CPUs I don't think it's a big deal. Once process<->memory affinity works well if we go into the thread<->memory affinity, we'll have to tweak knumad to trigger page faults to give us per-thread information on the memory affinity. Also I'm only working on anonymous memory right now, maybe it should be extended to other types of memory and handle the case of the memory being shared by entities running in different nodes and not touch it in that case, while if the pagecache is used by just one thread (or process initially) it could still migrate it. For readonly shared memory duplicating it per-node is the way to go but I'm not going into that direction as it's not useful for virt. It remains a possibility for the future. > against today. So yes, I want the interim solution - just in case your > plan doesn't work out :). Oh, and then there's the non-PV guests too... Actually to me it looks like the code misses the memory affinity and migration, so I'm not sure how much you can run benchmarks on it yet. It seems to tweak the scheduler though. I don't mean ms_tbind/mbind are a bad idea, it allows to remove the migration invoked by a perl script, into the kernel, but I'm not satisfied with the trouble that creating a vtopology still gives us (vtopology only makes sense if it matches the "real hardware" and as said above we don't always have real hardware, and if you enforce real hardware you're pretty close to using hard bindings, except it will be the kernel doing the migration of cpus and memory instead of those being invoked by userland, and it also allows to maximize usage of the idle CPUs). But even after you create a vtopology in guest and you makes sure it won't split across nodes so that the vtopology runs on "real hardware" it is has been created for, it won't help much if all userland apps in guest aren't also modified to use ms_mbind/ms_tbind which I don't see happening any time soon. You could still run knumad in guest, to take advantage of the vtopology without having to modify guest apps though. But if knumad would run in host (if we can solve the thread<->memory affinity) there would be no need of vtopology in the guest in the first place. I'm positive and I've proof of concept that knumad works for process<->memory affinity but considering the automigration code isn't complete (it doesn't even yet migrate THP without splitting them) I'm not yet delving into the complications of the thread affinity. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:60711) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RTJIZ-0002Fy-Np for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:05 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RTJIY-0003QR-3j for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:03 -0500 Received: from mx1.redhat.com ([209.132.183.28]:35087) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RTJIX-0003QN-Si for qemu-devel@nongnu.org; Wed, 23 Nov 2011 15:20:02 -0500 Date: Wed, 23 Nov 2011 21:19:06 +0100 From: Andrea Arcangeli Message-ID: <20111123201906.GL8397@redhat.com> References: <20111029184502.GH11038@in.ibm.com> <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de> <20111108173304.GA14486@sequoia.sous-sol.org> <20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins> <20111121160001.GB3602@in.ibm.com> <1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws> <20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4ECD3CBD.7010902@suse.de> Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf Cc: Peter Zijlstra , kvm list , dipankar@in.ibm.com, qemu-devel Developers , Chris Wright , bharata@linux.vnet.ibm.com, Vaidyanathan S On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote: > So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to > tell the kernel that this GB of RAM actually is close to that vCPU thread. > Of course the admin still needs to decide how to split up memory. That's > the deal with emulating real hardware. You get the interfaces hardware > gets :). However, if you follow a reasonable default strategy such as The problem is how do you decide the parameter "-numa node,mem=1G,cpus=0". Real hardware exists when the VM starts. But then the VM can be migrated. Or the VM may have to be split in the middle of two nodes regardless of the -node node,mem=1G,cpus=0-1 to avoid swapping so there may be two 512M nodes with 1 cpu each instead of 1 NUMA node with 1G and 2 cpus. Especially by relaxing the hard bindings and using ms_mbind/tbind, the vtopology you create won't match real hardware because you don't know the real hardware that you will get. > numa splitting your RAM into equal chunks between guest vCPUs you're > probably close enough to optimal usage models. Or at least you could > have a close enough approximation of how this mapping could work for the > _guest_ regardless of the host and when you migrate it somewhere else it > should also work reasonably well. If you enforce these assumptions and the admin has still again to choose the "-numa node,mem=1G" parameters after checking the physical numa topology and make sure the vtopology can match the real physical topology and that the guest runs on "real hardware", it's not very different from using hard bindings, hard bindings enforces the "real hardware" so there's no way it can go wrong. I mean you still need some NUMA topology knowledge outside of QEMU to be sure you get "real hardware" out of the vtopology. Ok cpusets would restrict the availability of idle cpus, so there's a slight improvement in maximizing idle CPU usage (it's better to run 50% slower than not to run at all), but that could be achieved also by a relax of the cpuset semantics (if that's not already available). > So you want to basically dynamically create NUMA topologies from the > runtime behavior of the guest? What if it changes over time? Yes, just I wouldn't call it NUMA topologies or it looks like a vtopology, and the vtopology is something fixed at boot time, the sort of thing created by using a command line like "-numa node,mem=1G,cpu=0". I wouldn't try to give the guest any "memory" topology, just the vcpus are magic threads that don't behave like normal threads in memory affinity terms. So they need a paravirtualization layer to be dealt with. The fact vcpu0 accessed 10 pages right now, doesn't mean there's a real affinity between vcpu0 and those N pages if the guest scheduler is free to migrate anything anywhere. The guest thread running in the vcpu0 may be migrated to the vcpu7 which may belong to a different physical node. So if we want to automatically detect thread<->memory affinity between vcpus and guest memory, we also need to group the guest threads in certain vcpus and prevent those cpu migrations. The thread in the guest would better stick to vcpu0/1/2/3 (instead of migration to vcpu4/5/6/7) if vcpu0/1/2/3 have affinity with the same memory which fits in one node. That can only be told dynamically from KVM to the guest OS scheduler as we may migrate virtual machines or we may move the memory. Take the example of 3 VM of 2.5G ram each on a 8G system with 2 nodes (4G per node). Suppose one of the two VM that have all the 2.5G allocated in a single node quits. Then the VM that was split across the two nodes will "memory-migrated" to fit in one node. So far so good, but then KVM should tell the guest OS scheduler that it should stop grouping vcpus and all vcpus are equal and all guest threads can be migrated to any vcpu. I don't see a way to do those things with a vtopology fixed at boot. > I actually like the idea of just telling the kernel how close memory > will be to a thread. Sure, you can handle this basically by shoving your > scheduler into user space, but isn't managing processes what a kernel is > supposed to do in the first place? Assume you're not in virt and you just want to tell thread A uses memory range A and thread B uses memory range B. If the memory range A fits in one node you're ok. But if "memory A" now spans over two nodes (maybe to avoid swapping), you're still screwed and you won't give enough information to the kernel on the real runtime affinity that "thread A" has on the memory. Now if statistically the access to "memory a" are all equal, it won't make a difference but if you end up using half of "memory A" 99% of the time, it will not work as well. This is especially a problem for KVM because statistically the accesses to "memory a" given to vcpu0 won't be equal. 50% of it may not be used at all and just have pagecache sitting there, or even free memory, so we can do better if "memory a" is split across two nodes to avoid swapping, if we detect the vcpu<->memory affinity dynamically. > You can always argue for a microkernel, but having a scheduler in user > space (perl script) and another one in the kernel doesn't sound very > appealing to me. If you want to go full-on user space, sure, I can see > why :). > > Either way, your approach sounds to be very much in the concept phase, > while this is more something that can actually be tested and benchmarked The thread<->memory affinity is in the concept phase, but the process<->memory affinity already runs and in benchmarks it already performs almost as well as hard bindings. It has the cost of a knumad daemon scanning the memory in the background but that's cheap, not even comparable to something like KSM. It's comparable to khugepaged overhead, which is orders of magnitude lower and considering those are big systems with many CPUs I don't think it's a big deal. Once process<->memory affinity works well if we go into the thread<->memory affinity, we'll have to tweak knumad to trigger page faults to give us per-thread information on the memory affinity. Also I'm only working on anonymous memory right now, maybe it should be extended to other types of memory and handle the case of the memory being shared by entities running in different nodes and not touch it in that case, while if the pagecache is used by just one thread (or process initially) it could still migrate it. For readonly shared memory duplicating it per-node is the way to go but I'm not going into that direction as it's not useful for virt. It remains a possibility for the future. > against today. So yes, I want the interim solution - just in case your > plan doesn't work out :). Oh, and then there's the non-PV guests too... Actually to me it looks like the code misses the memory affinity and migration, so I'm not sure how much you can run benchmarks on it yet. It seems to tweak the scheduler though. I don't mean ms_tbind/mbind are a bad idea, it allows to remove the migration invoked by a perl script, into the kernel, but I'm not satisfied with the trouble that creating a vtopology still gives us (vtopology only makes sense if it matches the "real hardware" and as said above we don't always have real hardware, and if you enforce real hardware you're pretty close to using hard bindings, except it will be the kernel doing the migration of cpus and memory instead of those being invoked by userland, and it also allows to maximize usage of the idle CPUs). But even after you create a vtopology in guest and you makes sure it won't split across nodes so that the vtopology runs on "real hardware" it is has been created for, it won't help much if all userland apps in guest aren't also modified to use ms_mbind/ms_tbind which I don't see happening any time soon. You could still run knumad in guest, to take advantage of the vtopology without having to modify guest apps though. But if knumad would run in host (if we can solve the thread<->memory affinity) there would be no need of vtopology in the guest in the first place. I'm positive and I've proof of concept that knumad works for process<->memory affinity but considering the automigration code isn't complete (it doesn't even yet migrate THP without splitting them) I'm not yet delving into the complications of the thread affinity.