From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [RFC PATCH] Exporting Guest RAM information for NUMA binding Date: Wed, 30 Nov 2011 18:41:13 +0100 Message-ID: <20111130174113.GM23466@redhat.com> References: <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de> <20111108173304.GA14486@sequoia.sous-sol.org> <20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins> <20111121160001.GB3602@in.ibm.com> <1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws> <20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de> <20111130162237.GC27308@in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Peter Zijlstra , kvm list , qemu-devel Developers , Alexander Graf , Chris Wright , bharata@linux.vnet.ibm.com, Vaidyanathan S To: Dipankar Sarma Return-path: Content-Disposition: inline In-Reply-To: <20111130162237.GC27308@in.ibm.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > create the guest topology correctly and optimize for NUMA. This > would work for us. Even on the case of 1 guest that fits in one node, you're not going to max out the full bandwidth of all memory channels with this. qemu all can do with ms_mbind/tbind is to create a vtopology that matches the hardware topology. It has these limits: 1) requires all userland applications to be modified to scan either the physical topology if run on host, or the vtopology if run on guest to get the full benefit. 2) breaks across live migration if host physical topology changes 3) 1 small guest on a idle numa system that fits in one numa node will tell not enough information to the host kernel 4) if used outside of qemu and one threads allocates more memory than what fits in one node it won't tell enough info to the host kernel. About 3): if you've just one guest that fits in one node, each vcpu should be spread across all the nodes probably, and behave like MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in reverse, the global memory bandwidth will still be used full even if they will both access remote memory. I've just seen benchmarks where no pinning runs more than _twice_ as fast than pinning with just 1 guest and only 10 vcpu threads, probably because of that. About 4): even if the thread scans the numa topology it won't be able to tell tell enough info to the kernel to know which parts of the memory may be used more or less (ok it may be possible to call mbind and vary it at runtime but it adds even more complexity left to the programmer). If the vcpu is free to go in any node, and we've a automatic vcpu<->memory affinity, then the memory will follow the vcpu. And the scheduler domains should already optimize for maxing out the full memory bandwidth of all channels. Trouble 1/2/3/4 applies to the hard bindings as well, not just to mbind/tbin. In short it's an incremental step that moves some logic to the kernel but I don't see it solving all situations optimally and it shares a lot of the limits of the hard bindings. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:42893) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RVoA1-0003qQ-0Y for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RVo9z-0000Ww-Bl for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34630) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RVo9z-0000WW-3f for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:31 -0500 Date: Wed, 30 Nov 2011 18:41:13 +0100 From: Andrea Arcangeli Message-ID: <20111130174113.GM23466@redhat.com> References: <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de> <20111108173304.GA14486@sequoia.sous-sol.org> <20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins> <20111121160001.GB3602@in.ibm.com> <1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws> <20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de> <20111130162237.GC27308@in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20111130162237.GC27308@in.ibm.com> Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Dipankar Sarma Cc: Peter Zijlstra , kvm list , qemu-devel Developers , Alexander Graf , Chris Wright , bharata@linux.vnet.ibm.com, Vaidyanathan S On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote: > create the guest topology correctly and optimize for NUMA. This > would work for us. Even on the case of 1 guest that fits in one node, you're not going to max out the full bandwidth of all memory channels with this. qemu all can do with ms_mbind/tbind is to create a vtopology that matches the hardware topology. It has these limits: 1) requires all userland applications to be modified to scan either the physical topology if run on host, or the vtopology if run on guest to get the full benefit. 2) breaks across live migration if host physical topology changes 3) 1 small guest on a idle numa system that fits in one numa node will tell not enough information to the host kernel 4) if used outside of qemu and one threads allocates more memory than what fits in one node it won't tell enough info to the host kernel. About 3): if you've just one guest that fits in one node, each vcpu should be spread across all the nodes probably, and behave like MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in reverse, the global memory bandwidth will still be used full even if they will both access remote memory. I've just seen benchmarks where no pinning runs more than _twice_ as fast than pinning with just 1 guest and only 10 vcpu threads, probably because of that. About 4): even if the thread scans the numa topology it won't be able to tell tell enough info to the kernel to know which parts of the memory may be used more or less (ok it may be possible to call mbind and vary it at runtime but it adds even more complexity left to the programmer). If the vcpu is free to go in any node, and we've a automatic vcpu<->memory affinity, then the memory will follow the vcpu. And the scheduler domains should already optimize for maxing out the full memory bandwidth of all channels. Trouble 1/2/3/4 applies to the hard bindings as well, not just to mbind/tbin. In short it's an incremental step that moves some logic to the kernel but I don't see it solving all situations optimally and it shares a lot of the limits of the hard bindings.