From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC PATCH] Exporting Guest RAM information for
 NUMA binding
Date: Wed, 30 Nov 2011 18:41:13 +0100
Message-ID: <20111130174113.GM23466@redhat.com>
References: <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de>
	<20111108173304.GA14486@sequoia.sous-sol.org>
	<20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins>
	<20111121160001.GB3602@in.ibm.com>
	<1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws>
	<20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de>
	<20111130162237.GC27308@in.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>,
	Chris Wright <chrisw@sous-sol.org>, bharata@linux.vnet.ibm.com,
	Vaidyanathan S <svaidy@in.ibm.com>
To: Dipankar Sarma <dipankar@in.ibm.com>
Return-path: <qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org>
Content-Disposition: inline
In-Reply-To: <20111130162237.GC27308@in.ibm.com>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
List-Id: kvm.vger.kernel.org

On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:42893)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RVoA1-0003qQ-0Y
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:34 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RVo9z-0000Ww-Bl
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:32 -0500
Received: from mx1.redhat.com ([209.132.183.28]:34630)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1RVo9z-0000WW-3f
	for qemu-devel@nongnu.org; Wed, 30 Nov 2011 12:41:31 -0500
Date: Wed, 30 Nov 2011 18:41:13 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Message-ID: <20111130174113.GM23466@redhat.com>
References: <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de>
	<20111108173304.GA14486@sequoia.sous-sol.org>
	<20111121150054.GA3602@in.ibm.com> <1321889126.28118.5.camel@twins>
	<20111121160001.GB3602@in.ibm.com>
	<1321894980.28118.16.camel@twins> <4ECB0019.7020800@codemonkey.ws>
	<20111123150300.GH8397@redhat.com> <4ECD3CBD.7010902@suse.de>
	<20111130162237.GC27308@in.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20111130162237.GC27308@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for
 NUMA binding
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, kvm list <kvm@vger.kernel.org>, qemu-devel Developers <qemu-devel@nongnu.org>, Alexander Graf <agraf@suse.de>, Chris Wright <chrisw@sous-sol.org>, bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>

On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.