From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Theurer <habanero@linux.vnet.ibm.com>
Subject: Re: kvm scaling question
Date: Tue, 15 Sep 2009 09:10:05 -0500
Message-ID: <1253023806.4204.9.camel@twinturbo.austin.ibm.com>
References: <4AAA1A0A0200004800080E06@novprvlin0050.provo.novell.com>
	 <20090911215355.GD6244@amt.cnet>
	 <4AAE7B3B0200004800081118@novprvlin0050.provo.novell.com>
Reply-To: habanero@linux.vnet.ibm.com
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Cc: Marcelo Tosatti <mtosatti@redhat.com>, kvm@vger.kernel.org
To: Bruce Rogers <BROGERS@novell.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from e35.co.us.ibm.com ([32.97.110.153]:51682 "EHLO
	e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754131AbZIOOKk (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 15 Sep 2009 10:10:40 -0400
Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com [9.17.195.107])
	by e35.co.us.ibm.com (8.14.3/8.13.1) with ESMTP id n8FE0m8B015063
	for <kvm@vger.kernel.org>; Tue, 15 Sep 2009 08:00:48 -0600
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])
	by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id n8FEALU8171252
	for <kvm@vger.kernel.org>; Tue, 15 Sep 2009 08:10:22 -0600
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])
	by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n8FEA79v025069
	for <kvm@vger.kernel.org>; Tue, 15 Sep 2009 08:10:08 -0600
In-Reply-To: <4AAE7B3B0200004800081118@novprvlin0050.provo.novell.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Mon, 2009-09-14 at 17:19 -0600, Bruce Rogers wrote:
> On 9/11/2009 at 3:53 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Fri, Sep 11, 2009 at 09:36:10AM -0600, Bruce Rogers wrote:
> >> I am wondering if anyone has investigated how well kvm scales when 
> > supporting many guests, or many vcpus or both.
> >> 
> >> I'll do some investigations into the per vm memory overhead and
> >> play with bumping the max vcpu limit way beyond 16, but hopefully
> >> someone can comment on issues such as locking problems that are known
> >> to exist and needing to be addressed to increased parallellism,
> >> general overhead percentages which can help provide consolidation
> >> expectations, etc.
> > 
> > I suppose it depends on the guest and workload. With an EPT host and
> > 16-way Linux guest doing kernel compilations, on recent kernel, i see:
> > 
> > # Samples: 98703304
> > #
> > # Overhead          Command                      Shared Object  Symbol
> > # ........  ...............  .................................  ......
> > #
> >     97.15%               sh  [kernel]                           [k] 
> > vmx_vcpu_run
> >      0.27%               sh  [kernel]                           [k] 
> > kvm_arch_vcpu_ioctl_
> >      0.12%               sh  [kernel]                           [k] 
> > default_send_IPI_mas
> >      0.09%               sh  [kernel]                           [k] 
> > _spin_lock_irq
> > 
> > Which is pretty good. Without EPT/NPT the mmu_lock seems to be the major
> > bottleneck to parallelism.
> > 
> >> Also, when I did a simple experiment with vcpu overcommitment, I was
> >> surprised how quickly performance suffered (just bringing a Linux vm
> >> up), since I would have assumed the additional vcpus would have been
> >> halted the vast majority of the time. On a 2 proc box, overcommitment
> >> to 8 vcpus in a guest (I know this isn't a good usage scenario, but
> >> does provide some insights) caused the boot time to increase to almost
> >> exponential levels. At 16 vcpus, it took hours to just reach the gui
> >> login prompt.
> > 
> > One probable reason for that are vcpus which hold spinlocks in the guest
> > are scheduled out in favour of vcpus which spin on that same lock.
> 
> I suspected it might be a whole lot of spinning happening. That does seems most likely. I was just surprised how bad the behavior was.

I have collected lock_stat info on a similar vcpu over-commit
configuration, but with EPT system, and saw a very significant amount of
spinning.  However, if you don't have EPT or NPT, I would bet that's the
first problem.  IMO, I am a little surprised simply booting is such a
problem.  I would be interesting to see what lock_stat shows on your
guest after booting with 16 vcpus.  

I have observed that shortening the time between vcpus being scheduled
can help mitigate the problem with lock holder preemption (presumably
because the spinning vcpu is de-scheduled earlier and the vcpu holding
the lock is scheduled sooner), but I imagine there are other unwanted
side-effects like lower cache hits.

-Andrew

> 
> Bruce
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html