From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753538Ab2IYHj4 (ORCPT ); Tue, 25 Sep 2012 03:39:56 -0400 Received: from e23smtp06.au.ibm.com ([202.81.31.148]:51819 "EHLO e23smtp06.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751825Ab2IYHjy (ORCPT ); Tue, 25 Sep 2012 03:39:54 -0400 Message-ID: <50615EE4.1040809@linux.vnet.ibm.com> Date: Tue, 25 Sep 2012 13:06:04 +0530 From: Raghavendra K T Organization: IBM User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1 MIME-Version: 1.0 To: Avi Kivity CC: Rik van Riel , Peter Zijlstra , "H. Peter Anvin" , Ingo Molnar , Marcelo Tosatti , Srikar , "Nikunj A. Dadhania" , KVM , Jiannan Ouyang , chegu vinod , "Andrew M. Theurer" , LKML , Srivatsa Vaddagiri , Gleb Natapov Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler References: <20120921115942.27611.67488.sendpatchset@codeblue> <20120921120000.27611.71321.sendpatchset@codeblue> <505C654B.2050106@redhat.com> <505CA2EB.7050403@linux.vnet.ibm.com> <50607F1F.2040704@redhat.com> In-Reply-To: <50607F1F.2040704@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit x-cbid: 12092507-7014-0000-0000-000001F123FE Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/24/2012 09:11 PM, Avi Kivity wrote: > On 09/21/2012 08:24 PM, Raghavendra K T wrote: >> On 09/21/2012 06:32 PM, Rik van Riel wrote: >>> On 09/21/2012 08:00 AM, Raghavendra K T wrote: >>>> From: Raghavendra K T >>>> >>>> When total number of VCPUs of system is less than or equal to physical >>>> CPUs, >>>> PLE exits become costly since each VCPU can have dedicated PCPU, and >>>> trying to find a target VCPU to yield_to just burns time in PLE handler. >>>> >>>> This patch reduces overhead, by simply doing a return in such >>>> scenarios by >>>> checking the length of current cpu runqueue. >>> >>> I am not convinced this is the way to go. >>> >>> The VCPU that is holding the lock, and is not releasing it, >>> probably got scheduled out. That implies that VCPU is on a >>> runqueue with at least one other task. >> >> I see your point here, we have two cases: >> >> case 1) >> >> rq1 : vcpu1->wait(lockA) (spinning) >> rq2 : vcpu2->holding(lockA) (running) >> >> Here Ideally vcpu1 should not enter PLE handler, since it would surely >> get the lock within ple_window cycle. (assuming ple_window is tuned for >> that workload perfectly). >> >> May be this explains why we are not seeing benefit with kernbench. >> >> On the other side, Since we cannot have a perfect ple_window tuned for >> all type of workloads, for those workloads, which may need more than >> 4096 cycles, we gain. thinking is it that we are seeing in benefited >> cases? > > Maybe we need to increase the ple window regardless. 4096 cycles is 2 > microseconds or less (call it t_spin). The overhead from > kvm_vcpu_on_spin() and the associated task switches is at least a few > microseconds, increasing as contention is added (call it t_tield). The > time for a natural context switch is several milliseconds (call it > t_slice). There is also the time the lock holder owns the lock, > assuming no contention (t_hold). > > If t_yield> t_spin, then in the undercommitted case it dominates > t_spin. If t_hold> t_spin we lose badly. > > If t_spin> t_yield, then the undercommitted case doesn't suffer as much > as most of the spinning happens in the guest instead of the host, so it > can pick up the unlock timely. We don't lose too much in the > overcommitted case provided the values aren't too far apart (say a > factor of 3). > > Obviously t_spin must be significantly smaller than t_slice, otherwise > it accomplishes nothing. > > Regarding t_hold: if it is small, then a larger t_spin helps avoid false > exits. If it is large, then we're not very sensitive to t_spin. It > doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up > yielding for several milliseconds. > > So I think it's worth trying again with ple_window of 20000-40000. > Agree that spinning is not costly and I have tried increasing ple_window earlier. I 'll give one more shot. I was thinking, unnessary spinning of vcpus (spinning when lockholder is preempted), add up to degradation significantly, especially in ticketlock scenario is more problemtic. no?