From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756566Ab2JSIfI (ORCPT ); Fri, 19 Oct 2012 04:35:08 -0400 Received: from e28smtp06.in.ibm.com ([122.248.162.6]:37920 "EHLO e28smtp06.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752897Ab2JSIfC (ORCPT ); Fri, 19 Oct 2012 04:35:02 -0400 Message-ID: <50810FB0.9000507@linux.vnet.ibm.com> Date: Fri, 19 Oct 2012 14:00:40 +0530 From: Raghavendra K T Organization: IBM User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: habanero@linux.vnet.ibm.com CC: Avi Kivity , Peter Zijlstra , Rik van Riel , "H. Peter Anvin" , Ingo Molnar , Marcelo Tosatti , Srikar , "Nikunj A. Dadhania" , KVM , Jiannan Ouyang , chegu vinod , LKML , Srivatsa Vaddagiri , Gleb Natapov , Andrew Jones Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler References: <20120921120000.27611.71321.sendpatchset@codeblue> <505C654B.2050106@redhat.com> <505CA2EB.7050403@linux.vnet.ibm.com> <50607F1F.2040704@redhat.com> <20121003122209.GA9076@linux.vnet.ibm.com> <506C7057.6000102@redhat.com> <506D69AB.7020400@linux.vnet.ibm.com> <506D83EE.2020303@redhat.com> <1349356038.14388.3.camel@twins> <506DA48C.8050200@redhat.com> <20121009185108.GA2549@linux.vnet.ibm.com> <1349837987.5551.182.camel@oc6622382223.ibm.com> <5075B63C.5030603@linux.vnet.ibm.com> <1349897783.22418.15.camel@oc2024037011.ibm.com> <507BFD2C.3010808@linux.vnet.ibm.com> <1350311695.22418.86.camel@oc2024037011.ibm.com> In-Reply-To: <1350311695.22418.86.camel@oc2024037011.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit x-cbid: 12101908-9574-0000-0000-000004EED84B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/15/2012 08:04 PM, Andrew Theurer wrote: > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: >> On 10/11/2012 01:06 AM, Andrew Theurer wrote: >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote: >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: >>>>>> * Avi Kivity [2012-10-04 17:00:28]: >>>>>> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: >>>>>>>>> >> [...] >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it >>>>> has just terrible scalability to begin with. I do not think we should >>>>> try to optimize such a bad workload. >>>>> >>>> >>>> I think my way of running dbench has some flaw, so I went to ebizzy. >>>> Could you let me know how you generally run dbench? >>> >>> I mount a tmpfs and then specify that mount for dbench to run on. This >>> eliminates all IO. I use a 300 second run time and number of threads is >>> equal to number of vcpus. All of the VMs of course need to have a >>> synchronized start. >>> >>> I would also make sure you are using a recent kernel for dbench, where >>> the dcache scalability is much improved. Without any lock-holder >>> preemption, the time in spin_lock should be very low: >>> >>> >>>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled >>>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42 >>>> 2.81% 10176 dbench dbench [.] child_run >>>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock >>>> 2.33% 8423 dbench dbench [.] next_token >>>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu >>>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42 >>>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2 >>>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk >>>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc >>>> 1.38% 5009 dbench libc-2.12.so [.] memmove >>>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf >>>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit >>> >> >> Hi Andrew, >> I ran the test with dbench with tmpfs. I do not see any improvements in >> dbench for 16k ple window. >> >> So it seems apart from ebizzy no workload benefited by that. and I >> agree that, it may not be good to optimize for ebizzy. >> I shall drop changing to 16k default window and continue with other >> original patch series. Need to experiment with latest kernel. > > Thanks for running this again. I do believe there are some workloads, > when run at 1x overcommit, would benefit from a larger ple_window [with > he current ple handling code], but I do not also want to potentially > degrade >1x with a larger window. I do, however, think there may be a > another option. I have not fully worked this out, but I think I am on > to something. > > I decided to revert back to just a yield() instead of a yield_to(). My > motivation was that yield_to() [for large VMs] is like a dog chasing its > tail, round and round we go.... Just yield(), in particular a yield() > which results in yielding to something -other- than the current VM's > vcpus, helps synchronize the execution of sibling vcpus by deferring > them until the lock holder vcpu is running again. The more we can do to > get all vcpus running at the same time, the far less we deal with the > preemption problem. The other benefit is that yield() is far, far lower > overhead than yield_to() > > This does assume that vcpus from same VM do not share same runqueues. > Yielding to a sibling vcpu with yield() is not productive for larger VMs > in the same way that yield_to() is not. My recent results include > restricting vcpu placement so that sibling vcpus do not get to run on > the same runqueue. I do believe we could implement a initial placement > and load balance policy to strive for this restriction (making it purely > optional, but I bet could also help user apps which use spin locks). > > For 1x VMs which still vm_exit due to PLE, I believe we could probably > just leave the ple_window alone, as long as we mostly use yield() > instead of yield_to(). The problem with the unneeded exits in this case > has been the overhead in routines leading up to yield_to() and the > yield_to() itself. If we use yield() most of the time, this overhead > will go away. > > Here is a comparison of yield_to() and yield(): > > dbench with 20-way VMs, 8 of them on 80-way host: > > no PLE 426 +/- 11.03% > no PLE w/ gangsched 32001 +/- .37% > PLE with yield() 29207 +/- .28% > PLE with yield_to() 8175 +/- 1.37% > > Yield() is far and way better than yield_to() here and almost approaches > gang sched result. Here is a link for the perf sched map bitmap: > > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU > > The thrashing is way down and sibling vcpus tend to run together, > approximating the behavior of the gang scheduling without needing to > actually implement gang scheduling. > > I did test a smaller VM: > > dbench with 10-way VMs, 16 of them on 80-way host: > > no PLE 6248 +/- 7.69% > no PLE w/ gangsched 28379 +/- .07% > PLE with yield() 29196 +/- 1.62% > PLE with yield_to() 32217 +/- 1.76% Hi Andrew, Results are encouraging. > > There is some degrade from yield() to yield_to() here, but nearly as > large as the uplift we see on the larger VMs. Regardless, I have an > idea to fix that: Instead of using yield() all the time, we could use > yield_to(), but limit the rate per vcpu to something like 1 per jiffie. > All other exits use yield(). That rate of yield_to() should be more > than enough for the smaller VMs, and the result should be hopefully just > the same as the current code. I have not coded this up yet, but it's my > next step. I personally feel rate limiting yield_to may be a good idea. > > I am also hopeful the limitation of yield_to() will also make the 1x > issue just go away as well (even with 4096 ple_window). The vast > majority of exits will result in yield() which should be harmless. > > Keep in mind this did require ensuring sibling vcpus do not share host > runqueues -I do think that can be possible given some optional scheduler > tweaks. I think this is a concern (placing). Having rate limit alone may suffice.May be tuning that taking into overcommitted/non-overcommitted scenario also into account would be better. Okay below is my V2 implementation I am experimenting 1) check source -and- target runq to decide on exiting the ple handler 2) vcpu_on_spin() { ..... if yield_to_same_vm did not succeed and we are overcommitted yield() } I think combining your thoughts and (2) complicates scenario a bit. anyways let me see how my experiment goes. I will also check how yield performs without any pinning.