All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: habanero@linux.vnet.ibm.com
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@redhat.com>,
	KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
Date: Wed, 19 Sep 2012 16:39:29 +0300	[thread overview]
Message-ID: <5059CB11.2000804@redhat.com> (raw)
In-Reply-To: <1347937398.10325.190.camel@oc6622382223.ibm.com>

On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>> 
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run.  The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> > 
>> > On reducing the occurrence:  The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue.  This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not.  To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>> > 
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
>> > 
>> > On picking a better vcpu to yield to:  I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing.  Trying to think of way to
>> > further reduce candidate vcpus....
>> 
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning.  While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
> 
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful.  I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive.  Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.

If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right.  Eventually
it's not-a-timeslice will expire and we go through this again.  If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus.  So in this worst case, we get quadratic behaviour.

One way out is to increase the not-a-timeslice.  Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?

That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.

> 
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
> 
> Granted, my test case is quite brutal.  It's nothing but over-committed
> VMs which always have some spin lock activity.  However, we really
> should try to fix the worst case scenario.

Yes.  And other guests may not scale as well as Linux, so they may show
this behaviour more often.

> 
>>   So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early.  That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself.  Is this
>> possible with the current code?
>> 
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>> 
>> Let's examine a few cases:
>> 
>> 1. spinner on cpu 0, lock holder on cpu 0
>> 
>> win!
>> 
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>> 
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>> 
>> 3. spinner on cpu 0, another spinner on cpu 0
>> 
>> Worst case, we'll just spin some more.  Need to detect this case and
>> migrate something in.
> 
> Well, we can certainly experiment and see what we get.
> 
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-.  What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu.  I guess I need to find a faster way to get there.

pvspinlocks will find the right one, every time.  Otherwise I see no way
to do this.

-- 
error compiling committee.c: too many arguments to function

  reply	other threads:[~2012-09-19 13:40 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39   ` Raghavendra K T
2012-07-19  9:47     ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34   ` Raghavendra K T
2012-07-22 12:43     ` Avi Kivity
2012-07-23  7:35       ` Christian Borntraeger
2012-07-22 17:58     ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06     ` Raghavendra K T
2012-09-07 19:42       ` Andrew Theurer
2012-09-08  8:43         ` Srikar Dronamraju
2012-09-10 13:16           ` Andrew Theurer
2012-09-10 16:03             ` Peter Zijlstra
2012-09-10 16:56               ` Srikar Dronamraju
2012-09-10 17:12                 ` Peter Zijlstra
2012-09-10 19:10                   ` Raghavendra K T
2012-09-10 20:12                   ` Andrew Theurer
2012-09-10 20:19                     ` Peter Zijlstra
2012-09-10 20:31                       ` Rik van Riel
2012-09-11  6:08                     ` Raghavendra K T
2012-09-11 12:48                       ` Andrew Theurer
2012-09-11 18:27                       ` Andrew Theurer
2012-09-13 11:48                         ` Raghavendra K T
2012-09-13 21:30                           ` Andrew Theurer
2012-09-14 17:10                             ` Andrew Jones
2012-09-15 16:08                               ` Raghavendra K T
2012-09-17 13:48                                 ` Andrew Jones
2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
2012-09-17  8:02                               ` Andrew Jones
2012-09-16  8:55                             ` Avi Kivity
2012-09-17  8:10                               ` Andrew Jones
2012-09-18  3:03                               ` Andrew Theurer
2012-09-19 13:39                                 ` Avi Kivity [this message]
2012-09-13 12:13                         ` Avi Kivity
2012-09-11  7:04                   ` Srikar Dronamraju
2012-09-10 14:43         ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5059CB11.2000804@redhat.com \
    --to=avi@redhat.com \
    --cc=chegu_vinod@hp.com \
    --cc=gleb@redhat.com \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=srivatsa.vaddagiri@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.