Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
       [not found] <20160913085437.GA14514@linux-gk3p>
@ 2016-09-13 11:30 ` Dario Faggioli
  2016-09-14 10:44   ` Wei Yang
  2016-09-18  4:03   ` Wei Yang
  0 siblings, 2 replies; 17+ messages in thread
From: Dario Faggioli @ 2016-09-13 11:30 UTC (permalink / raw)
  To: Wei Yang, xen-devel; +Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6

[-- Attachment #1.1: Type: text/plain, Size: 14374 bytes --]

[using xendevel correct address]

On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
> > 
> > I'm not surprised by that. Yet, I'd be interested in hearing more
> > about this profiling you have done (things like, how you captured
> > the data, what workloads you are exactly considering, how you
> > determined what is the bottleneck, etc).
> Let me try to explain this.
> 
> Workload:         a. Blue Screen in Windows Guests
> 		  b. some client using Windows to do some video
> processing
> 		     which need precise timestamp (we are not sure the
> core
> 		     reason but we see the system is very slow)
>
Do you mind sharing just a bit more, such as:
 - number of pcpus
 - number of vcpus of the various VMs

I also am not sure what "a. Blue screen in Windows guests" above
means... is there a workload called "Blue Screen"? Or is it that you
are hitting a BSOD in some of your guests (which ones, what were they
doing)? Or is it that you _want_ to provoke a BSOD on some of your
guests? Or is that something else? :-P

> Capture the data: lock profile
> Bottleneck Check: The schedule lock wait time is really high      
> 
Ok, cool. Interesting that lock profiling works on 4.1! :-O

> > The scheduler tries to see whether the v->processor of the waking
> > vcpu can be re-used, but that's not at all guaranteed, and again,
> > on a very loaded system, not even that likely!
> > 
> Hmm... I may missed something.
> 
> Took your assumption below for example. 
> In my mind, the process looks like this:
> 
>     csched_vcpu_wake()
>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
> (1)
>     
Well, yes. More precisely, at least in current staging,
SCHEDULE_SOFTIRQ is raised for pcpu 6:
 - if pcpu 6 is idle, or
 - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
   the waking vcpu is higher in priority than what is running on pcpu 6

>     __do_softirq()
>         schedule()
>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
> on it's
>                                 priority
>
Yes, but it is pcpu 6 that will run csched_schedule() only if the
conditions mentioned above are met. If not, it will be some other pcpu
that will do so (or none!).

But I just realized that the fact that you are actually working on 4.1
is going to be an issue. In fact, the code that it is now in Xen has
changed **a lot** since 4.1. In fact, you're missing all the soft-
affinity work (which you may or may not find useful) but also
improvements and bugfixing.

I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
but there's very few chances I'll be available to provide detailed
review, advice, testing, etc., on such an old codebase. :-(

> By looking at the code, what I missed may be in __runq_tickle(),
> which in case
> there are idle pcpu, schedule_softirq is also raised on them. By
> doing so,
> those idle pcpus would steal other busy tasks and may in chance get
> d2v2?
> 
Yes, it looks like, in Xen 4.1, this is more or less what happens. The
idea is to always tickle pcpu 6, if the priority of the waking vcpu is
higher than what is running there. 

If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
 - if the waking vcpu preempted what was running on pcpu 6, an idler
   can pick-up ("steal") such preempted vcpu and run it;
 - if the waking vcpu ended up in the runqueue, an idler can steal it

> BTW, if the system is in heavy load, how much chance would we have
> idle pcpu?
> 
It's not very likely that there will be idle pcpus in a very busy
system, I agree.

> > > We can divide this in three steps:
> > > a) Send IPI to the target CPU and tell it which vcpu we want it
> > > to 
> > > wake up.
> > > b) The interrupt handler on target cpu insert vcpu to a percpu
> > > queue 
> > > and raise
> > >    softirq.
> > > c) softirq will dequeue the vcpu and wake it up.
> > > 
> > I'm not sure I see how you think this would improve the situation.
> > 
> > So, quickly, right now:
> > 
> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
> > - let's assume d2v2->processor = 6
> > - let's assume the wakeup is happening on pcpu 4
> > 
> > Right now:
> > 
> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
> >   runqueue per pcpu, and locks are per-cpu already)
> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
> > - still executing on pcpu 4, __runq_tickle() is called, and it
> >   determines on which pcpu d2v2 should run
> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
> >   following two cases:
> >    a) if it was determined that d2v2 should run on pcpu 6:
> >        - in a (little, hopefully) while, pcpu 6 will schedule
> >        - it takes its own runqueue lock
> >        - it finds d2v2 in there, and runs it
> Even in this case, it depends on the priority of d2v2 whether it
> would be run
> now. Right?
> 
Yes. But both in 4.1 and in current staging, we only raise
SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
vcpu is higher and it should preempt the currently running vcpu.

> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
> >        - in a (little, hopefully) while, pcpu 9 will schedule
> >        - it takes its own runqueue lock
> >        - if it has nothing in its runqueue (likely, if
> >          __runq_tickle() chose it, but not necessarily and always
> >          true, especially under high load), it looks around to
> >          see if other pcpus have any vcpu that it can execute
> >        - it will try to take the locks of the various pcpu's
> >          runqueues it comes across
> >        - if it manages to take the lock of pcpu's 6 runqueue, it
> >          sees d2v2 in it, steal it and run it.
> Oh, my understanding matches yours :-)
> 
> BYW, we also found in csched_load_balance() will hold the schedule
> lock, but
> not found a proper vcpu to steal. Maybe this would be a potential
> optimization
> point.
> 
Mmm... I actually don't understand what you mean here... what is the
possible optimization?

> > With your modifications, AFAICT:
> > 
> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
> > - poke pcpu 6 with an IPI
> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
> >   _I_think_, call vcpu_wake(d2v2) ? [*]
> You are almost right. Here in the interrupt handler, it does two
> things:
> 1. echo back notify_remote_vcpu_wake() it finishes the job
> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
> 
> > 
> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
> > - on pcpu 6, __runq_tickle() is called, and it determines on which
> >   pcpu d2v2 should run
> At this place, the behavior of __run_tickle() is what I mentioned
> above. Raise
> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>
Sort of, yes. In general, you can expect pcpu 6 going through
csched_schedule(), and hence setting d2v2 to run, to be faster that
SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
getting to work stealing from pcpu 6's runqueue.

However, that is the case only if d2v2 had higher priority than what is
running on pcpu 6. If it does not, you:
 - _don't_ tickle pcpu 6
 - tickle one or more idlers, if any.

So, again, it's not that you always wake and run it on pcpu 6

> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
> >   following two cases:
> >    a) if it was determined that d2v2 should run on pcpu 6
> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
> >        - at the first softirq check, pcpu 6 will schedule
> >        - it takes its own runqueue lock
> >        - it finds d2v2 in there, and runs it
> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
> >        - in a (little, hopefully) while, pcpu 9 will schedule
> >        - it takes its own runqueue lock
> >        - if it has nothing in its runqueue (likely, if
> >          __runq_tickle() chose it, but not necessarily and always
> >          true, especially under high load), it looks around to
> >          see if other pcpus have any vcpu that it can execute
> >        - it will try to take the locks of the various pcpu's
> >          runqueues it comes across
> >        - if it manages to take the lock of pcpu's 6 runqueue, it
> >          sees d2v2 in it, steal it and run it.
> > 
> > [*] I don't see in the code you shared what happens in reaction to
> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
> > calls vcpu_wake()
> > 
> > So, basically, it looks to me that you're adding a level of
> > indirection, I'm not sure for what benefit.

> In my mind, we are trying to reduce the contention on schedule lock
> from two
> aspects:
> 1. for those vcpus would run on other pcpu, it will take
> wake_vcpu_lock
>    instead of schedule lock
>
I don't see how you can say "instead". It looks to me that, for those
vcpus what would run on other pcpu, we need to take wake_vcpu_lock
_in_addition_ to the runqueue lock.

> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
> also
>    reduce the cache coherency between pcpus
> 
Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
remote lock, though.

> > In fact, in the best case (i.e., d2v2 will actually be run on its
> > v->processor), in the original code there is:
> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
> > In your case, there is:
> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
> > 
> Our idea may looks like this:
>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
> 
So, it's even worse than how I imagined! :-P

> > Also, as far as locking within the wakeup path is concerned, in the
> > original code:
> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
> >   (i.e., for the time of __runq_insert() and __runq_tickle().
> > In your case:
> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
> >   deferred wakeup
> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
> > 
> The wake_vcpu_lock is hold in the first step in above.
> The runqueue lock(I think you mean the schedule lock?) won't be taken
> in this
> case.
> 
It will be taken, for executing vcpu_wake(). From your own description
of the mechanism above:

"Our idea may looks like this:
   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
      vcpu.
   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."

1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
   csched_schedule()

> > 
> > (i.e., for the time of __runq_insert() and __runq_tickle().
> > 
> > Which, to me, looks like both more inter-CPU communication overhead
> > and more chances for lock contention.
> Hmm... yes, more inter-CPU communication, while less lock contention
> I think.
> 
More inter-CPU communication, the same level of local lock contention
plus some added moderately to low contented remote lock contention,
AFAICT.

> > So, again, more IPIs and more (potential) lock contention.
> > 
> > True the queueing into the deferred wakeup list will be very simple
> > and quick, and hence the critical section needed for implementing
> > it very short. But the problem is that you are not shortening
> > something long, or splitting something long in two shorter chunks,
> > you are _adding_ something, and thus you can't possibly win. :-/
> > 
> Yes, we are moving some thing, let me digest it for a while.
> 
Sure. Let me also add that I'm not saying it's not going to work, or
that it's not going to improve the situation.

What I'm saying is that, architecturally, it's not too different from
the current situation, so I would not expect wonders from just this.

What I'd do would be to try figuring out where it is that the most
waiting time is being accumulated, i.e., which *specific* locking
attempts of the scheduler's runqueues' locks are the most contended,
and focus on those first.

If you're on a big host, I think the way in which Credit1 does load
balancing (i.e., by work stealing) may be the thing to blame. But all
these are things that only profiling and benchmarking can really tell!
:-)

As a side note, we're also working on something which, at the end of
the day, is rather similar, although for different purposes. In fact,
see here:
https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html

The goal is not having to disable IRQs during scheduling, and I'm
looking at keeping the per-pcpu deferred wakeup list lockless, in order
not to add more lock contention.

If you're interested on knowing more or working on that, just let me
know. But, of course, this have to be done on Xen upstream, not on an
old version.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread