Re: [Help] Trigger Watchdog when adding an IPI in vcpu

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
       [not found] <20160913085437.GA14514@linux-gk3p>
@ 2016-09-13 11:30 ` Dario Faggioli
  2016-09-14 10:44   ` Wei Yang
  2016-09-18  4:03   ` Wei Yang
  0 siblings, 2 replies; 17+ messages in thread
From: Dario Faggioli @ 2016-09-13 11:30 UTC (permalink / raw)
  To: Wei Yang, xen-devel; +Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6


[-- Attachment #1.1: Type: text/plain, Size: 14374 bytes --]

[using xendevel correct address]

On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
> > 
> > I'm not surprised by that. Yet, I'd be interested in hearing more
> > about this profiling you have done (things like, how you captured
> > the data, what workloads you are exactly considering, how you
> > determined what is the bottleneck, etc).
> Let me try to explain this.
> 
> Workload:         a. Blue Screen in Windows Guests
> 		  b. some client using Windows to do some video
> processing
> 		     which need precise timestamp (we are not sure the
> core
> 		     reason but we see the system is very slow)
>
Do you mind sharing just a bit more, such as:
 - number of pcpus
 - number of vcpus of the various VMs

I also am not sure what "a. Blue screen in Windows guests" above
means... is there a workload called "Blue Screen"? Or is it that you
are hitting a BSOD in some of your guests (which ones, what were they
doing)? Or is it that you _want_ to provoke a BSOD on some of your
guests? Or is that something else? :-P

> Capture the data: lock profile
> Bottleneck Check: The schedule lock wait time is really high      
> 
Ok, cool. Interesting that lock profiling works on 4.1! :-O

> > The scheduler tries to see whether the v->processor of the waking
> > vcpu can be re-used, but that's not at all guaranteed, and again,
> > on a very loaded system, not even that likely!
> > 
> Hmm... I may missed something.
> 
> Took your assumption below for example. 
> In my mind, the process looks like this:
> 
>     csched_vcpu_wake()
>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
> (1)
>     
Well, yes. More precisely, at least in current staging,
SCHEDULE_SOFTIRQ is raised for pcpu 6:
 - if pcpu 6 is idle, or
 - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
   the waking vcpu is higher in priority than what is running on pcpu 6

>     __do_softirq()
>         schedule()
>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
> on it's
>                                 priority
>
Yes, but it is pcpu 6 that will run csched_schedule() only if the
conditions mentioned above are met. If not, it will be some other pcpu
that will do so (or none!).

But I just realized that the fact that you are actually working on 4.1
is going to be an issue. In fact, the code that it is now in Xen has
changed **a lot** since 4.1. In fact, you're missing all the soft-
affinity work (which you may or may not find useful) but also
improvements and bugfixing.

I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
but there's very few chances I'll be available to provide detailed
review, advice, testing, etc., on such an old codebase. :-(

> By looking at the code, what I missed may be in __runq_tickle(),
> which in case
> there are idle pcpu, schedule_softirq is also raised on them. By
> doing so,
> those idle pcpus would steal other busy tasks and may in chance get
> d2v2?
> 
Yes, it looks like, in Xen 4.1, this is more or less what happens. The
idea is to always tickle pcpu 6, if the priority of the waking vcpu is
higher than what is running there. 

If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
 - if the waking vcpu preempted what was running on pcpu 6, an idler
   can pick-up ("steal") such preempted vcpu and run it;
 - if the waking vcpu ended up in the runqueue, an idler can steal it

> BTW, if the system is in heavy load, how much chance would we have
> idle pcpu?
> 
It's not very likely that there will be idle pcpus in a very busy
system, I agree.

> > > We can divide this in three steps:
> > > a) Send IPI to the target CPU and tell it which vcpu we want it
> > > to 
> > > wake up.
> > > b) The interrupt handler on target cpu insert vcpu to a percpu
> > > queue 
> > > and raise
> > >    softirq.
> > > c) softirq will dequeue the vcpu and wake it up.
> > > 
> > I'm not sure I see how you think this would improve the situation.
> > 
> > So, quickly, right now:
> > 
> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
> > - let's assume d2v2->processor = 6
> > - let's assume the wakeup is happening on pcpu 4
> > 
> > Right now:
> > 
> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
> >   runqueue per pcpu, and locks are per-cpu already)
> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
> > - still executing on pcpu 4, __runq_tickle() is called, and it
> >   determines on which pcpu d2v2 should run
> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
> >   following two cases:
> >    a) if it was determined that d2v2 should run on pcpu 6:
> >        - in a (little, hopefully) while, pcpu 6 will schedule
> >        - it takes its own runqueue lock
> >        - it finds d2v2 in there, and runs it
> Even in this case, it depends on the priority of d2v2 whether it
> would be run
> now. Right?
> 
Yes. But both in 4.1 and in current staging, we only raise
SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
vcpu is higher and it should preempt the currently running vcpu.

> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
> >        - in a (little, hopefully) while, pcpu 9 will schedule
> >        - it takes its own runqueue lock
> >        - if it has nothing in its runqueue (likely, if
> >          __runq_tickle() chose it, but not necessarily and always
> >          true, especially under high load), it looks around to
> >          see if other pcpus have any vcpu that it can execute
> >        - it will try to take the locks of the various pcpu's
> >          runqueues it comes across
> >        - if it manages to take the lock of pcpu's 6 runqueue, it
> >          sees d2v2 in it, steal it and run it.
> Oh, my understanding matches yours :-)
> 
> BYW, we also found in csched_load_balance() will hold the schedule
> lock, but
> not found a proper vcpu to steal. Maybe this would be a potential
> optimization
> point.
> 
Mmm... I actually don't understand what you mean here... what is the
possible optimization?

> > With your modifications, AFAICT:
> > 
> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
> > - poke pcpu 6 with an IPI
> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
> >   _I_think_, call vcpu_wake(d2v2) ? [*]
> You are almost right. Here in the interrupt handler, it does two
> things:
> 1. echo back notify_remote_vcpu_wake() it finishes the job
> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
> 
> > 
> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
> > - on pcpu 6, __runq_tickle() is called, and it determines on which
> >   pcpu d2v2 should run
> At this place, the behavior of __run_tickle() is what I mentioned
> above. Raise
> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>
Sort of, yes. In general, you can expect pcpu 6 going through
csched_schedule(), and hence setting d2v2 to run, to be faster that
SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
getting to work stealing from pcpu 6's runqueue.

However, that is the case only if d2v2 had higher priority than what is
running on pcpu 6. If it does not, you:
 - _don't_ tickle pcpu 6
 - tickle one or more idlers, if any.

So, again, it's not that you always wake and run it on pcpu 6

> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
> >   following two cases:
> >    a) if it was determined that d2v2 should run on pcpu 6
> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
> >        - at the first softirq check, pcpu 6 will schedule
> >        - it takes its own runqueue lock
> >        - it finds d2v2 in there, and runs it
> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
> >        - in a (little, hopefully) while, pcpu 9 will schedule
> >        - it takes its own runqueue lock
> >        - if it has nothing in its runqueue (likely, if
> >          __runq_tickle() chose it, but not necessarily and always
> >          true, especially under high load), it looks around to
> >          see if other pcpus have any vcpu that it can execute
> >        - it will try to take the locks of the various pcpu's
> >          runqueues it comes across
> >        - if it manages to take the lock of pcpu's 6 runqueue, it
> >          sees d2v2 in it, steal it and run it.
> > 
> > [*] I don't see in the code you shared what happens in reaction to
> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
> > calls vcpu_wake()
> > 
> > So, basically, it looks to me that you're adding a level of
> > indirection, I'm not sure for what benefit.

> In my mind, we are trying to reduce the contention on schedule lock
> from two
> aspects:
> 1. for those vcpus would run on other pcpu, it will take
> wake_vcpu_lock
>    instead of schedule lock
>
I don't see how you can say "instead". It looks to me that, for those
vcpus what would run on other pcpu, we need to take wake_vcpu_lock
_in_addition_ to the runqueue lock.

> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
> also
>    reduce the cache coherency between pcpus
> 
Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
remote lock, though.

> > In fact, in the best case (i.e., d2v2 will actually be run on its
> > v->processor), in the original code there is:
> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
> > In your case, there is:
> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
> > 
> Our idea may looks like this:
>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
> 
So, it's even worse than how I imagined! :-P

> > Also, as far as locking within the wakeup path is concerned, in the
> > original code:
> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
> >   (i.e., for the time of __runq_insert() and __runq_tickle().
> > In your case:
> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
> >   deferred wakeup
> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
> > 
> The wake_vcpu_lock is hold in the first step in above.
> The runqueue lock(I think you mean the schedule lock?) won't be taken
> in this
> case.
> 
It will be taken, for executing vcpu_wake(). From your own description
of the mechanism above:

"Our idea may looks like this:
   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
      vcpu.
   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."

1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
   csched_schedule()

> > 
> > (i.e., for the time of __runq_insert() and __runq_tickle().
> > 
> > Which, to me, looks like both more inter-CPU communication overhead
> > and more chances for lock contention.
> Hmm... yes, more inter-CPU communication, while less lock contention
> I think.
> 
More inter-CPU communication, the same level of local lock contention
plus some added moderately to low contented remote lock contention,
AFAICT.

> > So, again, more IPIs and more (potential) lock contention.
> > 
> > True the queueing into the deferred wakeup list will be very simple
> > and quick, and hence the critical section needed for implementing
> > it very short. But the problem is that you are not shortening
> > something long, or splitting something long in two shorter chunks,
> > you are _adding_ something, and thus you can't possibly win. :-/
> > 
> Yes, we are moving some thing, let me digest it for a while.
> 
Sure. Let me also add that I'm not saying it's not going to work, or
that it's not going to improve the situation.

What I'm saying is that, architecturally, it's not too different from
the current situation, so I would not expect wonders from just this.

What I'd do would be to try figuring out where it is that the most
waiting time is being accumulated, i.e., which *specific* locking
attempts of the scheduler's runqueues' locks are the most contended,
and focus on those first.

If you're on a big host, I think the way in which Credit1 does load
balancing (i.e., by work stealing) may be the thing to blame. But all
these are things that only profiling and benchmarking can really tell!
:-)

As a side note, we're also working on something which, at the end of
the day, is rather similar, although for different purposes. In fact,
see here:
https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html

The goal is not having to disable IRQs during scheduling, and I'm
looking at keeping the per-pcpu deferred wakeup list lockless, in order
not to add more lock contention.

If you're interested on knowing more or working on that, just let me
know. But, of course, this have to be done on Xen upstream, not on an
old version.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli
@ 2016-09-14 10:44   ` Wei Yang
  2016-09-14 16:18     ` Dario Faggioli
  2016-09-17  3:30     ` Wei Yang
  2016-09-18  4:03   ` Wei Yang
  1 sibling, 2 replies; 17+ messages in thread
From: Wei Yang @ 2016-09-14 10:44 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang,
	mengxu, Wei Yang

On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>[using xendevel correct address]
>
>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
>> > 
>> > I'm not surprised by that. Yet, I'd be interested in hearing more
>> > about this profiling you have done (things like, how you captured
>> > the data, what workloads you are exactly considering, how you
>> > determined what is the bottleneck, etc).
>> Let me try to explain this.
>> 
>> Workload:         a. Blue Screen in Windows Guests
>> 		  b. some client using Windows to do some video
>> processing
>> 		     which need precise timestamp (we are not sure the
>> core
>> 		     reason but we see the system is very slow)
>>
>Do you mind sharing just a bit more, such as:
> - number of pcpus
> - number of vcpus of the various VMs

160 pcpus
16 vcpus in VM and 8 VMs

>
>I also am not sure what "a. Blue screen in Windows guests" above
>means... is there a workload called "Blue Screen"? Or is it that you
>are hitting a BSOD in some of your guests (which ones, what were they
>doing)? Or is it that you _want_ to provoke a BSOD on some of your
>guests? Or is that something else? :-P
>

Yes, the "Blue Screen" is what we want to mimic the behavior from client.

The "Blue Screen" will force the hypervisor to do load balance in my mind.

>> Capture the data: lock profile
>> Bottleneck Check: The schedule lock wait time is really high      
>> 
>Ok, cool. Interesting that lock profiling works on 4.1! :-O
>
>> > The scheduler tries to see whether the v->processor of the waking
>> > vcpu can be re-used, but that's not at all guaranteed, and again,
>> > on a very loaded system, not even that likely!
>> > 
>> Hmm... I may missed something.
>> 
>> Took your assumption below for example. 
>> In my mind, the process looks like this:
>> 
>>     csched_vcpu_wake()
>>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
>> (1)
>>     
>Well, yes. More precisely, at least in current staging,
>SCHEDULE_SOFTIRQ is raised for pcpu 6:
> - if pcpu 6 is idle, or
> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
>   the waking vcpu is higher in priority than what is running on pcpu 6
>
>>     __do_softirq()
>>         schedule()
>>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
>> on it's
>>                                 priority
>>
>Yes, but it is pcpu 6 that will run csched_schedule() only if the
>conditions mentioned above are met. If not, it will be some other pcpu
>that will do so (or none!).
>
>But I just realized that the fact that you are actually working on 4.1
>is going to be an issue. In fact, the code that it is now in Xen has
>changed **a lot** since 4.1. In fact, you're missing all the soft-
>affinity work (which you may or may not find useful) but also
>improvements and bugfixing.
>

That's true...

>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
>but there's very few chances I'll be available to provide detailed
>review, advice, testing, etc., on such an old codebase. :-(
>
>> By looking at the code, what I missed may be in __runq_tickle(),
>> which in case
>> there are idle pcpu, schedule_softirq is also raised on them. By
>> doing so,
>> those idle pcpus would steal other busy tasks and may in chance get
>> d2v2?
>> 
>Yes, it looks like, in Xen 4.1, this is more or less what happens. The
>idea is to always tickle pcpu 6, if the priority of the waking vcpu is
>higher than what is running there. 
>

Hmm... in case there are idle pcpus and the priority of the waking vcpu is
higher than what is running on pcpu 6, would pcpu 6 have more chance to run it?
or other idle pcup would steal it from pcpu6? or they have equal chance?

>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
> - if the waking vcpu preempted what was running on pcpu 6, an idler
>   can pick-up ("steal") such preempted vcpu and run it;
> - if the waking vcpu ended up in the runqueue, an idler can steal it
>

Hmm... I don't get the difference between these two cases.

Looks both are an idler steal the vcpu.

>> BTW, if the system is in heavy load, how much chance would we have
>> idle pcpu?
>> 
>It's not very likely that there will be idle pcpus in a very busy
>system, I agree.
>
>> > > We can divide this in three steps:
>> > > a) Send IPI to the target CPU and tell it which vcpu we want it
>> > > to 
>> > > wake up.
>> > > b) The interrupt handler on target cpu insert vcpu to a percpu
>> > > queue 
>> > > and raise
>> > >    softirq.
>> > > c) softirq will dequeue the vcpu and wake it up.
>> > > 
>> > I'm not sure I see how you think this would improve the situation.
>> > 
>> > So, quickly, right now:
>> > 
>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
>> > - let's assume d2v2->processor = 6
>> > - let's assume the wakeup is happening on pcpu 4
>> > 
>> > Right now:
>> > 
>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
>> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
>> >   runqueue per pcpu, and locks are per-cpu already)
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - still executing on pcpu 4, __runq_tickle() is called, and it
>> >   determines on which pcpu d2v2 should run
>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6:
>> >        - in a (little, hopefully) while, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> Even in this case, it depends on the priority of d2v2 whether it
>> would be run
>> now. Right?
>> 
>Yes. But both in 4.1 and in current staging, we only raise
>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
>vcpu is higher and it should preempt the currently running vcpu.
>

Oh, you are right. I didn't catch this either.

This means in case 
a) the priority is lower than current one 
b) no idler in system

The vcpu will sit quietly in the runq and waiting for the schedule next time.

>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> Oh, my understanding matches yours :-)
>> 
>> BYW, we also found in csched_load_balance() will hold the schedule
>> lock, but
>> not found a proper vcpu to steal. Maybe this would be a potential
>> optimization
>> point.
>> 
>Mmm... I actually don't understand what you mean here... what is the
>possible optimization?
>
>> > With your modifications, AFAICT:
>> > 
>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
>> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
>> > - poke pcpu 6 with an IPI
>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
>> >   _I_think_, call vcpu_wake(d2v2) ? [*]
>> You are almost right. Here in the interrupt handler, it does two
>> things:
>> 1. echo back notify_remote_vcpu_wake() it finishes the job
>> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
>> 
>> > 
>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - on pcpu 6, __runq_tickle() is called, and it determines on which
>> >   pcpu d2v2 should run
>> At this place, the behavior of __run_tickle() is what I mentioned
>> above. Raise
>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>>
>Sort of, yes. In general, you can expect pcpu 6 going through
>csched_schedule(), and hence setting d2v2 to run, to be faster that
>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
>getting to work stealing from pcpu 6's runqueue.
>
>However, that is the case only if d2v2 had higher priority than what is
>running on pcpu 6. If it does not, you:
> - _don't_ tickle pcpu 6
> - tickle one or more idlers, if any.
>
>So, again, it's not that you always wake and run it on pcpu 6
>

I see what you mean :-)

So I am curious about why we add this in pcpu's 6 queue, when we know the
priority is low in __runq_tickle() and won't tickle that. And we are sure
someone in the future will grab pcpu's 6 schedule lock and take it.

Or in this case -- when there are idlers and vcpu's priority is higher,
instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone
who is free to help a busy one.

>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6
>> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
>> >        - at the first softirq check, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> > 
>> > [*] I don't see in the code you shared what happens in reaction to
>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
>> > calls vcpu_wake()
>> > 
>> > So, basically, it looks to me that you're adding a level of
>> > indirection, I'm not sure for what benefit.
>
>> In my mind, we are trying to reduce the contention on schedule lock
>> from two
>> aspects:
>> 1. for those vcpus would run on other pcpu, it will take
>> wake_vcpu_lock
>>    instead of schedule lock
>>
>I don't see how you can say "instead". It looks to me that, for those
>vcpus what would run on other pcpu, we need to take wake_vcpu_lock
>_in_addition_ to the runqueue lock.
>
>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
>> also
>>    reduce the cache coherency between pcpus
>> 
>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
>remote lock, though.

Hmm... you are right.

>
>> > In fact, in the best case (i.e., d2v2 will actually be run on its
>> > v->processor), in the original code there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
>> > In your case, there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> > 
>> Our idea may looks like this:
>>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> 
>So, it's even worse than how I imagined! :-P
>
>> > Also, as far as locking within the wakeup path is concerned, in the
>> > original code:
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> >   (i.e., for the time of __runq_insert() and __runq_tickle().
>> > In your case:
>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
>> >   deferred wakeup
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> > 
>> The wake_vcpu_lock is hold in the first step in above.
>> The runqueue lock(I think you mean the schedule lock?) won't be taken
>> in this
>> case.
>> 
>It will be taken, for executing vcpu_wake(). From your own description
>of the mechanism above:
>
>"Our idea may looks like this:
>   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>      vcpu.
>   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>
>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>   csched_schedule()
>

Oh, you are right.

We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while
looks the wake_vcpu_lock is grabbed by others.

I am thinking our change may benefit like this .

The situation is:
* vcpu's priority is higher
* pcpu 9 is idle
* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ

At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock.

While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9
could do their job immediately. Ah I know, very corner, not sure about the
effect.

BTW, I suddenly found softirq is implemented as an IPI. So I am thinking
whether it is save to raise softirq to itself? Since it looks we already hold
the runqueue lock. Or because we have turned the irq off in vcpu_wake()?

>> > 
>> > (i.e., for the time of __runq_insert() and __runq_tickle().
>> > 
>> > Which, to me, looks like both more inter-CPU communication overhead
>> > and more chances for lock contention.
>> Hmm... yes, more inter-CPU communication, while less lock contention
>> I think.
>> 
>More inter-CPU communication, the same level of local lock contention
>plus some added moderately to low contented remote lock contention,
>AFAICT.
>
>> > So, again, more IPIs and more (potential) lock contention.
>> > 
>> > True the queueing into the deferred wakeup list will be very simple
>> > and quick, and hence the critical section needed for implementing
>> > it very short. But the problem is that you are not shortening
>> > something long, or splitting something long in two shorter chunks,
>> > you are _adding_ something, and thus you can't possibly win. :-/
>> > 
>> Yes, we are moving some thing, let me digest it for a while.
>> 
>Sure. Let me also add that I'm not saying it's not going to work, or
>that it's not going to improve the situation.
>
>What I'm saying is that, architecturally, it's not too different from
>the current situation, so I would not expect wonders from just this.
>
>What I'd do would be to try figuring out where it is that the most
>waiting time is being accumulated, i.e., which *specific* locking
>attempts of the scheduler's runqueues' locks are the most contended,
>and focus on those first.
>
>If you're on a big host, I think the way in which Credit1 does load
>balancing (i.e., by work stealing) may be the thing to blame. But all
>these are things that only profiling and benchmarking can really tell!
>:-)
>
>As a side note, we're also working on something which, at the end of
>the day, is rather similar, although for different purposes. In fact,
>see here:
>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html
>
>The goal is not having to disable IRQs during scheduling, and I'm
>looking at keeping the per-pcpu deferred wakeup list lockless, in order
>not to add more lock contention.
>
>If you're interested on knowing more or working on that, just let me
>know. But, of course, this have to be done on Xen upstream, not on an
>old version.
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-14 10:44   ` Wei Yang
@ 2016-09-14 16:18     ` Dario Faggioli
  2016-09-16  2:49       ` Wei Yang
  2016-09-17  3:30     ` Wei Yang
  1 sibling, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2016-09-14 16:18 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 12825 bytes --]

On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
> On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
> > 
> > Do you mind sharing just a bit more, such as:
> >  - number of pcpus
> >  - number of vcpus of the various VMs
> 160 pcpus
> 16 vcpus in VM and 8 VMs
> 
So, 16x8=128, which means you're not even oversubscribing. Maybe some
of the pcpus are hyperthreads and not full cores (are they?), but
still, this is not something I'd describe "super intensive cpu load".

Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32
vcpus (is this the case?), you technically are oversubscribing. But
then, what does dom0 do? Is it very busy doing some stuff on its own,
or running the backends/etc for the VMs? Can you check that with top,
xentop, and alike?

If the system is not overbooked, it's a bit strange that the scheduler
is the bottleneck.

> Yes, the "Blue Screen" is what we want to mimic the behavior from
> client.
> 
> The "Blue Screen" will force the hypervisor to do load balance in my
> mind.
> 
I've no idea what this means (but I don't know much of Windows and of
what happens when it crashes with a blue screen).

> > I'll have a quick look at how __runq_tickle() looks like in Xen
> > 4.1,
> > but there's very few chances I'll be available to provide detailed
> > review, advice, testing, etc., on such an old codebase. :-(
> > 
> > > 
> > > By looking at the code, what I missed may be in __runq_tickle(),
> > > which in case
> > > there are idle pcpu, schedule_softirq is also raised on them. By
> > > doing so,
> > > those idle pcpus would steal other busy tasks and may in chance
> > > get
> > > d2v2?
> > > 
> > Yes, it looks like, in Xen 4.1, this is more or less what happens.
> > The
> > idea is to always tickle pcpu 6, if the priority of the waking vcpu
> > is
> > higher than what is running there. 
> > 
> Hmm... in case there are idle pcpus and the priority of the waking
> vcpu is
> higher than what is running on pcpu 6, would pcpu 6 have more chance
> to run it?
> or other idle pcup would steal it from pcpu6? or they have equal
> chance?
> 
No, it works like this:

 - pcpu 6 is eithe idle or it is running someone already (say d3v4)
   + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
     pcpu 6 itself, and we're done
   + if pcpu 6 is running d3v4 and there is no other idle pcpu:
     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done
     * if prio(d2v2) < prio(d3v4) we just don't tickle anyone
   + if pcpu 6 is running d3v4 and there are some idle pcpus:
     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or  [1]
       more of the idle pcpus
     * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more   [2]
       of the idle pcpus

Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's
why we tickle pcpu 6. However, that would mean that d3v4 would be put
back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them
so that one can come and steam d3v4.

Idea behind [2] is that d3v4 should continue to run on pcpu 6, while
d2v2 will be put in pcpu's 6 runqueue. However, if there are idle
pcpus, we tickle them so that one can come and steal d2v2.

What really happens is not always what is expected, because it's
possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in
waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed
while pcpu 6 itself reschedules and enacts the preemption... But that's
unlikely and, all in all, not catastrophic (although, of course, worse
in terms of overhead, locality, etc)

> > If pcpu 6 was not idle, we also tickle one or more idle pcpus so
> > that:
> >  - if the waking vcpu preempted what was running on pcpu 6, an
> > idler
> >    can pick-up ("steal") such preempted vcpu and run it;
> >  - if the waking vcpu ended up in the runqueue, an idler can steal
> > it
> > 
> Hmm... I don't get the difference between these two cases.
> 
> Looks both are an idler steal the vcpu.
> 
I hope it's more clear now. :-)

> > Yes. But both in 4.1 and in current staging, we only raise
> > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the
> > waking
> > vcpu is higher and it should preempt the currently running vcpu.
> > 
> Oh, you are right. I didn't catch this either.
> 
> This means in case 
> a) the priority is lower than current one 
> b) no idler in system
> 
> The vcpu will sit quietly in the runq and waiting for the schedule
> next time.
> 
Well, yes. It will stay in pcpu's 6 runqueue until either:
 - what's running on pcpu 6 blocks, or finishes its credits, etc.,
   and pcpu 6 reschedules and picks it up
 - some other pcpu becomes idle and, when poking around the various
   pcpus for stealing work, picks it up

Of course, both depends on in what position in pcpu's 6 runqueue the
vcpu is when the specific event happens.

This is a limit of the Credit1 scheduler, especially of the version
that you find in 4.1. Newer code is a bit better wrt this (although not
dramatically... that has not been possible! :-/).

> > However, that is the case only if d2v2 had higher priority than
> > what is
> > running on pcpu 6. If it does not, you:
> >  - _don't_ tickle pcpu 6
> >  - tickle one or more idlers, if any.
> > 
> > So, again, it's not that you always wake and run it on pcpu 6
> > 
> I see what you mean :-)
> 
> So I am curious about why we add this in pcpu's 6 queue, when we know
> the
> priority is low in __runq_tickle() and won't tickle that. And we are
> sure
> someone in the future will grab pcpu's 6 schedule lock and take it.
> 
Yes, but what we should do? We've woken it up and it's v->processor
points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's
6 runqueue lock.

To queue it on some other pcpu we'd need to basically migrate it (i.e.,
at least changing the value of v->processor, but also other things,
like checking hard affinity, etc). And we also would need to take the
runqueue lock of another pcpu's runqueue. All this in the wakeup path,
which will become slow and complex. And we'd need all the work stealing
logic (in load_balance(), etc) to still exist, to cover the case of
work needing to be stolen when wakeups are not involved (i.e., it's not
that we make other part of the scheduler's code less complex).

So, basically, of course there probably were alternatives, but the
basic model, pretty much everywhere, is "queue where it was and tickle
others to come pick it up if suitable", and that's what's done here too
(and consistency is a very good thing :-D).

> Or in this case -- when there are idlers and vcpu's priority is
> higher,
> instead of let pcpu 6 to wake this, but let an idler to take it? Ask
> someone
> who is free to help a busy one.
> 
And in fact, in more recent version of Xen, I made the code do
something very close to what you're suggesting. Here's the comment that
you can find where all this logic is implemented (it's way more
complicated than this, and than the code in 4.1, because it takes soft-
affinity into account).

    /*
     * If the pcpu is idle, or there are no idlers and the new
     * vcpu is a higher priority than the old vcpu, run it here.
     *
     * If there are idle cpus, first try to find one suitable to run
     * new, so we can avoid preempting cur.  If we cannot find a
     * suitable idler on which to run new, run it here, but try to
     * find a suitable idler on which to run cur instead.
     */

> > It will be taken, for executing vcpu_wake(). From your own
> > description
> > of the mechanism above:
> > 
> > "Our idea may looks like this:
> >    1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
> >    2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
> >       vcpu.
> >    3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
> > 
> > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
> > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
> > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
> >    csched_schedule()
> > 
> Oh, you are right.
> 
> We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6,
> while
> looks the wake_vcpu_lock is grabbed by others.
> 
I understand that, and I can't say that this is not going to improve
things for you. All I'm saying is that you're potentially reducing the
overhead of one "operation", but at the same time adding some overhead
in other "operations".

Depending on a multiplicity of aspects, the net effect could be
positive or negative.

You you asked for general advice on the solution, and help with the
code. I can't help much debugging hangs in Xen 4.1. My opinion on the
solution is that, architecturally, is not something that I'd call an
absolute improvement.

Let me put it this way: if you'd send a patch series implementing what
we're talking about for upstream Xen, I would not Ack it without solid
data, coming from benchmarks run on different platform, with different
workloads and under different load conditions, that clearly shows
performance are improved.

Translated to your case, which is, AFAICT, that you need something on
top of 4.1. If you can hack this up quickly and try it, then good. If
this is taking a lot of time, I'm not sure I'd invest such time on this
(e.g., I'd rather try to narrow down even more the cause of the issue
you're seeing and, after that look at other solutions). But of course
I'm not you and this is only, and you absolutely have no reasons to
follow my advice. :-) :-)

> I am thinking our change may benefit like this .
> 
> The situation is:
> * vcpu's priority is higher
> * pcpu 9 is idle
> * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ
> 
> At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
> nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue
> lock.
> 
Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the
softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab
their own locks for running csched_schedule(), after having reacted to
the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.

BTW, you say that pcpu 9 is idle, but then that "it can't do his job"
because pcpu 4 holds the lock of pcpu 6... I don't think I understand
what you mean with this.

> While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6
> and pcpu 9
> could do their job immediately. Ah I know, very corner, not sure
> about the
> effect.
> 
Again, I don't get what "do their job" and "immediately" means.

In both cases the consequentiality of events is as follows:
 1. queue the task in one pcpu's 6 queue (either the runq, in original
    code, or the wakeup list, with your modifications)
 2. tickle pcpu 6
 3. pcpu 6 react to being tickled
 4. pcpu 6 start running the new vcpu

I original code 1 and 4 are serialized by the same lock, with your
modifications, half of 1 can happen using a different lock. But 1 and 4
are almost never overlapped anyway, so I don't think you're gaining
much parallelism.

What (maybe) you're achieving is that 1 interferes a little bit less
with other things that wants to happen on vcpu 6 (e.g., because of
other vcpus wrt the one waking up), which may be a good thing, if there
are a lot of wakeups. But then, again, if there are a lot of wakeups,
you probably will see contention of the wake_vcpu_lock!

> BTW, I suddenly found softirq is implemented as an IPI. So I am
> thinking
> whether it is save to raise softirq to itself? 
>
Softirqs are implemented by means of IPIs, but when they're raised for
other processors. Self-raising a softirq does not involve IPIs, just
setting a bit in the pending mask (see raise_softirq() in
xen/common/softirq.c).

> Since it looks we already hold
> the runqueue lock. Or because we have turned the irq off in
> vcpu_wake()?
> 
We hold the runqueue lock, so what? Should we schedule that
immediately? That would probably be possible, but it's again an
architectural decision, and a matter of consistency. In Xen, we
schedule when we find the SCHEDULE_SOFTIRQ to be pending.

Anyhow, I would not expect wonders by embedding the re-scheduling code
directly here either.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-14 16:18     ` Dario Faggioli
@ 2016-09-16  2:49       ` Wei Yang
  2016-09-16 16:07         ` Dario Faggioli
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Yang @ 2016-09-16  2:49 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang,
	mengxu, Wei Yang

On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote:
>On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
>> On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>> > 
>> > Do you mind sharing just a bit more, such as:
>> >  - number of pcpus
>> >  - number of vcpus of the various VMs
>> 160 pcpus
>> 16 vcpus in VM and 8 VMs
>> 
>So, 16x8=128, which means you're not even oversubscribing. Maybe some
>of the pcpus are hyperthreads and not full cores (are they?), but
>still, this is not something I'd describe "super intensive cpu load".
>
>Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32
>vcpus (is this the case?), you technically are oversubscribing. But
>then, what does dom0 do? Is it very busy doing some stuff on its own,
>or running the backends/etc for the VMs? Can you check that with top,
>xentop, and alike?
>
>If the system is not overbooked, it's a bit strange that the scheduler
>is the bottleneck.

I looked at the original data again. I don't see detailed data to describe the
dom0 configuration.

The exact user model is not accessible from our client. We guess their model
looks like this.


     +--------+     +-----------+         +-------------+
     |Timer   | --->|Coordinator| ---+--->|Worker       |
     +--------+     +-----------+    |    +-------------+
                                     |
                                     |    +-------------+
                                     +--->|Worker       |
                                     |    +-------------+
                                     |
                                     |    +-------------+
                                     +--->|Worker       |
                                          +-------------+

One Coordinator would driver several workers based on a high resolution timer.
Periodically, workers would be waked up by the coordinator. So at one moment,
many workers would be waked up and would triggers the vcpu_wake() in xen.

Not sure this would be a possible reason for the burst vcpu_wake()?

>
>> Yes, the "Blue Screen" is what we want to mimic the behavior from
>> client.
>> 
>> The "Blue Screen" will force the hypervisor to do load balance in my
>> mind.
>> 
>I've no idea what this means (but I don't know much of Windows and of
>what happens when it crashes with a blue screen).
>
>> > I'll have a quick look at how __runq_tickle() looks like in Xen
>> > 4.1,
>> > but there's very few chances I'll be available to provide detailed
>> > review, advice, testing, etc., on such an old codebase. :-(
>> > 
>> > > 
>> > > By looking at the code, what I missed may be in __runq_tickle(),
>> > > which in case
>> > > there are idle pcpu, schedule_softirq is also raised on them. By
>> > > doing so,
>> > > those idle pcpus would steal other busy tasks and may in chance
>> > > get
>> > > d2v2?
>> > > 
>> > Yes, it looks like, in Xen 4.1, this is more or less what happens.
>> > The
>> > idea is to always tickle pcpu 6, if the priority of the waking vcpu
>> > is
>> > higher than what is running there. 
>> > 
>> Hmm... in case there are idle pcpus and the priority of the waking
>> vcpu is
>> higher than what is running on pcpu 6, would pcpu 6 have more chance
>> to run it?
>> or other idle pcup would steal it from pcpu6? or they have equal
>> chance?
>> 
>No, it works like this:
>
> - pcpu 6 is eithe idle or it is running someone already (say d3v4)
>   + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
>     pcpu 6 itself, and we're done

Ok, it looks the behavior differs from 4.1 and current upstream. The upstream
just raise the softirq to pcpu6 when it is idle, while 4.1 would raise softirq
to both pcpu6 and other idlers even pcpu6 is idle.

I think current upstream is more cleaver.

>   + if pcpu 6 is running d3v4 and there is no other idle pcpu:
>     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done
>     * if prio(d2v2) < prio(d3v4) we just don't tickle anyone
>   + if pcpu 6 is running d3v4 and there are some idle pcpus:
>     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or  [1]
>       more of the idle pcpus
>     * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more   [2]
>       of the idle pcpus
>
>Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's
>why we tickle pcpu 6. However, that would mean that d3v4 would be put
>back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them
>so that one can come and steam d3v4.
>
>Idea behind [2] is that d3v4 should continue to run on pcpu 6, while
>d2v2 will be put in pcpu's 6 runqueue. However, if there are idle
>pcpus, we tickle them so that one can come and steal d2v2.
>
>What really happens is not always what is expected, because it's
>possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in
>waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed
>while pcpu 6 itself reschedules and enacts the preemption... But that's
>unlikely and, all in all, not catastrophic (although, of course, worse
>in terms of overhead, locality, etc)
>
>> > If pcpu 6 was not idle, we also tickle one or more idle pcpus so
>> > that:
>> >  - if the waking vcpu preempted what was running on pcpu 6, an
>> > idler
>> >    can pick-up ("steal") such preempted vcpu and run it;
>> >  - if the waking vcpu ended up in the runqueue, an idler can steal
>> > it
>> > 
>> Hmm... I don't get the difference between these two cases.
>> 
>> Looks both are an idler steal the vcpu.
>> 
>I hope it's more clear now. :-)
>

Yep, thanks. :-)

>> > Yes. But both in 4.1 and in current staging, we only raise
>> > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the
>> > waking
>> > vcpu is higher and it should preempt the currently running vcpu.
>> > 
>> Oh, you are right. I didn't catch this either.
>> 
>> This means in case 
>> a) the priority is lower than current one 
>> b) no idler in system
>> 
>> The vcpu will sit quietly in the runq and waiting for the schedule
>> next time.
>> 
>Well, yes. It will stay in pcpu's 6 runqueue until either:
> - what's running on pcpu 6 blocks, or finishes its credits, etc.,
>   and pcpu 6 reschedules and picks it up
> - some other pcpu becomes idle and, when poking around the various
>   pcpus for stealing work, picks it up
>
>Of course, both depends on in what position in pcpu's 6 runqueue the
>vcpu is when the specific event happens.
>
>This is a limit of the Credit1 scheduler, especially of the version
>that you find in 4.1. Newer code is a bit better wrt this (although not
>dramatically... that has not been possible! :-/).
>
>> > However, that is the case only if d2v2 had higher priority than
>> > what is
>> > running on pcpu 6. If it does not, you:
>> >  - _don't_ tickle pcpu 6
>> >  - tickle one or more idlers, if any.
>> > 
>> > So, again, it's not that you always wake and run it on pcpu 6
>> > 
>> I see what you mean :-)
>> 
>> So I am curious about why we add this in pcpu's 6 queue, when we know
>> the
>> priority is low in __runq_tickle() and won't tickle that. And we are
>> sure
>> someone in the future will grab pcpu's 6 schedule lock and take it.
>> 
>Yes, but what we should do? We've woken it up and it's v->processor
>points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's
>6 runqueue lock.
>
>To queue it on some other pcpu we'd need to basically migrate it (i.e.,
>at least changing the value of v->processor, but also other things,
>like checking hard affinity, etc). And we also would need to take the
>runqueue lock of another pcpu's runqueue. All this in the wakeup path,
>which will become slow and complex. And we'd need all the work stealing
>logic (in load_balance(), etc) to still exist, to cover the case of
>work needing to be stolen when wakeups are not involved (i.e., it's not
>that we make other part of the scheduler's code less complex).
>
>So, basically, of course there probably were alternatives, but the
>basic model, pretty much everywhere, is "queue where it was and tickle
>others to come pick it up if suitable", and that's what's done here too
>(and consistency is a very good thing :-D).
>
>> Or in this case -- when there are idlers and vcpu's priority is
>> higher,
>> instead of let pcpu 6 to wake this, but let an idler to take it? Ask
>> someone
>> who is free to help a busy one.
>> 
>And in fact, in more recent version of Xen, I made the code do
>something very close to what you're suggesting. Here's the comment that
>you can find where all this logic is implemented (it's way more
>complicated than this, and than the code in 4.1, because it takes soft-
>affinity into account).
>
>    /*
>     * If the pcpu is idle, or there are no idlers and the new
>     * vcpu is a higher priority than the old vcpu, run it here.
>     *
>     * If there are idle cpus, first try to find one suitable to run
>     * new, so we can avoid preempting cur.  If we cannot find a
>     * suitable idler on which to run new, run it here, but try to
>     * find a suitable idler on which to run cur instead.
>     */
>

I like this idea.

>> > It will be taken, for executing vcpu_wake(). From your own
>> > description
>> > of the mechanism above:
>> > 
>> > "Our idea may looks like this:
>> >    1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>> >    2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>> >       vcpu.
>> >    3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>> > 
>> > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>> > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>> > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>> >    csched_schedule()
>> > 
>> Oh, you are right.
>> 
>> We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6,
>> while
>> looks the wake_vcpu_lock is grabbed by others.
>> 
>I understand that, and I can't say that this is not going to improve
>things for you. All I'm saying is that you're potentially reducing the
>overhead of one "operation", but at the same time adding some overhead
>in other "operations".
>
>Depending on a multiplicity of aspects, the net effect could be
>positive or negative.
>
>You you asked for general advice on the solution, and help with the
>code. I can't help much debugging hangs in Xen 4.1. My opinion on the
>solution is that, architecturally, is not something that I'd call an
>absolute improvement.
>
>Let me put it this way: if you'd send a patch series implementing what
>we're talking about for upstream Xen, I would not Ack it without solid
>data, coming from benchmarks run on different platform, with different
>workloads and under different load conditions, that clearly shows
>performance are improved.
>
>Translated to your case, which is, AFAICT, that you need something on
>top of 4.1. If you can hack this up quickly and try it, then good. If
>this is taking a lot of time, I'm not sure I'd invest such time on this
>(e.g., I'd rather try to narrow down even more the cause of the issue
>you're seeing and, after that look at other solutions). But of course
>I'm not you and this is only, and you absolutely have no reasons to
>follow my advice. :-) :-)
>

Actually, I admire your advice :-)

We found the schedule lock be a consuming point in our scenario, so the direct
thought is try to avoid to grab it. Hmm... while our idea is not that
sparkling.

>> I am thinking our change may benefit like this .
>> 
>> The situation is:
>> * vcpu's priority is higher
>> * pcpu 9 is idle
>> * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ
>> 
>> At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
>> nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue
>> lock.
>> 
>Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the
>softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab
>their own locks for running csched_schedule(), after having reacted to
>the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.
>
>BTW, you say that pcpu 9 is idle, but then that "it can't do his job"
>because pcpu 4 holds the lock of pcpu 6... I don't think I understand
>what you mean with this.

After pcpu9 get the softirq from pcup4, it try to schedule and do load
balance, in which would take pcpu6 schedule lock.

As I thought previously, pcpu6 schedule lock is hold by pcpu4, so pcpu9 should
wait until pcpu4 release it. This is what I mean "it can't do his job
immediately".

While as you mentioned, compared with pcpu9, pcpu4 would release pcpu6
schedule lock earlier. :-)

Little complicated, hope I described it clearly. :-)

>
>> While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6
>> and pcpu 9
>> could do their job immediately. Ah I know, very corner, not sure
>> about the
>> effect.
>> 
>Again, I don't get what "do their job" and "immediately" means.
>
>In both cases the consequentiality of events is as follows:
> 1. queue the task in one pcpu's 6 queue (either the runq, in original
>    code, or the wakeup list, with your modifications)
> 2. tickle pcpu 6
> 3. pcpu 6 react to being tickled
> 4. pcpu 6 start running the new vcpu
>
>I original code 1 and 4 are serialized by the same lock, with your
>modifications, half of 1 can happen using a different lock. But 1 and 4
>are almost never overlapped anyway, so I don't think you're gaining
>much parallelism.
>
>What (maybe) you're achieving is that 1 interferes a little bit less
>with other things that wants to happen on vcpu 6 (e.g., because of
>other vcpus wrt the one waking up), which may be a good thing, if there
>are a lot of wakeups. But then, again, if there are a lot of wakeups,
>you probably will see contention of the wake_vcpu_lock!
>
>> BTW, I suddenly found softirq is implemented as an IPI. So I am
>> thinking
>> whether it is save to raise softirq to itself? 
>>
>Softirqs are implemented by means of IPIs, but when they're raised for
>other processors. Self-raising a softirq does not involve IPIs, just
>setting a bit in the pending mask (see raise_softirq() in
>xen/common/softirq.c).
>
>> Since it looks we already hold
>> the runqueue lock. Or because we have turned the irq off in
>> vcpu_wake()?
>> 
>We hold the runqueue lock, so what? Should we schedule that
>immediately? That would probably be possible, but it's again an
>architectural decision, and a matter of consistency. In Xen, we
>schedule when we find the SCHEDULE_SOFTIRQ to be pending.
>
>Anyhow, I would not expect wonders by embedding the re-scheduling code
>directly here either.
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-16  2:49       ` Wei Yang
@ 2016-09-16 16:07         ` Dario Faggioli
  2016-09-17  0:31           ` Wei Yang
  0 siblings, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2016-09-16 16:07 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 8605 bytes --]

On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote:
> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote:
> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
> > If the system is not overbooked, it's a bit strange that the
> > scheduler
> > is the bottleneck.
> I looked at the original data again. I don't see detailed data to
> describe the
> dom0 configuration.
> 
I see. No collection of output of top and xentop in dom0 either then,
I'm guessing? :-/

> The exact user model is not accessible from our client. We guess
> their model
> looks like this.
> 
> 
>      +--------+     +-----------+         +-------------+
>      |Timer   | --->|Coordinator| ---+--->|Worker       |
>      +--------+     +-----------+    |    +-------------+
>                                      |
>                                      |    +-------------+
>                                      +--->|Worker       |
>                                      |    +-------------+
>                                      |
>                                      |    +-------------+
>                                      +--->|Worker       |
>                                           +-------------+
> 
> One Coordinator would driver several workers based on a high
> resolution timer.
> Periodically, workers would be waked up by the coordinator. So at one
> moment,
> many workers would be waked up and would triggers the vcpu_wake() in
> xen.
> 
It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if
the graph describes the workload run inside the (and if yes, which
ones) VMs... but that is not terribly important, after all.

> Not sure this would be a possible reason for the burst vcpu_wake()?
> 
Well, there would be, at least potentially, a sensible number of vcpus
waking up, which indeed can make the runqueue locks of the various
pcpus contended.

But then again, if the system is not oversubscribed, I'd tend to think
it to be tolerable, and I'd expect the biggest problem to be the work-
stealing logic (considering the high number of pcpus), rather than the
duration of the critical sections within vcpu_wake().

> >  - pcpu 6 is eithe idle or it is running someone already (say d3v4)
> >    + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
> >      pcpu 6 itself, and we're done
> Ok, it looks the behavior differs from 4.1 and current upstream. The
> upstream
> just raise the softirq to pcpu6 when it is idle, while 4.1 would
> raise softirq
> to both pcpu6 and other idlers even pcpu6 is idle.
> 
> I think current upstream is more cleaver.
> 
I also think current upstream is a bit better, especially because it's
mostly me that made it look the way it does. :-D

But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6
is idle (se the '//xxx xxx xxx' comments I'm adding to the code
excerpts:

    if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
    {
        cpu_set(cpu, mask);
    }
    if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
    {
        ....
    }
    if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);

On the other hand, if pcpu 6 is not idle (and, sticking to the example
started in the last email, is running d3v4):

    if ( new->pri > cur->pri )  //depends from d2v2's prio and d3v4 prio's
    {
        cpu_set(cpu, mask);
    }
    if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see...
    {
        if ( cpus_empty(prv->idlers) )  //is true *only* if there are no idle pcpu 6. Let's assume there are (i.e., let's assume this is false)
        {
            ....
        }
        else
        {
            cpumask_t idle_mask;

            cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity);
            if ( !cpus_empty(idle_mask) )  //is true if there are idlers suitable for new (let's assume there are)
            {
                if ( opt_tickle_one_idle ) //chosen on boot, default is true
                {
                    this_cpu(last_tickle_cpu) = 
                        cycle_cpu(this_cpu(last_tickle_cpu), idle_mask);
                    cpu_set(this_cpu(last_tickle_cpu), mask);
                }
                else
                    ....
            }
            cpus_and(mask, mask, new->vcpu->cpu_affinity);
        }
    }
    if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, and also contains one idle pcpu
        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);

So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's
not, at least one idler (if it exists) is tickled, and pcpu 6 is
tickled or not, depending or priorities.

> > And in fact, in more recent version of Xen, I made the code do
> > something very close to what you're suggesting. Here's the comment
> > that
> > you can find where all this logic is implemented (it's way more
> > complicated than this, and than the code in 4.1, because it takes
> > soft-
> > affinity into account).
> > 
> >     /*
> >      * If the pcpu is idle, or there are no idlers and the new
> >      * vcpu is a higher priority than the old vcpu, run it here.
> >      *
> >      * If there are idle cpus, first try to find one suitable to
> > run
> >      * new, so we can avoid preempting cur.  If we cannot find a
> >      * suitable idler on which to run new, run it here, but try to
> >      * find a suitable idler on which to run cur instead.
> >      */
> > 
> I like this idea.
> 
Then update... Xen 4.2 or 4.3 should already contains it. :-P

> We found the schedule lock be a consuming point in our scenario, so
> the direct
> thought is try to avoid to grab it. Hmm... while our idea is not that
> sparkling.
> 
Again, I can't say how sparkling it will reveal once implemented.
Architecturally, it doesn't sound much different from the status quo to
me, and so I'd say it's unlikely that it will solve your problem, but
this is something always very hard to tell.

And again, I'd personally spend some more time --even by temporarily
and hackishly instrumenting the code-- to understand better where the
bottleneck is.

> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised
> > the
> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to
> > grab
> > their own locks for running csched_schedule(), after having reacted
> > to
> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.
> > 
> > BTW, you say that pcpu 9 is idle, but then that "it can't do his
> > job"
> > because pcpu 4 holds the lock of pcpu 6... I don't think I
> > understand
> > what you mean with this.
> After pcpu9 get the softirq from pcup4, it try to schedule and do
> load
> balance, in which would take pcpu6 schedule lock.
> 
> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so
> pcpu9 should
> wait until pcpu4 release it. This is what I mean "it can't do his job
> immediately".
> 
Yeah, but...

> While as you mentioned, compared with pcpu9, pcpu4 would release
> pcpu6
> schedule lock earlier. :-)
> 
... indeed that's what I think it happens 99% of the time. And this is
also why I'd tempted to think that the issue may be laying somewhere
else.

E.g., something that has proven effective for others (and for which
I've got an unfinished and never submitted patch to upstream), is
keeping track of what pcpus actually have at least 1 vcpu in their
runqueues (i.e., at least 1 vcpus that is runnable and not running)
and, inside balance_load(), only consider those when looking for work
to steal.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-16 16:07         ` Dario Faggioli
@ 2016-09-17  0:31           ` Wei Yang
  2016-09-19 23:24             ` Dario Faggioli
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Yang @ 2016-09-17  0:31 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang,
	mengxu, Wei Yang

On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote:
>On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote:
>> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote:
>> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
>> > If the system is not overbooked, it's a bit strange that the
>> > scheduler
>> > is the bottleneck.
>> I looked at the original data again. I don't see detailed data to
>> describe the
>> dom0 configuration.
>> 
>I see. No collection of output of top and xentop in dom0 either then,
>I'm guessing? :-/
>

Probably, let me check with some one to see whether we have the luck.

>> The exact user model is not accessible from our client. We guess
>> their model
>> looks like this.
>> 
>> 
>>      +--------+     +-----------+         +-------------+
>>      |Timer   | --->|Coordinator| ---+--->|Worker       |
>>      +--------+     +-----------+    |    +-------------+
>>                                      |
>>                                      |    +-------------+
>>                                      +--->|Worker       |
>>                                      |    +-------------+
>>                                      |
>>                                      |    +-------------+
>>                                      +--->|Worker       |
>>                                           +-------------+
>> 
>> One Coordinator would driver several workers based on a high
>> resolution timer.
>> Periodically, workers would be waked up by the coordinator. So at one
>> moment,
>> many workers would be waked up and would triggers the vcpu_wake() in
>> xen.
>> 
>It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if
>the graph describes the workload run inside the (and if yes, which
>ones) VMs... but that is not terribly important, after all.
>

Oh, yes, these are threads in a VM. Each VM may contain several
groups(instance) of these threads.

>> Not sure this would be a possible reason for the burst vcpu_wake()?
>> 
>Well, there would be, at least potentially, a sensible number of vcpus
>waking up, which indeed can make the runqueue locks of the various
>pcpus contended.
>
>But then again, if the system is not oversubscribed, I'd tend to think
>it to be tolerable, and I'd expect the biggest problem to be the work-
>stealing logic (considering the high number of pcpus), rather than the
>duration of the critical sections within vcpu_wake().
>

Yes, we are trying to improve the stealing part too.

Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in terms of
#PCPUs.

>> >  - pcpu 6 is eithe idle or it is running someone already (say d3v4)
>> >    + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
>> >      pcpu 6 itself, and we're done
>> Ok, it looks the behavior differs from 4.1 and current upstream. The
>> upstream
>> just raise the softirq to pcpu6 when it is idle, while 4.1 would
>> raise softirq
>> to both pcpu6 and other idlers even pcpu6 is idle.
>> 
>> I think current upstream is more cleaver.
>> 
>I also think current upstream is a bit better, especially because it's
>mostly me that made it look the way it does. :-D

Ah, not intended to flattering you.

>
>But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6
>is idle (se the '//xxx xxx xxx' comments I'm adding to the code
>excerpts:
>
>    if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
>    {
>        cpu_set(cpu, mask);
>    }
>    if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
>    {
>        ....
>    }
>    if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
>        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
>

Hmm... don't have the code at hand, while looks you are right. I misunderstood
the code.

>On the other hand, if pcpu 6 is not idle (and, sticking to the example
>started in the last email, is running d3v4):
>
>    if ( new->pri > cur->pri )  //depends from d2v2's prio and d3v4 prio's
>    {
>        cpu_set(cpu, mask);
>    }
>    if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see...
>    {
>        if ( cpus_empty(prv->idlers) )  //is true *only* if there are no idle pcpu 6. Let's assume there are (i.e., let's assume this is false)
>        {
>            ....
>        }
>        else
>        {
>            cpumask_t idle_mask;
>
>            cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity);
>            if ( !cpus_empty(idle_mask) )  //is true if there are idlers suitable for new (let's assume there are)
>            {
>                if ( opt_tickle_one_idle ) //chosen on boot, default is true
>                {
>                    this_cpu(last_tickle_cpu) = 
>                        cycle_cpu(this_cpu(last_tickle_cpu), idle_mask);
>                    cpu_set(this_cpu(last_tickle_cpu), mask);

May misunderstand the code previously and like this part too.

So only one idler would be tickled even there would be several idlers in the
system. I thought we would tickle several idlers, which may introduce some
contention between them.

BTW, any idea behind the cycle_cpu()? If this is about the locality, how about
cycle within the node? and cycle within the node where v->processor is?

>                }
>                else
>                    ....
>            }
>            cpus_and(mask, mask, new->vcpu->cpu_affinity);
>        }
>    }
>    if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, and also contains one idle pcpu
>        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
>
>So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's
>not, at least one idler (if it exists) is tickled, and pcpu 6 is
>tickled or not, depending or priorities.
>

That's clear to me now :-)

>> > And in fact, in more recent version of Xen, I made the code do
>> > something very close to what you're suggesting. Here's the comment
>> > that
>> > you can find where all this logic is implemented (it's way more
>> > complicated than this, and than the code in 4.1, because it takes
>> > soft-
>> > affinity into account).
>> > 
>> >     /*
>> >      * If the pcpu is idle, or there are no idlers and the new
>> >      * vcpu is a higher priority than the old vcpu, run it here.
>> >      *
>> >      * If there are idle cpus, first try to find one suitable to
>> > run
>> >      * new, so we can avoid preempting cur.  If we cannot find a
>> >      * suitable idler on which to run new, run it here, but try to
>> >      * find a suitable idler on which to run cur instead.
>> >      */
>> > 
>> I like this idea.
>> 
>Then update... Xen 4.2 or 4.3 should already contains it. :-P
>
>> We found the schedule lock be a consuming point in our scenario, so
>> the direct
>> thought is try to avoid to grab it. Hmm... while our idea is not that
>> sparkling.
>> 
>Again, I can't say how sparkling it will reveal once implemented.
>Architecturally, it doesn't sound much different from the status quo to
>me, and so I'd say it's unlikely that it will solve your problem, but
>this is something always very hard to tell.
>
>And again, I'd personally spend some more time --even by temporarily
>and hackishly instrumenting the code-- to understand better where the
>bottleneck is.
>

Hmm... usually we use xentrace and lock profile to identify the bottleneck,
other method you would like to suggest? and instrumenting the code means to
add some log in code?

>> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised
>> > the
>> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to
>> > grab
>> > their own locks for running csched_schedule(), after having reacted
>> > to
>> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.
>> > 
>> > BTW, you say that pcpu 9 is idle, but then that "it can't do his
>> > job"
>> > because pcpu 4 holds the lock of pcpu 6... I don't think I
>> > understand
>> > what you mean with this.
>> After pcpu9 get the softirq from pcup4, it try to schedule and do
>> load
>> balance, in which would take pcpu6 schedule lock.
>> 
>> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so
>> pcpu9 should
>> wait until pcpu4 release it. This is what I mean "it can't do his job
>> immediately".
>> 
>Yeah, but...
>
>> While as you mentioned, compared with pcpu9, pcpu4 would release
>> pcpu6
>> schedule lock earlier. :-)
>> 
>... indeed that's what I think it happens 99% of the time. And this is
>also why I'd tempted to think that the issue may be laying somewhere
>else.
>
>E.g., something that has proven effective for others (and for which
>I've got an unfinished and never submitted patch to upstream), is
>keeping track of what pcpus actually have at least 1 vcpu in their
>runqueues (i.e., at least 1 vcpus that is runnable and not running)
>and, inside balance_load(), only consider those when looking for work
>to steal.

Looks like we keep a cpumask with those who have 1 more vcpus and during
stealing process, just scan those instead of the online cpumask.

>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Wei Yang
Help you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-17  0:31           ` Wei Yang
@ 2016-09-19 23:24             ` Dario Faggioli
  2016-09-20  3:51               ` Wei Yang
  0 siblings, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2016-09-19 23:24 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, mengxu, Wei Yang


[-- Attachment #1.1: Type: text/plain, Size: 4451 bytes --]

On Sat, 2016-09-17 at 00:31 +0000, Wei Yang wrote:
> On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote:
> > But then again, if the system is not oversubscribed, I'd tend to
> > think
> > it to be tolerable, and I'd expect the biggest problem to be the
> > work-
> > stealing logic (considering the high number of pcpus), rather than
> > the
> > duration of the critical sections within vcpu_wake().
> > 
> Yes, we are trying to improve the stealing part too.
> 
Great.

> Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in
> terms of
> #PCPUs.
> 
Exactly!

> > I also think current upstream is a bit better, especially because
> > it's
> > mostly me that made it look the way it does. :-D
> 
> Ah, not intended to flattering you.
> 
EhEh! Sure, I was joking myself. :-)

> > 
> > 
> > But I was actually describing how 4.1 works. In fact, in 4.1, if
> > pcpu 6
> > is idle (se the '//xxx xxx xxx' comments I'm adding to the code
> > excerpts:
> > 
> >     if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
> >     {
> >         cpu_set(cpu, mask);
> >     }
> >     if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
> >     {
> >        ....
> >     }
> >     if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
> >         cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
> > 
> 
> Hmm... don't have the code at hand, while looks you are right. I
> misunderstood
> the code.

> So only one idler would be tickled even there would be several idlers
> in the
> system. I thought we would tickle several idlers, which may introduce
> some
> contention between them.
> 
You can ask Xen to tickle more than one idle, by means of that
parameter. But again, tickling idlers --either one or many-- would only
happen if there's actual need for it.

> BTW, any idea behind the cycle_cpu()? If this is about the locality,
> how about
> cycle within the node? and cycle within the node where v->processor
> is?
> 
Cycle is there as a form of load spreading, i.e., we don't want to risk
tickling always the same set of pcpus, or always the ones with the
lower cpu IDs.

You're right that taking locality into account could be a good thing in
big systems. No, that is not done, not in 4.1 nor in upstream, and it
may be something actually worth trying (although, again, it's probably
unrelated to your issue).

> > And again, I'd personally spend some more time --even by
> > temporarily
> > and hackishly instrumenting the code-- to understand better where
> > the
> > bottleneck is.
> > 
> 
> Hmm... usually we use xentrace and lock profile to identify the
> bottleneck,
> other method you would like to suggest? and instrumenting the code
> means to
> add some log in code?
> 
xentrace and lock profile is all I use too, and there's not much else,
without touching the code. And in fact, yes, with "instrumenting the
code" I means temporary changing the code to display the information
you need.

In this specific case, rather than adding printks here and there
(which, e.g., is what you usually do for debugging crashes or alike),
I'd see about modifying a little bit either xentrace or lock profiling
code (or both!) to make them provide the info you need.

Sorry, but I don't think I can be much more specific without going
looking at the code and actually trying to do that...

> > E.g., something that has proven effective for others (and for which
> > I've got an unfinished and never submitted patch to upstream), is
> > keeping track of what pcpus actually have at least 1 vcpu in their
> > runqueues (i.e., at least 1 vcpus that is runnable and not running)
> > and, inside balance_load(), only consider those when looking for
> > work
> > to steal.
> 
> Looks like we keep a cpumask with those who have 1 more vcpus and
> during
> stealing process, just scan those instead of the online cpumask.
> 
Yep, that sounds like a plan, and was along the lines of what I was
doing. If you want to give it a try, patches (for upstream, of course
:-D) are welcome. :-P

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-19 23:24             ` Dario Faggioli
@ 2016-09-20  3:51               ` Wei Yang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Yang @ 2016-09-20  3:51 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang,
	mengxu, Wei Yang

On Tue, Sep 20, 2016 at 01:24:17AM +0200, Dario Faggioli wrote:
>On Sat, 2016-09-17 at 00:31 +0000, Wei Yang wrote:
>> On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote:
>> > But then again, if the system is not oversubscribed, I'd tend to
>> > think
>> > it to be tolerable, and I'd expect the biggest problem to be the
>> > work-
>> > stealing logic (considering the high number of pcpus), rather than
>> > the
>> > duration of the critical sections within vcpu_wake().
>> > 
>> Yes, we are trying to improve the stealing part too.
>> 
>Great.
>
>> Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in
>> terms of
>> #PCPUs.
>> 
>Exactly!
>
>> > I also think current upstream is a bit better, especially because
>> > it's
>> > mostly me that made it look the way it does. :-D
>> 
>> Ah, not intended to flattering you.
>> 
>EhEh! Sure, I was joking myself. :-)
>

Ah, really enjoy talking with you :-)
You got some special sense of humor, and I like it~

Have a good day!

>> > 
>> > 
>> > But I was actually describing how 4.1 works. In fact, in 4.1, if
>> > pcpu 6
>> > is idle (se the '//xxx xxx xxx' comments I'm adding to the code
>> > excerpts:
>> > 
>> >     if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
>> >     {
>> >         cpu_set(cpu, mask);
>> >     }
>> >     if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
>> >     {
>> >        ....
>> >     }
>> >     if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
>> >         cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
>> > 
>> 
>> Hmm... don't have the code at hand, while looks you are right. I
>> misunderstood
>> the code.
>
>> So only one idler would be tickled even there would be several idlers
>> in the
>> system. I thought we would tickle several idlers, which may introduce
>> some
>> contention between them.
>> 
>You can ask Xen to tickle more than one idle, by means of that
>parameter. But again, tickling idlers --either one or many-- would only
>happen if there's actual need for it.
>
>> BTW, any idea behind the cycle_cpu()? If this is about the locality,
>> how about
>> cycle within the node? and cycle within the node where v->processor
>> is?
>> 
>Cycle is there as a form of load spreading, i.e., we don't want to risk
>tickling always the same set of pcpus, or always the ones with the
>lower cpu IDs.
>
>You're right that taking locality into account could be a good thing in
>big systems. No, that is not done, not in 4.1 nor in upstream, and it
>may be something actually worth trying (although, again, it's probably
>unrelated to your issue).
>
>> > And again, I'd personally spend some more time --even by
>> > temporarily
>> > and hackishly instrumenting the code-- to understand better where
>> > the
>> > bottleneck is.
>> > 
>> 
>> Hmm... usually we use xentrace and lock profile to identify the
>> bottleneck,
>> other method you would like to suggest? and instrumenting the code
>> means to
>> add some log in code?
>> 
>xentrace and lock profile is all I use too, and there's not much else,
>without touching the code. And in fact, yes, with "instrumenting the
>code" I means temporary changing the code to display the information
>you need.
>
>In this specific case, rather than adding printks here and there
>(which, e.g., is what you usually do for debugging crashes or alike),
>I'd see about modifying a little bit either xentrace or lock profiling
>code (or both!) to make them provide the info you need.
>
>Sorry, but I don't think I can be much more specific without going
>looking at the code and actually trying to do that...
>
>> > E.g., something that has proven effective for others (and for which
>> > I've got an unfinished and never submitted patch to upstream), is
>> > keeping track of what pcpus actually have at least 1 vcpu in their
>> > runqueues (i.e., at least 1 vcpus that is runnable and not running)
>> > and, inside balance_load(), only consider those when looking for
>> > work
>> > to steal.
>> 
>> Looks like we keep a cpumask with those who have 1 more vcpus and
>> during
>> stealing process, just scan those instead of the online cpumask.
>> 
>Yep, that sounds like a plan, and was along the lines of what I was
>doing. If you want to give it a try, patches (for upstream, of course
>:-D) are welcome. :-P
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-14 10:44   ` Wei Yang
  2016-09-14 16:18     ` Dario Faggioli
@ 2016-09-17  3:30     ` Wei Yang
  2016-09-22  9:12       ` Dario Faggioli
  1 sibling, 1 reply; 17+ messages in thread
From: Wei Yang @ 2016-09-17  3:30 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, Dario Faggioli, xen-devel,
	mengxu, richard.weiyang

On Wed, Sep 14, 2016 at 06:44:17PM +0800, Wei Yang wrote:
>On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>>[using xendevel correct address]
>>
>>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
>>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
>>> > 
>>> > I'm not surprised by that. Yet, I'd be interested in hearing more
>>> > about this profiling you have done (things like, how you captured
>>> > the data, what workloads you are exactly considering, how you
>>> > determined what is the bottleneck, etc).
>>> Let me try to explain this.
>>> 
>>> Workload:         a. Blue Screen in Windows Guests
>>> 		  b. some client using Windows to do some video
>>> processing
>>> 		     which need precise timestamp (we are not sure the
>>> core
>>> 		     reason but we see the system is very slow)
>>>
>>Do you mind sharing just a bit more, such as:
>> - number of pcpus
>> - number of vcpus of the various VMs
>
>160 pcpus
>16 vcpus in VM and 8 VMs
>
>>
>>I also am not sure what "a. Blue screen in Windows guests" above
>>means... is there a workload called "Blue Screen"? Or is it that you
>>are hitting a BSOD in some of your guests (which ones, what were they
>>doing)? Or is it that you _want_ to provoke a BSOD on some of your
>>guests? Or is that something else? :-P
>>
>
>Yes, the "Blue Screen" is what we want to mimic the behavior from client.
>
>The "Blue Screen" will force the hypervisor to do load balance in my mind.
>
>>> Capture the data: lock profile
>>> Bottleneck Check: The schedule lock wait time is really high      
>>> 
>>Ok, cool. Interesting that lock profiling works on 4.1! :-O
>>
>>> > The scheduler tries to see whether the v->processor of the waking
>>> > vcpu can be re-used, but that's not at all guaranteed, and again,
>>> > on a very loaded system, not even that likely!
>>> > 
>>> Hmm... I may missed something.
>>> 
>>> Took your assumption below for example. 
>>> In my mind, the process looks like this:
>>> 
>>>     csched_vcpu_wake()
>>>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>>>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
>>> (1)
>>>     
>>Well, yes. More precisely, at least in current staging,
>>SCHEDULE_SOFTIRQ is raised for pcpu 6:
>> - if pcpu 6 is idle, or
>> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
>>   the waking vcpu is higher in priority than what is running on pcpu 6
>>
>>>     __do_softirq()
>>>         schedule()
>>>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
>>> on it's
>>>                                 priority
>>>
>>Yes, but it is pcpu 6 that will run csched_schedule() only if the
>>conditions mentioned above are met. If not, it will be some other pcpu
>>that will do so (or none!).
>>
>>But I just realized that the fact that you are actually working on 4.1
>>is going to be an issue. In fact, the code that it is now in Xen has
>>changed **a lot** since 4.1. In fact, you're missing all the soft-
>>affinity work (which you may or may not find useful) but also
>>improvements and bugfixing.
>>
>
>That's true...
>
>>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
>>but there's very few chances I'll be available to provide detailed
>>review, advice, testing, etc., on such an old codebase. :-(
>>
>>> By looking at the code, what I missed may be in __runq_tickle(),
>>> which in case
>>> there are idle pcpu, schedule_softirq is also raised on them. By
>>> doing so,
>>> those idle pcpus would steal other busy tasks and may in chance get
>>> d2v2?
>>> 
>>Yes, it looks like, in Xen 4.1, this is more or less what happens. The
>>idea is to always tickle pcpu 6, if the priority of the waking vcpu is
>>higher than what is running there. 
>>
>
>Hmm... in case there are idle pcpus and the priority of the waking vcpu is
>higher than what is running on pcpu 6, would pcpu 6 have more chance to run it?
>or other idle pcup would steal it from pcpu6? or they have equal chance?
>
>>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
>> - if the waking vcpu preempted what was running on pcpu 6, an idler
>>   can pick-up ("steal") such preempted vcpu and run it;
>> - if the waking vcpu ended up in the runqueue, an idler can steal it
>>
>
>Hmm... I don't get the difference between these two cases.
>
>Looks both are an idler steal the vcpu.
>
>>> BTW, if the system is in heavy load, how much chance would we have
>>> idle pcpu?
>>> 
>>It's not very likely that there will be idle pcpus in a very busy
>>system, I agree.
>>
>>> > > We can divide this in three steps:
>>> > > a) Send IPI to the target CPU and tell it which vcpu we want it
>>> > > to 
>>> > > wake up.
>>> > > b) The interrupt handler on target cpu insert vcpu to a percpu
>>> > > queue 
>>> > > and raise
>>> > >    softirq.
>>> > > c) softirq will dequeue the vcpu and wake it up.
>>> > > 
>>> > I'm not sure I see how you think this would improve the situation.
>>> > 
>>> > So, quickly, right now:
>>> > 
>>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
>>> > - let's assume d2v2->processor = 6
>>> > - let's assume the wakeup is happening on pcpu 4
>>> > 
>>> > Right now:
>>> > 
>>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
>>> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
>>> >   runqueue per pcpu, and locks are per-cpu already)
>>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>>> > - still executing on pcpu 4, __runq_tickle() is called, and it
>>> >   determines on which pcpu d2v2 should run
>>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>>> >   following two cases:
>>> >    a) if it was determined that d2v2 should run on pcpu 6:
>>> >        - in a (little, hopefully) while, pcpu 6 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - it finds d2v2 in there, and runs it
>>> Even in this case, it depends on the priority of d2v2 whether it
>>> would be run
>>> now. Right?
>>> 
>>Yes. But both in 4.1 and in current staging, we only raise
>>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
>>vcpu is higher and it should preempt the currently running vcpu.
>>
>
>Oh, you are right. I didn't catch this either.
>
>This means in case 
>a) the priority is lower than current one 
>b) no idler in system
>
>The vcpu will sit quietly in the runq and waiting for the schedule next time.
>
>>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - if it has nothing in its runqueue (likely, if
>>> >          __runq_tickle() chose it, but not necessarily and always
>>> >          true, especially under high load), it looks around to
>>> >          see if other pcpus have any vcpu that it can execute
>>> >        - it will try to take the locks of the various pcpu's
>>> >          runqueues it comes across
>>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>>> >          sees d2v2 in it, steal it and run it.
>>> Oh, my understanding matches yours :-)
>>> 
>>> BYW, we also found in csched_load_balance() will hold the schedule
>>> lock, but
>>> not found a proper vcpu to steal. Maybe this would be a potential
>>> optimization
>>> point.
>>> 
>>Mmm... I actually don't understand what you mean here... what is the
>>possible optimization?
>>
>>> > With your modifications, AFAICT:
>>> > 
>>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
>>> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
>>> > - poke pcpu 6 with an IPI
>>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
>>> >   _I_think_, call vcpu_wake(d2v2) ? [*]
>>> You are almost right. Here in the interrupt handler, it does two
>>> things:
>>> 1. echo back notify_remote_vcpu_wake() it finishes the job
>>> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
>>> 
>>> > 
>>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
>>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>>> > - on pcpu 6, __runq_tickle() is called, and it determines on which
>>> >   pcpu d2v2 should run
>>> At this place, the behavior of __run_tickle() is what I mentioned
>>> above. Raise
>>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>>>
>>Sort of, yes. In general, you can expect pcpu 6 going through
>>csched_schedule(), and hence setting d2v2 to run, to be faster that
>>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
>>getting to work stealing from pcpu 6's runqueue.
>>
>>However, that is the case only if d2v2 had higher priority than what is
>>running on pcpu 6. If it does not, you:
>> - _don't_ tickle pcpu 6
>> - tickle one or more idlers, if any.
>>
>>So, again, it's not that you always wake and run it on pcpu 6
>>
>
>I see what you mean :-)
>
>So I am curious about why we add this in pcpu's 6 queue, when we know the
>priority is low in __runq_tickle() and won't tickle that. And we are sure
>someone in the future will grab pcpu's 6 schedule lock and take it.
>
>Or in this case -- when there are idlers and vcpu's priority is higher,
>instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone
>who is free to help a busy one.
>
>>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>>> >   following two cases:
>>> >    a) if it was determined that d2v2 should run on pcpu 6
>>> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
>>> >        - at the first softirq check, pcpu 6 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - it finds d2v2 in there, and runs it
>>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - if it has nothing in its runqueue (likely, if
>>> >          __runq_tickle() chose it, but not necessarily and always
>>> >          true, especially under high load), it looks around to
>>> >          see if other pcpus have any vcpu that it can execute
>>> >        - it will try to take the locks of the various pcpu's
>>> >          runqueues it comes across
>>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>>> >          sees d2v2 in it, steal it and run it.
>>> > 
>>> > [*] I don't see in the code you shared what happens in reaction to
>>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
>>> > calls vcpu_wake()
>>> > 
>>> > So, basically, it looks to me that you're adding a level of
>>> > indirection, I'm not sure for what benefit.
>>
>>> In my mind, we are trying to reduce the contention on schedule lock
>>> from two
>>> aspects:
>>> 1. for those vcpus would run on other pcpu, it will take
>>> wake_vcpu_lock
>>>    instead of schedule lock
>>>
>>I don't see how you can say "instead". It looks to me that, for those
>>vcpus what would run on other pcpu, we need to take wake_vcpu_lock
>>_in_addition_ to the runqueue lock.
>>
>>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
>>> also
>>>    reduce the cache coherency between pcpus
>>> 
>>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
>>remote lock, though.
>
>Hmm... you are right.
>
>>
>>> > In fact, in the best case (i.e., d2v2 will actually be run on its
>>> > v->processor), in the original code there is:
>>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
>>> > In your case, there is:
>>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
>>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
>>> > 
>>> Our idea may looks like this:
>>>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>>>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
>>> 
>>So, it's even worse than how I imagined! :-P
>>
>>> > Also, as far as locking within the wakeup path is concerned, in the
>>> > original code:
>>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>>> >   (i.e., for the time of __runq_insert() and __runq_tickle().
>>> > In your case:
>>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
>>> >   deferred wakeup
>>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>>> > 
>>> The wake_vcpu_lock is hold in the first step in above.
>>> The runqueue lock(I think you mean the schedule lock?) won't be taken
>>> in this
>>> case.
>>> 
>>It will be taken, for executing vcpu_wake(). From your own description
>>of the mechanism above:
>>
>>"Our idea may looks like this:
>>   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>>      vcpu.
>>   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>>
>>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>>   csched_schedule()
>>
>
>Oh, you are right.
>
>We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while
>looks the wake_vcpu_lock is grabbed by others.
>
>I am thinking our change may benefit like this .
>
>The situation is:
>* vcpu's priority is higher
>* pcpu 9 is idle
>* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ
>
>At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
>nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock.
>
>While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9
>could do their job immediately. Ah I know, very corner, not sure about the
>effect.
>
>BTW, I suddenly found softirq is implemented as an IPI. So I am thinking
>whether it is save to raise softirq to itself? Since it looks we already hold
>the runqueue lock. Or because we have turned the irq off in vcpu_wake()?
>
>>> > 
>>> > (i.e., for the time of __runq_insert() and __runq_tickle().
>>> > 
>>> > Which, to me, looks like both more inter-CPU communication overhead
>>> > and more chances for lock contention.
>>> Hmm... yes, more inter-CPU communication, while less lock contention
>>> I think.
>>> 
>>More inter-CPU communication, the same level of local lock contention
>>plus some added moderately to low contented remote lock contention,
>>AFAICT.
>>
>>> > So, again, more IPIs and more (potential) lock contention.
>>> > 
>>> > True the queueing into the deferred wakeup list will be very simple
>>> > and quick, and hence the critical section needed for implementing
>>> > it very short. But the problem is that you are not shortening
>>> > something long, or splitting something long in two shorter chunks,
>>> > you are _adding_ something, and thus you can't possibly win. :-/
>>> > 
>>> Yes, we are moving some thing, let me digest it for a while.
>>> 
>>Sure. Let me also add that I'm not saying it's not going to work, or
>>that it's not going to improve the situation.
>>
>>What I'm saying is that, architecturally, it's not too different from
>>the current situation, so I would not expect wonders from just this.
>>
>>What I'd do would be to try figuring out where it is that the most
>>waiting time is being accumulated, i.e., which *specific* locking
>>attempts of the scheduler's runqueues' locks are the most contended,
>>and focus on those first.
>>
>>If you're on a big host, I think the way in which Credit1 does load
>>balancing (i.e., by work stealing) may be the thing to blame. But all
>>these are things that only profiling and benchmarking can really tell!
>>:-)
>>
>>As a side note, we're also working on something which, at the end of
>>the day, is rather similar, although for different purposes. In fact,
>>see here:
>>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html
>>
>>The goal is not having to disable IRQs during scheduling, and I'm
>>looking at keeping the per-pcpu deferred wakeup list lockless, in order
>>not to add more lock contention.
>>

Dario,

Just get chance to look into this. This is interesting and I am trying to
understand the problem you want to solve first.

vcpu_wake() is a _special_ case who has to turn off the irq, because
SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle pcpu to pick
up the queued vcpu. And the tickling part needd the irq to be turned off.

I don't get the this part. Why we have to turn off the irq for tickling pcpu?

And by enabling irq in schedule handlers benefits the performance? or ? The
motivation behind this is?

Didn't look into your code yet. From the description from Andrew, I comes with
several questions:
1. which per-cpu list you want to queue the vcpu? v->processor?
2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle?
3. purpose is so that schedule softirq could run with irq enabled
4. the schedule softirq would do the jobs currently in vcpu_wake()?

Took the per-cpu branch as an example, I see several commits. The top 6 are
related ones, right?

>>If you're interested on knowing more or working on that, just let me
>>know. But, of course, this have to be done on Xen upstream, not on an
>>old version.
>>
>>Regards,
>>Dario
>>-- 
>><<This happens because I choose it to happen!>> (Raistlin Majere)
>>-----------------------------------------------------------------
>>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
>
>-- 
>Richard Yang\nHelp you, Help me

-- 
Wei Yang
Help you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-17  3:30     ` Wei Yang
@ 2016-09-22  9:12       ` Dario Faggioli
  2016-09-22  9:55         ` Dario Faggioli
  2016-09-24  3:39         ` Wei Yang
  0 siblings, 2 replies; 17+ messages in thread
From: Dario Faggioli @ 2016-09-22  9:12 UTC (permalink / raw)
  To: Wei Yang, Wei Yang
  Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3967 bytes --]

On Sat, 2016-09-17 at 03:30 +0000, Wei Yang wrote:
> Dario,
> 
Hey, hi again, and sorry for the in getting back at this particular
part of the conversation.

> Just get chance to look into this. This is interesting and I am
> trying to
> understand the problem you want to solve first.
> 
:-)

> vcpu_wake() is a _special_ case who has to turn off the irq, because
> SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle
> pcpu to pick
> up the queued vcpu. And the tickling part needd the irq to be turned
> off.
> 
_Almost_ correct. However, the problem is more that vcpu_wake() can
happen in response to an IRQ, and when you grab a spinlock in IRQ
context, you need to disable IRQs.

There is a good explanation of why, here:


> I don't get the this part. Why we have to turn off the irq for
> tickling pcpu?
> 
We don't. The point here is that, in Xen, we wants spinlocks to either
_always_ be taken with IRQs disabled or _always_ be taken with IRQs
enabled. Well, since we now take the scheduler's runqueue lock in
vcpu_wake() (and as said that must be done with IRQs disabled), this
implies that the scheduler's runqueue lock needs to always be taken
with IRQs disabled, even when leaving them enabled would not actually
be an issue, for being consistent with the locking rule.

So, if we manage to avoid taking the scheduler's runqueue lock in vcpu_wake(), we will manage to put ourself in the opposite situation, wrt the locking consistency rule, i.e., we can _always_ take the scheduler's runqueue lock with IRQs enabled.

This is, broadly speaking, the problem we wanted to solve, and how we thought about solving it.

> And by enabling irq in schedule handlers benefits the performance? or
> ? The
> motivation behind this is?
> 
As I said to you about your modification, here too it is not super-easy 
to tell in advance whether, and if yes, by how much, we'll see a boost
in performance.

In this case, the idea is that, in general, keeping IRQs disabled is
bad. It is bad for concurrency, it is bad for latency in both
scheduling and when it comes to reacting to external event. And it has
been profiled that the scheduler is one of the component that keeps the
IRQs disabled for long chunks of time.

I honestly did expect things to improve a bit, especially on certain
workloads, but of course, as usual, benchmarks will tell. :-)

> Didn't look into your code yet. 
>
Ok. 
Even if/when you look at it, bear in mind it's still only a stub.

> From the description from Andrew, I comes with
> several questions:
> 1. which per-cpu list you want to queue the vcpu? v->processor?
>
ISTR having thought, and even given a try, to both v->processor (where
v is the waking vcpu) and the current cpu (where for 'current cpu' I
meant the cpu where the wake-up is occurring).

I think I decided to use the current cpu (mostly for the same reasons
why I don't think there is much advantage in what you are trying to do
:-P)

> 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle?
>
It would call what is now vcpu_wake(). So, yes, for most scheduler that
means inserting and tickling, but it's really schedulere specific.

> 3. purpose is so that schedule softirq could run with irq enabled
>
Purpose is that scheduling functions can be executed with IRQs enabled,
yes.

> 4. the schedule softirq would do the jobs currently in vcpu_wake()?
> 
Err... not sure what you mean (but maybe I've already answered at point
2.?).

> Took the per-cpu branch as an example, I see several commits. The top
> 6 are
> related ones, right?
> 
The top 4 (I told you it's a stub :-P).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-22  9:12       ` Dario Faggioli
@ 2016-09-22  9:55         ` Dario Faggioli
  2016-09-24  7:26           ` Wei Yang
  2016-09-24  3:39         ` Wei Yang
  1 sibling, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2016-09-22  9:55 UTC (permalink / raw)
  To: Wei Yang, Wei Yang
  Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 711 bytes --]

On Thu, 2016-09-22 at 11:12 +0200, Dario Faggioli wrote:
> _Almost_ correct. However, the problem is more that vcpu_wake() can
> happen in response to an IRQ, and when you grab a spinlock in IRQ
> context, you need to disable IRQs.
> 
> There is a good explanation of why, here:
> 
Ah, sorry, link is missing! Here:

https://www.kernel.org/doc/Documentation/locking/spinlocks.txt

Read the "Lesson 3: spinlocks revisited." section.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-22  9:55         ` Dario Faggioli
@ 2016-09-24  7:26           ` Wei Yang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Yang @ 2016-09-24  7:26 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, xen-devel, liuxiaojian6, george.dunlap, Wei Yang,
	mengxu, Wei Yang

On Thu, Sep 22, 2016 at 11:55:16AM +0200, Dario Faggioli wrote:
>On Thu, 2016-09-22 at 11:12 +0200, Dario Faggioli wrote:
>> _Almost_ correct. However, the problem is more that vcpu_wake() can
>> happen in response to an IRQ, and when you grab a spinlock in IRQ
>> context, you need to disable IRQs.
>> 
>> There is a good explanation of why, here:
>> 
>Ah, sorry, link is missing! Here:
>
>https://www.kernel.org/doc/Documentation/locking/spinlocks.txt
>
>Read the "Lesson 3: spinlocks revisited." section.
>

Oh, you have a link here ... let me digest it.

>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-22  9:12       ` Dario Faggioli
  2016-09-22  9:55         ` Dario Faggioli
@ 2016-09-24  3:39         ` Wei Yang
  2016-09-26 10:18           ` Dario Faggioli
  1 sibling, 1 reply; 17+ messages in thread
From: Wei Yang @ 2016-09-24  3:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang,
	mengxu, Wei Yang

On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote:
>On Sat, 2016-09-17 at 03:30 +0000, Wei Yang wrote:
>> Dario,
>> 
>Hey, hi again, and sorry for the in getting back at this particular
>part of the conversation.
>

Sure, I was busy with other stuffs these days :-(

>> Just get chance to look into this. This is interesting and I am
>> trying to
>> understand the problem you want to solve first.
>> 
>:-)
>
>> vcpu_wake() is a _special_ case who has to turn off the irq, because
>> SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle
>> pcpu to pick
>> up the queued vcpu. And the tickling part needd the irq to be turned
>> off.
>> 
>_Almost_ correct. However, the problem is more that vcpu_wake() can
>happen in response to an IRQ, and when you grab a spinlock in IRQ
>context, you need to disable IRQs.
>

_start:
	jmp 1f
2:

Ok, now I have a question to this sentence.

I have checked some posts which says in the interrupt handling, irq would be
disabled. Well, this really depends on the implementation.

This means in Xen, irq is still enabled during interrupt handling? Because of
this, we have to disable IRQ when we need to grab the spin lock?

	jmp $$

>There is a good explanation of why, here:
>
>
>> I don't get the this part. Why we have to turn off the irq for
>> tickling pcpu?
>> 
>We don't. The point here is that, in Xen, we wants spinlocks to either
>_always_ be taken with IRQs disabled or _always_ be taken with IRQs
>enabled. Well, since we now take the scheduler's runqueue lock in
>vcpu_wake() (and as said that must be done with IRQs disabled), this
>implies that the scheduler's runqueue lock needs to always be taken
>with IRQs disabled, even when leaving them enabled would not actually
>be an issue, for being consistent with the locking rule.
>
>So, if we manage to avoid taking the scheduler's runqueue lock in vcpu_wake(), we will manage to put ourself in the opposite situation, wrt the locking consistency rule, i.e., we can _always_ take the scheduler's runqueue lock with IRQs enabled.
>
>This is, broadly speaking, the problem we wanted to solve, and how we thought about solving it.

1:

Looks I almost get every thing. Let me do a summary to see whether I have got
your words.

   (spinlock usage convention in Xen)  &  (vcpu_wake() may happen in an IRQ)

=>

   (irq all disabled || irq all enabled) & (disable irq on grabbing spinlock)

=>

   should grab schedule lock with IRQ disabled

jmp 2b

>
>> And by enabling irq in schedule handlers benefits the performance? or
>> ? The
>> motivation behind this is?
>> 
>As I said to you about your modification, here too it is not super-easy 
>to tell in advance whether, and if yes, by how much, we'll see a boost
>in performance.
>
>In this case, the idea is that, in general, keeping IRQs disabled is
>bad. It is bad for concurrency, it is bad for latency in both
>scheduling and when it comes to reacting to external event. And it has
>been profiled that the scheduler is one of the component that keeps the
>IRQs disabled for long chunks of time.
>
>I honestly did expect things to improve a bit, especially on certain
>workloads, but of course, as usual, benchmarks will tell. :-)
>
>> Didn't look into your code yet. 
>>
>Ok. 
>Even if/when you look at it, bear in mind it's still only a stub.
>
>> From the description from Andrew, I comes with
>> several questions:
>> 1. which per-cpu list you want to queue the vcpu? v->processor?
>>
>ISTR having thought, and even given a try, to both v->processor (where
>v is the waking vcpu) and the current cpu (where for 'current cpu' I
>meant the cpu where the wake-up is occurring).
>
>I think I decided to use the current cpu (mostly for the same reasons
>why I don't think there is much advantage in what you are trying to do
>:-P)
>
>> 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle?
>>
>It would call what is now vcpu_wake(). So, yes, for most scheduler that
>means inserting and tickling, but it's really schedulere specific.
>
>> 3. purpose is so that schedule softirq could run with irq enabled
>>
>Purpose is that scheduling functions can be executed with IRQs enabled,
>yes.
>
>> 4. the schedule softirq would do the jobs currently in vcpu_wake()?
>> 
>Err... not sure what you mean (but maybe I've already answered at point
>2.?).
>
>> Took the per-cpu branch as an example, I see several commits. The top
>> 6 are
>> related ones, right?
>> 
>The top 4 (I told you it's a stub :-P).
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-24  3:39         ` Wei Yang
@ 2016-09-26 10:18           ` Dario Faggioli
  2016-09-28  3:44             ` Wei Yang
  0 siblings, 1 reply; 17+ messages in thread
From: Dario Faggioli @ 2016-09-26 10:18 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, mengxu, Wei Yang


[-- Attachment #1.1: Type: text/plain, Size: 2327 bytes --]

On Sat, 2016-09-24 at 11:39 +0800, Wei Yang wrote:
> On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote:
> > _Almost_ correct. However, the problem is more that vcpu_wake() can
> > happen in response to an IRQ, and when you grab a spinlock in IRQ
> > context, you need to disable IRQs.
> > 
> 
> Ok, now I have a question to this sentence.
> 
> I have checked some posts which says in the interrupt handling, irq
> would be
> disabled. Well, this really depends on the implementation.
> 
> This means in Xen, irq is still enabled during interrupt handling?
> Because of
> this, we have to disable IRQ when we need to grab the spin lock?
> 
So, I don't think I'm getting all you're saying and asking.

The situation is like this: currently, vcpu_wake() can be called both
from inside or outside IRQ context, and both with IRQs enabled or
disabled.

Given this situation, to be safe, we need to disable IRQs when taking
whatever spinlock vcpu_wake() (and the functions it calls, of course)
calls. Since it takes the scheduler's runq lock, this means we need to
take the lock with IRQ disabled.

And because of that, _everyone_ that takes the schedulr's lock must do
that with IRQs disabled. If we manage to *not* take the scheduler's
lock with IRQ disabled in vcpu_wake(), we then can do the same
everywhere else. But we can't do that with the current structure of the
code, and that's why we're thinking to defer the actual wakeup --i.e.,
the actual call to vcpu_wake()-- to a moment when:
 - IRQs are enabled,
 - we don't need to disable them.

This is it. About the following...

> Looks I almost get every thing. Let me do a summary to see whether I
> have got
> your words.
> 
>    (spinlock usage convention in Xen)  &  (vcpu_wake() may happen in
> an IRQ)
> 
This is ok.

> =>
> 
>    (irq all disabled || irq all enabled) & (disable irq on grabbing
> spinlock)
> 
This, I don't get it.

> =>
> 
>    should grab schedule lock with IRQ disabled
> 
Yes, sounds right.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-26 10:18           ` Dario Faggioli
@ 2016-09-28  3:44             ` Wei Yang
  0 siblings, 0 replies; 17+ messages in thread
From: Wei Yang @ 2016-09-28  3:44 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang,
	mengxu, Wei Yang

On Mon, Sep 26, 2016 at 12:18:06PM +0200, Dario Faggioli wrote:
>On Sat, 2016-09-24 at 11:39 +0800, Wei Yang wrote:
>> On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote:
>> > _Almost_ correct. However, the problem is more that vcpu_wake() can
>> > happen in response to an IRQ, and when you grab a spinlock in IRQ
>> > context, you need to disable IRQs.
>> > 
>> 
>> Ok, now I have a question to this sentence.
>> 
>> I have checked some posts which says in the interrupt handling, irq
>> would be
>> disabled. Well, this really depends on the implementation.
>> 
>> This means in Xen, irq is still enabled during interrupt handling?
>> Because of
>> this, we have to disable IRQ when we need to grab the spin lock?
>> 
>So, I don't think I'm getting all you're saying and asking.
>
>The situation is like this: currently, vcpu_wake() can be called both
>from inside or outside IRQ context, and both with IRQs enabled or
>disabled.
>
>Given this situation, to be safe, we need to disable IRQs when taking
>whatever spinlock vcpu_wake() (and the functions it calls, of course)
>calls. Since it takes the scheduler's runq lock, this means we need to
>take the lock with IRQ disabled.
>
>And because of that, _everyone_ that takes the schedulr's lock must do
>that with IRQs disabled. If we manage to *not* take the scheduler's
>lock with IRQ disabled in vcpu_wake(), we then can do the same
>everywhere else. But we can't do that with the current structure of the
>code, and that's why we're thinking to defer the actual wakeup --i.e.,
>the actual call to vcpu_wake()-- to a moment when:
> - IRQs are enabled,
> - we don't need to disable them.
>

Dario,

I appreciate your patience and almost get the background and your purpose.

First, let me give my conclusion. By taking the schedule lock with IRQ
enabled, it will improve the parallelism to some extend, while it would be
hard to say the net effect. Need more data from benchmark to verify.

Then, let me describe what I understand. (Maybe not correct :-))

This is all about the SMP, spinlock and IRQ/IRQ context.

Two basic assumptions:
a. all interrupt are the same priority, interrupt could not be preempted by others. 
b. in the whole interrupt handler, IRQ is disabled on local processor

Let's assume there is only one processor first,
1. If the spinlock just used in interrupt handler, looks it is save, no race
   condition.
2. While if the spinlock maybe used outside interrupt handler, ok, we need to
   at least disable IRQ when grab the lock outside interrupt handler.
   Otherwise there is deadlock.

Then if there are several CPUs in the system.
1. If the spinlock just used in interrupt handler, would this be still save to
   not disable the IRQ? It looks ok to me.
2. While if the spinlock maybe used outside interrupt handler, ok, we need to
   at least disable IRQ when grab the lock outside interrupt handler.

This leads to the spinlock usage convention: if a spinlock would be grabbed in
interrupt handler(IRQ context), we need to grab it with IRQ disabled _always_.

Then, to achieve your purpose -- grab the spinlock with IRQ enabled, you need
to move all the lock operation outside of the interrupt handler. What you are
doing in the code is to deffer the job to softirq when it is in_irq.

By now, I can understand your first part of your check. If vcpu_wake() is
not running in_irq() we can do the job directly now. But not the second part,
why we need to check the local irq is also enabled? In my mind, we need to
check it is not in_irq() is enough.

Thanks a lot for your time :-)

>This is it. About the following...
>
>> Looks I almost get every thing. Let me do a summary to see whether I
>> have got
>> your words.
>> 
>>    (spinlock usage convention in Xen)  &  (vcpu_wake() may happen in
>> an IRQ)
>> 
>This is ok.
>
>> =>
>> 
>>    (irq all disabled || irq all enabled) & (disable irq on grabbing
>> spinlock)
>> 
>This, I don't get it.
>
>> =>
>> 
>>    should grab schedule lock with IRQ disabled
>> 
>Yes, sounds right.
>
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli
  2016-09-14 10:44   ` Wei Yang
@ 2016-09-18  4:03   ` Wei Yang
  2016-09-22  9:47     ` Dario Faggioli
  1 sibling, 1 reply; 17+ messages in thread
From: Wei Yang @ 2016-09-18  4:03 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang,
	mengxu, Wei Yang

On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>[using xendevel correct address]
>
>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
>> > 
>> > I'm not surprised by that. Yet, I'd be interested in hearing more
>> > about this profiling you have done (things like, how you captured
>> > the data, what workloads you are exactly considering, how you
>> > determined what is the bottleneck, etc).
>> Let me try to explain this.
>> 
>> Workload:         a. Blue Screen in Windows Guests
>> 		  b. some client using Windows to do some video
>> processing
>> 		     which need precise timestamp (we are not sure the
>> core
>> 		     reason but we see the system is very slow)
>>
>Do you mind sharing just a bit more, such as:
> - number of pcpus
> - number of vcpus of the various VMs
>
>I also am not sure what "a. Blue screen in Windows guests" above
>means... is there a workload called "Blue Screen"? Or is it that you
>are hitting a BSOD in some of your guests (which ones, what were they
>doing)? Or is it that you _want_ to provoke a BSOD on some of your
>guests? Or is that something else? :-P
>
>> Capture the data: lock profile
>> Bottleneck Check: The schedule lock wait time is really high      
>> 
>Ok, cool. Interesting that lock profiling works on 4.1! :-O
>
>> > The scheduler tries to see whether the v->processor of the waking
>> > vcpu can be re-used, but that's not at all guaranteed, and again,
>> > on a very loaded system, not even that likely!
>> > 
>> Hmm... I may missed something.
>> 
>> Took your assumption below for example. 
>> In my mind, the process looks like this:
>> 
>>     csched_vcpu_wake()
>>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
>> (1)
>>     
>Well, yes. More precisely, at least in current staging,
>SCHEDULE_SOFTIRQ is raised for pcpu 6:
> - if pcpu 6 is idle, or
> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
>   the waking vcpu is higher in priority than what is running on pcpu 6
>
>>     __do_softirq()
>>         schedule()
>>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
>> on it's
>>                                 priority
>>
>Yes, but it is pcpu 6 that will run csched_schedule() only if the
>conditions mentioned above are met. If not, it will be some other pcpu
>that will do so (or none!).
>
>But I just realized that the fact that you are actually working on 4.1
>is going to be an issue. In fact, the code that it is now in Xen has
>changed **a lot** since 4.1. In fact, you're missing all the soft-
>affinity work (which you may or may not find useful) but also
>improvements and bugfixing.
>
>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
>but there's very few chances I'll be available to provide detailed
>review, advice, testing, etc., on such an old codebase. :-(
>
>> By looking at the code, what I missed may be in __runq_tickle(),
>> which in case
>> there are idle pcpu, schedule_softirq is also raised on them. By
>> doing so,
>> those idle pcpus would steal other busy tasks and may in chance get
>> d2v2?
>> 
>Yes, it looks like, in Xen 4.1, this is more or less what happens. The
>idea is to always tickle pcpu 6, if the priority of the waking vcpu is
>higher than what is running there. 
>
>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
> - if the waking vcpu preempted what was running on pcpu 6, an idler
>   can pick-up ("steal") such preempted vcpu and run it;
> - if the waking vcpu ended up in the runqueue, an idler can steal it
>
>> BTW, if the system is in heavy load, how much chance would we have
>> idle pcpu?
>> 
>It's not very likely that there will be idle pcpus in a very busy
>system, I agree.
>
>> > > We can divide this in three steps:
>> > > a) Send IPI to the target CPU and tell it which vcpu we want it
>> > > to 
>> > > wake up.
>> > > b) The interrupt handler on target cpu insert vcpu to a percpu
>> > > queue 
>> > > and raise
>> > >    softirq.
>> > > c) softirq will dequeue the vcpu and wake it up.
>> > > 
>> > I'm not sure I see how you think this would improve the situation.
>> > 
>> > So, quickly, right now:
>> > 
>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
>> > - let's assume d2v2->processor = 6
>> > - let's assume the wakeup is happening on pcpu 4
>> > 
>> > Right now:
>> > 
>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
>> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
>> >   runqueue per pcpu, and locks are per-cpu already)
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - still executing on pcpu 4, __runq_tickle() is called, and it
>> >   determines on which pcpu d2v2 should run
>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6:
>> >        - in a (little, hopefully) while, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> Even in this case, it depends on the priority of d2v2 whether it
>> would be run
>> now. Right?
>> 
>Yes. But both in 4.1 and in current staging, we only raise
>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
>vcpu is higher and it should preempt the currently running vcpu.
>
>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> Oh, my understanding matches yours :-)
>> 
>> BYW, we also found in csched_load_balance() will hold the schedule
>> lock, but
>> not found a proper vcpu to steal. Maybe this would be a potential
>> optimization
>> point.
>> 
>Mmm... I actually don't understand what you mean here... what is the
>possible optimization?
>
>> > With your modifications, AFAICT:
>> > 
>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
>> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
>> > - poke pcpu 6 with an IPI
>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
>> >   _I_think_, call vcpu_wake(d2v2) ? [*]
>> You are almost right. Here in the interrupt handler, it does two
>> things:
>> 1. echo back notify_remote_vcpu_wake() it finishes the job
>> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
>> 
>> > 
>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>> > - on pcpu 6, __runq_tickle() is called, and it determines on which
>> >   pcpu d2v2 should run
>> At this place, the behavior of __run_tickle() is what I mentioned
>> above. Raise
>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>>
>Sort of, yes. In general, you can expect pcpu 6 going through
>csched_schedule(), and hence setting d2v2 to run, to be faster that
>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
>getting to work stealing from pcpu 6's runqueue.
>
>However, that is the case only if d2v2 had higher priority than what is
>running on pcpu 6. If it does not, you:
> - _don't_ tickle pcpu 6
> - tickle one or more idlers, if any.
>
>So, again, it's not that you always wake and run it on pcpu 6
>
>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>> >   following two cases:
>> >    a) if it was determined that d2v2 should run on pcpu 6
>> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
>> >        - at the first softirq check, pcpu 6 will schedule
>> >        - it takes its own runqueue lock
>> >        - it finds d2v2 in there, and runs it
>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>> >        - it takes its own runqueue lock
>> >        - if it has nothing in its runqueue (likely, if
>> >          __runq_tickle() chose it, but not necessarily and always
>> >          true, especially under high load), it looks around to
>> >          see if other pcpus have any vcpu that it can execute
>> >        - it will try to take the locks of the various pcpu's
>> >          runqueues it comes across
>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>> >          sees d2v2 in it, steal it and run it.
>> > 
>> > [*] I don't see in the code you shared what happens in reaction to
>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
>> > calls vcpu_wake()
>> > 
>> > So, basically, it looks to me that you're adding a level of
>> > indirection, I'm not sure for what benefit.
>
>> In my mind, we are trying to reduce the contention on schedule lock
>> from two
>> aspects:
>> 1. for those vcpus would run on other pcpu, it will take
>> wake_vcpu_lock
>>    instead of schedule lock
>>
>I don't see how you can say "instead". It looks to me that, for those
>vcpus what would run on other pcpu, we need to take wake_vcpu_lock
>_in_addition_ to the runqueue lock.
>
>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
>> also
>>    reduce the cache coherency between pcpus
>> 
>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
>remote lock, though.
>
>> > In fact, in the best case (i.e., d2v2 will actually be run on its
>> > v->processor), in the original code there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
>> > In your case, there is:
>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> > 
>> Our idea may looks like this:
>>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
>> 
>So, it's even worse than how I imagined! :-P
>
>> > Also, as far as locking within the wakeup path is concerned, in the
>> > original code:
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> >   (i.e., for the time of __runq_insert() and __runq_tickle().
>> > In your case:
>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
>> >   deferred wakeup
>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>> > 
>> The wake_vcpu_lock is hold in the first step in above.
>> The runqueue lock(I think you mean the schedule lock?) won't be taken
>> in this
>> case.
>> 
>It will be taken, for executing vcpu_wake(). From your own description
>of the mechanism above:
>
>"Our idea may looks like this:
>   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>      vcpu.
>   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>
>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>   csched_schedule()
>
>> > 
>> > (i.e., for the time of __runq_insert() and __runq_tickle().
>> > 
>> > Which, to me, looks like both more inter-CPU communication overhead
>> > and more chances for lock contention.
>> Hmm... yes, more inter-CPU communication, while less lock contention
>> I think.
>> 
>More inter-CPU communication, the same level of local lock contention
>plus some added moderately to low contented remote lock contention,
>AFAICT.
>
>> > So, again, more IPIs and more (potential) lock contention.
>> > 
>> > True the queueing into the deferred wakeup list will be very simple
>> > and quick, and hence the critical section needed for implementing
>> > it very short. But the problem is that you are not shortening
>> > something long, or splitting something long in two shorter chunks,
>> > you are _adding_ something, and thus you can't possibly win. :-/
>> > 
>> Yes, we are moving some thing, let me digest it for a while.
>> 
>Sure. Let me also add that I'm not saying it's not going to work, or
>that it's not going to improve the situation.
>
>What I'm saying is that, architecturally, it's not too different from
>the current situation, so I would not expect wonders from just this.
>
>What I'd do would be to try figuring out where it is that the most
>waiting time is being accumulated, i.e., which *specific* locking
>attempts of the scheduler's runqueues' locks are the most contended,
>and focus on those first.
>
>If you're on a big host, I think the way in which Credit1 does load
>balancing (i.e., by work stealing) may be the thing to blame. But all
>these are things that only profiling and benchmarking can really tell!
>:-)
>
>As a side note, we're also working on something which, at the end of
>the day, is rather similar, although for different purposes. In fact,
>see here:
>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html
>

Dario,

Took a look into your code, some questions as below:
1. vcpu_wake is split into two cases, the first case is "not in irq" and "irq
enabled". Some reason for this classification? Maybe some background knowledge
I need to have.

2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the list is
empty. and then add the vcpu to the list. any reason for this sequence? BTW,
at this moment vcpu_wake() holds the spin lock and vcpu_wake_deferred() should
take the spin lock before continuing.

In the last patch, I saw you introduce a flag wakeup_triggered, and then
change the sequence a little. Why at this place, you want to change the
sequence here?

3. The main purpose of this series is to enable irq for "scheduler handlers".
I am not sure the original initiative for this, while I guess this could
increase the parallelism, right? And I think we have to make sure those
functions are not re-entrant, otherwise they will face dead lock when locking
on the same spin-lock.

4. This looks similar with what we try to achieve. While we try to give the
vcpu to someone else, who we think may pick it, to wake it up, and in your
implementation, you at this vcpu to its own list and wake it up some time
later. I guess next step you want to do is to make the wakeup_defer.list a
lock-less list.

5. Well, this may not relevant to your code directly. I found some difference
between raise_softirq() and cpumask_raise_softirq(). The first one just set
pending bit, while the second one will send IPI. __runq_tickle() use the
second one to send SCHEDULE softirq, this means other pcpu would react
immediately?

>The goal is not having to disable IRQs during scheduling, and I'm
>looking at keeping the per-pcpu deferred wakeup list lockless, in order
>not to add more lock contention.
>
>If you're interested on knowing more or working on that, just let me
>know. But, of course, this have to be done on Xen upstream, not on an
>old version.
>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Richard Yang\nHelp you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake
  2016-09-18  4:03   ` Wei Yang
@ 2016-09-22  9:47     ` Dario Faggioli
  0 siblings, 0 replies; 17+ messages in thread
From: Dario Faggioli @ 2016-09-22  9:47 UTC (permalink / raw)
  To: Wei Yang
  Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 6347 bytes --]

On Sun, 2016-09-18 at 12:03 +0800, Wei Yang wrote:
> Dario,
> 
> Took a look into your code, some questions as below:
>
Haha, here you are! :-D

> 1. vcpu_wake is split into two cases, the first case is "not in irq"
> and "irq
> enabled". Some reason for this classification? Maybe some background
> knowledge
> I need to have.
> 
Mmm... yes. Well, probably that distinction should not be there, as it
is a (potential, benchmark will tell if that's the case) optimization.
And optimizations should be added later, with their own patch and patch
description, etc... But as I said in the other email, the code is not
even RFC. :-/

So, point is, the whole point of the patch series is to leave
interrupts enabled inside scheduling functions, including vcpu_wake()
(which is becoming _vcpu_wake()). In order to do that, we taking the
scheduler lock, we'll start using spin_lock() and vcpu_spin_lock(),
instead of vcpu_spinlock_irq[save](). But this also means that IRQs
can't be disabled when we do that, or:
 - we violate Xen's lock consistency rule (there's ASSERT()s checking
   that in debug builds)
 - we risk re-enabling interrupts at the wrong time.

So, if IRQs are disabled when vcpu_wake() is called, or we are in IRQ
context and not disabling them expose us to races, we want to go the
deferred way the patch series is implementing.

However, if vcpu_wake() is called with IRQ enabled and not from IRQ
context, none of the above risks stand, and we can avoid doing all the
dance required for deferred wakeup, and just call _vcpu_wake().

I'm saying this is a (potential) optimization because deferring the
wakeup has a cost, and this 'if' is trying to avoid paying such when
not strictly necessary.

But really, in a proper version of this series I should just always
defer the wakeup in this patch, and have another patch, at the end of
the series, where this attempt to avoid deferral is implemented, and
run benchmarks with and without said patch applied.

> 2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the
> list is
> empty. and then add the vcpu to the list. any reason for this
> sequence? BTW,
> at this moment vcpu_wake() holds the spin lock and
> vcpu_wake_deferred() should
> take the spin lock before continuing.
> 
If the list is empty it means there is no one going through it and
waking up vcpus from there. This means I need to raise the softirq that
will cause someone to go check the list, so it will find the vcpu that
I'm inserting there right now.

OTOH, if the list is not empty, it means there is already someone going
through the list itself. This means I don't need to raise anything, as
the vcpu that I'm inserting in the list right now will be found by such
waker, and woken up.

This is for avoiding raising too many softirq and poking too many cpus
to run their deferred wakers, which would introduce contention issue on
the deferred list lock.

The reason why the check is before the insertion is pure cosmetics. I
can well insert first and check for list_is_singleton(). In the code
you see there, that does not matter, because the deferred list has it's
spinlock. But this will probably change if I get rid of that, making
the list lockless.

> In the last patch, I saw you introduce a flag wakeup_triggered, and
> then
> change the sequence a little. Why at this place, you want to change
> the
> sequence here?
> 
It was another attempt of optimization. But again, this is all changing
with the list becoming lockless.

> 3. The main purpose of this series is to enable irq for "scheduler
> handlers".
> I am not sure the original initiative for this, while I guess this
> could
> increase the parallelism, right? And I think we have to make sure
> those
> functions are not re-entrant, otherwise they will face dead lock when
> locking
> on the same spin-lock.
> 
Hopefully, I explained the purpose in my previous email.

I'm not sure what you mean here with 're-entrant'. The only reason why
we want IRQs to be disabled when taking a lock, is because we take that
lock from both inside and outside of IRQ context.

So, if we're sure that a lock is never taken from IRQ context, we can
avoid disabling IRQs. And this is what this is all about: the scheduler
lock is taken once from IRQ context, and this series is trying to get
rid of that one case.

> 4. This looks similar with what we try to achieve. While we try to
> give the
> vcpu to someone else, who we think may pick it, to wake it up, and in
> your
> implementation, you at this vcpu to its own list and wake it up some
> time
> later. 
>
It does looks similar (this is the reason why I mentioned it to you :-
)), although:
 - both the goal and the expectations are different,
 - I'm not really 'giving the vcpu to someone else', I'm _deferring_ 
   it, i.e., it is still going to happen on the same cpu where the 
   event happens, only a little later in time (when no longer in IRQ
   context)

> I guess next step you want to do is to make the wakeup_defer.list a
> lock-less list.
> 
Yes, that would be the plan (and refreshing the series, of course).

> 5. Well, this may not relevant to your code directly. I found some
> difference
> between raise_softirq() and cpumask_raise_softirq(). The first one
> just set
> pending bit, while the second one will send IPI.
>
Exactly.

>  __runq_tickle() use the
> second one to send SCHEDULE softirq, this means other pcpu would
> react
> immediately?
> 
I don't think there's much difference in observable behavior. Perhaps,
in cpumask_raise_softirq(), if the current cpu is in the mast, it would
be enough to set a pensing bit for it.

However, this is something very different again, between 4.1 and
upstream. In fact, check commits (not present in Xen 4.1):

 c47c316d99b3b570d0b "x86/HVM: batch vCPU wakeups"
 9a727a813e9b25003e4 "x86: suppress event check IPI to MWAITing CPUs"

So I can't really be sure or comment much on this.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-09-28  3:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20160913085437.GA14514@linux-gk3p>
2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli
2016-09-14 10:44   ` Wei Yang
2016-09-14 16:18     ` Dario Faggioli
2016-09-16  2:49       ` Wei Yang
2016-09-16 16:07         ` Dario Faggioli
2016-09-17  0:31           ` Wei Yang
2016-09-19 23:24             ` Dario Faggioli
2016-09-20  3:51               ` Wei Yang
2016-09-17  3:30     ` Wei Yang
2016-09-22  9:12       ` Dario Faggioli
2016-09-22  9:55         ` Dario Faggioli
2016-09-24  7:26           ` Wei Yang
2016-09-24  3:39         ` Wei Yang
2016-09-26 10:18           ` Dario Faggioli
2016-09-28  3:44             ` Wei Yang
2016-09-18  4:03   ` Wei Yang
2016-09-22  9:47     ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.