* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake [not found] <20160913085437.GA14514@linux-gk3p> @ 2016-09-13 11:30 ` Dario Faggioli 2016-09-14 10:44 ` Wei Yang 2016-09-18 4:03 ` Wei Yang 0 siblings, 2 replies; 17+ messages in thread From: Dario Faggioli @ 2016-09-13 11:30 UTC (permalink / raw) To: Wei Yang, xen-devel; +Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6 [-- Attachment #1.1: Type: text/plain, Size: 14374 bytes --] [using xendevel correct address] On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote: > On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote: > > > > I'm not surprised by that. Yet, I'd be interested in hearing more > > about this profiling you have done (things like, how you captured > > the data, what workloads you are exactly considering, how you > > determined what is the bottleneck, etc). > Let me try to explain this. > > Workload: a. Blue Screen in Windows Guests > b. some client using Windows to do some video > processing > which need precise timestamp (we are not sure the > core > reason but we see the system is very slow) > Do you mind sharing just a bit more, such as: - number of pcpus - number of vcpus of the various VMs I also am not sure what "a. Blue screen in Windows guests" above means... is there a workload called "Blue Screen"? Or is it that you are hitting a BSOD in some of your guests (which ones, what were they doing)? Or is it that you _want_ to provoke a BSOD on some of your guests? Or is that something else? :-P > Capture the data: lock profile > Bottleneck Check: The schedule lock wait time is really high > Ok, cool. Interesting that lock profiling works on 4.1! :-O > > The scheduler tries to see whether the v->processor of the waking > > vcpu can be re-used, but that's not at all guaranteed, and again, > > on a very loaded system, not even that likely! > > > Hmm... I may missed something. > > Took your assumption below for example. > In my mind, the process looks like this: > > csched_vcpu_wake() > __runq_insert(), insert the vcpu in pcpu's 6 runq > __runq_tickle(), raise SCHEDULE_SOFTIRQ on pcpu 6 or other > (1) > Well, yes. More precisely, at least in current staging, SCHEDULE_SOFTIRQ is raised for pcpu 6: - if pcpu 6 is idle, or - if pcpu 6 is not idle but there actually isn't any idle vcpu, and the waking vcpu is higher in priority than what is running on pcpu 6 > __do_softirq() > schedule() > csched_schedule(), for pcpu 6, it may wake up d2v2 based > on it's > priority > Yes, but it is pcpu 6 that will run csched_schedule() only if the conditions mentioned above are met. If not, it will be some other pcpu that will do so (or none!). But I just realized that the fact that you are actually working on 4.1 is going to be an issue. In fact, the code that it is now in Xen has changed **a lot** since 4.1. In fact, you're missing all the soft- affinity work (which you may or may not find useful) but also improvements and bugfixing. I'll have a quick look at how __runq_tickle() looks like in Xen 4.1, but there's very few chances I'll be available to provide detailed review, advice, testing, etc., on such an old codebase. :-( > By looking at the code, what I missed may be in __runq_tickle(), > which in case > there are idle pcpu, schedule_softirq is also raised on them. By > doing so, > those idle pcpus would steal other busy tasks and may in chance get > d2v2? > Yes, it looks like, in Xen 4.1, this is more or less what happens. The idea is to always tickle pcpu 6, if the priority of the waking vcpu is higher than what is running there. If pcpu 6 was not idle, we also tickle one or more idle pcpus so that: - if the waking vcpu preempted what was running on pcpu 6, an idler can pick-up ("steal") such preempted vcpu and run it; - if the waking vcpu ended up in the runqueue, an idler can steal it > BTW, if the system is in heavy load, how much chance would we have > idle pcpu? > It's not very likely that there will be idle pcpus in a very busy system, I agree. > > > We can divide this in three steps: > > > a) Send IPI to the target CPU and tell it which vcpu we want it > > > to > > > wake up. > > > b) The interrupt handler on target cpu insert vcpu to a percpu > > > queue > > > and raise > > > softirq. > > > c) softirq will dequeue the vcpu and wake it up. > > > > > I'm not sure I see how you think this would improve the situation. > > > > So, quickly, right now: > > > > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up > > - let's assume d2v2->processor = 6 > > - let's assume the wakeup is happening on pcpu 4 > > > > Right now: > > > > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2, > > which is the runqueue lock of pcpu 6 (in Credit1, there is 1 > > runqueue per pcpu, and locks are per-cpu already) > > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue > > - still executing on pcpu 4, __runq_tickle() is called, and it > > determines on which pcpu d2v2 should run > > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the > > following two cases: > > a) if it was determined that d2v2 should run on pcpu 6: > > - in a (little, hopefully) while, pcpu 6 will schedule > > - it takes its own runqueue lock > > - it finds d2v2 in there, and runs it > Even in this case, it depends on the priority of d2v2 whether it > would be run > now. Right? > Yes. But both in 4.1 and in current staging, we only raise SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking vcpu is higher and it should preempt the currently running vcpu. > > b) if it was determined that d2v2 should run on, say, pcpu 9: > > - in a (little, hopefully) while, pcpu 9 will schedule > > - it takes its own runqueue lock > > - if it has nothing in its runqueue (likely, if > > __runq_tickle() chose it, but not necessarily and always > > true, especially under high load), it looks around to > > see if other pcpus have any vcpu that it can execute > > - it will try to take the locks of the various pcpu's > > runqueues it comes across > > - if it manages to take the lock of pcpu's 6 runqueue, it > > sees d2v2 in it, steal it and run it. > Oh, my understanding matches yours :-) > > BYW, we also found in csched_load_balance() will hold the schedule > lock, but > not found a proper vcpu to steal. Maybe this would be a potential > optimization > point. > Mmm... I actually don't understand what you mean here... what is the possible optimization? > > With your modifications, AFAICT: > > > > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock > > of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list > > - poke pcpu 6 with an IPI > > - in a (little, hopefully) while, pcpu 6 react to the IPI and, > > _I_think_, call vcpu_wake(d2v2) ? [*] > You are almost right. Here in the interrupt handler, it does two > things: > 1. echo back notify_remote_vcpu_wake() it finishes the job > 2. raise a separate softirq, which will call vcpu_wake(d2v2) > > > > > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock > > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue > > - on pcpu 6, __runq_tickle() is called, and it determines on which > > pcpu d2v2 should run > At this place, the behavior of __run_tickle() is what I mentioned > above. Raise > softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2. > Sort of, yes. In general, you can expect pcpu 6 going through csched_schedule(), and hence setting d2v2 to run, to be faster that SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and getting to work stealing from pcpu 6's runqueue. However, that is the case only if d2v2 had higher priority than what is running on pcpu 6. If it does not, you: - _don't_ tickle pcpu 6 - tickle one or more idlers, if any. So, again, it's not that you always wake and run it on pcpu 6 > > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the > > following two cases: > > a) if it was determined that d2v2 should run on pcpu 6 > > (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself): > > - at the first softirq check, pcpu 6 will schedule > > - it takes its own runqueue lock > > - it finds d2v2 in there, and runs it > > b) if it was determined that d2v2 should run on, say, pcpu 9: > > - in a (little, hopefully) while, pcpu 9 will schedule > > - it takes its own runqueue lock > > - if it has nothing in its runqueue (likely, if > > __runq_tickle() chose it, but not necessarily and always > > true, especially under high load), it looks around to > > see if other pcpus have any vcpu that it can execute > > - it will try to take the locks of the various pcpu's > > runqueues it comes across > > - if it manages to take the lock of pcpu's 6 runqueue, it > > sees d2v2 in it, steal it and run it. > > > > [*] I don't see in the code you shared what happens in reaction to > > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually > > calls vcpu_wake() > > > > So, basically, it looks to me that you're adding a level of > > indirection, I'm not sure for what benefit. > In my mind, we are trying to reduce the contention on schedule lock > from two > aspects: > 1. for those vcpus would run on other pcpu, it will take > wake_vcpu_lock > instead of schedule lock > I don't see how you can say "instead". It looks to me that, for those vcpus what would run on other pcpu, we need to take wake_vcpu_lock _in_addition_ to the runqueue lock. > 2. and in vcpu_wake(), it will not grab a remote schedule lock, which > also > reduce the cache coherency between pcpus > Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a remote lock, though. > > In fact, in the best case (i.e., d2v2 will actually be run on its > > v->processor), in the original code there is: > > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6. > > In your case, there is: > > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6 > > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6. > > > Our idea may looks like this: > - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 > - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. > - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6. > So, it's even worse than how I imagined! :-P > > Also, as far as locking within the wakeup path is concerned, in the > > original code: > > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() > > (i.e., for the time of __runq_insert() and __runq_tickle(). > > In your case: > > - the wake_vcpu_lock of pcpu 6 is held during queueing of the > > deferred wakeup > > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() > > > The wake_vcpu_lock is hold in the first step in above. > The runqueue lock(I think you mean the schedule lock?) won't be taken > in this > case. > It will be taken, for executing vcpu_wake(). From your own description of the mechanism above: "Our idea may looks like this: 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for csched_schedule() > > > > (i.e., for the time of __runq_insert() and __runq_tickle(). > > > > Which, to me, looks like both more inter-CPU communication overhead > > and more chances for lock contention. > Hmm... yes, more inter-CPU communication, while less lock contention > I think. > More inter-CPU communication, the same level of local lock contention plus some added moderately to low contented remote lock contention, AFAICT. > > So, again, more IPIs and more (potential) lock contention. > > > > True the queueing into the deferred wakeup list will be very simple > > and quick, and hence the critical section needed for implementing > > it very short. But the problem is that you are not shortening > > something long, or splitting something long in two shorter chunks, > > you are _adding_ something, and thus you can't possibly win. :-/ > > > Yes, we are moving some thing, let me digest it for a while. > Sure. Let me also add that I'm not saying it's not going to work, or that it's not going to improve the situation. What I'm saying is that, architecturally, it's not too different from the current situation, so I would not expect wonders from just this. What I'd do would be to try figuring out where it is that the most waiting time is being accumulated, i.e., which *specific* locking attempts of the scheduler's runqueues' locks are the most contended, and focus on those first. If you're on a big host, I think the way in which Credit1 does load balancing (i.e., by work stealing) may be the thing to blame. But all these are things that only profiling and benchmarking can really tell! :-) As a side note, we're also working on something which, at the end of the day, is rather similar, although for different purposes. In fact, see here: https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html The goal is not having to disable IRQs during scheduling, and I'm looking at keeping the per-pcpu deferred wakeup list lockless, in order not to add more lock contention. If you're interested on knowing more or working on that, just let me know. But, of course, this have to be done on Xen upstream, not on an old version. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli @ 2016-09-14 10:44 ` Wei Yang 2016-09-14 16:18 ` Dario Faggioli 2016-09-17 3:30 ` Wei Yang 2016-09-18 4:03 ` Wei Yang 1 sibling, 2 replies; 17+ messages in thread From: Wei Yang @ 2016-09-14 10:44 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang, mengxu, Wei Yang On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >[using xendevel correct address] > >On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote: >> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote: >> > >> > I'm not surprised by that. Yet, I'd be interested in hearing more >> > about this profiling you have done (things like, how you captured >> > the data, what workloads you are exactly considering, how you >> > determined what is the bottleneck, etc). >> Let me try to explain this. >> >> Workload: a. Blue Screen in Windows Guests >> b. some client using Windows to do some video >> processing >> which need precise timestamp (we are not sure the >> core >> reason but we see the system is very slow) >> >Do you mind sharing just a bit more, such as: > - number of pcpus > - number of vcpus of the various VMs 160 pcpus 16 vcpus in VM and 8 VMs > >I also am not sure what "a. Blue screen in Windows guests" above >means... is there a workload called "Blue Screen"? Or is it that you >are hitting a BSOD in some of your guests (which ones, what were they >doing)? Or is it that you _want_ to provoke a BSOD on some of your >guests? Or is that something else? :-P > Yes, the "Blue Screen" is what we want to mimic the behavior from client. The "Blue Screen" will force the hypervisor to do load balance in my mind. >> Capture the data: lock profile >> Bottleneck Check: The schedule lock wait time is really high >> >Ok, cool. Interesting that lock profiling works on 4.1! :-O > >> > The scheduler tries to see whether the v->processor of the waking >> > vcpu can be re-used, but that's not at all guaranteed, and again, >> > on a very loaded system, not even that likely! >> > >> Hmm... I may missed something. >> >> Took your assumption below for example. >> In my mind, the process looks like this: >> >> csched_vcpu_wake() >> __runq_insert(), insert the vcpu in pcpu's 6 runq >> __runq_tickle(), raise SCHEDULE_SOFTIRQ on pcpu 6 or other >> (1) >> >Well, yes. More precisely, at least in current staging, >SCHEDULE_SOFTIRQ is raised for pcpu 6: > - if pcpu 6 is idle, or > - if pcpu 6 is not idle but there actually isn't any idle vcpu, and > the waking vcpu is higher in priority than what is running on pcpu 6 > >> __do_softirq() >> schedule() >> csched_schedule(), for pcpu 6, it may wake up d2v2 based >> on it's >> priority >> >Yes, but it is pcpu 6 that will run csched_schedule() only if the >conditions mentioned above are met. If not, it will be some other pcpu >that will do so (or none!). > >But I just realized that the fact that you are actually working on 4.1 >is going to be an issue. In fact, the code that it is now in Xen has >changed **a lot** since 4.1. In fact, you're missing all the soft- >affinity work (which you may or may not find useful) but also >improvements and bugfixing. > That's true... >I'll have a quick look at how __runq_tickle() looks like in Xen 4.1, >but there's very few chances I'll be available to provide detailed >review, advice, testing, etc., on such an old codebase. :-( > >> By looking at the code, what I missed may be in __runq_tickle(), >> which in case >> there are idle pcpu, schedule_softirq is also raised on them. By >> doing so, >> those idle pcpus would steal other busy tasks and may in chance get >> d2v2? >> >Yes, it looks like, in Xen 4.1, this is more or less what happens. The >idea is to always tickle pcpu 6, if the priority of the waking vcpu is >higher than what is running there. > Hmm... in case there are idle pcpus and the priority of the waking vcpu is higher than what is running on pcpu 6, would pcpu 6 have more chance to run it? or other idle pcup would steal it from pcpu6? or they have equal chance? >If pcpu 6 was not idle, we also tickle one or more idle pcpus so that: > - if the waking vcpu preempted what was running on pcpu 6, an idler > can pick-up ("steal") such preempted vcpu and run it; > - if the waking vcpu ended up in the runqueue, an idler can steal it > Hmm... I don't get the difference between these two cases. Looks both are an idler steal the vcpu. >> BTW, if the system is in heavy load, how much chance would we have >> idle pcpu? >> >It's not very likely that there will be idle pcpus in a very busy >system, I agree. > >> > > We can divide this in three steps: >> > > a) Send IPI to the target CPU and tell it which vcpu we want it >> > > to >> > > wake up. >> > > b) The interrupt handler on target cpu insert vcpu to a percpu >> > > queue >> > > and raise >> > > softirq. >> > > c) softirq will dequeue the vcpu and wake it up. >> > > >> > I'm not sure I see how you think this would improve the situation. >> > >> > So, quickly, right now: >> > >> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up >> > - let's assume d2v2->processor = 6 >> > - let's assume the wakeup is happening on pcpu 4 >> > >> > Right now: >> > >> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2, >> > which is the runqueue lock of pcpu 6 (in Credit1, there is 1 >> > runqueue per pcpu, and locks are per-cpu already) >> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >> > - still executing on pcpu 4, __runq_tickle() is called, and it >> > determines on which pcpu d2v2 should run >> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >> > following two cases: >> > a) if it was determined that d2v2 should run on pcpu 6: >> > - in a (little, hopefully) while, pcpu 6 will schedule >> > - it takes its own runqueue lock >> > - it finds d2v2 in there, and runs it >> Even in this case, it depends on the priority of d2v2 whether it >> would be run >> now. Right? >> >Yes. But both in 4.1 and in current staging, we only raise >SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking >vcpu is higher and it should preempt the currently running vcpu. > Oh, you are right. I didn't catch this either. This means in case a) the priority is lower than current one b) no idler in system The vcpu will sit quietly in the runq and waiting for the schedule next time. >> > b) if it was determined that d2v2 should run on, say, pcpu 9: >> > - in a (little, hopefully) while, pcpu 9 will schedule >> > - it takes its own runqueue lock >> > - if it has nothing in its runqueue (likely, if >> > __runq_tickle() chose it, but not necessarily and always >> > true, especially under high load), it looks around to >> > see if other pcpus have any vcpu that it can execute >> > - it will try to take the locks of the various pcpu's >> > runqueues it comes across >> > - if it manages to take the lock of pcpu's 6 runqueue, it >> > sees d2v2 in it, steal it and run it. >> Oh, my understanding matches yours :-) >> >> BYW, we also found in csched_load_balance() will hold the schedule >> lock, but >> not found a proper vcpu to steal. Maybe this would be a potential >> optimization >> point. >> >Mmm... I actually don't understand what you mean here... what is the >possible optimization? > >> > With your modifications, AFAICT: >> > >> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock >> > of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list >> > - poke pcpu 6 with an IPI >> > - in a (little, hopefully) while, pcpu 6 react to the IPI and, >> > _I_think_, call vcpu_wake(d2v2) ? [*] >> You are almost right. Here in the interrupt handler, it does two >> things: >> 1. echo back notify_remote_vcpu_wake() it finishes the job >> 2. raise a separate softirq, which will call vcpu_wake(d2v2) >> >> > >> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock >> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >> > - on pcpu 6, __runq_tickle() is called, and it determines on which >> > pcpu d2v2 should run >> At this place, the behavior of __run_tickle() is what I mentioned >> above. Raise >> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2. >> >Sort of, yes. In general, you can expect pcpu 6 going through >csched_schedule(), and hence setting d2v2 to run, to be faster that >SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and >getting to work stealing from pcpu 6's runqueue. > >However, that is the case only if d2v2 had higher priority than what is >running on pcpu 6. If it does not, you: > - _don't_ tickle pcpu 6 > - tickle one or more idlers, if any. > >So, again, it's not that you always wake and run it on pcpu 6 > I see what you mean :-) So I am curious about why we add this in pcpu's 6 queue, when we know the priority is low in __runq_tickle() and won't tickle that. And we are sure someone in the future will grab pcpu's 6 schedule lock and take it. Or in this case -- when there are idlers and vcpu's priority is higher, instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone who is free to help a busy one. >> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >> > following two cases: >> > a) if it was determined that d2v2 should run on pcpu 6 >> > (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself): >> > - at the first softirq check, pcpu 6 will schedule >> > - it takes its own runqueue lock >> > - it finds d2v2 in there, and runs it >> > b) if it was determined that d2v2 should run on, say, pcpu 9: >> > - in a (little, hopefully) while, pcpu 9 will schedule >> > - it takes its own runqueue lock >> > - if it has nothing in its runqueue (likely, if >> > __runq_tickle() chose it, but not necessarily and always >> > true, especially under high load), it looks around to >> > see if other pcpus have any vcpu that it can execute >> > - it will try to take the locks of the various pcpu's >> > runqueues it comes across >> > - if it manages to take the lock of pcpu's 6 runqueue, it >> > sees d2v2 in it, steal it and run it. >> > >> > [*] I don't see in the code you shared what happens in reaction to >> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually >> > calls vcpu_wake() >> > >> > So, basically, it looks to me that you're adding a level of >> > indirection, I'm not sure for what benefit. > >> In my mind, we are trying to reduce the contention on schedule lock >> from two >> aspects: >> 1. for those vcpus would run on other pcpu, it will take >> wake_vcpu_lock >> instead of schedule lock >> >I don't see how you can say "instead". It looks to me that, for those >vcpus what would run on other pcpu, we need to take wake_vcpu_lock >_in_addition_ to the runqueue lock. > >> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which >> also >> reduce the cache coherency between pcpus >> >Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a >remote lock, though. Hmm... you are right. > >> > In fact, in the best case (i.e., d2v2 will actually be run on its >> > v->processor), in the original code there is: >> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6. >> > In your case, there is: >> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6 >> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6. >> > >> Our idea may looks like this: >> - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. >> - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6. >> >So, it's even worse than how I imagined! :-P > >> > Also, as far as locking within the wakeup path is concerned, in the >> > original code: >> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >> > (i.e., for the time of __runq_insert() and __runq_tickle(). >> > In your case: >> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the >> > deferred wakeup >> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >> > >> The wake_vcpu_lock is hold in the first step in above. >> The runqueue lock(I think you mean the schedule lock?) won't be taken >> in this >> case. >> >It will be taken, for executing vcpu_wake(). From your own description >of the mechanism above: > >"Our idea may looks like this: > 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 > 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the > vcpu. > 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." > >1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for > csched_schedule() > Oh, you are right. We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while looks the wake_vcpu_lock is grabbed by others. I am thinking our change may benefit like this . The situation is: * vcpu's priority is higher * pcpu 9 is idle * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock. While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9 could do their job immediately. Ah I know, very corner, not sure about the effect. BTW, I suddenly found softirq is implemented as an IPI. So I am thinking whether it is save to raise softirq to itself? Since it looks we already hold the runqueue lock. Or because we have turned the irq off in vcpu_wake()? >> > >> > (i.e., for the time of __runq_insert() and __runq_tickle(). >> > >> > Which, to me, looks like both more inter-CPU communication overhead >> > and more chances for lock contention. >> Hmm... yes, more inter-CPU communication, while less lock contention >> I think. >> >More inter-CPU communication, the same level of local lock contention >plus some added moderately to low contented remote lock contention, >AFAICT. > >> > So, again, more IPIs and more (potential) lock contention. >> > >> > True the queueing into the deferred wakeup list will be very simple >> > and quick, and hence the critical section needed for implementing >> > it very short. But the problem is that you are not shortening >> > something long, or splitting something long in two shorter chunks, >> > you are _adding_ something, and thus you can't possibly win. :-/ >> > >> Yes, we are moving some thing, let me digest it for a while. >> >Sure. Let me also add that I'm not saying it's not going to work, or >that it's not going to improve the situation. > >What I'm saying is that, architecturally, it's not too different from >the current situation, so I would not expect wonders from just this. > >What I'd do would be to try figuring out where it is that the most >waiting time is being accumulated, i.e., which *specific* locking >attempts of the scheduler's runqueues' locks are the most contended, >and focus on those first. > >If you're on a big host, I think the way in which Credit1 does load >balancing (i.e., by work stealing) may be the thing to blame. But all >these are things that only profiling and benchmarking can really tell! >:-) > >As a side note, we're also working on something which, at the end of >the day, is rather similar, although for different purposes. In fact, >see here: >https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html > >The goal is not having to disable IRQs during scheduling, and I'm >looking at keeping the per-pcpu deferred wakeup list lockless, in order >not to add more lock contention. > >If you're interested on knowing more or working on that, just let me >know. But, of course, this have to be done on Xen upstream, not on an >old version. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-14 10:44 ` Wei Yang @ 2016-09-14 16:18 ` Dario Faggioli 2016-09-16 2:49 ` Wei Yang 2016-09-17 3:30 ` Wei Yang 1 sibling, 1 reply; 17+ messages in thread From: Dario Faggioli @ 2016-09-14 16:18 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 12825 bytes --] On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: > On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: > > > > Do you mind sharing just a bit more, such as: > > - number of pcpus > > - number of vcpus of the various VMs > 160 pcpus > 16 vcpus in VM and 8 VMs > So, 16x8=128, which means you're not even oversubscribing. Maybe some of the pcpus are hyperthreads and not full cores (are they?), but still, this is not something I'd describe "super intensive cpu load". Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32 vcpus (is this the case?), you technically are oversubscribing. But then, what does dom0 do? Is it very busy doing some stuff on its own, or running the backends/etc for the VMs? Can you check that with top, xentop, and alike? If the system is not overbooked, it's a bit strange that the scheduler is the bottleneck. > Yes, the "Blue Screen" is what we want to mimic the behavior from > client. > > The "Blue Screen" will force the hypervisor to do load balance in my > mind. > I've no idea what this means (but I don't know much of Windows and of what happens when it crashes with a blue screen). > > I'll have a quick look at how __runq_tickle() looks like in Xen > > 4.1, > > but there's very few chances I'll be available to provide detailed > > review, advice, testing, etc., on such an old codebase. :-( > > > > > > > > By looking at the code, what I missed may be in __runq_tickle(), > > > which in case > > > there are idle pcpu, schedule_softirq is also raised on them. By > > > doing so, > > > those idle pcpus would steal other busy tasks and may in chance > > > get > > > d2v2? > > > > > Yes, it looks like, in Xen 4.1, this is more or less what happens. > > The > > idea is to always tickle pcpu 6, if the priority of the waking vcpu > > is > > higher than what is running there. > > > Hmm... in case there are idle pcpus and the priority of the waking > vcpu is > higher than what is running on pcpu 6, would pcpu 6 have more chance > to run it? > or other idle pcup would steal it from pcpu6? or they have equal > chance? > No, it works like this: - pcpu 6 is eithe idle or it is running someone already (say d3v4) + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) pcpu 6 itself, and we're done + if pcpu 6 is running d3v4 and there is no other idle pcpu: * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done * if prio(d2v2) < prio(d3v4) we just don't tickle anyone + if pcpu 6 is running d3v4 and there are some idle pcpus: * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or [1] more of the idle pcpus * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more [2] of the idle pcpus Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's why we tickle pcpu 6. However, that would mean that d3v4 would be put back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them so that one can come and steam d3v4. Idea behind [2] is that d3v4 should continue to run on pcpu 6, while d2v2 will be put in pcpu's 6 runqueue. However, if there are idle pcpus, we tickle them so that one can come and steal d2v2. What really happens is not always what is expected, because it's possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed while pcpu 6 itself reschedules and enacts the preemption... But that's unlikely and, all in all, not catastrophic (although, of course, worse in terms of overhead, locality, etc) > > If pcpu 6 was not idle, we also tickle one or more idle pcpus so > > that: > > - if the waking vcpu preempted what was running on pcpu 6, an > > idler > > can pick-up ("steal") such preempted vcpu and run it; > > - if the waking vcpu ended up in the runqueue, an idler can steal > > it > > > Hmm... I don't get the difference between these two cases. > > Looks both are an idler steal the vcpu. > I hope it's more clear now. :-) > > Yes. But both in 4.1 and in current staging, we only raise > > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the > > waking > > vcpu is higher and it should preempt the currently running vcpu. > > > Oh, you are right. I didn't catch this either. > > This means in case > a) the priority is lower than current one > b) no idler in system > > The vcpu will sit quietly in the runq and waiting for the schedule > next time. > Well, yes. It will stay in pcpu's 6 runqueue until either: - what's running on pcpu 6 blocks, or finishes its credits, etc., and pcpu 6 reschedules and picks it up - some other pcpu becomes idle and, when poking around the various pcpus for stealing work, picks it up Of course, both depends on in what position in pcpu's 6 runqueue the vcpu is when the specific event happens. This is a limit of the Credit1 scheduler, especially of the version that you find in 4.1. Newer code is a bit better wrt this (although not dramatically... that has not been possible! :-/). > > However, that is the case only if d2v2 had higher priority than > > what is > > running on pcpu 6. If it does not, you: > > - _don't_ tickle pcpu 6 > > - tickle one or more idlers, if any. > > > > So, again, it's not that you always wake and run it on pcpu 6 > > > I see what you mean :-) > > So I am curious about why we add this in pcpu's 6 queue, when we know > the > priority is low in __runq_tickle() and won't tickle that. And we are > sure > someone in the future will grab pcpu's 6 schedule lock and take it. > Yes, but what we should do? We've woken it up and it's v->processor points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's 6 runqueue lock. To queue it on some other pcpu we'd need to basically migrate it (i.e., at least changing the value of v->processor, but also other things, like checking hard affinity, etc). And we also would need to take the runqueue lock of another pcpu's runqueue. All this in the wakeup path, which will become slow and complex. And we'd need all the work stealing logic (in load_balance(), etc) to still exist, to cover the case of work needing to be stolen when wakeups are not involved (i.e., it's not that we make other part of the scheduler's code less complex). So, basically, of course there probably were alternatives, but the basic model, pretty much everywhere, is "queue where it was and tickle others to come pick it up if suitable", and that's what's done here too (and consistency is a very good thing :-D). > Or in this case -- when there are idlers and vcpu's priority is > higher, > instead of let pcpu 6 to wake this, but let an idler to take it? Ask > someone > who is free to help a busy one. > And in fact, in more recent version of Xen, I made the code do something very close to what you're suggesting. Here's the comment that you can find where all this logic is implemented (it's way more complicated than this, and than the code in 4.1, because it takes soft- affinity into account). /* * If the pcpu is idle, or there are no idlers and the new * vcpu is a higher priority than the old vcpu, run it here. * * If there are idle cpus, first try to find one suitable to run * new, so we can avoid preempting cur. If we cannot find a * suitable idler on which to run new, run it here, but try to * find a suitable idler on which to run cur instead. */ > > It will be taken, for executing vcpu_wake(). From your own > > description > > of the mechanism above: > > > > "Our idea may looks like this: > > 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 > > 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the > > vcpu. > > 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." > > > > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock > > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() > > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for > > csched_schedule() > > > Oh, you are right. > > We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, > while > looks the wake_vcpu_lock is grabbed by others. > I understand that, and I can't say that this is not going to improve things for you. All I'm saying is that you're potentially reducing the overhead of one "operation", but at the same time adding some overhead in other "operations". Depending on a multiplicity of aspects, the net effect could be positive or negative. You you asked for general advice on the solution, and help with the code. I can't help much debugging hangs in Xen 4.1. My opinion on the solution is that, architecturally, is not something that I'd call an absolute improvement. Let me put it this way: if you'd send a patch series implementing what we're talking about for upstream Xen, I would not Ack it without solid data, coming from benchmarks run on different platform, with different workloads and under different load conditions, that clearly shows performance are improved. Translated to your case, which is, AFAICT, that you need something on top of 4.1. If you can hack this up quickly and try it, then good. If this is taking a lot of time, I'm not sure I'd invest such time on this (e.g., I'd rather try to narrow down even more the cause of the issue you're seeing and, after that look at other solutions). But of course I'm not you and this is only, and you absolutely have no reasons to follow my advice. :-) :-) > I am thinking our change may benefit like this . > > The situation is: > * vcpu's priority is higher > * pcpu 9 is idle > * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ > > At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 > nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue > lock. > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab their own locks for running csched_schedule(), after having reacted to the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. BTW, you say that pcpu 9 is idle, but then that "it can't do his job" because pcpu 4 holds the lock of pcpu 6... I don't think I understand what you mean with this. > While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 > and pcpu 9 > could do their job immediately. Ah I know, very corner, not sure > about the > effect. > Again, I don't get what "do their job" and "immediately" means. In both cases the consequentiality of events is as follows: 1. queue the task in one pcpu's 6 queue (either the runq, in original code, or the wakeup list, with your modifications) 2. tickle pcpu 6 3. pcpu 6 react to being tickled 4. pcpu 6 start running the new vcpu I original code 1 and 4 are serialized by the same lock, with your modifications, half of 1 can happen using a different lock. But 1 and 4 are almost never overlapped anyway, so I don't think you're gaining much parallelism. What (maybe) you're achieving is that 1 interferes a little bit less with other things that wants to happen on vcpu 6 (e.g., because of other vcpus wrt the one waking up), which may be a good thing, if there are a lot of wakeups. But then, again, if there are a lot of wakeups, you probably will see contention of the wake_vcpu_lock! > BTW, I suddenly found softirq is implemented as an IPI. So I am > thinking > whether it is save to raise softirq to itself? > Softirqs are implemented by means of IPIs, but when they're raised for other processors. Self-raising a softirq does not involve IPIs, just setting a bit in the pending mask (see raise_softirq() in xen/common/softirq.c). > Since it looks we already hold > the runqueue lock. Or because we have turned the irq off in > vcpu_wake()? > We hold the runqueue lock, so what? Should we schedule that immediately? That would probably be possible, but it's again an architectural decision, and a matter of consistency. In Xen, we schedule when we find the SCHEDULE_SOFTIRQ to be pending. Anyhow, I would not expect wonders by embedding the re-scheduling code directly here either. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-14 16:18 ` Dario Faggioli @ 2016-09-16 2:49 ` Wei Yang 2016-09-16 16:07 ` Dario Faggioli 0 siblings, 1 reply; 17+ messages in thread From: Wei Yang @ 2016-09-16 2:49 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang, mengxu, Wei Yang On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote: >On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: >> On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >> > >> > Do you mind sharing just a bit more, such as: >> > - number of pcpus >> > - number of vcpus of the various VMs >> 160 pcpus >> 16 vcpus in VM and 8 VMs >> >So, 16x8=128, which means you're not even oversubscribing. Maybe some >of the pcpus are hyperthreads and not full cores (are they?), but >still, this is not something I'd describe "super intensive cpu load". > >Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32 >vcpus (is this the case?), you technically are oversubscribing. But >then, what does dom0 do? Is it very busy doing some stuff on its own, >or running the backends/etc for the VMs? Can you check that with top, >xentop, and alike? > >If the system is not overbooked, it's a bit strange that the scheduler >is the bottleneck. I looked at the original data again. I don't see detailed data to describe the dom0 configuration. The exact user model is not accessible from our client. We guess their model looks like this. +--------+ +-----------+ +-------------+ |Timer | --->|Coordinator| ---+--->|Worker | +--------+ +-----------+ | +-------------+ | | +-------------+ +--->|Worker | | +-------------+ | | +-------------+ +--->|Worker | +-------------+ One Coordinator would driver several workers based on a high resolution timer. Periodically, workers would be waked up by the coordinator. So at one moment, many workers would be waked up and would triggers the vcpu_wake() in xen. Not sure this would be a possible reason for the burst vcpu_wake()? > >> Yes, the "Blue Screen" is what we want to mimic the behavior from >> client. >> >> The "Blue Screen" will force the hypervisor to do load balance in my >> mind. >> >I've no idea what this means (but I don't know much of Windows and of >what happens when it crashes with a blue screen). > >> > I'll have a quick look at how __runq_tickle() looks like in Xen >> > 4.1, >> > but there's very few chances I'll be available to provide detailed >> > review, advice, testing, etc., on such an old codebase. :-( >> > >> > > >> > > By looking at the code, what I missed may be in __runq_tickle(), >> > > which in case >> > > there are idle pcpu, schedule_softirq is also raised on them. By >> > > doing so, >> > > those idle pcpus would steal other busy tasks and may in chance >> > > get >> > > d2v2? >> > > >> > Yes, it looks like, in Xen 4.1, this is more or less what happens. >> > The >> > idea is to always tickle pcpu 6, if the priority of the waking vcpu >> > is >> > higher than what is running there. >> > >> Hmm... in case there are idle pcpus and the priority of the waking >> vcpu is >> higher than what is running on pcpu 6, would pcpu 6 have more chance >> to run it? >> or other idle pcup would steal it from pcpu6? or they have equal >> chance? >> >No, it works like this: > > - pcpu 6 is eithe idle or it is running someone already (say d3v4) > + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) > pcpu 6 itself, and we're done Ok, it looks the behavior differs from 4.1 and current upstream. The upstream just raise the softirq to pcpu6 when it is idle, while 4.1 would raise softirq to both pcpu6 and other idlers even pcpu6 is idle. I think current upstream is more cleaver. > + if pcpu 6 is running d3v4 and there is no other idle pcpu: > * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done > * if prio(d2v2) < prio(d3v4) we just don't tickle anyone > + if pcpu 6 is running d3v4 and there are some idle pcpus: > * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or [1] > more of the idle pcpus > * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more [2] > of the idle pcpus > >Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's >why we tickle pcpu 6. However, that would mean that d3v4 would be put >back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them >so that one can come and steam d3v4. > >Idea behind [2] is that d3v4 should continue to run on pcpu 6, while >d2v2 will be put in pcpu's 6 runqueue. However, if there are idle >pcpus, we tickle them so that one can come and steal d2v2. > >What really happens is not always what is expected, because it's >possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in >waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed >while pcpu 6 itself reschedules and enacts the preemption... But that's >unlikely and, all in all, not catastrophic (although, of course, worse >in terms of overhead, locality, etc) > >> > If pcpu 6 was not idle, we also tickle one or more idle pcpus so >> > that: >> > - if the waking vcpu preempted what was running on pcpu 6, an >> > idler >> > can pick-up ("steal") such preempted vcpu and run it; >> > - if the waking vcpu ended up in the runqueue, an idler can steal >> > it >> > >> Hmm... I don't get the difference between these two cases. >> >> Looks both are an idler steal the vcpu. >> >I hope it's more clear now. :-) > Yep, thanks. :-) >> > Yes. But both in 4.1 and in current staging, we only raise >> > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the >> > waking >> > vcpu is higher and it should preempt the currently running vcpu. >> > >> Oh, you are right. I didn't catch this either. >> >> This means in case >> a) the priority is lower than current one >> b) no idler in system >> >> The vcpu will sit quietly in the runq and waiting for the schedule >> next time. >> >Well, yes. It will stay in pcpu's 6 runqueue until either: > - what's running on pcpu 6 blocks, or finishes its credits, etc., > and pcpu 6 reschedules and picks it up > - some other pcpu becomes idle and, when poking around the various > pcpus for stealing work, picks it up > >Of course, both depends on in what position in pcpu's 6 runqueue the >vcpu is when the specific event happens. > >This is a limit of the Credit1 scheduler, especially of the version >that you find in 4.1. Newer code is a bit better wrt this (although not >dramatically... that has not been possible! :-/). > >> > However, that is the case only if d2v2 had higher priority than >> > what is >> > running on pcpu 6. If it does not, you: >> > - _don't_ tickle pcpu 6 >> > - tickle one or more idlers, if any. >> > >> > So, again, it's not that you always wake and run it on pcpu 6 >> > >> I see what you mean :-) >> >> So I am curious about why we add this in pcpu's 6 queue, when we know >> the >> priority is low in __runq_tickle() and won't tickle that. And we are >> sure >> someone in the future will grab pcpu's 6 schedule lock and take it. >> >Yes, but what we should do? We've woken it up and it's v->processor >points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's >6 runqueue lock. > >To queue it on some other pcpu we'd need to basically migrate it (i.e., >at least changing the value of v->processor, but also other things, >like checking hard affinity, etc). And we also would need to take the >runqueue lock of another pcpu's runqueue. All this in the wakeup path, >which will become slow and complex. And we'd need all the work stealing >logic (in load_balance(), etc) to still exist, to cover the case of >work needing to be stolen when wakeups are not involved (i.e., it's not >that we make other part of the scheduler's code less complex). > >So, basically, of course there probably were alternatives, but the >basic model, pretty much everywhere, is "queue where it was and tickle >others to come pick it up if suitable", and that's what's done here too >(and consistency is a very good thing :-D). > >> Or in this case -- when there are idlers and vcpu's priority is >> higher, >> instead of let pcpu 6 to wake this, but let an idler to take it? Ask >> someone >> who is free to help a busy one. >> >And in fact, in more recent version of Xen, I made the code do >something very close to what you're suggesting. Here's the comment that >you can find where all this logic is implemented (it's way more >complicated than this, and than the code in 4.1, because it takes soft- >affinity into account). > > /* > * If the pcpu is idle, or there are no idlers and the new > * vcpu is a higher priority than the old vcpu, run it here. > * > * If there are idle cpus, first try to find one suitable to run > * new, so we can avoid preempting cur. If we cannot find a > * suitable idler on which to run new, run it here, but try to > * find a suitable idler on which to run cur instead. > */ > I like this idea. >> > It will be taken, for executing vcpu_wake(). From your own >> > description >> > of the mechanism above: >> > >> > "Our idea may looks like this: >> > 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> > 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the >> > vcpu. >> > 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." >> > >> > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >> > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >> > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for >> > csched_schedule() >> > >> Oh, you are right. >> >> We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, >> while >> looks the wake_vcpu_lock is grabbed by others. >> >I understand that, and I can't say that this is not going to improve >things for you. All I'm saying is that you're potentially reducing the >overhead of one "operation", but at the same time adding some overhead >in other "operations". > >Depending on a multiplicity of aspects, the net effect could be >positive or negative. > >You you asked for general advice on the solution, and help with the >code. I can't help much debugging hangs in Xen 4.1. My opinion on the >solution is that, architecturally, is not something that I'd call an >absolute improvement. > >Let me put it this way: if you'd send a patch series implementing what >we're talking about for upstream Xen, I would not Ack it without solid >data, coming from benchmarks run on different platform, with different >workloads and under different load conditions, that clearly shows >performance are improved. > >Translated to your case, which is, AFAICT, that you need something on >top of 4.1. If you can hack this up quickly and try it, then good. If >this is taking a lot of time, I'm not sure I'd invest such time on this >(e.g., I'd rather try to narrow down even more the cause of the issue >you're seeing and, after that look at other solutions). But of course >I'm not you and this is only, and you absolutely have no reasons to >follow my advice. :-) :-) > Actually, I admire your advice :-) We found the schedule lock be a consuming point in our scenario, so the direct thought is try to avoid to grab it. Hmm... while our idea is not that sparkling. >> I am thinking our change may benefit like this . >> >> The situation is: >> * vcpu's priority is higher >> * pcpu 9 is idle >> * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ >> >> At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 >> nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue >> lock. >> >Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the >softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab >their own locks for running csched_schedule(), after having reacted to >the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. > >BTW, you say that pcpu 9 is idle, but then that "it can't do his job" >because pcpu 4 holds the lock of pcpu 6... I don't think I understand >what you mean with this. After pcpu9 get the softirq from pcup4, it try to schedule and do load balance, in which would take pcpu6 schedule lock. As I thought previously, pcpu6 schedule lock is hold by pcpu4, so pcpu9 should wait until pcpu4 release it. This is what I mean "it can't do his job immediately". While as you mentioned, compared with pcpu9, pcpu4 would release pcpu6 schedule lock earlier. :-) Little complicated, hope I described it clearly. :-) > >> While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 >> and pcpu 9 >> could do their job immediately. Ah I know, very corner, not sure >> about the >> effect. >> >Again, I don't get what "do their job" and "immediately" means. > >In both cases the consequentiality of events is as follows: > 1. queue the task in one pcpu's 6 queue (either the runq, in original > code, or the wakeup list, with your modifications) > 2. tickle pcpu 6 > 3. pcpu 6 react to being tickled > 4. pcpu 6 start running the new vcpu > >I original code 1 and 4 are serialized by the same lock, with your >modifications, half of 1 can happen using a different lock. But 1 and 4 >are almost never overlapped anyway, so I don't think you're gaining >much parallelism. > >What (maybe) you're achieving is that 1 interferes a little bit less >with other things that wants to happen on vcpu 6 (e.g., because of >other vcpus wrt the one waking up), which may be a good thing, if there >are a lot of wakeups. But then, again, if there are a lot of wakeups, >you probably will see contention of the wake_vcpu_lock! > >> BTW, I suddenly found softirq is implemented as an IPI. So I am >> thinking >> whether it is save to raise softirq to itself? >> >Softirqs are implemented by means of IPIs, but when they're raised for >other processors. Self-raising a softirq does not involve IPIs, just >setting a bit in the pending mask (see raise_softirq() in >xen/common/softirq.c). > >> Since it looks we already hold >> the runqueue lock. Or because we have turned the irq off in >> vcpu_wake()? >> >We hold the runqueue lock, so what? Should we schedule that >immediately? That would probably be possible, but it's again an >architectural decision, and a matter of consistency. In Xen, we >schedule when we find the SCHEDULE_SOFTIRQ to be pending. > >Anyhow, I would not expect wonders by embedding the re-scheduling code >directly here either. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-16 2:49 ` Wei Yang @ 2016-09-16 16:07 ` Dario Faggioli 2016-09-17 0:31 ` Wei Yang 0 siblings, 1 reply; 17+ messages in thread From: Dario Faggioli @ 2016-09-16 16:07 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 8605 bytes --] On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote: > On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote: > > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: > > If the system is not overbooked, it's a bit strange that the > > scheduler > > is the bottleneck. > I looked at the original data again. I don't see detailed data to > describe the > dom0 configuration. > I see. No collection of output of top and xentop in dom0 either then, I'm guessing? :-/ > The exact user model is not accessible from our client. We guess > their model > looks like this. > > > +--------+ +-----------+ +-------------+ > |Timer | --->|Coordinator| ---+--->|Worker | > +--------+ +-----------+ | +-------------+ > | > | +-------------+ > +--->|Worker | > | +-------------+ > | > | +-------------+ > +--->|Worker | > +-------------+ > > One Coordinator would driver several workers based on a high > resolution timer. > Periodically, workers would be waked up by the coordinator. So at one > moment, > many workers would be waked up and would triggers the vcpu_wake() in > xen. > It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if the graph describes the workload run inside the (and if yes, which ones) VMs... but that is not terribly important, after all. > Not sure this would be a possible reason for the burst vcpu_wake()? > Well, there would be, at least potentially, a sensible number of vcpus waking up, which indeed can make the runqueue locks of the various pcpus contended. But then again, if the system is not oversubscribed, I'd tend to think it to be tolerable, and I'd expect the biggest problem to be the work- stealing logic (considering the high number of pcpus), rather than the duration of the critical sections within vcpu_wake(). > > - pcpu 6 is eithe idle or it is running someone already (say d3v4) > > + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) > > pcpu 6 itself, and we're done > Ok, it looks the behavior differs from 4.1 and current upstream. The > upstream > just raise the softirq to pcpu6 when it is idle, while 4.1 would > raise softirq > to both pcpu6 and other idlers even pcpu6 is idle. > > I think current upstream is more cleaver. > I also think current upstream is a bit better, especially because it's mostly me that made it look the way it does. :-D But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6 is idle (se the '//xxx xxx xxx' comments I'm adding to the code excerpts: if ( new->pri > cur->pri ) //is true, so we put pcpu 6 in mask { cpu_set(cpu, mask); } if ( cur->pri > CSCHED_PRI_IDLE ) //is false!! { .... } if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); On the other hand, if pcpu 6 is not idle (and, sticking to the example started in the last email, is running d3v4): if ( new->pri > cur->pri ) //depends from d2v2's prio and d3v4 prio's { cpu_set(cpu, mask); } if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see... { if ( cpus_empty(prv->idlers) ) //is true *only* if there are no idle pcpu 6. Let's assume there are (i.e., let's assume this is false) { .... } else { cpumask_t idle_mask; cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity); if ( !cpus_empty(idle_mask) ) //is true if there are idlers suitable for new (let's assume there are) { if ( opt_tickle_one_idle ) //chosen on boot, default is true { this_cpu(last_tickle_cpu) = cycle_cpu(this_cpu(last_tickle_cpu), idle_mask); cpu_set(this_cpu(last_tickle_cpu), mask); } else .... } cpus_and(mask, mask, new->vcpu->cpu_affinity); } } if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, and also contains one idle pcpu cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's not, at least one idler (if it exists) is tickled, and pcpu 6 is tickled or not, depending or priorities. > > And in fact, in more recent version of Xen, I made the code do > > something very close to what you're suggesting. Here's the comment > > that > > you can find where all this logic is implemented (it's way more > > complicated than this, and than the code in 4.1, because it takes > > soft- > > affinity into account). > > > > /* > > * If the pcpu is idle, or there are no idlers and the new > > * vcpu is a higher priority than the old vcpu, run it here. > > * > > * If there are idle cpus, first try to find one suitable to > > run > > * new, so we can avoid preempting cur. If we cannot find a > > * suitable idler on which to run new, run it here, but try to > > * find a suitable idler on which to run cur instead. > > */ > > > I like this idea. > Then update... Xen 4.2 or 4.3 should already contains it. :-P > We found the schedule lock be a consuming point in our scenario, so > the direct > thought is try to avoid to grab it. Hmm... while our idea is not that > sparkling. > Again, I can't say how sparkling it will reveal once implemented. Architecturally, it doesn't sound much different from the status quo to me, and so I'd say it's unlikely that it will solve your problem, but this is something always very hard to tell. And again, I'd personally spend some more time --even by temporarily and hackishly instrumenting the code-- to understand better where the bottleneck is. > > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised > > the > > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to > > grab > > their own locks for running csched_schedule(), after having reacted > > to > > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. > > > > BTW, you say that pcpu 9 is idle, but then that "it can't do his > > job" > > because pcpu 4 holds the lock of pcpu 6... I don't think I > > understand > > what you mean with this. > After pcpu9 get the softirq from pcup4, it try to schedule and do > load > balance, in which would take pcpu6 schedule lock. > > As I thought previously, pcpu6 schedule lock is hold by pcpu4, so > pcpu9 should > wait until pcpu4 release it. This is what I mean "it can't do his job > immediately". > Yeah, but... > While as you mentioned, compared with pcpu9, pcpu4 would release > pcpu6 > schedule lock earlier. :-) > ... indeed that's what I think it happens 99% of the time. And this is also why I'd tempted to think that the issue may be laying somewhere else. E.g., something that has proven effective for others (and for which I've got an unfinished and never submitted patch to upstream), is keeping track of what pcpus actually have at least 1 vcpu in their runqueues (i.e., at least 1 vcpus that is runnable and not running) and, inside balance_load(), only consider those when looking for work to steal. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-16 16:07 ` Dario Faggioli @ 2016-09-17 0:31 ` Wei Yang 2016-09-19 23:24 ` Dario Faggioli 0 siblings, 1 reply; 17+ messages in thread From: Wei Yang @ 2016-09-17 0:31 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang, mengxu, Wei Yang On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote: >On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote: >> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote: >> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: >> > If the system is not overbooked, it's a bit strange that the >> > scheduler >> > is the bottleneck. >> I looked at the original data again. I don't see detailed data to >> describe the >> dom0 configuration. >> >I see. No collection of output of top and xentop in dom0 either then, >I'm guessing? :-/ > Probably, let me check with some one to see whether we have the luck. >> The exact user model is not accessible from our client. We guess >> their model >> looks like this. >> >> >> +--------+ +-----------+ +-------------+ >> |Timer | --->|Coordinator| ---+--->|Worker | >> +--------+ +-----------+ | +-------------+ >> | >> | +-------------+ >> +--->|Worker | >> | +-------------+ >> | >> | +-------------+ >> +--->|Worker | >> +-------------+ >> >> One Coordinator would driver several workers based on a high >> resolution timer. >> Periodically, workers would be waked up by the coordinator. So at one >> moment, >> many workers would be waked up and would triggers the vcpu_wake() in >> xen. >> >It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if >the graph describes the workload run inside the (and if yes, which >ones) VMs... but that is not terribly important, after all. > Oh, yes, these are threads in a VM. Each VM may contain several groups(instance) of these threads. >> Not sure this would be a possible reason for the burst vcpu_wake()? >> >Well, there would be, at least potentially, a sensible number of vcpus >waking up, which indeed can make the runqueue locks of the various >pcpus contended. > >But then again, if the system is not oversubscribed, I'd tend to think >it to be tolerable, and I'd expect the biggest problem to be the work- >stealing logic (considering the high number of pcpus), rather than the >duration of the critical sections within vcpu_wake(). > Yes, we are trying to improve the stealing part too. Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in terms of #PCPUs. >> > - pcpu 6 is eithe idle or it is running someone already (say d3v4) >> > + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) >> > pcpu 6 itself, and we're done >> Ok, it looks the behavior differs from 4.1 and current upstream. The >> upstream >> just raise the softirq to pcpu6 when it is idle, while 4.1 would >> raise softirq >> to both pcpu6 and other idlers even pcpu6 is idle. >> >> I think current upstream is more cleaver. >> >I also think current upstream is a bit better, especially because it's >mostly me that made it look the way it does. :-D Ah, not intended to flattering you. > >But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6 >is idle (se the '//xxx xxx xxx' comments I'm adding to the code >excerpts: > > if ( new->pri > cur->pri ) //is true, so we put pcpu 6 in mask > { > cpu_set(cpu, mask); > } > if ( cur->pri > CSCHED_PRI_IDLE ) //is false!! > { > .... > } > if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); > Hmm... don't have the code at hand, while looks you are right. I misunderstood the code. >On the other hand, if pcpu 6 is not idle (and, sticking to the example >started in the last email, is running d3v4): > > if ( new->pri > cur->pri ) //depends from d2v2's prio and d3v4 prio's > { > cpu_set(cpu, mask); > } > if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see... > { > if ( cpus_empty(prv->idlers) ) //is true *only* if there are no idle pcpu 6. Let's assume there are (i.e., let's assume this is false) > { > .... > } > else > { > cpumask_t idle_mask; > > cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity); > if ( !cpus_empty(idle_mask) ) //is true if there are idlers suitable for new (let's assume there are) > { > if ( opt_tickle_one_idle ) //chosen on boot, default is true > { > this_cpu(last_tickle_cpu) = > cycle_cpu(this_cpu(last_tickle_cpu), idle_mask); > cpu_set(this_cpu(last_tickle_cpu), mask); May misunderstand the code previously and like this part too. So only one idler would be tickled even there would be several idlers in the system. I thought we would tickle several idlers, which may introduce some contention between them. BTW, any idea behind the cycle_cpu()? If this is about the locality, how about cycle within the node? and cycle within the node where v->processor is? > } > else > .... > } > cpus_and(mask, mask, new->vcpu->cpu_affinity); > } > } > if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, and also contains one idle pcpu > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); > >So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's >not, at least one idler (if it exists) is tickled, and pcpu 6 is >tickled or not, depending or priorities. > That's clear to me now :-) >> > And in fact, in more recent version of Xen, I made the code do >> > something very close to what you're suggesting. Here's the comment >> > that >> > you can find where all this logic is implemented (it's way more >> > complicated than this, and than the code in 4.1, because it takes >> > soft- >> > affinity into account). >> > >> > /* >> > * If the pcpu is idle, or there are no idlers and the new >> > * vcpu is a higher priority than the old vcpu, run it here. >> > * >> > * If there are idle cpus, first try to find one suitable to >> > run >> > * new, so we can avoid preempting cur. If we cannot find a >> > * suitable idler on which to run new, run it here, but try to >> > * find a suitable idler on which to run cur instead. >> > */ >> > >> I like this idea. >> >Then update... Xen 4.2 or 4.3 should already contains it. :-P > >> We found the schedule lock be a consuming point in our scenario, so >> the direct >> thought is try to avoid to grab it. Hmm... while our idea is not that >> sparkling. >> >Again, I can't say how sparkling it will reveal once implemented. >Architecturally, it doesn't sound much different from the status quo to >me, and so I'd say it's unlikely that it will solve your problem, but >this is something always very hard to tell. > >And again, I'd personally spend some more time --even by temporarily >and hackishly instrumenting the code-- to understand better where the >bottleneck is. > Hmm... usually we use xentrace and lock profile to identify the bottleneck, other method you would like to suggest? and instrumenting the code means to add some log in code? >> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised >> > the >> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to >> > grab >> > their own locks for running csched_schedule(), after having reacted >> > to >> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. >> > >> > BTW, you say that pcpu 9 is idle, but then that "it can't do his >> > job" >> > because pcpu 4 holds the lock of pcpu 6... I don't think I >> > understand >> > what you mean with this. >> After pcpu9 get the softirq from pcup4, it try to schedule and do >> load >> balance, in which would take pcpu6 schedule lock. >> >> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so >> pcpu9 should >> wait until pcpu4 release it. This is what I mean "it can't do his job >> immediately". >> >Yeah, but... > >> While as you mentioned, compared with pcpu9, pcpu4 would release >> pcpu6 >> schedule lock earlier. :-) >> >... indeed that's what I think it happens 99% of the time. And this is >also why I'd tempted to think that the issue may be laying somewhere >else. > >E.g., something that has proven effective for others (and for which >I've got an unfinished and never submitted patch to upstream), is >keeping track of what pcpus actually have at least 1 vcpu in their >runqueues (i.e., at least 1 vcpus that is runnable and not running) >and, inside balance_load(), only consider those when looking for work >to steal. Looks like we keep a cpumask with those who have 1 more vcpus and during stealing process, just scan those instead of the online cpumask. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Wei Yang Help you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-17 0:31 ` Wei Yang @ 2016-09-19 23:24 ` Dario Faggioli 2016-09-20 3:51 ` Wei Yang 0 siblings, 1 reply; 17+ messages in thread From: Dario Faggioli @ 2016-09-19 23:24 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, mengxu, Wei Yang [-- Attachment #1.1: Type: text/plain, Size: 4451 bytes --] On Sat, 2016-09-17 at 00:31 +0000, Wei Yang wrote: > On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote: > > But then again, if the system is not oversubscribed, I'd tend to > > think > > it to be tolerable, and I'd expect the biggest problem to be the > > work- > > stealing logic (considering the high number of pcpus), rather than > > the > > duration of the critical sections within vcpu_wake(). > > > Yes, we are trying to improve the stealing part too. > Great. > Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in > terms of > #PCPUs. > Exactly! > > I also think current upstream is a bit better, especially because > > it's > > mostly me that made it look the way it does. :-D > > Ah, not intended to flattering you. > EhEh! Sure, I was joking myself. :-) > > > > > > But I was actually describing how 4.1 works. In fact, in 4.1, if > > pcpu 6 > > is idle (se the '//xxx xxx xxx' comments I'm adding to the code > > excerpts: > > > > if ( new->pri > cur->pri ) //is true, so we put pcpu 6 in mask > > { > > cpu_set(cpu, mask); > > } > > if ( cur->pri > CSCHED_PRI_IDLE ) //is false!! > > { > > .... > > } > > if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only > > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); > > > > Hmm... don't have the code at hand, while looks you are right. I > misunderstood > the code. > So only one idler would be tickled even there would be several idlers > in the > system. I thought we would tickle several idlers, which may introduce > some > contention between them. > You can ask Xen to tickle more than one idle, by means of that parameter. But again, tickling idlers --either one or many-- would only happen if there's actual need for it. > BTW, any idea behind the cycle_cpu()? If this is about the locality, > how about > cycle within the node? and cycle within the node where v->processor > is? > Cycle is there as a form of load spreading, i.e., we don't want to risk tickling always the same set of pcpus, or always the ones with the lower cpu IDs. You're right that taking locality into account could be a good thing in big systems. No, that is not done, not in 4.1 nor in upstream, and it may be something actually worth trying (although, again, it's probably unrelated to your issue). > > And again, I'd personally spend some more time --even by > > temporarily > > and hackishly instrumenting the code-- to understand better where > > the > > bottleneck is. > > > > Hmm... usually we use xentrace and lock profile to identify the > bottleneck, > other method you would like to suggest? and instrumenting the code > means to > add some log in code? > xentrace and lock profile is all I use too, and there's not much else, without touching the code. And in fact, yes, with "instrumenting the code" I means temporary changing the code to display the information you need. In this specific case, rather than adding printks here and there (which, e.g., is what you usually do for debugging crashes or alike), I'd see about modifying a little bit either xentrace or lock profiling code (or both!) to make them provide the info you need. Sorry, but I don't think I can be much more specific without going looking at the code and actually trying to do that... > > E.g., something that has proven effective for others (and for which > > I've got an unfinished and never submitted patch to upstream), is > > keeping track of what pcpus actually have at least 1 vcpu in their > > runqueues (i.e., at least 1 vcpus that is runnable and not running) > > and, inside balance_load(), only consider those when looking for > > work > > to steal. > > Looks like we keep a cpumask with those who have 1 more vcpus and > during > stealing process, just scan those instead of the online cpumask. > Yep, that sounds like a plan, and was along the lines of what I was doing. If you want to give it a try, patches (for upstream, of course :-D) are welcome. :-P Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-19 23:24 ` Dario Faggioli @ 2016-09-20 3:51 ` Wei Yang 0 siblings, 0 replies; 17+ messages in thread From: Wei Yang @ 2016-09-20 3:51 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang, mengxu, Wei Yang On Tue, Sep 20, 2016 at 01:24:17AM +0200, Dario Faggioli wrote: >On Sat, 2016-09-17 at 00:31 +0000, Wei Yang wrote: >> On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote: >> > But then again, if the system is not oversubscribed, I'd tend to >> > think >> > it to be tolerable, and I'd expect the biggest problem to be the >> > work- >> > stealing logic (considering the high number of pcpus), rather than >> > the >> > duration of the critical sections within vcpu_wake(). >> > >> Yes, we are trying to improve the stealing part too. >> >Great. > >> Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in >> terms of >> #PCPUs. >> >Exactly! > >> > I also think current upstream is a bit better, especially because >> > it's >> > mostly me that made it look the way it does. :-D >> >> Ah, not intended to flattering you. >> >EhEh! Sure, I was joking myself. :-) > Ah, really enjoy talking with you :-) You got some special sense of humor, and I like it~ Have a good day! >> > >> > >> > But I was actually describing how 4.1 works. In fact, in 4.1, if >> > pcpu 6 >> > is idle (se the '//xxx xxx xxx' comments I'm adding to the code >> > excerpts: >> > >> > if ( new->pri > cur->pri ) //is true, so we put pcpu 6 in mask >> > { >> > cpu_set(cpu, mask); >> > } >> > if ( cur->pri > CSCHED_PRI_IDLE ) //is false!! >> > { >> > .... >> > } >> > if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only >> > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); >> > >> >> Hmm... don't have the code at hand, while looks you are right. I >> misunderstood >> the code. > >> So only one idler would be tickled even there would be several idlers >> in the >> system. I thought we would tickle several idlers, which may introduce >> some >> contention between them. >> >You can ask Xen to tickle more than one idle, by means of that >parameter. But again, tickling idlers --either one or many-- would only >happen if there's actual need for it. > >> BTW, any idea behind the cycle_cpu()? If this is about the locality, >> how about >> cycle within the node? and cycle within the node where v->processor >> is? >> >Cycle is there as a form of load spreading, i.e., we don't want to risk >tickling always the same set of pcpus, or always the ones with the >lower cpu IDs. > >You're right that taking locality into account could be a good thing in >big systems. No, that is not done, not in 4.1 nor in upstream, and it >may be something actually worth trying (although, again, it's probably >unrelated to your issue). > >> > And again, I'd personally spend some more time --even by >> > temporarily >> > and hackishly instrumenting the code-- to understand better where >> > the >> > bottleneck is. >> > >> >> Hmm... usually we use xentrace and lock profile to identify the >> bottleneck, >> other method you would like to suggest? and instrumenting the code >> means to >> add some log in code? >> >xentrace and lock profile is all I use too, and there's not much else, >without touching the code. And in fact, yes, with "instrumenting the >code" I means temporary changing the code to display the information >you need. > >In this specific case, rather than adding printks here and there >(which, e.g., is what you usually do for debugging crashes or alike), >I'd see about modifying a little bit either xentrace or lock profiling >code (or both!) to make them provide the info you need. > >Sorry, but I don't think I can be much more specific without going >looking at the code and actually trying to do that... > >> > E.g., something that has proven effective for others (and for which >> > I've got an unfinished and never submitted patch to upstream), is >> > keeping track of what pcpus actually have at least 1 vcpu in their >> > runqueues (i.e., at least 1 vcpus that is runnable and not running) >> > and, inside balance_load(), only consider those when looking for >> > work >> > to steal. >> >> Looks like we keep a cpumask with those who have 1 more vcpus and >> during >> stealing process, just scan those instead of the online cpumask. >> >Yep, that sounds like a plan, and was along the lines of what I was >doing. If you want to give it a try, patches (for upstream, of course >:-D) are welcome. :-P > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-14 10:44 ` Wei Yang 2016-09-14 16:18 ` Dario Faggioli @ 2016-09-17 3:30 ` Wei Yang 2016-09-22 9:12 ` Dario Faggioli 1 sibling, 1 reply; 17+ messages in thread From: Wei Yang @ 2016-09-17 3:30 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, Dario Faggioli, xen-devel, mengxu, richard.weiyang On Wed, Sep 14, 2016 at 06:44:17PM +0800, Wei Yang wrote: >On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >>[using xendevel correct address] >> >>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote: >>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote: >>> > >>> > I'm not surprised by that. Yet, I'd be interested in hearing more >>> > about this profiling you have done (things like, how you captured >>> > the data, what workloads you are exactly considering, how you >>> > determined what is the bottleneck, etc). >>> Let me try to explain this. >>> >>> Workload: a. Blue Screen in Windows Guests >>> b. some client using Windows to do some video >>> processing >>> which need precise timestamp (we are not sure the >>> core >>> reason but we see the system is very slow) >>> >>Do you mind sharing just a bit more, such as: >> - number of pcpus >> - number of vcpus of the various VMs > >160 pcpus >16 vcpus in VM and 8 VMs > >> >>I also am not sure what "a. Blue screen in Windows guests" above >>means... is there a workload called "Blue Screen"? Or is it that you >>are hitting a BSOD in some of your guests (which ones, what were they >>doing)? Or is it that you _want_ to provoke a BSOD on some of your >>guests? Or is that something else? :-P >> > >Yes, the "Blue Screen" is what we want to mimic the behavior from client. > >The "Blue Screen" will force the hypervisor to do load balance in my mind. > >>> Capture the data: lock profile >>> Bottleneck Check: The schedule lock wait time is really high >>> >>Ok, cool. Interesting that lock profiling works on 4.1! :-O >> >>> > The scheduler tries to see whether the v->processor of the waking >>> > vcpu can be re-used, but that's not at all guaranteed, and again, >>> > on a very loaded system, not even that likely! >>> > >>> Hmm... I may missed something. >>> >>> Took your assumption below for example. >>> In my mind, the process looks like this: >>> >>> csched_vcpu_wake() >>> __runq_insert(), insert the vcpu in pcpu's 6 runq >>> __runq_tickle(), raise SCHEDULE_SOFTIRQ on pcpu 6 or other >>> (1) >>> >>Well, yes. More precisely, at least in current staging, >>SCHEDULE_SOFTIRQ is raised for pcpu 6: >> - if pcpu 6 is idle, or >> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and >> the waking vcpu is higher in priority than what is running on pcpu 6 >> >>> __do_softirq() >>> schedule() >>> csched_schedule(), for pcpu 6, it may wake up d2v2 based >>> on it's >>> priority >>> >>Yes, but it is pcpu 6 that will run csched_schedule() only if the >>conditions mentioned above are met. If not, it will be some other pcpu >>that will do so (or none!). >> >>But I just realized that the fact that you are actually working on 4.1 >>is going to be an issue. In fact, the code that it is now in Xen has >>changed **a lot** since 4.1. In fact, you're missing all the soft- >>affinity work (which you may or may not find useful) but also >>improvements and bugfixing. >> > >That's true... > >>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1, >>but there's very few chances I'll be available to provide detailed >>review, advice, testing, etc., on such an old codebase. :-( >> >>> By looking at the code, what I missed may be in __runq_tickle(), >>> which in case >>> there are idle pcpu, schedule_softirq is also raised on them. By >>> doing so, >>> those idle pcpus would steal other busy tasks and may in chance get >>> d2v2? >>> >>Yes, it looks like, in Xen 4.1, this is more or less what happens. The >>idea is to always tickle pcpu 6, if the priority of the waking vcpu is >>higher than what is running there. >> > >Hmm... in case there are idle pcpus and the priority of the waking vcpu is >higher than what is running on pcpu 6, would pcpu 6 have more chance to run it? >or other idle pcup would steal it from pcpu6? or they have equal chance? > >>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that: >> - if the waking vcpu preempted what was running on pcpu 6, an idler >> can pick-up ("steal") such preempted vcpu and run it; >> - if the waking vcpu ended up in the runqueue, an idler can steal it >> > >Hmm... I don't get the difference between these two cases. > >Looks both are an idler steal the vcpu. > >>> BTW, if the system is in heavy load, how much chance would we have >>> idle pcpu? >>> >>It's not very likely that there will be idle pcpus in a very busy >>system, I agree. >> >>> > > We can divide this in three steps: >>> > > a) Send IPI to the target CPU and tell it which vcpu we want it >>> > > to >>> > > wake up. >>> > > b) The interrupt handler on target cpu insert vcpu to a percpu >>> > > queue >>> > > and raise >>> > > softirq. >>> > > c) softirq will dequeue the vcpu and wake it up. >>> > > >>> > I'm not sure I see how you think this would improve the situation. >>> > >>> > So, quickly, right now: >>> > >>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up >>> > - let's assume d2v2->processor = 6 >>> > - let's assume the wakeup is happening on pcpu 4 >>> > >>> > Right now: >>> > >>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2, >>> > which is the runqueue lock of pcpu 6 (in Credit1, there is 1 >>> > runqueue per pcpu, and locks are per-cpu already) >>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >>> > - still executing on pcpu 4, __runq_tickle() is called, and it >>> > determines on which pcpu d2v2 should run >>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >>> > following two cases: >>> > a) if it was determined that d2v2 should run on pcpu 6: >>> > - in a (little, hopefully) while, pcpu 6 will schedule >>> > - it takes its own runqueue lock >>> > - it finds d2v2 in there, and runs it >>> Even in this case, it depends on the priority of d2v2 whether it >>> would be run >>> now. Right? >>> >>Yes. But both in 4.1 and in current staging, we only raise >>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking >>vcpu is higher and it should preempt the currently running vcpu. >> > >Oh, you are right. I didn't catch this either. > >This means in case >a) the priority is lower than current one >b) no idler in system > >The vcpu will sit quietly in the runq and waiting for the schedule next time. > >>> > b) if it was determined that d2v2 should run on, say, pcpu 9: >>> > - in a (little, hopefully) while, pcpu 9 will schedule >>> > - it takes its own runqueue lock >>> > - if it has nothing in its runqueue (likely, if >>> > __runq_tickle() chose it, but not necessarily and always >>> > true, especially under high load), it looks around to >>> > see if other pcpus have any vcpu that it can execute >>> > - it will try to take the locks of the various pcpu's >>> > runqueues it comes across >>> > - if it manages to take the lock of pcpu's 6 runqueue, it >>> > sees d2v2 in it, steal it and run it. >>> Oh, my understanding matches yours :-) >>> >>> BYW, we also found in csched_load_balance() will hold the schedule >>> lock, but >>> not found a proper vcpu to steal. Maybe this would be a potential >>> optimization >>> point. >>> >>Mmm... I actually don't understand what you mean here... what is the >>possible optimization? >> >>> > With your modifications, AFAICT: >>> > >>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock >>> > of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list >>> > - poke pcpu 6 with an IPI >>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and, >>> > _I_think_, call vcpu_wake(d2v2) ? [*] >>> You are almost right. Here in the interrupt handler, it does two >>> things: >>> 1. echo back notify_remote_vcpu_wake() it finishes the job >>> 2. raise a separate softirq, which will call vcpu_wake(d2v2) >>> >>> > >>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock >>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >>> > - on pcpu 6, __runq_tickle() is called, and it determines on which >>> > pcpu d2v2 should run >>> At this place, the behavior of __run_tickle() is what I mentioned >>> above. Raise >>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2. >>> >>Sort of, yes. In general, you can expect pcpu 6 going through >>csched_schedule(), and hence setting d2v2 to run, to be faster that >>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and >>getting to work stealing from pcpu 6's runqueue. >> >>However, that is the case only if d2v2 had higher priority than what is >>running on pcpu 6. If it does not, you: >> - _don't_ tickle pcpu 6 >> - tickle one or more idlers, if any. >> >>So, again, it's not that you always wake and run it on pcpu 6 >> > >I see what you mean :-) > >So I am curious about why we add this in pcpu's 6 queue, when we know the >priority is low in __runq_tickle() and won't tickle that. And we are sure >someone in the future will grab pcpu's 6 schedule lock and take it. > >Or in this case -- when there are idlers and vcpu's priority is higher, >instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone >who is free to help a busy one. > >>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >>> > following two cases: >>> > a) if it was determined that d2v2 should run on pcpu 6 >>> > (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself): >>> > - at the first softirq check, pcpu 6 will schedule >>> > - it takes its own runqueue lock >>> > - it finds d2v2 in there, and runs it >>> > b) if it was determined that d2v2 should run on, say, pcpu 9: >>> > - in a (little, hopefully) while, pcpu 9 will schedule >>> > - it takes its own runqueue lock >>> > - if it has nothing in its runqueue (likely, if >>> > __runq_tickle() chose it, but not necessarily and always >>> > true, especially under high load), it looks around to >>> > see if other pcpus have any vcpu that it can execute >>> > - it will try to take the locks of the various pcpu's >>> > runqueues it comes across >>> > - if it manages to take the lock of pcpu's 6 runqueue, it >>> > sees d2v2 in it, steal it and run it. >>> > >>> > [*] I don't see in the code you shared what happens in reaction to >>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually >>> > calls vcpu_wake() >>> > >>> > So, basically, it looks to me that you're adding a level of >>> > indirection, I'm not sure for what benefit. >> >>> In my mind, we are trying to reduce the contention on schedule lock >>> from two >>> aspects: >>> 1. for those vcpus would run on other pcpu, it will take >>> wake_vcpu_lock >>> instead of schedule lock >>> >>I don't see how you can say "instead". It looks to me that, for those >>vcpus what would run on other pcpu, we need to take wake_vcpu_lock >>_in_addition_ to the runqueue lock. >> >>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which >>> also >>> reduce the cache coherency between pcpus >>> >>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a >>remote lock, though. > >Hmm... you are right. > >> >>> > In fact, in the best case (i.e., d2v2 will actually be run on its >>> > v->processor), in the original code there is: >>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6. >>> > In your case, there is: >>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6 >>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6. >>> > >>> Our idea may looks like this: >>> - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >>> - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. >>> - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6. >>> >>So, it's even worse than how I imagined! :-P >> >>> > Also, as far as locking within the wakeup path is concerned, in the >>> > original code: >>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >>> > (i.e., for the time of __runq_insert() and __runq_tickle(). >>> > In your case: >>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the >>> > deferred wakeup >>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >>> > >>> The wake_vcpu_lock is hold in the first step in above. >>> The runqueue lock(I think you mean the schedule lock?) won't be taken >>> in this >>> case. >>> >>It will be taken, for executing vcpu_wake(). From your own description >>of the mechanism above: >> >>"Our idea may looks like this: >> 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the >> vcpu. >> 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." >> >>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for >> csched_schedule() >> > >Oh, you are right. > >We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while >looks the wake_vcpu_lock is grabbed by others. > >I am thinking our change may benefit like this . > >The situation is: >* vcpu's priority is higher >* pcpu 9 is idle >* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ > >At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 >nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock. > >While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9 >could do their job immediately. Ah I know, very corner, not sure about the >effect. > >BTW, I suddenly found softirq is implemented as an IPI. So I am thinking >whether it is save to raise softirq to itself? Since it looks we already hold >the runqueue lock. Or because we have turned the irq off in vcpu_wake()? > >>> > >>> > (i.e., for the time of __runq_insert() and __runq_tickle(). >>> > >>> > Which, to me, looks like both more inter-CPU communication overhead >>> > and more chances for lock contention. >>> Hmm... yes, more inter-CPU communication, while less lock contention >>> I think. >>> >>More inter-CPU communication, the same level of local lock contention >>plus some added moderately to low contented remote lock contention, >>AFAICT. >> >>> > So, again, more IPIs and more (potential) lock contention. >>> > >>> > True the queueing into the deferred wakeup list will be very simple >>> > and quick, and hence the critical section needed for implementing >>> > it very short. But the problem is that you are not shortening >>> > something long, or splitting something long in two shorter chunks, >>> > you are _adding_ something, and thus you can't possibly win. :-/ >>> > >>> Yes, we are moving some thing, let me digest it for a while. >>> >>Sure. Let me also add that I'm not saying it's not going to work, or >>that it's not going to improve the situation. >> >>What I'm saying is that, architecturally, it's not too different from >>the current situation, so I would not expect wonders from just this. >> >>What I'd do would be to try figuring out where it is that the most >>waiting time is being accumulated, i.e., which *specific* locking >>attempts of the scheduler's runqueues' locks are the most contended, >>and focus on those first. >> >>If you're on a big host, I think the way in which Credit1 does load >>balancing (i.e., by work stealing) may be the thing to blame. But all >>these are things that only profiling and benchmarking can really tell! >>:-) >> >>As a side note, we're also working on something which, at the end of >>the day, is rather similar, although for different purposes. In fact, >>see here: >>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html >> >>The goal is not having to disable IRQs during scheduling, and I'm >>looking at keeping the per-pcpu deferred wakeup list lockless, in order >>not to add more lock contention. >> Dario, Just get chance to look into this. This is interesting and I am trying to understand the problem you want to solve first. vcpu_wake() is a _special_ case who has to turn off the irq, because SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle pcpu to pick up the queued vcpu. And the tickling part needd the irq to be turned off. I don't get the this part. Why we have to turn off the irq for tickling pcpu? And by enabling irq in schedule handlers benefits the performance? or ? The motivation behind this is? Didn't look into your code yet. From the description from Andrew, I comes with several questions: 1. which per-cpu list you want to queue the vcpu? v->processor? 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle? 3. purpose is so that schedule softirq could run with irq enabled 4. the schedule softirq would do the jobs currently in vcpu_wake()? Took the per-cpu branch as an example, I see several commits. The top 6 are related ones, right? >>If you're interested on knowing more or working on that, just let me >>know. But, of course, this have to be done on Xen upstream, not on an >>old version. >> >>Regards, >>Dario >>-- >><<This happens because I choose it to happen!>> (Raistlin Majere) >>----------------------------------------------------------------- >>Dario Faggioli, Ph.D, http://about.me/dario.faggioli >>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> > > > >-- >Richard Yang\nHelp you, Help me -- Wei Yang Help you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-17 3:30 ` Wei Yang @ 2016-09-22 9:12 ` Dario Faggioli 2016-09-22 9:55 ` Dario Faggioli 2016-09-24 3:39 ` Wei Yang 0 siblings, 2 replies; 17+ messages in thread From: Dario Faggioli @ 2016-09-22 9:12 UTC (permalink / raw) To: Wei Yang, Wei Yang Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 3967 bytes --] On Sat, 2016-09-17 at 03:30 +0000, Wei Yang wrote: > Dario, > Hey, hi again, and sorry for the in getting back at this particular part of the conversation. > Just get chance to look into this. This is interesting and I am > trying to > understand the problem you want to solve first. > :-) > vcpu_wake() is a _special_ case who has to turn off the irq, because > SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle > pcpu to pick > up the queued vcpu. And the tickling part needd the irq to be turned > off. > _Almost_ correct. However, the problem is more that vcpu_wake() can happen in response to an IRQ, and when you grab a spinlock in IRQ context, you need to disable IRQs. There is a good explanation of why, here: > I don't get the this part. Why we have to turn off the irq for > tickling pcpu? > We don't. The point here is that, in Xen, we wants spinlocks to either _always_ be taken with IRQs disabled or _always_ be taken with IRQs enabled. Well, since we now take the scheduler's runqueue lock in vcpu_wake() (and as said that must be done with IRQs disabled), this implies that the scheduler's runqueue lock needs to always be taken with IRQs disabled, even when leaving them enabled would not actually be an issue, for being consistent with the locking rule. So, if we manage to avoid taking the scheduler's runqueue lock in vcpu_wake(), we will manage to put ourself in the opposite situation, wrt the locking consistency rule, i.e., we can _always_ take the scheduler's runqueue lock with IRQs enabled. This is, broadly speaking, the problem we wanted to solve, and how we thought about solving it. > And by enabling irq in schedule handlers benefits the performance? or > ? The > motivation behind this is? > As I said to you about your modification, here too it is not super-easy to tell in advance whether, and if yes, by how much, we'll see a boost in performance. In this case, the idea is that, in general, keeping IRQs disabled is bad. It is bad for concurrency, it is bad for latency in both scheduling and when it comes to reacting to external event. And it has been profiled that the scheduler is one of the component that keeps the IRQs disabled for long chunks of time. I honestly did expect things to improve a bit, especially on certain workloads, but of course, as usual, benchmarks will tell. :-) > Didn't look into your code yet. > Ok. Even if/when you look at it, bear in mind it's still only a stub. > From the description from Andrew, I comes with > several questions: > 1. which per-cpu list you want to queue the vcpu? v->processor? > ISTR having thought, and even given a try, to both v->processor (where v is the waking vcpu) and the current cpu (where for 'current cpu' I meant the cpu where the wake-up is occurring). I think I decided to use the current cpu (mostly for the same reasons why I don't think there is much advantage in what you are trying to do :-P) > 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle? > It would call what is now vcpu_wake(). So, yes, for most scheduler that means inserting and tickling, but it's really schedulere specific. > 3. purpose is so that schedule softirq could run with irq enabled > Purpose is that scheduling functions can be executed with IRQs enabled, yes. > 4. the schedule softirq would do the jobs currently in vcpu_wake()? > Err... not sure what you mean (but maybe I've already answered at point 2.?). > Took the per-cpu branch as an example, I see several commits. The top > 6 are > related ones, right? > The top 4 (I told you it's a stub :-P). Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-22 9:12 ` Dario Faggioli @ 2016-09-22 9:55 ` Dario Faggioli 2016-09-24 7:26 ` Wei Yang 2016-09-24 3:39 ` Wei Yang 1 sibling, 1 reply; 17+ messages in thread From: Dario Faggioli @ 2016-09-22 9:55 UTC (permalink / raw) To: Wei Yang, Wei Yang Cc: george.dunlap, xuquan8, mengxu, liuxiaojian6, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 711 bytes --] On Thu, 2016-09-22 at 11:12 +0200, Dario Faggioli wrote: > _Almost_ correct. However, the problem is more that vcpu_wake() can > happen in response to an IRQ, and when you grab a spinlock in IRQ > context, you need to disable IRQs. > > There is a good explanation of why, here: > Ah, sorry, link is missing! Here: https://www.kernel.org/doc/Documentation/locking/spinlocks.txt Read the "Lesson 3: spinlocks revisited." section. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-22 9:55 ` Dario Faggioli @ 2016-09-24 7:26 ` Wei Yang 0 siblings, 0 replies; 17+ messages in thread From: Wei Yang @ 2016-09-24 7:26 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, xen-devel, liuxiaojian6, george.dunlap, Wei Yang, mengxu, Wei Yang On Thu, Sep 22, 2016 at 11:55:16AM +0200, Dario Faggioli wrote: >On Thu, 2016-09-22 at 11:12 +0200, Dario Faggioli wrote: >> _Almost_ correct. However, the problem is more that vcpu_wake() can >> happen in response to an IRQ, and when you grab a spinlock in IRQ >> context, you need to disable IRQs. >> >> There is a good explanation of why, here: >> >Ah, sorry, link is missing! Here: > >https://www.kernel.org/doc/Documentation/locking/spinlocks.txt > >Read the "Lesson 3: spinlocks revisited." section. > Oh, you have a link here ... let me digest it. >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-22 9:12 ` Dario Faggioli 2016-09-22 9:55 ` Dario Faggioli @ 2016-09-24 3:39 ` Wei Yang 2016-09-26 10:18 ` Dario Faggioli 1 sibling, 1 reply; 17+ messages in thread From: Wei Yang @ 2016-09-24 3:39 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang, mengxu, Wei Yang On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote: >On Sat, 2016-09-17 at 03:30 +0000, Wei Yang wrote: >> Dario, >> >Hey, hi again, and sorry for the in getting back at this particular >part of the conversation. > Sure, I was busy with other stuffs these days :-( >> Just get chance to look into this. This is interesting and I am >> trying to >> understand the problem you want to solve first. >> >:-) > >> vcpu_wake() is a _special_ case who has to turn off the irq, because >> SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle >> pcpu to pick >> up the queued vcpu. And the tickling part needd the irq to be turned >> off. >> >_Almost_ correct. However, the problem is more that vcpu_wake() can >happen in response to an IRQ, and when you grab a spinlock in IRQ >context, you need to disable IRQs. > _start: jmp 1f 2: Ok, now I have a question to this sentence. I have checked some posts which says in the interrupt handling, irq would be disabled. Well, this really depends on the implementation. This means in Xen, irq is still enabled during interrupt handling? Because of this, we have to disable IRQ when we need to grab the spin lock? jmp $$ >There is a good explanation of why, here: > > >> I don't get the this part. Why we have to turn off the irq for >> tickling pcpu? >> >We don't. The point here is that, in Xen, we wants spinlocks to either >_always_ be taken with IRQs disabled or _always_ be taken with IRQs >enabled. Well, since we now take the scheduler's runqueue lock in >vcpu_wake() (and as said that must be done with IRQs disabled), this >implies that the scheduler's runqueue lock needs to always be taken >with IRQs disabled, even when leaving them enabled would not actually >be an issue, for being consistent with the locking rule. > >So, if we manage to avoid taking the scheduler's runqueue lock in vcpu_wake(), we will manage to put ourself in the opposite situation, wrt the locking consistency rule, i.e., we can _always_ take the scheduler's runqueue lock with IRQs enabled. > >This is, broadly speaking, the problem we wanted to solve, and how we thought about solving it. 1: Looks I almost get every thing. Let me do a summary to see whether I have got your words. (spinlock usage convention in Xen) & (vcpu_wake() may happen in an IRQ) => (irq all disabled || irq all enabled) & (disable irq on grabbing spinlock) => should grab schedule lock with IRQ disabled jmp 2b > >> And by enabling irq in schedule handlers benefits the performance? or >> ? The >> motivation behind this is? >> >As I said to you about your modification, here too it is not super-easy >to tell in advance whether, and if yes, by how much, we'll see a boost >in performance. > >In this case, the idea is that, in general, keeping IRQs disabled is >bad. It is bad for concurrency, it is bad for latency in both >scheduling and when it comes to reacting to external event. And it has >been profiled that the scheduler is one of the component that keeps the >IRQs disabled for long chunks of time. > >I honestly did expect things to improve a bit, especially on certain >workloads, but of course, as usual, benchmarks will tell. :-) > >> Didn't look into your code yet. >> >Ok. >Even if/when you look at it, bear in mind it's still only a stub. > >> From the description from Andrew, I comes with >> several questions: >> 1. which per-cpu list you want to queue the vcpu? v->processor? >> >ISTR having thought, and even given a try, to both v->processor (where >v is the waking vcpu) and the current cpu (where for 'current cpu' I >meant the cpu where the wake-up is occurring). > >I think I decided to use the current cpu (mostly for the same reasons >why I don't think there is much advantage in what you are trying to do >:-P) > >> 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle? >> >It would call what is now vcpu_wake(). So, yes, for most scheduler that >means inserting and tickling, but it's really schedulere specific. > >> 3. purpose is so that schedule softirq could run with irq enabled >> >Purpose is that scheduling functions can be executed with IRQs enabled, >yes. > >> 4. the schedule softirq would do the jobs currently in vcpu_wake()? >> >Err... not sure what you mean (but maybe I've already answered at point >2.?). > >> Took the per-cpu branch as an example, I see several commits. The top >> 6 are >> related ones, right? >> >The top 4 (I told you it's a stub :-P). > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-24 3:39 ` Wei Yang @ 2016-09-26 10:18 ` Dario Faggioli 2016-09-28 3:44 ` Wei Yang 0 siblings, 1 reply; 17+ messages in thread From: Dario Faggioli @ 2016-09-26 10:18 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, mengxu, Wei Yang [-- Attachment #1.1: Type: text/plain, Size: 2327 bytes --] On Sat, 2016-09-24 at 11:39 +0800, Wei Yang wrote: > On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote: > > _Almost_ correct. However, the problem is more that vcpu_wake() can > > happen in response to an IRQ, and when you grab a spinlock in IRQ > > context, you need to disable IRQs. > > > > Ok, now I have a question to this sentence. > > I have checked some posts which says in the interrupt handling, irq > would be > disabled. Well, this really depends on the implementation. > > This means in Xen, irq is still enabled during interrupt handling? > Because of > this, we have to disable IRQ when we need to grab the spin lock? > So, I don't think I'm getting all you're saying and asking. The situation is like this: currently, vcpu_wake() can be called both from inside or outside IRQ context, and both with IRQs enabled or disabled. Given this situation, to be safe, we need to disable IRQs when taking whatever spinlock vcpu_wake() (and the functions it calls, of course) calls. Since it takes the scheduler's runq lock, this means we need to take the lock with IRQ disabled. And because of that, _everyone_ that takes the schedulr's lock must do that with IRQs disabled. If we manage to *not* take the scheduler's lock with IRQ disabled in vcpu_wake(), we then can do the same everywhere else. But we can't do that with the current structure of the code, and that's why we're thinking to defer the actual wakeup --i.e., the actual call to vcpu_wake()-- to a moment when: - IRQs are enabled, - we don't need to disable them. This is it. About the following... > Looks I almost get every thing. Let me do a summary to see whether I > have got > your words. > > (spinlock usage convention in Xen) & (vcpu_wake() may happen in > an IRQ) > This is ok. > => > > (irq all disabled || irq all enabled) & (disable irq on grabbing > spinlock) > This, I don't get it. > => > > should grab schedule lock with IRQ disabled > Yes, sounds right. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-26 10:18 ` Dario Faggioli @ 2016-09-28 3:44 ` Wei Yang 0 siblings, 0 replies; 17+ messages in thread From: Wei Yang @ 2016-09-28 3:44 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, Wei Yang, mengxu, Wei Yang On Mon, Sep 26, 2016 at 12:18:06PM +0200, Dario Faggioli wrote: >On Sat, 2016-09-24 at 11:39 +0800, Wei Yang wrote: >> On Thu, Sep 22, 2016 at 11:12:15AM +0200, Dario Faggioli wrote: >> > _Almost_ correct. However, the problem is more that vcpu_wake() can >> > happen in response to an IRQ, and when you grab a spinlock in IRQ >> > context, you need to disable IRQs. >> > >> >> Ok, now I have a question to this sentence. >> >> I have checked some posts which says in the interrupt handling, irq >> would be >> disabled. Well, this really depends on the implementation. >> >> This means in Xen, irq is still enabled during interrupt handling? >> Because of >> this, we have to disable IRQ when we need to grab the spin lock? >> >So, I don't think I'm getting all you're saying and asking. > >The situation is like this: currently, vcpu_wake() can be called both >from inside or outside IRQ context, and both with IRQs enabled or >disabled. > >Given this situation, to be safe, we need to disable IRQs when taking >whatever spinlock vcpu_wake() (and the functions it calls, of course) >calls. Since it takes the scheduler's runq lock, this means we need to >take the lock with IRQ disabled. > >And because of that, _everyone_ that takes the schedulr's lock must do >that with IRQs disabled. If we manage to *not* take the scheduler's >lock with IRQ disabled in vcpu_wake(), we then can do the same >everywhere else. But we can't do that with the current structure of the >code, and that's why we're thinking to defer the actual wakeup --i.e., >the actual call to vcpu_wake()-- to a moment when: > - IRQs are enabled, > - we don't need to disable them. > Dario, I appreciate your patience and almost get the background and your purpose. First, let me give my conclusion. By taking the schedule lock with IRQ enabled, it will improve the parallelism to some extend, while it would be hard to say the net effect. Need more data from benchmark to verify. Then, let me describe what I understand. (Maybe not correct :-)) This is all about the SMP, spinlock and IRQ/IRQ context. Two basic assumptions: a. all interrupt are the same priority, interrupt could not be preempted by others. b. in the whole interrupt handler, IRQ is disabled on local processor Let's assume there is only one processor first, 1. If the spinlock just used in interrupt handler, looks it is save, no race condition. 2. While if the spinlock maybe used outside interrupt handler, ok, we need to at least disable IRQ when grab the lock outside interrupt handler. Otherwise there is deadlock. Then if there are several CPUs in the system. 1. If the spinlock just used in interrupt handler, would this be still save to not disable the IRQ? It looks ok to me. 2. While if the spinlock maybe used outside interrupt handler, ok, we need to at least disable IRQ when grab the lock outside interrupt handler. This leads to the spinlock usage convention: if a spinlock would be grabbed in interrupt handler(IRQ context), we need to grab it with IRQ disabled _always_. Then, to achieve your purpose -- grab the spinlock with IRQ enabled, you need to move all the lock operation outside of the interrupt handler. What you are doing in the code is to deffer the job to softirq when it is in_irq. By now, I can understand your first part of your check. If vcpu_wake() is not running in_irq() we can do the job directly now. But not the second part, why we need to check the local irq is also enabled? In my mind, we need to check it is not in_irq() is enough. Thanks a lot for your time :-) >This is it. About the following... > >> Looks I almost get every thing. Let me do a summary to see whether I >> have got >> your words. >> >> (spinlock usage convention in Xen) & (vcpu_wake() may happen in >> an IRQ) >> >This is ok. > >> => >> >> (irq all disabled || irq all enabled) & (disable irq on grabbing >> spinlock) >> >This, I don't get it. > >> => >> >> should grab schedule lock with IRQ disabled >> >Yes, sounds right. > >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli 2016-09-14 10:44 ` Wei Yang @ 2016-09-18 4:03 ` Wei Yang 2016-09-22 9:47 ` Dario Faggioli 1 sibling, 1 reply; 17+ messages in thread From: Wei Yang @ 2016-09-18 4:03 UTC (permalink / raw) To: Dario Faggioli Cc: xuquan8, liuxiaojian6, george.dunlap, xen-devel, richard.weiyang, mengxu, Wei Yang On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >[using xendevel correct address] > >On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote: >> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote: >> > >> > I'm not surprised by that. Yet, I'd be interested in hearing more >> > about this profiling you have done (things like, how you captured >> > the data, what workloads you are exactly considering, how you >> > determined what is the bottleneck, etc). >> Let me try to explain this. >> >> Workload: a. Blue Screen in Windows Guests >> b. some client using Windows to do some video >> processing >> which need precise timestamp (we are not sure the >> core >> reason but we see the system is very slow) >> >Do you mind sharing just a bit more, such as: > - number of pcpus > - number of vcpus of the various VMs > >I also am not sure what "a. Blue screen in Windows guests" above >means... is there a workload called "Blue Screen"? Or is it that you >are hitting a BSOD in some of your guests (which ones, what were they >doing)? Or is it that you _want_ to provoke a BSOD on some of your >guests? Or is that something else? :-P > >> Capture the data: lock profile >> Bottleneck Check: The schedule lock wait time is really high >> >Ok, cool. Interesting that lock profiling works on 4.1! :-O > >> > The scheduler tries to see whether the v->processor of the waking >> > vcpu can be re-used, but that's not at all guaranteed, and again, >> > on a very loaded system, not even that likely! >> > >> Hmm... I may missed something. >> >> Took your assumption below for example. >> In my mind, the process looks like this: >> >> csched_vcpu_wake() >> __runq_insert(), insert the vcpu in pcpu's 6 runq >> __runq_tickle(), raise SCHEDULE_SOFTIRQ on pcpu 6 or other >> (1) >> >Well, yes. More precisely, at least in current staging, >SCHEDULE_SOFTIRQ is raised for pcpu 6: > - if pcpu 6 is idle, or > - if pcpu 6 is not idle but there actually isn't any idle vcpu, and > the waking vcpu is higher in priority than what is running on pcpu 6 > >> __do_softirq() >> schedule() >> csched_schedule(), for pcpu 6, it may wake up d2v2 based >> on it's >> priority >> >Yes, but it is pcpu 6 that will run csched_schedule() only if the >conditions mentioned above are met. If not, it will be some other pcpu >that will do so (or none!). > >But I just realized that the fact that you are actually working on 4.1 >is going to be an issue. In fact, the code that it is now in Xen has >changed **a lot** since 4.1. In fact, you're missing all the soft- >affinity work (which you may or may not find useful) but also >improvements and bugfixing. > >I'll have a quick look at how __runq_tickle() looks like in Xen 4.1, >but there's very few chances I'll be available to provide detailed >review, advice, testing, etc., on such an old codebase. :-( > >> By looking at the code, what I missed may be in __runq_tickle(), >> which in case >> there are idle pcpu, schedule_softirq is also raised on them. By >> doing so, >> those idle pcpus would steal other busy tasks and may in chance get >> d2v2? >> >Yes, it looks like, in Xen 4.1, this is more or less what happens. The >idea is to always tickle pcpu 6, if the priority of the waking vcpu is >higher than what is running there. > >If pcpu 6 was not idle, we also tickle one or more idle pcpus so that: > - if the waking vcpu preempted what was running on pcpu 6, an idler > can pick-up ("steal") such preempted vcpu and run it; > - if the waking vcpu ended up in the runqueue, an idler can steal it > >> BTW, if the system is in heavy load, how much chance would we have >> idle pcpu? >> >It's not very likely that there will be idle pcpus in a very busy >system, I agree. > >> > > We can divide this in three steps: >> > > a) Send IPI to the target CPU and tell it which vcpu we want it >> > > to >> > > wake up. >> > > b) The interrupt handler on target cpu insert vcpu to a percpu >> > > queue >> > > and raise >> > > softirq. >> > > c) softirq will dequeue the vcpu and wake it up. >> > > >> > I'm not sure I see how you think this would improve the situation. >> > >> > So, quickly, right now: >> > >> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up >> > - let's assume d2v2->processor = 6 >> > - let's assume the wakeup is happening on pcpu 4 >> > >> > Right now: >> > >> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2, >> > which is the runqueue lock of pcpu 6 (in Credit1, there is 1 >> > runqueue per pcpu, and locks are per-cpu already) >> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >> > - still executing on pcpu 4, __runq_tickle() is called, and it >> > determines on which pcpu d2v2 should run >> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >> > following two cases: >> > a) if it was determined that d2v2 should run on pcpu 6: >> > - in a (little, hopefully) while, pcpu 6 will schedule >> > - it takes its own runqueue lock >> > - it finds d2v2 in there, and runs it >> Even in this case, it depends on the priority of d2v2 whether it >> would be run >> now. Right? >> >Yes. But both in 4.1 and in current staging, we only raise >SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking >vcpu is higher and it should preempt the currently running vcpu. > >> > b) if it was determined that d2v2 should run on, say, pcpu 9: >> > - in a (little, hopefully) while, pcpu 9 will schedule >> > - it takes its own runqueue lock >> > - if it has nothing in its runqueue (likely, if >> > __runq_tickle() chose it, but not necessarily and always >> > true, especially under high load), it looks around to >> > see if other pcpus have any vcpu that it can execute >> > - it will try to take the locks of the various pcpu's >> > runqueues it comes across >> > - if it manages to take the lock of pcpu's 6 runqueue, it >> > sees d2v2 in it, steal it and run it. >> Oh, my understanding matches yours :-) >> >> BYW, we also found in csched_load_balance() will hold the schedule >> lock, but >> not found a proper vcpu to steal. Maybe this would be a potential >> optimization >> point. >> >Mmm... I actually don't understand what you mean here... what is the >possible optimization? > >> > With your modifications, AFAICT: >> > >> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock >> > of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list >> > - poke pcpu 6 with an IPI >> > - in a (little, hopefully) while, pcpu 6 react to the IPI and, >> > _I_think_, call vcpu_wake(d2v2) ? [*] >> You are almost right. Here in the interrupt handler, it does two >> things: >> 1. echo back notify_remote_vcpu_wake() it finishes the job >> 2. raise a separate softirq, which will call vcpu_wake(d2v2) >> >> > >> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock >> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >> > - on pcpu 6, __runq_tickle() is called, and it determines on which >> > pcpu d2v2 should run >> At this place, the behavior of __run_tickle() is what I mentioned >> above. Raise >> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2. >> >Sort of, yes. In general, you can expect pcpu 6 going through >csched_schedule(), and hence setting d2v2 to run, to be faster that >SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and >getting to work stealing from pcpu 6's runqueue. > >However, that is the case only if d2v2 had higher priority than what is >running on pcpu 6. If it does not, you: > - _don't_ tickle pcpu 6 > - tickle one or more idlers, if any. > >So, again, it's not that you always wake and run it on pcpu 6 > >> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >> > following two cases: >> > a) if it was determined that d2v2 should run on pcpu 6 >> > (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself): >> > - at the first softirq check, pcpu 6 will schedule >> > - it takes its own runqueue lock >> > - it finds d2v2 in there, and runs it >> > b) if it was determined that d2v2 should run on, say, pcpu 9: >> > - in a (little, hopefully) while, pcpu 9 will schedule >> > - it takes its own runqueue lock >> > - if it has nothing in its runqueue (likely, if >> > __runq_tickle() chose it, but not necessarily and always >> > true, especially under high load), it looks around to >> > see if other pcpus have any vcpu that it can execute >> > - it will try to take the locks of the various pcpu's >> > runqueues it comes across >> > - if it manages to take the lock of pcpu's 6 runqueue, it >> > sees d2v2 in it, steal it and run it. >> > >> > [*] I don't see in the code you shared what happens in reaction to >> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually >> > calls vcpu_wake() >> > >> > So, basically, it looks to me that you're adding a level of >> > indirection, I'm not sure for what benefit. > >> In my mind, we are trying to reduce the contention on schedule lock >> from two >> aspects: >> 1. for those vcpus would run on other pcpu, it will take >> wake_vcpu_lock >> instead of schedule lock >> >I don't see how you can say "instead". It looks to me that, for those >vcpus what would run on other pcpu, we need to take wake_vcpu_lock >_in_addition_ to the runqueue lock. > >> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which >> also >> reduce the cache coherency between pcpus >> >Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a >remote lock, though. > >> > In fact, in the best case (i.e., d2v2 will actually be run on its >> > v->processor), in the original code there is: >> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6. >> > In your case, there is: >> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6 >> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6. >> > >> Our idea may looks like this: >> - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. >> - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6. >> >So, it's even worse than how I imagined! :-P > >> > Also, as far as locking within the wakeup path is concerned, in the >> > original code: >> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >> > (i.e., for the time of __runq_insert() and __runq_tickle(). >> > In your case: >> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the >> > deferred wakeup >> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >> > >> The wake_vcpu_lock is hold in the first step in above. >> The runqueue lock(I think you mean the schedule lock?) won't be taken >> in this >> case. >> >It will be taken, for executing vcpu_wake(). From your own description >of the mechanism above: > >"Our idea may looks like this: > 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 > 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the > vcpu. > 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." > >1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for > csched_schedule() > >> > >> > (i.e., for the time of __runq_insert() and __runq_tickle(). >> > >> > Which, to me, looks like both more inter-CPU communication overhead >> > and more chances for lock contention. >> Hmm... yes, more inter-CPU communication, while less lock contention >> I think. >> >More inter-CPU communication, the same level of local lock contention >plus some added moderately to low contented remote lock contention, >AFAICT. > >> > So, again, more IPIs and more (potential) lock contention. >> > >> > True the queueing into the deferred wakeup list will be very simple >> > and quick, and hence the critical section needed for implementing >> > it very short. But the problem is that you are not shortening >> > something long, or splitting something long in two shorter chunks, >> > you are _adding_ something, and thus you can't possibly win. :-/ >> > >> Yes, we are moving some thing, let me digest it for a while. >> >Sure. Let me also add that I'm not saying it's not going to work, or >that it's not going to improve the situation. > >What I'm saying is that, architecturally, it's not too different from >the current situation, so I would not expect wonders from just this. > >What I'd do would be to try figuring out where it is that the most >waiting time is being accumulated, i.e., which *specific* locking >attempts of the scheduler's runqueues' locks are the most contended, >and focus on those first. > >If you're on a big host, I think the way in which Credit1 does load >balancing (i.e., by work stealing) may be the thing to blame. But all >these are things that only profiling and benchmarking can really tell! >:-) > >As a side note, we're also working on something which, at the end of >the day, is rather similar, although for different purposes. In fact, >see here: >https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html > Dario, Took a look into your code, some questions as below: 1. vcpu_wake is split into two cases, the first case is "not in irq" and "irq enabled". Some reason for this classification? Maybe some background knowledge I need to have. 2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the list is empty. and then add the vcpu to the list. any reason for this sequence? BTW, at this moment vcpu_wake() holds the spin lock and vcpu_wake_deferred() should take the spin lock before continuing. In the last patch, I saw you introduce a flag wakeup_triggered, and then change the sequence a little. Why at this place, you want to change the sequence here? 3. The main purpose of this series is to enable irq for "scheduler handlers". I am not sure the original initiative for this, while I guess this could increase the parallelism, right? And I think we have to make sure those functions are not re-entrant, otherwise they will face dead lock when locking on the same spin-lock. 4. This looks similar with what we try to achieve. While we try to give the vcpu to someone else, who we think may pick it, to wake it up, and in your implementation, you at this vcpu to its own list and wake it up some time later. I guess next step you want to do is to make the wakeup_defer.list a lock-less list. 5. Well, this may not relevant to your code directly. I found some difference between raise_softirq() and cpumask_raise_softirq(). The first one just set pending bit, while the second one will send IPI. __runq_tickle() use the second one to send SCHEDULE softirq, this means other pcpu would react immediately? >The goal is not having to disable IRQs during scheduling, and I'm >looking at keeping the per-pcpu deferred wakeup list lockless, in order >not to add more lock contention. > >If you're interested on knowing more or working on that, just let me >know. But, of course, this have to be done on Xen upstream, not on an >old version. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Richard Yang\nHelp you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake 2016-09-18 4:03 ` Wei Yang @ 2016-09-22 9:47 ` Dario Faggioli 0 siblings, 0 replies; 17+ messages in thread From: Dario Faggioli @ 2016-09-22 9:47 UTC (permalink / raw) To: Wei Yang Cc: xuquan8, liuxiaojian6, george.dunlap, richard.weiyang, mengxu, xen-devel [-- Attachment #1.1: Type: text/plain, Size: 6347 bytes --] On Sun, 2016-09-18 at 12:03 +0800, Wei Yang wrote: > Dario, > > Took a look into your code, some questions as below: > Haha, here you are! :-D > 1. vcpu_wake is split into two cases, the first case is "not in irq" > and "irq > enabled". Some reason for this classification? Maybe some background > knowledge > I need to have. > Mmm... yes. Well, probably that distinction should not be there, as it is a (potential, benchmark will tell if that's the case) optimization. And optimizations should be added later, with their own patch and patch description, etc... But as I said in the other email, the code is not even RFC. :-/ So, point is, the whole point of the patch series is to leave interrupts enabled inside scheduling functions, including vcpu_wake() (which is becoming _vcpu_wake()). In order to do that, we taking the scheduler lock, we'll start using spin_lock() and vcpu_spin_lock(), instead of vcpu_spinlock_irq[save](). But this also means that IRQs can't be disabled when we do that, or: - we violate Xen's lock consistency rule (there's ASSERT()s checking that in debug builds) - we risk re-enabling interrupts at the wrong time. So, if IRQs are disabled when vcpu_wake() is called, or we are in IRQ context and not disabling them expose us to races, we want to go the deferred way the patch series is implementing. However, if vcpu_wake() is called with IRQ enabled and not from IRQ context, none of the above risks stand, and we can avoid doing all the dance required for deferred wakeup, and just call _vcpu_wake(). I'm saying this is a (potential) optimization because deferring the wakeup has a cost, and this 'if' is trying to avoid paying such when not strictly necessary. But really, in a proper version of this series I should just always defer the wakeup in this patch, and have another patch, at the end of the series, where this attempt to avoid deferral is implemented, and run benchmarks with and without said patch applied. > 2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the > list is > empty. and then add the vcpu to the list. any reason for this > sequence? BTW, > at this moment vcpu_wake() holds the spin lock and > vcpu_wake_deferred() should > take the spin lock before continuing. > If the list is empty it means there is no one going through it and waking up vcpus from there. This means I need to raise the softirq that will cause someone to go check the list, so it will find the vcpu that I'm inserting there right now. OTOH, if the list is not empty, it means there is already someone going through the list itself. This means I don't need to raise anything, as the vcpu that I'm inserting in the list right now will be found by such waker, and woken up. This is for avoiding raising too many softirq and poking too many cpus to run their deferred wakers, which would introduce contention issue on the deferred list lock. The reason why the check is before the insertion is pure cosmetics. I can well insert first and check for list_is_singleton(). In the code you see there, that does not matter, because the deferred list has it's spinlock. But this will probably change if I get rid of that, making the list lockless. > In the last patch, I saw you introduce a flag wakeup_triggered, and > then > change the sequence a little. Why at this place, you want to change > the > sequence here? > It was another attempt of optimization. But again, this is all changing with the list becoming lockless. > 3. The main purpose of this series is to enable irq for "scheduler > handlers". > I am not sure the original initiative for this, while I guess this > could > increase the parallelism, right? And I think we have to make sure > those > functions are not re-entrant, otherwise they will face dead lock when > locking > on the same spin-lock. > Hopefully, I explained the purpose in my previous email. I'm not sure what you mean here with 're-entrant'. The only reason why we want IRQs to be disabled when taking a lock, is because we take that lock from both inside and outside of IRQ context. So, if we're sure that a lock is never taken from IRQ context, we can avoid disabling IRQs. And this is what this is all about: the scheduler lock is taken once from IRQ context, and this series is trying to get rid of that one case. > 4. This looks similar with what we try to achieve. While we try to > give the > vcpu to someone else, who we think may pick it, to wake it up, and in > your > implementation, you at this vcpu to its own list and wake it up some > time > later. > It does looks similar (this is the reason why I mentioned it to you :- )), although: - both the goal and the expectations are different, - I'm not really 'giving the vcpu to someone else', I'm _deferring_ it, i.e., it is still going to happen on the same cpu where the event happens, only a little later in time (when no longer in IRQ context) > I guess next step you want to do is to make the wakeup_defer.list a > lock-less list. > Yes, that would be the plan (and refreshing the series, of course). > 5. Well, this may not relevant to your code directly. I found some > difference > between raise_softirq() and cpumask_raise_softirq(). The first one > just set > pending bit, while the second one will send IPI. > Exactly. > __runq_tickle() use the > second one to send SCHEDULE softirq, this means other pcpu would > react > immediately? > I don't think there's much difference in observable behavior. Perhaps, in cpumask_raise_softirq(), if the current cpu is in the mast, it would be enough to set a pensing bit for it. However, this is something very different again, between 4.1 and upstream. In fact, check commits (not present in Xen 4.1): c47c316d99b3b570d0b "x86/HVM: batch vCPU wakeups" 9a727a813e9b25003e4 "x86: suppress event check IPI to MWAITing CPUs" So I can't really be sure or comment much on this. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-09-28 3:44 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20160913085437.GA14514@linux-gk3p> 2016-09-13 11:30 ` [Help] Trigger Watchdog when adding an IPI in vcpu_wake Dario Faggioli 2016-09-14 10:44 ` Wei Yang 2016-09-14 16:18 ` Dario Faggioli 2016-09-16 2:49 ` Wei Yang 2016-09-16 16:07 ` Dario Faggioli 2016-09-17 0:31 ` Wei Yang 2016-09-19 23:24 ` Dario Faggioli 2016-09-20 3:51 ` Wei Yang 2016-09-17 3:30 ` Wei Yang 2016-09-22 9:12 ` Dario Faggioli 2016-09-22 9:55 ` Dario Faggioli 2016-09-24 7:26 ` Wei Yang 2016-09-24 3:39 ` Wei Yang 2016-09-26 10:18 ` Dario Faggioli 2016-09-28 3:44 ` Wei Yang 2016-09-18 4:03 ` Wei Yang 2016-09-22 9:47 ` Dario Faggioli
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.