From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [Help] Trigger Watchdog when adding an IPI in vcpu_wake Date: Thu, 22 Sep 2016 11:47:23 +0200 Message-ID: <1474537643.4393.367.camel@citrix.com> References: <20160913085437.GA14514@linux-gk3p> <1473766217.6339.103.camel@citrix.com> <20160918040320.GA15774@linux-gk3p> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1488462778690791428==" Return-path: In-Reply-To: <20160918040320.GA15774@linux-gk3p> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Wei Yang Cc: xuquan8@huawei.com, liuxiaojian6@huawei.com, george.dunlap@eu.citrix.com, richard.weiyang@gmail.com, mengxu@cis.upenn.edu, xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --===============1488462778690791428== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-WoupWSypRCFmHSSGi2Nb" --=-WoupWSypRCFmHSSGi2Nb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, 2016-09-18 at 12:03 +0800, Wei Yang wrote: > Dario, >=20 > Took a look into your code, some questions as below: > Haha, here you are! :-D > 1. vcpu_wake is split into two cases, the first case is "not in irq" > and "irq > enabled". Some reason for this classification? Maybe some background > knowledge > I need to have. >=20 Mmm... yes. Well, probably that distinction should not be there, as it is a (potential, benchmark will tell if that's the case) optimization. And optimizations should be added later, with their own patch and patch description, etc... But as I said in the other email, the code is not even RFC. :-/ So, point is, the whole point of the patch series is to leave interrupts enabled inside scheduling functions, including vcpu_wake() (which is becoming _vcpu_wake()). In order to do that, we taking the scheduler lock, we'll start using spin_lock() and vcpu_spin_lock(), instead of vcpu_spinlock_irq[save](). But this also means that IRQs can't be disabled when we do that, or: =C2=A0- we violate Xen's lock consistency rule (there's ASSERT()s checking =C2=A0 =C2=A0that in debug builds) =C2=A0- we risk re-enabling interrupts at the wrong time. So, if IRQs are disabled when vcpu_wake() is called, or we are in IRQ context and not disabling them expose us to races, we want to go the deferred way the patch series is implementing. However, if vcpu_wake() is called with IRQ enabled and not from IRQ context, none of the above risks stand, and we can avoid doing all the dance required for deferred wakeup, and just call _vcpu_wake(). I'm saying this is a (potential) optimization because deferring the wakeup has a cost, and this 'if' is trying to avoid paying such when not strictly necessary. But really, in a proper version of this series I should just always defer the wakeup in this patch, and have another patch, at the end of the series, where this attempt to avoid deferral is implemented, and run benchmarks with and without said patch applied. > 2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the > list is > empty. and then add the vcpu to the list. any reason for this > sequence? BTW, > at this moment vcpu_wake() holds the spin lock and > vcpu_wake_deferred() should > take the spin lock before continuing. >=20 If the list is empty it means there is no one going through it and waking up vcpus from there. This means I need to raise the softirq that will cause someone to go check the list, so it will find the vcpu that I'm inserting there right now. OTOH, if the list is not empty, it means there is already someone going through the list itself. This means I don't need to raise anything, as the vcpu that I'm inserting in the list right now will be found by such waker, and woken up. This is for avoiding raising too many softirq and poking too many cpus to run their deferred wakers, which would introduce contention issue on the deferred list lock. The reason why the check is before the insertion is pure cosmetics. I can well insert first and check for list_is_singleton(). In the code you see there, that does not matter, because the deferred list has it's spinlock. But this will probably change if I get rid of that, making the list lockless. > In the last patch, I saw you introduce a flag wakeup_triggered, and > then > change the sequence a little. Why at this place, you want to change > the > sequence here? >=20 It was another attempt of optimization. But again, this is all changing with the list becoming lockless. > 3. The main purpose of this series is to enable irq for "scheduler > handlers". > I am not sure the original initiative for this, while I guess this > could > increase the parallelism, right? And I think we have to make sure > those > functions are not re-entrant, otherwise they will face dead lock when > locking > on the same spin-lock. >=20 Hopefully, I explained the purpose in my previous email. I'm not sure what you mean here with 're-entrant'. The only reason why we want IRQs to be disabled when taking a lock, is because we take that lock from both inside and outside of IRQ context. So, if we're sure that a lock is never taken from IRQ context, we can avoid disabling IRQs. And this is what this is all about: the scheduler lock is taken once from IRQ context, and this series is trying to get rid of that one case. > 4. This looks similar with what we try to achieve. While we try to > give the > vcpu to someone else, who we think may pick it, to wake it up, and in > your > implementation, you at this vcpu to its own list and wake it up some > time > later.=20 > It does looks similar (this is the reason why I mentioned it to you :- )), although: =C2=A0- both the goal and the expectations are different, =C2=A0- I'm not really 'giving the vcpu to someone else', I'm _deferring_= =C2=A0 =C2=A0 =C2=A0it, i.e., it is still going to happen on the same cpu where th= e=C2=A0 =C2=A0 =C2=A0event happens, only a little later in time (when no longer in = IRQ =C2=A0 =C2=A0context) > I guess next step you want to do is to make the wakeup_defer.list a > lock-less list. >=20 Yes, that would be the plan (and refreshing the series, of course). > 5. Well, this may not relevant to your code directly. I found some > difference > between raise_softirq() and cpumask_raise_softirq(). The first one > just set > pending bit, while the second one will send IPI. > Exactly. > __runq_tickle() use the > second one to send SCHEDULE softirq, this means other pcpu would > react > immediately? >=20 I don't think there's much difference in observable behavior. Perhaps, in cpumask_raise_softirq(), if the current cpu is in the mast, it would be enough to set a pensing bit for it. However, this is something very different again, between 4.1 and upstream. In fact, check commits (not present in Xen 4.1): =C2=A0c47c316d99b3b570d0b "x86/HVM: batch vCPU wakeups" =C2=A09a727a813e9b25003e4 "x86: suppress event check IPI to MWAITing CPUs" So I can't really be sure or comment much on this. Regards, Dario --=C2=A0 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-WoupWSypRCFmHSSGi2Nb Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJX46isAAoJEBZCeImluHPuMNcQAJDF59QEw2dXk4MPybMOjNID nAlamyuiXP4aKjEToXqn/TL2Llod9AtFprl5VJjH8iksi9PYyMQtILjqOJjvtCxr +l/3O0bsfZPfANbF8uFUtfv3Gu+YEuzv9faRZUVHlHuRgatr8huPNmF0VCSWJRqM xTSt6pG6xusy9z2rXAbardv/zzFdv6p8J5i9oTGXAlB4ZBUIPgkwAZWkcPT6GbTc uawzRU7VEW2YB2sdR37bgVZOBRHIbSney8HVxv6+FTxiWO/RlpkCmsKUPfxfEmOp 3OL5hIEC6Hy3NRM0SxsgBlgkJDXFAl76xt2Jhr3ubm2MwW1OnWFmulz6lMd4M8kl dNGax0tyihpFmBKp8EZDFmwBK/i3silt3ghTLpW0u9O85ynqyFN23ukJWQWYFdEN oqsEObbJkaWYv9iVQZaRuHrADTHAnQg1aHResJ6X4t91qEb2SBuvVeQbNlx05D3t 9ktkW3QBXl8H7bHiMYIgbWhrsQtFaD5BDP8GpGj/gwozY6//yqViglQEds2nqYkg 121qTsptCDWo13hYOG9C52KU739aFmr7XkmTziI1zAScmHO2JcKzrcILATQVcchk N+i1HlehsMQKfVP14f6QpDu4SI+6oBu3cZAHGwoMCt3LftHzNKrbbWzu2qo3w15J BHAkLt/HsqZ15mEGZEnF =56TS -----END PGP SIGNATURE----- --=-WoupWSypRCFmHSSGi2Nb-- --===============1488462778690791428== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============1488462778690791428==--