From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period. Date: Wed, 2 Aug 2017 12:14:01 +0200 Message-ID: <1501668841.19956.4.camel@citrix.com> References: <150114201043.22910.12807057883146318803.stgit@Solace> <150114249858.22910.4601418126082976816.stgit@Solace> <1501538621.30551.3.camel@citrix.com> <1501548445.30551.5.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3022201630767863844==" Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dcqfP-0000Pn-Ju for xen-devel@lists.xenproject.org; Wed, 02 Aug 2017 10:14:15 +0000 In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Stefano Stabellini Cc: xen-devel@lists.xenproject.org, Julien Grall , Jan Beulich , Andrew Cooper List-Id: xen-devel@lists.xenproject.org --===============3022201630767863844== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-VJG+emKRSSuLSkOhJKa5" --=-VJG+emKRSSuLSkOhJKa5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hey, apologies in advance, for the very long email! :-O On Tue, 2017-08-01 at 12:13 -0700, Stefano Stabellini wrote: > Given the description below, it's possible that the new timer has to > fire several times before the callback can be finally invoked, right? >=20 > I would make some measurements to check how many times the timer has > to > fire (how long we actually need to wait before calling the callback) > in > various scenarios. Then I would choose a value to minimize the number > of > unnecessary wake-ups. >=20 You seem to be thinking to this timer as something that disturbs the idleness of a CPU. It's not, because the CPU is not idle! It has a callback waiting to be invoked, and it better do that as soon as possible, in order to: - freeing the memory/the resources as soon as practically possible.=C2=A0 E.g., if we wait for, say, 1 sec, destroying a VM with a "passed- through" I/O range, and then trying to create another one, with the=C2=A0 same resource assigned, would fail until that 1 sec expires. True, we=C2=A0save more power on that CPU, but it does not look neither fair=C2= =A0 nor=C2=A0correct to me to strive toward achieving that, in this situation= . Power should indeed be saved, and unnecessary wake-ups prevented, but when there isn't anything to do! If there is stuff to do, and this is the case, our goal should be to get it done as soon as possible, so that, afterwords, we can go into a (hopefully) long sleep. - by delaying detection of a finished grace period, we're also=C2=A0 delaying=C2=A0the beginning of a new one. This means, in case there are RCU operation waiting for it (because they arrived too late for the current one, and have been queued), we're potentially significantly delaying them too. So, basically, this again looks like the same kind of paradox to me, where we have stuff to do, but we decide to try to save power. RCU are not a mechanism for allow the CPUs to stay idle longer, they're a lockless and asymmetric serialization primitive. And as any serialization primitive --either lock-based or lockless-- keeping the duration of critical sections small is rather important. So, I think the goal here is not minimizing the wakeups; it much rather is completing the grace period sooner rather than later. In theory, if rcu_needs_cpu() returns true, we may very well not even let the CPU go idle, and have it spin (which indeed is what happens without this patch). That would guarantee a prompt detection of CPU quiescence, and of the ending of a grace period. Then, since we know that spinning consumes a lot of power, and generates overhead that, potentially, may even affect and slow down activities going on other CPUs, we let it go idle with a timer armed. But that's a compromise, and can't be confused with the primary goal. And any nanosecond that we spend, on that CPU, consuming less power than what we'd have consumed running a tight loop around rcu_pending(), we should be thankful for it, instead of regretting that there could have been more. :-) In Linux, as said a few times, they do the same, and the role of this new timer I'm introducing here, is played by the tick. The tick period is controlled by the HZ config variable, possible values for which (in 2.6.21, where CONFIG_NO_HZ was introduced for first time for all arches) are 100Hz, 250Hz, 300Hz and 1000Hz, which translates into 10ms, 4ms, 3.33ms or 1ms. Default is 250HZ=3D=3D4ms. The periodic tick is the minimum time granularity of the kernel for a lot of subsystem (especially in that kernel), so it's something that can be considered pretty frequently running. As also said many times, we don't have any such thing, but still I think we should use something with a similar order of magnitude. [NOTE that I'm not using Linux as an example because what is done in Linux is always correct and is always what we also should do. However: (1) RCU has been basically invented for Linux, or at least Linux is where they've seen the widest adoption and development, and (2) our RCU code has been (not in the best possible way, allow me to say) imported from there. therefore, looking at what's done in Linux seems reasonable to me, in this case.] About the kind of measurements you're suggesting doing. I think I understand the rationale behind it, but at the same time, I'm quite sure that what we'd get would be pseudo-random numbers. In fact, the length of this "wait time", although it has a relationship with the workload being run, it also depends on a lot of other things, within the *system* as a whole. It would be basically impossible to run any meaningful statistical analysis on the results and come to something that is good enough in general. Think, for instance, at how this is not at all architecture specific code, but the behavior we're seeing is so different between x86 and=20 ARM. Think at how this doesn't really have to do anything with the scheduler, but the behavior we observe varies, depending on some internal implementation details of a scheduler! What follows is a more detailed explanation of how things works, in a simple enough scenario. + =3D=3D CPU is busy . =3D=3D CPU is idle R =3D=3D when call_rcu() is invoked (i.e., beginning of new grace period) Q =3D=3D CPU quiesces (when all have done this, grace period ends). This happens when do_softirq() is called on the CPU C =3D=3D pending callbacks are invoked t1 | | CPU1 +++R+++++++++++++Q.............. CPU2 ++++++++++++++++++++++++++Q++... | | t2 If CPU1 is where call_rcu() happens, and CPU2 is the only other busy CPU in the system at the time, and T is the period of our timer, the following cases are possible: x) t2 < t1 In this case, the timer never fires: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= t1 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= | =C2=A0=C2=A0=C2=A0CPU1 +++R+++++++++++++QC......... =C2=A0=C2=A0=C2=A0CPU2 ++++++++++++++Q............. =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0t2 x) t2 - t1 < T In this case, the timer only fires once =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0t1 t1+T =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| | =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| | CPU1 +++R++++++++Q.......C...... CPU2 +++++++++++++++++Q......... =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0t2 x) t2 - t1 > (n-1)*T =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 =C2=A0t1=C2=A0 =C2=A0=C2=A0 t1+T=C2=A0=C2=A0=C2=A0t1+2T=C2=A0=C2= =A0 t1+3T t1+4T =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0| | | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0 =C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0| | | =C2=A0=C2=A0=C2=A0CPU1 +R++++++Q...............................C.. =C2=A0=C2=A0=C2=A0CPU2 ++++++++++++++++++++++++++++++++++++Q...... =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 t2 In this case, the timer will fire n times (with n=3D4 in the example). So, you're suggesting to measure, under different workloads, and then do some kind of averaging/statistical analysis, t2-t1. What I'm saying is that the occurrence of those events marked 'Q', on the appropriate CPU, is really hard to predict. In fact, t2-t1 is, basically, the difference between the time when the two CPUs call do_softirq(). How much the frequency of invocation of do_softirq() is related to the actual characteristics of the workload we're running? well, a relationship sure exists, but not one that I think is easy (possible?) to generalize... And here we're also talking about considering *different* workloads! :-O For instance, because of either hard-affinity, soft-affinity, or scheduling weights, there may be more or less preemptions, on the CPUs in question (and preemptions means going through do_softirq())... And that's not related to system load a particular workload produces, as it's in total control of the user. So we'd have not only to measure with different workloads, but for each workload, we'd have to test a certain number of variations of the scheduling parameters of the vCPUs! :-O On x86, there's an upper bound to the time CPUs will stay idle, because there's always (something like) a timer running (for time sync rendezvous, said Andrew, and I haven't checked myself, but have no reason to doubt that to be the case), while on ARM there's no such thing (or we weren't here! :-D). So we'd have to measure each workload, with each combination of parameters, on each (present and future) architecture we support! :-O CPU onlining and offlining, or CPUs moving around cpupools, can cause timers (and RCU callbacks even, in the online/offline case!) to be moved around, which means they'd be firing (and a timer firing means going through do_softirq()) on different CPUs from where we expected them to. So, we'd have to measure each workload, with each combination of parameters, on each architecture, while doing CPU online/offline, and while moving CPUs around pools! :-O And this was just to mention the few examples I could fish from the top of my head... No, IMO, there are too many variables and too many unrelated factors involved, to even hope that any set of measurements we actually manage to put together, can be representative of the general case. *PERHAPS* we can think of an heuristic... Provided it's not too complex, or at least, that it does not try to capture all this variability (because that'd be equally impossible). Something like this: when a CPU with queued callbacks quiesces and then goes idle, we know how many other CPUs, among the ones that are participating in the grace period, have not quiesced yet. Instead of using, say, always 10ms, we can program the timer to fire at something like 1ms*cpumask_weight(not_yet_quiesced_cpus). In fact, I think it makes sense to assume that, if there are only few CPUs involved in the grace period, that still has not quiesced, then the grace period itself is about to finish, and we should check frequently whether or not that has happened. if, OTOH, there are a lot of CPUs involved in the grace period, that still haven't quiesced, we can check whether the grace period has ended in a less eager way. This will make the period of the timer vary, i.e., it may be different between two subsequent timer_set(), but that is not a problem at all. Basically, we're making the timer kind of adaptive. This does not work well in the case where there is, e.g., only 1 CPU that has not quiesced yet (so we'd set the timer to fire very frequently), but it takes a lot of time for such CPU to do so. Well, it's an heuristic, so it can't always work well.. but I think this isn't a common situation. Would you be happy with something like this? Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-VJG+emKRSSuLSkOhJKa5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJZgaXqAAoJEBZCeImluHPuMvUP/jZa0/5YqTmluC1fLCV41g+f ozCGS0fh7zAhx1p1TdRAMpTS9J45jUQ1g8ebRcm4wmZUAZlJ8isJ7tSkK73+5UYu JSugY5Mebpfnwm4NQymBn8GyYNYcboTzgii5gTQDwj/3S1DENQmm85RUj915GBG5 I51Kq7w9K+YY56z3k2yLQHFKW5lrr8VTL5khPoOEcxXzGQd6iQQFCAfCnFljOxn0 zrozLpb3mLi0hZlnrkTxWmWMMxfma0Q+BPs5G1CzMCTCky8YV+e61o/ltPnZkZ29 73NKxHgLCZmkPT/zjZOn+CuDBxn87T2O8A/R1xH23DeRWx92NB4DboaAXLE9GR1c I2B3G96voltr7jTzD5GfDr6NeGCb7wR2dg0jU1aycjGSi8cUcS3hzPveTwVYtlPP sEJPCjJ5Y7tdFMIykdDAvvxS5J41L4WoM5tHMtVjtu2z76oA0NW9LWBOFm4s2GVX RmK+9t1gNctY4P3iiTkFT+/aOp9vfG8DJYslHK20P1g54UBXl6p0KLYS+nuBNtRl Jf/cHsKhX5AcQExv33lIPUay1vCSdVXjJdLdOKhbpDFQpoGYPdOz8REuWuDLTRLI DSF5lUzVDeS/H3IqtnGk6r3ZUdKHc3H7VVmaGUOvHPt/QOMJ8ABkcM5BTsf9zoll Q1Be+TAmGvxSWF43zBtA =MRIf -----END PGP SIGNATURE----- --=-VJG+emKRSSuLSkOhJKa5-- --===============3022201630767863844== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============3022201630767863844==--