From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: [PATCH 5/5] xen: RCU: avoid busy waiting until the
 end of grace period.
Date: Wed, 2 Aug 2017 12:14:01 +0200
Message-ID: <1501668841.19956.4.camel@citrix.com>
References: <150114201043.22910.12807057883146318803.stgit@Solace>
 <150114249858.22910.4601418126082976816.stgit@Solace>
 <alpine.DEB.2.10.1707311418050.22381@sstabellini-ThinkPad-X260>
 <1501538621.30551.3.camel@citrix.com>
 <alpine.DEB.2.10.1707311655250.22381@sstabellini-ThinkPad-X260>
 <1501548445.30551.5.camel@citrix.com>
 <alpine.DEB.2.10.1708011207310.20080@sstabellini-ThinkPad-X260>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============3022201630767863844=="
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
 by lists.xenproject.org with esmtp (Exim 4.84_2)
 (envelope-from <prvs=3807062f8=dario.faggioli@citrix.com>)
 id 1dcqfP-0000Pn-Ju
 for xen-devel@lists.xenproject.org; Wed, 02 Aug 2017 10:14:15 +0000
In-Reply-To: <alpine.DEB.2.10.1708011207310.20080@sstabellini-ThinkPad-X260>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel@lists.xenproject.org, Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>
List-Id: xen-devel@lists.xenproject.org

--===============3022201630767863844==
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="=-VJG+emKRSSuLSkOhJKa5"

--=-VJG+emKRSSuLSkOhJKa5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hey, apologies in advance, for the very long email! :-O

On Tue, 2017-08-01 at 12:13 -0700, Stefano Stabellini wrote:
> Given the description below, it's possible that the new timer has to
> fire several times before the callback can be finally invoked, right?
>=20
> I would make some measurements to check how many times the timer has
> to
> fire (how long we actually need to wait before calling the callback)
> in
> various scenarios. Then I would choose a value to minimize the number
> of
> unnecessary wake-ups.
>=20
You seem to be thinking to this timer as something that disturbs the
idleness of a CPU. It's not, because the CPU is not idle! It has a
callback waiting to be invoked, and it better do that as soon as
possible, in order to:

- freeing the memory/the resources as soon as practically possible.=C2=A0
  E.g., if we wait for, say, 1 sec, destroying a VM with a "passed-
  through" I/O range, and then trying to create another one, with the=C2=A0
  same resource assigned, would fail until that 1 sec expires. True,
  we=C2=A0save more power on that CPU, but it does not look neither fair=C2=
=A0
  nor=C2=A0correct to me to strive toward achieving that, in this situation=
.
  Power should indeed be saved, and unnecessary wake-ups prevented, but
  when there isn't anything to do! If there is stuff to do, and this is
  the case, our goal should be to get it done as soon as possible, so
  that, afterwords, we can go into a (hopefully) long sleep.

- by delaying detection of a finished grace period, we're also=C2=A0
  delaying=C2=A0the beginning of a new one. This means, in case there are
  RCU operation waiting for it (because they arrived too late for the
  current one, and have been queued), we're potentially significantly
  delaying them too. So, basically, this again looks like the same
  kind of paradox to me, where we have stuff to do, but we decide to
  try to save power.

RCU are not a mechanism for allow the CPUs to stay idle longer, they're
a lockless and asymmetric serialization primitive. And as any
serialization primitive --either lock-based or lockless-- keeping the
duration of critical sections small is rather important.

So, I think the goal here is not minimizing the wakeups; it much rather
is completing the grace period sooner rather than later. In theory, if
rcu_needs_cpu() returns true, we may very well not even let the CPU go
idle, and have it spin (which indeed is what happens without this
patch).
That would guarantee a prompt detection of CPU quiescence, and of the
ending of a grace period. Then, since we know that spinning consumes a
lot of power, and generates overhead that, potentially, may even affect
and slow down activities going on other CPUs, we let it go idle with a
timer armed. But that's a compromise, and can't be confused with the
primary goal. And any nanosecond that we spend, on that CPU, consuming
less power than what we'd have consumed running a tight loop around
rcu_pending(), we should be thankful for it, instead of regretting that
there could have been more. :-)

In Linux, as said a few times, they do the same, and the role of this
new timer I'm introducing here, is played by the tick. The tick period
is controlled by the HZ config variable, possible values for which (in
2.6.21, where CONFIG_NO_HZ was introduced for first time for all
arches) are 100Hz, 250Hz, 300Hz and 1000Hz, which translates into 10ms,
 4ms, 3.33ms or 1ms. Default is 250HZ=3D=3D4ms. The periodic tick is the
minimum time granularity of the kernel for a lot of subsystem
(especially in that kernel), so it's something that can be considered
pretty frequently running. As also said many times, we don't have any
such thing, but still I think we should use something with a similar
order of magnitude.

[NOTE that I'm not using Linux as an example because what is done in
Linux is always correct and is always what we also should do. However:
(1) RCU has been basically invented for Linux, or at least Linux is
where they've seen the widest adoption and development, and (2) our RCU
code has been (not in the best possible way, allow me to say) imported
from there. therefore, looking at what's done in Linux seems reasonable
to me, in this case.]

About the kind of measurements you're suggesting doing. I think I
understand the rationale behind it, but at the same time, I'm quite
sure that what we'd get would be pseudo-random numbers.

In fact, the length of this "wait time", although it has a relationship
with the workload being run, it also depends on a lot of other things,
within the *system* as a whole. It would be basically impossible to run
any meaningful statistical analysis on the results and come to
something that is good enough in general.

Think, for instance, at how this is not at all architecture specific
code, but the behavior we're seeing is so different between x86 and=20
ARM. Think at how this doesn't really have to do anything with the
scheduler, but the behavior we observe varies, depending on some
internal implementation details of a scheduler!

What follows is a more detailed explanation of how things works, in a
simple enough scenario.

+ =3D=3D CPU is busy
. =3D=3D CPU is idle
R =3D=3D when call_rcu() is invoked (i.e., beginning of new grace
          period)
Q =3D=3D CPU quiesces (when all have done this, grace period ends).
     This happens when do_softirq() is called on the CPU
C =3D=3D pending callbacks are invoked

                      t1
                      |
                      |
CPU1 +++R+++++++++++++Q..............
CPU2 ++++++++++++++++++++++++++Q++...
                               |
                               |
                               t2

If CPU1 is where call_rcu() happens, and CPU2 is the only other busy
CPU in the system at the time, and T is the period of our timer, the
following cases are possible:

x) t2 < t1

   In this case, the timer never fires:

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
t1
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
|
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
|
=C2=A0=C2=A0=C2=A0CPU1 +++R+++++++++++++QC.........
=C2=A0=C2=A0=C2=A0CPU2 ++++++++++++++Q.............
            =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0|
            =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|
           =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0t2

x) t2 - t1 < T

   In this case, the timer only fires once

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0   =C2=A0=C2=A0=C2=A0=
 =C2=A0=C2=A0=C2=A0=C2=A0t1     t1+T
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0   =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|       |
=C2=A0=C2=A0=C2=A0   =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|       |
   CPU1 +++R++++++++Q.......C......
   CPU2 +++++++++++++++++Q.........
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0   =
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0   =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0t2

x) t2 - t1 > (n-1)*T

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 =C2=A0t1=C2=A0 =C2=A0=C2=A0 t1+T=C2=A0=C2=A0=C2=A0t1+2T=C2=A0=C2=
=A0 t1+3T   t1+4T
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0|       |       |
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0 =C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0|       |       |
=C2=A0=C2=A0=C2=A0CPU1 +R++++++Q...............................C..
=C2=A0=C2=A0=C2=A0CPU2 ++++++++++++++++++++++++++++++++++++Q......
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0          |
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0          |
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0          t2

   In this case, the timer will fire n times (with n=3D4 in the example).

So, you're suggesting to measure, under different workloads, and then
do some kind of averaging/statistical analysis, t2-t1. What I'm saying
is that the occurrence of those events marked 'Q', on the appropriate
CPU, is really hard to predict.

In fact, t2-t1 is, basically, the difference between the time when the
two CPUs call do_softirq(). How much the frequency of invocation of
do_softirq() is related to the actual characteristics of the workload
we're running? well, a relationship sure exists, but not one that I
think is easy (possible?) to generalize... And here we're also talking
about considering *different* workloads! :-O

For instance, because of either hard-affinity, soft-affinity, or
scheduling weights, there may be more or less preemptions, on the CPUs
in question (and preemptions means going through do_softirq())... And
that's not related to system load a particular workload produces, as
it's in total control of the user. So we'd have not only to measure
with different workloads, but for each workload, we'd have to test a
certain number of variations of the scheduling parameters of the vCPUs!
:-O

On x86, there's an upper bound to the time CPUs will stay idle, because
there's always (something like) a timer running (for time sync
rendezvous, said Andrew, and I haven't checked myself, but have no
reason to doubt that to be the case), while on ARM there's no such
thing (or we weren't here! :-D). So we'd have to measure each workload,
with each combination of parameters, on each (present and future)
architecture we support! :-O

CPU onlining and offlining, or CPUs moving around cpupools, can cause
timers (and RCU callbacks even, in the online/offline case!) to be
moved around, which means they'd be firing (and a timer firing means
going through do_softirq()) on different CPUs from where we expected
them to. So, we'd have to measure each workload, with each combination
of parameters, on each architecture, while doing CPU online/offline,
and while moving CPUs around pools! :-O

And this was just to mention the few examples I could fish from the top
of my head...

No, IMO, there are too many variables and too many unrelated factors
involved, to even hope that any set of measurements we actually manage
to put together, can be representative of the general case.


*PERHAPS* we can think of an heuristic... Provided it's not too
complex, or at least, that it does not try to capture all this
variability (because that'd be equally impossible).

Something like this: when a CPU with queued callbacks quiesces and then
goes idle, we know how many other CPUs, among the ones that are
participating in the grace period, have not quiesced yet. Instead of
using, say, always 10ms, we can program the timer to fire at something
like 1ms*cpumask_weight(not_yet_quiesced_cpus).

In fact, I think it makes sense to assume that, if there are only few
CPUs involved in the grace period, that still has not quiesced, then
the grace period itself is about to finish, and we should check
frequently whether or not that has happened.
if, OTOH, there are a lot of CPUs involved in the grace period, that
still haven't quiesced, we can check whether the grace period has ended
in a less eager way.

This will make the period of the timer vary, i.e., it may be different
between two subsequent timer_set(), but that is not a problem at all.
Basically, we're making the timer kind of adaptive.

This does not work well in the case where there is, e.g., only 1 CPU
that has not quiesced yet (so we'd set the timer to fire very
frequently), but it takes a lot of time for such CPU to do so. Well,
it's an heuristic, so it can't always work well.. but I think this
isn't a common situation.

Would you be happy with something like this?

Regards,
Dario
--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
--=-VJG+emKRSSuLSkOhJKa5
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJZgaXqAAoJEBZCeImluHPuMvUP/jZa0/5YqTmluC1fLCV41g+f
ozCGS0fh7zAhx1p1TdRAMpTS9J45jUQ1g8ebRcm4wmZUAZlJ8isJ7tSkK73+5UYu
JSugY5Mebpfnwm4NQymBn8GyYNYcboTzgii5gTQDwj/3S1DENQmm85RUj915GBG5
I51Kq7w9K+YY56z3k2yLQHFKW5lrr8VTL5khPoOEcxXzGQd6iQQFCAfCnFljOxn0
zrozLpb3mLi0hZlnrkTxWmWMMxfma0Q+BPs5G1CzMCTCky8YV+e61o/ltPnZkZ29
73NKxHgLCZmkPT/zjZOn+CuDBxn87T2O8A/R1xH23DeRWx92NB4DboaAXLE9GR1c
I2B3G96voltr7jTzD5GfDr6NeGCb7wR2dg0jU1aycjGSi8cUcS3hzPveTwVYtlPP
sEJPCjJ5Y7tdFMIykdDAvvxS5J41L4WoM5tHMtVjtu2z76oA0NW9LWBOFm4s2GVX
RmK+9t1gNctY4P3iiTkFT+/aOp9vfG8DJYslHK20P1g54UBXl6p0KLYS+nuBNtRl
Jf/cHsKhX5AcQExv33lIPUay1vCSdVXjJdLdOKhbpDFQpoGYPdOz8REuWuDLTRLI
DSF5lUzVDeS/H3IqtnGk6r3ZUdKHc3H7VVmaGUOvHPt/QOMJ8ABkcM5BTsf9zoll
Q1Be+TAmGvxSWF43zBtA
=MRIf
-----END PGP SIGNATURE-----

--=-VJG+emKRSSuLSkOhJKa5--


--===============3022201630767863844==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v
cmcveGVuLWRldmVsCg==

--===============3022201630767863844==--