```xen-devel.lists.xenproject.org archive mirror
help / color / mirror / Atom feed```
```From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xenproject.org
Cc: Anshul Makkar <anshul.makkar@citrix.com>,
George Dunlap <george.dunlap@citrix.com>,
David Vrabel <david.vrabel@citrix.com>
Subject: [PATCH 10/19] xen: credit2: rework load tracking logic
Date: Sat, 18 Jun 2016 01:12:29 +0200	[thread overview]
Message-ID: <146620514940.29766.7233346867325214395.stgit@Solace.fritz.box> (raw)

maintain, and not entirely consistent. This is due to a
number of reasons:
- code and comments were not in perfect sync, making it
difficult to figure out what the intent of a particular
choice was (e.g., the choice of 18 for load_window_shift);
- the math, although effective, was not entirely consistent.
In fact, we were doing (if W is the lenght of the window):

which does not match any known variant of 'smoothing
moving average'. In fact, it should have been:

(for details on why, see the doc comments inside this
patch.). Furthermore, with

The reason why the formula above sort of worked was because
the number of bits used for the fractional parts of the
values used in fixed point math and the number of bits used
for the lenght of the window were the same (load_window_shift
was being used for both).

This may look handy, but it introduced a (not especially well
documented) dependency between the lenght of the window and
the precision of the calculations, which really should be
two independent things. Especially if treating them as such
(like it is done in this patch) does not lead to more
complex maths (same number of multiplications and shifts, and
there is still room for some optimization).

Therefore, in this patch, we:
- split length of the window and precision (and, since there
is already a command line parameter for length of window,
introduce one for precision too),
- align the math with one proper incarnation of exponential
math used.

While there fix a couple of style issues as well (pointless

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
docs/misc/xen-command-line.markdown |   30 +++
xen/common/sched_credit2.c          |  328 ++++++++++++++++++++++++++++++-----
2 files changed, 310 insertions(+), 48 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index fed732c..29a554b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -477,9 +477,39 @@ the address range the area should fall into.
### credit2\_balance\_under
> `= <integer>`

+> `= <integer>`
+
+> Default: `18`
+
+Specify the number of bits to use for the fractional part of the
+
> `= <integer>`

+> Default: `30`
+
+Specify the number of bits to use for represent the length of the
+window (in nanoseconds) we use for load tracking inside Credit2.
+This means that, with the default value (30), we use
+2^30 nsec ~= 1 sec long window.
+
+Load tracking is done by means of a variation of exponentially
+weighted moving average (EWMA). The window length defined here
+is what tells for how long we give value to previous history
+of the load itself. In fact, after a full window has passed,
+what happens is that we discard all previous history entirely.
+
+A short window will make the load balancer quick at reacting
+(and hence, e.g., long term load trends). A long window will
+make the load balancer thoughtful of previous history (and
+hence capable of capturing, e.g., long term load trends), but
+also slow in responding to load changes.
+
+The default value of `1 sec` is rather long.
+
### credit2\_runqueue
> `= core | socket | node | all`

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index c534f9c..230a512 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -173,16 +173,88 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
#define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])

/*
- * Shifts for load average.
- * - granularity: Reduce granularity of time by a factor of 1000, so we can use 32-bit maths
- * - window shift: Given granularity shift, make the window about 1 second
- * - scale shift: Shift up load by this amount rather than using fractions; 128 corresponds
- *   to a load of 1.
+ *
+ * Load history of runqueues and vcpus is accounted for by using an
+ * exponential weighted moving average algorithm. However, instead of using
+ * fractions,we shift everything to left by the number of bits we want to
+ * use for representing the fractional part (Q-format).
+ *
+ * We may also want to reduce the precision of time accounting, to
+ * accommodate 'longer  windows'. So, if that is the case, we just need to
+ * shift all time samples to the right.
+ *
+ * The details of the formulas used for load tracking are explained close to
+ * __update_runq_load(). Let's just say here that, with full nanosecond time
+ * granularity, a 30 bits wide 'decaying window' is ~1 second long.
+ *
+ * We want to consider the following equations:
+ *
+ *  avg[i+1] = avg[i] + delta*load*P/W - delta*avg[i]/W,  0 <= delta <= W
+ *
+ * where W is the lenght of the window, P the multiplier for transitiong into
+ * Q-format fixed point arithmetic and load is the instantaneous load of a
+ * runqueue, which basically is the number of runnable vcpus there are on the
+ * runqueue (for the meaning of the other terms, look at the doc comment to
+ *
+ *  So, with the default values defined below, we have:
+ *
+ *  W = 2^30
+ *  P = 2^18
+ *
+ * The maximum possible value for the average load, which we want to store in
+ * s_time_t type variables (i.e., we have 63 bits available) is load*P. This
+ * means that, with P 18 bits wide, load can occupy 45 bits. This in turn
+ * means we can have 2^45 vcpus in each runqueue, before overflow occurs!
+ *
+ * However, it can happen that, at step j+1, if:
+ *
+ *  delta = W
+ *
+ * then:
+ *
+ *
+ * So we must be able to deal with W*load*P. This means load can't be higher
+ * than:
+ *
+ *  2^(63 - 30 - 18) = 2^15 = 32768
+ *
+ * So 32768 is the maximum number of vcpus the we can have in a runqueue,
+ * at any given time, and still not have problems with the load tracking
+ * calculations... and this is more than fine.
+ *
+ * If/when, for some reason, this will not be acceptable any longer, we can
+ * act on time granularity, window lenght or precision (or a combination of
+ * them). For instance, reducing the granularity to microseconds we could
+ * switch to W=2^20 and still have 18 fractional bits and a 1 second long
+ * window (which would mean 2^25 = 33554432 vcpus per runq and no overflow).
+ */
+
+/* If >0, decreases the granularity of time samples used for load tracking. */
+/* Time window during which we still give value to previous load history. */
+/* 18 bits by default (and not less than 4) for decimals. */
+
+/*
+ * Both the lenght of the window and the number of fractional bits can be
+ * decided with boot parameters.
+ *
+ * When LOADAVG_GRANULARITY_SHIFT is 0, the length of the window is given in
+ * nanoseconds. The same is true for a granularity of 10 (== microseconds) and
+ * a window of 20 (the default).
*/
+
@@ -279,6 +351,7 @@ struct csched2_private {
cpumask_t active_queues; /* Queues which may have active cpus */
struct csched2_runqueue_data rqd[NR_CPUS];

};

@@ -387,19 +460,148 @@ __runq_elem(struct list_head *elem)
return list_entry(elem, struct csched2_vcpu, runq_elem);
}

+/*
+ * Track the runq load by gathering instantaneous load samples, and using
+ * exponentially weighted moving average (EWMA) for the 'decaying'.
+ *
+ * We consider a window of lenght W=2^(prv->load_window_shift) nsecs.
+ * (The actual calculations may use coarser granularity time sampling,
+ * if LOADAVG_GRANULARITY_SHIFT is not 0.)
+ *
+ * If load is the instantaneous load, the formula for EWMA looks as follows,
+ * for the i-eth sample:
+ *
+ *  avg[i] = a*load + (1 - a)*avg[i-1]
+ *
+ * where avg[i] is the new value of the average load, avg[i-1] is the value
+ * of the average load calculated so far, and a is a coefficient less or
+ * equal to 1.
+ *
+ * So, for us, it becomes:
+ *
+ *
+ * For determining a, we consider _when_ we are doing the load update, wrt
+ * the lenght of the window. We define delta as follows:
+ *
+ *  delta = t - load_last_update
+ *
+ * where t is current time (i.e., time at which we are both sampling and
+ * updating the load average) and load_last_update is the last time we did
+ * that.
+ *
+ * There are two possible situations:
+ *
+ * a) delta <= W
+ *    this means that, during the last window of lenght W, the runeuque load
+ *    was avgload for (W - detla) time, and load for delta time:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *     -------------------------|---------|---
+ *                |             |         |
+ *                \__W - delta__/\_delta__/
+ *                |             |         |
+ *
+ *    So, what about using delta/W as our smoothing coefficient a. If we do,
+ *    here's what happens:
+ *
+ *     a = delta / W
+ *     1 - a = 1 - (delta / W) = (W - delta) / W
+ *
+ *    Which matches the above description of what happened in the last
+ *    window of lenght W.
+ *
+ *    Note that this also means that the weight that we assign to both the
+ *    latest load sample, and to previous history, varies at each update.
+ *    The longer the latest load sample has been in efect, within the last
+ *    window, the higher it weights (and the lesser the previous history
+ *    weights).
+ *
+ *    This is some sort of extension of plain EWMA to fit even better to our
+ *    use case.
+ *
+ * b) delta > W
+ *    this means more than a full window has passed since the last update:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *     ----|------------------------------|---
+ *         |                              |
+ *         \_________________delta________/
+ *
+ *    Basically, it means the last load sample has been in effect for more
+ *    than W time, and hence we should just use it, and forget everything
+ *    before that.
+ *
+ *    This can be seen as a 'reset condition', occurring when, for whatever
+ *    reason, load has not been updated for longer than we expected. (It is
+ *    also how avgload is assigned its first value.)
+ *
+ * The formula for avgload then becomes:
+ *
+ *
+ * So, final form is:
+ *
+ *
+ * As a confirmation, let's look at the extremes, when delta is 0 (i.e.,
+ * what happens if we  update the load twice, at the same time instant?):
+ *
+ *
+ * and when delta is W (i.e., what happens if we update at the last
+ * possible instant before the window 'expires'?):
+ *
+ *
+ * Which, in both cases, is what we expect.
+ */
static void
struct csched2_runqueue_data *rqd, int change, s_time_t now)
{
struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
+    unsigned int P, W;

+    /*
+     * To avoid using fractions, we shift to left by load_precision_shift,
+     * and use the least last load_precision_shift bits as fractional part.
+     * Looking back at the formula we want to use, we now have:
+     *
+     *
+     * And if we are ok storing and using P*avgload, we can rewrite this as:
+     *
+     *
+     * Coupled with, of course:
+     *
+     */
+
+    if ( rqd->load_last_update + (1ULL << W)  < now )
{
}
else
{
@@ -411,26 +613,39 @@ __update_runq_load(const struct scheduler *ops,
delta = 0;
}

-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
-
-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
+        /*
+         * Note that, if we were to enforce (or check) some relationship
+         * between P and W, we may save one shift. E.g., if we are sure
+         * that P < W, we could write:
+         *
+         *  (delta * (load << P)) >> W
+         *
+         * as:
+         *
+         *  (delta * load) >> (W - P)
+         */
+                       ((delta * (load << P)) >> W) -
+                       ((delta * rqd->avgload) >> W);
+                         ((delta * (load << P)) >> W) -
+                         ((delta * rqd->b_avgload) >> W);
}

+
{
struct {
} d;
-        d.rq_id=rqd->id;
+        d.rq_id = rqd->id;
+        d.shift = P;
sizeof(d),
(unsigned char *)&d);
@@ -442,8 +657,8 @@ __update_svc_load(const struct scheduler *ops,
struct csched2_vcpu *svc, int change, s_time_t now)
{
struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
+    unsigned int P, W;

if ( change == -1 )
@@ -452,11 +667,13 @@ __update_svc_load(const struct scheduler *ops,
else

+    if ( svc->load_last_update + (1ULL << W) < now )
{
}
else
{
@@ -468,20 +685,22 @@ __update_svc_load(const struct scheduler *ops,
delta = 0;
}

-            ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )
+                       ((delta * (vcpu_load << P)) >> W) -
+                       ((delta * svc->avgload) >> W);
}

{
struct {
unsigned vcpu:16, dom:16;
+            unsigned shift;
} d;
d.dom = svc->vcpu->domain->domain_id;
d.vcpu = svc->vcpu->vcpu_id;
+        d.shift = P;
sizeof(d),
(unsigned char *)&d);
@@ -903,7 +1122,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
svc->credit = CSCHED2_CREDIT_INIT;
svc->weight = svc->sdom->weight;
/* Starting load of 50% */
}
else
@@ -1152,7 +1371,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
vcpu_schedule_unlock_irq(lock, vc);
}

static int
csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
{
@@ -1447,15 +1666,19 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
if ( i > cpus_max )
cpus_max = i;

-        /* If we're under 100% capacaty, only shift if load difference
-         * is > 1.  otherwise, shift if under 12.5% */
+        /*
+         * If we're under 100% capacaty, only shift if load difference
+         * is > 1.  otherwise, shift if under 12.5%
+         */
{
goto out;
}
else
goto out;
}

@@ -1965,7 +2188,7 @@ csched2_schedule(
}

static void
-csched2_dump_vcpu(struct csched2_vcpu *svc)
+csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
{
printk("[%i.%i] flags=%x cpu=%i",
svc->vcpu->domain->domain_id,
@@ -1975,6 +2198,9 @@ csched2_dump_vcpu(struct csched2_vcpu *svc)

printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);

+
printk("\n");
}

@@ -2012,7 +2238,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
if ( svc )
{
printk("\trun: ");
-        csched2_dump_vcpu(svc);
+        csched2_dump_vcpu(prv, svc);
}

loop = 0;
@@ -2022,7 +2248,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
if ( svc )
{
printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);
}
}

@@ -2051,8 +2277,8 @@ csched2_dump(const struct scheduler *ops)
for_each_cpu(i, &prv->active_queues)
{
s_time_t fraction;
-
+

cpulist_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].active);
printk("Runqueue %d:\n"
@@ -2060,12 +2286,13 @@ csched2_dump(const struct scheduler *ops)
"\tcpus               = %s\n"
"\tmax_weight         = %d\n"
i,
cpustr,
prv->rqd[i].max_weight,
fraction);

@@ -2096,7 +2323,7 @@ csched2_dump(const struct scheduler *ops)
lock = vcpu_schedule_lock(svc->vcpu);

printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);

vcpu_schedule_unlock(lock, svc->vcpu);
}
@@ -2357,18 +2584,22 @@ csched2_init(struct scheduler *ops)
" WARNING: This is experimental software in development.\n" \
" Use at your own risk.\n");

printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);

{
-        printk("%s: opt_load_window_shift %d below min %d, resetting\n",
+        printk("WARNING: %s: opt_load_precision_shift %d below min %d, resetting\n",
}

+    printk(XENLOG_INFO "load tracking window lenght %llu ns\n",
+
/* Basically no CPU information is available at this point; just
* set up basic structures, and a callback when the CPU info is
* available. */
@@ -2388,6 +2619,7 @@ csched2_init(struct scheduler *ops)
prv->rqd[i].id = -1;
}

return 0;

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
```

```next prev parent reply	other threads:[~2016-06-17 23:12 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
2016-06-20  7:48   ` Jan Beulich
2016-07-07 10:11     ` Dario Faggioli
2016-06-21 16:17   ` anshul makkar
2016-07-06 15:41   ` George Dunlap
2016-07-07 10:25     ` Dario Faggioli
2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
2016-06-18  0:36   ` Meng Xu
2016-07-06 15:52   ` George Dunlap
2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
2016-06-21 16:41   ` anshul makkar
2016-07-06 15:59   ` George Dunlap
2016-06-17 23:11 ` [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu Dario Faggioli
2016-07-06 16:02   ` George Dunlap
2016-07-07 10:26     ` Dario Faggioli
2016-06-17 23:11 ` [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once Dario Faggioli
2016-07-06 16:05   ` George Dunlap
2016-06-17 23:12 ` [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held Dario Faggioli
2016-06-20  7:56   ` Jan Beulich
2016-07-06 16:10     ` George Dunlap
2016-07-07 10:28       ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
2016-06-20  8:02   ` Jan Beulich
2016-07-06 16:21     ` George Dunlap
2016-07-07  7:29       ` Jan Beulich
2016-07-07  9:09         ` George Dunlap
2016-07-07  9:18           ` Jan Beulich
2016-07-07 10:53             ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 08/19] xen: credit2: when tickling, check idle cpus first Dario Faggioli
2016-07-06 16:36   ` George Dunlap
2016-06-17 23:12 ` [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu Dario Faggioli
2016-07-06 16:40   ` George Dunlap
2016-06-17 23:12 ` Dario Faggioli [this message]
2016-07-06 17:33   ` [PATCH 10/19] xen: credit2: rework load tracking logic George Dunlap
2016-06-17 23:12 ` [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format Dario Faggioli
2016-06-21  9:27   ` Wei Liu
2016-06-17 23:12 ` [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
2016-07-07  9:45   ` George Dunlap
2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
2016-06-20  8:13   ` Jan Beulich
2016-07-07 10:59     ` Dario Faggioli
2016-07-07 15:17   ` George Dunlap
2016-07-07 16:43     ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
2016-06-20  8:15   ` Jan Beulich
2016-07-07 15:34     ` George Dunlap
2016-07-07 15:34   ` George Dunlap
2016-06-17 23:13 ` [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
2016-07-07 15:37   ` George Dunlap
2016-06-17 23:13 ` [PATCH 16/19] tools: tracing: deal with new Credit2 events Dario Faggioli
2016-07-07 15:39   ` George Dunlap
2016-06-17 23:13 ` [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
2016-07-07 16:00   ` George Dunlap
2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
2016-06-20  8:26   ` Jan Beulich
2016-06-20 10:38     ` Dario Faggioli
2016-06-27 15:20   ` anshul makkar
2016-07-12 13:40   ` George Dunlap
2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
2016-06-20  8:30   ` Jan Beulich
2016-06-20 11:28     ` Dario Faggioli
2016-06-21 10:42   ` David Vrabel
2016-07-07 16:55     ` Dario Faggioli
```

```Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,

Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

switches of git-send-email(1):

git send-email \
--to=dario.faggioli@citrix.com \
--cc=anshul.makkar@citrix.com \
--cc=david.vrabel@citrix.com \
--cc=george.dunlap@citrix.com \
--cc=xen-devel@lists.xenproject.org \
```This is a public inbox, see mirroring instructions