[PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2
@ 2016-07-15 14:49 Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar, David Vrabel, Jan Beulich

Hi,

Version 2 of the patch series. Not much to say apart from the fact that (I
think) I've addressed all the review comments v1 received (thanks everyone!).
Details are in the individual changelogs.

It's smaller because George commited some of the patches already.

As mentioned in the v1 thread, I've dropped the last patch "xen: credit2: use
cpumask_first instead of cpumask_any when choosing cpu". I still think it is
sound, but I will better evaluate its effects and respin.

Posting of v1, with discussion and benchmarks, is here:
  https://lists.xen.org/archives/html/xen-devel/2016-06/msg02287.html
  <146618450041.23516.9007927860063823148.stgit@Solace.fritz.box>

And here's the git branches:
  (v2) git://xenbits.xen.org/people/dariof/xen.git  wip/sched/credit2-misc-improvements-v2
       http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/credit2-misc-improvements-v2
  
  (v1) git://xenbits.xen.org/people/dariof/xen.git  wip/sched/credit2-misc-improvements
       http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/credit2-misc-improvements

Thanks and Regards,
Dario
---
Dario Faggioli (11):
      xen: sched: leave CPUs doing tasklet work alone.
      xen: credit2: prevent load balancing to go mad if time goes backwards
      xen: credit2: rework load tracking logic
      xen/tools: improve tracing of Credit2 load tracking events
      xen: credit2: use non-atomic cpumask and bit operations
      xen: credit2: make the code less experimental
      xen: credit2: add yet some more tracing
      xen: credit2: only marshall trace point arguments if tracing enabled
      tools: tracing: deal with new Credit2 events
      xen: credit2: the private scheduler lock can be an rwlock.
      xen: credit2: implement true SMT support

 docs/misc/xen-command-line.markdown |   30 +
 tools/xentrace/formats              |   10
 tools/xentrace/xenalyze.c           |  103 +++
 xen/common/sched_credit.c           |    2
 xen/common/sched_credit2.c          | 1047 +++++++++++++++++++++++++----------
 5 files changed, 875 insertions(+), 317 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone.
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-18 14:35   ` George Dunlap
  2016-07-15 14:49 ` [PATCH v2 02/11] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, Jan Beulich, David Vrabel

In both Credit1 and Credit2, stop considering a pCPU idle,
if the reason why the idle vCPU is being selected, is to
do tasklet work.

Not doing so means that the tickling and load balancing
logic, seeing the pCPU as idle, considers it a candidate
for picking up vCPUs. But the pCPU won't actually pick
up or schedule any vCPU, which would then remain in the
runqueue, which is bad, especially if there were other,
truly idle pCPUs, that could execute it.

The only drawback is that we can't assume that a pCPU is
in always marked as idle when being removed from an
instance of the Credit2 scheduler (csched2_deinit_pdata).
In fact, if we are in stop-machine (i.e., during suspend
or shutdown), the pCPUs are running the stopmachine_tasklet
and hence are actually marked as busy. On the other hand,
when removing a pCPU from a Credit2 pool, it will indeed
be idle. The only thing we can do, therefore, is to
remove the BUG_ON() check.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
Changes in v1:
 * reorg. the Credit1 code for minimal churn, as requested during review;
 * don't move out: label up in Credit1, as requested during review.
---
As code changed, I did not apply Anshul's R-b tag. Anshul, re-issue it if you
want (if you feel like it, of course).
---
 xen/common/sched_credit.c  |    2 +-
 xen/common/sched_credit2.c |   14 ++++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index ac22746..d547716 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1827,7 +1827,7 @@ csched_schedule(
      * Update idlers mask if necessary. When we're idling, other CPUs
      * will tickle us when they get extra work.
      */
-    if ( snext->pri == CSCHED_PRI_IDLE )
+    if ( !tasklet_work_scheduled && snext->pri == CSCHED_PRI_IDLE )
     {
         if ( !cpumask_test_cpu(cpu, prv->idlers) )
             cpumask_set_cpu(cpu, prv->idlers);
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 8b95a47..7e572bf 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1923,8 +1923,16 @@ csched2_schedule(
     }
     else
     {
-        /* Update the idle mask if necessary */
-        if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        /*
+         * Update the idle mask if necessary. Note that, if we're scheduling
+         * idle in order to carry on some tasklet work, we want to play busy!
+         */
+        if ( tasklet_work_scheduled )
+        {
+            if ( cpumask_test_cpu(cpu, &rqd->idle) )
+                cpumask_clear_cpu(cpu, &rqd->idle);
+        }
+        else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
             cpumask_set_cpu(cpu, &rqd->idle);
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
@@ -2304,8 +2312,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     /* No need to save IRQs here, they're already disabled */
     spin_lock(&rqd->lock);
 
-    BUG_ON(!cpumask_test_cpu(cpu, &rqd->idle));
-
     printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     cpumask_clear_cpu(cpu, &rqd->idle);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 02/11] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 03/11] xen: credit2: rework load tracking logic Dario Faggioli
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, Jan Beulich, David Vrabel

This really should not happen, but:
 1. it does happen! Some more info here:
    http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html
 2. independently from 1, it makes sense and is easy enough
    to have a 'safety catch'.

The reason why this is particularly bad for Credit2 is that
negative values of delta mean out of scale high load (because
of the conversion to unsigned). This, for instance in the
case of runqueue load, results in a runqueue having its load
updated to values of the order of 10000% or so, which in turns
means that the load balancer will migrate everything off from
the pCPUs in the runqueue, and leave them idle until the load
gets back to something sane... which may indeed take a while!

This is not a fix for the problem of time going backwards. In
fact, if that happens a lot, load tracking accuracy is still
compromized, but at least the effect is a lot less bad than
before.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
Leaving the patch as it was in v1, for reasons explained in these messages, by
both George and me:
 * <CAFLBxZZdS21UGW4TSDGvaDtn3LzD3X6L3f-rpDP4ySrHN0r7RQ@mail.gmail.com>
 * <a5b9e1c0-1d25-4cda-b1b6-a546189b43b0@citrix.com>
 * <1467888805.23985.62.camel@citrix.com>
---
 xen/common/sched_credit2.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 7e572bf..6cb06e8 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
     else
     {
         delta = now - rqd->load_last_update;
+        if ( unlikely(delta < 0) )
+        {
+            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                     __func__, now, rqd->load_last_update);
+            delta = 0;
+        }
 
         rqd->avgload =
             ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
@@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
     else
     {
         delta = now - svc->load_last_update;
+        if ( unlikely(delta < 0) )
+        {
+            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                     __func__, now, svc->load_last_update);
+            delta = 0;
+        }
 
         svc->avgload =
             ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 03/11] xen: credit2: rework load tracking logic
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 02/11] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-18 14:46   ` George Dunlap
  2016-07-15 14:49 ` [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events Dario Faggioli
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

The existing load tracking code was hard to understad and
maintain, and not entirely consistent. This is due to a
number of reasons:
 - code and comments were not in perfect sync, making it
   difficult to figure out what the intent of a particular
   choice was (e.g., the choice of 18 for load_window_shift);
 - the math, although effective, was not entirely consistent.
   In fact, we were doing (if W is the lenght of the window):

    avgload = (delta*load*W + (W - delta)*avgload)/W
    avgload = avgload + delta*load - delta*avgload/W

   which does not match any known variant of 'smoothing
   moving average'. In fact, it should have been:

    avgload = avgload + delta*load/W - delta*avgload/W

   (for details on why, see the doc comments inside this
   patch.). Furthermore, with

    avgload ~= avgload + W*load - avgload
    avgload ~= W*load

The reason why the formula above sort of worked was because
the number of bits used for the fractional parts of the
values used in fixed point math and the number of bits used
for the lenght of the window were the same (load_window_shift
was being used for both).

This may look handy, but it introduced a (not especially well
documented) dependency between the lenght of the window and
the precision of the calculations, which really should be
two independent things. Especially if treating them as such
(like it is done in this patch) does not lead to more
complex maths (same number of multiplications and shifts, and
there is still room for some optimization).

Therefore, in this patch, we:
 - split length of the window and precision (and, since there
   is already a command line parameter for length of window,
   introduce one for precision too),
 - align the math with one proper incarnation of exponential
   smoothing (at no added cost),
 - add comments, about the details of the algorithm and the
   math used.

While there fix a couple of style issues as well (pointless
initialization, long lines, comments).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * reconciled comments and actual code about load_window_shift handling;
 * added some more sanity checking for opt_load_window_shift;
 * changes to trace records moved to next patch, as requested during review.
---
I've changed slightly more than just pushing the tracing related hunks to
another patch, so I'm not picking up George's Reviewed-by tag.
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 docs/misc/xen-command-line.markdown |   30 +++
 xen/common/sched_credit2.c          |  323 ++++++++++++++++++++++++++++++-----
 2 files changed, 308 insertions(+), 45 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 5500242..3a250cb 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -485,9 +485,39 @@ the address range the area should fall into.
 ### credit2\_balance\_under
 > `= <integer>`
 
+### credit2\_load\_precision\_shift
+> `= <integer>`
+
+> Default: `18`
+
+Specify the number of bits to use for the fractional part of the
+values involved in Credit2 load tracking and load balancing math.
+
 ### credit2\_load\_window\_shift
 > `= <integer>`
 
+> Default: `30`
+
+Specify the number of bits to use to represent the length of the
+window (in nanoseconds) we use for load tracking inside Credit2.
+This means that, with the default value (30), we use
+2^30 nsec ~= 1 sec long window.
+
+Load tracking is done by means of a variation of exponentially
+weighted moving average (EWMA). The window length defined here
+is what tells for how long we give value to previous history
+of the load itself. In fact, after a full window has passed,
+what happens is that we discard all previous history entirely.
+
+A short window will make the load balancer quick at reacting
+to load changes, but also short-sighted about previous history
+(and hence, e.g., long term load trends). A long window will
+make the load balancer thoughtful of previous history (and
+hence capable of capturing, e.g., long term load trends), but
+also slow in responding to load changes.
+
+The default value of `1 sec` is rather long.
+
 ### credit2\_runqueue
 > `= core | socket | node | all`
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 6cb06e8..e695f1b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -173,16 +173,86 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
 
 /*
- * Shifts for load average.
- * - granularity: Reduce granularity of time by a factor of 1000, so we can use 32-bit maths
- * - window shift: Given granularity shift, make the window about 1 second
- * - scale shift: Shift up load by this amount rather than using fractions; 128 corresponds 
- *   to a load of 1.
+ * Load tracking and load balancing
+ *
+ * Load history of runqueues and vcpus is accounted for by using an
+ * exponential weighted moving average algorithm. However, instead of using
+ * fractions,we shift everything to left by the number of bits we want to
+ * use for representing the fractional part (Q-format).
+ *
+ * We may also want to reduce the precision of time accounting, to
+ * accommodate 'longer  windows'. So, if that is the case, we just need to
+ * shift all time samples to the right.
+ *
+ * The details of the formulas used for load tracking are explained close to
+ * __update_runq_load(). Let's just say here that, with full nanosecond time
+ * granularity, a 30 bits wide 'decaying window' is ~1 second long.
+ *
+ * We want to consider the following equations:
+ *
+ *  avg[0] = load*P
+ *  avg[i+1] = avg[i] + delta*load*P/W - delta*avg[i]/W,  0 <= delta <= W
+ *
+ * where W is the lenght of the window, P the multiplier for transitiong into
+ * Q-format fixed point arithmetic and load is the instantaneous load of a
+ * runqueue, which basically is the number of runnable vcpus there are on the
+ * runqueue (for the meaning of the other terms, look at the doc comment to
+ *  __update_runq_load()).
+ *
+ *  So, again, with full nanosecond granularity, and 1 second window, we have:
+ *
+ *  W = 2^30
+ *  P = 2^18
+ *
+ * The maximum possible value for the average load, which we want to store in
+ * s_time_t type variables (i.e., we have 63 bits available) is load*P. This
+ * means that, with P 18 bits wide, load can occupy 45 bits. This in turn
+ * means we can have 2^45 vcpus in each runqueue, before overflow occurs!
+ *
+ * However, it can happen that, at step j+1, if:
+ *
+ *  avg[j] = load*P
+ *  delta = W
+ *
+ * then:
+ *
+ *  avg[j+i] = avg[j] + W*load*P/W - W*load*P/W
+ *
+ * So we must be able to deal with W*load*P. This means load can't be higher
+ * than:
+ *
+ *  2^(63 - 30 - 18) = 2^15 = 32768
+ *
+ * So 32768 is the maximum number of vcpus the we can have in a runqueue,
+ * at any given time, and still not have problems with the load tracking
+ * calculations... and this is more than fine.
+ *
+ * As a matter of fact, since we are using microseconds granularity, we have
+ * W=2^20. So, still with 18 fractional bits and a 1 second long window, there
+ * may be 2^25 = 33554432 vcpus in a runq before we have to start thinking
+ * about overflow.
  */
-#define LOADAVG_GRANULARITY_SHIFT (10)
-static unsigned int __read_mostly opt_load_window_shift = 18;
-#define  LOADAVG_WINDOW_SHIFT_MIN 4
+
+/* If >0, decreases the granularity of time samples used for load tracking. */
+#define LOADAVG_GRANULARITY_SHIFT   (10)
+/* Time window during which we still give value to previous load history. */
+#define LOADAVG_WINDOW_SHIFT        (30)
+/* 18 bits by default (and not less than 4) for decimals. */
+#define LOADAVG_PRECISION_SHIFT     (18)
+#define LOADAVG_PRECISION_SHIFT_MIN (4)
+
+/*
+ * Both the lenght of the window and the number of fractional bits can be
+ * decided with boot parameters.
+ *
+ * The length of the window is always expressed in nanoseconds. The actual
+ * value used by default is LOADAVG_WINDOW_SHIFT - LOADAVG_GRANULARITY_SHIFT.
+ */
+static unsigned int __read_mostly opt_load_window_shift = LOADAVG_WINDOW_SHIFT;
 integer_param("credit2_load_window_shift", opt_load_window_shift);
+static unsigned int __read_mostly opt_load_precision_shift = LOADAVG_PRECISION_SHIFT;
+integer_param("credit2_load_precision_shift", opt_load_precision_shift);
+
 static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
 static int __read_mostly opt_overload_balance_tolerance = -3;
@@ -279,6 +349,7 @@ struct csched2_private {
     cpumask_t active_queues; /* Queues which may have active cpus */
     struct csched2_runqueue_data rqd[NR_CPUS];
 
+    unsigned int load_precision_shift;
     unsigned int load_window_shift;
 };
 
@@ -387,19 +458,147 @@ __runq_elem(struct list_head *elem)
     return list_entry(elem, struct csched2_vcpu, runq_elem);
 }
 
+/*
+ * Track the runq load by gathering instantaneous load samples, and using
+ * exponentially weighted moving average (EWMA) for the 'decaying'.
+ *
+ * We consider a window of lenght W=2^(prv->load_window_shift) nsecs
+ * (which takes LOADAVG_GRANULARITY_SHIFT into account).
+ *
+ * If load is the instantaneous load, the formula for EWMA looks as follows,
+ * for the i-eth sample:
+ *
+ *  avg[i] = a*load + (1 - a)*avg[i-1]
+ *
+ * where avg[i] is the new value of the average load, avg[i-1] is the value
+ * of the average load calculated so far, and a is a coefficient less or
+ * equal to 1.
+ *
+ * So, for us, it becomes:
+ *
+ *  avgload = a*load + (1 - a)*avgload
+ *
+ * For determining a, we consider _when_ we are doing the load update, wrt
+ * the lenght of the window. We define delta as follows:
+ *
+ *  delta = t - load_last_update
+ *
+ * where t is current time (i.e., time at which we are both sampling and
+ * updating the load average) and load_last_update is the last time we did
+ * that.
+ *
+ * There are two possible situations:
+ *
+ * a) delta <= W
+ *    this means that, during the last window of lenght W, the runeuque load
+ *    was avgload for (W - detla) time, and load for delta time:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *                |     load_last_update  t
+ *     -------------------------|---------|---
+ *                |             |         |
+ *                \__W - delta__/\_delta__/
+ *                |             |         |
+ *                |___avgload___|__load___|
+ *
+ *    So, what about using delta/W as our smoothing coefficient a. If we do,
+ *    here's what happens:
+ *
+ *     a = delta / W
+ *     1 - a = 1 - (delta / W) = (W - delta) / W
+ *
+ *    Which matches the above description of what happened in the last
+ *    window of lenght W.
+ *
+ *    Note that this also means that the weight that we assign to both the
+ *    latest load sample, and to previous history, varies at each update.
+ *    The longer the latest load sample has been in efect, within the last
+ *    window, the higher it weights (and the lesser the previous history
+ *    weights).
+ *
+ *    This is some sort of extension of plain EWMA to fit even better to our
+ *    use case.
+ *
+ * b) delta > W
+ *    this means more than a full window has passed since the last update:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *       load_last_update                 t
+ *     ----|------------------------------|---
+ *         |                              |
+ *         \_________________delta________/
+ *
+ *    Basically, it means the last load sample has been in effect for more
+ *    than W time, and hence we should just use it, and forget everything
+ *    before that.
+ *
+ *    This can be seen as a 'reset condition', occurring when, for whatever
+ *    reason, load has not been updated for longer than we expected. (It is
+ *    also how avgload is assigned its first value.)
+ *
+ * The formula for avgload then becomes:
+ *
+ *  avgload = (delta/W)*load + (W - delta)*avgload/W
+ *  avgload = delta*load/W + W*avgload/W - delta*avgload/W
+ *  avgload = avgload + delta*load/W - delta*avgload/W
+ *
+ * So, final form is:
+ *
+ *  avgload_0 = load
+ *  avgload = avgload + delta*load/W - delta*avgload/W,  0<=delta<=W
+ *
+ * As a confirmation, let's look at the extremes, when delta is 0 (i.e.,
+ * what happens if we  update the load twice, at the same time instant?):
+ *
+ *  avgload = avgload + 0*load/W - 0*avgload/W
+ *  avgload = avgload
+ *
+ * and when delta is W (i.e., what happens if we update at the last
+ * possible instant before the window 'expires'?):
+ *
+ *  avgload = avgload + W*load/W - W*avgload/W
+ *  avgload = avgload + load - avgload
+ *  avgload = load
+ *
+ * Which, in both cases, is what we expect.
+ */
 static void
 __update_runq_load(const struct scheduler *ops,
                   struct csched2_runqueue_data *rqd, int change, s_time_t now)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
+    s_time_t delta, load = rqd->load;
+    unsigned int P, W;
 
+    W = prv->load_window_shift;
+    P = prv->load_precision_shift;
     now >>= LOADAVG_GRANULARITY_SHIFT;
 
-    if ( rqd->load_last_update + (1ULL<<prv->load_window_shift) < now )
+    /*
+     * To avoid using fractions, we shift to left by load_precision_shift,
+     * and use the least last load_precision_shift bits as fractional part.
+     * Looking back at the formula we want to use, we now have:
+     *
+     *  P = 2^(load_precision_shift)
+     *  P*avgload = P*(avgload + delta*load/W - delta*avgload/W)
+     *  P*avgload = P*avgload + delta*load*P/W - delta*P*avgload/W
+     *
+     * And if we are ok storing and using P*avgload, we can rewrite this as:
+     *
+     *  P*avgload = avgload'
+     *  avgload' = avgload' + delta*P*load/W - delta*avgload'/W
+     *
+     * Coupled with, of course:
+     *
+     *  avgload_0' = P*load
+     */
+
+    if ( rqd->load_last_update + (1ULL << W)  < now )
     {
-        rqd->avgload = (unsigned long long)rqd->load << prv->load_window_shift;
-        rqd->b_avgload = (unsigned long long)rqd->load << prv->load_window_shift;
+        rqd->avgload = load << P;
+        rqd->b_avgload = load << P;
     }
     else
     {
@@ -411,17 +610,29 @@ __update_runq_load(const struct scheduler *ops,
             delta = 0;
         }
 
-        rqd->avgload =
-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift;
-
-        rqd->b_avgload =
-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->b_avgload ) ) >> prv->load_window_shift;
+        /*
+         * Note that, if we were to enforce (or check) some relationship
+         * between P and W, we may save one shift. E.g., if we are sure
+         * that P < W, we could write:
+         *
+         *  (delta * (load << P)) >> W
+         *
+         * as:
+         *
+         *  (delta * load) >> (W - P)
+         */
+        rqd->avgload = rqd->avgload +
+                       ((delta * (load << P)) >> W) -
+                       ((delta * rqd->avgload) >> W);
+        rqd->b_avgload = rqd->b_avgload +
+                         ((delta * (load << P)) >> W) -
+                         ((delta * rqd->b_avgload) >> W);
     }
     rqd->load += change;
     rqd->load_last_update = now;
 
+    ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX);
+
     {
         struct {
             unsigned rq_load:4, rq_avgload:28;
@@ -442,8 +653,8 @@ __update_svc_load(const struct scheduler *ops,
                   struct csched2_vcpu *svc, int change, s_time_t now)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
-    int vcpu_load;
+    s_time_t delta, vcpu_load;
+    unsigned int P, W;
 
     if ( change == -1 )
         vcpu_load = 1;
@@ -452,11 +663,13 @@ __update_svc_load(const struct scheduler *ops,
     else
         vcpu_load = vcpu_runnable(svc->vcpu);
 
+    W = prv->load_window_shift;
+    P = prv->load_precision_shift;
     now >>= LOADAVG_GRANULARITY_SHIFT;
 
-    if ( svc->load_last_update + (1ULL<<prv->load_window_shift) < now )
+    if ( svc->load_last_update + (1ULL << W) < now )
     {
-        svc->avgload = (unsigned long long)vcpu_load << prv->load_window_shift;
+        svc->avgload = vcpu_load << P;
     }
     else
     {
@@ -468,9 +681,9 @@ __update_svc_load(const struct scheduler *ops,
             delta = 0;
         }
 
-        svc->avgload =
-            ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * svc->avgload ) ) >> prv->load_window_shift;
+        svc->avgload = svc->avgload +
+                       ((delta * (vcpu_load << P)) >> W) -
+                       ((delta * svc->avgload) >> W);
     }
     svc->load_last_update = now;
 
@@ -903,7 +1116,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
         svc->credit = CSCHED2_CREDIT_INIT;
         svc->weight = svc->sdom->weight;
         /* Starting load of 50% */
-        svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_window_shift - 1);
+        svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1);
         svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT;
     }
     else
@@ -1152,7 +1365,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     vcpu_schedule_unlock_irq(lock, vc);
 }
 
-#define MAX_LOAD (1ULL<<60);
+#define MAX_LOAD (STIME_MAX);
 static int
 csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 {
@@ -1446,15 +1659,19 @@ retry:
         if ( i > cpus_max )
             cpus_max = i;
 
-        /* If we're under 100% capacaty, only shift if load difference
-         * is > 1.  otherwise, shift if under 12.5% */
-        if ( load_max < (1ULL<<(prv->load_window_shift))*cpus_max )
+        /*
+         * If we're under 100% capacaty, only shift if load difference
+         * is > 1.  otherwise, shift if under 12.5%
+         */
+        if ( load_max < (cpus_max << prv->load_precision_shift) )
         {
-            if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_underload_balance_tolerance) ) )
+            if ( st.load_delta < (1ULL << (prv->load_precision_shift +
+                                           opt_underload_balance_tolerance)) )
                  goto out;
         }
         else
-            if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_overload_balance_tolerance)) )
+            if ( st.load_delta < (1ULL << (prv->load_precision_shift +
+                                           opt_overload_balance_tolerance)) )
                 goto out;
     }
              
@@ -1962,7 +2179,7 @@ csched2_schedule(
 }
 
 static void
-csched2_dump_vcpu(struct csched2_vcpu *svc)
+csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
 {
     printk("[%i.%i] flags=%x cpu=%i",
             svc->vcpu->domain->domain_id,
@@ -1972,6 +2189,9 @@ csched2_dump_vcpu(struct csched2_vcpu *svc)
 
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
+    printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
+           (svc->avgload * 100) >> prv->load_precision_shift);
+
     printk("\n");
 }
 
@@ -2009,7 +2229,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
     if ( svc )
     {
         printk("\trun: ");
-        csched2_dump_vcpu(svc);
+        csched2_dump_vcpu(prv, svc);
     }
 
     loop = 0;
@@ -2019,7 +2239,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
         if ( svc )
         {
             printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);
         }
     }
 
@@ -2048,8 +2268,8 @@ csched2_dump(const struct scheduler *ops)
     for_each_cpu(i, &prv->active_queues)
     {
         s_time_t fraction;
-        
-        fraction = prv->rqd[i].avgload * 100 / (1ULL<<prv->load_window_shift);
+
+        fraction = (prv->rqd[i].avgload * 100) >> prv->load_precision_shift;
 
         cpulist_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].active);
         printk("Runqueue %d:\n"
@@ -2057,12 +2277,13 @@ csched2_dump(const struct scheduler *ops)
                "\tcpus               = %s\n"
                "\tmax_weight         = %d\n"
                "\tinstload           = %d\n"
-               "\taveload            = %3"PRI_stime"\n",
+               "\taveload            = %"PRI_stime" (~%"PRI_stime"%%)\n",
                i,
                cpumask_weight(&prv->rqd[i].active),
                cpustr,
                prv->rqd[i].max_weight,
                prv->rqd[i].load,
+               prv->rqd[i].avgload,
                fraction);
 
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].idle);
@@ -2093,7 +2314,7 @@ csched2_dump(const struct scheduler *ops)
             lock = vcpu_schedule_lock(svc->vcpu);
 
             printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);
 
             vcpu_schedule_unlock(lock, svc->vcpu);
         }
@@ -2354,17 +2575,27 @@ csched2_init(struct scheduler *ops)
            " WARNING: This is experimental software in development.\n" \
            " Use at your own risk.\n");
 
+    printk(" load_precision_shift: %d\n", opt_load_precision_shift);
     printk(" load_window_shift: %d\n", opt_load_window_shift);
     printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
     printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
     printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
 
-    if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
+    if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN )
+    {
+        printk("WARNING: %s: opt_load_precision_shift %d below min %d, resetting\n",
+               __func__, opt_load_precision_shift, LOADAVG_PRECISION_SHIFT_MIN);
+        opt_load_precision_shift = LOADAVG_PRECISION_SHIFT_MIN;
+    }
+
+    if ( opt_load_window_shift <= LOADAVG_GRANULARITY_SHIFT )
     {
-        printk("%s: opt_load_window_shift %d below min %d, resetting\n",
-               __func__, opt_load_window_shift, LOADAVG_WINDOW_SHIFT_MIN);
-        opt_load_window_shift = LOADAVG_WINDOW_SHIFT_MIN;
+        printk("WARNING: %s: opt_load_window_shift %d too short, resetting\n",
+               __func__, opt_load_window_shift);
+        opt_load_window_shift = LOADAVG_WINDOW_SHIFT;
     }
+    printk(XENLOG_INFO "load tracking window lenght %llu ns\n",
+           1ULL << opt_load_window_shift);
 
     /* Basically no CPU information is available at this point; just
      * set up basic structures, and a callback when the CPU info is
@@ -2385,7 +2616,9 @@ csched2_init(struct scheduler *ops)
         prv->rqd[i].id = -1;
     }
 
-    prv->load_window_shift = opt_load_window_shift;
+    prv->load_precision_shift = opt_load_precision_shift;
+    prv->load_window_shift = opt_load_window_shift - LOADAVG_GRANULARITY_SHIFT;
+    ASSERT(opt_load_window_shift > 0);
 
     return 0;
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2016-07-15 14:49 ` [PATCH v2 03/11] xen: credit2: rework load tracking logic Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-18 15:38   ` Wei Liu
  2016-07-15 14:49 ` [PATCH v2 05/11] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Anshul Makkar, Ian Jackson, George Dunlap

Add the shift used for the precision of the integer
arithmetic to the trace records, and update both xenalyze
and xentrace_format to make use of/print it.

In particular, in xenalyze, we are can now show the
load as a (easier to interpreet) percentage.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v1:
 * moved the hypervisor bits in this patch, as requested during review.
---
Sticking George's Reviewed-by tag there, as he said I could.
---
 tools/xentrace/formats     |    4 ++--
 tools/xentrace/xenalyze.c  |   25 ++++++++++++++++++-------
 xen/common/sched_credit2.c |   11 +++++++----
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index d204351..2e58d03 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -53,8 +53,8 @@
 0x00022208  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:sched_tasklet
 0x00022209  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:update_load
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
-0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(1)08x, avgload = %(2)d ]
-0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[4]:rq_avgload[28] = 0x%(1)08x, rq_id[4]:b_avgload[28] = 0x%(2)08x ]
+0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(3)08x, vcpuload = 0x%(2)08x%(1)08x, wshift = %(4)d ]
+0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x%(5)08x, rq_avgload = 0x%(2)08x%(1)08x, b_avgload = 0x%(4)08x%(3)08x ]
 
 0x00022801  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:tickle        [ cpu = %(1)d ]
 0x00022802  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:runq_pick     [ dom:vcpu = 0x%(1)08x, cur_deadline = 0x%(3)08x%(2)08x, cur_budget = 0x%(5)08x%(4)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 01ead8b..f2f97bd 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7802,25 +7802,36 @@ void sched_process(struct pcpu_info *p)
         case TRC_SCHED_CLASS_EVT(CSCHED2, 11): /* UPDATE_VCPU_LOAD */
             if(opt.dump_all) {
                 struct {
+                    uint64_t vcpuload;
                     unsigned int vcpuid:16, domid:16;
-                    unsigned int avgload;
+                    unsigned int shift;
                 } *r = (typeof(r))ri->d;
+                double vcpuload;
 
-                printf(" %s csched2:update_vcpu_load d%uv%u, avg_load = %u\n",
-                       ri->dump_header, r->domid, r->vcpuid, r->avgload);
+                vcpuload = (r->vcpuload * 100.0) / (1ULL << r->shift);
+
+                printf(" %s csched2:update_vcpu_load d%uv%u, "
+                       "vcpu_load = %4.3f%% (%"PRIu64")\n",
+                       ri->dump_header, r->domid, r->vcpuid, vcpuload,
+                       r->vcpuload);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 12): /* UPDATE_RUNQ_LOAD */
             if(opt.dump_all) {
                 struct {
-                    unsigned int rq_load:4, rq_avgload:28;
-                    unsigned int rq_id:4, b_avgload:28;
+                    uint64_t rq_avgload, b_avgload;
+                    unsigned int rq_load:16, rq_id:8, shift:8;
                 } *r = (typeof(r))ri->d;
+                double avgload, b_avgload;
+
+                avgload = (r->rq_avgload * 100.0) / (1ULL << r->shift);
+                b_avgload = (r->b_avgload * 100.0) / (1ULL << r->shift);
 
                 printf(" %s csched2:update_rq_load rq# %u, load = %u, "
-                       "avgload = %u, b_avgload = %u\n",
+                       "avgload = %4.3f%% (%"PRIu64"), "
+                       "b_avgload = %4.3f%% (%"PRIu64")\n",
                        ri->dump_header, r->rq_id, r->rq_load,
-                       r->rq_avgload, r->b_avgload);
+                       avgload, r->rq_avgload, b_avgload, r->b_avgload);
             }
             break;
         /* RTDS (TRC_RTDS_xxx) */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index e695f1b..2978eac 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -635,13 +635,14 @@ __update_runq_load(const struct scheduler *ops,
 
     {
         struct {
-            unsigned rq_load:4, rq_avgload:28;
-            unsigned rq_id:4, b_avgload:28;
+            uint64_t rq_avgload, b_avgload;
+            unsigned rq_load:16, rq_id:8, shift:8;
         } d;
-        d.rq_id=rqd->id;
+        d.rq_id = rqd->id;
         d.rq_load = rqd->load;
         d.rq_avgload = rqd->avgload;
         d.b_avgload = rqd->b_avgload;
+        d.shift = P;
         trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
                   sizeof(d),
                   (unsigned char *)&d);
@@ -689,12 +690,14 @@ __update_svc_load(const struct scheduler *ops,
 
     {
         struct {
+            uint64_t v_avgload;
             unsigned vcpu:16, dom:16;
-            unsigned v_avgload:32;
+            unsigned shift;
         } d;
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.v_avgload = svc->avgload;
+        d.shift = P;
         trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
                   sizeof(d),
                   (unsigned char *)&d);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 05/11] xen: credit2: use non-atomic cpumask and bit operations
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2016-07-15 14:49 ` [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-15 14:49 ` [PATCH v2 06/11] xen: credit2: make the code less experimental Dario Faggioli
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

as all the accesses to both the masks and the flags are
serialized by the runqueues locks already.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   48 ++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 2978eac..f40e307 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -906,7 +906,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                   sizeof(d),
                   (unsigned char *)&d);
     }
-    cpumask_set_cpu(ipid, &rqd->tickled);
+    __cpumask_set_cpu(ipid, &rqd->tickled);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1274,7 +1274,7 @@ csched2_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         __runq_remove(svc);
     }
     else if ( svc->flags & CSFLAG_delayed_runq_add )
-        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        __clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
 }
 
 static void
@@ -1311,7 +1311,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
      * after the context has been saved. */
     if ( unlikely(svc->flags & CSFLAG_scheduled) )
     {
-        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
         goto out;
     }
 
@@ -1344,7 +1344,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor));
 
     /* This vcpu is now eligible to be put on the runqueue again */
-    clear_bit(__CSFLAG_scheduled, &svc->flags);
+    __clear_bit(__CSFLAG_scheduled, &svc->flags);
 
     /* If someone wants it on the runqueue, put it there. */
     /*
@@ -1354,7 +1354,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
      * it seems a bit pointless; especially as we have plenty of
      * bits free.
      */
-    if ( test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
+    if ( __test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
          && likely(vcpu_runnable(vc)) )
     {
         BUG_ON(__vcpu_on_runq(svc));
@@ -1396,10 +1396,10 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !spin_trylock(&prv->lock) )
     {
-        if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
+        if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
         {
             d2printk("%pv -\n", svc->vcpu);
-            clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
+            __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
 
         return get_fallback_cpu(svc);
@@ -1407,7 +1407,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* First check to see if we're here because someone else suggested a place
      * for us to move. */
-    if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
+    if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
         {
@@ -1542,8 +1542,8 @@ static void migrate(const struct scheduler *ops,
         d2printk("%pv %d-%d a\n", svc->vcpu, svc->rqd->id, trqd->id);
         /* It's running; mark it to migrate. */
         svc->migrate_rqd = trqd;
-        set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
-        set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
+        __set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
+        __set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         SCHED_STAT_CRANK(migrate_requested);
     }
     else
@@ -2073,7 +2073,7 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
-        cpumask_clear_cpu(cpu, &rqd->tickled);
+        __cpumask_clear_cpu(cpu, &rqd->tickled);
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2109,7 +2109,7 @@ csched2_schedule(
     if ( snext != scurr
          && !is_idle_vcpu(scurr->vcpu)
          && vcpu_runnable(current) )
-        set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
+        __set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
 
     ret.migrated = 0;
 
@@ -2128,7 +2128,7 @@ csched2_schedule(
                        cpu, snext->vcpu, snext->vcpu->processor, scurr->vcpu);
                 BUG();
             }
-            set_bit(__CSFLAG_scheduled, &snext->flags);
+            __set_bit(__CSFLAG_scheduled, &snext->flags);
         }
 
         /* Check for the reset condition */
@@ -2140,7 +2140,7 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
-            cpumask_clear_cpu(cpu, &rqd->idle);
+            __cpumask_clear_cpu(cpu, &rqd->idle);
 
         snext->start_time = now;
 
@@ -2162,10 +2162,10 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
-                cpumask_clear_cpu(cpu, &rqd->idle);
+                __cpumask_clear_cpu(cpu, &rqd->idle);
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
-            cpumask_set_cpu(cpu, &rqd->idle);
+            __cpumask_set_cpu(cpu, &rqd->idle);
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2341,7 +2341,7 @@ static void activate_runqueue(struct csched2_private *prv, int rqi)
     INIT_LIST_HEAD(&rqd->runq);
     spin_lock_init(&rqd->lock);
 
-    cpumask_set_cpu(rqi, &prv->active_queues);
+    __cpumask_set_cpu(rqi, &prv->active_queues);
 }
 
 static void deactivate_runqueue(struct csched2_private *prv, int rqi)
@@ -2354,7 +2354,7 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     
     rqd->id = -1;
 
-    cpumask_clear_cpu(rqi, &prv->active_queues);
+    __cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
 static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
@@ -2443,9 +2443,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     /* Set the runqueue map */
     prv->runq_map[cpu] = rqi;
     
-    cpumask_set_cpu(cpu, &rqd->idle);
-    cpumask_set_cpu(cpu, &rqd->active);
-    cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->idle);
+    __cpumask_set_cpu(cpu, &rqd->active);
+    __cpumask_set_cpu(cpu, &prv->initialized);
 
     return rqi;
 }
@@ -2550,8 +2550,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
 
-    cpumask_clear_cpu(cpu, &rqd->idle);
-    cpumask_clear_cpu(cpu, &rqd->active);
+    __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )
     {
@@ -2561,7 +2561,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_unlock(&rqd->lock);
 
-    cpumask_clear_cpu(cpu, &prv->initialized);
+    __cpumask_clear_cpu(cpu, &prv->initialized);
 
     spin_unlock_irqrestore(&prv->lock, flags);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 06/11] xen: credit2: make the code less experimental
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2016-07-15 14:49 ` [PATCH v2 05/11] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-18 15:04   ` George Dunlap
  2016-07-15 14:49 ` [PATCH v2 07/11] xen: credit2: add yet some more tracing Dario Faggioli
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, Jan Beulich, David Vrabel

Mainly, almost all of the BUG_ON-s can be converted into
ASSERTS, and almost all the debug printk can either be
removed or turned into tracing.

The 'TODO' list, in a comment at the beginning of the file,
was also stale, so remove items that were still there but
are actually done.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
Changes from v1:
 * d2printk is not killed, and the "Time goes backward" messages are only
   printed if that's defined;
 * make sure the XENLOG_INFO applies to all the lines printing the values of
   all the parameters, as pointed out during review;
 * avoid adding variables for values used only once in migrate() (they were
   remnant from the development of the series), as suggested during review;
 * vcpu to runq assignment debugging checks removed from csched2_schedule,
   as it was suggested during review that they're not useful at all any longer.
---
 xen/common/sched_credit2.c |  256 +++++++++++++++++++++-----------------------
 1 file changed, 121 insertions(+), 135 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index f40e307..d72f530 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -27,8 +27,10 @@
 #include <xen/cpu.h>
 #include <xen/keyhandler.h>
 
+/* Meant only for helping developers during debugging. */
+/* #define d2printk printk */
 #define d2printk(x...)
-//#define d2printk printk
+
 
 /*
  * Credit2 tracing events ("only" 512 available!). Check
@@ -46,16 +48,16 @@
 #define TRC_CSCHED2_RUNQ_ASSIGN      TRC_SCHED_CLASS_EVT(CSCHED2, 10)
 #define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11)
 #define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12)
+#define TRC_CSCHED2_TICKLE_NEW       TRC_SCHED_CLASS_EVT(CSCHED2, 13)
+#define TRC_CSCHED2_RUNQ_MAX_WEIGHT  TRC_SCHED_CLASS_EVT(CSCHED2, 14)
+#define TRC_CSCHED2_MIGRATE          TRC_SCHED_CLASS_EVT(CSCHED2, 15)
 
 /*
  * WARNING: This is still in an experimental phase.  Status and work can be found at the
  * credit2 wiki page:
  *  http://wiki.xen.org/wiki/Credit2_Scheduler_Development
+ *
  * TODO:
- * + Multiple sockets
- *  - Simple load balancer / runqueue assignment
- *  - Runqueue load measurement
- *  - Load-based load balancer
  * + Hyperthreading
  *  - Look for non-busy core if possible
  *  - "Discount" time run on a thread with busy siblings
@@ -605,7 +607,7 @@ __update_runq_load(const struct scheduler *ops,
         delta = now - rqd->load_last_update;
         if ( unlikely(delta < 0) )
         {
-            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+            d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
                      __func__, now, rqd->load_last_update);
             delta = 0;
         }
@@ -677,7 +679,7 @@ __update_svc_load(const struct scheduler *ops,
         delta = now - svc->load_last_update;
         if ( unlikely(delta < 0) )
         {
-            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+            d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
                      __func__, now, svc->load_last_update);
             delta = 0;
         }
@@ -720,23 +722,18 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu *svc)
     struct list_head *iter;
     int pos = 0;
 
-    d2printk("rqi %pv\n", svc->vcpu);
-
-    BUG_ON(&svc->rqd->runq != runq);
-    /* Idle vcpus not allowed on the runqueue anymore */
-    BUG_ON(is_idle_vcpu(svc->vcpu));
-    BUG_ON(svc->vcpu->is_running);
-    BUG_ON(svc->flags & CSFLAG_scheduled);
+    ASSERT(&svc->rqd->runq == runq);
+    ASSERT(!is_idle_vcpu(svc->vcpu));
+    ASSERT(!svc->vcpu->is_running);
+    ASSERT(!(svc->flags & CSFLAG_scheduled));
 
     list_for_each( iter, runq )
     {
         struct csched2_vcpu * iter_svc = __runq_elem(iter);
 
         if ( svc->credit > iter_svc->credit )
-        {
-            d2printk(" p%d %pv\n", pos, iter_svc->vcpu);
             break;
-        }
+
         pos++;
     }
 
@@ -752,10 +749,10 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
     struct list_head * runq = &RQD(ops, cpu)->runq;
     int pos = 0;
 
-    ASSERT( spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock) );
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
-    BUG_ON( __vcpu_on_runq(svc) );
-    BUG_ON( c2r(ops, cpu) != c2r(ops, svc->vcpu->processor) );
+    ASSERT(!__vcpu_on_runq(svc));
+    ASSERT(c2r(ops, cpu) == c2r(ops, svc->vcpu->processor));
 
     pos = __runq_insert(runq, svc);
 
@@ -778,7 +775,7 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
 static inline void
 __runq_remove(struct csched2_vcpu *svc)
 {
-    BUG_ON( !__vcpu_on_runq(svc) );
+    ASSERT(__vcpu_on_runq(svc));
     list_del_init(&svc->runq_elem);
 }
 
@@ -803,16 +800,29 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_ti
 static void
 runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 {
-    int i, ipid=-1;
-    s_time_t lowest=(1<<30);
+    int i, ipid = -1;
+    s_time_t lowest = (1<<30);
     unsigned int cpu = new->vcpu->processor;
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     cpumask_t mask;
     struct csched2_vcpu * cur;
 
-    d2printk("rqt %pv curr %pv\n", new->vcpu, current);
+    ASSERT(new->rqd == rqd);
 
-    BUG_ON(new->rqd != rqd);
+    /* TRACE */
+    {
+        struct {
+            unsigned vcpu:16, dom:16;
+            unsigned processor, credit;
+        } d;
+        d.dom = new->vcpu->domain->domain_id;
+        d.vcpu = new->vcpu->vcpu_id;
+        d.processor = new->vcpu->processor;
+        d.credit = new->credit;
+        trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
 
     /*
      * Get a mask of idle, but not tickled, processors that new is
@@ -858,7 +868,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
         cur = CSCHED2_VCPU(curr_on_cpu(i));
 
-        BUG_ON(is_idle_vcpu(cur->vcpu));
+        ASSERT(!is_idle_vcpu(cur->vcpu));
 
         /* Update credits for current to see if we want to preempt. */
         burn_credits(rqd, cur, now);
@@ -948,8 +958,8 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
 
         svc = list_entry(iter, struct csched2_vcpu, rqd_elem);
 
-        BUG_ON( is_idle_vcpu(svc->vcpu) );
-        BUG_ON( svc->rqd != rqd );
+        ASSERT(!is_idle_vcpu(svc->vcpu));
+        ASSERT(svc->rqd == rqd);
 
         start_credit = svc->credit;
 
@@ -993,12 +1003,11 @@ void burn_credits(struct csched2_runqueue_data *rqd,
 {
     s_time_t delta;
 
-    /* Assert svc is current */
-    ASSERT(svc==CSCHED2_VCPU(curr_on_cpu(svc->vcpu->processor)));
+    ASSERT(svc == CSCHED2_VCPU(curr_on_cpu(svc->vcpu->processor)));
 
     if ( unlikely(is_idle_vcpu(svc->vcpu)) )
     {
-        BUG_ON(svc->credit != CSCHED2_IDLE_CREDIT);
+        ASSERT(svc->credit == CSCHED2_IDLE_CREDIT);
         return;
     }
 
@@ -1009,13 +1018,11 @@ void burn_credits(struct csched2_runqueue_data *rqd,
         SCHED_STAT_CRANK(burn_credits_t2c);
         t2c_update(rqd, delta, svc);
         svc->start_time = now;
-
-        d2printk("b %pv c%d\n", svc->vcpu, svc->credit);
     }
     else if ( delta < 0 )
     {
-        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
-               __func__, now, svc->start_time);
+        d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" start_time %"PRI_stime"\n",
+                 __func__, now, svc->start_time);
     }
 
     /* TRACE */
@@ -1048,7 +1055,6 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
     if ( new_weight > rqd->max_weight )
     {
         rqd->max_weight = new_weight;
-        d2printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
         SCHED_STAT_CRANK(upd_max_weight_quick);
     }
     else if ( old_weight == rqd->max_weight )
@@ -1065,9 +1071,20 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
         }
 
         rqd->max_weight = max_weight;
-        d2printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
         SCHED_STAT_CRANK(upd_max_weight_full);
     }
+
+    /* TRACE */
+    {
+        struct {
+            unsigned rqi:16, max_weight:16;
+        } d;
+        d.rqi = rqd->id;
+        d.max_weight = rqd->max_weight;
+        trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
 }
 
 #ifndef NDEBUG
@@ -1114,8 +1131,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
 
     if ( ! is_idle_vcpu(vc) )
     {
-        BUG_ON( svc->sdom == NULL );
-
+        ASSERT(svc->sdom != NULL);
         svc->credit = CSCHED2_CREDIT_INIT;
         svc->weight = svc->sdom->weight;
         /* Starting load of 50% */
@@ -1124,7 +1140,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     }
     else
     {
-        BUG_ON( svc->sdom != NULL );
+        ASSERT(svc->sdom == NULL);
         svc->credit = CSCHED2_IDLE_CREDIT;
         svc->weight = 0;
     }
@@ -1168,7 +1184,7 @@ runq_assign(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu *svc = vc->sched_priv;
 
-    BUG_ON(svc->rqd != NULL);
+    ASSERT(svc->rqd == NULL);
 
     __runq_assign(svc, RQD(ops, vc->processor));
 }
@@ -1176,8 +1192,8 @@ runq_assign(const struct scheduler *ops, struct vcpu *vc)
 static void
 __runq_deassign(struct csched2_vcpu *svc)
 {
-    BUG_ON(__vcpu_on_runq(svc));
-    BUG_ON(svc->flags & CSFLAG_scheduled);
+    ASSERT(!__vcpu_on_runq(svc));
+    ASSERT(!(svc->flags & CSFLAG_scheduled));
 
     list_del_init(&svc->rqd_elem);
     update_max_weight(svc->rqd, 0, svc->weight);
@@ -1193,7 +1209,7 @@ runq_deassign(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu *svc = vc->sched_priv;
 
-    BUG_ON(svc->rqd != RQD(ops, vc->processor));
+    ASSERT(svc->rqd == RQD(ops, vc->processor));
 
     __runq_deassign(svc);
 }
@@ -1205,9 +1221,8 @@ csched2_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_dom * const sdom = svc->sdom;
     spinlock_t *lock;
 
-    printk("%s: Inserting %pv\n", __func__, vc);
-
-    BUG_ON(is_idle_vcpu(vc));
+    ASSERT(!is_idle_vcpu(vc));
+    ASSERT(list_empty(&svc->runq_elem));
 
     /* Add vcpu to runqueue of initial processor */
     lock = vcpu_schedule_lock_irq(vc);
@@ -1235,26 +1250,21 @@ static void
 csched2_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
-    struct csched2_dom * const sdom = svc->sdom;
-
-    BUG_ON( sdom == NULL );
-    BUG_ON( !list_empty(&svc->runq_elem) );
+    spinlock_t *lock;
 
-    if ( ! is_idle_vcpu(vc) )
-    {
-        spinlock_t *lock;
+    ASSERT(!is_idle_vcpu(vc));
+    ASSERT(list_empty(&svc->runq_elem));
 
-        SCHED_STAT_CRANK(vcpu_remove);
+    SCHED_STAT_CRANK(vcpu_remove);
 
-        /* Remove from runqueue */
-        lock = vcpu_schedule_lock_irq(vc);
+    /* Remove from runqueue */
+    lock = vcpu_schedule_lock_irq(vc);
 
-        runq_deassign(ops, vc);
+    runq_deassign(ops, vc);
 
-        vcpu_schedule_unlock_irq(lock, vc);
+    vcpu_schedule_unlock_irq(lock, vc);
 
-        svc->sdom->nr_vcpus--;
-    }
+    svc->sdom->nr_vcpus--;
 }
 
 static void
@@ -1262,14 +1272,14 @@ csched2_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
 
-    BUG_ON( is_idle_vcpu(vc) );
+    ASSERT(!is_idle_vcpu(vc));
     SCHED_STAT_CRANK(vcpu_sleep);
 
     if ( curr_on_cpu(vc->processor) == vc )
         cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
     else if ( __vcpu_on_runq(svc) )
     {
-        BUG_ON(svc->rqd != RQD(ops, vc->processor));
+        ASSERT(svc->rqd == RQD(ops, vc->processor));
         update_load(ops, svc->rqd, svc, -1, NOW());
         __runq_remove(svc);
     }
@@ -1285,9 +1295,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 
     /* Schedule lock should be held at this point. */
 
-    d2printk("w %pv\n", vc);
-
-    BUG_ON( is_idle_vcpu(vc) );
+    ASSERT(!is_idle_vcpu(vc));
 
     if ( unlikely(curr_on_cpu(vc->processor) == vc) )
     {
@@ -1319,7 +1327,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     if ( svc->rqd == NULL )
         runq_assign(ops, vc);
     else
-        BUG_ON(RQD(ops, vc->processor) != svc->rqd );
+        ASSERT(RQD(ops, vc->processor) == svc->rqd );
 
     now = NOW();
 
@@ -1330,7 +1338,6 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     runq_tickle(ops, svc, now);
 
 out:
-    d2printk("w-\n");
     return;
 }
 
@@ -1342,6 +1349,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     s_time_t now = NOW();
 
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor));
+    ASSERT(is_idle_vcpu(vc) || svc->rqd == RQD(ops, vc->processor));
 
     /* This vcpu is now eligible to be put on the runqueue again */
     __clear_bit(__CSFLAG_scheduled, &svc->flags);
@@ -1357,7 +1365,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     if ( __test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
          && likely(vcpu_runnable(vc)) )
     {
-        BUG_ON(__vcpu_on_runq(svc));
+        ASSERT(!__vcpu_on_runq(svc));
 
         runq_insert(ops, svc);
         runq_tickle(ops, svc, now);
@@ -1377,7 +1385,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
     s_time_t min_avgload;
 
-    BUG_ON(cpumask_empty(&prv->active_queues));
+    ASSERT(!cpumask_empty(&prv->active_queues));
 
     /* Locking:
      * - vc->processor is already locked
@@ -1396,12 +1404,8 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !spin_trylock(&prv->lock) )
     {
-        if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
-        {
-            d2printk("%pv -\n", svc->vcpu);
-            __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
-        }
-
+        /* We may be here because someone requested us to migrate. */
+        __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         return get_fallback_cpu(svc);
     }
 
@@ -1411,7 +1415,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
         {
-            printk("%s: Runqueue migrate aborted because target runqueue disappeared!\n",
+            printk(XENLOG_WARNING "%s: target runqueue disappeared!\n",
                    __func__);
         }
         else
@@ -1420,10 +1424,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
                         &svc->migrate_rqd->active);
             new_cpu = cpumask_any(cpumask_scratch);
             if ( new_cpu < nr_cpu_ids )
-            {
-                d2printk("%pv +\n", svc->vcpu);
                 goto out_up;
-            }
         }
         /* Fall-through to normal cpu pick */
     }
@@ -1537,9 +1538,23 @@ static void migrate(const struct scheduler *ops,
                     struct csched2_runqueue_data *trqd, 
                     s_time_t now)
 {
+    /* TRACE */
+    {
+        struct {
+            unsigned vcpu:16, dom:16;
+            unsigned rqi:16, trqi:16;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.rqi = svc->rqd->id;
+        d.trqi = trqd->id;
+        trace_var(TRC_CSCHED2_MIGRATE, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
     if ( svc->flags & CSFLAG_scheduled )
     {
-        d2printk("%pv %d-%d a\n", svc->vcpu, svc->rqd->id, trqd->id);
         /* It's running; mark it to migrate. */
         svc->migrate_rqd = trqd;
         __set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
@@ -1548,21 +1563,20 @@ static void migrate(const struct scheduler *ops,
     }
     else
     {
-        int on_runq=0;
+        int on_runq = 0;
         /* It's not running; just move it */
-        d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
         if ( __vcpu_on_runq(svc) )
         {
             __runq_remove(svc);
             update_load(ops, svc->rqd, NULL, -1, now);
-            on_runq=1;
+            on_runq = 1;
         }
         __runq_deassign(svc);
 
         cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
                     &trqd->active);
         svc->vcpu->processor = cpumask_any(cpumask_scratch);
-        BUG_ON(svc->vcpu->processor >= nr_cpu_ids);
+        ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         __runq_assign(svc, trqd);
         if ( on_runq )
@@ -1754,7 +1768,7 @@ csched2_vcpu_migrate(
     struct csched2_runqueue_data *trqd;
 
     /* Check if new_cpu is valid */
-    BUG_ON(!cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
+    ASSERT(cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
     ASSERT(cpumask_test_cpu(new_cpu, vc->cpu_hard_affinity));
 
     trqd = RQD(ops, new_cpu);
@@ -1814,7 +1828,7 @@ csched2_dom_cntl(
                  * been disabled. */
                 spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
 
-                BUG_ON(svc->rqd != RQD(ops, svc->vcpu->processor));
+                ASSERT(svc->rqd == RQD(ops, svc->vcpu->processor));
 
                 svc->weight = sdom->weight;
                 update_max_weight(svc->rqd, svc->weight, old_weight);
@@ -1863,8 +1877,6 @@ csched2_dom_init(const struct scheduler *ops, struct domain *dom)
 {
     struct csched2_dom *sdom;
 
-    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
-
     if ( is_idle_domain(dom) )
         return 0;
 
@@ -1895,7 +1907,7 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 static void
 csched2_dom_destroy(const struct scheduler *ops, struct domain *dom)
 {
-    BUG_ON(CSCHED2_DOM(dom)->nr_vcpus > 0);
+    ASSERT(CSCHED2_DOM(dom)->nr_vcpus == 0);
 
     csched2_free_domdata(ops, CSCHED2_DOM(dom));
 }
@@ -2036,8 +2048,6 @@ csched2_schedule(
     SCHED_STAT_CRANK(schedule);
     CSCHED2_VCPU_CHECK(current);
 
-    d2printk("sc p%d c %pv now %"PRI_stime"\n", cpu, scurr->vcpu, now);
-
     BUG_ON(!cpumask_test_cpu(cpu, &CSCHED2_PRIV(ops)->initialized));
 
     rqd = RQD(ops, cpu);
@@ -2045,30 +2055,6 @@ csched2_schedule(
 
     /* Protected by runqueue lock */        
 
-    /* DEBUG */
-    if ( !is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd)
-    {
-        int other_rqi = -1, this_rqi = c2r(ops, cpu);
-
-        if ( scurr->rqd )
-        {
-            int rq;
-            other_rqi = -2;
-            for_each_cpu ( rq, &CSCHED2_PRIV(ops)->active_queues )
-            {
-                if ( scurr->rqd == &CSCHED2_PRIV(ops)->rqd[rq] )
-                {
-                    other_rqi = rq;
-                    break;
-                }
-            }
-        }
-        printk("%s: pcpu %d rq %d, but scurr %pv assigned to "
-               "pcpu %d rq %d!\n",
-               __func__,
-               cpu, this_rqi,
-               scurr->vcpu, scurr->vcpu->processor, other_rqi);
-    }
     BUG_ON(!is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd);
 
     /* Clear "tickled" bit now that we've been scheduled */
@@ -2119,15 +2105,10 @@ csched2_schedule(
         /* If switching, remove this from the runqueue and mark it scheduled */
         if ( snext != scurr )
         {
-            BUG_ON(snext->rqd != rqd);
-    
+            ASSERT(snext->rqd == rqd);
+            ASSERT(!snext->vcpu->is_running);
+
             __runq_remove(snext);
-            if ( snext->vcpu->is_running )
-            {
-                printk("p%d: snext %pv running on p%d! scurr %pv\n",
-                       cpu, snext->vcpu, snext->vcpu->processor, scurr->vcpu);
-                BUG();
-            }
             __set_bit(__CSFLAG_scheduled, &snext->flags);
         }
 
@@ -2433,10 +2414,10 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
 
     rqd = prv->rqd + rqi;
 
-    printk("Adding cpu %d to runqueue %d\n", cpu, rqi);
+    printk(XENLOG_INFO "Adding cpu %d to runqueue %d\n", cpu, rqi);
     if ( ! cpumask_test_cpu(rqi, &prv->active_queues) )
     {
-        printk(" First cpu on runqueue, activating\n");
+        printk(XENLOG_INFO " First cpu on runqueue, activating\n");
         activate_runqueue(prv, rqi);
     }
     
@@ -2548,14 +2529,14 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     /* No need to save IRQs here, they're already disabled */
     spin_lock(&rqd->lock);
 
-    printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
+    printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )
     {
-        printk(" No cpus left on runqueue, disabling\n");
+        printk(XENLOG_INFO " No cpus left on runqueue, disabling\n");
         deactivate_runqueue(prv, rqi);
     }
 
@@ -2574,15 +2555,20 @@ csched2_init(struct scheduler *ops)
     int i;
     struct csched2_private *prv;
 
-    printk("Initializing Credit2 scheduler\n" \
-           " WARNING: This is experimental software in development.\n" \
+    printk("Initializing Credit2 scheduler\n");
+    printk(" WARNING: This is experimental software in development.\n" \
            " Use at your own risk.\n");
 
-    printk(" load_precision_shift: %d\n", opt_load_precision_shift);
-    printk(" load_window_shift: %d\n", opt_load_window_shift);
-    printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
-    printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
-    printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
+    printk(XENLOG_INFO " load_precision_shift: %d\n"
+           XENLOG_INFO " load_window_shift: %d\n"
+           XENLOG_INFO " underload_balance_tolerance: %d\n"
+           XENLOG_INFO " overload_balance_tolerance: %d\n"
+           XENLOG_INFO " runqueues arrangement: %s\n",
+           opt_load_precision_shift,
+           opt_load_window_shift,
+           opt_underload_balance_tolerance,
+           opt_overload_balance_tolerance,
+           opt_runqueue_str[opt_runqueue]);
 
     if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 07/11] xen: credit2: add yet some more tracing
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2016-07-15 14:49 ` [PATCH v2 06/11] xen: credit2: make the code less experimental Dario Faggioli
@ 2016-07-15 14:49 ` Dario Faggioli
  2016-07-15 14:50 ` [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:49 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

(and fix the style of two labels as well.)

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   58 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index d72f530..a4aec73 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -51,6 +51,9 @@
 #define TRC_CSCHED2_TICKLE_NEW       TRC_SCHED_CLASS_EVT(CSCHED2, 13)
 #define TRC_CSCHED2_RUNQ_MAX_WEIGHT  TRC_SCHED_CLASS_EVT(CSCHED2, 14)
 #define TRC_CSCHED2_MIGRATE          TRC_SCHED_CLASS_EVT(CSCHED2, 15)
+#define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2, 16)
+#define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2, 17)
+#define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2, 19)
 
 /*
  * WARNING: This is still in an experimental phase.  Status and work can be found at the
@@ -711,6 +714,8 @@ update_load(const struct scheduler *ops,
             struct csched2_runqueue_data *rqd,
             struct csched2_vcpu *svc, int change, s_time_t now)
 {
+    trace_var(TRC_CSCHED2_UPDATE_LOAD, 1, 0,  NULL);
+
     __update_runq_load(ops, rqd, change, now);
     if ( svc )
         __update_svc_load(ops, svc, change, now);
@@ -1486,6 +1491,23 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 out_up:
     spin_unlock(&prv->lock);
 
+    /* TRACE */
+    {
+        struct {
+            uint64_t b_avgload;
+            unsigned vcpu:16, dom:16;
+            unsigned rq_id:16, new_cpu:16;
+       } d;
+        d.b_avgload = prv->rqd[min_rqi].b_avgload;
+        d.dom = vc->domain->domain_id;
+        d.vcpu = vc->vcpu_id;
+        d.rq_id = c2r(ops, new_cpu);
+        d.new_cpu = new_cpu;
+        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
     return new_cpu;
 }
 
@@ -1611,7 +1633,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     bool_t inner_load_updated = 0;
 
     balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
-    
+
     /*
      * Basic algorithm: Push, pull, or swap.
      * - Find the runqueue with the furthest load distance
@@ -1676,6 +1698,20 @@ retry:
         if ( i > cpus_max )
             cpus_max = i;
 
+        /* TRACE */
+        {
+            struct {
+                unsigned lrq_id:16, orq_id:16;
+                unsigned load_delta;
+            } d;
+            d.lrq_id = st.lrqd->id;
+            d.orq_id = st.orqd->id;
+            d.load_delta = st.load_delta;
+            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
+                      sizeof(d),
+                      (unsigned char *)&d);
+        }
+
         /*
          * If we're under 100% capacaty, only shift if load difference
          * is > 1.  otherwise, shift if under 12.5%
@@ -1704,6 +1740,21 @@ retry:
     if ( unlikely(st.orqd->id < 0) )
         goto out_up;
 
+    /* TRACE */
+    {
+        struct {
+            uint64_t lb_avgload, ob_avgload;
+            unsigned lrq_id:16, orq_id:16;
+        } d;
+        d.lrq_id = st.lrqd->id;
+        d.lb_avgload = st.lrqd->b_avgload;
+        d.orq_id = st.orqd->id;
+        d.ob_avgload = st.orqd->b_avgload;
+        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
     /* Look for "swap" which gives the best load average
      * FIXME: O(n^2)! */
 
@@ -1753,10 +1804,9 @@ retry:
     if ( st.best_pull_svc )
         migrate(ops, st.best_pull_svc, st.lrqd, now);
 
-out_up:
+ out_up:
     spin_unlock(&st.orqd->lock);
-
-out:
+ out:
     return;
 }
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (6 preceding siblings ...)
  2016-07-15 14:49 ` [PATCH v2 07/11] xen: credit2: add yet some more tracing Dario Faggioli
@ 2016-07-15 14:50 ` Dario Faggioli
  2016-07-18 15:10   ` George Dunlap
  2016-07-15 14:50 ` [PATCH v2 09/11] tools: tracing: deal with new Credit2 events Dario Faggioli
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
Changes from v1:
 * avoid stray code removal in balance_load(), as pointed out by George
   during review.
---
 xen/common/sched_credit2.c |  112 +++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index a4aec73..be27ba3 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -638,6 +638,7 @@ __update_runq_load(const struct scheduler *ops,
 
     ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX);
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t rq_avgload, b_avgload;
@@ -648,9 +649,9 @@ __update_runq_load(const struct scheduler *ops,
         d.rq_avgload = rqd->avgload;
         d.b_avgload = rqd->b_avgload;
         d.shift = P;
-        trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -693,6 +694,7 @@ __update_svc_load(const struct scheduler *ops,
     }
     svc->load_last_update = now;
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t v_avgload;
@@ -703,9 +705,9 @@ __update_svc_load(const struct scheduler *ops,
         d.vcpu = svc->vcpu->vcpu_id;
         d.v_avgload = svc->avgload;
         d.shift = P;
-        trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -761,6 +763,7 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
 
     pos = __runq_insert(runq, svc);
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -769,9 +772,9 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.pos = pos;
-        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     return;
@@ -814,7 +817,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
     ASSERT(new->rqd == rqd);
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -824,9 +827,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         d.vcpu = new->vcpu->vcpu_id;
         d.processor = new->vcpu->processor;
         d.credit = new->credit;
-        trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     /*
@@ -884,7 +887,8 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
             lowest = cur->credit;
         }
 
-        /* TRACE */ {
+        if ( unlikely(tb_init_done) )
+        {
             struct {
                 unsigned vcpu:16, dom:16;
                 unsigned credit;
@@ -892,9 +896,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
             d.dom = cur->vcpu->domain->domain_id;
             d.vcpu = cur->vcpu->vcpu_id;
             d.credit = cur->credit;
-            trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
     }
 
@@ -912,14 +916,15 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
  tickle:
     BUG_ON(ipid == -1);
 
-    /* TRACE */ {
+    if ( unlikely(tb_init_done) )
+    {
         struct {
             unsigned cpu:16, pad:16;
         } d;
         d.cpu = ipid; d.pad = 0;
-        trace_var(TRC_CSCHED2_TICKLE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_TICKLE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
@@ -981,7 +986,8 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
 
         svc->start_time = now;
 
-        /* TRACE */ {
+        if ( unlikely(tb_init_done) )
+        {
             struct {
                 unsigned vcpu:16, dom:16;
                 unsigned credit_start, credit_end;
@@ -992,9 +998,9 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
             d.credit_start = start_credit;
             d.credit_end = svc->credit;
             d.multiplier = m;
-            trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
     }
 
@@ -1030,7 +1036,7 @@ void burn_credits(struct csched2_runqueue_data *rqd,
                  __func__, now, svc->start_time);
     }
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1041,9 +1047,9 @@ void burn_credits(struct csched2_runqueue_data *rqd,
         d.vcpu = svc->vcpu->vcpu_id;
         d.credit = svc->credit;
         d.delta = delta;
-        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -1079,16 +1085,16 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
         SCHED_STAT_CRANK(upd_max_weight_full);
     }
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned rqi:16, max_weight:16;
         } d;
         d.rqi = rqd->id;
         d.max_weight = rqd->max_weight;
-        trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -1168,7 +1174,7 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
     /* Expected new load based on adding this vcpu */
     rqd->b_avgload += svc->avgload;
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1177,9 +1183,9 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.rqi=rqd->id;
-        trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
 }
@@ -1491,7 +1497,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 out_up:
     spin_unlock(&prv->lock);
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t b_avgload;
@@ -1503,9 +1509,9 @@ out_up:
         d.vcpu = vc->vcpu_id;
         d.rq_id = c2r(ops, new_cpu);
         d.new_cpu = new_cpu;
-        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_PICKED_CPU, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     return new_cpu;
@@ -1560,7 +1566,7 @@ static void migrate(const struct scheduler *ops,
                     struct csched2_runqueue_data *trqd, 
                     s_time_t now)
 {
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1570,9 +1576,9 @@ static void migrate(const struct scheduler *ops,
         d.vcpu = svc->vcpu->vcpu_id;
         d.rqi = svc->rqd->id;
         d.trqi = trqd->id;
-        trace_var(TRC_CSCHED2_MIGRATE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_MIGRATE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     if ( svc->flags & CSFLAG_scheduled )
@@ -1698,7 +1704,7 @@ retry:
         if ( i > cpus_max )
             cpus_max = i;
 
-        /* TRACE */
+        if ( unlikely(tb_init_done) )
         {
             struct {
                 unsigned lrq_id:16, orq_id:16;
@@ -1707,9 +1713,9 @@ retry:
             d.lrq_id = st.lrqd->id;
             d.orq_id = st.orqd->id;
             d.load_delta = st.load_delta;
-            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
 
         /*
@@ -1740,7 +1746,7 @@ retry:
     if ( unlikely(st.orqd->id < 0) )
         goto out_up;
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t lb_avgload, ob_avgload;
@@ -1750,9 +1756,9 @@ retry:
         d.lb_avgload = st.lrqd->b_avgload;
         d.orq_id = st.orqd->id;
         d.ob_avgload = st.orqd->b_avgload;
-        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     /* Look for "swap" which gives the best load average


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 09/11] tools: tracing: deal with new Credit2 events
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (7 preceding siblings ...)
  2016-07-15 14:50 ` [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
@ 2016-07-15 14:50 ` Dario Faggioli
  2016-07-18 15:39   ` Wei Liu
  2016-07-15 14:50 ` [PATCH v2 10/11] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Anshul Makkar, Ian Jackson, George Dunlap

more specifically, with: TICKLE_NEW, RUNQ_MAX_WEIGHT,
MIGRATE, LOAD_CHECK, LOAD_BALANCE and PICKED_CPU, and
in both both xenalyze and formats (for xentrace_format).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 tools/xentrace/formats    |    6 +++
 tools/xentrace/xenalyze.c |   78 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 2e58d03..caafb5f 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -55,6 +55,12 @@
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
 0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(3)08x, vcpuload = 0x%(2)08x%(1)08x, wshift = %(4)d ]
 0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x%(5)08x, rq_avgload = 0x%(2)08x%(1)08x, b_avgload = 0x%(4)08x%(3)08x ]
+0x0002220d  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_new     [ dom:vcpu = 0x%(1)08x, processor = %(2)d credit = %(3)d ]
+0x0002220e  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_max_weight [ rq_id[16]:max_weight[16] = 0x%(1)08x ]
+0x0002220f  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:migrrate       [ dom:vcpu = 0x%(1)08x, rq_id[16]:trq_id[16] = 0x%(2)08x ]
+0x00022210  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_check     [ lrq_id[16]:orq_id[16] = 0x%(1)08x, delta = %(2)d ]
+0x00022211  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_balance   [ l_bavgload = 0x%(2)08x%(1)08x, o_bavgload = 0x%(4)08x%(3)08x, lrq_id[16]:orq_id[16] = 0x%(5)08x ]
+0x00022212  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:pick_cpu       [ b_avgload = 0x%(2)08x%(1)08x, dom:vcpu = 0x%(3)08x, rq_id[16]:new_cpu[16] = %(4)d ]
 
 0x00022801  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:tickle        [ cpu = %(1)d ]
 0x00022802  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:runq_pick     [ dom:vcpu = 0x%(1)08x, cur_deadline = 0x%(3)08x%(2)08x, cur_budget = 0x%(5)08x%(4)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index f2f97bd..d223de6 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7725,7 +7725,6 @@ void sched_process(struct pcpu_info *p)
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
-        case TRC_SCHED_CLASS_EVT(CSCHED2, 9): /* UPDATE_LOAD       */
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 2): /* RUNQ_POS          */
             if(opt.dump_all) {
@@ -7788,11 +7787,15 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all)
                 printf(" %s csched2:sched_tasklet\n", ri->dump_header);
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 9):  /* UPDATE_LOAD      */
+            if(opt.dump_all)
+                printf(" %s csched2:update_load\n", ri->dump_header);
+            break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 10): /* RUNQ_ASSIGN      */
             if(opt.dump_all) {
                 struct {
                     unsigned int vcpuid:16, domid:16;
-                    unsigned int rqi;
+                    unsigned int rqi:16;
                 } *r = (typeof(r))ri->d;
 
                 printf(" %s csched2:runq_assign d%uv%u on rq# %u\n",
@@ -7834,6 +7837,77 @@ void sched_process(struct pcpu_info *p)
                        avgload, r->rq_avgload, b_avgload, r->b_avgload);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 13): /* TICKLE_NEW       */
+            if (opt.dump_all) {
+                struct {
+                    unsigned vcpuid:16, domid:16;
+                    unsigned processor, credit;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:runq_tickle_new d%uv%u, "
+                       "processor = %u, credit = %u\n",
+                       ri->dump_header, r->domid, r->vcpuid,
+                       r->processor, r->credit);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 14): /* RUNQ_MAX_WEIGHT  */
+            if (opt.dump_all) {
+                struct {
+                    unsigned rqi:16, max_weight:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:update_max_weight rq# %u, max_weight = %u\n",
+                       ri->dump_header, r->rqi, r->max_weight);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 15): /* MIGRATE          */
+            if (opt.dump_all) {
+                struct {
+                    unsigned vcpuid:16, domid:16;
+                    unsigned rqi:16, trqi:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:migrate d%uv%u rq# %u --> rq# %u\n",
+                       ri->dump_header, r->domid, r->vcpuid, r->rqi, r->trqi);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 16): /* LOAD_CHECK       */
+            if (opt.dump_all) {
+                struct {
+                    unsigned lrqi:16, orqi:16;
+                    unsigned load_delta;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:load_balance_check lrq# %u, orq# %u, "
+                       "delta = %u\n",
+                       ri->dump_header, r->lrqi, r->orqi, r->load_delta);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 17): /* LOAD_BALANCE     */
+            if (opt.dump_all) {
+                struct {
+                    uint64_t lb_avgload, ob_avgload;
+                    unsigned lrqi:16, orqi:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:load_balance_begin lrq# %u, "
+                       "avg_load = %"PRIu64" -- orq# %u, avg_load = %"PRIu64"\n",
+                       ri->dump_header, r->lrqi, r->lb_avgload,
+                       r->orqi, r->ob_avgload);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 19): /* PICKED_CPU       */
+            if (opt.dump_all) {
+                struct {
+                    uint64_t b_avgload;
+                    unsigned vcpuid:16, domid:16;
+                    unsigned rqi:16, cpu:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:picked_cpu d%uv%u, rq# %u, cpu %u\n",
+                       ri->dump_header, r->domid, r->vcpuid, r->rqi, r->cpu);
+            }
+            break;
         /* RTDS (TRC_RTDS_xxx) */
         case TRC_SCHED_CLASS_EVT(RTDS, 1): /* TICKLE           */
             if(opt.dump_all) {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 10/11] xen: credit2: the private scheduler lock can be an rwlock.
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (8 preceding siblings ...)
  2016-07-15 14:50 ` [PATCH v2 09/11] tools: tracing: deal with new Credit2 events Dario Faggioli
@ 2016-07-15 14:50 ` Dario Faggioli
  2016-07-15 14:50 ` [PATCH v2 11/11] xen: credit2: implement true SMT support Dario Faggioli
  2016-07-18 17:21 ` [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 George Dunlap
  11 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

In fact, the data it protects only change either at init-time,
during cpupools manipulation, or when changing domains' weights.
In all other cases (namely, load balancing, reading weights
and status dumping), information is only read.

Therefore, let the lock be an read/write one. This means there
is no full serialization point for the whole scheduler and
for all the pCPUs of the host any longer.

This is particularly good for scalability (especially when doing
load balancing).

Also, update the high level description of the locking discipline,
and take the chance for rewording it a little bit (as well as
for adding a couple of locking related ASSERT()-s).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  133 ++++++++++++++++++++++++++------------------
 1 file changed, 80 insertions(+), 53 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index be27ba3..b33ba7a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -90,17 +90,37 @@
 
 /*
  * Locking:
- * - Schedule-lock is per-runqueue
- *  + Protects runqueue data, runqueue insertion, &c
- *  + Also protects updates to private sched vcpu structure
- *  + Must be grabbed using vcpu_schedule_lock_irq() to make sure vcpu->processr
- *    doesn't change under our feet.
- * - Private data lock
- *  + Protects access to global domain list
- *  + All other private data is written at init and only read afterwards.
+ *
+ * - runqueue lock
+ *  + it is per-runqueue, so:
+ *   * cpus in a runqueue take the runqueue lock, when using
+ *     pcpu_schedule_lock() / vcpu_schedule_lock() (and friends),
+ *   * a cpu may (try to) take a "remote" runqueue lock, e.g., for
+ *     load balancing;
+ *  + serializes runqueue operations (removing and inserting vcpus);
+ *  + protects runqueue-wide data in csched2_runqueue_data;
+ *  + protects vcpu parameters in csched2_vcpu for the vcpu in the
+ *    runqueue.
+ *
+ * - Private scheduler lock
+ *  + protects scheduler-wide data in csched2_private, such as:
+ *   * the list of domains active in this scheduler,
+ *   * what cpus and what runqueues are active and in what
+ *     runqueue each cpu is;
+ *  + serializes the operation of changing the weights of domains;
+ *
+ * - Type:
+ *  + runqueue locks are 'regular' spinlocks;
+ *  + the private scheduler lock can be an rwlock. In fact, data
+ *    it protects is modified only during initialization, cpupool
+ *    manipulation and when changing weights, and read in all
+ *    other cases (e.g., during load balancing).
+ *
  * Ordering:
- * - We grab private->schedule when updating domain weight; so we
- *  must never grab private if a schedule lock is held.
+ *  + tylock must be used when wanting to take a runqueue lock,
+ *    if we already hold another one;
+ *  + if taking both a runqueue lock and the private scheduler
+ *    lock is, the latter must always be taken for first.
  */
 
 /*
@@ -345,7 +365,7 @@ struct csched2_runqueue_data {
  * System-wide private data
  */
 struct csched2_private {
-    spinlock_t lock;
+    rwlock_t lock;
     cpumask_t initialized; /* CPU is initialized for this pool */
     
     struct list_head sdom; /* Used mostly for dump keyhandler. */
@@ -1302,13 +1322,14 @@ static void
 csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
+    unsigned int cpu = vc->processor;
     s_time_t now;
 
-    /* Schedule lock should be held at this point. */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
     ASSERT(!is_idle_vcpu(vc));
 
-    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
+    if ( unlikely(curr_on_cpu(cpu) == vc) )
     {
         SCHED_STAT_CRANK(vcpu_wake_running);
         goto out;
@@ -1399,7 +1420,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     ASSERT(!cpumask_empty(&prv->active_queues));
 
     /* Locking:
-     * - vc->processor is already locked
+     * - Runqueue lock of vc->processor is already locked
      * - Need to grab prv lock to make sure active runqueues don't
      *   change
      * - Need to grab locks for other runqueues while checking
@@ -1412,8 +1433,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
      * just grab the prv lock.  Instead, we'll have to trylock, and
      * do something else reasonable if we fail.
      */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, vc->processor).schedule_lock));
 
-    if ( !spin_trylock(&prv->lock) )
+    if ( !read_trylock(&prv->lock) )
     {
         /* We may be here because someone requested us to migrate. */
         __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
@@ -1495,7 +1517,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     }
 
 out_up:
-    spin_unlock(&prv->lock);
+    read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
     {
@@ -1647,15 +1669,13 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
      * on either side may be empty).
      */
 
-    /* Locking:
-     * - pcpu schedule lock should be already locked
-     */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
     st.lrqd = RQD(ops, cpu);
 
     __update_runq_load(ops, st.lrqd, 0, now);
 
 retry:
-    if ( !spin_trylock(&prv->lock) )
+    if ( !read_trylock(&prv->lock) )
         return;
 
     st.load_delta = 0;
@@ -1685,8 +1705,8 @@ retry:
         spin_unlock(&st.orqd->lock);
     }
 
-    /* Minimize holding the big lock */
-    spin_unlock(&prv->lock);
+    /* Minimize holding the private scheduler lock. */
+    read_unlock(&prv->lock);
     if ( max_delta_rqi == -1 )
         goto out;
 
@@ -1854,14 +1874,19 @@ csched2_dom_cntl(
     unsigned long flags;
     int rc = 0;
 
-    /* Must hold csched2_priv lock to read and update sdom,
-     * runq lock to update csvcs. */
-    spin_lock_irqsave(&prv->lock, flags);
-
+    /*
+     * Locking:
+     *  - we must take the private lock for accessing the weights of the
+     *    vcpus of d,
+     *  - in the putinfo case, we also need the runqueue lock(s), for
+     *    updating the max waight of the runqueue(s).
+     */
     switch ( op->cmd )
     {
     case XEN_DOMCTL_SCHEDOP_getinfo:
+        read_lock_irqsave(&prv->lock, flags);
         op->u.credit2.weight = sdom->weight;
+        read_unlock_irqrestore(&prv->lock, flags);
         break;
     case XEN_DOMCTL_SCHEDOP_putinfo:
         if ( op->u.credit2.weight != 0 )
@@ -1869,6 +1894,8 @@ csched2_dom_cntl(
             struct vcpu *v;
             int old_weight;
 
+            write_lock_irqsave(&prv->lock, flags);
+
             old_weight = sdom->weight;
 
             sdom->weight = op->u.credit2.weight;
@@ -1877,11 +1904,6 @@ csched2_dom_cntl(
             for_each_vcpu ( d, v )
             {
                 struct csched2_vcpu *svc = CSCHED2_VCPU(v);
-
-                /* NB: Locking order is important here.  Because we grab this lock here, we
-                 * must never lock csched2_priv.lock if we're holding a runqueue lock.
-                 * Also, calling vcpu_schedule_lock() is enough, since IRQs have already
-                 * been disabled. */
                 spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
 
                 ASSERT(svc->rqd == RQD(ops, svc->vcpu->processor));
@@ -1891,6 +1913,8 @@ csched2_dom_cntl(
 
                 vcpu_schedule_unlock(lock, svc->vcpu);
             }
+
+            write_unlock_irqrestore(&prv->lock, flags);
         }
         break;
     default:
@@ -1898,7 +1922,6 @@ csched2_dom_cntl(
         break;
     }
 
-    spin_unlock_irqrestore(&prv->lock, flags);
 
     return rc;
 }
@@ -1906,6 +1929,7 @@ csched2_dom_cntl(
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_dom *sdom;
     unsigned long flags;
 
@@ -1919,11 +1943,11 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     sdom->weight = CSCHED2_DEFAULT_WEIGHT;
     sdom->nr_vcpus = 0;
 
-    spin_lock_irqsave(&CSCHED2_PRIV(ops)->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     list_add_tail(&sdom->sdom_elem, &CSCHED2_PRIV(ops)->sdom);
 
-    spin_unlock_irqrestore(&CSCHED2_PRIV(ops)->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     return (void *)sdom;
 }
@@ -1950,12 +1974,13 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 {
     unsigned long flags;
     struct csched2_dom *sdom = data;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
 
-    spin_lock_irqsave(&CSCHED2_PRIV(ops)->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     list_del_init(&sdom->sdom_elem);
 
-    spin_unlock_irqrestore(&CSCHED2_PRIV(ops)->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     xfree(data);
 }
@@ -2109,7 +2134,7 @@ csched2_schedule(
     rqd = RQD(ops, cpu);
     BUG_ON(!cpumask_test_cpu(cpu, &rqd->active));
 
-    /* Protected by runqueue lock */        
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
     BUG_ON(!is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd);
 
@@ -2248,12 +2273,12 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
 
     /*
      * We need both locks:
-     * - csched2_dump_vcpu() wants to access domains' scheduling
-     *   parameters, which are protected by the private scheduler lock;
+     * - csched2_dump_vcpu() wants to access domains' weights,
+     *   which are protected by the private scheduler lock;
      * - we scan through the runqueue, so we need the proper runqueue
      *   lock (the one of the runqueue this cpu is associated to).
      */
-    spin_lock_irqsave(&prv->lock, flags);
+    read_lock_irqsave(&prv->lock, flags);
     lock = per_cpu(schedule_data, cpu).schedule_lock;
     spin_lock(lock);
 
@@ -2284,7 +2309,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
     }
 
     spin_unlock(lock);
-    spin_unlock_irqrestore(&prv->lock, flags);
+    read_unlock_irqrestore(&prv->lock, flags);
 #undef cpustr
 }
 
@@ -2297,9 +2322,11 @@ csched2_dump(const struct scheduler *ops)
     int i, loop;
 #define cpustr keyhandler_scratch
 
-    /* We need the private lock as we access global scheduler data
-     * and (below) the list of active domains. */
-    spin_lock_irqsave(&prv->lock, flags);
+    /*
+     * We need the private scheduler lock as we access global
+     * scheduler data and (below) the list of active domains.
+     */
+    read_lock_irqsave(&prv->lock, flags);
 
     printk("Active queues: %d\n"
            "\tdefault-weight     = %d\n",
@@ -2360,7 +2387,7 @@ csched2_dump(const struct scheduler *ops)
         }
     }
 
-    spin_unlock_irqrestore(&prv->lock, flags);
+    read_unlock_irqrestore(&prv->lock, flags);
 #undef cpustr
 }
 
@@ -2462,7 +2489,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     unsigned rqi;
     struct csched2_runqueue_data *rqd;
 
-    ASSERT(spin_is_locked(&prv->lock));
+    ASSERT(rw_is_write_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
 
     /* Figure out which runqueue to put it in */
@@ -2501,7 +2528,7 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
      */
     ASSERT(!pdata);
 
-    spin_lock_irqsave(&prv->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
     old_lock = pcpu_schedule_lock(cpu);
 
     rqi = init_pdata(prv, cpu);
@@ -2510,7 +2537,7 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 
     /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
     spin_unlock(old_lock);
-    spin_unlock_irqrestore(&prv->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 }
 
 /* Change the scheduler of cpu to us (Credit2). */
@@ -2533,7 +2560,7 @@ csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
      * cpu) is what is necessary to prevent races.
      */
     ASSERT(!local_irq_is_enabled());
-    spin_lock(&prv->lock);
+    write_lock(&prv->lock);
 
     idle_vcpu[cpu]->sched_priv = vdata;
 
@@ -2558,7 +2585,7 @@ csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
     smp_mb();
     per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
 
-    spin_unlock(&prv->lock);
+    write_unlock(&prv->lock);
 }
 
 static void
@@ -2569,7 +2596,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     struct csched2_runqueue_data *rqd;
     int rqi;
 
-    spin_lock_irqsave(&prv->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     /*
      * alloc_pdata is not implemented, so pcpu must be NULL. On the other
@@ -2600,7 +2627,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     __cpumask_clear_cpu(cpu, &prv->initialized);
 
-    spin_unlock_irqrestore(&prv->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     return;
 }
@@ -2651,7 +2678,7 @@ csched2_init(struct scheduler *ops)
         return -ENOMEM;
     ops->sched_data = prv;
 
-    spin_lock_init(&prv->lock);
+    rwlock_init(&prv->lock);
     INIT_LIST_HEAD(&prv->sdom);
 
     /* But un-initialize all runqueues */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (9 preceding siblings ...)
  2016-07-15 14:50 ` [PATCH v2 10/11] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
@ 2016-07-15 14:50 ` Dario Faggioli
  2016-07-18 16:48   ` George Dunlap
  2016-07-18 17:21 ` [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 George Dunlap
  11 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, Jan Beulich, David Vrabel

In fact, right now, we recommend keepeing runqueues
arranged per-core, so that it is the inter-runqueue load
balancing code that automatically spreads the work in an
SMT friendly way. This means that any other runq
arrangement one may want to use falls short of SMT
scheduling optimizations.

This commit implements SMT awareness --similar to the
one we have in Credit1-- for any possible runq
arrangement. This turned out to be pretty easy to do,
as the logic can live entirely in runq_tickle()
(although, in order to avoid for_each_cpu loops in
that function, we use a new cpumask which indeed needs
to be updated in other places).

In addition to disentangling SMT awareness from load
balancing, this also allows us to support the
sched_smt_power_savings parametar in Credit2 as well.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
Changes from v1:
 * make smt_idle_mask_set() const correct, and use a local variable for
   per_cpu(cpu_sibling_masks) in there, as suggested during review;
 * removed a stray XXX, as pointed out during review;
 * full stop in a comment, as requested during review.
---
 xen/common/sched_credit2.c |  142 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 128 insertions(+), 14 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b33ba7a..6ccc6f0 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -354,7 +354,8 @@ struct csched2_runqueue_data {
     unsigned int max_weight;
 
     cpumask_t idle,        /* Currently idle */
-        tickled;           /* Another cpu in the queue is already targeted for this one */
+        smt_idle,          /* Fully idle cores (as in all the siblings are idle) */
+        tickled;           /* Have been asked to go through schedule */
     int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
     s_time_t load_last_update;  /* Last time average was updated */
     s_time_t avgload;           /* Decaying queue load */
@@ -415,6 +416,76 @@ struct csched2_dom {
 };
 
 /*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *    bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *    bits correspondings to it and to all its siblings are clear in the
+ *    smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *    on pcpus from smt_idle mask first. This is what happens if
+ *    sched_smt_power_savings was not set at boot (default), and it maximizes
+ *    true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *    on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *    sched_smt_power_savings is set at boot, and it allows as more cores as
+ *    possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *    runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *    csched2_cpu_pick() looks for the least loaded runq and return just any
+ *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *    the chosen runq, and it is again runq_tickle(), called by
+ *    csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *    chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *    directly, that again temporarily use a random pcpu from the new runq,
+ *    and then calls runq_tickle(), by itself.
+ */
+
+/*
+ * If all the siblings of cpu (including cpu itself) are in idlers,
+ * set all their bits in mask.
+ *
+ * In order to properly take into account tickling, idlers needs to be
+ * set qeual to something like:
+ *
+ *  rqd->idle & (~rqd->tickled)
+ *
+ * This is because cpus that have been tickled will very likely pick up some
+ * work as soon as the manage to schedule, and hence we should really consider
+ * them as busy.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,
+                       cpumask_t *mask)
+{
+    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
+
+    if ( cpumask_subset(cpu_siblings, idlers) )
+        cpumask_or(mask, mask, cpu_siblings);
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask.
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{
+    cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
  * When a hard affinity change occurs, we may not be able to check some
  * (any!) of the other runqueues, when looking for the best new processor
  * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
@@ -853,9 +924,30 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     }
 
     /*
-     * Get a mask of idle, but not tickled, processors that new is
-     * allowed to run on. If that's not empty, choose someone from there
-     * (preferrably, the one were new was running on already).
+     * First of all, consider idle cpus, checking if we can just
+     * re-use the pcpu where we were running before.
+     *
+     * If there are cores where all the siblings are idle, consider
+     * them first, honoring whatever the spreading-vs-consolidation
+     * SMT policy wants us to do.
+     */
+    if ( unlikely(sched_smt_power_savings) )
+        cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
+    else
+        cpumask_copy(&mask, &rqd->smt_idle);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    i = cpumask_test_or_cycle(cpu, &mask);
+    if ( i < nr_cpu_ids )
+    {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
+        ipid = i;
+        goto tickle;
+    }
+
+    /*
+     * If there are no fully idle cores, check all idlers, after
+     * having filtered out pcpus that have been tickled but haven't
+     * gone through the scheduler yet.
      */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
@@ -947,6 +1039,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                     (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
+    smt_idle_mask_clear(ipid, &rqd->smt_idle);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1442,8 +1535,10 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         return get_fallback_cpu(svc);
     }
 
-    /* First check to see if we're here because someone else suggested a place
-     * for us to move. */
+    /*
+     * First check to see if we're here because someone else suggested a place
+     * for us to move.
+     */
     if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
@@ -1464,7 +1559,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /* Find the runqueue with the lowest average load. */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
@@ -1507,16 +1602,17 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = get_fallback_cpu(svc);
-    else
     {
-        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
-                    &prv->rqd[min_rqi].active);
-        new_cpu = cpumask_any(cpumask_scratch);
-        BUG_ON(new_cpu >= nr_cpu_ids);
+        new_cpu = get_fallback_cpu(svc);
+        goto out_up;
     }
 
-out_up:
+    cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                &prv->rqd[min_rqi].active);
+    new_cpu = cpumask_any(cpumask_scratch);
+    BUG_ON(new_cpu >= nr_cpu_ids);
+
+ out_up:
     read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
@@ -2140,7 +2236,11 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
+    {
         __cpumask_clear_cpu(cpu, &rqd->tickled);
+        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+    }
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2202,7 +2302,10 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_clear_cpu(cpu, &rqd->idle);
+            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+        }
 
         snext->start_time = now;
 
@@ -2224,10 +2327,17 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
+            {
                 __cpumask_clear_cpu(cpu, &rqd->idle);
+                smt_idle_mask_clear(cpu, &rqd->smt_idle);
+            }
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_set_cpu(cpu, &rqd->idle);
+            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+        }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2357,6 +2467,8 @@ csched2_dump(const struct scheduler *ops)
         printk("\tidlers: %s\n", cpustr);
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].tickled);
         printk("\ttickled: %s\n", cpustr);
+        cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].smt_idle);
+        printk("\tfully idle cores: %s\n", cpustr);
     }
 
     printk("Domain info:\n");
@@ -2510,6 +2622,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &rqd->idle);
     __cpumask_set_cpu(cpu, &rqd->active);
     __cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
     return rqi;
 }
@@ -2615,6 +2728,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->smt_idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone.
  2016-07-15 14:49 ` [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
@ 2016-07-18 14:35   ` George Dunlap
  0 siblings, 0 replies; 26+ messages in thread
From: George Dunlap @ 2016-07-18 14:35 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, David Vrabel, Jan Beulich

On 15/07/16 15:49, Dario Faggioli wrote:
> In both Credit1 and Credit2, stop considering a pCPU idle,
> if the reason why the idle vCPU is being selected, is to
> do tasklet work.
> 
> Not doing so means that the tickling and load balancing
> logic, seeing the pCPU as idle, considers it a candidate
> for picking up vCPUs. But the pCPU won't actually pick
> up or schedule any vCPU, which would then remain in the
> runqueue, which is bad, especially if there were other,
> truly idle pCPUs, that could execute it.
> 
> The only drawback is that we can't assume that a pCPU is
> in always marked as idle when being removed from an
> instance of the Credit2 scheduler (csched2_deinit_pdata).
> In fact, if we are in stop-machine (i.e., during suspend
> or shutdown), the pCPUs are running the stopmachine_tasklet
> and hence are actually marked as busy. On the other hand,
> when removing a pCPU from a Credit2 pool, it will indeed
> be idle. The only thing we can do, therefore, is to
> remove the BUG_ON() check.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
> Changes in v1:
>  * reorg. the Credit1 code for minimal churn, as requested during review;
>  * don't move out: label up in Credit1, as requested during review.
> ---
> As code changed, I did not apply Anshul's R-b tag. Anshul, re-issue it if you
> want (if you feel like it, of course).
> ---
>  xen/common/sched_credit.c  |    2 +-
>  xen/common/sched_credit2.c |   14 ++++++++++----
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index ac22746..d547716 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1827,7 +1827,7 @@ csched_schedule(
>       * Update idlers mask if necessary. When we're idling, other CPUs
>       * will tickle us when they get extra work.
>       */
> -    if ( snext->pri == CSCHED_PRI_IDLE )
> +    if ( !tasklet_work_scheduled && snext->pri == CSCHED_PRI_IDLE )
>      {
>          if ( !cpumask_test_cpu(cpu, prv->idlers) )
>              cpumask_set_cpu(cpu, prv->idlers);
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 8b95a47..7e572bf 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1923,8 +1923,16 @@ csched2_schedule(
>      }
>      else
>      {
> -        /* Update the idle mask if necessary */
> -        if ( !cpumask_test_cpu(cpu, &rqd->idle) )
> +        /*
> +         * Update the idle mask if necessary. Note that, if we're scheduling
> +         * idle in order to carry on some tasklet work, we want to play busy!
> +         */
> +        if ( tasklet_work_scheduled )
> +        {
> +            if ( cpumask_test_cpu(cpu, &rqd->idle) )
> +                cpumask_clear_cpu(cpu, &rqd->idle);
> +        }
> +        else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
>              cpumask_set_cpu(cpu, &rqd->idle);
>          /* Make sure avgload gets updated periodically even
>           * if there's no activity */
> @@ -2304,8 +2312,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>      /* No need to save IRQs here, they're already disabled */
>      spin_lock(&rqd->lock);
>  
> -    BUG_ON(!cpumask_test_cpu(cpu, &rqd->idle));
> -
>      printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
>  
>      cpumask_clear_cpu(cpu, &rqd->idle);
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/11] xen: credit2: rework load tracking logic
  2016-07-15 14:49 ` [PATCH v2 03/11] xen: credit2: rework load tracking logic Dario Faggioli
@ 2016-07-18 14:46   ` George Dunlap
  2016-07-18 14:51     ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: George Dunlap @ 2016-07-18 14:46 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, David Vrabel

On 15/07/16 15:49, Dario Faggioli wrote:
> The existing load tracking code was hard to understad and
> maintain, and not entirely consistent. This is due to a
> number of reasons:
>  - code and comments were not in perfect sync, making it
>    difficult to figure out what the intent of a particular
>    choice was (e.g., the choice of 18 for load_window_shift);
>  - the math, although effective, was not entirely consistent.
>    In fact, we were doing (if W is the lenght of the window):
> 
>     avgload = (delta*load*W + (W - delta)*avgload)/W
>     avgload = avgload + delta*load - delta*avgload/W
> 
>    which does not match any known variant of 'smoothing
>    moving average'. In fact, it should have been:
> 
>     avgload = avgload + delta*load/W - delta*avgload/W
> 
>    (for details on why, see the doc comments inside this
>    patch.). Furthermore, with
> 
>     avgload ~= avgload + W*load - avgload
>     avgload ~= W*load
> 
> The reason why the formula above sort of worked was because
> the number of bits used for the fractional parts of the
> values used in fixed point math and the number of bits used
> for the lenght of the window were the same (load_window_shift
> was being used for both).
> 
> This may look handy, but it introduced a (not especially well
> documented) dependency between the lenght of the window and
> the precision of the calculations, which really should be
> two independent things. Especially if treating them as such
> (like it is done in this patch) does not lead to more
> complex maths (same number of multiplications and shifts, and
> there is still room for some optimization).
> 
> Therefore, in this patch, we:
>  - split length of the window and precision (and, since there
>    is already a command line parameter for length of window,
>    introduce one for precision too),
>  - align the math with one proper incarnation of exponential
>    smoothing (at no added cost),
>  - add comments, about the details of the algorithm and the
>    math used.
> 
> While there fix a couple of style issues as well (pointless
> initialization, long lines, comments).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Changes from v1:
>  * reconciled comments and actual code about load_window_shift handling;

And this apparently means changing the code to match the comments (not
the other way around), so that the actual load_window_shift default is
now 30 (as the comments say) instead of 20 (as it was in the previous
patch). :-)

At any rate:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 03/11] xen: credit2: rework load tracking logic
  2016-07-18 14:46   ` George Dunlap
@ 2016-07-18 14:51     ` Dario Faggioli
  0 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-18 14:51 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 1935 bytes --]

On Mon, 2016-07-18 at 15:46 +0100, George Dunlap wrote:
> On 15/07/16 15:49, Dario Faggioli wrote:
> > 
> > This may look handy, but it introduced a (not especially well
> > documented) dependency between the lenght of the window and
> > the precision of the calculations, which really should be
> > two independent things. Especially if treating them as such
> > (like it is done in this patch) does not lead to more
> > complex maths (same number of multiplications and shifts, and
> > there is still room for some optimization).
> > 
> > Therefore, in this patch, we:
> >  - split length of the window and precision (and, since there
> >    is already a command line parameter for length of window,
> >    introduce one for precision too),
> >  - align the math with one proper incarnation of exponential
> >    smoothing (at no added cost),
> >  - add comments, about the details of the algorithm and the
> >    math used.
> > 
> > While there fix a couple of style issues as well (pointless
> > initialization, long lines, comments).
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > ---
> > Changes from v1:
> >  * reconciled comments and actual code about load_window_shift
> > handling;
> And this apparently means changing the code to match the comments
> (not
> the other way around), so that the actual load_window_shift default
> is
> now 30 (as the comments say) instead of 20 (as it was in the previous
> patch). :-)
> 
Indeed. I should have made it more clear that this is what's
happening... Sorry! :-)

> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
> 
Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 06/11] xen: credit2: make the code less experimental
  2016-07-15 14:49 ` [PATCH v2 06/11] xen: credit2: make the code less experimental Dario Faggioli
@ 2016-07-18 15:04   ` George Dunlap
  0 siblings, 0 replies; 26+ messages in thread
From: George Dunlap @ 2016-07-18 15:04 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, David Vrabel, Jan Beulich

On 15/07/16 15:49, Dario Faggioli wrote:
> Mainly, almost all of the BUG_ON-s can be converted into
> ASSERTS, and almost all the debug printk can either be
> removed or turned into tracing.
> 
> The 'TODO' list, in a comment at the beginning of the file,
> was also stale, so remove items that were still there but
> are actually done.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Thanks,

Reviewed-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled
  2016-07-15 14:50 ` [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
@ 2016-07-18 15:10   ` George Dunlap
  0 siblings, 0 replies; 26+ messages in thread
From: George Dunlap @ 2016-07-18 15:10 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, David Vrabel

On 15/07/16 15:50, Dario Faggioli wrote:
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

It seems a bit strange not to have anything in the "body" of the
changeset here, but I can't think of anything in particular to say here, so:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Thanks

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
> Changes from v1:
>  * avoid stray code removal in balance_load(), as pointed out by George
>    during review.
> ---
>  xen/common/sched_credit2.c |  112 +++++++++++++++++++++++---------------------
>  1 file changed, 59 insertions(+), 53 deletions(-)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index a4aec73..be27ba3 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -638,6 +638,7 @@ __update_runq_load(const struct scheduler *ops,
>  
>      ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX);
>  
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              uint64_t rq_avgload, b_avgload;
> @@ -648,9 +649,9 @@ __update_runq_load(const struct scheduler *ops,
>          d.rq_avgload = rqd->avgload;
>          d.b_avgload = rqd->b_avgload;
>          d.shift = P;
> -        trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  }
>  
> @@ -693,6 +694,7 @@ __update_svc_load(const struct scheduler *ops,
>      }
>      svc->load_last_update = now;
>  
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              uint64_t v_avgload;
> @@ -703,9 +705,9 @@ __update_svc_load(const struct scheduler *ops,
>          d.vcpu = svc->vcpu->vcpu_id;
>          d.v_avgload = svc->avgload;
>          d.shift = P;
> -        trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  }
>  
> @@ -761,6 +763,7 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
>  
>      pos = __runq_insert(runq, svc);
>  
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned vcpu:16, dom:16;
> @@ -769,9 +772,9 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
>          d.dom = svc->vcpu->domain->domain_id;
>          d.vcpu = svc->vcpu->vcpu_id;
>          d.pos = pos;
> -        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_RUNQ_POS, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>      return;
> @@ -814,7 +817,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>  
>      ASSERT(new->rqd == rqd);
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned vcpu:16, dom:16;
> @@ -824,9 +827,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>          d.vcpu = new->vcpu->vcpu_id;
>          d.processor = new->vcpu->processor;
>          d.credit = new->credit;
> -        trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>      /*
> @@ -884,7 +887,8 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>              lowest = cur->credit;
>          }
>  
> -        /* TRACE */ {
> +        if ( unlikely(tb_init_done) )
> +        {
>              struct {
>                  unsigned vcpu:16, dom:16;
>                  unsigned credit;
> @@ -892,9 +896,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>              d.dom = cur->vcpu->domain->domain_id;
>              d.vcpu = cur->vcpu->vcpu_id;
>              d.credit = cur->credit;
> -            trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
> -                      sizeof(d),
> -                      (unsigned char *)&d);
> +            __trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
> +                        sizeof(d),
> +                        (unsigned char *)&d);
>          }
>      }
>  
> @@ -912,14 +916,15 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>   tickle:
>      BUG_ON(ipid == -1);
>  
> -    /* TRACE */ {
> +    if ( unlikely(tb_init_done) )
> +    {
>          struct {
>              unsigned cpu:16, pad:16;
>          } d;
>          d.cpu = ipid; d.pad = 0;
> -        trace_var(TRC_CSCHED2_TICKLE, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_TICKLE, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>      __cpumask_set_cpu(ipid, &rqd->tickled);
>      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> @@ -981,7 +986,8 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
>  
>          svc->start_time = now;
>  
> -        /* TRACE */ {
> +        if ( unlikely(tb_init_done) )
> +        {
>              struct {
>                  unsigned vcpu:16, dom:16;
>                  unsigned credit_start, credit_end;
> @@ -992,9 +998,9 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
>              d.credit_start = start_credit;
>              d.credit_end = svc->credit;
>              d.multiplier = m;
> -            trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
> -                      sizeof(d),
> -                      (unsigned char *)&d);
> +            __trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
> +                        sizeof(d),
> +                        (unsigned char *)&d);
>          }
>      }
>  
> @@ -1030,7 +1036,7 @@ void burn_credits(struct csched2_runqueue_data *rqd,
>                   __func__, now, svc->start_time);
>      }
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned vcpu:16, dom:16;
> @@ -1041,9 +1047,9 @@ void burn_credits(struct csched2_runqueue_data *rqd,
>          d.vcpu = svc->vcpu->vcpu_id;
>          d.credit = svc->credit;
>          d.delta = delta;
> -        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  }
>  
> @@ -1079,16 +1085,16 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
>          SCHED_STAT_CRANK(upd_max_weight_full);
>      }
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned rqi:16, max_weight:16;
>          } d;
>          d.rqi = rqd->id;
>          d.max_weight = rqd->max_weight;
> -        trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  }
>  
> @@ -1168,7 +1174,7 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
>      /* Expected new load based on adding this vcpu */
>      rqd->b_avgload += svc->avgload;
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned vcpu:16, dom:16;
> @@ -1177,9 +1183,9 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
>          d.dom = svc->vcpu->domain->domain_id;
>          d.vcpu = svc->vcpu->vcpu_id;
>          d.rqi=rqd->id;
> -        trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>  }
> @@ -1491,7 +1497,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>  out_up:
>      spin_unlock(&prv->lock);
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              uint64_t b_avgload;
> @@ -1503,9 +1509,9 @@ out_up:
>          d.vcpu = vc->vcpu_id;
>          d.rq_id = c2r(ops, new_cpu);
>          d.new_cpu = new_cpu;
> -        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_PICKED_CPU, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>      return new_cpu;
> @@ -1560,7 +1566,7 @@ static void migrate(const struct scheduler *ops,
>                      struct csched2_runqueue_data *trqd, 
>                      s_time_t now)
>  {
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              unsigned vcpu:16, dom:16;
> @@ -1570,9 +1576,9 @@ static void migrate(const struct scheduler *ops,
>          d.vcpu = svc->vcpu->vcpu_id;
>          d.rqi = svc->rqd->id;
>          d.trqi = trqd->id;
> -        trace_var(TRC_CSCHED2_MIGRATE, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_MIGRATE, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>      if ( svc->flags & CSFLAG_scheduled )
> @@ -1698,7 +1704,7 @@ retry:
>          if ( i > cpus_max )
>              cpus_max = i;
>  
> -        /* TRACE */
> +        if ( unlikely(tb_init_done) )
>          {
>              struct {
>                  unsigned lrq_id:16, orq_id:16;
> @@ -1707,9 +1713,9 @@ retry:
>              d.lrq_id = st.lrqd->id;
>              d.orq_id = st.orqd->id;
>              d.load_delta = st.load_delta;
> -            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
> -                      sizeof(d),
> -                      (unsigned char *)&d);
> +            __trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
> +                        sizeof(d),
> +                        (unsigned char *)&d);
>          }
>  
>          /*
> @@ -1740,7 +1746,7 @@ retry:
>      if ( unlikely(st.orqd->id < 0) )
>          goto out_up;
>  
> -    /* TRACE */
> +    if ( unlikely(tb_init_done) )
>      {
>          struct {
>              uint64_t lb_avgload, ob_avgload;
> @@ -1750,9 +1756,9 @@ retry:
>          d.lb_avgload = st.lrqd->b_avgload;
>          d.orq_id = st.orqd->id;
>          d.ob_avgload = st.orqd->b_avgload;
> -        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
> -                  sizeof(d),
> -                  (unsigned char *)&d);
> +        __trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
>      /* Look for "swap" which gives the best load average
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events
  2016-07-15 14:49 ` [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events Dario Faggioli
@ 2016-07-18 15:38   ` Wei Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Wei Liu @ 2016-07-18 15:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Anshul Makkar, Ian Jackson, Wei Liu, George Dunlap

On Fri, Jul 15, 2016 at 04:49:33PM +0200, Dario Faggioli wrote:
> Add the shift used for the precision of the integer
> arithmetic to the trace records, and update both xenalyze
> and xentrace_format to make use of/print it.
> 
> In particular, in xenalyze, we are can now show the
> load as a (easier to interpreet) percentage.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Acked-by: Wei Liu <wei.liu2@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 09/11] tools: tracing: deal with new Credit2 events
  2016-07-15 14:50 ` [PATCH v2 09/11] tools: tracing: deal with new Credit2 events Dario Faggioli
@ 2016-07-18 15:39   ` Wei Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Wei Liu @ 2016-07-18 15:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Anshul Makkar, Ian Jackson, Wei Liu, George Dunlap

On Fri, Jul 15, 2016 at 04:50:11PM +0200, Dario Faggioli wrote:
> more specifically, with: TICKLE_NEW, RUNQ_MAX_WEIGHT,
> MIGRATE, LOAD_CHECK, LOAD_BALANCE and PICKED_CPU, and
> in both both xenalyze and formats (for xentrace_format).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Acked-by: George Dunlap <george.dunlap@citrix.com>

Acked-by: Wei Liu <wei.liu2@citrix.com>

George, I assume you are going to commit both this patch and patch #4.
Please let me know if you expect me to handle these.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-15 14:50 ` [PATCH v2 11/11] xen: credit2: implement true SMT support Dario Faggioli
@ 2016-07-18 16:48   ` George Dunlap
  2016-07-18 17:24     ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: George Dunlap @ 2016-07-18 16:48 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, David Vrabel, Jan Beulich

On 15/07/16 15:50, Dario Faggioli wrote:
> In fact, right now, we recommend keepeing runqueues
> arranged per-core, so that it is the inter-runqueue load
> balancing code that automatically spreads the work in an
> SMT friendly way. This means that any other runq
> arrangement one may want to use falls short of SMT
> scheduling optimizations.
> 
> This commit implements SMT awareness --similar to the
> one we have in Credit1-- for any possible runq
> arrangement. This turned out to be pretty easy to do,
> as the logic can live entirely in runq_tickle()
> (although, in order to avoid for_each_cpu loops in
> that function, we use a new cpumask which indeed needs
> to be updated in other places).
> 
> In addition to disentangling SMT awareness from load
> balancing, this also allows us to support the
> sched_smt_power_savings parametar in Credit2 as well.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
> Changes from v1:
>  * make smt_idle_mask_set() const correct, and use a local variable for
>    per_cpu(cpu_sibling_masks) in there, as suggested during review;
>  * removed a stray XXX, as pointed out during review;
>  * full stop in a comment, as requested during review.
> ---
>  xen/common/sched_credit2.c |  142 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 128 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index b33ba7a..6ccc6f0 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c

[snip]

> +/*
> + * If all the siblings of cpu (including cpu itself) are in idlers,
> + * set all their bits in mask.
> + *
> + * In order to properly take into account tickling, idlers needs to be
> + * set qeual to something like:

*equal (I can fix this on check-in)

> + *
> + *  rqd->idle & (~rqd->tickled)
> + *
> + * This is because cpus that have been tickled will very likely pick up some
> + * work as soon as the manage to schedule, and hence we should really consider
> + * them as busy.

OK this is something that slightly confused me when I was reviewing the
patch the first time: that rqd->idle is *all* pcpus which are currently
idle (and thus we need to & (~tickled) when using it), but rqd->smt_idle
is meant to be maintained as *non-tickled* idle pcpus.

Are you planning at some point to have a follow-up patch which changes
rqd->idle to be non-tickled idle pcpus as well?  Unless I missed
something it looks like at the moment the only times rqd->idle is acted
upon is after &~-ing out rqd->tickled anyway.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2
  2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (10 preceding siblings ...)
  2016-07-15 14:50 ` [PATCH v2 11/11] xen: credit2: implement true SMT support Dario Faggioli
@ 2016-07-18 17:21 ` George Dunlap
  11 siblings, 0 replies; 26+ messages in thread
From: George Dunlap @ 2016-07-18 17:21 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, Jan Beulich

On Fri, Jul 15, 2016 at 3:49 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Hi,
>
> Version 2 of the patch series. Not much to say apart from the fact that (I
> think) I've addressed all the review comments v1 received (thanks everyone!).
> Details are in the individual changelogs.
>
> It's smaller because George commited some of the patches already.
>
> As mentioned in the v1 thread, I've dropped the last patch "xen: credit2: use
> cpumask_first instead of cpumask_any when choosing cpu". I still think it is
> sound, but I will better evaluate its effects and respin.
>
> Posting of v1, with discussion and benchmarks, is here:
>   https://lists.xen.org/archives/html/xen-devel/2016-06/msg02287.html
>   <146618450041.23516.9007927860063823148.stgit@Solace.fritz.box>
>
> And here's the git branches:
>   (v2) git://xenbits.xen.org/people/dariof/xen.git  wip/sched/credit2-misc-improvements-v2
>        http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/credit2-misc-improvements-v2
>
>   (v1) git://xenbits.xen.org/people/dariof/xen.git  wip/sched/credit2-misc-improvements
>        http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/wip/sched/credit2-misc-improvements
>
> Thanks and Regards,
> Dario
> ---
> Dario Faggioli (11):
>       xen: sched: leave CPUs doing tasklet work alone.
>       xen: credit2: prevent load balancing to go mad if time goes backwards
>       xen: credit2: rework load tracking logic
>       xen/tools: improve tracing of Credit2 load tracking events
>       xen: credit2: use non-atomic cpumask and bit operations
>       xen: credit2: make the code less experimental
>       xen: credit2: add yet some more tracing
>       xen: credit2: only marshall trace point arguments if tracing enabled
>       tools: tracing: deal with new Credit2 events
>       xen: credit2: the private scheduler lock can be an rwlock.

I've checked in everything up to here.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-18 16:48   ` George Dunlap
@ 2016-07-18 17:24     ` Dario Faggioli
  2016-07-19  9:39       ` George Dunlap
  0 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-18 17:24 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Anshul Makkar, David Vrabel, Jan Beulich

[-- Attachment #1.1: Type: text/plain, Size: 3374 bytes --]

On Mon, 2016-07-18 at 17:48 +0100, George Dunlap wrote:
> On 15/07/16 15:50, Dario Faggioli wrote:
> > 
> > +/*
> > + * If all the siblings of cpu (including cpu itself) are in
> > idlers,
> > + * set all their bits in mask.
> > + *
> > + * In order to properly take into account tickling, idlers needs
> > to be
> > + * set qeual to something like:
> *equal (I can fix this on check-in)
> 
Oops!

> > + *
> > + *  rqd->idle & (~rqd->tickled)
> > + *
> > + * This is because cpus that have been tickled will very likely
> > pick up some
> > + * work as soon as the manage to schedule, and hence we should
> > really consider
> > + * them as busy.
> OK this is something that slightly confused me when I was reviewing
> the
> patch the first time: that rqd->idle is *all* pcpus which are
> currently
> idle (and thus we need to & (~tickled) when using it), but rqd-
> >smt_idle
> is meant to be maintained as *non-tickled* idle pcpus.
> 
Short answer is, "yes, this recap of yours is correct".

In fact, the difference between idle and smt_idle is that the former is
valid instantaneously, while the latter is tracking a state.

IOW, if, at any given time, I want to know what pcpus are idle, I check
rqd->idle. If I want to know what are idle and also are not (or are
unlikely) just about to pick up work, I can check
rqd->idle&(~rqd->tickled)

Let's now consider smt_idle and assume that, at time t siblings pcpus 2
and 3 are idle (as in, their bit is 1 in rqd->idle). If I'd be basing
smt_idle just on that, I could at this point set the bit of the core in
smt_idle. This in turn means that work will likely be sent to either 2
or 3 (depending on all the other factors that influence this). Let's
assume we select 2. But if either of them --although being idle-- was
has actually been tickled already, we may have taken a suboptimal
decision. In fact, if 3 was tickled, both 2 and 3 will pick up work,
and if there is another core (say, made up of siblings pcpus 6 and 7)
which is truly fully idle, we would better have chosen a pcpu from
there. If 2 was the one that was tickled, that's even worse, because I
most likely have 2 work items, and am tickling only 1 pcpu!

So, again, yes, basically this means that I need smt_idle to be
representative of the set of non-tickled idle pcpus.

> Are you planning at some point to have a follow-up patch which
> changes
> rqd->idle to be non-tickled idle pcpus as well?  Unless I missed
> something it looks like at the moment the only times rqd->idle is
> acted
> upon is after &~-ing out rqd->tickled anyway.
> 
I am indeed, but I was planning to do that after this round of changes
(this series, plus soft-affinity, plus caps, which I have in my queue).

It's, after all, an optimization, and hence I think it is fine to leave
it to when things will be proven to be working. :-)

If you're saying that this discrepancy between rqd->idle's and
rqd->smt_idle's semantic is, at minimum, unideal, I do agree... but I
think, for now at least, it's worth living with it.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-18 17:24     ` Dario Faggioli
@ 2016-07-19  9:39       ` George Dunlap
  2016-07-19  9:57         ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: George Dunlap @ 2016-07-19  9:39 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 3665 bytes --]

On Mon, Jul 18, 2016 at 6:24 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Mon, 2016-07-18 at 17:48 +0100, George Dunlap wrote:
>> On 15/07/16 15:50, Dario Faggioli wrote:
>> >
>> > +/*
>> > + * If all the siblings of cpu (including cpu itself) are in
>> > idlers,
>> > + * set all their bits in mask.
>> > + *
>> > + * In order to properly take into account tickling, idlers needs
>> > to be
>> > + * set qeual to something like:
>> *equal (I can fix this on check-in)
>>
> Oops!
>
>> > + *
>> > + *  rqd->idle & (~rqd->tickled)
>> > + *
>> > + * This is because cpus that have been tickled will very likely
>> > pick up some
>> > + * work as soon as the manage to schedule, and hence we should
>> > really consider
>> > + * them as busy.
>> OK this is something that slightly confused me when I was reviewing
>> the
>> patch the first time: that rqd->idle is *all* pcpus which are
>> currently
>> idle (and thus we need to & (~tickled) when using it), but rqd-
>> >smt_idle
>> is meant to be maintained as *non-tickled* idle pcpus.
>>
> Short answer is, "yes, this recap of yours is correct".
>
> In fact, the difference between idle and smt_idle is that the former is
> valid instantaneously, while the latter is tracking a state.
>
> IOW, if, at any given time, I want to know what pcpus are idle, I check
> rqd->idle. If I want to know what are idle and also are not (or are
> unlikely) just about to pick up work, I can check
> rqd->idle&(~rqd->tickled)
>
> Let's now consider smt_idle and assume that, at time t siblings pcpus 2
> and 3 are idle (as in, their bit is 1 in rqd->idle). If I'd be basing
> smt_idle just on that, I could at this point set the bit of the core in
> smt_idle. This in turn means that work will likely be sent to either 2
> or 3 (depending on all the other factors that influence this). Let's
> assume we select 2. But if either of them --although being idle-- was
> has actually been tickled already, we may have taken a suboptimal
> decision. In fact, if 3 was tickled, both 2 and 3 will pick up work,
> and if there is another core (say, made up of siblings pcpus 6 and 7)
> which is truly fully idle, we would better have chosen a pcpu from
> there. If 2 was the one that was tickled, that's even worse, because I
> most likely have 2 work items, and am tickling only 1 pcpu!
>
> So, again, yes, basically this means that I need smt_idle to be
> representative of the set of non-tickled idle pcpus.
>
>> Are you planning at some point to have a follow-up patch which
>> changes
>> rqd->idle to be non-tickled idle pcpus as well?  Unless I missed
>> something it looks like at the moment the only times rqd->idle is
>> acted
>> upon is after &~-ing out rqd->tickled anyway.
>>
> I am indeed, but I was planning to do that after this round of changes
> (this series, plus soft-affinity, plus caps, which I have in my queue).
>
> It's, after all, an optimization, and hence I think it is fine to leave
> it to when things will be proven to be working. :-)
>
> If you're saying that this discrepancy between rqd->idle's and
> rqd->smt_idle's semantic is, at minimum, unideal, I do agree... but I
> think, for now at least, it's worth living with it.

I hadn't actually said anything, but you know me well enough to guess
what I'm thinking. :-)  I am somewhat torn between feeling like the
inconsistency and as you say, the fact that this is a distinct
improvement and it would seem a bit petty to insist that you either
wait or produce a patch to change idle at the same time.

But I do think that the difference needs to be called out a bit
better.  What about folding in something like the attached patch?

  -George

[-- Attachment #2: comments.patch --]
[-- Type: text/x-diff, Size: 2289 bytes --]

commit fd8fe6d8526cc9d6abe510aae7a654d1b72d4305
Author: George Dunlap <george.dunlap@citrix.com>
Commit: George Dunlap <george.dunlap@citrix.com>

    George's mods

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 6ccc6f0..3e1720c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -353,8 +353,8 @@ struct csched2_runqueue_data {
     struct list_head svc;  /* List of all vcpus assigned to this runqueue */
     unsigned int max_weight;
 
-    cpumask_t idle,        /* Currently idle */
-        smt_idle,          /* Fully idle cores (as in all the siblings are idle) */
+    cpumask_t idle,        /* Currently idle pcpus */
+        smt_idle,          /* Fully idle-and-untickled cores (see below) */
         tickled;           /* Have been asked to go through schedule */
     int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
     s_time_t load_last_update;  /* Last time average was updated */
@@ -454,17 +454,20 @@ struct csched2_dom {
  */
 
 /*
- * If all the siblings of cpu (including cpu itself) are in idlers,
- * set all their bits in mask.
- *
- * In order to properly take into account tickling, idlers needs to be
- * set qeual to something like:
- *
- *  rqd->idle & (~rqd->tickled)
- *
- * This is because cpus that have been tickled will very likely pick up some
- * work as soon as the manage to schedule, and hence we should really consider
- * them as busy.
+ * If all the siblings of cpu (including cpu itself) are both idle and
+ * untickled, set all their bits in mask.
+ *
+ * NB that rqd->smt_idle is different than rqd->idle.  rqd->idle
+ * records pcpus that at are merely idle (i.e., at the moment do not
+ * have a vcpu running on them).  But you have to manually filter out
+ * which pcpus have been tickled in order to find cores that are not
+ * going to be busy soon.  Filtering out tickled cpus pairwise is a
+ * lot of extra pain; so for rqd->smt_idle, we explicitly make so that
+ * the bits of a pcpu are set only if all the threads on its core are
+ * both idle *and* untickled.
+ *
+ * This means changing the mask when either rqd->idle or rqd->tickled
+ * changes.
  */
 static inline
 void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,

[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-19  9:39       ` George Dunlap
@ 2016-07-19  9:57         ` Dario Faggioli
  2016-07-19 10:05           ` George Dunlap
  0 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2016-07-19  9:57 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1381 bytes --]

On Tue, 2016-07-19 at 10:39 +0100, George Dunlap wrote:
> On Mon, Jul 18, 2016 at 6:24 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > 
> > If you're saying that this discrepancy between rqd->idle's and
> > rqd->smt_idle's semantic is, at minimum, unideal, I do agree... but
> > I
> > think, for now at least, it's worth living with it.
> I hadn't actually said anything, but you know me well enough to guess
> what I'm thinking. :-)  
>
Hehe. :-)

> I am somewhat torn between feeling like the
> inconsistency and as you say, the fact that this is a distinct
> improvement and it would seem a bit petty to insist that you either
> wait or produce a patch to change idle at the same time.
> 
If we go ahead, I sign up for double checking and, if possible, fixing
the inconsistency.

> But I do think that the difference needs to be called out a bit
> better.  
>
Yes, I was about to re-replying saying "perhaps we should add a comment
about this".

> What about folding in something like the attached patch?
> 
I'd be totally fine with this.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-19  9:57         ` Dario Faggioli
@ 2016-07-19 10:05           ` George Dunlap
  2016-07-19 10:19             ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: George Dunlap @ 2016-07-19 10:05 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 1253 bytes --]

On 19/07/16 10:57, Dario Faggioli wrote:
> On Tue, 2016-07-19 at 10:39 +0100, George Dunlap wrote:
>> On Mon, Jul 18, 2016 at 6:24 PM, Dario Faggioli
>> <dario.faggioli@citrix.com> wrote:
>>>  
>>> If you're saying that this discrepancy between rqd->idle's and
>>> rqd->smt_idle's semantic is, at minimum, unideal, I do agree... but
>>> I
>>> think, for now at least, it's worth living with it.
>> I hadn't actually said anything, but you know me well enough to guess
>> what I'm thinking. :-)  
>>
> Hehe. :-)
> 
>> I am somewhat torn between feeling like the
>> inconsistency and as you say, the fact that this is a distinct
>> improvement and it would seem a bit petty to insist that you either
>> wait or produce a patch to change idle at the same time.
>>
> If we go ahead, I sign up for double checking and, if possible, fixing
> the inconsistency.
> 
>> But I do think that the difference needs to be called out a bit
>> better.  
>>
> Yes, I was about to re-replying saying "perhaps we should add a comment
> about this".
> 
>> What about folding in something like the attached patch?
>>
> I'd be totally fine with this.

Do you mean you ack me folding in that particular patch (so that the
resulting commit looks like the attached)?

 -George


[-- Attachment #2: credit2_smt_revised_comment.patch --]
[-- Type: text/x-diff, Size: 11475 bytes --]

commit 9dfa4b90867dedf4b1db0523a76c7007cbb9bd40
Author: George Dunlap <george.dunlap@citrix.com>
Commit: George Dunlap <george.dunlap@citrix.com>

    xen: credit2: implement true SMT support
    
    In fact, right now, we recommend keepeing runqueues
    arranged per-core, so that it is the inter-runqueue load
    balancing code that automatically spreads the work in an
    SMT friendly way. This means that any other runq
    arrangement one may want to use falls short of SMT
    scheduling optimizations.
    
    This commit implements SMT awareness --similar to the
    one we have in Credit1-- for any possible runq
    arrangement. This turned out to be pretty easy to do,
    as the logic can live entirely in runq_tickle()
    (although, in order to avoid for_each_cpu loops in
    that function, we use a new cpumask which indeed needs
    to be updated in other places).
    
    In addition to disentangling SMT awareness from load
    balancing, this also allows us to support the
    sched_smt_power_savings parametar in Credit2 as well.
    
    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
    Signed-off-by: George Dunlap <george.dunlap@citrix.com>
    Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b33ba7a..3e1720c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -353,8 +353,9 @@ struct csched2_runqueue_data {
     struct list_head svc;  /* List of all vcpus assigned to this runqueue */
     unsigned int max_weight;
 
-    cpumask_t idle,        /* Currently idle */
-        tickled;           /* Another cpu in the queue is already targeted for this one */
+    cpumask_t idle,        /* Currently idle pcpus */
+        smt_idle,          /* Fully idle-and-untickled cores (see below) */
+        tickled;           /* Have been asked to go through schedule */
     int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
     s_time_t load_last_update;  /* Last time average was updated */
     s_time_t avgload;           /* Decaying queue load */
@@ -415,6 +416,79 @@ struct csched2_dom {
 };
 
 /*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *    bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *    bits correspondings to it and to all its siblings are clear in the
+ *    smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *    on pcpus from smt_idle mask first. This is what happens if
+ *    sched_smt_power_savings was not set at boot (default), and it maximizes
+ *    true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *    on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *    sched_smt_power_savings is set at boot, and it allows as more cores as
+ *    possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *    runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *    csched2_cpu_pick() looks for the least loaded runq and return just any
+ *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *    the chosen runq, and it is again runq_tickle(), called by
+ *    csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *    chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *    directly, that again temporarily use a random pcpu from the new runq,
+ *    and then calls runq_tickle(), by itself.
+ */
+
+/*
+ * If all the siblings of cpu (including cpu itself) are both idle and
+ * untickled, set all their bits in mask.
+ *
+ * NB that rqd->smt_idle is different than rqd->idle.  rqd->idle
+ * records pcpus that at are merely idle (i.e., at the moment do not
+ * have a vcpu running on them).  But you have to manually filter out
+ * which pcpus have been tickled in order to find cores that are not
+ * going to be busy soon.  Filtering out tickled cpus pairwise is a
+ * lot of extra pain; so for rqd->smt_idle, we explicitly make so that
+ * the bits of a pcpu are set only if all the threads on its core are
+ * both idle *and* untickled.
+ *
+ * This means changing the mask when either rqd->idle or rqd->tickled
+ * changes.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,
+                       cpumask_t *mask)
+{
+    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
+
+    if ( cpumask_subset(cpu_siblings, idlers) )
+        cpumask_or(mask, mask, cpu_siblings);
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask.
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{
+    cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
  * When a hard affinity change occurs, we may not be able to check some
  * (any!) of the other runqueues, when looking for the best new processor
  * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
@@ -853,9 +927,30 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     }
 
     /*
-     * Get a mask of idle, but not tickled, processors that new is
-     * allowed to run on. If that's not empty, choose someone from there
-     * (preferrably, the one were new was running on already).
+     * First of all, consider idle cpus, checking if we can just
+     * re-use the pcpu where we were running before.
+     *
+     * If there are cores where all the siblings are idle, consider
+     * them first, honoring whatever the spreading-vs-consolidation
+     * SMT policy wants us to do.
+     */
+    if ( unlikely(sched_smt_power_savings) )
+        cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
+    else
+        cpumask_copy(&mask, &rqd->smt_idle);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    i = cpumask_test_or_cycle(cpu, &mask);
+    if ( i < nr_cpu_ids )
+    {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
+        ipid = i;
+        goto tickle;
+    }
+
+    /*
+     * If there are no fully idle cores, check all idlers, after
+     * having filtered out pcpus that have been tickled but haven't
+     * gone through the scheduler yet.
      */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
@@ -947,6 +1042,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                     (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
+    smt_idle_mask_clear(ipid, &rqd->smt_idle);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1442,8 +1538,10 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         return get_fallback_cpu(svc);
     }
 
-    /* First check to see if we're here because someone else suggested a place
-     * for us to move. */
+    /*
+     * First check to see if we're here because someone else suggested a place
+     * for us to move.
+     */
     if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
@@ -1464,7 +1562,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /* Find the runqueue with the lowest average load. */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
@@ -1507,16 +1605,17 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = get_fallback_cpu(svc);
-    else
     {
-        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
-                    &prv->rqd[min_rqi].active);
-        new_cpu = cpumask_any(cpumask_scratch);
-        BUG_ON(new_cpu >= nr_cpu_ids);
+        new_cpu = get_fallback_cpu(svc);
+        goto out_up;
     }
 
-out_up:
+    cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                &prv->rqd[min_rqi].active);
+    new_cpu = cpumask_any(cpumask_scratch);
+    BUG_ON(new_cpu >= nr_cpu_ids);
+
+ out_up:
     read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
@@ -2140,7 +2239,11 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
+    {
         __cpumask_clear_cpu(cpu, &rqd->tickled);
+        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+    }
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2202,7 +2305,10 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_clear_cpu(cpu, &rqd->idle);
+            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+        }
 
         snext->start_time = now;
 
@@ -2224,10 +2330,17 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
+            {
                 __cpumask_clear_cpu(cpu, &rqd->idle);
+                smt_idle_mask_clear(cpu, &rqd->smt_idle);
+            }
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_set_cpu(cpu, &rqd->idle);
+            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+        }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2357,6 +2470,8 @@ csched2_dump(const struct scheduler *ops)
         printk("\tidlers: %s\n", cpustr);
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].tickled);
         printk("\ttickled: %s\n", cpustr);
+        cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].smt_idle);
+        printk("\tfully idle cores: %s\n", cpustr);
     }
 
     printk("Domain info:\n");
@@ -2510,6 +2625,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &rqd->idle);
     __cpumask_set_cpu(cpu, &rqd->active);
     __cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
     return rqi;
 }
@@ -2615,6 +2731,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->smt_idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )

[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 11/11] xen: credit2: implement true SMT support
  2016-07-19 10:05           ` George Dunlap
@ 2016-07-19 10:19             ` Dario Faggioli
  0 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2016-07-19 10:19 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 621 bytes --]

On Tue, 2016-07-19 at 11:05 +0100, George Dunlap wrote:
> On 19/07/16 10:57, Dario Faggioli wrote:
> > 
> > > What about folding in something like the attached patch?
> > > 
> > I'd be totally fine with this.
> Do you mean you ack me folding in that particular patch (so that the
> resulting commit looks like the attached)?
> 
Yes, I do.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-07-19 10:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 14:49 [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
2016-07-15 14:49 ` [PATCH v2 01/11] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
2016-07-18 14:35   ` George Dunlap
2016-07-15 14:49 ` [PATCH v2 02/11] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
2016-07-15 14:49 ` [PATCH v2 03/11] xen: credit2: rework load tracking logic Dario Faggioli
2016-07-18 14:46   ` George Dunlap
2016-07-18 14:51     ` Dario Faggioli
2016-07-15 14:49 ` [PATCH v2 04/11] xen/tools: improve tracing of Credit2 load tracking events Dario Faggioli
2016-07-18 15:38   ` Wei Liu
2016-07-15 14:49 ` [PATCH v2 05/11] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
2016-07-15 14:49 ` [PATCH v2 06/11] xen: credit2: make the code less experimental Dario Faggioli
2016-07-18 15:04   ` George Dunlap
2016-07-15 14:49 ` [PATCH v2 07/11] xen: credit2: add yet some more tracing Dario Faggioli
2016-07-15 14:50 ` [PATCH v2 08/11] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
2016-07-18 15:10   ` George Dunlap
2016-07-15 14:50 ` [PATCH v2 09/11] tools: tracing: deal with new Credit2 events Dario Faggioli
2016-07-18 15:39   ` Wei Liu
2016-07-15 14:50 ` [PATCH v2 10/11] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
2016-07-15 14:50 ` [PATCH v2 11/11] xen: credit2: implement true SMT support Dario Faggioli
2016-07-18 16:48   ` George Dunlap
2016-07-18 17:24     ` Dario Faggioli
2016-07-19  9:39       ` George Dunlap
2016-07-19  9:57         ` Dario Faggioli
2016-07-19 10:05           ` George Dunlap
2016-07-19 10:19             ` Dario Faggioli
2016-07-18 17:21 ` [PATCH v2 00/11] xen: sched: assorted fixes and improvements to Credit2 George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).