xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2
@ 2016-06-17 23:11 Dario Faggioli
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
                   ` (18 more replies)
  0 siblings, 19 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, Jan Beulich, David Vrabel

Hi everyone,

Here you go a collection of pseudo-random fixes and improvement to Credit2.

In the process of working on Soft Affinity and Caps support, I stumbled upon
them, one after the other, and decided to take care.

It's been hard to test and run benchmark, due to the "time goes backwards" bug
I uncovered [1], and this is at least part of the reason why the code for
affinity and caps is still missing. I've got it already, but need to refine a
couple of things, after double checking benchmark results. So, now that we have
Jan's series [2] (thanks! [*]), and that I managed to indeed run some tests on
this preliminary set of patches, I decided I better set this first group free,
while working on finishing the rest.

The various patches do a wide range of different things, so, please, refer to
individual changelogs for more detailed explanation.

About the numbers I could collect so far, here's the situation. I've run rather
simple benchmarks such as: - Xen build inside a VM. Metric is how log that
takes (in seconds), so lower is better.  - Iperf from a VM to its host. Metric
is total aggregate throughput, so higher is better.

The host is a 16 pCPUs / 2 NUMA nodes Xeon E5620, 6GB RAM per node. The VM had
16 vCPUs and 4GB of memory. Dom0 had 16 vCPUs as well, and 1GB of RAM.

The Xen build, I did it one time with -j4 --representative of low VM load-- and
another time with -j24 --representative of high VM laod. The Iperf test, I've
only used 8 parallel streams (I wanted to do 4 and 8, but there was a bug in my
scripts! :-/).

I've run the above both with and without disturbing external (from the point of
view of the VM) load. Such load were just generated by means of running
processes in dom0. It's rather basic, but it certainly keeps dom0's vCPUs busy
and stress the scheduler. This "noise", when present, was composed by:
 - 8 (v)CPU hog process (`yes &> /dev/null'), running in dom0
 - 4 processes alternating computation and sleep with a duty cycle of 35%.

So, there basically were 12 vCPUs of dom0 kept busy, in an heterogeneous fashion.

I benchmarked Credit2 with runqueues arranged per-core (the current default)
and per-socket, and also Credit1, for reference. The baseline was current
staging plus Jan's monotonicity series.

Actual numbers:

|=======================================================================|
| CREDIT 1 (for reference)                                              |
|=======================================================================|
| Xen build, low VM load, no noise    |
|-------------------------------------|
|               32.207                |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, no noise   | Iperf, high VM load, no noise   |
|-------------------------------------|---------------------------------|
|               18.500                |             22.633              |
|-------------------------------------|---------------------------------|
| Xen build, low VM load, with noise  |
|-------------------------------------|
|               38.700                |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, with noise | Iperf, high VM load, with noise |
|-------------------------------------|---------------------------------|
|               80.317                |             21.300
|=======================================================================|
| CREDIT 2                                                              |
|=======================================================================|
| Xen build, low VM load, no noise    | 
|-------------------------------------|
|            runq=core   runq=socket  |
| baseline     34.543       38.070    |
| patched      35.200       33.433    |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, no noise   | Iperf, high VM load, no noise   |
|-------------------------------------|---------------------------------|
|            runq=core   runq=socket  |           runq=core runq=socket |
| baseline     18.710       19.397    | baseline    21.300     21.933   |
| patched      18.013       18.530    | patched     23.200     23.466   |
|-------------------------------------|---------------------------------|
| Xen build, low VM load, with noise  |
|-------------------------------------|
|            runq=core   runq=socket  |
| baseline     44.483       40.747    |
| patched      45.866       39.493    |
|-------------------------------------|---------------------------------|
| Xen build, high VM load, with noise | Iperf, high VM load, with noise |
|-------------------------------------|---------------------------------|
|            runq=core   runq=socket  |           runq=core runq=socket |
| baseline     41.466       30.630    | baseline    20.333     20.633   |
| patched      36.840       29.080    | patched     19.967     21.000   |
|=======================================================================|

Which, summarizing, means:
 * as far as Credit2 is concerned,  applying this series and using runq=socket
   is what _ALWAYS_ provides the best results.
 * when looking at Credit1 vs. patched Credit2 with runq=socket:
  - Xen build, low VM load,  no noise  : Credit1 slightly better
  - Xen build, low VM load,  no noise  : on par
  - Xen build, low VM load,  with noise: Credit1 a bit better
  - Xen build, high VM load, with noise: Credit2 _ENORMOUSLY_ better (yes, I
    rerun both cases a number of time!)
  - Iperf,     high VM load, no noise  : Credit2 a bit better
  - Iperf,     high VM load, with noise: Credit1 slightly better    

So, Credit1 still wins a few rounds, but performance are very very very close,
and this series seems to me to help narrowing the gap (for some of the cases,
significantly).

It also looks like that, although rather naive, the 'Xen build, high VM load,
with noise' test case exposed another of those issues with Credit1 (more
investigation is necessary), while Credit2 keeps up just fine.

Another interesting thing to note is that, on Credit2 (with this series) 'Xen
build, high VM load, with noise' turns out being quicker than 'Xen build, low
VM load, with noise'. This means that using an higher value for `make -j' for a
build, inside a guest, results in quicker build time, which makes sense... But
that is _NOT_ what happens on Credit1, the whole thing (wildly :-P) hinting at
Credit2 being able to achieve better scalability and better fairness.

In any case, more benchmarking is necessary, and is already planned. More
investigation is also necessary to figure out whether, once we will have this
series, going back to runq=socket as default would indeed be the best thing
(which I indeed suspect it will).

But from all I see, and from all the various perspectives, this series seems a
step in the right direction.

Thanks and Regards,
Dario

[1] http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html
[2] http://lists.xen.org/archives/html/xen-devel/2016-06/msg01884.html

[*] Jan, I confirm that, with your series applied, I haven't yet seen any of
those "Time went backwards?" printk from Credit2, as you sort of were
expecting...

---
Dario Faggioli (19):
      xen: sched: leave CPUs doing tasklet work alone.
      xen: sched: make the 'tickled' perf counter clearer
      xen: credit2: insert and tickle don't need a cpu parameter
      xen: credit2: kill useless helper function choose_cpu
      xen: credit2: do not warn if calling burn_credits more than once
      xen: credit2: read NOW() with the proper runq lock held
      xen: credit2: prevent load balancing to go mad if time goes backwards
      xen: credit2: when tickling, check idle cpus first
      xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu
      xen: credit2: rework load tracking logic
      tools: tracing: adapt Credit2 load tracking events to new format
      xen: credit2: use non-atomic cpumask and bit operations
      xen: credit2: make the code less experimental
      xen: credit2: add yet some more tracing
      xen: credit2: only marshall trace point arguments if tracing enabled
      tools: tracing: deal with new Credit2 events
      xen: credit2: the private scheduler lock can be an rwlock.
      xen: credit2: implement SMT support independent runq arrangement
      xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu


 docs/misc/xen-command-line.markdown |   30 +
 tools/xentrace/formats              |   10 
 tools/xentrace/xenalyze.c           |  103 +++
 xen/common/sched_credit.c           |   22 -
 xen/common/sched_credit2.c          | 1158 +++++++++++++++++++++++++----------
 xen/common/sched_rt.c               |    8 
 xen/include/xen/cpumask.h           |    8 
 xen/include/xen/perfc_defn.h        |    5 
 8 files changed, 973 insertions(+), 371 deletions(-)

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
@ 2016-06-17 23:11 ` Dario Faggioli
  2016-06-20  7:48   ` Jan Beulich
                     ` (2 more replies)
  2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
                   ` (17 subsequent siblings)
  18 siblings, 3 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

In both Credit1 and Credit2, stop considering a pCPU idle,
if the reason why the idle vCPU is being selected, is to
do tasklet work.

Not doing so means that the tickling and load balancing
logic, seeing the pCPU as idle, considers it a candidate
for picking up vCPUs. But the pCPU won't actually pick
up or schedule any vCPU, which would then remain in the
runqueue, which is bas, especially if there were other,
truly idle pCPUs, that could execute it.

The only drawback is that we can't assume that a pCPU is
in always marked as idle when being removed from an
instance of the Credit2 scheduler (csched2_deinit_pdata).
In fact, if we are in stop-machine (i.e., during suspend
or shutdown), the pCPUs are running the stopmachine_tasklet
and hence are actually marked as busy. On the other hand,
when removing a pCPU from a Credit2 pool, it will indeed
be idle. The only thing we can do, therefore, is to
remove the BUG_ON() check.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit.c  |   12 ++++++------
 xen/common/sched_credit2.c |   14 ++++++++++----
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a38a63d..a6645a2 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1819,24 +1819,24 @@ csched_schedule(
     else
         snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
 
+ out:
     /*
      * Update idlers mask if necessary. When we're idling, other CPUs
      * will tickle us when they get extra work.
      */
-    if ( snext->pri == CSCHED_PRI_IDLE )
+    if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
     {
-        if ( !cpumask_test_cpu(cpu, prv->idlers) )
-            cpumask_set_cpu(cpu, prv->idlers);
+        if ( cpumask_test_cpu(cpu, prv->idlers) )
+            cpumask_clear_cpu(cpu, prv->idlers);
     }
-    else if ( cpumask_test_cpu(cpu, prv->idlers) )
+    else if ( !cpumask_test_cpu(cpu, prv->idlers) )
     {
-        cpumask_clear_cpu(cpu, prv->idlers);
+        cpumask_set_cpu(cpu, prv->idlers);
     }
 
     if ( !is_idle_vcpu(snext->vcpu) )
         snext->start_time += now;
 
-out:
     /*
      * Return task to run next...
      */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 1933ff1..cf8455c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1910,8 +1910,16 @@ csched2_schedule(
     }
     else
     {
-        /* Update the idle mask if necessary */
-        if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        /*
+         * Update the idle mask if necessary. Note that, if we're scheduling
+         * idle in order to carry on some tasklet work, we want to play busy!
+         */
+        if ( tasklet_work_scheduled )
+        {
+            if ( cpumask_test_cpu(cpu, &rqd->idle) )
+                cpumask_clear_cpu(cpu, &rqd->idle);
+        }
+        else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
             cpumask_set_cpu(cpu, &rqd->idle);
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
@@ -2291,8 +2299,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     /* No need to save IRQs here, they're already disabled */
     spin_lock(&rqd->lock);
 
-    BUG_ON(!cpumask_test_cpu(cpu, &rqd->idle));
-
     printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     cpumask_clear_cpu(cpu, &rqd->idle);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
@ 2016-06-17 23:11 ` Dario Faggioli
  2016-06-18  0:36   ` Meng Xu
  2016-07-06 15:52   ` George Dunlap
  2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, Meng Xu, George Dunlap, David Vrabel

In fact, what we have right now, i.e., tickle_idlers_none
and tickle_idlers_some, is not good enough for describing
what really happens in the various tickling functions of
the various scheduler.

Switch to a more descriptive set of counters, such as:
 - tickled_no_cpu: for when we don't tickle anyone
 - tickled_idle_cpu: for when we tickle one or more
                     idler
 - tickled_busy_cpu: for when we tickle one or more
                     non-idler

While there, fix style of an "out:" label in sched_rt.c.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Meng Xu <mengxu@cis.upenn.edu>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit.c    |   10 +++++++---
 xen/common/sched_credit2.c   |   12 +++++-------
 xen/common/sched_rt.c        |    8 +++++---
 xen/include/xen/perfc_defn.h |    5 +++--
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a6645a2..a54bb2d 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -385,7 +385,9 @@ static inline void __runq_tickle(struct csched_vcpu *new)
          || (idlers_empty && new->pri > cur->pri) )
     {
         if ( cur->pri != CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_idlers_none);
+            SCHED_STAT_CRANK(tickled_busy_cpu);
+        else
+            SCHED_STAT_CRANK(tickled_idle_cpu);
         __cpumask_set_cpu(cpu, &mask);
     }
     else if ( !idlers_empty )
@@ -444,13 +446,13 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                     set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
                 }
                 /* Tickle cpu anyway, to let new preempt cur. */
-                SCHED_STAT_CRANK(tickle_idlers_none);
+                SCHED_STAT_CRANK(tickled_busy_cpu);
                 __cpumask_set_cpu(cpu, &mask);
             }
             else if ( !new_idlers_empty )
             {
                 /* Which of the idlers suitable for new shall we wake up? */
-                SCHED_STAT_CRANK(tickle_idlers_some);
+                SCHED_STAT_CRANK(tickled_idle_cpu);
                 if ( opt_tickle_one_idle )
                 {
                     this_cpu(last_tickle_cpu) =
@@ -479,6 +481,8 @@ static inline void __runq_tickle(struct csched_vcpu *new)
         /* Send scheduler interrupts to designated CPUs */
         cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
     }
+    else
+        SCHED_STAT_CRANK(tickled_no_cpu);
 }
 
 static void
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index cf8455c..0246453 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -589,6 +589,7 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     i = cpumask_cycle(cpu, &mask);
     if ( i < nr_cpu_ids )
     {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
         ipid = i;
         goto tickle;
     }
@@ -637,11 +638,12 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
      * than the migrate resistance */
     if ( ipid == -1 || lowest + CSCHED2_MIGRATE_RESIST > new->credit )
     {
-        SCHED_STAT_CRANK(tickle_idlers_none);
-        goto no_tickle;
+        SCHED_STAT_CRANK(tickled_no_cpu);
+        return;
     }
 
-tickle:
+    SCHED_STAT_CRANK(tickled_busy_cpu);
+ tickle:
     BUG_ON(ipid == -1);
 
     /* TRACE */ {
@@ -654,11 +656,7 @@ tickle:
                   (unsigned char *)&d);
     }
     cpumask_set_cpu(ipid, &rqd->tickled);
-    SCHED_STAT_CRANK(tickle_idlers_some);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
-
-no_tickle:
-    return;
 }
 
 /*
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index 5b077d7..dd1c4d3 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -1140,6 +1140,7 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
     /* 1) if new's previous cpu is idle, kick it for cache benefit */
     if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
     {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
         cpu_to_tickle = new->vcpu->processor;
         goto out;
     }
@@ -1151,6 +1152,7 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
         iter_vc = curr_on_cpu(cpu);
         if ( is_idle_vcpu(iter_vc) )
         {
+            SCHED_STAT_CRANK(tickled_idle_cpu);
             cpu_to_tickle = cpu;
             goto out;
         }
@@ -1164,14 +1166,15 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
     if ( latest_deadline_vcpu != NULL &&
          new->cur_deadline < latest_deadline_vcpu->cur_deadline )
     {
+        SCHED_STAT_CRANK(tickled_busy_cpu);
         cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
         goto out;
     }
 
     /* didn't tickle any cpu */
-    SCHED_STAT_CRANK(tickle_idlers_none);
+    SCHED_STAT_CRANK(tickled_no_cpu);
     return;
-out:
+ out:
     /* TRACE */
     {
         struct {
@@ -1185,7 +1188,6 @@ out:
     }
 
     cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
-    SCHED_STAT_CRANK(tickle_idlers_some);
     cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);
     return;
 }
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 21c1e0b..a336c71 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -27,8 +27,9 @@ PERFCOUNTER(vcpu_wake_running,      "sched: vcpu_wake_running")
 PERFCOUNTER(vcpu_wake_onrunq,       "sched: vcpu_wake_onrunq")
 PERFCOUNTER(vcpu_wake_runnable,     "sched: vcpu_wake_runnable")
 PERFCOUNTER(vcpu_wake_not_runnable, "sched: vcpu_wake_not_runnable")
-PERFCOUNTER(tickle_idlers_none,     "sched: tickle_idlers_none")
-PERFCOUNTER(tickle_idlers_some,     "sched: tickle_idlers_some")
+PERFCOUNTER(tickled_no_cpu,         "sched: tickled_no_cpu")
+PERFCOUNTER(tickled_idle_cpu,       "sched: tickled_idle_cpu")
+PERFCOUNTER(tickled_busy_cpu,       "sched: tickled_busy_cpu")
 PERFCOUNTER(vcpu_check,             "sched: vcpu_check")
 
 /* credit specific counters */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
  2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
@ 2016-06-17 23:11 ` Dario Faggioli
  2016-06-21 16:41   ` anshul makkar
  2016-07-06 15:59   ` George Dunlap
  2016-06-17 23:11 ` [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu Dario Faggioli
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, David Vrabel, George Dunlap

In fact, they always operate on the svc->processor of
the csched2_vcpu passed to them.

No functional change intended.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dnulap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 0246453..5881583 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -518,8 +518,9 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu *svc)
 }
 
 static void
-runq_insert(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *svc)
+runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
 {
+    unsigned int cpu = svc->vcpu->processor;
     struct list_head * runq = &RQD(ops, cpu)->runq;
     int pos = 0;
 
@@ -558,17 +559,17 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_ti
 /* Check to see if the item on the runqueue is higher priority than what's
  * currently running; if so, wake up the processor */
 static /*inline*/ void
-runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *new, s_time_t now)
+runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 {
     int i, ipid=-1;
     s_time_t lowest=(1<<30);
+    unsigned int cpu = new->vcpu->processor;
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     cpumask_t mask;
     struct csched2_vcpu * cur;
 
     d2printk("rqt %pv curr %pv\n", new->vcpu, current);
 
-    BUG_ON(new->vcpu->processor != cpu);
     BUG_ON(new->rqd != rqd);
 
     /* Look at the cpu it's running on first */
@@ -1071,8 +1072,8 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     update_load(ops, svc->rqd, svc, 1, now);
         
     /* Put the VCPU on the runq */
-    runq_insert(ops, vc->processor, svc);
-    runq_tickle(ops, vc->processor, svc, now);
+    runq_insert(ops, svc);
+    runq_tickle(ops, svc, now);
 
 out:
     d2printk("w-\n");
@@ -1104,8 +1105,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     {
         BUG_ON(__vcpu_on_runq(svc));
 
-        runq_insert(ops, vc->processor, svc);
-        runq_tickle(ops, vc->processor, svc, now);
+        runq_insert(ops, svc);
+        runq_tickle(ops, svc, now);
     }
     else if ( !is_idle_vcpu(vc) )
         update_load(ops, svc->rqd, svc, -1, now);
@@ -1313,8 +1314,8 @@ static void migrate(const struct scheduler *ops,
         if ( on_runq )
         {
             update_load(ops, svc->rqd, NULL, 1, now);
-            runq_insert(ops, svc->vcpu->processor, svc);
-            runq_tickle(ops, svc->vcpu->processor, svc, now);
+            runq_insert(ops, svc);
+            runq_tickle(ops, svc, now);
             SCHED_STAT_CRANK(migrate_on_runq);
         }
         else


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
@ 2016-06-17 23:11 ` Dario Faggioli
  2016-07-06 16:02   ` George Dunlap
  2016-06-17 23:11 ` [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once Dario Faggioli
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

In fact, it has the same signature of csched2_cpu_pick,
which also is its uniqe caller.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 5881583..ef199e3 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -321,7 +321,7 @@ struct csched2_dom {
 /*
  * When a hard affinity change occurs, we may not be able to check some
  * (any!) of the other runqueues, when looking for the best new processor
- * for svc (as trylock-s in choose_cpu() can fail). If that happens, we
+ * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
  * pick, in order of decreasing preference:
  *  - svc's current pcpu;
  *  - another pcpu from svc's current runq;
@@ -1116,7 +1116,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
 
 #define MAX_LOAD (1ULL<<60);
 static int
-choose_cpu(const struct scheduler *ops, struct vcpu *vc)
+csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     int i, min_rqi = -1, new_cpu;
@@ -1490,16 +1490,6 @@ out:
     return;
 }
 
-static int
-csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
-{
-    int new_cpu;
-
-    new_cpu = choose_cpu(ops, vc);
-
-    return new_cpu;
-}
-
 static void
 csched2_vcpu_migrate(
     const struct scheduler *ops, struct vcpu *vc, unsigned int new_cpu)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2016-06-17 23:11 ` [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu Dario Faggioli
@ 2016-06-17 23:11 ` Dario Faggioli
  2016-07-06 16:05   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held Dario Faggioli
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

on the same vcpu, without NOW() having changed.

This is, in fact, a legitimate use case. If it happens,
we should just do nothing, without producing any warning
or debug message.

While there, fix style and add a couple of branching
hints.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index ef199e3..9e8e561 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -738,14 +738,15 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
     /* No need to resort runqueue, as everyone's order should be the same. */
 }
 
-void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *svc, s_time_t now)
+void burn_credits(struct csched2_runqueue_data *rqd,
+                  struct csched2_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
 
     /* Assert svc is current */
     ASSERT(svc==CSCHED2_VCPU(curr_on_cpu(svc->vcpu->processor)));
 
-    if ( is_idle_vcpu(svc->vcpu) )
+    if ( unlikely(is_idle_vcpu(svc->vcpu)) )
     {
         BUG_ON(svc->credit != CSCHED2_IDLE_CREDIT);
         return;
@@ -753,13 +754,16 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *svc, s
 
     delta = now - svc->start_time;
 
-    if ( delta > 0 ) {
+    if ( likely(delta > 0) )
+    {
         SCHED_STAT_CRANK(burn_credits_t2c);
         t2c_update(rqd, delta, svc);
         svc->start_time = now;
 
         d2printk("b %pv c%d\n", svc->vcpu, svc->credit);
-    } else {
+    }
+    else if ( delta < 0 )
+    {
         d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
                __func__, now, svc->start_time);
     }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2016-06-17 23:11 ` [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-06-20  7:56   ` Jan Beulich
  2016-06-17 23:12 ` [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

Yet another situation very similar to 779511f4bf5ae
("sched: avoid races on time values read from NOW()").

In fact, when more than one runqueue is involved, we need
to make sure that the following does not happen:
 1. take the lock of 1st runq
 2. now = NOW()
 3. take the lock of 2nd runq
 4. use now

as, if we have to wait at step 3, the value in now may
be stale when we get to use it at step 4.

While there, fix the style of a label.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 9e8e561..50f8dfd 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1361,7 +1361,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
 
     __update_runq_load(ops, st.lrqd, 0, now);
 
-retry:
+ retry:
     if ( !spin_trylock(&prv->lock) )
         return;
 
@@ -1377,7 +1377,8 @@ retry:
              || !spin_trylock(&st.orqd->lock) )
             continue;
 
-        __update_runq_load(ops, st.orqd, 0, now);
+        /* Use a value of NOW() sampled after taking orqd's lock. */
+        __update_runq_load(ops, st.orqd, 0, NOW());
     
         delta = st.lrqd->b_avgload - st.orqd->b_avgload;
         if ( delta < 0 )
@@ -1435,6 +1436,8 @@ retry:
     if ( unlikely(st.orqd->id < 0) )
         goto out_up;
 
+    now = NOW();
+
     /* Look for "swap" which gives the best load average
      * FIXME: O(n^2)! */
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-06-20  8:02   ` Jan Beulich
  2016-06-17 23:12 ` [PATCH 08/19] xen: credit2: when tickling, check idle cpus first Dario Faggioli
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

This really should not happen, but:
 1. it does happen! Investigation is ongoing here:
    http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html
 2. even when 1 will be fixed it makes sense and is easy enough
    to have a 'safety catch' for it.

The reason why this is particularly bad for Credit2 is that
negative values of delta mean out of scale high load (because
of the conversion to unsigned). This, for instance in the
case of runqueue load, results in a runqueue having its load
updated to values of the order of 10000% or so, which in turns
means that the load balancer will migrate everything off from
the pCPUs in the runqueue, and leave them idle until the load
gets back to something sane... which may indeed take a while!

This is not a fix for the problem of time going backwards. In
fact, if that happens a lot, load tracking accuracy is still
compromized, but at least the effect is a lot less bad than
before.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 50f8dfd..b73d034 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
     else
     {
         delta = now - rqd->load_last_update;
+        if ( unlikely(delta < 0) )
+        {
+            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                     __func__, now, rqd->load_last_update);
+            delta = 0;
+        }
 
         rqd->avgload =
             ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
@@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
     else
     {
         delta = now - svc->load_last_update;
+        if ( unlikely(delta < 0) )
+        {
+            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                     __func__, now, svc->load_last_update);
+            delta = 0;
+        }
 
         svc->avgload =
             ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/19] xen: credit2: when tickling, check idle cpus first
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (6 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-07-06 16:36   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu Dario Faggioli
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

If there are idle pCPUs, it's always better to try to
"ship" the new vCPU there, instead than letting it
preempting on a currently busy one.

This commit also adds a cpumask_test_or_cycle() helper
function, to make it easier to code the preference for
the pCPU where the vCPU was running before.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   68 +++++++++++++++++++++++++++++---------------
 xen/include/xen/cpumask.h  |    8 +++++
 2 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b73d034..af28e7b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -568,9 +568,23 @@ __runq_remove(struct csched2_vcpu *svc)
 
 void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_time_t);
 
-/* Check to see if the item on the runqueue is higher priority than what's
- * currently running; if so, wake up the processor */
-static /*inline*/ void
+/*
+ * Check what processor it is best to 'wake', for picking up a vcpu that has
+ * just been put (back) in the runqueue. Logic is as follows:
+ *  1. if there are idle processors in the runq, wake one of them;
+ *  2. if there aren't idle processor, check the one were the vcpu was
+ *     running before to see if we can preempt what's running there now
+ *     (and hence doing just one migration);
+ *  3. last stand: check all processors and see if the vcpu is in right
+ *     of preempting any of the other vcpus running on them (this requires
+ *     two migrations, and that's indeed why it is left as the last stand).
+ *
+ * Note that when we say 'idle processors' what we really mean is (pretty
+ * much always) both _idle_ and _not_already_tickled_. In fact, if a
+ * processor has been tickled, it will run csched2_schedule() shortly, and
+ * pick up some work, so it would be wrong to consider it idle.
+ */
+static void
 runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 {
     int i, ipid=-1;
@@ -584,22 +598,14 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
     BUG_ON(new->rqd != rqd);
 
-    /* Look at the cpu it's running on first */
-    cur = CSCHED2_VCPU(curr_on_cpu(cpu));
-    burn_credits(rqd, cur, now);
-
-    if ( cur->credit < new->credit )
-    {
-        ipid = cpu;
-        goto tickle;
-    }
-    
-    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
+    /*
+     * Get a mask of idle, but not tickled, processors that new is
+     * allowed to run on. If that's not empty, choose someone from there
+     * (preferrably, the one were new was running on already).
+     */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
-    
-    /* If it's not empty, choose one */
-    i = cpumask_cycle(cpu, &mask);
+    i = cpumask_test_or_cycle(cpu, &mask);
     if ( i < nr_cpu_ids )
     {
         SCHED_STAT_CRANK(tickled_idle_cpu);
@@ -607,12 +613,26 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         goto tickle;
     }
 
-    /* Otherwise, look for the non-idle cpu with the lowest credit,
-     * skipping cpus which have been tickled but not scheduled yet,
-     * that new is allowed to run on. */
+    /*
+     * Otherwise, look for the non-idle (and non-tickled) processors with
+     * the lowest credit, among the ones new is allowed to run on. Again,
+     * the cpu were it was running on would be the best candidate.
+     */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    if ( cpumask_test_cpu(cpu, &mask) )
+    {
+        cur = CSCHED2_VCPU(curr_on_cpu(cpu));
+        burn_credits(rqd, cur, now);
+
+        if ( cur->credit < new->credit )
+        {
+            SCHED_STAT_CRANK(tickled_busy_cpu);
+            ipid = cpu;
+            goto tickle;
+        }
+    }
 
     for_each_cpu(i, &mask)
     {
@@ -624,7 +644,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
         BUG_ON(is_idle_vcpu(cur->vcpu));
 
-        /* Update credits for current to see if we want to preempt */
+        /* Update credits for current to see if we want to preempt. */
         burn_credits(rqd, cur, now);
 
         if ( cur->credit < lowest )
@@ -647,8 +667,10 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         }
     }
 
-    /* Only switch to another processor if the credit difference is greater
-     * than the migrate resistance */
+    /*
+     * Only switch to another processor if the credit difference is
+     * greater than the migrate resistance.
+     */
     if ( ipid == -1 || lowest + CSCHED2_MIGRATE_RESIST > new->credit )
     {
         SCHED_STAT_CRANK(tickled_no_cpu);
diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
index 0e7108c..3f340d6 100644
--- a/xen/include/xen/cpumask.h
+++ b/xen/include/xen/cpumask.h
@@ -266,6 +266,14 @@ static inline int cpumask_cycle(int n, const cpumask_t *srcp)
     return nxt;
 }
 
+static inline int cpumask_test_or_cycle(int n, const cpumask_t *srcp)
+{
+    if ( cpumask_test_cpu(n, srcp) )
+        return n;
+
+    return cpumask_cycle(n, srcp);
+}
+
 static inline unsigned int cpumask_any(const cpumask_t *srcp)
 {
     unsigned int cpu = cpumask_first(srcp);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (7 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 08/19] xen: credit2: when tickling, check idle cpus first Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-07-06 16:40   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 10/19] xen: credit2: rework load tracking logic Dario Faggioli
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

by not resetting the variable that should guard against
that at the beginning of each step of the outer loop.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index af28e7b..c534f9c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1378,6 +1378,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     int i, max_delta_rqi = -1;
     struct list_head *push_iter, *pull_iter;
+    bool_t inner_load_updated = 0;
 
     balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
     
@@ -1478,7 +1479,6 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     /* Reuse load delta (as we're trying to minimize it) */
     list_for_each( push_iter, &st.lrqd->svc )
     {
-        int inner_load_updated = 0;
         struct csched2_vcpu * push_svc = list_entry(push_iter, struct csched2_vcpu, rqd_elem);
 
         __update_svc_load(ops, push_svc, 0, now);
@@ -1490,10 +1490,8 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
         {
             struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
             
-            if ( ! inner_load_updated )
-            {
+            if ( !inner_load_updated )
                 __update_svc_load(ops, pull_svc, 0, now);
-            }
         
             if ( !vcpu_is_migrateable(pull_svc, st.lrqd) )
                 continue;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/19] xen: credit2: rework load tracking logic
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (8 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-07-06 17:33   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format Dario Faggioli
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

The existing load tracking code was hard to understad and
maintain, and not entirely consistent. This is due to a
number of reasons:
 - code and comments were not in perfect sync, making it
   difficult to figure out what the intent of a particular
   choice was (e.g., the choice of 18 for load_window_shift);
 - the math, although effective, was not entirely consistent.
   In fact, we were doing (if W is the lenght of the window):

    avgload = (delta*load*W + (W - delta)*avgload)/W
    avgload = avgload + delta*load - delta*avgload/W

   which does not match any known variant of 'smoothing
   moving average'. In fact, it should have been:

    avgload = avgload + delta*load/W - delta*avgload/W

   (for details on why, see the doc comments inside this
   patch.). Furthermore, with

    avgload ~= avgload + W*load - avgload
    avgload ~= W*load

The reason why the formula above sort of worked was because
the number of bits used for the fractional parts of the
values used in fixed point math and the number of bits used
for the lenght of the window were the same (load_window_shift
was being used for both).

This may look handy, but it introduced a (not especially well
documented) dependency between the lenght of the window and
the precision of the calculations, which really should be
two independent things. Especially if treating them as such
(like it is done in this patch) does not lead to more
complex maths (same number of multiplications and shifts, and
there is still room for some optimization).

Therefore, in this patch, we:
 - split length of the window and precision (and, since there
   is already a command line parameter for length of window,
   introduce one for precision too),
 - align the math with one proper incarnation of exponential
   smoothing (at no added cost),
 - add comments, about the details of the algorithm and the
   math used.

While there fix a couple of style issues as well (pointless
initialization, long lines, comments).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 docs/misc/xen-command-line.markdown |   30 +++
 xen/common/sched_credit2.c          |  328 ++++++++++++++++++++++++++++++-----
 2 files changed, 310 insertions(+), 48 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index fed732c..29a554b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -477,9 +477,39 @@ the address range the area should fall into.
 ### credit2\_balance\_under
 > `= <integer>`
 
+### credit2\_load\_precision\_shift
+> `= <integer>`
+
+> Default: `18`
+
+Specify the number of bits to use for the fractional part of the
+values involved in Credit2 load tracking and load balancing math.
+
 ### credit2\_load\_window\_shift
 > `= <integer>`
 
+> Default: `30`
+
+Specify the number of bits to use for represent the length of the
+window (in nanoseconds) we use for load tracking inside Credit2.
+This means that, with the default value (30), we use
+2^30 nsec ~= 1 sec long window.
+
+Load tracking is done by means of a variation of exponentially
+weighted moving average (EWMA). The window length defined here
+is what tells for how long we give value to previous history
+of the load itself. In fact, after a full window has passed,
+what happens is that we discard all previous history entirely.
+
+A short window will make the load balancer quick at reacting
+to load changes, but also short-sighted about previous history
+(and hence, e.g., long term load trends). A long window will
+make the load balancer thoughtful of previous history (and
+hence capable of capturing, e.g., long term load trends), but
+also slow in responding to load changes.
+
+The default value of `1 sec` is rather long.
+
 ### credit2\_runqueue
 > `= core | socket | node | all`
 
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index c534f9c..230a512 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -173,16 +173,88 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
 
 /*
- * Shifts for load average.
- * - granularity: Reduce granularity of time by a factor of 1000, so we can use 32-bit maths
- * - window shift: Given granularity shift, make the window about 1 second
- * - scale shift: Shift up load by this amount rather than using fractions; 128 corresponds 
- *   to a load of 1.
+ * Load tracking and load balancing
+ *
+ * Load history of runqueues and vcpus is accounted for by using an
+ * exponential weighted moving average algorithm. However, instead of using
+ * fractions,we shift everything to left by the number of bits we want to
+ * use for representing the fractional part (Q-format).
+ *
+ * We may also want to reduce the precision of time accounting, to
+ * accommodate 'longer  windows'. So, if that is the case, we just need to
+ * shift all time samples to the right.
+ *
+ * The details of the formulas used for load tracking are explained close to
+ * __update_runq_load(). Let's just say here that, with full nanosecond time
+ * granularity, a 30 bits wide 'decaying window' is ~1 second long.
+ *
+ * We want to consider the following equations:
+ *
+ *  avg[0] = load*P
+ *  avg[i+1] = avg[i] + delta*load*P/W - delta*avg[i]/W,  0 <= delta <= W
+ *
+ * where W is the lenght of the window, P the multiplier for transitiong into
+ * Q-format fixed point arithmetic and load is the instantaneous load of a
+ * runqueue, which basically is the number of runnable vcpus there are on the
+ * runqueue (for the meaning of the other terms, look at the doc comment to
+ *  __update_runq_load()).
+ *
+ *  So, with the default values defined below, we have:
+ *
+ *  W = 2^30
+ *  P = 2^18
+ *
+ * The maximum possible value for the average load, which we want to store in
+ * s_time_t type variables (i.e., we have 63 bits available) is load*P. This
+ * means that, with P 18 bits wide, load can occupy 45 bits. This in turn
+ * means we can have 2^45 vcpus in each runqueue, before overflow occurs!
+ *
+ * However, it can happen that, at step j+1, if:
+ *
+ *  avg[j] = load*P
+ *  delta = W
+ *
+ * then:
+ *
+ *  avg[j+i] = avg[j] + W*load*P/W - W*load*P/W
+ *
+ * So we must be able to deal with W*load*P. This means load can't be higher
+ * than:
+ *
+ *  2^(63 - 30 - 18) = 2^15 = 32768
+ *
+ * So 32768 is the maximum number of vcpus the we can have in a runqueue,
+ * at any given time, and still not have problems with the load tracking
+ * calculations... and this is more than fine.
+ *
+ * If/when, for some reason, this will not be acceptable any longer, we can
+ * act on time granularity, window lenght or precision (or a combination of
+ * them). For instance, reducing the granularity to microseconds we could
+ * switch to W=2^20 and still have 18 fractional bits and a 1 second long
+ * window (which would mean 2^25 = 33554432 vcpus per runq and no overflow).
+ */
+
+/* If >0, decreases the granularity of time samples used for load tracking. */
+#define LOADAVG_GRANULARITY_SHIFT   (10)
+/* Time window during which we still give value to previous load history. */
+#define LOADAVG_WINDOW_SHIFT        (20)
+/* 18 bits by default (and not less than 4) for decimals. */
+#define LOADAVG_PRECISION_SHIFT     (18)
+#define LOADAVG_PRECISION_SHIFT_MIN (4)
+
+/*
+ * Both the lenght of the window and the number of fractional bits can be
+ * decided with boot parameters.
+ *
+ * When LOADAVG_GRANULARITY_SHIFT is 0, the length of the window is given in
+ * nanoseconds. The same is true for a granularity of 10 (== microseconds) and
+ * a window of 20 (the default).
  */
-#define LOADAVG_GRANULARITY_SHIFT (10)
-static unsigned int __read_mostly opt_load_window_shift = 18;
-#define  LOADAVG_WINDOW_SHIFT_MIN 4
+static unsigned int __read_mostly opt_load_window_shift = LOADAVG_WINDOW_SHIFT;
 integer_param("credit2_load_window_shift", opt_load_window_shift);
+static unsigned int __read_mostly opt_load_precision_shift = LOADAVG_PRECISION_SHIFT;
+integer_param("credit2_load_precision_shift", opt_load_precision_shift);
+
 static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
 static int __read_mostly opt_overload_balance_tolerance = -3;
@@ -279,6 +351,7 @@ struct csched2_private {
     cpumask_t active_queues; /* Queues which may have active cpus */
     struct csched2_runqueue_data rqd[NR_CPUS];
 
+    unsigned int load_precision_shift;
     unsigned int load_window_shift;
 };
 
@@ -387,19 +460,148 @@ __runq_elem(struct list_head *elem)
     return list_entry(elem, struct csched2_vcpu, runq_elem);
 }
 
+/*
+ * Track the runq load by gathering instantaneous load samples, and using
+ * exponentially weighted moving average (EWMA) for the 'decaying'.
+ *
+ * We consider a window of lenght W=2^(prv->load_window_shift) nsecs.
+ * (The actual calculations may use coarser granularity time sampling,
+ * if LOADAVG_GRANULARITY_SHIFT is not 0.)
+ *
+ * If load is the instantaneous load, the formula for EWMA looks as follows,
+ * for the i-eth sample:
+ *
+ *  avg[i] = a*load + (1 - a)*avg[i-1]
+ *
+ * where avg[i] is the new value of the average load, avg[i-1] is the value
+ * of the average load calculated so far, and a is a coefficient less or
+ * equal to 1.
+ *
+ * So, for us, it becomes:
+ *
+ *  avgload = a*load + (1 - a)*avgload
+ *
+ * For determining a, we consider _when_ we are doing the load update, wrt
+ * the lenght of the window. We define delta as follows:
+ *
+ *  delta = t - load_last_update
+ *
+ * where t is current time (i.e., time at which we are both sampling and
+ * updating the load average) and load_last_update is the last time we did
+ * that.
+ *
+ * There are two possible situations:
+ *
+ * a) delta <= W
+ *    this means that, during the last window of lenght W, the runeuque load
+ *    was avgload for (W - detla) time, and load for delta time:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *                |     load_last_update  t
+ *     -------------------------|---------|---
+ *                |             |         |
+ *                \__W - delta__/\_delta__/
+ *                |             |         |
+ *                |___avgload___|__load___|
+ *
+ *    So, what about using delta/W as our smoothing coefficient a. If we do,
+ *    here's what happens:
+ *
+ *     a = delta / W
+ *     1 - a = 1 - (delta / W) = (W - delta) / W
+ *
+ *    Which matches the above description of what happened in the last
+ *    window of lenght W.
+ *
+ *    Note that this also means that the weight that we assign to both the
+ *    latest load sample, and to previous history, varies at each update.
+ *    The longer the latest load sample has been in efect, within the last
+ *    window, the higher it weights (and the lesser the previous history
+ *    weights).
+ *
+ *    This is some sort of extension of plain EWMA to fit even better to our
+ *    use case.
+ *
+ * b) delta > W
+ *    this means more than a full window has passed since the last update:
+ *
+ *                |----------- W ---------|
+ *                |                       |
+ *       load_last_update                 t
+ *     ----|------------------------------|---
+ *         |                              |
+ *         \_________________delta________/
+ *
+ *    Basically, it means the last load sample has been in effect for more
+ *    than W time, and hence we should just use it, and forget everything
+ *    before that.
+ *
+ *    This can be seen as a 'reset condition', occurring when, for whatever
+ *    reason, load has not been updated for longer than we expected. (It is
+ *    also how avgload is assigned its first value.)
+ *
+ * The formula for avgload then becomes:
+ *
+ *  avgload = (delta/W)*load + (W - delta)*avgload/W
+ *  avgload = delta*load/W + W*avgload/W - delta*avgload/W
+ *  avgload = avgload + delta*load/W - delta*avgload/W
+ *
+ * So, final form is:
+ *
+ *  avgload_0 = load
+ *  avgload = avgload + delta*load/W - delta*avgload/W,  0<=delta<=W
+ *
+ * As a confirmation, let's look at the extremes, when delta is 0 (i.e.,
+ * what happens if we  update the load twice, at the same time instant?):
+ *
+ *  avgload = avgload + 0*load/W - 0*avgload/W
+ *  avgload = avgload
+ *
+ * and when delta is W (i.e., what happens if we update at the last
+ * possible instant before the window 'expires'?):
+ *
+ *  avgload = avgload + W*load/W - W*avgload/W
+ *  avgload = avgload + load - avgload
+ *  avgload = load
+ *
+ * Which, in both cases, is what we expect.
+ */
 static void
 __update_runq_load(const struct scheduler *ops,
                   struct csched2_runqueue_data *rqd, int change, s_time_t now)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
+    s_time_t delta, load = rqd->load;
+    unsigned int P, W;
 
+    W = prv->load_window_shift;
+    P = prv->load_precision_shift;
     now >>= LOADAVG_GRANULARITY_SHIFT;
 
-    if ( rqd->load_last_update + (1ULL<<prv->load_window_shift) < now )
+    /*
+     * To avoid using fractions, we shift to left by load_precision_shift,
+     * and use the least last load_precision_shift bits as fractional part.
+     * Looking back at the formula we want to use, we now have:
+     *
+     *  P = 2^(load_precision_shift)
+     *  P*avgload = P*(avgload + delta*load/W - delta*avgload/W)
+     *  P*avgload = P*avgload + delta*load*P/W - delta*P*avgload/W
+     *
+     * And if we are ok storing and using P*avgload, we can rewrite this as:
+     *
+     *  P*avgload = avgload'
+     *  avgload' = avgload' + delta*P*load/W - delta*avgload'/W
+     *
+     * Coupled with, of course:
+     *
+     *  avgload_0' = P*load
+     */
+
+    if ( rqd->load_last_update + (1ULL << W)  < now )
     {
-        rqd->avgload = (unsigned long long)rqd->load << prv->load_window_shift;
-        rqd->b_avgload = (unsigned long long)rqd->load << prv->load_window_shift;
+        rqd->avgload = load << P;
+        rqd->b_avgload = load << P;
     }
     else
     {
@@ -411,26 +613,39 @@ __update_runq_load(const struct scheduler *ops,
             delta = 0;
         }
 
-        rqd->avgload =
-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->avgload ) ) >> prv->load_window_shift;
-
-        rqd->b_avgload =
-            ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * rqd->b_avgload ) ) >> prv->load_window_shift;
+        /*
+         * Note that, if we were to enforce (or check) some relationship
+         * between P and W, we may save one shift. E.g., if we are sure
+         * that P < W, we could write:
+         *
+         *  (delta * (load << P)) >> W
+         *
+         * as:
+         *
+         *  (delta * load) >> (W - P)
+         */
+        rqd->avgload = rqd->avgload +
+                       ((delta * (load << P)) >> W) -
+                       ((delta * rqd->avgload) >> W);
+        rqd->b_avgload = rqd->b_avgload +
+                         ((delta * (load << P)) >> W) -
+                         ((delta * rqd->b_avgload) >> W);
     }
     rqd->load += change;
     rqd->load_last_update = now;
 
+    ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX);
+
     {
         struct {
-            unsigned rq_load:4, rq_avgload:28;
-            unsigned rq_id:4, b_avgload:28;
+            uint64_t rq_avgload, b_avgload;
+            unsigned rq_load:16, rq_id:8, shift:8;
         } d;
-        d.rq_id=rqd->id;
+        d.rq_id = rqd->id;
         d.rq_load = rqd->load;
         d.rq_avgload = rqd->avgload;
         d.b_avgload = rqd->b_avgload;
+        d.shift = P;
         trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
                   sizeof(d),
                   (unsigned char *)&d);
@@ -442,8 +657,8 @@ __update_svc_load(const struct scheduler *ops,
                   struct csched2_vcpu *svc, int change, s_time_t now)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
-    s_time_t delta=-1;
-    int vcpu_load;
+    s_time_t delta, vcpu_load;
+    unsigned int P, W;
 
     if ( change == -1 )
         vcpu_load = 1;
@@ -452,11 +667,13 @@ __update_svc_load(const struct scheduler *ops,
     else
         vcpu_load = vcpu_runnable(svc->vcpu);
 
+    W = prv->load_window_shift;
+    P = prv->load_precision_shift;
     now >>= LOADAVG_GRANULARITY_SHIFT;
 
-    if ( svc->load_last_update + (1ULL<<prv->load_window_shift) < now )
+    if ( svc->load_last_update + (1ULL << W) < now )
     {
-        svc->avgload = (unsigned long long)vcpu_load << prv->load_window_shift;
+        svc->avgload = vcpu_load << P;
     }
     else
     {
@@ -468,20 +685,22 @@ __update_svc_load(const struct scheduler *ops,
             delta = 0;
         }
 
-        svc->avgload =
-            ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )
-              + ( ((1ULL<<prv->load_window_shift) - delta) * svc->avgload ) ) >> prv->load_window_shift;
+        svc->avgload = svc->avgload +
+                       ((delta * (vcpu_load << P)) >> W) -
+                       ((delta * svc->avgload) >> W);
     }
     svc->load_last_update = now;
 
     {
         struct {
+            uint64_t v_avgload;
             unsigned vcpu:16, dom:16;
-            unsigned v_avgload:32;
+            unsigned shift;
         } d;
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.v_avgload = svc->avgload;
+        d.shift = P;
         trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
                   sizeof(d),
                   (unsigned char *)&d);
@@ -903,7 +1122,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
         svc->credit = CSCHED2_CREDIT_INIT;
         svc->weight = svc->sdom->weight;
         /* Starting load of 50% */
-        svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_window_shift - 1);
+        svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1);
         svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT;
     }
     else
@@ -1152,7 +1371,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     vcpu_schedule_unlock_irq(lock, vc);
 }
 
-#define MAX_LOAD (1ULL<<60);
+#define MAX_LOAD (STIME_MAX);
 static int
 csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 {
@@ -1447,15 +1666,19 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
         if ( i > cpus_max )
             cpus_max = i;
 
-        /* If we're under 100% capacaty, only shift if load difference
-         * is > 1.  otherwise, shift if under 12.5% */
-        if ( load_max < (1ULL<<(prv->load_window_shift))*cpus_max )
+        /*
+         * If we're under 100% capacaty, only shift if load difference
+         * is > 1.  otherwise, shift if under 12.5%
+         */
+        if ( load_max < (cpus_max << prv->load_precision_shift) )
         {
-            if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_underload_balance_tolerance) ) )
+            if ( st.load_delta < (1ULL << (prv->load_precision_shift +
+                                           opt_underload_balance_tolerance)) )
                  goto out;
         }
         else
-            if ( st.load_delta < (1ULL<<(prv->load_window_shift+opt_overload_balance_tolerance)) )
+            if ( st.load_delta < (1ULL << (prv->load_precision_shift +
+                                           opt_overload_balance_tolerance)) )
                 goto out;
     }
              
@@ -1965,7 +2188,7 @@ csched2_schedule(
 }
 
 static void
-csched2_dump_vcpu(struct csched2_vcpu *svc)
+csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
 {
     printk("[%i.%i] flags=%x cpu=%i",
             svc->vcpu->domain->domain_id,
@@ -1975,6 +2198,9 @@ csched2_dump_vcpu(struct csched2_vcpu *svc)
 
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
+    printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
+           (svc->avgload * 100) >> prv->load_precision_shift);
+
     printk("\n");
 }
 
@@ -2012,7 +2238,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
     if ( svc )
     {
         printk("\trun: ");
-        csched2_dump_vcpu(svc);
+        csched2_dump_vcpu(prv, svc);
     }
 
     loop = 0;
@@ -2022,7 +2248,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
         if ( svc )
         {
             printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);
         }
     }
 
@@ -2051,8 +2277,8 @@ csched2_dump(const struct scheduler *ops)
     for_each_cpu(i, &prv->active_queues)
     {
         s_time_t fraction;
-        
-        fraction = prv->rqd[i].avgload * 100 / (1ULL<<prv->load_window_shift);
+
+        fraction = (prv->rqd[i].avgload * 100) >> prv->load_precision_shift;
 
         cpulist_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].active);
         printk("Runqueue %d:\n"
@@ -2060,12 +2286,13 @@ csched2_dump(const struct scheduler *ops)
                "\tcpus               = %s\n"
                "\tmax_weight         = %d\n"
                "\tinstload           = %d\n"
-               "\taveload            = %3"PRI_stime"\n",
+               "\taveload            = %"PRI_stime" (~%"PRI_stime"%%)\n",
                i,
                cpumask_weight(&prv->rqd[i].active),
                cpustr,
                prv->rqd[i].max_weight,
                prv->rqd[i].load,
+               prv->rqd[i].avgload,
                fraction);
 
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].idle);
@@ -2096,7 +2323,7 @@ csched2_dump(const struct scheduler *ops)
             lock = vcpu_schedule_lock(svc->vcpu);
 
             printk("\t%3d: ", ++loop);
-            csched2_dump_vcpu(svc);
+            csched2_dump_vcpu(prv, svc);
 
             vcpu_schedule_unlock(lock, svc->vcpu);
         }
@@ -2357,18 +2584,22 @@ csched2_init(struct scheduler *ops)
            " WARNING: This is experimental software in development.\n" \
            " Use at your own risk.\n");
 
+    printk(" load_precision_shift: %d\n", opt_load_precision_shift);
     printk(" load_window_shift: %d\n", opt_load_window_shift);
     printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
     printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
     printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
 
-    if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
+    if ( opt_load_window_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {
-        printk("%s: opt_load_window_shift %d below min %d, resetting\n",
-               __func__, opt_load_window_shift, LOADAVG_WINDOW_SHIFT_MIN);
-        opt_load_window_shift = LOADAVG_WINDOW_SHIFT_MIN;
+        printk("WARNING: %s: opt_load_precision_shift %d below min %d, resetting\n",
+               __func__, opt_load_precision_shift, LOADAVG_PRECISION_SHIFT_MIN);
+        opt_load_precision_shift = LOADAVG_PRECISION_SHIFT_MIN;
     }
 
+    printk(XENLOG_INFO "load tracking window lenght %llu ns\n",
+           (1ULL << opt_load_window_shift) << LOADAVG_GRANULARITY_SHIFT);
+
     /* Basically no CPU information is available at this point; just
      * set up basic structures, and a callback when the CPU info is
      * available. */
@@ -2388,6 +2619,7 @@ csched2_init(struct scheduler *ops)
         prv->rqd[i].id = -1;
     }
 
+    prv->load_precision_shift = opt_load_precision_shift;
     prv->load_window_shift = opt_load_window_shift;
 
     return 0;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (9 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 10/19] xen: credit2: rework load tracking logic Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-06-21  9:27   ` Wei Liu
  2016-06-17 23:12 ` [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Anshul Makkar, Ian Jackson, George Dunlap

in both xenalyze and formats (for xentrace_format).

In particular, in xenalyze, now that we have the precision
of the fixed point load values in the tracepoint, show both
the raw value and the (easier to interpreet) percentage.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 tools/xentrace/formats    |    4 ++--
 tools/xentrace/xenalyze.c |   25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index d204351..2e58d03 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -53,8 +53,8 @@
 0x00022208  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:sched_tasklet
 0x00022209  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:update_load
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
-0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(1)08x, avgload = %(2)d ]
-0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[4]:rq_avgload[28] = 0x%(1)08x, rq_id[4]:b_avgload[28] = 0x%(2)08x ]
+0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(3)08x, vcpuload = 0x%(2)08x%(1)08x, wshift = %(4)d ]
+0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x%(5)08x, rq_avgload = 0x%(2)08x%(1)08x, b_avgload = 0x%(4)08x%(3)08x ]
 
 0x00022801  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:tickle        [ cpu = %(1)d ]
 0x00022802  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:runq_pick     [ dom:vcpu = 0x%(1)08x, cur_deadline = 0x%(3)08x%(2)08x, cur_budget = 0x%(5)08x%(4)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 01ead8b..f2f97bd 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7802,25 +7802,36 @@ void sched_process(struct pcpu_info *p)
         case TRC_SCHED_CLASS_EVT(CSCHED2, 11): /* UPDATE_VCPU_LOAD */
             if(opt.dump_all) {
                 struct {
+                    uint64_t vcpuload;
                     unsigned int vcpuid:16, domid:16;
-                    unsigned int avgload;
+                    unsigned int shift;
                 } *r = (typeof(r))ri->d;
+                double vcpuload;
 
-                printf(" %s csched2:update_vcpu_load d%uv%u, avg_load = %u\n",
-                       ri->dump_header, r->domid, r->vcpuid, r->avgload);
+                vcpuload = (r->vcpuload * 100.0) / (1ULL << r->shift);
+
+                printf(" %s csched2:update_vcpu_load d%uv%u, "
+                       "vcpu_load = %4.3f%% (%"PRIu64")\n",
+                       ri->dump_header, r->domid, r->vcpuid, vcpuload,
+                       r->vcpuload);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 12): /* UPDATE_RUNQ_LOAD */
             if(opt.dump_all) {
                 struct {
-                    unsigned int rq_load:4, rq_avgload:28;
-                    unsigned int rq_id:4, b_avgload:28;
+                    uint64_t rq_avgload, b_avgload;
+                    unsigned int rq_load:16, rq_id:8, shift:8;
                 } *r = (typeof(r))ri->d;
+                double avgload, b_avgload;
+
+                avgload = (r->rq_avgload * 100.0) / (1ULL << r->shift);
+                b_avgload = (r->b_avgload * 100.0) / (1ULL << r->shift);
 
                 printf(" %s csched2:update_rq_load rq# %u, load = %u, "
-                       "avgload = %u, b_avgload = %u\n",
+                       "avgload = %4.3f%% (%"PRIu64"), "
+                       "b_avgload = %4.3f%% (%"PRIu64")\n",
                        ri->dump_header, r->rq_id, r->rq_load,
-                       r->rq_avgload, r->b_avgload);
+                       avgload, r->rq_avgload, b_avgload, r->b_avgload);
             }
             break;
         /* RTDS (TRC_RTDS_xxx) */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (10 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-07-07  9:45   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

as all the accesses to both the masks and the flags are
serialized by the runqueues locks already.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   48 ++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 230a512..2ca63ae 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -909,7 +909,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                   sizeof(d),
                   (unsigned char *)&d);
     }
-    cpumask_set_cpu(ipid, &rqd->tickled);
+    __cpumask_set_cpu(ipid, &rqd->tickled);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1277,7 +1277,7 @@ csched2_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         __runq_remove(svc);
     }
     else if ( svc->flags & CSFLAG_delayed_runq_add )
-        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        __clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
 }
 
 static void
@@ -1314,7 +1314,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
      * after the context has been saved. */
     if ( unlikely(svc->flags & CSFLAG_scheduled) )
     {
-        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
         goto out;
     }
 
@@ -1347,7 +1347,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor));
 
     /* This vcpu is now eligible to be put on the runqueue again */
-    clear_bit(__CSFLAG_scheduled, &svc->flags);
+    __clear_bit(__CSFLAG_scheduled, &svc->flags);
 
     /* If someone wants it on the runqueue, put it there. */
     /*
@@ -1357,7 +1357,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
      * it seems a bit pointless; especially as we have plenty of
      * bits free.
      */
-    if ( test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
+    if ( __test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
          && likely(vcpu_runnable(vc)) )
     {
         BUG_ON(__vcpu_on_runq(svc));
@@ -1399,10 +1399,10 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !spin_trylock(&prv->lock) )
     {
-        if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
+        if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
         {
             d2printk("%pv -\n", svc->vcpu);
-            clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
+            __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
 
         return get_fallback_cpu(svc);
@@ -1410,7 +1410,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* First check to see if we're here because someone else suggested a place
      * for us to move. */
-    if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
+    if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
         {
@@ -1545,8 +1545,8 @@ static void migrate(const struct scheduler *ops,
         d2printk("%pv %d-%d a\n", svc->vcpu, svc->rqd->id, trqd->id);
         /* It's running; mark it to migrate. */
         svc->migrate_rqd = trqd;
-        set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
-        set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
+        __set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
+        __set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         SCHED_STAT_CRANK(migrate_requested);
     }
     else
@@ -2079,7 +2079,7 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
-        cpumask_clear_cpu(cpu, &rqd->tickled);
+        __cpumask_clear_cpu(cpu, &rqd->tickled);
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2115,7 +2115,7 @@ csched2_schedule(
     if ( snext != scurr
          && !is_idle_vcpu(scurr->vcpu)
          && vcpu_runnable(current) )
-        set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
+        __set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
 
     ret.migrated = 0;
 
@@ -2134,7 +2134,7 @@ csched2_schedule(
                        cpu, snext->vcpu, snext->vcpu->processor, scurr->vcpu);
                 BUG();
             }
-            set_bit(__CSFLAG_scheduled, &snext->flags);
+            __set_bit(__CSFLAG_scheduled, &snext->flags);
         }
 
         /* Check for the reset condition */
@@ -2146,7 +2146,7 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
-            cpumask_clear_cpu(cpu, &rqd->idle);
+            __cpumask_clear_cpu(cpu, &rqd->idle);
 
         snext->start_time = now;
 
@@ -2168,10 +2168,10 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
-                cpumask_clear_cpu(cpu, &rqd->idle);
+                __cpumask_clear_cpu(cpu, &rqd->idle);
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
-            cpumask_set_cpu(cpu, &rqd->idle);
+            __cpumask_set_cpu(cpu, &rqd->idle);
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2347,7 +2347,7 @@ static void activate_runqueue(struct csched2_private *prv, int rqi)
     INIT_LIST_HEAD(&rqd->runq);
     spin_lock_init(&rqd->lock);
 
-    cpumask_set_cpu(rqi, &prv->active_queues);
+    __cpumask_set_cpu(rqi, &prv->active_queues);
 }
 
 static void deactivate_runqueue(struct csched2_private *prv, int rqi)
@@ -2360,7 +2360,7 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
     
     rqd->id = -1;
 
-    cpumask_clear_cpu(rqi, &prv->active_queues);
+    __cpumask_clear_cpu(rqi, &prv->active_queues);
 }
 
 static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
@@ -2449,9 +2449,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     /* Set the runqueue map */
     prv->runq_map[cpu] = rqi;
     
-    cpumask_set_cpu(cpu, &rqd->idle);
-    cpumask_set_cpu(cpu, &rqd->active);
-    cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->idle);
+    __cpumask_set_cpu(cpu, &rqd->active);
+    __cpumask_set_cpu(cpu, &prv->initialized);
 
     return rqi;
 }
@@ -2556,8 +2556,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
 
-    cpumask_clear_cpu(cpu, &rqd->idle);
-    cpumask_clear_cpu(cpu, &rqd->active);
+    __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )
     {
@@ -2567,7 +2567,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_unlock(&rqd->lock);
 
-    cpumask_clear_cpu(cpu, &prv->initialized);
+    __cpumask_clear_cpu(cpu, &prv->initialized);
 
     spin_unlock_irqrestore(&prv->lock, flags);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 13/19] xen: credit2: make the code less experimental
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (11 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-06-20  8:13   ` Jan Beulich
  2016-07-07 15:17   ` George Dunlap
  2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

Mainly, almost all of the BUG_ON-s can be converted into
ASSERTS, and the debug printk either removed or turned
into tracing.

The 'TODO' list, in a comment at the beginning of the file,
was also stale, so remove items that were still there but
are actually done.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  244 +++++++++++++++++++++++---------------------
 1 file changed, 126 insertions(+), 118 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 2ca63ae..ba3a78a 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -27,9 +27,6 @@
 #include <xen/cpu.h>
 #include <xen/keyhandler.h>
 
-#define d2printk(x...)
-//#define d2printk printk
-
 /*
  * Credit2 tracing events ("only" 512 available!). Check
  * include/public/trace.h for more details.
@@ -46,16 +43,16 @@
 #define TRC_CSCHED2_RUNQ_ASSIGN      TRC_SCHED_CLASS_EVT(CSCHED2, 10)
 #define TRC_CSCHED2_UPDATE_VCPU_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 11)
 #define TRC_CSCHED2_UPDATE_RUNQ_LOAD TRC_SCHED_CLASS_EVT(CSCHED2, 12)
+#define TRC_CSCHED2_TICKLE_NEW       TRC_SCHED_CLASS_EVT(CSCHED2, 13)
+#define TRC_CSCHED2_RUNQ_MAX_WEIGHT  TRC_SCHED_CLASS_EVT(CSCHED2, 14)
+#define TRC_CSCHED2_MIGRATE          TRC_SCHED_CLASS_EVT(CSCHED2, 15)
 
 /*
  * WARNING: This is still in an experimental phase.  Status and work can be found at the
  * credit2 wiki page:
  *  http://wiki.xen.org/wiki/Credit2_Scheduler_Development
+ *
  * TODO:
- * + Multiple sockets
- *  - Simple load balancer / runqueue assignment
- *  - Runqueue load measurement
- *  - Load-based load balancer
  * + Hyperthreading
  *  - Look for non-busy core if possible
  *  - "Discount" time run on a thread with busy siblings
@@ -608,8 +605,8 @@ __update_runq_load(const struct scheduler *ops,
         delta = now - rqd->load_last_update;
         if ( unlikely(delta < 0) )
         {
-            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
-                     __func__, now, rqd->load_last_update);
+            printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                   __func__, now, rqd->load_last_update);
             delta = 0;
         }
 
@@ -680,8 +677,8 @@ __update_svc_load(const struct scheduler *ops,
         delta = now - svc->load_last_update;
         if ( unlikely(delta < 0) )
         {
-            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
-                     __func__, now, svc->load_last_update);
+            printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
+                   __func__, now, svc->load_last_update);
             delta = 0;
         }
 
@@ -723,23 +720,18 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu *svc)
     struct list_head *iter;
     int pos = 0;
 
-    d2printk("rqi %pv\n", svc->vcpu);
-
-    BUG_ON(&svc->rqd->runq != runq);
-    /* Idle vcpus not allowed on the runqueue anymore */
-    BUG_ON(is_idle_vcpu(svc->vcpu));
-    BUG_ON(svc->vcpu->is_running);
-    BUG_ON(svc->flags & CSFLAG_scheduled);
+    ASSERT(&svc->rqd->runq == runq);
+    ASSERT(!is_idle_vcpu(svc->vcpu));
+    ASSERT(!svc->vcpu->is_running);
+    ASSERT(!(svc->flags & CSFLAG_scheduled));
 
     list_for_each( iter, runq )
     {
         struct csched2_vcpu * iter_svc = __runq_elem(iter);
 
         if ( svc->credit > iter_svc->credit )
-        {
-            d2printk(" p%d %pv\n", pos, iter_svc->vcpu);
             break;
-        }
+
         pos++;
     }
 
@@ -755,10 +747,10 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
     struct list_head * runq = &RQD(ops, cpu)->runq;
     int pos = 0;
 
-    ASSERT( spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock) );
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
-    BUG_ON( __vcpu_on_runq(svc) );
-    BUG_ON( c2r(ops, cpu) != c2r(ops, svc->vcpu->processor) );
+    ASSERT(!__vcpu_on_runq(svc));
+    ASSERT(c2r(ops, cpu) == c2r(ops, svc->vcpu->processor));
 
     pos = __runq_insert(runq, svc);
 
@@ -781,7 +773,7 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
 static inline void
 __runq_remove(struct csched2_vcpu *svc)
 {
-    BUG_ON( !__vcpu_on_runq(svc) );
+    ASSERT(__vcpu_on_runq(svc));
     list_del_init(&svc->runq_elem);
 }
 
@@ -806,16 +798,29 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_ti
 static void
 runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 {
-    int i, ipid=-1;
-    s_time_t lowest=(1<<30);
+    int i, ipid = -1;
+    s_time_t lowest = (1<<30);
     unsigned int cpu = new->vcpu->processor;
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     cpumask_t mask;
     struct csched2_vcpu * cur;
 
-    d2printk("rqt %pv curr %pv\n", new->vcpu, current);
+    ASSERT(new->rqd == rqd);
 
-    BUG_ON(new->rqd != rqd);
+    /* TRACE */
+    {
+        struct {
+            unsigned vcpu:16, dom:16;
+            unsigned processor, credit;
+        } d;
+        d.dom = new->vcpu->domain->domain_id;
+        d.vcpu = new->vcpu->vcpu_id;
+        d.processor = new->vcpu->processor;
+        d.credit = new->credit;
+        trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
 
     /*
      * Get a mask of idle, but not tickled, processors that new is
@@ -861,7 +866,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
         cur = CSCHED2_VCPU(curr_on_cpu(i));
 
-        BUG_ON(is_idle_vcpu(cur->vcpu));
+        ASSERT(!is_idle_vcpu(cur->vcpu));
 
         /* Update credits for current to see if we want to preempt. */
         burn_credits(rqd, cur, now);
@@ -951,8 +956,8 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
 
         svc = list_entry(iter, struct csched2_vcpu, rqd_elem);
 
-        BUG_ON( is_idle_vcpu(svc->vcpu) );
-        BUG_ON( svc->rqd != rqd );
+        ASSERT(!is_idle_vcpu(svc->vcpu));
+        ASSERT(svc->rqd == rqd);
 
         start_credit = svc->credit;
 
@@ -996,12 +1001,11 @@ void burn_credits(struct csched2_runqueue_data *rqd,
 {
     s_time_t delta;
 
-    /* Assert svc is current */
-    ASSERT(svc==CSCHED2_VCPU(curr_on_cpu(svc->vcpu->processor)));
+    ASSERT(svc == CSCHED2_VCPU(curr_on_cpu(svc->vcpu->processor)));
 
     if ( unlikely(is_idle_vcpu(svc->vcpu)) )
     {
-        BUG_ON(svc->credit != CSCHED2_IDLE_CREDIT);
+        ASSERT(svc->credit == CSCHED2_IDLE_CREDIT);
         return;
     }
 
@@ -1012,12 +1016,10 @@ void burn_credits(struct csched2_runqueue_data *rqd,
         SCHED_STAT_CRANK(burn_credits_t2c);
         t2c_update(rqd, delta, svc);
         svc->start_time = now;
-
-        d2printk("b %pv c%d\n", svc->vcpu, svc->credit);
     }
     else if ( delta < 0 )
     {
-        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
+        printk("WARNING: %s: Time went backwards? now %"PRI_stime" start_time %"PRI_stime"\n",
                __func__, now, svc->start_time);
     }
 
@@ -1051,7 +1053,6 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
     if ( new_weight > rqd->max_weight )
     {
         rqd->max_weight = new_weight;
-        d2printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
         SCHED_STAT_CRANK(upd_max_weight_quick);
     }
     else if ( old_weight == rqd->max_weight )
@@ -1068,9 +1069,20 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
         }
 
         rqd->max_weight = max_weight;
-        d2printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
         SCHED_STAT_CRANK(upd_max_weight_full);
     }
+
+    /* TRACE */
+    {
+        struct {
+            unsigned rqi:16, max_weight:16;
+        } d;
+        d.rqi = rqd->id;
+        d.max_weight = rqd->max_weight;
+        trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
 }
 
 #ifndef NDEBUG
@@ -1117,8 +1129,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
 
     if ( ! is_idle_vcpu(vc) )
     {
-        BUG_ON( svc->sdom == NULL );
-
+        ASSERT(svc->sdom != NULL);
         svc->credit = CSCHED2_CREDIT_INIT;
         svc->weight = svc->sdom->weight;
         /* Starting load of 50% */
@@ -1127,7 +1138,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     }
     else
     {
-        BUG_ON( svc->sdom != NULL );
+        ASSERT(svc->sdom == NULL);
         svc->credit = CSCHED2_IDLE_CREDIT;
         svc->weight = 0;
     }
@@ -1171,7 +1182,7 @@ runq_assign(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu *svc = vc->sched_priv;
 
-    BUG_ON(svc->rqd != NULL);
+    ASSERT(svc->rqd == NULL);
 
     __runq_assign(svc, RQD(ops, vc->processor));
 }
@@ -1179,8 +1190,8 @@ runq_assign(const struct scheduler *ops, struct vcpu *vc)
 static void
 __runq_deassign(struct csched2_vcpu *svc)
 {
-    BUG_ON(__vcpu_on_runq(svc));
-    BUG_ON(svc->flags & CSFLAG_scheduled);
+    ASSERT(!__vcpu_on_runq(svc));
+    ASSERT(!(svc->flags & CSFLAG_scheduled));
 
     list_del_init(&svc->rqd_elem);
     update_max_weight(svc->rqd, 0, svc->weight);
@@ -1196,7 +1207,7 @@ runq_deassign(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu *svc = vc->sched_priv;
 
-    BUG_ON(svc->rqd != RQD(ops, vc->processor));
+    ASSERT(svc->rqd == RQD(ops, vc->processor));
 
     __runq_deassign(svc);
 }
@@ -1208,9 +1219,8 @@ csched2_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_dom * const sdom = svc->sdom;
     spinlock_t *lock;
 
-    printk("%s: Inserting %pv\n", __func__, vc);
-
-    BUG_ON(is_idle_vcpu(vc));
+    ASSERT(!is_idle_vcpu(vc));
+    ASSERT(list_empty(&svc->runq_elem));
 
     /* Add vcpu to runqueue of initial processor */
     lock = vcpu_schedule_lock_irq(vc);
@@ -1238,26 +1248,21 @@ static void
 csched2_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
-    struct csched2_dom * const sdom = svc->sdom;
-
-    BUG_ON( sdom == NULL );
-    BUG_ON( !list_empty(&svc->runq_elem) );
+    spinlock_t *lock;
 
-    if ( ! is_idle_vcpu(vc) )
-    {
-        spinlock_t *lock;
+    ASSERT(!is_idle_vcpu(vc));
+    ASSERT(list_empty(&svc->runq_elem));
 
-        SCHED_STAT_CRANK(vcpu_remove);
+    SCHED_STAT_CRANK(vcpu_remove);
 
-        /* Remove from runqueue */
-        lock = vcpu_schedule_lock_irq(vc);
+    /* Remove from runqueue */
+    lock = vcpu_schedule_lock_irq(vc);
 
-        runq_deassign(ops, vc);
+    runq_deassign(ops, vc);
 
-        vcpu_schedule_unlock_irq(lock, vc);
+    vcpu_schedule_unlock_irq(lock, vc);
 
-        svc->sdom->nr_vcpus--;
-    }
+    svc->sdom->nr_vcpus--;
 }
 
 static void
@@ -1265,14 +1270,14 @@ csched2_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
 
-    BUG_ON( is_idle_vcpu(vc) );
+    ASSERT(!is_idle_vcpu(vc));
     SCHED_STAT_CRANK(vcpu_sleep);
 
     if ( curr_on_cpu(vc->processor) == vc )
         cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
     else if ( __vcpu_on_runq(svc) )
     {
-        BUG_ON(svc->rqd != RQD(ops, vc->processor));
+        ASSERT(svc->rqd == RQD(ops, vc->processor));
         update_load(ops, svc->rqd, svc, -1, NOW());
         __runq_remove(svc);
     }
@@ -1288,9 +1293,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 
     /* Schedule lock should be held at this point. */
 
-    d2printk("w %pv\n", vc);
-
-    BUG_ON( is_idle_vcpu(vc) );
+    ASSERT(!is_idle_vcpu(vc));
 
     if ( unlikely(curr_on_cpu(vc->processor) == vc) )
     {
@@ -1322,7 +1325,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     if ( svc->rqd == NULL )
         runq_assign(ops, vc);
     else
-        BUG_ON(RQD(ops, vc->processor) != svc->rqd );
+        ASSERT(RQD(ops, vc->processor) == svc->rqd );
 
     now = NOW();
 
@@ -1333,7 +1336,6 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     runq_tickle(ops, svc, now);
 
 out:
-    d2printk("w-\n");
     return;
 }
 
@@ -1345,6 +1347,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     s_time_t now = NOW();
 
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor));
+    ASSERT(is_idle_vcpu(vc) || svc->rqd == RQD(ops, vc->processor));
 
     /* This vcpu is now eligible to be put on the runqueue again */
     __clear_bit(__CSFLAG_scheduled, &svc->flags);
@@ -1360,7 +1363,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     if ( __test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
          && likely(vcpu_runnable(vc)) )
     {
-        BUG_ON(__vcpu_on_runq(svc));
+        ASSERT(!__vcpu_on_runq(svc));
 
         runq_insert(ops, svc);
         runq_tickle(ops, svc, now);
@@ -1380,7 +1383,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
     s_time_t min_avgload;
 
-    BUG_ON(cpumask_empty(&prv->active_queues));
+    ASSERT(!cpumask_empty(&prv->active_queues));
 
     /* Locking:
      * - vc->processor is already locked
@@ -1399,12 +1402,8 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !spin_trylock(&prv->lock) )
     {
-        if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
-        {
-            d2printk("%pv -\n", svc->vcpu);
-            __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
-        }
-
+        /* We may be here because someon requested us to migrate */
+        __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         return get_fallback_cpu(svc);
     }
 
@@ -1414,7 +1413,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
         {
-            printk("%s: Runqueue migrate aborted because target runqueue disappeared!\n",
+            printk(XENLOG_WARNING "%s: target runqueue disappeared!\n",
                    __func__);
         }
         else
@@ -1423,10 +1422,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
                         &svc->migrate_rqd->active);
             new_cpu = cpumask_any(cpumask_scratch);
             if ( new_cpu < nr_cpu_ids )
-            {
-                d2printk("%pv +\n", svc->vcpu);
                 goto out_up;
-            }
         }
         /* Fall-through to normal cpu pick */
     }
@@ -1540,9 +1536,26 @@ static void migrate(const struct scheduler *ops,
                     struct csched2_runqueue_data *trqd, 
                     s_time_t now)
 {
-    if ( svc->flags & CSFLAG_scheduled )
+    bool_t running = svc->flags & CSFLAG_scheduled;
+    bool_t on_runq = __vcpu_on_runq(svc);
+
+    /* TRACE */
+    {
+        struct {
+            unsigned vcpu:16, dom:16;
+            unsigned rqi:16, trqi:16;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.rqi = svc->rqd->id;
+        d.trqi = trqd->id;
+        trace_var(TRC_CSCHED2_MIGRATE, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    if ( running )
     {
-        d2printk("%pv %d-%d a\n", svc->vcpu, svc->rqd->id, trqd->id);
         /* It's running; mark it to migrate. */
         svc->migrate_rqd = trqd;
         __set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
@@ -1551,21 +1564,19 @@ static void migrate(const struct scheduler *ops,
     }
     else
     {
-        int on_runq=0;
         /* It's not running; just move it */
-        d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
-        if ( __vcpu_on_runq(svc) )
+        if ( on_runq )
         {
             __runq_remove(svc);
             update_load(ops, svc->rqd, NULL, -1, now);
-            on_runq=1;
+            on_runq = 1;
         }
         __runq_deassign(svc);
 
         cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
                     &trqd->active);
         svc->vcpu->processor = cpumask_any(cpumask_scratch);
-        BUG_ON(svc->vcpu->processor >= nr_cpu_ids);
+        ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         __runq_assign(svc, trqd);
         if ( on_runq )
@@ -1760,7 +1771,7 @@ csched2_vcpu_migrate(
     struct csched2_runqueue_data *trqd;
 
     /* Check if new_cpu is valid */
-    BUG_ON(!cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
+    ASSERT(cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized));
     ASSERT(cpumask_test_cpu(new_cpu, vc->cpu_hard_affinity));
 
     trqd = RQD(ops, new_cpu);
@@ -1820,7 +1831,7 @@ csched2_dom_cntl(
                  * been disabled. */
                 spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
 
-                BUG_ON(svc->rqd != RQD(ops, svc->vcpu->processor));
+                ASSERT(svc->rqd == RQD(ops, svc->vcpu->processor));
 
                 svc->weight = sdom->weight;
                 update_max_weight(svc->rqd, svc->weight, old_weight);
@@ -1869,8 +1880,6 @@ csched2_dom_init(const struct scheduler *ops, struct domain *dom)
 {
     struct csched2_dom *sdom;
 
-    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
-
     if ( is_idle_domain(dom) )
         return 0;
 
@@ -1901,7 +1910,7 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 static void
 csched2_dom_destroy(const struct scheduler *ops, struct domain *dom)
 {
-    BUG_ON(CSCHED2_DOM(dom)->nr_vcpus > 0);
+    ASSERT(CSCHED2_DOM(dom)->nr_vcpus == 0);
 
     csched2_free_domdata(ops, CSCHED2_DOM(dom));
 }
@@ -2042,8 +2051,6 @@ csched2_schedule(
     SCHED_STAT_CRANK(schedule);
     CSCHED2_VCPU_CHECK(current);
 
-    d2printk("sc p%d c %pv now %"PRI_stime"\n", cpu, scurr->vcpu, now);
-
     BUG_ON(!cpumask_test_cpu(cpu, &CSCHED2_PRIV(ops)->initialized));
 
     rqd = RQD(ops, cpu);
@@ -2051,7 +2058,7 @@ csched2_schedule(
 
     /* Protected by runqueue lock */        
 
-    /* DEBUG */
+#ifndef NDEBUG
     if ( !is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd)
     {
         int other_rqi = -1, this_rqi = c2r(ops, cpu);
@@ -2069,12 +2076,13 @@ csched2_schedule(
                 }
             }
         }
-        printk("%s: pcpu %d rq %d, but scurr %pv assigned to "
+        printk("DEBUG: %s: pcpu %d rq %d, but scurr %pv assigned to "
                "pcpu %d rq %d!\n",
                __func__,
                cpu, this_rqi,
                scurr->vcpu, scurr->vcpu->processor, other_rqi);
     }
+#endif
     BUG_ON(!is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd);
 
     /* Clear "tickled" bit now that we've been scheduled */
@@ -2125,15 +2133,10 @@ csched2_schedule(
         /* If switching, remove this from the runqueue and mark it scheduled */
         if ( snext != scurr )
         {
-            BUG_ON(snext->rqd != rqd);
-    
+            ASSERT(snext->rqd == rqd);
+            ASSERT(!snext->vcpu->is_running);
+
             __runq_remove(snext);
-            if ( snext->vcpu->is_running )
-            {
-                printk("p%d: snext %pv running on p%d! scurr %pv\n",
-                       cpu, snext->vcpu, snext->vcpu->processor, scurr->vcpu);
-                BUG();
-            }
             __set_bit(__CSFLAG_scheduled, &snext->flags);
         }
 
@@ -2439,10 +2442,10 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
 
     rqd = prv->rqd + rqi;
 
-    printk("Adding cpu %d to runqueue %d\n", cpu, rqi);
+    printk(XENLOG_INFO "Adding cpu %d to runqueue %d\n", cpu, rqi);
     if ( ! cpumask_test_cpu(rqi, &prv->active_queues) )
     {
-        printk(" First cpu on runqueue, activating\n");
+        printk(XENLOG_INFO " First cpu on runqueue, activating\n");
         activate_runqueue(prv, rqi);
     }
     
@@ -2554,14 +2557,14 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     /* No need to save IRQs here, they're already disabled */
     spin_lock(&rqd->lock);
 
-    printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
+    printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )
     {
-        printk(" No cpus left on runqueue, disabling\n");
+        printk(XENLOG_INFO " No cpus left on runqueue, disabling\n");
         deactivate_runqueue(prv, rqi);
     }
 
@@ -2580,15 +2583,20 @@ csched2_init(struct scheduler *ops)
     int i;
     struct csched2_private *prv;
 
-    printk("Initializing Credit2 scheduler\n" \
-           " WARNING: This is experimental software in development.\n" \
+    printk("Initializing Credit2 scheduler\n");
+    printk(" WARNING: This is experimental software in development.\n" \
            " Use at your own risk.\n");
 
-    printk(" load_precision_shift: %d\n", opt_load_precision_shift);
-    printk(" load_window_shift: %d\n", opt_load_window_shift);
-    printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
-    printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
-    printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
+    printk(XENLOG_INFO " load_precision_shift: %d\n"
+           " load_window_shift: %d\n"
+           " underload_balance_tolerance: %d\n"
+           " overload_balance_tolerance: %d\n"
+           " runqueues arrangement: %s\n",
+           opt_load_precision_shift,
+           opt_load_window_shift,
+           opt_underload_balance_tolerance,
+           opt_overload_balance_tolerance,
+           opt_runqueue_str[opt_runqueue]);
 
     if ( opt_load_window_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 14/19] xen: credit2: add yet some more tracing
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (12 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
@ 2016-06-17 23:12 ` Dario Faggioli
  2016-06-20  8:15   ` Jan Beulich
  2016-07-07 15:34   ` George Dunlap
  2016-06-17 23:13 ` [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:12 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

(and fix the style of two labels as well.)

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |   58 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index ba3a78a..e9f3f13 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -46,6 +46,9 @@
 #define TRC_CSCHED2_TICKLE_NEW       TRC_SCHED_CLASS_EVT(CSCHED2, 13)
 #define TRC_CSCHED2_RUNQ_MAX_WEIGHT  TRC_SCHED_CLASS_EVT(CSCHED2, 14)
 #define TRC_CSCHED2_MIGRATE          TRC_SCHED_CLASS_EVT(CSCHED2, 15)
+#define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2, 16)
+#define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2, 17)
+#define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2, 19)
 
 /*
  * WARNING: This is still in an experimental phase.  Status and work can be found at the
@@ -709,6 +712,8 @@ update_load(const struct scheduler *ops,
             struct csched2_runqueue_data *rqd,
             struct csched2_vcpu *svc, int change, s_time_t now)
 {
+    trace_var(TRC_CSCHED2_UPDATE_LOAD, 1, 0,  NULL);
+
     __update_runq_load(ops, rqd, change, now);
     if ( svc )
         __update_svc_load(ops, svc, change, now);
@@ -1484,6 +1489,23 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 out_up:
     spin_unlock(&prv->lock);
 
+    /* TRACE */
+    {
+        struct {
+            uint64_t b_avgload;
+            unsigned vcpu:16, dom:16;
+            unsigned rq_id:16, new_cpu:16;
+       } d;
+        d.b_avgload = prv->rqd[min_rqi].b_avgload;
+        d.dom = vc->domain->domain_id;
+        d.vcpu = vc->vcpu_id;
+        d.rq_id = c2r(ops, new_cpu);
+        d.new_cpu = new_cpu;
+        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
     return new_cpu;
 }
 
@@ -1611,7 +1633,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     bool_t inner_load_updated = 0;
 
     balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
-    
+
     /*
      * Basic algorithm: Push, pull, or swap.
      * - Find the runqueue with the furthest load distance
@@ -1677,6 +1699,20 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
         if ( i > cpus_max )
             cpus_max = i;
 
+        /* TRACE */
+        {
+            struct {
+                unsigned lrq_id:16, orq_id:16;
+                unsigned load_delta;
+            } d;
+            d.lrq_id = st.lrqd->id;
+            d.orq_id = st.orqd->id;
+            d.load_delta = st.load_delta;
+            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
+                      sizeof(d),
+                      (unsigned char *)&d);
+        }
+
         /*
          * If we're under 100% capacaty, only shift if load difference
          * is > 1.  otherwise, shift if under 12.5%
@@ -1705,6 +1741,21 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     if ( unlikely(st.orqd->id < 0) )
         goto out_up;
 
+    /* TRACE */
+    {
+        struct {
+            uint64_t lb_avgload, ob_avgload;
+            unsigned lrq_id:16, orq_id:16;
+        } d;
+        d.lrq_id = st.lrqd->id;
+        d.lb_avgload = st.lrqd->b_avgload;
+        d.orq_id = st.orqd->id;
+        d.ob_avgload = st.orqd->b_avgload;
+        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
     now = NOW();
 
     /* Look for "swap" which gives the best load average
@@ -1756,10 +1807,9 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     if ( st.best_pull_svc )
         migrate(ops, st.best_pull_svc, st.lrqd, now);
 
-out_up:
+ out_up:
     spin_unlock(&st.orqd->lock);
-
-out:
+ out:
     return;
 }
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (13 preceding siblings ...)
  2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
@ 2016-06-17 23:13 ` Dario Faggioli
  2016-07-07 15:37   ` George Dunlap
  2016-06-17 23:13 ` [PATCH 16/19] tools: tracing: deal with new Credit2 events Dario Faggioli
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  114 +++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 55 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index e9f3f13..3fdc91c 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -636,6 +636,7 @@ __update_runq_load(const struct scheduler *ops,
 
     ASSERT(rqd->avgload <= STIME_MAX && rqd->b_avgload <= STIME_MAX);
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t rq_avgload, b_avgload;
@@ -646,9 +647,9 @@ __update_runq_load(const struct scheduler *ops,
         d.rq_avgload = rqd->avgload;
         d.b_avgload = rqd->b_avgload;
         d.shift = P;
-        trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_UPDATE_RUNQ_LOAD, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -691,6 +692,7 @@ __update_svc_load(const struct scheduler *ops,
     }
     svc->load_last_update = now;
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t v_avgload;
@@ -701,9 +703,9 @@ __update_svc_load(const struct scheduler *ops,
         d.vcpu = svc->vcpu->vcpu_id;
         d.v_avgload = svc->avgload;
         d.shift = P;
-        trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_UPDATE_VCPU_LOAD, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -759,6 +761,7 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
 
     pos = __runq_insert(runq, svc);
 
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -767,9 +770,9 @@ runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.pos = pos;
-        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     return;
@@ -812,7 +815,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
     ASSERT(new->rqd == rqd);
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -822,9 +825,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         d.vcpu = new->vcpu->vcpu_id;
         d.processor = new->vcpu->processor;
         d.credit = new->credit;
-        trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_TICKLE_NEW, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     /*
@@ -882,7 +885,8 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
             lowest = cur->credit;
         }
 
-        /* TRACE */ {
+        if ( unlikely(tb_init_done) )
+        {
             struct {
                 unsigned vcpu:16, dom:16;
                 unsigned credit;
@@ -890,9 +894,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
             d.dom = cur->vcpu->domain->domain_id;
             d.vcpu = cur->vcpu->vcpu_id;
             d.credit = cur->credit;
-            trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
     }
 
@@ -910,14 +914,15 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
  tickle:
     BUG_ON(ipid == -1);
 
-    /* TRACE */ {
+    if ( unlikely(tb_init_done) )
+    {
         struct {
             unsigned cpu:16, pad:16;
         } d;
         d.cpu = ipid; d.pad = 0;
-        trace_var(TRC_CSCHED2_TICKLE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_TICKLE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
@@ -979,7 +984,8 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
 
         svc->start_time = now;
 
-        /* TRACE */ {
+        if ( unlikely(tb_init_done) )
+        {
             struct {
                 unsigned vcpu:16, dom:16;
                 unsigned credit_start, credit_end;
@@ -990,9 +996,9 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
             d.credit_start = start_credit;
             d.credit_end = svc->credit;
             d.multiplier = m;
-            trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_CREDIT_RESET, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
     }
 
@@ -1028,7 +1034,7 @@ void burn_credits(struct csched2_runqueue_data *rqd,
                __func__, now, svc->start_time);
     }
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1039,9 +1045,9 @@ void burn_credits(struct csched2_runqueue_data *rqd,
         d.vcpu = svc->vcpu->vcpu_id;
         d.credit = svc->credit;
         d.delta = delta;
-        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -1077,16 +1083,16 @@ static void update_max_weight(struct csched2_runqueue_data *rqd, int new_weight,
         SCHED_STAT_CRANK(upd_max_weight_full);
     }
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned rqi:16, max_weight:16;
         } d;
         d.rqi = rqd->id;
         d.max_weight = rqd->max_weight;
-        trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_MAX_WEIGHT, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 }
 
@@ -1166,7 +1172,7 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
     /* Expected new load based on adding this vcpu */
     rqd->b_avgload += svc->avgload;
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1175,9 +1181,9 @@ __runq_assign(struct csched2_vcpu *svc, struct csched2_runqueue_data *rqd)
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.rqi=rqd->id;
-        trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_RUNQ_ASSIGN, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
 }
@@ -1489,7 +1495,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 out_up:
     spin_unlock(&prv->lock);
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t b_avgload;
@@ -1501,9 +1507,9 @@ out_up:
         d.vcpu = vc->vcpu_id;
         d.rq_id = c2r(ops, new_cpu);
         d.new_cpu = new_cpu;
-        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_PICKED_CPU, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     return new_cpu;
@@ -1561,7 +1567,7 @@ static void migrate(const struct scheduler *ops,
     bool_t running = svc->flags & CSFLAG_scheduled;
     bool_t on_runq = __vcpu_on_runq(svc);
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
@@ -1571,9 +1577,9 @@ static void migrate(const struct scheduler *ops,
         d.vcpu = svc->vcpu->vcpu_id;
         d.rqi = svc->rqd->id;
         d.trqi = trqd->id;
-        trace_var(TRC_CSCHED2_MIGRATE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_MIGRATE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     if ( running )
@@ -1696,10 +1702,8 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
 
         cpus_max = cpumask_weight(&st.lrqd->active);
         i = cpumask_weight(&st.orqd->active);
-        if ( i > cpus_max )
-            cpus_max = i;
 
-        /* TRACE */
+        if ( unlikely(tb_init_done) )
         {
             struct {
                 unsigned lrq_id:16, orq_id:16;
@@ -1708,9 +1712,9 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
             d.lrq_id = st.lrqd->id;
             d.orq_id = st.orqd->id;
             d.load_delta = st.load_delta;
-            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
-                      sizeof(d),
-                      (unsigned char *)&d);
+            __trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
+                        sizeof(d),
+                        (unsigned char *)&d);
         }
 
         /*
@@ -1741,7 +1745,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     if ( unlikely(st.orqd->id < 0) )
         goto out_up;
 
-    /* TRACE */
+    if ( unlikely(tb_init_done) )
     {
         struct {
             uint64_t lb_avgload, ob_avgload;
@@ -1751,9 +1755,9 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
         d.lb_avgload = st.lrqd->b_avgload;
         d.orq_id = st.orqd->id;
         d.ob_avgload = st.orqd->b_avgload;
-        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
-                  sizeof(d),
-                  (unsigned char *)&d);
+        __trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
+                    sizeof(d),
+                    (unsigned char *)&d);
     }
 
     now = NOW();


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 16/19] tools: tracing: deal with new Credit2 events
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (14 preceding siblings ...)
  2016-06-17 23:13 ` [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
@ 2016-06-17 23:13 ` Dario Faggioli
  2016-07-07 15:39   ` George Dunlap
  2016-06-17 23:13 ` [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Anshul Makkar, Ian Jackson, George Dunlap

more specifically, with: TICKLE_NEW, RUNQ_MAX_WEIGHT,
MIGRATE, LOAD_CHECK, LOAD_BALANCE and PICKED_CPU, and
in both both xenalyze and formats (for xentrace_format).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 tools/xentrace/formats    |    6 +++
 tools/xentrace/xenalyze.c |   78 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 2e58d03..caafb5f 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -55,6 +55,12 @@
 0x0002220a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_assign    [ dom:vcpu = 0x%(1)08x, rq_id = %(2)d ]
 0x0002220b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_vcpu_load [ dom:vcpu = 0x%(3)08x, vcpuload = 0x%(2)08x%(1)08x, wshift = %(4)d ]
 0x0002220c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x%(5)08x, rq_avgload = 0x%(2)08x%(1)08x, b_avgload = 0x%(4)08x%(3)08x ]
+0x0002220d  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_new     [ dom:vcpu = 0x%(1)08x, processor = %(2)d credit = %(3)d ]
+0x0002220e  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_max_weight [ rq_id[16]:max_weight[16] = 0x%(1)08x ]
+0x0002220f  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:migrrate       [ dom:vcpu = 0x%(1)08x, rq_id[16]:trq_id[16] = 0x%(2)08x ]
+0x00022210  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_check     [ lrq_id[16]:orq_id[16] = 0x%(1)08x, delta = %(2)d ]
+0x00022211  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:load_balance   [ l_bavgload = 0x%(2)08x%(1)08x, o_bavgload = 0x%(4)08x%(3)08x, lrq_id[16]:orq_id[16] = 0x%(5)08x ]
+0x00022212  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:pick_cpu       [ b_avgload = 0x%(2)08x%(1)08x, dom:vcpu = 0x%(3)08x, rq_id[16]:new_cpu[16] = %(4)d ]
 
 0x00022801  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:tickle        [ cpu = %(1)d ]
 0x00022802  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  rtds:runq_pick     [ dom:vcpu = 0x%(1)08x, cur_deadline = 0x%(3)08x%(2)08x, cur_budget = 0x%(5)08x%(4)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index f2f97bd..d223de6 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7725,7 +7725,6 @@ void sched_process(struct pcpu_info *p)
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
-        case TRC_SCHED_CLASS_EVT(CSCHED2, 9): /* UPDATE_LOAD       */
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 2): /* RUNQ_POS          */
             if(opt.dump_all) {
@@ -7788,11 +7787,15 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all)
                 printf(" %s csched2:sched_tasklet\n", ri->dump_header);
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 9):  /* UPDATE_LOAD      */
+            if(opt.dump_all)
+                printf(" %s csched2:update_load\n", ri->dump_header);
+            break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 10): /* RUNQ_ASSIGN      */
             if(opt.dump_all) {
                 struct {
                     unsigned int vcpuid:16, domid:16;
-                    unsigned int rqi;
+                    unsigned int rqi:16;
                 } *r = (typeof(r))ri->d;
 
                 printf(" %s csched2:runq_assign d%uv%u on rq# %u\n",
@@ -7834,6 +7837,77 @@ void sched_process(struct pcpu_info *p)
                        avgload, r->rq_avgload, b_avgload, r->b_avgload);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 13): /* TICKLE_NEW       */
+            if (opt.dump_all) {
+                struct {
+                    unsigned vcpuid:16, domid:16;
+                    unsigned processor, credit;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:runq_tickle_new d%uv%u, "
+                       "processor = %u, credit = %u\n",
+                       ri->dump_header, r->domid, r->vcpuid,
+                       r->processor, r->credit);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 14): /* RUNQ_MAX_WEIGHT  */
+            if (opt.dump_all) {
+                struct {
+                    unsigned rqi:16, max_weight:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:update_max_weight rq# %u, max_weight = %u\n",
+                       ri->dump_header, r->rqi, r->max_weight);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 15): /* MIGRATE          */
+            if (opt.dump_all) {
+                struct {
+                    unsigned vcpuid:16, domid:16;
+                    unsigned rqi:16, trqi:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:migrate d%uv%u rq# %u --> rq# %u\n",
+                       ri->dump_header, r->domid, r->vcpuid, r->rqi, r->trqi);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 16): /* LOAD_CHECK       */
+            if (opt.dump_all) {
+                struct {
+                    unsigned lrqi:16, orqi:16;
+                    unsigned load_delta;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:load_balance_check lrq# %u, orq# %u, "
+                       "delta = %u\n",
+                       ri->dump_header, r->lrqi, r->orqi, r->load_delta);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 17): /* LOAD_BALANCE     */
+            if (opt.dump_all) {
+                struct {
+                    uint64_t lb_avgload, ob_avgload;
+                    unsigned lrqi:16, orqi:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:load_balance_begin lrq# %u, "
+                       "avg_load = %"PRIu64" -- orq# %u, avg_load = %"PRIu64"\n",
+                       ri->dump_header, r->lrqi, r->lb_avgload,
+                       r->orqi, r->ob_avgload);
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED2, 19): /* PICKED_CPU       */
+            if (opt.dump_all) {
+                struct {
+                    uint64_t b_avgload;
+                    unsigned vcpuid:16, domid:16;
+                    unsigned rqi:16, cpu:16;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched2:picked_cpu d%uv%u, rq# %u, cpu %u\n",
+                       ri->dump_header, r->domid, r->vcpuid, r->rqi, r->cpu);
+            }
+            break;
         /* RTDS (TRC_RTDS_xxx) */
         case TRC_SCHED_CLASS_EVT(RTDS, 1): /* TICKLE           */
             if(opt.dump_all) {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock.
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (15 preceding siblings ...)
  2016-06-17 23:13 ` [PATCH 16/19] tools: tracing: deal with new Credit2 events Dario Faggioli
@ 2016-06-17 23:13 ` Dario Faggioli
  2016-07-07 16:00   ` George Dunlap
  2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
  2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
  18 siblings, 1 reply; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

In fact, the data it protects only change either at init-time,
during cpupools manipulation, or when changing domains' weights.
In all other cases (namely, load balancing, reading weights
and status dumping), information is only read.

Therefore, let the lock be an read/write one. This means there
is no full serialization point for the whole scheduler and
for all the pCPUs of the host any longer.

This is particularly good for scalability (especially when doing
load balancing).

Also, update the high level description of the locking discipline,
and take the chance for rewording it a little bit (as well as
for adding a couple of locking related ASSERT()-s).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunalp@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  133 ++++++++++++++++++++++++++------------------
 1 file changed, 80 insertions(+), 53 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 3fdc91c..93943fa 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -85,17 +85,37 @@
 
 /*
  * Locking:
- * - Schedule-lock is per-runqueue
- *  + Protects runqueue data, runqueue insertion, &c
- *  + Also protects updates to private sched vcpu structure
- *  + Must be grabbed using vcpu_schedule_lock_irq() to make sure vcpu->processr
- *    doesn't change under our feet.
- * - Private data lock
- *  + Protects access to global domain list
- *  + All other private data is written at init and only read afterwards.
+ *
+ * - runqueue lock
+ *  + it is per-runqueue, so:
+ *   * cpus in a runqueue take the runqueue lock, when using
+ *     pcpu_schedule_lock() / vcpu_schedule_lock() (and friends),
+ *   * a cpu may (try to) take a "remote" runqueue lock, e.g., for
+ *     load balancing;
+ *  + serializes runqueue operations (removing and inserting vcpus);
+ *  + protects runqueue-wide data in csched2_runqueue_data;
+ *  + protects vcpu parameters in csched2_vcpu for the vcpu in the
+ *    runqueue.
+ *
+ * - Private scheduler lock
+ *  + protects scheduler-wide data in csched2_private, such as:
+ *   * the list of domains active in this scheduler,
+ *   * what cpus and what runqueues are active and in what
+ *     runqueue each cpu is;
+ *  + serializes the operation of changing the weights of domains;
+ *
+ * - Type:
+ *  + runqueue locks are 'regular' spinlocks;
+ *  + the private scheduler lock can be an rwlock. In fact, data
+ *    it protects is modified only during initialization, cpupool
+ *    manipulation and when changing weights, and read in all
+ *    other cases (e.g., during load balancing).
+ *
  * Ordering:
- * - We grab private->schedule when updating domain weight; so we
- *  must never grab private if a schedule lock is held.
+ *  + tylock must be used when wanting to take a runqueue lock,
+ *    if we already hold another one;
+ *  + if taking both a runqueue lock and the private scheduler
+ *    lock is, the latter must always be taken for first.
  */
 
 /*
@@ -342,7 +362,7 @@ struct csched2_runqueue_data {
  * System-wide private data
  */
 struct csched2_private {
-    spinlock_t lock;
+    rwlock_t lock;
     cpumask_t initialized; /* CPU is initialized for this pool */
     
     struct list_head sdom; /* Used mostly for dump keyhandler. */
@@ -1300,13 +1320,14 @@ static void
 csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_vcpu * const svc = CSCHED2_VCPU(vc);
+    unsigned int cpu = vc->processor;
     s_time_t now;
 
-    /* Schedule lock should be held at this point. */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
     ASSERT(!is_idle_vcpu(vc));
 
-    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
+    if ( unlikely(curr_on_cpu(cpu) == vc) )
     {
         SCHED_STAT_CRANK(vcpu_wake_running);
         goto out;
@@ -1397,7 +1418,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     ASSERT(!cpumask_empty(&prv->active_queues));
 
     /* Locking:
-     * - vc->processor is already locked
+     * - Runqueue lock of vc->processor is already locked
      * - Need to grab prv lock to make sure active runqueues don't
      *   change
      * - Need to grab locks for other runqueues while checking
@@ -1410,8 +1431,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
      * just grab the prv lock.  Instead, we'll have to trylock, and
      * do something else reasonable if we fail.
      */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, vc->processor).schedule_lock));
 
-    if ( !spin_trylock(&prv->lock) )
+    if ( !read_trylock(&prv->lock) )
     {
         /* We may be here because someon requested us to migrate */
         __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
@@ -1493,7 +1515,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
     }
 
 out_up:
-    spin_unlock(&prv->lock);
+    read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
     {
@@ -1647,15 +1669,13 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
      * on either side may be empty).
      */
 
-    /* Locking:
-     * - pcpu schedule lock should be already locked
-     */
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
     st.lrqd = RQD(ops, cpu);
 
     __update_runq_load(ops, st.lrqd, 0, now);
 
  retry:
-    if ( !spin_trylock(&prv->lock) )
+    if ( !read_trylock(&prv->lock) )
         return;
 
     st.load_delta = 0;
@@ -1686,8 +1706,8 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
         spin_unlock(&st.orqd->lock);
     }
 
-    /* Minimize holding the big lock */
-    spin_unlock(&prv->lock);
+    /* Minimize holding the private scheduler lock. */
+    read_unlock(&prv->lock);
     if ( max_delta_rqi == -1 )
         goto out;
 
@@ -1855,14 +1875,19 @@ csched2_dom_cntl(
     unsigned long flags;
     int rc = 0;
 
-    /* Must hold csched2_priv lock to read and update sdom,
-     * runq lock to update csvcs. */
-    spin_lock_irqsave(&prv->lock, flags);
-
+    /*
+     * Locking:
+     *  - we must take the private lock for accessing the weights of the
+     *    vcpus of d,
+     *  - in the putinfo case, we also need the runqueue lock(s), for
+     *    updating the max waight of the runqueue(s).
+     */
     switch ( op->cmd )
     {
     case XEN_DOMCTL_SCHEDOP_getinfo:
+        read_lock_irqsave(&prv->lock, flags);
         op->u.credit2.weight = sdom->weight;
+        read_unlock_irqrestore(&prv->lock, flags);
         break;
     case XEN_DOMCTL_SCHEDOP_putinfo:
         if ( op->u.credit2.weight != 0 )
@@ -1870,6 +1895,8 @@ csched2_dom_cntl(
             struct vcpu *v;
             int old_weight;
 
+            write_lock_irqsave(&prv->lock, flags);
+
             old_weight = sdom->weight;
 
             sdom->weight = op->u.credit2.weight;
@@ -1878,11 +1905,6 @@ csched2_dom_cntl(
             for_each_vcpu ( d, v )
             {
                 struct csched2_vcpu *svc = CSCHED2_VCPU(v);
-
-                /* NB: Locking order is important here.  Because we grab this lock here, we
-                 * must never lock csched2_priv.lock if we're holding a runqueue lock.
-                 * Also, calling vcpu_schedule_lock() is enough, since IRQs have already
-                 * been disabled. */
                 spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
 
                 ASSERT(svc->rqd == RQD(ops, svc->vcpu->processor));
@@ -1892,6 +1914,8 @@ csched2_dom_cntl(
 
                 vcpu_schedule_unlock(lock, svc->vcpu);
             }
+
+            write_unlock_irqrestore(&prv->lock, flags);
         }
         break;
     default:
@@ -1899,7 +1923,6 @@ csched2_dom_cntl(
         break;
     }
 
-    spin_unlock_irqrestore(&prv->lock, flags);
 
     return rc;
 }
@@ -1907,6 +1930,7 @@ csched2_dom_cntl(
 static void *
 csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
 {
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
     struct csched2_dom *sdom;
     unsigned long flags;
 
@@ -1920,11 +1944,11 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     sdom->weight = CSCHED2_DEFAULT_WEIGHT;
     sdom->nr_vcpus = 0;
 
-    spin_lock_irqsave(&CSCHED2_PRIV(ops)->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     list_add_tail(&sdom->sdom_elem, &CSCHED2_PRIV(ops)->sdom);
 
-    spin_unlock_irqrestore(&CSCHED2_PRIV(ops)->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     return (void *)sdom;
 }
@@ -1951,12 +1975,13 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 {
     unsigned long flags;
     struct csched2_dom *sdom = data;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
 
-    spin_lock_irqsave(&CSCHED2_PRIV(ops)->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     list_del_init(&sdom->sdom_elem);
 
-    spin_unlock_irqrestore(&CSCHED2_PRIV(ops)->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     xfree(data);
 }
@@ -2110,7 +2135,7 @@ csched2_schedule(
     rqd = RQD(ops, cpu);
     BUG_ON(!cpumask_test_cpu(cpu, &rqd->active));
 
-    /* Protected by runqueue lock */        
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
 #ifndef NDEBUG
     if ( !is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd)
@@ -2274,12 +2299,12 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
 
     /*
      * We need both locks:
-     * - csched2_dump_vcpu() wants to access domains' scheduling
-     *   parameters, which are protected by the private scheduler lock;
+     * - csched2_dump_vcpu() wants to access domains' weights,
+     *   which are protected by the private scheduler lock;
      * - we scan through the runqueue, so we need the proper runqueue
      *   lock (the one of the runqueue this cpu is associated to).
      */
-    spin_lock_irqsave(&prv->lock, flags);
+    read_lock_irqsave(&prv->lock, flags);
     lock = per_cpu(schedule_data, cpu).schedule_lock;
     spin_lock(lock);
 
@@ -2310,7 +2335,7 @@ csched2_dump_pcpu(const struct scheduler *ops, int cpu)
     }
 
     spin_unlock(lock);
-    spin_unlock_irqrestore(&prv->lock, flags);
+    read_unlock_irqrestore(&prv->lock, flags);
 #undef cpustr
 }
 
@@ -2323,9 +2348,11 @@ csched2_dump(const struct scheduler *ops)
     int i, loop;
 #define cpustr keyhandler_scratch
 
-    /* We need the private lock as we access global scheduler data
-     * and (below) the list of active domains. */
-    spin_lock_irqsave(&prv->lock, flags);
+    /*
+     * We need the private scheduler lock as we access global
+     * scheduler data and (below) the list of active domains.
+     */
+    read_lock_irqsave(&prv->lock, flags);
 
     printk("Active queues: %d\n"
            "\tdefault-weight     = %d\n",
@@ -2386,7 +2413,7 @@ csched2_dump(const struct scheduler *ops)
         }
     }
 
-    spin_unlock_irqrestore(&prv->lock, flags);
+    read_unlock_irqrestore(&prv->lock, flags);
 #undef cpustr
 }
 
@@ -2488,7 +2515,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     unsigned rqi;
     struct csched2_runqueue_data *rqd;
 
-    ASSERT(spin_is_locked(&prv->lock));
+    ASSERT(rw_is_write_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
 
     /* Figure out which runqueue to put it in */
@@ -2527,7 +2554,7 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
      */
     ASSERT(!pdata);
 
-    spin_lock_irqsave(&prv->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
     old_lock = pcpu_schedule_lock(cpu);
 
     rqi = init_pdata(prv, cpu);
@@ -2536,7 +2563,7 @@ csched2_init_pdata(const struct scheduler *ops, void *pdata, int cpu)
 
     /* _Not_ pcpu_schedule_unlock(): schedule_lock may have changed! */
     spin_unlock(old_lock);
-    spin_unlock_irqrestore(&prv->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 }
 
 /* Change the scheduler of cpu to us (Credit2). */
@@ -2559,7 +2586,7 @@ csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
      * cpu) is what is necessary to prevent races.
      */
     ASSERT(!local_irq_is_enabled());
-    spin_lock(&prv->lock);
+    write_lock(&prv->lock);
 
     idle_vcpu[cpu]->sched_priv = vdata;
 
@@ -2584,7 +2611,7 @@ csched2_switch_sched(struct scheduler *new_ops, unsigned int cpu,
     smp_mb();
     per_cpu(schedule_data, cpu).schedule_lock = &prv->rqd[rqi].lock;
 
-    spin_unlock(&prv->lock);
+    write_unlock(&prv->lock);
 }
 
 static void
@@ -2595,7 +2622,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     struct csched2_runqueue_data *rqd;
     int rqi;
 
-    spin_lock_irqsave(&prv->lock, flags);
+    write_lock_irqsave(&prv->lock, flags);
 
     /*
      * alloc_pdata is not implemented, so pcpu must be NULL. On the other
@@ -2626,7 +2653,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     __cpumask_clear_cpu(cpu, &prv->initialized);
 
-    spin_unlock_irqrestore(&prv->lock, flags);
+    write_unlock_irqrestore(&prv->lock, flags);
 
     return;
 }
@@ -2671,7 +2698,7 @@ csched2_init(struct scheduler *ops)
         return -ENOMEM;
     ops->sched_data = prv;
 
-    spin_lock_init(&prv->lock);
+    rwlock_init(&prv->lock);
     INIT_LIST_HEAD(&prv->sdom);
 
     /* But un-initialize all runqueues */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (16 preceding siblings ...)
  2016-06-17 23:13 ` [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
@ 2016-06-17 23:13 ` Dario Faggioli
  2016-06-20  8:26   ` Jan Beulich
                     ` (2 more replies)
  2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
  18 siblings, 3 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

In fact, right now, we recommend keepeing runqueues
arranged per-core, so that it is the inter-runqueue load
balancing code that automatically spreads the work in an
SMT friendly way. This means that any other runq
arrangement one may want to use falls short of SMT
scheduling optimizations.

This commit implements SMT awareness --similar to the
one we have in Credit1-- for any possible runq
arrangement. This turned out to be pretty easy to do,
as the logic can live entirely in runq_tickle()
(although, in order to avoid for_each_cpu loops in
that function, we use a new cpumask which indeed needs
to be updated in other places).

In addition to disentangling SMT awareness from load
balancing, this also allows us to support the
sched_smt_power_savings parametar in Credit2 as well.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |  141 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 126 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 93943fa..a8b3a85 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -351,7 +351,8 @@ struct csched2_runqueue_data {
     unsigned int max_weight;
 
     cpumask_t idle,        /* Currently idle */
-        tickled;           /* Another cpu in the queue is already targeted for this one */
+        smt_idle,          /* Fully idle cores (as in all the siblings are idle) */
+        tickled;           /* Have been asked to go through schedule */
     int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
     s_time_t load_last_update;  /* Last time average was updated */
     s_time_t avgload;           /* Decaying queue load */
@@ -412,6 +413,73 @@ struct csched2_dom {
 };
 
 /*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *    bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *    bits correspondings to it and to all its siblings are clear in the
+ *    smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *    on pcpus from smt_idle mask first. This is what happens if
+ *    sched_smt_power_savings was not set at boot (default), and it maximizes
+ *    true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *    on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *    sched_smt_power_savings is set at boot, and it allows as more cores as
+ *    possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *    runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *    csched2_cpu_pick() looks for the least loaded runq and return just any
+ *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *    the chosen runq, and it is again runq_tickle(), called by
+ *    csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *    chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *    directly, that again temporarily use a random pcpu from the new runq,
+ *    and then calls runq_tickle(), by itself.
+ */
+
+/*
+ * If all the siblings of cpu (including cpu itself) are in idlers,
+ * set all their bits in mask.
+ *
+ * In order to properly take into account tickling, idlers needs to be
+ * set qeual to something like:
+ *
+ *  rqd->idle & (~rqd->tickled)
+ *
+ * This is because cpus that have been tickled will very likely pick up some
+ * work as soon as the manage to schedule, and hence we should really consider
+ * them as busy.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask)
+{
+    if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
+        cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask.
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{
+    cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
+/*
  * When a hard affinity change occurs, we may not be able to check some
  * (any!) of the other runqueues, when looking for the best new processor
  * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
@@ -851,9 +919,30 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     }
 
     /*
-     * Get a mask of idle, but not tickled, processors that new is
-     * allowed to run on. If that's not empty, choose someone from there
-     * (preferrably, the one were new was running on already).
+     * First of all, consider idle cpus, checking if we can just
+     * re-use the pcpu where we were running before.
+     *
+     * If there are cores where all the siblings are idle, consider
+     * them first, honoring whatever the spreading-vs-consolidation
+     * SMT policy wants us to do.
+     */
+    if ( unlikely(sched_smt_power_savings) )
+        cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
+    else
+        cpumask_copy(&mask, &rqd->smt_idle);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    i = cpumask_test_or_cycle(cpu, &mask);
+    if ( i < nr_cpu_ids )
+    {
+        SCHED_STAT_CRANK(tickled_idle_cpu);
+        ipid = i;
+        goto tickle;
+    }
+
+    /*
+     * If there are no fully idle cores, check all idlers, after
+     * having filtered out pcpus that have been tickled but haven't
+     * gone through the scheduler yet.
      */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
     cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
@@ -945,6 +1034,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                     (unsigned char *)&d);
     }
     __cpumask_set_cpu(ipid, &rqd->tickled);
+    //smt_idle_mask_clear(ipid, &rqd->smt_idle); XXX
     cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
 }
 
@@ -1435,13 +1525,15 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     if ( !read_trylock(&prv->lock) )
     {
-        /* We may be here because someon requested us to migrate */
+        /* We may be here because someone requested us to migrate */
         __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         return get_fallback_cpu(svc);
     }
 
-    /* First check to see if we're here because someone else suggested a place
-     * for us to move. */
+    /*
+     * First check to see if we're here because someone else suggested a place
+     * for us to move.
+     */
     if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
     {
         if ( unlikely(svc->migrate_rqd->id < 0) )
@@ -1462,7 +1554,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /* Find the runqueue with the lowest average load. */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
@@ -1505,16 +1597,17 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     /* We didn't find anyone (most likely because of spinlock contention). */
     if ( min_rqi == -1 )
-        new_cpu = get_fallback_cpu(svc);
-    else
     {
-        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
-                    &prv->rqd[min_rqi].active);
-        new_cpu = cpumask_any(cpumask_scratch);
-        BUG_ON(new_cpu >= nr_cpu_ids);
+        new_cpu = get_fallback_cpu(svc);
+        goto out_up;
     }
 
-out_up:
+    cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
+                &prv->rqd[min_rqi].active);
+    new_cpu = cpumask_any(cpumask_scratch);
+    BUG_ON(new_cpu >= nr_cpu_ids);
+
+ out_up:
     read_unlock(&prv->lock);
 
     if ( unlikely(tb_init_done) )
@@ -2166,7 +2259,11 @@ csched2_schedule(
 
     /* Clear "tickled" bit now that we've been scheduled */
     if ( cpumask_test_cpu(cpu, &rqd->tickled) )
+    {
         __cpumask_clear_cpu(cpu, &rqd->tickled);
+        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle); // XXX
+    }
 
     /* Update credits */
     burn_credits(rqd, scurr, now);
@@ -2228,7 +2325,10 @@ csched2_schedule(
 
         /* Clear the idle mask if necessary */
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_clear_cpu(cpu, &rqd->idle);
+            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+        }
 
         snext->start_time = now;
 
@@ -2250,10 +2350,17 @@ csched2_schedule(
         if ( tasklet_work_scheduled )
         {
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
+            {
                 __cpumask_clear_cpu(cpu, &rqd->idle);
+                smt_idle_mask_clear(cpu, &rqd->smt_idle);
+            }
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
+        {
             __cpumask_set_cpu(cpu, &rqd->idle);
+            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
+            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+        }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
@@ -2383,6 +2490,8 @@ csched2_dump(const struct scheduler *ops)
         printk("\tidlers: %s\n", cpustr);
         cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].tickled);
         printk("\ttickled: %s\n", cpustr);
+        cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].smt_idle);
+        printk("\tfully idle cores: %s\n", cpustr);
     }
 
     printk("Domain info:\n");
@@ -2536,6 +2645,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &rqd->idle);
     __cpumask_set_cpu(cpu, &rqd->active);
     __cpumask_set_cpu(cpu, &prv->initialized);
+    __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
     return rqi;
 }
@@ -2641,6 +2751,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
 
     __cpumask_clear_cpu(cpu, &rqd->idle);
+    __cpumask_clear_cpu(cpu, &rqd->smt_idle);
     __cpumask_clear_cpu(cpu, &rqd->active);
 
     if ( cpumask_empty(&rqd->active) )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu
  2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
                   ` (17 preceding siblings ...)
  2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
@ 2016-06-17 23:13 ` Dario Faggioli
  2016-06-20  8:30   ` Jan Beulich
  2016-06-21 10:42   ` David Vrabel
  18 siblings, 2 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-17 23:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

because it is cheaper, and there is no much point in
randomizing which cpu gets selected anyway, as such
choice will be overridden shortly after, in runq_tickle().

If we really feel the need (e.g., we prove it worth with
benchmarking), we can record the last cpu which was used
by csched2_cpu_pick() and migrate() in a per-runq variable,
and then use cpumask_cycle()... but this really does not
look necessary.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 xen/common/sched_credit2.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index a8b3a85..afd432e 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1545,7 +1545,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         {
             cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
                         &svc->migrate_rqd->active);
-            new_cpu = cpumask_any(cpumask_scratch);
+            new_cpu = cpumask_first(cpumask_scratch);
             if ( new_cpu < nr_cpu_ids )
                 goto out_up;
         }
@@ -1604,7 +1604,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
                 &prv->rqd[min_rqi].active);
-    new_cpu = cpumask_any(cpumask_scratch);
+    new_cpu = cpumask_first(cpumask_scratch);
     BUG_ON(new_cpu >= nr_cpu_ids);
 
  out_up:
@@ -1718,7 +1718,7 @@ static void migrate(const struct scheduler *ops,
 
         cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
                     &trqd->active);
-        svc->vcpu->processor = cpumask_any(cpumask_scratch);
+        svc->vcpu->processor = cpumask_first(cpumask_scratch);
         ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         __runq_assign(svc, trqd);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer
  2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
@ 2016-06-18  0:36   ` Meng Xu
  2016-07-06 15:52   ` George Dunlap
  1 sibling, 0 replies; 64+ messages in thread
From: Meng Xu @ 2016-06-18  0:36 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

On Fri, Jun 17, 2016 at 7:11 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>
> In fact, what we have right now, i.e., tickle_idlers_none
> and tickle_idlers_some, is not good enough for describing
> what really happens in the various tickling functions of
> the various scheduler.
>
> Switch to a more descriptive set of counters, such as:
>  - tickled_no_cpu: for when we don't tickle anyone
>  - tickled_idle_cpu: for when we tickle one or more
>                      idler
>  - tickled_busy_cpu: for when we tickle one or more
>                      non-idler
>
> While there, fix style of an "out:" label in sched_rt.c.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit.c    |   10 +++++++---
>  xen/common/sched_credit2.c   |   12 +++++-------
>  xen/common/sched_rt.c        |    8 +++++---
>  xen/include/xen/perfc_defn.h |    5 +++--
>  4 files changed, 20 insertions(+), 15 deletions(-)


In terms of sched_rt.c and perfc_defn.h,

Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>

Thanks,

Meng

------------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
@ 2016-06-20  7:48   ` Jan Beulich
  2016-07-07 10:11     ` Dario Faggioli
  2016-06-21 16:17   ` anshul makkar
  2016-07-06 15:41   ` George Dunlap
  2 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  7:48 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:11, <dario.faggioli@citrix.com> wrote:
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1819,24 +1819,24 @@ csched_schedule(
>      else
>          snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
>  
> + out:
>      /*
>       * Update idlers mask if necessary. When we're idling, other CPUs
>       * will tickle us when they get extra work.
>       */
> -    if ( snext->pri == CSCHED_PRI_IDLE )
> +    if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
>      {
> -        if ( !cpumask_test_cpu(cpu, prv->idlers) )
> -            cpumask_set_cpu(cpu, prv->idlers);
> +        if ( cpumask_test_cpu(cpu, prv->idlers) )
> +            cpumask_clear_cpu(cpu, prv->idlers);
>      }
> -    else if ( cpumask_test_cpu(cpu, prv->idlers) )
> +    else if ( !cpumask_test_cpu(cpu, prv->idlers) )
>      {
> -        cpumask_clear_cpu(cpu, prv->idlers);
> +        cpumask_set_cpu(cpu, prv->idlers);
>      }

Is there a reason for this extra code churn? It would seem to me
that the change could be just the "out" label movement and
adjustment to the first if:

   if ( !tasklet_work_scheduled && snext->pri == CSCHED_PRI_IDLE )

Am I overlooking something?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held
  2016-06-17 23:12 ` [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held Dario Faggioli
@ 2016-06-20  7:56   ` Jan Beulich
  2016-07-06 16:10     ` George Dunlap
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  7:56 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> Yet another situation very similar to 779511f4bf5ae
> ("sched: avoid races on time values read from NOW()").
> 
> In fact, when more than one runqueue is involved, we need
> to make sure that the following does not happen:
>  1. take the lock of 1st runq
>  2. now = NOW()
>  3. take the lock of 2nd runq
>  4. use now
> 
> as, if we have to wait at step 3, the value in now may
> be stale when we get to use it at step 4.

Is this really meaningful here? We're talking of trylocks, which don't
incur any delay other than the latency of the LOCKed (on x86)
instruction to determine lock availability.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-06-17 23:12 ` [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
@ 2016-06-20  8:02   ` Jan Beulich
  2016-07-06 16:21     ` George Dunlap
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  8:02 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> This really should not happen, but:
>  1. it does happen! Investigation is ongoing here:
>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html 
>  2. even when 1 will be fixed it makes sense and is easy enough
>     to have a 'safety catch' for it.
> 
> The reason why this is particularly bad for Credit2 is that
> negative values of delta mean out of scale high load (because
> of the conversion to unsigned). This, for instance in the
> case of runqueue load, results in a runqueue having its load
> updated to values of the order of 10000% or so, which in turns
> means that the load balancer will migrate everything off from
> the pCPUs in the runqueue, and leave them idle until the load
> gets back to something sane... which may indeed take a while!
> 
> This is not a fix for the problem of time going backwards. In
> fact, if that happens a lot, load tracking accuracy is still
> compromized, but at least the effect is a lot less bad than
> before.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit2.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 50f8dfd..b73d034 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>      else
>      {
>          delta = now - rqd->load_last_update;
> +        if ( unlikely(delta < 0) )
> +        {
> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> +                     __func__, now, rqd->load_last_update);
> +            delta = 0;
> +        }
>  
>          rqd->avgload =
>              ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>      else
>      {
>          delta = now - svc->load_last_update;
> +        if ( unlikely(delta < 0) )
> +        {
> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> +                     __func__, now, svc->load_last_update);
> +            delta = 0;
> +        }
>  
>          svc->avgload =
>              ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )

Do the absolute times really matter here? I.e. wouldn't it be more
useful to simply log the value of delta?

Also, may I ask you to use the L modifier in favor of the ll one, for
being one byte shorter (and hence, even if just very slightly,
reducing both image size and cache pressure)?

And finally, instead of logging function names, could the two
messages be made distinguishable by other means resulting in less
data issued to the log (and potentially needing transmission over
a slow serial line)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/19] xen: credit2: make the code less experimental
  2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
@ 2016-06-20  8:13   ` Jan Beulich
  2016-07-07 10:59     ` Dario Faggioli
  2016-07-07 15:17   ` George Dunlap
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  8:13 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> @@ -608,8 +605,8 @@ __update_runq_load(const struct scheduler *ops,
>          delta = now - rqd->load_last_update;
>          if ( unlikely(delta < 0) )
>          {
> -            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> -                     __func__, now, rqd->load_last_update);
> +            printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> +                   __func__, now, rqd->load_last_update);
>              delta = 0;
>          }
>  
> @@ -680,8 +677,8 @@ __update_svc_load(const struct scheduler *ops,
>          delta = now - svc->load_last_update;
>          if ( unlikely(delta < 0) )
>          {
> -            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> -                     __func__, now, svc->load_last_update);
> +            printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> +                   __func__, now, svc->load_last_update);
>              delta = 0;
>          }
>  

With these now becoming non-debugging ones - are they useful
_every_ time such an event occurs? I.e. wouldn't it be better to
e.g. only log new high watermark values?

> @@ -2580,15 +2583,20 @@ csched2_init(struct scheduler *ops)
>      int i;
>      struct csched2_private *prv;
>  
> -    printk("Initializing Credit2 scheduler\n" \
> -           " WARNING: This is experimental software in development.\n" \
> +    printk("Initializing Credit2 scheduler\n");
> +    printk(" WARNING: This is experimental software in development.\n" \
>             " Use at your own risk.\n");
>  
> -    printk(" load_precision_shift: %d\n", opt_load_precision_shift);
> -    printk(" load_window_shift: %d\n", opt_load_window_shift);
> -    printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
> -    printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
> -    printk(" runqueues arrangement: %s\n", opt_runqueue_str[opt_runqueue]);
> +    printk(XENLOG_INFO " load_precision_shift: %d\n"
> +           " load_window_shift: %d\n"
> +           " underload_balance_tolerance: %d\n"
> +           " overload_balance_tolerance: %d\n"
> +           " runqueues arrangement: %s\n",
> +           opt_load_precision_shift,
> +           opt_load_window_shift,
> +           opt_underload_balance_tolerance,
> +           opt_overload_balance_tolerance,
> +           opt_runqueue_str[opt_runqueue]);

Note that this results in only the first line getting logged at info level;
all others will get the default logging level (i.e. warning) assigned. IOW
I think you want to repeat XENLOG_INFO a couple of times.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 14/19] xen: credit2: add yet some more tracing
  2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
@ 2016-06-20  8:15   ` Jan Beulich
  2016-07-07 15:34     ` George Dunlap
  2016-07-07 15:34   ` George Dunlap
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  8:15 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> @@ -1484,6 +1489,23 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>  out_up:
>      spin_unlock(&prv->lock);
>  
> +    /* TRACE */
> +    {
> +        struct {
> +            uint64_t b_avgload;
> +            unsigned vcpu:16, dom:16;
> +            unsigned rq_id:16, new_cpu:16;
> +       } d;
> +        d.b_avgload = prv->rqd[min_rqi].b_avgload;
> +        d.dom = vc->domain->domain_id;
> +        d.vcpu = vc->vcpu_id;
> +        d.rq_id = c2r(ops, new_cpu);
> +        d.new_cpu = new_cpu;

I guess this follows pre-existing style, but it would seem more natural
to me for the variable to have an initializer instead of this series of
assignments.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
  2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
@ 2016-06-20  8:26   ` Jan Beulich
  2016-06-20 10:38     ` Dario Faggioli
  2016-06-27 15:20   ` anshul makkar
  2016-07-12 13:40   ` George Dunlap
  2 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  8:26 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:13, <dario.faggioli@citrix.com> wrote:
> +static inline
> +void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask)
> +{
> +    if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
> +        cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
> +}

I think helpers like this should be made const-correct. Here idlers
is only an input.

Also I'm not sure the compiler can fold the redundant
per_cpu(cpu_sibling_mask, cpu) in all cases. Is it maybe worth
helping it by using a local variable here or moving the expression
into the caller's invocation expression?

And as a side note - there a stray space inside the cpumask_subset().

> @@ -945,6 +1034,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>                      (unsigned char *)&d);
>      }
>      __cpumask_set_cpu(ipid, &rqd->tickled);
> +    //smt_idle_mask_clear(ipid, &rqd->smt_idle); XXX
>      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>  }

With this, was the patch meant to be RFC?

> @@ -1435,13 +1525,15 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>  
>      if ( !read_trylock(&prv->lock) )
>      {
> -        /* We may be here because someon requested us to migrate */
> +        /* We may be here because someone requested us to migrate */

Please add the missing full stop at once.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu
  2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
@ 2016-06-20  8:30   ` Jan Beulich
  2016-06-20 11:28     ` Dario Faggioli
  2016-06-21 10:42   ` David Vrabel
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-06-20  8:30 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

>>> On 18.06.16 at 01:13, <dario.faggioli@citrix.com> wrote:
> because it is cheaper, and there is no much point in
> randomizing which cpu gets selected anyway, as such
> choice will be overridden shortly after, in runq_tickle().

If it will always be overridden, why fill it in the first place? And if there
are cases where it won't get overridden, you're re-introducing a
preference towards lower CPU numbers, which I think is not a good
idea. Can the code perhaps be rearranged to avoid the cpumask_any()
when another value will subsequently get stored anyway?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
  2016-06-20  8:26   ` Jan Beulich
@ 2016-06-20 10:38     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-20 10:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1994 bytes --]

On Mon, 2016-06-20 at 02:26 -0600, Jan Beulich wrote:
> > > > On 18.06.16 at 01:13, <dario.faggioli@citrix.com> wrote:
> > +static inline
> > +void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers,
> > cpumask_t *mask)
> > +{
> > +    if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
> > +        cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
> > +}
> I think helpers like this should be made const-correct. Here idlers
> is only an input.
> 
Ok.

> Also I'm not sure the compiler can fold the redundant
> per_cpu(cpu_sibling_mask, cpu) in all cases. Is it maybe worth
> helping it by using a local variable here or moving the expression
> into the caller's invocation expression?
> 
Agreed too.

> > @@ -945,6 +1034,7 @@ runq_tickle(const struct scheduler *ops,
> > struct csched2_vcpu *new, s_time_t now)
> >                      (unsigned char *)&d);
> >      }
> >      __cpumask_set_cpu(ipid, &rqd->tickled);
> > +    //smt_idle_mask_clear(ipid, &rqd->smt_idle); XXX
> >      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> >  }
> With this, was the patch meant to be RFC?
> 
No, it's me that should have removed this line after the last round of
testing, but forgot. Apologies. :-/

> > @@ -1435,13 +1525,15 @@ csched2_cpu_pick(const struct scheduler
> > *ops, struct vcpu *vc)
> >  
> >      if ( !read_trylock(&prv->lock) )
> >      {
> > -        /* We may be here because someon requested us to migrate
> > */
> > +        /* We may be here because someone requested us to migrate
> > */
> Please add the missing full stop at once.
> 
Yep.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu
  2016-06-20  8:30   ` Jan Beulich
@ 2016-06-20 11:28     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-06-20 11:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 3872 bytes --]

On Mon, 2016-06-20 at 02:30 -0600, Jan Beulich wrote:
> > 
> > > 
> > > > 
> > > > On 18.06.16 at 01:13, <dario.faggioli@citrix.com> wrote:
> > because it is cheaper, and there is no much point in
> > randomizing which cpu gets selected anyway, as such
> > choice will be overridden shortly after, in runq_tickle().
> If it will always be overridden, why fill it in the first place? And
> if there
> are cases where it won't get overridden, you're re-introducing a
> preference towards lower CPU numbers, which I think is not a good
> idea. 
>
It will never be used directly as the actual target CPU --at least
according to my analysis of the code.

runq_tickle() will consider it, but only as an hint, and will actually
use it only if it satisfies all the other load balancing conditions
(being part of a fully idle core, being idle, being in hard affinity,
being in preemptable, etc).

As I said in the rest of the changelog, if we really fear, or start to
observe, that lower CPU numbers are being preferred, we can add
countermeasures (stashing the CPU we chose last time and use
cpumask_cycle(), as we do in Credit1, for another thing).

My feeling is that they won't, as the load balancing logic in
runq_tickle() will make that unlikely enough.

> Can the code perhaps be rearranged to avoid the cpumask_any()
> when another value will subsequently get stored anyway?
> 
I thought about it, and although for sure there are alternatives, none
of the ones I could come up with were looking better than the present
situation.

Fact is, when the pick_cpu() hook is called in vcpu_migrate(), what
vcpu_migrate() wants back from it is indeed a CPU number. Then (through
vcpu_move_locked()) it either just sets v->processor equal to such CPU,
or call the migrate() hook.

On Credit1, the CPU returned by pick_cpu() is indeed the CPU where we
want the vcpu to run, and setting v->processor to that is all we need
to do for migrating a vcpu (and in fact, migrate() is not even
defined).

On Credit2, we (ab?)use pick_cpu() to actually select not really a CPU,
but a runqueue. The fact that we return a random CPU from the runqueue
we decided we want is the (pretty clever, IMO) way with which we avoid
having to teach schedule.c about runqueues. Then, in migrate() (which
is defined for Credit2), we do the other way round: we hand a CPU to
Credit2 and it will translate that back in a runqueue (the runqueue
where that CPU sits).

Archaeology confirms that the migrate() hook was introduced (in
ff38d3faa7d "credit2: add a callback to migrate to a new cpu")
specifically for Credit2.

The main difference, wrt all the above, between Credit1 and Credit2 is
that in Credit1 there is one runqueue per each CPU, in Credit2, more
CPUs use the same runqueue. The current pick_cpu()/migrate() approach
lets both the schedulers, despite this difference, achieve what they
want. Note also how such an approach targets the simplest case (<<hey,
sched_*.c, give me a CPU!>>), which is good when reading and wanting to
understand schedule.c. It's then responsibility of any scheduler that
wants to play fancy tricks --like Credit2 does with runqueues-- to take
care of that, without making anyone else paying the price in terms of
complexity.

Every alternative I thought to, always involved making things less
straightforward in schedule.c, which is something I'd rather avoid. If
anyone has better alternatives, I'm all ears. :-)

I certainly can add more comments, in sched_credit2.c, for explaining
the situation.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format
  2016-06-17 23:12 ` [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format Dario Faggioli
@ 2016-06-21  9:27   ` Wei Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Liu @ 2016-06-21  9:27 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xen-devel, Anshul Makkar, Ian Jackson, Wei Liu, George Dunlap

On Sat, Jun 18, 2016 at 01:12:36AM +0200, Dario Faggioli wrote:
> in both xenalyze and formats (for xentrace_format).
> 
> In particular, in xenalyze, now that we have the precision
> of the fixed point load values in the tracepoint, show both
> the raw value and the (easier to interpreet) percentage.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

FAOD I will leave this patch and the other one to George because he
knows better than me about xentrace.

Just by skimming the two patches, they look fine to me.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu
  2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
  2016-06-20  8:30   ` Jan Beulich
@ 2016-06-21 10:42   ` David Vrabel
  2016-07-07 16:55     ` Dario Faggioli
  1 sibling, 1 reply; 64+ messages in thread
From: David Vrabel @ 2016-06-21 10:42 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar, George Dunlap, David Vrabel

On 18/06/16 00:13, Dario Faggioli wrote:
> because it is cheaper, and there is no much point in
> randomizing which cpu gets selected anyway, as such
> choice will be overridden shortly after, in runq_tickle().
> 
> If we really feel the need (e.g., we prove it worth with
> benchmarking), we can record the last cpu which was used
> by csched2_cpu_pick() and migrate() in a per-runq variable,
> and then use cpumask_cycle()... but this really does not
> look necessary.

Isn't this backwards?  Surely you should demonstrate that this change is
beneficial before proposing it?

I don't think any performance related change should be accepted without
experimental evidence that it makes something better, especially if it
looks like it might have negative consequences (e.g., by favouring low
cpus).

David


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
  2016-06-20  7:48   ` Jan Beulich
@ 2016-06-21 16:17   ` anshul makkar
  2016-07-06 15:41   ` George Dunlap
  2 siblings, 0 replies; 64+ messages in thread
From: anshul makkar @ 2016-06-21 16:17 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, David Vrabel

On 18/06/16 00:11, Dario Faggioli wrote:
> In both Credit1 and Credit2, stop considering a pCPU idle,
> if the reason why the idle vCPU is being selected, is to
> do tasklet work.
>
> Not doing so means that the tickling and load balancing
> logic, seeing the pCPU as idle, considers it a candidate
> for picking up vCPUs. But the pCPU won't actually pick
> up or schedule any vCPU, which would then remain in the
> runqueue, which is bas, especially if there were other,
> truly idle pCPUs, that could execute it.
>
> The only drawback is that we can't assume that a pCPU is
> in always marked as idle when being removed from an
> instance of the Credit2 scheduler (csched2_deinit_pdata).
> In fact, if we are in stop-machine (i.e., during suspend
> or shutdown), the pCPUs are running the stopmachine_tasklet
> and hence are actually marked as busy. On the other hand,
> when removing a pCPU from a Credit2 pool, it will indeed
> be idle. The only thing we can do, therefore, is to
> remove the BUG_ON() check.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>   xen/common/sched_credit.c  |   12 ++++++------
>   xen/common/sched_credit2.c |   14 ++++++++++----
>   2 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index a38a63d..a6645a2 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1819,24 +1819,24 @@ csched_schedule(
>       else
>           snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
>
> + out:
>       /*
>        * Update idlers mask if necessary. When we're idling, other CPUs
>        * will tickle us when they get extra work.
>        */
> -    if ( snext->pri == CSCHED_PRI_IDLE )
> +    if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
>       {
> -        if ( !cpumask_test_cpu(cpu, prv->idlers) )
> -            cpumask_set_cpu(cpu, prv->idlers);
> +        if ( cpumask_test_cpu(cpu, prv->idlers) )
> +            cpumask_clear_cpu(cpu, prv->idlers);
>       }
> -    else if ( cpumask_test_cpu(cpu, prv->idlers) )
> +    else if ( !cpumask_test_cpu(cpu, prv->idlers) )
>       {
> -        cpumask_clear_cpu(cpu, prv->idlers);
> +        cpumask_set_cpu(cpu, prv->idlers);
>       }
>
>       if ( !is_idle_vcpu(snext->vcpu) )
>           snext->start_time += now;
>
> -out:
>       /*
>        * Return task to run next...
>        */
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 1933ff1..cf8455c 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -1910,8 +1910,16 @@ csched2_schedule(
>       }
>       else
>       {
> -        /* Update the idle mask if necessary */
> -        if ( !cpumask_test_cpu(cpu, &rqd->idle) )
> +        /*
> +         * Update the idle mask if necessary. Note that, if we're scheduling
> +         * idle in order to carry on some tasklet work, we want to play busy!
> +         */
> +        if ( tasklet_work_scheduled )
> +        {
> +            if ( cpumask_test_cpu(cpu, &rqd->idle) )
> +                cpumask_clear_cpu(cpu, &rqd->idle);
> +        }
> +        else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
>               cpumask_set_cpu(cpu, &rqd->idle);
>           /* Make sure avgload gets updated periodically even
>            * if there's no activity */
> @@ -2291,8 +2299,6 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>       /* No need to save IRQs here, they're already disabled */
>       spin_lock(&rqd->lock);
>
> -    BUG_ON(!cpumask_test_cpu(cpu, &rqd->idle));
> -
>       printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
>
>       cpumask_clear_cpu(cpu, &rqd->idle);
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter
  2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
@ 2016-06-21 16:41   ` anshul makkar
  2016-07-06 15:59   ` George Dunlap
  1 sibling, 0 replies; 64+ messages in thread
From: anshul makkar @ 2016-06-21 16:41 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: David Vrabel, George Dunlap

On 18/06/16 00:11, Dario Faggioli wrote:
> In fact, they always operate on the svc->processor of
> the csched2_vcpu passed to them.
>
> No functional change intended.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Cc: George Dunlap <george.dnulap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>   xen/common/sched_credit2.c |   19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 0246453..5881583 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -518,8 +518,9 @@ __runq_insert(struct list_head *runq, struct csched2_vcpu *svc)
>   }
>
>   static void
> -runq_insert(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *svc)
> +runq_insert(const struct scheduler *ops, struct csched2_vcpu *svc)
>   {
> +    unsigned int cpu = svc->vcpu->processor;
>       struct list_head * runq = &RQD(ops, cpu)->runq;
>       int pos = 0;
>
> @@ -558,17 +559,17 @@ void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_ti
>   /* Check to see if the item on the runqueue is higher priority than what's
>    * currently running; if so, wake up the processor */
>   static /*inline*/ void
> -runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *new, s_time_t now)
> +runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>   {
>       int i, ipid=-1;
>       s_time_t lowest=(1<<30);
> +    unsigned int cpu = new->vcpu->processor;
>       struct csched2_runqueue_data *rqd = RQD(ops, cpu);
>       cpumask_t mask;
>       struct csched2_vcpu * cur;
>
>       d2printk("rqt %pv curr %pv\n", new->vcpu, current);
>
> -    BUG_ON(new->vcpu->processor != cpu);
>       BUG_ON(new->rqd != rqd);
>
>       /* Look at the cpu it's running on first */
> @@ -1071,8 +1072,8 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
>       update_load(ops, svc->rqd, svc, 1, now);
>
>       /* Put the VCPU on the runq */
> -    runq_insert(ops, vc->processor, svc);
> -    runq_tickle(ops, vc->processor, svc, now);
> +    runq_insert(ops, svc);
> +    runq_tickle(ops, svc, now);
>
>   out:
>       d2printk("w-\n");
> @@ -1104,8 +1105,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
>       {
>           BUG_ON(__vcpu_on_runq(svc));
>
> -        runq_insert(ops, vc->processor, svc);
> -        runq_tickle(ops, vc->processor, svc, now);
> +        runq_insert(ops, svc);
> +        runq_tickle(ops, svc, now);
>       }
>       else if ( !is_idle_vcpu(vc) )
>           update_load(ops, svc->rqd, svc, -1, now);
> @@ -1313,8 +1314,8 @@ static void migrate(const struct scheduler *ops,
>           if ( on_runq )
>           {
>               update_load(ops, svc->rqd, NULL, 1, now);
> -            runq_insert(ops, svc->vcpu->processor, svc);
> -            runq_tickle(ops, svc->vcpu->processor, svc, now);
> +            runq_insert(ops, svc);
> +            runq_tickle(ops, svc, now);
>               SCHED_STAT_CRANK(migrate_on_runq);
>           }
>           else
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
  2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
  2016-06-20  8:26   ` Jan Beulich
@ 2016-06-27 15:20   ` anshul makkar
  2016-07-12 13:40   ` George Dunlap
  2 siblings, 0 replies; 64+ messages in thread
From: anshul makkar @ 2016-06-27 15:20 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, David Vrabel

On 18/06/16 00:13, Dario Faggioli wrote:
> In fact, right now, we recommend keepeing runqueues
> arranged per-core, so that it is the inter-runqueue load
> balancing code that automatically spreads the work in an
> SMT friendly way. This means that any other runq
> arrangement one may want to use falls short of SMT
> scheduling optimizations.
>
> This commit implements SMT awareness --similar to the
> one we have in Credit1-- for any possible runq
> arrangement. This turned out to be pretty easy to do,
> as the logic can live entirely in runq_tickle()
> (although, in order to avoid for_each_cpu loops in
> that function, we use a new cpumask which indeed needs
> to be updated in other places).
>
> In addition to disentangling SMT awareness from load
> balancing, this also allows us to support the
> sched_smt_power_savings parametar in Credit2 as well.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>   xen/common/sched_credit2.c |  141 +++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 126 insertions(+), 15 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 93943fa..a8b3a85 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -351,7 +351,8 @@ struct csched2_runqueue_data {
>       unsigned int max_weight;
>
>       cpumask_t idle,        /* Currently idle */
> -        tickled;           /* Another cpu in the queue is already targeted for this one */
> +        smt_idle,          /* Fully idle cores (as in all the siblings are idle) */
> +        tickled;           /* Have been asked to go through schedule */
>       int load;              /* Instantaneous load: Length of queue  + num non-idle threads */
>       s_time_t load_last_update;  /* Last time average was updated */
>       s_time_t avgload;           /* Decaying queue load */
> @@ -412,6 +413,73 @@ struct csched2_dom {
>   };
>
>   /*
> + * Hyperthreading (SMT) support.
> + *
> + * We use a special per-runq mask (smt_idle) and update it according to the
> + * following logic:
> + *  - when _all_ the SMT sibling in a core are idle, all their corresponding
> + *    bits are set in the smt_idle mask;
> + *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
> + *    bits correspondings to it and to all its siblings are clear in the
> + *    smt_idle mask.
> + *
> + * Once we have such a mask, it is easy to implement a policy that, either:
> + *  - uses fully idle cores first: it is enough to try to schedule the vcpus
> + *    on pcpus from smt_idle mask first. This is what happens if
> + *    sched_smt_power_savings was not set at boot (default), and it maximizes
> + *    true parallelism, and hence performance;
> + *  - uses already busy cores first: it is enough to try to schedule the vcpus
> + *    on pcpus that are idle, but are not in smt_idle. This is what happens if
> + *    sched_smt_power_savings is set at boot, and it allows as more cores as
> + *    possible to stay in low power states, minimizing power consumption.
> + *
> + * This logic is entirely implemented in runq_tickle(), and that is enough.
> + * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
> + * runq, _always_ happens by means of tickling:
> + *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
> + *    runq_tickle();
> + *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
> + *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
> + *    csched2_cpu_pick() looks for the least loaded runq and return just any
> + *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
> + *    the chosen runq, and it is again runq_tickle(), called by
> + *    csched2_vcpu_wake() that actually decides what pcpu to use within the
> + *    chosen runq;
> + *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
> + *    directly, that again temporarily use a random pcpu from the new runq,
> + *    and then calls runq_tickle(), by itself.
> + */
> +
> +/*
> + * If all the siblings of cpu (including cpu itself) are in idlers,
> + * set all their bits in mask.
> + *
> + * In order to properly take into account tickling, idlers needs to be
> + * set qeual to something like:
> + *
> + *  rqd->idle & (~rqd->tickled)
> + *
> + * This is because cpus that have been tickled will very likely pick up some
> + * work as soon as the manage to schedule, and hence we should really consider
> + * them as busy.
> + */
> +static inline
> +void smt_idle_mask_set(unsigned int cpu, cpumask_t *idlers, cpumask_t *mask)
> +{
> +    if ( cpumask_subset( per_cpu(cpu_sibling_mask, cpu), idlers) )
> +        cpumask_or(mask, mask, per_cpu(cpu_sibling_mask, cpu));
> +}
> +
> +/*
> + * Clear the bits of all the siblings of cpu from mask.
> + */
> +static inline
> +void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
> +{
> +    cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
> +}
> +
> +/*
>    * When a hard affinity change occurs, we may not be able to check some
>    * (any!) of the other runqueues, when looking for the best new processor
>    * for svc (as trylock-s in csched2_cpu_pick() can fail). If that happens, we
> @@ -851,9 +919,30 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>       }
>
>       /*
> -     * Get a mask of idle, but not tickled, processors that new is
> -     * allowed to run on. If that's not empty, choose someone from there
> -     * (preferrably, the one were new was running on already).
> +     * First of all, consider idle cpus, checking if we can just
> +     * re-use the pcpu where we were running before.
> +     *
> +     * If there are cores where all the siblings are idle, consider
> +     * them first, honoring whatever the spreading-vs-consolidation
> +     * SMT policy wants us to do.
> +     */
> +    if ( unlikely(sched_smt_power_savings) )
> +        cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
> +    else
> +        cpumask_copy(&mask, &rqd->smt_idle);
> +    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> +    i = cpumask_test_or_cycle(cpu, &mask);
> +    if ( i < nr_cpu_ids )
> +    {
> +        SCHED_STAT_CRANK(tickled_idle_cpu);
> +        ipid = i;
> +        goto tickle;
> +    }
> +
> +    /*
> +     * If there are no fully idle cores, check all idlers, after
> +     * having filtered out pcpus that have been tickled but haven't
> +     * gone through the scheduler yet.
>        */
>       cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
>       cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> @@ -945,6 +1034,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>                       (unsigned char *)&d);
>       }
>       __cpumask_set_cpu(ipid, &rqd->tickled);
> +    //smt_idle_mask_clear(ipid, &rqd->smt_idle); XXX
>       cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>   }
>
> @@ -1435,13 +1525,15 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>
>       if ( !read_trylock(&prv->lock) )
>       {
> -        /* We may be here because someon requested us to migrate */
> +        /* We may be here because someone requested us to migrate */
>           __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>           return get_fallback_cpu(svc);
>       }
>
> -    /* First check to see if we're here because someone else suggested a place
> -     * for us to move. */
> +    /*
> +     * First check to see if we're here because someone else suggested a place
> +     * for us to move.
> +     */
>       if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
>       {
>           if ( unlikely(svc->migrate_rqd->id < 0) )
> @@ -1462,7 +1554,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>
>       min_avgload = MAX_LOAD;
>
> -    /* Find the runqueue with the lowest instantaneous load */
> +    /* Find the runqueue with the lowest average load. */
>       for_each_cpu(i, &prv->active_queues)
>       {
>           struct csched2_runqueue_data *rqd;
> @@ -1505,16 +1597,17 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>
>       /* We didn't find anyone (most likely because of spinlock contention). */
>       if ( min_rqi == -1 )
> -        new_cpu = get_fallback_cpu(svc);
> -    else
>       {
> -        cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
> -                    &prv->rqd[min_rqi].active);
> -        new_cpu = cpumask_any(cpumask_scratch);
> -        BUG_ON(new_cpu >= nr_cpu_ids);
> +        new_cpu = get_fallback_cpu(svc);
> +        goto out_up;
>       }
>
> -out_up:
> +    cpumask_and(cpumask_scratch, vc->cpu_hard_affinity,
> +                &prv->rqd[min_rqi].active);
> +    new_cpu = cpumask_any(cpumask_scratch);
> +    BUG_ON(new_cpu >= nr_cpu_ids);
> +
> + out_up:
>       read_unlock(&prv->lock);
>
>       if ( unlikely(tb_init_done) )
> @@ -2166,7 +2259,11 @@ csched2_schedule(
>
>       /* Clear "tickled" bit now that we've been scheduled */
>       if ( cpumask_test_cpu(cpu, &rqd->tickled) )
> +    {
>           __cpumask_clear_cpu(cpu, &rqd->tickled);
> +        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
> +        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle); // XXX
> +    }
>
>       /* Update credits */
>       burn_credits(rqd, scurr, now);
> @@ -2228,7 +2325,10 @@ csched2_schedule(
>
>           /* Clear the idle mask if necessary */
>           if ( cpumask_test_cpu(cpu, &rqd->idle) )
> +        {
>               __cpumask_clear_cpu(cpu, &rqd->idle);
> +            smt_idle_mask_clear(cpu, &rqd->smt_idle);
> +        }
>
>           snext->start_time = now;
>
> @@ -2250,10 +2350,17 @@ csched2_schedule(
>           if ( tasklet_work_scheduled )
>           {
>               if ( cpumask_test_cpu(cpu, &rqd->idle) )
> +            {
>                   __cpumask_clear_cpu(cpu, &rqd->idle);
> +                smt_idle_mask_clear(cpu, &rqd->smt_idle);
> +            }
>           }
>           else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
> +        {
>               __cpumask_set_cpu(cpu, &rqd->idle);
> +            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
> +            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
> +        }
>           /* Make sure avgload gets updated periodically even
>            * if there's no activity */
>           update_load(ops, rqd, NULL, 0, now);
> @@ -2383,6 +2490,8 @@ csched2_dump(const struct scheduler *ops)
>           printk("\tidlers: %s\n", cpustr);
>           cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].tickled);
>           printk("\ttickled: %s\n", cpustr);
> +        cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->rqd[i].smt_idle);
> +        printk("\tfully idle cores: %s\n", cpustr);
>       }
>
>       printk("Domain info:\n");
> @@ -2536,6 +2645,7 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
>       __cpumask_set_cpu(cpu, &rqd->idle);
>       __cpumask_set_cpu(cpu, &rqd->active);
>       __cpumask_set_cpu(cpu, &prv->initialized);
> +    __cpumask_set_cpu(cpu, &rqd->smt_idle);
>
>       return rqi;
>   }
> @@ -2641,6 +2751,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>       printk(XENLOG_INFO "Removing cpu %d from runqueue %d\n", cpu, rqi);
>
>       __cpumask_clear_cpu(cpu, &rqd->idle);
> +    __cpumask_clear_cpu(cpu, &rqd->smt_idle);
>       __cpumask_clear_cpu(cpu, &rqd->active);
>
>       if ( cpumask_empty(&rqd->active) )
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
  2016-06-20  7:48   ` Jan Beulich
  2016-06-21 16:17   ` anshul makkar
@ 2016-07-06 15:41   ` George Dunlap
  2016-07-07 10:25     ` Dario Faggioli
  2 siblings, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-06 15:41 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In both Credit1 and Credit2, stop considering a pCPU idle,
> if the reason why the idle vCPU is being selected, is to
> do tasklet work.
>
> Not doing so means that the tickling and load balancing
> logic, seeing the pCPU as idle, considers it a candidate
> for picking up vCPUs. But the pCPU won't actually pick
> up or schedule any vCPU, which would then remain in the
> runqueue, which is bas, especially if there were other,

*bad

> truly idle pCPUs, that could execute it.
>
> The only drawback is that we can't assume that a pCPU is
> in always marked as idle when being removed from an
> instance of the Credit2 scheduler (csched2_deinit_pdata).
> In fact, if we are in stop-machine (i.e., during suspend
> or shutdown), the pCPUs are running the stopmachine_tasklet
> and hence are actually marked as busy. On the other hand,
> when removing a pCPU from a Credit2 pool, it will indeed
> be idle. The only thing we can do, therefore, is to
> remove the BUG_ON() check.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit.c  |   12 ++++++------
>  xen/common/sched_credit2.c |   14 ++++++++++----
>  2 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index a38a63d..a6645a2 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1819,24 +1819,24 @@ csched_schedule(
>      else
>          snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
>
> + out:

Sorry if I'm being a bit dense, but why is this moving up here?  As
far as I can tell the only time the 'out' label will be used, the
'idler' status of the cpu cannot change.

At very least moving it up here introduces a bug, since now...

>      /*
>       * Update idlers mask if necessary. When we're idling, other CPUs
>       * will tickle us when they get extra work.
>       */
> -    if ( snext->pri == CSCHED_PRI_IDLE )
> +    if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
>      {
> -        if ( !cpumask_test_cpu(cpu, prv->idlers) )
> -            cpumask_set_cpu(cpu, prv->idlers);
> +        if ( cpumask_test_cpu(cpu, prv->idlers) )
> +            cpumask_clear_cpu(cpu, prv->idlers);
>      }
> -    else if ( cpumask_test_cpu(cpu, prv->idlers) )
> +    else if ( !cpumask_test_cpu(cpu, prv->idlers) )
>      {
> -        cpumask_clear_cpu(cpu, prv->idlers);
> +        cpumask_set_cpu(cpu, prv->idlers);
>      }
>
>      if ( !is_idle_vcpu(snext->vcpu) )
>          snext->start_time += now;

...this will happen twice in the case (once in the if() clause, once
here).  (Although arguably the one in the if() clause should go away
and the out: label should be moved above this line anyway).

Other than that, looks good.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer
  2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
  2016-06-18  0:36   ` Meng Xu
@ 2016-07-06 15:52   ` George Dunlap
  1 sibling, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 15:52 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, Meng Xu, David Vrabel

On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, what we have right now, i.e., tickle_idlers_none
> and tickle_idlers_some, is not good enough for describing
> what really happens in the various tickling functions of
> the various scheduler.
>
> Switch to a more descriptive set of counters, such as:
>  - tickled_no_cpu: for when we don't tickle anyone
>  - tickled_idle_cpu: for when we tickle one or more
>                      idler
>  - tickled_busy_cpu: for when we tickle one or more
>                      non-idler
>
> While there, fix style of an "out:" label in sched_rt.c.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Looks good:

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Meng Xu <mengxu@cis.upenn.edu>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit.c    |   10 +++++++---
>  xen/common/sched_credit2.c   |   12 +++++-------
>  xen/common/sched_rt.c        |    8 +++++---
>  xen/include/xen/perfc_defn.h |    5 +++--
>  4 files changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index a6645a2..a54bb2d 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -385,7 +385,9 @@ static inline void __runq_tickle(struct csched_vcpu *new)
>           || (idlers_empty && new->pri > cur->pri) )
>      {
>          if ( cur->pri != CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> +            SCHED_STAT_CRANK(tickled_busy_cpu);
> +        else
> +            SCHED_STAT_CRANK(tickled_idle_cpu);
>          __cpumask_set_cpu(cpu, &mask);
>      }
>      else if ( !idlers_empty )
> @@ -444,13 +446,13 @@ static inline void __runq_tickle(struct csched_vcpu *new)
>                      set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
>                  }
>                  /* Tickle cpu anyway, to let new preempt cur. */
> -                SCHED_STAT_CRANK(tickle_idlers_none);
> +                SCHED_STAT_CRANK(tickled_busy_cpu);
>                  __cpumask_set_cpu(cpu, &mask);
>              }
>              else if ( !new_idlers_empty )
>              {
>                  /* Which of the idlers suitable for new shall we wake up? */
> -                SCHED_STAT_CRANK(tickle_idlers_some);
> +                SCHED_STAT_CRANK(tickled_idle_cpu);
>                  if ( opt_tickle_one_idle )
>                  {
>                      this_cpu(last_tickle_cpu) =
> @@ -479,6 +481,8 @@ static inline void __runq_tickle(struct csched_vcpu *new)
>          /* Send scheduler interrupts to designated CPUs */
>          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
>      }
> +    else
> +        SCHED_STAT_CRANK(tickled_no_cpu);
>  }
>
>  static void
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index cf8455c..0246453 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -589,6 +589,7 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>      i = cpumask_cycle(cpu, &mask);
>      if ( i < nr_cpu_ids )
>      {
> +        SCHED_STAT_CRANK(tickled_idle_cpu);
>          ipid = i;
>          goto tickle;
>      }
> @@ -637,11 +638,12 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>       * than the migrate resistance */
>      if ( ipid == -1 || lowest + CSCHED2_MIGRATE_RESIST > new->credit )
>      {
> -        SCHED_STAT_CRANK(tickle_idlers_none);
> -        goto no_tickle;
> +        SCHED_STAT_CRANK(tickled_no_cpu);
> +        return;
>      }
>
> -tickle:
> +    SCHED_STAT_CRANK(tickled_busy_cpu);
> + tickle:
>      BUG_ON(ipid == -1);
>
>      /* TRACE */ {
> @@ -654,11 +656,7 @@ tickle:
>                    (unsigned char *)&d);
>      }
>      cpumask_set_cpu(ipid, &rqd->tickled);
> -    SCHED_STAT_CRANK(tickle_idlers_some);
>      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> -
> -no_tickle:
> -    return;
>  }
>
>  /*
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index 5b077d7..dd1c4d3 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -1140,6 +1140,7 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
>      /* 1) if new's previous cpu is idle, kick it for cache benefit */
>      if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
>      {
> +        SCHED_STAT_CRANK(tickled_idle_cpu);
>          cpu_to_tickle = new->vcpu->processor;
>          goto out;
>      }
> @@ -1151,6 +1152,7 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
>          iter_vc = curr_on_cpu(cpu);
>          if ( is_idle_vcpu(iter_vc) )
>          {
> +            SCHED_STAT_CRANK(tickled_idle_cpu);
>              cpu_to_tickle = cpu;
>              goto out;
>          }
> @@ -1164,14 +1166,15 @@ runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
>      if ( latest_deadline_vcpu != NULL &&
>           new->cur_deadline < latest_deadline_vcpu->cur_deadline )
>      {
> +        SCHED_STAT_CRANK(tickled_busy_cpu);
>          cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
>          goto out;
>      }
>
>      /* didn't tickle any cpu */
> -    SCHED_STAT_CRANK(tickle_idlers_none);
> +    SCHED_STAT_CRANK(tickled_no_cpu);
>      return;
> -out:
> + out:
>      /* TRACE */
>      {
>          struct {
> @@ -1185,7 +1188,6 @@ out:
>      }
>
>      cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
> -    SCHED_STAT_CRANK(tickle_idlers_some);
>      cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);
>      return;
>  }
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> index 21c1e0b..a336c71 100644
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -27,8 +27,9 @@ PERFCOUNTER(vcpu_wake_running,      "sched: vcpu_wake_running")
>  PERFCOUNTER(vcpu_wake_onrunq,       "sched: vcpu_wake_onrunq")
>  PERFCOUNTER(vcpu_wake_runnable,     "sched: vcpu_wake_runnable")
>  PERFCOUNTER(vcpu_wake_not_runnable, "sched: vcpu_wake_not_runnable")
> -PERFCOUNTER(tickle_idlers_none,     "sched: tickle_idlers_none")
> -PERFCOUNTER(tickle_idlers_some,     "sched: tickle_idlers_some")
> +PERFCOUNTER(tickled_no_cpu,         "sched: tickled_no_cpu")
> +PERFCOUNTER(tickled_idle_cpu,       "sched: tickled_idle_cpu")
> +PERFCOUNTER(tickled_busy_cpu,       "sched: tickled_busy_cpu")
>  PERFCOUNTER(vcpu_check,             "sched: vcpu_check")
>
>  /* credit specific counters */
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter
  2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
  2016-06-21 16:41   ` anshul makkar
@ 2016-07-06 15:59   ` George Dunlap
  1 sibling, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 15:59 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap

On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, they always operate on the svc->processor of
> the csched2_vcpu passed to them.
>
> No functional change intended.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

And queued this one and patch 2.

Good to see those finally go. :-)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu
  2016-06-17 23:11 ` [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu Dario Faggioli
@ 2016-07-06 16:02   ` George Dunlap
  2016-07-07 10:26     ` Dario Faggioli
  0 siblings, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:02 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, it has the same signature of csched2_cpu_pick,
> which also is its uniqe caller.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

And queued, fixing up the spelling of "unique".

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once
  2016-06-17 23:11 ` [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once Dario Faggioli
@ 2016-07-06 16:05   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:05 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> on the same vcpu, without NOW() having changed.
>
> This is, in fact, a legitimate use case. If it happens,
> we should just do nothing, without producing any warning
> or debug message.
>
> While there, fix style and add a couple of branching
> hints.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

And queued.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held
  2016-06-20  7:56   ` Jan Beulich
@ 2016-07-06 16:10     ` George Dunlap
  2016-07-07 10:28       ` Dario Faggioli
  0 siblings, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Anshul Makkar, David Vrabel

On Mon, Jun 20, 2016 at 8:56 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>> Yet another situation very similar to 779511f4bf5ae
>> ("sched: avoid races on time values read from NOW()").
>>
>> In fact, when more than one runqueue is involved, we need
>> to make sure that the following does not happen:
>>  1. take the lock of 1st runq
>>  2. now = NOW()
>>  3. take the lock of 2nd runq
>>  4. use now
>>
>> as, if we have to wait at step 3, the value in now may
>> be stale when we get to use it at step 4.
>
> Is this really meaningful here? We're talking of trylocks, which don't
> incur any delay other than the latency of the LOCKed (on x86)
> instruction to determine lock availability.

This makes sense to me -- Dario?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-06-20  8:02   ` Jan Beulich
@ 2016-07-06 16:21     ` George Dunlap
  2016-07-07  7:29       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Anshul Makkar, David Vrabel

On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>> This really should not happen, but:
>>  1. it does happen! Investigation is ongoing here:
>>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html
>>  2. even when 1 will be fixed it makes sense and is easy enough
>>     to have a 'safety catch' for it.
>>
>> The reason why this is particularly bad for Credit2 is that
>> negative values of delta mean out of scale high load (because
>> of the conversion to unsigned). This, for instance in the
>> case of runqueue load, results in a runqueue having its load
>> updated to values of the order of 10000% or so, which in turns
>> means that the load balancer will migrate everything off from
>> the pCPUs in the runqueue, and leave them idle until the load
>> gets back to something sane... which may indeed take a while!
>>
>> This is not a fix for the problem of time going backwards. In
>> fact, if that happens a lot, load tracking accuracy is still
>> compromized, but at least the effect is a lot less bad than
>> before.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>> ---
>> Cc: George Dunlap <george.dunlap@citrix.com>
>> Cc: Anshul Makkar <anshul.makkar@citrix.com>
>> Cc: David Vrabel <david.vrabel@citrix.com>
>> ---
>>  xen/common/sched_credit2.c |   12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>> index 50f8dfd..b73d034 100644
>> --- a/xen/common/sched_credit2.c
>> +++ b/xen/common/sched_credit2.c
>> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>>      else
>>      {
>>          delta = now - rqd->load_last_update;
>> +        if ( unlikely(delta < 0) )
>> +        {
>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
>> +                     __func__, now, rqd->load_last_update);
>> +            delta = 0;
>> +        }
>>
>>          rqd->avgload =
>>              ( ( delta * ( (unsigned long long)rqd->load << prv->load_window_shift ) )
>> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>>      else
>>      {
>>          delta = now - svc->load_last_update;
>> +        if ( unlikely(delta < 0) )
>> +        {
>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
>> +                     __func__, now, svc->load_last_update);
>> +            delta = 0;
>> +        }
>>
>>          svc->avgload =
>>              ( ( delta * ( (unsigned long long)vcpu_load << prv->load_window_shift ) )
>
> Do the absolute times really matter here? I.e. wouldn't it be more
> useful to simply log the value of delta?
>
> Also, may I ask you to use the L modifier in favor of the ll one, for
> being one byte shorter (and hence, even if just very slightly,
> reducing both image size and cache pressure)?
>
> And finally, instead of logging function names, could the two
> messages be made distinguishable by other means resulting in less
> data issued to the log (and potentially needing transmission over
> a slow serial line)?

The reason this is under a "d2printk" is because it's really only to
help developers in debugging.  In-tree this warning isn't even on with
debug=y; you have to go to the top of the file and change the #define
to make it even exist.

Given that, I don't think the quibbles over the code size or the
length of what's logged really matter.  I think we should just take it
as it is.

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/19] xen: credit2: when tickling, check idle cpus first
  2016-06-17 23:12 ` [PATCH 08/19] xen: credit2: when tickling, check idle cpus first Dario Faggioli
@ 2016-07-06 16:36   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:36 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> If there are idle pCPUs, it's always better to try to
> "ship" the new vCPU there, instead than letting it
> preempting on a currently busy one.
>
> This commit also adds a cpumask_test_or_cycle() helper
> function, to make it easier to code the preference for
> the pCPU where the vCPU was running before.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

And queued.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu
  2016-06-17 23:12 ` [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu Dario Faggioli
@ 2016-07-06 16:40   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 16:40 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> by not resetting the variable that should guard against
> that at the beginning of each step of the outer loop.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Oops. :-/

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

And queued.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/19] xen: credit2: rework load tracking logic
  2016-06-17 23:12 ` [PATCH 10/19] xen: credit2: rework load tracking logic Dario Faggioli
@ 2016-07-06 17:33   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-06 17:33 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> The existing load tracking code was hard to understad and
> maintain, and not entirely consistent. This is due to a
> number of reasons:
>  - code and comments were not in perfect sync, making it
>    difficult to figure out what the intent of a particular
>    choice was (e.g., the choice of 18 for load_window_shift);
>  - the math, although effective, was not entirely consistent.
>    In fact, we were doing (if W is the lenght of the window):
>
>     avgload = (delta*load*W + (W - delta)*avgload)/W
>     avgload = avgload + delta*load - delta*avgload/W
>
>    which does not match any known variant of 'smoothing
>    moving average'. In fact, it should have been:
>
>     avgload = avgload + delta*load/W - delta*avgload/W
>
>    (for details on why, see the doc comments inside this
>    patch.). Furthermore, with
>
>     avgload ~= avgload + W*load - avgload
>     avgload ~= W*load
>
> The reason why the formula above sort of worked was because
> the number of bits used for the fractional parts of the
> values used in fixed point math and the number of bits used
> for the lenght of the window were the same (load_window_shift
> was being used for both).
>
> This may look handy, but it introduced a (not especially well
> documented) dependency between the lenght of the window and
> the precision of the calculations, which really should be
> two independent things. Especially if treating them as such
> (like it is done in this patch) does not lead to more
> complex maths (same number of multiplications and shifts, and
> there is still room for some optimization).
>
> Therefore, in this patch, we:
>  - split length of the window and precision (and, since there
>    is already a command line parameter for length of window,
>    introduce one for precision too),
>  - align the math with one proper incarnation of exponential
>    smoothing (at no added cost),
>  - add comments, about the details of the algorithm and the
>    math used.
>
> While there fix a couple of style issues as well (pointless
> initialization, long lines, comments).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

And queued.  Thanks for the work.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-07-06 16:21     ` George Dunlap
@ 2016-07-07  7:29       ` Jan Beulich
  2016-07-07  9:09         ` George Dunlap
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-07-07  7:29 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, Dario Faggioli, David Vrabel

>>> On 06.07.16 at 18:21, <george.dunlap@citrix.com> wrote:
> On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>>> This really should not happen, but:
>>>  1. it does happen! Investigation is ongoing here:
>>>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html 
>>>  2. even when 1 will be fixed it makes sense and is easy enough
>>>     to have a 'safety catch' for it.
>>>
>>> The reason why this is particularly bad for Credit2 is that
>>> negative values of delta mean out of scale high load (because
>>> of the conversion to unsigned). This, for instance in the
>>> case of runqueue load, results in a runqueue having its load
>>> updated to values of the order of 10000% or so, which in turns
>>> means that the load balancer will migrate everything off from
>>> the pCPUs in the runqueue, and leave them idle until the load
>>> gets back to something sane... which may indeed take a while!
>>>
>>> This is not a fix for the problem of time going backwards. In
>>> fact, if that happens a lot, load tracking accuracy is still
>>> compromized, but at least the effect is a lot less bad than
>>> before.
>>>
>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>> ---
>>> Cc: George Dunlap <george.dunlap@citrix.com>
>>> Cc: Anshul Makkar <anshul.makkar@citrix.com>
>>> Cc: David Vrabel <david.vrabel@citrix.com>
>>> ---
>>>  xen/common/sched_credit2.c |   12 ++++++++++++
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>>> index 50f8dfd..b73d034 100644
>>> --- a/xen/common/sched_credit2.c
>>> +++ b/xen/common/sched_credit2.c
>>> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>>>      else
>>>      {
>>>          delta = now - rqd->load_last_update;
>>> +        if ( unlikely(delta < 0) )
>>> +        {
>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
> %"PRI_stime"\n",
>>> +                     __func__, now, rqd->load_last_update);
>>> +            delta = 0;
>>> +        }
>>>
>>>          rqd->avgload =
>>>              ( ( delta * ( (unsigned long long)rqd->load << 
> prv->load_window_shift ) )
>>> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>>>      else
>>>      {
>>>          delta = now - svc->load_last_update;
>>> +        if ( unlikely(delta < 0) )
>>> +        {
>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
> %"PRI_stime"\n",
>>> +                     __func__, now, svc->load_last_update);
>>> +            delta = 0;
>>> +        }
>>>
>>>          svc->avgload =
>>>              ( ( delta * ( (unsigned long long)vcpu_load << 
> prv->load_window_shift ) )
>>
>> Do the absolute times really matter here? I.e. wouldn't it be more
>> useful to simply log the value of delta?
>>
>> Also, may I ask you to use the L modifier in favor of the ll one, for
>> being one byte shorter (and hence, even if just very slightly,
>> reducing both image size and cache pressure)?
>>
>> And finally, instead of logging function names, could the two
>> messages be made distinguishable by other means resulting in less
>> data issued to the log (and potentially needing transmission over
>> a slow serial line)?
> 
> The reason this is under a "d2printk" is because it's really only to
> help developers in debugging.  In-tree this warning isn't even on with
> debug=y; you have to go to the top of the file and change the #define
> to make it even exist.
> 
> Given that, I don't think the quibbles over the code size or the
> length of what's logged really matter.  I think we should just take it
> as it is.
> 
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Oh, okay - I agree on those two parts then. But the question on the
usefulness of absolute vs relative times remains.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-07-07  7:29       ` Jan Beulich
@ 2016-07-07  9:09         ` George Dunlap
  2016-07-07  9:18           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-07  9:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Anshul Makkar, Dario Faggioli, David Vrabel

On 07/07/16 08:29, Jan Beulich wrote:
>>>> On 06.07.16 at 18:21, <george.dunlap@citrix.com> wrote:
>> On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>>>> This really should not happen, but:
>>>>  1. it does happen! Investigation is ongoing here:
>>>>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html 
>>>>  2. even when 1 will be fixed it makes sense and is easy enough
>>>>     to have a 'safety catch' for it.
>>>>
>>>> The reason why this is particularly bad for Credit2 is that
>>>> negative values of delta mean out of scale high load (because
>>>> of the conversion to unsigned). This, for instance in the
>>>> case of runqueue load, results in a runqueue having its load
>>>> updated to values of the order of 10000% or so, which in turns
>>>> means that the load balancer will migrate everything off from
>>>> the pCPUs in the runqueue, and leave them idle until the load
>>>> gets back to something sane... which may indeed take a while!
>>>>
>>>> This is not a fix for the problem of time going backwards. In
>>>> fact, if that happens a lot, load tracking accuracy is still
>>>> compromized, but at least the effect is a lot less bad than
>>>> before.
>>>>
>>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>>> ---
>>>> Cc: George Dunlap <george.dunlap@citrix.com>
>>>> Cc: Anshul Makkar <anshul.makkar@citrix.com>
>>>> Cc: David Vrabel <david.vrabel@citrix.com>
>>>> ---
>>>>  xen/common/sched_credit2.c |   12 ++++++++++++
>>>>  1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>>>> index 50f8dfd..b73d034 100644
>>>> --- a/xen/common/sched_credit2.c
>>>> +++ b/xen/common/sched_credit2.c
>>>> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>>>>      else
>>>>      {
>>>>          delta = now - rqd->load_last_update;
>>>> +        if ( unlikely(delta < 0) )
>>>> +        {
>>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
>> %"PRI_stime"\n",
>>>> +                     __func__, now, rqd->load_last_update);
>>>> +            delta = 0;
>>>> +        }
>>>>
>>>>          rqd->avgload =
>>>>              ( ( delta * ( (unsigned long long)rqd->load << 
>> prv->load_window_shift ) )
>>>> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>>>>      else
>>>>      {
>>>>          delta = now - svc->load_last_update;
>>>> +        if ( unlikely(delta < 0) )
>>>> +        {
>>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
>> %"PRI_stime"\n",
>>>> +                     __func__, now, svc->load_last_update);
>>>> +            delta = 0;
>>>> +        }
>>>>
>>>>          svc->avgload =
>>>>              ( ( delta * ( (unsigned long long)vcpu_load << 
>> prv->load_window_shift ) )
>>>
>>> Do the absolute times really matter here? I.e. wouldn't it be more
>>> useful to simply log the value of delta?
>>>
>>> Also, may I ask you to use the L modifier in favor of the ll one, for
>>> being one byte shorter (and hence, even if just very slightly,
>>> reducing both image size and cache pressure)?
>>>
>>> And finally, instead of logging function names, could the two
>>> messages be made distinguishable by other means resulting in less
>>> data issued to the log (and potentially needing transmission over
>>> a slow serial line)?
>>
>> The reason this is under a "d2printk" is because it's really only to
>> help developers in debugging.  In-tree this warning isn't even on with
>> debug=y; you have to go to the top of the file and change the #define
>> to make it even exist.
>>
>> Given that, I don't think the quibbles over the code size or the
>> length of what's logged really matter.  I think we should just take it
>> as it is.
>>
>> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
> 
> Oh, okay - I agree on those two parts then. But the question on the
> usefulness of absolute vs relative times remains.

What is the usefulness of printing the relative time?  If you have the
absolute time, you have some chance of catching mistakes like one of the
times being 0 or something like that; or of being able to correlate it
with another time printed somewhere else (for instance, a timestamp from
a trace record).

In any case, I think it's really a bike shed.  Dario is the one who has
used this error message to find an actual bug recently, so I'll let him
decide what he thinks the most useful thing to print here is.

 -George



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-07-07  9:09         ` George Dunlap
@ 2016-07-07  9:18           ` Jan Beulich
  2016-07-07 10:53             ` Dario Faggioli
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2016-07-07  9:18 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, Dario Faggioli, David Vrabel

>>> On 07.07.16 at 11:09, <george.dunlap@citrix.com> wrote:
> On 07/07/16 08:29, Jan Beulich wrote:
>>>>> On 06.07.16 at 18:21, <george.dunlap@citrix.com> wrote:
>>> On Mon, Jun 20, 2016 at 9:02 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>>>>> This really should not happen, but:
>>>>>  1. it does happen! Investigation is ongoing here:
>>>>>     http://lists.xen.org/archives/html/xen-devel/2016-06/msg00922.html 
>>>>>  2. even when 1 will be fixed it makes sense and is easy enough
>>>>>     to have a 'safety catch' for it.
>>>>>
>>>>> The reason why this is particularly bad for Credit2 is that
>>>>> negative values of delta mean out of scale high load (because
>>>>> of the conversion to unsigned). This, for instance in the
>>>>> case of runqueue load, results in a runqueue having its load
>>>>> updated to values of the order of 10000% or so, which in turns
>>>>> means that the load balancer will migrate everything off from
>>>>> the pCPUs in the runqueue, and leave them idle until the load
>>>>> gets back to something sane... which may indeed take a while!
>>>>>
>>>>> This is not a fix for the problem of time going backwards. In
>>>>> fact, if that happens a lot, load tracking accuracy is still
>>>>> compromized, but at least the effect is a lot less bad than
>>>>> before.
>>>>>
>>>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>>>> ---
>>>>> Cc: George Dunlap <george.dunlap@citrix.com>
>>>>> Cc: Anshul Makkar <anshul.makkar@citrix.com>
>>>>> Cc: David Vrabel <david.vrabel@citrix.com>
>>>>> ---
>>>>>  xen/common/sched_credit2.c |   12 ++++++++++++
>>>>>  1 file changed, 12 insertions(+)
>>>>>
>>>>> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
>>>>> index 50f8dfd..b73d034 100644
>>>>> --- a/xen/common/sched_credit2.c
>>>>> +++ b/xen/common/sched_credit2.c
>>>>> @@ -404,6 +404,12 @@ __update_runq_load(const struct scheduler *ops,
>>>>>      else
>>>>>      {
>>>>>          delta = now - rqd->load_last_update;
>>>>> +        if ( unlikely(delta < 0) )
>>>>> +        {
>>>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
>>> %"PRI_stime"\n",
>>>>> +                     __func__, now, rqd->load_last_update);
>>>>> +            delta = 0;
>>>>> +        }
>>>>>
>>>>>          rqd->avgload =
>>>>>              ( ( delta * ( (unsigned long long)rqd->load << 
>>> prv->load_window_shift ) )
>>>>> @@ -455,6 +461,12 @@ __update_svc_load(const struct scheduler *ops,
>>>>>      else
>>>>>      {
>>>>>          delta = now - svc->load_last_update;
>>>>> +        if ( unlikely(delta < 0) )
>>>>> +        {
>>>>> +            d2printk("%s: Time went backwards? now %"PRI_stime" llu 
>>> %"PRI_stime"\n",
>>>>> +                     __func__, now, svc->load_last_update);
>>>>> +            delta = 0;
>>>>> +        }
>>>>>
>>>>>          svc->avgload =
>>>>>              ( ( delta * ( (unsigned long long)vcpu_load << 
>>> prv->load_window_shift ) )
>>>>
>>>> Do the absolute times really matter here? I.e. wouldn't it be more
>>>> useful to simply log the value of delta?
>>>>
>>>> Also, may I ask you to use the L modifier in favor of the ll one, for
>>>> being one byte shorter (and hence, even if just very slightly,
>>>> reducing both image size and cache pressure)?
>>>>
>>>> And finally, instead of logging function names, could the two
>>>> messages be made distinguishable by other means resulting in less
>>>> data issued to the log (and potentially needing transmission over
>>>> a slow serial line)?
>>>
>>> The reason this is under a "d2printk" is because it's really only to
>>> help developers in debugging.  In-tree this warning isn't even on with
>>> debug=y; you have to go to the top of the file and change the #define
>>> to make it even exist.
>>>
>>> Given that, I don't think the quibbles over the code size or the
>>> length of what's logged really matter.  I think we should just take it
>>> as it is.
>>>
>>> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
>> 
>> Oh, okay - I agree on those two parts then. But the question on the
>> usefulness of absolute vs relative times remains.
> 
> What is the usefulness of printing the relative time?  If you have the
> absolute time, you have some chance of catching mistakes like one of the
> times being 0 or something like that; or of being able to correlate it
> with another time printed somewhere else (for instance, a timestamp from
> a trace record).

Well, having had to deal with time going backwards elsewhere
(both in the past and recently) I have always found it cumbersome
to work out the delta from huge (far into the billions) absolute
numbers, and therefore consider logging the delta more useful -
apart from seeing at the first glance whether a particular delta is
positive or negative, this also allows at almost the first glance to
at least recognize the magnitude of the difference. But anyway ...

> In any case, I think it's really a bike shed.  Dario is the one who has
> used this error message to find an actual bug recently, so I'll let him
> decide what he thinks the most useful thing to print here is.

... fine with me; it was just a question after all.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations
  2016-06-17 23:12 ` [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
@ 2016-07-07  9:45   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07  9:45 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> as all the accesses to both the masks and the flags are
> serialized by the runqueues locks already.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

This one doesn't apply without 10/19, so will have to be resent.

 -George

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit2.c |   48 ++++++++++++++++++++++----------------------
>  1 file changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 230a512..2ca63ae 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -909,7 +909,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>                    sizeof(d),
>                    (unsigned char *)&d);
>      }
> -    cpumask_set_cpu(ipid, &rqd->tickled);
> +    __cpumask_set_cpu(ipid, &rqd->tickled);
>      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>  }
>
> @@ -1277,7 +1277,7 @@ csched2_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
>          __runq_remove(svc);
>      }
>      else if ( svc->flags & CSFLAG_delayed_runq_add )
> -        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
> +        __clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
>  }
>
>  static void
> @@ -1314,7 +1314,7 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
>       * after the context has been saved. */
>      if ( unlikely(svc->flags & CSFLAG_scheduled) )
>      {
> -        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
> +        __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
>          goto out;
>      }
>
> @@ -1347,7 +1347,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
>      BUG_ON( !is_idle_vcpu(vc) && svc->rqd != RQD(ops, vc->processor));
>
>      /* This vcpu is now eligible to be put on the runqueue again */
> -    clear_bit(__CSFLAG_scheduled, &svc->flags);
> +    __clear_bit(__CSFLAG_scheduled, &svc->flags);
>
>      /* If someone wants it on the runqueue, put it there. */
>      /*
> @@ -1357,7 +1357,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
>       * it seems a bit pointless; especially as we have plenty of
>       * bits free.
>       */
> -    if ( test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
> +    if ( __test_and_clear_bit(__CSFLAG_delayed_runq_add, &svc->flags)
>           && likely(vcpu_runnable(vc)) )
>      {
>          BUG_ON(__vcpu_on_runq(svc));
> @@ -1399,10 +1399,10 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>
>      if ( !spin_trylock(&prv->lock) )
>      {
> -        if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
> +        if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
>          {
>              d2printk("%pv -\n", svc->vcpu);
> -            clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
> +            __clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>          }
>
>          return get_fallback_cpu(svc);
> @@ -1410,7 +1410,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>
>      /* First check to see if we're here because someone else suggested a place
>       * for us to move. */
> -    if ( test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
> +    if ( __test_and_clear_bit(__CSFLAG_runq_migrate_request, &svc->flags) )
>      {
>          if ( unlikely(svc->migrate_rqd->id < 0) )
>          {
> @@ -1545,8 +1545,8 @@ static void migrate(const struct scheduler *ops,
>          d2printk("%pv %d-%d a\n", svc->vcpu, svc->rqd->id, trqd->id);
>          /* It's running; mark it to migrate. */
>          svc->migrate_rqd = trqd;
> -        set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
> -        set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
> +        __set_bit(_VPF_migrating, &svc->vcpu->pause_flags);
> +        __set_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>          SCHED_STAT_CRANK(migrate_requested);
>      }
>      else
> @@ -2079,7 +2079,7 @@ csched2_schedule(
>
>      /* Clear "tickled" bit now that we've been scheduled */
>      if ( cpumask_test_cpu(cpu, &rqd->tickled) )
> -        cpumask_clear_cpu(cpu, &rqd->tickled);
> +        __cpumask_clear_cpu(cpu, &rqd->tickled);
>
>      /* Update credits */
>      burn_credits(rqd, scurr, now);
> @@ -2115,7 +2115,7 @@ csched2_schedule(
>      if ( snext != scurr
>           && !is_idle_vcpu(scurr->vcpu)
>           && vcpu_runnable(current) )
> -        set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
> +        __set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
>
>      ret.migrated = 0;
>
> @@ -2134,7 +2134,7 @@ csched2_schedule(
>                         cpu, snext->vcpu, snext->vcpu->processor, scurr->vcpu);
>                  BUG();
>              }
> -            set_bit(__CSFLAG_scheduled, &snext->flags);
> +            __set_bit(__CSFLAG_scheduled, &snext->flags);
>          }
>
>          /* Check for the reset condition */
> @@ -2146,7 +2146,7 @@ csched2_schedule(
>
>          /* Clear the idle mask if necessary */
>          if ( cpumask_test_cpu(cpu, &rqd->idle) )
> -            cpumask_clear_cpu(cpu, &rqd->idle);
> +            __cpumask_clear_cpu(cpu, &rqd->idle);
>
>          snext->start_time = now;
>
> @@ -2168,10 +2168,10 @@ csched2_schedule(
>          if ( tasklet_work_scheduled )
>          {
>              if ( cpumask_test_cpu(cpu, &rqd->idle) )
> -                cpumask_clear_cpu(cpu, &rqd->idle);
> +                __cpumask_clear_cpu(cpu, &rqd->idle);
>          }
>          else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
> -            cpumask_set_cpu(cpu, &rqd->idle);
> +            __cpumask_set_cpu(cpu, &rqd->idle);
>          /* Make sure avgload gets updated periodically even
>           * if there's no activity */
>          update_load(ops, rqd, NULL, 0, now);
> @@ -2347,7 +2347,7 @@ static void activate_runqueue(struct csched2_private *prv, int rqi)
>      INIT_LIST_HEAD(&rqd->runq);
>      spin_lock_init(&rqd->lock);
>
> -    cpumask_set_cpu(rqi, &prv->active_queues);
> +    __cpumask_set_cpu(rqi, &prv->active_queues);
>  }
>
>  static void deactivate_runqueue(struct csched2_private *prv, int rqi)
> @@ -2360,7 +2360,7 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
>
>      rqd->id = -1;
>
> -    cpumask_clear_cpu(rqi, &prv->active_queues);
> +    __cpumask_clear_cpu(rqi, &prv->active_queues);
>  }
>
>  static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
> @@ -2449,9 +2449,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
>      /* Set the runqueue map */
>      prv->runq_map[cpu] = rqi;
>
> -    cpumask_set_cpu(cpu, &rqd->idle);
> -    cpumask_set_cpu(cpu, &rqd->active);
> -    cpumask_set_cpu(cpu, &prv->initialized);
> +    __cpumask_set_cpu(cpu, &rqd->idle);
> +    __cpumask_set_cpu(cpu, &rqd->active);
> +    __cpumask_set_cpu(cpu, &prv->initialized);
>
>      return rqi;
>  }
> @@ -2556,8 +2556,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>
>      printk("Removing cpu %d from runqueue %d\n", cpu, rqi);
>
> -    cpumask_clear_cpu(cpu, &rqd->idle);
> -    cpumask_clear_cpu(cpu, &rqd->active);
> +    __cpumask_clear_cpu(cpu, &rqd->idle);
> +    __cpumask_clear_cpu(cpu, &rqd->active);
>
>      if ( cpumask_empty(&rqd->active) )
>      {
> @@ -2567,7 +2567,7 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>
>      spin_unlock(&rqd->lock);
>
> -    cpumask_clear_cpu(cpu, &prv->initialized);
> +    __cpumask_clear_cpu(cpu, &prv->initialized);
>
>      spin_unlock_irqrestore(&prv->lock, flags);
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-06-20  7:48   ` Jan Beulich
@ 2016-07-07 10:11     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1913 bytes --]

On Mon, 2016-06-20 at 01:48 -0600, Jan Beulich wrote:
> > > > On 18.06.16 at 01:11, <dario.faggioli@citrix.com> wrote:
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -1819,24 +1819,24 @@ csched_schedule(
> >      else
> >          snext = csched_load_balance(prv, cpu, snext,
> > &ret.migrated);
> >  
> > + out:
> >      /*
> >       * Update idlers mask if necessary. When we're idling, other
> > CPUs
> >       * will tickle us when they get extra work.
> >       */
> > -    if ( snext->pri == CSCHED_PRI_IDLE )
> > +    if ( tasklet_work_scheduled || snext->pri != CSCHED_PRI_IDLE )
> >      {
> > -        if ( !cpumask_test_cpu(cpu, prv->idlers) )
> > -            cpumask_set_cpu(cpu, prv->idlers);
> > +        if ( cpumask_test_cpu(cpu, prv->idlers) )
> > +            cpumask_clear_cpu(cpu, prv->idlers);
> >      }
> > -    else if ( cpumask_test_cpu(cpu, prv->idlers) )
> > +    else if ( !cpumask_test_cpu(cpu, prv->idlers) )
> >      {
> > -        cpumask_clear_cpu(cpu, prv->idlers);
> > +        cpumask_set_cpu(cpu, prv->idlers);
> >      }
> Is there a reason for this extra code churn? It would seem to me
> that the change could be just the "out" label movement and
> adjustment to the first if:
> 
>    if ( !tasklet_work_scheduled && snext->pri == CSCHED_PRI_IDLE )
> 
> Am I overlooking something?
> 
No, you are not. It indeed can be done as you suggest, and it's better,
so I'll go for it.

Thanks and regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone.
  2016-07-06 15:41   ` George Dunlap
@ 2016-07-07 10:25     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:25 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 1255 bytes --]

On Wed, 2016-07-06 at 16:41 +0100, George Dunlap wrote:
> On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
> 
> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> > index a38a63d..a6645a2 100644
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -1819,24 +1819,24 @@ csched_schedule(
> >      else
> >          snext = csched_load_balance(prv, cpu, snext,
> > &ret.migrated);
> > 
> > + out:
> Sorry if I'm being a bit dense, but why is this moving up here?  As
> far as I can tell the only time the 'out' label will be used, the
> 'idler' status of the cpu cannot change.
> 
Mmm... I think you're right. If we go to out:, we are running someone
(so we are not idle), and we will continue to do so (so we should not
be marked as idle).

I seem to recall having seen something because of which out needed
bubbling up, but I don't any longer right now, so I'll leave it alone.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu
  2016-07-06 16:02   ` George Dunlap
@ 2016-07-07 10:26     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:26 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 718 bytes --]

On Wed, 2016-07-06 at 17:02 +0100, George Dunlap wrote:
> On Sat, Jun 18, 2016 at 12:11 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > 
> > In fact, it has the same signature of csched2_cpu_pick,
> > which also is its uniqe caller.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
> 
> And queued, fixing up the spelling of "unique".
> 
Thanks! :-)
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held
  2016-07-06 16:10     ` George Dunlap
@ 2016-07-07 10:28       ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:28 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich; +Cc: xen-devel, Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 1290 bytes --]

On Wed, 2016-07-06 at 17:10 +0100, George Dunlap wrote:
> On Mon, Jun 20, 2016 at 8:56 AM, Jan Beulich <JBeulich@suse.com>
> wrote:
> > 
> > > > > On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> > > Yet another situation very similar to 779511f4bf5ae
> > > ("sched: avoid races on time values read from NOW()").
> > > 
> > > In fact, when more than one runqueue is involved, we need
> > > to make sure that the following does not happen:
> > >  1. take the lock of 1st runq
> > >  2. now = NOW()
> > >  3. take the lock of 2nd runq
> > >  4. use now
> > > 
> > > as, if we have to wait at step 3, the value in now may
> > > be stale when we get to use it at step 4.
> > Is this really meaningful here? We're talking of trylocks, which
> > don't
> > incur any delay other than the latency of the LOCKed (on x86)
> > instruction to determine lock availability.
> This makes sense to me -- Dario?
> 
Yes, I think this patch is, after all, not really necessary.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards
  2016-07-07  9:18           ` Jan Beulich
@ 2016-07-07 10:53             ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:53 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 2724 bytes --]

On Thu, 2016-07-07 at 03:18 -0600, Jan Beulich wrote:
> > > > On 07.07.16 at 11:09, <george.dunlap@citrix.com> wrote:
> > > Oh, okay - I agree on those two parts then. But the question on
> > > the
> > > usefulness of absolute vs relative times remains.
> > What is the usefulness of printing the relative time?  If you have
> > the
> > absolute time, you have some chance of catching mistakes like one
> > of the
> > times being 0 or something like that; or of being able to correlate
> > it
> > with another time printed somewhere else (for instance, a timestamp
> > from
> > a trace record).
> Well, having had to deal with time going backwards elsewhere
> (both in the past and recently) I have always found it cumbersome
> to work out the delta from huge (far into the billions) absolute
> numbers, and therefore consider logging the delta more useful -
>
And, in general, I agree with you. In this case, however...

> apart from seeing at the first glance whether a particular delta is
> positive or negative, this also allows at almost the first glance to
> at least recognize the magnitude of the difference. But anyway ...
> 
true, for the magnitude part, but in this case we are only logging
anything if delta is negative. So the fact itself that there is
something being logged, tells about the sign of the delta. And...

> > In any case, I think it's really a bike shed.  Dario is the one who
> > has
> > used this error message to find an actual bug recently, so I'll let
> > him
> > decide what he thinks the most useful thing to print here is.
> ... fine with me; it was just a question after all.
> 
...I caught two bugs thanks to this. The (more recent) one about lack
of monotonicity, having two absolute values or a delta would have been
the same. For the other one (the fact that we were using stale time
samples because we were calling NOW() and then [potentially] waiting
for acquiring a lock) seeing that there were repeating occurrences of
the same absolute value of now, did indeed helped both identifying and
debugging the issue, while having just a delta wouldn't have been as
effective.

And I've also done what George refers to, i.e., look for the exact
value printed in the trace record, to have an idea of what was
happening, and again having absolutes revealed useful for this.

So, yes, I think it's actually fine to leave the message as it is.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/19] xen: credit2: make the code less experimental
  2016-06-20  8:13   ` Jan Beulich
@ 2016-07-07 10:59     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 10:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Anshul Makkar, David Vrabel, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 3458 bytes --]

On Mon, 2016-06-20 at 02:13 -0600, Jan Beulich wrote:
> > > > On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
> > @@ -680,8 +677,8 @@ __update_svc_load(const struct scheduler *ops,
> >          delta = now - svc->load_last_update;
> >          if ( unlikely(delta < 0) )
> >          {
> > -            d2printk("%s: Time went backwards? now %"PRI_stime"
> > llu %"PRI_stime"\n",
> > -                     __func__, now, svc->load_last_update);
> > +            printk("WARNING: %s: Time went backwards? now
> > %"PRI_stime" llu %"PRI_stime"\n",
> > +                   __func__, now, svc->load_last_update);
> >              delta = 0;
> >          }
> >  
> With these now becoming non-debugging ones - are they useful
> _every_ time such an event occurs? I.e. wouldn't it be better to
> e.g. only log new high watermark values?
>
Actually, I may want to reconsider this specific hunk (and the other
similar ones using printk instead of d2printk for 'time going
backwards' debug lines).

It's useful, but I'm not sure I want it printing all the time.

So hold off committing this patch, please (this is probably not
necessary to say, given the issue below, but just in case...)

> > @@ -2580,15 +2583,20 @@ csched2_init(struct scheduler *ops)
> >      int i;
> >      struct csched2_private *prv;
> >  
> > -    printk("Initializing Credit2 scheduler\n" \
> > -           " WARNING: This is experimental software in
> > development.\n" \
> > +    printk("Initializing Credit2 scheduler\n");
> > +    printk(" WARNING: This is experimental software in
> > development.\n" \
> >             " Use at your own risk.\n");
> >  
> > -    printk(" load_precision_shift: %d\n",
> > opt_load_precision_shift);
> > -    printk(" load_window_shift: %d\n", opt_load_window_shift);
> > -    printk(" underload_balance_tolerance: %d\n",
> > opt_underload_balance_tolerance);
> > -    printk(" overload_balance_tolerance: %d\n",
> > opt_overload_balance_tolerance);
> > -    printk(" runqueues arrangement: %s\n",
> > opt_runqueue_str[opt_runqueue]);
> > +    printk(XENLOG_INFO " load_precision_shift: %d\n"
> > +           " load_window_shift: %d\n"
> > +           " underload_balance_tolerance: %d\n"
> > +           " overload_balance_tolerance: %d\n"
> > +           " runqueues arrangement: %s\n",
> > +           opt_load_precision_shift,
> > +           opt_load_window_shift,
> > +           opt_underload_balance_tolerance,
> > +           opt_overload_balance_tolerance,
> > +           opt_runqueue_str[opt_runqueue]);
> Note that this results in only the first line getting logged at info
> level;
> all others will get the default logging level (i.e. warning)
> assigned. IOW
> I think you want to repeat XENLOG_INFO a couple of times.
> 
You know what, I did not notice that, sorry! Yes, I'll fix this.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/19] xen: credit2: make the code less experimental
  2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
  2016-06-20  8:13   ` Jan Beulich
@ 2016-07-07 15:17   ` George Dunlap
  2016-07-07 16:43     ` Dario Faggioli
  1 sibling, 1 reply; 64+ messages in thread
From: George Dunlap @ 2016-07-07 15:17 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Mainly, almost all of the BUG_ON-s can be converted into
> ASSERTS, and the debug printk either removed or turned
> into tracing.
>
> The 'TODO' list, in a comment at the beginning of the file,
> was also stale, so remove items that were still there but
> are actually done.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Overall looks good.  A couple of things...

> @@ -680,8 +677,8 @@ __update_svc_load(const struct scheduler *ops,
>          delta = now - svc->load_last_update;
>          if ( unlikely(delta < 0) )
>          {
> -            d2printk("%s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> -                     __func__, now, svc->load_last_update);
> +            printk("WARNING: %s: Time went backwards? now %"PRI_stime" llu %"PRI_stime"\n",
> +                   __func__, now, svc->load_last_update);

Hmm, I'm afraid this makes all Jan's comments from patch 7 which I
argued against since it was just a debugging message now valid.

> @@ -1540,9 +1536,26 @@ static void migrate(const struct scheduler *ops,
>                      struct csched2_runqueue_data *trqd,
>                      s_time_t now)
>  {
> -    if ( svc->flags & CSFLAG_scheduled )
> +    bool_t running = svc->flags & CSFLAG_scheduled;
> +    bool_t on_runq = __vcpu_on_runq(svc);

What's the point of having these variables here?  AFAICS 'running' is
used exactly once; and on_runq is only used inside the original else {
} clause where it was before.

> @@ -2069,12 +2076,13 @@ csched2_schedule(
>                  }
>              }
>          }
> -        printk("%s: pcpu %d rq %d, but scurr %pv assigned to "
> +        printk("DEBUG: %s: pcpu %d rq %d, but scurr %pv assigned to "
>                 "pcpu %d rq %d!\n",
>                 __func__,
>                 cpu, this_rqi,
>                 scurr->vcpu, scurr->vcpu->processor, other_rqi);
>      }
> +#endif

Do we need this path anymore? I think it was just there to help
debugging; but all this should have been sorted out a long time ago.
:-)

Thanks,
 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 14/19] xen: credit2: add yet some more tracing
  2016-06-20  8:15   ` Jan Beulich
@ 2016-07-07 15:34     ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07 15:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Anshul Makkar, David Vrabel

On Mon, Jun 20, 2016 at 9:15 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 18.06.16 at 01:12, <dario.faggioli@citrix.com> wrote:
>> @@ -1484,6 +1489,23 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>>  out_up:
>>      spin_unlock(&prv->lock);
>>
>> +    /* TRACE */
>> +    {
>> +        struct {
>> +            uint64_t b_avgload;
>> +            unsigned vcpu:16, dom:16;
>> +            unsigned rq_id:16, new_cpu:16;
>> +       } d;
>> +        d.b_avgload = prv->rqd[min_rqi].b_avgload;
>> +        d.dom = vc->domain->domain_id;
>> +        d.vcpu = vc->vcpu_id;
>> +        d.rq_id = c2r(ops, new_cpu);
>> +        d.new_cpu = new_cpu;
>
> I guess this follows pre-existing style, but it would seem more natural
> to me for the variable to have an initializer instead of this series of
> assignments.

Well that doesn't actually save you that much typing, and I think it's
probably (slightly) less easy to read.  But the biggest thing at this
point is that it's inconsistent with what's there. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 14/19] xen: credit2: add yet some more tracing
  2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
  2016-06-20  8:15   ` Jan Beulich
@ 2016-07-07 15:34   ` George Dunlap
  1 sibling, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07 15:34 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> (and fix the style of two labels as well.)
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  xen/common/sched_credit2.c |   58 +++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 4 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index ba3a78a..e9f3f13 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -46,6 +46,9 @@
>  #define TRC_CSCHED2_TICKLE_NEW       TRC_SCHED_CLASS_EVT(CSCHED2, 13)
>  #define TRC_CSCHED2_RUNQ_MAX_WEIGHT  TRC_SCHED_CLASS_EVT(CSCHED2, 14)
>  #define TRC_CSCHED2_MIGRATE          TRC_SCHED_CLASS_EVT(CSCHED2, 15)
> +#define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2, 16)
> +#define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2, 17)
> +#define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2, 19)
>
>  /*
>   * WARNING: This is still in an experimental phase.  Status and work can be found at the
> @@ -709,6 +712,8 @@ update_load(const struct scheduler *ops,
>              struct csched2_runqueue_data *rqd,
>              struct csched2_vcpu *svc, int change, s_time_t now)
>  {
> +    trace_var(TRC_CSCHED2_UPDATE_LOAD, 1, 0,  NULL);
> +
>      __update_runq_load(ops, rqd, change, now);
>      if ( svc )
>          __update_svc_load(ops, svc, change, now);
> @@ -1484,6 +1489,23 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>  out_up:
>      spin_unlock(&prv->lock);
>
> +    /* TRACE */
> +    {
> +        struct {
> +            uint64_t b_avgload;
> +            unsigned vcpu:16, dom:16;
> +            unsigned rq_id:16, new_cpu:16;
> +       } d;
> +        d.b_avgload = prv->rqd[min_rqi].b_avgload;
> +        d.dom = vc->domain->domain_id;
> +        d.vcpu = vc->vcpu_id;
> +        d.rq_id = c2r(ops, new_cpu);
> +        d.new_cpu = new_cpu;
> +        trace_var(TRC_CSCHED2_PICKED_CPU, 1,
> +                  sizeof(d),
> +                  (unsigned char *)&d);
> +    }
> +
>      return new_cpu;
>  }
>
> @@ -1611,7 +1633,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>      bool_t inner_load_updated = 0;
>
>      balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
> -
> +
>      /*
>       * Basic algorithm: Push, pull, or swap.
>       * - Find the runqueue with the furthest load distance
> @@ -1677,6 +1699,20 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>          if ( i > cpus_max )
>              cpus_max = i;
>
> +        /* TRACE */
> +        {
> +            struct {
> +                unsigned lrq_id:16, orq_id:16;
> +                unsigned load_delta;
> +            } d;
> +            d.lrq_id = st.lrqd->id;
> +            d.orq_id = st.orqd->id;
> +            d.load_delta = st.load_delta;
> +            trace_var(TRC_CSCHED2_LOAD_CHECK, 1,
> +                      sizeof(d),
> +                      (unsigned char *)&d);
> +        }
> +
>          /*
>           * If we're under 100% capacaty, only shift if load difference
>           * is > 1.  otherwise, shift if under 12.5%
> @@ -1705,6 +1741,21 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>      if ( unlikely(st.orqd->id < 0) )
>          goto out_up;
>
> +    /* TRACE */
> +    {
> +        struct {
> +            uint64_t lb_avgload, ob_avgload;
> +            unsigned lrq_id:16, orq_id:16;
> +        } d;
> +        d.lrq_id = st.lrqd->id;
> +        d.lb_avgload = st.lrqd->b_avgload;
> +        d.orq_id = st.orqd->id;
> +        d.ob_avgload = st.orqd->b_avgload;
> +        trace_var(TRC_CSCHED2_LOAD_BALANCE, 1,
> +                  sizeof(d),
> +                  (unsigned char *)&d);
> +    }
> +
>      now = NOW();
>
>      /* Look for "swap" which gives the best load average
> @@ -1756,10 +1807,9 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>      if ( st.best_pull_svc )
>          migrate(ops, st.best_pull_svc, st.lrqd, now);
>
> -out_up:
> + out_up:
>      spin_unlock(&st.orqd->lock);
> -
> -out:
> + out:
>      return;
>  }
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled
  2016-06-17 23:13 ` [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
@ 2016-07-07 15:37   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07 15:37 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:13 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

> @@ -1696,10 +1702,8 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>
>          cpus_max = cpumask_weight(&st.lrqd->active);
>          i = cpumask_weight(&st.orqd->active);
> -        if ( i > cpus_max )
> -            cpus_max = i;

What is this about?

Other than that, looks good, thanks.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 16/19] tools: tracing: deal with new Credit2 events
  2016-06-17 23:13 ` [PATCH 16/19] tools: tracing: deal with new Credit2 events Dario Faggioli
@ 2016-07-07 15:39   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07 15:39 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, Wei Liu, Ian Jackson

On Sat, Jun 18, 2016 at 12:13 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> more specifically, with: TICKLE_NEW, RUNQ_MAX_WEIGHT,
> MIGRATE, LOAD_CHECK, LOAD_BALANCE and PICKED_CPU, and
> in both both xenalyze and formats (for xentrace_format).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock.
  2016-06-17 23:13 ` [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
@ 2016-07-07 16:00   ` George Dunlap
  0 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-07 16:00 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, George Dunlap, David Vrabel

On Sat, Jun 18, 2016 at 12:13 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, the data it protects only change either at init-time,
> during cpupools manipulation, or when changing domains' weights.
> In all other cases (namely, load balancing, reading weights
> and status dumping), information is only read.
>
> Therefore, let the lock be an read/write one. This means there
> is no full serialization point for the whole scheduler and
> for all the pCPUs of the host any longer.
>
> This is particularly good for scalability (especially when doing
> load balancing).
>
> Also, update the high level description of the locking discipline,
> and take the chance for rewording it a little bit (as well as
> for adding a couple of locking related ASSERT()-s).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Looks good:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/19] xen: credit2: make the code less experimental
  2016-07-07 15:17   ` George Dunlap
@ 2016-07-07 16:43     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 16:43 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Anshul Makkar, David Vrabel


[-- Attachment #1.1: Type: text/plain, Size: 2740 bytes --]

On Thu, 2016-07-07 at 16:17 +0100, George Dunlap wrote:
> On Sat, Jun 18, 2016 at 12:12 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > @@ -680,8 +677,8 @@ __update_svc_load(const struct scheduler *ops,
> >          delta = now - svc->load_last_update;
> >          if ( unlikely(delta < 0) )
> >          {
> > -            d2printk("%s: Time went backwards? now %"PRI_stime"
> > llu %"PRI_stime"\n",
> > -                     __func__, now, svc->load_last_update);
> > +            printk("WARNING: %s: Time went backwards? now
> > %"PRI_stime" llu %"PRI_stime"\n",
> > +                   __func__, now, svc->load_last_update);
> Hmm, I'm afraid this makes all Jan's comments from patch 7 which I
> argued against since it was just a debugging message now valid.
> 
Yes, and on second thoughts --as I wrote myself earlier today-- I think
we actually want to keep these debug only.

I'll make things that way when resending.

> > @@ -1540,9 +1536,26 @@ static void migrate(const struct scheduler
> > *ops,
> >                      struct csched2_runqueue_data *trqd,
> >                      s_time_t now)
> >  {
> > -    if ( svc->flags & CSFLAG_scheduled )
> > +    bool_t running = svc->flags & CSFLAG_scheduled;
> > +    bool_t on_runq = __vcpu_on_runq(svc);
> What's the point of having these variables here?  AFAICS 'running' is
> used exactly once; and on_runq is only used inside the original else
> {
> } clause where it was before.
> 
Mmm... not much indeed. AFAICR, it's a remnant from a previous version
of the patch. Sorry.

> > @@ -2069,12 +2076,13 @@ csched2_schedule(
> >                  }
> >              }
> >          }
> > -        printk("%s: pcpu %d rq %d, but scurr %pv assigned to "
> > +        printk("DEBUG: %s: pcpu %d rq %d, but scurr %pv assigned
> > to "
> >                 "pcpu %d rq %d!\n",
> >                 __func__,
> >                 cpu, this_rqi,
> >                 scurr->vcpu, scurr->vcpu->processor, other_rqi);
> >      }
> > +#endif
> Do we need this path anymore? I think it was just there to help
> debugging; but all this should have been sorted out a long time ago.
> :-)
> 
Right. I'm more than up for killing it.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu
  2016-06-21 10:42   ` David Vrabel
@ 2016-07-07 16:55     ` Dario Faggioli
  0 siblings, 0 replies; 64+ messages in thread
From: Dario Faggioli @ 2016-07-07 16:55 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Anshul Makkar, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1508 bytes --]

On Tue, 2016-06-21 at 11:42 +0100, David Vrabel wrote:
> On 18/06/16 00:13, Dario Faggioli wrote:
> > 
> > because it is cheaper, and there is no much point in
> > randomizing which cpu gets selected anyway, as such
> > choice will be overridden shortly after, in runq_tickle().
> > 
> > If we really feel the need (e.g., we prove it worth with
> > benchmarking), we can record the last cpu which was used
> > by csched2_cpu_pick() and migrate() in a per-runq variable,
> > and then use cpumask_cycle()... but this really does not
> > look necessary.
> Isn't this backwards?  Surely you should demonstrate that this change
> is
> beneficial before proposing it?
> 
Right. I think it's my fault having presented things this way.

This patch get rid of something that is pure overhead, and getting rid
of overhead is, in general, a good thing.

There is only one possible situation under which we may actually end up
favouring lower pCPU IDs, and it is unlikely enough that it is IMO, of
no concern.

But in any case, let's just drop this patch. I'm rerunning the
benchmarks anyway, I'll consider doing a set of runs with and without
this patch, and check if it does make any difference.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement
  2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
  2016-06-20  8:26   ` Jan Beulich
  2016-06-27 15:20   ` anshul makkar
@ 2016-07-12 13:40   ` George Dunlap
  2 siblings, 0 replies; 64+ messages in thread
From: George Dunlap @ 2016-07-12 13:40 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, David Vrabel

On Sat, Jun 18, 2016 at 12:13 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, right now, we recommend keepeing runqueues
> arranged per-core, so that it is the inter-runqueue load
> balancing code that automatically spreads the work in an
> SMT friendly way. This means that any other runq
> arrangement one may want to use falls short of SMT
> scheduling optimizations.
>
> This commit implements SMT awareness --similar to the
> one we have in Credit1-- for any possible runq
> arrangement. This turned out to be pretty easy to do,
> as the logic can live entirely in runq_tickle()
> (although, in order to avoid for_each_cpu loops in
> that function, we use a new cpumask which indeed needs
> to be updated in other places).
>
> In addition to disentangling SMT awareness from load
> balancing, this also allows us to support the
> sched_smt_power_savings parametar in Credit2 as well.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Other than the issues Jan pointed out, looks good.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2016-07-12 13:40 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17 23:11 [PATCH 00/19] xen: sched: assorted fixes and improvements to Credit2 Dario Faggioli
2016-06-17 23:11 ` [PATCH 01/19] xen: sched: leave CPUs doing tasklet work alone Dario Faggioli
2016-06-20  7:48   ` Jan Beulich
2016-07-07 10:11     ` Dario Faggioli
2016-06-21 16:17   ` anshul makkar
2016-07-06 15:41   ` George Dunlap
2016-07-07 10:25     ` Dario Faggioli
2016-06-17 23:11 ` [PATCH 02/19] xen: sched: make the 'tickled' perf counter clearer Dario Faggioli
2016-06-18  0:36   ` Meng Xu
2016-07-06 15:52   ` George Dunlap
2016-06-17 23:11 ` [PATCH 03/19] xen: credit2: insert and tickle don't need a cpu parameter Dario Faggioli
2016-06-21 16:41   ` anshul makkar
2016-07-06 15:59   ` George Dunlap
2016-06-17 23:11 ` [PATCH 04/19] xen: credit2: kill useless helper function choose_cpu Dario Faggioli
2016-07-06 16:02   ` George Dunlap
2016-07-07 10:26     ` Dario Faggioli
2016-06-17 23:11 ` [PATCH 05/19] xen: credit2: do not warn if calling burn_credits more than once Dario Faggioli
2016-07-06 16:05   ` George Dunlap
2016-06-17 23:12 ` [PATCH 06/19] xen: credit2: read NOW() with the proper runq lock held Dario Faggioli
2016-06-20  7:56   ` Jan Beulich
2016-07-06 16:10     ` George Dunlap
2016-07-07 10:28       ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 07/19] xen: credit2: prevent load balancing to go mad if time goes backwards Dario Faggioli
2016-06-20  8:02   ` Jan Beulich
2016-07-06 16:21     ` George Dunlap
2016-07-07  7:29       ` Jan Beulich
2016-07-07  9:09         ` George Dunlap
2016-07-07  9:18           ` Jan Beulich
2016-07-07 10:53             ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 08/19] xen: credit2: when tickling, check idle cpus first Dario Faggioli
2016-07-06 16:36   ` George Dunlap
2016-06-17 23:12 ` [PATCH 09/19] xen: credit2: avoid calling __update_svc_load() multiple times on the same vcpu Dario Faggioli
2016-07-06 16:40   ` George Dunlap
2016-06-17 23:12 ` [PATCH 10/19] xen: credit2: rework load tracking logic Dario Faggioli
2016-07-06 17:33   ` George Dunlap
2016-06-17 23:12 ` [PATCH 11/19] tools: tracing: adapt Credit2 load tracking events to new format Dario Faggioli
2016-06-21  9:27   ` Wei Liu
2016-06-17 23:12 ` [PATCH 12/19] xen: credit2: use non-atomic cpumask and bit operations Dario Faggioli
2016-07-07  9:45   ` George Dunlap
2016-06-17 23:12 ` [PATCH 13/19] xen: credit2: make the code less experimental Dario Faggioli
2016-06-20  8:13   ` Jan Beulich
2016-07-07 10:59     ` Dario Faggioli
2016-07-07 15:17   ` George Dunlap
2016-07-07 16:43     ` Dario Faggioli
2016-06-17 23:12 ` [PATCH 14/19] xen: credit2: add yet some more tracing Dario Faggioli
2016-06-20  8:15   ` Jan Beulich
2016-07-07 15:34     ` George Dunlap
2016-07-07 15:34   ` George Dunlap
2016-06-17 23:13 ` [PATCH 15/19] xen: credit2: only marshall trace point arguments if tracing enabled Dario Faggioli
2016-07-07 15:37   ` George Dunlap
2016-06-17 23:13 ` [PATCH 16/19] tools: tracing: deal with new Credit2 events Dario Faggioli
2016-07-07 15:39   ` George Dunlap
2016-06-17 23:13 ` [PATCH 17/19] xen: credit2: the private scheduler lock can be an rwlock Dario Faggioli
2016-07-07 16:00   ` George Dunlap
2016-06-17 23:13 ` [PATCH 18/19] xen: credit2: implement SMT support independent runq arrangement Dario Faggioli
2016-06-20  8:26   ` Jan Beulich
2016-06-20 10:38     ` Dario Faggioli
2016-06-27 15:20   ` anshul makkar
2016-07-12 13:40   ` George Dunlap
2016-06-17 23:13 ` [PATCH 19/19] xen: credit2: use cpumask_first instead of cpumask_any when choosing cpu Dario Faggioli
2016-06-20  8:30   ` Jan Beulich
2016-06-20 11:28     ` Dario Faggioli
2016-06-21 10:42   ` David Vrabel
2016-07-07 16:55     ` Dario Faggioli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).