All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
@ 2017-04-07 16:56 Dario Faggioli
  2017-04-07 16:56 ` [PATCH v3 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Anshul Makkar, Ian Jackson, George Dunlap, Wei Liu

Hello,

Here's v3.

 v2: https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg00800.html
 v1: https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg00265.html

Series has been reordered as suggested, and also all the other comments from
v2's review have been addressed.

 git://xenbits.xen.org/people/dariof/xen.git  xenbits/rel/sched/credit1-credit2-optim-and-scalability-v3
 (yeah, messed up the branch name... :-/)
 https://travis-ci.org/fdario/xen/builds/219753675

As far as I can tell, what are now patches 6 and 3, needs George's stamp.

What is now patch 4 ("xen/tools: tracing: add record for credit1 runqueue
stealing.") may require a tools' maintainer Ack, but I'm not sure (it does,
according to get_maintainers.pl, but it's all tracing, after all...).

Thanks and Regards,
Dario
---
Dario Faggioli (7):
      xen: credit1: simplify csched_runq_steal() a little bit.
      xen: credit: (micro) optimize csched_runq_steal().
      xen: credit: consider tickled pCPUs as busy.
      xen/tools: tracing: add record for credit1 runqueue stealing.
      xen: credit2: avoid cpumask_any() in pick_cpu().
      xen: credit1: increase efficiency and scalability of load balancing.
      xen: credit1: treat pCPUs more evenly during balancing.

 tools/xentrace/formats       |    1 
 tools/xentrace/xenalyze.c    |   11 ++
 xen/common/sched_credit.c    |  271 ++++++++++++++++++++++++++++++++----------
 xen/common/sched_credit2.c   |   22 +++
 xen/include/xen/perfc_defn.h |    1 
 5 files changed, 238 insertions(+), 68 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/7] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
@ 2017-04-07 16:56 ` Dario Faggioli
  2017-04-07 16:56 ` [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
 - tell the compiler this is actually unlikely,
 - bail early if the pCPU, unfortunately, turns out
   to really be idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* slightly reworded the changelog.
---
 xen/common/sched_credit.c |   87 +++++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 4649e64..63a8675 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1593,64 +1593,65 @@ static struct csched_vcpu *
 csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
-    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
     struct vcpu *vc;
 
+    ASSERT(peer_pcpu != NULL);
+
     /*
      * Don't steal from an idle CPU's runq because it's about to
      * pick up work from it itself.
      */
-    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
+    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+        goto out;
+
+    list_for_each( iter, &peer_pcpu->runq )
     {
-        list_for_each( iter, &peer_pcpu->runq )
-        {
-            speer = __runq_elem(iter);
+        speer = __runq_elem(iter);
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+        /*
+         * If next available VCPU here is not of strictly higher
+         * priority than ours, this PCPU is useless to us.
+         */
+        if ( speer->pri <= pri )
+            break;
 
-            /* Is this VCPU runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+        /* Is this VCPU runnable on our PCPU? */
+        vc = speer->vcpu;
+        BUG_ON( is_idle_vcpu(vc) );
 
-            /*
-             * If the vcpu has no useful soft affinity, skip this vcpu.
-             * In fact, what we want is to check if we have any "soft-affine
-             * work" to steal, before starting to look at "hard-affine work".
-             *
-             * Notice that, if not even one vCPU on this runq has a useful
-             * soft affinity, we could have avoid considering this runq for
-             * a soft balancing step in the first place. This, for instance,
-             * can be implemented by taking note of on what runq there are
-             * vCPUs with useful soft affinities in some sort of bitmap
-             * or counter.
-             */
-            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
-                continue;
+        /*
+         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * In fact, what we want is to check if we have any "soft-affine
+         * work" to steal, before starting to look at "hard-affine work".
+         *
+         * Notice that, if not even one vCPU on this runq has a useful
+         * soft affinity, we could have avoid considering this runq for
+         * a soft balancing step in the first place. This, for instance,
+         * can be implemented by taking note of on what runq there are
+         * vCPUs with useful soft affinities in some sort of bitmap
+         * or counter.
+         */
+        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+            continue;
 
-            csched_balance_cpumask(vc, balance_step, cpumask_scratch);
-            if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
-            {
-                /* We got a candidate. Grab it! */
-                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
-                         vc->domain->domain_id, vc->vcpu_id);
-                SCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                SCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
-            }
+        csched_balance_cpumask(vc, balance_step, cpumask_scratch);
+        if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
+        {
+            /* We got a candidate. Grab it! */
+            TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
+                     vc->domain->domain_id, vc->vcpu_id);
+            SCHED_VCPU_STAT_CRANK(speer, migrate_q);
+            SCHED_STAT_CRANK(migrate_queued);
+            WARN_ON(vc->is_urgent);
+            __runq_remove(speer);
+            vc->processor = cpu;
+            return speer;
         }
     }
-
+ out:
     SCHED_STAT_CRANK(steal_peer_idle);
     return NULL;
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal().
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-04-07 16:56 ` [PATCH v3 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
@ 2017-04-07 16:56 ` Dario Faggioli
  2017-04-07 17:09   ` George Dunlap
  2017-04-07 16:56 ` [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Chacking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* add an assert enforcing the old !is_running check inside the
  is_migrateable() function.

Changes from v1:
* fixed a typo in a comment.
---
 xen/common/sched_credit.c |   22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 63a8675..59ed4ca 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -708,12 +708,15 @@ static inline int
 __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
-     * Don't pick up work that's in the peer's scheduling tail or hot on
-     * peer PCPU. Only pick up work that prefers and/or is allowed to run
-     * on our CPU.
+     * Don't pick up work that's hot on peer PCPU, or that can't (or
+     * would prefer not to) run on cpu.
+     *
+     * The caller is supposed to have already checked that vc is also
+     * not running.
      */
-    return !vc->is_running &&
-           !__csched_vcpu_is_cache_hot(vc) &&
+    ASSERT(!vc->is_running);
+
+    return !__csched_vcpu_is_cache_hot(vc) &&
            cpumask_test_cpu(dest_cpu, mask);
 }
 
@@ -1622,7 +1625,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         BUG_ON( is_idle_vcpu(vc) );
 
         /*
-         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * If the vcpu is still in peer_cpu's scheduling tail, or if it
+         * has no useful soft affinity, skip it.
+         *
          * In fact, what we want is to check if we have any "soft-affine
          * work" to steal, before starting to look at "hard-affine work".
          *
@@ -1633,8 +1638,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
          * vCPUs with useful soft affinities in some sort of bitmap
          * or counter.
          */
-        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+        if ( vc->is_running ||
+             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+              && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
             continue;
 
         csched_balance_cpumask(vc, balance_step, cpumask_scratch);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy.
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-04-07 16:56 ` [PATCH v3 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
  2017-04-07 16:56 ` [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-04-07 16:56 ` Dario Faggioli
  2017-04-07 17:12   ` George Dunlap
  2017-04-07 16:56 ` [PATCH v3 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.

When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.

Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).

Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.

Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* add comments to better explain the meaning of the added ASSERT()-s.
---
 xen/common/sched_credit.c |   32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 59ed4ca..7b4ea02 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -489,7 +489,17 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                 __trace_var(TRC_CSCHED_TICKLE, 1, sizeof(cpu), &cpu);
         }
 
-        /* Send scheduler interrupts to designated CPUs */
+        /*
+         * Mark the designated CPUs as busy and send them all the scheduler
+         * interrupt. We need the for_each_cpu for dealing with the
+         * !opt_tickle_one_idle case. We must use cpumask_clear_cpu() and
+         * can't use cpumask_andnot(), because prv->idlers needs atomic access.
+         *
+         * In the default (and most common) case, when opt_rickle_one_idle is
+         * true, the loop does only one step, and only one bit is cleared.
+         */
+        for_each_cpu(cpu, &mask)
+            cpumask_clear_cpu(cpu, prv->idlers);
         cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
     }
     else
@@ -990,6 +1000,13 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
             SCHED_VCPU_STAT_CRANK(svc, migrate_r);
             SCHED_STAT_CRANK(migrate_running);
             set_bit(_VPF_migrating, &current->pause_flags);
+            /*
+             * As we are about to tickle cpu, we should clear its bit in
+             * idlers. But, if we are here, it means there is someone running
+             * on it, and hence the bit must be zero already.
+             */
+            ASSERT(!cpumask_test_cpu(cpu,
+                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
             cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
         }
     }
@@ -1082,13 +1099,22 @@ static void
 csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    unsigned int cpu = vc->processor;
 
     SCHED_STAT_CRANK(vcpu_sleep);
 
     BUG_ON( is_idle_vcpu(vc) );
 
-    if ( curr_on_cpu(vc->processor) == vc )
-        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    if ( curr_on_cpu(cpu) == vc )
+    {
+        /*
+         * We are about to tickle cpu, so we should clear its bit in idlers.
+         * But, we are here because vc is going to sleep while running on cpu,
+         * so the bit must be zero already.
+         */
+        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+    }
     else if ( __vcpu_on_runq(svc) )
         __runq_remove(svc);
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/7] xen/tools: tracing: add record for credit1 runqueue stealing.
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2017-04-07 16:56 ` [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
@ 2017-04-07 16:56 ` Dario Faggioli
  2017-04-07 16:57 ` [PATCH v3 5/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Ian Jackson, George Dunlap

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes from v2:
* make the code less cool! :-( (i.e., change "skipp'n" to just "skip" in a
  comment.)
---
 tools/xentrace/formats       |    1 +
 tools/xentrace/xenalyze.c    |   11 +++++++++++
 xen/common/sched_credit.c    |    6 +++++-
 xen/include/xen/perfc_defn.h |    1 +
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index a055231..8b31780 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -47,6 +47,7 @@
 0x00022008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:unboost       [ dom:vcpu = 0x%(1)04x%(2)04x ]
 0x00022009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:schedule      [ cpu[16]:tasklet[8]:idle[8] = %(1)08x ]
 0x0002200A  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:ratelimit     [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ]
+0x0002200B  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:steal_check   [ peer_cpu = %(1)d, checked = %(2)d ]
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 029c89d..fa608ad 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7651,6 +7651,17 @@ void sched_process(struct pcpu_info *p)
                        r->runtime / 1000, r->runtime % 1000);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int peer_cpu, check;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:load_balance %s %u\n",
+                       ri->dump_header, r->check ? "checking" : "skipping",
+                       r->peer_cpu);
+            }
+            break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 7b4ea02..59b87f7 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -134,6 +134,7 @@
 #define TRC_CSCHED_BOOST_END     TRC_SCHED_CLASS_EVT(CSCHED, 8)
 #define TRC_CSCHED_SCHEDULE      TRC_SCHED_CLASS_EVT(CSCHED, 9)
 #define TRC_CSCHED_RATELIMIT     TRC_SCHED_CLASS_EVT(CSCHED, 10)
+#define TRC_CSCHED_STEAL_CHECK   TRC_SCHED_CLASS_EVT(CSCHED, 11)
 
 
 /*
@@ -1753,14 +1754,17 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * balancing and trying to lock this CPU.
                  */
                 spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
-
+                SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
                     peer_cpu = cpumask_cycle(peer_cpu, &workers);
                     continue;
                 }
 
+                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
+
                 /* Any work over there to steal? */
                 speer = cpumask_test_cpu(peer_cpu, online) ?
                     csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 0d702f0..53849af 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -48,6 +48,7 @@ PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
 PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
 PERFCOUNTER(load_balance_over,      "csched: load_balance_over")
 PERFCOUNTER(load_balance_other,     "csched: load_balance_other")
+PERFCOUNTER(steal_trylock,          "csched: steal_trylock")
 PERFCOUNTER(steal_trylock_failed,   "csched: steal_trylock_failed")
 PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
 PERFCOUNTER(migrate_queued,         "csched: migrate_queued")


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 5/7] xen: credit2: avoid cpumask_any() in pick_cpu().
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2017-04-07 16:56 ` [PATCH v3 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
@ 2017-04-07 16:57 ` Dario Faggioli
  2017-04-07 16:57 ` [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
  2017-04-07 16:57 ` [PATCH v3 7/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
  6 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit2.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index bb1c657..a76bedb 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -363,6 +363,7 @@ struct csched2_runqueue_data {
     struct list_head runq; /* Ordered list of runnable vms */
     struct list_head svc;  /* List of all vcpus assigned to this runqueue */
     unsigned int max_weight;
+    unsigned int pick_bias;/* Last CPU we picked. Start from it next time */
 
     cpumask_t idle,        /* Currently idle pcpus */
         smt_idle,          /* Fully idle-and-untickled cores (see below) */
@@ -1679,7 +1680,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         {
             cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                         &svc->migrate_rqd->active);
-            new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+            new_cpu = cpumask_cycle(svc->migrate_rqd->pick_bias,
+                                    cpumask_scratch_cpu(cpu));
+            svc->migrate_rqd->pick_bias = new_cpu;
             goto out_up;
         }
         /* Fall-through to normal cpu pick */
@@ -1737,7 +1740,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                 &prv->rqd[min_rqi].active);
-    new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+    new_cpu = cpumask_cycle(prv->rqd[min_rqi].pick_bias,
+                            cpumask_scratch_cpu(cpu));
+    prv->rqd[min_rqi].pick_bias = new_cpu;
     BUG_ON(new_cpu >= nr_cpu_ids);
 
  out_up:
@@ -1854,7 +1859,9 @@ static void migrate(const struct scheduler *ops,
                     cpupool_domain_cpumask(svc->vcpu->domain));
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                     &trqd->active);
-        svc->vcpu->processor = cpumask_any(cpumask_scratch_cpu(cpu));
+        svc->vcpu->processor = cpumask_cycle(trqd->pick_bias,
+                                             cpumask_scratch_cpu(cpu));
+        trqd->pick_bias = svc->vcpu->processor;
         ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         _runq_assign(svc, trqd);
@@ -2819,13 +2826,15 @@ csched2_dump(const struct scheduler *ops)
         printk("Runqueue %d:\n"
                "\tncpus              = %u\n"
                "\tcpus               = %s\n"
-               "\tmax_weight         = %d\n"
+               "\tmax_weight         = %u\n"
+               "\tpick_bias          = %u\n"
                "\tinstload           = %d\n"
                "\taveload            = %"PRI_stime" (~%"PRI_stime"%%)\n",
                i,
                cpumask_weight(&prv->rqd[i].active),
                cpustr,
                prv->rqd[i].max_weight,
+               prv->rqd[i].pick_bias,
                prv->rqd[i].load,
                prv->rqd[i].avgload,
                fraction);
@@ -2928,6 +2937,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &prv->initialized);
     __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
+    if ( cpumask_weight(&rqd->active) == 1 )
+        rqd->pick_bias = cpu;
+
     return rqi;
 }
 
@@ -3040,6 +3052,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         printk(XENLOG_INFO " No cpus left on runqueue, disabling\n");
         deactivate_runqueue(prv, rqi);
     }
+    else if ( rqd->pick_bias == cpu )
+        rqd->pick_bias = cpumask_first(&rqd->active);
 
     spin_unlock(&rqd->lock);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2017-04-07 16:57 ` [PATCH v3 5/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
@ 2017-04-07 16:57 ` Dario Faggioli
  2017-04-07 17:15   ` George Dunlap
  2017-04-07 16:57 ` [PATCH v3 7/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
  6 siblings, 1 reply; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Anshul Makkar, George Dunlap

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* do as much as nr-runnable accounting inside runqueue remove and runqueue
  insert helpers. Add comments for the instances that must live outside of
  them.

Changes from v1:
* don't count the idle vCPU as runnable. This is just cosmetic and not at all a
  logic or functionl change wrt v1;
* don't count inside of __runq_remove() or __runq_insert(), but provide
  specific counting functions, and call them when appropriate. This is necessary
  to avoid spurious overloaded state to be reported, basically *all* the time
  that a pCPU goes through the scheduler (due to the fact that the scheduler
  calls __runq_insert() on the current vCPU);
* get rid of the overloaded cpumask and only use the counter. I actually did
  like the cpumask solution better, but for this purpose, it was overkill.
---
 xen/common/sched_credit.c |   97 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 59b87f7..a0ad167 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -172,6 +172,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 }
 
 static inline void
+inc_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable++;
+
+}
+
+static inline void
+dec_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable--;
+    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
+}
+
+static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -292,12 +310,25 @@ __runq_insert(struct csched_vcpu *svc)
 }
 
 static inline void
+runq_insert(struct csched_vcpu *svc)
+{
+    __runq_insert(svc);
+    inc_nr_runnable(svc->vcpu->processor);
+}
+
+static inline void
 __runq_remove(struct csched_vcpu *svc)
 {
     BUG_ON( !__vcpu_on_runq(svc) );
     list_del_init(&svc->runq_elem);
 }
 
+static inline void
+runq_remove(struct csched_vcpu *svc)
+{
+    dec_nr_runnable(svc->vcpu->processor);
+    __runq_remove(svc);
+}
 
 #define for_each_csched_balance_step(step) \
     for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ )
@@ -601,6 +632,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 0;
 }
 
 static void
@@ -1052,7 +1084,7 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
 
     if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
-        __runq_insert(svc);
+        runq_insert(svc);
 
     vcpu_schedule_unlock_irq(lock, vc);
 
@@ -1117,7 +1149,7 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __vcpu_on_runq(svc) )
-        __runq_remove(svc);
+        runq_remove(svc);
 }
 
 static void
@@ -1177,7 +1209,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     }
 
     /* Put the VCPU on the runq and tickle CPUs */
-    __runq_insert(svc);
+    runq_insert(svc);
     __runq_tickle(svc);
 }
 
@@ -1679,8 +1711,14 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
             SCHED_VCPU_STAT_CRANK(speer, migrate_q);
             SCHED_STAT_CRANK(migrate_queued);
             WARN_ON(vc->is_urgent);
-            __runq_remove(speer);
+            runq_remove(speer);
             vc->processor = cpu;
+            /*
+             * speer will start executing directly on cpu, without having to
+             * go through runq_insert(). So we must update the runnable count
+             * for cpu here.
+             */
+            inc_nr_runnable(cpu);
             return speer;
         }
     }
@@ -1736,7 +1774,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
+            /* Select the pCPUs in this node that have work we can steal. */
             cpumask_andnot(&workers, online, prv->idlers);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
@@ -1746,6 +1784,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 goto next_node;
             do
             {
+                spinlock_t *lock;
+
+                /*
+                 * If there is only one runnable vCPU on peer_cpu, it means
+                 * there's no one to be stolen in its runqueue, so skip it.
+                 *
+                 * Checking this without holding the lock is racy... But that's
+                 * the whole point of this optimization!
+                 *
+                 * In more details:
+                 * - if we race with dec_nr_runnable(), we may try to take the
+                 *   lock and call csched_runq_steal() for no reason. This is
+                 *   not a functional issue, and should be infrequent enough.
+                 *   And we can avoid that by re-checking nr_runnable after
+                 *   having grabbed the lock, if we want;
+                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
+                 *   have runnable vCPUs in its runqueue, but that's not a
+                 *   problem because:
+                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
+                 *     __runq_tickle() will be called afterwords, so the vCPU
+                 *     won't get stuck in the runqueue for too long;
+                 *   + if racing with csched_runq_steal(), it may be that a
+                 *     vCPU that we could have picked up, stays in a runqueue
+                 *     until someone else tries to steal it again. But this is
+                 *     no worse than what can happen already (without this
+                 *     optimization), it the pCPU would schedule right after we
+                 *     have taken the lock, and hence block on it.
+                 */
+                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
+                {
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
+                    goto next_cpu;
+                }
+
                 /*
                  * Get ahold of the scheduler lock for this peer CPU.
                  *
@@ -1753,14 +1825,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * could cause a deadlock if the peer CPU is also load
                  * balancing and trying to lock this CPU.
                  */
-                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
+                lock = pcpu_schedule_trylock(peer_cpu);
                 SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
-                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
-                    continue;
+                    goto next_cpu;
                 }
 
                 TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
@@ -1777,6 +1848,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                     return speer;
                 }
 
+ next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
             } while( peer_cpu != cpumask_first(&workers) );
@@ -1907,7 +1979,11 @@ csched_schedule(
     if ( vcpu_runnable(current) )
         __runq_insert(scurr);
     else
+    {
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+        /* Current has blocked. Update the runnable counter for this cpu. */
+        dec_nr_runnable(cpu);
+    }
 
     snext = __runq_elem(runq->next);
     ret.migrated = 0;
@@ -2024,7 +2100,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 7/7] xen: credit1: treat pCPUs more evenly during balancing.
  2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2017-04-07 16:57 ` [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-04-07 16:57 ` Dario Faggioli
  6 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit.c |   37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index a0ad167..93658dc 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -229,6 +229,7 @@ struct csched_private {
     uint32_t credit;
     int credit_balance;
     uint32_t runq_sort;
+    uint32_t *balance_bias;
     unsigned ratelimit_us;
     /* Period of master and tick in milliseconds */
     unsigned tslice_ms, tick_period_us, ticks_per_tslice;
@@ -560,6 +561,7 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc = pcpu;
+    unsigned int node = cpu_to_node(cpu);
     unsigned long flags;
 
     /*
@@ -583,6 +585,12 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         prv->master = cpumask_first(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
     }
+    if ( prv->balance_bias[node] == cpu )
+    {
+        cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(node));
+        if ( !cpumask_empty(cpumask_scratch) )
+            prv->balance_bias[node] =  cpumask_first(cpumask_scratch);
+    }
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
         kill_timer(&prv->master_ticker);
@@ -622,6 +630,10 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
                   NOW() + MILLISECS(prv->tslice_ms));
     }
 
+    cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(cpu_to_node(cpu)));
+    if ( cpumask_weight(cpumask_scratch) == 1 )
+        prv->balance_bias[cpu_to_node(cpu)] = cpu;
+
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
     set_timer(&spc->ticker, NOW() + MICROSECS(prv->tick_period_us) );
 
@@ -1735,7 +1747,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
     struct csched_vcpu *speer;
     cpumask_t workers;
     cpumask_t *online;
-    int peer_cpu, peer_node, bstep;
+    int peer_cpu, first_cpu, peer_node, bstep;
     int node = cpu_to_node(cpu);
 
     BUG_ON( cpu != snext->vcpu->processor );
@@ -1779,9 +1791,10 @@ csched_load_balance(struct csched_private *prv, int cpu,
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
-            peer_cpu = cpumask_first(&workers);
-            if ( peer_cpu >= nr_cpu_ids )
+            first_cpu = cpumask_cycle(prv->balance_bias[peer_node], &workers);
+            if ( first_cpu >= nr_cpu_ids )
                 goto next_node;
+            peer_cpu = first_cpu;
             do
             {
                 spinlock_t *lock;
@@ -1845,13 +1858,19 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 if ( speer != NULL )
                 {
                     *stolen = 1;
+                    /*
+                     * Next time we'll look for work to steal on this node, we
+                     * will start from the next pCPU, with respect to this one,
+                     * so we don't risk stealing always from the same ones.
+                     */
+                    prv->balance_bias[peer_node] = peer_cpu;
                     return speer;
                 }
 
  next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
-            } while( peer_cpu != cpumask_first(&workers) );
+            } while( peer_cpu != first_cpu );
 
  next_node:
             peer_node = cycle_node(peer_node, node_online_map);
@@ -2204,10 +2223,19 @@ csched_init(struct scheduler *ops)
     prv = xzalloc(struct csched_private);
     if ( prv == NULL )
         return -ENOMEM;
+
+    prv->balance_bias = xzalloc_array(uint32_t, MAX_NUMNODES);
+    if ( prv->balance_bias == NULL )
+    {
+        xfree(prv);
+        return -ENOMEM;
+    }
+
     if ( !zalloc_cpumask_var(&prv->cpus) ||
          !zalloc_cpumask_var(&prv->idlers) )
     {
         free_cpumask_var(prv->cpus);
+        xfree(prv->balance_bias);
         xfree(prv);
         return -ENOMEM;
     }
@@ -2253,6 +2281,7 @@ csched_deinit(struct scheduler *ops)
         ops->sched_data = NULL;
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
+        xfree(prv->balance_bias);
         xfree(prv);
     }
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal().
  2017-04-07 16:56 ` [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-04-07 17:09   ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2017-04-07 17:09 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar

On Fri, Apr 7, 2017 at 5:56 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Chacking whether or not a vCPU can be 'stolen'

"Checking" -- I'll fix this one on check-in. :-)

> from a peer pCPU's runqueue is relatively cheap.
>
> Therefore, let's do that  as early as possible,
> avoiding potentially useless complex checks, and
> cpumask manipulations.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Changes from v2:
> * add an assert enforcing the old !is_running check inside the
>   is_migrateable() function.
>
> Changes from v1:
> * fixed a typo in a comment.
> ---
>  xen/common/sched_credit.c |   22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 63a8675..59ed4ca 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -708,12 +708,15 @@ static inline int
>  __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>  {
>      /*
> -     * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> -     * on our CPU.
> +     * Don't pick up work that's hot on peer PCPU, or that can't (or
> +     * would prefer not to) run on cpu.
> +     *
> +     * The caller is supposed to have already checked that vc is also
> +     * not running.
>       */
> -    return !vc->is_running &&
> -           !__csched_vcpu_is_cache_hot(vc) &&
> +    ASSERT(!vc->is_running);
> +
> +    return !__csched_vcpu_is_cache_hot(vc) &&
>             cpumask_test_cpu(dest_cpu, mask);
>  }
>
> @@ -1622,7 +1625,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>          BUG_ON( is_idle_vcpu(vc) );
>
>          /*
> -         * If the vcpu has no useful soft affinity, skip this vcpu.
> +         * If the vcpu is still in peer_cpu's scheduling tail, or if it
> +         * has no useful soft affinity, skip it.
> +         *
>           * In fact, what we want is to check if we have any "soft-affine
>           * work" to steal, before starting to look at "hard-affine work".
>           *
> @@ -1633,8 +1638,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>           * vCPUs with useful soft affinities in some sort of bitmap
>           * or counter.
>           */
> -        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> -             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +        if ( vc->is_running ||
> +             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> +              && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
>              continue;
>
>          csched_balance_cpumask(vc, balance_step, cpumask_scratch);
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy.
  2017-04-07 16:56 ` [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
@ 2017-04-07 17:12   ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2017-04-07 17:12 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar

On Fri, Apr 7, 2017 at 5:56 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Currently, it can happen that __runq_tickle(),
> running on pCPU 2 because vCPU x woke up, decides
> to tickle pCPU 3, because it's idle. Just after
> that, but before pCPU 3 manages to schedule and
> pick up x, either __runq_tickel() or
> __csched_cpu_pick(), running on pCPU 6, sees that
> idle pCPUs are 0, 1 and also 3, and for whatever
> reason it also chooses 3 for waking up (or
> migrating) vCPU y.
>
> When pCPU 3 goes through the scheduler, it will
> pick up, say, vCPU x, and y will sit in its
> runqueue, even if there are idle pCPUs.
>
> Alleviate this by marking a pCPU to be idle
> right away when tickling it (like, e.g., it happens
> in Credit2).
>
> Note that this does not eliminate the race. That
> is not possible without introducing proper locking
> for the cpumasks the scheduler uses. It significantly
> reduces the window during which it can happen, though.
>
> Introduce proper locking for the cpumask can, in
> theory, be done, and may be investigated in future.
> It is a significant amount of work to do it properly
> (e.g., avoiding deadlock), and it is likely to adversely
> affect scalability, and so it may be a path it is just
> not worth following.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Changes from v2:
> * add comments to better explain the meaning of the added ASSERT()-s.

Thanks.

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
>  xen/common/sched_credit.c |   32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 59ed4ca..7b4ea02 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -489,7 +489,17 @@ static inline void __runq_tickle(struct csched_vcpu *new)
>                  __trace_var(TRC_CSCHED_TICKLE, 1, sizeof(cpu), &cpu);
>          }
>
> -        /* Send scheduler interrupts to designated CPUs */
> +        /*
> +         * Mark the designated CPUs as busy and send them all the scheduler
> +         * interrupt. We need the for_each_cpu for dealing with the
> +         * !opt_tickle_one_idle case. We must use cpumask_clear_cpu() and
> +         * can't use cpumask_andnot(), because prv->idlers needs atomic access.
> +         *
> +         * In the default (and most common) case, when opt_rickle_one_idle is
> +         * true, the loop does only one step, and only one bit is cleared.
> +         */
> +        for_each_cpu(cpu, &mask)
> +            cpumask_clear_cpu(cpu, prv->idlers);
>          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
>      }
>      else
> @@ -990,6 +1000,13 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
>              SCHED_VCPU_STAT_CRANK(svc, migrate_r);
>              SCHED_STAT_CRANK(migrate_running);
>              set_bit(_VPF_migrating, &current->pause_flags);
> +            /*
> +             * As we are about to tickle cpu, we should clear its bit in
> +             * idlers. But, if we are here, it means there is someone running
> +             * on it, and hence the bit must be zero already.
> +             */
> +            ASSERT(!cpumask_test_cpu(cpu,
> +                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
>              cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
>          }
>      }
> @@ -1082,13 +1099,22 @@ static void
>  csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct csched_vcpu * const svc = CSCHED_VCPU(vc);
> +    unsigned int cpu = vc->processor;
>
>      SCHED_STAT_CRANK(vcpu_sleep);
>
>      BUG_ON( is_idle_vcpu(vc) );
>
> -    if ( curr_on_cpu(vc->processor) == vc )
> -        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> +    if ( curr_on_cpu(cpu) == vc )
> +    {
> +        /*
> +         * We are about to tickle cpu, so we should clear its bit in idlers.
> +         * But, we are here because vc is going to sleep while running on cpu,
> +         * so the bit must be zero already.
> +         */
> +        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
> +        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> +    }
>      else if ( __vcpu_on_runq(svc) )
>          __runq_remove(svc);
>  }
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 16:57 ` [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-04-07 17:15   ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2017-04-07 17:15 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Anshul Makkar, Andrew Cooper

On Fri, Apr 7, 2017 at 5:57 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> During load balancing, we check the non idle pCPUs to
> see if they have runnable but not running vCPUs that
> can be stolen by and set to run on currently idle pCPUs.
>
> If a pCPU has only one running (or runnable) vCPU,
> though, we don't want to steal it from there, and
> it's therefore pointless bothering with it
> (especially considering that bothering means trying
> to take its runqueue lock!).
>
> On large systems, when load is only slightly higher
> than the number of pCPUs (i.e., there are just a few
> more active vCPUs than the number of the pCPUs), this
> may mean that:
>  - we go through all the pCPUs,
>  - for each one, we (try to) take its runqueue locks,
>  - we figure out there's actually nothing to be stolen!
>
> To mitigate this, we introduce a counter for the number
> of runnable vCPUs on each pCPU. In fact, unless there
> re least 2 runnable vCPUs --typically, one running,
> and the others in the runqueue-- it does not make sense
> to try stealing anything.
>
> signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Changes from v2:
> * do as much as nr-runnable accounting inside runqueue remove and runqueue
>   insert helpers. Add comments for the instances that must live outside of
>   them.
>
> Changes from v1:
> * don't count the idle vCPU as runnable. This is just cosmetic and not at all a
>   logic or functionl change wrt v1;
> * don't count inside of __runq_remove() or __runq_insert(), but provide
>   specific counting functions, and call them when appropriate. This is necessary
>   to avoid spurious overloaded state to be reported, basically *all* the time
>   that a pCPU goes through the scheduler (due to the fact that the scheduler
>   calls __runq_insert() on the current vCPU);
> * get rid of the overloaded cpumask and only use the counter. I actually did
>   like the cpumask solution better, but for this purpose, it was overkill.
> ---
>  xen/common/sched_credit.c |   97 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 87 insertions(+), 10 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 59b87f7..a0ad167 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -172,6 +172,7 @@ struct csched_pcpu {
>      struct timer ticker;
>      unsigned int tick;
>      unsigned int idle_bias;
> +    unsigned int nr_runnable;
>  };
>
>  /*
> @@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
>  }
>
>  static inline void
> +inc_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable++;
> +
> +}
> +
> +static inline void
> +dec_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable--;
> +    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
> +}
> +
> +static inline void
>  __runq_insert(struct csched_vcpu *svc)
>  {
> -    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
> +    unsigned int cpu = svc->vcpu->processor;
> +    const struct list_head * const runq = RUNQ(cpu);
>      struct list_head *iter;
>
>      BUG_ON( __vcpu_on_runq(svc) );
> @@ -292,12 +310,25 @@ __runq_insert(struct csched_vcpu *svc)
>  }
>
>  static inline void
> +runq_insert(struct csched_vcpu *svc)
> +{
> +    __runq_insert(svc);
> +    inc_nr_runnable(svc->vcpu->processor);
> +}
> +
> +static inline void
>  __runq_remove(struct csched_vcpu *svc)
>  {
>      BUG_ON( !__vcpu_on_runq(svc) );
>      list_del_init(&svc->runq_elem);
>  }
>
> +static inline void
> +runq_remove(struct csched_vcpu *svc)
> +{
> +    dec_nr_runnable(svc->vcpu->processor);
> +    __runq_remove(svc);
> +}
>
>  #define for_each_csched_balance_step(step) \
>      for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ )
> @@ -601,6 +632,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>      /* Start off idling... */
>      BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
>      cpumask_set_cpu(cpu, prv->idlers);
> +    spc->nr_runnable = 0;
>  }
>
>  static void
> @@ -1052,7 +1084,7 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
>      lock = vcpu_schedule_lock_irq(vc);
>
>      if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> -        __runq_insert(svc);
> +        runq_insert(svc);
>
>      vcpu_schedule_unlock_irq(lock, vc);
>
> @@ -1117,7 +1149,7 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
>          cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
>      }
>      else if ( __vcpu_on_runq(svc) )
> -        __runq_remove(svc);
> +        runq_remove(svc);
>  }
>
>  static void
> @@ -1177,7 +1209,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
>      }
>
>      /* Put the VCPU on the runq and tickle CPUs */
> -    __runq_insert(svc);
> +    runq_insert(svc);
>      __runq_tickle(svc);
>  }
>
> @@ -1679,8 +1711,14 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>              SCHED_VCPU_STAT_CRANK(speer, migrate_q);
>              SCHED_STAT_CRANK(migrate_queued);
>              WARN_ON(vc->is_urgent);
> -            __runq_remove(speer);
> +            runq_remove(speer);
>              vc->processor = cpu;
> +            /*
> +             * speer will start executing directly on cpu, without having to
> +             * go through runq_insert(). So we must update the runnable count
> +             * for cpu here.
> +             */
> +            inc_nr_runnable(cpu);
>              return speer;
>          }
>      }
> @@ -1736,7 +1774,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
>          peer_node = node;
>          do
>          {
> -            /* Find out what the !idle are in this node */
> +            /* Select the pCPUs in this node that have work we can steal. */
>              cpumask_andnot(&workers, online, prv->idlers);
>              cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
>              __cpumask_clear_cpu(cpu, &workers);
> @@ -1746,6 +1784,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                  goto next_node;
>              do
>              {
> +                spinlock_t *lock;
> +
> +                /*
> +                 * If there is only one runnable vCPU on peer_cpu, it means
> +                 * there's no one to be stolen in its runqueue, so skip it.
> +                 *
> +                 * Checking this without holding the lock is racy... But that's
> +                 * the whole point of this optimization!
> +                 *
> +                 * In more details:
> +                 * - if we race with dec_nr_runnable(), we may try to take the
> +                 *   lock and call csched_runq_steal() for no reason. This is
> +                 *   not a functional issue, and should be infrequent enough.
> +                 *   And we can avoid that by re-checking nr_runnable after
> +                 *   having grabbed the lock, if we want;
> +                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
> +                 *   have runnable vCPUs in its runqueue, but that's not a
> +                 *   problem because:
> +                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
> +                 *     __runq_tickle() will be called afterwords, so the vCPU
> +                 *     won't get stuck in the runqueue for too long;
> +                 *   + if racing with csched_runq_steal(), it may be that a
> +                 *     vCPU that we could have picked up, stays in a runqueue
> +                 *     until someone else tries to steal it again. But this is
> +                 *     no worse than what can happen already (without this
> +                 *     optimization), it the pCPU would schedule right after we
> +                 *     have taken the lock, and hence block on it.
> +                 */
> +                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
> +                {
> +                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
> +                    goto next_cpu;
> +                }
> +
>                  /*
>                   * Get ahold of the scheduler lock for this peer CPU.
>                   *
> @@ -1753,14 +1825,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                   * could cause a deadlock if the peer CPU is also load
>                   * balancing and trying to lock this CPU.
>                   */
> -                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
> +                lock = pcpu_schedule_trylock(peer_cpu);
>                  SCHED_STAT_CRANK(steal_trylock);
>                  if ( !lock )
>                  {
>                      SCHED_STAT_CRANK(steal_trylock_failed);
>                      TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
> -                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -                    continue;
> +                    goto next_cpu;
>                  }
>
>                  TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
> @@ -1777,6 +1848,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                      return speer;
>                  }
>
> + next_cpu:
>                  peer_cpu = cpumask_cycle(peer_cpu, &workers);
>
>              } while( peer_cpu != cpumask_first(&workers) );
> @@ -1907,7 +1979,11 @@ csched_schedule(
>      if ( vcpu_runnable(current) )
>          __runq_insert(scurr);
>      else
> +    {
>          BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
> +        /* Current has blocked. Update the runnable counter for this cpu. */
> +        dec_nr_runnable(cpu);
> +    }
>
>      snext = __runq_elem(runq->next);
>      ret.migrated = 0;
> @@ -2024,7 +2100,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
>      runq = &spc->runq;
>
>      cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
> -    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
> +    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
> +           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
>      cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
>      printk("core=%s\n", cpustr);
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-04-07 17:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07 16:56 [PATCH v3 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
2017-04-07 16:56 ` [PATCH v3 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
2017-04-07 16:56 ` [PATCH v3 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
2017-04-07 17:09   ` George Dunlap
2017-04-07 16:56 ` [PATCH v3 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
2017-04-07 17:12   ` George Dunlap
2017-04-07 16:56 ` [PATCH v3 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
2017-04-07 16:57 ` [PATCH v3 5/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
2017-04-07 16:57 ` [PATCH v3 6/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
2017-04-07 17:15   ` George Dunlap
2017-04-07 16:57 ` [PATCH v3 7/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.