All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
@ 2017-03-02 10:37 Dario Faggioli
  2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Anshul Makkar, Ian Jackson, George Dunlap, Wei Liu

Hello,

This series aims at introducing some optimization and performance improvement
in Credit1 (in certain specific situations), but Credit2 is lightly touched as
well.

The core of the series is patches 3 and 4, which aim at both redistributing and
reducing spinlock contention during load balancing. In fact, Credit1 load
balancing is based on "work stealing". This means that, when a pCPU would go
idle, it looks around inside other pCPUs' runqueues, to see if there are vCPUs
waiting to run, and steal the first one it finds.

This process of going around pCPUs happens in a NUMA node wise fashion, and
always starts from the first pCPU on each node. That may lead to higher
scheduler lock pressure on lower ID pCPUs (of each node), as well as stealing
happening more frequently from them. And this is what patch 4 aims at fixing.
This is not necessarily expected to improve performance per-se, although a
fairer lock pressure is likely to bring benefits.

Still about load balancing, when deciding whether or not to try to steal work
from a pCPU, we only consider the ones that are non-idle. A pCPU which is
running a vCPU and does not have any other vCPU in its runqueue waiting to run,
is not idle, but there's nothing we can steal. It's therefore possible that we
check a number of pCPUs, which include at least trying to take their runqueue
lock, only to figure out that there's no vCPU we can grab, and we need to
continue checking other processors.
On a large system, in situations where the load (i.e., the number of runnable
and running vCPUs) is only _slightly_ higher than the number of pCPUs, this can
have a significant performance impact.  A way of improving this situation, is
to keep track of not only whether pCPUs are idle or not, but also which ones
have more than one runnable vCPU, which basically means they have at least one
vCPU ready to be stolen by anyone that would otherwise go idle.
And this exactly is what is done in patch 3.

Finally, patch 6 does to Credit2, something similar to what patch 3 does to
Credit1, although the context is, actually, different. In fact, there are
places in Credit2, where we just want the scheduler to give us one pCPU from a
certain runqueue. We do that by means of cpumask_any(), which is great, but
comes at a price.  As a matter of fact, we don't really care much which one, as
a subsequent call to runq_tickle() will override such choice anyway. But
--within runqueue tickle itself-- the pCPU we choose is at last used as an
hint, so we really don't want to totally give up and introduce biases (by,
e.g., just using cpumask_first().  We, therefore, use an approach similar to
the one in patch 3, i.e., we record and remember which pCPU we choose for last,
and start from it next time.

As said already, the performance benefit of this series are to be expected on
large systems, with very specific load conditions.  I've done some benchmarking
on a 16 CPUs NUMA box that I have at hand.

I've run three experiments. A Xen compile ('MAKEXEN') inside a 16 vCPUs guest.
2 Xen compiles running concurrently inside two 16 vCPUs VMs. And a Xen compile
and Iperf ('IPERF') running concurrently inside two 16 vCPUs VMs

Here's the result for Credit1. For MAKEXEN, lower is better, while for IPERF,
higher is. Average and standard dviation over 10 runs is what's shown in the
tables below.

    |CREDIT1                                                            |
    |-------------------------------------------------------------------|
    |MAKEXEN, 1VM    |MAKEXEN, 2VMs   |vm1: MAKEXEN     vm2: IPERF      |
    |baseline patched|baseline patched|baseline	patched	baseline patched|
    |----------------|----------------|---------------------------------|
avg | 18.154   17.906| 52.832   51.088| 29.306   28.936	 15.840   18.580|
stdd|  0.580    0.059|  1.061    1.717|  0.757    0.296   4.264	   2.492|

So, with this patch applied, Xen compiles a little bit faster, and Iperf
achieves higher throughput, which is great. :-D

As far as Credit2 goes, here's the numbers:
               				
    |CREDIT2                                                            |
    |-------------------------------------------------------------------|
    |MAKEXEN, 1VM    |MAKEXEN, 2VMs   |vm1: MAKEXEN     vm2: IPERF      |
    |baseline patched|baseline patched|baseline	patched	baseline patched|
    |----------------|----------------|---------------------------------|
avg | 18.062   17.894| 53.136   52.968| 32.754   32.880  18.160   19.240|
stdd|  0.331    0.205|  0.886    0.566|  0.787    0.548   1.910	   1.842|

In this case, the expected impact of the series is smaller, and that in fact
matches what we get, with baseline and patched numbers very very close. What I
wanted to verify is that I was not introducing regressions, which seems to be
confirmed.

Thanks and Regards,
Dario
---
Dario Faggioli (6):
      xen: credit1: simplify csched_runq_steal() a little bit.
      xen: credit: (micro) optimize csched_runq_steal().
      xen: credit1: increase efficiency and scalability of load balancing.
      xen: credit1: treat pCPUs more evenly during balancing.
      xen/tools: tracing: add record for credit1 runqueue stealing.
      xen: credit2: avoid cpumask_any() in pick_cpu().

 tools/xentrace/formats       |    1
 tools/xentrace/xenalyze.c    |   11 ++
 xen/common/sched_credit.c    |  199 +++++++++++++++++++++++++++++-------------
 xen/common/sched_credit2.c   |   22 ++++-
 xen/include/xen/perfc_defn.h |    1
 5 files changed, 169 insertions(+), 65 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-03  9:35   ` anshul makkar
  2017-03-02 10:38 ` [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we mark it as
such (and bail early if it unfortunately is).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   87 +++++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 4649e64..63a8675 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1593,64 +1593,65 @@ static struct csched_vcpu *
 csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
-    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
     struct vcpu *vc;
 
+    ASSERT(peer_pcpu != NULL);
+
     /*
      * Don't steal from an idle CPU's runq because it's about to
      * pick up work from it itself.
      */
-    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
+    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+        goto out;
+
+    list_for_each( iter, &peer_pcpu->runq )
     {
-        list_for_each( iter, &peer_pcpu->runq )
-        {
-            speer = __runq_elem(iter);
+        speer = __runq_elem(iter);
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+        /*
+         * If next available VCPU here is not of strictly higher
+         * priority than ours, this PCPU is useless to us.
+         */
+        if ( speer->pri <= pri )
+            break;
 
-            /* Is this VCPU runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+        /* Is this VCPU runnable on our PCPU? */
+        vc = speer->vcpu;
+        BUG_ON( is_idle_vcpu(vc) );
 
-            /*
-             * If the vcpu has no useful soft affinity, skip this vcpu.
-             * In fact, what we want is to check if we have any "soft-affine
-             * work" to steal, before starting to look at "hard-affine work".
-             *
-             * Notice that, if not even one vCPU on this runq has a useful
-             * soft affinity, we could have avoid considering this runq for
-             * a soft balancing step in the first place. This, for instance,
-             * can be implemented by taking note of on what runq there are
-             * vCPUs with useful soft affinities in some sort of bitmap
-             * or counter.
-             */
-            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
-                continue;
+        /*
+         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * In fact, what we want is to check if we have any "soft-affine
+         * work" to steal, before starting to look at "hard-affine work".
+         *
+         * Notice that, if not even one vCPU on this runq has a useful
+         * soft affinity, we could have avoid considering this runq for
+         * a soft balancing step in the first place. This, for instance,
+         * can be implemented by taking note of on what runq there are
+         * vCPUs with useful soft affinities in some sort of bitmap
+         * or counter.
+         */
+        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+            continue;
 
-            csched_balance_cpumask(vc, balance_step, cpumask_scratch);
-            if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
-            {
-                /* We got a candidate. Grab it! */
-                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
-                         vc->domain->domain_id, vc->vcpu_id);
-                SCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                SCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
-            }
+        csched_balance_cpumask(vc, balance_step, cpumask_scratch);
+        if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
+        {
+            /* We got a candidate. Grab it! */
+            TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
+                     vc->domain->domain_id, vc->vcpu_id);
+            SCHED_VCPU_STAT_CRANK(speer, migrate_q);
+            SCHED_STAT_CRANK(migrate_queued);
+            WARN_ON(vc->is_urgent);
+            __runq_remove(speer);
+            vc->processor = cpu;
+            return speer;
         }
     }
-
+ out:
     SCHED_STAT_CRANK(steal_peer_idle);
     return NULL;
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal().
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-03  9:48   ` anshul makkar
  2017-03-02 10:38 ` [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Chacking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 63a8675..2b13e99 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -708,12 +708,10 @@ static inline int
 __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
-     * Don't pick up work that's in the peer's scheduling tail or hot on
-     * peer PCPU. Only pick up work that prefers and/or is allowed to run
-     * on our CPU.
+     * Don't pick up work that's or hot on peer PCPU, or that can't (or
+     * would prefer not to) run on cpu.
      */
-    return !vc->is_running &&
-           !__csched_vcpu_is_cache_hot(vc) &&
+    return !__csched_vcpu_is_cache_hot(vc) &&
            cpumask_test_cpu(dest_cpu, mask);
 }
 
@@ -1622,7 +1620,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         BUG_ON( is_idle_vcpu(vc) );
 
         /*
-         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * If the vcpu is still in peer_cpu's scheduling tail, or if it
+         * has no useful soft affinity, skip it.
+         *
          * In fact, what we want is to check if we have any "soft-affine
          * work" to steal, before starting to look at "hard-affine work".
          *
@@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
          * vCPUs with useful soft affinities in some sort of bitmap
          * or counter.
          */
-        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+        if ( vc->is_running ||
+             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+              && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
             continue;
 
         csched_balance_cpumask(vc, balance_step, cpumask_scratch);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing.
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
  2017-03-02 10:38 ` [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-02 11:06   ` Andrew Cooper
  2017-03-02 10:38 ` [PATCH 4/6] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce here the concept of
overloaded runqueues, and a cpumask where to record
what pCPUs are in such state.

An overloaded runqueue has at least runnable 2 vCPUs
(plus the idle one, which is always there). Typically,
this means 1 vCPU is running, and 1 is sitting in  the
runqueue, and can hence be stolen.

Then, in  csched_balance_load(), it is enough to go
over the overloaded pCPUs, instead than all non-idle
pCPUs, which is better.

signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
I'm Cc-ing Andy on this patch, because we've discussed once about doing
something like this upstream.
---
 xen/common/sched_credit.c |   56 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 2b13e99..529b6c7 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -171,6 +171,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -221,6 +222,7 @@ struct csched_private {
     uint32_t ncpus;
     struct timer  master_ticker;
     unsigned int master;
+    cpumask_var_t overloaded;
     cpumask_var_t idlers;
     cpumask_var_t cpus;
     uint32_t weight;
@@ -263,7 +265,10 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
+    struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -288,12 +293,37 @@ __runq_insert(struct csched_vcpu *svc)
     }
 
     list_add_tail(&svc->runq_elem, iter);
+
+    /*
+     * If there is more than just the idle vCPU and a "regular" vCPU runnable
+     * on the runqueue of this pCPU, mark it as overloaded (so other pCPU
+     * can come and pick up some work).
+     */
+    if ( ++spc->nr_runnable > 2 &&
+         !cpumask_test_cpu(cpu, prv->overloaded) )
+        cpumask_set_cpu(cpu, prv->overloaded);
 }
 
 static inline void
 __runq_remove(struct csched_vcpu *svc)
 {
+    unsigned int cpu = svc->vcpu->processor;
+    struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
+
     BUG_ON( !__vcpu_on_runq(svc) );
+
+    /*
+     * Mark the CPU as no longer overloaded when we drop to having only
+     * 1 vCPU in its runqueue. In fact, this means that just the idle
+     * idle vCPU and a "regular" vCPU are around.
+     */
+    if ( --spc->nr_runnable <= 2 &&
+         cpumask_test_cpu(cpu, prv->overloaded) )
+        cpumask_clear_cpu(cpu, prv->overloaded);
+
+    ASSERT(spc->nr_runnable >= 1);
+
     list_del_init(&svc->runq_elem);
 }
 
@@ -590,6 +620,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 1;
 }
 
 static void
@@ -1704,8 +1735,8 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
-            cpumask_andnot(&workers, online, prv->idlers);
+            /* Select the pCPUs in this node that have work we can steal. */
+            cpumask_and(&workers, online, prv->overloaded);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
@@ -1989,7 +2020,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 
@@ -2027,7 +2059,7 @@ csched_dump(const struct scheduler *ops)
 
     spin_lock_irqsave(&prv->lock, flags);
 
-#define idlers_buf keyhandler_scratch
+#define cpumask_buf keyhandler_scratch
 
     printk("info:\n"
            "\tncpus              = %u\n"
@@ -2055,8 +2087,10 @@ csched_dump(const struct scheduler *ops)
            prv->ticks_per_tslice,
            vcpu_migration_delay);
 
-    cpumask_scnprintf(idlers_buf, sizeof(idlers_buf), prv->idlers);
-    printk("idlers: %s\n", idlers_buf);
+    cpumask_scnprintf(cpumask_buf, sizeof(cpumask_buf), prv->idlers);
+    printk("idlers: %s\n", cpumask_buf);
+    cpumask_scnprintf(cpumask_buf, sizeof(cpumask_buf), prv->overloaded);
+    printk("overloaded: %s\n", cpumask_buf);
 
     printk("active vcpus:\n");
     loop = 0;
@@ -2079,7 +2113,7 @@ csched_dump(const struct scheduler *ops)
             vcpu_schedule_unlock(lock, svc->vcpu);
         }
     }
-#undef idlers_buf
+#undef cpumask_buf
 
     spin_unlock_irqrestore(&prv->lock, flags);
 }
@@ -2093,8 +2127,11 @@ csched_init(struct scheduler *ops)
     if ( prv == NULL )
         return -ENOMEM;
     if ( !zalloc_cpumask_var(&prv->cpus) ||
-         !zalloc_cpumask_var(&prv->idlers) )
+         !zalloc_cpumask_var(&prv->idlers) ||
+         !zalloc_cpumask_var(&prv->overloaded) )
     {
+        free_cpumask_var(prv->overloaded);
+        free_cpumask_var(prv->idlers);
         free_cpumask_var(prv->cpus);
         xfree(prv);
         return -ENOMEM;
@@ -2141,6 +2178,7 @@ csched_deinit(struct scheduler *ops)
         ops->sched_data = NULL;
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
+        free_cpumask_var(prv->overloaded);
         xfree(prv);
     }
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] xen: credit1: treat pCPUs more evenly during balancing.
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2017-03-02 10:38 ` [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-02 10:38 ` [PATCH 5/6] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 529b6c7..bae29a7 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -229,6 +229,7 @@ struct csched_private {
     uint32_t credit;
     int credit_balance;
     uint32_t runq_sort;
+    uint32_t *balance_bias;
     unsigned ratelimit_us;
     /* Period of master and tick in milliseconds */
     unsigned tslice_ms, tick_period_us, ticks_per_tslice;
@@ -548,6 +549,7 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc = pcpu;
+    unsigned int node = cpu_to_node(cpu);
     unsigned long flags;
 
     /*
@@ -571,6 +573,12 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         prv->master = cpumask_first(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
     }
+    if ( prv->balance_bias[node] == cpu )
+    {
+        cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(node));
+        if ( !cpumask_empty(cpumask_scratch) )
+            prv->balance_bias[node] =  cpumask_first(cpumask_scratch);
+    }
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
         kill_timer(&prv->master_ticker);
@@ -610,6 +618,10 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
                   NOW() + MILLISECS(prv->tslice_ms));
     }
 
+    cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(cpu_to_node(cpu)));
+    if ( cpumask_weight(cpumask_scratch) == 1 )
+        prv->balance_bias[cpu_to_node(cpu)] = cpu;
+
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
     set_timer(&spc->ticker, NOW() + MICROSECS(prv->tick_period_us) );
 
@@ -1696,7 +1708,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
     struct csched_vcpu *speer;
     cpumask_t workers;
     cpumask_t *online;
-    int peer_cpu, peer_node, bstep;
+    int peer_cpu, first_cpu, peer_node, bstep;
     int node = cpu_to_node(cpu);
 
     BUG_ON( cpu != snext->vcpu->processor );
@@ -1740,9 +1752,10 @@ csched_load_balance(struct csched_private *prv, int cpu,
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
-            peer_cpu = cpumask_first(&workers);
-            if ( peer_cpu >= nr_cpu_ids )
+            first_cpu = cpumask_cycle(prv->balance_bias[peer_node], &workers);
+            if ( first_cpu >= nr_cpu_ids )
                 goto next_node;
+            peer_cpu = first_cpu;
             do
             {
                 /*
@@ -1770,12 +1783,18 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 if ( speer != NULL )
                 {
                     *stolen = 1;
+                    /*
+                     * Next time we'll look for work to steal on this node, we
+                     * will start from the next pCPU, with respect to this one,
+                     * so we don't risk stealing always from the same ones.
+                     */
+                    prv->balance_bias[peer_node] = peer_cpu;
                     return speer;
                 }
 
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
-            } while( peer_cpu != cpumask_first(&workers) );
+            } while( peer_cpu != first_cpu );
 
  next_node:
             peer_node = cycle_node(peer_node, node_online_map);
@@ -2126,6 +2145,14 @@ csched_init(struct scheduler *ops)
     prv = xzalloc(struct csched_private);
     if ( prv == NULL )
         return -ENOMEM;
+
+    prv->balance_bias = xzalloc_array(uint32_t, MAX_NUMNODES);
+    if ( prv->balance_bias == NULL )
+    {
+        xfree(prv);
+        return -ENOMEM;
+    }
+
     if ( !zalloc_cpumask_var(&prv->cpus) ||
          !zalloc_cpumask_var(&prv->idlers) ||
          !zalloc_cpumask_var(&prv->overloaded) )
@@ -2133,6 +2160,7 @@ csched_init(struct scheduler *ops)
         free_cpumask_var(prv->overloaded);
         free_cpumask_var(prv->idlers);
         free_cpumask_var(prv->cpus);
+        xfree(prv->balance_bias);
         xfree(prv);
         return -ENOMEM;
     }
@@ -2179,6 +2207,7 @@ csched_deinit(struct scheduler *ops)
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
         free_cpumask_var(prv->overloaded);
+        xfree(prv->balance_bias);
         xfree(prv);
     }
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] xen/tools: tracing: add record for credit1 runqueue stealing.
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2017-03-02 10:38 ` [PATCH 4/6] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-02 10:38 ` [PATCH 6/6] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Ian Jackson, George Dunlap

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/xentrace/formats       |    1 +
 tools/xentrace/xenalyze.c    |   11 +++++++++++
 xen/common/sched_credit.c    |    6 ++++++
 xen/include/xen/perfc_defn.h |    1 +
 4 files changed, 19 insertions(+)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index a055231..8b31780 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -47,6 +47,7 @@
 0x00022008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:unboost       [ dom:vcpu = 0x%(1)04x%(2)04x ]
 0x00022009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:schedule      [ cpu[16]:tasklet[8]:idle[8] = %(1)08x ]
 0x0002200A  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:ratelimit     [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ]
+0x0002200B  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:steal_check   [ peer_cpu = %(1)d, checked = %(2)d ]
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 68ffcc2..fd8ddb9 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7651,6 +7651,17 @@ void sched_process(struct pcpu_info *p)
                        r->runtime / 1000, r->runtime % 1000);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int peer_cpu, check;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:load_balance %s %u\n",
+                       ri->dump_header, r->check ? "checking" : "skipping",
+                       r->peer_cpu);
+            }
+            break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index bae29a7..ed44d40 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -134,6 +134,7 @@
 #define TRC_CSCHED_BOOST_END     TRC_SCHED_CLASS_EVT(CSCHED, 8)
 #define TRC_CSCHED_SCHEDULE      TRC_SCHED_CLASS_EVT(CSCHED, 9)
 #define TRC_CSCHED_RATELIMIT     TRC_SCHED_CLASS_EVT(CSCHED, 10)
+#define TRC_CSCHED_STEAL_CHECK   TRC_SCHED_CLASS_EVT(CSCHED, 11)
 
 
 /*
@@ -1767,12 +1768,17 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  */
                 spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
 
+                SCHED_STAT_CRANK(steal_check);
+
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipped */ 0);
                     peer_cpu = cpumask_cycle(peer_cpu, &workers);
                     continue;
                 }
+                else
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
 
                 /* Any work over there to steal? */
                 speer = cpumask_test_cpu(peer_cpu, online) ?
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 0d702f0..742d429 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -48,6 +48,7 @@ PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
 PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
 PERFCOUNTER(load_balance_over,      "csched: load_balance_over")
 PERFCOUNTER(load_balance_other,     "csched: load_balance_other")
+PERFCOUNTER(steal_check,            "csched: steal_check")
 PERFCOUNTER(steal_trylock_failed,   "csched: steal_trylock_failed")
 PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
 PERFCOUNTER(migrate_queued,         "csched: migrate_queued")


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] xen: credit2: avoid cpumask_any() in pick_cpu().
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2017-03-02 10:38 ` [PATCH 5/6] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
@ 2017-03-02 10:38 ` Dario Faggioli
  2017-03-02 10:58 ` [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-03-27  9:08 ` Dario Faggioli
  7 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit2.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index af457c1..7b9e1a1 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -363,6 +363,7 @@ struct csched2_runqueue_data {
     struct list_head runq; /* Ordered list of runnable vms */
     struct list_head svc;  /* List of all vcpus assigned to this runqueue */
     unsigned int max_weight;
+    unsigned int pick_bias;/* Last CPU we picked. Start from it next time */
 
     cpumask_t idle,        /* Currently idle pcpus */
         smt_idle,          /* Fully idle-and-untickled cores (see below) */
@@ -1679,7 +1680,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         {
             cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                         &svc->migrate_rqd->active);
-            new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+            new_cpu = cpumask_cycle(svc->migrate_rqd->pick_bias,
+                                    cpumask_scratch_cpu(cpu));
+            svc->migrate_rqd->pick_bias = new_cpu;
             goto out_up;
         }
         /* Fall-through to normal cpu pick */
@@ -1737,7 +1740,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                 &prv->rqd[min_rqi].active);
-    new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+    new_cpu = cpumask_cycle(prv->rqd[min_rqi].pick_bias,
+                            cpumask_scratch_cpu(cpu));
+    prv->rqd[min_rqi].pick_bias = new_cpu;
     BUG_ON(new_cpu >= nr_cpu_ids);
 
  out_up:
@@ -1854,7 +1859,9 @@ static void migrate(const struct scheduler *ops,
                     cpupool_domain_cpumask(svc->vcpu->domain));
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                     &trqd->active);
-        svc->vcpu->processor = cpumask_any(cpumask_scratch_cpu(cpu));
+        svc->vcpu->processor = cpumask_cycle(trqd->pick_bias,
+                                             cpumask_scratch_cpu(cpu));
+        trqd->pick_bias = svc->vcpu->processor;
         ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         _runq_assign(svc, trqd);
@@ -2821,13 +2828,15 @@ csched2_dump(const struct scheduler *ops)
         printk("Runqueue %d:\n"
                "\tncpus              = %u\n"
                "\tcpus               = %s\n"
-               "\tmax_weight         = %d\n"
+               "\tmax_weight         = %u\n"
+               "\tpick_bias          = %u\n"
                "\tinstload           = %d\n"
                "\taveload            = %"PRI_stime" (~%"PRI_stime"%%)\n",
                i,
                cpumask_weight(&prv->rqd[i].active),
                cpustr,
                prv->rqd[i].max_weight,
+               prv->rqd[i].pick_bias,
                prv->rqd[i].load,
                prv->rqd[i].avgload,
                fraction);
@@ -2930,6 +2939,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &prv->initialized);
     __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
+    if ( cpumask_weight(&rqd->active) == 1 )
+        rqd->pick_bias = cpu;
+
     return rqi;
 }
 
@@ -3042,6 +3054,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         printk(XENLOG_INFO " No cpus left on runqueue, disabling\n");
         deactivate_runqueue(prv, rqi);
     }
+    else if ( rqd->pick_bias == cpu )
+        rqd->pick_bias = cpumask_first(&rqd->active);
 
     spin_unlock(&rqd->lock);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2017-03-02 10:38 ` [PATCH 6/6] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
@ 2017-03-02 10:58 ` Dario Faggioli
  2017-03-27  9:08 ` Dario Faggioli
  7 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 10:58 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Anshul Makkar, Ian Jackson, George Dunlap, Wei Liu


[-- Attachment #1.1: Type: text/plain, Size: 1511 bytes --]

On Thu, 2017-03-02 at 11:37 +0100, Dario Faggioli wrote:
> ---
> Dario Faggioli (6):
>       xen: credit1: simplify csched_runq_steal() a little bit.
>       xen: credit: (micro) optimize csched_runq_steal().
>       xen: credit1: increase efficiency and scalability of load balancing.
>       xen: credit1: treat pCPUs more evenly during balancing.
>       xen/tools: tracing: add record for credit1 runqueue stealing.
>       xen: credit2: avoid cpumask_any() in pick_cpu().
> 
>  tools/xentrace/formats       |    1
>  tools/xentrace/xenalyze.c    |   11 ++
>  xen/common/sched_credit.c    |  199 +++++++++++++++++++++++++++++-------------
>  xen/common/sched_credit2.c   |   22 ++++-
>  xen/include/xen/perfc_defn.h |    1
>  5 files changed, 169 insertions(+), 65 deletions(-)
>
And there's a git branch available here:

 git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit1-credit2-optim-and-scalability
 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit1-credit2-optim-and-scalability
 https://travis-ci.org/fdario/xen/builds/206774498

Sorry I forgot the links before.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing.
  2017-03-02 10:38 ` [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-03-02 11:06   ` Andrew Cooper
  2017-03-02 11:35     ` Dario Faggioli
  2017-04-06  7:37     ` Dario Faggioli
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Cooper @ 2017-03-02 11:06 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 2156 bytes --]

On 02/03/17 10:38, Dario Faggioli wrote:
> During load balancing, we check the non idle pCPUs to
> see if they have runnable but not running vCPUs that
> can be stolen by and set to run on currently idle pCPUs.
>
> If a pCPU has only one running (or runnable) vCPU,
> though, we don't want to steal it from there, and
> it's therefore pointless bothering with it
> (especially considering that bothering means trying
> to take its runqueue lock!).
>
> On large systems, when load is only slightly higher
> than the number of pCPUs (i.e., there are just a few
> more active vCPUs than the number of the pCPUs), this
> may mean that:
>  - we go through all the pCPUs,
>  - for each one, we (try to) take its runqueue locks,
>  - we figure out there's actually nothing to be stolen!
>
> To mitigate this, we introduce here the concept of
> overloaded runqueues, and a cpumask where to record
> what pCPUs are in such state.
>
> An overloaded runqueue has at least runnable 2 vCPUs
> (plus the idle one, which is always there). Typically,
> this means 1 vCPU is running, and 1 is sitting in  the
> runqueue, and can hence be stolen.
>
> Then, in  csched_balance_load(), it is enough to go
> over the overloaded pCPUs, instead than all non-idle
> pCPUs, which is better.
>
> signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Malcolm’s solution to this problem is
https://github.com/xenserver/xen-4.7.pg/commit/0f830b9f229fa6472accc9630ad16cfa42258966 
This has been in 2 releases of XenServer now, and has a very visible
improvement for aggregate multi-queue multi-vm intrahost network
performance (although I can't find the numbers right now).

The root of the performance problems is that pcpu_schedule_trylock() is
expensive even for the local case, while cross-cpu locking is much
worse.  Locking every single pcpu in turn is terribly expensive, in
terms of hot cacheline pingpong, and the lock is frequently contended.

As a first opinion of this patch, you are adding another cpumask which
is going to play hot cacheline pingpong.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 3012 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing.
  2017-03-02 11:06   ` Andrew Cooper
@ 2017-03-02 11:35     ` Dario Faggioli
  2017-04-06  7:37     ` Dario Faggioli
  1 sibling, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-02 11:35 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1854 bytes --]

On Thu, 2017-03-02 at 11:06 +0000, Andrew Cooper wrote:
> On 02/03/17 10:38, Dario Faggioli wrote:
> > signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > ---
> > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>  
> Malcolm’s solution to this problem is https://github.com/xenserver/xe
> n-4.7.pg/commit/0f830b9f229fa6472accc9630ad16cfa42258966  This has
> been in 2 releases of XenServer now, and has a very visible
> improvement for aggregate multi-queue multi-vm intrahost network
> performance (although I can't find the numbers right now).
> 
Yep, as you know, I had become aware of that.

> The root of the performance problems is that pcpu_schedule_trylock()
> is expensive even for the local case, while cross-cpu locking is much
> worse.  Locking every single pcpu in turn is terribly expensive, in
> terms of hot cacheline pingpong, and the lock is frequently
> contended.
> 
> As a first opinion of this patch, you are adding another cpumask
> which is going to play hot cacheline pingpong.
> 
Can you clarify? Inside csched_load_balance(), I'm actually trading one
currently existing cpumask_and() with another.

Sure this new mask needs updating, but that only happens when a pCPUs
acquires or leaves the "overloaded" status.

Malcom's patch uses an atomic counter which does not fit very well in
Credit1's load balancer's logic, and in fact it (potentially) requires
an additional cpumask_cycle(). And it also comes with cache coherency
implications, doesn't it?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
@ 2017-03-03  9:35   ` anshul makkar
  2017-03-03 13:39     ` Dario Faggioli
  0 siblings, 1 reply; 16+ messages in thread
From: anshul makkar @ 2017-03-03  9:35 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap



On 02/03/17 10:38, Dario Faggioli wrote:
> Since we're holding the lock on the pCPU from which we
> are trying to steal, it can't have disappeared, so we
> can drop the check for that (and convert it in an
> ASSERT()).
>
> And since we try to steal only from busy pCPUs, it's
> unlikely for such pCPU to be idle, so we mark it as
> such (and bail early if it unfortunately is).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>   xen/common/sched_credit.c |   87 +++++++++++++++++++++++----------------------
>   1 file changed, 44 insertions(+), 43 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 4649e64..63a8675 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1593,64 +1593,65 @@ static struct csched_vcpu *
>   csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> -    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
>       struct vcpu *vc;
>   
> +    ASSERT(peer_pcpu != NULL);
> +
>       /*
>        * Don't steal from an idle CPU's runq because it's about to
>        * pick up work from it itself.
>        */
> -    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
> +    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
> +        goto out;
We can just use if (!is_idle_vcpu(peer_vcpu)). Why to replace it with 
some code that introduces an unnecessary branch statement.
> +
> +    list_for_each( iter, &peer_pcpu->runq )
>       {
> -        list_for_each( iter, &peer_pcpu->runq )
> -        {
> -            speer = __runq_elem(iter);
> +        speer = __runq_elem(iter);
>   
> -            /*
> -             * If next available VCPU here is not of strictly higher
> -             * priority than ours, this PCPU is useless to us.
> -             */
> -            if ( speer->pri <= pri )
> -                break;
> +        /*
> +         * If next available VCPU here is not of strictly higher
> +         * priority than ours, this PCPU is useless to us.
> +         */
> +        if ( speer->pri <= pri )
> +            break;
>   
> -            /* Is this VCPU runnable on our PCPU? */
> -            vc = speer->vcpu;
> -            BUG_ON( is_idle_vcpu(vc) );
> +        /* Is this VCPU runnable on our PCPU? */
> +        vc = speer->vcpu;
> +        BUG_ON( is_idle_vcpu(vc) );
>   
> -            /*
> -             * If the vcpu has no useful soft affinity, skip this vcpu.
> -             * In fact, what we want is to check if we have any "soft-affine
> -             * work" to steal, before starting to look at "hard-affine work".
> -             *
> -             * Notice that, if not even one vCPU on this runq has a useful
> -             * soft affinity, we could have avoid considering this runq for
> -             * a soft balancing step in the first place. This, for instance,
> -             * can be implemented by taking note of on what runq there are
> -             * vCPUs with useful soft affinities in some sort of bitmap
> -             * or counter.
> -             */
> -            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> -                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> -                continue;
> +        /*
> +         * If the vcpu has no useful soft affinity, skip this vcpu.
> +         * In fact, what we want is to check if we have any "soft-affine
> +         * work" to steal, before starting to look at "hard-affine work".
> +         *
> +         * Notice that, if not even one vCPU on this runq has a useful
> +         * soft affinity, we could have avoid considering this runq for
> +         * a soft balancing step in the first place. This, for instance,
> +         * can be implemented by taking note of on what runq there are
> +         * vCPUs with useful soft affinities in some sort of bitmap
> +         * or counter.
> +         */
Isn't it a better approach that now as we have came across a vcpu which 
doesn't have a desired soft affinity but is a potential candidate for 
migration, so instead of just forgetting it,  lets save the information 
for such vcpus in some data structure in some order so that we don't 
have to scan them again if we don't find anything useful in the present run.
> +        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> +             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +            continue;
>   
> -            csched_balance_cpumask(vc, balance_step, cpumask_scratch);
> -            if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
> -            {
> -                /* We got a candidate. Grab it! */
> -                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> -                         vc->domain->domain_id, vc->vcpu_id);
> -                SCHED_VCPU_STAT_CRANK(speer, migrate_q);
> -                SCHED_STAT_CRANK(migrate_queued);
> -                WARN_ON(vc->is_urgent);
> -                __runq_remove(speer);
> -                vc->processor = cpu;
> -                return speer;
> -            }
> +        csched_balance_cpumask(vc, balance_step, cpumask_scratch);
> +        if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
> +        {
> +            /* We got a candidate. Grab it! */
> +            TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> +                     vc->domain->domain_id, vc->vcpu_id);
> +            SCHED_VCPU_STAT_CRANK(speer, migrate_q);
> +            SCHED_STAT_CRANK(migrate_queued);
> +            WARN_ON(vc->is_urgent);
> +            __runq_remove(speer);
> +            vc->processor = cpu;
> +            return speer;
>           }
>       }
> -
> + out:
>       SCHED_STAT_CRANK(steal_peer_idle);
>       return NULL;
>   }
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Anshul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal().
  2017-03-02 10:38 ` [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-03-03  9:48   ` anshul makkar
  2017-03-03 13:53     ` Dario Faggioli
  0 siblings, 1 reply; 16+ messages in thread
From: anshul makkar @ 2017-03-03  9:48 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap



On 02/03/17 10:38, Dario Faggioli wrote:
> Chacking whether or not a vCPU can be 'stolen'
> from a peer pCPU's runqueue is relatively cheap.
>
> Therefore, let's do that  as early as possible,
> avoiding potentially useless complex checks, and
> cpumask manipulations.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> ---
>   xen/common/sched_credit.c |   17 +++++++++--------
>   1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 63a8675..2b13e99 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -708,12 +708,10 @@ static inline int
>   __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
> -     * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> -     * on our CPU.
> +     * Don't pick up work that's or hot on peer PCPU, or that can't (or
Not clear.
> +     * would prefer not to) run on cpu.
>        */
> -    return !vc->is_running &&
> -           !__csched_vcpu_is_cache_hot(vc) &&
> +    return !__csched_vcpu_is_cache_hot(vc) &&
>              cpumask_test_cpu(dest_cpu, mask);
!vc->is_running doesn't ease the complexity and doesn't save much on cpu 
cycles. Infact, I think (!vc->is_running) makes the intention to check 
for migration much more clear to understand.

Yeah, apart from the above reasons, its save to remove this check from here.
>   }
>   
> @@ -1622,7 +1620,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>           BUG_ON( is_idle_vcpu(vc) );
>   
>           /*
> -         * If the vcpu has no useful soft affinity, skip this vcpu.
> +         * If the vcpu is still in peer_cpu's scheduling tail, or if it
> +         * has no useful soft affinity, skip it.
> +         *
>            * In fact, what we want is to check if we have any "soft-affine
>            * work" to steal, before starting to look at "hard-affine work".
>            *
> @@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>            * vCPUs with useful soft affinities in some sort of bitmap
>            * or counter.
>            */
> -        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> -             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +        if ( vc->is_running ||
> +             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> +              && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
>               continue;
>   
>           csched_balance_cpumask(vc, balance_step, cpumask_scratch);
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-03-03  9:35   ` anshul makkar
@ 2017-03-03 13:39     ` Dario Faggioli
  0 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-03 13:39 UTC (permalink / raw)
  To: anshul makkar, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 3413 bytes --]

On Fri, 2017-03-03 at 09:35 +0000, anshul makkar wrote:
> On 02/03/17 10:38, Dario Faggioli wrote:
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -1593,64 +1593,65 @@ static struct csched_vcpu *
> >   csched_runq_steal(int peer_cpu, int cpu, int pri, int
> > balance_step)
> >   {
> >       const struct csched_pcpu * const peer_pcpu =
> > CSCHED_PCPU(peer_cpu);
> > -    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
> >       struct csched_vcpu *speer;
> >       struct list_head *iter;
> >       struct vcpu *vc;
> >   
> > +    ASSERT(peer_pcpu != NULL);
> > +
> >       /*
> >        * Don't steal from an idle CPU's runq because it's about to
> >        * pick up work from it itself.
> >        */
> > -    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
> > +    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
> > +        goto out;
> We can just use if (!is_idle_vcpu(peer_vcpu)). Why to replace it
> with 
> some code that introduces an unnecessary branch statement.
>
Mmm... I don't think I understand what this means.

> > +        /*
> > +         * If the vcpu has no useful soft affinity, skip this
> > vcpu.
> > +         * In fact, what we want is to check if we have any "soft-
> > affine
> > +         * work" to steal, before starting to look at "hard-affine 
> > work".
> > +         *
> > +         * Notice that, if not even one vCPU on this runq has a
> > useful
> > +         * soft affinity, we could have avoid considering this
> > runq for
> > +         * a soft balancing step in the first place. This, for
> > instance,
> > +         * can be implemented by taking note of on what runq there
> > are
> > +         * vCPUs with useful soft affinities in some sort of
> > bitmap
> > +         * or counter.
> > +         */
>
> Isn't it a better approach that now as we have came across a vcpu
> which 
> doesn't have a desired soft affinity but is a potential candidate
> for 
> migration, so instead of just forgetting it,  lets save the
> information 
> for such vcpus in some data structure in some order so that we don't 
> have to scan them again if we don't find anything useful in the
> present run.
>
So, AFAIUI, you're suggesting something like this:
 1. for each vcpu in the runqueue, we check soft-affinity. If it 
    matches, we're done;
 2. if it does not match, we check hard-affinity. If it matches, we 
    save that vcpu somewhere. We only need to save one vcpu, the first 
    one that we find to have matching hard-affinity;
 3. if we don't find any vcpu with matching soft affinity, we steal 
    the one we've saved.

Is this correct? If yes, if there's not anything I'm overlooking, I
think it could work.

It's a separate patch, of course. I can try putting that together,
unless of course you want to give this a go yourself. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal().
  2017-03-03  9:48   ` anshul makkar
@ 2017-03-03 13:53     ` Dario Faggioli
  0 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-03 13:53 UTC (permalink / raw)
  To: anshul makkar, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 3126 bytes --]

On Fri, 2017-03-03 at 09:48 +0000, anshul makkar wrote:
> On 02/03/17 10:38, Dario Faggioli wrote:
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -708,12 +708,10 @@ static inline int
> >   __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu,
> > cpumask_t *mask)
> >   {
> >       /*
> > -     * Don't pick up work that's in the peer's scheduling tail or
> > hot on
> > -     * peer PCPU. Only pick up work that prefers and/or is allowed
> > to run
> > -     * on our CPU.
> > +     * Don't pick up work that's or hot on peer PCPU, or that
> > can't (or
> Not clear.
>
Well, there's actually a typo (redundant 'or'). Good catch. :-)
> > 
> > +     * would prefer not to) run on cpu.
> >        */
> > -    return !vc->is_running &&
> > -           !__csched_vcpu_is_cache_hot(vc) &&
> > +    return !__csched_vcpu_is_cache_hot(vc) &&
> >              cpumask_test_cpu(dest_cpu, mask);
> !vc->is_running doesn't ease the complexity and doesn't save much on
> cpu 
> cycles. Infact, I think (!vc->is_running) makes the intention to
> check 
> for migration much more clear to understand.
> 
But the point is not saving the overhead of a !vc->is_running check
here, it is actually to pull it out from within this function and check
that before. And that's ok because the value won't change, and is good
thing because what we save is a call to

  __vcpu_has_soft_affinity()

and, potentially, to

  csched_balance_cpumask()

i.e., more specifically...
            *
> > @@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int
> > pri, int balance_step)
> >            * vCPUs with useful soft affinities in some sort of
> > bitmap
> >            * or counter.
> >            */
> > -        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> > -             && !__vcpu_has_soft_affinity(vc, vc-
> > >cpu_hard_affinity) )
> > +        if ( vc->is_running ||
> > +             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> > +              && !__vcpu_has_soft_affinity(vc, vc-
> > >cpu_hard_affinity)) )
> >               continue;
> >   
> >           csched_balance_cpumask(vc, balance_step,
> > cpumask_scratch);
> > 
...these ones here.

I agree that the check was a good fit for that function, but --with the
updated comments-- I don't think it's too terrible to have it outside.

Or were you suggesting to have it in _both_ places? If that's the case,
no... I agree it's cheap, but that would look confusing to me (I
totally see myself, in 3 months, sending a patch to remove the
redundant is_running check! :-P).

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
  2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (6 preceding siblings ...)
  2017-03-02 10:58 ` [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
@ 2017-03-27  9:08 ` Dario Faggioli
  7 siblings, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-03-27  9:08 UTC (permalink / raw)
  To: xen-devel, George Dunlap; +Cc: Andrew Cooper, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 1148 bytes --]

On Thu, 2017-03-02 at 11:37 +0100, Dario Faggioli wrote:
> Hello,
> 
Hey, George,

About this series. I was re-looking at it, and I figured out that:

> Dario Faggioli (6):
>       xen: credit1: simplify csched_runq_steal() a little bit.
>       xen: credit: (micro) optimize csched_runq_steal().
>       xen: credit1: increase efficiency and scalability of load
> balancing.
>
Here in patch 3, overhead inside __runq_insert() and __runq_remove()
can be reduced.

>       xen: credit1: treat pCPUs more evenly during balancing.
>
And about patch 4, I like it, but I'm currently running more benchmarks
to make sure of its impact, and its cost-vs-benefit ratio.

I will send a v2 shortly, so, maybe, if you're in desperate look for
reviewing something from me, feel free to skip (or just quickly glance
at) this series. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing.
  2017-03-02 11:06   ` Andrew Cooper
  2017-03-02 11:35     ` Dario Faggioli
@ 2017-04-06  7:37     ` Dario Faggioli
  1 sibling, 0 replies; 16+ messages in thread
From: Dario Faggioli @ 2017-04-06  7:37 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 2738 bytes --]

On Thu, 2017-03-02 at 11:06 +0000, Andrew Cooper wrote:
> On 02/03/17 10:38, Dario Faggioli wrote:
> > 
> > To mitigate this, we introduce here the concept of
> > overloaded runqueues, and a cpumask where to record
> > what pCPUs are in such state.
> > 
> > An overloaded runqueue has at least runnable 2 vCPUs
> > (plus the idle one, which is always there). Typically,
> > this means 1 vCPU is running, and 1 is sitting in  the
> > runqueue, and can hence be stolen.
> > 
> > Then, in  csched_balance_load(), it is enough to go
> > over the overloaded pCPUs, instead than all non-idle
> > pCPUs, which is better.
> > 
> Malcolm’s solution to this problem is https://github.com/xenserver/xe
> n-4.7.pg/commit/0f830b9f229fa6472accc9630ad16cfa42258966  This has
> been in 2 releases of XenServer now, and has a very visible
> improvement for aggregate multi-queue multi-vm intrahost network
> performance (although I can't find the numbers right now).
> 
> The root of the performance problems is that pcpu_schedule_trylock()
> is expensive even for the local case, while cross-cpu locking is much
> worse.  Locking every single pcpu in turn is terribly expensive, in
> terms of hot cacheline pingpong, and the lock is frequently
> contended.
> 
BTW, both my patch in this series, and the patch linked above are
_wrong_ in using __runq_insert() and __runq_remove() for counting the
runnable vCPUs.

In fact, in Credit1, during the main scheduling function
(csched_schedule()), we call runqueue insert for temporarily putting
the running vCPU. This increments the counter, making all the other
pCPUs think that there is a vCPU available for stealing in there, while
that:
1) may not be true, if we end up choosing to run the same vCPU again
2) even if true, they'll always fail on the trylock, unless until we're
out of csched_schedule(), as it holds the runqueue lock itself.

So, yeah, it's not really a matter of correctness, but there's more
overhead cut.

In v2 of this series, that I'm about to send, I've "fixed" this (i.e.,
I'm only modifying the counter when really necessary).

> As a first opinion of this patch, you are adding another cpumask
> which is going to play hot cacheline pingpong.
> 
Yeah, well, despite liking the cpumask based approach, I agree it's
overkill in this case. In v2, I got rid of it, and I am doing something
 even more similar to Malcolm's patch above.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-04-06  7:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-02 10:37 [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
2017-03-02 10:38 ` [PATCH 1/6] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
2017-03-03  9:35   ` anshul makkar
2017-03-03 13:39     ` Dario Faggioli
2017-03-02 10:38 ` [PATCH 2/6] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
2017-03-03  9:48   ` anshul makkar
2017-03-03 13:53     ` Dario Faggioli
2017-03-02 10:38 ` [PATCH 3/6] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
2017-03-02 11:06   ` Andrew Cooper
2017-03-02 11:35     ` Dario Faggioli
2017-04-06  7:37     ` Dario Faggioli
2017-03-02 10:38 ` [PATCH 4/6] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
2017-03-02 10:38 ` [PATCH 5/6] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
2017-03-02 10:38 ` [PATCH 6/6] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
2017-03-02 10:58 ` [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
2017-03-27  9:08 ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.