All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
@ 2018-08-24 23:35 Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 01/16] xen: Credit1: count runnable vcpus, not running ones Dario Faggioli
                   ` (17 more replies)
  0 siblings, 18 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich

Hello,

As anticipated here:
https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02052.html

Here's a preliminary version of my work, trying to implement
core-scheduling in Xen.

First of all, this deals with Credit1 only. I have patches for Credit2,
and I've been working on having them ready by today, but I could not
defeat the latest bugs. :-/
I'll post them when back from vacation. Just let me anticipate, that
doing something like this in Credit2, is a lot simpler than what you
see here for Credit1.

Even these patches that I'm posting are not perfect, and In fact there
are some TODOs and XXXs --both in the changelogs and in the code.

They give me a system that boots, where I can do basic stuff (like
playing with dom0, creating guests, etc), and where the constraint of
only scheduling vcpus from one domain at a time on pcpus that are part
of the same core is, as far as I've seen, respected.

There are still cases where the behavior is unideal, e.g., we could
make a better use of some of the cores which are, some of the times,
left idle.

There are git branches here:
 https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-RFCv1
 https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1

Any comment is more than welcome.

Regards,
Dario
---
Dario Faggioli (16):
      xen: Credit1: count runnable vcpus, not running ones
      xen: Credit1: always steal from pcpus with runnable but not running vcpus
      xen: Credit1: do not always tickle an idle pcpu
      xen: sched: make the logic for tracking idle core generic.
      xen: Credit1: track fully idle cores.
      xen: Credit1: check for fully idle cores when tickling
      xen: Credit1: reorg __runq_tickle() code a bit.
      xen: Credit1: reorg csched_schedule() code a bit.
      xen: Credit1: SMT-aware domain co-scheduling parameter and data structs
      xen: Credit1: support sched_smt_cosched in csched_schedule()
      xen: Credit1: support sched_smt_cosched in _csched_cpu_pick()
      xen: Credit1: support sched_smt_cosched in csched_runq_steal().
      xen: Credit1: sched_smt_cosched support in __csched_vcpu_is_migrateable().
      xen: Credit1: sched_smt_cosched support in __runq_tickle() for pinned vcpus.
      xen: Credit1: sched_smt_cosched support in __runq_tickle().
      xen/tools: tracing of Credit1 SMT domain co-scheduling support

 docs/misc/xen-command-line.markdown |   11 +
 tools/xentrace/xenalyze.c           |   85 +++++-
 xen/common/sched_credit.c           |  532 +++++++++++++++++++++++++++--------
 xen/common/sched_credit2.c          |  128 +++-----
 xen/common/schedule.c               |    7 
 xen/include/xen/sched.h             |   42 +++
 6 files changed, 603 insertions(+), 202 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 01/16] xen: Credit1: count runnable vcpus, not running ones
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus Dario Faggioli
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

For scalability reasons, we keep track of the number of runnable vcpu
in each runqueue (there is one runq for each pcpu). This, right now,
includes running both runnable and running ones.

Change the mechanism so that we count runnable only runnable ones,
which is more useful. In fact, we want to be able to distinguish
between pcpus which are idle and have (at least) one queued runnable
vcpu, and pcpus which are running one vcpu, and have none queued.

Right now, nr_running is 1 in both these cases, while with this
commit, nr_runnable will be 1 in the former case, and 0 in the
latter.

This is going to be useful in following patches. Therefore, as long
as just this commit is considered, there is no functional change
(intended).

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   43 +++++++++++++------------------------------
 1 file changed, 13 insertions(+), 30 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 84e744bfe4..b8765ac6df 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -271,7 +271,7 @@ dec_nr_runnable(unsigned int cpu)
 }
 
 static inline void
-__runq_insert(struct csched_vcpu *svc)
+runq_insert(struct csched_vcpu *svc)
 {
     unsigned int cpu = svc->vcpu->processor;
     const struct list_head * const runq = RUNQ(cpu);
@@ -299,29 +299,20 @@ __runq_insert(struct csched_vcpu *svc)
     }
 
     list_add_tail(&svc->runq_elem, iter);
-}
 
-static inline void
-runq_insert(struct csched_vcpu *svc)
-{
-    __runq_insert(svc);
-    inc_nr_runnable(svc->vcpu->processor);
+    if ( !is_idle_vcpu(svc->vcpu) )
+        inc_nr_runnable(svc->vcpu->processor);
 }
 
 static inline void
-__runq_remove(struct csched_vcpu *svc)
+runq_remove(struct csched_vcpu *svc)
 {
     BUG_ON( !__vcpu_on_runq(svc) );
+    if ( !is_idle_vcpu(svc->vcpu) )
+        dec_nr_runnable(svc->vcpu->processor);
     list_del_init(&svc->runq_elem);
 }
 
-static inline void
-runq_remove(struct csched_vcpu *svc)
-{
-    dec_nr_runnable(svc->vcpu->processor);
-    __runq_remove(svc);
-}
-
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
@@ -1666,12 +1657,6 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
             WARN_ON(vc->is_urgent);
             runq_remove(speer);
             vc->processor = cpu;
-            /*
-             * speer will start executing directly on cpu, without having to
-             * go through runq_insert(). So we must update the runnable count
-             * for cpu here.
-             */
-            inc_nr_runnable(cpu);
             return speer;
         }
     }
@@ -1741,8 +1726,8 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 spinlock_t *lock;
 
                 /*
-                 * If there is only one runnable vCPU on peer_cpu, it means
-                 * there's no one to be stolen in its runqueue, so skip it.
+                 * If there are no runnable vCPUs in peer_cpu's runq, it means
+                 * there's nothing to steal, so skip it.
                  *
                  * Checking this without holding the lock is racy... But that's
                  * the whole point of this optimization!
@@ -1766,7 +1751,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  *     optimization), it the pCPU would schedule right after we
                  *     have taken the lock, and hence block on it.
                  */
-                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
+                if ( CSCHED_PCPU(peer_cpu)->nr_runnable == 0 )
                 {
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
                     goto next_cpu;
@@ -1820,7 +1805,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
 
  out:
     /* Failed to find more important work elsewhere... */
-    __runq_remove(snext);
+    runq_remove(snext);
     return snext;
 }
 
@@ -1937,12 +1922,10 @@ csched_schedule(
      * Select next runnable local VCPU (ie top of local runq)
      */
     if ( vcpu_runnable(current) )
-        __runq_insert(scurr);
+        runq_insert(scurr);
     else
     {
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
-        /* Current has blocked. Update the runnable counter for this cpu. */
-        dec_nr_runnable(cpu);
     }
 
     snext = __runq_elem(runq->next);
@@ -1970,7 +1953,7 @@ csched_schedule(
      * already removed from the runq.
      */
     if ( snext->pri > CSCHED_PRI_TS_OVER )
-        __runq_remove(snext);
+        runq_remove(snext);
     else
         snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
 
@@ -2060,7 +2043,7 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+    printk("CPU[%02d] nr_runbl=%d, sort=%d, sibling=%s, ",
            cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 01/16] xen: Credit1: count runnable vcpus, not running ones Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 03/16] xen: Credit1: do not always tickle an idle pcpu Dario Faggioli
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Wei Liu, Ian Jackson

Currently, if a pcpu is idle and has some runnable vcpus in its
runqueue, the other pcpus (e.g., when they're going idle) never try to
steal any of such runnable vcpus.

This is because Credit1 assumes that, when a vcpu wakes up, if the pcpu
on which it is queued is idle, that very pcpu is _always_ tickled, and
will _always_ have the chance to pick up someone from its own runqueue.

While true for now, this is going to change. In fact:
- we want to be able to tickle a different pcpu than the one where the
  vcpu has been queues, if that is better for load distribution (or for
  honoring soft-affinity);
- when introducing SMT-aware domain coscheduling, it can happen that a
  pcpu, even if idle, can't pick up vcpus from its own runqueue, and
  hence we do want other vcpus to be able to try to steal and run them.

Let's, therefore, remove the logic implementing such assumption, making
sure that runnable vcpus are picked up from the runqueues of whatever
pcpu they're in.

We also make it more clear, while tracing, whether a cpu is being
considered or skipped (and, if the latter, why) during load balancing.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
TODO:
- for tracing, deal with xentrace's formats too;
- we now check a lot of pcpus. Although we do that without taking their
  runqueue lock, the overhead may be significant in big systems. In
  order to avoid having to do that, we can introduce a new cpumask,
  tracking the 'overloaded' pcpus (i.e., the pcpus which have at least
  one queued, runnable, vcpu), and only look at them.
---
 tools/xentrace/xenalyze.c |   11 +++++++----
 xen/common/sched_credit.c |   29 +++++++++++++----------------
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 5ed0a12327..a1e2531945 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7655,12 +7655,15 @@ void sched_process(struct pcpu_info *p)
         case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
             if(opt.dump_all) {
                 struct {
-                    unsigned int peer_cpu, check;
+                    unsigned int peer_cpu;
+                    int check;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched:load_balance %s %u\n",
-                       ri->dump_header, r->check ? "checking" : "skipping",
-                       r->peer_cpu);
+                printf(" %s csched:load_balance cpu %u: %s (nr_runnable=%d)\n",
+                       ri->dump_header, r->peer_cpu,
+                       r->check == 0 ? "skipping" :
+                           (r->check < 0 ? "locked" : "checking" ),
+                       r->check < 0 ? -r->check : r->check);
             }
             break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index b8765ac6df..48d80993b1 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1606,11 +1606,8 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 
     ASSERT(peer_pcpu != NULL);
 
-    /*
-     * Don't steal from an idle CPU's runq because it's about to
-     * pick up work from it itself.
-     */
-    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+    /* Don't steal from pcpus which have no runnable vcpu queued. */
+    if ( unlikely(peer_pcpu->nr_runnable == 0) )
         goto out;
 
     list_for_each( iter, &peer_pcpu->runq )
@@ -1695,15 +1692,15 @@ csched_load_balance(struct csched_private *prv, int cpu,
 
     /*
      * Let's look around for work to steal, taking both hard affinity
-     * and soft affinity into account. More specifically, we check all
-     * the non-idle CPUs' runq, looking for:
+     * and soft affinity into account. More specifically, we check the
+     * CPUs' runq with runnable vCPUs, looking for:
      *  1. any "soft-affine work" to steal first,
      *  2. if not finding anything, any "hard-affine work" to steal.
      */
     for_each_affinity_balance_step( bstep )
     {
         /*
-         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
+         * We peek at the CPUs in a node-wise fashion. In fact,
          * it is more likely that we find some affine work on our same
          * node, not to mention that migrating vcpus within the same node
          * could well expected to be cheaper than across-nodes (memory
@@ -1713,8 +1710,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         do
         {
             /* Select the pCPUs in this node that have work we can steal. */
-            cpumask_andnot(&workers, online, prv->idlers);
-            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
+            cpumask_and(&workers, online, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
             first_cpu = cpumask_cycle(prv->balance_bias[peer_node], &workers);
@@ -1723,6 +1719,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
             peer_cpu = first_cpu;
             do
             {
+                const struct csched_pcpu * const pspc = CSCHED_PCPU(peer_cpu);
                 spinlock_t *lock;
 
                 /*
@@ -1735,9 +1732,9 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * In more details:
                  * - if we race with dec_nr_runnable(), we may try to take the
                  *   lock and call csched_runq_steal() for no reason. This is
-                 *   not a functional issue, and should be infrequent enough.
-                 *   And we can avoid that by re-checking nr_runnable after
-                 *   having grabbed the lock, if we want;
+                 *   not a functional issue and, in any case, we re-check
+                 *   nr_runnable inside csched_runq_steal, after grabbing the
+                 *   lock;
                  * - if we race with inc_nr_runnable(), we skip a pCPU that may
                  *   have runnable vCPUs in its runqueue, but that's not a
                  *   problem because:
@@ -1751,7 +1748,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  *     optimization), it the pCPU would schedule right after we
                  *     have taken the lock, and hence block on it.
                  */
-                if ( CSCHED_PCPU(peer_cpu)->nr_runnable == 0 )
+                if ( pspc->nr_runnable == 0 )
                 {
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
                     goto next_cpu;
@@ -1769,11 +1766,11 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
-                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, -pspc->nr_runnable);
                     goto next_cpu;
                 }
 
-                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
+                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, pspc->nr_runnable);
 
                 /* Any work over there to steal? */
                 speer = cpumask_test_cpu(peer_cpu, online) ?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 03/16] xen: Credit1: do not always tickle an idle pcpu
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 01/16] xen: Credit1: count runnable vcpus, not running ones Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 04/16] xen: sched: make the logic for tracking idle core generic Dario Faggioli
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If a vcpu wakes up on an idle pcpu, we always (try to) run it there.
This may seem ok but it may actually be not, in case there is another
(also idle) pcpu which would be better suited (for load balancing and/or
affinity reasons, for instance) to run it.

So, instead than blindly tickle the pcpu where the vcpu wakes up, let's
go through the tickling, even if it was idle. If there is no 'better'
pcpu where to run it, we still prefer the one where it woke up, among
others with similar characteristics.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 48d80993b1..d0eeb5a335 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -373,21 +373,18 @@ static inline void __runq_tickle(struct csched_vcpu *new)
     }
 
     /*
-     * If the pcpu is idle, or there are no idlers and the new
-     * vcpu is a higher priority than the old vcpu, run it here.
-     *
+     * If there are no idlers, and the new vcpu is a higher priority than
+     * the old vcpu, run it here.
+    *
      * If there are idle cpus, first try to find one suitable to run
      * new, so we can avoid preempting cur.  If we cannot find a
      * suitable idler on which to run new, run it here, but try to
      * find a suitable idler on which to run cur instead.
      */
-    if ( cur->pri == CSCHED_PRI_IDLE
-         || (idlers_empty && new->pri > cur->pri) )
+    if ( idlers_empty && new->pri > cur->pri )
     {
-        if ( cur->pri != CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickled_busy_cpu);
-        else
-            SCHED_STAT_CRANK(tickled_idle_cpu);
+        ASSERT(cpumask_test_cpu(cpu, new->vcpu->cpu_hard_affinity));
+        SCHED_STAT_CRANK(tickled_busy_cpu);
         __cpumask_set_cpu(cpu, &mask);
     }
     else if ( !idlers_empty )
@@ -452,9 +449,12 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                 SCHED_STAT_CRANK(tickled_idle_cpu);
                 if ( opt_tickle_one_idle )
                 {
-                    this_cpu(last_tickle_cpu) =
-                        cpumask_cycle(this_cpu(last_tickle_cpu),
-                                      cpumask_scratch_cpu(cpu));
+                    if ( cpumask_test_cpu(cpu, cpumask_scratch_cpu(cpu)) )
+                        this_cpu(last_tickle_cpu) = cpu;
+                    else
+                        this_cpu(last_tickle_cpu) =
+                            cpumask_cycle(this_cpu(last_tickle_cpu),
+                                          cpumask_scratch_cpu(cpu));
                     __cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
                 }
                 else


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 04/16] xen: sched: make the logic for tracking idle core generic.
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (2 preceding siblings ...)
  2018-08-24 23:35 ` [RFC PATCH v1 03/16] xen: Credit1: do not always tickle an idle pcpu Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 05/16] xen: Credit1: track fully idle cores Dario Faggioli
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Jan Beulich

Introduced in 9bb9c73884d "xen: credit2: implement true
SMT support", it was available only to Credit2.

Move the functions to common headers so that also other
schedulers can use them, to track the idleness of full
cores.

No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 xen/common/sched_credit2.c |  128 ++++++++++++++++++--------------------------
 xen/include/xen/sched.h    |   35 ++++++++++++
 2 files changed, 87 insertions(+), 76 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 743848121f..5d2040ff90 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -217,6 +217,44 @@
  *    must always be taken for first.
  */
 
+/*
+ * Hyperthreading (SMT) support.
+ *
+ * We use a special per-runq mask (smt_idle) and update it according to the
+ * following logic:
+ *  - when _all_ the SMT sibling in a core are idle, all their corresponding
+ *    bits are set in the smt_idle mask;
+ *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
+ *    bits correspondings to it and to all its siblings are clear in the
+ *    smt_idle mask.
+ *
+ * Once we have such a mask, it is easy to implement a policy that, either:
+ *  - uses fully idle cores first: it is enough to try to schedule the vcpus
+ *    on pcpus from smt_idle mask first. This is what happens if
+ *    sched_smt_power_savings was not set at boot (default), and it maximizes
+ *    true parallelism, and hence performance;
+ *  - uses already busy cores first: it is enough to try to schedule the vcpus
+ *    on pcpus that are idle, but are not in smt_idle. This is what happens if
+ *    sched_smt_power_savings is set at boot, and it allows as more cores as
+ *    possible to stay in low power states, minimizing power consumption.
+ *
+ * This logic is entirely implemented in runq_tickle(), and that is enough.
+ * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
+ * runq, _always_ happens by means of tickling:
+ *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
+ *    runq_tickle();
+ *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
+ *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
+ *    csched2_cpu_pick() looks for the least loaded runq and return just any
+ *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
+ *    the chosen runq, and it is again runq_tickle(), called by
+ *    csched2_vcpu_wake() that actually decides what pcpu to use within the
+ *    chosen runq;
+ *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
+ *    directly, that again temporarily use a random pcpu from the new runq,
+ *    and then calls runq_tickle(), by itself.
+ */
+
 /*
  * Basic constants
  */
@@ -489,6 +527,20 @@ struct csched2_runqueue_data {
     unsigned int pick_bias;    /* Last picked pcpu. Start from it next time  */
 };
 
+/*
+ * Note that rqd->smt_idle is different than rqd->idle. rqd->idle
+ * records pcpus that at are merely idle (i.e., at the moment do not
+ * have a vcpu running on them).  But you have to manually filter out
+ * which pcpus have been tickled in order to find cores that are not
+ * going to be busy soon.  Filtering out tickled cpus pairwise is a
+ * lot of extra pain; so for rqd->smt_idle, we explicitly make so that
+ * the bits of a pcpu are set only if all the threads on its core are
+ * both idle *and* untickled.
+ *
+ * This means changing the mask when either rqd->idle or rqd->tickled
+ * changes.
+ */
+
 /*
  * System-wide private data
  */
@@ -600,82 +652,6 @@ static inline bool has_cap(const struct csched2_vcpu *svc)
     return svc->budget != STIME_MAX;
 }
 
-/*
- * Hyperthreading (SMT) support.
- *
- * We use a special per-runq mask (smt_idle) and update it according to the
- * following logic:
- *  - when _all_ the SMT sibling in a core are idle, all their corresponding
- *    bits are set in the smt_idle mask;
- *  - when even _just_one_ of the SMT siblings in a core is not idle, all the
- *    bits correspondings to it and to all its siblings are clear in the
- *    smt_idle mask.
- *
- * Once we have such a mask, it is easy to implement a policy that, either:
- *  - uses fully idle cores first: it is enough to try to schedule the vcpus
- *    on pcpus from smt_idle mask first. This is what happens if
- *    sched_smt_power_savings was not set at boot (default), and it maximizes
- *    true parallelism, and hence performance;
- *  - uses already busy cores first: it is enough to try to schedule the vcpus
- *    on pcpus that are idle, but are not in smt_idle. This is what happens if
- *    sched_smt_power_savings is set at boot, and it allows as more cores as
- *    possible to stay in low power states, minimizing power consumption.
- *
- * This logic is entirely implemented in runq_tickle(), and that is enough.
- * In fact, in this scheduler, placement of a vcpu on one of the pcpus of a
- * runq, _always_ happens by means of tickling:
- *  - when a vcpu wakes up, it calls csched2_vcpu_wake(), which calls
- *    runq_tickle();
- *  - when a migration is initiated in schedule.c, we call csched2_cpu_pick(),
- *    csched2_vcpu_migrate() (which calls migrate()) and csched2_vcpu_wake().
- *    csched2_cpu_pick() looks for the least loaded runq and return just any
- *    of its processors. Then, csched2_vcpu_migrate() just moves the vcpu to
- *    the chosen runq, and it is again runq_tickle(), called by
- *    csched2_vcpu_wake() that actually decides what pcpu to use within the
- *    chosen runq;
- *  - when a migration is initiated in sched_credit2.c, by calling  migrate()
- *    directly, that again temporarily use a random pcpu from the new runq,
- *    and then calls runq_tickle(), by itself.
- */
-
-/*
- * If all the siblings of cpu (including cpu itself) are both idle and
- * untickled, set all their bits in mask.
- *
- * NB that rqd->smt_idle is different than rqd->idle.  rqd->idle
- * records pcpus that at are merely idle (i.e., at the moment do not
- * have a vcpu running on them).  But you have to manually filter out
- * which pcpus have been tickled in order to find cores that are not
- * going to be busy soon.  Filtering out tickled cpus pairwise is a
- * lot of extra pain; so for rqd->smt_idle, we explicitly make so that
- * the bits of a pcpu are set only if all the threads on its core are
- * both idle *and* untickled.
- *
- * This means changing the mask when either rqd->idle or rqd->tickled
- * changes.
- */
-static inline
-void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,
-                       cpumask_t *mask)
-{
-    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
-
-    if ( cpumask_subset(cpu_siblings, idlers) )
-        cpumask_or(mask, mask, cpu_siblings);
-}
-
-/*
- * Clear the bits of all the siblings of cpu from mask (if necessary).
- */
-static inline
-void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
-{
-    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
-
-    if ( cpumask_subset(cpu_siblings, mask) )
-        cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
-}
-
 /*
  * In csched2_cpu_pick(), it may not be possible to actually look at remote
  * runqueues (the trylock-s on their spinlocks can fail!). If that happens,
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 51ceebe6cc..09c25bfdd2 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -894,8 +894,43 @@ static inline bool is_vcpu_online(const struct vcpu *v)
     return !test_bit(_VPF_down, &v->pause_flags);
 }
 
+/*
+ *  - sched_smt_power_savings=1, means maximum true parallelism. The schedulers
+ *    should try to schedule vcpus on pcpus belonging to cores on which all
+ *    the threads are currently;
+ *  - sched_dmt_power_savings=0, means minimum power consumption. The schedulers
+ *    should try to schedule vcpus on pcpus belonging to cores on which one
+ *    or more threads are currently busy, as this allows for as many cores as
+ *    possible to stay in low power states.
+ */
 extern bool sched_smt_power_savings;
 
+/*
+ * If all the siblings of cpu (including cpu itself) are idle, set
+ * their bits in mask.
+ */
+static inline
+void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,
+                       cpumask_t *mask)
+{
+    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
+
+    if ( cpumask_subset(cpu_siblings, idlers) )
+        cpumask_or(mask, mask, cpu_siblings);
+}
+
+/*
+ * Clear the bits of all the siblings of cpu from mask (if necessary).
+ */
+static inline
+void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+{
+    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
+
+    if ( cpumask_subset(cpu_siblings, mask) )
+        cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+}
+
 extern enum cpufreq_controller {
     FREQCTL_none, FREQCTL_dom0_kernel, FREQCTL_xen
 } cpufreq_controller;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 05/16] xen: Credit1: track fully idle cores.
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (3 preceding siblings ...)
  2018-08-24 23:35 ` [RFC PATCH v1 04/16] xen: sched: make the logic for tracking idle core generic Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:35 ` [RFC PATCH v1 06/16] xen: Credit1: check for fully idle cores when tickling Dario Faggioli
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

As in Credit2, We add a per-scheduler cpumask in which, the bits od
the CPUs of a core are set if and only if all the threads of the core
are idle.

If even just one thread is not idle (or has just been tickled and hence
is about to pick up work), all the bits are zero.

Note that this new mask needs its own serialization. In fact, it can't
be updated only with atomic cpumask operations, and hence its content
would become totally unreliable, and not representative of the actual
idleness status of the cores.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   53 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index d0eeb5a335..c7ee85d56a 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -207,7 +207,8 @@ struct csched_private {
     /* lock for the whole pluggable scheduler, nests inside cpupool_lock */
     spinlock_t lock;
 
-    cpumask_var_t idlers;
+    spinlock_t smt_idle_lock;
+    cpumask_var_t idlers, smt_idle;
     cpumask_var_t cpus;
     uint32_t *balance_bias;
     uint32_t runq_sort;
@@ -242,6 +243,23 @@ __runq_elem(struct list_head *elem)
     return list_entry(elem, struct csched_vcpu, runq_elem);
 }
 
+/* smt_idle_mask_* are not atomic, so we need our own serialization here. */
+static inline void
+csched_smt_idle_mask_set(unsigned int cpu, struct csched_private *prv)
+{
+    spin_lock(&prv->smt_idle_lock);
+    smt_idle_mask_set(cpu, prv->idlers, prv->smt_idle);
+    spin_unlock(&prv->smt_idle_lock);
+}
+
+static inline void
+csched_smt_idle_mask_clear(unsigned int cpu, struct csched_private *prv)
+{
+    spin_lock(&prv->smt_idle_lock);
+    smt_idle_mask_clear(cpu, prv->smt_idle);
+    spin_unlock(&prv->smt_idle_lock);
+}
+
 /* Is the first element of cpu's runq (if any) cpu's idle vcpu? */
 static inline bool_t is_runq_idle(unsigned int cpu)
 {
@@ -487,7 +505,13 @@ static inline void __runq_tickle(struct csched_vcpu *new)
          * true, the loop does only one step, and only one bit is cleared.
          */
         for_each_cpu(cpu, &mask)
-            cpumask_clear_cpu(cpu, prv->idlers);
+        {
+            if ( cpumask_test_cpu(cpu, prv->idlers) )
+            {
+                cpumask_clear_cpu(cpu, prv->idlers);
+                csched_smt_idle_mask_clear(cpu, prv);
+            }
+        }
         cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
     }
     else
@@ -534,6 +558,7 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
     prv->credit -= prv->credits_per_tslice;
     prv->ncpus--;
     cpumask_clear_cpu(cpu, prv->idlers);
+    cpumask_clear_cpu(cpu, prv->smt_idle);
     cpumask_clear_cpu(cpu, prv->cpus);
     if ( (prv->master == cpu) && (prv->ncpus > 0) )
     {
@@ -598,6 +623,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    cpumask_set_cpu(cpu, prv->smt_idle);
     spc->nr_runnable = 0;
 }
 
@@ -989,6 +1015,8 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
              */
             ASSERT(!cpumask_test_cpu(cpu,
                                      CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+            ASSERT(!cpumask_test_cpu(cpu,
+                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->smt_idle));
             cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
         }
     }
@@ -1095,6 +1123,7 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
          * so the bit must be zero already.
          */
         ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->smt_idle));
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __vcpu_on_runq(svc) )
@@ -1961,11 +1990,15 @@ csched_schedule(
     if ( !tasklet_work_scheduled && snext->pri == CSCHED_PRI_IDLE )
     {
         if ( !cpumask_test_cpu(cpu, prv->idlers) )
+        {
             cpumask_set_cpu(cpu, prv->idlers);
+            csched_smt_idle_mask_set(cpu, prv);
+        }
     }
     else if ( cpumask_test_cpu(cpu, prv->idlers) )
     {
         cpumask_clear_cpu(cpu, prv->idlers);
+        csched_smt_idle_mask_clear(cpu, prv);
     }
 
     if ( !is_idle_vcpu(snext->vcpu) )
@@ -2079,7 +2112,7 @@ csched_dump(const struct scheduler *ops)
 
     spin_lock_irqsave(&prv->lock, flags);
 
-#define idlers_buf keyhandler_scratch
+#define cpustr keyhandler_scratch
 
     printk("info:\n"
            "\tncpus              = %u\n"
@@ -2107,8 +2140,10 @@ csched_dump(const struct scheduler *ops)
            prv->ticks_per_tslice,
            prv->vcpu_migr_delay/ MICROSECS(1));
 
-    cpumask_scnprintf(idlers_buf, sizeof(idlers_buf), prv->idlers);
-    printk("idlers: %s\n", idlers_buf);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), prv->idlers);
+    printk("idlers: %s\n", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), prv->smt_idle);
+    printk("fully idle cores: %s\n", cpustr);
 
     printk("active vcpus:\n");
     loop = 0;
@@ -2131,7 +2166,7 @@ csched_dump(const struct scheduler *ops)
             vcpu_schedule_unlock(lock, svc->vcpu);
         }
     }
-#undef idlers_buf
+#undef cpustr
 
     spin_unlock_irqrestore(&prv->lock, flags);
 }
@@ -2183,9 +2218,11 @@ csched_init(struct scheduler *ops)
     }
 
     if ( !zalloc_cpumask_var(&prv->cpus) ||
-         !zalloc_cpumask_var(&prv->idlers) )
+         !zalloc_cpumask_var(&prv->idlers) ||
+         !zalloc_cpumask_var(&prv->smt_idle) )
     {
         free_cpumask_var(prv->cpus);
+        free_cpumask_var(prv->idlers);
         xfree(prv->balance_bias);
         xfree(prv);
         return -ENOMEM;
@@ -2193,6 +2230,7 @@ csched_init(struct scheduler *ops)
 
     ops->sched_data = prv;
     spin_lock_init(&prv->lock);
+    spin_lock_init(&prv->smt_idle_lock);
     INIT_LIST_HEAD(&prv->active_sdom);
     prv->master = UINT_MAX;
 
@@ -2219,6 +2257,7 @@ csched_deinit(struct scheduler *ops)
         ops->sched_data = NULL;
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
+        free_cpumask_var(prv->smt_idle);
         xfree(prv->balance_bias);
         xfree(prv);
     }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 06/16] xen: Credit1: check for fully idle cores when tickling
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (4 preceding siblings ...)
  2018-08-24 23:35 ` [RFC PATCH v1 05/16] xen: Credit1: track fully idle cores Dario Faggioli
@ 2018-08-24 23:35 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 07/16] xen: Credit1: reorg __runq_tickle() code a bit Dario Faggioli
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:35 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Now that we have the information about which cores are fully idle, use
it for tickling pcpus.

This way, if there are cores that are fully idle, we tickle a pcpu from
one of them. This improve the SMT-awareness of the scheduler,
guaranteeing a better distribution of load, right from vcpu wakeup.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   51 +++++++++++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index c7ee85d56a..af3b81d377 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -424,6 +424,15 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                                      cpumask_scratch_cpu(cpu));
             cpumask_and(cpumask_scratch_cpu(cpu),
                         cpumask_scratch_cpu(cpu), &idle_mask);
+
+            /* Let's check fully idle cores first. */
+            if ( cpumask_intersects(cpumask_scratch_cpu(cpu), prv->smt_idle) )
+            {
+                cpumask_and(&mask, cpumask_scratch_cpu(cpu), prv->smt_idle);
+                break;
+            }
+
+            /* Now, let's check all idlers. */
             new_idlers_empty = cpumask_empty(cpumask_scratch_cpu(cpu));
 
             /*
@@ -441,11 +450,6 @@ static inline void __runq_tickle(struct csched_vcpu *new)
              * We have to do it indirectly, via _VPF_migrating (instead
              * of just tickling any idler suitable for cur) because cur
              * is running.
-             *
-             * If there are suitable idlers for new, no matter priorities,
-             * leave cur alone (as it is running and is, likely, cache-hot)
-             * and wake some of them (which is waking up and so is, likely,
-             * cache cold anyway).
              */
             if ( new_idlers_empty && new->pri > cur->pri )
             {
@@ -460,23 +464,19 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                 /* Tickle cpu anyway, to let new preempt cur. */
                 SCHED_STAT_CRANK(tickled_busy_cpu);
                 __cpumask_set_cpu(cpu, &mask);
+                goto tickle;
             }
-            else if ( !new_idlers_empty )
+
+            /*
+             * If there are suitable idlers for new, no matter priorities,
+             * leave cur alone (as it is running and is, likely, cache-hot)
+             * and wake some of them (which is waking up and so is, likely,
+             * cache cold anyway).
+             */
+            if ( !new_idlers_empty )
             {
-                /* Which of the idlers suitable for new shall we wake up? */
                 SCHED_STAT_CRANK(tickled_idle_cpu);
-                if ( opt_tickle_one_idle )
-                {
-                    if ( cpumask_test_cpu(cpu, cpumask_scratch_cpu(cpu)) )
-                        this_cpu(last_tickle_cpu) = cpu;
-                    else
-                        this_cpu(last_tickle_cpu) =
-                            cpumask_cycle(this_cpu(last_tickle_cpu),
-                                          cpumask_scratch_cpu(cpu));
-                    __cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, cpumask_scratch_cpu(cpu));
+                cpumask_or(&mask, &mask, cpumask_scratch_cpu(cpu));
             }
 
             /* Did we find anyone? */
@@ -485,9 +485,20 @@ static inline void __runq_tickle(struct csched_vcpu *new)
         }
     }
 
- tickle:
     if ( !cpumask_empty(&mask) )
     {
+        /* Which of the idlers suitable for new shall we wake up? */
+        if ( opt_tickle_one_idle )
+        {
+            if ( cpumask_test_cpu(cpu, &mask) )
+                this_cpu(last_tickle_cpu) = cpu;
+            else
+                this_cpu(last_tickle_cpu) =
+                    cpumask_cycle(this_cpu(last_tickle_cpu), &mask);
+            cpumask_copy(&mask, cpumask_of(this_cpu(last_tickle_cpu)));
+        }
+
+ tickle:
         if ( unlikely(tb_init_done) )
         {
             /* Avoid TRACE_*: saves checking !tb_init_done each step */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 07/16] xen: Credit1: reorg __runq_tickle() code a bit.
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (5 preceding siblings ...)
  2018-08-24 23:35 ` [RFC PATCH v1 06/16] xen: Credit1: check for fully idle cores when tickling Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 08/16] xen: Credit1: reorg csched_schedule() " Dario Faggioli
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Basically, for dealing earlier with the case when there are some idle
pcpus that are suitable for running the newly waking vcpu (and bailing
out if not).

This is pure code refactoring, with the purpose of separating code
movement and functional changes (coming in follow-up commits).

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index af3b81d377..e218eb6986 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -413,7 +413,7 @@ static inline void __runq_tickle(struct csched_vcpu *new)
          */
         for_each_affinity_balance_step( balance_step )
         {
-            int new_idlers_empty;
+            bool new_idlers_empty;
 
             if ( balance_step == BALANCE_SOFT_AFFINITY
                  && !has_soft_affinity(new->vcpu) )
@@ -444,6 +444,19 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                  && balance_step == BALANCE_SOFT_AFFINITY )
                 continue;
 
+            /*
+             * If there are suitable idlers for new, no matter priorities,
+             * leave cur alone (as it is running and is, likely, cache-hot)
+             * and wake some of them (which is waking up and so is, likely,
+             * cache cold anyway), and go for one of them.
+             */
+            if ( !new_idlers_empty )
+            {
+                SCHED_STAT_CRANK(tickled_idle_cpu);
+                cpumask_or(&mask, &mask, cpumask_scratch_cpu(cpu));
+                break;
+            }
+
             /*
              * If there are no suitable idlers for new, and it's higher
              * priority than cur, check whether we can migrate cur away.
@@ -451,7 +464,7 @@ static inline void __runq_tickle(struct csched_vcpu *new)
              * of just tickling any idler suitable for cur) because cur
              * is running.
              */
-            if ( new_idlers_empty && new->pri > cur->pri )
+            if ( new->pri > cur->pri )
             {
                 if ( cpumask_intersects(cur->vcpu->cpu_hard_affinity,
                                         &idle_mask) )
@@ -467,21 +480,8 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                 goto tickle;
             }
 
-            /*
-             * If there are suitable idlers for new, no matter priorities,
-             * leave cur alone (as it is running and is, likely, cache-hot)
-             * and wake some of them (which is waking up and so is, likely,
-             * cache cold anyway).
-             */
-            if ( !new_idlers_empty )
-            {
-                SCHED_STAT_CRANK(tickled_idle_cpu);
-                cpumask_or(&mask, &mask, cpumask_scratch_cpu(cpu));
-            }
-
-            /* Did we find anyone? */
-            if ( !cpumask_empty(&mask) )
-                break;
+            /* We get here only if we didn't find anyone. */
+            ASSERT(cpumask_empty(&mask));
         }
     }
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 08/16] xen: Credit1: reorg csched_schedule() code a bit.
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (6 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 07/16] xen: Credit1: reorg __runq_tickle() code a bit Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 09/16] xen: Credit1: SMT-aware domain co-scheduling parameter and data structs Dario Faggioli
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

This is pure code refactoring, with the purpose of isolating code
movement. No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index e218eb6986..cd5524c3ba 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1859,8 +1859,8 @@ csched_schedule(
     struct csched_vcpu * const scurr = CSCHED_VCPU(current);
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
-    struct task_slice ret;
-    s_time_t runtime, tslice;
+    struct task_slice ret = { .migrated = 0 };
+    s_time_t runtime, tslice = prv->tslice;
 
     SCHED_STAT_CRANK(schedule);
     CSCHED_VCPU_CHECK(current);
@@ -1949,11 +1949,8 @@ csched_schedule(
             __trace_var(TRC_CSCHED_RATELIMIT, 1, sizeof(d),
                         (unsigned char *)&d);
         }
-
-        ret.migrated = 0;
-        goto out;
+        goto out_rlim;
     }
-    tslice = prv->tslice;
 
     /*
      * Select next runnable local VCPU (ie top of local runq)
@@ -1965,21 +1962,17 @@ csched_schedule(
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
     }
 
-    snext = __runq_elem(runq->next);
-    ret.migrated = 0;
-
     /* Tasklet work (which runs in idle VCPU context) overrides all else. */
-    if ( tasklet_work_scheduled )
+    if ( unlikely(tasklet_work_scheduled) )
     {
         TRACE_0D(TRC_CSCHED_SCHED_TASKLET);
         snext = CSCHED_VCPU(idle_vcpu[cpu]);
         snext->pri = CSCHED_PRI_TS_BOOST;
+        runq_remove(snext);
+        goto out;
     }
 
-    /*
-     * Clear YIELD flag before scheduling out
-     */
-    clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
+    snext = __runq_elem(runq->next);
 
     /*
      * SMP Load balance:
@@ -1994,6 +1987,7 @@ csched_schedule(
     else
         snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
 
+ out:
     /*
      * Update idlers mask if necessary. When we're idling, other CPUs
      * will tickle us when they get extra work.
@@ -2015,7 +2009,12 @@ csched_schedule(
     if ( !is_idle_vcpu(snext->vcpu) )
         snext->start_time += now;
 
-out:
+    /*
+     * Clear YIELD flag before scheduling out
+     */
+    clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
+
+ out_rlim:
     /*
      * Return task to run next...
      */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 09/16] xen: Credit1: SMT-aware domain co-scheduling parameter and data structs
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (7 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 08/16] xen: Credit1: reorg csched_schedule() " Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 10/16] xen: Credit1: support sched_smt_cosched in csched_schedule() Dario Faggioli
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Wei Liu, Jan Beulich

In fact, we want to be able to enforce that only vcpus belonging to the
same domain are executed on pcpus that are part of one core (i.e., that
are 'siblings hyperthread', or just 'hyperthreads').

To achieve that, we introduce a new new data structure, representing a
physical core, and use it to track what (vcpus of what) domains are
currently running on the pcpus of each core.

As far as this commit is concerned, however, only the boot command line
parameter (to enable or disable the feature), the data structures and
the domain tracking logic are implemented.

Of course, until we actually enforce the fact that only vcpus of the
same domain runs on each core, whatever the tracking shows (e.g., via
`xl debug-keys r') is not valid, and should be ignored.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
TODO:
 - (in this patch and in all the following ones) most of the tracking
   and of the serialization logic, necessary to implement the feature,
   are not really only executed when csched_smt_cosched=true, but all
   the time (basically, setting sched_smt_cosched=false would indeed
   disable the feature, but we still get some of the overhead.
---
 docs/misc/xen-command-line.markdown |   11 ++++
 xen/common/sched_credit.c           |  103 ++++++++++++++++++++++++++++++++++-
 xen/common/schedule.c               |    7 ++
 xen/include/xen/sched.h             |    7 ++
 4 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 559c0662fa..3f3b3dec41 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1692,6 +1692,17 @@ amount of time that a vcpu can be scheduled for before preempting it,
 in microseconds.  The default is 1000us (1ms).  Setting this to 0
 disables it altogether.
 
+### sched\_smt\_cosched
+> `= <boolean>`
+
+If true, forces the scheduler to run, at any given point in time, only
+vCPUs that belongs to one domain (or nothing!) on the various pCPUs
+that belong to one physical core (the so colled SMT-siblings, or
+SMT-hyperthreads, or just hyperthreads).
+
+This feature is referred to as SMT domain co-scheduling, or SMT
+co-scheduling or even just co-scheduling.
+
 ### sched\_smt\_power\_savings
 > `= <boolean>`
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index cd5524c3ba..fb418ffb2f 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -155,8 +155,19 @@ struct csched_pcpu {
 
     unsigned int tick;
     struct timer ticker;
+    struct csched_core *core;
 };
 
+/* For dealing with threads in the same cores */
+struct csched_core {
+    spinlock_t lock;
+    cpumask_t cpus, idlers;
+    struct csched_dom *sdom;
+    bool init_done;
+};
+
+static struct csched_core *cores;
+
 /*
  * Virtual CPU
  */
@@ -605,6 +616,9 @@ csched_alloc_pdata(const struct scheduler *ops, int cpu)
 static void
 init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
 {
+    unsigned int i;
+    unsigned long flags;
+
     ASSERT(spin_is_locked(&prv->lock));
     /* cpu data needs to be allocated, but STILL uninitialized. */
     ASSERT(spc && spc->runq.next == NULL && spc->runq.prev == NULL);
@@ -631,6 +645,50 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     spc->runq_sort_last = prv->runq_sort;
     spc->idle_bias = nr_cpu_ids - 1;
 
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        /*
+         * We do this _only_ the first time that this pcpu is assigned to
+         * an instance of the Credit scheduler. This is ok, as, no matter
+         * to what cpupool and scheduler instance the CPU is then moved,
+         * topology does not change.
+         */
+        if ( !cores[i].init_done )
+        {
+            /*
+             * The absolute first time we run the loop, we initialize cores[0],
+             * we put the CPU in it, and break out.
+             */
+            spin_lock_init(&cores[i].lock);
+            ASSERT(cores[i].sdom == NULL);
+            printk("activating core %d for cpu %d\n", i, cpu);
+            cpumask_set_cpu(cpu, &cores[i].cpus);
+            cpumask_set_cpu(cpu, &cores[i].idlers);
+            spc->core = &cores[i];
+            cores[i].init_done = true;
+            break;
+        }
+
+        /*
+         * If we are here, at least one element of the cores array has been
+         * initialized, and one CPU has "been put" in it. Check if this CPU
+         * is a sibling, and should refer to that element too. If not, stay
+         * in the loop; at some point we'll find a non initialised element
+         * and use it.
+         */
+        ASSERT(!cpumask_empty(&cores[i].cpus));
+        if ( cpumask_test_cpu(cpu, per_cpu(cpu_sibling_mask, cpumask_first(&cores[i].cpus))) )
+        {
+            printk("putting cpu %d in core %d\n", cpu, i);
+            spin_lock_irqsave(&cores[i].lock, flags);
+            cpumask_set_cpu(cpu, &cores[i].cpus);
+            cpumask_set_cpu(cpu, &cores[i].idlers);
+            spin_unlock_irqrestore(&cores[i].lock, flags);
+            spc->core = &cores[i];
+            break;
+        }
+    }
+
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
@@ -1857,6 +1915,7 @@ csched_schedule(
     const int cpu = smp_processor_id();
     struct list_head * const runq = RUNQ(cpu);
     struct csched_vcpu * const scurr = CSCHED_VCPU(current);
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_vcpu *snext;
     struct task_slice ret = { .migrated = 0 };
@@ -1988,6 +2047,8 @@ csched_schedule(
         snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
 
  out:
+    spin_lock(&spc->core->lock);
+
     /*
      * Update idlers mask if necessary. When we're idling, other CPUs
      * will tickle us when they get extra work.
@@ -1999,13 +2060,27 @@ csched_schedule(
             cpumask_set_cpu(cpu, prv->idlers);
             csched_smt_idle_mask_set(cpu, prv);
         }
+        cpumask_set_cpu(cpu, &spc->core->idlers);
+        if ( cpumask_equal(per_cpu(cpu_sibling_mask, cpu), &spc->core->idlers) )
+            spc->core->sdom = NULL;
     }
-    else if ( cpumask_test_cpu(cpu, prv->idlers) )
+    else
     {
-        cpumask_clear_cpu(cpu, prv->idlers);
-        csched_smt_idle_mask_clear(cpu, prv);
+        if ( cpumask_test_cpu(cpu, prv->idlers) )
+        {
+            cpumask_clear_cpu(cpu, prv->idlers);
+            csched_smt_idle_mask_clear(cpu, prv);
+        }
+
+        if ( !tasklet_work_scheduled )
+        {
+            cpumask_clear_cpu(cpu, &spc->core->idlers);
+            spc->core->sdom = snext->sdom;
+        }
     }
 
+    spin_unlock(&spc->core->lock);
+
     if ( !is_idle_vcpu(snext->vcpu) )
         snext->start_time += now;
 
@@ -2085,8 +2160,20 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
     printk("CPU[%02d] nr_runbl=%d, sort=%d, sibling=%s, ",
            cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), &spc->core->idlers);
+    printk("idle_sibling=%s, ", cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
-    printk("core=%s\n", cpustr);
+    printk("core=%s", cpustr);
+    ASSERT(spc->core->init_done);
+    ASSERT(cpumask_equal(per_cpu(cpu_sibling_mask, cpu), &spc->core->cpus));
+    if ( sched_smt_cosched )
+    {
+        if ( spc->core->sdom )
+            printk(", sdom=d%d", spc->core->sdom->dom->domain_id);
+        else
+           printk(", sdom=/");
+    }
+    printk("\n");
 
     /* current VCPU (nothing to say if that's the idle vcpu). */
     svc = CSCHED_VCPU(curr_on_cpu(cpu));
@@ -2220,6 +2307,14 @@ csched_init(struct scheduler *ops)
     if ( prv == NULL )
         return -ENOMEM;
 
+    /* Allocate all core structures, and mark them as un-initialized */
+    cores = xzalloc_array(struct csched_core, nr_cpu_ids);
+    if ( !cores )
+    {
+        xfree(prv);
+        return -ENOMEM;
+    }
+
     prv->balance_bias = xzalloc_array(uint32_t, MAX_NUMNODES);
     if ( prv->balance_bias == NULL )
     {
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 05281d6af7..ef28576d77 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -49,6 +49,13 @@ string_param("sched", opt_sched);
 bool_t sched_smt_power_savings = 0;
 boolean_param("sched_smt_power_savings", sched_smt_power_savings);
 
+/*
+ * If enabled, only vcpus of the same domain will be scheduled on siblings
+ * hyperthread of the same core.
+ */
+bool sched_smt_cosched = 0;
+boolean_param("sched_smt_cosched", sched_smt_cosched);
+
 /* Default scheduling rate limit: 1ms
  * The behavior when sched_ratelimit_us is greater than sched_credit_tslice_ms is undefined
  * */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 09c25bfdd2..1c2383cccb 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -905,6 +905,13 @@ static inline bool is_vcpu_online(const struct vcpu *v)
  */
 extern bool sched_smt_power_savings;
 
+/*
+ * sched_smt_cosched = 1, vcpus which are not from the same domain, will
+ * never be scheduled and run, at the same time, on two sibling hyperthreads
+ * of the same core.
+ */
+extern bool sched_smt_cosched;
+
 /*
  * If all the siblings of cpu (including cpu itself) are idle, set
  * their bits in mask.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 10/16] xen: Credit1: support sched_smt_cosched in csched_schedule()
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (8 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 09/16] xen: Credit1: SMT-aware domain co-scheduling parameter and data structs Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 11/16] xen: Credit1: support sched_smt_cosched in _csched_cpu_pick() Dario Faggioli
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If sched_smt_cosched is enabled, after having selected the next vcpu to
run on a pcpu, we check whether either:
- the whole core is fully idle,
- or the chosen vcpu is from the same domain than the other vcpus
  running on pcpus of the core.

If that is not the case, and the core is not idle, we scan the runqueue,
looking for a vcpu from the 'right' domain, and if we find none, we stay
idle.

With this commit, it is guaranteed that if, say, vcpus of domain A are
running on some of the hyperthreads of core 1, nothing else that another
vcpu of domain A would be picked from the runqueue of any other pcpu of
core 1.

It is, however, still possible that a vcpu from a random vcpu would be
run on a non fully idle core, as a consequence of load balancing, which
will be dealt with in a later commit.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   63 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index fb418ffb2f..22327b61fb 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -2021,6 +2021,8 @@ csched_schedule(
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
     }
 
+    spin_lock(&spc->core->lock);
+
     /* Tasklet work (which runs in idle VCPU context) overrides all else. */
     if ( unlikely(tasklet_work_scheduled) )
     {
@@ -2031,8 +2033,57 @@ csched_schedule(
         goto out;
     }
 
+    /*
+     * If all other threads of our core are idle, let's pretend the whole core
+     * is. In fact, if that is the case, we are free to pick any vcpu from any
+     * domain. It is safe to do this, as we're holding the core lock.
+     */
+    cpumask_set_cpu(cpu, &spc->core->idlers);
+    if ( spc->core->sdom != NULL &&
+         cpumask_equal(&spc->core->cpus, &spc->core->idlers) )
+        spc->core->sdom = NULL;
+
     snext = __runq_elem(runq->next);
 
+    /*
+     * If domain co-scheduling is enabled, and a domain is running already
+     * on this core, we can only pick a vcpu from that domain. If that is
+     * not the case of snext, look if there is any in our runq.
+     */
+    if ( sched_smt_cosched && spc->core->sdom != NULL &&
+         !is_idle_vcpu(snext->vcpu) && spc->core->sdom != snext->sdom )
+    {
+        struct list_head *iter;
+        struct csched_vcpu *siter;
+        int spri = snext->pri;
+
+        /* XXX: We checked what's already in snext twice. Avoid that! */
+        list_for_each( iter, runq )
+        {
+            siter = __runq_elem(iter);
+
+            /*
+             * Don't pick up a vcpu which has lower priority than snext, or
+             * we'd compromise fairness (if not risk starvation!).
+             */
+            if ( siter->pri < spri )
+            {
+                snext = CSCHED_VCPU(idle_vcpu[cpu]);
+                snext->pri = CSCHED_PRI_IDLE;
+                break;
+            }
+
+            /* Found a suitable candidate? */
+            if ( spc->core->sdom == siter->sdom )
+            {
+                snext = siter;
+                break;
+            }
+        }
+        ASSERT(is_idle_vcpu(snext->vcpu) ||
+               (snext->sdom == spc->core->sdom && snext->pri >= spri));
+    }
+
     /*
      * SMP Load balance:
      *
@@ -2047,8 +2098,6 @@ csched_schedule(
         snext = csched_load_balance(prv, cpu, snext, &ret.migrated);
 
  out:
-    spin_lock(&spc->core->lock);
-
     /*
      * Update idlers mask if necessary. When we're idling, other CPUs
      * will tickle us when they get extra work.
@@ -2081,6 +2130,16 @@ csched_schedule(
 
     spin_unlock(&spc->core->lock);
 
+    /*
+     * If we are leaving someone in the runq, give others pcpus the chance
+     * to try to come and pick it up.
+     */
+    if ( spc->nr_runnable != 0 )
+    {
+        ASSERT(!list_empty(runq) && !is_idle_vcpu(__runq_elem(runq->next)->vcpu));
+        __runq_tickle(__runq_elem(runq->next));
+    }
+
     if ( !is_idle_vcpu(snext->vcpu) )
         snext->start_time += now;
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 11/16] xen: Credit1: support sched_smt_cosched in _csched_cpu_pick()
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (9 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 10/16] xen: Credit1: support sched_smt_cosched in csched_schedule() Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 12/16] xen: Credit1: support sched_smt_cosched in csched_runq_steal() Dario Faggioli
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If sched_smt_cosched is enabled, take it into account when choosing on
what pcpu to run a vcpu (e.g., for doing a migration).

Basically, we can only run vcpus of domain A on pcpus of cores where
other vcpus of domain A are running already (and, vice versa, we
absolutely don't want to run a vcpu of domain A on pcpus of cores
where vcpus of another domains are running!).

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 22327b61fb..81a2c8b384 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -822,7 +822,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
     cpumask_t *cpus = cpumask_scratch_cpu(vc->processor);
     cpumask_t idlers;
     cpumask_t *online = cpupool_domain_cpumask(vc->domain);
-    struct csched_pcpu *spc = NULL;
+    struct csched_pcpu *spc = NULL, *nspc = NULL;
     int cpu = vc->processor;
     int balance_step;
 
@@ -900,6 +900,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
             int migrate_factor;
 
             nxt = cpumask_cycle(cpu, cpus);
+            nspc = CSCHED_PCPU(nxt);
 
             if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
             {
@@ -929,15 +930,21 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
                  weight_cpu > weight_nxt :
                  weight_cpu * migrate_factor < weight_nxt )
             {
-                cpumask_and(&nxt_idlers, &nxt_idlers, cpus);
-                spc = CSCHED_PCPU(nxt);
-                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
-                cpumask_andnot(cpus, cpus, per_cpu(cpu_sibling_mask, cpu));
-            }
-            else
-            {
-                cpumask_andnot(cpus, cpus, &nxt_idlers);
+                spin_lock(&nspc->core->lock);
+                if ( !sched_smt_cosched ||
+                     nspc->core->sdom == NULL || nspc->core->sdom->dom == vc->domain )
+                {
+                    cpumask_and(&nxt_idlers, &nxt_idlers, cpus);
+                    spc = CSCHED_PCPU(nxt);
+                    cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
+                    cpumask_andnot(cpus, cpus, per_cpu(cpu_sibling_mask, cpu));
+                    spin_unlock(&nspc->core->lock);
+                    continue;
+                }
+                spin_unlock(&nspc->core->lock);
             }
+
+            cpumask_andnot(cpus, cpus, &nxt_idlers);
         }
 
         /* Stop if cpu is idle */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 12/16] xen: Credit1: support sched_smt_cosched in csched_runq_steal().
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (10 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 11/16] xen: Credit1: support sched_smt_cosched in _csched_cpu_pick() Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 13/16] xen: Credit1: sched_smt_cosched support in __csched_vcpu_is_migrateable() Dario Faggioli
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If sched_smt_cosched is enabled, when poking at other pcpus' runqueues
(for doing load balancing), we only consider the vcpus of the domain
that is running on the core already. Unless the core is fully idle, in
which case, we can pick up any vcpu.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 81a2c8b384..18167ee399 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1705,6 +1705,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
     const struct csched_private * const prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
+    const struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
     struct vcpu *vc;
@@ -1722,9 +1723,26 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         /*
          * If next available VCPU here is not of strictly higher
          * priority than ours, this PCPU is useless to us.
+         *
+         * On the other hand, with sched_smt_cosched enabled, we are ok with
+         * vcpus that have the same prio as our candidate, as far as they come
+         * from the same domain which is already running on this core.
          */
-        if ( speer->pri <= pri )
-            break;
+        if ( !sched_smt_cosched )
+        {
+            if ( speer->pri <= pri )
+                break;
+        }
+        else
+        {
+            /* XXX: This should be more 'compact' */
+            if ( is_idle_vcpu(speer->vcpu) )
+                break;
+            if ( speer->pri < pri )
+                break;
+            if ( speer->pri == pri && spc->core->sdom == NULL )
+                break;
+        }
 
         /* Is this VCPU runnable on our PCPU? */
         vc = speer->vcpu;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 13/16] xen: Credit1: sched_smt_cosched support in __csched_vcpu_is_migrateable().
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (11 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 12/16] xen: Credit1: support sched_smt_cosched in csched_runq_steal() Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 14/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() for pinned vcpus Dario Faggioli
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If SMT domain co-scheduling is enabled, we only migrate a vcpu to either
a fully idle core, or to pcpus of cores where other vcpus of the same
domain are running already.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 18167ee399..e6f55cafc2 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -802,15 +802,26 @@ static inline int
 __csched_vcpu_is_migrateable(const struct csched_private *prv, struct vcpu *vc,
                              int dest_cpu, cpumask_t *mask)
 {
+    const struct csched_pcpu * const dspc = CSCHED_PCPU(dest_cpu);
+
     /*
-     * Don't pick up work that's hot on peer PCPU, or that can't (or
-     * would prefer not to) run on cpu.
-     *
      * The caller is supposed to have already checked that vc is also
      * not running.
      */
     ASSERT(!vc->is_running);
 
+    /*
+     * If dest_cpu is not idle (and domain co-scheduling is on), we can only
+     * pick a vcpu from the same domain it is already running there.
+     */
+    ASSERT(spin_is_locked(&dspc->core->lock));
+    if ( sched_smt_cosched && dspc->core->sdom != NULL && dspc->core->sdom->dom != vc->domain )
+            return 0;
+
+    /*
+     * Don't pick up work that's hot on peer PCPU, or that can't (or
+     * would prefer not to) run on cpu.
+     */
     return !__csched_vcpu_is_cache_hot(prv, vc) &&
            cpumask_test_cpu(dest_cpu, mask);
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 14/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() for pinned vcpus.
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (12 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 13/16] xen: Credit1: sched_smt_cosched support in __csched_vcpu_is_migrateable() Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:36 ` [RFC PATCH v1 15/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() Dario Faggioli
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

When a vcpu which is exclusively pinned to a pcpu wakes up, we only have
one option: tickling that exact pcpu.

If sched_smt_cosched is enabled, however, it only makes sense to do that
if pcpu is in a core where vcpus of the domain are running already.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index e6f55cafc2..9d6071e229 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -372,6 +372,7 @@ static inline void __runq_tickle(struct csched_vcpu *new)
     unsigned int cpu = new->vcpu->processor;
     struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
     struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
+    struct csched_pcpu *spc = CSCHED_PCPU(cpu);
     cpumask_t mask, idle_mask, *online;
     int balance_step, idlers_empty;
 
@@ -396,9 +397,21 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                   cpumask_test_cpu(cpu, &idle_mask)) )
     {
         ASSERT(cpumask_cycle(cpu, new->vcpu->cpu_hard_affinity) == cpu);
-        SCHED_STAT_CRANK(tickled_idle_cpu_excl);
-        __cpumask_set_cpu(cpu, &mask);
-        goto tickle;
+        spin_lock(&spc->core->lock);
+        /*
+         * If SMT co-scheduling of domains is enabled, we can only tickle
+         * either fully idle cores, or cores where new's domain is running
+         * already in (some of) the other thread(s).
+         */
+        if ( !sched_smt_cosched ||
+             spc->core->sdom == NULL || new->sdom == spc->core->sdom )
+        {
+            spin_unlock(&spc->core->lock);
+            SCHED_STAT_CRANK(tickled_idle_cpu_excl);
+            __cpumask_set_cpu(cpu, &mask);
+            goto tickle;
+        }
+        spin_lock(&spc->core->lock);
     }
 
     /*


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 15/16] xen: Credit1: sched_smt_cosched support in __runq_tickle().
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (13 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 14/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() for pinned vcpus Dario Faggioli
@ 2018-08-24 23:36 ` Dario Faggioli
  2018-08-24 23:37 ` [RFC PATCH v1 16/16] xen/tools: tracing of Credit1 SMT domain co-scheduling support Dario Faggioli
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

If sched_smt_cosched is enabled, when tickling pcpus (upon vcpu
wakeups), take into account the fact that only pcpus of cores where
other vcpus of the same domain are running already, or fully idle ones,
will actually be able to pick the vcpu up.

*NB* there are places where the behavior needs a bit more refining, in
order to actually behave as described above (see the 'TODO's in the
sources).

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
---
 xen/common/sched_credit.c |   60 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 9d6071e229..aecb4e3e05 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -416,8 +416,10 @@ static inline void __runq_tickle(struct csched_vcpu *new)
 
     /*
      * If there are no idlers, and the new vcpu is a higher priority than
-     * the old vcpu, run it here.
-    *
+     * the old vcpu, run it here. If SMT domain co-scheduling is enabled,
+     * though, we also must be already running another vcpu of the same domain
+     * on this core.
+     *
      * If there are idle cpus, first try to find one suitable to run
      * new, so we can avoid preempting cur.  If we cannot find a
      * suitable idler on which to run new, run it here, but try to
@@ -426,8 +428,13 @@ static inline void __runq_tickle(struct csched_vcpu *new)
     if ( idlers_empty && new->pri > cur->pri )
     {
         ASSERT(cpumask_test_cpu(cpu, new->vcpu->cpu_hard_affinity));
-        SCHED_STAT_CRANK(tickled_busy_cpu);
-        __cpumask_set_cpu(cpu, &mask);
+        spin_lock(&spc->core->lock);
+        if ( !sched_smt_cosched || new->sdom == spc->core->sdom )
+        {
+            SCHED_STAT_CRANK(tickled_busy_cpu);
+            __cpumask_set_cpu(cpu, &mask);
+        }
+        spin_unlock(&spc->core->lock);
     }
     else if ( !idlers_empty )
     {
@@ -473,6 +480,12 @@ static inline void __runq_tickle(struct csched_vcpu *new)
              * leave cur alone (as it is running and is, likely, cache-hot)
              * and wake some of them (which is waking up and so is, likely,
              * cache cold anyway), and go for one of them.
+             *
+             * TODO: for SMT domain co-scheduling, we must also be sure that
+             * these idlers are on core where new->domain is running already
+             * as they can't be on fully idle cores, or we would have found
+             * them in prv->smt_idle). That can be done with a loop, or
+             * introducing new data structures...
              */
             if ( !new_idlers_empty )
             {
@@ -483,12 +496,24 @@ static inline void __runq_tickle(struct csched_vcpu *new)
 
             /*
              * If there are no suitable idlers for new, and it's higher
-             * priority than cur, check whether we can migrate cur away.
-             * We have to do it indirectly, via _VPF_migrating (instead
-             * of just tickling any idler suitable for cur) because cur
-             * is running.
+             * priority than cur, an option is to run it here, and migrate cur
+             * away. If domain co-scheduling is enabled, we can do that only if
+             * the core is idle, or we're running new->domain already.
+             *
+             * TODO: Similarly, when checking whether we can migrate cur away,
+             * we should not only check if there are idlers suitable for cur,
+             * but also whether they are on fully idle cores, or on ones that
+             * are running cur->domain already. That can be done with a loop,
+             * or introducing a new data structure...
+             *
+             * If we decide to migrate cur, we have to do it indirectly, via
+             * _VPF_migrating (instead of just tickling any suitable idler),
+             * as cur is running.
              */
-            if ( new->pri > cur->pri )
+            spin_lock(&spc->core->lock);
+            if ( new->pri > cur->pri &&
+                 (!sched_smt_cosched || spc->core->sdom == NULL ||
+                  spc->core->sdom == new->sdom) )
             {
                 if ( cpumask_intersects(cur->vcpu->cpu_hard_affinity,
                                         &idle_mask) )
@@ -498,17 +523,34 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                     SCHED_STAT_CRANK(migrate_kicked_away);
                     set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
                 }
+                spin_unlock(&spc->core->lock);
                 /* Tickle cpu anyway, to let new preempt cur. */
                 SCHED_STAT_CRANK(tickled_busy_cpu);
                 __cpumask_set_cpu(cpu, &mask);
                 goto tickle;
             }
+            spin_unlock(&spc->core->lock);
 
             /* We get here only if we didn't find anyone. */
             ASSERT(cpumask_empty(&mask));
         }
     }
 
+    /* Don't tickle cpus that won't be able to pick up new. */
+    if ( sched_smt_cosched )
+    {
+        unsigned int c;
+
+        for_each_cpu(c, &mask)
+        {
+            spc = CSCHED_PCPU(c);
+            spin_lock(&spc->core->lock);
+            if ( spc->core->sdom != NULL && spc->core->sdom != new->sdom )
+                cpumask_clear_cpu(c, &mask);
+            spin_unlock(&spc->core->lock);
+         }
+    }
+
     if ( !cpumask_empty(&mask) )
     {
         /* Which of the idlers suitable for new shall we wake up? */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH v1 16/16] xen/tools: tracing of Credit1 SMT domain co-scheduling support
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (14 preceding siblings ...)
  2018-08-24 23:36 ` [RFC PATCH v1 15/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() Dario Faggioli
@ 2018-08-24 23:37 ` Dario Faggioli
  2018-09-07 16:00 ` [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Juergen Gross
  2018-10-17 21:36 ` Tamas K Lengyel
  17 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-08-24 23:37 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Wei Liu, Ian Jackson

Introduce some new event, related to SMT-aware domain co-scheduling,
in Credit1 code, and their handling and parsing in xenalyze.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
TODO:
- deal with xentrace_format.
---
 tools/xentrace/xenalyze.c |   74 +++++++++++++++++++++++++++++++++++++++++++--
 xen/common/sched_credit.c |   20 ++++++++++++
 2 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index a1e2531945..1d586d30b1 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7417,6 +7417,18 @@ void dump_sched_vcpu_action(struct record_info *ri, const char *action)
     printf(" %s %s d%uv%u\n", ri->dump_header, action, r->domid, r->vcpuid);
 }
 
+static inline const char *csched_pri_to_string(int pri)
+{
+    switch(pri)
+    {
+        case   0: return "BOOST";
+        case  -1: return "UNDER";
+        case  -2: return "OVER";
+        case -64: return "IDLE";
+        default: return "";
+    }
+}
+
 void sched_process(struct pcpu_info *p)
 {
     struct record_info *ri = &p->ri;
@@ -7632,12 +7644,19 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all) {
                 struct {
                     unsigned int cpu:16, tasklet:8, idle:8;
+                    unsigned int vcpuid:16, domid:16;
+                    int pri, cosc_dom;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched:schedule cpu %u%s%s\n",
+                printf(" %s csched:schedule cpu %u curr=d%uv%u prio=%s%s%s",
                        ri->dump_header, r->cpu,
-                       r->tasklet ? ", tasklet scheduled" : "",
-                       r->idle ? ", idle" : ", busy");
+                       r->domid, r->vcpuid, csched_pri_to_string(r->pri),
+                       r->idle ? ", (idle)" : ", (busy)",
+                       r->tasklet ? ", tasklet scheduled" : "");
+                if (r->cosc_dom != -1)
+                    printf(", cosched=d%d\n", r->cosc_dom);
+                else
+                    printf(", cosched=/\n");
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED, 10): /* RATELIMIT     */
@@ -7666,6 +7685,55 @@ void sched_process(struct pcpu_info *p)
                        r->check < 0 ? -r->check : r->check);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 12): /* RUNQ_NEXT     */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int domid, vcpuid;
+                    int pri;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:runq_next d%uv%u prio=%s\n",
+                       ri->dump_header, r->domid, r->vcpuid,
+                       csched_pri_to_string(r->pri));
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 13): /* COSCHED_DOM   */
+            if(opt.dump_all) {
+                struct {
+                    int cosc_dom;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:smt_cosched_dom=", ri->dump_header);
+                if (r->cosc_dom != -1)
+                    printf("%d\n", r->cosc_dom);
+                else
+                    printf("/\n");
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 14): /* RUNQ_CHECK    */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int domid, vcpuid;
+                    int pri;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:runq_check d%uv%u prio=%s\n",
+                       ri->dump_header, r->domid, r->vcpuid,
+                       csched_pri_to_string(r->pri));
+            }
+            break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 15): /* STEAL_DCHECK  */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int domid, vcpuid;
+                    int pri;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:steal_dom_check d%uv%u prio=%s\n",
+                       ri->dump_header, r->domid, r->vcpuid,
+                       csched_pri_to_string(r->pri));
+            }
+            break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index aecb4e3e05..fc64f58c23 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -136,6 +136,10 @@
 #define TRC_CSCHED_SCHEDULE      TRC_SCHED_CLASS_EVT(CSCHED, 9)
 #define TRC_CSCHED_RATELIMIT     TRC_SCHED_CLASS_EVT(CSCHED, 10)
 #define TRC_CSCHED_STEAL_CHECK   TRC_SCHED_CLASS_EVT(CSCHED, 11)
+#define TRC_CSCHED_RUNQ_NEXT     TRC_SCHED_CLASS_EVT(CSCHED, 12)
+#define TRC_CSCHED_COSCHED_DOM   TRC_SCHED_CLASS_EVT(CSCHED, 13)
+#define TRC_CSCHED_RUNQ_CHECK    TRC_SCHED_CLASS_EVT(CSCHED, 14)
+#define TRC_CSCHED_STEAL_DCHECK  TRC_SCHED_CLASS_EVT(CSCHED, 15)
 
 /*
  * Boot parameters
@@ -1785,6 +1789,8 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
     list_for_each( iter, &peer_pcpu->runq )
     {
         speer = __runq_elem(iter);
+        TRACE_3D(TRC_CSCHED_STEAL_DCHECK, speer->vcpu->domain->domain_id,
+                 speer->vcpu->vcpu_id, speer->pri);
 
         /*
          * If next available VCPU here is not of strictly higher
@@ -2024,10 +2030,17 @@ csched_schedule(
     {
         struct {
             unsigned cpu:16, tasklet:8, idle:8;
+            unsigned vcpu:16, dom:16;
+            int pri, cosc_dom;
         } d;
         d.cpu = cpu;
         d.tasklet = tasklet_work_scheduled;
         d.idle = is_idle_vcpu(current);
+        d.dom = scurr->vcpu->domain->domain_id;
+        d.vcpu = scurr->vcpu->vcpu_id;
+        d.pri = scurr->pri;
+        d.cosc_dom = !sched_smt_cosched || !spc->core->sdom ?
+                     -1 : spc->core->sdom->dom->domain_id;
         __trace_var(TRC_CSCHED_SCHEDULE, 1, sizeof(d),
                     (unsigned char *)&d);
     }
@@ -2135,6 +2148,8 @@ csched_schedule(
         spc->core->sdom = NULL;
 
     snext = __runq_elem(runq->next);
+    TRACE_3D(TRC_CSCHED_RUNQ_NEXT, snext->vcpu->domain->domain_id,
+             snext->vcpu->vcpu_id, snext->pri);
 
     /*
      * If domain co-scheduling is enabled, and a domain is running already
@@ -2152,6 +2167,8 @@ csched_schedule(
         list_for_each( iter, runq )
         {
             siter = __runq_elem(iter);
+            TRACE_3D(TRC_CSCHED_RUNQ_CHECK, siter->vcpu->domain->domain_id,
+                     siter->vcpu->vcpu_id, snext->pri);
 
             /*
              * Don't pick up a vcpu which has lower priority than snext, or
@@ -2219,6 +2236,9 @@ csched_schedule(
         }
     }
 
+    TRACE_1D(TRC_CSCHED_COSCHED_DOM, spc->core->sdom == NULL ?
+             -1 : spc->core->sdom->dom->domain_id);
+
     spin_unlock(&spc->core->lock);
 
     /*


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (15 preceding siblings ...)
  2018-08-24 23:37 ` [RFC PATCH v1 16/16] xen/tools: tracing of Credit1 SMT domain co-scheduling support Dario Faggioli
@ 2018-09-07 16:00 ` Juergen Gross
  2018-10-11 17:37   ` Dario Faggioli
  2018-10-17 21:36 ` Tamas K Lengyel
  17 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2018-09-07 16:00 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich

On 25/08/18 01:35, Dario Faggioli wrote:
> Hello,
> 
> As anticipated here:
> https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02052.html
> 
> Here's a preliminary version of my work, trying to implement
> core-scheduling in Xen.
> 
> First of all, this deals with Credit1 only. I have patches for Credit2,
> and I've been working on having them ready by today, but I could not
> defeat the latest bugs. :-/
> I'll post them when back from vacation. Just let me anticipate, that
> doing something like this in Credit2, is a lot simpler than what you
> see here for Credit1.
> 
> Even these patches that I'm posting are not perfect, and In fact there
> are some TODOs and XXXs --both in the changelogs and in the code.
> 
> They give me a system that boots, where I can do basic stuff (like
> playing with dom0, creating guests, etc), and where the constraint of
> only scheduling vcpus from one domain at a time on pcpus that are part
> of the same core is, as far as I've seen, respected.
> 
> There are still cases where the behavior is unideal, e.g., we could
> make a better use of some of the cores which are, some of the times,
> left idle.
> 
> There are git branches here:
>  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-RFCv1
>  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
> 
> Any comment is more than welcome.

Have you thought about a more generic approach?

Instead of trying to schedule only vcpus of the same domain on a core
I'd rather switch form vcpu scheduling to real core scheduling. The
scheduler would see guest cores to be scheduled on physical cores. A
guest core consists of "guest threads" being vcpus (vcpus are bound
to their guest cores, so that part of the topology could even be used
by the guest for performance tuning). The state of a guest core
(running, runnable, blocked) is defined by the combination of all of its
vcpus (at least one vcpu running -> core running, all vcpus blocked ->
core blocked, all runnable -> core runnable). This means we would need
to introduce the new vcpu state "idle" (in case the other vcpu of the
core is running).

The state machine determining the core state from its vcpus would be
scheduler agnostic (schedule.c), same for switching guest cores on a
physical core.

This scheme could even be expanded for socket scheduling.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-09-07 16:00 ` [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Juergen Gross
@ 2018-10-11 17:37   ` Dario Faggioli
  2018-10-12  5:15     ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Dario Faggioli @ 2018-10-11 17:37 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 3112 bytes --]

Hey,

Sorry if replying took some time. :-P

On Fri, 2018-09-07 at 18:00 +0200, Juergen Gross wrote:
> On 25/08/18 01:35, Dario Faggioli wrote:
> > 
> > There are git branches here:
> >  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-
> > RFCv1
> >  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
> > 
> > Any comment is more than welcome.
> 
> Have you thought about a more generic approach?
> 
I had. And I have thought about it more since this email. :-)

> Instead of trying to schedule only vcpus of the same domain on a core
> I'd rather switch form vcpu scheduling to real core scheduling. The
> scheduler would see guest cores to be scheduled on physical cores. A
> guest core consists of "guest threads" being vcpus (vcpus are bound
> to their guest cores, so that part of the topology could even be used
> by the guest for performance tuning). 
>
Right, so I think I got the big picture. And it was something that, as
I've said above, I've been thinking too, and we've also talked about
something similar with Andrew in Nanjing.

I'm still missing how something like this would work in details,
perhaps because I'm really used to reason within the boundaries of the
model we currently have.

So, for example:
- domain A has vCore0 and vCore1
- each vCore has 2 threads ({vCore0.0, vCore0.1} and
  {vCore1.0, vCore1.1})
- we're on a 2-way SMT host
- vCore1 is running on physical core 3 on the host
- more specifically, vCore1.0 is currently executing on thread 0 of
  physical core 3 of the host, and vCore1.1 is currently executing on
  thread 1 of core 3 of the host
- say that both vCore1.0 and vCore1.1 are in guest context

Now:
* vCore1.0 blocks. What happens?
* vCore1.0 makes an hypercall. What happens?
* vCore1.0 VMEXITs. What happens?

> The state machine determining the core state from its vcpus would be
> scheduler agnostic (schedule.c), same for switching guest cores on a
> physical core.
> 
What do you mean with "same for switching guest cores on a physical
core"?

All in all, I like the idea, because it is about introducing nice
abstractions, it is general, etc., but it looks like a major rework of
the scheduler.

And it's not that I am not up for major reworks, but I'd like to
understand properly what that is buying us.

Note that, while this series which tries to implement core-scheduling
for Credit1 is rather long and messy, doing the same (and with a
similar approach) for Credit2 is a lot easier and nicer. I have it
almost ready, and will send it soon.

> This scheme could even be expanded for socket scheduling.
> 
Right. But again, in Credit2, I've been able to implement socket-wise
coscheduling with this approach (I mean, an approach similar to the one
in this series, but adapted to Credit2).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-11 17:37   ` Dario Faggioli
@ 2018-10-12  5:15     ` Juergen Gross
  2018-10-12  7:49       ` Dario Faggioli
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2018-10-12  5:15 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich

On 11/10/2018 19:37, Dario Faggioli wrote:
> Hey,
> 
> Sorry if replying took some time. :-P
> 
> On Fri, 2018-09-07 at 18:00 +0200, Juergen Gross wrote:
>> On 25/08/18 01:35, Dario Faggioli wrote:
>>>
>>> There are git branches here:
>>>  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-
>>> RFCv1
>>>  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
>>>
>>> Any comment is more than welcome.
>>
>> Have you thought about a more generic approach?
>>
> I had. And I have thought about it more since this email. :-)
> 
>> Instead of trying to schedule only vcpus of the same domain on a core
>> I'd rather switch form vcpu scheduling to real core scheduling. The
>> scheduler would see guest cores to be scheduled on physical cores. A
>> guest core consists of "guest threads" being vcpus (vcpus are bound
>> to their guest cores, so that part of the topology could even be used
>> by the guest for performance tuning). 
>>
> Right, so I think I got the big picture. And it was something that, as
> I've said above, I've been thinking too, and we've also talked about
> something similar with Andrew in Nanjing.
> 
> I'm still missing how something like this would work in details,
> perhaps because I'm really used to reason within the boundaries of the
> model we currently have.
> 
> So, for example:
> - domain A has vCore0 and vCore1
> - each vCore has 2 threads ({vCore0.0, vCore0.1} and
>   {vCore1.0, vCore1.1})
> - we're on a 2-way SMT host
> - vCore1 is running on physical core 3 on the host
> - more specifically, vCore1.0 is currently executing on thread 0 of
>   physical core 3 of the host, and vCore1.1 is currently executing on
>   thread 1 of core 3 of the host
> - say that both vCore1.0 and vCore1.1 are in guest context
> 
> Now:
> * vCore1.0 blocks. What happens?

It is going to vBlocked (the physical thread is sitting in the
hypervisor waiting for either a (core-)scheduling event or for
unblocking vCore1.0). vCore1.1 keeps running. Or, if vCore1.1
is already vIdle/vBlocked, vCore1 is switching to blocked and the
scheduler is looking for another vCore to schedule on the physical
core.

> * vCore1.0 makes an hypercall. What happens?

Same as today. The hypercall is being executed.

> * vCore1.0 VMEXITs. What happens?

Same as today. The VMEXIT is handled.

In case you referring to a potential rendezvous for e.g. L1TF
mitigation: this would be handled scheduler agnostic.

>> The state machine determining the core state from its vcpus would be
>> scheduler agnostic (schedule.c), same for switching guest cores on a
>> physical core.
>>
> What do you mean with "same for switching guest cores on a physical
> core"?

No per-scheduler handling, but a common scheduler.c function (maybe with
new per-scheduler hooks if needed). So schedule() modified to work on
scheduling entities (threads/cores/sockets).

> All in all, I like the idea, because it is about introducing nice
> abstractions, it is general, etc., but it looks like a major rework of
> the scheduler.

Correct. Finally something to do :-p

> And it's not that I am not up for major reworks, but I'd like to
> understand properly what that is buying us.

I would hope so!

> Note that, while this series which tries to implement core-scheduling
> for Credit1 is rather long and messy, doing the same (and with a
> similar approach) for Credit2 is a lot easier and nicer. I have it
> almost ready, and will send it soon.

Okay, but would it keep vThreads of the same vCore let always running
together on the same physical core?

>> This scheme could even be expanded for socket scheduling.
>>
> Right. But again, in Credit2, I've been able to implement socket-wise
> coscheduling with this approach (I mean, an approach similar to the one
> in this series, but adapted to Credit2).

And then there still is sched_rt.c


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-12  5:15     ` Juergen Gross
@ 2018-10-12  7:49       ` Dario Faggioli
  2018-10-12  8:35         ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Dario Faggioli @ 2018-10-12  7:49 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 4602 bytes --]

On Fri, 2018-10-12 at 07:15 +0200, Juergen Gross wrote:
> On 11/10/2018 19:37, Dario Faggioli wrote:
> > 
> > So, for example:
> > - domain A has vCore0 and vCore1
> > - each vCore has 2 threads ({vCore0.0, vCore0.1} and
> >   {vCore1.0, vCore1.1})
> > - we're on a 2-way SMT host
> > - vCore1 is running on physical core 3 on the host
> > - more specifically, vCore1.0 is currently executing on thread 0 of
> >   physical core 3 of the host, and vCore1.1 is currently executing
> > on
> >   thread 1 of core 3 of the host
> > - say that both vCore1.0 and vCore1.1 are in guest context
> > 
> > Now:
> > * vCore1.0 blocks. What happens?
> 
> It is going to vBlocked (the physical thread is sitting in the
> hypervisor waiting for either a (core-)scheduling event or for
> unblocking vCore1.0). vCore1.1 keeps running. Or, if vCore1.1
> is already vIdle/vBlocked, vCore1 is switching to blocked and the
> scheduler is looking for another vCore to schedule on the physical
> core.
> 
Ok. And then we'll have one thread in guest context, and one thread in
Xen (albeit, idle, in this case). In these other cases...

> > * vCore1.0 makes an hypercall. What happens?
> 
> Same as today. The hypercall is being executed.
> 
> > * vCore1.0 VMEXITs. What happens?
> 
> Same as today. The VMEXIT is handled.
> 
... we have one thread in guest context, and one thread in Xen, and the
one in Xen is not just staying idle, it's doing hypercalls and VMEXIT
handling.

> In case you referring to a potential rendezvous for e.g. L1TF
> mitigation: this would be handled scheduler agnostic.
> 
Yes, that was what I was thinking to. I.e., in order to be able to use
core-scheduling as a _fully_effective_ mitigation for stuff like L1TF,
we'd need something like that.

In fact, core-scheduling "per-se" mitigates leaks among guests, but if
we want to fully avoid for two threads to ever be in different security
contexts (like one in guest and one in Xen, to prevent Xen data leaking
to a guest), we do need some kind of synchronized Xen enters/exits,
AFAIUI.

What I'm trying to understand right now, is whether implementing things
in this way you're proposing, would help achieving that. And what I've
understood so far is that, no it doesn't.

The main difference between the two approaches would be that we
implement it once in schedule.c, for all schedulers. But this, I see it
as something having both up and down sides (yeah, like everything on
Earth, I know! :-P). More on this later.

> > All in all, I like the idea, because it is about introducing nice
> > abstractions, it is general, etc., but it looks like a major rework
> > of
> > the scheduler.
> 
> Correct. Finally something to do :-p
> 
Indeed! :-)

> > Note that, while this series which tries to implement core-
> > scheduling
> > for Credit1 is rather long and messy, doing the same (and with a
> > similar approach) for Credit2 is a lot easier and nicer. I have it
> > almost ready, and will send it soon.
> 
> Okay, but would it keep vThreads of the same vCore let always running
> together on the same physical core?
> 
It doesn't right now, as we don't have a way to expose such information
to the guest, yet. And since without such a mechanism, the guest can't
take advantage of something like this (neither from a performance nor
from a vuln. mitigation point of view), I kept that out.

But I certainly can see about making it do so (I was already planning
to).

> > Right. But again, in Credit2, I've been able to implement socket-
> > wise
> > coscheduling with this approach (I mean, an approach similar to the
> > one
> > in this series, but adapted to Credit2).
> 
> And then there still is sched_rt.c
> 
Ok, so I think this is the main benefit of this approach. We do the
thing once, and all schedulers get core-scheduling (or whatever
granularity of group scheduling we implement/allow).

But how easy it is to opt out, if one doesn't want it? E.g., in the
context of L1TF, what if I'm not affected, and hence am not interested
in core-scheduling? What if I want a cpupool with core-scheduling and
one without?

I may be wrong, but out of the top of my head, but it seems to me that
doing things in schedule.c makes this a lot harder, if possible at all.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-12  7:49       ` Dario Faggioli
@ 2018-10-12  8:35         ` Juergen Gross
  2018-10-12  9:15           ` Dario Faggioli
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2018-10-12  8:35 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich

On 12/10/2018 09:49, Dario Faggioli wrote:
> On Fri, 2018-10-12 at 07:15 +0200, Juergen Gross wrote:
>> On 11/10/2018 19:37, Dario Faggioli wrote:
>>>
>>> So, for example:
>>> - domain A has vCore0 and vCore1
>>> - each vCore has 2 threads ({vCore0.0, vCore0.1} and
>>>   {vCore1.0, vCore1.1})
>>> - we're on a 2-way SMT host
>>> - vCore1 is running on physical core 3 on the host
>>> - more specifically, vCore1.0 is currently executing on thread 0 of
>>>   physical core 3 of the host, and vCore1.1 is currently executing
>>> on
>>>   thread 1 of core 3 of the host
>>> - say that both vCore1.0 and vCore1.1 are in guest context
>>>
>>> Now:
>>> * vCore1.0 blocks. What happens?
>>
>> It is going to vBlocked (the physical thread is sitting in the
>> hypervisor waiting for either a (core-)scheduling event or for
>> unblocking vCore1.0). vCore1.1 keeps running. Or, if vCore1.1
>> is already vIdle/vBlocked, vCore1 is switching to blocked and the
>> scheduler is looking for another vCore to schedule on the physical
>> core.
>>
> Ok. And then we'll have one thread in guest context, and one thread in
> Xen (albeit, idle, in this case). In these other cases...
> 
>>> * vCore1.0 makes an hypercall. What happens?
>>
>> Same as today. The hypercall is being executed.
>>
>>> * vCore1.0 VMEXITs. What happens?
>>
>> Same as today. The VMEXIT is handled.
>>
> ... we have one thread in guest context, and one thread in Xen, and the
> one in Xen is not just staying idle, it's doing hypercalls and VMEXIT
> handling.
> 
>> In case you referring to a potential rendezvous for e.g. L1TF
>> mitigation: this would be handled scheduler agnostic.
>>
> Yes, that was what I was thinking to. I.e., in order to be able to use
> core-scheduling as a _fully_effective_ mitigation for stuff like L1TF,
> we'd need something like that.
> 
> In fact, core-scheduling "per-se" mitigates leaks among guests, but if
> we want to fully avoid for two threads to ever be in different security
> contexts (like one in guest and one in Xen, to prevent Xen data leaking
> to a guest), we do need some kind of synchronized Xen enters/exits,
> AFAIUI.
> 
> What I'm trying to understand right now, is whether implementing things
> in this way you're proposing, would help achieving that. And what I've
> understood so far is that, no it doesn't.

This aspect will need about the same effort in both solutions.
Maybe my proposal would make it easier to decide whether such a
rendezvous is needed, as there would be only one instance to ask
(schedule.c) instead of multiple instances (sched_*.c).

> 
> The main difference between the two approaches would be that we
> implement it once in schedule.c, for all schedulers. But this, I see it
> as something having both up and down sides (yeah, like everything on
> Earth, I know! :-P). More on this later.
> 
>>> All in all, I like the idea, because it is about introducing nice
>>> abstractions, it is general, etc., but it looks like a major rework
>>> of
>>> the scheduler.
>>
>> Correct. Finally something to do :-p
>>
> Indeed! :-)
> 
>>> Note that, while this series which tries to implement core-
>>> scheduling
>>> for Credit1 is rather long and messy, doing the same (and with a
>>> similar approach) for Credit2 is a lot easier and nicer. I have it
>>> almost ready, and will send it soon.
>>
>> Okay, but would it keep vThreads of the same vCore let always running
>> together on the same physical core?
>>
> It doesn't right now, as we don't have a way to expose such information
> to the guest, yet. And since without such a mechanism, the guest can't
> take advantage of something like this (neither from a performance nor
> from a vuln. mitigation point of view), I kept that out.
> 
> But I certainly can see about making it do so (I was already planning
> to).
> 
>>> Right. But again, in Credit2, I've been able to implement socket-
>>> wise
>>> coscheduling with this approach (I mean, an approach similar to the
>>> one
>>> in this series, but adapted to Credit2).
>>
>> And then there still is sched_rt.c
>>
> Ok, so I think this is the main benefit of this approach. We do the
> thing once, and all schedulers get core-scheduling (or whatever
> granularity of group scheduling we implement/allow).
> 
> But how easy it is to opt out, if one doesn't want it? E.g., in the
> context of L1TF, what if I'm not affected, and hence am not interested
> in core-scheduling? What if I want a cpupool with core-scheduling and
> one without?
> 
> I may be wrong, but out of the top of my head, but it seems to me that
> doing things in schedule.c makes this a lot harder, if possible at all.

Why? This would be a per-cpupool setting, so the scheduling granularity
would live in struct scheduler.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-12  8:35         ` Juergen Gross
@ 2018-10-12  9:15           ` Dario Faggioli
  2018-10-12  9:23             ` Juergen Gross
  0 siblings, 1 reply; 29+ messages in thread
From: Dario Faggioli @ 2018-10-12  9:15 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1203 bytes --]

On Fri, 2018-10-12 at 10:35 +0200, Juergen Gross wrote:
> On 12/10/2018 09:49, Dario Faggioli wrote:
> > 
> > But how easy it is to opt out, if one doesn't want it? E.g., in the
> > context of L1TF, what if I'm not affected, and hence am not
> > interested
> > in core-scheduling? What if I want a cpupool with core-scheduling
> > and
> > one without?
> > 
> > I may be wrong, but out of the top of my head, but it seems to me
> > that
> > doing things in schedule.c makes this a lot harder, if possible at
> > all.
> 
> Why? This would be a per-cpupool setting, so the scheduling
> granularity
> would live in struct scheduler.
> 
Mmm... it's already starting to be a bit hard to reason, without
looking at at least a prototype of the code. :-/

But, anyway, if I'm in sched_dario.c, and schedule.c makes me reason in
terms of vCores, how do I deal with single CPUs for a particular
cpupool that does not want core-scheduling?

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-12  9:15           ` Dario Faggioli
@ 2018-10-12  9:23             ` Juergen Gross
  2018-10-18 10:40               ` Dario Faggioli
  0 siblings, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2018-10-12  9:23 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich

On 12/10/2018 11:15, Dario Faggioli wrote:
> On Fri, 2018-10-12 at 10:35 +0200, Juergen Gross wrote:
>> On 12/10/2018 09:49, Dario Faggioli wrote:
>>>
>>> But how easy it is to opt out, if one doesn't want it? E.g., in the
>>> context of L1TF, what if I'm not affected, and hence am not
>>> interested
>>> in core-scheduling? What if I want a cpupool with core-scheduling
>>> and
>>> one without?
>>>
>>> I may be wrong, but out of the top of my head, but it seems to me
>>> that
>>> doing things in schedule.c makes this a lot harder, if possible at
>>> all.
>>
>> Why? This would be a per-cpupool setting, so the scheduling
>> granularity
>> would live in struct scheduler.
>>
> Mmm... it's already starting to be a bit hard to reason, without
> looking at at least a prototype of the code. :-/
> 
> But, anyway, if I'm in sched_dario.c, and schedule.c makes me reason in
> terms of vCores, how do I deal with single CPUs for a particular
> cpupool that does not want core-scheduling?

sched_dario.c would only know of scheduling entities. Mapping of vcpus
to scheduling entities is part of the infrastructure.

schedule.c would receive vcpu state changes. In case such a vcpu
state change leads to a scheduling entity state change the related
scheduler is informed about that to be able to react.

In case the scheduling entity is a thread (no core scheduling) each
vcpu state change will result in a scheduling entity state change,
so it will be as today.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
                   ` (16 preceding siblings ...)
  2018-09-07 16:00 ` [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Juergen Gross
@ 2018-10-17 21:36 ` Tamas K Lengyel
  2018-10-18  8:16   ` Dario Faggioli
  17 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2018-10-17 21:36 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	bhavesh.davda, Jan Beulich, Xen-devel

On Fri, Aug 24, 2018 at 5:36 PM Dario Faggioli <dfaggioli@suse.com> wrote:
>
> Hello,
>
> As anticipated here:
> https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02052.html
>
> Here's a preliminary version of my work, trying to implement
> core-scheduling in Xen.
>
> First of all, this deals with Credit1 only. I have patches for Credit2,
> and I've been working on having them ready by today, but I could not
> defeat the latest bugs. :-/
> I'll post them when back from vacation. Just let me anticipate, that
> doing something like this in Credit2, is a lot simpler than what you
> see here for Credit1.
>
> Even these patches that I'm posting are not perfect, and In fact there
> are some TODOs and XXXs --both in the changelogs and in the code.
>
> They give me a system that boots, where I can do basic stuff (like
> playing with dom0, creating guests, etc), and where the constraint of
> only scheduling vcpus from one domain at a time on pcpus that are part
> of the same core is, as far as I've seen, respected.
>
> There are still cases where the behavior is unideal, e.g., we could
> make a better use of some of the cores which are, some of the times,
> left idle.
>
> There are git branches here:
>  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-RFCv1
>  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
>
> Any comment is more than welcome.

Hi Dario,
thanks for the series, we are in the process of evaluating it in terms
of performance. Our test is to setup 2 VMs each being assigned enough
vCPUs to completely saturate all hyperthreads and then we fire up CPU
benchmarking inside the VMs to spin each vCPU 100% (using swet). The
idea is to force the scheduler to move vCPUs in-and-out constantly to
see how much performance hit there would be with core-scheduling vs
plain credit1 vs disabling hyperthreading. After running the test on a
handful of machines it looks like we get the best performance with
hyperthreading completely disabled, which is a bit unexpected. Have
you or anyone else encountered this?

Thanks,
Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-17 21:36 ` Tamas K Lengyel
@ 2018-10-18  8:16   ` Dario Faggioli
  2018-10-18 12:55     ` Tamas K Lengyel
  0 siblings, 1 reply; 29+ messages in thread
From: Dario Faggioli @ 2018-10-18  8:16 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	bhavesh.davda, Jan Beulich, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3362 bytes --]

On Wed, 2018-10-17 at 15:36 -0600, Tamas K Lengyel wrote:
> On Fri, Aug 24, 2018 at 5:36 PM Dario Faggioli <dfaggioli@suse.com>
> wrote:
> > 
> > They give me a system that boots, where I can do basic stuff (like
> > playing with dom0, creating guests, etc), and where the constraint
> > of
> > only scheduling vcpus from one domain at a time on pcpus that are
> > part
> > of the same core is, as far as I've seen, respected.
> > 
> > There are still cases where the behavior is unideal, e.g., we could
> > make a better use of some of the cores which are, some of the
> > times,
> > left idle.
> > 
> > There are git branches here:
> >  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-
> > RFCv1
> >  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
> > 
> > Any comment is more than welcome.
> 
> Hi Dario,
>
Hi,

> thanks for the series, we are in the process of evaluating it in
> terms
> of performance. Our test is to setup 2 VMs each being assigned enough
> vCPUs to completely saturate all hyperthreads and then we fire up CPU
> benchmarking inside the VMs to spin each vCPU 100% (using swet). The
> idea is to force the scheduler to move vCPUs in-and-out constantly to
> see how much performance hit there would be with core-scheduling vs
> plain credit1 vs disabling hyperthreading. After running the test on
> a
> handful of machines it looks like we get the best performance with
> hyperthreading completely disabled, which is a bit unexpected. Have
> you or anyone else encountered this?
>
Do you mean that no-hyperthreading is better than core-scheduling, as
per this series goes?

Or do you mean that no-hyperthreading is better than plain Credit1,
with SMT enabled and *without* this series?

If the former, well, this series is not at a stage where it makes sense
to run performance benchmarks. Not even close, I would say.

If the latter, it's a bit weird, as I've often seen hyperthreading
causing seemingly weird performance figures, but it does not happen
very often that it actually slows down things (although, I wouldn't
rule it out! :-P).

Of course, this could also be a problem in the scheduler. Something I'd
do is (leaving this series aside):
- try to run the benchmark with half the number of vCPUs wrt the total
number of CPUs (i.e., with only 1 vCPU per core);
- try to run the benchmark as above, but explicitly pinning 1 vCPU on
each core;
- go back to nr vCPUs == nr pCPUs, but explicitly pin each vCPU to one
pCPU (i.e., to one thread, in this case).

You can do all the above with Credit, and results should already tell
us whether we may be dealing with some scheduling anomaly. If you want,
since you don't seem to have oversubscription, you can also try the
null scheduler, to have even more data points (in fact, one of the
purposes of having it was exactly this, i.e., making it possible to
have a reference for other schedulers to compare against).

BTW, when you say "2 VMs with enough vCPUs to saturate all
hyperthreads", does that include dom0 vCPUs?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-12  9:23             ` Juergen Gross
@ 2018-10-18 10:40               ` Dario Faggioli
  0 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-10-18 10:40 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	Bhavesh Davda, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 3065 bytes --]

On Fri, 2018-10-12 at 11:23 +0200, Juergen Gross wrote:
> On 12/10/2018 11:15, Dario Faggioli wrote:
> > 
> > But, anyway, if I'm in sched_dario.c, and schedule.c makes me
> > reason in
> > terms of vCores, how do I deal with single CPUs for a particular
> > cpupool that does not want core-scheduling?
> 
> sched_dario.c would only know of scheduling entities. Mapping of
> vcpus
> to scheduling entities is part of the infrastructure.
> 
> schedule.c would receive vcpu state changes. In case such a vcpu
> state change leads to a scheduling entity state change the related
> scheduler is informed about that to be able to react.
> 
> In case the scheduling entity is a thread (no core scheduling) each
> vcpu state change will result in a scheduling entity state change,
> so it will be as today.
> 
I think I'm starting to understand more your idea, although not
completely (or at least, I think I understand it completely, but I
still can't make up, in my mind, a fully complete picture of how the
code will look, and of how it will work).

I think I still see reasons why something like group-scheduling belongs
in sched_dario.c, rather than in schedule.c, but that may partly be
because I've been working on doing it in _that_way_ for some time, and
my brain is wrapped around such design. :-)

IAC, this really continues to look to me like a major rework of (not
only, probably) scheduling code. This, of course, does not mean this is
not worth doing, not per-se, at least. And although, as I said, I still
have some doubts, it is certainly possible that they'll be proven 
wrong by continuing the discussion, or by writing/seeing a prototype.

A little bit more on the practical side, there are patches out there
that let us achieve the goal, the Credit2 one being in the "not too
bad" state (at least IMO, and I'd be happy to have feedback on that).
In fact, with some effort, the Credit2 series could even be 4.12
material. I don't think the same would be true for the rework (and I'm
talking both in general, and considering my current situation, assuming
it would be me doing this --which is not necessarily the case, of
course :-D ).

So, basically, the way I personally see it more likely for us to get
core-scheduling not too far away in time, is by means of an approach
that does not require rewriting, and testing, and benchmarking (because
it's scheduling, look at what the process has been/is being to switch
to Credit2!) large portions of the scheduler.

And then maybe we can focus on how to make core-scheduling useful and
effective in really mitigating things like TLBLeed (via more topology
awareness in both guest and Xen) and L1TF (by the doing "panopticon"
[1] and/or truly synch'd context switches).

Others? Thoughts?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-18  8:16   ` Dario Faggioli
@ 2018-10-18 12:55     ` Tamas K Lengyel
  2018-10-18 13:48       ` Dario Faggioli
  0 siblings, 1 reply; 29+ messages in thread
From: Tamas K Lengyel @ 2018-10-18 12:55 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	bhavesh.davda, Jan Beulich, Xen-devel

On Thu, Oct 18, 2018 at 2:16 AM Dario Faggioli <dfaggioli@suse.com> wrote:
>
> On Wed, 2018-10-17 at 15:36 -0600, Tamas K Lengyel wrote:
> > On Fri, Aug 24, 2018 at 5:36 PM Dario Faggioli <dfaggioli@suse.com>
> > wrote:
> > >
> > > They give me a system that boots, where I can do basic stuff (like
> > > playing with dom0, creating guests, etc), and where the constraint
> > > of
> > > only scheduling vcpus from one domain at a time on pcpus that are
> > > part
> > > of the same core is, as far as I've seen, respected.
> > >
> > > There are still cases where the behavior is unideal, e.g., we could
> > > make a better use of some of the cores which are, some of the
> > > times,
> > > left idle.
> > >
> > > There are git branches here:
> > >  https://gitlab.com/dfaggioli/xen.git rel/sched/core-scheduling-
> > > RFCv1
> > >  https://github.com/fdario/xen.git rel/sched/core-scheduling-RFCv1
> > >
> > > Any comment is more than welcome.
> >
> > Hi Dario,
> >
> Hi,
>
> > thanks for the series, we are in the process of evaluating it in
> > terms
> > of performance. Our test is to setup 2 VMs each being assigned enough
> > vCPUs to completely saturate all hyperthreads and then we fire up CPU
> > benchmarking inside the VMs to spin each vCPU 100% (using swet). The
> > idea is to force the scheduler to move vCPUs in-and-out constantly to
> > see how much performance hit there would be with core-scheduling vs
> > plain credit1 vs disabling hyperthreading. After running the test on
> > a
> > handful of machines it looks like we get the best performance with
> > hyperthreading completely disabled, which is a bit unexpected. Have
> > you or anyone else encountered this?
> >
> Do you mean that no-hyperthreading is better than core-scheduling, as
> per this series goes?
>
> Or do you mean that no-hyperthreading is better than plain Credit1,
> with SMT enabled and *without* this series?
>
> If the former, well, this series is not at a stage where it makes sense
> to run performance benchmarks. Not even close, I would say.

Understood, just wanted to get a rough idea of how much it effects it
in the worst case. We haven't got to actually run the tests on
core-scheduling yet. On my laptop Xen crashed when I tried to create a
VM after booting with sched_smt_cosched=1. On my desktop which has
serial access creating VMs works but when I fired up swet in both VMs
the whole system froze - no crash or anything reported on the serial.
I suspect a deadlock because everything froze
display/keyboard/serial/ping. But no crash and reboot.

> If the latter, it's a bit weird, as I've often seen hyperthreading
> causing seemingly weird performance figures, but it does not happen
> very often that it actually slows down things (although, I wouldn't
> rule it out! :-P).

Well, we ran it on at least 4 machines thus far (laptops, NUC,
desktops) and it is consistently better, and quite significantly. I
can post our detailed results if interested.

>
> Of course, this could also be a problem in the scheduler. Something I'd
> do is (leaving this series aside):
> - try to run the benchmark with half the number of vCPUs wrt the total
> number of CPUs (i.e., with only 1 vCPU per core);
> - try to run the benchmark as above, but explicitly pinning 1 vCPU on
> each core;
> - go back to nr vCPUs == nr pCPUs, but explicitly pin each vCPU to one
> pCPU (i.e., to one thread, in this case).
>
> You can do all the above with Credit, and results should already tell
> us whether we may be dealing with some scheduling anomaly. If you want,
> since you don't seem to have oversubscription, you can also try the
> null scheduler, to have even more data points (in fact, one of the
> purposes of having it was exactly this, i.e., making it possible to
> have a reference for other schedulers to compare against).

We do oversubscribe. Say we have 4 pCPUs, 8 cores showing with
hyperthread enabled. We run 2 VMs each with 8 vCPUs (16 total) and in
each VM we create 8 swet processes each spinning at 100% at the same
time. The idea is to create a scenario that's the "worst case", so
that we can pinpoint the effect of the scheduler changes. With pinning
or fewer vCPUs then hyperthreads I believe it would be harder to spot
the effect of the scheduler changes.

>
> BTW, when you say "2 VMs with enough vCPUs to saturate all
> hyperthreads", does that include dom0 vCPUs?

No, dom0 vCPUs are on top of those but dom0 is largely idle for the
duration of the test.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v1 00/16] xen: sched: implement core-scheduling
  2018-10-18 12:55     ` Tamas K Lengyel
@ 2018-10-18 13:48       ` Dario Faggioli
  0 siblings, 0 replies; 29+ messages in thread
From: Dario Faggioli @ 2018-10-18 13:48 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson,
	bhavesh.davda, Jan Beulich, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6797 bytes --]

On Thu, 2018-10-18 at 06:55 -0600, Tamas K Lengyel wrote:
> On Thu, Oct 18, 2018 at 2:16 AM Dario Faggioli <dfaggioli@suse.com>
> wrote:
> > 
> > On Wed, 2018-10-17 at 15:36 -0600, Tamas K Lengyel wrote:
> > > On Fri, Aug 24, 2018 at 5:36 PM Dario Faggioli <
> > > dfaggioli@suse.com>
> > > wrote:
> > > > 
> > > > They give me a system that boots, where I can do basic stuff
> > > > (like
> > > > playing with dom0, creating guests, etc), and where the
> > > > constraint
> > > > of
> > > > only scheduling vcpus from one domain at a time on pcpus that
> > > > are
> > > > part
> > > > of the same core is, as far as I've seen, respected.
> > > > 
> > > > There are still cases where the behavior is unideal, e.g., we
> > > > could
> > > > make a better use of some of the cores which are, some of the
> > > > times,
> > > > left idle.
> > > > 
> > > > There are git branches here:
> > > >  https://gitlab.com/dfaggioli/xen.git rel/sched/core-
> > > > scheduling-
> > > > RFCv1
> > > >  https://github.com/fdario/xen.git rel/sched/core-scheduling-
> > > > RFCv1
> > > > 
> > > > Any comment is more than welcome.
> > > 
> > > Hi Dario,
> > > 
> > 
> > Hi,
> > 
> > > thanks for the series, we are in the process of evaluating it in
> > > terms
> > > of performance. Our test is to setup 2 VMs each being assigned
> > > enough
> > > vCPUs to completely saturate all hyperthreads and then we fire up
> > > CPU
> > > benchmarking inside the VMs to spin each vCPU 100% (using swet).
> > > The
> > > idea is to force the scheduler to move vCPUs in-and-out
> > > constantly to
> > > see how much performance hit there would be with core-scheduling
> > > vs
> > > plain credit1 vs disabling hyperthreading. After running the test
> > > on
> > > a
> > > handful of machines it looks like we get the best performance
> > > with
> > > hyperthreading completely disabled, which is a bit unexpected.
> > > Have
> > > you or anyone else encountered this?
> > > 
> > 
> > Do you mean that no-hyperthreading is better than core-scheduling,
> > as
> > per this series goes?
> > 
> > Or do you mean that no-hyperthreading is better than plain Credit1,
> > with SMT enabled and *without* this series?
> > 
> > If the former, well, this series is not at a stage where it makes
> > sense
> > to run performance benchmarks. Not even close, I would say.
> 
> Understood, just wanted to get a rough idea of how much it effects it
> in the worst case. We haven't got to actually run the tests on
> core-scheduling yet. On my laptop Xen crashed when I tried to create
> a
> VM after booting with sched_smt_cosched=1. 
>
Ah, interesting! :-)

> On my desktop which has
> serial access creating VMs works but when I fired up swet in both VMs
> the whole system froze - no crash or anything reported on the serial.
> I suspect a deadlock because everything froze
> display/keyboard/serial/ping. But no crash and reboot.
>
Right. This is not something I've seen during my test. But, as said,
the code as it stands in this series, does have fairness issues, which
may escalate into starvation issues.

And if the load you're creating in the VMs is enough to let the VM
starve dom0 out of the host CPUs long enough, then you may get
something like what you're seeing.

> > If the latter, it's a bit weird, as I've often seen hyperthreading
> > causing seemingly weird performance figures, but it does not happen
> > very often that it actually slows down things (although, I wouldn't
> > rule it out! :-P).
> 
> Well, we ran it on at least 4 machines thus far (laptops, NUC,
> desktops) and it is consistently better, and quite significantly. I
> can post our detailed results if interested.
> 
Ok.

> > You can do all the above with Credit, and results should already
> > tell
> > us whether we may be dealing with some scheduling anomaly. If you
> > want,
> > since you don't seem to have oversubscription, you can also try the
> > null scheduler, to have even more data points (in fact, one of the
> > purposes of having it was exactly this, i.e., making it possible to
> > have a reference for other schedulers to compare against).
> 
> We do oversubscribe. Say we have 4 pCPUs, 8 cores showing with
> hyperthread enabled. We run 2 VMs each with 8 vCPUs (16 total) and in
> each VM we create 8 swet processes each spinning at 100% at the same
> time. The idea is to create a scenario that's the "worst case", so
> that we can pinpoint the effect of the scheduler changes. 
>
So, basically, you have 16 CPU hog vCPUs, on 8 pCPUs. How do you
measure the actual performance? Does swet print out something like the
cycles its doing, or stuff like that (sorry, I'm not familiar with it)?

IAC, in order to understand things better, I would run the experiment
gradually increasing the load.

Like (assuming 8 pCPUs, i.e., 4 cores with hyperthreading) you start
with one VM with 4 vCPUs, and then also check the 2 VMs with 2 vCPUs
each case. Without pinning. This would tell you what the performance
are, when only 1 thread of each core is busy.

You can also try pinning, just as a check. The numbers you get should
be similar to the non-pinned case.

Then you go up with, say, 1 VM with 6 and then 8 vCPUs (or 2 VMs with 3
and then 4 vCPUs), and check what the trend is. Finally, you go all the
way toward oversubscription.

Feel free to post the actual numbers, I'll be happy to have a look at
them.

What is this, BTW, staging?

> With pinning
> or fewer vCPUs then hyperthreads I believe it would be harder to spot
> the effect of the scheduler changes.
> 
Absolutely, I was suggesting using pinning to try to understand why, in
this benchmark, it seems that hyperthreading has such a negative
effect, and whether or not this could be a scheduler bug or anomaly (I
mean in existing, checked-in code, not in the patch).

Basically, the pinning cases would act as a reference for (some of) the
unpinned ones. And if we see that there are significant differences,
then this may mean we have a bug.

To properly evaluate a change like core-scheduling, indeed we don't
want pinning... but we're not there yet. :-)

> > BTW, when you say "2 VMs with enough vCPUs to saturate all
> > hyperthreads", does that include dom0 vCPUs?
> 
> No, dom0 vCPUs are on top of those but dom0 is largely idle for the
> duration of the test.
>
Ok, then don't use null, not even in the "undersubscribed" cases (for
now).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-10-18 13:48 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-24 23:35 [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 01/16] xen: Credit1: count runnable vcpus, not running ones Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 03/16] xen: Credit1: do not always tickle an idle pcpu Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 04/16] xen: sched: make the logic for tracking idle core generic Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 05/16] xen: Credit1: track fully idle cores Dario Faggioli
2018-08-24 23:35 ` [RFC PATCH v1 06/16] xen: Credit1: check for fully idle cores when tickling Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 07/16] xen: Credit1: reorg __runq_tickle() code a bit Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 08/16] xen: Credit1: reorg csched_schedule() " Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 09/16] xen: Credit1: SMT-aware domain co-scheduling parameter and data structs Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 10/16] xen: Credit1: support sched_smt_cosched in csched_schedule() Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 11/16] xen: Credit1: support sched_smt_cosched in _csched_cpu_pick() Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 12/16] xen: Credit1: support sched_smt_cosched in csched_runq_steal() Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 13/16] xen: Credit1: sched_smt_cosched support in __csched_vcpu_is_migrateable() Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 14/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() for pinned vcpus Dario Faggioli
2018-08-24 23:36 ` [RFC PATCH v1 15/16] xen: Credit1: sched_smt_cosched support in __runq_tickle() Dario Faggioli
2018-08-24 23:37 ` [RFC PATCH v1 16/16] xen/tools: tracing of Credit1 SMT domain co-scheduling support Dario Faggioli
2018-09-07 16:00 ` [RFC PATCH v1 00/16] xen: sched: implement core-scheduling Juergen Gross
2018-10-11 17:37   ` Dario Faggioli
2018-10-12  5:15     ` Juergen Gross
2018-10-12  7:49       ` Dario Faggioli
2018-10-12  8:35         ` Juergen Gross
2018-10-12  9:15           ` Dario Faggioli
2018-10-12  9:23             ` Juergen Gross
2018-10-18 10:40               ` Dario Faggioli
2018-10-17 21:36 ` Tamas K Lengyel
2018-10-18  8:16   ` Dario Faggioli
2018-10-18 12:55     ` Tamas K Lengyel
2018-10-18 13:48       ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.