All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/8] Series short description
@ 2018-10-12 17:43 Dario Faggioli
  2018-10-12 17:43 ` [RFC PATCH v1 1/8] xen: sched: Credit2: during scheduling, update the idle mask before using it Dario Faggioli
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Bhavesh Davda, Wei Liu, George Dunlap

Hello,

Here it comes, core-scheduling for Credit2 as well. Well, this time,
it's actually group-scheduling (see below).

 git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/group-scheduling-RFCv1
 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/group-scheduling-RFCv1

 (Or https://github.com/fdario/xen/tree/rel/sched/credit2/group-scheduling-RFCv1 ,
  Or https://gitlab.com/dfaggioli/xen/tree/rel/sched/credit2/group-scheduling-RFCv1 )

An RFC series implementing the same feature for Credit1 is here:
https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html

The two series, however, are completely independent, and I'd recommend
focusing on this one first. In fact, implementing the feature here in
Credit2 was waaay simpler, and the result is, IMO, already a lot better.

Therefore, I expect that the amount of effort required for making this
very series upstreamable to be much smaller than for the Credit1 one.
When this is in, we'll have one scheduler that supports
group-scheduling, and we can focus on what to do with the others.

Let me also point out, that there is some discussion (in the thread of
the Credit1 RFC series [1]), about whether a different approach toward
implementing core/group-scheduling wouldn't be better. I had this code
almost ready already, and so I decided to send it out anyway. If it then
turns out that we have to throw it away, then fine. But, so far, I'm all
but convinced that the way things are done in this series is not our
current best solution to deal with the problems we have at hand.

So, what's in here? Well, we have a generic group scheduling
implementation which seems to me to work reasonably well... For an
RFC. ;-P

I call it generic because, although the main aim is core-scheduling, it
can be made to work (and in fact, it already kind of does) with
different grouping (like node, socket, or arbitrary sets of CPUs).

I does not have the fairness and starvation issues that the RFC series
for Credit1 liked above has. I.e., it already sort-of works. :-D

Some improvements are necessary, mostly because Credit2 is not a fully
work conserving scheduler, and this hurts when we do things like group
scheduling. So we need to add logic for doing some quick load-balancing,
or work stealing, when a CPU goes idle, but that is not that much of a
big deal (I was already thinking to add it anyway).

Finding a way of considering group-scheduling while doing proper load
balancing is also on my todo list. It is less easy than the work
conserving-ification described above, but also less important, IMO.

What's not there? Well, mainly, we're missing updating the docs, and
tracing. About the latter, I have an unfinished patch which adds
tracepoints that will be useful to observe, understand and debug whether
the code behave as we expect. And I'll send out that soon too.

Some notes on the actual patches:
- patches 1 and 2 have been submitted already, but they're necessary
  for testing this series, so I've included them;
- credit2_group_sched=core has been not only boot tested, but I've also
  thrown at him a couple of (basic) workloads.
  credit2_group_sched=no has been also tested not to break things (but,
  e.g., I haven't measured the overhead the series introduces).
  credit2_group_sched=node has been boot tested, but not much else;
- cpupool and CPU hotplug have _not_ been tested;

Thanks and Regards,
Dario

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg00707.html
    https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01010.html

PS. I'm Cc-ing a few people with which we've discussed these issues, but
    only to this cover letter, to avoid spamming you with scheduling code.
    Find the patches on the list/git, or ping me, and I'll mail them to
    you... :-)
---
Dario Faggioli (8):
      xen: sched: Credit2: during scheduling, update the idle mask before using it
      xen: sched: Credit2: avoid looping too much (over runqueues) during load balancing
      xen: sched: Credit2: show runqueue id during runqueue dump
      xen: sched: Credit2: generalize topology related bootparam handling
      xen: sched: Credit2 group-scheduling: data structures
      xen: sched: Credit2 group-scheduling: selecting next vcpu to run
      xen: sched: Credit2 group-scheduling: tickling
      xen: sched: Credit2 group-scheduling: anti-starvation measures


 xen/common/sched_credit2.c |  492 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 439 insertions(+), 53 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 1/8] xen: sched: Credit2: during scheduling, update the idle mask before using it
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
@ 2018-10-12 17:43 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 2/8] xen: sched: Credit2: avoid looping too much (over runqueues) during load balancing Dario Faggioli
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:43 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Load balancing, when happening, at the end of a "scheduler epoch", can
trigger vcpu migration, which in its turn may call runq_tickle(). If the
cpu where this happens was idle, but we're now going to schedule a vcpu
on it, let's update the runq's idle cpus mask accordingly _before_ doing
load balancing.

Not doing that, in fact, may cause runq_tickle() to think that the cpu
is still idle, and tickle it to go pick up a vcpu from the runqueue,
which might be wrong/unideal.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_credit2.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 2b16bcea21..72fed2dd18 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3554,6 +3554,13 @@ csched2_schedule(
             __set_bit(__CSFLAG_scheduled, &snext->flags);
         }
 
+        /* Clear the idle mask if necessary */
+        if ( cpumask_test_cpu(cpu, &rqd->idle) )
+        {
+            __cpumask_clear_cpu(cpu, &rqd->idle);
+            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+        }
+
         /*
          * The reset condition is "has a scheduler epoch come to an end?".
          * The way this is enforced is checking whether the vcpu at the top
@@ -3574,13 +3581,6 @@ csched2_schedule(
             balance_load(ops, cpu, now);
         }
 
-        /* Clear the idle mask if necessary */
-        if ( cpumask_test_cpu(cpu, &rqd->idle) )
-        {
-            __cpumask_clear_cpu(cpu, &rqd->idle);
-            smt_idle_mask_clear(cpu, &rqd->smt_idle);
-        }
-
         snext->start_time = now;
         snext->tickled_cpu = -1;
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 2/8] xen: sched: Credit2: avoid looping too much (over runqueues) during load balancing
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
  2018-10-12 17:43 ` [RFC PATCH v1 1/8] xen: sched: Credit2: during scheduling, update the idle mask before using it Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 3/8] xen: sched: Credit2: show runqueue id during runqueue dump Dario Faggioli
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

For doing load balancing between runqueues, we check the load of each
runqueue, select the one more "distant" than our own load, and then take
the proper runq lock and attempt vcpu migrations.

If we fail to take such lock, we try again, and the idea was to give up
and bail if, during the checking phase, we can't take the lock of any
runqueue (check the comment near to the 'goto retry;', in the middle of
balance_load())

However, the variable that controls the "give up and bail" part, is not
reset upon retries. Therefore, provided we did manage to check the load of
at least one runqueue during the first pass, if we can't get any runq lock,
we don't bail, but we try again taking the lock of that same runqueue
(and that may even more than once).

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_credit2.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 72fed2dd18..06b45725fa 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -2554,7 +2554,7 @@ static bool vcpu_is_migrateable(struct csched2_vcpu *svc,
 static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
 {
     struct csched2_private *prv = csched2_priv(ops);
-    int i, max_delta_rqi = -1;
+    int i, max_delta_rqi;
     struct list_head *push_iter, *pull_iter;
     bool inner_load_updated = 0;
 
@@ -2573,6 +2573,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     update_runq_load(ops, st.lrqd, 0, now);
 
 retry:
+    max_delta_rqi = -1;
     if ( !read_trylock(&prv->lock) )
         return;
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 3/8] xen: sched: Credit2: show runqueue id during runqueue dump
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
  2018-10-12 17:43 ` [RFC PATCH v1 1/8] xen: sched: Credit2: during scheduling, update the idle mask before using it Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 2/8] xen: sched: Credit2: avoid looping too much (over runqueues) during load balancing Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 4/8] xen: sched: Credit2: generalize topology related bootparam handling Dario Faggioli
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Instead than just a sequence number, and consistently
with what's shown when the runqueues are created.

Also add some more pretty printing here and there.

No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_credit2.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 06b45725fa..617a7ece6e 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3658,7 +3658,7 @@ dump_pcpu(const struct scheduler *ops, int cpu)
 #define cpustr keyhandler_scratch
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] runq=%d, sibling=%s, ", cpu, c2r(cpu), cpustr);
+    printk(" CPU[%02d] runq=%d, sibling=%s, ", cpu, c2r(cpu), cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 
@@ -3760,7 +3760,8 @@ csched2_dump(const struct scheduler *ops)
         /* We need the lock to scan the runqueue. */
         spin_lock(&rqd->lock);
 
-        printk("Runqueue %d:\n", i);
+        printk("Runqueue %d:\n", rqd->id);
+        printk("CPUs:\n");
 
         for_each_cpu(j, &rqd->active)
             dump_pcpu(ops, j);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 4/8] xen: sched: Credit2: generalize topology related bootparam handling
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (2 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 3/8] xen: sched: Credit2: show runqueue id during runqueue dump Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 5/8] xen: sched: Credit2 group-scheduling: data structures Dario Faggioli
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Right now, runqueue organization is the only bit of the scheduler that
use such topology related information. But that may not be true forever,
i.e., there may be other boot parameters which takes the same "core",
"socket", etc, strings as argument.

In fact, this is the case of the credit2_group_sched parameter,
introduced in later patches.

Therefore, let's:
- make the #define-s more general, i.e., RUNQUEUE -> TOPOLOGY;
- do the parsing outside of the specific function handling the
  credit2_runqueue param.

While there, we also move "node" before "socket", so that we have them
ordered from the smallest to the largest, and we can do math with their
values. (Yes, I know, relationship between node and socket is not always
that clear, but, I've found boxes, like EPYC, with more than one node in
one socket, and I've never found one where two socket are in the same
node, so...)

No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched_credit2.c |   61 +++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 617a7ece6e..9550503b5b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -434,34 +434,43 @@ integer_param("credit2_cap_period_ms", opt_cap_period);
  * either the same physical core, the same physical socket, the same NUMA
  * node, or just all of them, will be put together to form runqueues.
  */
-#define OPT_RUNQUEUE_CPU    0
-#define OPT_RUNQUEUE_CORE   1
-#define OPT_RUNQUEUE_SOCKET 2
-#define OPT_RUNQUEUE_NODE   3
-#define OPT_RUNQUEUE_ALL    4
-static const char *const opt_runqueue_str[] = {
-    [OPT_RUNQUEUE_CPU] = "cpu",
-    [OPT_RUNQUEUE_CORE] = "core",
-    [OPT_RUNQUEUE_SOCKET] = "socket",
-    [OPT_RUNQUEUE_NODE] = "node",
-    [OPT_RUNQUEUE_ALL] = "all"
+#define OPT_TOPOLOGY_CPU    0
+#define OPT_TOPOLOGY_CORE   1
+#define OPT_TOPOLOGY_NODE   2
+#define OPT_TOPOLOGY_SOCKET 3
+#define OPT_TOPOLOGY_ALL    4
+static const char *const opt_topospan_str[] = {
+    [OPT_TOPOLOGY_CPU] = "cpu",
+    [OPT_TOPOLOGY_CORE] = "core",
+    [OPT_TOPOLOGY_NODE] = "node",
+    [OPT_TOPOLOGY_SOCKET] = "socket",
+    [OPT_TOPOLOGY_ALL] = "all"
 };
-static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
 
-static int __init parse_credit2_runqueue(const char *s)
+static int __init parse_topology_span(const char *s)
 {
     unsigned int i;
 
-    for ( i = 0; i < ARRAY_SIZE(opt_runqueue_str); i++ )
+    for ( i = 0; i < ARRAY_SIZE(opt_topospan_str); i++ )
     {
-        if ( !strcmp(s, opt_runqueue_str[i]) )
-        {
-            opt_runqueue = i;
-            return 0;
-        }
+        if ( !strcmp(s, opt_topospan_str[i]) )
+            return i;
     }
 
-    return -EINVAL;
+    return -1;
+}
+
+static int __read_mostly opt_runqueue = OPT_TOPOLOGY_SOCKET;
+
+static int __init parse_credit2_runqueue(const char *s)
+{
+    opt_runqueue = parse_topology_span(s);
+
+    if ( opt_runqueue < 0 )
+        return -EINVAL;
+
+    ASSERT(opt_runqueue <= OPT_TOPOLOGY_ALL );
+    return 0;
 }
 custom_param("credit2_runqueue", parse_credit2_runqueue);
 
@@ -883,12 +892,12 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
                cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
 
-        if (opt_runqueue == OPT_RUNQUEUE_CPU)
+        if (opt_runqueue == OPT_TOPOLOGY_CPU)
             continue;
-        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
-             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
-             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
-             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
+        if ( opt_runqueue == OPT_TOPOLOGY_ALL ||
+             (opt_runqueue == OPT_TOPOLOGY_CORE && same_core(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_TOPOLOGY_SOCKET && same_socket(peer_cpu, cpu)) ||
+             (opt_runqueue == OPT_TOPOLOGY_NODE && same_node(peer_cpu, cpu)) )
             break;
     }
 
@@ -4021,7 +4030,7 @@ csched2_init(struct scheduler *ops)
            opt_load_window_shift,
            opt_underload_balance_tolerance,
            opt_overload_balance_tolerance,
-           opt_runqueue_str[opt_runqueue],
+           opt_topospan_str[opt_runqueue],
            opt_cap_period);
 
     printk(XENLOG_INFO "load tracking window length %llu ns\n",


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 5/8] xen: sched: Credit2 group-scheduling: data structures
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (3 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 4/8] xen: sched: Credit2: generalize topology related bootparam handling Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run Dario Faggioli
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Group scheduling is, for us, when a certain group of CPUs can only
execute the vcpus of one domain, at any given time. What CPUs form the
groups can be defined pretty much arbitrarily, but they're usually build
after the system topology. E.g., core-scheduling is a pretty popular
form of group scheduling, where the CPUs that are SMT sibling threads
within one core are in the same group.

So, basically, core-scheduling means that, if we have one core with two
threads, we will never run dAv0 (i.e., vcpu 0 of domain A) and dBv2, on
these two threads. In fact, we either run dAv0 and dAv3 on them, or, if
there's only one dA's vcpu that can run, then one of the thread stays
idle.

Making Credit2 support core-scheduling is the main aim of this patch
series, but the implementation is general, and allows the user to chose
a different granularity/arrangement of the groups (such as, per-NUMA
node groups).

As per this commit only, just the boot command line parameter (to
enable, disable and configure the feature), the data structures and
the domain tracking logic are implemented.

This means that, until we implement the group scheduling logic, in
later commits, the result of such "what domain is running in this group"
logic (which can be seen via `xl debug-keys r') is not to be considered
correct.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
TODO:
- document credit2_group_sched in docs/misc/xen-command-line.markdown;
---
 xen/common/sched_credit2.c |  262 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 255 insertions(+), 7 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 9550503b5b..b11713e244 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -171,6 +171,36 @@
  *   pool, must occur only when holding the 'budget_lock'.
  */
 
+/*
+ * Group scheduling:
+ *
+ * A group of physical cpus are said to be coscheduling a domain if only
+ * virtual cpus of that same domain are running on any of the cpus. If there
+ * are not enough (ready to run) vcpus from the domain, some of the pCPUs in
+ * the coscheduling group stay idle.
+ *
+ * So, basically, after we've divided the cpus in coscheduling groups, each
+ * group will run a domain at a time. For instance, the cpus of coscheduling
+ * group A, at any given time, either run the vcpus of a specific guest, or are
+ * idle.
+ *
+ * Typically, coscheduling group are formed after topology consideration, e.g.,
+ * the SMT topology. I.e., all the cpus that share a core live  in the same
+ * coscheduling group. This has started to go under the name of
+ * 'core-scheduling'.
+ *
+ * Enabling a group scheduling like behavior may, depending on a number of
+ * factors, bring benefits from a performance or security point of view.
+ * E.g., core-scheduling could be of help in limiting information leak through
+ * side channel attacks on some SMT systems.
+ *
+ * NB: the term 'gang scheduling' also exist and is used, sometimes as
+ * synonym of coscheduling and group scheduling. However, strictly speaking,
+ * what gang scheduling means is that a certain set of vcpus (typically the
+ * vcpus of a guest) either run together, each on one cpu, or they don't run
+ * at all. And we therefore do not use this term here.
+ */
+
 /*
  * Locking:
  *
@@ -184,6 +214,9 @@
  *  + protects runqueue-wide data in csched2_runqueue_data;
  *  + protects vcpu parameters in csched2_vcpu for the vcpu in the
  *    runqueue.
+ *  + protects group-scheduling wide data in csched2_grpsched_data. This
+ *    is because we force cpus that are in the same coscheduling group, to
+ *    also share the same runqueue.
  *
  * - Private scheduler lock
  *  + protects scheduler-wide data in csched2_private, such as:
@@ -474,6 +507,60 @@ static int __init parse_credit2_runqueue(const char *s)
 }
 custom_param("credit2_runqueue", parse_credit2_runqueue);
 
+/*
+ * Group scheduling.
+ *
+ * We support flexible coscheduling grouping strategies, such as:
+ *
+ * - cpu: meaning no group scheduling happens (i.e., this is how group
+ *        scheduling is disabled);
+ *
+ * - core: pCPUs are grouped at the core-level. This means pCPUs that are
+ *         sibling hyperthreads within the same core, are made part of
+ *         the same group. Therefore, each core only executes one domain at
+ *         a time. The number of vCPUs of such domain running on each core
+ *         depends on how many threads the core itself has (typically 2, but
+ *         systems with 4 threads per-core exists already);
+ *
+ * - node: pCPUs are grouped at the NUMA nodes level. This means all the pCPUs
+ *         within a NUMA node, are made part of one group, and hence execute
+ *         the vCPUs of one domain at a time. On an SMT systems, this of course
+ *         means that all the threads of all the cores inside a node are in
+ *         the same group.
+ *
+ * Per-socket --which often is the same than per-node, but not always-- and
+ * even global group scheduling is certainly possible, but not currently
+ * implemented. Well, in theory it should "just work"^TM, but it hasn't been
+ * tested thoroughly, so let's not offer it to users.
+ *
+ * pCPUs that are part of the same group, must also share the runqueue.
+ */
+static int __read_mostly opt_grpsched = OPT_TOPOLOGY_CORE;
+
+static int __init parse_credit2_group_sched(const char *s)
+{
+    if ( !strcmp(s, "no") || !strcmp(s, "false") )
+    {
+        opt_grpsched = 0;
+        return 0;
+    }
+
+    opt_grpsched = parse_topology_span(s);
+
+    /* We're limiting group scheduling to socket granularity, for now. */
+    if ( opt_grpsched < 0 || opt_grpsched > OPT_TOPOLOGY_NODE )
+        return -EINVAL;
+
+    return 0;
+}
+custom_param("credit2_group_sched", parse_credit2_group_sched);
+
+/* Returns false if opt_grpsched is OPT_TOPOLOGY_CPU, which is 0 */
+static inline bool grpsched_enabled(void)
+{
+    return opt_grpsched;
+}
+
 /*
  * Per-runqueue data
  */
@@ -498,6 +585,17 @@ struct csched2_runqueue_data {
     unsigned int pick_bias;    /* Last picked pcpu. Start from it next time  */
 };
 
+/*
+ * Per-coscheduling group data
+ */
+struct csched2_grpsched_data {
+    /* No locking necessary, we use runqueue lock for serialization.         */
+    struct csched2_dom *sdom;  /* domain running on the cpus of the group    */
+    int id;                    /* ID of this group (-1 if invalid)           */
+    unsigned int nr_running;   /* vcpus currently running in this group      */
+    cpumask_t cpus;            /* cpus that are part of this group           */
+};
+
 /*
  * System-wide private data
  */
@@ -510,6 +608,7 @@ struct csched2_private {
 
     cpumask_t active_queues;           /* Runqueues with (maybe) active cpus */
     struct csched2_runqueue_data *rqd; /* Data of the various runqueues      */
+    struct csched2_grpsched_data *gscd;/* Data of the coscheduling groups    */
 
     cpumask_t initialized;             /* CPUs part of this scheduler        */
     struct list_head sdom;             /* List of domains (for debug key)    */
@@ -519,6 +618,7 @@ struct csched2_private {
  * Physical CPU
  */
 struct csched2_pcpu {
+    struct csched2_grpsched_data *gscd;
     int runq_id;
 };
 
@@ -607,6 +707,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,
     return &csched2_priv(ops)->rqd[c2r(cpu)];
 }
 
+/* CPU to coscheduling group data */
+static inline struct csched2_grpsched_data *c2gscd(unsigned int cpu)
+{
+    return csched2_pcpu(cpu)->gscd;
+}
+
 /* Does the domain of this vCPU have a cap? */
 static inline bool has_cap(const struct csched2_vcpu *svc)
 {
@@ -1624,6 +1730,46 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     new->tickled_cpu = ipid;
 }
 
+/*
+ * Group scheduling code.
+ */
+
+static unsigned int
+cpu_to_cosched_group(struct csched2_private *prv, unsigned int cpu)
+{
+    struct csched2_grpsched_data *gscd;
+    unsigned int peer_cpu, gsci;
+
+    ASSERT(opt_runqueue >= opt_grpsched);
+    ASSERT(opt_grpsched > 0 || opt_grpsched < OPT_TOPOLOGY_NODE);
+
+    for ( gsci = 0; gsci < nr_cpu_ids; gsci++ )
+    {
+        /* As soon as we come across an uninitialized group, use it. */
+        if ( prv->gscd[gsci].id == -1 )
+            break;
+
+        /*
+         * We've found an element of the gscd array which has been initialized,
+         * already, and hence has at least one CPU in it. Check if this CPU
+         * belongs there too.
+         */
+
+        gscd = prv->gscd + gsci;
+        BUG_ON(cpumask_empty(&gscd->cpus));
+        peer_cpu = cpumask_first(&gscd->cpus);
+
+        if ( (opt_grpsched == OPT_TOPOLOGY_CORE && same_core(peer_cpu, cpu)) ||
+             (opt_grpsched == OPT_TOPOLOGY_NODE && same_node(peer_cpu, cpu)) )
+            break;
+    }
+
+    /* We really expect that each cpu will be in a coscheduling group. */
+    BUG_ON(gsci >= nr_cpu_ids);
+
+    return gsci;
+}
+
 /*
  * Credit-related code
  */
@@ -3460,6 +3606,7 @@ csched2_schedule(
 {
     const int cpu = smp_processor_id();
     struct csched2_runqueue_data *rqd;
+    struct csched2_grpsched_data * const gscd = c2gscd(cpu);
     struct csched2_vcpu * const scurr = csched2_vcpu(current);
     struct csched2_vcpu *snext = NULL;
     unsigned int skipped_vcpus = 0;
@@ -3474,6 +3621,11 @@ csched2_schedule(
     rqd = c2rqd(ops, cpu);
     BUG_ON(!cpumask_test_cpu(cpu, &rqd->active));
 
+    /*
+     * We're holding the runqueue lock already. For group-scheduling data,
+     * cpus that are in the same group, also share the runqueue, so serializing
+     * them on the runqueue lock is enough, and no further locking is necessary.
+     */
     ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
 
     BUG_ON(!is_idle_vcpu(scurr->vcpu) && scurr->rqd != rqd);
@@ -3562,6 +3714,14 @@ csched2_schedule(
 
             runq_remove(snext);
             __set_bit(__CSFLAG_scheduled, &snext->flags);
+
+            /* Track which domain is running in the coscheduling group */
+            gscd->sdom = snext->sdom;
+            if ( is_idle_vcpu(scurr->vcpu) )
+            {
+                gscd->nr_running++;
+                ASSERT(gscd->nr_running <= cpumask_weight(&gscd->cpus));
+            }
         }
 
         /* Clear the idle mask if necessary */
@@ -3623,10 +3783,21 @@ csched2_schedule(
             cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
             smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
         }
+        if ( !is_idle_vcpu(scurr->vcpu) )
+        {
+            ASSERT(gscd->nr_running >= 1);
+            if ( --gscd->nr_running == 0 )
+            {
+                /* There's no domain running on this coscheduling group */
+                gscd->sdom = NULL;
+            }
+        }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */
         update_load(ops, rqd, NULL, 0, now);
     }
+    ASSERT(gscd->sdom != NULL || gscd->nr_running == 0);
+    ASSERT(gscd->nr_running != 0 || gscd->sdom == NULL);
 
     /*
      * Return task to run next...
@@ -3667,7 +3838,8 @@ dump_pcpu(const struct scheduler *ops, int cpu)
 #define cpustr keyhandler_scratch
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk(" CPU[%02d] runq=%d, sibling=%s, ", cpu, c2r(cpu), cpustr);
+    printk(" %sCPU[%02d] runq=%d, sibling=%s, ", grpsched_enabled() ? " " : "",
+           cpu, c2r(cpu), cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 
@@ -3772,8 +3944,32 @@ csched2_dump(const struct scheduler *ops)
         printk("Runqueue %d:\n", rqd->id);
         printk("CPUs:\n");
 
-        for_each_cpu(j, &rqd->active)
-            dump_pcpu(ops, j);
+        cpumask_copy(cpumask_scratch, &rqd->active);
+        while ( !cpumask_empty(cpumask_scratch) )
+        {
+            int c = cpumask_first(cpumask_scratch);
+            struct csched2_grpsched_data * const cgscd = c2gscd(c);
+            cpumask_t *cpus;
+
+            if ( grpsched_enabled() )
+            {
+                cpus = &cgscd->cpus;
+                printk(" cosched_group=%d, ", cgscd->id);
+                if ( cgscd->sdom )
+                    printk("sdom=d%d, ", cgscd->sdom->dom->domain_id);
+                else
+                   printk("sdom=/, ");
+                printk("nr_running=%u\n", cgscd->nr_running);
+            }
+            else
+                cpus = cpumask_scratch;
+
+            for_each_cpu(j, cpus)
+            {
+                cpumask_clear_cpu(j, cpumask_scratch);
+                dump_pcpu(ops, j);
+            }
+        }
 
         printk("RUNQ:\n");
         list_for_each( iter, runq )
@@ -3814,11 +4010,13 @@ init_pdata(struct csched2_private *prv, struct csched2_pcpu *spc,
            unsigned int cpu)
 {
     struct csched2_runqueue_data *rqd;
+    struct csched2_grpsched_data *gscd;
+    unsigned int grpsched_id;
 
     ASSERT(rw_is_write_locked(&prv->lock));
     ASSERT(!cpumask_test_cpu(cpu, &prv->initialized));
     /* CPU data needs to be allocated, but still uninitialized. */
-    ASSERT(spc && spc->runq_id == -1);
+    ASSERT(spc && spc->runq_id == -1 && spc->gscd == NULL);
 
     /* Figure out which runqueue to put it in */
     spc->runq_id = cpu_to_runqueue(prv, cpu);
@@ -3831,7 +4029,7 @@ init_pdata(struct csched2_private *prv, struct csched2_pcpu *spc,
         printk(XENLOG_INFO " First cpu on runqueue, activating\n");
         activate_runqueue(prv, spc->runq_id);
     }
-    
+
     __cpumask_set_cpu(cpu, &rqd->idle);
     __cpumask_set_cpu(cpu, &rqd->active);
     __cpumask_set_cpu(cpu, &prv->initialized);
@@ -3840,6 +4038,27 @@ init_pdata(struct csched2_private *prv, struct csched2_pcpu *spc,
     if ( cpumask_weight(&rqd->active) == 1 )
         rqd->pick_bias = cpu;
 
+    /* Figure out in which coscheduling group this belongs */
+    if ( grpsched_enabled() )
+    {
+        grpsched_id = cpu_to_cosched_group(prv, cpu);
+
+        printk("Adding cpu %d to cosched. group %d\n", cpu, grpsched_id);
+        spc->gscd = gscd = &prv->gscd[grpsched_id];
+        if ( cpumask_empty(&gscd->cpus) )
+        {
+            printk("First cpu in group, activating\n");
+            ASSERT(gscd->sdom == NULL && gscd->nr_running == 0);
+            gscd->id = grpsched_id;
+        }
+        cpumask_set_cpu(cpu, &gscd->cpus);
+    }
+    else
+    {
+        spc->gscd = &prv->gscd[cpu];
+        cpumask_set_cpu(cpu, &prv->gscd[cpu].cpus);
+    }
+
     return spc->runq_id;
 }
 
@@ -4009,6 +4228,23 @@ csched2_global_init(void)
         opt_cap_period = 10; /* ms */
     }
 
+    if ( opt_grpsched > opt_runqueue )
+    {
+        printk("WARNING: %s: can't have %s group scheduling with per-%s runqueue\n",
+               __func__, opt_topospan_str[opt_grpsched],
+               opt_topospan_str[opt_runqueue]);
+        if ( opt_runqueue >= OPT_TOPOLOGY_CORE )
+        {
+            printk(" resorting to per-core group scheduling\n");
+            opt_grpsched = OPT_TOPOLOGY_CORE;
+        }
+        else
+        {
+            printk(" disabling group scheduling\n");
+            opt_grpsched = 0;
+        }
+    }
+
     return 0;
 }
 
@@ -4025,12 +4261,14 @@ csched2_init(struct scheduler *ops)
            XENLOG_INFO " underload_balance_tolerance: %d\n"
            XENLOG_INFO " overload_balance_tolerance: %d\n"
            XENLOG_INFO " runqueues arrangement: %s\n"
+           XENLOG_INFO " group scheduling: %s\n"
            XENLOG_INFO " cap enforcement granularity: %dms\n",
            opt_load_precision_shift,
            opt_load_window_shift,
            opt_underload_balance_tolerance,
            opt_overload_balance_tolerance,
            opt_topospan_str[opt_runqueue],
+           opt_grpsched ? opt_topospan_str[opt_grpsched] : "disabled",
            opt_cap_period);
 
     printk(XENLOG_INFO "load tracking window length %llu ns\n",
@@ -4050,15 +4288,25 @@ csched2_init(struct scheduler *ops)
     rwlock_init(&prv->lock);
     INIT_LIST_HEAD(&prv->sdom);
 
-    /* Allocate all runqueues and mark them as un-initialized */
+    /*
+     * Allocate all runqueues and coscheduling group data structures,
+     * and mark them all as un-initialized.
+     */
     prv->rqd = xzalloc_array(struct csched2_runqueue_data, nr_cpu_ids);
     if ( !prv->rqd )
     {
         xfree(prv);
         return -ENOMEM;
     }
+    prv->gscd = xzalloc_array(struct csched2_grpsched_data, nr_cpu_ids);
+    if ( !prv->gscd )
+    {
+        xfree(prv->rqd);
+        xfree(prv);
+        return -ENOMEM;
+    }
     for ( i = 0; i < nr_cpu_ids; i++ )
-        prv->rqd[i].id = -1;
+        prv->rqd[i].id = prv->gscd[i].id = -1;
 
     /* initialize ratelimit */
     prv->ratelimit_us = sched_ratelimit_us;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (4 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 5/8] xen: sched: Credit2 group-scheduling: data structures Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-11-21 16:15   ` George Dunlap
  2018-10-12 17:44 ` [RFC PATCH v1 7/8] xen: sched: Credit2 group-scheduling: tickling Dario Faggioli
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

When chosing which vcpu to run next, on a CPU which is in a group where
other vcpus are running already, only consider vcpus of the same domain
(of those vcpus that are running already!).

This is as easy as, in runq_candidate(), while traversing the runqueue,
skipping the vcpus that do not satisfy the group-scheduling constraints.

And now that such constraints are actually enforced, also add an ASSERT()
that checks that we really respect them.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
TODO:
- Consider better the interactions between group-scheduling and
  soft-affinity (in runq_candidate() @3481);
---
 xen/common/sched_credit2.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index b11713e244..052e050394 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3414,7 +3414,7 @@ csched2_runtime(const struct scheduler *ops, int cpu,
 /*
  * Find a candidate.
  */
-static struct csched2_vcpu *
+static noinline struct csched2_vcpu *
 runq_candidate(struct csched2_runqueue_data *rqd,
                struct csched2_vcpu *scurr,
                int cpu, s_time_t now,
@@ -3423,8 +3423,19 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     struct list_head *iter, *temp;
     struct csched2_vcpu *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
+    struct csched2_grpsched_data *gscd = c2gscd(cpu);
     bool yield = false, soft_aff_preempt = false;
 
+    /*
+     * Some more sanity checking. With group scheduling enabled, either:
+     * - the whole coscheduling group is currently idle. Or,
+     * - this CPU is currently idle. Or,
+     * - this CPU is running a vcpu from the same domain of all the
+     *   other one that are running in the group (if any).
+     */
+    ASSERT(!grpsched_enabled() || gscd->sdom == NULL ||
+           scurr->sdom == NULL || gscd->sdom == scurr->sdom);
+
     *skipped = 0;
 
     if ( unlikely(is_idle_vcpu(scurr->vcpu)) )
@@ -3473,6 +3484,8 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         {
             cpumask_t *online = cpupool_domain_cpumask(scurr->vcpu->domain);
 
+            /* XXX deal with grpsched_enabled() == true */
+
             /* Ok, is any of the pcpus in scurr soft-affinity idle? */
             cpumask_and(cpumask_scratch, cpumask_scratch, &rqd->idle);
             cpumask_andnot(cpumask_scratch, cpumask_scratch, &rqd->tickled);
@@ -3528,6 +3541,23 @@ runq_candidate(struct csched2_runqueue_data *rqd,
             continue;
         }
 
+        /*
+         * If groups scheduling is enabled, only consider svc if:
+         * - the whole group is idle. Or,
+         * - one or more other svc->sdom's vcpus are running already in the
+         *   pCPUs of the coscheduling group. Or,
+         * - there is only one vcpu running in the whole coscheduling group,
+         *   and it is running here on this CPU (and svc would preempt it).
+         */
+        if ( grpsched_enabled() &&
+             gscd->sdom != NULL && gscd->sdom != svc->sdom &&
+             !(gscd->nr_running == 1 && scurr->sdom != NULL) )
+        {
+            ASSERT(gscd->nr_running != 0);
+            (*skipped)++;
+            continue;
+        }
+
         /*
          * If a vcpu is meant to be picked up by another processor, and such
          * processor has not scheduled yet, leave it in the runqueue for him.
@@ -3715,6 +3745,18 @@ csched2_schedule(
             runq_remove(snext);
             __set_bit(__CSFLAG_scheduled, &snext->flags);
 
+            /*
+             * If group scheduling is enabled, and we're switching to
+             * a non-idle vcpu, either:
+             * - they're from the same domain,
+             * - the whole coscheduling group was idle,
+             * - there was only 1 vcpu running in the whole scheduling group,
+             *   and it was running on this CPU (i.e., this CPU was not idle).
+             */
+            ASSERT(!grpsched_enabled() || gscd->sdom == snext->sdom ||
+                   (gscd->nr_running == 0 && gscd->sdom == NULL) ||
+                   (gscd->nr_running == 1 && !is_idle_vcpu(scurr->vcpu)));
+
             /* Track which domain is running in the coscheduling group */
             gscd->sdom = snext->sdom;
             if ( is_idle_vcpu(scurr->vcpu) )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 7/8] xen: sched: Credit2 group-scheduling: tickling
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (5 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-10-12 17:44 ` [RFC PATCH v1 8/8] xen: sched: Credit2 group-scheduling: anti-starvation measures Dario Faggioli
  2018-11-07 17:58 ` [RFC PATCH v1 0/8] Series short description Dario Faggioli
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

When chosing which CPU should be poked to go pick up a vcpu from the
runqueue, take group-scheduling into account, if it is enabled.

Basically, we avoid tickling CPUs that, even if they are idle, are part
of coscheduling groups where vcpus of other domains (wrt the one waking
up) are already running. Instead, we actively try to tickle the idle
CPUs within the coscheduling groups where vcpus of the same domain are
currently running.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
TODO:
- deal with sched_smt_power_savings==true;
- optimize the search of appropriate CPUs to be tickled, most likely
  using a per-domain data structure. That will spare us having to do
  a loop.
---
 xen/common/sched_credit2.c |   73 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 64 insertions(+), 9 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 052e050394..d2b4c907dc 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1558,6 +1558,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     s_time_t max = 0;
     unsigned int bs, cpu = new->vcpu->processor;
     struct csched2_runqueue_data *rqd = c2rqd(ops, cpu);
+    struct csched2_grpsched_data *gscd = c2gscd(cpu);
     cpumask_t *online = cpupool_domain_cpumask(new->vcpu->domain);
     cpumask_t mask;
 
@@ -1593,10 +1594,19 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                   cpumask_test_cpu(cpu, &rqd->idle) &&
                   !cpumask_test_cpu(cpu, &rqd->tickled)) )
     {
-        ASSERT(cpumask_cycle(cpu, new->vcpu->cpu_hard_affinity) == cpu);
-        SCHED_STAT_CRANK(tickled_idle_cpu_excl);
-        ipid = cpu;
-        goto tickle;
+        /*
+         * If group scheduling is enabled, what's running on the
+         * other pCPUs of the coscheduling group also matters.
+         */
+        if ( !grpsched_enabled() || gscd->sdom == NULL ||
+             gscd->sdom == new->sdom )
+        {
+            ASSERT(gscd->nr_running < cpumask_weight(&gscd->cpus));
+            ASSERT(cpumask_cycle(cpu, new->vcpu->cpu_hard_affinity) == cpu);
+            SCHED_STAT_CRANK(tickled_idle_cpu_excl);
+            ipid = cpu;
+            goto tickle;
+        }
     }
 
     for_each_affinity_balance_step( bs )
@@ -1617,6 +1627,8 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
          */
         if ( unlikely(sched_smt_power_savings) )
         {
+            /* XXX deal with grpsched_enabled() == true */
+
             cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
             cpumask_and(&mask, &mask, online);
         }
@@ -1626,6 +1638,15 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         i = cpumask_test_or_cycle(cpu, &mask);
         if ( i < nr_cpu_ids )
         {
+            struct csched2_grpsched_data *igscd = c2gscd(i);
+
+            ASSERT(igscd->nr_running < cpumask_weight(&igscd->cpus));
+            /*
+             * If we're doing core-scheduling, the CPU being in smt_idle also
+             * means that there are no other vcpus running in the group.
+             */
+            ASSERT(opt_grpsched != OPT_TOPOLOGY_CORE ||
+                   (igscd->sdom == NULL && igscd->nr_running == 0));
             SCHED_STAT_CRANK(tickled_idle_cpu);
             ipid = i;
             goto tickle;
@@ -1640,11 +1661,36 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu), online);
         cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
         i = cpumask_test_or_cycle(cpu, &mask);
-        if ( i < nr_cpu_ids )
+        /*
+         * If we don't have group scheduling enabled, any CPU in the mask
+         * is fine. And in fact, during the very first iteration, we take
+         * the 'if', and go to tickling.
+         *
+         * If, OTOH, that is enabled, we want to tickle CPUs that are in
+         * groups where other vcpus of new's domain are running already.
+         *
+         * XXX Potential optimization: if we use a data structure where we
+         *     keep track, for each domain, on what pCPUs the vcpus of the
+         *     domain itself are currently running, we can probably avoid
+         *     the loop.
+         */
+        while ( !cpumask_empty(&mask) )
         {
-            SCHED_STAT_CRANK(tickled_idle_cpu);
-            ipid = i;
-            goto tickle;
+            struct csched2_grpsched_data *igscd = c2gscd(i);
+
+            ASSERT(i < nr_cpu_ids);
+            ASSERT(is_idle_vcpu(curr_on_cpu(i)) &&
+                   csched2_vcpu(curr_on_cpu(i))->sdom == NULL);
+            ASSERT(igscd->nr_running < cpumask_weight(&igscd->cpus));
+            if ( !grpsched_enabled() || igscd->sdom == NULL ||
+                 igscd->sdom == new->sdom)
+            {
+                SCHED_STAT_CRANK(tickled_idle_cpu);
+                ipid = i;
+                goto tickle;
+            }
+            __cpumask_clear_cpu(i, &mask);
+            i = cpumask_cycle(i, &mask);
         }
     }
 
@@ -1667,7 +1713,10 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
     cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
-    if ( __cpumask_test_and_clear_cpu(cpu, &mask) )
+    if ( __cpumask_test_and_clear_cpu(cpu, &mask) &&
+         (!grpsched_enabled() || gscd->sdom == NULL ||
+          gscd->sdom == new->sdom ||
+          (gscd->nr_running == 1 && !is_idle_vcpu(curr_on_cpu(cpu)))) )
     {
         s_time_t score = tickle_score(ops, now, new, cpu);
 
@@ -1687,11 +1736,17 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
 
     for_each_cpu(i, &mask)
     {
+        struct csched2_grpsched_data *igscd = c2gscd(i);
         s_time_t score;
 
         /* Already looked at this one above */
         ASSERT(i != cpu);
 
+        if ( grpsched_enabled() && igscd->sdom != NULL &&
+             igscd->sdom != new->sdom &&
+             !(igscd->nr_running == 1 && !is_idle_vcpu(curr_on_cpu(i))) )
+            continue;
+
         score = tickle_score(ops, now, new, i);
 
         if ( score > max )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 8/8] xen: sched: Credit2 group-scheduling: anti-starvation measures
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (6 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 7/8] xen: sched: Credit2 group-scheduling: tickling Dario Faggioli
@ 2018-10-12 17:44 ` Dario Faggioli
  2018-11-07 17:58 ` [RFC PATCH v1 0/8] Series short description Dario Faggioli
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-10-12 17:44 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

With group scheduling enabled, if a vcpu of, say, domain A, is already
running on a CPU, the other CPUs of the group can only run vcpus of
that same domain. And in fact, we scan the runqueue and look for one.

But then what can happen is that vcpus of domain A takes turns at
switching between idle/blocked and running, and manage to keep every
other (vcpus of the other) domains out of a group of CPUs for long time,
or even indefinitely (impacting fairness, or causing starvation).

To avoid this, let's limit how deep we go along the runqueue in search
of a vcpu of domain A. That is, if we don't find any that have at least
a certain amount of credits less than what the vcpu at the top of the
runqueue has, give up and keep the CPU idle.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
TODO:
- for now, CSCHED2_MIN_TIMER is what's used as threshold, but this can
  use some tuning (e.g., it probably wants to be adaptive, depending on
  how wide the coscheduling group of CPUs is, etc.)
---
 xen/common/sched_credit2.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index d2b4c907dc..a23c8f18d6 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -3476,7 +3476,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                unsigned int *skipped)
 {
     struct list_head *iter, *temp;
-    struct csched2_vcpu *snext = NULL;
+    struct csched2_vcpu *first_svc, *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
     struct csched2_grpsched_data *gscd = c2gscd(cpu);
     bool yield = false, soft_aff_preempt = false;
@@ -3568,11 +3568,28 @@ runq_candidate(struct csched2_runqueue_data *rqd,
      * Of course, we also default to idle also if scurr is not runnable.
      */
     if ( vcpu_runnable(scurr->vcpu) && !soft_aff_preempt )
+
         snext = scurr;
     else
         snext = csched2_vcpu(idle_vcpu[cpu]);
 
  check_runq:
+    /*
+     * To retain fairness, and avoid starvation issues, we don't let
+     * group scheduling make us run vcpus which are too far behing (i.e.,
+     * have less credits) than what is currently in the runqueue.
+     *
+     * XXX Just use MIN_TIMER as the threshold, for now.
+     */
+    first_svc = list_entry(&rqd->runq, struct csched2_vcpu, runq_elem);
+    if ( grpsched_enabled() && !is_idle_vcpu(scurr->vcpu) &&
+         !list_empty(&rqd->runq) )
+    {
+        ASSERT(gscd->sdom != NULL);
+        if ( scurr->credit < first_svc->credit - CSCHED2_MIN_TIMER )
+            snext = csched2_vcpu(idle_vcpu[cpu]);
+    }
+
     list_for_each_safe( iter, temp, &rqd->runq )
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
@@ -3637,6 +3654,19 @@ runq_candidate(struct csched2_runqueue_data *rqd,
             continue;
         }
 
+        /*
+         * As stated above, let's not go too far and risk picking up
+         * a vcpu which has too much lower credits than the one we would
+         * have picked if group scheduling was not enabled.
+         *
+         * There's a risk that this means leaving the CPU idle (if we don't
+         * find vcpus that satisfy this rule, and also the group scheduling
+         * constraints)... but that's what coscheduling is all about!
+         */
+        if ( grpsched_enabled() && gscd->sdom != NULL &&
+             svc->credit < first_svc->credit - CSCHED2_MIN_TIMER )
+            break;
+
         /*
          * If the one in the runqueue has more credit than current (or idle,
          * if current is not runnable), or if current is yielding, and also


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/8] Series short description
  2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
                   ` (7 preceding siblings ...)
  2018-10-12 17:44 ` [RFC PATCH v1 8/8] xen: sched: Credit2 group-scheduling: anti-starvation measures Dario Faggioli
@ 2018-11-07 17:58 ` Dario Faggioli
  8 siblings, 0 replies; 11+ messages in thread
From: Dario Faggioli @ 2018-11-07 17:58 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Bhavesh Davda, Wei Liu, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 3236 bytes --]

On Fri, 2018-10-12 at 19:43 +0200, Dario Faggioli wrote:
> Hello,
> 
> Here it comes, core-scheduling for Credit2 as well. Well, this time,
> it's actually group-scheduling (see below).
> 
>  git://xenbits.xen.org/people/dariof/xen.git rel/sched/credit2/group-
> scheduling-RFCv1
>  
> http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/rel/sched/credit2/group-scheduling-RFCv1
> 
>  (Or 
> https://github.com/fdario/xen/tree/rel/sched/credit2/group-scheduling-RFCv1
> ,
>   Or 
> https://gitlab.com/dfaggioli/xen/tree/rel/sched/credit2/group-scheduling-RFCv1
> )
> 
> An RFC series implementing the same feature for Credit1 is here:
> https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html
> 
> The two series, however, are completely independent, and I'd
> recommend
> focusing on this one first. In fact, implementing the feature here in
> Credit2 was waaay simpler, and the result is, IMO, already a lot
> better.
> 
> Therefore, I expect that the amount of effort required for making
> this
> very series upstreamable to be much smaller than for the Credit1 one.
> When this is in, we'll have one scheduler that supports
> group-scheduling, and we can focus on what to do with the others.
> 
> Let me also point out, that there is some discussion (in the thread
> of
> the Credit1 RFC series [1]), about whether a different approach
> toward
> implementing core/group-scheduling wouldn't be better. I had this
> code
> almost ready already, and so I decided to send it out anyway. If it
> then
> turns out that we have to throw it away, then fine. But, so far, I'm
> all
> but convinced that the way things are done in this series is not our
> current best solution to deal with the problems we have at hand.
> 
> So, what's in here? Well, we have a generic group scheduling
> implementation which seems to me to work reasonably well... For an
> RFC. ;-P
> 
> I call it generic because, although the main aim is core-scheduling,
> it
> can be made to work (and in fact, it already kind of does) with
> different grouping (like node, socket, or arbitrary sets of CPUs).
> 
> I does not have the fairness and starvation issues that the RFC
> series
> for Credit1 liked above has. I.e., it already sort-of works. :-D
> 
> Some improvements are necessary, mostly because Credit2 is not a
> fully
> work conserving scheduler, and this hurts when we do things like
> group
> scheduling. So we need to add logic for doing some quick load-
> balancing,
> or work stealing, when a CPU goes idle, but that is not that much of
> a
> big deal (I was already thinking to add it anyway).
> 
> Finding a way of considering group-scheduling while doing proper load
> balancing is also on my todo list. It is less easy than the work
> conserving-ification described above, but also less important, IMO.
> 
So... Any idea? Thoughts? First impressions? :-D

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run
  2018-10-12 17:44 ` [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run Dario Faggioli
@ 2018-11-21 16:15   ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2018-11-21 16:15 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel

On 10/12/18 6:44 PM, Dario Faggioli wrote:
> When chosing which vcpu to run next, on a CPU which is in a group where
> other vcpus are running already, only consider vcpus of the same domain
> (of those vcpus that are running already!).
> 
> This is as easy as, in runq_candidate(), while traversing the runqueue,
> skipping the vcpus that do not satisfy the group-scheduling constraints.
> 
> And now that such constraints are actually enforced, also add an ASSERT()
> that checks that we really respect them.
> 
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

As a data point in the "number of tags" question:
1. my normal way of importing a series is to use `stg import` on a
single mbox file;
2. if something doesn't apply cleanly, I often fix it up and re-apply
using `-i` to say, "ignore already-applied-patches"
3. stgit seems to use the name of the patch to determine if the patch
has been applied or not
4. For 'name', it only uses the first four words it can see.

So, after fixing up some trivial porting issues in earlier patches, I
got this:

$ stg import -i --reject -M "/tmp/dariof.credit2-core-scheduling.rfc-v1"
Checking for changes in the working directory ... done
Ignoring already applied patch "xen-sched-credit2-during"
Ignoring already applied patch "xen-sched-credit2-avoid"
Ignoring already applied patch "xen-sched-credit2-show"
Ignoring already applied patch "xen-sched-credit2-generalize"
Ignoring already applied patch "xen-sched-credit2-group"
Ignoring already applied patch "xen-sched-credit2-group"
Ignoring already applied patch "xen-sched-credit2-group"
Ignoring already applied patch "xen-sched-credit2-group"
Now at patch "xen-sched-credit2-group"

That is, it only applied the first of the last four patches, because
they all look the same to it.

Obviously that's somewhat of a deficiency in stackgit, but it
demonstrates the weird issues you run into when your description line
has too many tags. :-)

I'll pull the branch from xenbits mentioned in the cover letter.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-11-21 16:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12 17:43 [RFC PATCH v1 0/8] Series short description Dario Faggioli
2018-10-12 17:43 ` [RFC PATCH v1 1/8] xen: sched: Credit2: during scheduling, update the idle mask before using it Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 2/8] xen: sched: Credit2: avoid looping too much (over runqueues) during load balancing Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 3/8] xen: sched: Credit2: show runqueue id during runqueue dump Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 4/8] xen: sched: Credit2: generalize topology related bootparam handling Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 5/8] xen: sched: Credit2 group-scheduling: data structures Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 6/8] xen: sched: Credit2 group-scheduling: selecting next vcpu to run Dario Faggioli
2018-11-21 16:15   ` George Dunlap
2018-10-12 17:44 ` [RFC PATCH v1 7/8] xen: sched: Credit2 group-scheduling: tickling Dario Faggioli
2018-10-12 17:44 ` [RFC PATCH v1 8/8] xen: sched: Credit2 group-scheduling: anti-starvation measures Dario Faggioli
2018-11-07 17:58 ` [RFC PATCH v1 0/8] Series short description Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.