[PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue
@ 2020-04-29 17:36 Dario Faggioli
  2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich

In Credit2, the CPUs are assigned to runqueues according to the host
topology. For instance, if we want per-socket runqueues (which is the
default), the CPUs that are in the same socket will end up in the same
runqueue.

This is generally good for scalability, at least until the number of
CPUs that end up in the same runqueue is not too high. In fact, all this
CPUs will compete for the same spinlock, for making scheduling decisions
and manipulating the scheduler data structures. Therefore, if they are
too many, that can become a bottleneck.

This has not been an issue so far, but architectures with 128 CPUs per
socket are now available, and it is certainly unideal to have so many
CPUs in the same runqueue, competing for the same locks, etc.

Let's therefore set a cap to the total number of CPUs that can share a
runqueue. The value is set to 16, by default, but can be changed with
a boot command line parameter.

Note that this is orthogonal and independent from the activity of
introducing runqueue arrangement mechanisms and chriteria. In fact,
we very well may want to implement the capability of having, say, per-LLC
runqueues, or per-die (where that is defined) runqueues. In fact, this
would work well on current system, but nothing guarantees that, in
future, there won't be an architecture with hundreds of CPUs per DIE or
LLC... And we'll be back to where we are now.

Therefore, even if we are actually planning to add some new runqueue
organization criteria, fixing a limit for the number of CPUs that
will ever share a runqueue, whatever the underlying topology is, is
still useful.

Note also that, if the host has hyperthreading (and we are not using
core-scheduling), additional care is taken to avoid splitting CPUs that
are hyperthread siblings among different runqueues.

---
Dario Faggioli (2):
      xen: credit2: factor cpu to runqueue matching in a function
      xen: credit2: limit the max number of CPUs in a runqueue

 xen/common/sched/cpupool.c |    2 -
 xen/common/sched/credit2.c |  130 +++++++++++++++++++++++++++++++++++++++-----
 xen/common/sched/private.h |    2 +
 3 files changed, 119 insertions(+), 15 deletions(-)

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function
  2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
@ 2020-04-29 17:36 ` Dario Faggioli
  2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli
  2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
  2 siblings, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap

Just move the big if() condition in an inline function.

No functional change intended.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
---
 xen/common/sched/credit2.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index 34f05c3e2a..697c9f917d 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -838,6 +838,20 @@ static inline bool same_core(unsigned int cpua, unsigned int cpub)
            cpu_to_core(cpua) == cpu_to_core(cpub);
 }
 
+static inline bool
+cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
+{
+    unsigned int peer_cpu = rqd->pick_bias;
+
+    BUG_ON(cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
+
+    /* OPT_RUNQUEUE_CPU will never find an existing runqueue. */
+    return opt_runqueue == OPT_RUNQUEUE_ALL ||
+           (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
+           (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
+           (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
+}
+
 static struct csched2_runqueue_data *
 cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
@@ -855,21 +869,11 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
     rqd_ins = &prv->rql;
     list_for_each_entry ( rqd, &prv->rql, rql )
     {
-        unsigned int peer_cpu;
-
         /* Remember first unused queue index. */
         if ( !rqi_unused && rqd->id > rqi )
             rqi_unused = true;
 
-        peer_cpu = rqd->pick_bias;
-        BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
-               cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
-
-        /* OPT_RUNQUEUE_CPU will never find an existing runqueue. */
-        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
-             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
-             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
-             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
+        if ( cpu_runqueue_match(rqd, cpu) )
         {
             rqd_valid = true;
             break;
@@ -3744,6 +3748,8 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
     struct csched2_pcpu *spc;
     struct csched2_runqueue_data *rqd;
 
+    BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID);
+
     spc = xzalloc(struct csched2_pcpu);
     if ( spc == NULL )
         return ERR_PTR(-ENOMEM);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
  2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli
@ 2020-04-29 17:36 ` Dario Faggioli
  2020-04-30  6:45   ` Jan Beulich
  2020-04-30  7:35   ` Jürgen Groß
  2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
  2 siblings, 2 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich

In Credit2 CPUs (can) share runqueues, depending on the topology. For
instance, with per-socket runqueues (the default) all the CPUs that are
part of the same socket share a runqueue.

On platform with a huge number of CPUs per socket, that could be a
problem. An example is AMD EPYC2 servers, where we can have up to 128
CPUs in a socket.

It is of course possible to define other, still topology-based, runqueue
arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in
runqueues with too many CPUs on other/future platforms.

Therefore, let's set a limit to the max number of CPUs that can share a
Credit2 runqueue. The actual value is configurable (at boot time), the
default being 16. If, for instance,  there are more than 16 CPUs in a
socket, they'll be split among two (or more) runqueues.

Note: with core scheduling enabled, this parameter sets the max number
of *scheduling resources* that can share a runqueue. Therefore, with
granularity set to core (and assumint 2 threads per core), we will have
at most 16 cores per runqueue, which corresponds to 32 threads. But that
is fine, considering how core scheduling works.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Juergen Gross <jgross@suse.com>
---
 xen/common/sched/cpupool.c |    2 -
 xen/common/sched/credit2.c |  104 ++++++++++++++++++++++++++++++++++++++++++--
 xen/common/sched/private.h |    2 +
 3 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
index d40345b585..0227457285 100644
--- a/xen/common/sched/cpupool.c
+++ b/xen/common/sched/cpupool.c
@@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
 
 static DEFINE_SPINLOCK(cpupool_lock);
 
-static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
+enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
 static unsigned int __read_mostly sched_granularity = 1;
 
 #ifdef CONFIG_HAS_SCHED_GRANULARITY
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index 697c9f917d..abe4d048c8 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -471,6 +471,16 @@ static int __init parse_credit2_runqueue(const char *s)
 }
 custom_param("credit2_runqueue", parse_credit2_runqueue);
 
+/*
+ * How many CPUs will be put, at most, in the same runqueue.
+ * Runqueues are still arranged according to the host topology (and
+ * according to the value of the 'credit2_runqueue' parameter). But
+ * we also have a cap to the number of CPUs that share runqueues.
+ * As soon as we reach the limit, a new runqueue will be created.
+ */
+static unsigned int __read_mostly opt_max_cpus_runqueue = 16;
+integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue);
+
 /*
  * Per-runqueue data
  */
@@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
            (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
 }
 
+/* Additional checks, to avoid separating siblings in different runqueues. */
+static bool
+cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
+{
+    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
+    unsigned int rcpu, nr_smts = 0;
+
+    /*
+     * If we put the CPU in this runqueue, we must be sure that there will
+     * be enough room for accepting its hyperthread sibling(s) here as well.
+     */
+    cpumask_clear(cpumask_scratch_cpu(cpu));
+    for_each_cpu ( rcpu, &rqd->active )
+    {
+        ASSERT(rcpu != cpu);
+        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
+        {
+            /*
+             * For each CPU already in the runqueue, account for it and for
+             * its sibling(s), independently from whether such sibling(s) are
+             * in the runqueue already or not.
+             *
+             * Of course, if there are sibling CPUs in the runqueue already,
+             * only count them once.
+             */
+            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
+                       per_cpu(cpu_sibling_mask, rcpu));
+            nr_smts += nr_sibl;
+        }
+    }
+    /*
+     * We know that neither the CPU, nor any of its sibling are here,
+     * or we wouldn't even have entered the function.
+     */
+    ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu),
+                               per_cpu(cpu_sibling_mask, cpu)));
+
+    /* Try adding CPU and its sibling(s) to the count and check... */
+    nr_smts += nr_sibl;
+
+    if ( nr_smts <= opt_max_cpus_runqueue )
+        return true;
+
+    return false;
+}
+
 static struct csched2_runqueue_data *
 cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
     struct csched2_runqueue_data *rqd, *rqd_new;
+    struct csched2_runqueue_data *rqd_valid = NULL;
     struct list_head *rqd_ins;
     unsigned long flags;
     int rqi = 0;
-    bool rqi_unused = false, rqd_valid = false;
+    bool rqi_unused = false;
 
     /* Prealloc in case we need it - not allowed with interrupts off. */
     rqd_new = xzalloc(struct csched2_runqueue_data);
@@ -873,11 +930,44 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         if ( !rqi_unused && rqd->id > rqi )
             rqi_unused = true;
 
-        if ( cpu_runqueue_match(rqd, cpu) )
+        /*
+         * Check whether the CPU should (according to the topology) and also
+         * can (if we there aren't too many already) go in this runqueue.
+         */
+        if ( rqd->refcnt < opt_max_cpus_runqueue &&
+             cpu_runqueue_match(rqd, cpu) )
         {
-            rqd_valid = true;
-            break;
+            cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu);
+
+            dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n",
+                    cpu, rqd->id, CPUMASK_PR(&rqd->active),
+                    opt_max_cpus_runqueue);
+
+            /*
+             * If we're using core (or socket!) scheduling, or we don't have
+             * hyperthreading, no need to do any further checking.
+             *
+             * If no (to both), but our sibling is already in this runqueue,
+             * then it's also ok for the CPU to stay in this runqueue..
+             *
+             * Otherwise, do some more checks, to better account for SMT.
+             */
+            if ( opt_sched_granularity != SCHED_GRAN_cpu ||
+                 cpumask_weight(siblings) <= 1 ||
+                 cpumask_intersects(&rqd->active, siblings) )
+            {
+                dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id);
+                rqd_valid = rqd;
+                break;
+            }
+            else if ( cpu_runqueue_smt_match(rqd, cpu) )
+            {
+                dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id);
+                rqd_valid = rqd;
+            }
         }
+	else
+            dprintk(XENLOG_DEBUG, "ignoring runq %d\n", rqd->id);
 
         if ( !rqi_unused )
         {
@@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         rqd->pick_bias = cpu;
         rqd->id = rqi;
     }
+    else
+        rqd = rqd_valid;
+
+    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n",
+           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id,
+           CPUMASK_PR(&rqd->active));
 
     rqd->refcnt++;
 
diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h
index 367811a12f..e964e3f407 100644
--- a/xen/common/sched/private.h
+++ b/xen/common/sched/private.h
@@ -30,6 +30,8 @@ enum sched_gran {
     SCHED_GRAN_socket
 };
 
+extern enum sched_gran opt_sched_granularity;
+
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
  * we have a per-cpu pointer, along with a pre-allocated set of



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli
@ 2020-04-30  6:45   ` Jan Beulich
  2020-05-26 22:00     ` Dario Faggioli
  2020-04-30  7:35   ` Jürgen Groß
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-04-30  6:45 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

On 29.04.2020 19:36, Dario Faggioli wrote:
> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
>             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
>  }
>  
> +/* Additional checks, to avoid separating siblings in different runqueues. */
> +static bool
> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
> +{
> +    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
> +    unsigned int rcpu, nr_smts = 0;
> +
> +    /*
> +     * If we put the CPU in this runqueue, we must be sure that there will
> +     * be enough room for accepting its hyperthread sibling(s) here as well.
> +     */
> +    cpumask_clear(cpumask_scratch_cpu(cpu));
> +    for_each_cpu ( rcpu, &rqd->active )
> +    {
> +        ASSERT(rcpu != cpu);
> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> +        {
> +            /*
> +             * For each CPU already in the runqueue, account for it and for
> +             * its sibling(s), independently from whether such sibling(s) are
> +             * in the runqueue already or not.
> +             *
> +             * Of course, if there are sibling CPUs in the runqueue already,
> +             * only count them once.
> +             */
> +            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
> +                       per_cpu(cpu_sibling_mask, rcpu));
> +            nr_smts += nr_sibl;

This being common code, is it appropriate to assume all CPUs having
the same number of siblings? Even beyond that, iirc the sibling mask
represents the online or parked siblings, but not offline ones. For
the purpose here, don't you rather care about the full set?

What about HT vs AMD Fam15's CUs? Do you want both to be treated
the same here?

Also could you outline the intentions with this logic in the
description, to be able to match the goal with what gets done?

> +        }
> +    }
> +    /*
> +     * We know that neither the CPU, nor any of its sibling are here,
> +     * or we wouldn't even have entered the function.
> +     */
> +    ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu),
> +                               per_cpu(cpu_sibling_mask, cpu)));
> +
> +    /* Try adding CPU and its sibling(s) to the count and check... */
> +    nr_smts += nr_sibl;
> +
> +    if ( nr_smts <= opt_max_cpus_runqueue )
> +        return true;
> +
> +    return false;

Fold these into

    return nr_smts <= opt_max_cpus_runqueue;

?

> @@ -873,11 +930,44 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          if ( !rqi_unused && rqd->id > rqi )
>              rqi_unused = true;
>  
> -        if ( cpu_runqueue_match(rqd, cpu) )
> +        /*
> +         * Check whether the CPU should (according to the topology) and also
> +         * can (if we there aren't too many already) go in this runqueue.

Nit: Stray "we"?

> +         */
> +        if ( rqd->refcnt < opt_max_cpus_runqueue &&
> +             cpu_runqueue_match(rqd, cpu) )
>          {
> -            rqd_valid = true;
> -            break;
> +            cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu);
> +
> +            dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n",
> +                    cpu, rqd->id, CPUMASK_PR(&rqd->active),
> +                    opt_max_cpus_runqueue);
> +
> +            /*
> +             * If we're using core (or socket!) scheduling, or we don't have
> +             * hyperthreading, no need to do any further checking.
> +             *
> +             * If no (to both), but our sibling is already in this runqueue,
> +             * then it's also ok for the CPU to stay in this runqueue..

Nit: Stray full 2nd stop?

> +             * Otherwise, do some more checks, to better account for SMT.
> +             */
> +            if ( opt_sched_granularity != SCHED_GRAN_cpu ||
> +                 cpumask_weight(siblings) <= 1 ||
> +                 cpumask_intersects(&rqd->active, siblings) )
> +            {
> +                dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id);
> +                rqd_valid = rqd;
> +                break;
> +            }
> +            else if ( cpu_runqueue_smt_match(rqd, cpu) )
> +            {
> +                dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id);
> +                rqd_valid = rqd;
> +            }
>          }
> +	else

Hard tab slipped in.

> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          rqd->pick_bias = cpu;
>          rqd->id = rqi;
>      }
> +    else
> +        rqd = rqd_valid;
> +
> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n",
> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id,
> +           CPUMASK_PR(&rqd->active));

Iirc there's one per-CPU printk() already. On large systems this isn't
very nice, so I'd like to ask that their total number at least not get
further grown. Ideally there would be a less verbose summary after all
CPUs have been brought up at boot, with per-CPU info be logged only
during CPU hot online.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli
  2020-04-30  6:45   ` Jan Beulich
@ 2020-04-30  7:35   ` Jürgen Groß
  2020-04-30 12:28     ` Dario Faggioli
  1 sibling, 1 reply; 21+ messages in thread
From: Jürgen Groß @ 2020-04-30  7:35 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Andrew Cooper, George Dunlap, Jan Beulich

On 29.04.20 19:36, Dario Faggioli wrote:
> In Credit2 CPUs (can) share runqueues, depending on the topology. For
> instance, with per-socket runqueues (the default) all the CPUs that are
> part of the same socket share a runqueue.
> 
> On platform with a huge number of CPUs per socket, that could be a
> problem. An example is AMD EPYC2 servers, where we can have up to 128
> CPUs in a socket.
> 
> It is of course possible to define other, still topology-based, runqueue
> arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in
> runqueues with too many CPUs on other/future platforms.
> 
> Therefore, let's set a limit to the max number of CPUs that can share a
> Credit2 runqueue. The actual value is configurable (at boot time), the
> default being 16. If, for instance,  there are more than 16 CPUs in a
> socket, they'll be split among two (or more) runqueues.

Did you think about balancing the runqueues regarding the number of
cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in each
runqueue? I know this will need more logic as cpus are added one by one,
but the result would be much better IMO.

> 
> Note: with core scheduling enabled, this parameter sets the max number
> of *scheduling resources* that can share a runqueue. Therefore, with
> granularity set to core (and assumint 2 threads per core), we will have
> at most 16 cores per runqueue, which corresponds to 32 threads. But that
> is fine, considering how core scheduling works.
> 
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Juergen Gross <jgross@suse.com>
> ---
>   xen/common/sched/cpupool.c |    2 -
>   xen/common/sched/credit2.c |  104 ++++++++++++++++++++++++++++++++++++++++++--
>   xen/common/sched/private.h |    2 +
>   3 files changed, 103 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
> index d40345b585..0227457285 100644
> --- a/xen/common/sched/cpupool.c
> +++ b/xen/common/sched/cpupool.c
> @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
>   
>   static DEFINE_SPINLOCK(cpupool_lock);
>   
> -static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
> +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;

Please don't use the global option value, but the per-cpupool one.

>   static unsigned int __read_mostly sched_granularity = 1;
>   
>   #ifdef CONFIG_HAS_SCHED_GRANULARITY
> diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
> index 697c9f917d..abe4d048c8 100644
> --- a/xen/common/sched/credit2.c
> +++ b/xen/common/sched/credit2.c
> @@ -471,6 +471,16 @@ static int __init parse_credit2_runqueue(const char *s)
>   }
>   custom_param("credit2_runqueue", parse_credit2_runqueue);
>   
> +/*
> + * How many CPUs will be put, at most, in the same runqueue.
> + * Runqueues are still arranged according to the host topology (and
> + * according to the value of the 'credit2_runqueue' parameter). But
> + * we also have a cap to the number of CPUs that share runqueues.
> + * As soon as we reach the limit, a new runqueue will be created.
> + */
> +static unsigned int __read_mostly opt_max_cpus_runqueue = 16;
> +integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue);
> +
>   /*
>    * Per-runqueue data
>    */
> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
>              (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
>   }
>   
> +/* Additional checks, to avoid separating siblings in different runqueues. */
> +static bool
> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
> +{
> +    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));

Shouldn't you mask away siblings not in the cpupool?

> +    unsigned int rcpu, nr_smts = 0;
> +
> +    /*
> +     * If we put the CPU in this runqueue, we must be sure that there will
> +     * be enough room for accepting its hyperthread sibling(s) here as well.
> +     */
> +    cpumask_clear(cpumask_scratch_cpu(cpu));
> +    for_each_cpu ( rcpu, &rqd->active )
> +    {
> +        ASSERT(rcpu != cpu);
> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> +        {
> +            /*
> +             * For each CPU already in the runqueue, account for it and for
> +             * its sibling(s), independently from whether such sibling(s) are
> +             * in the runqueue already or not.
> +             *
> +             * Of course, if there are sibling CPUs in the runqueue already,
> +             * only count them once.
> +             */
> +            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
> +                       per_cpu(cpu_sibling_mask, rcpu));

Again, local cpupool only!


Juergen


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-30  7:35   ` Jürgen Groß
@ 2020-04-30 12:28     ` Dario Faggioli
  2020-04-30 12:52       ` Jürgen Groß
  0 siblings, 1 reply; 21+ messages in thread
From: Dario Faggioli @ 2020-04-30 12:28 UTC (permalink / raw)
  To: Jürgen Groß, xen-devel
  Cc: Andrew Cooper, George Dunlap, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 3528 bytes --]

On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
> On 29.04.20 19:36, Dario Faggioli wrote:
> > 
> > Therefore, let's set a limit to the max number of CPUs that can
> > share a
> > Credit2 runqueue. The actual value is configurable (at boot time),
> > the
> > default being 16. If, for instance,  there are more than 16 CPUs in
> > a
> > socket, they'll be split among two (or more) runqueues.
> 
> Did you think about balancing the runqueues regarding the number of
> cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in
> each
> runqueue? I know this will need more logic as cpus are added one by
> one,
> but the result would be much better IMO.
>
I know. Point is, CPUs not only are added one by one, but they can,
once the system is running, be offlined/onlined or moved among
cpupools.

Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at
boot, if the admin removes 4 CPUs that happened to be all in the same
runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd
indeed need full runqueue online rebalancing logic, which will probably
end up being quite complex and I'm not sure it's worth it.

That being said, I can try to make things a bit more fair, when CPUs
come up and are added to the pool. Something around the line of adding
them to the runqueue with the least number of CPUs in it (among the
suitable ones, of course).

With that, when the user removes 4 CPUs, we will have the 6 vs 10
situation. But we would make sure that, when she adds them back, we
will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like
that.

Was something like this that you had in mind? And in any case, what do
you think about it?

> > --- a/xen/common/sched/cpupool.c
> > +++ b/xen/common/sched/cpupool.c
> > @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
> >   
> >   static DEFINE_SPINLOCK(cpupool_lock);
> >   
> > -static enum sched_gran __read_mostly opt_sched_granularity =
> > SCHED_GRAN_cpu;
> > +enum sched_gran __read_mostly opt_sched_granularity =
> > SCHED_GRAN_cpu;
> 
> Please don't use the global option value, but the per-cpupool one.
> 
Yep, you're right. Will do.

> > +/* Additional checks, to avoid separating siblings in different
> > runqueues. */
> > +static bool
> > +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd,
> > unsigned int cpu)
> > +{
> > +    unsigned int nr_sibl =
> > cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
> 
> Shouldn't you mask away siblings not in the cpupool?
> 
So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues
and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling--
to Pool-0, I always want to make sure that there is room for both CPUs
0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if
CPU 1 is currently in another pool.

This way if, in future, CPU 1 is removed from Pool-1 and added to
Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I
don't consider CPUs which currently are in another pool, we risk that
when/if they're added to this very pool, they'll end up in a different
runqueue.

And we don't want that.

Makes sense?

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-30 12:28     ` Dario Faggioli
@ 2020-04-30 12:52       ` Jürgen Groß
  2020-04-30 14:01         ` Dario Faggioli
  2020-05-26 21:18         ` Dario Faggioli
  0 siblings, 2 replies; 21+ messages in thread
From: Jürgen Groß @ 2020-04-30 12:52 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Andrew Cooper, George Dunlap, Jan Beulich

On 30.04.20 14:28, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
>> On 29.04.20 19:36, Dario Faggioli wrote:
>>>
>>> Therefore, let's set a limit to the max number of CPUs that can
>>> share a
>>> Credit2 runqueue. The actual value is configurable (at boot time),
>>> the
>>> default being 16. If, for instance,  there are more than 16 CPUs in
>>> a
>>> socket, they'll be split among two (or more) runqueues.
>>
>> Did you think about balancing the runqueues regarding the number of
>> cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in
>> each
>> runqueue? I know this will need more logic as cpus are added one by
>> one,
>> but the result would be much better IMO.
>>
> I know. Point is, CPUs not only are added one by one, but they can,
> once the system is running, be offlined/onlined or moved among
> cpupools.
> 
> Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at
> boot, if the admin removes 4 CPUs that happened to be all in the same
> runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd
> indeed need full runqueue online rebalancing logic, which will probably
> end up being quite complex and I'm not sure it's worth it.
> 
> That being said, I can try to make things a bit more fair, when CPUs
> come up and are added to the pool. Something around the line of adding
> them to the runqueue with the least number of CPUs in it (among the
> suitable ones, of course).
> 
> With that, when the user removes 4 CPUs, we will have the 6 vs 10
> situation. But we would make sure that, when she adds them back, we
> will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like
> that.
> 
> Was something like this that you had in mind? And in any case, what do
> you think about it?

Yes, this would be better already.

> 
>>> --- a/xen/common/sched/cpupool.c
>>> +++ b/xen/common/sched/cpupool.c
>>> @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
>>>    
>>>    static DEFINE_SPINLOCK(cpupool_lock);
>>>    
>>> -static enum sched_gran __read_mostly opt_sched_granularity =
>>> SCHED_GRAN_cpu;
>>> +enum sched_gran __read_mostly opt_sched_granularity =
>>> SCHED_GRAN_cpu;
>>
>> Please don't use the global option value, but the per-cpupool one.
>>
> Yep, you're right. Will do.
> 
>>> +/* Additional checks, to avoid separating siblings in different
>>> runqueues. */
>>> +static bool
>>> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd,
>>> unsigned int cpu)
>>> +{
>>> +    unsigned int nr_sibl =
>>> cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
>>
>> Shouldn't you mask away siblings not in the cpupool?
>>
> So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues
> and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling--
> to Pool-0, I always want to make sure that there is room for both CPUs
> 0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if
> CPU 1 is currently in another pool.
> 
> This way if, in future, CPU 1 is removed from Pool-1 and added to
> Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I
> don't consider CPUs which currently are in another pool, we risk that
> when/if they're added to this very pool, they'll end up in a different
> runqueue.
> 
> And we don't want that.
> 
> Makes sense?

Yes.

You should add a comment in this regard.

And you should either reject the case of less cpus per queue than
siblings per core, or you should handle this situation. Otherwise you
won't ever find a suitable run-queue. :-)


Juergen


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-30 12:52       ` Jürgen Groß
@ 2020-04-30 14:01         ` Dario Faggioli
  2020-05-26 21:18         ` Dario Faggioli
  1 sibling, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-04-30 14:01 UTC (permalink / raw)
  To: Jürgen Groß, xen-devel
  Cc: Andrew Cooper, George Dunlap, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]

On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
> On 30.04.20 14:28, Dario Faggioli wrote:
> > On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
> > > 
> > With that, when the user removes 4 CPUs, we will have the 6 vs 10
> > situation. But we would make sure that, when she adds them back, we
> > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
> > like
> > that.
> > 
> > Was something like this that you had in mind? And in any case, what
> > do
> > you think about it?
> 
> Yes, this would be better already.
> 
Right, I'll give a try at this, and let's see how it ends up looking
like.

> > This way if, in future, CPU 1 is removed from Pool-1 and added to
> > Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If
> > I
> > don't consider CPUs which currently are in another pool, we risk
> > that
> > when/if they're added to this very pool, they'll end up in a
> > different
> > runqueue.
> > 
> > And we don't want that.
> > 
> Yes.
> 
Cool. :-)

> You should add a comment in this regard.
> 
Sure.

> And you should either reject the case of less cpus per queue than
> siblings per core, or you should handle this situation. Otherwise you
> won't ever find a suitable run-queue. :-)
> 
Right, and in fact I had a check for that, rejecting smaller
max-cpus-per-runqueue than siblings, where we validate also the other
scheduling parameters. But indeed it's not there now, so I guess I
removed it by mistake before sending.

And now that I double check, I'm also missing documenting the new
parameters.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-30 12:52       ` Jürgen Groß
  2020-04-30 14:01         ` Dario Faggioli
@ 2020-05-26 21:18         ` Dario Faggioli
  2020-05-27  6:22           ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Dario Faggioli @ 2020-05-26 21:18 UTC (permalink / raw)
  To: Jürgen Groß, xen-devel
  Cc: Andrew Cooper, George Dunlap, Jan Beulich, paul

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
> On 30.04.20 14:28, Dario Faggioli wrote:
> > That being said, I can try to make things a bit more fair, when
> > CPUs
> > come up and are added to the pool. Something around the line of
> > adding
> > them to the runqueue with the least number of CPUs in it (among the
> > suitable ones, of course).
> > 
> > With that, when the user removes 4 CPUs, we will have the 6 vs 10
> > situation. But we would make sure that, when she adds them back, we
> > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
> > like
> > that.
> > 
> > Was something like this that you had in mind? And in any case, what
> > do
> > you think about it?
> 
> Yes, this would be better already.
> 
So, a couple of thoughts. Doing something like what I tried to describe
above is not too bad, and I have it pretty much ready.

With that, on an Intel system with 96 CPUs on two sockets, and
max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2
giant runqueues with 48 CPUs in each:

 - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 
   runqueues in each of them

I can also "tweak" it in such a way that, if one for instance boots
with "smt=no", we get to a point where we have:

 - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each

Now, I think this is good, and actually better than the current
situation where on such a system, we only have two very big runqueues
(and let me repeat that introducing a per-LLC runqueue arrangement, on
which I'm also working, won't help in this case, as NUMA node == LLC).

So, problem is that if one starts to fiddle with cpupools and cpu on
and offlining, things can get pretty unbalanced. E.g., I can end up
with 2 runqueues on a socket, one with 16 CPUs and the other with just
a couple of them.

Now, this is all possible as of now (I mean, without this patch)
already, although at a different level. In fact, I can very well remove
all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue
with a lot of CPUs and another with just one.

It looks like we need a way to rebalance the runqueues, which should be
doable... But despite having spent a couple of days trying to come up
with something decent, that I could include in v2 of this series, I
couldn't get it to work sensibly.

So, this looks to me like an improvement, that would need being
improved further, but I'm not sure we have the time for it right now
(Cc-ing Paul). Should we still go for what's ready? I think yes, but
I'd be interested in opinions...

Also, if anyone has any clever ideas on how to implement a mechanism
that rebalance the CPUs within the runqueue, I'm all ears and am up for
trying implementing. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-04-30  6:45   ` Jan Beulich
@ 2020-05-26 22:00     ` Dario Faggioli
  2020-05-27  4:26       ` Jürgen Groß
  2020-05-27  6:17       ` Jan Beulich
  0 siblings, 2 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-26 22:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 4894 bytes --]

Hey,

thanks for the review, and sorry for replying late... I was busy with
something and then was trying to implement a better balancing logic, as
discussed with Juergen, but with only partial success...

On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
> On 29.04.2020 19:36, Dario Faggioli wrote:
> > @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct 
> > [...]
> > +        ASSERT(rcpu != cpu);
> > +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> > +        {
> > +            /*
> > +             * For each CPU already in the runqueue, account for
> > it and for
> > +             * its sibling(s), independently from whether such
> > sibling(s) are
> > +             * in the runqueue already or not.
> > +             *
> > +             * Of course, if there are sibling CPUs in the
> > runqueue already,
> > +             * only count them once.
> > +             */
> > +            cpumask_or(cpumask_scratch_cpu(cpu),
> > cpumask_scratch_cpu(cpu),
> > +                       per_cpu(cpu_sibling_mask, rcpu));
> > +            nr_smts += nr_sibl;
> 
> This being common code, is it appropriate to assume all CPUs having
> the same number of siblings? 
>
You mention common code because you are thinking of differences between
x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
that have SMT, or that we may want to identify and treat like if it was
SMT, we currently have no support for that, so I don't think it is a
problem.

On x86, I'm not sure I am aware of cases where the number of threads is
different among cores or sockets... are there any?

Besides, we have some SMT specific code around (especially in
scheduling) already.

> Even beyond that, iirc the sibling mask
> represents the online or parked siblings, but not offline ones. For
> the purpose here, don't you rather care about the full set?
> 
This is actually a good point. I indeed care about the number of
siblings a thread has, in general, not only about the ones that are
currently online.

In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
in an helper that just returns 1 for ARM. What do you think, is this
better?

> What about HT vs AMD Fam15's CUs? Do you want both to be treated
> the same here?
> 
Are you referring to the cores that, AFAIUI, share the L1i cache? If
yes, I thought about it, and ended up _not_ dealing with them here, but
I'm still a bit unsure.

Cache oriented runqueue organization will be the subject of another
patch series, and that's why I kept them out. However, that's a rather
special case with a lot in common to SMT... Just in case, is there a
way to identify them easily, like with a mask or something, in the code
already?

> Also could you outline the intentions with this logic in the
> description, to be able to match the goal with what gets done?
> 
Sure, I will try state it more clearly.

> > @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
> > *prv, unsigned int cpu)
> >          rqd->pick_bias = cpu;
> >          rqd->id = rqi;
> >      }
> > +    else
> > +        rqd = rqd_valid;
> > +
> > +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
> > runqueue %d with {%*pbl}\n",
> > +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
> > >id,
> > +           CPUMASK_PR(&rqd->active));
> 
> Iirc there's one per-CPU printk() already. On large systems this
> isn't
> very nice, so I'd like to ask that their total number at least not
> get
> further grown. Ideally there would be a less verbose summary after
> all
> CPUs have been brought up at boot, with per-CPU info be logged only
> during CPU hot online.
> 
Understood. Problem is that, here in the scheduling code, I don't see
an easy way to tell when we have finished bringing up CPUs... And it's
probably not worth looking too hard (even less adding logic) only for
the sake of printing this message.

So I think I will demote this printk as a XENLOG_DEBUG one (and will
also get rid of others that were already DEBUG, but not super useful,
after some more thinking).

The idea is that, after all, exactly on which runqueue a CPU ends up is
not an information that should matter much from an user/administrator.
For now, it will be possible to know where they ended up via debug
keys. And even if we feel like making it more visible, that is better
achieved via some toolstack command that queries and prints the
configuration of the scheduler, rather than a line like this in the
boot log.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-26 22:00     ` Dario Faggioli
@ 2020-05-27  4:26       ` Jürgen Groß
  2020-05-28  9:32         ` Dario Faggioli
  2020-05-27  6:17       ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Jürgen Groß @ 2020-05-27  4:26 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich; +Cc: xen-devel, George Dunlap, Andrew Cooper

On 27.05.20 00:00, Dario Faggioli wrote:
> Hey,
> 
> thanks for the review, and sorry for replying late... I was busy with
> something and then was trying to implement a better balancing logic, as
> discussed with Juergen, but with only partial success...
> 
> On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
>> On 29.04.2020 19:36, Dario Faggioli wrote:
>>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct
>>> [...]
>>> +        ASSERT(rcpu != cpu);
>>> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
>>> +        {
>>> +            /*
>>> +             * For each CPU already in the runqueue, account for
>>> it and for
>>> +             * its sibling(s), independently from whether such
>>> sibling(s) are
>>> +             * in the runqueue already or not.
>>> +             *
>>> +             * Of course, if there are sibling CPUs in the
>>> runqueue already,
>>> +             * only count them once.
>>> +             */
>>> +            cpumask_or(cpumask_scratch_cpu(cpu),
>>> cpumask_scratch_cpu(cpu),
>>> +                       per_cpu(cpu_sibling_mask, rcpu));
>>> +            nr_smts += nr_sibl;
>>
>> This being common code, is it appropriate to assume all CPUs having
>> the same number of siblings?
>>
> You mention common code because you are thinking of differences between
> x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
> that have SMT, or that we may want to identify and treat like if it was
> SMT, we currently have no support for that, so I don't think it is a
> problem.
> 
> On x86, I'm not sure I am aware of cases where the number of threads is
> different among cores or sockets... are there any?
> 
> Besides, we have some SMT specific code around (especially in
> scheduling) already.
> 
>> Even beyond that, iirc the sibling mask
>> represents the online or parked siblings, but not offline ones. For
>> the purpose here, don't you rather care about the full set?
>>
> This is actually a good point. I indeed care about the number of
> siblings a thread has, in general, not only about the ones that are
> currently online.
> 
> In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
> in an helper that just returns 1 for ARM. What do you think, is this
> better?
> 
>> What about HT vs AMD Fam15's CUs? Do you want both to be treated
>> the same here?
>>
> Are you referring to the cores that, AFAIUI, share the L1i cache? If
> yes, I thought about it, and ended up _not_ dealing with them here, but
> I'm still a bit unsure.
> 
> Cache oriented runqueue organization will be the subject of another
> patch series, and that's why I kept them out. However, that's a rather
> special case with a lot in common to SMT... Just in case, is there a
> way to identify them easily, like with a mask or something, in the code
> already?
> 
>> Also could you outline the intentions with this logic in the
>> description, to be able to match the goal with what gets done?
>>
> Sure, I will try state it more clearly.
> 
>>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
>>> *prv, unsigned int cpu)
>>>           rqd->pick_bias = cpu;
>>>           rqd->id = rqi;
>>>       }
>>> +    else
>>> +        rqd = rqd_valid;
>>> +
>>> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
>>> runqueue %d with {%*pbl}\n",
>>> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
>>>> id,
>>> +           CPUMASK_PR(&rqd->active));
>>
>> Iirc there's one per-CPU printk() already. On large systems this
>> isn't
>> very nice, so I'd like to ask that their total number at least not
>> get
>> further grown. Ideally there would be a less verbose summary after
>> all
>> CPUs have been brought up at boot, with per-CPU info be logged only
>> during CPU hot online.
>>
> Understood. Problem is that, here in the scheduling code, I don't see
> an easy way to tell when we have finished bringing up CPUs... And it's
> probably not worth looking too hard (even less adding logic) only for
> the sake of printing this message.

cpupool_init() is the perfect place for that.


Juergen


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-26 22:00     ` Dario Faggioli
  2020-05-27  4:26       ` Jürgen Groß
@ 2020-05-27  6:17       ` Jan Beulich
  2020-05-28 14:55         ` Dario Faggioli
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-05-27  6:17 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

On 27.05.2020 00:00, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
>> On 29.04.2020 19:36, Dario Faggioli wrote:
>>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct 
>>> [...]
>>> +        ASSERT(rcpu != cpu);
>>> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
>>> +        {
>>> +            /*
>>> +             * For each CPU already in the runqueue, account for
>>> it and for
>>> +             * its sibling(s), independently from whether such
>>> sibling(s) are
>>> +             * in the runqueue already or not.
>>> +             *
>>> +             * Of course, if there are sibling CPUs in the
>>> runqueue already,
>>> +             * only count them once.
>>> +             */
>>> +            cpumask_or(cpumask_scratch_cpu(cpu),
>>> cpumask_scratch_cpu(cpu),
>>> +                       per_cpu(cpu_sibling_mask, rcpu));
>>> +            nr_smts += nr_sibl;
>>
>> This being common code, is it appropriate to assume all CPUs having
>> the same number of siblings? 
>>
> You mention common code because you are thinking of differences between
> x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
> that have SMT, or that we may want to identify and treat like if it was
> SMT, we currently have no support for that, so I don't think it is a
> problem.
> 
> On x86, I'm not sure I am aware of cases where the number of threads is
> different among cores or sockets... are there any?

I'm not aware of any either, but in common code it should also
matter what might be, not just what there is. Unless you wrap
things in e.g. CONFIG_X86.

>> Even beyond that, iirc the sibling mask
>> represents the online or parked siblings, but not offline ones. For
>> the purpose here, don't you rather care about the full set?
>>
> This is actually a good point. I indeed care about the number of
> siblings a thread has, in general, not only about the ones that are
> currently online.
> 
> In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
> in an helper that just returns 1 for ARM. What do you think, is this
> better?

As per above, cpu_data[rcpu] then please.

>> What about HT vs AMD Fam15's CUs? Do you want both to be treated
>> the same here?
>>
> Are you referring to the cores that, AFAIUI, share the L1i cache? If
> yes, I thought about it, and ended up _not_ dealing with them here, but
> I'm still a bit unsure.
> 
> Cache oriented runqueue organization will be the subject of another
> patch series, and that's why I kept them out. However, that's a rather
> special case with a lot in common to SMT...

I didn't think of cache sharing in particular, but about the
concept of compute units vs hyperthreads in general.

> Just in case, is there a
> way to identify them easily, like with a mask or something, in the code
> already?

cpu_sibling_mask still gets used for both, so there's no mask
to use. As per set_cpu_sibling_map() you can look at
cpu_data[].compute_unit_id to tell, but that's of course x86-
specific (as is the entire compute unit concept).

>>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
>>> *prv, unsigned int cpu)
>>>          rqd->pick_bias = cpu;
>>>          rqd->id = rqi;
>>>      }
>>> +    else
>>> +        rqd = rqd_valid;
>>> +
>>> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
>>> runqueue %d with {%*pbl}\n",
>>> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
>>>> id,
>>> +           CPUMASK_PR(&rqd->active));
>>
>> Iirc there's one per-CPU printk() already. On large systems this
>> isn't
>> very nice, so I'd like to ask that their total number at least not
>> get
>> further grown. Ideally there would be a less verbose summary after
>> all
>> CPUs have been brought up at boot, with per-CPU info be logged only
>> during CPU hot online.
>>
> Understood. Problem is that, here in the scheduling code, I don't see
> an easy way to tell when we have finished bringing up CPUs... And it's
> probably not worth looking too hard (even less adding logic) only for
> the sake of printing this message.
> 
> So I think I will demote this printk as a XENLOG_DEBUG one (and will
> also get rid of others that were already DEBUG, but not super useful,
> after some more thinking).

Having seen Jürgen's reply as well as what you further wrote below,
I'd still like to point out that even XENLOG_DEBUG isn't quiet
enough: There may be some value to such a debugging message to you,
but it may be mainly spam to e.g. me, who I still have a need to
run with loglvl=all in the common case. Let's not forget, the
context in which the underlying topic came up in was pretty-many-
core AMD CPUs.

We had this issue elsewhere as well, and as CPU counts grew we
started hiding such messages behind a command line option, besides
a log level qualification. Similarly we hide e.g. details of the
IOMMU arrangements behind a command line option.

> The idea is that, after all, exactly on which runqueue a CPU ends up is
> not an information that should matter much from an user/administrator.
> For now, it will be possible to know where they ended up via debug
> keys. And even if we feel like making it more visible, that is better
> achieved via some toolstack command that queries and prints the
> configuration of the scheduler, rather than a line like this in the
> boot log.

Good.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-26 21:18         ` Dario Faggioli
@ 2020-05-27  6:22           ` Jan Beulich
  2020-05-28  9:36             ` Dario Faggioli
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-05-27  6:22 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Jürgen Groß, xen-devel, paul, George Dunlap, Andrew Cooper

On 26.05.2020 23:18, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
>> On 30.04.20 14:28, Dario Faggioli wrote:
>>> That being said, I can try to make things a bit more fair, when
>>> CPUs
>>> come up and are added to the pool. Something around the line of
>>> adding
>>> them to the runqueue with the least number of CPUs in it (among the
>>> suitable ones, of course).
>>>
>>> With that, when the user removes 4 CPUs, we will have the 6 vs 10
>>> situation. But we would make sure that, when she adds them back, we
>>> will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
>>> like
>>> that.
>>>
>>> Was something like this that you had in mind? And in any case, what
>>> do
>>> you think about it?
>>
>> Yes, this would be better already.
>>
> So, a couple of thoughts. Doing something like what I tried to describe
> above is not too bad, and I have it pretty much ready.
> 
> With that, on an Intel system with 96 CPUs on two sockets, and
> max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2
> giant runqueues with 48 CPUs in each:
> 
>  - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 
>    runqueues in each of them
> 
> I can also "tweak" it in such a way that, if one for instance boots
> with "smt=no", we get to a point where we have:
> 
>  - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each
> 
> Now, I think this is good, and actually better than the current
> situation where on such a system, we only have two very big runqueues
> (and let me repeat that introducing a per-LLC runqueue arrangement, on
> which I'm also working, won't help in this case, as NUMA node == LLC).
> 
> So, problem is that if one starts to fiddle with cpupools and cpu on
> and offlining, things can get pretty unbalanced. E.g., I can end up
> with 2 runqueues on a socket, one with 16 CPUs and the other with just
> a couple of them.
> 
> Now, this is all possible as of now (I mean, without this patch)
> already, although at a different level. In fact, I can very well remove
> all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue
> with a lot of CPUs and another with just one.
> 
> It looks like we need a way to rebalance the runqueues, which should be
> doable... But despite having spent a couple of days trying to come up
> with something decent, that I could include in v2 of this series, I
> couldn't get it to work sensibly.

CPU on-/offlining may not need considering here at all imo. I think it
would be quite reasonable as a first step to have a static model where
from system topology (and perhaps a command line option) one can
predict which runqueue a particular CPU will end up in, no matter when
it gets brought online.

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-27  4:26       ` Jürgen Groß
@ 2020-05-28  9:32         ` Dario Faggioli
  0 siblings, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-28  9:32 UTC (permalink / raw)
  To: Jürgen Groß, Jan Beulich
  Cc: xen-devel, George Dunlap, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Wed, 2020-05-27 at 06:26 +0200, Jürgen Groß wrote:
> On 27.05.20 00:00, Dario Faggioli wrote:
> > 
> > Understood. Problem is that, here in the scheduling code, I don't
> > see
> > an easy way to tell when we have finished bringing up CPUs... And
> > it's
> > probably not worth looking too hard (even less adding logic) only
> > for
> > the sake of printing this message.
> 
> cpupool_init() is the perfect place for that.
> 
Yes, at least for boot time, it is indeed, so I'll go for it.

OTOH, when for instance one creates a new cpupool (or adding a bunch of
CPUs to one, with `xl cpupool-add Pool-X 7,3-14-22,18`), CPUs are added
one by one, and we don't really know in Xen which one would be the last
one, and print a summary.

But yeah, I'll go for cpupool_init() and get rid of the rest, for now.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-27  6:22           ` Jan Beulich
@ 2020-05-28  9:36             ` Dario Faggioli
  0 siblings, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-28  9:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jürgen Groß, xen-devel, paul, George Dunlap, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 1348 bytes --]

On Wed, 2020-05-27 at 08:22 +0200, Jan Beulich wrote:
> On 26.05.2020 23:18, Dario Faggioli wrote:
> > 
> > It looks like we need a way to rebalance the runqueues, which
> > should be
> > doable... But despite having spent a couple of days trying to come
> > up
> > with something decent, that I could include in v2 of this series, I
> > couldn't get it to work sensibly.
> 
> CPU on-/offlining may not need considering here at all imo. I think
> it
> would be quite reasonable as a first step to have a static model
> where
> from system topology (and perhaps a command line option) one can
> predict which runqueue a particular CPU will end up in, no matter
> when
> it gets brought online.
> 
Right.

IAC, just FYI, after talking to Juergen --who suggested a nice solution
to overcome the problem where I was stuck-- I have now a patch that
successfully implement dynamic online rebalancing of runqueues.

I'm polishing up the comments and changelog and will send it ASAP, as
I'd really like for this series to go in... :-)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-27  6:17       ` Jan Beulich
@ 2020-05-28 14:55         ` Dario Faggioli
  2020-05-29  9:58           ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Dario Faggioli @ 2020-05-28 14:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote:
> On 27.05.2020 00:00, Dario Faggioli wrote:
> > 
> > Cache oriented runqueue organization will be the subject of another
> > patch series, and that's why I kept them out. However, that's a
> > rather
> > special case with a lot in common to SMT...
> 
> I didn't think of cache sharing in particular, but about the
> concept of compute units vs hyperthreads in general.
> 
Ok.

> > Just in case, is there a
> > way to identify them easily, like with a mask or something, in the
> > code
> > already?
> 
> cpu_sibling_mask still gets used for both, so there's no mask
> to use. As per set_cpu_sibling_map() you can look at
> cpu_data[].compute_unit_id to tell, but that's of course x86-
> specific (as is the entire compute unit concept).
> 
Right. And thanks for the pointers.

But then, what I am understanding by having a look there is that I
indeed can use (again, appropriately wrapped) x86_num_siblings, for
telling, in this function, whether a CPU has any, and if yes how many,
HT (Intel) or CU (AMD) siblings in total, although some of them may
currently be offline.

Which means I will be treating HTs and CUs the same which, thinking
more about it (and thinking actually to CUs, rather than to any cache
sharing relationship), does make sense for this feature.

Does this make sense, or am I missing or misinterpreting anything?

> > So I think I will demote this printk as a XENLOG_DEBUG one (and
> > will
> > also get rid of others that were already DEBUG, but not super
> > useful,
> > after some more thinking).
> 
> Having seen Jürgen's reply as well as what you further wrote below,
> I'd still like to point out that even XENLOG_DEBUG isn't quiet
> enough: There may be some value to such a debugging message to you,
> but it may be mainly spam to e.g. me, who I still have a need to
> run with loglvl=all in the common case. Let's not forget, the
> context in which the underlying topic came up in was pretty-many-
> core AMD CPUs.
> 
Good point indeed about DEBUG potentially being an issue as well. So
yes, as announced in my reply to Juergen, I was going with the recap in
cpupool_init().

However, that looks like it requires a new hook in the scheduler's
interface, as the information is scheduler specific but at the same
time I don't think we want to have the exact same information from
either dump_settings() or dump_cpu_state(), which is all we have... :-/

I'll see about it.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-28 14:55         ` Dario Faggioli
@ 2020-05-29  9:58           ` Jan Beulich
  2020-05-29 10:19             ` Dario Faggioli
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2020-05-29  9:58 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

On 28.05.2020 16:55, Dario Faggioli wrote:
> On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote:
>> On 27.05.2020 00:00, Dario Faggioli wrote:
>>> Just in case, is there a
>>> way to identify them easily, like with a mask or something, in the
>>> code
>>> already?
>>
>> cpu_sibling_mask still gets used for both, so there's no mask
>> to use. As per set_cpu_sibling_map() you can look at
>> cpu_data[].compute_unit_id to tell, but that's of course x86-
>> specific (as is the entire compute unit concept).
>>
> Right. And thanks for the pointers.
> 
> But then, what I am understanding by having a look there is that I
> indeed can use (again, appropriately wrapped) x86_num_siblings, for
> telling, in this function, whether a CPU has any, and if yes how many,
> HT (Intel) or CU (AMD) siblings in total, although some of them may
> currently be offline.
> 
> Which means I will be treating HTs and CUs the same which, thinking
> more about it (and thinking actually to CUs, rather than to any cache
> sharing relationship), does make sense for this feature.
> 
> Does this make sense, or am I missing or misinterpreting anything?

Well, it effectively answers the question I had raised: "What about HT
vs AMD Fam15's CUs? Do you want both to be treated the same here?"

Jan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue
  2020-05-29  9:58           ` Jan Beulich
@ 2020-05-29 10:19             ` Dario Faggioli
  0 siblings, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-29 10:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 1150 bytes --]

On Fri, 2020-05-29 at 11:58 +0200, Jan Beulich wrote:
> On 28.05.2020 16:55, Dario Faggioli wrote:
> > 
> > Which means I will be treating HTs and CUs the same which, thinking
> > more about it (and thinking actually to CUs, rather than to any
> > cache
> > sharing relationship), does make sense for this feature.
> > 
> > Does this make sense, or am I missing or misinterpreting anything?
> 
> Well, it effectively answers the question I had raised: "What about
> HT
> vs AMD Fam15's CUs? Do you want both to be treated the same here?"
> 
Exactly! :-)

And, in fact, the answer is "Yes, I want them being treated the same,
and the patches are actually doing ".

Sorry for not getting it right away, and just replying like that in the
first place.

At least, this led to better (more generic) comments and variable names
in v2 of this patch.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue
  2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
  2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli
  2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli
@ 2020-05-29 11:46 ` Dario Faggioli
  2020-05-29 15:06   ` Dario Faggioli
  2020-05-29 16:15   ` George Dunlap
  2 siblings, 2 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-29 11:46 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich, Paul Durrant

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

So,

I felt like providing some additional thoughts about this series, from
a release point of view (adding Paul).

Timing is *beyond tight* so if this series, entirely or partly, has any
chance to go in, it would be through some form of exception, which of
course comes with some risks, etc.

I did work hard to submit the full series, because I wanted people to
be able to see the complete solution. However, I think the series
itself can be logically split in two parts.

Basically, if we just consider patches 1 and 4 we will end up, right
after boot, with a system that has smaller runqueues. They will most
likely be balanced in terms of how many CPUs each one has, so a good
setup. This will likely (actual differences seems to depend *quite a
bit* on the actual workload) be an improvement for very large systems.

This is a path will get a decent amount of testing in OSSTests, from
now until the day of the release, I think, because booting with the
default CPU configuration and setup is what most (all?) OSSTests' jobs
do.

If the user starts to create pools, we can get to a point where the
different runqueues are unbalanced, i.e., each one has a different
number of CPUs in them, wrt the others. This, however:
* can happen already, as of today, without this patches. Whether these
  patches may make things "better" or "worse", from this point of view,
  it's impossible to tell, because it actually depends on what CPUs 
  the user moves among pools or put offline, etc.
* this means that the scheduler has to deal with unbalanced runqueues 
  anyway, and if it doesn't, it's a bug and, again, it seems to me 
  that these patches don't make things particularly better or worse.

So, again, for patches 1 and 4 alone, it looks to me that we'd get
improvements, at least in some cases, the codepath will get some
testing and they do not introduce additional or different issues than
what we have already right now.

They also are at their second iteration, as the original patch series
was comprised of exactly those two patches.

So, I think it would be interesting if these two patches would be given
a chance, even just of getting some reviews... And I would be fine
going through the formal process necessary for making that to happen
myself.

Then, there's the rest, the runqueue rebalancing thing. As said above,
I personally would love if we'd release with it, but I see one rather
big issue. In fact, such mechanism is triggered and stressed is
stressed by the dynamic creation and manipulation of cpupools (and CPU
on/off-lining), and we don't have an OSSTests test for that. Therefore,
we are not in the best position for catching issues it may have
introduced.

I can commit to do some testing myself, but it's not the same thing has
having them in our CI, I know that very well. So, I'd be interested in
hearing what others think about these other patches as well, and I am
happy to do my best to make sure that they are working fine, if we
decide to try to include them too, but I do see this as much more of a
risk myself.

So, any thoughts? :-)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue
  2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
@ 2020-05-29 15:06   ` Dario Faggioli
  2020-05-29 16:15   ` George Dunlap
  1 sibling, 0 replies; 21+ messages in thread
From: Dario Faggioli @ 2020-05-29 15:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich, Paul Durrant

[-- Attachment #1: Type: text/plain, Size: 5283 bytes --]

On Fri, 2020-05-29 at 13:46 +0200, Dario Faggioli wrote:
> Basically, if we just consider patches 1 and 4 we will end up, right
> after boot, with a system that has smaller runqueues.
>
Actually, to be fully precise, given how I reorganized the series, it's
not patches 1 and 4, it's patches 1, 3 and 4.

Hopefully, that is not a big deal, but it patch 3 is really a problem,
I can re-arrange patch 4 for working without it.

Apart from this, and for adding more information, on a system with 96
CPUs in 2 sockets, this is how the runqueues looks like (with these
patches:

(XEN) Online Cpus: 0-95
(XEN) Cpupool 0:
(XEN) Cpus: 0-95
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Active queues: 6
(XEN)   default-weight     = 256
(XEN) Runqueue 0:
(XEN)   ncpus              = 16
(XEN)   cpus               = 0-15
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 0
(XEN)   instload           = 0
(XEN)   aveload            = 1223 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff
(XEN) Runqueue 1:
(XEN)   ncpus              = 16
(XEN)   cpus               = 16-31
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 16
(XEN)   instload           = 0
(XEN)   aveload            = 3324 (~1%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000
(XEN) Runqueue 2:
(XEN)   ncpus              = 16
(XEN)   cpus               = 32-47
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 32
(XEN)   instload           = 1
(XEN)   aveload            = 8996 (~3%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,0000feff,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,0000fcff,00000000
(XEN) Runqueue 3:
(XEN)   ncpus              = 16
(XEN)   cpus               = 48-63
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 48
(XEN)   instload           = 0
(XEN)   aveload            = 2424 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000
(XEN) Runqueue 4:
(XEN)   ncpus              = 16
(XEN)   cpus               = 64-79
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 66
(XEN)   instload           = 0
(XEN)   aveload            = 1070 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000
(XEN) Runqueue 5:
(XEN)   ncpus              = 16
(XEN)   cpus               = 80-95
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 82
(XEN)   instload           = 0
(XEN)   aveload            = 425 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000

Without the patches, there would be just 2 of them (on with CPUs 0-47
and another with CPUs 48-95).

On a system with "just" 16 CPUs, in 2 sockets, they look like this:

(XEN) Online Cpus: 0-15
(XEN) Cpupool 0:
(XEN) Cpus: 0-15
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Active queues: 2
(XEN) 	default-weight     = 256
(XEN) Runqueue 0:
(XEN) 	ncpus              = 8
(XEN) 	cpus               = 0-7
(XEN) 	max_weight         = 256
(XEN) 	pick_bias          = 0
(XEN) 	instload           = 0
(XEN) 	aveload            = 7077 (~2%)
(XEN) 	idlers: 00000000,000000ff
(XEN) 	tickled: 00000000,00000000
(XEN) 	fully idle cores: 00000000,000000ff
(XEN) Runqueue 1:
(XEN) 	ncpus              = 8
(XEN) 	cpus               = 8-15
(XEN) 	max_weight         = 256
(XEN) 	pick_bias          = 8
(XEN) 	instload           = 1
(XEN) 	aveload            = 11848 (~4%)
(XEN) 	idlers: 00000000,0000fe00
(XEN) 	tickled: 00000000,00000000
(XEN) 	fully idle cores: 00000000,0000fc00

There are still 2, because there are 2 sockets, and we still honor the
topology (and 8 CPUs in a runqueue is fine, because is lower than 16).

I'll share the same output on a 256 CPU system, as soon as I finish
installing it.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue
  2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
  2020-05-29 15:06   ` Dario Faggioli
@ 2020-05-29 16:15   ` George Dunlap
  1 sibling, 0 replies; 21+ messages in thread
From: George Dunlap @ 2020-05-29 16:15 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Juergen Gross, xen-devel, Paul Durrant, Jan Beulich, Andrew Cooper



> On May 29, 2020, at 12:46 PM, Dario Faggioli <dfaggioli@suse.com> wrote:
> 
> So,
> 
> I felt like providing some additional thoughts about this series, from
> a release point of view (adding Paul).
> 
> Timing is *beyond tight* so if this series, entirely or partly, has any
> chance to go in, it would be through some form of exception, which of
> course comes with some risks, etc.
> 
> I did work hard to submit the full series, because I wanted people to
> be able to see the complete solution. However, I think the series
> itself can be logically split in two parts.
> 
> Basically, if we just consider patches 1 and 4 we will end up, right
> after boot, with a system that has smaller runqueues. They will most
> likely be balanced in terms of how many CPUs each one has, so a good
> setup. This will likely (actual differences seems to depend *quite a
> bit* on the actual workload) be an improvement for very large systems.

Fundamentally, I feel like the reason we have the feature freeze is exactly to have to avoid questions like this.  Something very much like patch 4 was posted before the last posting date; patches 1-4 received R-b’s before the feature freeze.  I think they should probably go in.

The rebalancing patches I’m inclined to say should wait until they’ve had a bit more time to be thought about.

 -George


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-05-29 16:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli
2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli
2020-04-30  6:45   ` Jan Beulich
2020-05-26 22:00     ` Dario Faggioli
2020-05-27  4:26       ` Jürgen Groß
2020-05-28  9:32         ` Dario Faggioli
2020-05-27  6:17       ` Jan Beulich
2020-05-28 14:55         ` Dario Faggioli
2020-05-29  9:58           ` Jan Beulich
2020-05-29 10:19             ` Dario Faggioli
2020-04-30  7:35   ` Jürgen Groß
2020-04-30 12:28     ` Dario Faggioli
2020-04-30 12:52       ` Jürgen Groß
2020-04-30 14:01         ` Dario Faggioli
2020-05-26 21:18         ` Dario Faggioli
2020-05-27  6:22           ` Jan Beulich
2020-05-28  9:36             ` Dario Faggioli
2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli
2020-05-29 15:06   ` Dario Faggioli
2020-05-29 16:15   ` George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).