* [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue @ 2020-04-29 17:36 Dario Faggioli 2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich In Credit2, the CPUs are assigned to runqueues according to the host topology. For instance, if we want per-socket runqueues (which is the default), the CPUs that are in the same socket will end up in the same runqueue. This is generally good for scalability, at least until the number of CPUs that end up in the same runqueue is not too high. In fact, all this CPUs will compete for the same spinlock, for making scheduling decisions and manipulating the scheduler data structures. Therefore, if they are too many, that can become a bottleneck. This has not been an issue so far, but architectures with 128 CPUs per socket are now available, and it is certainly unideal to have so many CPUs in the same runqueue, competing for the same locks, etc. Let's therefore set a cap to the total number of CPUs that can share a runqueue. The value is set to 16, by default, but can be changed with a boot command line parameter. Note that this is orthogonal and independent from the activity of introducing runqueue arrangement mechanisms and chriteria. In fact, we very well may want to implement the capability of having, say, per-LLC runqueues, or per-die (where that is defined) runqueues. In fact, this would work well on current system, but nothing guarantees that, in future, there won't be an architecture with hundreds of CPUs per DIE or LLC... And we'll be back to where we are now. Therefore, even if we are actually planning to add some new runqueue organization criteria, fixing a limit for the number of CPUs that will ever share a runqueue, whatever the underlying topology is, is still useful. Note also that, if the host has hyperthreading (and we are not using core-scheduling), additional care is taken to avoid splitting CPUs that are hyperthread siblings among different runqueues. --- Dario Faggioli (2): xen: credit2: factor cpu to runqueue matching in a function xen: credit2: limit the max number of CPUs in a runqueue xen/common/sched/cpupool.c | 2 - xen/common/sched/credit2.c | 130 +++++++++++++++++++++++++++++++++++++++----- xen/common/sched/private.h | 2 + 3 files changed, 119 insertions(+), 15 deletions(-) -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function 2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli @ 2020-04-29 17:36 ` Dario Faggioli 2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli 2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2 siblings, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw) To: xen-devel; +Cc: George Dunlap Just move the big if() condition in an inline function. No functional change intended. Signed-off-by: Dario Faggioli <dfaggioli@suse.com> --- Cc: George Dunlap <george.dunlap@citrix.com> --- xen/common/sched/credit2.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c index 34f05c3e2a..697c9f917d 100644 --- a/xen/common/sched/credit2.c +++ b/xen/common/sched/credit2.c @@ -838,6 +838,20 @@ static inline bool same_core(unsigned int cpua, unsigned int cpub) cpu_to_core(cpua) == cpu_to_core(cpub); } +static inline bool +cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) +{ + unsigned int peer_cpu = rqd->pick_bias; + + BUG_ON(cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); + + /* OPT_RUNQUEUE_CPU will never find an existing runqueue. */ + return opt_runqueue == OPT_RUNQUEUE_ALL || + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)); +} + static struct csched2_runqueue_data * cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) { @@ -855,21 +869,11 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) rqd_ins = &prv->rql; list_for_each_entry ( rqd, &prv->rql, rql ) { - unsigned int peer_cpu; - /* Remember first unused queue index. */ if ( !rqi_unused && rqd->id > rqi ) rqi_unused = true; - peer_cpu = rqd->pick_bias; - BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || - cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); - - /* OPT_RUNQUEUE_CPU will never find an existing runqueue. */ - if ( opt_runqueue == OPT_RUNQUEUE_ALL || - (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || - (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || - (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) + if ( cpu_runqueue_match(rqd, cpu) ) { rqd_valid = true; break; @@ -3744,6 +3748,8 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu) struct csched2_pcpu *spc; struct csched2_runqueue_data *rqd; + BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID); + spc = xzalloc(struct csched2_pcpu); if ( spc == NULL ) return ERR_PTR(-ENOMEM); ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli @ 2020-04-29 17:36 ` Dario Faggioli 2020-04-30 6:45 ` Jan Beulich 2020-04-30 7:35 ` Jürgen Groß 2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2 siblings, 2 replies; 21+ messages in thread From: Dario Faggioli @ 2020-04-29 17:36 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich In Credit2 CPUs (can) share runqueues, depending on the topology. For instance, with per-socket runqueues (the default) all the CPUs that are part of the same socket share a runqueue. On platform with a huge number of CPUs per socket, that could be a problem. An example is AMD EPYC2 servers, where we can have up to 128 CPUs in a socket. It is of course possible to define other, still topology-based, runqueue arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in runqueues with too many CPUs on other/future platforms. Therefore, let's set a limit to the max number of CPUs that can share a Credit2 runqueue. The actual value is configurable (at boot time), the default being 16. If, for instance, there are more than 16 CPUs in a socket, they'll be split among two (or more) runqueues. Note: with core scheduling enabled, this parameter sets the max number of *scheduling resources* that can share a runqueue. Therefore, with granularity set to core (and assumint 2 threads per core), we will have at most 16 cores per runqueue, which corresponds to 32 threads. But that is fine, considering how core scheduling works. Signed-off-by: Dario Faggioli <dfaggioli@suse.com> --- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Juergen Gross <jgross@suse.com> --- xen/common/sched/cpupool.c | 2 - xen/common/sched/credit2.c | 104 ++++++++++++++++++++++++++++++++++++++++++-- xen/common/sched/private.h | 2 + 3 files changed, 103 insertions(+), 5 deletions(-) diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c index d40345b585..0227457285 100644 --- a/xen/common/sched/cpupool.c +++ b/xen/common/sched/cpupool.c @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus; static DEFINE_SPINLOCK(cpupool_lock); -static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu; +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu; static unsigned int __read_mostly sched_granularity = 1; #ifdef CONFIG_HAS_SCHED_GRANULARITY diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c index 697c9f917d..abe4d048c8 100644 --- a/xen/common/sched/credit2.c +++ b/xen/common/sched/credit2.c @@ -471,6 +471,16 @@ static int __init parse_credit2_runqueue(const char *s) } custom_param("credit2_runqueue", parse_credit2_runqueue); +/* + * How many CPUs will be put, at most, in the same runqueue. + * Runqueues are still arranged according to the host topology (and + * according to the value of the 'credit2_runqueue' parameter). But + * we also have a cap to the number of CPUs that share runqueues. + * As soon as we reach the limit, a new runqueue will be created. + */ +static unsigned int __read_mostly opt_max_cpus_runqueue = 16; +integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue); + /* * Per-runqueue data */ @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)); } +/* Additional checks, to avoid separating siblings in different runqueues. */ +static bool +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) +{ + unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu)); + unsigned int rcpu, nr_smts = 0; + + /* + * If we put the CPU in this runqueue, we must be sure that there will + * be enough room for accepting its hyperthread sibling(s) here as well. + */ + cpumask_clear(cpumask_scratch_cpu(cpu)); + for_each_cpu ( rcpu, &rqd->active ) + { + ASSERT(rcpu != cpu); + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) + { + /* + * For each CPU already in the runqueue, account for it and for + * its sibling(s), independently from whether such sibling(s) are + * in the runqueue already or not. + * + * Of course, if there are sibling CPUs in the runqueue already, + * only count them once. + */ + cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu), + per_cpu(cpu_sibling_mask, rcpu)); + nr_smts += nr_sibl; + } + } + /* + * We know that neither the CPU, nor any of its sibling are here, + * or we wouldn't even have entered the function. + */ + ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu), + per_cpu(cpu_sibling_mask, cpu))); + + /* Try adding CPU and its sibling(s) to the count and check... */ + nr_smts += nr_sibl; + + if ( nr_smts <= opt_max_cpus_runqueue ) + return true; + + return false; +} + static struct csched2_runqueue_data * cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) { struct csched2_runqueue_data *rqd, *rqd_new; + struct csched2_runqueue_data *rqd_valid = NULL; struct list_head *rqd_ins; unsigned long flags; int rqi = 0; - bool rqi_unused = false, rqd_valid = false; + bool rqi_unused = false; /* Prealloc in case we need it - not allowed with interrupts off. */ rqd_new = xzalloc(struct csched2_runqueue_data); @@ -873,11 +930,44 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) if ( !rqi_unused && rqd->id > rqi ) rqi_unused = true; - if ( cpu_runqueue_match(rqd, cpu) ) + /* + * Check whether the CPU should (according to the topology) and also + * can (if we there aren't too many already) go in this runqueue. + */ + if ( rqd->refcnt < opt_max_cpus_runqueue && + cpu_runqueue_match(rqd, cpu) ) { - rqd_valid = true; - break; + cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu); + + dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n", + cpu, rqd->id, CPUMASK_PR(&rqd->active), + opt_max_cpus_runqueue); + + /* + * If we're using core (or socket!) scheduling, or we don't have + * hyperthreading, no need to do any further checking. + * + * If no (to both), but our sibling is already in this runqueue, + * then it's also ok for the CPU to stay in this runqueue.. + * + * Otherwise, do some more checks, to better account for SMT. + */ + if ( opt_sched_granularity != SCHED_GRAN_cpu || + cpumask_weight(siblings) <= 1 || + cpumask_intersects(&rqd->active, siblings) ) + { + dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id); + rqd_valid = rqd; + break; + } + else if ( cpu_runqueue_smt_match(rqd, cpu) ) + { + dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id); + rqd_valid = rqd; + } } + else + dprintk(XENLOG_DEBUG, "ignoring runq %d\n", rqd->id); if ( !rqi_unused ) { @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) rqd->pick_bias = cpu; rqd->id = rqi; } + else + rqd = rqd_valid; + + printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n", + cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id, + CPUMASK_PR(&rqd->active)); rqd->refcnt++; diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h index 367811a12f..e964e3f407 100644 --- a/xen/common/sched/private.h +++ b/xen/common/sched/private.h @@ -30,6 +30,8 @@ enum sched_gran { SCHED_GRAN_socket }; +extern enum sched_gran opt_sched_granularity; + /* * In order to allow a scheduler to remap the lock->cpu mapping, * we have a per-cpu pointer, along with a pre-allocated set of ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli @ 2020-04-30 6:45 ` Jan Beulich 2020-05-26 22:00 ` Dario Faggioli 2020-04-30 7:35 ` Jürgen Groß 1 sibling, 1 reply; 21+ messages in thread From: Jan Beulich @ 2020-04-30 6:45 UTC (permalink / raw) To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper On 29.04.2020 19:36, Dario Faggioli wrote: > @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) > (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)); > } > > +/* Additional checks, to avoid separating siblings in different runqueues. */ > +static bool > +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) > +{ > + unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu)); > + unsigned int rcpu, nr_smts = 0; > + > + /* > + * If we put the CPU in this runqueue, we must be sure that there will > + * be enough room for accepting its hyperthread sibling(s) here as well. > + */ > + cpumask_clear(cpumask_scratch_cpu(cpu)); > + for_each_cpu ( rcpu, &rqd->active ) > + { > + ASSERT(rcpu != cpu); > + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) > + { > + /* > + * For each CPU already in the runqueue, account for it and for > + * its sibling(s), independently from whether such sibling(s) are > + * in the runqueue already or not. > + * > + * Of course, if there are sibling CPUs in the runqueue already, > + * only count them once. > + */ > + cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu), > + per_cpu(cpu_sibling_mask, rcpu)); > + nr_smts += nr_sibl; This being common code, is it appropriate to assume all CPUs having the same number of siblings? Even beyond that, iirc the sibling mask represents the online or parked siblings, but not offline ones. For the purpose here, don't you rather care about the full set? What about HT vs AMD Fam15's CUs? Do you want both to be treated the same here? Also could you outline the intentions with this logic in the description, to be able to match the goal with what gets done? > + } > + } > + /* > + * We know that neither the CPU, nor any of its sibling are here, > + * or we wouldn't even have entered the function. > + */ > + ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu), > + per_cpu(cpu_sibling_mask, cpu))); > + > + /* Try adding CPU and its sibling(s) to the count and check... */ > + nr_smts += nr_sibl; > + > + if ( nr_smts <= opt_max_cpus_runqueue ) > + return true; > + > + return false; Fold these into return nr_smts <= opt_max_cpus_runqueue; ? > @@ -873,11 +930,44 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) > if ( !rqi_unused && rqd->id > rqi ) > rqi_unused = true; > > - if ( cpu_runqueue_match(rqd, cpu) ) > + /* > + * Check whether the CPU should (according to the topology) and also > + * can (if we there aren't too many already) go in this runqueue. Nit: Stray "we"? > + */ > + if ( rqd->refcnt < opt_max_cpus_runqueue && > + cpu_runqueue_match(rqd, cpu) ) > { > - rqd_valid = true; > - break; > + cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu); > + > + dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n", > + cpu, rqd->id, CPUMASK_PR(&rqd->active), > + opt_max_cpus_runqueue); > + > + /* > + * If we're using core (or socket!) scheduling, or we don't have > + * hyperthreading, no need to do any further checking. > + * > + * If no (to both), but our sibling is already in this runqueue, > + * then it's also ok for the CPU to stay in this runqueue.. Nit: Stray full 2nd stop? > + * Otherwise, do some more checks, to better account for SMT. > + */ > + if ( opt_sched_granularity != SCHED_GRAN_cpu || > + cpumask_weight(siblings) <= 1 || > + cpumask_intersects(&rqd->active, siblings) ) > + { > + dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id); > + rqd_valid = rqd; > + break; > + } > + else if ( cpu_runqueue_smt_match(rqd, cpu) ) > + { > + dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id); > + rqd_valid = rqd; > + } > } > + else Hard tab slipped in. > @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu) > rqd->pick_bias = cpu; > rqd->id = rqi; > } > + else > + rqd = rqd_valid; > + > + printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n", > + cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id, > + CPUMASK_PR(&rqd->active)); Iirc there's one per-CPU printk() already. On large systems this isn't very nice, so I'd like to ask that their total number at least not get further grown. Ideally there would be a less verbose summary after all CPUs have been brought up at boot, with per-CPU info be logged only during CPU hot online. Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-30 6:45 ` Jan Beulich @ 2020-05-26 22:00 ` Dario Faggioli 2020-05-27 4:26 ` Jürgen Groß 2020-05-27 6:17 ` Jan Beulich 0 siblings, 2 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-26 22:00 UTC (permalink / raw) To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 4894 bytes --] Hey, thanks for the review, and sorry for replying late... I was busy with something and then was trying to implement a better balancing logic, as discussed with Juergen, but with only partial success... On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote: > On 29.04.2020 19:36, Dario Faggioli wrote: > > @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct > > [...] > > + ASSERT(rcpu != cpu); > > + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) > > + { > > + /* > > + * For each CPU already in the runqueue, account for > > it and for > > + * its sibling(s), independently from whether such > > sibling(s) are > > + * in the runqueue already or not. > > + * > > + * Of course, if there are sibling CPUs in the > > runqueue already, > > + * only count them once. > > + */ > > + cpumask_or(cpumask_scratch_cpu(cpu), > > cpumask_scratch_cpu(cpu), > > + per_cpu(cpu_sibling_mask, rcpu)); > > + nr_smts += nr_sibl; > > This being common code, is it appropriate to assume all CPUs having > the same number of siblings? > You mention common code because you are thinking of differences between x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips that have SMT, or that we may want to identify and treat like if it was SMT, we currently have no support for that, so I don't think it is a problem. On x86, I'm not sure I am aware of cases where the number of threads is different among cores or sockets... are there any? Besides, we have some SMT specific code around (especially in scheduling) already. > Even beyond that, iirc the sibling mask > represents the online or parked siblings, but not offline ones. For > the purpose here, don't you rather care about the full set? > This is actually a good point. I indeed care about the number of siblings a thread has, in general, not only about the ones that are currently online. In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped in an helper that just returns 1 for ARM. What do you think, is this better? > What about HT vs AMD Fam15's CUs? Do you want both to be treated > the same here? > Are you referring to the cores that, AFAIUI, share the L1i cache? If yes, I thought about it, and ended up _not_ dealing with them here, but I'm still a bit unsure. Cache oriented runqueue organization will be the subject of another patch series, and that's why I kept them out. However, that's a rather special case with a lot in common to SMT... Just in case, is there a way to identify them easily, like with a mask or something, in the code already? > Also could you outline the intentions with this logic in the > description, to be able to match the goal with what gets done? > Sure, I will try state it more clearly. > > @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private > > *prv, unsigned int cpu) > > rqd->pick_bias = cpu; > > rqd->id = rqi; > > } > > + else > > + rqd = rqd_valid; > > + > > + printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to > > runqueue %d with {%*pbl}\n", > > + cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd- > > >id, > > + CPUMASK_PR(&rqd->active)); > > Iirc there's one per-CPU printk() already. On large systems this > isn't > very nice, so I'd like to ask that their total number at least not > get > further grown. Ideally there would be a less verbose summary after > all > CPUs have been brought up at boot, with per-CPU info be logged only > during CPU hot online. > Understood. Problem is that, here in the scheduling code, I don't see an easy way to tell when we have finished bringing up CPUs... And it's probably not worth looking too hard (even less adding logic) only for the sake of printing this message. So I think I will demote this printk as a XENLOG_DEBUG one (and will also get rid of others that were already DEBUG, but not super useful, after some more thinking). The idea is that, after all, exactly on which runqueue a CPU ends up is not an information that should matter much from an user/administrator. For now, it will be possible to know where they ended up via debug keys. And even if we feel like making it more visible, that is better achieved via some toolstack command that queries and prints the configuration of the scheduler, rather than a line like this in the boot log. Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-26 22:00 ` Dario Faggioli @ 2020-05-27 4:26 ` Jürgen Groß 2020-05-28 9:32 ` Dario Faggioli 2020-05-27 6:17 ` Jan Beulich 1 sibling, 1 reply; 21+ messages in thread From: Jürgen Groß @ 2020-05-27 4:26 UTC (permalink / raw) To: Dario Faggioli, Jan Beulich; +Cc: xen-devel, George Dunlap, Andrew Cooper On 27.05.20 00:00, Dario Faggioli wrote: > Hey, > > thanks for the review, and sorry for replying late... I was busy with > something and then was trying to implement a better balancing logic, as > discussed with Juergen, but with only partial success... > > On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote: >> On 29.04.2020 19:36, Dario Faggioli wrote: >>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct >>> [...] >>> + ASSERT(rcpu != cpu); >>> + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) >>> + { >>> + /* >>> + * For each CPU already in the runqueue, account for >>> it and for >>> + * its sibling(s), independently from whether such >>> sibling(s) are >>> + * in the runqueue already or not. >>> + * >>> + * Of course, if there are sibling CPUs in the >>> runqueue already, >>> + * only count them once. >>> + */ >>> + cpumask_or(cpumask_scratch_cpu(cpu), >>> cpumask_scratch_cpu(cpu), >>> + per_cpu(cpu_sibling_mask, rcpu)); >>> + nr_smts += nr_sibl; >> >> This being common code, is it appropriate to assume all CPUs having >> the same number of siblings? >> > You mention common code because you are thinking of differences between > x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips > that have SMT, or that we may want to identify and treat like if it was > SMT, we currently have no support for that, so I don't think it is a > problem. > > On x86, I'm not sure I am aware of cases where the number of threads is > different among cores or sockets... are there any? > > Besides, we have some SMT specific code around (especially in > scheduling) already. > >> Even beyond that, iirc the sibling mask >> represents the online or parked siblings, but not offline ones. For >> the purpose here, don't you rather care about the full set? >> > This is actually a good point. I indeed care about the number of > siblings a thread has, in general, not only about the ones that are > currently online. > > In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped > in an helper that just returns 1 for ARM. What do you think, is this > better? > >> What about HT vs AMD Fam15's CUs? Do you want both to be treated >> the same here? >> > Are you referring to the cores that, AFAIUI, share the L1i cache? If > yes, I thought about it, and ended up _not_ dealing with them here, but > I'm still a bit unsure. > > Cache oriented runqueue organization will be the subject of another > patch series, and that's why I kept them out. However, that's a rather > special case with a lot in common to SMT... Just in case, is there a > way to identify them easily, like with a mask or something, in the code > already? > >> Also could you outline the intentions with this logic in the >> description, to be able to match the goal with what gets done? >> > Sure, I will try state it more clearly. > >>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private >>> *prv, unsigned int cpu) >>> rqd->pick_bias = cpu; >>> rqd->id = rqi; >>> } >>> + else >>> + rqd = rqd_valid; >>> + >>> + printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to >>> runqueue %d with {%*pbl}\n", >>> + cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd- >>>> id, >>> + CPUMASK_PR(&rqd->active)); >> >> Iirc there's one per-CPU printk() already. On large systems this >> isn't >> very nice, so I'd like to ask that their total number at least not >> get >> further grown. Ideally there would be a less verbose summary after >> all >> CPUs have been brought up at boot, with per-CPU info be logged only >> during CPU hot online. >> > Understood. Problem is that, here in the scheduling code, I don't see > an easy way to tell when we have finished bringing up CPUs... And it's > probably not worth looking too hard (even less adding logic) only for > the sake of printing this message. cpupool_init() is the perfect place for that. Juergen ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-27 4:26 ` Jürgen Groß @ 2020-05-28 9:32 ` Dario Faggioli 0 siblings, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-28 9:32 UTC (permalink / raw) To: Jürgen Groß, Jan Beulich Cc: xen-devel, George Dunlap, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 1124 bytes --] On Wed, 2020-05-27 at 06:26 +0200, Jürgen Groß wrote: > On 27.05.20 00:00, Dario Faggioli wrote: > > > > Understood. Problem is that, here in the scheduling code, I don't > > see > > an easy way to tell when we have finished bringing up CPUs... And > > it's > > probably not worth looking too hard (even less adding logic) only > > for > > the sake of printing this message. > > cpupool_init() is the perfect place for that. > Yes, at least for boot time, it is indeed, so I'll go for it. OTOH, when for instance one creates a new cpupool (or adding a bunch of CPUs to one, with `xl cpupool-add Pool-X 7,3-14-22,18`), CPUs are added one by one, and we don't really know in Xen which one would be the last one, and print a summary. But yeah, I'll go for cpupool_init() and get rid of the rest, for now. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-26 22:00 ` Dario Faggioli 2020-05-27 4:26 ` Jürgen Groß @ 2020-05-27 6:17 ` Jan Beulich 2020-05-28 14:55 ` Dario Faggioli 1 sibling, 1 reply; 21+ messages in thread From: Jan Beulich @ 2020-05-27 6:17 UTC (permalink / raw) To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper On 27.05.2020 00:00, Dario Faggioli wrote: > On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote: >> On 29.04.2020 19:36, Dario Faggioli wrote: >>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct >>> [...] >>> + ASSERT(rcpu != cpu); >>> + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) >>> + { >>> + /* >>> + * For each CPU already in the runqueue, account for >>> it and for >>> + * its sibling(s), independently from whether such >>> sibling(s) are >>> + * in the runqueue already or not. >>> + * >>> + * Of course, if there are sibling CPUs in the >>> runqueue already, >>> + * only count them once. >>> + */ >>> + cpumask_or(cpumask_scratch_cpu(cpu), >>> cpumask_scratch_cpu(cpu), >>> + per_cpu(cpu_sibling_mask, rcpu)); >>> + nr_smts += nr_sibl; >> >> This being common code, is it appropriate to assume all CPUs having >> the same number of siblings? >> > You mention common code because you are thinking of differences between > x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips > that have SMT, or that we may want to identify and treat like if it was > SMT, we currently have no support for that, so I don't think it is a > problem. > > On x86, I'm not sure I am aware of cases where the number of threads is > different among cores or sockets... are there any? I'm not aware of any either, but in common code it should also matter what might be, not just what there is. Unless you wrap things in e.g. CONFIG_X86. >> Even beyond that, iirc the sibling mask >> represents the online or parked siblings, but not offline ones. For >> the purpose here, don't you rather care about the full set? >> > This is actually a good point. I indeed care about the number of > siblings a thread has, in general, not only about the ones that are > currently online. > > In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped > in an helper that just returns 1 for ARM. What do you think, is this > better? As per above, cpu_data[rcpu] then please. >> What about HT vs AMD Fam15's CUs? Do you want both to be treated >> the same here? >> > Are you referring to the cores that, AFAIUI, share the L1i cache? If > yes, I thought about it, and ended up _not_ dealing with them here, but > I'm still a bit unsure. > > Cache oriented runqueue organization will be the subject of another > patch series, and that's why I kept them out. However, that's a rather > special case with a lot in common to SMT... I didn't think of cache sharing in particular, but about the concept of compute units vs hyperthreads in general. > Just in case, is there a > way to identify them easily, like with a mask or something, in the code > already? cpu_sibling_mask still gets used for both, so there's no mask to use. As per set_cpu_sibling_map() you can look at cpu_data[].compute_unit_id to tell, but that's of course x86- specific (as is the entire compute unit concept). >>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private >>> *prv, unsigned int cpu) >>> rqd->pick_bias = cpu; >>> rqd->id = rqi; >>> } >>> + else >>> + rqd = rqd_valid; >>> + >>> + printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to >>> runqueue %d with {%*pbl}\n", >>> + cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd- >>>> id, >>> + CPUMASK_PR(&rqd->active)); >> >> Iirc there's one per-CPU printk() already. On large systems this >> isn't >> very nice, so I'd like to ask that their total number at least not >> get >> further grown. Ideally there would be a less verbose summary after >> all >> CPUs have been brought up at boot, with per-CPU info be logged only >> during CPU hot online. >> > Understood. Problem is that, here in the scheduling code, I don't see > an easy way to tell when we have finished bringing up CPUs... And it's > probably not worth looking too hard (even less adding logic) only for > the sake of printing this message. > > So I think I will demote this printk as a XENLOG_DEBUG one (and will > also get rid of others that were already DEBUG, but not super useful, > after some more thinking). Having seen Jürgen's reply as well as what you further wrote below, I'd still like to point out that even XENLOG_DEBUG isn't quiet enough: There may be some value to such a debugging message to you, but it may be mainly spam to e.g. me, who I still have a need to run with loglvl=all in the common case. Let's not forget, the context in which the underlying topic came up in was pretty-many- core AMD CPUs. We had this issue elsewhere as well, and as CPU counts grew we started hiding such messages behind a command line option, besides a log level qualification. Similarly we hide e.g. details of the IOMMU arrangements behind a command line option. > The idea is that, after all, exactly on which runqueue a CPU ends up is > not an information that should matter much from an user/administrator. > For now, it will be possible to know where they ended up via debug > keys. And even if we feel like making it more visible, that is better > achieved via some toolstack command that queries and prints the > configuration of the scheduler, rather than a line like this in the > boot log. Good. Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-27 6:17 ` Jan Beulich @ 2020-05-28 14:55 ` Dario Faggioli 2020-05-29 9:58 ` Jan Beulich 0 siblings, 1 reply; 21+ messages in thread From: Dario Faggioli @ 2020-05-28 14:55 UTC (permalink / raw) To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 2814 bytes --] On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote: > On 27.05.2020 00:00, Dario Faggioli wrote: > > > > Cache oriented runqueue organization will be the subject of another > > patch series, and that's why I kept them out. However, that's a > > rather > > special case with a lot in common to SMT... > > I didn't think of cache sharing in particular, but about the > concept of compute units vs hyperthreads in general. > Ok. > > Just in case, is there a > > way to identify them easily, like with a mask or something, in the > > code > > already? > > cpu_sibling_mask still gets used for both, so there's no mask > to use. As per set_cpu_sibling_map() you can look at > cpu_data[].compute_unit_id to tell, but that's of course x86- > specific (as is the entire compute unit concept). > Right. And thanks for the pointers. But then, what I am understanding by having a look there is that I indeed can use (again, appropriately wrapped) x86_num_siblings, for telling, in this function, whether a CPU has any, and if yes how many, HT (Intel) or CU (AMD) siblings in total, although some of them may currently be offline. Which means I will be treating HTs and CUs the same which, thinking more about it (and thinking actually to CUs, rather than to any cache sharing relationship), does make sense for this feature. Does this make sense, or am I missing or misinterpreting anything? > > So I think I will demote this printk as a XENLOG_DEBUG one (and > > will > > also get rid of others that were already DEBUG, but not super > > useful, > > after some more thinking). > > Having seen Jürgen's reply as well as what you further wrote below, > I'd still like to point out that even XENLOG_DEBUG isn't quiet > enough: There may be some value to such a debugging message to you, > but it may be mainly spam to e.g. me, who I still have a need to > run with loglvl=all in the common case. Let's not forget, the > context in which the underlying topic came up in was pretty-many- > core AMD CPUs. > Good point indeed about DEBUG potentially being an issue as well. So yes, as announced in my reply to Juergen, I was going with the recap in cpupool_init(). However, that looks like it requires a new hook in the scheduler's interface, as the information is scheduler specific but at the same time I don't think we want to have the exact same information from either dump_settings() or dump_cpu_state(), which is all we have... :-/ I'll see about it. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-28 14:55 ` Dario Faggioli @ 2020-05-29 9:58 ` Jan Beulich 2020-05-29 10:19 ` Dario Faggioli 0 siblings, 1 reply; 21+ messages in thread From: Jan Beulich @ 2020-05-29 9:58 UTC (permalink / raw) To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper On 28.05.2020 16:55, Dario Faggioli wrote: > On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote: >> On 27.05.2020 00:00, Dario Faggioli wrote: >>> Just in case, is there a >>> way to identify them easily, like with a mask or something, in the >>> code >>> already? >> >> cpu_sibling_mask still gets used for both, so there's no mask >> to use. As per set_cpu_sibling_map() you can look at >> cpu_data[].compute_unit_id to tell, but that's of course x86- >> specific (as is the entire compute unit concept). >> > Right. And thanks for the pointers. > > But then, what I am understanding by having a look there is that I > indeed can use (again, appropriately wrapped) x86_num_siblings, for > telling, in this function, whether a CPU has any, and if yes how many, > HT (Intel) or CU (AMD) siblings in total, although some of them may > currently be offline. > > Which means I will be treating HTs and CUs the same which, thinking > more about it (and thinking actually to CUs, rather than to any cache > sharing relationship), does make sense for this feature. > > Does this make sense, or am I missing or misinterpreting anything? Well, it effectively answers the question I had raised: "What about HT vs AMD Fam15's CUs? Do you want both to be treated the same here?" Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-29 9:58 ` Jan Beulich @ 2020-05-29 10:19 ` Dario Faggioli 0 siblings, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-29 10:19 UTC (permalink / raw) To: Jan Beulich; +Cc: Juergen Gross, xen-devel, George Dunlap, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 1150 bytes --] On Fri, 2020-05-29 at 11:58 +0200, Jan Beulich wrote: > On 28.05.2020 16:55, Dario Faggioli wrote: > > > > Which means I will be treating HTs and CUs the same which, thinking > > more about it (and thinking actually to CUs, rather than to any > > cache > > sharing relationship), does make sense for this feature. > > > > Does this make sense, or am I missing or misinterpreting anything? > > Well, it effectively answers the question I had raised: "What about > HT > vs AMD Fam15's CUs? Do you want both to be treated the same here?" > Exactly! :-) And, in fact, the answer is "Yes, I want them being treated the same, and the patches are actually doing ". Sorry for not getting it right away, and just replying like that in the first place. At least, this led to better (more generic) comments and variable names in v2 of this patch. Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli 2020-04-30 6:45 ` Jan Beulich @ 2020-04-30 7:35 ` Jürgen Groß 2020-04-30 12:28 ` Dario Faggioli 1 sibling, 1 reply; 21+ messages in thread From: Jürgen Groß @ 2020-04-30 7:35 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: Andrew Cooper, George Dunlap, Jan Beulich On 29.04.20 19:36, Dario Faggioli wrote: > In Credit2 CPUs (can) share runqueues, depending on the topology. For > instance, with per-socket runqueues (the default) all the CPUs that are > part of the same socket share a runqueue. > > On platform with a huge number of CPUs per socket, that could be a > problem. An example is AMD EPYC2 servers, where we can have up to 128 > CPUs in a socket. > > It is of course possible to define other, still topology-based, runqueue > arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in > runqueues with too many CPUs on other/future platforms. > > Therefore, let's set a limit to the max number of CPUs that can share a > Credit2 runqueue. The actual value is configurable (at boot time), the > default being 16. If, for instance, there are more than 16 CPUs in a > socket, they'll be split among two (or more) runqueues. Did you think about balancing the runqueues regarding the number of cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in each runqueue? I know this will need more logic as cpus are added one by one, but the result would be much better IMO. > > Note: with core scheduling enabled, this parameter sets the max number > of *scheduling resources* that can share a runqueue. Therefore, with > granularity set to core (and assumint 2 threads per core), we will have > at most 16 cores per runqueue, which corresponds to 32 threads. But that > is fine, considering how core scheduling works. > > Signed-off-by: Dario Faggioli <dfaggioli@suse.com> > --- > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: George Dunlap <george.dunlap@citrix.com> > Cc: Jan Beulich <jbeulich@suse.com> > Cc: Juergen Gross <jgross@suse.com> > --- > xen/common/sched/cpupool.c | 2 - > xen/common/sched/credit2.c | 104 ++++++++++++++++++++++++++++++++++++++++++-- > xen/common/sched/private.h | 2 + > 3 files changed, 103 insertions(+), 5 deletions(-) > > diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c > index d40345b585..0227457285 100644 > --- a/xen/common/sched/cpupool.c > +++ b/xen/common/sched/cpupool.c > @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus; > > static DEFINE_SPINLOCK(cpupool_lock); > > -static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu; > +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu; Please don't use the global option value, but the per-cpupool one. > static unsigned int __read_mostly sched_granularity = 1; > > #ifdef CONFIG_HAS_SCHED_GRANULARITY > diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c > index 697c9f917d..abe4d048c8 100644 > --- a/xen/common/sched/credit2.c > +++ b/xen/common/sched/credit2.c > @@ -471,6 +471,16 @@ static int __init parse_credit2_runqueue(const char *s) > } > custom_param("credit2_runqueue", parse_credit2_runqueue); > > +/* > + * How many CPUs will be put, at most, in the same runqueue. > + * Runqueues are still arranged according to the host topology (and > + * according to the value of the 'credit2_runqueue' parameter). But > + * we also have a cap to the number of CPUs that share runqueues. > + * As soon as we reach the limit, a new runqueue will be created. > + */ > +static unsigned int __read_mostly opt_max_cpus_runqueue = 16; > +integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue); > + > /* > * Per-runqueue data > */ > @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) > (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)); > } > > +/* Additional checks, to avoid separating siblings in different runqueues. */ > +static bool > +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu) > +{ > + unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu)); Shouldn't you mask away siblings not in the cpupool? > + unsigned int rcpu, nr_smts = 0; > + > + /* > + * If we put the CPU in this runqueue, we must be sure that there will > + * be enough room for accepting its hyperthread sibling(s) here as well. > + */ > + cpumask_clear(cpumask_scratch_cpu(cpu)); > + for_each_cpu ( rcpu, &rqd->active ) > + { > + ASSERT(rcpu != cpu); > + if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) ) > + { > + /* > + * For each CPU already in the runqueue, account for it and for > + * its sibling(s), independently from whether such sibling(s) are > + * in the runqueue already or not. > + * > + * Of course, if there are sibling CPUs in the runqueue already, > + * only count them once. > + */ > + cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu), > + per_cpu(cpu_sibling_mask, rcpu)); Again, local cpupool only! Juergen ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-30 7:35 ` Jürgen Groß @ 2020-04-30 12:28 ` Dario Faggioli 2020-04-30 12:52 ` Jürgen Groß 0 siblings, 1 reply; 21+ messages in thread From: Dario Faggioli @ 2020-04-30 12:28 UTC (permalink / raw) To: Jürgen Groß, xen-devel Cc: Andrew Cooper, George Dunlap, Jan Beulich [-- Attachment #1: Type: text/plain, Size: 3528 bytes --] On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote: > On 29.04.20 19:36, Dario Faggioli wrote: > > > > Therefore, let's set a limit to the max number of CPUs that can > > share a > > Credit2 runqueue. The actual value is configurable (at boot time), > > the > > default being 16. If, for instance, there are more than 16 CPUs in > > a > > socket, they'll be split among two (or more) runqueues. > > Did you think about balancing the runqueues regarding the number of > cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in > each > runqueue? I know this will need more logic as cpus are added one by > one, > but the result would be much better IMO. > I know. Point is, CPUs not only are added one by one, but they can, once the system is running, be offlined/onlined or moved among cpupools. Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at boot, if the admin removes 4 CPUs that happened to be all in the same runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd indeed need full runqueue online rebalancing logic, which will probably end up being quite complex and I'm not sure it's worth it. That being said, I can try to make things a bit more fair, when CPUs come up and are added to the pool. Something around the line of adding them to the runqueue with the least number of CPUs in it (among the suitable ones, of course). With that, when the user removes 4 CPUs, we will have the 6 vs 10 situation. But we would make sure that, when she adds them back, we will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like that. Was something like this that you had in mind? And in any case, what do you think about it? > > --- a/xen/common/sched/cpupool.c > > +++ b/xen/common/sched/cpupool.c > > @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus; > > > > static DEFINE_SPINLOCK(cpupool_lock); > > > > -static enum sched_gran __read_mostly opt_sched_granularity = > > SCHED_GRAN_cpu; > > +enum sched_gran __read_mostly opt_sched_granularity = > > SCHED_GRAN_cpu; > > Please don't use the global option value, but the per-cpupool one. > Yep, you're right. Will do. > > +/* Additional checks, to avoid separating siblings in different > > runqueues. */ > > +static bool > > +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, > > unsigned int cpu) > > +{ > > + unsigned int nr_sibl = > > cpumask_weight(per_cpu(cpu_sibling_mask, cpu)); > > Shouldn't you mask away siblings not in the cpupool? > So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling-- to Pool-0, I always want to make sure that there is room for both CPUs 0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if CPU 1 is currently in another pool. This way if, in future, CPU 1 is removed from Pool-1 and added to Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I don't consider CPUs which currently are in another pool, we risk that when/if they're added to this very pool, they'll end up in a different runqueue. And we don't want that. Makes sense? Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-30 12:28 ` Dario Faggioli @ 2020-04-30 12:52 ` Jürgen Groß 2020-04-30 14:01 ` Dario Faggioli 2020-05-26 21:18 ` Dario Faggioli 0 siblings, 2 replies; 21+ messages in thread From: Jürgen Groß @ 2020-04-30 12:52 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: Andrew Cooper, George Dunlap, Jan Beulich On 30.04.20 14:28, Dario Faggioli wrote: > On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote: >> On 29.04.20 19:36, Dario Faggioli wrote: >>> >>> Therefore, let's set a limit to the max number of CPUs that can >>> share a >>> Credit2 runqueue. The actual value is configurable (at boot time), >>> the >>> default being 16. If, for instance, there are more than 16 CPUs in >>> a >>> socket, they'll be split among two (or more) runqueues. >> >> Did you think about balancing the runqueues regarding the number of >> cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in >> each >> runqueue? I know this will need more logic as cpus are added one by >> one, >> but the result would be much better IMO. >> > I know. Point is, CPUs not only are added one by one, but they can, > once the system is running, be offlined/onlined or moved among > cpupools. > > Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at > boot, if the admin removes 4 CPUs that happened to be all in the same > runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd > indeed need full runqueue online rebalancing logic, which will probably > end up being quite complex and I'm not sure it's worth it. > > That being said, I can try to make things a bit more fair, when CPUs > come up and are added to the pool. Something around the line of adding > them to the runqueue with the least number of CPUs in it (among the > suitable ones, of course). > > With that, when the user removes 4 CPUs, we will have the 6 vs 10 > situation. But we would make sure that, when she adds them back, we > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like > that. > > Was something like this that you had in mind? And in any case, what do > you think about it? Yes, this would be better already. > >>> --- a/xen/common/sched/cpupool.c >>> +++ b/xen/common/sched/cpupool.c >>> @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus; >>> >>> static DEFINE_SPINLOCK(cpupool_lock); >>> >>> -static enum sched_gran __read_mostly opt_sched_granularity = >>> SCHED_GRAN_cpu; >>> +enum sched_gran __read_mostly opt_sched_granularity = >>> SCHED_GRAN_cpu; >> >> Please don't use the global option value, but the per-cpupool one. >> > Yep, you're right. Will do. > >>> +/* Additional checks, to avoid separating siblings in different >>> runqueues. */ >>> +static bool >>> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, >>> unsigned int cpu) >>> +{ >>> + unsigned int nr_sibl = >>> cpumask_weight(per_cpu(cpu_sibling_mask, cpu)); >> >> Shouldn't you mask away siblings not in the cpupool? >> > So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues > and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling-- > to Pool-0, I always want to make sure that there is room for both CPUs > 0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if > CPU 1 is currently in another pool. > > This way if, in future, CPU 1 is removed from Pool-1 and added to > Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I > don't consider CPUs which currently are in another pool, we risk that > when/if they're added to this very pool, they'll end up in a different > runqueue. > > And we don't want that. > > Makes sense? Yes. You should add a comment in this regard. And you should either reject the case of less cpus per queue than siblings per core, or you should handle this situation. Otherwise you won't ever find a suitable run-queue. :-) Juergen ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-30 12:52 ` Jürgen Groß @ 2020-04-30 14:01 ` Dario Faggioli 2020-05-26 21:18 ` Dario Faggioli 1 sibling, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-04-30 14:01 UTC (permalink / raw) To: Jürgen Groß, xen-devel Cc: Andrew Cooper, George Dunlap, Jan Beulich [-- Attachment #1: Type: text/plain, Size: 1901 bytes --] On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote: > On 30.04.20 14:28, Dario Faggioli wrote: > > On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote: > > > > > With that, when the user removes 4 CPUs, we will have the 6 vs 10 > > situation. But we would make sure that, when she adds them back, we > > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something > > like > > that. > > > > Was something like this that you had in mind? And in any case, what > > do > > you think about it? > > Yes, this would be better already. > Right, I'll give a try at this, and let's see how it ends up looking like. > > This way if, in future, CPU 1 is removed from Pool-1 and added to > > Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If > > I > > don't consider CPUs which currently are in another pool, we risk > > that > > when/if they're added to this very pool, they'll end up in a > > different > > runqueue. > > > > And we don't want that. > > > Yes. > Cool. :-) > You should add a comment in this regard. > Sure. > And you should either reject the case of less cpus per queue than > siblings per core, or you should handle this situation. Otherwise you > won't ever find a suitable run-queue. :-) > Right, and in fact I had a check for that, rejecting smaller max-cpus-per-runqueue than siblings, where we validate also the other scheduling parameters. But indeed it's not there now, so I guess I removed it by mistake before sending. And now that I double check, I'm also missing documenting the new parameters. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-04-30 12:52 ` Jürgen Groß 2020-04-30 14:01 ` Dario Faggioli @ 2020-05-26 21:18 ` Dario Faggioli 2020-05-27 6:22 ` Jan Beulich 1 sibling, 1 reply; 21+ messages in thread From: Dario Faggioli @ 2020-05-26 21:18 UTC (permalink / raw) To: Jürgen Groß, xen-devel Cc: Andrew Cooper, George Dunlap, Jan Beulich, paul [-- Attachment #1: Type: text/plain, Size: 3077 bytes --] On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote: > On 30.04.20 14:28, Dario Faggioli wrote: > > That being said, I can try to make things a bit more fair, when > > CPUs > > come up and are added to the pool. Something around the line of > > adding > > them to the runqueue with the least number of CPUs in it (among the > > suitable ones, of course). > > > > With that, when the user removes 4 CPUs, we will have the 6 vs 10 > > situation. But we would make sure that, when she adds them back, we > > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something > > like > > that. > > > > Was something like this that you had in mind? And in any case, what > > do > > you think about it? > > Yes, this would be better already. > So, a couple of thoughts. Doing something like what I tried to describe above is not too bad, and I have it pretty much ready. With that, on an Intel system with 96 CPUs on two sockets, and max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2 giant runqueues with 48 CPUs in each: - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 runqueues in each of them I can also "tweak" it in such a way that, if one for instance boots with "smt=no", we get to a point where we have: - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each Now, I think this is good, and actually better than the current situation where on such a system, we only have two very big runqueues (and let me repeat that introducing a per-LLC runqueue arrangement, on which I'm also working, won't help in this case, as NUMA node == LLC). So, problem is that if one starts to fiddle with cpupools and cpu on and offlining, things can get pretty unbalanced. E.g., I can end up with 2 runqueues on a socket, one with 16 CPUs and the other with just a couple of them. Now, this is all possible as of now (I mean, without this patch) already, although at a different level. In fact, I can very well remove all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue with a lot of CPUs and another with just one. It looks like we need a way to rebalance the runqueues, which should be doable... But despite having spent a couple of days trying to come up with something decent, that I could include in v2 of this series, I couldn't get it to work sensibly. So, this looks to me like an improvement, that would need being improved further, but I'm not sure we have the time for it right now (Cc-ing Paul). Should we still go for what's ready? I think yes, but I'd be interested in opinions... Also, if anyone has any clever ideas on how to implement a mechanism that rebalance the CPUs within the runqueue, I'm all ears and am up for trying implementing. :-) Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-26 21:18 ` Dario Faggioli @ 2020-05-27 6:22 ` Jan Beulich 2020-05-28 9:36 ` Dario Faggioli 0 siblings, 1 reply; 21+ messages in thread From: Jan Beulich @ 2020-05-27 6:22 UTC (permalink / raw) To: Dario Faggioli Cc: Jürgen Groß, xen-devel, paul, George Dunlap, Andrew Cooper On 26.05.2020 23:18, Dario Faggioli wrote: > On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote: >> On 30.04.20 14:28, Dario Faggioli wrote: >>> That being said, I can try to make things a bit more fair, when >>> CPUs >>> come up and are added to the pool. Something around the line of >>> adding >>> them to the runqueue with the least number of CPUs in it (among the >>> suitable ones, of course). >>> >>> With that, when the user removes 4 CPUs, we will have the 6 vs 10 >>> situation. But we would make sure that, when she adds them back, we >>> will go back to 10 vs. 10, instead than, say, 6 vs 14 or something >>> like >>> that. >>> >>> Was something like this that you had in mind? And in any case, what >>> do >>> you think about it? >> >> Yes, this would be better already. >> > So, a couple of thoughts. Doing something like what I tried to describe > above is not too bad, and I have it pretty much ready. > > With that, on an Intel system with 96 CPUs on two sockets, and > max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2 > giant runqueues with 48 CPUs in each: > > - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 > runqueues in each of them > > I can also "tweak" it in such a way that, if one for instance boots > with "smt=no", we get to a point where we have: > > - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each > > Now, I think this is good, and actually better than the current > situation where on such a system, we only have two very big runqueues > (and let me repeat that introducing a per-LLC runqueue arrangement, on > which I'm also working, won't help in this case, as NUMA node == LLC). > > So, problem is that if one starts to fiddle with cpupools and cpu on > and offlining, things can get pretty unbalanced. E.g., I can end up > with 2 runqueues on a socket, one with 16 CPUs and the other with just > a couple of them. > > Now, this is all possible as of now (I mean, without this patch) > already, although at a different level. In fact, I can very well remove > all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue > with a lot of CPUs and another with just one. > > It looks like we need a way to rebalance the runqueues, which should be > doable... But despite having spent a couple of days trying to come up > with something decent, that I could include in v2 of this series, I > couldn't get it to work sensibly. CPU on-/offlining may not need considering here at all imo. I think it would be quite reasonable as a first step to have a static model where from system topology (and perhaps a command line option) one can predict which runqueue a particular CPU will end up in, no matter when it gets brought online. Jan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue 2020-05-27 6:22 ` Jan Beulich @ 2020-05-28 9:36 ` Dario Faggioli 0 siblings, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-28 9:36 UTC (permalink / raw) To: Jan Beulich Cc: Jürgen Groß, xen-devel, paul, George Dunlap, Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 1348 bytes --] On Wed, 2020-05-27 at 08:22 +0200, Jan Beulich wrote: > On 26.05.2020 23:18, Dario Faggioli wrote: > > > > It looks like we need a way to rebalance the runqueues, which > > should be > > doable... But despite having spent a couple of days trying to come > > up > > with something decent, that I could include in v2 of this series, I > > couldn't get it to work sensibly. > > CPU on-/offlining may not need considering here at all imo. I think > it > would be quite reasonable as a first step to have a static model > where > from system topology (and perhaps a command line option) one can > predict which runqueue a particular CPU will end up in, no matter > when > it gets brought online. > Right. IAC, just FYI, after talking to Juergen --who suggested a nice solution to overcome the problem where I was stuck-- I have now a patch that successfully implement dynamic online rebalancing of runqueues. I'm polishing up the comments and changelog and will send it ASAP, as I'd really like for this series to go in... :-) Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue 2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli 2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli @ 2020-05-29 11:46 ` Dario Faggioli 2020-05-29 15:06 ` Dario Faggioli 2020-05-29 16:15 ` George Dunlap 2 siblings, 2 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-29 11:46 UTC (permalink / raw) To: xen-devel Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich, Paul Durrant [-- Attachment #1: Type: text/plain, Size: 3434 bytes --] So, I felt like providing some additional thoughts about this series, from a release point of view (adding Paul). Timing is *beyond tight* so if this series, entirely or partly, has any chance to go in, it would be through some form of exception, which of course comes with some risks, etc. I did work hard to submit the full series, because I wanted people to be able to see the complete solution. However, I think the series itself can be logically split in two parts. Basically, if we just consider patches 1 and 4 we will end up, right after boot, with a system that has smaller runqueues. They will most likely be balanced in terms of how many CPUs each one has, so a good setup. This will likely (actual differences seems to depend *quite a bit* on the actual workload) be an improvement for very large systems. This is a path will get a decent amount of testing in OSSTests, from now until the day of the release, I think, because booting with the default CPU configuration and setup is what most (all?) OSSTests' jobs do. If the user starts to create pools, we can get to a point where the different runqueues are unbalanced, i.e., each one has a different number of CPUs in them, wrt the others. This, however: * can happen already, as of today, without this patches. Whether these patches may make things "better" or "worse", from this point of view, it's impossible to tell, because it actually depends on what CPUs the user moves among pools or put offline, etc. * this means that the scheduler has to deal with unbalanced runqueues anyway, and if it doesn't, it's a bug and, again, it seems to me that these patches don't make things particularly better or worse. So, again, for patches 1 and 4 alone, it looks to me that we'd get improvements, at least in some cases, the codepath will get some testing and they do not introduce additional or different issues than what we have already right now. They also are at their second iteration, as the original patch series was comprised of exactly those two patches. So, I think it would be interesting if these two patches would be given a chance, even just of getting some reviews... And I would be fine going through the formal process necessary for making that to happen myself. Then, there's the rest, the runqueue rebalancing thing. As said above, I personally would love if we'd release with it, but I see one rather big issue. In fact, such mechanism is triggered and stressed is stressed by the dynamic creation and manipulation of cpupools (and CPU on/off-lining), and we don't have an OSSTests test for that. Therefore, we are not in the best position for catching issues it may have introduced. I can commit to do some testing myself, but it's not the same thing has having them in our CI, I know that very well. So, I'd be interested in hearing what others think about these other patches as well, and I am happy to do my best to make sure that they are working fine, if we decide to try to include them too, but I do see this as much more of a risk myself. So, any thoughts? :-) Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue 2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli @ 2020-05-29 15:06 ` Dario Faggioli 2020-05-29 16:15 ` George Dunlap 1 sibling, 0 replies; 21+ messages in thread From: Dario Faggioli @ 2020-05-29 15:06 UTC (permalink / raw) To: xen-devel Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich, Paul Durrant [-- Attachment #1: Type: text/plain, Size: 5283 bytes --] On Fri, 2020-05-29 at 13:46 +0200, Dario Faggioli wrote: > Basically, if we just consider patches 1 and 4 we will end up, right > after boot, with a system that has smaller runqueues. > Actually, to be fully precise, given how I reorganized the series, it's not patches 1 and 4, it's patches 1, 3 and 4. Hopefully, that is not a big deal, but it patch 3 is really a problem, I can re-arrange patch 4 for working without it. Apart from this, and for adding more information, on a system with 96 CPUs in 2 sockets, this is how the runqueues looks like (with these patches: (XEN) Online Cpus: 0-95 (XEN) Cpupool 0: (XEN) Cpus: 0-95 (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2) (XEN) Active queues: 6 (XEN) default-weight = 256 (XEN) Runqueue 0: (XEN) ncpus = 16 (XEN) cpus = 0-15 (XEN) max_weight = 256 (XEN) pick_bias = 0 (XEN) instload = 0 (XEN) aveload = 1223 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff (XEN) Runqueue 1: (XEN) ncpus = 16 (XEN) cpus = 16-31 (XEN) max_weight = 256 (XEN) pick_bias = 16 (XEN) instload = 0 (XEN) aveload = 3324 (~1%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000 (XEN) Runqueue 2: (XEN) ncpus = 16 (XEN) cpus = 32-47 (XEN) max_weight = 256 (XEN) pick_bias = 32 (XEN) instload = 1 (XEN) aveload = 8996 (~3%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,0000feff,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,0000fcff,00000000 (XEN) Runqueue 3: (XEN) ncpus = 16 (XEN) cpus = 48-63 (XEN) max_weight = 256 (XEN) pick_bias = 48 (XEN) instload = 0 (XEN) aveload = 2424 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000 (XEN) Runqueue 4: (XEN) ncpus = 16 (XEN) cpus = 64-79 (XEN) max_weight = 256 (XEN) pick_bias = 66 (XEN) instload = 0 (XEN) aveload = 1070 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000 (XEN) Runqueue 5: (XEN) ncpus = 16 (XEN) cpus = 80-95 (XEN) max_weight = 256 (XEN) pick_bias = 82 (XEN) instload = 0 (XEN) aveload = 425 (~0%) (XEN) idlers: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000 (XEN) tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000 (XEN) fully idle cores: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000 Without the patches, there would be just 2 of them (on with CPUs 0-47 and another with CPUs 48-95). On a system with "just" 16 CPUs, in 2 sockets, they look like this: (XEN) Online Cpus: 0-15 (XEN) Cpupool 0: (XEN) Cpus: 0-15 (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2) (XEN) Active queues: 2 (XEN) default-weight = 256 (XEN) Runqueue 0: (XEN) ncpus = 8 (XEN) cpus = 0-7 (XEN) max_weight = 256 (XEN) pick_bias = 0 (XEN) instload = 0 (XEN) aveload = 7077 (~2%) (XEN) idlers: 00000000,000000ff (XEN) tickled: 00000000,00000000 (XEN) fully idle cores: 00000000,000000ff (XEN) Runqueue 1: (XEN) ncpus = 8 (XEN) cpus = 8-15 (XEN) max_weight = 256 (XEN) pick_bias = 8 (XEN) instload = 1 (XEN) aveload = 11848 (~4%) (XEN) idlers: 00000000,0000fe00 (XEN) tickled: 00000000,00000000 (XEN) fully idle cores: 00000000,0000fc00 There are still 2, because there are 2 sockets, and we still honor the topology (and 8 CPUs in a runqueue is fine, because is lower than 16). I'll share the same output on a 256 CPU system, as soon as I finish installing it. Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue 2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2020-05-29 15:06 ` Dario Faggioli @ 2020-05-29 16:15 ` George Dunlap 1 sibling, 0 replies; 21+ messages in thread From: George Dunlap @ 2020-05-29 16:15 UTC (permalink / raw) To: Dario Faggioli Cc: Juergen Gross, xen-devel, Paul Durrant, Jan Beulich, Andrew Cooper > On May 29, 2020, at 12:46 PM, Dario Faggioli <dfaggioli@suse.com> wrote: > > So, > > I felt like providing some additional thoughts about this series, from > a release point of view (adding Paul). > > Timing is *beyond tight* so if this series, entirely or partly, has any > chance to go in, it would be through some form of exception, which of > course comes with some risks, etc. > > I did work hard to submit the full series, because I wanted people to > be able to see the complete solution. However, I think the series > itself can be logically split in two parts. > > Basically, if we just consider patches 1 and 4 we will end up, right > after boot, with a system that has smaller runqueues. They will most > likely be balanced in terms of how many CPUs each one has, so a good > setup. This will likely (actual differences seems to depend *quite a > bit* on the actual workload) be an improvement for very large systems. Fundamentally, I feel like the reason we have the feature freeze is exactly to have to avoid questions like this. Something very much like patch 4 was posted before the last posting date; patches 1-4 received R-b’s before the feature freeze. I think they should probably go in. The rebalancing patches I’m inclined to say should wait until they’ve had a bit more time to be thought about. -George ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2020-05-29 16:15 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-29 17:36 [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2020-04-29 17:36 ` [PATCH 1/2] xen: credit2: factor cpu to runqueue matching in a function Dario Faggioli 2020-04-29 17:36 ` [PATCH 2/2] xen: credit2: limit the max number of CPUs in a runqueue Dario Faggioli 2020-04-30 6:45 ` Jan Beulich 2020-05-26 22:00 ` Dario Faggioli 2020-05-27 4:26 ` Jürgen Groß 2020-05-28 9:32 ` Dario Faggioli 2020-05-27 6:17 ` Jan Beulich 2020-05-28 14:55 ` Dario Faggioli 2020-05-29 9:58 ` Jan Beulich 2020-05-29 10:19 ` Dario Faggioli 2020-04-30 7:35 ` Jürgen Groß 2020-04-30 12:28 ` Dario Faggioli 2020-04-30 12:52 ` Jürgen Groß 2020-04-30 14:01 ` Dario Faggioli 2020-05-26 21:18 ` Dario Faggioli 2020-05-27 6:22 ` Jan Beulich 2020-05-28 9:36 ` Dario Faggioli 2020-05-29 11:46 ` [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue Dario Faggioli 2020-05-29 15:06 ` Dario Faggioli 2020-05-29 16:15 ` George Dunlap
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).