* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2022-02-04 1:30 ` kernel test robot
2022-02-04 7:06 ` Srikar Dronamraju
2022-02-04 15:07 ` Nayak, KPrateek (K Prateek)
2 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2022-02-04 1:30 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 18268 bytes --]
Hi Mel,
I love your patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.17-rc2 next-20220203]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ec2444530612a886b406e2830d7f314d1a07d4bb
config: parisc-randconfig-s031-20220131 (https://download.01.org/0day-ci/archive/20220204/202202040638.UP73q8Dl-lkp(a)intel.com/config)
compiler: hppa-linux-gcc (GCC) 11.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129
git checkout 81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc SHELL=/bin/bash kernel/sched/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
sparse warnings: (new ones prefixed by >>)
kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *pd @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:461:19: sparse: expected struct perf_domain *pd
kernel/sched/topology.c:461:19: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:623:49: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:623:49: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:694:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:694:50: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:694:50: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:701:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@
kernel/sched/topology.c:701:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
kernel/sched/topology.c:701:55: sparse: got struct sched_domain *[assigned] tmp
kernel/sched/topology.c:711:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:711:29: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:711:29: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:716:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:716:20: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:716:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:737:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@
kernel/sched/topology.c:737:13: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:737:13: sparse: got struct sched_domain [noderef] __rcu *sd
kernel/sched/topology.c:899:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:899:70: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:899:70: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:928:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:928:59: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:928:59: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:974:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:974:57: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:974:57: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:976:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:976:25: sparse: expected struct sched_domain *sibling
kernel/sched/topology.c:976:25: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:984:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:984:55: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:984:55: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:986:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:986:25: sparse: expected struct sched_domain *sibling
kernel/sched/topology.c:986:25: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1056:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1056:62: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:1056:62: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1160:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1160:40: sparse: expected struct sched_domain *child
kernel/sched/topology.c:1160:40: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1571:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@
kernel/sched/topology.c:1571:43: sparse: expected struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1571:43: sparse: got struct sched_domain *child
kernel/sched/topology.c:2130:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@
kernel/sched/topology.c:2130:31: sparse: expected struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2130:31: sparse: got struct sched_domain *sd
kernel/sched/topology.c:2233:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2233:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2233:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2254:56: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:2254:56: sparse: expected struct sched_domain *child
kernel/sched/topology.c:2254:56: sparse: got struct sched_domain [noderef] __rcu *child
>> kernel/sched/topology.c:2284:39: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *top_p @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2284:39: sparse: expected struct sched_domain *top_p
kernel/sched/topology.c:2284:39: sparse: got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/topology.c:2286:45: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] top @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2286:45: sparse: expected struct sched_domain *[assigned] top
kernel/sched/topology.c:2286:45: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2287:47: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *top_p @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2287:47: sparse: expected struct sched_domain *top_p
kernel/sched/topology.c:2287:47: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2253:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2253:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2253:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2303:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2303:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2303:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c: note: in included file:
kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:929:31: sparse: sparse: dereference of noderef expression
kernel/sched/topology.c:1592:19: sparse: sparse: dereference of noderef expression
vim +2284 kernel/sched/topology.c
2186
2187 /*
2188 * Build sched domains for a given set of CPUs and attach the sched domains
2189 * to the individual CPUs
2190 */
2191 static int
2192 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
2193 {
2194 enum s_alloc alloc_state = sa_none;
2195 struct sched_domain *sd;
2196 struct s_data d;
2197 struct rq *rq = NULL;
2198 int i, ret = -ENOMEM;
2199 bool has_asym = false;
2200
2201 if (WARN_ON(cpumask_empty(cpu_map)))
2202 goto error;
2203
2204 alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
2205 if (alloc_state != sa_rootdomain)
2206 goto error;
2207
2208 /* Set up domains for CPUs specified by the cpu_map: */
2209 for_each_cpu(i, cpu_map) {
2210 struct sched_domain_topology_level *tl;
2211
2212 sd = NULL;
2213 for_each_sd_topology(tl) {
2214
2215 if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
2216 goto error;
2217
2218 sd = build_sched_domain(tl, cpu_map, attr, sd, i);
2219
2220 has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
2221
2222 if (tl == sched_domain_topology)
2223 *per_cpu_ptr(d.sd, i) = sd;
2224 if (tl->flags & SDTL_OVERLAP)
2225 sd->flags |= SD_OVERLAP;
2226 if (cpumask_equal(cpu_map, sched_domain_span(sd)))
2227 break;
2228 }
2229 }
2230
2231 /* Build the groups for the domains */
2232 for_each_cpu(i, cpu_map) {
2233 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
2234 sd->span_weight = cpumask_weight(sched_domain_span(sd));
2235 if (sd->flags & SD_OVERLAP) {
2236 if (build_overlap_sched_groups(sd, i))
2237 goto error;
2238 } else {
2239 if (build_sched_groups(sd, i))
2240 goto error;
2241 }
2242 }
2243 }
2244
2245 /*
2246 * Calculate an allowed NUMA imbalance such that LLCs do not get
2247 * imbalanced.
2248 */
2249 for_each_cpu(i, cpu_map) {
2250 unsigned int imb = 0;
2251 unsigned int imb_span = 1;
2252
2253 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
2254 struct sched_domain *child = sd->child;
2255
2256 if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
2257 (child->flags & SD_SHARE_PKG_RESOURCES)) {
2258 struct sched_domain *top, *top_p;
2259 unsigned int nr_llcs;
2260
2261 /*
2262 * For a single LLC per node, allow an
2263 * imbalance up to 25% of the node. This is an
2264 * arbitrary cutoff based on SMT-2 to balance
2265 * between memory bandwidth and avoiding
2266 * premature sharing of HT resources and SMT-4
2267 * or SMT-8 *may* benefit from a different
2268 * cutoff.
2269 *
2270 * For multiple LLCs, allow an imbalance
2271 * until multiple tasks would share an LLC
2272 * on one node while LLCs on another node
2273 * remain idle.
2274 */
2275 nr_llcs = sd->span_weight / child->span_weight;
2276 if (nr_llcs == 1)
2277 imb = sd->span_weight >> 2;
2278 else
2279 imb = nr_llcs;
2280 sd->imb_numa_nr = imb;
2281
2282 /* Set span based on the first NUMA domain. */
2283 top = sd;
> 2284 top_p = top->parent;
2285 while (top_p && !(top_p->flags & SD_NUMA)) {
> 2286 top = top->parent;
2287 top_p = top->parent;
2288 }
2289 imb_span = top_p ? top_p->span_weight : sd->span_weight;
2290 } else {
2291 int factor = max(1U, (sd->span_weight / imb_span));
2292
2293 sd->imb_numa_nr = imb * factor;
2294 }
2295 }
2296 }
2297
2298 /* Calculate CPU capacity for physical packages and nodes */
2299 for (i = nr_cpumask_bits-1; i >= 0; i--) {
2300 if (!cpumask_test_cpu(i, cpu_map))
2301 continue;
2302
2303 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
2304 claim_allocations(i, sd);
2305 init_sched_groups_capacity(i, sd);
2306 }
2307 }
2308
2309 /* Attach the domains */
2310 rcu_read_lock();
2311 for_each_cpu(i, cpu_map) {
2312 rq = cpu_rq(i);
2313 sd = *per_cpu_ptr(d.sd, i);
2314
2315 /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
2316 if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
2317 WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
2318
2319 cpu_attach_domain(sd, d.rd, i);
2320 }
2321 rcu_read_unlock();
2322
2323 if (has_asym)
2324 static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
2325
2326 if (rq && sched_debug_verbose) {
2327 pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
2328 cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
2329 }
2330
2331 ret = 0;
2332 error:
2333 __free_domain_allocs(&d, alloc_state, cpu_map);
2334
2335 return ret;
2336 }
2337
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2022-02-04 1:30 ` kernel test robot
@ 2022-02-04 7:06 ` Srikar Dronamraju
2022-02-04 9:04 ` Mel Gorman
2022-02-04 15:07 ` Nayak, KPrateek (K Prateek)
2 siblings, 1 reply; 8+ messages in thread
From: Srikar Dronamraju @ 2022-02-04 7:06 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML
* Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> }
> }
>
> + /*
> + * Calculate an allowed NUMA imbalance such that LLCs do not get
> + * imbalanced.
> + */
We seem to adding this hunk before the sched_domains may be degenerated.
Wondering if we really want to do it before degeneration.
Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
all the 3 domains, then lets say the middle sched_domain gets degenerated.
Would the sd->imb_numa_nr's still be relevant?
> + for_each_cpu(i, cpu_map) {
> + unsigned int imb = 0;
> + unsigned int imb_span = 1;
> +
> + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *child = sd->child;
> +
> + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> + (child->flags & SD_SHARE_PKG_RESOURCES)) {
> + struct sched_domain *top, *top_p;
> + unsigned int nr_llcs;
> +
> + /*
> + * For a single LLC per node, allow an
> + * imbalance up to 25% of the node. This is an
> + * arbitrary cutoff based on SMT-2 to balance
> + * between memory bandwidth and avoiding
> + * premature sharing of HT resources and SMT-4
> + * or SMT-8 *may* benefit from a different
> + * cutoff.
> + *
> + * For multiple LLCs, allow an imbalance
> + * until multiple tasks would share an LLC
> + * on one node while LLCs on another node
> + * remain idle.
> + */
> + nr_llcs = sd->span_weight / child->span_weight;
> + if (nr_llcs == 1)
> + imb = sd->span_weight >> 2;
> + else
> + imb = nr_llcs;
> + sd->imb_numa_nr = imb;
> +
> + /* Set span based on the first NUMA domain. */
> + top = sd;
> + top_p = top->parent;
> + while (top_p && !(top_p->flags & SD_NUMA)) {
> + top = top->parent;
> + top_p = top->parent;
> + }
> + imb_span = top_p ? top_p->span_weight : sd->span_weight;
I am getting confused by imb_span.
Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
MC domains having SD_SHARE_PKG_RESOURCES flag set.
We come here only for DIE domain.
imb_span set here is being used for both the subsequent sched domains
most likely they will be NUMA domains. Right?
> + } else {
> + int factor = max(1U, (sd->span_weight / imb_span));
> +
> + sd->imb_numa_nr = imb * factor;
For SMT, (or any sched domains below the llcs) factor would be
sd->span_weight but imb_numa_nr and imb would be 0.
For NUMA (or any sched domain just above DIE), factor would be
sd->imb_numa_nr would be nr_llcs.
For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
nr_llcs. Right?
> + }
> + }
> + }
> +
> /* Calculate CPU capacity for physical packages and nodes */
> for (i = nr_cpumask_bits-1; i >= 0; i--) {
> if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2022-02-04 1:30 ` kernel test robot
2022-02-04 7:06 ` Srikar Dronamraju
@ 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45 ` Mel Gorman
2 siblings, 1 reply; 8+ messages in thread
From: Nayak, KPrateek (K Prateek) @ 2022-02-04 15:07 UTC (permalink / raw)
To: Mel Gorman, Peter Zijlstra
Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
LKML
Hello Mel,
On 2/3/2022 8:16 PM, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
> 5.17.0-rc0 5.17.0-rc0
> vanilla sched-numaimb-v5
> MB/sec copy-16 162596.94 ( 0.00%) 501967.12 ( 208.72%)
> MB/sec scale-16 136901.28 ( 0.00%) 376531.50 ( 175.04%)
> MB/sec add-16 157300.70 ( 0.00%) 569997.42 ( 262.36%)
> MB/sec triad-16 151446.88 ( 0.00%) 553204.54 ( 265.28%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
> 5.17.0-rc0 5.17.0-rc0
> vanilla sched-numaimb-v5
> Min Score-16 368239.36 ( 0.00%) 400876.92 ( 8.86%)
> Hmean Score-16 388607.33 ( 0.00%) 441447.30 * 13.60%*
> Max Score-16 408945.69 ( 0.00%) 478826.87 ( 17.09%)
> Stddev Score-16 15247.04 ( 0.00%) 34061.76 (-123.40%)
> CoeffVar Score-16 3.92 ( 0.00%) 7.67 ( -95.82%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
>
> 5.17.0-rc0 5.17.0-rc0
> vanilla sched-numaimb-v5
> Hmean tput-1 71631.55 ( 0.00%) 70383.46 ( -1.74%)
> Hmean tput-8 582758.78 ( 0.00%) 607290.89 * 4.21%*
> Hmean tput-16 1020372.75 ( 0.00%) 1031257.25 ( 1.07%)
> Hmean tput-24 1416430.67 ( 0.00%) 1587576.33 * 12.08%*
> Hmean tput-32 1687702.72 ( 0.00%) 1724207.51 ( 2.16%)
> Hmean tput-40 1798094.90 ( 0.00%) 1983053.56 * 10.29%*
> Hmean tput-48 1972731.77 ( 0.00%) 2157461.70 ( 9.36%)
> Hmean tput-56 2386872.38 ( 0.00%) 2193237.42 ( -8.11%)
> Hmean tput-64 2536954.17 ( 0.00%) 2588741.08 ( 2.04%)
> Hmean tput-72 2585071.36 ( 0.00%) 2654776.36 ( 2.70%)
> Hmean tput-80 2960523.94 ( 0.00%) 2894657.12 ( -2.22%)
> Hmean tput-88 3061408.57 ( 0.00%) 2903167.72 ( -5.17%)
> Hmean tput-96 3052394.82 ( 0.00%) 2872605.46 ( -5.89%)
> Hmean tput-104 2997814.76 ( 0.00%) 3013660.26 ( 0.53%)
> Hmean tput-112 2955353.29 ( 0.00%) 3029122.16 ( 2.50%)
> Hmean tput-120 2889770.71 ( 0.00%) 2957739.88 ( 2.35%)
> Hmean tput-128 2871713.84 ( 0.00%) 2912410.18 ( 1.42%)
>
> In general, the standard deviation figures also are a lot more
> stable.
>
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
>
> vanilla sched-numaimb-v5r12
> Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
> Amean ep.D 31.86 ( 0.00%) 26.26 * 17.58%*
> Stddev ep.D 0.07 ( 0.00%) 0.18 (-157.54%)
> CoeffVar ep.D 0.22 ( 0.00%) 0.69 (-212.46%)
> Max ep.D 31.93 ( 0.00%) 26.46 ( 17.13%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c | 22 +++++++-------
> kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++
> 3 files changed, 66 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
> unsigned int busy_factor; /* less balancing by factor if busy */
> unsigned int imbalance_pct; /* No balance until over watermark */
> unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
> + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */
>
> int nohz_idle; /* NOHZ IDLE status */
> int flags; /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..86abf97a8df6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
> int src_cpu, src_nid;
> int dst_cpu, dst_nid;
> + int imb_numa_nr;
>
> struct numa_stats src_stats, dst_stats;
>
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
> static unsigned long cpu_load(struct rq *rq);
> static unsigned long cpu_runnable(struct rq *rq);
> static inline long adjust_numa_imbalance(int imbalance,
> - int dst_running, int dst_weight);
> + int dst_running, int imb_numa_nr);
>
> static inline enum
> numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> dst_running = env->dst_stats.nr_running + 1;
> imbalance = max(0, dst_running - src_running);
> imbalance = adjust_numa_imbalance(imbalance, dst_running,
> - env->dst_stats.weight);
> + env->imb_numa_nr);
>
> /* Use idle CPU if there is no imbalance */
> if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
> */
> rcu_read_lock();
> sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> - if (sd)
> + if (sd) {
> env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> + env.imb_numa_nr = sd->imb_numa_nr;
> + }
> rcu_read_unlock();
>
> /*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> * This is an approximation as the number of running tasks may not be
> * related to the number of busy CPUs due to sched_setaffinity.
> */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
> {
> - return (running < (weight >> 2));
> + return running < imb_numa_nr;
> }
>
> /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * allowed. If there is a real need of migration,
> * periodic load balance will take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?
allow_numa_imbalance allows the imbalance based on the following inequality:
running < imb_numa_nr
Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
Assume the group is running 7 task and we are finding the idlest group for the 8th task:
sd->imb_numa_nr = 8
local_sgs.sum_nr_running = 7
In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
and place the task in the same group, each task can be given one LLC.
However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
NUMA group.
In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
to bear. However that doesn't seem to be the case here.
> return NULL;
> }
>
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> #define NUMA_IMBALANCE_MIN 2
>
> static inline long adjust_numa_imbalance(int imbalance,
> - int dst_running, int dst_weight)
> + int dst_running, int imb_numa_nr)
> {
> - if (!allow_numa_imbalance(dst_running, dst_weight))
> + if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> return imbalance;
>
> /*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> /* Consider allowing a small imbalance between NUMA groups */
> if (env->sd->flags & SD_NUMA) {
> env->imbalance = adjust_numa_imbalance(env->imbalance,
> - local->sum_nr_running + 1, local->group_weight);
> + local->sum_nr_running + 1, env->sd->imb_numa_nr);
> }
>
> return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> }
> }
>
> + /*
> + * Calculate an allowed NUMA imbalance such that LLCs do not get
> + * imbalanced.
> + */
> + for_each_cpu(i, cpu_map) {
> + unsigned int imb = 0;
> + unsigned int imb_span = 1;
> +
> + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *child = sd->child;
> +
> + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> + (child->flags & SD_SHARE_PKG_RESOURCES)) {
> + struct sched_domain *top, *top_p;
> + unsigned int nr_llcs;
> +
> + /*
> + * For a single LLC per node, allow an
> + * imbalance up to 25% of the node. This is an
> + * arbitrary cutoff based on SMT-2 to balance
> + * between memory bandwidth and avoiding
> + * premature sharing of HT resources and SMT-4
> + * or SMT-8 *may* benefit from a different
> + * cutoff.
> + *
> + * For multiple LLCs, allow an imbalance
> + * until multiple tasks would share an LLC
> + * on one node while LLCs on another node
> + * remain idle.
> + */
To add to my point above, the comment here says -
"allow an imbalance until multiple tasks would share an LLC on one node"
Whereas, in the case I highlighted above, we see balancing kick in with possibly
one LLC being unaccounted for.
> + nr_llcs = sd->span_weight / child->span_weight;
> + if (nr_llcs == 1)
> + imb = sd->span_weight >> 2;
> + else
> + imb = nr_llcs;
> + sd->imb_numa_nr = imb;
> +
> + /* Set span based on the first NUMA domain. */
> + top = sd;
> + top_p = top->parent;
> + while (top_p && !(top_p->flags & SD_NUMA)) {
> + top = top->parent;
> + top_p = top->parent;
> + }
> + imb_span = top_p ? top_p->span_weight : sd->span_weight;
> + } else {
> + int factor = max(1U, (sd->span_weight / imb_span));
> +
> + sd->imb_numa_nr = imb * factor;
> + }
> + }
> + }
> +
> /* Calculate CPU capacity for physical packages and nodes */
> for (i = nr_cpumask_bits-1; i >= 0; i--) {
> if (!cpumask_test_cpu(i, cpu_map))
Please correct me if I'm wrong.
Thanks and Regards
Prateek
^ permalink raw reply [flat|nested] 8+ messages in thread