* [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 @ 2019-12-20 8:42 Mel Gorman 2019-12-20 12:40 ` Valentin Schneider ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Mel Gorman @ 2019-12-20 8:42 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML, Mel Gorman Changelog since V1 o Alter code flow vincent.guittot o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot o Note that the division is still in place. Without it and taking imbalance_adj into account before the cutoff, two NUMA domains do not converage as being equally balanced when the number of busy tasks equals the size of one domain (50% of the sum). Some data is in the changelog. The CPU load balancer balances between different domains to spread load and strives to have equal balance everywhere. Communicating tasks can migrate so they are topologically close to each other but these decisions are independent. On a lightly loaded NUMA machine, two communicating tasks pulled together at wakeup time can be pushed apart by the load balancer. In isolation, the load balancer decision is fine but it ignores the tasks data locality and the wakeup/LB paths continually conflict. NUMA balancing is also a factor but it also simply conflicts with the load balancer. This patch allows a degree of imbalance to exist between NUMA domains based on the imbalance_pct defined by the scheduler domain. This slight imbalance is allowed until the scheduler domain reaches almost 50% utilisation at which point other factors like HT utilisation and memory bandwidth come into play. While not commented upon in the code, the cutoff is important for memory-bound parallelised non-communicating workloads that do not fully utilise the entire machine. This is not necessarily the best universal cut-off point but it appeared appropriate for a variety of workloads and machines. The most obvious impact is on netperf TCP_STREAM -- two simple communicating tasks with some softirq offloaded depending on the transmission rate. 2-socket Haswell machine 48 core, HT enabled netperf-tcp -- mmtests config config-network-netperf-unbound baseline lbnuma-v1 Hmean 64 666.68 ( 0.00%) 667.31 ( 0.09%) Hmean 128 1276.18 ( 0.00%) 1288.92 * 1.00%* Hmean 256 2366.78 ( 0.00%) 2422.22 * 2.34%* Hmean 1024 8123.94 ( 0.00%) 8464.15 * 4.19%* Hmean 2048 12962.45 ( 0.00%) 13693.79 * 5.64%* Hmean 3312 17709.24 ( 0.00%) 17494.23 ( -1.21%) Hmean 4096 19756.01 ( 0.00%) 19472.58 ( -1.43%) Hmean 8192 27469.59 ( 0.00%) 27787.32 ( 1.16%) Hmean 16384 30062.82 ( 0.00%) 30657.62 * 1.98%* Stddev 64 2.64 ( 0.00%) 2.09 ( 20.76%) Stddev 128 6.22 ( 0.00%) 6.48 ( -4.28%) Stddev 256 9.75 ( 0.00%) 22.85 (-134.30%) Stddev 1024 69.62 ( 0.00%) 58.41 ( 16.11%) Stddev 2048 72.73 ( 0.00%) 83.47 ( -14.77%) Stddev 3312 412.35 ( 0.00%) 75.77 ( 81.63%) Stddev 4096 345.02 ( 0.00%) 297.01 ( 13.91%) Stddev 8192 280.09 ( 0.00%) 485.36 ( -73.29%) Stddev 16384 452.99 ( 0.00%) 250.21 ( 44.76%) Fairly small impact on average performance but note how much the standard deviation is reduced in many cases. A clearer story is visible from the NUMA Balancing stats Ops NUMA base-page range updates 21596.00 282.00 Ops NUMA PTE updates 21596.00 282.00 Ops NUMA PMD updates 0.00 0.00 Ops NUMA hint faults 17786.00 137.00 Ops NUMA hint local faults % 9916.00 137.00 Ops NUMA hint local percent 55.75 100.00 Ops NUMA pages migrated 4231.00 0.00 Without the patch, only 55.75% of sampled accesses are local. With the patch, 100% of sampled accesses are local. A 2-socket Broadwell showed better results on average but are not presented for brevity. The patch holds up for 4-socket boxes as well 4-socket Haswell machine, 144 core, HT enabled netperf-tcp baseline lbnuma-v1 Hmean 64 953.51 ( 0.00%) 977.27 * 2.49%* Hmean 128 1826.48 ( 0.00%) 1863.37 * 2.02%* Hmean 256 3295.19 ( 0.00%) 3329.37 ( 1.04%) Hmean 1024 10915.40 ( 0.00%) 11339.60 * 3.89%* Hmean 2048 17833.82 ( 0.00%) 19066.12 * 6.91%* Hmean 3312 22690.72 ( 0.00%) 24048.92 * 5.99%* Hmean 4096 24422.23 ( 0.00%) 26606.60 * 8.94%* Hmean 8192 31250.11 ( 0.00%) 33374.62 * 6.80%* Hmean 16384 37033.70 ( 0.00%) 38684.28 * 4.46%* Hmean 16384 37033.70 ( 0.00%) 38732.22 * 4.59%* On this machine, the baseline measured 58.11% locality for sampled accesses and 100% local accesses with the patch. Similarly, the patch holds up for 2-socket machines with multiple L3 caches such as the AMD Epyc 2 2-socket EPYC-2 machine, 256 cores netperf-tcp Hmean 64 1564.63 ( 0.00%) 1550.59 ( -0.90%) Hmean 128 3028.83 ( 0.00%) 3030.48 ( 0.05%) Hmean 256 5733.47 ( 0.00%) 5769.51 ( 0.63%) Hmean 1024 18936.04 ( 0.00%) 19216.15 * 1.48%* Hmean 2048 27589.77 ( 0.00%) 28200.45 * 2.21%* Hmean 3312 35361.97 ( 0.00%) 35881.94 * 1.47%* Hmean 4096 37965.59 ( 0.00%) 38702.01 * 1.94%* Hmean 8192 48499.92 ( 0.00%) 49530.62 * 2.13%* Hmean 16384 54249.96 ( 0.00%) 55937.24 * 3.11%* For amusement purposes, here are two graphs showing CPU utilisation on the 2-socket Haswell machine over time based on mpstat with the ordering of the CPUs based on topology. http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png The lines on the left match up CPUs that are HT siblings or on the same node. The machine has only one L3 cache per NUMA node or that would also be shown. It should be very clear from the images that the baseline kernel spread the load with lighter utilisation across nodes while the patched kernel had heavy utilisation of fewer CPUs on one node. Hackbench generally shows good results across machines with some differences depending on whether threads or sockets are used as well as pipes or sockets. This is the *worst* result from the 2-socket Haswell machine 2-socket Haswell machine 48 core, HT enabled hackbench-process-pipes -- mmtests config config-scheduler-unbound 5.5.0-rc1 5.5.0-rc1 baseline lbnuma-v1 Amean 1 1.2580 ( 0.00%) 1.2393 ( 1.48%) Amean 4 5.3293 ( 0.00%) 5.2683 * 1.14%* Amean 7 8.9067 ( 0.00%) 8.7130 * 2.17%* Amean 12 14.9577 ( 0.00%) 14.5773 * 2.54%* Amean 21 25.9570 ( 0.00%) 25.6657 * 1.12%* Amean 30 37.7287 ( 0.00%) 37.1277 * 1.59%* Amean 48 61.6757 ( 0.00%) 60.0433 * 2.65%* Amean 79 100.4740 ( 0.00%) 98.4507 ( 2.01%) Amean 110 141.2450 ( 0.00%) 136.8900 * 3.08%* Amean 141 179.7747 ( 0.00%) 174.5110 * 2.93%* Amean 172 221.0700 ( 0.00%) 214.7857 * 2.84%* Amean 192 245.2007 ( 0.00%) 238.3680 * 2.79%* An earlier prototype of the patch showed major regressions for NAS C-class when running with only half of the available CPUs -- 20-30% performance hits were measured at the time. With this version of the patch, the impact is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a patch discussed during review that avoids a divide by putting the cutoff at exactly 50% instead of accounting for imbalance_adj. NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half baseline nodivide lbnuma-v1 Amean bt.C 64.29 ( 0.00%) 76.33 * -18.72%* 69.55 * -8.17%* Amean cg.C 26.33 ( 0.00%) 26.26 ( 0.27%) 26.36 ( -0.11%) Amean ep.C 10.26 ( 0.00%) 10.29 ( -0.31%) 10.26 ( -0.04%) Amean ft.C 17.98 ( 0.00%) 19.73 * -9.71%* 19.51 * -8.52%* Amean is.C 0.99 ( 0.00%) 0.99 ( 0.40%) 0.99 ( 0.00%) Amean lu.C 51.72 ( 0.00%) 48.57 ( 6.09%) 48.68 * 5.88%* Amean mg.C 8.12 ( 0.00%) 8.27 ( -1.82%) 8.24 ( -1.50%) Amean sp.C 82.76 ( 0.00%) 86.06 * -3.99%* 83.42 ( -0.80%) Amean ua.C 58.64 ( 0.00%) 57.66 ( 1.67%) 57.79 ( 1.45%) There is some impact but there is a degree of variability and the ones showing impact are mainly workloads that are mostly parallelised and communicate infrequently between tests. It's a corner case where the workload benefits heavily from spreading wide and early which is not common. This is intended to illustrate the worst case measured. In general, the patch simply seeks to avoid unnecessarily cross-node migrations when a machine is lightly loaded but shows benefits for other workloads. While tests are still running, so far it seems to benefit light-utilisation smaller workloads on large machines and does not appear to do any harm to larger or parallelised workloads. [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 08a233e97a01..60a780e1420e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* * Try to use spare capacity of local group without overloading it or * emptying busiest. - * XXX Spreading tasks across NUMA nodes is not always the best policy - * and special care should be taken for SD_NUMA domain level before - * spreading the tasks. For now, load_balance() fully relies on - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. */ if (local->group_type == group_has_spare) { if (busiest->group_type > group_fully_busy) { @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s return; } + /* Consider allowing a small imbalance between NUMA groups */ + if (env->sd->flags & SD_NUMA) { + unsigned int imbalance_adj, imbalance_max; + + /* + * imbalance_adj is the allowable degree of imbalance + * to exist between two NUMA domains. It's calculated + * relative to imbalance_pct with a minimum of two + * tasks or idle CPUs. The choice of two is due to + * the most basic case of two communicating tasks + * that should remain on the same NUMA node after + * wakeup. + */ + imbalance_adj = max(2U, (busiest->group_weight * + (env->sd->imbalance_pct - 100) / 100) >> 1); + + /* + * Ignore small imbalances unless the busiest sd has + * almost half as many busy CPUs as there are + * available CPUs in the busiest group. Note that + * it is not exactly half as imbalance_adj must be + * accounted for or the two domains do not converge + * as equally balanced if the number of busy tasks is + * roughly the size of one NUMA domain. + */ + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; + if (env->imbalance <= imbalance_adj && + busiest->idle_cpus >= imbalance_max) { + env->imbalance = 0; + return; + } + } + if (busiest->group_weight == 1 || sds->prefer_sibling) { unsigned int nr_diff = busiest->sum_nr_running; /* ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman @ 2019-12-20 12:40 ` Valentin Schneider 2019-12-20 14:22 ` Mel Gorman 2019-12-21 11:25 ` Mel Gorman 2019-12-22 12:00 ` Srikar Dronamraju 2019-12-23 13:31 ` Vincent Guittot 2 siblings, 2 replies; 28+ messages in thread From: Valentin Schneider @ 2019-12-20 12:40 UTC (permalink / raw) To: Mel Gorman, Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML On 20/12/2019 08:42, Mel Gorman wrote: > In general, the patch simply seeks to avoid unnecessarily cross-node > migrations when a machine is lightly loaded but shows benefits for other > workloads. While tests are still running, so far it seems to benefit > light-utilisation smaller workloads on large machines and does not appear > to do any harm to larger or parallelised workloads. > > [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus] I think only the comment bit is still there in this version and it's not really worth mentioning (but I do thank you for doing it!). > @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > return; > } > > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + unsigned int imbalance_adj, imbalance_max; > + > + /* > + * imbalance_adj is the allowable degree of imbalance > + * to exist between two NUMA domains. It's calculated > + * relative to imbalance_pct with a minimum of two > + * tasks or idle CPUs. The choice of two is due to > + * the most basic case of two communicating tasks > + * that should remain on the same NUMA node after > + * wakeup. > + */ > + imbalance_adj = max(2U, (busiest->group_weight * > + (env->sd->imbalance_pct - 100) / 100) >> 1); > + > + /* > + * Ignore small imbalances unless the busiest sd has > + * almost half as many busy CPUs as there are > + * available CPUs in the busiest group. Note that This is all on the busiest group, so this should be more like: * Ignore small imbalances unless almost half of the * busiest sg's CPUs are busy. > + * it is not exactly half as imbalance_adj must be > + * accounted for or the two domains do not converge > + * as equally balanced if the number of busy tasks is > + * roughly the size of one NUMA domain. > + */ > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > + if (env->imbalance <= imbalance_adj && I'm confused now, have we set env->imbalance to anything at this point? AIUI Vincent's suggestion was to hinge this purely on the weight vs idle_cpus / nr_running, IOW not use imbalance: if (sd->flags & SD_NUMA) { imbalance_adj = max(2U, (busiest->group_weight * (env->sd->imbalance_pct - 100) / 100) >> 1); imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; if (busiest->idle_cpus >= imbalance_max) { env->imbalance = 0; return; } } Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use the number of runnable tasks. We are setting up a threshold for how far we are willing to ignore imbalances; if we have overloaded CPUs we *really* should try to solve this. Number of tasks is the safer option IMO: when we do have one task per CPU, it'll be the same as if we had used idle_cpus, and when we don't have one task per CPU we'll load-balance more often that we would have with idle_cpus. > + busiest->idle_cpus >= imbalance_max) { > + env->imbalance = 0; > + return; > + } > + } > + > if (busiest->group_weight == 1 || sds->prefer_sibling) { > unsigned int nr_diff = busiest->sum_nr_running; > /* > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 12:40 ` Valentin Schneider @ 2019-12-20 14:22 ` Mel Gorman 2019-12-20 15:32 ` Valentin Schneider 2019-12-21 11:25 ` Mel Gorman 1 sibling, 1 reply; 28+ messages in thread From: Mel Gorman @ 2019-12-20 14:22 UTC (permalink / raw) To: Valentin Schneider Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote: > > + /* Consider allowing a small imbalance between NUMA groups */ > > + if (env->sd->flags & SD_NUMA) { > > + unsigned int imbalance_adj, imbalance_max; > > + > > + /* > > + * imbalance_adj is the allowable degree of imbalance > > + * to exist between two NUMA domains. It's calculated > > + * relative to imbalance_pct with a minimum of two > > + * tasks or idle CPUs. The choice of two is due to > > + * the most basic case of two communicating tasks > > + * that should remain on the same NUMA node after > > + * wakeup. > > + */ > > + imbalance_adj = max(2U, (busiest->group_weight * > > + (env->sd->imbalance_pct - 100) / 100) >> 1); > > + > > + /* > > + * Ignore small imbalances unless the busiest sd has > > + * almost half as many busy CPUs as there are > > + * available CPUs in the busiest group. Note that > > This is all on the busiest group, so this should be more like: > > * Ignore small imbalances unless almost half of the > * busiest sg's CPUs are busy. > Updated. > > + * it is not exactly half as imbalance_adj must be > > + * accounted for or the two domains do not converge > > + * as equally balanced if the number of busy tasks is > > + * roughly the size of one NUMA domain. > > + */ > > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > + if (env->imbalance <= imbalance_adj && > > I'm confused now, have we set env->imbalance to anything at this point? No, it's a no-op in this context and I meant to take the check out. > AIUI > Vincent's suggestion was to hinge this purely on the weight vs idle_cpus / > nr_running, IOW not use imbalance: > > if (sd->flags & SD_NUMA) { > imbalance_adj = max(2U, (busiest->group_weight * > (env->sd->imbalance_pct - 100) / 100) >> 1); > imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > if (busiest->idle_cpus >= imbalance_max) { > env->imbalance = 0; > return; > } > } > And that's what it is now. Functionally it's equivalent similar other than imbalance could be potentially anything (but should be zero). > Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use > the number of runnable tasks. We are setting up a threshold for how far we > are willing to ignore imbalances; if we have overloaded CPUs we *really* > should try to solve this. Number of tasks is the safer option IMO: when we > do have one task per CPU, it'll be the same as if we had used idle_cpus, and > when we don't have one task per CPU we'll load-balance more often that we > would have with idle_cpus. > I couldn't convince myself to really push back hard on the sum_nr_runnable versus idle_cpus. If the local group has spare capacity and the busiest group has multiple tasks stacked on CPUs then it's most likely due to CPU affinity. In that case, there is no guarantee tasks can move to the local group either. In that case, the difference between sum_nr_running and idle_cpus is almost moot. There may be common use cases where the distinction really matters but right now, I'm at the point where I think such a change could be a separate patch with the use case included and supporting data on why it must be sum_nr_running. Right now, I feel it's mostly a cosmetic issue given the context and intent of the patch. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 14:22 ` Mel Gorman @ 2019-12-20 15:32 ` Valentin Schneider 0 siblings, 0 replies; 28+ messages in thread From: Valentin Schneider @ 2019-12-20 15:32 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML On 20/12/2019 14:22, Mel Gorman wrote: >> Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use >> the number of runnable tasks. We are setting up a threshold for how far we >> are willing to ignore imbalances; if we have overloaded CPUs we *really* >> should try to solve this. Number of tasks is the safer option IMO: when we >> do have one task per CPU, it'll be the same as if we had used idle_cpus, and >> when we don't have one task per CPU we'll load-balance more often that we >> would have with idle_cpus. >> > > I couldn't convince myself to really push back hard on the sum_nr_runnable > versus idle_cpus. If the local group has spare capacity and the busiest > group has multiple tasks stacked on CPUs then it's most likely due to > CPU affinity. Not necessarily, for instance wakeup balancing (select_idle_sibling()) could end up packing stuff within a node if said node spans more than one LLC domain, which IIRC is the case on some AMD chips. Or, still with the same LLC < node topology, you could start with the node being completely utilized, then some tasks on some LLC domains terminate but there's an LLC that still has a bunch of tasks running, and then you're left with an imbalance between LLC domains that the wakeup balance cannot solve. > In that case, there is no guarantee tasks can move to the > local group either. In that case, the difference between sum_nr_running > and idle_cpus is almost moot. There may be common use cases where the > distinction really matters but right now, I'm at the point where I think > such a change could be a separate patch with the use case included and > supporting data on why it must be sum_nr_running. Right now, I feel it's > mostly a cosmetic issue given the context and intent of the patch. > Let me spin it this way: do we need to push this ignoring of the imbalance as far as possible, or are we okay with it only happening when there's just a few tasks running? The latter is achieved with sum_nr_running and is the safer option IMO. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 12:40 ` Valentin Schneider 2019-12-20 14:22 ` Mel Gorman @ 2019-12-21 11:25 ` Mel Gorman 1 sibling, 0 replies; 28+ messages in thread From: Mel Gorman @ 2019-12-21 11:25 UTC (permalink / raw) To: Valentin Schneider Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote: > > + * it is not exactly half as imbalance_adj must be > > + * accounted for or the two domains do not converge > > + * as equally balanced if the number of busy tasks is > > + * roughly the size of one NUMA domain. > > + */ > > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > + if (env->imbalance <= imbalance_adj && > > I'm confused now, have we set env->imbalance to anything at this point? AIUI > Vincent's suggestion was to hinge this purely on the weight vs idle_cpus / > nr_running, IOW not use imbalance: > > if (sd->flags & SD_NUMA) { > imbalance_adj = max(2U, (busiest->group_weight * > (env->sd->imbalance_pct - 100) / 100) >> 1); > imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > if (busiest->idle_cpus >= imbalance_max) { > env->imbalance = 0; > return; > } > } > Ok, I tried this just in case and it does not work out when the number of CPUs per NUMA node gets very large. It allows very large imbalances between the number of CPUs calculated based on imbalance_pct and based on group_weight >> 1 and caused some very large regressions on a modern AMD machine. The version I had that calculated an imbalance first and decided whether to ignore it still has been consistently the best approach across multiple machines and workloads. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman 2019-12-20 12:40 ` Valentin Schneider @ 2019-12-22 12:00 ` Srikar Dronamraju 2019-12-23 13:31 ` Vincent Guittot 2 siblings, 0 replies; 28+ messages in thread From: Srikar Dronamraju @ 2019-12-22 12:00 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML Hi Mel, * Mel Gorman <mgorman@techsingularity.net> [2019-12-20 08:42:52]: > Changelog since V1 > o Alter code flow vincent.guittot > o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot > o Note that the division is still in place. Without it and taking > imbalance_adj into account before the cutoff, two NUMA domains > do not converage as being equally balanced when the number of > busy tasks equals the size of one domain (50% of the sum). > Some data is in the changelog. I still plan to run SpecJBB on the same setup but for now I have few perf bench test results. The setup and results are below. We might notice that except for numa01, all other cases the system was not fully loaded. Summary: We seem to be regressing in all the 5 situations I tried. lscpu o/p Architecture: ppc64le Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 32 NUMA node(s): 8 Model: 2.1 (pvr 004b 0201) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 NUMA node2 CPU(s): 64-95 NUMA node3 CPU(s): 96-127 NUMA node4 CPU(s): 128-159 NUMA node5 CPU(s): 160-191 NUMA node6 CPU(s): 192-223 NUMA node7 CPU(s): 224-255 numa01 is a set of 2 process each running 128 threads; each thread doing 50 loops on 3GB process shared memory operations. numa02 is a single process with 32 threads; each thread doing 800 loops on 32MB thread local memory operations. numa03 is a single process with 32 threads; each thread doing 50 loops on 3GB process shared memory operations. numa04 is a set of 8 process (as many nodes) each running 4 threads; each thread doing 50 loops on 3GB process shared memory operations. numa05 is a set of 16 process (twice as many nodes) each running 2 threads; each thread doing 50 loops on 3GB process shared memory operations. (Since we are measuring duration in seconds, lower is better; -ve %Change would mean duration has increased) v5.5-rc2 Testcase Time: Min Max Avg StdDev numa01.sh Real: 479.46 501.07 490.26 10.80 numa01.sh Sys: 79.74 128.24 103.99 24.25 numa01.sh User: 118577.01 121002.82 119789.92 1212.90 numa02.sh Real: 12.53 14.80 13.66 1.14 numa02.sh Sys: 0.89 1.49 1.19 0.30 numa02.sh User: 318.74 354.86 336.80 18.06 numa03.sh Real: 80.19 83.64 81.91 1.73 numa03.sh Sys: 3.87 3.91 3.89 0.02 numa03.sh User: 2501.34 2574.68 2538.01 36.67 numa04.sh Real: 59.62 65.79 62.70 3.09 numa04.sh Sys: 26.99 29.36 28.17 1.19 numa04.sh User: 1637.57 1657.55 1647.56 9.99 numa05.sh Real: 58.03 60.54 59.28 1.26 numa05.sh Sys: 66.01 66.48 66.25 0.24 numa05.sh User: 1522.61 1526.41 1524.51 1.90 v5.5-rc2 + patch Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 518.80 526.75 522.77 3.97 -6.2188% numa01.sh Sys: 133.63 154.35 143.99 10.36 -27.7797% numa01.sh User: 123964.16 127670.38 125817.27 1853.11 -4.79056% numa02.sh Real: 23.70 24.58 24.14 0.44 -43.4134% numa02.sh Sys: 1.66 1.99 1.82 0.17 -34.6154% numa02.sh User: 525.45 550.45 537.95 12.50 -37.392% numa03.sh Real: 111.58 139.86 125.72 14.14 -34.8473% numa03.sh Sys: 4.32 4.34 4.33 0.01 -10.1617% numa03.sh User: 2837.44 3581.76 3209.60 372.16 -20.9244% numa04.sh Real: 102.01 110.14 106.08 4.06 -40.8937% numa04.sh Sys: 41.62 41.63 41.62 0.01 -32.3162% numa04.sh User: 2402.66 2465.31 2433.98 31.33 -32.31% numa05.sh Real: 97.96 109.59 103.78 5.81 -42.8792% numa05.sh Sys: 76.57 84.83 80.70 4.13 -17.9058% numa05.sh User: 2196.67 2689.63 2443.15 246.48 -37.6006% vmstat counters +ve %Change would mean hits have increased) vmstat counters for numa01 param v5.5-rc2 with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 2075794 2162993 4.20075% numa_hint_faults_local 363105 393400 8.34332% numa_hit 1333086 1190972 -10.6605% numa_local 1332945 1190789 -10.6648% numa_other 141 183 29.7872% numa_pages_migrated 303268 316709 4.43205% numa_pte_updates 2076930 2164327 4.20799% pgfault 4773768 4378411 -8.28186% pgmajfault 288 266 -7.63889% pgmigrate_success 303268 316709 4.43205% vmstat counters for numa02 param v5.5-rc2 with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 61234 90455 47.7202% numa_hint_faults_local 31024 69953 125.48% numa_hit 78931 68528 -13.1799% numa_local 78931 68528 -13.1799% numa_pages_migrated 30005 20281 -32.4079% numa_pte_updates 63694 92503 45.2303% pgfault 139196 161149 15.7713% pgmajfault 65 70 7.69231% pgmigrate_success 30005 20281 -32.4079% vmstat counters for numa03 param v5.5-rc2 with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 692956 799162 15.3265% numa_hint_faults_local 95121 167989 76.6056% numa_hit 256729 223528 -12.9323% numa_local 256729 223528 -12.9323% numa_pages_migrated 106499 116825 9.69587% numa_pte_updates 693020 803373 15.9235% pgfault 1003178 1004960 0.177635% pgmajfault 84 79 -5.95238% pgmigrate_success 106499 116825 9.69587% vmstat counters for numa04 param v5.5-rc2 with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 4766686 5512065 15.6373% numa_hint_faults_local 2574830 3040466 18.0841% numa_hit 1797339 1975390 9.90637% numa_local 1797330 1975357 9.90508% numa_pages_migrated 958674 1144133 19.3454% numa_pte_updates 4784502 5533075 15.6458% pgfault 5855953 6546520 11.7926% pgmajfault 124 118 -4.83871% pgmigrate_success 958674 1144133 19.3454% vmstat counters for numa05 param v5.5-rc2 with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 11252825 10988104 -2.35248% numa_hint_faults_local 7379141 8290124 12.3454% numa_hit 3669519 2964756 -19.2059% numa_local 3669518 2964702 -19.2073% numa_pages_migrated 2050570 1696810 -17.2518% numa_pte_updates 11326394 11003544 -2.85042% pgfault 13059535 12747105 -2.39235% pgmajfault 83 433 421.687% pgmigrate_success 2050570 1696810 -17.2518% -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-20 8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman 2019-12-20 12:40 ` Valentin Schneider 2019-12-22 12:00 ` Srikar Dronamraju @ 2019-12-23 13:31 ` Vincent Guittot 2019-12-23 13:41 ` Vincent Guittot 2020-01-03 14:31 ` Mel Gorman 2 siblings, 2 replies; 28+ messages in thread From: Vincent Guittot @ 2019-12-23 13:31 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Fri, 20 Dec 2019 at 09:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > Changelog since V1 > o Alter code flow vincent.guittot > o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot > o Note that the division is still in place. Without it and taking > imbalance_adj into account before the cutoff, two NUMA domains > do not converage as being equally balanced when the number of > busy tasks equals the size of one domain (50% of the sum). > Some data is in the changelog. > > The CPU load balancer balances between different domains to spread load > and strives to have equal balance everywhere. Communicating tasks can > migrate so they are topologically close to each other but these decisions > are independent. On a lightly loaded NUMA machine, two communicating tasks > pulled together at wakeup time can be pushed apart by the load balancer. > In isolation, the load balancer decision is fine but it ignores the tasks > data locality and the wakeup/LB paths continually conflict. NUMA balancing > is also a factor but it also simply conflicts with the load balancer. > > This patch allows a degree of imbalance to exist between NUMA domains > based on the imbalance_pct defined by the scheduler domain. This slight > imbalance is allowed until the scheduler domain reaches almost 50% > utilisation at which point other factors like HT utilisation and memory > bandwidth come into play. While not commented upon in the code, the cutoff > is important for memory-bound parallelised non-communicating workloads > that do not fully utilise the entire machine. This is not necessarily the > best universal cut-off point but it appeared appropriate for a variety > of workloads and machines. > > The most obvious impact is on netperf TCP_STREAM -- two simple > communicating tasks with some softirq offloaded depending on the > transmission rate. > > 2-socket Haswell machine 48 core, HT enabled > netperf-tcp -- mmtests config config-network-netperf-unbound > baseline lbnuma-v1 > Hmean 64 666.68 ( 0.00%) 667.31 ( 0.09%) > Hmean 128 1276.18 ( 0.00%) 1288.92 * 1.00%* > Hmean 256 2366.78 ( 0.00%) 2422.22 * 2.34%* > Hmean 1024 8123.94 ( 0.00%) 8464.15 * 4.19%* > Hmean 2048 12962.45 ( 0.00%) 13693.79 * 5.64%* > Hmean 3312 17709.24 ( 0.00%) 17494.23 ( -1.21%) > Hmean 4096 19756.01 ( 0.00%) 19472.58 ( -1.43%) > Hmean 8192 27469.59 ( 0.00%) 27787.32 ( 1.16%) > Hmean 16384 30062.82 ( 0.00%) 30657.62 * 1.98%* > Stddev 64 2.64 ( 0.00%) 2.09 ( 20.76%) > Stddev 128 6.22 ( 0.00%) 6.48 ( -4.28%) > Stddev 256 9.75 ( 0.00%) 22.85 (-134.30%) > Stddev 1024 69.62 ( 0.00%) 58.41 ( 16.11%) > Stddev 2048 72.73 ( 0.00%) 83.47 ( -14.77%) > Stddev 3312 412.35 ( 0.00%) 75.77 ( 81.63%) > Stddev 4096 345.02 ( 0.00%) 297.01 ( 13.91%) > Stddev 8192 280.09 ( 0.00%) 485.36 ( -73.29%) > Stddev 16384 452.99 ( 0.00%) 250.21 ( 44.76%) > > Fairly small impact on average performance but note how much the standard > deviation is reduced in many cases. A clearer story is visible from the > NUMA Balancing stats > > Ops NUMA base-page range updates 21596.00 282.00 > Ops NUMA PTE updates 21596.00 282.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 17786.00 137.00 > Ops NUMA hint local faults % 9916.00 137.00 > Ops NUMA hint local percent 55.75 100.00 > Ops NUMA pages migrated 4231.00 0.00 > > Without the patch, only 55.75% of sampled accesses are local. > With the patch, 100% of sampled accesses are local. A 2-socket > Broadwell showed better results on average but are not presented > for brevity. The patch holds up for 4-socket boxes as well > > 4-socket Haswell machine, 144 core, HT enabled > netperf-tcp > > baseline lbnuma-v1 > Hmean 64 953.51 ( 0.00%) 977.27 * 2.49%* > Hmean 128 1826.48 ( 0.00%) 1863.37 * 2.02%* > Hmean 256 3295.19 ( 0.00%) 3329.37 ( 1.04%) > Hmean 1024 10915.40 ( 0.00%) 11339.60 * 3.89%* > Hmean 2048 17833.82 ( 0.00%) 19066.12 * 6.91%* > Hmean 3312 22690.72 ( 0.00%) 24048.92 * 5.99%* > Hmean 4096 24422.23 ( 0.00%) 26606.60 * 8.94%* > Hmean 8192 31250.11 ( 0.00%) 33374.62 * 6.80%* > Hmean 16384 37033.70 ( 0.00%) 38684.28 * 4.46%* > Hmean 16384 37033.70 ( 0.00%) 38732.22 * 4.59%* > > On this machine, the baseline measured 58.11% locality for sampled accesses > and 100% local accesses with the patch. Similarly, the patch holds up > for 2-socket machines with multiple L3 caches such as the AMD Epyc 2 > > 2-socket EPYC-2 machine, 256 cores > netperf-tcp > Hmean 64 1564.63 ( 0.00%) 1550.59 ( -0.90%) > Hmean 128 3028.83 ( 0.00%) 3030.48 ( 0.05%) > Hmean 256 5733.47 ( 0.00%) 5769.51 ( 0.63%) > Hmean 1024 18936.04 ( 0.00%) 19216.15 * 1.48%* > Hmean 2048 27589.77 ( 0.00%) 28200.45 * 2.21%* > Hmean 3312 35361.97 ( 0.00%) 35881.94 * 1.47%* > Hmean 4096 37965.59 ( 0.00%) 38702.01 * 1.94%* > Hmean 8192 48499.92 ( 0.00%) 49530.62 * 2.13%* > Hmean 16384 54249.96 ( 0.00%) 55937.24 * 3.11%* > > For amusement purposes, here are two graphs showing CPU utilisation on > the 2-socket Haswell machine over time based on mpstat with the ordering > of the CPUs based on topology. > > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png > > The lines on the left match up CPUs that are HT siblings or on the same > node. The machine has only one L3 cache per NUMA node or that would also > be shown. It should be very clear from the images that the baseline > kernel spread the load with lighter utilisation across nodes while the > patched kernel had heavy utilisation of fewer CPUs on one node. > > Hackbench generally shows good results across machines with some > differences depending on whether threads or sockets are used as well as > pipes or sockets. This is the *worst* result from the 2-socket Haswell > machine > > 2-socket Haswell machine 48 core, HT enabled > hackbench-process-pipes -- mmtests config config-scheduler-unbound > 5.5.0-rc1 5.5.0-rc1 > baseline lbnuma-v1 > Amean 1 1.2580 ( 0.00%) 1.2393 ( 1.48%) > Amean 4 5.3293 ( 0.00%) 5.2683 * 1.14%* > Amean 7 8.9067 ( 0.00%) 8.7130 * 2.17%* > Amean 12 14.9577 ( 0.00%) 14.5773 * 2.54%* > Amean 21 25.9570 ( 0.00%) 25.6657 * 1.12%* > Amean 30 37.7287 ( 0.00%) 37.1277 * 1.59%* > Amean 48 61.6757 ( 0.00%) 60.0433 * 2.65%* > Amean 79 100.4740 ( 0.00%) 98.4507 ( 2.01%) > Amean 110 141.2450 ( 0.00%) 136.8900 * 3.08%* > Amean 141 179.7747 ( 0.00%) 174.5110 * 2.93%* > Amean 172 221.0700 ( 0.00%) 214.7857 * 2.84%* > Amean 192 245.2007 ( 0.00%) 238.3680 * 2.79%* > > An earlier prototype of the patch showed major regressions for NAS C-class > when running with only half of the available CPUs -- 20-30% performance > hits were measured at the time. With this version of the patch, the impact > is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a > patch discussed during review that avoids a divide by putting the cutoff > at exactly 50% instead of accounting for imbalance_adj. > > NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half > baseline nodivide lbnuma-v1 > Amean bt.C 64.29 ( 0.00%) 76.33 * -18.72%* 69.55 * -8.17%* > Amean cg.C 26.33 ( 0.00%) 26.26 ( 0.27%) 26.36 ( -0.11%) > Amean ep.C 10.26 ( 0.00%) 10.29 ( -0.31%) 10.26 ( -0.04%) > Amean ft.C 17.98 ( 0.00%) 19.73 * -9.71%* 19.51 * -8.52%* > Amean is.C 0.99 ( 0.00%) 0.99 ( 0.40%) 0.99 ( 0.00%) > Amean lu.C 51.72 ( 0.00%) 48.57 ( 6.09%) 48.68 * 5.88%* > Amean mg.C 8.12 ( 0.00%) 8.27 ( -1.82%) 8.24 ( -1.50%) > Amean sp.C 82.76 ( 0.00%) 86.06 * -3.99%* 83.42 ( -0.80%) > Amean ua.C 58.64 ( 0.00%) 57.66 ( 1.67%) 57.79 ( 1.45%) > > There is some impact but there is a degree of variability and the ones > showing impact are mainly workloads that are mostly parallelised > and communicate infrequently between tests. It's a corner case where > the workload benefits heavily from spreading wide and early which is > not common. This is intended to illustrate the worst case measured. > > In general, the patch simply seeks to avoid unnecessarily cross-node > migrations when a machine is lightly loaded but shows benefits for other > workloads. While tests are still running, so far it seems to benefit > light-utilisation smaller workloads on large machines and does not appear > to do any harm to larger or parallelised workloads. > > [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus] > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++---- > 1 file changed, 33 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 08a233e97a01..60a780e1420e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* > * Try to use spare capacity of local group without overloading it or > * emptying busiest. > - * XXX Spreading tasks across NUMA nodes is not always the best policy > - * and special care should be taken for SD_NUMA domain level before > - * spreading the tasks. For now, load_balance() fully relies on > - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. > */ > if (local->group_type == group_has_spare) { > if (busiest->group_type > group_fully_busy) { > @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > return; > } > > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + unsigned int imbalance_adj, imbalance_max; > + > + /* > + * imbalance_adj is the allowable degree of imbalance > + * to exist between two NUMA domains. It's calculated > + * relative to imbalance_pct with a minimum of two > + * tasks or idle CPUs. The choice of two is due to > + * the most basic case of two communicating tasks > + * that should remain on the same NUMA node after > + * wakeup. > + */ > + imbalance_adj = max(2U, (busiest->group_weight * > + (env->sd->imbalance_pct - 100) / 100) >> 1); > + > + /* > + * Ignore small imbalances unless the busiest sd has > + * almost half as many busy CPUs as there are > + * available CPUs in the busiest group. Note that > + * it is not exactly half as imbalance_adj must be > + * accounted for or the two domains do not converge > + * as equally balanced if the number of busy tasks is > + * roughly the size of one NUMA domain. > + */ > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > + if (env->imbalance <= imbalance_adj && AFAICT, env->imbalance is undefined there. I have tried your patch with the below instead - if (env->imbalance <= imbalance_adj && - busiest->idle_cpus >= imbalance_max) { + if (busiest->idle_cpus >= imbalance_max) { Sorry for the delay but running tests tooks more time than expected. I have applied your patch on top of v5.5-rc3+apparmor fix I can see an improvement for hackbench -l (256000/#grp) -g #grp 1 groups 14.197 +/-0.95% 12.127 +/-1.19% (+14.58%) I haven't seen any difference otherwise > + busiest->idle_cpus >= imbalance_max) { > + env->imbalance = 0; > + return; > + } > + } > + > if (busiest->group_weight == 1 || sds->prefer_sibling) { > unsigned int nr_diff = busiest->sum_nr_running; > /* ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-23 13:31 ` Vincent Guittot @ 2019-12-23 13:41 ` Vincent Guittot 2020-01-03 14:31 ` Mel Gorman 1 sibling, 0 replies; 28+ messages in thread From: Vincent Guittot @ 2019-12-23 13:41 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Mon, 23 Dec 2019 at 14:31, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 20 Dec 2019 at 09:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > Changelog since V1 > > o Alter code flow vincent.guittot > > o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot > > o Note that the division is still in place. Without it and taking > > imbalance_adj into account before the cutoff, two NUMA domains > > do not converage as being equally balanced when the number of > > busy tasks equals the size of one domain (50% of the sum). > > Some data is in the changelog. > > > > The CPU load balancer balances between different domains to spread load > > and strives to have equal balance everywhere. Communicating tasks can > > migrate so they are topologically close to each other but these decisions > > are independent. On a lightly loaded NUMA machine, two communicating tasks > > pulled together at wakeup time can be pushed apart by the load balancer. > > In isolation, the load balancer decision is fine but it ignores the tasks > > data locality and the wakeup/LB paths continually conflict. NUMA balancing > > is also a factor but it also simply conflicts with the load balancer. > > > > This patch allows a degree of imbalance to exist between NUMA domains > > based on the imbalance_pct defined by the scheduler domain. This slight > > imbalance is allowed until the scheduler domain reaches almost 50% > > utilisation at which point other factors like HT utilisation and memory > > bandwidth come into play. While not commented upon in the code, the cutoff > > is important for memory-bound parallelised non-communicating workloads > > that do not fully utilise the entire machine. This is not necessarily the > > best universal cut-off point but it appeared appropriate for a variety > > of workloads and machines. > > > > The most obvious impact is on netperf TCP_STREAM -- two simple > > communicating tasks with some softirq offloaded depending on the > > transmission rate. > > > > 2-socket Haswell machine 48 core, HT enabled > > netperf-tcp -- mmtests config config-network-netperf-unbound > > baseline lbnuma-v1 > > Hmean 64 666.68 ( 0.00%) 667.31 ( 0.09%) > > Hmean 128 1276.18 ( 0.00%) 1288.92 * 1.00%* > > Hmean 256 2366.78 ( 0.00%) 2422.22 * 2.34%* > > Hmean 1024 8123.94 ( 0.00%) 8464.15 * 4.19%* > > Hmean 2048 12962.45 ( 0.00%) 13693.79 * 5.64%* > > Hmean 3312 17709.24 ( 0.00%) 17494.23 ( -1.21%) > > Hmean 4096 19756.01 ( 0.00%) 19472.58 ( -1.43%) > > Hmean 8192 27469.59 ( 0.00%) 27787.32 ( 1.16%) > > Hmean 16384 30062.82 ( 0.00%) 30657.62 * 1.98%* > > Stddev 64 2.64 ( 0.00%) 2.09 ( 20.76%) > > Stddev 128 6.22 ( 0.00%) 6.48 ( -4.28%) > > Stddev 256 9.75 ( 0.00%) 22.85 (-134.30%) > > Stddev 1024 69.62 ( 0.00%) 58.41 ( 16.11%) > > Stddev 2048 72.73 ( 0.00%) 83.47 ( -14.77%) > > Stddev 3312 412.35 ( 0.00%) 75.77 ( 81.63%) > > Stddev 4096 345.02 ( 0.00%) 297.01 ( 13.91%) > > Stddev 8192 280.09 ( 0.00%) 485.36 ( -73.29%) > > Stddev 16384 452.99 ( 0.00%) 250.21 ( 44.76%) > > > > Fairly small impact on average performance but note how much the standard > > deviation is reduced in many cases. A clearer story is visible from the > > NUMA Balancing stats > > > > Ops NUMA base-page range updates 21596.00 282.00 > > Ops NUMA PTE updates 21596.00 282.00 > > Ops NUMA PMD updates 0.00 0.00 > > Ops NUMA hint faults 17786.00 137.00 > > Ops NUMA hint local faults % 9916.00 137.00 > > Ops NUMA hint local percent 55.75 100.00 > > Ops NUMA pages migrated 4231.00 0.00 > > > > Without the patch, only 55.75% of sampled accesses are local. > > With the patch, 100% of sampled accesses are local. A 2-socket > > Broadwell showed better results on average but are not presented > > for brevity. The patch holds up for 4-socket boxes as well > > > > 4-socket Haswell machine, 144 core, HT enabled > > netperf-tcp > > > > baseline lbnuma-v1 > > Hmean 64 953.51 ( 0.00%) 977.27 * 2.49%* > > Hmean 128 1826.48 ( 0.00%) 1863.37 * 2.02%* > > Hmean 256 3295.19 ( 0.00%) 3329.37 ( 1.04%) > > Hmean 1024 10915.40 ( 0.00%) 11339.60 * 3.89%* > > Hmean 2048 17833.82 ( 0.00%) 19066.12 * 6.91%* > > Hmean 3312 22690.72 ( 0.00%) 24048.92 * 5.99%* > > Hmean 4096 24422.23 ( 0.00%) 26606.60 * 8.94%* > > Hmean 8192 31250.11 ( 0.00%) 33374.62 * 6.80%* > > Hmean 16384 37033.70 ( 0.00%) 38684.28 * 4.46%* > > Hmean 16384 37033.70 ( 0.00%) 38732.22 * 4.59%* > > > > On this machine, the baseline measured 58.11% locality for sampled accesses > > and 100% local accesses with the patch. Similarly, the patch holds up > > for 2-socket machines with multiple L3 caches such as the AMD Epyc 2 > > > > 2-socket EPYC-2 machine, 256 cores > > netperf-tcp > > Hmean 64 1564.63 ( 0.00%) 1550.59 ( -0.90%) > > Hmean 128 3028.83 ( 0.00%) 3030.48 ( 0.05%) > > Hmean 256 5733.47 ( 0.00%) 5769.51 ( 0.63%) > > Hmean 1024 18936.04 ( 0.00%) 19216.15 * 1.48%* > > Hmean 2048 27589.77 ( 0.00%) 28200.45 * 2.21%* > > Hmean 3312 35361.97 ( 0.00%) 35881.94 * 1.47%* > > Hmean 4096 37965.59 ( 0.00%) 38702.01 * 1.94%* > > Hmean 8192 48499.92 ( 0.00%) 49530.62 * 2.13%* > > Hmean 16384 54249.96 ( 0.00%) 55937.24 * 3.11%* > > > > For amusement purposes, here are two graphs showing CPU utilisation on > > the 2-socket Haswell machine over time based on mpstat with the ordering > > of the CPUs based on topology. > > > > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png > > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png > > > > The lines on the left match up CPUs that are HT siblings or on the same > > node. The machine has only one L3 cache per NUMA node or that would also > > be shown. It should be very clear from the images that the baseline > > kernel spread the load with lighter utilisation across nodes while the > > patched kernel had heavy utilisation of fewer CPUs on one node. > > > > Hackbench generally shows good results across machines with some > > differences depending on whether threads or sockets are used as well as > > pipes or sockets. This is the *worst* result from the 2-socket Haswell > > machine > > > > 2-socket Haswell machine 48 core, HT enabled > > hackbench-process-pipes -- mmtests config config-scheduler-unbound > > 5.5.0-rc1 5.5.0-rc1 > > baseline lbnuma-v1 > > Amean 1 1.2580 ( 0.00%) 1.2393 ( 1.48%) > > Amean 4 5.3293 ( 0.00%) 5.2683 * 1.14%* > > Amean 7 8.9067 ( 0.00%) 8.7130 * 2.17%* > > Amean 12 14.9577 ( 0.00%) 14.5773 * 2.54%* > > Amean 21 25.9570 ( 0.00%) 25.6657 * 1.12%* > > Amean 30 37.7287 ( 0.00%) 37.1277 * 1.59%* > > Amean 48 61.6757 ( 0.00%) 60.0433 * 2.65%* > > Amean 79 100.4740 ( 0.00%) 98.4507 ( 2.01%) > > Amean 110 141.2450 ( 0.00%) 136.8900 * 3.08%* > > Amean 141 179.7747 ( 0.00%) 174.5110 * 2.93%* > > Amean 172 221.0700 ( 0.00%) 214.7857 * 2.84%* > > Amean 192 245.2007 ( 0.00%) 238.3680 * 2.79%* > > > > An earlier prototype of the patch showed major regressions for NAS C-class > > when running with only half of the available CPUs -- 20-30% performance > > hits were measured at the time. With this version of the patch, the impact > > is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a > > patch discussed during review that avoids a divide by putting the cutoff > > at exactly 50% instead of accounting for imbalance_adj. > > > > NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half > > baseline nodivide lbnuma-v1 > > Amean bt.C 64.29 ( 0.00%) 76.33 * -18.72%* 69.55 * -8.17%* > > Amean cg.C 26.33 ( 0.00%) 26.26 ( 0.27%) 26.36 ( -0.11%) > > Amean ep.C 10.26 ( 0.00%) 10.29 ( -0.31%) 10.26 ( -0.04%) > > Amean ft.C 17.98 ( 0.00%) 19.73 * -9.71%* 19.51 * -8.52%* > > Amean is.C 0.99 ( 0.00%) 0.99 ( 0.40%) 0.99 ( 0.00%) > > Amean lu.C 51.72 ( 0.00%) 48.57 ( 6.09%) 48.68 * 5.88%* > > Amean mg.C 8.12 ( 0.00%) 8.27 ( -1.82%) 8.24 ( -1.50%) > > Amean sp.C 82.76 ( 0.00%) 86.06 * -3.99%* 83.42 ( -0.80%) > > Amean ua.C 58.64 ( 0.00%) 57.66 ( 1.67%) 57.79 ( 1.45%) > > > > There is some impact but there is a degree of variability and the ones > > showing impact are mainly workloads that are mostly parallelised > > and communicate infrequently between tests. It's a corner case where > > the workload benefits heavily from spreading wide and early which is > > not common. This is intended to illustrate the worst case measured. > > > > In general, the patch simply seeks to avoid unnecessarily cross-node > > migrations when a machine is lightly loaded but shows benefits for other > > workloads. While tests are still running, so far it seems to benefit > > light-utilisation smaller workloads on large machines and does not appear > > to do any harm to larger or parallelised workloads. > > > > [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus] > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > --- > > kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++---- > > 1 file changed, 33 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 08a233e97a01..60a780e1420e 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* > > * Try to use spare capacity of local group without overloading it or > > * emptying busiest. > > - * XXX Spreading tasks across NUMA nodes is not always the best policy > > - * and special care should be taken for SD_NUMA domain level before > > - * spreading the tasks. For now, load_balance() fully relies on > > - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. > > */ > > if (local->group_type == group_has_spare) { > > if (busiest->group_type > group_fully_busy) { > > @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > return; > > } > > > > + /* Consider allowing a small imbalance between NUMA groups */ > > + if (env->sd->flags & SD_NUMA) { > > + unsigned int imbalance_adj, imbalance_max; > > + > > + /* > > + * imbalance_adj is the allowable degree of imbalance > > + * to exist between two NUMA domains. It's calculated > > + * relative to imbalance_pct with a minimum of two > > + * tasks or idle CPUs. The choice of two is due to > > + * the most basic case of two communicating tasks > > + * that should remain on the same NUMA node after > > + * wakeup. > > + */ > > + imbalance_adj = max(2U, (busiest->group_weight * > > + (env->sd->imbalance_pct - 100) / 100) >> 1); > > + > > + /* > > + * Ignore small imbalances unless the busiest sd has > > + * almost half as many busy CPUs as there are > > + * available CPUs in the busiest group. Note that > > + * it is not exactly half as imbalance_adj must be > > + * accounted for or the two domains do not converge > > + * as equally balanced if the number of busy tasks is > > + * roughly the size of one NUMA domain. > > + */ > > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > + if (env->imbalance <= imbalance_adj && > > AFAICT, env->imbalance is undefined there. I have tried your patch > with the below instead > > - if (env->imbalance <= imbalance_adj && > - busiest->idle_cpus >= imbalance_max) { > + if (busiest->idle_cpus >= imbalance_max) { > > Sorry for the delay but running tests tooks more time than expected. I > have applied your patch on top of v5.5-rc3+apparmor fix I forgot to mentionned that the platform that used for the tests, is a 2 nodes 224 CPUs arm64 > I can see an improvement for > hackbench -l (256000/#grp) -g #grp > 1 groups 14.197 +/-0.95% 12.127 +/-1.19% (+14.58%) > > I haven't seen any difference otherwise > > > + busiest->idle_cpus >= imbalance_max) { > > + env->imbalance = 0; > > + return; > > + } > > + } > > + > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > > unsigned int nr_diff = busiest->sum_nr_running; > > /* ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2019-12-23 13:31 ` Vincent Guittot 2019-12-23 13:41 ` Vincent Guittot @ 2020-01-03 14:31 ` Mel Gorman 2020-01-06 13:55 ` Vincent Guittot 1 sibling, 1 reply; 28+ messages in thread From: Mel Gorman @ 2020-01-03 14:31 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Mon, Dec 23, 2019 at 02:31:40PM +0100, Vincent Guittot wrote: > > + /* Consider allowing a small imbalance between NUMA groups */ > > + if (env->sd->flags & SD_NUMA) { > > + unsigned int imbalance_adj, imbalance_max; > > + > > + /* > > + * imbalance_adj is the allowable degree of imbalance > > + * to exist between two NUMA domains. It's calculated > > + * relative to imbalance_pct with a minimum of two > > + * tasks or idle CPUs. The choice of two is due to > > + * the most basic case of two communicating tasks > > + * that should remain on the same NUMA node after > > + * wakeup. > > + */ > > + imbalance_adj = max(2U, (busiest->group_weight * > > + (env->sd->imbalance_pct - 100) / 100) >> 1); > > + > > + /* > > + * Ignore small imbalances unless the busiest sd has > > + * almost half as many busy CPUs as there are > > + * available CPUs in the busiest group. Note that > > + * it is not exactly half as imbalance_adj must be > > + * accounted for or the two domains do not converge > > + * as equally balanced if the number of busy tasks is > > + * roughly the size of one NUMA domain. > > + */ > > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > + if (env->imbalance <= imbalance_adj && > > AFAICT, env->imbalance is undefined there. I have tried your patch > with the below instead > Even when fixed, other corner cases were hit for parallelised loads that benefit from spreading early. At the moment I'm testing a variant of the first approach except it adjust small balances at low utilisation and otherwise balances at normal. It appears to work for basic communicating tasks at relatively low utitisation that fits within a node without obviously impacting higher utilisation non-communicating workloads but more testing time will be needed to be sure. It's still based on sum_nr_running as a cut-off instead of idle_cpus as at low utilisation, there is not much of a material difference in the cut-offs given that either approach can be misleading depending on the load of the individual tasks, any cpu binding configurations and the degree the tasks are memory-bound. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ba749f579714..58ba2f0c6363 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* * Try to use spare capacity of local group without overloading it or * emptying busiest. - * XXX Spreading tasks across NUMA nodes is not always the best policy - * and special care should be taken for SD_NUMA domain level before - * spreading the tasks. For now, load_balance() fully relies on - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. */ if (local->group_type == group_has_spare) { if (busiest->group_type > group_fully_busy) { @@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->migration_type = migrate_task; lsub_positive(&nr_diff, local->sum_nr_running); env->imbalance = nr_diff >> 1; - return; - } + } else { - /* - * If there is no overload, we just want to even the number of - * idle cpus. - */ - env->migration_type = migrate_task; - env->imbalance = max_t(long, 0, (local->idle_cpus - + /* + * If there is no overload, we just want to even the number of + * idle cpus. + */ + env->migration_type = migrate_task; + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1); + } + + /* Consider allowing a small imbalance between NUMA groups */ + if (env->sd->flags & SD_NUMA) { + long imbalance_adj, imbalance_max; + + /* + * imbalance_adj is the allowable degree of imbalance + * to exist between two NUMA domains. imbalance_pct + * is used to estimate the number of active tasks + * needed before memory bandwidth may be as important + * as memory locality. + */ + imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1; + + /* + * Allow small imbalances when the busiest group has + * low utilisation. + */ + imbalance_max = imbalance_adj << 1; + if (busiest->sum_nr_running < imbalance_max) + env->imbalance -= min(env->imbalance, imbalance_adj); + } + return; } -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-03 14:31 ` Mel Gorman @ 2020-01-06 13:55 ` Vincent Guittot 2020-01-06 14:52 ` Mel Gorman 0 siblings, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-06 13:55 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Fri, 3 Jan 2020 at 15:31, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Mon, Dec 23, 2019 at 02:31:40PM +0100, Vincent Guittot wrote: > > > + /* Consider allowing a small imbalance between NUMA groups */ > > > + if (env->sd->flags & SD_NUMA) { > > > + unsigned int imbalance_adj, imbalance_max; > > > + > > > + /* > > > + * imbalance_adj is the allowable degree of imbalance > > > + * to exist between two NUMA domains. It's calculated > > > + * relative to imbalance_pct with a minimum of two > > > + * tasks or idle CPUs. The choice of two is due to > > > + * the most basic case of two communicating tasks > > > + * that should remain on the same NUMA node after > > > + * wakeup. > > > + */ > > > + imbalance_adj = max(2U, (busiest->group_weight * > > > + (env->sd->imbalance_pct - 100) / 100) >> 1); > > > + > > > + /* > > > + * Ignore small imbalances unless the busiest sd has > > > + * almost half as many busy CPUs as there are > > > + * available CPUs in the busiest group. Note that > > > + * it is not exactly half as imbalance_adj must be > > > + * accounted for or the two domains do not converge > > > + * as equally balanced if the number of busy tasks is > > > + * roughly the size of one NUMA domain. > > > + */ > > > + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj; > > > + if (env->imbalance <= imbalance_adj && > > > > AFAICT, env->imbalance is undefined there. I have tried your patch > > with the below instead > > > > Even when fixed, other corner cases were hit for parallelised loads that > benefit from spreading early. At the moment I'm testing a variant of the > first approach except it adjust small balances at low utilisation and > otherwise balances at normal. It appears to work for basic communicating > tasks at relatively low utitisation that fits within a node without > obviously impacting higher utilisation non-communicating workloads but > more testing time will be needed to be sure. > > It's still based on sum_nr_running as a cut-off instead of idle_cpus as > at low utilisation, there is not much of a material difference in the > cut-offs given that either approach can be misleading depending on the > load of the individual tasks, any cpu binding configurations and the > degree the tasks are memory-bound. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ba749f579714..58ba2f0c6363 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* > * Try to use spare capacity of local group without overloading it or > * emptying busiest. > - * XXX Spreading tasks across NUMA nodes is not always the best policy > - * and special care should be taken for SD_NUMA domain level before > - * spreading the tasks. For now, load_balance() fully relies on > - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. > */ > if (local->group_type == group_has_spare) { > if (busiest->group_type > group_fully_busy) { > @@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > env->migration_type = migrate_task; > lsub_positive(&nr_diff, local->sum_nr_running); > env->imbalance = nr_diff >> 1; > - return; > - } > + } else { > > - /* > - * If there is no overload, we just want to even the number of > - * idle cpus. > - */ > - env->migration_type = migrate_task; > - env->imbalance = max_t(long, 0, (local->idle_cpus - > + /* > + * If there is no overload, we just want to even the number of > + * idle cpus. > + */ > + env->migration_type = migrate_task; > + env->imbalance = max_t(long, 0, (local->idle_cpus - > busiest->idle_cpus) >> 1); > + } > + > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + long imbalance_adj, imbalance_max; > + > + /* > + * imbalance_adj is the allowable degree of imbalance > + * to exist between two NUMA domains. imbalance_pct > + * is used to estimate the number of active tasks > + * needed before memory bandwidth may be as important > + * as memory locality. > + */ > + imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1; This looks weird to me because you use imbalance_pct, which is meaningful only compare a ratio, to define a number that will be then compared to a number of tasks without taking into account the weight of the node. So whatever the node size, 32 or 128 CPUs, the imbalance_adj will be the same: 3 with the default imbalance_pct of NUMA level which is 125 AFAICT > + > + /* > + * Allow small imbalances when the busiest group has > + * low utilisation. > + */ > + imbalance_max = imbalance_adj << 1; Why do you add this shift ? So according to the above, imbalance_max = 6 whatever the size of the node Regards, Vincent > + if (busiest->sum_nr_running < imbalance_max) > + env->imbalance -= min(env->imbalance, imbalance_adj); > + } > + > return; > } > > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-06 13:55 ` Vincent Guittot @ 2020-01-06 14:52 ` Mel Gorman 2020-01-07 8:38 ` Vincent Guittot 0 siblings, 1 reply; 28+ messages in thread From: Mel Gorman @ 2020-01-06 14:52 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML Sorry I sent out v3 before seeing this email as my mail only synchronises periodically. On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote: > > - return; > > - } > > + } else { > > > > - /* > > - * If there is no overload, we just want to even the number of > > - * idle cpus. > > - */ > > - env->migration_type = migrate_task; > > - env->imbalance = max_t(long, 0, (local->idle_cpus - > > + /* > > + * If there is no overload, we just want to even the number of > > + * idle cpus. > > + */ > > + env->migration_type = migrate_task; > > + env->imbalance = max_t(long, 0, (local->idle_cpus - > > busiest->idle_cpus) >> 1); > > + } > > + > > + /* Consider allowing a small imbalance between NUMA groups */ > > + if (env->sd->flags & SD_NUMA) { > > + long imbalance_adj, imbalance_max; > > + > > + /* > > + * imbalance_adj is the allowable degree of imbalance > > + * to exist between two NUMA domains. imbalance_pct > > + * is used to estimate the number of active tasks > > + * needed before memory bandwidth may be as important > > + * as memory locality. > > + */ > > + imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1; > > This looks weird to me because you use imbalance_pct, which is > meaningful only compare a ratio, to define a number that will be then > compared to a number of tasks without taking into account the weight > of the node. So whatever the node size, 32 or 128 CPUs, the > imbalance_adj will be the same: 3 with the default imbalance_pct of > NUMA level which is 125 AFAICT > The intent in this version was to only cover the low utilisation case regardless of the NUMA node size. There were too many corner cases where the tradeoff of local memory latency versus local memory bandwidth cannot be quantified. See Srikar's report as an example but it's a general problem. The use of imbalance_pct was simply to find the smallest number of running tasks were (imbalance_pct - 100) would be 1 running task and limit the patch to address the low utilisation case only. It could be simply hard-coded but that would ignore cases where an architecture overrides imbalance_pct. I'm open to suggestion on how we could identify the point where imbalances can be ignored without hitting other corner cases. > > + > > + /* > > + * Allow small imbalances when the busiest group has > > + * low utilisation. > > + */ > > + imbalance_max = imbalance_adj << 1; > > Why do you add this shift ? > For very low utilisation, there is no balancing between nodes. For slightly above that, there is limited balancing. After that, the load balancing behaviour is unchanged as I believe we cannot determine if memory latency or memory bandwidth is more important for arbitrary workloads. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-06 14:52 ` Mel Gorman @ 2020-01-07 8:38 ` Vincent Guittot 2020-01-07 9:56 ` Mel Gorman 0 siblings, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-07 8:38 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Mon, 6 Jan 2020 at 15:52, Mel Gorman <mgorman@techsingularity.net> wrote: > > Sorry I sent out v3 before seeing this email as my mail only synchronises > periodically. > > On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote: > > > - return; > > > - } > > > + } else { > > > > > > - /* > > > - * If there is no overload, we just want to even the number of > > > - * idle cpus. > > > - */ > > > - env->migration_type = migrate_task; > > > - env->imbalance = max_t(long, 0, (local->idle_cpus - > > > + /* > > > + * If there is no overload, we just want to even the number of > > > + * idle cpus. > > > + */ > > > + env->migration_type = migrate_task; > > > + env->imbalance = max_t(long, 0, (local->idle_cpus - > > > busiest->idle_cpus) >> 1); > > > + } > > > + > > > + /* Consider allowing a small imbalance between NUMA groups */ > > > + if (env->sd->flags & SD_NUMA) { > > > + long imbalance_adj, imbalance_max; > > > + > > > + /* > > > + * imbalance_adj is the allowable degree of imbalance > > > + * to exist between two NUMA domains. imbalance_pct > > > + * is used to estimate the number of active tasks > > > + * needed before memory bandwidth may be as important > > > + * as memory locality. > > > + */ > > > + imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1; > > > > This looks weird to me because you use imbalance_pct, which is > > meaningful only compare a ratio, to define a number that will be then > > compared to a number of tasks without taking into account the weight > > of the node. So whatever the node size, 32 or 128 CPUs, the > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > NUMA level which is 125 AFAICT > > > > The intent in this version was to only cover the low utilisation case > regardless of the NUMA node size. There were too many corner cases > where the tradeoff of local memory latency versus local memory bandwidth > cannot be quantified. See Srikar's report as an example but it's a general > problem. The use of imbalance_pct was simply to find the smallest number of > running tasks were (imbalance_pct - 100) would be 1 running task and limit But using imbalance_pct alone doesn't mean anything. Using similar to the below busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 would be more meaningful Or you could use the util_avg so you will take into account if the tasks are short running ones or long running ones > the patch to address the low utilisation case only. It could be simply > hard-coded but that would ignore cases where an architecture overrides > imbalance_pct. I'm open to suggestion on how we could identify the point > where imbalances can be ignored without hitting other corner cases. > > > > + > > > + /* > > > + * Allow small imbalances when the busiest group has > > > + * low utilisation. > > > + */ > > > + imbalance_max = imbalance_adj << 1; > > > > Why do you add this shift ? > > > > For very low utilisation, there is no balancing between nodes. For slightly > above that, there is limited balancing. After that, the load balancing > behaviour is unchanged as I believe we cannot determine if memory latency > or memory bandwidth is more important for arbitrary workloads. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 8:38 ` Vincent Guittot @ 2020-01-07 9:56 ` Mel Gorman 2020-01-07 11:17 ` Vincent Guittot ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Mel Gorman @ 2020-01-07 9:56 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote: > > > This looks weird to me because you use imbalance_pct, which is > > > meaningful only compare a ratio, to define a number that will be then > > > compared to a number of tasks without taking into account the weight > > > of the node. So whatever the node size, 32 or 128 CPUs, the > > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > > NUMA level which is 125 AFAICT > > > > > > > The intent in this version was to only cover the low utilisation case > > regardless of the NUMA node size. There were too many corner cases > > where the tradeoff of local memory latency versus local memory bandwidth > > cannot be quantified. See Srikar's report as an example but it's a general > > problem. The use of imbalance_pct was simply to find the smallest number of > > running tasks were (imbalance_pct - 100) would be 1 running task and limit > > But using imbalance_pct alone doesn't mean anything. Other than figuring out "how many running tasks are required before imbalance_pct is roughly equivalent to one fully active CPU?". Even then, it's a bit weak as imbalance_pct makes hard-coded assumptions on what the tradeoff of cross-domain migration is without considering the hardware. > Using similar to the below > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 > > would be more meaningful > It's meaningful to some sense from a conceptual point of view but setting the low utilisation cutoff depending on the number of CPUs on the node does not account for any local memory latency vs bandwidth. i.e. on a small or mid-range machine the cutoff will make sense. On larger machines, the cutoff could be at the point where memory bandwidth is saturated leading to a scenario whereby upgrading to a larger machine performs worse than the smaller machine. Much more importantly, doing what you suggest allows an imbalance of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 caches. That is going to have side-effects. While I also do not account for the LLC group_weight, it's unlikely the cut-off I used would be smaller than an LLC cache on a large machine as the cache. These two points are why I didn't take the group weight into account. Now if you want, I can do what you suggest anyway as long as you are happy that the child domain weight is also taken into account and to bound the largest possible allowed imbalance to deal with the case of a node having multiple small LLC caches. That means that some machines will be using the size of the node and some machines will use the size of an LLC. It's less predictable overall as some machines will be "special" relative to others making it harder to reproduce certain problems locally but it would take imbalance_pct into account in a way that you're happy with. Also bear in mind that whether LLC is accounted for or not, the final result should be halved similar to the other imbalance calculations to avoid over or under load balancing. > Or you could use the util_avg so you will take into account if the > tasks are short running ones or long running ones > util_avg can be skewed if there are big outliers. Even then, it's not a great metric for the low utilisation cutoff. Large numbers of mostly idle but running tasks would be treated similarly to small numbers of fully active tasks. It's less predictable and harder to reason about how load balancing behaves across a variety of workloads. Based on what you suggest, the result looks like this (build tested only) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ba749f579714..1b2c7bed2db5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* * Try to use spare capacity of local group without overloading it or * emptying busiest. - * XXX Spreading tasks across NUMA nodes is not always the best policy - * and special care should be taken for SD_NUMA domain level before - * spreading the tasks. For now, load_balance() fully relies on - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. */ if (local->group_type == group_has_spare) { if (busiest->group_type > group_fully_busy) { @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->migration_type = migrate_task; lsub_positive(&nr_diff, local->sum_nr_running); env->imbalance = nr_diff >> 1; - return; - } + } else { - /* - * If there is no overload, we just want to even the number of - * idle cpus. - */ - env->migration_type = migrate_task; - env->imbalance = max_t(long, 0, (local->idle_cpus - + /* + * If there is no overload, we just want to even the number of + * idle cpus. + */ + env->migration_type = migrate_task; + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1); + } + + /* Consider allowing a small imbalance between NUMA groups */ + if (env->sd->flags & SD_NUMA) { + struct sched_domain *child = env->sd->child; + unsigned int imbalance_adj; + + /* + * Calculate an acceptable degree of imbalance based + * on imbalance_adj. However, do not allow a greater + * imbalance than the child domains weight to avoid + * a case where the allowed imbalance spans multiple + * LLCs. + */ + imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; + imbalance_adj = min(imbalance_adj, child->span_weight); + imbalance_adj >>= 1; + + /* + * Ignore small imbalances when the busiest group has + * low utilisation. + */ + if (busiest->sum_nr_running < imbalance_adj) + env->imbalance = 0; + } + return; } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 9:56 ` Mel Gorman @ 2020-01-07 11:17 ` Vincent Guittot 2020-01-07 11:56 ` Mel Gorman 2020-01-07 11:22 ` Peter Zijlstra 2020-01-07 19:26 ` Phil Auld 2 siblings, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-07 11:17 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote: > > > > This looks weird to me because you use imbalance_pct, which is > > > > meaningful only compare a ratio, to define a number that will be then > > > > compared to a number of tasks without taking into account the weight > > > > of the node. So whatever the node size, 32 or 128 CPUs, the > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > > > NUMA level which is 125 AFAICT > > > > > > > > > > The intent in this version was to only cover the low utilisation case > > > regardless of the NUMA node size. There were too many corner cases > > > where the tradeoff of local memory latency versus local memory bandwidth > > > cannot be quantified. See Srikar's report as an example but it's a general > > > problem. The use of imbalance_pct was simply to find the smallest number of > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit > > > > But using imbalance_pct alone doesn't mean anything. > > Other than figuring out "how many running tasks are required before > imbalance_pct is roughly equivalent to one fully active CPU?". Even then, sorry, I don't see how you deduct this from only using imbalance_pct which is mainly there to say how much percent of difference is significant > it's a bit weak as imbalance_pct makes hard-coded assumptions on what > the tradeoff of cross-domain migration is without considering the hardware. > > > Using similar to the below > > > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 > > > > would be more meaningful > > > > It's meaningful to some sense from a conceptual point of view but > setting the low utilisation cutoff depending on the number of CPUs on > the node does not account for any local memory latency vs bandwidth. > i.e. on a small or mid-range machine the cutoff will make sense. On > larger machines, the cutoff could be at the point where memory bandwidth > is saturated leading to a scenario whereby upgrading to a larger > machine performs worse than the smaller machine. > > Much more importantly, doing what you suggest allows an imbalance > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > caches. That is going to have side-effects. While I also do not account > for the LLC group_weight, it's unlikely the cut-off I used would be > smaller than an LLC cache on a large machine as the cache. > > These two points are why I didn't take the group weight into account. > > Now if you want, I can do what you suggest anyway as long as you are happy > that the child domain weight is also taken into account and to bound the Taking into account child domain makes sense to me, but shouldn't we take into account the number of child group instead ? This should reflect the number of different LLC caches. IIUC your reasoning, we want to make sure that tasks will not start to fight for using same resources which is the connection between LLC cache and memory in this case > largest possible allowed imbalance to deal with the case of a node having > multiple small LLC caches. That means that some machines will be using the > size of the node and some machines will use the size of an LLC. It's less > predictable overall as some machines will be "special" relative to others > making it harder to reproduce certain problems locally but it would take > imbalance_pct into account in a way that you're happy with. > > Also bear in mind that whether LLC is accounted for or not, the final > result should be halved similar to the other imbalance calculations to > avoid over or under load balancing. > > > Or you could use the util_avg so you will take into account if the > > tasks are short running ones or long running ones > > > > util_avg can be skewed if there are big outliers. Even then, it's not > a great metric for the low utilisation cutoff. Large numbers of mostly > idle but running tasks would be treated similarly to small numbers of > fully active tasks. It's less predictable and harder to reason about how Yes but this also have the advantage of reflecting more accurately how the system is used. with nr_running, we consider that mostly idle and fully active tasks will have the exact same impact on the memory > load balancing behaves across a variety of workloads. > > Based on what you suggest, the result looks like this (build tested > only) I'm going to make a try of this patch > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ba749f579714..1b2c7bed2db5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* > * Try to use spare capacity of local group without overloading it or > * emptying busiest. > - * XXX Spreading tasks across NUMA nodes is not always the best policy > - * and special care should be taken for SD_NUMA domain level before > - * spreading the tasks. For now, load_balance() fully relies on > - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. > */ > if (local->group_type == group_has_spare) { > if (busiest->group_type > group_fully_busy) { > @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > env->migration_type = migrate_task; > lsub_positive(&nr_diff, local->sum_nr_running); > env->imbalance = nr_diff >> 1; > - return; > - } > + } else { > > - /* > - * If there is no overload, we just want to even the number of > - * idle cpus. > - */ > - env->migration_type = migrate_task; > - env->imbalance = max_t(long, 0, (local->idle_cpus - > + /* > + * If there is no overload, we just want to even the number of > + * idle cpus. > + */ > + env->migration_type = migrate_task; > + env->imbalance = max_t(long, 0, (local->idle_cpus - > busiest->idle_cpus) >> 1); > + } > + > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + struct sched_domain *child = env->sd->child; > + unsigned int imbalance_adj; > + > + /* > + * Calculate an acceptable degree of imbalance based > + * on imbalance_adj. However, do not allow a greater > + * imbalance than the child domains weight to avoid > + * a case where the allowed imbalance spans multiple > + * LLCs. > + */ > + imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; > + imbalance_adj = min(imbalance_adj, child->span_weight); > + imbalance_adj >>= 1; > + > + /* > + * Ignore small imbalances when the busiest group has > + * low utilisation. > + */ > + if (busiest->sum_nr_running < imbalance_adj) > + env->imbalance = 0; > + } > + > return; > } > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 11:17 ` Vincent Guittot @ 2020-01-07 11:56 ` Mel Gorman 2020-01-07 16:00 ` Vincent Guittot 0 siblings, 1 reply; 28+ messages in thread From: Mel Gorman @ 2020-01-07 11:56 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote: > On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote: > > > > > This looks weird to me because you use imbalance_pct, which is > > > > > meaningful only compare a ratio, to define a number that will be then > > > > > compared to a number of tasks without taking into account the weight > > > > > of the node. So whatever the node size, 32 or 128 CPUs, the > > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > > > > NUMA level which is 125 AFAICT > > > > > > > > > > > > > The intent in this version was to only cover the low utilisation case > > > > regardless of the NUMA node size. There were too many corner cases > > > > where the tradeoff of local memory latency versus local memory bandwidth > > > > cannot be quantified. See Srikar's report as an example but it's a general > > > > problem. The use of imbalance_pct was simply to find the smallest number of > > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit > > > > > > But using imbalance_pct alone doesn't mean anything. > > > > Other than figuring out "how many running tasks are required before > > imbalance_pct is roughly equivalent to one fully active CPU?". Even then, > > sorry, I don't see how you deduct this from only using imbalance_pct > which is mainly there to say how much percent of difference is > significant > Because if the difference is 25% then 1 CPU out of 4 active is enough for imbalance_pct to potentially be a factor. Anyway, the approach seems almost universally disliked so even if I had reasons for not scaling this by the group_weight, no one appears to agree with them :P > > it's a bit weak as imbalance_pct makes hard-coded assumptions on what > > the tradeoff of cross-domain migration is without considering the hardware. > > > > > Using similar to the below > > > > > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 > > > > > > would be more meaningful > > > > > > > It's meaningful to some sense from a conceptual point of view but > > setting the low utilisation cutoff depending on the number of CPUs on > > the node does not account for any local memory latency vs bandwidth. > > i.e. on a small or mid-range machine the cutoff will make sense. On > > larger machines, the cutoff could be at the point where memory bandwidth > > is saturated leading to a scenario whereby upgrading to a larger > > machine performs worse than the smaller machine. > > > > Much more importantly, doing what you suggest allows an imbalance > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > > caches. That is going to have side-effects. While I also do not account > > for the LLC group_weight, it's unlikely the cut-off I used would be > > smaller than an LLC cache on a large machine as the cache. > > > > These two points are why I didn't take the group weight into account. > > > > Now if you want, I can do what you suggest anyway as long as you are happy > > that the child domain weight is also taken into account and to bound the > > Taking into account child domain makes sense to me, but shouldn't we > take into account the number of child group instead ? This should > reflect the number of different LLC caches. I guess it would but why is it inherently better? The number of domains would yield a similar result if we assume that all the lower domains have equal weight so it simply because the weight of the SD_NUMA domain divided by the number of child domains. Now, I could be missing something with asymmetric setups. I don't know if it's possible for child domains of a NUMA domain to have different sizes. I would be somewhat surprised if they did but I also do not work on such machines nor have I ever accessed one (to my knowledge). > IIUC your reasoning, we want to make sure that tasks will not start to > fight for using same resources which is the connection between LLC > cache and memory in this case > Yep. I don't want a case where the allowed imbalance causes the load balancer to have to balance between the lower domains. *Maybe* that is actually better in some cases but it's far from intuitive so I would prefer that change would be a patch on its own with a big fat comment explaining the reasoning behind the additional complexity. > > largest possible allowed imbalance to deal with the case of a node having > > multiple small LLC caches. That means that some machines will be using the > > size of the node and some machines will use the size of an LLC. It's less > > predictable overall as some machines will be "special" relative to others > > making it harder to reproduce certain problems locally but it would take > > imbalance_pct into account in a way that you're happy with. > > > > Also bear in mind that whether LLC is accounted for or not, the final > > result should be halved similar to the other imbalance calculations to > > avoid over or under load balancing. > > > > > Or you could use the util_avg so you will take into account if the > > > tasks are short running ones or long running ones > > > > > > > util_avg can be skewed if there are big outliers. Even then, it's not > > a great metric for the low utilisation cutoff. Large numbers of mostly > > idle but running tasks would be treated similarly to small numbers of > > fully active tasks. It's less predictable and harder to reason about how > > Yes but this also have the advantage of reflecting more accurately how > the system is used. > with nr_running, we consider that mostly idle and fully active tasks > will have the exact same impact on the memory > Maybe, maybe not. When there is spare capacity in the domain overall and we are only interested in the low utilisation case, it seems to me that we should consider the most obvious and understandable metric. Now, if we were talking about a nearly fully loaded domain or an overloaded domain then I would fully agree with you as balancing utilisation in that case becomes critical. > > load balancing behaves across a variety of workloads. > > > > Based on what you suggest, the result looks like this (build tested > > only) > > I'm going to make a try of this patch > Thanks. I've queued the same patch on one machine to see what falls out. I don't want to tie up half my test grid until we get some sort of consensus. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 11:56 ` Mel Gorman @ 2020-01-07 16:00 ` Vincent Guittot 2020-01-07 20:24 ` Mel Gorman 0 siblings, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-07 16:00 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; On Tue, 7 Jan 2020 at 12:56, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote: > > On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote: > > > > > > This looks weird to me because you use imbalance_pct, which is > > > > > > meaningful only compare a ratio, to define a number that will be then > > > > > > compared to a number of tasks without taking into account the weight > > > > > > of the node. So whatever the node size, 32 or 128 CPUs, the > > > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > > > > > NUMA level which is 125 AFAICT > > > > > > > > > > > > > > > > The intent in this version was to only cover the low utilisation case > > > > > regardless of the NUMA node size. There were too many corner cases > > > > > where the tradeoff of local memory latency versus local memory bandwidth > > > > > cannot be quantified. See Srikar's report as an example but it's a general > > > > > problem. The use of imbalance_pct was simply to find the smallest number of > > > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit > > > > > > > > But using imbalance_pct alone doesn't mean anything. > > > > > > Other than figuring out "how many running tasks are required before > > > imbalance_pct is roughly equivalent to one fully active CPU?". Even then, > > > > sorry, I don't see how you deduct this from only using imbalance_pct > > which is mainly there to say how much percent of difference is > > significant > > > > Because if the difference is 25% then 1 CPU out of 4 active is enough > for imbalance_pct to potentially be a factor. Anyway, the approach seems > almost universally disliked so even if I had reasons for not scaling > this by the group_weight, no one appears to agree with them :P > > > > it's a bit weak as imbalance_pct makes hard-coded assumptions on what > > > the tradeoff of cross-domain migration is without considering the hardware. > > > > > > > Using similar to the below > > > > > > > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 > > > > > > > > would be more meaningful > > > > > > > > > > It's meaningful to some sense from a conceptual point of view but > > > setting the low utilisation cutoff depending on the number of CPUs on > > > the node does not account for any local memory latency vs bandwidth. > > > i.e. on a small or mid-range machine the cutoff will make sense. On > > > larger machines, the cutoff could be at the point where memory bandwidth > > > is saturated leading to a scenario whereby upgrading to a larger > > > machine performs worse than the smaller machine. > > > > > > Much more importantly, doing what you suggest allows an imbalance > > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > > > caches. That is going to have side-effects. While I also do not account > > > for the LLC group_weight, it's unlikely the cut-off I used would be > > > smaller than an LLC cache on a large machine as the cache. > > > > > > These two points are why I didn't take the group weight into account. > > > > > > Now if you want, I can do what you suggest anyway as long as you are happy > > > that the child domain weight is also taken into account and to bound the > > > > Taking into account child domain makes sense to me, but shouldn't we > > take into account the number of child group instead ? This should > > reflect the number of different LLC caches. > > I guess it would but why is it inherently better? The number of domains > would yield a similar result if we assume that all the lower domains > have equal weight so it simply because the weight of the SD_NUMA domain > divided by the number of child domains. but that's not what you are doing in your proposal. You are using directly child->span_weight which reflects the number of CPUs in the child and not the number of group you should do something like sds->busiest->span_weight / sds->busiest->child->span_weight which gives you an approximation of the number of independent group inside the busiest numa node from a shared resource pov > > Now, I could be missing something with asymmetric setups. I don't know > if it's possible for child domains of a NUMA domain to have different > sizes. I would be somewhat surprised if they did but I also do not work > on such machines nor have I ever accessed one (to my knowledge). > > > IIUC your reasoning, we want to make sure that tasks will not start to > > fight for using same resources which is the connection between LLC > > cache and memory in this case > > > > Yep. I don't want a case where the allowed imbalance causes the load > balancer to have to balance between the lower domains. *Maybe* that is > actually better in some cases but it's far from intuitive so I would > prefer that change would be a patch on its own with a big fat comment > explaining the reasoning behind the additional complexity. > > > > largest possible allowed imbalance to deal with the case of a node having > > > multiple small LLC caches. That means that some machines will be using the > > > size of the node and some machines will use the size of an LLC. It's less > > > predictable overall as some machines will be "special" relative to others > > > making it harder to reproduce certain problems locally but it would take > > > imbalance_pct into account in a way that you're happy with. > > > > > > Also bear in mind that whether LLC is accounted for or not, the final > > > result should be halved similar to the other imbalance calculations to > > > avoid over or under load balancing. > > > > > > > Or you could use the util_avg so you will take into account if the > > > > tasks are short running ones or long running ones > > > > > > > > > > util_avg can be skewed if there are big outliers. Even then, it's not > > > a great metric for the low utilisation cutoff. Large numbers of mostly > > > idle but running tasks would be treated similarly to small numbers of > > > fully active tasks. It's less predictable and harder to reason about how > > > > Yes but this also have the advantage of reflecting more accurately how > > the system is used. > > with nr_running, we consider that mostly idle and fully active tasks > > will have the exact same impact on the memory > > > > Maybe, maybe not. When there is spare capacity in the domain overall and > we are only interested in the low utilisation case, it seems to me that > we should consider the most obvious and understandable metric. Now, if we > were talking about a nearly fully loaded domain or an overloaded domain > then I would fully agree with you as balancing utilisation in that case > becomes critical. > > > > load balancing behaves across a variety of workloads. > > > > > > Based on what you suggest, the result looks like this (build tested > > > only) > > > > I'm going to make a try of this patch > > > > Thanks. I've queued the same patch on one machine to see what falls out. > I don't want to tie up half my test grid until we get some sort of > consensus. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 16:00 ` Vincent Guittot @ 2020-01-07 20:24 ` Mel Gorman 2020-01-08 8:25 ` Vincent Guittot 2020-01-08 13:18 ` Peter Zijlstra 0 siblings, 2 replies; 28+ messages in thread From: Mel Gorman @ 2020-01-07 20:24 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote: > > > Taking into account child domain makes sense to me, but shouldn't we > > > take into account the number of child group instead ? This should > > > reflect the number of different LLC caches. > > > > I guess it would but why is it inherently better? The number of domains > > would yield a similar result if we assume that all the lower domains > > have equal weight so it simply because the weight of the SD_NUMA domain > > divided by the number of child domains. > > but that's not what you are doing in your proposal. You are using > directly child->span_weight which reflects the number of CPUs in the > child and not the number of group > > you should do something like sds->busiest->span_weight / > sds->busiest->child->span_weight which gives you an approximation of > the number of independent group inside the busiest numa node from a > shared resource pov > Now I get you, but unfortunately it also would not work out. The number of groups is not related to the LLC except in some specific cases. It's possible to use the first CPU to find the size of an LLC but now I worry that it would lead to unpredictable behaviour. AMD has different numbers of LLCs per node depending on the CPU family and while Intel generally has one LLC per node, I imagine there are counter examples. This means that load balancing on different machines with similar core counts will behave differently due to the LLC size. It might be possible to infer it if the intermediate domain was DIE instead of MC but I doubt that's guaranteed and it would still be unpredictable. It may be the type of complexity that should only be introduced with a separate patch with clear rationale as to why it's necessary and we are not at that threshold so I withdraw the suggestion. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 20:24 ` Mel Gorman @ 2020-01-08 8:25 ` Vincent Guittot 2020-01-08 8:49 ` Mel Gorman 2020-01-08 13:18 ` Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-08 8:25 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, 7 Jan 2020 at 21:24, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote: > > > > Taking into account child domain makes sense to me, but shouldn't we > > > > take into account the number of child group instead ? This should > > > > reflect the number of different LLC caches. > > > > > > I guess it would but why is it inherently better? The number of domains > > > would yield a similar result if we assume that all the lower domains > > > have equal weight so it simply because the weight of the SD_NUMA domain > > > divided by the number of child domains. > > > > but that's not what you are doing in your proposal. You are using > > directly child->span_weight which reflects the number of CPUs in the > > child and not the number of group > > > > you should do something like sds->busiest->span_weight / > > sds->busiest->child->span_weight which gives you an approximation of > > the number of independent group inside the busiest numa node from a > > shared resource pov > > > > Now I get you, but unfortunately it also would not work out. The number > of groups is not related to the LLC except in some specific cases. > It's possible to use the first CPU to find the size of an LLC but now I > worry that it would lead to unpredictable behaviour. AMD has different > numbers of LLCs per node depending on the CPU family and while Intel > generally has one LLC per node, I imagine there are counter examples. > This means that load balancing on different machines with similar core > counts will behave differently due to the LLC size. It might be possible But the degree of allowed imbalance is related to this topology so using the same value for those different machine will generate a different behavior because they don't have the same HW topology but we use the same threshold > to infer it if the intermediate domain was DIE instead of MC but I doubt > that's guaranteed and it would still be unpredictable. It may be the type > of complexity that should only be introduced with a separate patch with > clear rationale as to why it's necessary and we are not at that threshold > so I withdraw the suggestion. The problem is that you proposal is not aligned to what you would like to do: You want to take into account the number of groups but you use the number of CPUs per group instead Vincent > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-08 8:25 ` Vincent Guittot @ 2020-01-08 8:49 ` Mel Gorman 0 siblings, 0 replies; 28+ messages in thread From: Mel Gorman @ 2020-01-08 8:49 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Wed, Jan 08, 2020 at 09:25:38AM +0100, Vincent Guittot wrote: > On Tue, 7 Jan 2020 at 21:24, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote: > > > > > Taking into account child domain makes sense to me, but shouldn't we > > > > > take into account the number of child group instead ? This should > > > > > reflect the number of different LLC caches. > > > > > > > > I guess it would but why is it inherently better? The number of domains > > > > would yield a similar result if we assume that all the lower domains > > > > have equal weight so it simply because the weight of the SD_NUMA domain > > > > divided by the number of child domains. > > > > > > but that's not what you are doing in your proposal. You are using > > > directly child->span_weight which reflects the number of CPUs in the > > > child and not the number of group > > > > > > you should do something like sds->busiest->span_weight / > > > sds->busiest->child->span_weight which gives you an approximation of > > > the number of independent group inside the busiest numa node from a > > > shared resource pov > > > > > > > Now I get you, but unfortunately it also would not work out. The number > > of groups is not related to the LLC except in some specific cases. > > It's possible to use the first CPU to find the size of an LLC but now I > > worry that it would lead to unpredictable behaviour. AMD has different > > numbers of LLCs per node depending on the CPU family and while Intel > > generally has one LLC per node, I imagine there are counter examples. > > This means that load balancing on different machines with similar core > > counts will behave differently due to the LLC size. It might be possible > > But the degree of allowed imbalance is related to this topology so > using the same value for those different machine will generate a > different behavior because they don't have the same HW topology but we > use the same threshold > The differences in behaviour would be marginal given that the original fixed value for the v3 patch would generally be smaller than an LLC. For the moment, I'm assuming that v4 will be based on the number of CPUs in the node. > > to infer it if the intermediate domain was DIE instead of MC but I doubt > > that's guaranteed and it would still be unpredictable. It may be the type > > of complexity that should only be introduced with a separate patch with > > clear rationale as to why it's necessary and we are not at that threshold > > so I withdraw the suggestion. > > The problem is that you proposal is not aligned to what you would like > to do: You want to take into account the number of groups but you use > the number of CPUs per group instead > I'm dropping the check of the child domain entirely. The lookups to get the LLC size are relatively expensive without any data indicating it's worthwhile. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 20:24 ` Mel Gorman 2020-01-08 8:25 ` Vincent Guittot @ 2020-01-08 13:18 ` Peter Zijlstra 2020-01-08 14:03 ` Mel Gorman 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2020-01-08 13:18 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote: > Now I get you, but unfortunately it also would not work out. The number > of groups is not related to the LLC except in some specific cases. > It's possible to use the first CPU to find the size of an LLC but now I > worry that it would lead to unpredictable behaviour. AMD has different > numbers of LLCs per node depending on the CPU family and while Intel > generally has one LLC per node, I imagine there are counter examples. Intel has the 'fun' case of an LLC spanning nodes :-), although Linux pretends this isn't so and truncates the LLC topology information to be the node again -- see arch/x86/kernel/smpboot.c:match_llc(). And of course, in the Core2 era we had the Core2Quad chips which was a dual-die solution and therefore also had multiple LLCs, and I think the Xeon variant of that would allow the multiple LLC per node situation too, although this is of course ancient hardware nobody really cares about anymore. > This means that load balancing on different machines with similar core > counts will behave differently due to the LLC size. That sounds like perfectly fine/expected behaviour to me. > It might be possible > to infer it if the intermediate domain was DIE instead of MC but I doubt > that's guaranteed and it would still be unpredictable. It may be the type > of complexity that should only be introduced with a separate patch with > clear rationale as to why it's necessary and we are not at that threshold > so I withdraw the suggestion. So IIRC the initial patch(es) had the idea to allow for 1 extra task imbalance to get 1-1 pairs on the same node, instead of across nodes. I don't immediately see that in these later patches. Would that be something to go back to? Would that not side-step much of the issues under discussion here? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-08 13:18 ` Peter Zijlstra @ 2020-01-08 14:03 ` Mel Gorman 2020-01-08 16:46 ` Vincent Guittot 0 siblings, 1 reply; 28+ messages in thread From: Mel Gorman @ 2020-01-08 14:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote: > > Now I get you, but unfortunately it also would not work out. The number > > of groups is not related to the LLC except in some specific cases. > > It's possible to use the first CPU to find the size of an LLC but now I > > worry that it would lead to unpredictable behaviour. AMD has different > > numbers of LLCs per node depending on the CPU family and while Intel > > generally has one LLC per node, I imagine there are counter examples. > > Intel has the 'fun' case of an LLC spanning nodes :-), although Linux > pretends this isn't so and truncates the LLC topology information to be > the node again -- see arch/x86/kernel/smpboot.c:match_llc(). > That's the Sub-NUMA Clustering, right? I thought that manifested as a separate NUMA node in Linux. Not exactly right from a topology point of view, but close enough. Still, you're right in that identifying the LLC specifically would not necessarily be the way to go. FWIW, I have one machine with SNC enabled and didn't notice anything problematic in the versions of the patch released to date. > And of course, in the Core2 era we had the Core2Quad chips which was a > dual-die solution and therefore also had multiple LLCs, and I think the > Xeon variant of that would allow the multiple LLC per node situation > too, although this is of course ancient hardware nobody really cares > about anymore. > Ok, even though they're older, it's enough counter-examples to be concerned about. I'm definitely dropping the LLC considerations for the moment :) > > This means that load balancing on different machines with similar core > > counts will behave differently due to the LLC size. > > That sounds like perfectly fine/expected behaviour to me. > Even so, I think it should not be part of the initial patch. I would only be happy if I had enough different machine types to prove that specific special casing is required and we cannot simply rely on standard load balancing of the domains below SD_NUMA. > > It might be possible > > to infer it if the intermediate domain was DIE instead of MC but I doubt > > that's guaranteed and it would still be unpredictable. It may be the type > > of complexity that should only be introduced with a separate patch with > > clear rationale as to why it's necessary and we are not at that threshold > > so I withdraw the suggestion. > > So IIRC the initial patch(es) had the idea to allow for 1 extra task > imbalance to get 1-1 pairs on the same node, instead of across nodes. I > don't immediately see that in these later patches. > Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for very small NUMA domains. By v3, it was not necessary because the value was hard-coded regardless of the number of CPUs. I think I'll leave it out because I don't think it's worth worrying about the imbalance between NUMA domains of less than 4 CPUs (do they even exist any more?) > Would that be something to go back to? Would that not side-step much of > the issues under discussion here? Allowing just 1 extra task would work for netperf in some cases except when softirq is involved. It would partially work for IO on ext4 as it's only communicating with one journal thread but a bit more borderline for XFS due to workqueue usage. XFS is not a massive concern in this context as the workqueue is close to the IO issuer and short-lived so I don't think it would crop up much for load balancing unlike ext4 where jbd2 could be very active. If v4 of the patch fails to meet approval then I'll try a patch that allows a hard-coded imbalance of 2 tasks (one communicating task and one kthread) regardless of NUMA domain span up to 50% of utilisation and see what that gets. One way or the other, I would like to get basic NUMA balancing issue out of the way before even looking at NUMA balancing and whether it needs to be adjusted based on the load balancing rewrite. NUMA balancing already has some logic that fights load balancer decisions and I don't want to make that worse, I would be delighted if we could even delete that migration retry check that overrides the load balancer because it always was a bit nasty. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-08 14:03 ` Mel Gorman @ 2020-01-08 16:46 ` Vincent Guittot 2020-01-08 18:03 ` Mel Gorman 0 siblings, 1 reply; 28+ messages in thread From: Vincent Guittot @ 2020-01-08 16:46 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML Le Wednesday 08 Jan 2020 à 14:03:49 (+0000), Mel Gorman a écrit : > On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote: > > > Now I get you, but unfortunately it also would not work out. The number > > > of groups is not related to the LLC except in some specific cases. > > > It's possible to use the first CPU to find the size of an LLC but now I > > > worry that it would lead to unpredictable behaviour. AMD has different > > > numbers of LLCs per node depending on the CPU family and while Intel > > > generally has one LLC per node, I imagine there are counter examples. > > > > Intel has the 'fun' case of an LLC spanning nodes :-), although Linux > > pretends this isn't so and truncates the LLC topology information to be > > the node again -- see arch/x86/kernel/smpboot.c:match_llc(). > > > > That's the Sub-NUMA Clustering, right? I thought that manifested as a > separate NUMA node in Linux. Not exactly right from a topology point of > view, but close enough. Still, you're right in that identifying the LLC > specifically would not necessarily be the way to go. FWIW, I have one > machine with SNC enabled and didn't notice anything problematic in the > versions of the patch released to date. > > > And of course, in the Core2 era we had the Core2Quad chips which was a > > dual-die solution and therefore also had multiple LLCs, and I think the > > Xeon variant of that would allow the multiple LLC per node situation > > too, although this is of course ancient hardware nobody really cares > > about anymore. > > > > Ok, even though they're older, it's enough counter-examples to be concerned > about. I'm definitely dropping the LLC considerations for the moment :) > > > > This means that load balancing on different machines with similar core > > > counts will behave differently due to the LLC size. > > > > That sounds like perfectly fine/expected behaviour to me. > > > > Even so, I think it should not be part of the initial patch. I would only > be happy if I had enough different machine types to prove that specific > special casing is required and we cannot simply rely on standard load > balancing of the domains below SD_NUMA. > > > > It might be possible > > > to infer it if the intermediate domain was DIE instead of MC but I doubt > > > that's guaranteed and it would still be unpredictable. It may be the type > > > of complexity that should only be introduced with a separate patch with > > > clear rationale as to why it's necessary and we are not at that threshold > > > so I withdraw the suggestion. > > > > So IIRC the initial patch(es) had the idea to allow for 1 extra task > > imbalance to get 1-1 pairs on the same node, instead of across nodes. I > > don't immediately see that in these later patches. > > > > Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for > very small NUMA domains. By v3, it was not necessary because the value > was hard-coded regardless of the number of CPUs. I think I'll leave it > out because I don't think it's worth worrying about the imbalance between > NUMA domains of less than 4 CPUs (do they even exist any more?) > > > Would that be something to go back to? Would that not side-step much of > > the issues under discussion here? > > Allowing just 1 extra task would work for netperf in some cases except when > softirq is involved. It would partially work for IO on ext4 as it's only > communicating with one journal thread but a bit more borderline for XFS > due to workqueue usage. XFS is not a massive concern in this context as > the workqueue is close to the IO issuer and short-lived so I don't think > it would crop up much for load balancing unlike ext4 where jbd2 could be > very active. > > If v4 of the patch fails to meet approval then I'll try a patch that My main concern with v4 was the mismatch between the computed value and the goal to not overload the LLCs > allows a hard-coded imbalance of 2 tasks (one communicating task and If there is no good way to compute the allowed imbalance, a hard coded value of 2 is probably simple value to start with > one kthread) regardless of NUMA domain span up to 50% of utilisation Are you sure that it's necessary ? This degree of imbalance already applies only if the group has spare capacity something like + /* Consider allowing a small imbalance between NUMA groups */ + if (env->sd->flags & SD_NUMA) { + + /* + * Until we found a good way to compute an acceptable + * degree of imbalance linked to the system topology + * and that will not impatc mem bandwith and latency, + * let start with a fixed small value. + */ + imbalance_adj = 2; + + /* + * Ignore small imbalances when the busiest group has + * low utilisation. + */ + env->imbalance -= min(env->imbalance, imbalance_adj); + } > and see what that gets. One way or the other, I would like to get basic > NUMA balancing issue out of the way before even looking at NUMA balancing > and whether it needs to be adjusted based on the load balancing rewrite. > NUMA balancing already has some logic that fights load balancer decisions > and I don't want to make that worse, I would be delighted if we could > even delete that migration retry check that overrides the load balancer > because it always was a bit nasty. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-08 16:46 ` Vincent Guittot @ 2020-01-08 18:03 ` Mel Gorman 0 siblings, 0 replies; 28+ messages in thread From: Mel Gorman @ 2020-01-08 18:03 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Wed, Jan 08, 2020 at 05:46:57PM +0100, Vincent Guittot wrote: > > Allowing just 1 extra task would work for netperf in some cases except when > > softirq is involved. It would partially work for IO on ext4 as it's only > > communicating with one journal thread but a bit more borderline for XFS > > due to workqueue usage. XFS is not a massive concern in this context as > > the workqueue is close to the IO issuer and short-lived so I don't think > > it would crop up much for load balancing unlike ext4 where jbd2 could be > > very active. > > > > If v4 of the patch fails to meet approval then I'll try a patch that > > My main concern with v4 was the mismatch between the computed value and the goal to not overload the LLCs > Fair enough. > > allows a hard-coded imbalance of 2 tasks (one communicating task and > > If there is no good way to compute the allowed imbalance, a hard coded > value of 2 is probably simple value to start with Indeed. > > > one kthread) regardless of NUMA domain span up to 50% of utilisation > > Are you sure that it's necessary ? This degree of imbalance already applies only if the group has spare capacity > > something like > > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + > + /* > + * Until we found a good way to compute an acceptable > + * degree of imbalance linked to the system topology > + * and that will not impatc mem bandwith and latency, > + * let start with a fixed small value. > + */ > + imbalance_adj = 2; > + > + /* > + * Ignore small imbalances when the busiest group has > + * low utilisation. > + */ > + env->imbalance -= min(env->imbalance, imbalance_adj); > + } > This is more or less what I had in mind with the exception that the "low utilisation" part of the comment would go away. The 50% utilisation may be unnecessary and was based simply on the idea that at that point memory bandwidth, HT considerations or both would also be dominating factors. I can leave out the check and add it in as a separate patch if proven to be necessary. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 9:56 ` Mel Gorman 2020-01-07 11:17 ` Vincent Guittot @ 2020-01-07 11:22 ` Peter Zijlstra 2020-01-07 11:42 ` Mel Gorman 2020-01-07 12:28 ` Peter Zijlstra 2020-01-07 19:26 ` Phil Auld 2 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-01-07 11:22 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 09:56:55AM +0000, Mel Gorman wrote: > Much more importantly, doing what you suggest allows an imbalance > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > caches. That is going to have side-effects. While I also do not account > for the LLC group_weight, it's unlikely the cut-off I used would be > smaller than an LLC cache on a large machine as the cache. > > These two points are why I didn't take the group weight into account. > > Now if you want, I can do what you suggest anyway as long as you are happy > that the child domain weight is also taken into account and to bound the > largest possible allowed imbalance to deal with the case of a node having > multiple small LLC caches. That means that some machines will be using the > size of the node and some machines will use the size of an LLC. It's less > predictable overall as some machines will be "special" relative to others > making it harder to reproduce certain problems locally but it would take > imbalance_pct into account in a way that you're happy with. > > Also bear in mind that whether LLC is accounted for or not, the final > result should be halved similar to the other imbalance calculations to > avoid over or under load balancing. > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + struct sched_domain *child = env->sd->child; This assumes sd-child exists, which should be true for NUMA domains I suppose. > + unsigned int imbalance_adj; > + > + /* > + * Calculate an acceptable degree of imbalance based > + * on imbalance_adj. However, do not allow a greater > + * imbalance than the child domains weight to avoid > + * a case where the allowed imbalance spans multiple > + * LLCs. > + */ That comment is a wee misleading, @child is not an LLC per se. This could be the NUMA distance 2 domain, in which case @child is the NUMA distance 1 group. That said, even then it probably makes sense to ensure you don't idle a whole smaller distance group. > + imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; > + imbalance_adj = min(imbalance_adj, child->span_weight); > + imbalance_adj >>= 1; > + > + /* > + * Ignore small imbalances when the busiest group has > + * low utilisation. > + */ > + if (busiest->sum_nr_running < imbalance_adj) > + env->imbalance = 0; > + } > + > return; > } > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 11:22 ` Peter Zijlstra @ 2020-01-07 11:42 ` Mel Gorman 2020-01-07 12:29 ` Peter Zijlstra 2020-01-07 12:28 ` Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: Mel Gorman @ 2020-01-07 11:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote: > > Much more importantly, doing what you suggest allows an imbalance > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > > caches. That is going to have side-effects. While I also do not account > > for the LLC group_weight, it's unlikely the cut-off I used would be > > smaller than an LLC cache on a large machine as the cache. > > > > These two points are why I didn't take the group weight into account. > > > > Now if you want, I can do what you suggest anyway as long as you are happy > > that the child domain weight is also taken into account and to bound the > > largest possible allowed imbalance to deal with the case of a node having > > multiple small LLC caches. That means that some machines will be using the > > size of the node and some machines will use the size of an LLC. It's less > > predictable overall as some machines will be "special" relative to others > > making it harder to reproduce certain problems locally but it would take > > imbalance_pct into account in a way that you're happy with. > > > > Also bear in mind that whether LLC is accounted for or not, the final > > result should be halved similar to the other imbalance calculations to > > avoid over or under load balancing. > > > + /* Consider allowing a small imbalance between NUMA groups */ > > + if (env->sd->flags & SD_NUMA) { > > + struct sched_domain *child = env->sd->child; > > This assumes sd-child exists, which should be true for NUMA domains I > suppose. > I would be stunned if it was not. What sort of NUMA domain would not have child domains? Does a memory-only NUMA node with no CPUs even generate a scheduler domain? If it does, then I guess the check is necessary. > > + unsigned int imbalance_adj; > > + > > + /* > > + * Calculate an acceptable degree of imbalance based > > + * on imbalance_adj. However, do not allow a greater > > + * imbalance than the child domains weight to avoid > > + * a case where the allowed imbalance spans multiple > > + * LLCs. > > + */ > > That comment is a wee misleading, @child is not an LLC per se. This > could be the NUMA distance 2 domain, in which case @child is the NUMA > distance 1 group. > > That said, even then it probably makes sense to ensure you don't idle a > whole smaller distance group. > I hadn't considered that case but even then, it's just a comment fix. Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 11:42 ` Mel Gorman @ 2020-01-07 12:29 ` Peter Zijlstra 0 siblings, 0 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-01-07 12:29 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 11:42:11AM +0000, Mel Gorman wrote: > On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote: > > > + /* Consider allowing a small imbalance between NUMA groups */ > > > + if (env->sd->flags & SD_NUMA) { > > > + struct sched_domain *child = env->sd->child; > > > > This assumes sd-child exists, which should be true for NUMA domains I > > suppose. > > > > I would be stunned if it was not. What sort of NUMA domain would not have > child domains? Does a memory-only NUMA node with no CPUs even generate > a scheduler domain? If it does, then I guess the check is necessary. I think it's fine, it was just my paranoia triggering. At the very least we'll have the single CPU domain there. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 11:22 ` Peter Zijlstra 2020-01-07 11:42 ` Mel Gorman @ 2020-01-07 12:28 ` Peter Zijlstra 1 sibling, 0 replies; 28+ messages in thread From: Peter Zijlstra @ 2020-01-07 12:28 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote: > On Tue, Jan 07, 2020 at 09:56:55AM +0000, Mel Gorman wrote: > > + unsigned int imbalance_adj; > > + > > + /* > > + * Calculate an acceptable degree of imbalance based > > + * on imbalance_adj. However, do not allow a greater > > + * imbalance than the child domains weight to avoid > > + * a case where the allowed imbalance spans multiple > > + * LLCs. > > + */ > > That comment is a wee misleading, @child is not an LLC per se. This > could be the NUMA distance 2 domain, in which case @child is the NUMA > distance 1 group. > > That said, even then it probably makes sense to ensure you don't idle a > whole smaller distance group. Hmm, one more thing. On AMD EPYC, which the multiple LLCs, you'll have the single NODE domain in between, and that is not marked with SD_NUMA (iirc). So specifically the case you want to handle is not in fact handled. The first SD_NUMA (distance-1) will have all NODE children, which on EPYC are not LLCs. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 2020-01-07 9:56 ` Mel Gorman 2020-01-07 11:17 ` Vincent Guittot 2020-01-07 11:22 ` Peter Zijlstra @ 2020-01-07 19:26 ` Phil Auld 2 siblings, 0 replies; 28+ messages in thread From: Phil Auld @ 2020-01-07 19:26 UTC (permalink / raw) To: Mel Gorman Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML Hi, On Tue, Jan 07, 2020 at 09:56:55AM +0000 Mel Gorman wrote: > > util_avg can be skewed if there are big outliers. Even then, it's not > a great metric for the low utilisation cutoff. Large numbers of mostly > idle but running tasks would be treated similarly to small numbers of > fully active tasks. It's less predictable and harder to reason about how > load balancing behaves across a variety of workloads. > > Based on what you suggest, the result looks like this (build tested > only) (Here I'm calling the below patch v4 for lack of a better name.) One of my concerns is to have the group imbalance issue addressed. This is the one remaining issue from the wasted cores paper. I have a setup that is designed to illustrate this case. I ran a number of tests with the small imbalance patches (v3 and v4 in this case) and both before and after Vincent's load balancing rework. The basic test is to run an LU.c benchmark from the NAS parallel benchmark suite along with a couple of other cpu burning tasks. The GROUP case is LU and each cpu hog in separate cgroups. The NORMAL case is all of these in one cgroup. This shows of the problems of averaging the group scheduling load by failing to balance the jobs across the NUMA nodes. It ends up with idle CPUs in the nodes where the cpu hogs are running while overloading LU.c threads on others, with a big impact on the benchmark's performance. This test benefits from getting balanced well quickly. The test machine is a 4-node 80 cpu x86_64 system (smt on). There are 76 threads in the LU.c test and 2 stress cpu jobs. Each row shows the numbers for 10 runs to smooth it out and make the mean more, well, meaningful. It's still got a fair bit of variance as you can see from the 3 sets of data points for each kernel. 5.4.0 is before load balancing rework (the really bad case). 5.5-rc2 is with the load balancing rework. lbv3 is Mel's posted v3 patch on top of 5.5-rc2 lbv4 is Mel's experimental v4 which is from email discussion with Vincent. lbv4 appears a little worse for the GROUP case. v3 and 5.5-rc2 are pretty close to the same. All of the post 5.4.0 cases lose a little on the NORMAL case. lbv3 seems to get a fair bit of that loss back on average but with a bit higher variability. This test can be pretty variable though so the minor differences probably don't mean that much. In all the post re-work cases we are still showing vast improvement in the GROUP case, which given the common use of cgroups in modern workloads is a good thing. ---------------------------------- GROUP - LU.c and cpu hogs in separate cgroups Mop/s - Higher is better ============76_GROUP========Mop/s=================================== min q1 median q3 max 5.4.0 1671.8 4211.2 6103.0 6934.1 7865.4 5.4.0 1777.1 3719.9 4861.8 5822.5 13479.6 5.4.0 2015.3 2716.2 5007.1 6214.5 9491.7 5.5-rc2 27641.0 30684.7 32091.8 33417.3 38118.1 5.5-rc2 27386.0 29795.2 32484.1 36004.0 37704.3 5.5-rc2 26649.6 29485.0 30379.7 33116.0 36832.8 lbv3 28496.3 29716.0 30634.8 32998.4 40945.2 lbv3 27294.7 29336.4 30186.0 31888.3 35839.1 lbv3 27099.3 29325.3 31680.1 35973.5 39000.0 lbv4 27936.4 30109.0 31724.8 33150.7 35905.1 lbv4 26431.0 29355.6 29850.1 32704.4 36060.3 lbv4 27436.6 29945.9 31076.9 32207.8 35401.5 Runtime - Lower is better ============76_GROUP========time==================================== min q1 median q3 max 5.4.0 259.2 294.92 335.39 484.33 1219.61 5.4.0 151.3 351.1 419.4 551.99 1147.3 5.4.0 214.8 328.16 407.27 751.03 1011.77 5.5-rc2 53.49 61.03 63.56 66.46 73.77 5.5-rc2 54.08 56.67 62.78 68.44 74.45 5.5-rc2 55.36 61.61 67.14 69.16 76.51 lbv3 49.8 61.8 66.59 68.62 71.55 lbv3 56.89 63.95 67.55 69.51 74.7 lbv3 52.28 56.68 64.38 69.54 75.24 lbv4 56.79 61.52 64.3 67.73 72.99 lbv4 56.54 62.36 68.31 69.47 77.14 lbv4 57.6 63.33 65.64 68.11 74.32 NORMAL - LU.c and cpu hogs all in one cgroup Mop/s - Higher is better ============76_NORMAL========Mop/s=================================== min q1 median q3 max 5.4.0 32912.6 34047.5 36739.4 39124.1 41592.5 5.4.0 29937.7 33060.6 34860.8 39528.8 43328.1 5.4.0 31851.2 34281.1 35284.4 36016.8 38847.4 5.5-rc2 30475.6 32505.1 33977.3 34876 36233.8 5.5-rc2 30657.7 31301.1 32059.4 34396.7 38661.8 5.5-rc2 31022 32247.6 32628.9 33245 38572.3 lbv3 30606.4 32794.4 34258.6 35699 38669.2 lbv3 29722.7 30558.9 32731.2 36412 40752.3 lbv3 30297.7 32568.3 36654.6 38066.2 38988.3 lbv4 30084.9 31227.5 32312.8 33222.8 36039.7 lbv4 29875.9 32903.6 33803.1 34519.3 38663.5 lbv4 27923.3 30631.1 32666.9 33516.7 36663.4 Runtime - Lower is better ============76_NORMAL========time==================================== min q1 median q3 max 5.4.0 49.02 52.115 55.58 59.89 61.95 5.4.0 47.06 51.615 58.57 61.68 68.11 5.4.0 52.49 56.615 57.795 59.48 64.02 5.5-rc2 56.27 58.47 60.02 62.735 66.91 5.5-rc2 52.74 59.295 63.605 65.145 66.51 5.5-rc2 52.86 61.335 62.495 63.23 65.73 lbv3 52.73 57.12 59.52 62.19 66.62 lbv3 50.03 56.02 62.39 66.725 68.6 lbv3 52.3 53.565 55.65 62.645 67.3 lbv4 56.58 61.375 63.135 65.3 67.77 lbv4 52.74 59.07 60.335 61.97 68.25 lbv4 55.61 60.835 62.42 66.635 73.02 So aside from the theoretical disputes the posted v3 seems reasonable. When a final version comes toghether I'll have the perf team run a fuller set of tests. Cheers, Phil > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ba749f579714..1b2c7bed2db5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* > * Try to use spare capacity of local group without overloading it or > * emptying busiest. > - * XXX Spreading tasks across NUMA nodes is not always the best policy > - * and special care should be taken for SD_NUMA domain level before > - * spreading the tasks. For now, load_balance() fully relies on > - * NUMA_BALANCING and fbq_classify_group/rq to override the decision. > */ > if (local->group_type == group_has_spare) { > if (busiest->group_type > group_fully_busy) { > @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > env->migration_type = migrate_task; > lsub_positive(&nr_diff, local->sum_nr_running); > env->imbalance = nr_diff >> 1; > - return; > - } > + } else { > > - /* > - * If there is no overload, we just want to even the number of > - * idle cpus. > - */ > - env->migration_type = migrate_task; > - env->imbalance = max_t(long, 0, (local->idle_cpus - > + /* > + * If there is no overload, we just want to even the number of > + * idle cpus. > + */ > + env->migration_type = migrate_task; > + env->imbalance = max_t(long, 0, (local->idle_cpus - > busiest->idle_cpus) >> 1); > + } > + > + /* Consider allowing a small imbalance between NUMA groups */ > + if (env->sd->flags & SD_NUMA) { > + struct sched_domain *child = env->sd->child; > + unsigned int imbalance_adj; > + > + /* > + * Calculate an acceptable degree of imbalance based > + * on imbalance_adj. However, do not allow a greater > + * imbalance than the child domains weight to avoid > + * a case where the allowed imbalance spans multiple > + * LLCs. > + */ > + imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; > + imbalance_adj = min(imbalance_adj, child->span_weight); > + imbalance_adj >>= 1; > + > + /* > + * Ignore small imbalances when the busiest group has > + * low utilisation. > + */ > + if (busiest->sum_nr_running < imbalance_adj) > + env->imbalance = 0; > + } > + > return; > } > > -- ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2020-01-08 18:03 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-20 8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman 2019-12-20 12:40 ` Valentin Schneider 2019-12-20 14:22 ` Mel Gorman 2019-12-20 15:32 ` Valentin Schneider 2019-12-21 11:25 ` Mel Gorman 2019-12-22 12:00 ` Srikar Dronamraju 2019-12-23 13:31 ` Vincent Guittot 2019-12-23 13:41 ` Vincent Guittot 2020-01-03 14:31 ` Mel Gorman 2020-01-06 13:55 ` Vincent Guittot 2020-01-06 14:52 ` Mel Gorman 2020-01-07 8:38 ` Vincent Guittot 2020-01-07 9:56 ` Mel Gorman 2020-01-07 11:17 ` Vincent Guittot 2020-01-07 11:56 ` Mel Gorman 2020-01-07 16:00 ` Vincent Guittot 2020-01-07 20:24 ` Mel Gorman 2020-01-08 8:25 ` Vincent Guittot 2020-01-08 8:49 ` Mel Gorman 2020-01-08 13:18 ` Peter Zijlstra 2020-01-08 14:03 ` Mel Gorman 2020-01-08 16:46 ` Vincent Guittot 2020-01-08 18:03 ` Mel Gorman 2020-01-07 11:22 ` Peter Zijlstra 2020-01-07 11:42 ` Mel Gorman 2020-01-07 12:29 ` Peter Zijlstra 2020-01-07 12:28 ` Peter Zijlstra 2020-01-07 19:26 ` Phil Auld
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).