* [PATCH v3 0/7] Improve scheduler scalability for fast path @ 2019-06-09 1:49 subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar ` (6 more replies) 0 siblings, 7 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.99 0.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.83 5.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%) 0.38 32 7531.36 0.49 7607.18 (1.01%) 0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.1 12158.3 (0.87%) 0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.34 1.0178 (1.78%) 0.49 80 1 0.53 1.0092 (0.92%) 1.5 100 1 0.79 1.0090 (0.9%) 0.88 120 1 0.06 1.0048 (0.48%) 0.72 140 1 0.22 1.0116 (1.16%) 0.05 160 1 0.57 1.0264 (2.64%) 0.67 180 1 0.81 1.0194 (1.94%) 0.91 200 1 0.44 1.028 (2.8%) 3.09 220 1 1.74 1.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.28 (2.01%) 0.29 16 87.81 0.82 89.67 (2.12%) 0.38 32 151.19 0.02 153.5 (1.53%) 0.41 48 190.2 0.21 194.79 (2.41%) 0.07 64 190.42 0.35 202.9 (6.55%) 1.66 128 323.86 0.28 343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 92 (2.13%) 2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67 104.96 208.33 (59.36%) 1.2 99.5 2296 82.55 3674.66 (-60.05%) 22.19 99.9 12517.33 2.38 12784 (-2.13%) 0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2213 (6.66%) 10.32 2 1.4559 8.55 1.5048 (-3.36%) 4.72 4 2.6271 1.74 2.5532 (2.81%) 2.02 8 4.7089 3.01 4.5118 (4.19%) 2.74 16 8.7406 2.25 8.6801 (0.69%) 4.78 32 17.7835 1.01 16.759 (5.76%) 1.38 64 36.1901 0.65 34.6652 (4.21%) 1.24 128 72.6585 0.51 70.9762 (2.32%) 0.9 Following are the performance numbers with various benchmarks with SIS_CORE false (idle core search disabled). This improves throughput of certain workloads but increases latency of other workloads. Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5835 (-0.33%) 8.21 2 0.6428 10.64 0.5752 (10.52%) 4.05 4 1.0152 1.99 0.9946 (2.03%) 2.56 8 1.8128 1.4 1.7619 (2.81%) 1.88 16 3.1666 0.8 3.1275 (1.23%) 0.42 32 5.6084 0.83 5.5856 (0.41%) 0.89 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2084.72 (-0.51%) 1.65 16 4218.45 0.06 4179.69 (-0.92%) 0.18 32 7531.36 0.49 7623.18 (1.22%) 0.39 48 10206.42 0.21 10159.16 (-0.46%) 0.21 64 12053.73 0.1 12087.21 (0.28%) 0.19 128 14810.33 0.04 14894.08 (0.57%) 0.08 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0056 (0.56%) 0.34 40 1 0.8 1.0173 (1.73%) 0.13 60 1 0.34 0.9995 (-0.05%) 0.85 80 1 0.53 1.0175 (1.75%) 1.56 100 1 0.79 1.0151 (1.51%) 1.31 120 1 0.06 1.0244 (2.44%) 0.5 140 1 0.22 1.034 (3.4%) 0.66 160 1 0.57 1.0362 (3.62%) 0.07 180 1 0.81 1.041 (4.1%) 0.8 200 1 0.44 1.0233 (2.33%) 1.4 220 1 1.74 1.0125 (1.25%) 1.41 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.94 (3.48%) 0.2 16 87.81 0.82 91.75 (4.49%) 0.43 32 151.19 0.02 167.74 (10.95%) 1.29 48 190.2 0.21 200.57 (5.45%) 0.89 64 190.42 0.35 226.74 (19.07%) 1.79 128 323.86 0.28 348.12 (7.49%) 0.77 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 629.8 600.19 (-4.7%) 2 1159.65 1162.07 (0.21%) 4 2121.61 2112.27 (-0.44%) 8 2620.52 2645.55 (0.96%) 16 2879.31 2828.87 (-1.75%) 32 2791.24 2760.97 (-1.08%) 64 1853.07 1747.66 (-5.69%) 128 1484.95 1459.81 (-1.69%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 256.41 258.11 (0.67%) 2 509.89 509.13 (-0.15%) 4 999.44 1016.58 (1.72%) 8 1982.7 2006.53 (1.2%) 16 3891.51 3964.43 (1.87%) 32 6819.24 7376.92 (8.18%) 64 8542.95 9660.45 (13.08%) 128 15277.6 15438.4 (1.05%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 94.67 (-0.71%) 2.2 75 124 2.13 124.67 (-0.54%) 1.67 90 152 1.74 154.33 (-1.54%) 0.75 95 171 2.11 176.67 (-3.31%) 0.86 99 512.67 104.96 4130.33 (-705.65%) 79.41 99.5 2296 82.55 10066.67 (-338.44%) 26.15 99.9 12517.33 2.38 12869.33 (-2.81%) 0.8 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2514 (4.36%) 11.1 2 1.4559 8.55 1.5433 (-6%) 3.05 4 2.6271 1.74 2.5626 (2.5%) 2.69 8 4.7089 3.01 4.5316 (3.77%) 2.95 16 8.7406 2.25 8.6585 (0.94%) 2.91 32 17.7835 1.01 17.175 (3.42%) 1.38 64 36.1901 0.65 35.5294 (1.83%) 1.02 128 72.6585 0.51 71.8821 (1.07%) 1.05 Following are the schbench performance numbers with SIS_CORE false and SIS_PROP false. This recovers the latency increase by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67 104.96 206 (59.82%) 8.86 99.5 2296 82.55 3121.67 (-35.96%) 97.37 99.9 12517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 ++++++++++++++++- include/linux/topology.h | 4 ++++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++++++++++++++++++++++-------- kernel/sched/features.h | 1 + kernel/sched/sched.h | 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar ` (5 subsequent siblings) 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar ` (4 subsequent siblings) 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..80657fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include <trace/events/sched.h> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..4cecfa2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr(&runqueues) -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 3/7] sched: rotate the cpu search window for better spread 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar ` (3 subsequent siblings) 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 4/7] sched: add sched feature to disable idle core search 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar ` (2 preceding siblings ...) 2019-06-09 1:49 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar ` (2 subsequent siblings) 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 5/7] sched: SIS_CORE to disable idle core search 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar ` (3 preceding siblings ...) 2019-06-09 1:49 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar ` (4 preceding siblings ...) 2019-06-09 1:49 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. This patch does it only for x86. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 ++++++++++++++++- include/linux/topology.h | 4 ++++ 4 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages() (__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..20bf676 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,10 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = &cpu_data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, &phys_cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#define topology_sibling_weight(cpu) \ + cpumask_weight(topology_sibling_cpumask(cpu)) +#endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar ` (5 preceding siblings ...) 2019-06-09 1:49 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar @ 2019-06-09 1:49 ` subhra mazumdar 6 siblings, 0 replies; 11+ messages in thread From: subhra mazumdar @ 2019-06-09 1:49 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu. This is the fast path of scheduler and every cycle is worth saving. Usage of cpumask_weight can result in iterations. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a74808..878f11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; limit = floor << 1; -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path @ 2019-06-27 1:29 subhra mazumdar 2019-06-27 1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar 0 siblings, 1 reply; 11+ messages in thread From: subhra mazumdar @ 2019-06-27 1:29 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Hi, Resending this patchset, will be good to get some feedback. Any suggestions that will make it more acceptable are welcome. We have been shipping this with Unbreakable Enterprise Kernel in Oracle Linux. Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.99 0.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.83 5.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%) 0.38 32 7531.36 0.49 7607.18 (1.01%) 0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.1 12158.3 (0.87%) 0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.34 1.0178 (1.78%) 0.49 80 1 0.53 1.0092 (0.92%) 1.5 100 1 0.79 1.0090 (0.9%) 0.88 120 1 0.06 1.0048 (0.48%) 0.72 140 1 0.22 1.0116 (1.16%) 0.05 160 1 0.57 1.0264 (2.64%) 0.67 180 1 0.81 1.0194 (1.94%) 0.91 200 1 0.44 1.028 (2.8%) 3.09 220 1 1.74 1.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.28 (2.01%) 0.29 16 87.81 0.82 89.67 (2.12%) 0.38 32 151.19 0.02 153.5 (1.53%) 0.41 48 190.2 0.21 194.79 (2.41%) 0.07 64 190.42 0.35 202.9 (6.55%) 1.66 128 323.86 0.28 343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 92 (2.13%) 2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67 104.96 208.33 (59.36%) 1.2 99.5 2296 82.55 3674.66 (-60.05%) 22.19 99.9 12517.33 2.38 12784 (-2.13%) 0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2213 (6.66%) 10.32 2 1.4559 8.55 1.5048 (-3.36%) 4.72 4 2.6271 1.74 2.5532 (2.81%) 2.02 8 4.7089 3.01 4.5118 (4.19%) 2.74 16 8.7406 2.25 8.6801 (0.69%) 4.78 32 17.7835 1.01 16.759 (5.76%) 1.38 64 36.1901 0.65 34.6652 (4.21%) 1.24 128 72.6585 0.51 70.9762 (2.32%) 0.9 Following are the performance numbers with various benchmarks with SIS_CORE false (idle core search disabled). This improves throughput of certain workloads but increases latency of other workloads. Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5835 (-0.33%) 8.21 2 0.6428 10.64 0.5752 (10.52%) 4.05 4 1.0152 1.99 0.9946 (2.03%) 2.56 8 1.8128 1.4 1.7619 (2.81%) 1.88 16 3.1666 0.8 3.1275 (1.23%) 0.42 32 5.6084 0.83 5.5856 (0.41%) 0.89 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2084.72 (-0.51%) 1.65 16 4218.45 0.06 4179.69 (-0.92%) 0.18 32 7531.36 0.49 7623.18 (1.22%) 0.39 48 10206.42 0.21 10159.16 (-0.46%) 0.21 64 12053.73 0.1 12087.21 (0.28%) 0.19 128 14810.33 0.04 14894.08 (0.57%) 0.08 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0056 (0.56%) 0.34 40 1 0.8 1.0173 (1.73%) 0.13 60 1 0.34 0.9995 (-0.05%) 0.85 80 1 0.53 1.0175 (1.75%) 1.56 100 1 0.79 1.0151 (1.51%) 1.31 120 1 0.06 1.0244 (2.44%) 0.5 140 1 0.22 1.034 (3.4%) 0.66 160 1 0.57 1.0362 (3.62%) 0.07 180 1 0.81 1.041 (4.1%) 0.8 200 1 0.44 1.0233 (2.33%) 1.4 220 1 1.74 1.0125 (1.25%) 1.41 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.94 (3.48%) 0.2 16 87.81 0.82 91.75 (4.49%) 0.43 32 151.19 0.02 167.74 (10.95%) 1.29 48 190.2 0.21 200.57 (5.45%) 0.89 64 190.42 0.35 226.74 (19.07%) 1.79 128 323.86 0.28 348.12 (7.49%) 0.77 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 629.8 600.19 (-4.7%) 2 1159.65 1162.07 (0.21%) 4 2121.61 2112.27 (-0.44%) 8 2620.52 2645.55 (0.96%) 16 2879.31 2828.87 (-1.75%) 32 2791.24 2760.97 (-1.08%) 64 1853.07 1747.66 (-5.69%) 128 1484.95 1459.81 (-1.69%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baseline patch 1 256.41 258.11 (0.67%) 2 509.89 509.13 (-0.15%) 4 999.44 1016.58 (1.72%) 8 1982.7 2006.53 (1.2%) 16 3891.51 3964.43 (1.87%) 32 6819.24 7376.92 (8.18%) 64 8542.95 9660.45 (13.08%) 128 15277.6 15438.4 (1.05%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 94.67 (-0.71%) 2.2 75 124 2.13 124.67 (-0.54%) 1.67 90 152 1.74 154.33 (-1.54%) 0.75 95 171 2.11 176.67 (-3.31%) 0.86 99 512.67 104.96 4130.33 (-705.65%) 79.41 99.5 2296 82.55 10066.67 (-338.44%) 26.15 99.9 12517.33 2.38 12869.33 (-2.81%) 0.8 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2514 (4.36%) 11.1 2 1.4559 8.55 1.5433 (-6%) 3.05 4 2.6271 1.74 2.5626 (2.5%) 2.69 8 4.7089 3.01 4.5316 (3.77%) 2.95 16 8.7406 2.25 8.6585 (0.94%) 2.91 32 17.7835 1.01 17.175 (3.42%) 1.38 64 36.1901 0.65 35.5294 (1.83%) 1.02 128 72.6585 0.51 71.8821 (1.07%) 1.05 Following are the schbench performance numbers with SIS_CORE false and SIS_PROP false. This recovers the latency increase by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 94 2.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67 104.96 206 (59.82%) 8.86 99.5 2296 82.55 3121.67 (-35.96%) 97.37 99.9 12517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 ++++++++++++++++- include/linux/topology.h | 4 ++++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++++++++++++++++++++++-------- kernel/sched/features.h | 1 + kernel/sched/sched.h | 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu 2019-06-27 1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar @ 2019-06-27 1:29 ` subhra mazumdar 2019-06-28 18:47 ` Parth Shah 0 siblings, 1 reply; 11+ messages in thread From: subhra mazumdar @ 2019-06-27 1:29 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> --- kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu 2019-06-27 1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar @ 2019-06-28 18:47 ` Parth Shah 2019-06-28 22:21 ` Subhra Mazumdar 0 siblings, 1 reply; 11+ messages in thread From: Parth Shah @ 2019-06-28 18:47 UTC (permalink / raw) To: subhra mazumdar, linux-kernel Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman On 6/27/19 6:59 AM, subhra mazumdar wrote: > Put upper and lower limit on cpu search of select_idle_cpu. The lower limit > is amount of cpus in a core while upper limit is twice that. This ensures > for any architecture we will usually search beyond a core. The upper limit > also helps in keeping the search cost low and constant. > > Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> > --- > kernel/sched/fair.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..b58f08f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > u64 avg_cost, avg_idle; > u64 time, cost; > s64 delta; > - int cpu, nr = INT_MAX; > + int cpu, limit, floor, nr = INT_MAX; > > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); > if (!this_sd) > @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > > if (sched_feat(SIS_PROP)) { > u64 span_avg = sd->span_weight * avg_idle; > - if (span_avg > 4*avg_cost) > + floor = cpumask_weight(topology_sibling_cpumask(target)); > + if (floor < 2) > + floor = 2; > + limit = floor << 1; Is upper limit an experimental value only or it has any arch specific significance? Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores due to its different cache model. So it can be tuned for arch specific builds then. Also variable names can be changed for better readability. floor -> weight_clamp_min limit -> weight_clamp_max or something similar > + if (span_avg > floor*avg_cost) { > nr = div_u64(span_avg, avg_cost); > - else > - nr = 4; > + if (nr > limit) > + nr = limit; > + } else { > + nr = floor; > + } > } > > time = local_clock(); > Best, Parth ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu 2019-06-28 18:47 ` Parth Shah @ 2019-06-28 22:21 ` Subhra Mazumdar 0 siblings, 0 replies; 11+ messages in thread From: Subhra Mazumdar @ 2019-06-28 22:21 UTC (permalink / raw) To: Parth Shah, linux-kernel Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen, mgorman On 6/28/19 11:47 AM, Parth Shah wrote: > > On 6/27/19 6:59 AM, subhra mazumdar wrote: >> Put upper and lower limit on cpu search of select_idle_cpu. The lower limit >> is amount of cpus in a core while upper limit is twice that. This ensures >> for any architecture we will usually search beyond a core. The upper limit >> also helps in keeping the search cost low and constant. >> >> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com> >> --- >> kernel/sched/fair.c | 15 +++++++++++---- >> 1 file changed, 11 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index f35930f..b58f08f 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t >> u64 avg_cost, avg_idle; >> u64 time, cost; >> s64 delta; >> - int cpu, nr = INT_MAX; >> + int cpu, limit, floor, nr = INT_MAX; >> >> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); >> if (!this_sd) >> @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t >> >> if (sched_feat(SIS_PROP)) { >> u64 span_avg = sd->span_weight * avg_idle; >> - if (span_avg > 4*avg_cost) >> + floor = cpumask_weight(topology_sibling_cpumask(target)); >> + if (floor < 2) >> + floor = 2; >> + limit = floor << 1; > Is upper limit an experimental value only or it has any arch specific significance? > Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores > due to its different cache model. So it can be tuned for arch specific builds then. The lower bound and upper bound were 1 core and 2 core respectively. That is done as to search beyond one core and at the same time not to search too much. It is heuristic that seemed to work well on all archs coupled with the moving window mechanism. Does 4 vs 2 make any difference on your POWER9? AFAIR it didn't on SPARC SMT8. > > Also variable names can be changed for better readability. > floor -> weight_clamp_min > limit -> weight_clamp_max > or something similar OK. Thanks, Subhra > > >> + if (span_avg > floor*avg_cost) { >> nr = div_u64(span_avg, avg_cost); >> - else >> - nr = 4; >> + if (nr > limit) >> + nr = limit; >> + } else { >> + nr = floor; >> + } >> } >> >> time = local_clock(); >> > > Best, > Parth > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-06-28 22:26 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-06-09 1:49 [PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar 2019-06-09 1:49 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar 2019-06-27 1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar 2019-06-27 1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar 2019-06-28 18:47 ` Parth Shah 2019-06-28 22:21 ` Subhra Mazumdar
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.