* [PATCH RFC 0/2] Fix race window during idle cpu selection. @ 2017-10-31 5:27 Atish Patra 2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra 2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra 0 siblings, 2 replies; 25+ messages in thread From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw) To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo The 1st patch actually fixes the issue. The 2nd patch adds a new element in schedstat intended only for testing. Atish Patra (2): sched: Minimize the idle cpu selection race window. sched: Add a stat for idle cpu selection race window. kernel/sched/core.c | 6 ++++++ kernel/sched/fair.c | 12 +++++++++--- kernel/sched/idle_task.c | 1 + kernel/sched/sched.h | 2 ++ kernel/sched/stats.c | 5 +++-- 5 files changed, 21 insertions(+), 5 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra @ 2017-10-31 5:27 ` Atish Patra 2017-10-31 8:20 ` Peter Zijlstra 2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra 1 sibling, 1 reply; 25+ messages in thread From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw) To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo Currently, multiple tasks can wakeup on same cpu from select_idle_sibiling() path in case they wakeup simulatenously and last ran on the same llc. This happens because an idle cpu is not updated until idle task is scheduled out. Any task waking during that period may potentially select that cpu for a wakeup candidate. Introduce a per cpu variable that is set as soon as a cpu is selected for wakeup for any task. This prevents from other tasks to select the same cpu again. Note: This does not close the race window but minimizes it to accessing the per-cpu variable. If two wakee tasks access the per cpu variable at the same time, they may select the same cpu again. But it minimizes the race window considerably. Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Atish Patra <atish.patra@oracle.com> --- kernel/sched/core.c | 4 ++++ kernel/sched/fair.c | 12 +++++++++--- kernel/sched/idle_task.c | 1 + kernel/sched/sched.h | 1 + 4 files changed, 15 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2288a14..d9d501c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p) return p->prio - MAX_RT_PRIO; } +DEFINE_PER_CPU(int, claim_wakeup); /** * idle_cpu - is a given CPU idle currently? * @cpu: the processor in question. @@ -3917,6 +3918,9 @@ int idle_cpu(int cpu) return 0; #endif + if (per_cpu(claim_wakeup, cpu)) + return 0; + return 1; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 13393bb..885023a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int idle = false; } - if (idle) + if (idle) { + per_cpu(claim_wakeup, core) = 1; return core; + } } /* @@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) + if (idle_cpu(cpu)) { + per_cpu(claim_wakeup, cpu) = 1; return cpu; + } } return -1; @@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) + if (idle_cpu(cpu)) { + per_cpu(claim_wakeup, cpu) = 1; break; + } } time = local_clock() - time; diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index 0c00172..64d6495 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf { put_prev_task(rq, prev); update_idle_core(rq); + this_cpu_write(claim_wakeup, 0); schedstat_inc(rq->sched_goidle); return rq->idle; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8aa24b4..5f70b98 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(int, claim_wakeup); struct sched_group_capacity { atomic_t ref; -- 2.7.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra @ 2017-10-31 8:20 ` Peter Zijlstra 2017-10-31 8:48 ` Mike Galbraith 2017-11-05 0:58 ` Joel Fernandes 0 siblings, 2 replies; 25+ messages in thread From: Peter Zijlstra @ 2017-10-31 8:20 UTC (permalink / raw) To: Atish Patra; +Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: > Currently, multiple tasks can wakeup on same cpu from > select_idle_sibiling() path in case they wakeup simulatenously > and last ran on the same llc. This happens because an idle cpu > is not updated until idle task is scheduled out. Any task waking > during that period may potentially select that cpu for a wakeup > candidate. > > Introduce a per cpu variable that is set as soon as a cpu is > selected for wakeup for any task. This prevents from other tasks > to select the same cpu again. Note: This does not close the race > window but minimizes it to accessing the per-cpu variable. If two > wakee tasks access the per cpu variable at the same time, they may > select the same cpu again. But it minimizes the race window > considerably. The very most important question; does it actually help? What benchmarks, give what numbers? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-10-31 8:20 ` Peter Zijlstra @ 2017-10-31 8:48 ` Mike Galbraith 2017-11-01 6:08 ` Atish Patra 2017-11-05 0:58 ` Joel Fernandes 1 sibling, 1 reply; 25+ messages in thread From: Mike Galbraith @ 2017-10-31 8:48 UTC (permalink / raw) To: Peter Zijlstra, Atish Patra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote: > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: > > Currently, multiple tasks can wakeup on same cpu from > > select_idle_sibiling() path in case they wakeup simulatenously > > and last ran on the same llc. This happens because an idle cpu > > is not updated until idle task is scheduled out. Any task waking > > during that period may potentially select that cpu for a wakeup > > candidate. > > > > Introduce a per cpu variable that is set as soon as a cpu is > > selected for wakeup for any task. This prevents from other tasks > > to select the same cpu again. Note: This does not close the race > > window but minimizes it to accessing the per-cpu variable. If two > > wakee tasks access the per cpu variable at the same time, they may > > select the same cpu again. But it minimizes the race window > > considerably. > > The very most important question; does it actually help? What > benchmarks, give what numbers? I played with something ~similar (cmpxchg() idle cpu reservation) a while back in the context of schbench, and it did help that, but for generic fast mover benchmarks, the added overhead had the expected effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat). I still have the patch lying about in my rubbish heap, but didn't bother to save any of the test results. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-10-31 8:48 ` Mike Galbraith @ 2017-11-01 6:08 ` Atish Patra 2017-11-01 6:54 ` Mike Galbraith 0 siblings, 1 reply; 25+ messages in thread From: Atish Patra @ 2017-11-01 6:08 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On 10/31/2017 03:48 AM, Mike Galbraith wrote: > On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote: >> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: >>> Currently, multiple tasks can wakeup on same cpu from >>> select_idle_sibiling() path in case they wakeup simulatenously >>> and last ran on the same llc. This happens because an idle cpu >>> is not updated until idle task is scheduled out. Any task waking >>> during that period may potentially select that cpu for a wakeup >>> candidate. >>> >>> Introduce a per cpu variable that is set as soon as a cpu is >>> selected for wakeup for any task. This prevents from other tasks >>> to select the same cpu again. Note: This does not close the race >>> window but minimizes it to accessing the per-cpu variable. If two >>> wakee tasks access the per cpu variable at the same time, they may >>> select the same cpu again. But it minimizes the race window >>> considerably. >> The very most important question; does it actually help? What >> benchmarks, give what numbers? Here are the numbers from one of the OLTP configuration on a 8 socket x86 machine kernel txn/minute (normalized) user/sys baseline 1.0 80/5 pcpu 1.021 84/5 The throughput gains are not very high and close to run-to-run variation %. The schedstat data (added for testing in 2/2 patch) indicates the there are many instances of the race conditions that got addressed but may be not enough to trigger a significant throughput change. All other benchmark I tested (TPCC, hackbench, schbench, swingbench) did not show any regression. I will let Joel post numbers from Android benchmarks. > I played with something ~similar (cmpxchg() idle cpu reservation) I had an atomic version earlier as well. Peter's suggestion for per cpu seems to perform slightly better than atomic. Thus, this patch has the per cpu version. > a > while back in the context of schbench, and it did help that, Do you have the schbench configuration somewhere that I can test? I tried various configurations but did not see any improvement or regression. > but for > generic fast mover benchmarks, the added overhead had the expected > effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat). which benchmark ? Is it hackbench or something else ? I have not found any regression yet in my testing. I would be happy to test if any other benchmark or different configuration for hackbench. Regards, Atish > I still have the patch lying about in my rubbish heap, but didn't > bother to save any of the test results. > > -Mike > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-01 6:08 ` Atish Patra @ 2017-11-01 6:54 ` Mike Galbraith 2017-11-01 7:18 ` Mike Galbraith 0 siblings, 1 reply; 25+ messages in thread From: Mike Galbraith @ 2017-11-01 6:54 UTC (permalink / raw) To: Atish Patra, Peter Zijlstra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote: > > On 10/31/2017 03:48 AM, Mike Galbraith wrote: > > > I played with something ~similar (cmpxchg() idle cpu reservation) > I had an atomic version earlier as well. Peter's suggestion for per cpu > seems to perform slightly better than atomic. Yeah, no doubt about that. > Thus, this patch has the per cpu version. > > a > > while back in the context of schbench, and it did help that, > Do you have the schbench configuration somewhere that I can test? I > tried various configurations but did not > see any improvement or regression. No, as noted, I didn't save anything. I watched and fiddled with several configurations on various sized boxen, doing this/that on top of the reservation thing to try to improve the "it's 100% about perfect spread" schbench, but didn't like the pricetag, so tossed the lot over my shoulder and walked away. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-01 6:54 ` Mike Galbraith @ 2017-11-01 7:18 ` Mike Galbraith 2017-11-01 16:36 ` Atish Patra 0 siblings, 1 reply; 25+ messages in thread From: Mike Galbraith @ 2017-11-01 7:18 UTC (permalink / raw) To: Atish Patra, Peter Zijlstra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote: > On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote: > > > Do you have the schbench configuration somewhere that I can test? I > > tried various configurations but did not > > see any improvement or regression. > > No, as noted, I didn't save anything. I watched and fiddled with > several configurations on various sized boxen, doing this/that on top > of the reservation thing to try to improve the "it's 100% about perfect > spread" schbench, but didn't like the pricetag, so tossed the lot over > my shoulder and walked away. BTW, the biggest trouble with schbench at that time was not so much about wakeup time stacking (though it was visible), it was using load averages instead of instantaneous state, and sawtoothing therein. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-01 7:18 ` Mike Galbraith @ 2017-11-01 16:36 ` Atish Patra 2017-11-01 20:20 ` Mike Galbraith 0 siblings, 1 reply; 25+ messages in thread From: Atish Patra @ 2017-11-01 16:36 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On 11/01/2017 02:18 AM, Mike Galbraith wrote: > On Wed, 2017-11-01 at 07:54 +0100, Mike Galbraith wrote: >> On Wed, 2017-11-01 at 01:08 -0500, Atish Patra wrote: >> >>> Do you have the schbench configuration somewhere that I can test? I >>> tried various configurations but did not >>> see any improvement or regression. >> No, as noted, I didn't save anything. I watched and fiddled with >> several configurations on various sized boxen, doing this/that on top >> of the reservation thing to try to improve the "it's 100% about perfect >> spread" schbench, but didn't like the pricetag, so tossed the lot over >> my shoulder and walked away. :) > BTW, the biggest trouble with schbench at that time was not so much > about wakeup time stacking (though it was visible), it was using load > averages instead of instantaneous state, and sawtoothing therein. Ahh I see. Any other benchmark suggestion that may benefit from this fix ? Regards, Atish > -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-01 16:36 ` Atish Patra @ 2017-11-01 20:20 ` Mike Galbraith 0 siblings, 0 replies; 25+ messages in thread From: Mike Galbraith @ 2017-11-01 20:20 UTC (permalink / raw) To: Atish Patra, Peter Zijlstra Cc: linux-kernel, joelaf, brendan.jackman, jbacik, mingo On Wed, 2017-11-01 at 11:36 -0500, Atish Patra wrote: > > Any other benchmark suggestion that may benefit from this fix ? Nope. I'll be kinda surprised if you find anything where you don't have to squint to see a dinky signal modulating white noise :) -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-10-31 8:20 ` Peter Zijlstra 2017-10-31 8:48 ` Mike Galbraith @ 2017-11-05 0:58 ` Joel Fernandes 2017-11-22 5:23 ` Atish Patra 1 sibling, 1 reply; 25+ messages in thread From: Joel Fernandes @ 2017-11-05 0:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Atish Patra, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar Hi Peter, On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: >> Currently, multiple tasks can wakeup on same cpu from >> select_idle_sibiling() path in case they wakeup simulatenously >> and last ran on the same llc. This happens because an idle cpu >> is not updated until idle task is scheduled out. Any task waking >> during that period may potentially select that cpu for a wakeup >> candidate. >> >> Introduce a per cpu variable that is set as soon as a cpu is >> selected for wakeup for any task. This prevents from other tasks >> to select the same cpu again. Note: This does not close the race >> window but minimizes it to accessing the per-cpu variable. If two >> wakee tasks access the per cpu variable at the same time, they may >> select the same cpu again. But it minimizes the race window >> considerably. > > The very most important question; does it actually help? What > benchmarks, give what numbers? I collected some numbers with an Android benchmark called Jankbench. Most tests didn't show an improvement or degradation with the patch. However, one of the tests called "list view", consistently shows an improvement. Particularly striking is the improvement at mean and 25 percentile. For list_view test, Jankbench pulls up a list of text and scrolls the list, this exercises the display pipeline in Android to render and display the animation as the scroll happens. For Android, lower frame times is considered quite important as that means we are less likely to drop frames and give the user a good experience vs a perceivable poor experience. For each frame, Jankbench measures the total time a frame takes and stores it in a DB (the time from which the app starts drawing, to when the rendering completes and the frame is submitted for display). Following is the distribution of frame times in ms. count 16304 (@60 fps, 4.5 minutes) Without patch With patch mean 5.196633 4.429641 (+14.75%) std 2.030054 2.310025 25% 5.606810 1.991017 (+64.48%) 50% 5.824013 5.716631 (+1.84%) 75% 5.987102 5.932751 (+0.90%) 95% 6.461230 6.301318 (+2.47%) 99% 9.828959 9.697076 (+1.34%) Note that although Android uses energy aware scheduling patches, I turned those off to bring the test as close to mainline as possible. I also backported Vincent's and Brendan's slow path fixes to the 4.4 kernel that the Pixel 2 uses. Personally I am in favor of this patch considering this test data but also that in the past, I remember that our teams had to deal with the same race issue and used cpusets to avoid it (although they probably tested with "energy aware" CPU selection kept on). thanks, - Joel ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-05 0:58 ` Joel Fernandes @ 2017-11-22 5:23 ` Atish Patra 2017-11-23 10:52 ` Uladzislau Rezki 0 siblings, 1 reply; 25+ messages in thread From: Atish Patra @ 2017-11-22 5:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar Here are the results of schbench(scheduler latency benchmark) and uperf (networking benchmark). Hardware config: 20 core (40 hyperthreaded cpus) x86 box. schbench config: message threads = 2; time = 180s, worker thread = variable uperf config:ping pong test on loopback interface with message size = 8k Overall, both benchmark seems to happiest when number of threads are closer to number of cpus. -------------------------------------------------------------------------------------------------------------------------- schbench Maximum Latency(lower is better): Base(4.14) Base+pcpu Num Worker Mean Stdev Mean Stdev Improvement (%) 10 3026.8 4987.12 523 474.35 82.7210255055 18 13854.6 1841.61 12945.6 125.19 6.5609977913 19 16457 2046.51 12985.4 48.46 21.0949747828 20 14995 2368.84 15838 2038.82 -5.621873958 25 29952.2 107.72 29673.6 337.57 0.9301487036 30 30084 19.768 30096.2 7.782 -0.0405531179 ------------------------------------------------------------------------------------------------------------------- The proposed fix seems to improve the maximum latency for lower number of threads. It also seems to reduce the variation(lower stdev) as well. If number of threads are equal or higher than number of cpus, it results in significantly higher latencies in because of the nature of the benchmark. Results for higher threads use case are presented to provide a complete picture but it is difficult to conclude anything from that. Next individual percentile results are present for each use case. The proposed fix also improves latency across all percentiles for configuration(19 worker threads) which should saturate the system. --------------------------------------------------------------------------------------------------------------------- schbench Latency in usec(lower is better) Baseline(4.14) Base+pcpu Num Worker Mean stdev Mean stdev Improvement(%) 50th 10 64.2 2.039 63.6 1.743 0.934 18 57.6 5.388 57 4.939 1.041 19 63 4.774 58 4 7.936 20 59.6 4.127 60.2 5.153 -1.006 25 78.4 0.489 78.2 0.748 0.255 30 96.2 0.748 96.4 1.019 -0.207 75th 10 72 3.033 71.6 2.939 0.555 18 78 2.097 77.2 2.135 1.025 19 81.6 1.2 79.4 0.8 2.696 20 81 1.264 80.4 2.332 0.740 25 109.6 1.019 110 0 -0.364 30 781.4 50.902 731.8 70.6382 6.3475 90th 10 80.4 3.666 80.6 2.576 -0.248 18 87.8 1.469 88 1.673 -0.227 19 92.8 0.979 90.6 0.489 2.370 20 92.6 1.019 92 2 0.647 25 8977.6 1277.160 9014.4 467.857 -0.409 30 9558.4 334.641 9507.2 320.383 0.5356 95th 10 86.8 3.867 87.6 4.409 -0.921 18 95.4 1.496 95.2 2.039 0.209 19 102.6 1.624 99 0.894 3.508 20 103.2 1.326 102.2 2.481 0.968 25 12400 78.383 12406.4 37.318 -0.051 30 12336 40.477 12310.4 12.8 0.207 99th 10 99.2 5.418 103.4 6.887 -4.233 18 115.2 2.561 114.6 3.611 0.5208 19 126.25 4.573 120.4 3.872 4.6336 20 145.4 3.09 133 1.41 8.5281 25 12988.8 15.676 12924.8 25.6 0.4927 30 12988.8 15.676 12956.8 32.633 0.2463 99.50th 10 104.4 5.161 109.8 7.909 -5.172 18 127.6 7.391 124.2 4.214 2.6645 19 2712.2 4772.883 133.6 5.571 95.074 20 3707.8 2831.954 2844.2 4708.345 23.291 25 14032 1283.834 13008 0 7.2976 30 16550.4 886.382 13840 1218.355 16.376 ------------------------------------------------------------------------------------------------------------------------ Results from uperf uperf config: Loopback ping pong test with message size = 8k Baseline (4.14) Baseline +pcpu Mean stdev Mean stdev Improvement(%) 1 9.056 0.02 8.966 0.083 -0.993 2 17.664 0.13 17.448 0.303 -1.222 4 32.03 0.22 31.972 0.129 -0.181 8 58.198 0.31 58.588 0.198 0.670 16 101.018 0.67 100.056 0.455 -0.952 32 148.1 15.41 164.494 2.312 11.069 64 203.66 1.16 203.042 1.348 -0.3073 128 197.12 1.04 194.722 1.174 -1.2165 The race window fix seems to help uperf for 32 threads (closest to number of cpus) as well. Regards, Atish On 11/04/2017 07:58 PM, Joel Fernandes wrote: > Hi Peter, > > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: >>> Currently, multiple tasks can wakeup on same cpu from >>> select_idle_sibiling() path in case they wakeup simulatenously >>> and last ran on the same llc. This happens because an idle cpu >>> is not updated until idle task is scheduled out. Any task waking >>> during that period may potentially select that cpu for a wakeup >>> candidate. >>> >>> Introduce a per cpu variable that is set as soon as a cpu is >>> selected for wakeup for any task. This prevents from other tasks >>> to select the same cpu again. Note: This does not close the race >>> window but minimizes it to accessing the per-cpu variable. If two >>> wakee tasks access the per cpu variable at the same time, they may >>> select the same cpu again. But it minimizes the race window >>> considerably. >> The very most important question; does it actually help? What >> benchmarks, give what numbers? > I collected some numbers with an Android benchmark called Jankbench. > Most tests didn't show an improvement or degradation with the patch. > However, one of the tests called "list view", consistently shows an > improvement. Particularly striking is the improvement at mean and 25 > percentile. > > For list_view test, Jankbench pulls up a list of text and scrolls the > list, this exercises the display pipeline in Android to render and > display the animation as the scroll happens. For Android, lower frame > times is considered quite important as that means we are less likely > to drop frames and give the user a good experience vs a perceivable > poor experience. > > For each frame, Jankbench measures the total time a frame takes and > stores it in a DB (the time from which the app starts drawing, to when > the rendering completes and the frame is submitted for display). > Following is the distribution of frame times in ms. > > count 16304 (@60 fps, 4.5 minutes) > > Without patch With patch > mean 5.196633 4.429641 (+14.75%) > std 2.030054 2.310025 > 25% 5.606810 1.991017 (+64.48%) > 50% 5.824013 5.716631 (+1.84%) > 75% 5.987102 5.932751 (+0.90%) > 95% 6.461230 6.301318 (+2.47%) > 99% 9.828959 9.697076 (+1.34%) > > Note that although Android uses energy aware scheduling patches, I > turned those off to bring the test as close to mainline as possible. I > also backported Vincent's and Brendan's slow path fixes to the 4.4 > kernel that the Pixel 2 uses. > > Personally I am in favor of this patch considering this test data but > also that in the past, I remember that our teams had to deal with the > same race issue and used cpusets to avoid it (although they probably > tested with "energy aware" CPU selection kept on). > > thanks, > > - Joel ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-22 5:23 ` Atish Patra @ 2017-11-23 10:52 ` Uladzislau Rezki 2017-11-23 13:13 ` Mike Galbraith 0 siblings, 1 reply; 25+ messages in thread From: Uladzislau Rezki @ 2017-11-23 10:52 UTC (permalink / raw) To: Atish Patra, Peter Zijlstra Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar Hello, Atish, Peter, all. I have a question about if a task's nr_cpus_allowed is 1. In that scenario we do not call select_task_rq. Therefore even thought a task "p" is placed on idle CPU that CPU will not be marked as claimed for wake-up. What do you think about adding per_cpu(claim_wakeup, cpu) = 1; to select_task_rq() instead and possibly get rid of them from other places (increases a race window a bit)? Also, it will help and improve an ILB decision for NO_HZ idle balance. To be more clear i mean: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..8111cfa20075 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); + per_cpu(claim_wakeup, cpu) = 1; return cpu; } @@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p) return p->prio - MAX_RT_PRIO; } +DEFINE_PER_CPU(int, claim_wakeup); /** * idle_cpu - is a given CPU idle currently? * @cpu: the processor in question. @@ -3909,6 +3911,9 @@ int idle_cpu(int cpu) return 0; #endif + if (per_cpu(claim_wakeup, cpu)) + return 0; + return 1; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c09ddf8c832..55ad7fa97e95 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8596,7 +8596,17 @@ static struct { static inline int find_new_ilb(void) { - int ilb = cpumask_first(nohz.idle_cpus_mask); + cpumask_t cpumask; + int ilb; + + cpumask_copy(&cpumask, nohz.idle_cpus_mask); + + /* + * Probably is not good approach in case of many CPUs + */ + for_each_cpu(ilb, &cpumask) + if (!per_cpu(claim_wakeup, ilb)) + break; if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb; diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d518664cce4f..91cf50cab836 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf { put_prev_task(rq, prev); update_idle_core(rq); + this_cpu_write(claim_wakeup, 0); schedstat_inc(rq->sched_goidle); return rq->idle; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3b448ba82225..0e0bb7536725 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(int, claim_wakeup); struct sched_group_capacity { atomic_t ref; -- Uladzislau Rezki On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote: > Here are the results of schbench(scheduler latency benchmark) and uperf > (networking benchmark). > > Hardware config: 20 core (40 hyperthreaded cpus) x86 box. > schbench config: message threads = 2; time = 180s, worker thread = variable > uperf config:ping pong test on loopback interface with message size = 8k > > Overall, both benchmark seems to happiest when number of threads are closer > to number of cpus. > -------------------------------------------------------------------------------------------------------------------------- > schbench Maximum Latency(lower is better): > Base(4.14) Base+pcpu > Num > Worker Mean Stdev Mean Stdev Improvement (%) > 10 3026.8 4987.12 523 474.35 82.7210255055 > 18 13854.6 1841.61 12945.6 125.19 6.5609977913 > 19 16457 2046.51 12985.4 48.46 21.0949747828 > 20 14995 2368.84 15838 2038.82 -5.621873958 > 25 29952.2 107.72 29673.6 337.57 0.9301487036 > 30 30084 19.768 30096.2 7.782 -0.0405531179 > > ------------------------------------------------------------------------------------------------------------------- > > The proposed fix seems to improve the maximum latency for lower number > of threads. It also seems to reduce the variation(lower stdev) as well. > > If number of threads are equal or higher than number of cpus, it results in > significantly > higher latencies in because of the nature of the benchmark. Results for > higher > threads use case are presented to provide a complete picture but it is > difficult to conclude > anything from that. > > Next individual percentile results are present for each use case. The > proposed fix also > improves latency across all percentiles for configuration(19 worker threads) > which > should saturate the system. > --------------------------------------------------------------------------------------------------------------------- > schbench Latency in usec(lower is better) > Baseline(4.14) Base+pcpu > Num > Worker Mean stdev Mean stdev Improvement(%) > > 50th > 10 64.2 2.039 63.6 1.743 0.934 > 18 57.6 5.388 57 4.939 1.041 > 19 63 4.774 58 4 7.936 > 20 59.6 4.127 60.2 5.153 -1.006 > 25 78.4 0.489 78.2 0.748 0.255 > 30 96.2 0.748 96.4 1.019 -0.207 > > 75th > 10 72 3.033 71.6 2.939 0.555 > 18 78 2.097 77.2 2.135 1.025 > 19 81.6 1.2 79.4 0.8 2.696 > 20 81 1.264 80.4 2.332 0.740 > 25 109.6 1.019 110 0 -0.364 > 30 781.4 50.902 731.8 70.6382 6.3475 > > 90th > 10 80.4 3.666 80.6 2.576 -0.248 > 18 87.8 1.469 88 1.673 -0.227 > 19 92.8 0.979 90.6 0.489 2.370 > 20 92.6 1.019 92 2 0.647 > 25 8977.6 1277.160 9014.4 467.857 -0.409 > 30 9558.4 334.641 9507.2 320.383 0.5356 > > 95th > 10 86.8 3.867 87.6 4.409 -0.921 > 18 95.4 1.496 95.2 2.039 0.209 > 19 102.6 1.624 99 0.894 3.508 > 20 103.2 1.326 102.2 2.481 0.968 > 25 12400 78.383 12406.4 37.318 -0.051 > 30 12336 40.477 12310.4 12.8 0.207 > > 99th > 10 99.2 5.418 103.4 6.887 -4.233 > 18 115.2 2.561 114.6 3.611 0.5208 > 19 126.25 4.573 120.4 3.872 4.6336 > 20 145.4 3.09 133 1.41 8.5281 > 25 12988.8 15.676 12924.8 25.6 0.4927 > 30 12988.8 15.676 12956.8 32.633 0.2463 > > 99.50th > 10 104.4 5.161 109.8 7.909 -5.172 > 18 127.6 7.391 124.2 4.214 2.6645 > 19 2712.2 4772.883 133.6 5.571 95.074 > 20 3707.8 2831.954 2844.2 4708.345 23.291 > 25 14032 1283.834 13008 0 7.2976 > 30 16550.4 886.382 13840 1218.355 16.376 > ------------------------------------------------------------------------------------------------------------------------ > > > Results from uperf > uperf config: Loopback ping pong test with message size = 8k > > Baseline (4.14) Baseline +pcpu > Mean stdev Mean stdev Improvement(%) > 1 9.056 0.02 8.966 0.083 -0.993 > 2 17.664 0.13 17.448 0.303 -1.222 > 4 32.03 0.22 31.972 0.129 -0.181 > 8 58.198 0.31 58.588 0.198 0.670 > 16 101.018 0.67 100.056 0.455 -0.952 > 32 148.1 15.41 164.494 2.312 11.069 > 64 203.66 1.16 203.042 1.348 -0.3073 > 128 197.12 1.04 194.722 1.174 -1.2165 > > The race window fix seems to help uperf for 32 threads (closest to number of > cpus) as well. > > Regards, > Atish > > On 11/04/2017 07:58 PM, Joel Fernandes wrote: > > Hi Peter, > > > > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote: > > > > Currently, multiple tasks can wakeup on same cpu from > > > > select_idle_sibiling() path in case they wakeup simulatenously > > > > and last ran on the same llc. This happens because an idle cpu > > > > is not updated until idle task is scheduled out. Any task waking > > > > during that period may potentially select that cpu for a wakeup > > > > candidate. > > > > > > > > Introduce a per cpu variable that is set as soon as a cpu is > > > > selected for wakeup for any task. This prevents from other tasks > > > > to select the same cpu again. Note: This does not close the race > > > > window but minimizes it to accessing the per-cpu variable. If two > > > > wakee tasks access the per cpu variable at the same time, they may > > > > select the same cpu again. But it minimizes the race window > > > > considerably. > > > The very most important question; does it actually help? What > > > benchmarks, give what numbers? > > I collected some numbers with an Android benchmark called Jankbench. > > Most tests didn't show an improvement or degradation with the patch. > > However, one of the tests called "list view", consistently shows an > > improvement. Particularly striking is the improvement at mean and 25 > > percentile. > > > > For list_view test, Jankbench pulls up a list of text and scrolls the > > list, this exercises the display pipeline in Android to render and > > display the animation as the scroll happens. For Android, lower frame > > times is considered quite important as that means we are less likely > > to drop frames and give the user a good experience vs a perceivable > > poor experience. > > > > For each frame, Jankbench measures the total time a frame takes and > > stores it in a DB (the time from which the app starts drawing, to when > > the rendering completes and the frame is submitted for display). > > Following is the distribution of frame times in ms. > > > > count 16304 (@60 fps, 4.5 minutes) > > > > Without patch With patch > > mean 5.196633 4.429641 (+14.75%) > > std 2.030054 2.310025 > > 25% 5.606810 1.991017 (+64.48%) > > 50% 5.824013 5.716631 (+1.84%) > > 75% 5.987102 5.932751 (+0.90%) > > 95% 6.461230 6.301318 (+2.47%) > > 99% 9.828959 9.697076 (+1.34%) > > > > Note that although Android uses energy aware scheduling patches, I > > turned those off to bring the test as close to mainline as possible. I > > also backported Vincent's and Brendan's slow path fixes to the 4.4 > > kernel that the Pixel 2 uses. > > > > Personally I am in favor of this patch considering this test data but > > also that in the past, I remember that our teams had to deal with the > > same race issue and used cpusets to avoid it (although they probably > > tested with "energy aware" CPU selection kept on). > > > > thanks, > > > > - Joel > ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-23 10:52 ` Uladzislau Rezki @ 2017-11-23 13:13 ` Mike Galbraith 2017-11-23 16:00 ` Josef Bacik 2017-11-24 10:26 ` Uladzislau Rezki 0 siblings, 2 replies; 25+ messages in thread From: Mike Galbraith @ 2017-11-23 13:13 UTC (permalink / raw) To: Uladzislau Rezki, Atish Patra, Peter Zijlstra Cc: Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote: > Hello, Atish, Peter, all. > > I have a question about if a task's nr_cpus_allowed is 1. > In that scenario we do not call select_task_rq. Therefore > even thought a task "p" is placed on idle CPU that CPU > will not be marked as claimed for wake-up. > > What do you think about adding per_cpu(claim_wakeup, cpu) = 1; > to select_task_rq() instead and possibly get rid of them from > other places (increases a race window a bit)? My thoughts on all of this is that we need less SIS, not more. Rather than trying so hard for the absolute lowest wakeup latency, which induces throughput/efficiency robbing bouncing, I think we'd be better of considering leaving an already llc affine task where it is if the average cycle time is sufficiently low that it will likely hit the CPU RSN. Completely ignoring low utilization kernel threads would go a long way to getting rid of bouncing userspace (which tends to have a meaningful footprint), all over hell and creation. You could also periodically send mobile kthreads down the slow path to try to keep them the hell away from partially busy CPUs, as well as anything else that hasn't run for a while, to keep background cruft from continually injecting itself into the middle of a cross core cyber-sex. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-23 13:13 ` Mike Galbraith @ 2017-11-23 16:00 ` Josef Bacik 2017-11-23 17:40 ` Mike Galbraith 2017-11-23 21:11 ` Atish Patra 2017-11-24 10:26 ` Uladzislau Rezki 1 sibling, 2 replies; 25+ messages in thread From: Josef Bacik @ 2017-11-23 16:00 UTC (permalink / raw) To: Mike Galbraith Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote: > On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote: > > Hello, Atish, Peter, all. > > > > I have a question about if a task's nr_cpus_allowed is 1. > > In that scenario we do not call select_task_rq. Therefore > > even thought a task "p" is placed on idle CPU that CPU > > will not be marked as claimed for wake-up. > > > > What do you think about adding per_cpu(claim_wakeup, cpu) = 1; > > to select_task_rq() instead and possibly get rid of them from > > other places (increases a race window a bit)? > > My thoughts on all of this is that we need less SIS, not more. Rather > than trying so hard for the absolute lowest wakeup latency, which > induces throughput/efficiency robbing bouncing, I think we'd be better > of considering leaving an already llc affine task where it is if the > average cycle time is sufficiently low that it will likely hit the CPU > RSN. Completely ignoring low utilization kernel threads would go a > long way to getting rid of bouncing userspace (which tends to have a > meaningful footprint), all over hell and creation. > > You could also periodically send mobile kthreads down the slow path to > try to keep them the hell away from partially busy CPUs, as well as > anything else that hasn't run for a while, to keep background cruft > from continually injecting itself into the middle of a cross core > cyber-sex. > And on this thanksgiving I'm thankful for Mike, and his entertaining early morning emails. Josef ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-23 16:00 ` Josef Bacik @ 2017-11-23 17:40 ` Mike Galbraith 2017-11-23 21:11 ` Atish Patra 1 sibling, 0 replies; 25+ messages in thread From: Mike Galbraith @ 2017-11-23 17:40 UTC (permalink / raw) To: Josef Bacik Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Thu, 2017-11-23 at 11:00 -0500, Josef Bacik wrote: > > And on this thanksgiving I'm thankful for Mike, and his entertaining early > morning emails. Read it again tomorrow. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-23 16:00 ` Josef Bacik 2017-11-23 17:40 ` Mike Galbraith @ 2017-11-23 21:11 ` Atish Patra 1 sibling, 0 replies; 25+ messages in thread From: Atish Patra @ 2017-11-23 21:11 UTC (permalink / raw) To: Josef Bacik, Mike Galbraith Cc: Uladzislau Rezki, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On 2017/11/23 10:00 AM, Josef Bacik wrote: > On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote: >> On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote: >>> Hello, Atish, Peter, all. >>> >>> I have a question about if a task's nr_cpus_allowed is 1. >>> In that scenario we do not call select_task_rq. Therefore >>> even thought a task "p" is placed on idle CPU that CPU >>> will not be marked as claimed for wake-up. >>> >>> What do you think about adding per_cpu(claim_wakeup, cpu) = 1; >>> to select_task_rq() instead and possibly get rid of them from >>> other places (increases a race window a bit)? >> My thoughts on all of this is that we need less SIS, not more. Rather >> than trying so hard for the absolute lowest wakeup latency, which >> induces throughput/efficiency robbing bouncing, I think we'd be better >> of considering leaving an already llc affine task where it is if the >> average cycle time is sufficiently low that it will likely hit the CPU >> RSN. Completely ignoring low utilization kernel threads would go a >> long way to getting rid of bouncing userspace (which tends to have a >> meaningful footprint), all over hell and creation. >> >> You could also periodically send mobile kthreads down the slow path to >> try to keep them the hell away from partially busy CPUs, as well as >> anything else that hasn't run for a while, to keep background cruft >> from continually injecting itself into the middle of a cross core >> cyber-sex. >> > And on this thanksgiving I'm thankful for Mike, and his entertaining early > morning emails. :) :). > Josef ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-23 13:13 ` Mike Galbraith 2017-11-23 16:00 ` Josef Bacik @ 2017-11-24 10:26 ` Uladzislau Rezki 2017-11-24 18:46 ` Mike Galbraith 1 sibling, 1 reply; 25+ messages in thread From: Uladzislau Rezki @ 2017-11-24 10:26 UTC (permalink / raw) To: Mike Galbraith Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Thu, Nov 23, 2017 at 02:13:01PM +0100, Mike Galbraith wrote: > On Thu, 2017-11-23 at 11:52 +0100, Uladzislau Rezki wrote: > > Hello, Atish, Peter, all. > > > > I have a question about if a task's nr_cpus_allowed is 1. > > In that scenario we do not call select_task_rq. Therefore > > even thought a task "p" is placed on idle CPU that CPU > > will not be marked as claimed for wake-up. > > > > What do you think about adding per_cpu(claim_wakeup, cpu) = 1; > > to select_task_rq() instead and possibly get rid of them from > > other places (increases a race window a bit)? > > My thoughts on all of this is that we need less SIS, not more. Rather > than trying so hard for the absolute lowest wakeup latency, which > induces throughput/efficiency robbing bouncing, I think we'd be better > of considering leaving an already llc affine task where it is if the > average cycle time is sufficiently low that it will likely hit the CPU > RSN. I guess there is misunderstanding here. The main goal is not to cover pinned case, for sure. I was thinking more about below points: - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more predictable (that is good for mobile workloads). Because as it is right now it simply returns a first CPU in a "nohz" mask and if we know that CPU has been claimed i think it is worth to go with another ILB core, since waking up on a remote CPU + doing nohz_idle_balance does not improve wake-up latency and is a miss from ilb point of view. - Get rid of duplication; - Be not limited to pinned case. If you have any proposal, i would be appreciated if you could share your specific view. > Completely ignoring low utilization kernel threads would go a > long way to getting rid of bouncing userspace (which tends to have a > meaningful footprint), all over hell and creation. > > You could also periodically send mobile kthreads down the slow path to > try to keep them the hell away from partially busy CPUs, as well as > anything else that hasn't run for a while, to keep background cruft > from continually injecting itself into the middle of a cross core > cyber-sex. CPU is not considered idle (in terms of idle_cpu()) until it hits a first rule checking a condition if current is idle thread or not. If we hit last check when a claim wake-up is set it means that CPU switches from idle state to non-idle one and it will happen quite soon depending on wake-up latency. Considering a core as not-idle when somebody tends to wake up a task on it is a good point. If you have any specific example when it is bad, please share it. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-24 10:26 ` Uladzislau Rezki @ 2017-11-24 18:46 ` Mike Galbraith 2017-11-26 20:58 ` Mike Galbraith 2017-11-28 9:34 ` Uladzislau Rezki 0 siblings, 2 replies; 25+ messages in thread From: Mike Galbraith @ 2017-11-24 18:46 UTC (permalink / raw) To: Uladzislau Rezki Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote: > > I guess there is misunderstanding here. The main goal is not to cover > pinned case, for sure. I was thinking more about below points: > > - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more > predictable (that is good for mobile workloads). Because as it is > right now it simply returns a first CPU in a "nohz" mask and if we > know that CPU has been claimed i think it is worth to go with another > ILB core, since waking up on a remote CPU + doing nohz_idle_balance > does not improve wake-up latency and is a miss from ilb point of view. Using unsynchronized access of a flag set by chaotic events all over the box? > If you have any proposal, i would be appreciated if you could > share your specific view. My view is you're barking up the wrong tree: you're making the idle data SIS is using more accurate, but I question the benefit. That it makes an imperfect placement decision occasionally due to raciness is nearly meaningless compared to the cost of frequent bounce. Better (but still not perfect, that requires atomics) state information doesn't meaningfully improve decision quality. > Considering a core as not-idle when somebody tends to wake up a task on > it is a good point. If you have any specific example when it is bad, > please share it. I'm not sure how that will work out. Probably like most everything wrt SIS, first comes "woohoo" then "aw crap", and off to the trash it goes. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-24 18:46 ` Mike Galbraith @ 2017-11-26 20:58 ` Mike Galbraith 2017-11-28 9:34 ` Uladzislau Rezki 1 sibling, 0 replies; 25+ messages in thread From: Mike Galbraith @ 2017-11-26 20:58 UTC (permalink / raw) To: Uladzislau Rezki Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Fri, 2017-11-24 at 19:46 +0100, Mike Galbraith wrote: > > My view is you're barking up the wrong tree: you're making the idle > data SIS is using more accurate, but I question the benefit. That it > makes an imperfect placement decision occasionally due to raciness is > nearly meaningless compared to the cost of frequent bounce. Playing with SIS (yet again), below is a hack that I think illustrates why I think the occasional races are nearly meaningless. Box = i4790 desktop. master masterx TCP_STREAM-1 Avg: 70495 71295 TCP_STREAM-2 Avg: 54388 66202 TCP_STREAM-4 Avg: 19316 21413 TCP_STREAM-8 Avg: 9678 8894 TCP_STREAM-16 Avg: 4286 4360 TCP_MAERTS-1 Avg: 69238 71799 TCP_MAERTS-2 Avg: 50729 65612 TCP_MAERTS-4 Avg: 19095 21984 TCP_MAERTS-8 Avg: 9405 8888 TCP_MAERTS-16 Avg: 4891 4371 TCP_RR-1 Avg: 198617 203291 TCP_RR-2 Avg: 152862 191761 TCP_RR-4 Avg: 112241 117888 TCP_RR-8 Avg: 104453 113260 TCP_RR-16 Avg: 50897 55280 UDP_RR-1 Avg: 250738 264214 UDP_RR-2 Avg: 196250 253352 UDP_RR-4 Avg: 152862 158819 UDP_RR-8 Avg: 143781 154071 UDP_RR-16 Avg: 68605 76492 tbench 1 2 4 8 16 master 772 1207 1829 3516 3440 masterx 811 1466 1959 3737 3670 hackbench -l 10000 5.917 5.990 5.957 avg 5.954 NO_SIS_DEBOUNCE 5.886 5.808 5.826 avg 5.840 SIS_DEBOUNCE echo 0 > tracing_on echo 1 > events/sched/sched_migrate_task/enable start endless tbench 2 for i in `seq 3` do echo > trace echo 1 > tracing_on sleep 10 echo 0 > tracing_on cat trace|grep tbench|wc -l done kde desktop idling NO_SIS_DEBOUNCE 261 208 199 SIS_DEBOUNCE 8 6 0 add firefox playing youtube documentary NO_SIS_DEBOUNCE 10906 10094 10774 SIS_DEBOUNCE 34 34 34 tbench 2 throughput as firefox runs NO_SIS_DEBOUNCE 1129.63 MB/sec SIS_DEBOUNCE 1462.53 MB/sec Advisory: welding goggles. --- include/linux/sched.h | 3 ++- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++-- kernel/sched/features.h | 1 + 3 files changed, 42 insertions(+), 3 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -541,7 +541,6 @@ struct task_struct { unsigned int ptrace; #ifdef CONFIG_SMP - struct llist_node wake_entry; int on_cpu; #ifdef CONFIG_THREAD_INFO_IN_TASK /* Current CPU: */ @@ -549,8 +548,10 @@ struct task_struct { #endif unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + unsigned long wakee_placed; struct task_struct *last_wakee; + struct llist_node wake_entry; int wake_cpu; #endif int on_rq; --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6174,6 +6174,9 @@ static int select_idle_sibling(struct ta if ((unsigned)i < nr_cpumask_bits) return i; + if (sched_feat(SIS_DEBOUNCE)) + p->wakee_placed = jiffies; + return target; } @@ -6258,6 +6261,22 @@ static int wake_cap(struct task_struct * return min_cap * 1024 < task_util(p) * capacity_margin; } +static bool task_placed(struct task_struct *p) +{ + return p->wakee_placed == jiffies; +} + +static bool task_llc_affine_and_cold(struct task_struct *p, int cpu, int prev) +{ + int cold = sysctl_sched_migration_cost; + + if (!cpus_share_cache(cpu, prev)) + return false; + if (cold > 0 && rq_clock_task(cpu_rq(prev)) - p->se.exec_start > cold) + return true; + return false; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -6276,16 +6295,26 @@ select_task_rq_fair(struct task_struct * struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); int new_cpu = prev_cpu; - int want_affine = 0; + int want_affine = 0, want_debounce = 0; int sync = wake_flags & WF_SYNC; + rcu_read_lock(); if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); + want_debounce = sched_feat(SIS_DEBOUNCE); + if (task_placed(p)) + goto out_unlock; + /* Balance cold tasks to reduce hot task bounce tendency. */ + if (want_debounce && task_llc_affine_and_cold(p, cpu, prev_cpu)) { + sd_flag |= SD_SHARE_PKG_RESOURCES; + sd = highest_flag_domain(prev_cpu, SD_SHARE_PKG_RESOURCES); + p->wakee_placed = jiffies; + goto pick_cpu_cold; + } want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) && cpumask_test_cpu(cpu, &p->cpus_allowed); } - rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -6315,6 +6344,7 @@ select_task_rq_fair(struct task_struct * new_cpu = cpu; } +pick_cpu_cold: if (sd && !(sd_flag & SD_BALANCE_FORK)) { /* * We're going to need the task's util for capacity_spare_wake @@ -6329,9 +6359,13 @@ select_task_rq_fair(struct task_struct * if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); + if (want_debounce && new_cpu == prev_cpu) + p->wakee_placed = jiffies; + } else { new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); } +out_unlock: rcu_read_unlock(); return new_cpu; @@ -6952,6 +6986,9 @@ static int task_hot(struct task_struct * if (sysctl_sched_migration_cost == 0) return 0; + if (task_placed(p)) + return 1; + delta = rq_clock_task(env->src_rq) - p->se.exec_start; return delta < (s64)sysctl_sched_migration_cost; --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_DEBOUNCE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-24 18:46 ` Mike Galbraith 2017-11-26 20:58 ` Mike Galbraith @ 2017-11-28 9:34 ` Uladzislau Rezki 2017-11-28 10:49 ` Mike Galbraith 1 sibling, 1 reply; 25+ messages in thread From: Uladzislau Rezki @ 2017-11-28 9:34 UTC (permalink / raw) To: Mike Galbraith Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote: > On Fri, 2017-11-24 at 11:26 +0100, Uladzislau Rezki wrote: > > > > I guess there is misunderstanding here. The main goal is not to cover > > pinned case, for sure. I was thinking more about below points: > > > > - Extend a claim_wake_up logic for making an ILB/NO_HZ decision more > > predictable (that is good for mobile workloads). Because as it is > > right now it simply returns a first CPU in a "nohz" mask and if we > > know that CPU has been claimed i think it is worth to go with another > > ILB core, since waking up on a remote CPU + doing nohz_idle_balance > > does not improve wake-up latency and is a miss from ilb point of view. > > Using unsynchronized access of a flag set by chaotic events all over > the box? Well, entering NO_HZ mode and checking nohz.idle_cpus_mask nohz_balancer_kick() -> find_new_ilb() are obviously asynchronous events and like you say is chaotic except of accessing to the cpumask variable (i mean atomically). So do not see any issues with that. > > > If you have any proposal, i would be appreciated if you could > > share your specific view. > > My view is you're barking up the wrong tree: you're making the idle > data SIS is using more accurate, but I question the benefit. That it > makes an imperfect placement decision occasionally due to raciness is > nearly meaningless compared to the cost of frequent bounce. Before sitting down and start testing, i just illustrated how we can apply claim_wake_up to ilb asking community a specific view on it: drawbacks, pros/cons, proposals etc. > > Better (but still not perfect, that requires atomics) state information > doesn't meaningfully improve decision quality. > I agree, it should be atomic, something like below: core.c +/* cpus with claim wake-up set bit */ +cpumask_t cpu_claim_wake_up_map; ... @@ -1559,6 +1562,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) !cpu_online(cpu))) cpu = select_fallback_rq(task_cpu(p), p); + cpumask_test_and_set_cpu(cpu, &cpu_claim_wake_up_map); return cpu; } ... + if (cpumask_test_cpu(cpu, &cpu_claim_wake_up_map)) + return 0; + return 1; } fair.c: static inline int find_new_ilb(void) { - int ilb = cpumask_first(nohz.idle_cpus_mask); + cpumask_t cpumask; + int ilb; + + cpumask_andnot(&cpumask, nohz.idle_cpus_mask, + &cpu_claim_wake_up_map); + + ilb = cpumask_first(&cpumask); > > Considering a core as not-idle when somebody tends to wake up a task on > > it is a good point. If you have any specific example when it is bad, > > please share it. > > I'm not sure how that will work out. Probably like most everything wrt > SIS, first comes "woohoo" then "aw crap", and off to the trash it goes. > -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-28 9:34 ` Uladzislau Rezki @ 2017-11-28 10:49 ` Mike Galbraith 2017-11-29 10:41 ` Uladzislau Rezki 0 siblings, 1 reply; 25+ messages in thread From: Mike Galbraith @ 2017-11-28 10:49 UTC (permalink / raw) To: Uladzislau Rezki Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote: > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote: > > > My view is you're barking up the wrong tree: you're making the idle > > data SIS is using more accurate, but I question the benefit. That it > > makes an imperfect placement decision occasionally due to raciness is > > nearly meaningless compared to the cost of frequent bounce. > Before sitting down and start testing, i just illustrated how we can > apply claim_wake_up to ilb asking community a specific view on it: > drawbacks, pros/cons, proposals etc. Even if you make the thing atomic, what is ILB supposed to do, look over its shoulder every step of the way and sh*t it's pants if somebody touches claim_wake_up as it's about to or just after it did something? If you intend to make all of LB race free, good luck. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-28 10:49 ` Mike Galbraith @ 2017-11-29 10:41 ` Uladzislau Rezki 2017-11-29 18:15 ` Mike Galbraith 0 siblings, 1 reply; 25+ messages in thread From: Uladzislau Rezki @ 2017-11-29 10:41 UTC (permalink / raw) To: Mike Galbraith Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote: > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote: > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote: > > > > > My view is you're barking up the wrong tree: you're making the idle > > > data SIS is using more accurate, but I question the benefit. That it > > > makes an imperfect placement decision occasionally due to raciness is > > > nearly meaningless compared to the cost of frequent bounce. > > > Before sitting down and start testing, i just illustrated how we can > > apply claim_wake_up to ilb asking community a specific view on it: > > drawbacks, pros/cons, proposals etc. > > Even if you make the thing atomic, what is ILB supposed to do, look > over its shoulder every step of the way and sh*t it's pants if somebody > touches claim_wake_up as it's about to or just after it did something? If nohz.idle_cpus_mask is set for particular CPU together with claim mask, it means that TIF_NEED_RESCHED is coming or is already in place. When a CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode unless it runs into scheduler_ipi or so. > > If you intend to make all of LB race free, good luck. > > -Mike Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-29 10:41 ` Uladzislau Rezki @ 2017-11-29 18:15 ` Mike Galbraith 2017-11-30 12:30 ` Uladzislau Rezki 0 siblings, 1 reply; 25+ messages in thread From: Mike Galbraith @ 2017-11-29 18:15 UTC (permalink / raw) To: Uladzislau Rezki Cc: Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote: > On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote: > > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote: > > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote: > > > > > > > My view is you're barking up the wrong tree: you're making the idle > > > > data SIS is using more accurate, but I question the benefit. That it > > > > makes an imperfect placement decision occasionally due to raciness is > > > > nearly meaningless compared to the cost of frequent bounce. > > > > > Before sitting down and start testing, i just illustrated how we can > > > apply claim_wake_up to ilb asking community a specific view on it: > > > drawbacks, pros/cons, proposals etc. > > > > Even if you make the thing atomic, what is ILB supposed to do, look > > over its shoulder every step of the way and sh*t it's pants if somebody > > touches claim_wake_up as it's about to or just after it did something? > If nohz.idle_cpus_mask is set for particular CPU together with claim mask, > it means that TIF_NEED_RESCHED is coming or is already in place. When a > CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode > unless it runs into scheduler_ipi or so. Which means nothing to an LB operation in progress. But whatever, I'm not going to argue endlessly about something I think should be blatantly obvious. IMO, this is a couple points shy of pointless. That's my 'C' to this RFC in a nutshell. I'm done. -Mike ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window. 2017-11-29 18:15 ` Mike Galbraith @ 2017-11-30 12:30 ` Uladzislau Rezki 0 siblings, 0 replies; 25+ messages in thread From: Uladzislau Rezki @ 2017-11-30 12:30 UTC (permalink / raw) To: Mike Galbraith Cc: Uladzislau Rezki, Atish Patra, Peter Zijlstra, Joel Fernandes, LKML, Brendan Jackman, Josef Bacik, Ingo Molnar On Wed, Nov 29, 2017 at 07:15:21PM +0100, Mike Galbraith wrote: > On Wed, 2017-11-29 at 11:41 +0100, Uladzislau Rezki wrote: > > On Tue, Nov 28, 2017 at 11:49:11AM +0100, Mike Galbraith wrote: > > > On Tue, 2017-11-28 at 10:34 +0100, Uladzislau Rezki wrote: > > > > On Fri, Nov 24, 2017 at 07:46:30PM +0100, Mike Galbraith wrote: > > > > > > > > > My view is you're barking up the wrong tree: you're making the idle > > > > > data SIS is using more accurate, but I question the benefit. That it > > > > > makes an imperfect placement decision occasionally due to raciness is > > > > > nearly meaningless compared to the cost of frequent bounce. > > > > > > > Before sitting down and start testing, i just illustrated how we can > > > > apply claim_wake_up to ilb asking community a specific view on it: > > > > drawbacks, pros/cons, proposals etc. > > > > > > Even if you make the thing atomic, what is ILB supposed to do, look > > > over its shoulder every step of the way and sh*t it's pants if somebody > > > touches claim_wake_up as it's about to or just after it did something? > > If nohz.idle_cpus_mask is set for particular CPU together with claim mask, > > it means that TIF_NEED_RESCHED is coming or is already in place. When a > > CPU hits idle_thread a claim bit gets reset and proceed to no_hz mode > > unless it runs into scheduler_ipi or so. > > Which means nothing to an LB operation in progress. > <snip> static void nohz_idle_balance(...) ... /* * If this cpu gets work to do, stop the load balancing * work being done for other cpus. Next load * balancing owner will pick it up. */ if (need_resched()) break; ... <snip> > > But whatever, I'm not going to argue endlessly about something I think > should be blatantly obvious. IMO, this is a couple points shy of > pointless. That's my 'C' to this RFC in a nutshell. I'm done. > Thank you. Vlad ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH DEBUG 2/2] sched: Add a stat for idle cpu selection race window. 2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra 2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra @ 2017-10-31 5:27 ` Atish Patra 1 sibling, 0 replies; 25+ messages in thread From: Atish Patra @ 2017-10-31 5:27 UTC (permalink / raw) To: linux-kernel; +Cc: atish.patra, joelaf, peterz, brendan.jackman, jbacik, mingo This is ** Debug ** only patch not intended for merging. A new stat in schedstat is added that represents number of times cpu was already claimed during wakeup while some other cpu tries to schedule tasks on it again. It helps to verify if the concerned issue is present in a specific becnhmark. Tested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Atish Patra <atish.patra@oracle.com> --- kernel/sched/core.c | 4 +++- kernel/sched/sched.h | 1 + kernel/sched/stats.c | 5 +++-- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d9d501c..d12b8c8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3918,8 +3918,10 @@ int idle_cpu(int cpu) return 0; #endif - if (per_cpu(claim_wakeup, cpu)) + if (per_cpu(claim_wakeup, cpu)) { + schedstat_inc(rq->ttwu_claimed); return 0; + } return 1; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5f70b98..8f3047c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -801,6 +801,7 @@ struct rq { /* try_to_wake_up() stats */ unsigned int ttwu_count; unsigned int ttwu_local; + unsigned int ttwu_claimed; #endif #ifdef CONFIG_SMP diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 87e2c9f..2290aa7 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -30,12 +30,13 @@ static int show_schedstat(struct seq_file *seq, void *v) /* runqueue-specific stats */ seq_printf(seq, - "cpu%d %u 0 %u %u %u %u %llu %llu %lu", + "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u", cpu, rq->yld_count, rq->sched_count, rq->sched_goidle, rq->ttwu_count, rq->ttwu_local, rq->rq_cpu_time, - rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount); + rq->rq_sched_info.run_delay, + rq->rq_sched_info.pcount, rq->ttwu_claimed); seq_printf(seq, "\n"); -- 2.7.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2017-11-30 12:30 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-31 5:27 [PATCH RFC 0/2] Fix race window during idle cpu selection Atish Patra 2017-10-31 5:27 ` [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window Atish Patra 2017-10-31 8:20 ` Peter Zijlstra 2017-10-31 8:48 ` Mike Galbraith 2017-11-01 6:08 ` Atish Patra 2017-11-01 6:54 ` Mike Galbraith 2017-11-01 7:18 ` Mike Galbraith 2017-11-01 16:36 ` Atish Patra 2017-11-01 20:20 ` Mike Galbraith 2017-11-05 0:58 ` Joel Fernandes 2017-11-22 5:23 ` Atish Patra 2017-11-23 10:52 ` Uladzislau Rezki 2017-11-23 13:13 ` Mike Galbraith 2017-11-23 16:00 ` Josef Bacik 2017-11-23 17:40 ` Mike Galbraith 2017-11-23 21:11 ` Atish Patra 2017-11-24 10:26 ` Uladzislau Rezki 2017-11-24 18:46 ` Mike Galbraith 2017-11-26 20:58 ` Mike Galbraith 2017-11-28 9:34 ` Uladzislau Rezki 2017-11-28 10:49 ` Mike Galbraith 2017-11-29 10:41 ` Uladzislau Rezki 2017-11-29 18:15 ` Mike Galbraith 2017-11-30 12:30 ` Uladzislau Rezki 2017-10-31 5:27 ` [PATCH DEBUG 2/2] sched: Add a stat for " Atish Patra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.