* weird loadavg on idle machine post 5.7 @ 2020-07-02 17:15 Dave Jones 2020-07-02 19:46 ` Dave Jones ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Dave Jones @ 2020-07-02 17:15 UTC (permalink / raw) To: Linux Kernel; +Cc: peterz, mgorman, mingo, Linus Torvalds When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly idle machine (that usually sees loadavg hover in the 0.xx range) that it was consistently above 1.00 even when there was nothing running. All that perf showed was the kernel was spending time in the idle loop (and running perf). For the first hour or so after boot, everything seems fine, but over time loadavg creeps up, and once it's established a new baseline, it never seems to ever drop below that again. One morning I woke up to find loadavg at '7.xx', after almost as many hours of uptime, which makes me wonder if perhaps this is triggered by something in cron. I have a bunch of scripts that fire off every hour that involve thousands of shortlived runs of iptables/ipset, but running them manually didn't seem to automatically trigger the bug. Given it took a few hours of runtime to confirm good/bad, bisecting this took the last two weeks. I did it four different times, the first producing bogus results from over-eager 'good', but the last two runs both implicated this commit: commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) Author: Peter Zijlstra <peterz@infradead.org> Date: Sun May 24 21:29:55 2020 +0100 sched/core: Optimize ttwu() spinning on p->on_cpu Both Rik and Mel reported seeing ttwu() spend significant time on: smp_cond_load_acquire(&p->on_cpu, !VAL); Attempt to avoid this by queueing the wakeup on the CPU that owns the p->on_cpu value. This will then allow the ttwu() to complete without further waiting. Since we run schedule() with interrupts disabled, the IPI is guaranteed to happen after p->on_cpu is cleared, this is what makes it safe to queue early. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: valentin.schneider@arm.com Cc: Hillf Danton <hdanton@sina.com> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't confirmed 100% that it's the cause yet, but the two separate bisects seem promising. I don't see any obvious correlation between what's changing there and the symtoms (other than "scheduler magic") but maybe those closer to this have ideas what could be going awry ? Dave ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones @ 2020-07-02 19:46 ` Dave Jones 2020-07-02 21:15 ` Paul Gortmaker 2020-07-02 21:36 ` Mel Gorman 2 siblings, 0 replies; 20+ messages in thread From: Dave Jones @ 2020-07-02 19:46 UTC (permalink / raw) To: Linux Kernel, peterz, mgorman, mingo, Linus Torvalds On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote: > When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly > idle machine (that usually sees loadavg hover in the 0.xx range) > that it was consistently above 1.00 even when there was nothing running. > All that perf showed was the kernel was spending time in the idle loop > (and running perf). Unfortunate typo there, I meant 5.8-rc2, and just confirmed the bug persists in 5.8-rc3. Dave ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones 2020-07-02 19:46 ` Dave Jones @ 2020-07-02 21:15 ` Paul Gortmaker 2020-07-03 13:23 ` Paul Gortmaker 2020-07-02 21:36 ` Mel Gorman 2 siblings, 1 reply; 20+ messages in thread From: Paul Gortmaker @ 2020-07-02 21:15 UTC (permalink / raw) To: Dave Jones, Linux Kernel, peterz, mgorman, mingo, Linus Torvalds Cc: Michal Kubecek, Paul E. McKenney, Thomas Gleixner [weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 13:15) Dave Jones wrote: > When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly > idle machine (that usually sees loadavg hover in the 0.xx range) > that it was consistently above 1.00 even when there was nothing running. > All that perf showed was the kernel was spending time in the idle loop > (and running perf). > > For the first hour or so after boot, everything seems fine, but over > time loadavg creeps up, and once it's established a new baseline, it > never seems to ever drop below that again. > > One morning I woke up to find loadavg at '7.xx', after almost as many > hours of uptime, which makes me wonder if perhaps this is triggered > by something in cron. I have a bunch of scripts that fire off > every hour that involve thousands of shortlived runs of iptables/ipset, > but running them manually didn't seem to automatically trigger the bug. > > Given it took a few hours of runtime to confirm good/bad, bisecting this > took the last two weeks. I did it four different times, the first I've seen pretty much the same thing - I was helping paulmck test rcu-dev for something hopefully unrelated, when I 1st saw it, and assumed it came in with the sched-core merge and was using one under that as "good" to attempt bisect. > producing bogus results from over-eager 'good', but the last two runs Yeah - it sucks. I was using Paul's TREE03 rcu-torture for loading and even after a two hour test I'd still get "false good" results. Only after 7h was I quite confident that good was really good. > both implicated this commit: > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > Author: Peter Zijlstra <peterz@infradead.org> > Date: Sun May 24 21:29:55 2020 +0100 > > sched/core: Optimize ttwu() spinning on p->on_cpu I was down to 10 commits roughly above and below this guy before hearing you were working the same problem. I just got this guy to reveal a false load after a 2h test as well. I want to let the one underneath soak overnight just to also confirm it is "good" - so that is pending. What I can add, is that it is like we are "leaking" an instance into calc_load_tasks -- which isn't anything new -- see when tglx fixed it before in d60585c5766. Unfortunate we don't have some low overhead leak checks on that... ? Anyway, if I "fix" the leak, then everything seems back to normal: (gdb) p calc_load_tasks $2 = {counter = 1} (gdb) set variable calc_load_tasks = { 0 } (gdb) p calc_load_tasks $4 = {counter = 0} (gdb) continue Continuing. [ ... watching decay on resumed target ....] 10:13:14 up 9:54, 4 users, load average: 0.92, 0.98, 1.15 10:13:54 up 9:55, 4 users, load average: 0.47, 0.86, 1.10 10:15:17 up 9:56, 4 users, load average: 0.12, 0.65, 1.00 10:19:20 up 10:00, 4 users, load average: 0.00, 0.28, 0.76 10:26:07 up 10:07, 4 users, load average: 0.00, 0.06, 0.48 10:32:48 up 10:14, 4 users, load average: 0.00, 0.00, 0.29 Obviously that isn't a fix, but it shows it is an accounting thing. I've also used gdb to snoop all the cfs->avg fields and they look as expected for a completely idle machine. Nothing hiding in avg_rt or avg_dl either. > > Both Rik and Mel reported seeing ttwu() spend significant time on: > > smp_cond_load_acquire(&p->on_cpu, !VAL); > > Attempt to avoid this by queueing the wakeup on the CPU that owns the > p->on_cpu value. This will then allow the ttwu() to complete without > further waiting. > > Since we run schedule() with interrupts disabled, the IPI is > guaranteed to happen after p->on_cpu is cleared, this is what makes it > safe to queue early. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Ingo Molnar <mingo@kernel.org> > Cc: Jirka Hladky <jhladky@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: valentin.schneider@arm.com > Cc: Hillf Danton <hdanton@sina.com> > Cc: Rik van Riel <riel@surriel.com> > Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net > > Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't > confirmed 100% that it's the cause yet, but the two separate bisects > seem promising. I've not tried the revert (yet) - but Kyle saw me boring people on #kernel with the details of bisecting this and gave me the heads-up you were looking at it too (thanks Kyle!). So I figured I'd better add what I'd seen so far. I'm testing with what is largely a defconfig, plus KVM_INTEL (needed for paulmck TREE03 rcu-torture), plus I enabled KGDB and DEBUG_INFO after a while so I could poke and prod - but was reproducing it before that. For completeness, the test was: tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 120 \ --configs TREE03 --trust-make ...on a 24 core 2013 vintage xeon-v2 COTS box. As above, the 120m seemed to give between 60-75% confidence on not getting a false good. Anyway - so that is all I know so far... Paul. -- > > I don't see any obvious correlation between what's changing there and > the symtoms (other than "scheduler magic") but maybe those closer to > this have ideas what could be going awry ? > > Dave ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 21:15 ` Paul Gortmaker @ 2020-07-03 13:23 ` Paul Gortmaker 0 siblings, 0 replies; 20+ messages in thread From: Paul Gortmaker @ 2020-07-03 13:23 UTC (permalink / raw) To: Dave Jones, Linux Kernel, peterz, mgorman, mingo, Linus Torvalds Cc: Michal Kubecek, Paul E. McKenney, Thomas Gleixner [Re: weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 17:15) Paul Gortmaker wrote: > [weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 13:15) Dave Jones wrote: [...] > > both implicated this commit: > > > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > > Author: Peter Zijlstra <peterz@infradead.org> > > Date: Sun May 24 21:29:55 2020 +0100 > > > > sched/core: Optimize ttwu() spinning on p->on_cpu > > I was down to 10 commits roughly above and below this guy before hearing > you were working the same problem. > > I just got this guy to reveal a false load after a 2h test as well. > I want to let the one underneath soak overnight just to also confirm it > is "good" - so that is pending. As per above, I ran a 12h test overnight on 505b8af5891 and it seems fine. Every other "bad" bisect point failed in 7h or less. So my testing seems to give the same result as Dave. Paul. -- root@t5610:/home/paul/git/linux-head# [1]- Done nohup tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 720 --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1 root@t5610:/home/paul/git/linux-head# uptime 09:10:56 up 13:12, 2 users, load average: 0.00, 0.00, 0.14 root@t5610:/home/paul/git/linux-head# cat /proc/version Linux version 5.7.0-rc6-00029-gd505b8af5891 (paul@t5610) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2), GNU ld (GNU Binutils for Ubuntu) 2.34) #2 SMP Thu Jul 2 18:55:40 EDT 2020 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones 2020-07-02 19:46 ` Dave Jones 2020-07-02 21:15 ` Paul Gortmaker @ 2020-07-02 21:36 ` Mel Gorman 2020-07-02 23:11 ` Michal Kubecek ` (2 more replies) 2 siblings, 3 replies; 20+ messages in thread From: Mel Gorman @ 2020-07-02 21:36 UTC (permalink / raw) To: Dave Jones, Linux Kernel, peterz, mingo, Linus Torvalds On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote: > When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly > idle machine (that usually sees loadavg hover in the 0.xx range) > that it was consistently above 1.00 even when there was nothing running. > All that perf showed was the kernel was spending time in the idle loop > (and running perf). > > For the first hour or so after boot, everything seems fine, but over > time loadavg creeps up, and once it's established a new baseline, it > never seems to ever drop below that again. > > One morning I woke up to find loadavg at '7.xx', after almost as many > hours of uptime, which makes me wonder if perhaps this is triggered > by something in cron. I have a bunch of scripts that fire off > every hour that involve thousands of shortlived runs of iptables/ipset, > but running them manually didn't seem to automatically trigger the bug. > > Given it took a few hours of runtime to confirm good/bad, bisecting this > took the last two weeks. I did it four different times, the first > producing bogus results from over-eager 'good', but the last two runs > both implicated this commit: > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > Author: Peter Zijlstra <peterz@infradead.org> > Date: Sun May 24 21:29:55 2020 +0100 > > sched/core: Optimize ttwu() spinning on p->on_cpu > > Both Rik and Mel reported seeing ttwu() spend significant time on: > > smp_cond_load_acquire(&p->on_cpu, !VAL); > > Attempt to avoid this by queueing the wakeup on the CPU that owns the > p->on_cpu value. This will then allow the ttwu() to complete without > further waiting. > > Since we run schedule() with interrupts disabled, the IPI is > guaranteed to happen after p->on_cpu is cleared, this is what makes it > safe to queue early. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Ingo Molnar <mingo@kernel.org> Peter, I'm not supremely confident about this but could it be because "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially happens while a task is still being dequeued? In the final stages of a task switch we have prev_state = prev->state; vtime_task_switch(prev); perf_event_task_sched_in(prev, current); finish_task(prev); finish_task is when p->on_cpu is cleared after the state is updated. With the patch, we potentially update sched_contributes_to_load while p->state is transient so if the check below is true and ttwu_queue_wakelist is used then sched_contributes_to_load was based on a transient value and potentially wrong. if (smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; sched_contributes_to_load determines the value of rq->uninterruptible which is used in the load value so it's a partial fit. The race would not happen very often on a normal desktop so it would explain why it takes a long time for the value to get screwed up and appears to fit. I'm thinking that the !!task_contributes_to_load(p) should still happen after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is held to stabilised p->state against a parallel wakeup or updating the task rq. I do not see any hazards with respect to smp_rmb and the value of p->state in this particular path but I've confused myself enough in the various scheduler and wakeup paths that I don't want to bet money on it late in the evening It builds, not booted, it's for discussion but maybe Dave is feeling brave! diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40392d4..52c73598b18a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_load_acquire(&p->on_cpu, !VAL); + /* + * p is off the cpu and pi_lock is held to p->state is stable + * for calculating whether it contributes to load. + */ + p->sched_contributes_to_load = !!task_contributes_to_load(p); + p->state = TASK_WAKING; + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 21:36 ` Mel Gorman @ 2020-07-02 23:11 ` Michal Kubecek 2020-07-02 23:24 ` Dave Jones 2020-07-03 9:02 ` Peter Zijlstra 2 siblings, 0 replies; 20+ messages in thread From: Michal Kubecek @ 2020-07-02 23:11 UTC (permalink / raw) To: Mel Gorman Cc: Dave Jones, Linux Kernel, peterz, mingo, Linus Torvalds, Paul E. McKenney, Paul Gortmaker, Thomas Gleixner On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote: > > It builds, not booted, it's for discussion but maybe Dave is feeling brave! > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index ca5db40392d4..52c73598b18a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > } > > #ifdef CONFIG_SMP > - p->sched_contributes_to_load = !!task_contributes_to_load(p); > - p->state = TASK_WAKING; > - > /* > * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be > * possible to, falsely, observe p->on_cpu == 0. > @@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > */ > smp_cond_load_acquire(&p->on_cpu, !VAL); > > + /* > + * p is off the cpu and pi_lock is held to p->state is stable > + * for calculating whether it contributes to load. > + */ > + p->sched_contributes_to_load = !!task_contributes_to_load(p); > + p->state = TASK_WAKING; > + > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); > if (task_cpu(p) != cpu) { > wake_flags |= WF_MIGRATED; > > -- I felt brave but something is probably wrong: I tried to boot three times and all three attempts crashed during boot, all three in scheduler related functions: set_next_entity() check_preempt_wakeup() finish_task_switch() I left my USB-to-serial adapter in the office but I'll try again tomorrow on a test machine with serial console. Michal ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 21:36 ` Mel Gorman 2020-07-02 23:11 ` Michal Kubecek @ 2020-07-02 23:24 ` Dave Jones 2020-07-03 9:02 ` Peter Zijlstra 2 siblings, 0 replies; 20+ messages in thread From: Dave Jones @ 2020-07-02 23:24 UTC (permalink / raw) To: Mel Gorman; +Cc: Linux Kernel, peterz, mingo, Linus Torvalds On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote: > I'm thinking that the !!task_contributes_to_load(p) should still happen > after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is > held to stabilised p->state against a parallel wakeup or updating the > task rq. I do not see any hazards with respect to smp_rmb and the value > of p->state in this particular path but I've confused myself enough in > the various scheduler and wakeup paths that I don't want to bet money on > it late in the evening > > It builds, not booted, it's for discussion but maybe Dave is feeling brave! stalls, and then panics during boot :( [ 16.933212] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 69.572840] watchdog: BUG: soft lockup - CPU#3 stuck for 44s! [kworker/u8:0:7] [ 69.572849] CPU: 3 PID: 7 Comm: kworker/u8:0 Kdump: loaded Not tainted 5.8.0-rc3-firewall+ #2 [ 69.572852] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Q3XXG4-P, BIOS 5.6.5 06/30/2018 [ 69.572861] Workqueue: 0x0 (events_power_efficient) [ 69.572877] RIP: 0010:finish_task_switch+0x71/0x1a0 [ 69.572884] Code: 00 00 4d 8b 7c 24 10 65 4c 8b 34 25 c0 6c 01 00 0f 1f 44 00 00 0f 1f 44 00 00 41 c7 44 24 2c 00 00 00 00 c6 03 00 fb 4d 85 ed <74> 0b f0 41 ff 4d 4c 0f 84 d9 00 00 00 49 83 c7 80 74 7a 48 89 d8 [ 69.572887] RSP: 0018:ffffb36700067e40 EFLAGS: 00000246 [ 69.572893] RAX: ffff94654eab0000 RBX: ffff9465575a8b40 RCX: 0000000000000000 [ 69.572895] RDX: 0000000000000000 RSI: ffff9465565c0000 RDI: ffff94654eab0000 [ 69.572898] RBP: ffffb36700067e68 R08: 0000000000000001 R09: 00000000000283c0 [ 69.572901] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94654eab0000 [ 69.572904] R13: 0000000000000000 R14: ffff9465565c0000 R15: 0000000000000001 [ 69.572909] FS: 0000000000000000(0000) GS:ffff946557580000(0000) knlGS:0000000000000000 [ 69.572912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 69.572917] CR2: 00007f29b26abc30 CR3: 000000020812d001 CR4: 00000000001606e0 [ 69.572919] Call Trace: [ 69.572937] __schedule+0x28d/0x570 [ 69.572946] ? _cond_resched+0x15/0x30 [ 69.572954] schedule+0x38/0xa0 [ 69.572962] worker_thread+0xaa/0x3c0 [ 69.572968] ? process_one_work+0x3c0/0x3c0 [ 69.572972] kthread+0x116/0x130 [ 69.572977] ? __kthread_create_on_node+0x180/0x180 [ 69.572982] ret_from_fork+0x22/0x30 [ 69.572988] Kernel panic - not syncing: softlockup: hung tasks [ 69.572993] CPU: 3 PID: 7 Comm: kworker/u8:0 Kdump: loaded Tainted: G L 5.8.0-rc3-firewall+ #2 [ 69.572995] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Q3XXG4-P, BIOS 5.6.5 06/30/2018 [ 69.572998] Workqueue: 0x0 (events_power_efficient) [ 69.573001] Call Trace: [ 69.573004] <IRQ> [ 69.573010] dump_stack+0x57/0x70 [ 69.573016] panic+0xfb/0x2cb [ 69.573024] watchdog_timer_fn.cold.12+0x7d/0x96 [ 69.573030] ? softlockup_fn+0x30/0x30 [ 69.573035] __hrtimer_run_queues+0x100/0x280 [ 69.573041] hrtimer_interrupt+0xf4/0x210 [ 69.573049] __sysvec_apic_timer_interrupt+0x5d/0xf0 [ 69.573055] asm_call_on_stack+0x12/0x20 [ 69.573058] </IRQ> [ 69.573064] sysvec_apic_timer_interrupt+0x6d/0x80 [ 69.573069] asm_sysvec_apic_timer_interrupt+0xf/0x20 [ 69.573078] RIP: 0010:finish_task_switch+0x71/0x1a0 [ 69.573082] Code: 00 00 4d 8b 7c 24 10 65 4c 8b 34 25 c0 6c 01 00 0f 1f 44 00 00 0f 1f 44 00 00 41 c7 44 24 2c 00 00 00 00 c6 03 00 fb 4d 85 ed <74> 0b f0 41 ff 4d 4c 0f 84 d9 00 00 00 49 83 c7 80 74 7a 48 89 d8 [ 69.573085] RSP: 0018:ffffb36700067e40 EFLAGS: 00000246 [ 69.573088] RAX: ffff94654eab0000 RBX: ffff9465575a8b40 RCX: 0000000000000000 [ 69.573090] RDX: 0000000000000000 RSI: ffff9465565c0000 RDI: ffff94654eab0000 [ 69.573092] RBP: ffffb36700067e68 R08: 0000000000000001 R09: 00000000000283c0 [ 69.573094] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94654eab0000 [ 69.573096] R13: 0000000000000000 R14: ffff9465565c0000 R15: 0000000000000001 [ 69.573106] __schedule+0x28d/0x570 [ 69.573113] ? _cond_resched+0x15/0x30 [ 69.573119] schedule+0x38/0xa0 [ 69.573125] worker_thread+0xaa/0x3c0 [ 69.573130] ? process_one_work+0x3c0/0x3c0 [ 69.573134] kthread+0x116/0x130 [ 69.573149] ? __kthread_create_on_node+0x180/0x180 [ 69.792344] ret_from_fork+0x22/0x30 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-02 21:36 ` Mel Gorman 2020-07-02 23:11 ` Michal Kubecek 2020-07-02 23:24 ` Dave Jones @ 2020-07-03 9:02 ` Peter Zijlstra 2020-07-03 10:40 ` Peter Zijlstra 2 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2020-07-03 9:02 UTC (permalink / raw) To: Mel Gorman; +Cc: Dave Jones, Linux Kernel, mingo, Linus Torvalds On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote: > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > > Author: Peter Zijlstra <peterz@infradead.org> > Peter, I'm not supremely confident about this but could it be because > "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially > happens while a task is still being dequeued? In the final stages of a > task switch we have > > prev_state = prev->state; > vtime_task_switch(prev); > perf_event_task_sched_in(prev, current); > finish_task(prev); > > finish_task is when p->on_cpu is cleared after the state is updated. > With the patch, we potentially update sched_contributes_to_load while > p->state is transient so if the check below is true and ttwu_queue_wakelist > is used then sched_contributes_to_load was based on a transient value > and potentially wrong. I'm not seeing it. Once a task hits schedule(), p->state doesn't change, except through wakeup. And while dequeue depends on p->state, it doesn't change it. At this point in ttwu() we know p->on_rq == 0, which implies dequeue has started, which means we've (at least) stopped executing the task -- we started or finished schedule(). Let me stare at this more... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-03 9:02 ` Peter Zijlstra @ 2020-07-03 10:40 ` Peter Zijlstra 2020-07-03 20:51 ` Dave Jones 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2020-07-03 10:40 UTC (permalink / raw) To: Mel Gorman; +Cc: Dave Jones, Linux Kernel, mingo, Linus Torvalds On Fri, Jul 03, 2020 at 11:02:26AM +0200, Peter Zijlstra wrote: > On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote: > > > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > > > Author: Peter Zijlstra <peterz@infradead.org> > > > Peter, I'm not supremely confident about this but could it be because > > "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially > > happens while a task is still being dequeued? In the final stages of a > > task switch we have > > > > prev_state = prev->state; > > vtime_task_switch(prev); > > perf_event_task_sched_in(prev, current); > > finish_task(prev); > > > > finish_task is when p->on_cpu is cleared after the state is updated. > > With the patch, we potentially update sched_contributes_to_load while > > p->state is transient so if the check below is true and ttwu_queue_wakelist > > is used then sched_contributes_to_load was based on a transient value > > and potentially wrong. > > I'm not seeing it. Once a task hits schedule(), p->state doesn't change, > except through wakeup. > > And while dequeue depends on p->state, it doesn't change it. > > At this point in ttwu() we know p->on_rq == 0, which implies dequeue has > started, which means we've (at least) stopped executing the task -- we > started or finished schedule(). > > Let me stare at this more... So ARM/Power/etc.. can speculate the load such that the task_contributes_to_load() value is from before ->on_rq. The compiler might similar re-order things -- although I've not found it doing so with the few builds I looked at. So I think at the very least we should do something like this. But i've no idea how to reproduce this problem. Mel's patch placed it too far down, as the WF_ON_CPU path also relies on this, and by not resetting p->sched_contributes_to_load it would skew accounting even worse. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fcd56f04b706..cba8a56d0f7f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2799,9 +2799,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2823,6 +2820,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_rmb(); + p->sched_contributes_to_load = !!task_contributes_to_load(p); + p->state = TASK_WAKING; + /* * If the owning (remote) CPU is still in the middle of schedule() with * this task as prev, considering queueing p on the remote CPUs wake_list ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-03 10:40 ` Peter Zijlstra @ 2020-07-03 20:51 ` Dave Jones 2020-07-06 14:59 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Dave Jones @ 2020-07-03 20:51 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mel Gorman, Linux Kernel, mingo, Linus Torvalds On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote: > So ARM/Power/etc.. can speculate the load such that the > task_contributes_to_load() value is from before ->on_rq. > > The compiler might similar re-order things -- although I've not found it > doing so with the few builds I looked at. > > So I think at the very least we should do something like this. But i've > no idea how to reproduce this problem. > > Mel's patch placed it too far down, as the WF_ON_CPU path also relies on > this, and by not resetting p->sched_contributes_to_load it would skew > accounting even worse. looked promising the first few hours, but as soon as it hit four hours of uptime, loadavg spiked and is now pinned to at least 1.00 Dave ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-03 20:51 ` Dave Jones @ 2020-07-06 14:59 ` Peter Zijlstra 2020-07-06 21:20 ` Dave Jones ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Peter Zijlstra @ 2020-07-06 14:59 UTC (permalink / raw) To: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds Cc: paul.gortmaker, valentin.schneider On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote: > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote: > > > So ARM/Power/etc.. can speculate the load such that the > > task_contributes_to_load() value is from before ->on_rq. > > > > The compiler might similar re-order things -- although I've not found it > > doing so with the few builds I looked at. > > > > So I think at the very least we should do something like this. But i've > > no idea how to reproduce this problem. > > > > Mel's patch placed it too far down, as the WF_ON_CPU path also relies on > > this, and by not resetting p->sched_contributes_to_load it would skew > > accounting even worse. > > looked promising the first few hours, but as soon as it hit four hours > of uptime, loadavg spiked and is now pinned to at least 1.00 OK, lots of cursing later, I now have the below... The TL;DR is that while schedule() doesn't change p->state once it starts, it does read it quite a bit, and ttwu() will actually change it to TASK_WAKING. So if ttwu() changes it to WAKING before schedule() reads it to do loadavg accounting, things go sideways. The below is extra complicated by the fact that I've had to scrounge up a bunch of load-store ordering without actually adding barriers. It adds yet another control dependency to ttwu(), so take that C standard :-) I've booted it, and build a few kernels with it and checked loadavg drops to 0 after each build, so from that pov all is well, but since I'm not confident I can reproduce the issue, I can't tell this actually fixes anything, except maybe phantoms of my imagination. --- include/linux/sched.h | 4 --- kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 49 insertions(+), 22 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9bd073a10224..e26c8bbeda00 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -114,10 +114,6 @@ struct task_group; #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) -#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ - (task->flags & PF_FROZEN) == 0 && \ - (task->state & TASK_NOLOAD) == 0) - #ifdef CONFIG_DEBUG_ATOMIC_SLEEP /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1d3d2d67f398..f245444b4b15 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1313,9 +1313,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) void activate_task(struct rq *rq, struct task_struct *p, int flags) { - if (task_contributes_to_load(p)) - rq->nr_uninterruptible--; - enqueue_task(rq, p, flags); p->on_rq = TASK_ON_RQ_QUEUED; @@ -1325,9 +1322,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; - if (task_contributes_to_load(p)) - rq->nr_uninterruptible++; - dequeue_task(rq, p, flags); } @@ -2228,10 +2222,10 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock); -#ifdef CONFIG_SMP if (p->sched_contributes_to_load) rq->nr_uninterruptible--; +#ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; #endif @@ -2575,7 +2569,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); - if (p->on_rq && ttwu_remote(p, wake_flags)) + if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags)) goto unlock; if (p->in_iowait) { @@ -2584,9 +2578,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). + * + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure + * schedule()'s deactivate_task() has 'happened' and p will no longer + * care about it's own p->state. See the comment in __schedule(). */ - smp_rmb(); + smp_acquire__after_ctrl_dep(); + + /* + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq + * == 0), which means we need to do an enqueue, change p->state to + * TASK_WAKING such that we can unlock p->pi_lock before doing the + * enqueue, such as ttwu_queue_wakelist(). + */ + p->state = TASK_WAKING; /* * If the owning (remote) CPU is still in the middle of schedule() with @@ -4088,6 +4091,7 @@ static void __sched notrace __schedule(bool preempt) { struct task_struct *prev, *next; unsigned long *switch_count; + unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4104,12 +4108,19 @@ static void __sched notrace __schedule(bool preempt) local_irq_disable(); rcu_note_context_switch(preempt); + prev_state = prev->state; + /* - * Make sure that signal_pending_state()->signal_pending() below - * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) - * done by the caller to avoid the race with signal_wake_up(). + * __set_current_state(@state) + * schedule() signal_wake_up() + * prev_state = p->state set_tsk_thread_flag(p, TIF_SIGPENDING) + * wake_up_state() + * LOCK rq->lock LOCK p->pi_state + * smp_mb__after_spinlock() smp_mb__after_spinlock() + * if (signal_pending_state() if (p->state & @state) + * * - * The membarrier system call requires a full memory barrier + * Also, the membarrier system call requires a full memory barrier * after coming from user-space, before storing to rq->curr. */ rq_lock(rq, &rf); @@ -4120,10 +4131,30 @@ static void __sched notrace __schedule(bool preempt) update_rq_clock(rq); switch_count = &prev->nivcsw; - if (!preempt && prev->state) { - if (signal_pending_state(prev->state, prev)) { + /* + * We must re-load p->state in case ttwu_runnable() changed it + * before we acquired rq->lock. + */ + if (!preempt && prev_state && prev_state == prev->state) { + if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { + prev->sched_contributes_to_load = + (prev_state & (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)) == TASK_UNINTERRUPTIBLE && + (prev->flags & PF_FROZEN) == 0; + + if (prev->sched_contributes_to_load) + rq->nr_uninterruptible++; + + /* + * __schedule() ttwu() + * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...) + * LOCK rq->lock goto out; + * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); + * p->on_rq = 0; p->state = TASK_WAKING; + * + * After this, schedule() must not care about p->state any more. + */ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-06 14:59 ` Peter Zijlstra @ 2020-07-06 21:20 ` Dave Jones 2020-07-07 7:48 ` Peter Zijlstra 2020-07-06 23:56 ` Valentin Schneider 2020-07-07 9:20 ` weird loadavg on idle machine post 5.7 Qais Yousef 2 siblings, 1 reply; 20+ messages in thread From: Dave Jones @ 2020-07-06 21:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, valentin.schneider On Mon, Jul 06, 2020 at 04:59:52PM +0200, Peter Zijlstra wrote: > On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote: > > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote: > > > > looked promising the first few hours, but as soon as it hit four hours > > of uptime, loadavg spiked and is now pinned to at least 1.00 > > OK, lots of cursing later, I now have the below... > > The TL;DR is that while schedule() doesn't change p->state once it > starts, it does read it quite a bit, and ttwu() will actually change it > to TASK_WAKING. So if ttwu() changes it to WAKING before schedule() > reads it to do loadavg accounting, things go sideways. > > The below is extra complicated by the fact that I've had to scrounge up > a bunch of load-store ordering without actually adding barriers. It adds > yet another control dependency to ttwu(), so take that C standard :-) Man this stuff is subtle. I could've read this a hundred times and not even come close to approaching this. Basically me reading scheduler code: http://www.quickmeme.com/img/96/9642ed212bbced00885592b39880ec55218e922245e0637cf94db2e41857d558.jpg > I've booted it, and build a few kernels with it and checked loadavg > drops to 0 after each build, so from that pov all is well, but since > I'm not confident I can reproduce the issue, I can't tell this actually > fixes anything, except maybe phantoms of my imagination. Five hours in, looking good so far. I think you nailed it. Dave ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-06 21:20 ` Dave Jones @ 2020-07-07 7:48 ` Peter Zijlstra 0 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2020-07-07 7:48 UTC (permalink / raw) To: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, valentin.schneider On Mon, Jul 06, 2020 at 05:20:57PM -0400, Dave Jones wrote: > On Mon, Jul 06, 2020 at 04:59:52PM +0200, Peter Zijlstra wrote: > > On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote: > > > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote: > > > > > > looked promising the first few hours, but as soon as it hit four hours > > > of uptime, loadavg spiked and is now pinned to at least 1.00 > > > > OK, lots of cursing later, I now have the below... > > > > The TL;DR is that while schedule() doesn't change p->state once it > > starts, it does read it quite a bit, and ttwu() will actually change it > > to TASK_WAKING. So if ttwu() changes it to WAKING before schedule() > > reads it to do loadavg accounting, things go sideways. > > > > The below is extra complicated by the fact that I've had to scrounge up > > a bunch of load-store ordering without actually adding barriers. It adds > > yet another control dependency to ttwu(), so take that C standard :-) > > Man this stuff is subtle. I could've read this a hundred times and not > even come close to approaching this. > > Basically me reading scheduler code: > http://www.quickmeme.com/img/96/9642ed212bbced00885592b39880ec55218e922245e0637cf94db2e41857d558.jpg Heh, that one made me nearly spill my tea, much funnies :-) But yes, Dave Chinner also complained about this for the previous fix. I've written this: https://lore.kernel.org/lkml/20200703133259.GE4781@hirez.programming.kicks-ass.net/ to help with that. But clearly I'll need to update that patch again after this little adventure. > > I've booted it, and build a few kernels with it and checked loadavg > > drops to 0 after each build, so from that pov all is well, but since > > I'm not confident I can reproduce the issue, I can't tell this actually > > fixes anything, except maybe phantoms of my imagination. > > Five hours in, looking good so far. I think you nailed it. \o/ hooray! Thanks for testing Dave! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-06 14:59 ` Peter Zijlstra 2020-07-06 21:20 ` Dave Jones @ 2020-07-06 23:56 ` Valentin Schneider 2020-07-07 8:17 ` Peter Zijlstra 2020-07-07 9:20 ` weird loadavg on idle machine post 5.7 Qais Yousef 2 siblings, 1 reply; 20+ messages in thread From: Valentin Schneider @ 2020-07-06 23:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker On 06/07/20 15:59, Peter Zijlstra wrote: > OK, lots of cursing later, I now have the below... > > The TL;DR is that while schedule() doesn't change p->state once it > starts, it does read it quite a bit, and ttwu() will actually change it > to TASK_WAKING. So if ttwu() changes it to WAKING before schedule() > reads it to do loadavg accounting, things go sideways. > > The below is extra complicated by the fact that I've had to scrounge up > a bunch of load-store ordering without actually adding barriers. It adds > yet another control dependency to ttwu(), so take that C standard :-) > > I've booted it, and build a few kernels with it and checked loadavg > drops to 0 after each build, so from that pov all is well, but since > I'm not confident I can reproduce the issue, I can't tell this actually > fixes anything, except maybe phantoms of my imagination. > As you said on IRC, the one apparent race would lead into "negative" rq->nr_uninterruptible accounting, i.e. we'd skip some increments so would end up with more decrements. As for the described issue, I think we were both expecting "positive" accounting, i.e. more increments than decrements, leading into an artificially inflated loadavg. In any case, this should get rid of any existing race. I'll need some more time (and more aperol spritz) to go through it all though. > @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > * > * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in > * __schedule(). See the comment for smp_mb__after_spinlock(). > + * > + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure > + * schedule()'s deactivate_task() has 'happened' and p will no longer > + * care about it's own p->state. See the comment in __schedule(). > */ > - smp_rmb(); > + smp_acquire__after_ctrl_dep(); Apologies for asking again, but I'm foolishly hopeful I'll someday be able to grok those things without half a dozen tabs open with documentation and Paul McKenney papers. Do I get it right that the 'acquire' part hints this is equivalent to issuing a load-acquire on whatever was needed to figure out whether or not the take the branch (in this case, p->on_rq, amongst other things); IOW ensures any memory access appearing later in program order has to happen after the load? That at least explains to me the load->{load,store} wording in smp_acquire__after_ctrl_dep(). > + > + /* > + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq > + * == 0), which means we need to do an enqueue, change p->state to > + * TASK_WAKING such that we can unlock p->pi_lock before doing the > + * enqueue, such as ttwu_queue_wakelist(). > + */ > + p->state = TASK_WAKING; > > /* > * If the owning (remote) CPU is still in the middle of schedule() with ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-06 23:56 ` Valentin Schneider @ 2020-07-07 8:17 ` Peter Zijlstra 2020-07-07 10:20 ` Valentin Schneider 2020-07-07 10:29 ` Peter Zijlstra 0 siblings, 2 replies; 20+ messages in thread From: Peter Zijlstra @ 2020-07-07 8:17 UTC (permalink / raw) To: Valentin Schneider Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, paulmck On Tue, Jul 07, 2020 at 12:56:04AM +0100, Valentin Schneider wrote: > > @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > > * > > * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in > > * __schedule(). See the comment for smp_mb__after_spinlock(). > > + * > > + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure > > + * schedule()'s deactivate_task() has 'happened' and p will no longer > > + * care about it's own p->state. See the comment in __schedule(). > > */ > > - smp_rmb(); > > + smp_acquire__after_ctrl_dep(); > > Apologies for asking again, but I'm foolishly hopeful I'll someday be able > to grok those things without half a dozen tabs open with documentation and > Paul McKenney papers. > > Do I get it right that the 'acquire' part hints this is equivalent to > issuing a load-acquire on whatever was needed to figure out whether or not > the take the branch (in this case, p->on_rq, amongst other things); IOW > ensures any memory access appearing later in program order has to happen > after the load? > > That at least explains to me the load->{load,store} wording in > smp_acquire__after_ctrl_dep(). Yes. So the thing is that hardware MUST NOT speculate stores, or rather, if it does, it must take extreme measures to ensure they do not become visible in any way shape or form, since speculative stores lead to instant OOTA problems. Therefore we can say that branches order stores and if the branch condition depends on a load, we get a load->store order. IOW the load must complete before we can resolve the branch, which in turn enables the store to become visible/happen. If we then add an smp_rmb() to the branch to order load->load, we end up with a load->{load,store} ordering, which is equivalent to a load-acquire. The reason to do it like that, is that load-aquire would otherwise require an smp_mb(), since for many platforms that's the only barrier that has load->store ordering. The down-side of doing it like this, as Paul will be quick to point out, is that the C standard doesn't recognise control dependencies and thus the compiler would be in its right to 'optimize' our conditional away. We're relying on the compilers not having done this in the past and there being sufficient compiler people interested in compiling Linux to avoid this from happening. Anyway, this patch is basically: LOAD p->state LOAD-ACQUIRE p->on_rq == 0 MB STORE p->on_rq, 0 STORE p->state, TASK_WAKING which ensures the TASK_WAKING store happens after the p->state load. Just a wee bit complicated due to not actually adding any barriers while adding additional ordering. Anyway, let me now endeavour to write a coherent Changelog for this mess :-( ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-07 8:17 ` Peter Zijlstra @ 2020-07-07 10:20 ` Valentin Schneider 2020-07-07 10:29 ` Peter Zijlstra 1 sibling, 0 replies; 20+ messages in thread From: Valentin Schneider @ 2020-07-07 10:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, paulmck On 07/07/20 09:17, Peter Zijlstra wrote: > On Tue, Jul 07, 2020 at 12:56:04AM +0100, Valentin Schneider wrote: > >> > @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) >> > * >> > * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in >> > * __schedule(). See the comment for smp_mb__after_spinlock(). >> > + * >> > + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure >> > + * schedule()'s deactivate_task() has 'happened' and p will no longer >> > + * care about it's own p->state. See the comment in __schedule(). >> > */ >> > - smp_rmb(); >> > + smp_acquire__after_ctrl_dep(); >> >> Apologies for asking again, but I'm foolishly hopeful I'll someday be able >> to grok those things without half a dozen tabs open with documentation and >> Paul McKenney papers. >> >> Do I get it right that the 'acquire' part hints this is equivalent to >> issuing a load-acquire on whatever was needed to figure out whether or not >> the take the branch (in this case, p->on_rq, amongst other things); IOW >> ensures any memory access appearing later in program order has to happen >> after the load? >> >> That at least explains to me the load->{load,store} wording in >> smp_acquire__after_ctrl_dep(). > > Yes. > > So the thing is that hardware MUST NOT speculate stores, or rather, if > it does, it must take extreme measures to ensure they do not become > visible in any way shape or form, since speculative stores lead to > instant OOTA problems. > > Therefore we can say that branches order stores and if the branch > condition depends on a load, we get a load->store order. IOW the load > must complete before we can resolve the branch, which in turn enables > the store to become visible/happen. > Right, I think that point is made clear in memory-barriers.txt. > If we then add an smp_rmb() to the branch to order load->load, we end up > with a load->{load,store} ordering, which is equivalent to a > load-acquire. > > The reason to do it like that, is that load-aquire would otherwise > require an smp_mb(), since for many platforms that's the only barrier > that has load->store ordering. > > The down-side of doing it like this, as Paul will be quick to point out, > is that the C standard doesn't recognise control dependencies and thus > the compiler would be in its right to 'optimize' our conditional away. > Yikes! > We're relying on the compilers not having done this in the past and > there being sufficient compiler people interested in compiling Linux to > avoid this from happening. > > > Anyway, this patch is basically: > > LOAD p->state LOAD-ACQUIRE p->on_rq == 0 > MB > STORE p->on_rq, 0 STORE p->state, TASK_WAKING > > which ensures the TASK_WAKING store happens after the p->state load. > Just a wee bit complicated due to not actually adding any barriers while > adding additional ordering. > Your newer changelog also helped in that regards. Thanks a ton for the writeup! > Anyway, let me now endeavour to write a coherent Changelog for this mess > :-( ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-07 8:17 ` Peter Zijlstra 2020-07-07 10:20 ` Valentin Schneider @ 2020-07-07 10:29 ` Peter Zijlstra 2020-07-08 9:46 ` [tip: sched/urgent] sched: Fix loadavg accounting race tip-bot2 for Peter Zijlstra 1 sibling, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2020-07-07 10:29 UTC (permalink / raw) To: Valentin Schneider Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, paulmck On Tue, Jul 07, 2020 at 10:17:19AM +0200, Peter Zijlstra wrote: > Anyway, let me now endeavour to write a coherent Changelog for this mess I'll go stick this in sched/urgent and update that other documentation patch (again).. --- Subject: sched: Fix loadavg accounting race From: Peter Zijlstra <peterz@infradead.org> Date: Fri, 3 Jul 2020 12:40:33 +0200 The recent commit: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") moved these lines in ttwu(): p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; up before: smp_cond_load_acquire(&p->on_cpu, !VAL); into the 'p->on_rq == 0' block, with the thinking that once we hit schedule() the current task cannot change it's ->state anymore. And while this is true, it is both incorrect and flawed. It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0' to avoid weak hardware from re-ordering things for us. This can fairly easily be achieved by relying on the control-dependency already in place. The second problem, which makes the flaw in the original argument, is that while schedule() will not change prev->state, it will read it a number of times (arguably too many times since it's marked volatile). The previous condition 'p->on_cpu == 0' was sufficient because that indicates schedule() has completed, and will no longer read prev->state. So now the trick is to make this same true for the (much) earlier 'prev->on_rq == 0' case. Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0' assignment needs to he a RELEASE, but adding additional ordering to schedule() is an unwelcome proposition at the best of times, doubly so for mere accounting. Luckily we can push the prev->state load up before rq->lock, with the only caveat that we then have to re-read the state after. However, we know that if it changed, we no longer have to worry about the blocking path. This gives us the required ordering, if we block, we did the prev->state load before an (effective) smp_mb() and the p->on_rq store needs not change. With this we end up with the effective ordering: LOAD p->state LOAD-ACQUIRE p->on_rq == 0 MB STORE p->on_rq, 0 STORE p->state, TASK_WAKING which ensures the TASK_WAKING store happens after the prev->state load, and all is well again. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> --- include/linux/sched.h | 4 -- kernel/sched/core.c | 67 ++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 51 insertions(+), 20 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -114,10 +114,6 @@ struct task_group; #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) -#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ - (task->flags & PF_FROZEN) == 0 && \ - (task->state & TASK_NOLOAD) == 0) - #ifdef CONFIG_DEBUG_ATOMIC_SLEEP /* --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1313,9 +1313,6 @@ static inline void dequeue_task(struct r void activate_task(struct rq *rq, struct task_struct *p, int flags) { - if (task_contributes_to_load(p)) - rq->nr_uninterruptible--; - enqueue_task(rq, p, flags); p->on_rq = TASK_ON_RQ_QUEUED; @@ -1325,9 +1322,6 @@ void deactivate_task(struct rq *rq, stru { p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; - if (task_contributes_to_load(p)) - rq->nr_uninterruptible++; - dequeue_task(rq, p, flags); } @@ -2228,10 +2222,10 @@ ttwu_do_activate(struct rq *rq, struct t lockdep_assert_held(&rq->lock); -#ifdef CONFIG_SMP if (p->sched_contributes_to_load) rq->nr_uninterruptible--; +#ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; #endif @@ -2575,7 +2569,7 @@ try_to_wake_up(struct task_struct *p, un * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); - if (p->on_rq && ttwu_remote(p, wake_flags)) + if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags)) goto unlock; if (p->in_iowait) { @@ -2584,9 +2578,6 @@ try_to_wake_up(struct task_struct *p, un } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, un * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). + * + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure + * schedule()'s deactivate_task() has 'happened' and p will no longer + * care about it's own p->state. See the comment in __schedule(). */ - smp_rmb(); + smp_acquire__after_ctrl_dep(); + + /* + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq + * == 0), which means we need to do an enqueue, change p->state to + * TASK_WAKING such that we can unlock p->pi_lock before doing the + * enqueue, such as ttwu_queue_wakelist(). + */ + p->state = TASK_WAKING; /* * If the owning (remote) CPU is still in the middle of schedule() with @@ -4088,6 +4091,7 @@ static void __sched notrace __schedule(b { struct task_struct *prev, *next; unsigned long *switch_count; + unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4104,12 +4108,22 @@ static void __sched notrace __schedule(b local_irq_disable(); rcu_note_context_switch(preempt); + /* See deactivate_task() below. */ + prev_state = prev->state; + /* * Make sure that signal_pending_state()->signal_pending() below * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) - * done by the caller to avoid the race with signal_wake_up(). + * done by the caller to avoid the race with signal_wake_up(): + * + * __set_current_state(@state) signal_wake_up() + * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING) + * wake_up_state(p, state) + * LOCK rq->lock LOCK p->pi_state + * smp_mb__after_spinlock() smp_mb__after_spinlock() + * if (signal_pending_state()) if (p->state & @state) * - * The membarrier system call requires a full memory barrier + * Also, the membarrier system call requires a full memory barrier * after coming from user-space, before storing to rq->curr. */ rq_lock(rq, &rf); @@ -4120,10 +4134,31 @@ static void __sched notrace __schedule(b update_rq_clock(rq); switch_count = &prev->nivcsw; - if (!preempt && prev->state) { - if (signal_pending_state(prev->state, prev)) { + /* + * We must re-load prev->state in case ttwu_remote() changed it + * before we acquired rq->lock. + */ + if (!preempt && prev_state && prev_state == prev->state) { + if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { + prev->sched_contributes_to_load = + (prev_state & TASK_UNINTERRUPTIBLE) && + !(prev_state & TASK_NOLOAD) && + !(prev->flags & PF_FROZEN); + + if (prev->sched_contributes_to_load) + rq->nr_uninterruptible++; + + /* + * __schedule() ttwu() + * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...) + * LOCK rq->lock goto out; + * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); + * p->on_rq = 0; p->state = TASK_WAKING; + * + * After this, schedule() must not care about p->state any more. + */ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { ^ permalink raw reply [flat|nested] 20+ messages in thread
* [tip: sched/urgent] sched: Fix loadavg accounting race 2020-07-07 10:29 ` Peter Zijlstra @ 2020-07-08 9:46 ` tip-bot2 for Peter Zijlstra 0 siblings, 0 replies; 20+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2020-07-08 9:46 UTC (permalink / raw) To: linux-tip-commits Cc: Dave Jones, Paul Gortmaker, Peter Zijlstra (Intel), x86, LKML The following commit has been merged into the sched/urgent branch of tip: Commit-ID: dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7 Gitweb: https://git.kernel.org/tip/dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Fri, 03 Jul 2020 12:40:33 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Wed, 08 Jul 2020 11:38:49 +02:00 sched: Fix loadavg accounting race The recent commit: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") moved these lines in ttwu(): p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; up before: smp_cond_load_acquire(&p->on_cpu, !VAL); into the 'p->on_rq == 0' block, with the thinking that once we hit schedule() the current task cannot change it's ->state anymore. And while this is true, it is both incorrect and flawed. It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0' to avoid weak hardware from re-ordering things for us. This can fairly easily be achieved by relying on the control-dependency already in place. The second problem, which makes the flaw in the original argument, is that while schedule() will not change prev->state, it will read it a number of times (arguably too many times since it's marked volatile). The previous condition 'p->on_cpu == 0' was sufficient because that indicates schedule() has completed, and will no longer read prev->state. So now the trick is to make this same true for the (much) earlier 'prev->on_rq == 0' case. Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0' assignment needs to he a RELEASE, but adding additional ordering to schedule() is an unwelcome proposition at the best of times, doubly so for mere accounting. Luckily we can push the prev->state load up before rq->lock, with the only caveat that we then have to re-read the state after. However, we know that if it changed, we no longer have to worry about the blocking path. This gives us the required ordering, if we block, we did the prev->state load before an (effective) smp_mb() and the p->on_rq store needs not change. With this we end up with the effective ordering: LOAD p->state LOAD-ACQUIRE p->on_rq == 0 MB STORE p->on_rq, 0 STORE p->state, TASK_WAKING which ensures the TASK_WAKING store happens after the prev->state load, and all is well again. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dave Jones <davej@codemonkey.org.uk> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net --- include/linux/sched.h | 4 +--- kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++----------- 2 files changed, 51 insertions(+), 20 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 692e327..6833729 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -114,10 +114,6 @@ struct task_group; #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) -#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ - (task->flags & PF_FROZEN) == 0 && \ - (task->state & TASK_NOLOAD) == 0) - #ifdef CONFIG_DEBUG_ATOMIC_SLEEP /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40..950ac45 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1311,9 +1311,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) void activate_task(struct rq *rq, struct task_struct *p, int flags) { - if (task_contributes_to_load(p)) - rq->nr_uninterruptible--; - enqueue_task(rq, p, flags); p->on_rq = TASK_ON_RQ_QUEUED; @@ -1323,9 +1320,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; - if (task_contributes_to_load(p)) - rq->nr_uninterruptible++; - dequeue_task(rq, p, flags); } @@ -2236,10 +2230,10 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock); -#ifdef CONFIG_SMP if (p->sched_contributes_to_load) rq->nr_uninterruptible--; +#ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; #endif @@ -2583,7 +2577,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); - if (p->on_rq && ttwu_remote(p, wake_flags)) + if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags)) goto unlock; if (p->in_iowait) { @@ -2592,9 +2586,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2613,8 +2604,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). + * + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure + * schedule()'s deactivate_task() has 'happened' and p will no longer + * care about it's own p->state. See the comment in __schedule(). */ - smp_rmb(); + smp_acquire__after_ctrl_dep(); + + /* + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq + * == 0), which means we need to do an enqueue, change p->state to + * TASK_WAKING such that we can unlock p->pi_lock before doing the + * enqueue, such as ttwu_queue_wakelist(). + */ + p->state = TASK_WAKING; /* * If the owning (remote) CPU is still in the middle of schedule() with @@ -4097,6 +4100,7 @@ static void __sched notrace __schedule(bool preempt) { struct task_struct *prev, *next; unsigned long *switch_count; + unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4113,12 +4117,22 @@ static void __sched notrace __schedule(bool preempt) local_irq_disable(); rcu_note_context_switch(preempt); + /* See deactivate_task() below. */ + prev_state = prev->state; + /* * Make sure that signal_pending_state()->signal_pending() below * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) - * done by the caller to avoid the race with signal_wake_up(). + * done by the caller to avoid the race with signal_wake_up(): + * + * __set_current_state(@state) signal_wake_up() + * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING) + * wake_up_state(p, state) + * LOCK rq->lock LOCK p->pi_state + * smp_mb__after_spinlock() smp_mb__after_spinlock() + * if (signal_pending_state()) if (p->state & @state) * - * The membarrier system call requires a full memory barrier + * Also, the membarrier system call requires a full memory barrier * after coming from user-space, before storing to rq->curr. */ rq_lock(rq, &rf); @@ -4129,10 +4143,31 @@ static void __sched notrace __schedule(bool preempt) update_rq_clock(rq); switch_count = &prev->nivcsw; - if (!preempt && prev->state) { - if (signal_pending_state(prev->state, prev)) { + /* + * We must re-load prev->state in case ttwu_remote() changed it + * before we acquired rq->lock. + */ + if (!preempt && prev_state && prev_state == prev->state) { + if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { + prev->sched_contributes_to_load = + (prev_state & TASK_UNINTERRUPTIBLE) && + !(prev_state & TASK_NOLOAD) && + !(prev->flags & PF_FROZEN); + + if (prev->sched_contributes_to_load) + rq->nr_uninterruptible++; + + /* + * __schedule() ttwu() + * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...) + * LOCK rq->lock goto out; + * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); + * p->on_rq = 0; p->state = TASK_WAKING; + * + * After this, schedule() must not care about p->state any more. + */ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-06 14:59 ` Peter Zijlstra 2020-07-06 21:20 ` Dave Jones 2020-07-06 23:56 ` Valentin Schneider @ 2020-07-07 9:20 ` Qais Yousef 2020-07-07 9:47 ` Peter Zijlstra 2 siblings, 1 reply; 20+ messages in thread From: Qais Yousef @ 2020-07-07 9:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, valentin.schneider On 07/06/20 16:59, Peter Zijlstra wrote: [...] > @@ -4104,12 +4108,19 @@ static void __sched notrace __schedule(bool preempt) > local_irq_disable(); > rcu_note_context_switch(preempt); > > + prev_state = prev->state; > + > /* > - * Make sure that signal_pending_state()->signal_pending() below > - * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > - * done by the caller to avoid the race with signal_wake_up(). > + * __set_current_state(@state) > + * schedule() signal_wake_up() > + * prev_state = p->state set_tsk_thread_flag(p, TIF_SIGPENDING) > + * wake_up_state() > + * LOCK rq->lock LOCK p->pi_state > + * smp_mb__after_spinlock() smp_mb__after_spinlock() > + * if (signal_pending_state() if (p->state & @state) > + * > * > - * The membarrier system call requires a full memory barrier > + * Also, the membarrier system call requires a full memory barrier > * after coming from user-space, before storing to rq->curr. > */ > rq_lock(rq, &rf); > @@ -4120,10 +4131,30 @@ static void __sched notrace __schedule(bool preempt) > update_rq_clock(rq); > > switch_count = &prev->nivcsw; > - if (!preempt && prev->state) { > - if (signal_pending_state(prev->state, prev)) { > + /* > + * We must re-load p->state in case ttwu_runnable() changed it > + * before we acquired rq->lock. > + */ > + if (!preempt && prev_state && prev_state == prev->state) { I think the compiler won't optimize `prev_state == prev->state` out because of the smp_mb__after_spinlock() which implies a compiler barrier. Still not sure if it's worth making prev->state accesses a READ_ONCE()? Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: weird loadavg on idle machine post 5.7 2020-07-07 9:20 ` weird loadavg on idle machine post 5.7 Qais Yousef @ 2020-07-07 9:47 ` Peter Zijlstra 0 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2020-07-07 9:47 UTC (permalink / raw) To: Qais Yousef Cc: Dave Jones, Mel Gorman, Linux Kernel, mingo, Linus Torvalds, paul.gortmaker, valentin.schneider On Tue, Jul 07, 2020 at 10:20:05AM +0100, Qais Yousef wrote: > On 07/06/20 16:59, Peter Zijlstra wrote: > > + if (!preempt && prev_state && prev_state == prev->state) { > > I think the compiler won't optimize `prev_state == prev->state` out because of > the smp_mb__after_spinlock() which implies a compiler barrier. Still not sure > if it's worth making prev->state accesses a READ_ONCE()? task_struct::state is one of the very rare (and ancient) variables that's declared volatile. We should probably clean that up some day, but so far I've not attempted to do such a thing. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-07-08 9:46 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones 2020-07-02 19:46 ` Dave Jones 2020-07-02 21:15 ` Paul Gortmaker 2020-07-03 13:23 ` Paul Gortmaker 2020-07-02 21:36 ` Mel Gorman 2020-07-02 23:11 ` Michal Kubecek 2020-07-02 23:24 ` Dave Jones 2020-07-03 9:02 ` Peter Zijlstra 2020-07-03 10:40 ` Peter Zijlstra 2020-07-03 20:51 ` Dave Jones 2020-07-06 14:59 ` Peter Zijlstra 2020-07-06 21:20 ` Dave Jones 2020-07-07 7:48 ` Peter Zijlstra 2020-07-06 23:56 ` Valentin Schneider 2020-07-07 8:17 ` Peter Zijlstra 2020-07-07 10:20 ` Valentin Schneider 2020-07-07 10:29 ` Peter Zijlstra 2020-07-08 9:46 ` [tip: sched/urgent] sched: Fix loadavg accounting race tip-bot2 for Peter Zijlstra 2020-07-07 9:20 ` weird loadavg on idle machine post 5.7 Qais Yousef 2020-07-07 9:47 ` Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.