* Loadavg accounting error on arm64 @ 2020-11-16 9:10 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 9:10 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-arm-kernel, linux-kernel Hi, I got cc'd internal bug report filed against a 5.8 and 5.9 kernel that loadavg was "exploding" on arch64 on a machines acting as a build servers. It happened on at least two different arm64 variants. That setup is complex to replicate but fortunately can be reproduced by running hackbench-process-pipes while heavily overcomitting a machine with 96 logical CPUs and then checking if loadavg drops afterwards. With an MMTests clone, I reproduced it as follows ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done Load should drop to 10 after about 10 minutes and it does on x86-64 but remained at around 200+ on arm64. The reproduction case simply hammers the case where a task can be descheduling while also being woken by another task at the same time. It takes a long time to run but it makes the problem very obvious. The expectation is that after hackbench has been running and saturating the machine for a long time. Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg accounting race in the generic case. Later it was documented why the ordering of when p->sched_contributes_to_load is read/updated relative to p->on_cpu. This is critical when a task is descheduling at the same time it is being activated on another CPU. While the load/stores happen under the RQ lock, the RQ lock on its own does not give any guarantees on the task state. Over the weekend I convinced myself that it must be because the implementation of smp_load_acquire and smp_store_release do not appear to implement acquire/release semantics because I didn't find something arm64 that was playing with p->state behind the schedulers back (I could have missed it if it was in an assembly portion as I can't reliablyh read arm assembler). Similarly, it's not clear why the arm64 implementation does not call smp_acquire__after_ctrl_dep in the smp_load_acquire implementation. Even when it was introduced, the arm64 implementation differed significantly from the arm implementation in terms of what barriers it used for non-obvious reasons. Unfortunately, making that work similar to the arch-independent version did not help but it's not helped that I know nothing about the arm64 memory model. I'll be looking again today to see can I find a mistake in the ordering for how sched_contributes_to_load is handled but again, the lack of knowledge on the arm64 memory model means I'm a bit stuck and a second set of eyes would be nice :( -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Loadavg accounting error on arm64 @ 2020-11-16 9:10 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 9:10 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-kernel, linux-arm-kernel Hi, I got cc'd internal bug report filed against a 5.8 and 5.9 kernel that loadavg was "exploding" on arch64 on a machines acting as a build servers. It happened on at least two different arm64 variants. That setup is complex to replicate but fortunately can be reproduced by running hackbench-process-pipes while heavily overcomitting a machine with 96 logical CPUs and then checking if loadavg drops afterwards. With an MMTests clone, I reproduced it as follows ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done Load should drop to 10 after about 10 minutes and it does on x86-64 but remained at around 200+ on arm64. The reproduction case simply hammers the case where a task can be descheduling while also being woken by another task at the same time. It takes a long time to run but it makes the problem very obvious. The expectation is that after hackbench has been running and saturating the machine for a long time. Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg accounting race in the generic case. Later it was documented why the ordering of when p->sched_contributes_to_load is read/updated relative to p->on_cpu. This is critical when a task is descheduling at the same time it is being activated on another CPU. While the load/stores happen under the RQ lock, the RQ lock on its own does not give any guarantees on the task state. Over the weekend I convinced myself that it must be because the implementation of smp_load_acquire and smp_store_release do not appear to implement acquire/release semantics because I didn't find something arm64 that was playing with p->state behind the schedulers back (I could have missed it if it was in an assembly portion as I can't reliablyh read arm assembler). Similarly, it's not clear why the arm64 implementation does not call smp_acquire__after_ctrl_dep in the smp_load_acquire implementation. Even when it was introduced, the arm64 implementation differed significantly from the arm implementation in terms of what barriers it used for non-obvious reasons. Unfortunately, making that work similar to the arch-independent version did not help but it's not helped that I know nothing about the arm64 memory model. I'll be looking again today to see can I find a mistake in the ordering for how sched_contributes_to_load is handled but again, the lack of knowledge on the arm64 memory model means I'm a bit stuck and a second set of eyes would be nice :( -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 9:10 ` Mel Gorman @ 2020-11-16 11:49 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 11:49 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > I'll be looking again today to see can I find a mistake in the ordering for > how sched_contributes_to_load is handled but again, the lack of knowledge > on the arm64 memory model means I'm a bit stuck and a second set of eyes > would be nice :( > This morning, it's not particularly clear what orders the visibility of sched_contributes_to_load exactly like other task fields in the schedule vs try_to_wake_up paths. I thought the rq lock would have ordered them but something is clearly off or loadavg would not be getting screwed. It could be done with an rmb and wmb (testing and hasn't blown up so far) but that's far too heavy. smp_load_acquire/smp_store_release might be sufficient on it although less clear if the arm64 gives the necessary guarantees. (This is still at the chucking out ideas as I haven't context switched back in all the memory barrier rules). -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 11:49 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 11:49 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > I'll be looking again today to see can I find a mistake in the ordering for > how sched_contributes_to_load is handled but again, the lack of knowledge > on the arm64 memory model means I'm a bit stuck and a second set of eyes > would be nice :( > This morning, it's not particularly clear what orders the visibility of sched_contributes_to_load exactly like other task fields in the schedule vs try_to_wake_up paths. I thought the rq lock would have ordered them but something is clearly off or loadavg would not be getting screwed. It could be done with an rmb and wmb (testing and hasn't blown up so far) but that's far too heavy. smp_load_acquire/smp_store_release might be sufficient on it although less clear if the arm64 gives the necessary guarantees. (This is still at the chucking out ideas as I haven't context switched back in all the memory barrier rules). -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 11:49 ` Mel Gorman @ 2020-11-16 12:00 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 12:00 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I'll be looking again today to see can I find a mistake in the ordering for > > how sched_contributes_to_load is handled but again, the lack of knowledge > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > would be nice :( > > > > This morning, it's not particularly clear what orders the visibility of > sched_contributes_to_load exactly like other task fields in the schedule > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > something is clearly off or loadavg would not be getting screwed. It could > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > far too heavy. smp_load_acquire/smp_store_release might be sufficient > on it although less clear if the arm64 gives the necessary guarantees. > And smp_* can't be used anyway because sched_contributes_to_load is a bit field that is not protected with a specific lock so it's "special". -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 12:00 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 12:00 UTC (permalink / raw) To: Peter Zijlstra, Will Deacon Cc: Davidlohr Bueso, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I'll be looking again today to see can I find a mistake in the ordering for > > how sched_contributes_to_load is handled but again, the lack of knowledge > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > would be nice :( > > > > This morning, it's not particularly clear what orders the visibility of > sched_contributes_to_load exactly like other task fields in the schedule > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > something is clearly off or loadavg would not be getting screwed. It could > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > far too heavy. smp_load_acquire/smp_store_release might be sufficient > on it although less clear if the arm64 gives the necessary guarantees. > And smp_* can't be used anyway because sched_contributes_to_load is a bit field that is not protected with a specific lock so it's "special". -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 11:49 ` Mel Gorman @ 2020-11-16 12:53 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:53 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I'll be looking again today to see can I find a mistake in the ordering for > > how sched_contributes_to_load is handled but again, the lack of knowledge > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > would be nice :( > > > > This morning, it's not particularly clear what orders the visibility of > sched_contributes_to_load exactly like other task fields in the schedule > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > something is clearly off or loadavg would not be getting screwed. It could > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > far too heavy. smp_load_acquire/smp_store_release might be sufficient > on it although less clear if the arm64 gives the necessary guarantees. > > (This is still at the chucking out ideas as I haven't context switched > back in all the memory barrier rules). IIRC it should be so ordered by ->on_cpu. We have: schedule() prev->sched_contributes_to_load = X; smp_store_release(prev->on_cpu, 0); on the one hand, and: sched_ttwu_pending() if (WARN_ON_ONCE(p->on_cpu)) smp_cond_load_acquire(&p->on_cpu) ttwu_do_activate() if (p->sched_contributes_to_load) ... on the other (for the remote case, which is the only 'interesting' one). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 12:53 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:53 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I'll be looking again today to see can I find a mistake in the ordering for > > how sched_contributes_to_load is handled but again, the lack of knowledge > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > would be nice :( > > > > This morning, it's not particularly clear what orders the visibility of > sched_contributes_to_load exactly like other task fields in the schedule > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > something is clearly off or loadavg would not be getting screwed. It could > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > far too heavy. smp_load_acquire/smp_store_release might be sufficient > on it although less clear if the arm64 gives the necessary guarantees. > > (This is still at the chucking out ideas as I haven't context switched > back in all the memory barrier rules). IIRC it should be so ordered by ->on_cpu. We have: schedule() prev->sched_contributes_to_load = X; smp_store_release(prev->on_cpu, 0); on the one hand, and: sched_ttwu_pending() if (WARN_ON_ONCE(p->on_cpu)) smp_cond_load_acquire(&p->on_cpu) ttwu_do_activate() if (p->sched_contributes_to_load) ... on the other (for the remote case, which is the only 'interesting' one). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 12:53 ` Peter Zijlstra @ 2020-11-16 12:58 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:58 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > > I'll be looking again today to see can I find a mistake in the ordering for > > > how sched_contributes_to_load is handled but again, the lack of knowledge > > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > > would be nice :( > > > > > > > This morning, it's not particularly clear what orders the visibility of > > sched_contributes_to_load exactly like other task fields in the schedule > > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > > something is clearly off or loadavg would not be getting screwed. It could > > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > > far too heavy. smp_load_acquire/smp_store_release might be sufficient > > on it although less clear if the arm64 gives the necessary guarantees. > > > > (This is still at the chucking out ideas as I haven't context switched > > back in all the memory barrier rules). > > IIRC it should be so ordered by ->on_cpu. > > We have: > > schedule() > prev->sched_contributes_to_load = X; > smp_store_release(prev->on_cpu, 0); > > > on the one hand, and: Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we even get here. > sched_ttwu_pending() > if (WARN_ON_ONCE(p->on_cpu)) > smp_cond_load_acquire(&p->on_cpu) > > ttwu_do_activate() > if (p->sched_contributes_to_load) > ... > > on the other (for the remote case, which is the only 'interesting' one). Also see the "Notes on Program-Order guarantees on SMP systems." comment. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 12:58 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:58 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > > I'll be looking again today to see can I find a mistake in the ordering for > > > how sched_contributes_to_load is handled but again, the lack of knowledge > > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > > would be nice :( > > > > > > > This morning, it's not particularly clear what orders the visibility of > > sched_contributes_to_load exactly like other task fields in the schedule > > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > > something is clearly off or loadavg would not be getting screwed. It could > > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > > far too heavy. smp_load_acquire/smp_store_release might be sufficient > > on it although less clear if the arm64 gives the necessary guarantees. > > > > (This is still at the chucking out ideas as I haven't context switched > > back in all the memory barrier rules). > > IIRC it should be so ordered by ->on_cpu. > > We have: > > schedule() > prev->sched_contributes_to_load = X; > smp_store_release(prev->on_cpu, 0); > > > on the one hand, and: Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we even get here. > sched_ttwu_pending() > if (WARN_ON_ONCE(p->on_cpu)) > smp_cond_load_acquire(&p->on_cpu) > > ttwu_do_activate() > if (p->sched_contributes_to_load) > ... > > on the other (for the remote case, which is the only 'interesting' one). Also see the "Notes on Program-Order guarantees on SMP systems." comment. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 12:58 ` Peter Zijlstra @ 2020-11-16 15:29 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 15:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > > > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > > > I'll be looking again today to see can I find a mistake in the ordering for > > > > how sched_contributes_to_load is handled but again, the lack of knowledge > > > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > > > would be nice :( > > > > > > > > > > This morning, it's not particularly clear what orders the visibility of > > > sched_contributes_to_load exactly like other task fields in the schedule > > > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > > > something is clearly off or loadavg would not be getting screwed. It could > > > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > > > far too heavy. smp_load_acquire/smp_store_release might be sufficient > > > on it although less clear if the arm64 gives the necessary guarantees. > > > > > > (This is still at the chucking out ideas as I haven't context switched > > > back in all the memory barrier rules). > > > > IIRC it should be so ordered by ->on_cpu. > > > > We have: > > > > schedule() > > prev->sched_contributes_to_load = X; > > smp_store_release(prev->on_cpu, 0); > > > > > > on the one hand, and: > > Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we > even get here. > Sortof, it will either have called smp_load_acquire(&p->on_cpu) or smp_cond_load_acquire(&p->on_cpu, !VAL) before hitting one of the paths leading to ttwu_do_activate. Either way, it's covered. > > sched_ttwu_pending() > > if (WARN_ON_ONCE(p->on_cpu)) > > smp_cond_load_acquire(&p->on_cpu) > > > > ttwu_do_activate() > > if (p->sched_contributes_to_load) > > ... > > > > on the other (for the remote case, which is the only 'interesting' one). > But this side is interesting because I'm having trouble convincing myself it's 100% correct for sched_contributes_to_load. The write of prev->sched_contributes_to_load in the schedule() path has a big gap before it hits the smp_store_release(prev->on_cpu). On the ttwu path, we have if (smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; ttwu_queue_wakelist queues task on the wakelist, sends IPI and on the receiver side it calls ttwu_do_activate and reads sched_contributes_to_load sched_ttwu_pending() is not necessarily using the same rq lock so no protection there. The smp_load_acquire() has just been hit but it still leaves a gap between when sched_contributes_to_load is written and a parallel read of sched_contributes_to_load. So while we might be able to avoid a smp_rmb() before the read of sched_contributes_to_load and rely on p->on_cpu ordering there, we may still need a smp_wmb() after nr_interruptible() increments instead of waiting until the smp_store_release() is hit while a task is scheduling. That would be a real memory barrier on arm64 and a plain compiler barrier on x86-64. > Also see the "Notes on Program-Order guarantees on SMP systems." > comment. I did, it was the on_cpu ordering for the blocking case that had me looking at the smp_store_release and smp_cond_load_acquire in arm64 in the first place thinking that something in there must be breaking the on_cpu ordering. I'm re-reading it every so often while trying to figure out where the gap is or whether I'm imagining things. Not fully tested but did not instantly break either diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2003a7d5ab5..877eaeba45ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { - prev->sched_contributes_to_load = + int acct_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); - if (prev->sched_contributes_to_load) + prev->sched_contributes_to_load = acct_load; + if (acct_load) { rq->nr_uninterruptible++; + /* + * Pairs with p->on_cpu ordering, either a + * smp_load_acquire or smp_cond_load_acquire + * in the ttwu path before ttwu_do_activate + * p->sched_contributes_to_load. It's only + * after the nr_interruptible update happens + * that the ordering is critical. + */ + smp_wmb(); + } + /* * __schedule() ttwu() * prev_state = prev->state; if (p->on_rq && ...) -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 15:29 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 15:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote: > > > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > > > I'll be looking again today to see can I find a mistake in the ordering for > > > > how sched_contributes_to_load is handled but again, the lack of knowledge > > > > on the arm64 memory model means I'm a bit stuck and a second set of eyes > > > > would be nice :( > > > > > > > > > > This morning, it's not particularly clear what orders the visibility of > > > sched_contributes_to_load exactly like other task fields in the schedule > > > vs try_to_wake_up paths. I thought the rq lock would have ordered them but > > > something is clearly off or loadavg would not be getting screwed. It could > > > be done with an rmb and wmb (testing and hasn't blown up so far) but that's > > > far too heavy. smp_load_acquire/smp_store_release might be sufficient > > > on it although less clear if the arm64 gives the necessary guarantees. > > > > > > (This is still at the chucking out ideas as I haven't context switched > > > back in all the memory barrier rules). > > > > IIRC it should be so ordered by ->on_cpu. > > > > We have: > > > > schedule() > > prev->sched_contributes_to_load = X; > > smp_store_release(prev->on_cpu, 0); > > > > > > on the one hand, and: > > Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we > even get here. > Sortof, it will either have called smp_load_acquire(&p->on_cpu) or smp_cond_load_acquire(&p->on_cpu, !VAL) before hitting one of the paths leading to ttwu_do_activate. Either way, it's covered. > > sched_ttwu_pending() > > if (WARN_ON_ONCE(p->on_cpu)) > > smp_cond_load_acquire(&p->on_cpu) > > > > ttwu_do_activate() > > if (p->sched_contributes_to_load) > > ... > > > > on the other (for the remote case, which is the only 'interesting' one). > But this side is interesting because I'm having trouble convincing myself it's 100% correct for sched_contributes_to_load. The write of prev->sched_contributes_to_load in the schedule() path has a big gap before it hits the smp_store_release(prev->on_cpu). On the ttwu path, we have if (smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; ttwu_queue_wakelist queues task on the wakelist, sends IPI and on the receiver side it calls ttwu_do_activate and reads sched_contributes_to_load sched_ttwu_pending() is not necessarily using the same rq lock so no protection there. The smp_load_acquire() has just been hit but it still leaves a gap between when sched_contributes_to_load is written and a parallel read of sched_contributes_to_load. So while we might be able to avoid a smp_rmb() before the read of sched_contributes_to_load and rely on p->on_cpu ordering there, we may still need a smp_wmb() after nr_interruptible() increments instead of waiting until the smp_store_release() is hit while a task is scheduling. That would be a real memory barrier on arm64 and a plain compiler barrier on x86-64. > Also see the "Notes on Program-Order guarantees on SMP systems." > comment. I did, it was the on_cpu ordering for the blocking case that had me looking at the smp_store_release and smp_cond_load_acquire in arm64 in the first place thinking that something in there must be breaking the on_cpu ordering. I'm re-reading it every so often while trying to figure out where the gap is or whether I'm imagining things. Not fully tested but did not instantly break either diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2003a7d5ab5..877eaeba45ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { - prev->sched_contributes_to_load = + int acct_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); - if (prev->sched_contributes_to_load) + prev->sched_contributes_to_load = acct_load; + if (acct_load) { rq->nr_uninterruptible++; + /* + * Pairs with p->on_cpu ordering, either a + * smp_load_acquire or smp_cond_load_acquire + * in the ttwu path before ttwu_do_activate + * p->sched_contributes_to_load. It's only + * after the nr_interruptible update happens + * that the ordering is critical. + */ + smp_wmb(); + } + /* * __schedule() ttwu() * prev_state = prev->state; if (p->on_rq && ...) -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 15:29 ` Mel Gorman @ 2020-11-16 16:42 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 16:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote: > I did, it was the on_cpu ordering for the blocking case that had me > looking at the smp_store_release and smp_cond_load_acquire in arm64 in > the first place thinking that something in there must be breaking the > on_cpu ordering. I'm re-reading it every so often while trying to figure > out where the gap is or whether I'm imagining things. > > Not fully tested but did not instantly break either > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d2003a7d5ab5..877eaeba45ac 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > - prev->sched_contributes_to_load = > + int acct_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > > - if (prev->sched_contributes_to_load) > + prev->sched_contributes_to_load = acct_load; > + if (acct_load) { > rq->nr_uninterruptible++; > > + /* > + * Pairs with p->on_cpu ordering, either a > + * smp_load_acquire or smp_cond_load_acquire > + * in the ttwu path before ttwu_do_activate > + * p->sched_contributes_to_load. It's only > + * after the nr_interruptible update happens > + * that the ordering is critical. > + */ > + smp_wmb(); > + } > + > /* > * __schedule() ttwu() > * prev_state = prev->state; if (p->on_rq && ...) > This passed the test. Load averages taken once a minute after the test completed showed 950.21 977.17 990.69 1/853 2117 349.00 799.32 928.69 1/859 2439 128.18 653.85 870.56 1/861 2736 47.08 534.84 816.08 1/860 3029 17.29 437.50 765.00 1/865 3357 6.35 357.87 717.13 1/865 3653 2.33 292.74 672.24 1/861 3709 0.85 239.46 630.17 1/859 3711 0.31 195.87 590.73 1/857 3713 0.11 160.22 553.76 1/853 3715 With 5.10-rc3, it got stuck with a load average of 244 after the test completed even though the machine was idle. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 16:42 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 16:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote: > I did, it was the on_cpu ordering for the blocking case that had me > looking at the smp_store_release and smp_cond_load_acquire in arm64 in > the first place thinking that something in there must be breaking the > on_cpu ordering. I'm re-reading it every so often while trying to figure > out where the gap is or whether I'm imagining things. > > Not fully tested but did not instantly break either > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d2003a7d5ab5..877eaeba45ac 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > - prev->sched_contributes_to_load = > + int acct_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > > - if (prev->sched_contributes_to_load) > + prev->sched_contributes_to_load = acct_load; > + if (acct_load) { > rq->nr_uninterruptible++; > > + /* > + * Pairs with p->on_cpu ordering, either a > + * smp_load_acquire or smp_cond_load_acquire > + * in the ttwu path before ttwu_do_activate > + * p->sched_contributes_to_load. It's only > + * after the nr_interruptible update happens > + * that the ordering is critical. > + */ > + smp_wmb(); > + } > + > /* > * __schedule() ttwu() > * prev_state = prev->state; if (p->on_rq && ...) > This passed the test. Load averages taken once a minute after the test completed showed 950.21 977.17 990.69 1/853 2117 349.00 799.32 928.69 1/859 2439 128.18 653.85 870.56 1/861 2736 47.08 534.84 816.08 1/860 3029 17.29 437.50 765.00 1/865 3357 6.35 357.87 717.13 1/865 3653 2.33 292.74 672.24 1/861 3709 0.85 239.46 630.17 1/859 3711 0.31 195.87 590.73 1/857 3713 0.11 160.22 553.76 1/853 3715 With 5.10-rc3, it got stuck with a load average of 244 after the test completed even though the machine was idle. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 15:29 ` Mel Gorman @ 2020-11-16 16:49 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 16:49 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote: > > > sched_ttwu_pending() > > > if (WARN_ON_ONCE(p->on_cpu)) > > > smp_cond_load_acquire(&p->on_cpu) > > > > > > ttwu_do_activate() > > > if (p->sched_contributes_to_load) > > > ... > > > > > > on the other (for the remote case, which is the only 'interesting' one). > > > > But this side is interesting because I'm having trouble convincing > myself it's 100% correct for sched_contributes_to_load. The write of > prev->sched_contributes_to_load in the schedule() path has a big gap > before it hits the smp_store_release(prev->on_cpu). > > On the ttwu path, we have > > if (smp_load_acquire(&p->on_cpu) && > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) > goto unlock; > > ttwu_queue_wakelist queues task on the wakelist, sends IPI > and on the receiver side it calls ttwu_do_activate and reads > sched_contributes_to_load > > sched_ttwu_pending() is not necessarily using the same rq lock so no > protection there. The smp_load_acquire() has just been hit but it still > leaves a gap between when sched_contributes_to_load is written and a > parallel read of sched_contributes_to_load. > > So while we might be able to avoid a smp_rmb() before the read of > sched_contributes_to_load and rely on p->on_cpu ordering there, > we may still need a smp_wmb() after nr_interruptible() increments > instead of waiting until the smp_store_release() is hit while a task > is scheduling. That would be a real memory barrier on arm64 and a plain > compiler barrier on x86-64. I'm mighty confused by your words here; and the patch below. What actual scenario are you worried about? If we take the WF_ON_CPU path, we IPI the CPU the task is ->on_cpu on. So the IPI lands after the schedule() that clears ->on_cpu on the very same CPU. > > > Also see the "Notes on Program-Order guarantees on SMP systems." > > comment. > > I did, it was the on_cpu ordering for the blocking case that had me > looking at the smp_store_release and smp_cond_load_acquire in arm64 in > the first place thinking that something in there must be breaking the > on_cpu ordering. I'm re-reading it every so often while trying to figure > out where the gap is or whether I'm imagining things. > > Not fully tested but did not instantly break either > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d2003a7d5ab5..877eaeba45ac 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > - prev->sched_contributes_to_load = > + int acct_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > > - if (prev->sched_contributes_to_load) > + prev->sched_contributes_to_load = acct_load; > + if (acct_load) { > rq->nr_uninterruptible++; > > + /* > + * Pairs with p->on_cpu ordering, either a > + * smp_load_acquire or smp_cond_load_acquire > + * in the ttwu path before ttwu_do_activate > + * p->sched_contributes_to_load. It's only > + * after the nr_interruptible update happens > + * that the ordering is critical. > + */ > + smp_wmb(); > + } Sorry, I can't follow, at all. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 16:49 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 16:49 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote: > > > sched_ttwu_pending() > > > if (WARN_ON_ONCE(p->on_cpu)) > > > smp_cond_load_acquire(&p->on_cpu) > > > > > > ttwu_do_activate() > > > if (p->sched_contributes_to_load) > > > ... > > > > > > on the other (for the remote case, which is the only 'interesting' one). > > > > But this side is interesting because I'm having trouble convincing > myself it's 100% correct for sched_contributes_to_load. The write of > prev->sched_contributes_to_load in the schedule() path has a big gap > before it hits the smp_store_release(prev->on_cpu). > > On the ttwu path, we have > > if (smp_load_acquire(&p->on_cpu) && > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) > goto unlock; > > ttwu_queue_wakelist queues task on the wakelist, sends IPI > and on the receiver side it calls ttwu_do_activate and reads > sched_contributes_to_load > > sched_ttwu_pending() is not necessarily using the same rq lock so no > protection there. The smp_load_acquire() has just been hit but it still > leaves a gap between when sched_contributes_to_load is written and a > parallel read of sched_contributes_to_load. > > So while we might be able to avoid a smp_rmb() before the read of > sched_contributes_to_load and rely on p->on_cpu ordering there, > we may still need a smp_wmb() after nr_interruptible() increments > instead of waiting until the smp_store_release() is hit while a task > is scheduling. That would be a real memory barrier on arm64 and a plain > compiler barrier on x86-64. I'm mighty confused by your words here; and the patch below. What actual scenario are you worried about? If we take the WF_ON_CPU path, we IPI the CPU the task is ->on_cpu on. So the IPI lands after the schedule() that clears ->on_cpu on the very same CPU. > > > Also see the "Notes on Program-Order guarantees on SMP systems." > > comment. > > I did, it was the on_cpu ordering for the blocking case that had me > looking at the smp_store_release and smp_cond_load_acquire in arm64 in > the first place thinking that something in there must be breaking the > on_cpu ordering. I'm re-reading it every so often while trying to figure > out where the gap is or whether I'm imagining things. > > Not fully tested but did not instantly break either > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d2003a7d5ab5..877eaeba45ac 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > - prev->sched_contributes_to_load = > + int acct_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > > - if (prev->sched_contributes_to_load) > + prev->sched_contributes_to_load = acct_load; > + if (acct_load) { > rq->nr_uninterruptible++; > > + /* > + * Pairs with p->on_cpu ordering, either a > + * smp_load_acquire or smp_cond_load_acquire > + * in the ttwu path before ttwu_do_activate > + * p->sched_contributes_to_load. It's only > + * after the nr_interruptible update happens > + * that the ordering is critical. > + */ > + smp_wmb(); > + } Sorry, I can't follow, at all. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 16:49 ` Peter Zijlstra @ 2020-11-16 17:24 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 17:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote: > > So while we might be able to avoid a smp_rmb() before the read of > > sched_contributes_to_load and rely on p->on_cpu ordering there, > > we may still need a smp_wmb() after nr_interruptible() increments > > instead of waiting until the smp_store_release() is hit while a task > > is scheduling. That would be a real memory barrier on arm64 and a plain > > compiler barrier on x86-64. > Wish I read this before sending the changelog > I'm mighty confused by your words here; and the patch below. What actual > scenario are you worried about? > The wrong one apparently. Even if the IRQ is released, the IPI would deliver to the CPU that should observe the correct value or take the other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the schedule to finish so I'm now both confused and wondering why smp_wmb made a difference at all. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 17:24 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 17:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote: > > So while we might be able to avoid a smp_rmb() before the read of > > sched_contributes_to_load and rely on p->on_cpu ordering there, > > we may still need a smp_wmb() after nr_interruptible() increments > > instead of waiting until the smp_store_release() is hit while a task > > is scheduling. That would be a real memory barrier on arm64 and a plain > > compiler barrier on x86-64. > Wish I read this before sending the changelog > I'm mighty confused by your words here; and the patch below. What actual > scenario are you worried about? > The wrong one apparently. Even if the IRQ is released, the IPI would deliver to the CPU that should observe the correct value or take the other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the schedule to finish so I'm now both confused and wondering why smp_wmb made a difference at all. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 17:24 ` Mel Gorman @ 2020-11-16 17:41 ` Will Deacon -1 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-16 17:41 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 05:24:44PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote: > > > So while we might be able to avoid a smp_rmb() before the read of > > > sched_contributes_to_load and rely on p->on_cpu ordering there, > > > we may still need a smp_wmb() after nr_interruptible() increments > > > instead of waiting until the smp_store_release() is hit while a task > > > is scheduling. That would be a real memory barrier on arm64 and a plain > > > compiler barrier on x86-64. > > > > Wish I read this before sending the changelog > > > I'm mighty confused by your words here; and the patch below. What actual > > scenario are you worried about? > > > > The wrong one apparently. Even if the IRQ is released, the IPI would > deliver to the CPU that should observe the correct value or take the > other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the > schedule to finish so I'm now both confused and wondering why smp_wmb > made a difference at all. Probably still worth trying Peter's hack to pad the bitfields though, as I think that's still a plausible issue (and one which would appear to be fixed by that smp_wmb() too). Will ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 17:41 ` Will Deacon 0 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-16 17:41 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Davidlohr Bueso, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 05:24:44PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote: > > > So while we might be able to avoid a smp_rmb() before the read of > > > sched_contributes_to_load and rely on p->on_cpu ordering there, > > > we may still need a smp_wmb() after nr_interruptible() increments > > > instead of waiting until the smp_store_release() is hit while a task > > > is scheduling. That would be a real memory barrier on arm64 and a plain > > > compiler barrier on x86-64. > > > > Wish I read this before sending the changelog > > > I'm mighty confused by your words here; and the patch below. What actual > > scenario are you worried about? > > > > The wrong one apparently. Even if the IRQ is released, the IPI would > deliver to the CPU that should observe the correct value or take the > other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the > schedule to finish so I'm now both confused and wondering why smp_wmb > made a difference at all. Probably still worth trying Peter's hack to pad the bitfields though, as I think that's still a plausible issue (and one which would appear to be fixed by that smp_wmb() too). Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 9:10 ` Mel Gorman @ 2020-11-16 12:46 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:46 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > Similarly, it's not clear why the arm64 implementation > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > implementation. Even when it was introduced, the arm64 implementation > differed significantly from the arm implementation in terms of what > barriers it used for non-obvious reasons. This is because ARM64's smp_cond_load_acquire() implementation uses smp_load_aquire() directly, as opposed to the generic version that uses READ_ONCE(). This is because ARM64 has a load-acquire instruction, which is highly optimized, and generally considered cheaper than the smp_rmb() from smp_acquire__after_ctrl_dep(). Or so I've been led to believe. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 12:46 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 12:46 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > Similarly, it's not clear why the arm64 implementation > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > implementation. Even when it was introduced, the arm64 implementation > differed significantly from the arm implementation in terms of what > barriers it used for non-obvious reasons. This is because ARM64's smp_cond_load_acquire() implementation uses smp_load_aquire() directly, as opposed to the generic version that uses READ_ONCE(). This is because ARM64 has a load-acquire instruction, which is highly optimized, and generally considered cheaper than the smp_rmb() from smp_acquire__after_ctrl_dep(). Or so I've been led to believe. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 12:46 ` Peter Zijlstra @ 2020-11-16 12:58 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 12:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 01:46:57PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > Similarly, it's not clear why the arm64 implementation > > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > > implementation. Even when it was introduced, the arm64 implementation > > differed significantly from the arm implementation in terms of what > > barriers it used for non-obvious reasons. > > This is because ARM64's smp_cond_load_acquire() implementation uses > smp_load_aquire() directly, as opposed to the generic version that uses > READ_ONCE(). > > This is because ARM64 has a load-acquire instruction, which is highly > optimized, and generally considered cheaper than the smp_rmb() from > smp_acquire__after_ctrl_dep(). > > Or so I've been led to believe. Fair enough. Either way, barriering sched_contributes_to_load "works" but it's clumsy and may not be guaranteed to be correct. The bits should have been protected by the rq lock but sched_remote_wakeup updates outside of the lock which might be leading to the adject fields (like sched_contributes_to_load) getting corrupted as per the "anti guarantees" in memory-barriers.txt. The rq lock could be conditionally acquired __ttwu_queue_wakelist for WF_MIGRATED and explicitly cleared in sched_ttwu_pending (not tested if this works) but it would also suck to acquire a remote lock when that's what we're explicitly trying to avoid in that path. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 12:58 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 12:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 01:46:57PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > Similarly, it's not clear why the arm64 implementation > > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > > implementation. Even when it was introduced, the arm64 implementation > > differed significantly from the arm implementation in terms of what > > barriers it used for non-obvious reasons. > > This is because ARM64's smp_cond_load_acquire() implementation uses > smp_load_aquire() directly, as opposed to the generic version that uses > READ_ONCE(). > > This is because ARM64 has a load-acquire instruction, which is highly > optimized, and generally considered cheaper than the smp_rmb() from > smp_acquire__after_ctrl_dep(). > > Or so I've been led to believe. Fair enough. Either way, barriering sched_contributes_to_load "works" but it's clumsy and may not be guaranteed to be correct. The bits should have been protected by the rq lock but sched_remote_wakeup updates outside of the lock which might be leading to the adject fields (like sched_contributes_to_load) getting corrupted as per the "anti guarantees" in memory-barriers.txt. The rq lock could be conditionally acquired __ttwu_queue_wakelist for WF_MIGRATED and explicitly cleared in sched_ttwu_pending (not tested if this works) but it would also suck to acquire a remote lock when that's what we're explicitly trying to avoid in that path. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 9:10 ` Mel Gorman @ 2020-11-16 13:11 ` Will Deacon -1 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-16 13:11 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel > that loadavg was "exploding" on arch64 on a machines acting as a build > servers. It happened on at least two different arm64 variants. That setup > is complex to replicate but fortunately can be reproduced by running > hackbench-process-pipes while heavily overcomitting a machine with 96 > logical CPUs and then checking if loadavg drops afterwards. With an > MMTests clone, I reproduced it as follows > > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done > > Load should drop to 10 after about 10 minutes and it does on x86-64 but > remained at around 200+ on arm64. Do you think you could use this to bisect the problem? Also, are you able to reproduce the issue on any other arm64 machines, or just this one? > The reproduction case simply hammers the case where a task can be > descheduling while also being woken by another task at the same time. It > takes a long time to run but it makes the problem very obvious. The > expectation is that after hackbench has been running and saturating the > machine for a long time. > > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg > accounting race in the generic case. Later it was documented why the > ordering of when p->sched_contributes_to_load is read/updated relative > to p->on_cpu. This is critical when a task is descheduling at the same > time it is being activated on another CPU. While the load/stores happen > under the RQ lock, the RQ lock on its own does not give any guarantees > on the task state. > > Over the weekend I convinced myself that it must be because the > implementation of smp_load_acquire and smp_store_release do not appear > to implement acquire/release semantics because I didn't find something > arm64 that was playing with p->state behind the schedulers back (I could > have missed it if it was in an assembly portion as I can't reliablyh read > arm assembler). Similarly, it's not clear why the arm64 implementation > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > implementation. Even when it was introduced, the arm64 implementation > differed significantly from the arm implementation in terms of what > barriers it used for non-obvious reasons. Why would you expect smp_acquire__after_ctrl_dep() to be called as part of the smp_load_acquire() implementation? FWIW, arm64 has special instructions for acquire and release (and they actually provide more order than is strictly needed by Linux), so we just map acquire/release to those instructions directly. Since these instructions are not available on most 32-bit cores, the arm implementation just uses the fence-based implementation. Anyway, setting all that aside, I do agree with you that the bitfield usage in task_struct looks pretty suspicious. For example, in __schedule() we have: rq_lock(rq, &rf); smp_mb__after_spinlock(); ... prev_state = prev->state; if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); ... deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); where deactivate_task() updates p->on_rq directly: p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; so this is _not_ ordered wrt sched_contributes_to_load. But then over in __ttwu_queue_wakelist() we have: p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); which can be invoked on the try_to_wake_up() path if p->on_rq is first read as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield updates can race and cause the flags to be corrupted? Then again, I went through the list of observed KCSAN splats and don't see this race showing up in there, so perhaps it's serialised by something I haven't spotted. Will ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 13:11 ` Will Deacon 0 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-16 13:11 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Davidlohr Bueso, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel > that loadavg was "exploding" on arch64 on a machines acting as a build > servers. It happened on at least two different arm64 variants. That setup > is complex to replicate but fortunately can be reproduced by running > hackbench-process-pipes while heavily overcomitting a machine with 96 > logical CPUs and then checking if loadavg drops afterwards. With an > MMTests clone, I reproduced it as follows > > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done > > Load should drop to 10 after about 10 minutes and it does on x86-64 but > remained at around 200+ on arm64. Do you think you could use this to bisect the problem? Also, are you able to reproduce the issue on any other arm64 machines, or just this one? > The reproduction case simply hammers the case where a task can be > descheduling while also being woken by another task at the same time. It > takes a long time to run but it makes the problem very obvious. The > expectation is that after hackbench has been running and saturating the > machine for a long time. > > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg > accounting race in the generic case. Later it was documented why the > ordering of when p->sched_contributes_to_load is read/updated relative > to p->on_cpu. This is critical when a task is descheduling at the same > time it is being activated on another CPU. While the load/stores happen > under the RQ lock, the RQ lock on its own does not give any guarantees > on the task state. > > Over the weekend I convinced myself that it must be because the > implementation of smp_load_acquire and smp_store_release do not appear > to implement acquire/release semantics because I didn't find something > arm64 that was playing with p->state behind the schedulers back (I could > have missed it if it was in an assembly portion as I can't reliablyh read > arm assembler). Similarly, it's not clear why the arm64 implementation > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > implementation. Even when it was introduced, the arm64 implementation > differed significantly from the arm implementation in terms of what > barriers it used for non-obvious reasons. Why would you expect smp_acquire__after_ctrl_dep() to be called as part of the smp_load_acquire() implementation? FWIW, arm64 has special instructions for acquire and release (and they actually provide more order than is strictly needed by Linux), so we just map acquire/release to those instructions directly. Since these instructions are not available on most 32-bit cores, the arm implementation just uses the fence-based implementation. Anyway, setting all that aside, I do agree with you that the bitfield usage in task_struct looks pretty suspicious. For example, in __schedule() we have: rq_lock(rq, &rf); smp_mb__after_spinlock(); ... prev_state = prev->state; if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); ... deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); where deactivate_task() updates p->on_rq directly: p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; so this is _not_ ordered wrt sched_contributes_to_load. But then over in __ttwu_queue_wakelist() we have: p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); which can be invoked on the try_to_wake_up() path if p->on_rq is first read as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield updates can race and cause the flags to be corrupted? Then again, I went through the list of observed KCSAN splats and don't see this race showing up in there, so perhaps it's serialised by something I haven't spotted. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 13:11 ` Will Deacon @ 2020-11-16 13:37 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 13:37 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel > > that loadavg was "exploding" on arch64 on a machines acting as a build > > servers. It happened on at least two different arm64 variants. That setup > > is complex to replicate but fortunately can be reproduced by running > > hackbench-process-pipes while heavily overcomitting a machine with 96 > > logical CPUs and then checking if loadavg drops afterwards. With an > > MMTests clone, I reproduced it as follows > > > > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ > > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done > > > > Load should drop to 10 after about 10 minutes and it does on x86-64 but > > remained at around 200+ on arm64. > > Do you think you could use this to bisect the problem? Also, are you able > to reproduce the issue on any other arm64 machines, or just this one? > I didn't bisect it as I was assuming it was related to c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") which is something I would still like to preserve and was responsible to a loadavg glitch fixed by dbfb089d360b ("sched: Fix loadavg accounting race") and d136122f5845 ("sched: Fix race against ptrace_freeze_trace()") While *I* can only reproduce it on one machine, I have a bug report saying it affects others. It's not a single machine issue or a single ARM variant. > > The reproduction case simply hammers the case where a task can be > > descheduling while also being woken by another task at the same time. It > > takes a long time to run but it makes the problem very obvious. The > > expectation is that after hackbench has been running and saturating the > > machine for a long time. > > > > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg > > accounting race in the generic case. Later it was documented why the > > ordering of when p->sched_contributes_to_load is read/updated relative > > to p->on_cpu. This is critical when a task is descheduling at the same > > time it is being activated on another CPU. While the load/stores happen > > under the RQ lock, the RQ lock on its own does not give any guarantees > > on the task state. > > > > Over the weekend I convinced myself that it must be because the > > implementation of smp_load_acquire and smp_store_release do not appear > > to implement acquire/release semantics because I didn't find something > > arm64 that was playing with p->state behind the schedulers back (I could > > have missed it if it was in an assembly portion as I can't reliablyh read > > arm assembler). Similarly, it's not clear why the arm64 implementation > > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > > implementation. Even when it was introduced, the arm64 implementation > > differed significantly from the arm implementation in terms of what > > barriers it used for non-obvious reasons. > > Why would you expect smp_acquire__after_ctrl_dep() to be called as part of > the smp_load_acquire() implementation? > I wouldn't, I should have said smp_cond_load_acquire. > FWIW, arm64 has special instructions for acquire and release (and they > actually provide more order than is strictly needed by Linux), so we > just map acquire/release to those instructions directly. Since these > instructions are not available on most 32-bit cores, the arm implementation > just uses the fence-based implementation. > Ok, makes sense. I think this was a red herring anyway as it's now looking more like a sched_contibutes_to_load ordering issue. > Anyway, setting all that aside, I do agree with you that the bitfield usage > in task_struct looks pretty suspicious. For example, in __schedule() we > have: > > rq_lock(rq, &rf); > smp_mb__after_spinlock(); > ... > prev_state = prev->state; > > if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > prev->sched_contributes_to_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > ... > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > > where deactivate_task() updates p->on_rq directly: > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better document ttwu()") which changed it. Not sure why that is and didn't think about it too deep as it didn't appear to be directly related to the problem and didn't have ordering consequences. > so this is _not_ ordered wrt sched_contributes_to_load. But then over in > __ttwu_queue_wakelist() we have: > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > updates can race and cause the flags to be corrupted? > I think this is at least one possibility. I think at least that one should only be explicitly set on WF_MIGRATED and explicitly cleared in sched_ttwu_pending. While I haven't audited it fully, it might be enough to avoid a double write outside of the rq lock on the bitfield but I still need to think more about the ordering of sched_contributes_to_load and whether it's ordered by p->on_cpu or not. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 13:37 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 13:37 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Davidlohr Bueso, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote: > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel > > that loadavg was "exploding" on arch64 on a machines acting as a build > > servers. It happened on at least two different arm64 variants. That setup > > is complex to replicate but fortunately can be reproduced by running > > hackbench-process-pipes while heavily overcomitting a machine with 96 > > logical CPUs and then checking if loadavg drops afterwards. With an > > MMTests clone, I reproduced it as follows > > > > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ > > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done > > > > Load should drop to 10 after about 10 minutes and it does on x86-64 but > > remained at around 200+ on arm64. > > Do you think you could use this to bisect the problem? Also, are you able > to reproduce the issue on any other arm64 machines, or just this one? > I didn't bisect it as I was assuming it was related to c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") which is something I would still like to preserve and was responsible to a loadavg glitch fixed by dbfb089d360b ("sched: Fix loadavg accounting race") and d136122f5845 ("sched: Fix race against ptrace_freeze_trace()") While *I* can only reproduce it on one machine, I have a bug report saying it affects others. It's not a single machine issue or a single ARM variant. > > The reproduction case simply hammers the case where a task can be > > descheduling while also being woken by another task at the same time. It > > takes a long time to run but it makes the problem very obvious. The > > expectation is that after hackbench has been running and saturating the > > machine for a long time. > > > > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg > > accounting race in the generic case. Later it was documented why the > > ordering of when p->sched_contributes_to_load is read/updated relative > > to p->on_cpu. This is critical when a task is descheduling at the same > > time it is being activated on another CPU. While the load/stores happen > > under the RQ lock, the RQ lock on its own does not give any guarantees > > on the task state. > > > > Over the weekend I convinced myself that it must be because the > > implementation of smp_load_acquire and smp_store_release do not appear > > to implement acquire/release semantics because I didn't find something > > arm64 that was playing with p->state behind the schedulers back (I could > > have missed it if it was in an assembly portion as I can't reliablyh read > > arm assembler). Similarly, it's not clear why the arm64 implementation > > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > > implementation. Even when it was introduced, the arm64 implementation > > differed significantly from the arm implementation in terms of what > > barriers it used for non-obvious reasons. > > Why would you expect smp_acquire__after_ctrl_dep() to be called as part of > the smp_load_acquire() implementation? > I wouldn't, I should have said smp_cond_load_acquire. > FWIW, arm64 has special instructions for acquire and release (and they > actually provide more order than is strictly needed by Linux), so we > just map acquire/release to those instructions directly. Since these > instructions are not available on most 32-bit cores, the arm implementation > just uses the fence-based implementation. > Ok, makes sense. I think this was a red herring anyway as it's now looking more like a sched_contibutes_to_load ordering issue. > Anyway, setting all that aside, I do agree with you that the bitfield usage > in task_struct looks pretty suspicious. For example, in __schedule() we > have: > > rq_lock(rq, &rf); > smp_mb__after_spinlock(); > ... > prev_state = prev->state; > > if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > prev->sched_contributes_to_load = > (prev_state & TASK_UNINTERRUPTIBLE) && > !(prev_state & TASK_NOLOAD) && > !(prev->flags & PF_FROZEN); > ... > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > > where deactivate_task() updates p->on_rq directly: > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better document ttwu()") which changed it. Not sure why that is and didn't think about it too deep as it didn't appear to be directly related to the problem and didn't have ordering consequences. > so this is _not_ ordered wrt sched_contributes_to_load. But then over in > __ttwu_queue_wakelist() we have: > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > updates can race and cause the flags to be corrupted? > I think this is at least one possibility. I think at least that one should only be explicitly set on WF_MIGRATED and explicitly cleared in sched_ttwu_pending. While I haven't audited it fully, it might be enough to avoid a double write outside of the rq lock on the bitfield but I still need to think more about the ordering of sched_contributes_to_load and whether it's ordered by p->on_cpu or not. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 13:37 ` Mel Gorman @ 2020-11-16 14:20 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 14:20 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote: > > Anyway, setting all that aside, I do agree with you that the bitfield usage > > in task_struct looks pretty suspicious. For example, in __schedule() we > > have: > > > > rq_lock(rq, &rf); > > smp_mb__after_spinlock(); > > ... > > prev_state = prev->state; > > > > if (!preempt && prev_state) { > > if (signal_pending_state(prev_state, prev)) { > > prev->state = TASK_RUNNING; > > } else { > > prev->sched_contributes_to_load = > > (prev_state & TASK_UNINTERRUPTIBLE) && > > !(prev_state & TASK_NOLOAD) && > > !(prev->flags & PF_FROZEN); > > ... > > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > > > > where deactivate_task() updates p->on_rq directly: > > > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > > > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > document ttwu()") which changed it. Not sure why that is and didn't > think about it too deep as it didn't appear to be directly related to > the problem and didn't have ordering consequences. I'm confused; that commit didn't change deactivate_task(). Anyway, ->on_rq should be strictly under rq->lock. That said, since there is a READ_ONCE() consumer of ->on_rq it makes sense to have the stores as WRITE_ONCE(). > > __ttwu_queue_wakelist() we have: > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > updates can race and cause the flags to be corrupted? > > > > I think this is at least one possibility. I think at least that one > should only be explicitly set on WF_MIGRATED and explicitly cleared in > sched_ttwu_pending. While I haven't audited it fully, it might be enough > to avoid a double write outside of the rq lock on the bitfield but I > still need to think more about the ordering of sched_contributes_to_load > and whether it's ordered by p->on_cpu or not. The scenario you're worried about is something like: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); And then the stores of X and Y clobber one another.. Hummph, seems reasonable. One quick thing to test would be something like this: diff --git a/include/linux/sched.h b/include/linux/sched.h index 7abbdd7f3884..9844e541c94c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,9 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; + unsigned :0; unsigned sched_remote_wakeup:1; + unsigned :0; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 14:20 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 14:20 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote: > > Anyway, setting all that aside, I do agree with you that the bitfield usage > > in task_struct looks pretty suspicious. For example, in __schedule() we > > have: > > > > rq_lock(rq, &rf); > > smp_mb__after_spinlock(); > > ... > > prev_state = prev->state; > > > > if (!preempt && prev_state) { > > if (signal_pending_state(prev_state, prev)) { > > prev->state = TASK_RUNNING; > > } else { > > prev->sched_contributes_to_load = > > (prev_state & TASK_UNINTERRUPTIBLE) && > > !(prev_state & TASK_NOLOAD) && > > !(prev->flags & PF_FROZEN); > > ... > > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); > > > > where deactivate_task() updates p->on_rq directly: > > > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > > > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > document ttwu()") which changed it. Not sure why that is and didn't > think about it too deep as it didn't appear to be directly related to > the problem and didn't have ordering consequences. I'm confused; that commit didn't change deactivate_task(). Anyway, ->on_rq should be strictly under rq->lock. That said, since there is a READ_ONCE() consumer of ->on_rq it makes sense to have the stores as WRITE_ONCE(). > > __ttwu_queue_wakelist() we have: > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > updates can race and cause the flags to be corrupted? > > > > I think this is at least one possibility. I think at least that one > should only be explicitly set on WF_MIGRATED and explicitly cleared in > sched_ttwu_pending. While I haven't audited it fully, it might be enough > to avoid a double write outside of the rq lock on the bitfield but I > still need to think more about the ordering of sched_contributes_to_load > and whether it's ordered by p->on_cpu or not. The scenario you're worried about is something like: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); And then the stores of X and Y clobber one another.. Hummph, seems reasonable. One quick thing to test would be something like this: diff --git a/include/linux/sched.h b/include/linux/sched.h index 7abbdd7f3884..9844e541c94c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,9 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; + unsigned :0; unsigned sched_remote_wakeup:1; + unsigned :0; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 14:20 ` Peter Zijlstra @ 2020-11-16 15:52 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 15:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > > document ttwu()") which changed it. Not sure why that is and didn't > > think about it too deep as it didn't appear to be directly related to > > the problem and didn't have ordering consequences. > > I'm confused; that commit didn't change deactivate_task(). Anyway, > ->on_rq should be strictly under rq->lock. That said, since there is a > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as > WRITE_ONCE(). > It didn't change deactivate_task but it did this - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); - dequeue_task(rq, p, DEQUEUE_NOCLOCK); + deactivate_task(rq, p, DEQUEUE_NOCLOCK); which makes that write a p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; As activate_task is also a plain store and I didn't spot a relevant ordering problem that would impact loadavg, I concluded it was not immediately relevant, just a curiousity. > > > __ttwu_queue_wakelist() we have: > > > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > > updates can race and cause the flags to be corrupted? > > > > > > > I think this is at least one possibility. I think at least that one > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > to avoid a double write outside of the rq lock on the bitfield but I > > still need to think more about the ordering of sched_contributes_to_load > > and whether it's ordered by p->on_cpu or not. > > The scenario you're worried about is something like: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); > Yes, mostly because of what memory-barriers.txt warns about for bitfields if they are not protected by the same lock. > And then the stores of X and Y clobber one another.. Hummph, seems > reasonable. One quick thing to test would be something like this: > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 7abbdd7f3884..9844e541c94c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,9 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > + unsigned :0; > unsigned sched_remote_wakeup:1; > + unsigned :0; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif I'll test this after the smp_wmb() test completes. While a clobbering may be the issue, I also think the gap between the rq->nr_uninterruptible++ and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 15:52 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 15:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > > document ttwu()") which changed it. Not sure why that is and didn't > > think about it too deep as it didn't appear to be directly related to > > the problem and didn't have ordering consequences. > > I'm confused; that commit didn't change deactivate_task(). Anyway, > ->on_rq should be strictly under rq->lock. That said, since there is a > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as > WRITE_ONCE(). > It didn't change deactivate_task but it did this - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); - dequeue_task(rq, p, DEQUEUE_NOCLOCK); + deactivate_task(rq, p, DEQUEUE_NOCLOCK); which makes that write a p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; As activate_task is also a plain store and I didn't spot a relevant ordering problem that would impact loadavg, I concluded it was not immediately relevant, just a curiousity. > > > __ttwu_queue_wakelist() we have: > > > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > > updates can race and cause the flags to be corrupted? > > > > > > > I think this is at least one possibility. I think at least that one > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > to avoid a double write outside of the rq lock on the bitfield but I > > still need to think more about the ordering of sched_contributes_to_load > > and whether it's ordered by p->on_cpu or not. > > The scenario you're worried about is something like: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); > Yes, mostly because of what memory-barriers.txt warns about for bitfields if they are not protected by the same lock. > And then the stores of X and Y clobber one another.. Hummph, seems > reasonable. One quick thing to test would be something like this: > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 7abbdd7f3884..9844e541c94c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,9 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > + unsigned :0; > unsigned sched_remote_wakeup:1; > + unsigned :0; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif I'll test this after the smp_wmb() test completes. While a clobbering may be the issue, I also think the gap between the rq->nr_uninterruptible++ and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 15:52 ` Mel Gorman @ 2020-11-16 16:54 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 16:54 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 03:52:32PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > > > document ttwu()") which changed it. Not sure why that is and didn't > > > think about it too deep as it didn't appear to be directly related to > > > the problem and didn't have ordering consequences. > > > > I'm confused; that commit didn't change deactivate_task(). Anyway, > > ->on_rq should be strictly under rq->lock. That said, since there is a > > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as > > WRITE_ONCE(). > > > > It didn't change deactivate_task but it did this > > - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); > - dequeue_task(rq, p, DEQUEUE_NOCLOCK); > + deactivate_task(rq, p, DEQUEUE_NOCLOCK); > > which makes that write a > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > > As activate_task is also a plain store and I didn't spot a relevant > ordering problem that would impact loadavg, I concluded it was not > immediately relevant, just a curiousity. That's move_queued_task() case, which is irrelevant for the issue at hand. > > > > __ttwu_queue_wakelist() we have: > > > > > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > > > updates can race and cause the flags to be corrupted? > > > > > > > > > > I think this is at least one possibility. I think at least that one > > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > > to avoid a double write outside of the rq lock on the bitfield but I > > > still need to think more about the ordering of sched_contributes_to_load > > > and whether it's ordered by p->on_cpu or not. > > > > The scenario you're worried about is something like: > > > > CPU0 CPU1 > > > > schedule() > > prev->sched_contributes_to_load = X; > > deactivate_task(prev); > > > > try_to_wake_up() > > if (p->on_rq &&) // false > > if (smp_load_acquire(&p->on_cpu) && // true > > ttwu_queue_wakelist()) > > p->sched_remote_wakeup = Y; > > > > smp_store_release(prev->on_cpu, 0); > > > > Yes, mostly because of what memory-barriers.txt warns about for bitfields > if they are not protected by the same lock. I'm not sure memory-barriers.txt is relevant; that's simply two racing stores and 'obviously' buggered. > > And then the stores of X and Y clobber one another.. Hummph, seems > > reasonable. One quick thing to test would be something like this: > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 7abbdd7f3884..9844e541c94c 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -775,7 +775,9 @@ struct task_struct { > > unsigned sched_reset_on_fork:1; > > unsigned sched_contributes_to_load:1; > > unsigned sched_migrated:1; > > + unsigned :0; > > unsigned sched_remote_wakeup:1; > > + unsigned :0; > > #ifdef CONFIG_PSI > > unsigned sched_psi_wake_requeue:1; > > #endif > > I'll test this after the smp_wmb() test completes. While a clobbering may > be the issue, I also think the gap between the rq->nr_uninterruptible++ > and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. I really don't understand what you wrote in that email... ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 16:54 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-16 16:54 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 03:52:32PM +0000, Mel Gorman wrote: > On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better > > > document ttwu()") which changed it. Not sure why that is and didn't > > > think about it too deep as it didn't appear to be directly related to > > > the problem and didn't have ordering consequences. > > > > I'm confused; that commit didn't change deactivate_task(). Anyway, > > ->on_rq should be strictly under rq->lock. That said, since there is a > > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as > > WRITE_ONCE(). > > > > It didn't change deactivate_task but it did this > > - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); > - dequeue_task(rq, p, DEQUEUE_NOCLOCK); > + deactivate_task(rq, p, DEQUEUE_NOCLOCK); > > which makes that write a > > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; > > As activate_task is also a plain store and I didn't spot a relevant > ordering problem that would impact loadavg, I concluded it was not > immediately relevant, just a curiousity. That's move_queued_task() case, which is irrelevant for the issue at hand. > > > > __ttwu_queue_wakelist() we have: > > > > > > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); > > > > > > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read > > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield > > > > updates can race and cause the flags to be corrupted? > > > > > > > > > > I think this is at least one possibility. I think at least that one > > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > > to avoid a double write outside of the rq lock on the bitfield but I > > > still need to think more about the ordering of sched_contributes_to_load > > > and whether it's ordered by p->on_cpu or not. > > > > The scenario you're worried about is something like: > > > > CPU0 CPU1 > > > > schedule() > > prev->sched_contributes_to_load = X; > > deactivate_task(prev); > > > > try_to_wake_up() > > if (p->on_rq &&) // false > > if (smp_load_acquire(&p->on_cpu) && // true > > ttwu_queue_wakelist()) > > p->sched_remote_wakeup = Y; > > > > smp_store_release(prev->on_cpu, 0); > > > > Yes, mostly because of what memory-barriers.txt warns about for bitfields > if they are not protected by the same lock. I'm not sure memory-barriers.txt is relevant; that's simply two racing stores and 'obviously' buggered. > > And then the stores of X and Y clobber one another.. Hummph, seems > > reasonable. One quick thing to test would be something like this: > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 7abbdd7f3884..9844e541c94c 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -775,7 +775,9 @@ struct task_struct { > > unsigned sched_reset_on_fork:1; > > unsigned sched_contributes_to_load:1; > > unsigned sched_migrated:1; > > + unsigned :0; > > unsigned sched_remote_wakeup:1; > > + unsigned :0; > > #ifdef CONFIG_PSI > > unsigned sched_psi_wake_requeue:1; > > #endif > > I'll test this after the smp_wmb() test completes. While a clobbering may > be the issue, I also think the gap between the rq->nr_uninterruptible++ > and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. I really don't understand what you wrote in that email... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 16:54 ` Peter Zijlstra @ 2020-11-16 17:16 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 17:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 05:54:15PM +0100, Peter Zijlstra wrote: > > > And then the stores of X and Y clobber one another.. Hummph, seems > > > reasonable. One quick thing to test would be something like this: > > > > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 7abbdd7f3884..9844e541c94c 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -775,7 +775,9 @@ struct task_struct { > > > unsigned sched_reset_on_fork:1; > > > unsigned sched_contributes_to_load:1; > > > unsigned sched_migrated:1; > > > + unsigned :0; > > > unsigned sched_remote_wakeup:1; > > > + unsigned :0; > > > #ifdef CONFIG_PSI > > > unsigned sched_psi_wake_requeue:1; > > > #endif > > > > I'll test this after the smp_wmb() test completes. While a clobbering may > > be the issue, I also think the gap between the rq->nr_uninterruptible++ > > and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. > > I really don't understand what you wrote in that email... Sorry :(. I tried writing a changelog showing where I think the race might be. I'll queue up your patch that is potentially sched_migrated and sched_remote_wakeup being clobbered. --8<-- sched: Fix loadavg accounting race on arm64 An internal bug report filed against a 5.8 and 5.9 kernel that loadavg was "exploding" on arm64 on a machines acting as a build servers. It happened on at least two different arm64 variants. That setup is complex to replicate but can be reproduced by running hackbench-process-pipes while heavily overcommitting a machine with 96 logical CPUs and then checking if loadavg drops afterwards. With an MMTests clone, reproduce it as follows ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done The reproduction case simply hammers the case where a task can be descheduling while also being woken by another task at the same time. After the test completes, load avg should reach 0 within a few minutes. Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg accounting race in the generic case. Later it was documented why the ordering of when p->sched_contributes_to_load is read/updated relative to p->on_cpu. This is critical when a task is descheduling at the same time it is being activated on another CPU. While the load/stores happen under the RQ lock, the RQ lock on its own does not give any guarantees on the task state. The problem appears to be because the schedule and wakeup paths rely on being ordered by ->on_cpu for some fields as documented in core.c under "Notes on Program-Order guarantees on SMP systems". However, this can happen CPU 0 CPU 1 CPU 2 __schedule() prev->sched_contributes_to_load = 1 rq->nr_uninterruptible++; rq_unlock_irq try_to_wake_up smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p) == true ttwu_queue_wakelist ttwu_queue_cond (true) __ttwu_queue_wakelist sched_ttwu_pending ttwu_do_activate if (p->sched_contributes_to_load) (wrong value observed, load drifts) finish_task smp_store_release(X->on_cpu, 0) There is a gap between when rq->nr_uninterruptible is written to before p->on_cpu is updated with smp_store_release(). During that window, a parallel waker may observe the incorrect value for p->sched_contributes_to_load and fail to decrement rq->nr_uninterruptible and the loadavg starts drifting. This patch adds a write barrier after nr_uninterruptible is updated with the matching read barrier done when reading p->on_cpu in the ttwu path. With the patch applied, the load averages taken every minute after the hackbench test case completes is 950.21 977.17 990.69 1/853 2117 349.00 799.32 928.69 1/859 2439 128.18 653.85 870.56 1/861 2736 47.08 534.84 816.08 1/860 3029 17.29 437.50 765.00 1/865 3357 6.35 357.87 717.13 1/865 3653 2.33 292.74 672.24 1/861 3709 0.85 239.46 630.17 1/859 3711 0.31 195.87 590.73 1/857 3713 0.11 160.22 553.76 1/853 3715 Without the patch, the load average stabilised at 244 on an otherwise idle machine. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: stable@vger.kernel.org # v5.8+ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2003a7d5ab5..877eaeba45ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { - prev->sched_contributes_to_load = + int acct_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); - if (prev->sched_contributes_to_load) + prev->sched_contributes_to_load = acct_load; + if (acct_load) { rq->nr_uninterruptible++; + /* + * Pairs with p->on_cpu ordering, either a + * smp_load_acquire or smp_cond_load_acquire + * in the ttwu path before ttwu_do_activate + * p->sched_contributes_to_load. It's only + * after the nr_interruptible update happens + * that the ordering is critical. + */ + smp_wmb(); + } + /* * __schedule() ttwu() * prev_state = prev->state; if (p->on_rq && ...) -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 17:16 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 17:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 05:54:15PM +0100, Peter Zijlstra wrote: > > > And then the stores of X and Y clobber one another.. Hummph, seems > > > reasonable. One quick thing to test would be something like this: > > > > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 7abbdd7f3884..9844e541c94c 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -775,7 +775,9 @@ struct task_struct { > > > unsigned sched_reset_on_fork:1; > > > unsigned sched_contributes_to_load:1; > > > unsigned sched_migrated:1; > > > + unsigned :0; > > > unsigned sched_remote_wakeup:1; > > > + unsigned :0; > > > #ifdef CONFIG_PSI > > > unsigned sched_psi_wake_requeue:1; > > > #endif > > > > I'll test this after the smp_wmb() test completes. While a clobbering may > > be the issue, I also think the gap between the rq->nr_uninterruptible++ > > and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate. > > I really don't understand what you wrote in that email... Sorry :(. I tried writing a changelog showing where I think the race might be. I'll queue up your patch that is potentially sched_migrated and sched_remote_wakeup being clobbered. --8<-- sched: Fix loadavg accounting race on arm64 An internal bug report filed against a 5.8 and 5.9 kernel that loadavg was "exploding" on arm64 on a machines acting as a build servers. It happened on at least two different arm64 variants. That setup is complex to replicate but can be reproduced by running hackbench-process-pipes while heavily overcommitting a machine with 96 logical CPUs and then checking if loadavg drops afterwards. With an MMTests clone, reproduce it as follows ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done The reproduction case simply hammers the case where a task can be descheduling while also being woken by another task at the same time. After the test completes, load avg should reach 0 within a few minutes. Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg accounting race in the generic case. Later it was documented why the ordering of when p->sched_contributes_to_load is read/updated relative to p->on_cpu. This is critical when a task is descheduling at the same time it is being activated on another CPU. While the load/stores happen under the RQ lock, the RQ lock on its own does not give any guarantees on the task state. The problem appears to be because the schedule and wakeup paths rely on being ordered by ->on_cpu for some fields as documented in core.c under "Notes on Program-Order guarantees on SMP systems". However, this can happen CPU 0 CPU 1 CPU 2 __schedule() prev->sched_contributes_to_load = 1 rq->nr_uninterruptible++; rq_unlock_irq try_to_wake_up smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p) == true ttwu_queue_wakelist ttwu_queue_cond (true) __ttwu_queue_wakelist sched_ttwu_pending ttwu_do_activate if (p->sched_contributes_to_load) (wrong value observed, load drifts) finish_task smp_store_release(X->on_cpu, 0) There is a gap between when rq->nr_uninterruptible is written to before p->on_cpu is updated with smp_store_release(). During that window, a parallel waker may observe the incorrect value for p->sched_contributes_to_load and fail to decrement rq->nr_uninterruptible and the loadavg starts drifting. This patch adds a write barrier after nr_uninterruptible is updated with the matching read barrier done when reading p->on_cpu in the ttwu path. With the patch applied, the load averages taken every minute after the hackbench test case completes is 950.21 977.17 990.69 1/853 2117 349.00 799.32 928.69 1/859 2439 128.18 653.85 870.56 1/861 2736 47.08 534.84 816.08 1/860 3029 17.29 437.50 765.00 1/865 3357 6.35 357.87 717.13 1/865 3653 2.33 292.74 672.24 1/861 3709 0.85 239.46 630.17 1/859 3711 0.31 195.87 590.73 1/857 3713 0.11 160.22 553.76 1/853 3715 Without the patch, the load average stabilised at 244 on an otherwise idle machine. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: stable@vger.kernel.org # v5.8+ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2003a7d5ab5..877eaeba45ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt) if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { - prev->sched_contributes_to_load = + int acct_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); - if (prev->sched_contributes_to_load) + prev->sched_contributes_to_load = acct_load; + if (acct_load) { rq->nr_uninterruptible++; + /* + * Pairs with p->on_cpu ordering, either a + * smp_load_acquire or smp_cond_load_acquire + * in the ttwu path before ttwu_do_activate + * p->sched_contributes_to_load. It's only + * after the nr_interruptible update happens + * that the ordering is critical. + */ + smp_wmb(); + } + /* * __schedule() ttwu() * prev_state = prev->state; if (p->on_rq && ...) -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 2020-11-16 14:20 ` Peter Zijlstra @ 2020-11-16 19:31 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 19:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > I think this is at least one possibility. I think at least that one > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > to avoid a double write outside of the rq lock on the bitfield but I > > still need to think more about the ordering of sched_contributes_to_load > > and whether it's ordered by p->on_cpu or not. > > The scenario you're worried about is something like: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); > Yes. > And then the stores of X and Y clobber one another.. Hummph, seems > reasonable. One quick thing to test would be something like this: > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 7abbdd7f3884..9844e541c94c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,9 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > + unsigned :0; > unsigned sched_remote_wakeup:1; > + unsigned :0; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif And this works. 986.01 1008.17 1013.15 2/855 1212 362.19 824.70 949.75 1/856 1564 133.19 674.65 890.32 1/864 1958 49.04 551.89 834.61 2/871 2339 18.33 451.54 782.41 1/867 2686 6.77 369.37 733.45 1/866 2929 2.55 302.16 687.55 1/864 2931 0.97 247.18 644.52 1/860 2933 0.48 202.23 604.20 1/849 2935 I should have gone with this after rereading the warning about bit fields having to be protected by the same lock in the "anti-guarantees" section of memory-barriers.txt :( sched_psi_wake_requeue can probably stay with the other three fields given they are under the rq lock but sched_remote_wakeup needs to move out. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Loadavg accounting error on arm64 @ 2020-11-16 19:31 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-16 19:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote: > > I think this is at least one possibility. I think at least that one > > should only be explicitly set on WF_MIGRATED and explicitly cleared in > > sched_ttwu_pending. While I haven't audited it fully, it might be enough > > to avoid a double write outside of the rq lock on the bitfield but I > > still need to think more about the ordering of sched_contributes_to_load > > and whether it's ordered by p->on_cpu or not. > > The scenario you're worried about is something like: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); > Yes. > And then the stores of X and Y clobber one another.. Hummph, seems > reasonable. One quick thing to test would be something like this: > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 7abbdd7f3884..9844e541c94c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,9 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > + unsigned :0; > unsigned sched_remote_wakeup:1; > + unsigned :0; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif And this works. 986.01 1008.17 1013.15 2/855 1212 362.19 824.70 949.75 1/856 1564 133.19 674.65 890.32 1/864 1958 49.04 551.89 834.61 2/871 2339 18.33 451.54 782.41 1/867 2686 6.77 369.37 733.45 1/866 2929 2.55 302.16 687.55 1/864 2931 0.97 247.18 644.52 1/860 2933 0.48 202.23 604.20 1/849 2935 I should have gone with this after rereading the warning about bit fields having to be protected by the same lock in the "anti-guarantees" section of memory-barriers.txt :( sched_psi_wake_requeue can probably stay with the other three fields given they are under the rq lock but sched_remote_wakeup needs to move out. -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] sched: Fix data-race in wakeup 2020-11-16 19:31 ` Mel Gorman @ 2020-11-17 8:30 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 8:30 UTC (permalink / raw) To: Mel Gorman; +Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote: > And this works. Yay! > sched_psi_wake_requeue can probably stay with the other three fields > given they are under the rq lock but sched_remote_wakeup needs to move > out. I _think_ we can move the bit into the unserialized section below. It's a bit cheecky, but it should work I think because the only time we actually use this bit, we're guaranteed the task isn't actually running, so current doesn't exist. I suppose the question is wether this is worth saving 31 bits over... How's this? --- Subject: sched: Fix data-race in wakeup From: Peter Zijlstra <peterz@infradead.org> Date: Tue Nov 17 09:08:41 CET 2020 Mel reported that on some ARM64 platforms loadavg goes bananas and tracked it down to the following data race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = X; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched.h | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,18 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * p->in_iowait = 1; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = X; + * + * Guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 8:30 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 8:30 UTC (permalink / raw) To: Mel Gorman; +Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote: > And this works. Yay! > sched_psi_wake_requeue can probably stay with the other three fields > given they are under the rq lock but sched_remote_wakeup needs to move > out. I _think_ we can move the bit into the unserialized section below. It's a bit cheecky, but it should work I think because the only time we actually use this bit, we're guaranteed the task isn't actually running, so current doesn't exist. I suppose the question is wether this is worth saving 31 bits over... How's this? --- Subject: sched: Fix data-race in wakeup From: Peter Zijlstra <peterz@infradead.org> Date: Tue Nov 17 09:08:41 CET 2020 Mel reported that on some ARM64 platforms loadavg goes bananas and tracked it down to the following data race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = X; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched.h | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,18 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * p->in_iowait = 1; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = X; + * + * Guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 8:30 ` Peter Zijlstra @ 2020-11-17 9:15 ` Will Deacon -1 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-17 9:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote: > > > And this works. > > Yay! > > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > > I suppose the question is wether this is worth saving 31 bits over... > > How's this? > > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > Mel reported that on some ARM64 platforms loadavg goes bananas and > tracked it down to the following data race: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); (nit: I suggested this race over at [1] ;) > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > in the same word, and thus the stores X and Y race (and can clobber > one another's data). > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > p->sched_remote_wakeup (just as it still does with > p->sched_contributes_to_load) that commit broke that by calling > ttwu_queue_wakelist() with p->on_cpu != 0. > > However, due to > > p->XXX ttwu() > schedule() if (p->on_rq && ...) // false > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > deactivate_task() ttwu_queue_wakelist()) > p->on_rq = 0; p->sched_remote_wakeup = X; > > We can be sure any 'current' store is complete and 'current' is > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > the current flags word. > > Note: while the observed failure was loadavg accounting gone wrong due > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > is also possible where schedule() clobbers p->sched_remote_wakeup, > this could result in enqueue_entity() wrecking ->vruntime and causing > scheduling artifacts. > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/sched.h | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,18 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * p->in_iowait = 1; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = X; > + * > + * Guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used. I'm still not sure this is particularly clear -- don't we want to highlight that the store of p->on_rq is unordered wrt the update to p->sched_contributes_to_load() in deactivate_task()? I dislike bitfields with a passion, but the fix looks good: Acked-by: Will Deacon <will@kernel.org> Now the million dollar question is why KCSAN hasn't run into this. Hrmph. Will [1] https://lore.kernel.org/r/20201116131102.GA29992@willie-the-truck ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 9:15 ` Will Deacon 0 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-17 9:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Mel Gorman, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote: > > > And this works. > > Yay! > > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > > I suppose the question is wether this is worth saving 31 bits over... > > How's this? > > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > Mel reported that on some ARM64 platforms loadavg goes bananas and > tracked it down to the following data race: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); (nit: I suggested this race over at [1] ;) > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > in the same word, and thus the stores X and Y race (and can clobber > one another's data). > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > p->sched_remote_wakeup (just as it still does with > p->sched_contributes_to_load) that commit broke that by calling > ttwu_queue_wakelist() with p->on_cpu != 0. > > However, due to > > p->XXX ttwu() > schedule() if (p->on_rq && ...) // false > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > deactivate_task() ttwu_queue_wakelist()) > p->on_rq = 0; p->sched_remote_wakeup = X; > > We can be sure any 'current' store is complete and 'current' is > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > the current flags word. > > Note: while the observed failure was loadavg accounting gone wrong due > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > is also possible where schedule() clobbers p->sched_remote_wakeup, > this could result in enqueue_entity() wrecking ->vruntime and causing > scheduling artifacts. > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/sched.h | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,18 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * p->in_iowait = 1; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = X; > + * > + * Guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used. I'm still not sure this is particularly clear -- don't we want to highlight that the store of p->on_rq is unordered wrt the update to p->sched_contributes_to_load() in deactivate_task()? I dislike bitfields with a passion, but the fix looks good: Acked-by: Will Deacon <will@kernel.org> Now the million dollar question is why KCSAN hasn't run into this. Hrmph. Will [1] https://lore.kernel.org/r/20201116131102.GA29992@willie-the-truck _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 9:15 ` Will Deacon @ 2020-11-17 9:29 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:29 UTC (permalink / raw) To: Will Deacon; +Cc: Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > Subject: sched: Fix data-race in wakeup > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Tue Nov 17 09:08:41 CET 2020 > > > > Mel reported that on some ARM64 platforms loadavg goes bananas and > > tracked it down to the following data race: > > > > CPU0 CPU1 > > > > schedule() > > prev->sched_contributes_to_load = X; > > deactivate_task(prev); > > > > try_to_wake_up() > > if (p->on_rq &&) // false > > if (smp_load_acquire(&p->on_cpu) && // true > > ttwu_queue_wakelist()) > > p->sched_remote_wakeup = Y; > > > > smp_store_release(prev->on_cpu, 0); > > (nit: I suggested this race over at [1] ;) Ah, I'll ammend and get you a Debugged-by line or something ;-) > > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > > in the same word, and thus the stores X and Y race (and can clobber > > one another's data). > > > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > > p->sched_remote_wakeup (just as it still does with > > p->sched_contributes_to_load) that commit broke that by calling > > ttwu_queue_wakelist() with p->on_cpu != 0. > > > > However, due to > > > > p->XXX ttwu() > > schedule() if (p->on_rq && ...) // false > > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > > deactivate_task() ttwu_queue_wakelist()) > > p->on_rq = 0; p->sched_remote_wakeup = X; > > > > We can be sure any 'current' store is complete and 'current' is > > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > > the current flags word. > > > > Note: while the observed failure was loadavg accounting gone wrong due > > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > > is also possible where schedule() clobbers p->sched_remote_wakeup, > > this could result in enqueue_entity() wrecking ->vruntime and causing > > scheduling artifacts. > > > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > > Reported-by: Mel Gorman <mgorman@techsingularity.net> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > include/linux/sched.h | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -775,7 +775,6 @@ struct task_struct { > > unsigned sched_reset_on_fork:1; > > unsigned sched_contributes_to_load:1; > > unsigned sched_migrated:1; > > - unsigned sched_remote_wakeup:1; > > #ifdef CONFIG_PSI > > unsigned sched_psi_wake_requeue:1; > > #endif > > @@ -785,6 +784,18 @@ struct task_struct { > > > > /* Unserialized, strictly 'current' */ > > > > + /* > > + * p->in_iowait = 1; ttwu() > > + * schedule() if (p->on_rq && ..) // false > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > + * deactivate_task() ttwu_queue_wakelist()) > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > + * > > + * Guarantees all stores of 'current' are visible before > > + * ->sched_remote_wakeup gets used. > > I'm still not sure this is particularly clear -- don't we want to highlight > that the store of p->on_rq is unordered wrt the update to > p->sched_contributes_to_load() in deactivate_task()? I can explicitly call that out I suppose. > I dislike bitfields with a passion, but the fix looks good: I don't particularly hate them, they're just a flag field with names on (in this case). > Acked-by: Will Deacon <will@kernel.org> Thanks! > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. kernel/sched/Makefile:KCSAN_SANITIZE := n might have something to do with that, I suppose. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 9:29 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:29 UTC (permalink / raw) To: Will Deacon; +Cc: Davidlohr Bueso, Mel Gorman, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > Subject: sched: Fix data-race in wakeup > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Tue Nov 17 09:08:41 CET 2020 > > > > Mel reported that on some ARM64 platforms loadavg goes bananas and > > tracked it down to the following data race: > > > > CPU0 CPU1 > > > > schedule() > > prev->sched_contributes_to_load = X; > > deactivate_task(prev); > > > > try_to_wake_up() > > if (p->on_rq &&) // false > > if (smp_load_acquire(&p->on_cpu) && // true > > ttwu_queue_wakelist()) > > p->sched_remote_wakeup = Y; > > > > smp_store_release(prev->on_cpu, 0); > > (nit: I suggested this race over at [1] ;) Ah, I'll ammend and get you a Debugged-by line or something ;-) > > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > > in the same word, and thus the stores X and Y race (and can clobber > > one another's data). > > > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > > p->sched_remote_wakeup (just as it still does with > > p->sched_contributes_to_load) that commit broke that by calling > > ttwu_queue_wakelist() with p->on_cpu != 0. > > > > However, due to > > > > p->XXX ttwu() > > schedule() if (p->on_rq && ...) // false > > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > > deactivate_task() ttwu_queue_wakelist()) > > p->on_rq = 0; p->sched_remote_wakeup = X; > > > > We can be sure any 'current' store is complete and 'current' is > > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > > the current flags word. > > > > Note: while the observed failure was loadavg accounting gone wrong due > > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > > is also possible where schedule() clobbers p->sched_remote_wakeup, > > this could result in enqueue_entity() wrecking ->vruntime and causing > > scheduling artifacts. > > > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > > Reported-by: Mel Gorman <mgorman@techsingularity.net> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > include/linux/sched.h | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -775,7 +775,6 @@ struct task_struct { > > unsigned sched_reset_on_fork:1; > > unsigned sched_contributes_to_load:1; > > unsigned sched_migrated:1; > > - unsigned sched_remote_wakeup:1; > > #ifdef CONFIG_PSI > > unsigned sched_psi_wake_requeue:1; > > #endif > > @@ -785,6 +784,18 @@ struct task_struct { > > > > /* Unserialized, strictly 'current' */ > > > > + /* > > + * p->in_iowait = 1; ttwu() > > + * schedule() if (p->on_rq && ..) // false > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > + * deactivate_task() ttwu_queue_wakelist()) > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > + * > > + * Guarantees all stores of 'current' are visible before > > + * ->sched_remote_wakeup gets used. > > I'm still not sure this is particularly clear -- don't we want to highlight > that the store of p->on_rq is unordered wrt the update to > p->sched_contributes_to_load() in deactivate_task()? I can explicitly call that out I suppose. > I dislike bitfields with a passion, but the fix looks good: I don't particularly hate them, they're just a flag field with names on (in this case). > Acked-by: Will Deacon <will@kernel.org> Thanks! > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. kernel/sched/Makefile:KCSAN_SANITIZE := n might have something to do with that, I suppose. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 9:29 ` Peter Zijlstra @ 2020-11-17 9:46 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:46 UTC (permalink / raw) To: Will Deacon; +Cc: Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > /* Unserialized, strictly 'current' */ > > > > > > + /* > > > + * p->in_iowait = 1; ttwu() > > > + * schedule() if (p->on_rq && ..) // false > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > + * deactivate_task() ttwu_queue_wakelist()) > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > + * > > > + * Guarantees all stores of 'current' are visible before > > > + * ->sched_remote_wakeup gets used. > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > that the store of p->on_rq is unordered wrt the update to > > p->sched_contributes_to_load() in deactivate_task()? How's this then? It still doesn't explicitly call out the specific race, but does mention the more fundamental issue that wakelist queueing doesn't respect the regular rules anymore. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,21 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * This field must not be in the scheduler word above due to wakelist + * queueing no longer being serialized by p->on_cpu. However: + * + * p->XXX = X; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = Y; + * + * guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used, so it can be in this word. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 9:46 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:46 UTC (permalink / raw) To: Will Deacon; +Cc: Davidlohr Bueso, Mel Gorman, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > /* Unserialized, strictly 'current' */ > > > > > > + /* > > > + * p->in_iowait = 1; ttwu() > > > + * schedule() if (p->on_rq && ..) // false > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > + * deactivate_task() ttwu_queue_wakelist()) > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > + * > > > + * Guarantees all stores of 'current' are visible before > > > + * ->sched_remote_wakeup gets used. > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > that the store of p->on_rq is unordered wrt the update to > > p->sched_contributes_to_load() in deactivate_task()? How's this then? It still doesn't explicitly call out the specific race, but does mention the more fundamental issue that wakelist queueing doesn't respect the regular rules anymore. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,21 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * This field must not be in the scheduler word above due to wakelist + * queueing no longer being serialized by p->on_cpu. However: + * + * p->XXX = X; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = Y; + * + * guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used, so it can be in this word. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 9:46 ` Peter Zijlstra @ 2020-11-17 10:36 ` Will Deacon -1 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-17 10:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 10:46:21AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > > /* Unserialized, strictly 'current' */ > > > > > > > > + /* > > > > + * p->in_iowait = 1; ttwu() > > > > + * schedule() if (p->on_rq && ..) // false > > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > > + * deactivate_task() ttwu_queue_wakelist()) > > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > > + * > > > > + * Guarantees all stores of 'current' are visible before > > > > + * ->sched_remote_wakeup gets used. > > > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > > that the store of p->on_rq is unordered wrt the update to > > > p->sched_contributes_to_load() in deactivate_task()? > > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ > + unsigned sched_remote_wakeup:1; Much better, thanks! Will ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 10:36 ` Will Deacon 0 siblings, 0 replies; 70+ messages in thread From: Will Deacon @ 2020-11-17 10:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Mel Gorman, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 10:46:21AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > > /* Unserialized, strictly 'current' */ > > > > > > > > + /* > > > > + * p->in_iowait = 1; ttwu() > > > > + * schedule() if (p->on_rq && ..) // false > > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > > + * deactivate_task() ttwu_queue_wakelist()) > > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > > + * > > > > + * Guarantees all stores of 'current' are visible before > > > > + * ->sched_remote_wakeup gets used. > > > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > > that the store of p->on_rq is unordered wrt the update to > > > p->sched_contributes_to_load() in deactivate_task()? > > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ > + unsigned sched_remote_wakeup:1; Much better, thanks! Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 9:46 ` Peter Zijlstra @ 2020-11-17 12:52 ` Valentin Schneider -1 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On 17/11/20 09:46, Peter Zijlstra wrote: > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ Isn't the control dep between that ttwu() p->on_rq read and p->sched_remote_wakeup write "sufficient"? That should be giving the right ordering for the rest of ttwu() wrt. those 'current' bits, considering they are written before that smp_mb__after_spinlock(). In any case, consider me convinced: Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > + unsigned sched_remote_wakeup:1; > + > /* Bit to tell LSMs we're in execve(): */ > unsigned in_execve:1; > unsigned in_iowait:1; ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 12:52 ` Valentin Schneider 0 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 12:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On 17/11/20 09:46, Peter Zijlstra wrote: > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ Isn't the control dep between that ttwu() p->on_rq read and p->sched_remote_wakeup write "sufficient"? That should be giving the right ordering for the rest of ttwu() wrt. those 'current' bits, considering they are written before that smp_mb__after_spinlock(). In any case, consider me convinced: Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > + unsigned sched_remote_wakeup:1; > + > /* Bit to tell LSMs we're in execve(): */ > unsigned in_execve:1; > unsigned in_iowait:1; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 12:52 ` Valentin Schneider @ 2020-11-17 15:37 ` Valentin Schneider -1 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 15:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On 17/11/20 12:52, Valentin Schneider wrote: > On 17/11/20 09:46, Peter Zijlstra wrote: >> How's this then? It still doesn't explicitly call out the specific race, >> but does mention the more fundamental issue that wakelist queueing >> doesn't respect the regular rules anymore. >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -775,7 +775,6 @@ struct task_struct { >> unsigned sched_reset_on_fork:1; >> unsigned sched_contributes_to_load:1; >> unsigned sched_migrated:1; >> - unsigned sched_remote_wakeup:1; >> #ifdef CONFIG_PSI >> unsigned sched_psi_wake_requeue:1; >> #endif >> @@ -785,6 +784,21 @@ struct task_struct { >> >> /* Unserialized, strictly 'current' */ >> >> + /* >> + * This field must not be in the scheduler word above due to wakelist >> + * queueing no longer being serialized by p->on_cpu. However: >> + * >> + * p->XXX = X; ttwu() >> + * schedule() if (p->on_rq && ..) // false >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> + * deactivate_task() ttwu_queue_wakelist()) >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> + * >> + * guarantees all stores of 'current' are visible before >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> + */ > > Isn't the control dep between that ttwu() p->on_rq read and > p->sched_remote_wakeup write "sufficient"? smp_acquire__after_ctrl_dep() that is, since we need ->on_rq load => 'current' bits load + store > That should be giving the right > ordering for the rest of ttwu() wrt. those 'current' bits, considering they > are written before that smp_mb__after_spinlock(). > > In any case, consider me convinced: > > Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > >> + unsigned sched_remote_wakeup:1; >> + >> /* Bit to tell LSMs we're in execve(): */ >> unsigned in_execve:1; >> unsigned in_iowait:1; ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 15:37 ` Valentin Schneider 0 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 15:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On 17/11/20 12:52, Valentin Schneider wrote: > On 17/11/20 09:46, Peter Zijlstra wrote: >> How's this then? It still doesn't explicitly call out the specific race, >> but does mention the more fundamental issue that wakelist queueing >> doesn't respect the regular rules anymore. >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -775,7 +775,6 @@ struct task_struct { >> unsigned sched_reset_on_fork:1; >> unsigned sched_contributes_to_load:1; >> unsigned sched_migrated:1; >> - unsigned sched_remote_wakeup:1; >> #ifdef CONFIG_PSI >> unsigned sched_psi_wake_requeue:1; >> #endif >> @@ -785,6 +784,21 @@ struct task_struct { >> >> /* Unserialized, strictly 'current' */ >> >> + /* >> + * This field must not be in the scheduler word above due to wakelist >> + * queueing no longer being serialized by p->on_cpu. However: >> + * >> + * p->XXX = X; ttwu() >> + * schedule() if (p->on_rq && ..) // false >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> + * deactivate_task() ttwu_queue_wakelist()) >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> + * >> + * guarantees all stores of 'current' are visible before >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> + */ > > Isn't the control dep between that ttwu() p->on_rq read and > p->sched_remote_wakeup write "sufficient"? smp_acquire__after_ctrl_dep() that is, since we need ->on_rq load => 'current' bits load + store > That should be giving the right > ordering for the rest of ttwu() wrt. those 'current' bits, considering they > are written before that smp_mb__after_spinlock(). > > In any case, consider me convinced: > > Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > >> + unsigned sched_remote_wakeup:1; >> + >> /* Bit to tell LSMs we're in execve(): */ >> unsigned in_execve:1; >> unsigned in_iowait:1; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 15:37 ` Valentin Schneider @ 2020-11-17 16:13 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 16:13 UTC (permalink / raw) To: Valentin Schneider Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> + /* > >> + * This field must not be in the scheduler word above due to wakelist > >> + * queueing no longer being serialized by p->on_cpu. However: > >> + * > >> + * p->XXX = X; ttwu() > >> + * schedule() if (p->on_rq && ..) // false > >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> + * deactivate_task() ttwu_queue_wakelist()) > >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> + * > >> + * guarantees all stores of 'current' are visible before > >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> + */ > > > > Isn't the control dep between that ttwu() p->on_rq read and > > p->sched_remote_wakeup write "sufficient"? > > smp_acquire__after_ctrl_dep() that is, since we need > ->on_rq load => 'current' bits load + store I don't think we need that extra barrier; after all, there will be a complete schedule() between waking the task and it actually becoming current. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 16:13 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 16:13 UTC (permalink / raw) To: Valentin Schneider Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> + /* > >> + * This field must not be in the scheduler word above due to wakelist > >> + * queueing no longer being serialized by p->on_cpu. However: > >> + * > >> + * p->XXX = X; ttwu() > >> + * schedule() if (p->on_rq && ..) // false > >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> + * deactivate_task() ttwu_queue_wakelist()) > >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> + * > >> + * guarantees all stores of 'current' are visible before > >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> + */ > > > > Isn't the control dep between that ttwu() p->on_rq read and > > p->sched_remote_wakeup write "sufficient"? > > smp_acquire__after_ctrl_dep() that is, since we need > ->on_rq load => 'current' bits load + store I don't think we need that extra barrier; after all, there will be a complete schedule() between waking the task and it actually becoming current. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 16:13 ` Peter Zijlstra @ 2020-11-17 19:32 ` Valentin Schneider -1 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 19:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On 17/11/20 16:13, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> >> + /* >> >> + * This field must not be in the scheduler word above due to wakelist >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> + * >> >> + * p->XXX = X; ttwu() >> >> + * schedule() if (p->on_rq && ..) // false >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> + * >> >> + * guarantees all stores of 'current' are visible before >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> + */ >> > >> > Isn't the control dep between that ttwu() p->on_rq read and >> > p->sched_remote_wakeup write "sufficient"? >> >> smp_acquire__after_ctrl_dep() that is, since we need >> ->on_rq load => 'current' bits load + store > > I don't think we need that extra barrier; after all, there will be a > complete schedule() between waking the task and it actually becoming > current. Apologies for the messy train of thought; what I was trying to say is that we have already the following, which AIUI is sufficient: * p->XXX = X; ttwu() * schedule() if (p->on_rq && ..) // false * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); * deactivate_task() ttwu_queue_wakelist() * p->on_rq = 0; p->sched_remote_wakeup = Y; ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 19:32 ` Valentin Schneider 0 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-17 19:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On 17/11/20 16:13, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> >> + /* >> >> + * This field must not be in the scheduler word above due to wakelist >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> + * >> >> + * p->XXX = X; ttwu() >> >> + * schedule() if (p->on_rq && ..) // false >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> + * >> >> + * guarantees all stores of 'current' are visible before >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> + */ >> > >> > Isn't the control dep between that ttwu() p->on_rq read and >> > p->sched_remote_wakeup write "sufficient"? >> >> smp_acquire__after_ctrl_dep() that is, since we need >> ->on_rq load => 'current' bits load + store > > I don't think we need that extra barrier; after all, there will be a > complete schedule() between waking the task and it actually becoming > current. Apologies for the messy train of thought; what I was trying to say is that we have already the following, which AIUI is sufficient: * p->XXX = X; ttwu() * schedule() if (p->on_rq && ..) // false * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); * deactivate_task() ttwu_queue_wakelist() * p->on_rq = 0; p->sched_remote_wakeup = Y; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 19:32 ` Valentin Schneider @ 2020-11-18 8:05 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-18 8:05 UTC (permalink / raw) To: Valentin Schneider Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: > > On 17/11/20 16:13, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > > > >> >> + /* > >> >> + * This field must not be in the scheduler word above due to wakelist > >> >> + * queueing no longer being serialized by p->on_cpu. However: > >> >> + * > >> >> + * p->XXX = X; ttwu() > >> >> + * schedule() if (p->on_rq && ..) // false > >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> >> + * deactivate_task() ttwu_queue_wakelist()) > >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> >> + * > >> >> + * guarantees all stores of 'current' are visible before > >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> >> + */ > >> > > >> > Isn't the control dep between that ttwu() p->on_rq read and > >> > p->sched_remote_wakeup write "sufficient"? > >> > >> smp_acquire__after_ctrl_dep() that is, since we need > >> ->on_rq load => 'current' bits load + store > > > > I don't think we need that extra barrier; after all, there will be a > > complete schedule() between waking the task and it actually becoming > > current. > > Apologies for the messy train of thought; what I was trying to say is that > we have already the following, which AIUI is sufficient: > > * p->XXX = X; ttwu() > * schedule() if (p->on_rq && ..) // false > * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); > * deactivate_task() ttwu_queue_wakelist() > * p->on_rq = 0; p->sched_remote_wakeup = Y; > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's not required here either ;-) The reason I had the ->on_cpu thing in there is because it shows we violate the regular ->on_cpu handoff rules, not for the acquire. The only ordering that matters on the RHS of that thing is the ->on_rq load to p->sched_remote_wakeup store ctrl dep. That, combined with the LHS, guarantees there is a strict order on the stores. Makes sense? ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-18 8:05 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-18 8:05 UTC (permalink / raw) To: Valentin Schneider Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: > > On 17/11/20 16:13, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > > > >> >> + /* > >> >> + * This field must not be in the scheduler word above due to wakelist > >> >> + * queueing no longer being serialized by p->on_cpu. However: > >> >> + * > >> >> + * p->XXX = X; ttwu() > >> >> + * schedule() if (p->on_rq && ..) // false > >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> >> + * deactivate_task() ttwu_queue_wakelist()) > >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> >> + * > >> >> + * guarantees all stores of 'current' are visible before > >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> >> + */ > >> > > >> > Isn't the control dep between that ttwu() p->on_rq read and > >> > p->sched_remote_wakeup write "sufficient"? > >> > >> smp_acquire__after_ctrl_dep() that is, since we need > >> ->on_rq load => 'current' bits load + store > > > > I don't think we need that extra barrier; after all, there will be a > > complete schedule() between waking the task and it actually becoming > > current. > > Apologies for the messy train of thought; what I was trying to say is that > we have already the following, which AIUI is sufficient: > > * p->XXX = X; ttwu() > * schedule() if (p->on_rq && ..) // false > * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); > * deactivate_task() ttwu_queue_wakelist() > * p->on_rq = 0; p->sched_remote_wakeup = Y; > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's not required here either ;-) The reason I had the ->on_cpu thing in there is because it shows we violate the regular ->on_cpu handoff rules, not for the acquire. The only ordering that matters on the RHS of that thing is the ->on_rq load to p->sched_remote_wakeup store ctrl dep. That, combined with the LHS, guarantees there is a strict order on the stores. Makes sense? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-18 8:05 ` Peter Zijlstra @ 2020-11-18 9:51 ` Valentin Schneider -1 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-18 9:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel On 18/11/20 08:05, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: >> >> On 17/11/20 16:13, Peter Zijlstra wrote: >> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: >> > >> >> >> + /* >> >> >> + * This field must not be in the scheduler word above due to wakelist >> >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> >> + * >> >> >> + * p->XXX = X; ttwu() >> >> >> + * schedule() if (p->on_rq && ..) // false >> >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> >> + * >> >> >> + * guarantees all stores of 'current' are visible before >> >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> >> + */ >> >> > >> >> > Isn't the control dep between that ttwu() p->on_rq read and >> >> > p->sched_remote_wakeup write "sufficient"? >> >> >> >> smp_acquire__after_ctrl_dep() that is, since we need >> >> ->on_rq load => 'current' bits load + store >> > >> > I don't think we need that extra barrier; after all, there will be a >> > complete schedule() between waking the task and it actually becoming >> > current. >> >> Apologies for the messy train of thought; what I was trying to say is that >> we have already the following, which AIUI is sufficient: >> >> * p->XXX = X; ttwu() >> * schedule() if (p->on_rq && ..) // false >> * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); >> * deactivate_task() ttwu_queue_wakelist() >> * p->on_rq = 0; p->sched_remote_wakeup = Y; >> > > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's > not required here either ;-) > > The reason I had the ->on_cpu thing in there is because it shows we > violate the regular ->on_cpu handoff rules, not for the acquire. > Gotcha > The only ordering that matters on the RHS of that thing is the ->on_rq > load to p->sched_remote_wakeup store ctrl dep. That, combined with the > LHS, guarantees there is a strict order on the stores. > > Makes sense? Yep, thanks! ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-18 9:51 ` Valentin Schneider 0 siblings, 0 replies; 70+ messages in thread From: Valentin Schneider @ 2020-11-18 9:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On 18/11/20 08:05, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: >> >> On 17/11/20 16:13, Peter Zijlstra wrote: >> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: >> > >> >> >> + /* >> >> >> + * This field must not be in the scheduler word above due to wakelist >> >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> >> + * >> >> >> + * p->XXX = X; ttwu() >> >> >> + * schedule() if (p->on_rq && ..) // false >> >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> >> + * >> >> >> + * guarantees all stores of 'current' are visible before >> >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> >> + */ >> >> > >> >> > Isn't the control dep between that ttwu() p->on_rq read and >> >> > p->sched_remote_wakeup write "sufficient"? >> >> >> >> smp_acquire__after_ctrl_dep() that is, since we need >> >> ->on_rq load => 'current' bits load + store >> > >> > I don't think we need that extra barrier; after all, there will be a >> > complete schedule() between waking the task and it actually becoming >> > current. >> >> Apologies for the messy train of thought; what I was trying to say is that >> we have already the following, which AIUI is sufficient: >> >> * p->XXX = X; ttwu() >> * schedule() if (p->on_rq && ..) // false >> * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); >> * deactivate_task() ttwu_queue_wakelist() >> * p->on_rq = 0; p->sched_remote_wakeup = Y; >> > > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's > not required here either ;-) > > The reason I had the ->on_cpu thing in there is because it shows we > violate the regular ->on_cpu handoff rules, not for the acquire. > Gotcha > The only ordering that matters on the RHS of that thing is the ->on_rq > load to p->sched_remote_wakeup store ctrl dep. That, combined with the > LHS, guarantees there is a strict order on the stores. > > Makes sense? Yep, thanks! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 9:29 ` Peter Zijlstra @ 2020-11-18 13:33 ` Marco Elver -1 siblings, 0 replies; 70+ messages in thread From: Marco Elver @ 2020-11-18 13:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Mel Gorman, Davidlohr Bueso, linux-arm-kernel, linux-kernel, paulmck On Tue, Nov 17, 2020 at 10:29AM +0100, Peter Zijlstra wrote: [...] > > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. > > kernel/sched/Makefile:KCSAN_SANITIZE := n > > might have something to do with that, I suppose. For the record, I tried to reproduce this data race. I found a read/write race on this bitfield, but not yet that write/write race (perhaps I wasn't running the right workload). | read to 0xffff8d4e2ce39aac of 1 bytes by task 5269 on cpu 3: | __sched_setscheduler+0x4a9/0x1070 kernel/sched/core.c:5297 | sched_setattr kernel/sched/core.c:5512 [inline] | ... | | write to 0xffff8d4e2ce39aac of 1 bytes by task 5268 on cpu 1: | __schedule+0x296/0xab0 kernel/sched/core.c:4462 prev->sched_contributes_to_load = | schedule+0xd1/0x130 kernel/sched/core.c:4601 | ... | | Full report: https://paste.debian.net/hidden/07a50732/ Getting to the above race also required some effort as 1) I kept hitting other unrelated data races in the scheduler and had to silence those first to be able to make progress, and 2) only enable KCSAN for scheduler code to just ignore all other data races. Then I let syzkaller run for a few minutes. Also note our default KCSAN config is suboptimal. For serious debugging, I'd recommend the same config that rcutorture uses with the --kcsan flag, specifically: CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n to get the full picture. However, as a first step, it'd be nice to eventually remove the KCSAN_SANITIZE := n from kernel/sched/Makefile when things are less noisy (so that syzbot and default builds can start finding more serious issues, too). Thanks, -- Marco ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-18 13:33 ` Marco Elver 0 siblings, 0 replies; 70+ messages in thread From: Marco Elver @ 2020-11-18 13:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, paulmck, Will Deacon, linux-kernel, Mel Gorman, linux-arm-kernel On Tue, Nov 17, 2020 at 10:29AM +0100, Peter Zijlstra wrote: [...] > > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. > > kernel/sched/Makefile:KCSAN_SANITIZE := n > > might have something to do with that, I suppose. For the record, I tried to reproduce this data race. I found a read/write race on this bitfield, but not yet that write/write race (perhaps I wasn't running the right workload). | read to 0xffff8d4e2ce39aac of 1 bytes by task 5269 on cpu 3: | __sched_setscheduler+0x4a9/0x1070 kernel/sched/core.c:5297 | sched_setattr kernel/sched/core.c:5512 [inline] | ... | | write to 0xffff8d4e2ce39aac of 1 bytes by task 5268 on cpu 1: | __schedule+0x296/0xab0 kernel/sched/core.c:4462 prev->sched_contributes_to_load = | schedule+0xd1/0x130 kernel/sched/core.c:4601 | ... | | Full report: https://paste.debian.net/hidden/07a50732/ Getting to the above race also required some effort as 1) I kept hitting other unrelated data races in the scheduler and had to silence those first to be able to make progress, and 2) only enable KCSAN for scheduler code to just ignore all other data races. Then I let syzkaller run for a few minutes. Also note our default KCSAN config is suboptimal. For serious debugging, I'd recommend the same config that rcutorture uses with the --kcsan flag, specifically: CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n to get the full picture. However, as a first step, it'd be nice to eventually remove the KCSAN_SANITIZE := n from kernel/sched/Makefile when things are less noisy (so that syzbot and default builds can start finding more serious issues, too). Thanks, -- Marco _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] sched: Fix rq->nr_iowait ordering 2020-11-17 8:30 ` Peter Zijlstra @ 2020-11-17 9:38 ` Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:38 UTC (permalink / raw) To: Mel Gorman Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel, Tejun Heo And poking at this reminded me of an order email from TJ that seems to have stagnated. --- Subject: sched: Fix rq->nr_iowait ordering From: Peter Zijlstra <peterz@infradead.org> Date: Thu, 24 Sep 2020 13:50:42 +0200 schedule() ttwu() deactivate_task(); if (p->on_rq && ...) // false atomic_dec(&task_rq(p)->nr_iowait); if (prev->in_iowait) atomic_inc(&rq->nr_iowait); Allows nr_iowait to be decremented before it gets incremented, resulting in more dodgy IO-wait numbers than usual. Note that because we can now do ttwu_queue_wakelist() before p->on_cpu==0, we lose the natural ordering and have to further delay the decrement. Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2949,7 +2949,12 @@ ttwu_do_activate(struct rq *rq, struct t #ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; + else #endif + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } activate_task(rq, p, en_flags); ttwu_do_wakeup(rq, p, wake_flags, rf); @@ -3336,11 +3341,6 @@ try_to_wake_up(struct task_struct *p, un if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) goto unlock; - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); - } - #ifdef CONFIG_SMP /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be @@ -3411,6 +3411,11 @@ try_to_wake_up(struct task_struct *p, un cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); if (task_cpu(p) != cpu) { + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + wake_flags |= WF_MIGRATED; psi_ttwu_dequeue(p); set_task_cpu(p, cpu); ^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH] sched: Fix rq->nr_iowait ordering @ 2020-11-17 9:38 ` Peter Zijlstra 0 siblings, 0 replies; 70+ messages in thread From: Peter Zijlstra @ 2020-11-17 9:38 UTC (permalink / raw) To: Mel Gorman Cc: Tejun Heo, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel And poking at this reminded me of an order email from TJ that seems to have stagnated. --- Subject: sched: Fix rq->nr_iowait ordering From: Peter Zijlstra <peterz@infradead.org> Date: Thu, 24 Sep 2020 13:50:42 +0200 schedule() ttwu() deactivate_task(); if (p->on_rq && ...) // false atomic_dec(&task_rq(p)->nr_iowait); if (prev->in_iowait) atomic_inc(&rq->nr_iowait); Allows nr_iowait to be decremented before it gets incremented, resulting in more dodgy IO-wait numbers than usual. Note that because we can now do ttwu_queue_wakelist() before p->on_cpu==0, we lose the natural ordering and have to further delay the decrement. Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2949,7 +2949,12 @@ ttwu_do_activate(struct rq *rq, struct t #ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; + else #endif + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } activate_task(rq, p, en_flags); ttwu_do_wakeup(rq, p, wake_flags, rf); @@ -3336,11 +3341,6 @@ try_to_wake_up(struct task_struct *p, un if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) goto unlock; - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); - } - #ifdef CONFIG_SMP /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be @@ -3411,6 +3411,11 @@ try_to_wake_up(struct task_struct *p, un cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU); if (task_cpu(p) != cpu) { + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + wake_flags |= WF_MIGRATED; psi_ttwu_dequeue(p); set_task_cpu(p, cpu); _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix rq->nr_iowait ordering 2020-11-17 9:38 ` Peter Zijlstra @ 2020-11-17 11:43 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-17 11:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel, Tejun Heo On Tue, Nov 17, 2020 at 10:38:29AM +0100, Peter Zijlstra wrote: > Subject: sched: Fix rq->nr_iowait ordering > From: Peter Zijlstra <peterz@infradead.org> > Date: Thu, 24 Sep 2020 13:50:42 +0200 > > schedule() ttwu() > deactivate_task(); if (p->on_rq && ...) // false > atomic_dec(&task_rq(p)->nr_iowait); > if (prev->in_iowait) > atomic_inc(&rq->nr_iowait); > > Allows nr_iowait to be decremented before it gets incremented, > resulting in more dodgy IO-wait numbers than usual. > > Note that because we can now do ttwu_queue_wakelist() before > p->on_cpu==0, we lose the natural ordering and have to further delay > the decrement. > > Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> s/Fixes: Fixes:/Fixes:/ Ok, very minor hazard that the same logic gets duplicated that someone might try "fix" but git blame should help. Otherwise, it makes sense as I've received more than one "bug" that complained that a number was larger than they expected even if no other problem was present so Acked-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix rq->nr_iowait ordering @ 2020-11-17 11:43 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-17 11:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 10:38:29AM +0100, Peter Zijlstra wrote: > Subject: sched: Fix rq->nr_iowait ordering > From: Peter Zijlstra <peterz@infradead.org> > Date: Thu, 24 Sep 2020 13:50:42 +0200 > > schedule() ttwu() > deactivate_task(); if (p->on_rq && ...) // false > atomic_dec(&task_rq(p)->nr_iowait); > if (prev->in_iowait) > atomic_inc(&rq->nr_iowait); > > Allows nr_iowait to be decremented before it gets incremented, > resulting in more dodgy IO-wait numbers than usual. > > Note that because we can now do ttwu_queue_wakelist() before > p->on_cpu==0, we lose the natural ordering and have to further delay > the decrement. > > Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> s/Fixes: Fixes:/Fixes:/ Ok, very minor hazard that the same logic gets duplicated that someone might try "fix" but git blame should help. Otherwise, it makes sense as I've received more than one "bug" that complained that a number was larger than they expected even if no other problem was present so Acked-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* [tip: sched/urgent] sched: Fix rq->nr_iowait ordering 2020-11-17 9:38 ` Peter Zijlstra (?) (?) @ 2020-11-19 9:55 ` tip-bot2 for Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2020-11-19 9:55 UTC (permalink / raw) To: linux-tip-commits Cc: Tejun Heo, Peter Zijlstra (Intel), Mel Gorman, x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: ec618b84f6e15281cc3660664d34cd0dd2f2579e Gitweb: https://git.kernel.org/tip/ec618b84f6e15281cc3660664d34cd0dd2f2579e Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 24 Sep 2020 13:50:42 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 17 Nov 2020 13:15:28 +01:00 sched: Fix rq->nr_iowait ordering schedule() ttwu() deactivate_task(); if (p->on_rq && ...) // false atomic_dec(&task_rq(p)->nr_iowait); if (prev->in_iowait) atomic_inc(&rq->nr_iowait); Allows nr_iowait to be decremented before it gets incremented, resulting in more dodgy IO-wait numbers than usual. Note that because we can now do ttwu_queue_wakelist() before p->on_cpu==0, we lose the natural ordering and have to further delay the decrement. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net --- kernel/sched/core.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2003a7..9f0ebfb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2501,7 +2501,12 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, #ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; + else #endif + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } activate_task(rq, p, en_flags); ttwu_do_wakeup(rq, p, wake_flags, rf); @@ -2888,11 +2893,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) goto unlock; - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); - } - #ifdef CONFIG_SMP /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be @@ -2963,6 +2963,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + wake_flags |= WF_MIGRATED; psi_ttwu_dequeue(p); set_task_cpu(p, cpu); ^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup 2020-11-17 8:30 ` Peter Zijlstra @ 2020-11-17 12:40 ` Mel Gorman -1 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-17 12:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Davidlohr Bueso, linux-arm-kernel, linux-kernel On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > Putting the bit there has the added advantage that if the bit existed on its own that it would be very special in terms of how it should be treated. Adding a bit adjacent to it would be potentially hazardous. > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > <SNIP> > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks, testing completed successfully! With the suggested alternative comment above sched_remote_wakeup; Reviewed-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH] sched: Fix data-race in wakeup @ 2020-11-17 12:40 ` Mel Gorman 0 siblings, 0 replies; 70+ messages in thread From: Mel Gorman @ 2020-11-17 12:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Davidlohr Bueso, Will Deacon, linux-kernel, linux-arm-kernel On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > Putting the bit there has the added advantage that if the bit existed on its own that it would be very special in terms of how it should be treated. Adding a bit adjacent to it would be potentially hazardous. > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > <SNIP> > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks, testing completed successfully! With the suggested alternative comment above sched_remote_wakeup; Reviewed-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 70+ messages in thread
* [tip: sched/urgent] sched: Fix data-race in wakeup 2020-11-17 8:30 ` Peter Zijlstra ` (3 preceding siblings ...) (?) @ 2020-11-19 9:55 ` tip-bot2 for Peter Zijlstra -1 siblings, 0 replies; 70+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2020-11-19 9:55 UTC (permalink / raw) To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: f97bb5272d9e95d400d6c8643ebb146b3e3e7842 Gitweb: https://git.kernel.org/tip/f97bb5272d9e95d400d6c8643ebb146b3e3e7842 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Tue, 17 Nov 2020 09:08:41 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 17 Nov 2020 13:15:27 +01:00 sched: Fix data-race in wakeup Mel reported that on some ARM64 platforms loadavg goes bananas and Will tracked it down to the following race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX = X ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = Y; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Mel Gorman <mgorman@techsingularity.net> Debugged-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net --- include/linux/sched.h | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d383cf0..0e91b45 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -769,7 +769,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -779,6 +778,21 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * This field must not be in the scheduler word above due to wakelist + * queueing no longer being serialized by p->on_cpu. However: + * + * p->XXX = X; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = Y; + * + * guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used, so it can be in this word. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; ^ permalink raw reply related [flat|nested] 70+ messages in thread
end of thread, other threads:[~2020-11-19 9:55 UTC | newest] Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-16 9:10 Loadavg accounting error on arm64 Mel Gorman 2020-11-16 9:10 ` Mel Gorman 2020-11-16 11:49 ` Mel Gorman 2020-11-16 11:49 ` Mel Gorman 2020-11-16 12:00 ` Mel Gorman 2020-11-16 12:00 ` Mel Gorman 2020-11-16 12:53 ` Peter Zijlstra 2020-11-16 12:53 ` Peter Zijlstra 2020-11-16 12:58 ` Peter Zijlstra 2020-11-16 12:58 ` Peter Zijlstra 2020-11-16 15:29 ` Mel Gorman 2020-11-16 15:29 ` Mel Gorman 2020-11-16 16:42 ` Mel Gorman 2020-11-16 16:42 ` Mel Gorman 2020-11-16 16:49 ` Peter Zijlstra 2020-11-16 16:49 ` Peter Zijlstra 2020-11-16 17:24 ` Mel Gorman 2020-11-16 17:24 ` Mel Gorman 2020-11-16 17:41 ` Will Deacon 2020-11-16 17:41 ` Will Deacon 2020-11-16 12:46 ` Peter Zijlstra 2020-11-16 12:46 ` Peter Zijlstra 2020-11-16 12:58 ` Mel Gorman 2020-11-16 12:58 ` Mel Gorman 2020-11-16 13:11 ` Will Deacon 2020-11-16 13:11 ` Will Deacon 2020-11-16 13:37 ` Mel Gorman 2020-11-16 13:37 ` Mel Gorman 2020-11-16 14:20 ` Peter Zijlstra 2020-11-16 14:20 ` Peter Zijlstra 2020-11-16 15:52 ` Mel Gorman 2020-11-16 15:52 ` Mel Gorman 2020-11-16 16:54 ` Peter Zijlstra 2020-11-16 16:54 ` Peter Zijlstra 2020-11-16 17:16 ` Mel Gorman 2020-11-16 17:16 ` Mel Gorman 2020-11-16 19:31 ` Mel Gorman 2020-11-16 19:31 ` Mel Gorman 2020-11-17 8:30 ` [PATCH] sched: Fix data-race in wakeup Peter Zijlstra 2020-11-17 8:30 ` Peter Zijlstra 2020-11-17 9:15 ` Will Deacon 2020-11-17 9:15 ` Will Deacon 2020-11-17 9:29 ` Peter Zijlstra 2020-11-17 9:29 ` Peter Zijlstra 2020-11-17 9:46 ` Peter Zijlstra 2020-11-17 9:46 ` Peter Zijlstra 2020-11-17 10:36 ` Will Deacon 2020-11-17 10:36 ` Will Deacon 2020-11-17 12:52 ` Valentin Schneider 2020-11-17 12:52 ` Valentin Schneider 2020-11-17 15:37 ` Valentin Schneider 2020-11-17 15:37 ` Valentin Schneider 2020-11-17 16:13 ` Peter Zijlstra 2020-11-17 16:13 ` Peter Zijlstra 2020-11-17 19:32 ` Valentin Schneider 2020-11-17 19:32 ` Valentin Schneider 2020-11-18 8:05 ` Peter Zijlstra 2020-11-18 8:05 ` Peter Zijlstra 2020-11-18 9:51 ` Valentin Schneider 2020-11-18 9:51 ` Valentin Schneider 2020-11-18 13:33 ` Marco Elver 2020-11-18 13:33 ` Marco Elver 2020-11-17 9:38 ` [PATCH] sched: Fix rq->nr_iowait ordering Peter Zijlstra 2020-11-17 9:38 ` Peter Zijlstra 2020-11-17 11:43 ` Mel Gorman 2020-11-17 11:43 ` Mel Gorman 2020-11-19 9:55 ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra 2020-11-17 12:40 ` [PATCH] sched: Fix data-race in wakeup Mel Gorman 2020-11-17 12:40 ` Mel Gorman 2020-11-19 9:55 ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.