All of lore.kernel.org
 help / color / mirror / Atom feed
* NULL pointer dereference in pick_next_task_fair
@ 2019-10-28 17:46 Quentin Perret
  2019-10-28 21:49 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Quentin Perret @ 2019-10-28 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, aaron.lwe, valentin.schneider, mingo, pauld, jdesfossez,
	naravamudan, vincent.guittot, dietmar.eggemann, juri.lelli,
	rostedt, bsegall, mgorman, kernel-team, john.stultz

Hi all,

The following issue has been observed with 5.4-rc*-based android
kernels both on an x86 VM and native arm64:

---8<---
[  222.515957] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  222.518871] #PF: supervisor read access in kernel mode
[  222.518871] #PF: error_code(0x0000) - not-present page
[  222.518871] PGD 800000007cdf7067 P4D 800000007cdf7067 PUD 7c868067 PMD 0 
[  222.518871] Oops: 0000 [#1] PREEMPT SMP PTI
[  222.518871] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.0-rc4-mainline-g6a9b34166c05 #1
[  222.518871] Hardware name: ChromiumOS crosvm, BIOS 0 
[  222.518871] RIP: 0010:set_next_entity+0xf/0xf0
[  222.518871] Code: 42 0b e5 85 31 c0 e8 90 1a fc ff 0f 0b eb 9b 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 53 48 89 f3 49 89 fe <83> 7e 40 00 74 3d 4c 89 f7 48 89 de e8 80 f9 ff ff 4c 8d 7b 18 4d
[  222.518871] RSP: 0018:ffffb95040073de8 EFLAGS: 00010046
[  222.518871] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff85ec0c30
[  222.518871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ec44c728500
[  222.518871] RBP: ffffb95040073e00 R08: 0000000000000000 R09: 0000000000000000
[  222.518871] R10: 0000000000000000 R11: ffffffff84ef0a30 R12: 0000000000000000
[  222.518871] R13: 0000000000000000 R14: ffff9ec44c728500 R15: ffffb95040073e60
[  222.518871] FS:  0000000000000000(0000) GS:ffff9ec44c700000(0000) knlGS:0000000000000000
[  222.518871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.518871] CR2: 0000000000000040 CR3: 000000007c87a002 CR4: 00000000003606e0
[  222.518871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  222.518871] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  222.518871] Call Trace:
[  222.518871]  pick_next_task_fair+0xe8/0x410
[  222.518871]  ? rcu_note_context_switch+0x3ef/0x520
[  222.518871]  ? finish_task_switch+0x10d/0x250
[  222.518871]  __schedule+0x1f8/0x6e0
[  222.518871]  schedule_idle+0x27/0x40
[  222.518871]  do_idle+0x21c/0x240
[  222.518871]  cpu_startup_entry+0x25/0x30
[  222.518871]  start_secondary+0x18d/0x1b0
[  222.518871]  secondary_startup_64+0xa4/0xb0
[  222.518871] Modules linked in: vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock_diag vsock can_raw can snd_intel8x0 snd_ac97_codec ac97_bus virtio_input virtio_pci virtio_mmio virtio_crypto mmc_block mmc_core dummy_cpufreq rtc_test dummy_hcd can_dev vcan virtio_net net_failover failover virt_wifi virtio_blk virtio_gpu ttm virtio_rng virtio_ring virtio 8250_of crypto_engine
[  222.518871] CR2: 0000000000000040
[  222.518871] ---[ end trace ae741809965b22f2 ]---
[  222.518871] RIP: 0010:set_next_entity+0xf/0xf0
[  222.518871] Code: 42 0b e5 85 31 c0 e8 90 1a fc ff 0f 0b eb 9b 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 53 48 89 f3 49 89 fe <83> 7e 40 00 74 3d 4c 89 f7 48 89 de e8 80 f9 ff ff 4c 8d 7b 18 4d
[  222.518871] RSP: 0018:ffffb95040073de8 EFLAGS: 00010046
[  222.518871] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff85ec0c30
[  222.518871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ec44c728500
[  222.518871] RBP: ffffb95040073e00 R08: 0000000000000000 R09: 0000000000000000
[  222.518871] R10: 0000000000000000 R11: ffffffff84ef0a30 R12: 0000000000000000
[  222.518871] R13: 0000000000000000 R14: ffff9ec44c728500 R15: ffffb95040073e60
[  222.518871] FS:  0000000000000000(0000) GS:ffff9ec44c700000(0000) knlGS:0000000000000000
[  222.518871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.518871] CR2: 0000000000000040 CR3: 000000007c87a002 CR4: 00000000003606e0
[  222.518871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  222.518871] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  222.518871] Kernel panic - not syncing: Fatal exception
[  222.518871] Kernel Offset: 0x3e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
--->8---

The issue is very transient and relatively hard to reproduce.

After digging a bit, the offending commit seems to be:

    67692435c411 ("sched: Rework pick_next_task() slow-path")

By 'offending' I mean that reverting it makes the issue go away. The
issue comes from the fact that pick_next_entity() returns a NULL se in
the 'simple' path of pick_next_task_fair(), which causes obvious
problems in the subsequent call to set_next_entity().

I'll dig more, but if anybody understands the issue in the meatime feel
free to send me a patch to try out :)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-28 17:46 NULL pointer dereference in pick_next_task_fair Quentin Perret
@ 2019-10-28 21:49 ` Peter Zijlstra
  2019-10-29 11:34 ` Peter Zijlstra
  2019-11-06 12:05 ` Peter Zijlstra
  2 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-10-28 21:49 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:

> The issue is very transient and relatively hard to reproduce.
> 
> After digging a bit, the offending commit seems to be:
> 
>     67692435c411 ("sched: Rework pick_next_task() slow-path")
> 
> By 'offending' I mean that reverting it makes the issue go away. The
> issue comes from the fact that pick_next_entity() returns a NULL se in
> the 'simple' path of pick_next_task_fair(), which causes obvious
> problems in the subsequent call to set_next_entity().
> 
> I'll dig more, but if anybody understands the issue in the meatime feel
> free to send me a patch to try out :)

The only way for pick_next_entity() to return NULL is if the tree is
empty and !cfs_rq->curr. But in that case, cfs_rq->nr_running _should_
be 0 and or it's related se should not be enqueued in the parent cfs_rq.

Now for the root cfs_rq we check nr_running this and jump to the idle
path, however if this occurs in the middle of the hierarchy, we're up a
creek without no paddles. This is something that really should not
happen (because empty cfs_rq should not be enqueued)

Also, if we take the simple patch, as you say, then we'll have done a
put_prev_task(), regardless of how we got there, so we know cfs_rq->curr
must be NULL. Which, with the above, means the tree really is empty.

And as stated above, when the tree is empty and !cfs_rq->curr, the
cfs_rq's se should not be enqueued in the parent cfs_rq so we should not
be getting here.

Clearly something is buggered with the cgroup state. What is your cgroup
setup, are you using cpu-bandwidth?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-28 17:46 NULL pointer dereference in pick_next_task_fair Quentin Perret
  2019-10-28 21:49 ` Peter Zijlstra
@ 2019-10-29 11:34 ` Peter Zijlstra
  2019-10-29 11:50   ` Quentin Perret
  2019-11-06 12:05 ` Peter Zijlstra
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-10-29 11:34 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> The issue is very transient and relatively hard to reproduce.
> 
> After digging a bit, the offending commit seems to be:
> 
>     67692435c411 ("sched: Rework pick_next_task() slow-path")
> 
> By 'offending' I mean that reverting it makes the issue go away. The
> issue comes from the fact that pick_next_entity() returns a NULL se in
> the 'simple' path of pick_next_task_fair(), which causes obvious
> problems in the subsequent call to set_next_entity().
> 
> I'll dig more, but if anybody understands the issue in the meatime feel
> free to send me a patch to try out :)

Can you please see if this makes any difference?

---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 2 +-
 kernel/sched/idle.c | 3 +--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..abd2d4f80381 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3922,8 +3922,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			goto restart;

 		/* Assumes fair_sched_class->next == idle_sched_class */
-		if (unlikely(!p))
-			p = idle_sched_class.pick_next_task(rq, prev, rf);
+		if (unlikely(!p)) {
+			prev->sched_class->put_prev_task(rq, prev, rf);
+			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
+		}

 		return p;
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e2374f..2aad94bb7165 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6820,7 +6820,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 simple:
 #endif
 	if (prev)
-		put_prev_task(rq, prev);
+		prev->sched_class->put_prev_task(rq, prev, rf);

 	do {
 		se = pick_next_entity(cfs_rq, NULL);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8dad5aa600ea..e8dfc84f375a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -390,8 +390,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *next = rq->idle;

-	if (prev)
-		put_prev_task(rq, prev);
+	WARN_ON_ONCE(prev || rf);

 	set_next_task_idle(rq, next);



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-29 11:34 ` Peter Zijlstra
@ 2019-10-29 11:50   ` Quentin Perret
  2019-10-30 22:50     ` Ram Muthiah
  0 siblings, 1 reply; 38+ messages in thread
From: Quentin Perret @ 2019-10-29 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On Tuesday 29 Oct 2019 at 12:34:11 (+0100), Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> > The issue is very transient and relatively hard to reproduce.
> > 
> > After digging a bit, the offending commit seems to be:
> > 
> >     67692435c411 ("sched: Rework pick_next_task() slow-path")
> > 
> > By 'offending' I mean that reverting it makes the issue go away. The
> > issue comes from the fact that pick_next_entity() returns a NULL se in
> > the 'simple' path of pick_next_task_fair(), which causes obvious
> > problems in the subsequent call to set_next_entity().
> > 
> > I'll dig more, but if anybody understands the issue in the meatime feel
> > free to send me a patch to try out :)
> 
> Can you please see if this makes any difference?
> 
> ---
>  kernel/sched/core.c | 6 ++++--
>  kernel/sched/fair.c | 2 +-
>  kernel/sched/idle.c | 3 +--
>  3 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7880f4f64d0e..abd2d4f80381 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3922,8 +3922,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  			goto restart;
> 
>  		/* Assumes fair_sched_class->next == idle_sched_class */
> -		if (unlikely(!p))
> -			p = idle_sched_class.pick_next_task(rq, prev, rf);
> +		if (unlikely(!p)) {
> +			prev->sched_class->put_prev_task(rq, prev, rf);
> +			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
> +		}
> 
>  		return p;
>  	}
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e2374f..2aad94bb7165 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6820,7 +6820,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  simple:
>  #endif
>  	if (prev)
> -		put_prev_task(rq, prev);
> +		prev->sched_class->put_prev_task(rq, prev, rf);
> 
>  	do {
>  		se = pick_next_entity(cfs_rq, NULL);
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 8dad5aa600ea..e8dfc84f375a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -390,8 +390,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  {
>  	struct task_struct *next = rq->idle;
> 
> -	if (prev)
> -		put_prev_task(rq, prev);
> +	WARN_ON_ONCE(prev || rf);
> 
>  	set_next_task_idle(rq, next);
> 
> 

Unfortunately the issue went into hiding this morning and I struggle to
reproduce it (this is turning bordeline nightmare now TBH).

I'll try the patch once my reproducer is fixed :/

Thank you very much for the help,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-29 11:50   ` Quentin Perret
@ 2019-10-30 22:50     ` Ram Muthiah
  2019-10-31  1:33       ` Valentin Schneider
  2019-10-31 22:15       ` Valentin Schneider
  0 siblings, 2 replies; 38+ messages in thread
From: Ram Muthiah @ 2019-10-30 22:50 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Tue, Oct 29, 2019 at 4:50 AM 'Quentin Perret' via kernel-team
<kernel-team@android.com> wrote:
>
> On Tuesday 29 Oct 2019 at 12:34:11 (+0100), Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> > > The issue is very transient and relatively hard to reproduce.
> > >
> > > After digging a bit, the offending commit seems to be:
> > >
> > >     67692435c411 ("sched: Rework pick_next_task() slow-path")
> > >
> > > By 'offending' I mean that reverting it makes the issue go away. The
> > > issue comes from the fact that pick_next_entity() returns a NULL se in
> > > the 'simple' path of pick_next_task_fair(), which causes obvious
> > > problems in the subsequent call to set_next_entity().
> > >
> > > I'll dig more, but if anybody understands the issue in the meatime feel
> > > free to send me a patch to try out :)
> >
> > Can you please see if this makes any difference?
> >
> > ---
> >  kernel/sched/core.c | 6 ++++--
> >  kernel/sched/fair.c | 2 +-
> >  kernel/sched/idle.c | 3 +--
> >  3 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7880f4f64d0e..abd2d4f80381 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3922,8 +3922,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >                       goto restart;
> >
> >               /* Assumes fair_sched_class->next == idle_sched_class */
> > -             if (unlikely(!p))
> > -                     p = idle_sched_class.pick_next_task(rq, prev, rf);
> > +             if (unlikely(!p)) {
> > +                     prev->sched_class->put_prev_task(rq, prev, rf);
> > +                     p = idle_sched_class.pick_next_task(rq, NULL, NULL);
> > +             }
> >
> >               return p;
> >       }
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 83ab35e2374f..2aad94bb7165 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6820,7 +6820,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >  simple:
> >  #endif
> >       if (prev)
> > -             put_prev_task(rq, prev);
> > +             prev->sched_class->put_prev_task(rq, prev, rf);
> >
> >       do {
> >               se = pick_next_entity(cfs_rq, NULL);
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index 8dad5aa600ea..e8dfc84f375a 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -390,8 +390,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >  {
> >       struct task_struct *next = rq->idle;
> >
> > -     if (prev)
> > -             put_prev_task(rq, prev);
> > +     WARN_ON_ONCE(prev || rf);
> >
> >       set_next_task_idle(rq, next);
> >
> >
>
> Unfortunately the issue went into hiding this morning and I struggle to
> reproduce it (this is turning bordeline nightmare now TBH).
>
> I'll try the patch once my reproducer is fixed :/
>
> Thank you very much for the help,
> Quentin
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Quentin and I were able to create a setup which reproduces the issue.

Given this, I tried Peter's proposed fix and was still able to reproduce the
issue unfortunately. Current patch is located here -
https://android-review.googlesource.com/c/kernel/common/+/1153487

Our mitigation for this issue on the android-mainline branch has been to
revert 67692435c411 ("sched: Rework pick_next_task() slow-path").
https://android-review.googlesource.com/c/kernel/common/+/1152564

I'll spend some time detailing repro steps next. I should be able to
provide an update on those details early next week.

We appreciate the help so far.
Thanks,
Ram

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-30 22:50     ` Ram Muthiah
@ 2019-10-31  1:33       ` Valentin Schneider
  2019-10-31 10:54         ` Valentin Schneider
  2019-10-31 22:15       ` Valentin Schneider
  1 sibling, 1 reply; 38+ messages in thread
From: Valentin Schneider @ 2019-10-31  1:33 UTC (permalink / raw)
  To: Ram Muthiah, Quentin Perret
  Cc: Peter Zijlstra, linux-kernel, aaron.lwe, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On 30/10/2019 23:50, Ram Muthiah wrote:
> 
> Quentin and I were able to create a setup which reproduces the issue.
> 
> Given this, I tried Peter's proposed fix and was still able to reproduce the
> issue unfortunately. Current patch is located here -
> https://android-review.googlesource.com/c/kernel/common/+/1153487
> 
> Our mitigation for this issue on the android-mainline branch has been to
> revert 67692435c411 ("sched: Rework pick_next_task() slow-path").
> https://android-review.googlesource.com/c/kernel/common/+/1152564
> 
> I'll spend some time detailing repro steps next. I should be able to
> provide an update on those details early next week.
> 
> We appreciate the help so far.
> Thanks,
> Ram
> 

The splat Quentin posted happens at secondary startup, is that always
the case? I'm trying to think of what could make rq.cfs_rq.nr_running
non-zero at secondary bringup time. It might not explain the NULL pointer, but
I'm still curious as to how we can get something there this early, as it could
point towards something. Be warned, I might bring up stuff I know nothing
about, but this looks "fun" so I can't help myself :)


sched domains are only setup after smp_init() in sched_init_smp(), thus after
we've booted all secondaries. This should take load balance out of the
picture.

For wakeups, select_task_rq_fair() can only ever pick prev_cpu or this_cpu
since there are no sched domains. I don't see many candidates that could
wakeup on a secondary (thus have non-zero this_cpu) this early there. Perhaps
the smpboot threads, but from a quick look they are first created *after*
sched_init_smp(), so they couldn't exist during (boot-time) secondary bringup.
Seems to be the same for IRQ threads (and they're setscheduler'd to FIFO
anyway).

So now I'm even more curious as to what CFS task could be enqueued on a
secondary CPU rq before sched_init_smp(). Have you been sending stuff to space
without any shielding lately?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-31  1:33       ` Valentin Schneider
@ 2019-10-31 10:54         ` Valentin Schneider
  2019-10-31 14:24           ` Valentin Schneider
  0 siblings, 1 reply; 38+ messages in thread
From: Valentin Schneider @ 2019-10-31 10:54 UTC (permalink / raw)
  To: Ram Muthiah, Quentin Perret
  Cc: Peter Zijlstra, linux-kernel, aaron.lwe, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On 31/10/2019 02:33, Valentin Schneider wrote:
> For wakeups, select_task_rq_fair() can only ever pick prev_cpu or this_cpu
> since there are no sched domains. I don't see many candidates that could
> wakeup on a secondary (thus have non-zero this_cpu) this early there. Perhaps
> the smpboot threads, but from a quick look they are first created *after*
> sched_init_smp(), so they couldn't exist during (boot-time) secondary bringup.

Scratch that, I can't read. The registration is done in early initcalls (and
we have the unpark in the secondary bringup anyway), so when we spool up the
secondaries we'll get wakeups for the smpboot threads. AFAIR only softirqd
and cpuhp are CFS, but while this satisfied some of my curiosity this doesn't
seem super helpful on its own.

> Seems to be the same for IRQ threads (and they're setscheduler'd to FIFO
> anyway).
> 
> So now I'm even more curious as to what CFS task could be enqueued on a
> secondary CPU rq before sched_init_smp(). Have you been sending stuff to space
> without any shielding lately?
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-31 10:54         ` Valentin Schneider
@ 2019-10-31 14:24           ` Valentin Schneider
  0 siblings, 0 replies; 38+ messages in thread
From: Valentin Schneider @ 2019-10-31 14:24 UTC (permalink / raw)
  To: Ram Muthiah, Quentin Perret
  Cc: Peter Zijlstra, linux-kernel, aaron.lwe, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On 31/10/2019 11:54, Valentin Schneider wrote:
> On 31/10/2019 02:33, Valentin Schneider wrote:
>> For wakeups, select_task_rq_fair() can only ever pick prev_cpu or this_cpu
>> since there are no sched domains. I don't see many candidates that could
>> wakeup on a secondary (thus have non-zero this_cpu) this early there. 

And another fail here, this doesn't have to be at secondary bringup, it's 
just that the bringup will always be in the idle task's callstack (if the
timestamps weren't enough of a clue already). I guess I'll stop writing
nonsense and keep staring in silence.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-30 22:50     ` Ram Muthiah
  2019-10-31  1:33       ` Valentin Schneider
@ 2019-10-31 22:15       ` Valentin Schneider
  1 sibling, 0 replies; 38+ messages in thread
From: Valentin Schneider @ 2019-10-31 22:15 UTC (permalink / raw)
  To: Ram Muthiah, Quentin Perret
  Cc: Peter Zijlstra, linux-kernel, aaron.lwe, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On 30/10/2019 23:50, Ram Muthiah wrote:
> Quentin and I were able to create a setup which reproduces the issue.
> 
> Given this, I tried Peter's proposed fix and was still able to reproduce the
> issue unfortunately. Current patch is located here -
> https://android-review.googlesource.com/c/kernel/common/+/1153487
> 
> Our mitigation for this issue on the android-mainline branch has been to
> revert 67692435c411 ("sched: Rework pick_next_task() slow-path").
> https://android-review.googlesource.com/c/kernel/common/+/1152564
> 

Still no cigar, but one thing to note is that we got a similar splat on a
Juno just yesterday, here it is fed through decode_stacktrace.sh:

[   22.930829] Mem abort info:
[   22.935904]   ESR = 0x96000006
[   22.938924]   EC = 0x25: DABT (current EL), IL = 32 bits
[   22.944178]   SET = 0, FnV = 0
[   22.947197]   EA = 0, S1PTW = 0
[   22.950300] Data abort info:
[   22.953145]   ISV = 0, ISS = 0x00000006
[   22.956937]   CM = 0, WnR = 0
[   22.959870] user pgtable: 4k pages, 48-bit VAs, pgdp=00000009f3905000
[   22.966243] [0000000000000040] pgd=00000009efa6e003, pud=00000009f41c3003, pmd=0000000000000000
[   22.974858] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[   22.980369] Modules linked in: tda998x drm_kms_helper drm crct10dif_ce ip_tables x_tables ipv6 nf_defrag_ipv6
[   22.990200] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.4.0-rc2-00001-gaa57157be69f #1
[   22.998036] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct 19 2018
[   23.008710] pstate: 60000085 (nZCv daIf -PAN -UAO)
[   23.013457] pc : set_next_entity (kernel/sched/fair.c:4156)
[   23.017511] lr : pick_next_task_fair (kernel/sched/fair.c:6829 (discriminator 1))
[   23.021991] sp : ffff800011a13e10
[   23.025267] x29: ffff800011a13e10 x28: 0000000000000000
[   23.030525] x27: ffff800011a13f10 x26: ffff800010c205ec
[   23.035782] x25: ffff000975cea108 x24: ffff800011789000
[   23.041038] x23: ffff8000113cf000 x22: ffff000975ce9b80
[   23.046294] x21: ffff00097ef78d40 x20: ffff000974e21a00
[   23.051550] x19: 0000000000000000 x18: 0000000000000000
[   23.056806] x17: 0000000000000000 x16: 0000000000000000
[   23.062062] x15: 0000000000000000 x14: 00000000000001ad
[   23.067317] x13: 0000000000000001 x12: 071c71c71c71c71c
[   23.072573] x11: 0000000000000500 x10: ffff00097ef77dc8
[   23.077829] x9 : 0000000000000087 x8 : ffff00097ef77de8
[   23.083085] x7 : 0000000000000003 x6 : 0000000000000000
[   23.088340] x5 : 0000000000000000 x4 : 0000000000000000
[   23.093596] x3 : 000000054f6e7800 x2 : 0000000000000000
[   23.098851] x1 : 0000000000000000 x0 : ffff000974e21a00
[   23.104107] Call trace:
[   23.106527] set_next_entity (kernel/sched/fair.c:4156)
[   23.110236] pick_next_task_fair (kernel/sched/fair.c:6829 (discriminator 1))
[   23.114377] __schedule (kernel/sched/core.c:3920 kernel/sched/core.c:4039)
[   23.117828] schedule_idle (kernel/sched/core.c:4165 (discriminator 1))
[   23.121365] do_idle (./arch/arm64/include/asm/current.h:19 ./arch/arm64/include/asm/preempt.h:31 kernel/sched/idle.c:275)
[   23.124557] cpu_startup_entry (kernel/sched/idle.c:355 (discriminator 1))
[   23.128439] secondary_start_kernel (arch/arm64/kernel/smp.c:262)

The faulty line is the very first se dereference in set_next_entity().
As Peter pointed out on IRC, this happens in the 'simple' path (prev is the
idle task).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-10-28 17:46 NULL pointer dereference in pick_next_task_fair Quentin Perret
  2019-10-28 21:49 ` Peter Zijlstra
  2019-10-29 11:34 ` Peter Zijlstra
@ 2019-11-06 12:05 ` Peter Zijlstra
  2019-11-06 13:08   ` Peter Zijlstra
  2019-11-06 15:51   ` Kirill Tkhai
  2 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-06 12:05 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> 
> After digging a bit, the offending commit seems to be:
> 
>     67692435c411 ("sched: Rework pick_next_task() slow-path")
> 
> By 'offending' I mean that reverting it makes the issue go away. The
> issue comes from the fact that pick_next_entity() returns a NULL se in
> the 'simple' path of pick_next_task_fair(), which causes obvious
> problems in the subsequent call to set_next_entity().
> 
> I'll dig more, but if anybody understands the issue in the meatime feel
> free to send me a patch to try out :)

So for all those who didn't follow along on IRC, the below seems to cure
things.

The only thing I'm now considering is if we shouldn't be setting
->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
cases.

---
Subject: sched: Fix pick_next_task() vs 'change' pattern race
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 4 22:18:14 CET 2019

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

The order in pick_next_task() is mandated by the fact that RT/DL
put_prev_task() can pull other RT tasks, in which case we should not
call newidle_balance() since we'll not be going idle. Similarly, we
cannot put newidle_balance() in put_prev_task_fair() because it should
be called every time we'll end up selecting the idle task.

Given that we're stuck with this order, the only solution is fixing
the 'change' pattern. The simplest fix seems to be to 'absuse'
p->on_cpu to carry more state. Adding more state to p->on_rq is
possible but is far more invasive and also ends up duplicating much of
the state we already carry in p->on_cpu.

Introduce task_on_rq_curr() to indicate the if
sched_class::set_next_task() has been called -- and we thus need to
call put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c  |   22 +++++++++++++++-------
 kernel/sched/sched.h |   16 ++++++++++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1595,7 +1595,7 @@ void do_set_cpus_allowed(struct task_str
 	lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued) {
 		/*
@@ -3934,8 +3934,16 @@ pick_next_task(struct rq *rq, struct tas
 	 * can PULL higher prio tasks when we lower the RQ 'priority'.
 	 */
 	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
+#ifdef CONFIG_SMP
+	if (!rq->nr_running) {
+		/*
+		 * Make sure task_on_rq_curr() fails, such that we don't do
+		 * put_prev_task() + set_next_task() on this task again.
+		 */
+		prev->on_cpu = 2;
 		newidle_balance(rq, rf);
+	}
+#endif
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -4422,7 +4430,7 @@ void rt_mutex_setprio(struct task_struct
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -4509,7 +4517,7 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -4957,7 +4965,7 @@ static int __sched_setscheduler(struct t
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -6141,7 +6149,7 @@ void sched_setnuma(struct task_struct *p
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -7031,7 +7039,7 @@ void sched_move_task(struct task_struct
 	rq = task_rq_lock(tsk, &rf);
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_on_rq_curr(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1628,6 +1628,22 @@ static inline int task_running(struct rq
 #endif
 }
 
+/*
+ * If true, @p has had sched_class::set_next_task() called on it.
+ * See pick_next_task().
+ */
+static inline bool task_on_rq_curr(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SMP
+	return rq->curr == p && p->on_cpu == 1;
+#else
+	return rq->curr == p;
+#endif
+}
+
+/*
+ * If true, @p has has sched_class::enqueue_task() called on it.
+ */
 static inline int task_on_rq_queued(struct task_struct *p)
 {
 	return p->on_rq == TASK_ON_RQ_QUEUED;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 12:05 ` Peter Zijlstra
@ 2019-11-06 13:08   ` Peter Zijlstra
  2019-11-06 15:04     ` Qais Yousef
  2019-11-06 15:51   ` Kirill Tkhai
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-06 13:08 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On Wed, Nov 06, 2019 at 01:05:25PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> > 
> > After digging a bit, the offending commit seems to be:
> > 
> >     67692435c411 ("sched: Rework pick_next_task() slow-path")
> > 
> > By 'offending' I mean that reverting it makes the issue go away. The
> > issue comes from the fact that pick_next_entity() returns a NULL se in
> > the 'simple' path of pick_next_task_fair(), which causes obvious
> > problems in the subsequent call to set_next_entity().
> > 
> > I'll dig more, but if anybody understands the issue in the meatime feel
> > free to send me a patch to try out :)
> 
> So for all those who didn't follow along on IRC, the below seems to cure
> things.
> 
> The only thing I'm now considering is if we shouldn't be setting
> ->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
> cases.

So I think it all works, but that's more by accident than anything else.
I'll move the ->on_cpu=2 assignment earlier. That clearly avoids calling
put_prev_task() while we're in put_prev_task().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 13:08   ` Peter Zijlstra
@ 2019-11-06 15:04     ` Qais Yousef
  2019-11-06 16:57       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Qais Yousef @ 2019-11-06 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On 11/06/19 14:08, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 01:05:25PM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> > > 
> > > After digging a bit, the offending commit seems to be:
> > > 
> > >     67692435c411 ("sched: Rework pick_next_task() slow-path")
> > > 
> > > By 'offending' I mean that reverting it makes the issue go away. The
> > > issue comes from the fact that pick_next_entity() returns a NULL se in
> > > the 'simple' path of pick_next_task_fair(), which causes obvious
> > > problems in the subsequent call to set_next_entity().
> > > 
> > > I'll dig more, but if anybody understands the issue in the meatime feel
> > > free to send me a patch to try out :)
> > 
> > So for all those who didn't follow along on IRC, the below seems to cure
> > things.
> > 
> > The only thing I'm now considering is if we shouldn't be setting
> > ->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
> > cases.
> 
> So I think it all works, but that's more by accident than anything else.
> I'll move the ->on_cpu=2 assignment earlier. That clearly avoids calling
> put_prev_task() while we're in put_prev_task().

Did you mean avoids calling *set_next_task()* while we're in put_prev_task()?

So what you're saying is that put_prev_task_{rt,dl}() could drop the rq_lock()
too and the race could happen while we're inside these functions, correct? Or
is it a different reason?

By the way, is all reads/writes to ->on_cpu happen when a lock is held? Ie: we
don't need to use any smp read/write barriers?

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 12:05 ` Peter Zijlstra
  2019-11-06 13:08   ` Peter Zijlstra
@ 2019-11-06 15:51   ` Kirill Tkhai
  2019-11-06 16:54     ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Kirill Tkhai @ 2019-11-06 15:51 UTC (permalink / raw)
  To: Peter Zijlstra, Quentin Perret
  Cc: linux-kernel, aaron.lwe, valentin.schneider, mingo, pauld,
	jdesfossez, naravamudan, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman, kernel-team, john.stultz

On 06.11.2019 15:05, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
>>
>> After digging a bit, the offending commit seems to be:
>>
>>     67692435c411 ("sched: Rework pick_next_task() slow-path")
>>
>> By 'offending' I mean that reverting it makes the issue go away. The
>> issue comes from the fact that pick_next_entity() returns a NULL se in
>> the 'simple' path of pick_next_task_fair(), which causes obvious
>> problems in the subsequent call to set_next_entity().
>>
>> I'll dig more, but if anybody understands the issue in the meatime feel
>> free to send me a patch to try out :)
> 
> So for all those who didn't follow along on IRC, the below seems to cure
> things.
> 
> The only thing I'm now considering is if we shouldn't be setting
> ->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
> cases.
> 
> ---
> Subject: sched: Fix pick_next_task() vs 'change' pattern race
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Nov 4 22:18:14 CET 2019
> 
> Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
> inadvertly introduced a race because it changed a previously
> unexplored dependency between dropping the rq->lock and
> sched_class::put_prev_task().
> 
> The comments about dropping rq->lock, in for example
> newidle_balance(), only mentions the task being current and ->on_cpu
> being set. But when we look at the 'change' pattern (in for example
> sched_setnuma()):
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 
> It becomes obvious that if we do this after put_prev_task() has
> already been called on @p, things go sideways. This is exactly what
> the commit in question allows to happen when it does:
> 
> 	prev->sched_class->put_prev_task(rq, prev, rf);
> 	if (!rq->nr_running)
> 		newidle_balance(rq, rf);
> 
> The newidle_balance() call will drop rq->lock after we've called
> put_prev_task() and that allows the above 'change' pattern to
> interleave and mess up the state.
> 
> The order in pick_next_task() is mandated by the fact that RT/DL
> put_prev_task() can pull other RT tasks, in which case we should not
> call newidle_balance() since we'll not be going idle. Similarly, we
> cannot put newidle_balance() in put_prev_task_fair() because it should
> be called every time we'll end up selecting the idle task.
> 
> Given that we're stuck with this order, the only solution is fixing
> the 'change' pattern. The simplest fix seems to be to 'absuse'
> p->on_cpu to carry more state. Adding more state to p->on_rq is
> possible but is far more invasive and also ends up duplicating much of
> the state we already carry in p->on_cpu.
> 
> Introduce task_on_rq_curr() to indicate the if
> sched_class::set_next_task() has been called -- and we thus need to
> call put_prev_task().
> 
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Quentin Perret <qperret@google.com>
> Tested-by: Qais Yousef <qais.yousef@arm.com>
> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/core.c  |   22 +++++++++++++++-------
>  kernel/sched/sched.h |   16 ++++++++++++++++
>  2 files changed, 31 insertions(+), 7 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1595,7 +1595,7 @@ void do_set_cpus_allowed(struct task_str
>  	lockdep_assert_held(&p->pi_lock);
>  
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  
>  	if (queued) {
>  		/*
> @@ -3934,8 +3934,16 @@ pick_next_task(struct rq *rq, struct tas
>  	 * can PULL higher prio tasks when we lower the RQ 'priority'.
>  	 */
>  	prev->sched_class->put_prev_task(rq, prev, rf);
> -	if (!rq->nr_running)
> +#ifdef CONFIG_SMP
> +	if (!rq->nr_running) {
> +		/*
> +		 * Make sure task_on_rq_curr() fails, such that we don't do
> +		 * put_prev_task() + set_next_task() on this task again.
> +		 */
> +		prev->on_cpu = 2;
>  		newidle_balance(rq, rf);

Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
become pickable again after newidle_balance() releases rq->lock, and we
take it again, so this on_cpu == 2 never will be cleared?

> +	}
> +#endif
>  
>  	for_each_class(class) {
>  		p = class->pick_next_task(rq, NULL, NULL);
> @@ -4422,7 +4430,7 @@ void rt_mutex_setprio(struct task_struct
>  
>  	prev_class = p->sched_class;
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, queue_flag);
>  	if (running)
> @@ -4509,7 +4517,7 @@ void set_user_nice(struct task_struct *p
>  		goto out_unlock;
>  	}
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
>  	if (running)
> @@ -4957,7 +4965,7 @@ static int __sched_setscheduler(struct t
>  	}
>  
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, queue_flags);
>  	if (running)
> @@ -6141,7 +6149,7 @@ void sched_setnuma(struct task_struct *p
>  
>  	rq = task_rq_lock(p, &rf);
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  
>  	if (queued)
>  		dequeue_task(rq, p, DEQUEUE_SAVE);
> @@ -7031,7 +7039,7 @@ void sched_move_task(struct task_struct
>  	rq = task_rq_lock(tsk, &rf);
>  	update_rq_clock(rq);
>  
> -	running = task_current(rq, tsk);
> +	running = task_on_rq_curr(rq, tsk);
>  	queued = task_on_rq_queued(tsk);
>  
>  	if (queued)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1628,6 +1628,22 @@ static inline int task_running(struct rq
>  #endif
>  }
>  
> +/*
> + * If true, @p has had sched_class::set_next_task() called on it.
> + * See pick_next_task().
> + */
> +static inline bool task_on_rq_curr(struct rq *rq, struct task_struct *p)
> +{
> +#ifdef CONFIG_SMP
> +	return rq->curr == p && p->on_cpu == 1;
> +#else
> +	return rq->curr == p;
> +#endif
> +}
> +
> +/*
> + * If true, @p has has sched_class::enqueue_task() called on it.
> + */
>  static inline int task_on_rq_queued(struct task_struct *p)
>  {
>  	return p->on_rq == TASK_ON_RQ_QUEUED;
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 15:51   ` Kirill Tkhai
@ 2019-11-06 16:54     ` Peter Zijlstra
  2019-11-06 17:27       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-06 16:54 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
> > +#ifdef CONFIG_SMP
> > +	if (!rq->nr_running) {
> > +		/*
> > +		 * Make sure task_on_rq_curr() fails, such that we don't do
> > +		 * put_prev_task() + set_next_task() on this task again.
> > +		 */
> > +		prev->on_cpu = 2;
> >  		newidle_balance(rq, rf);
> 
> Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
> become pickable again after newidle_balance() releases rq->lock, and we
> take it again, so this on_cpu == 2 never will be cleared?

Indeed so.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 15:04     ` Qais Yousef
@ 2019-11-06 16:57       ` Peter Zijlstra
  2019-11-06 17:26         ` Qais Yousef
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-06 16:57 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Wed, Nov 06, 2019 at 03:04:50PM +0000, Qais Yousef wrote:
> On 11/06/19 14:08, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2019 at 01:05:25PM +0100, Peter Zijlstra wrote:

> > > The only thing I'm now considering is if we shouldn't be setting
> > > ->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
> > > cases.
> > 
> > So I think it all works, but that's more by accident than anything else.
> > I'll move the ->on_cpu=2 assignment earlier. That clearly avoids calling
> > put_prev_task() while we're in put_prev_task().
> 
> Did you mean avoids calling *set_next_task()* while we're in put_prev_task()?

Either, really. The change pattern does put_prev_task() first, and then
restores state by calling set_next_task(). And it can do that while
we're in put_prev_task(), unless we're setting ->on_cpu=2.

> So what you're saying is that put_prev_task_{rt,dl}() could drop the rq_lock()
> too and the race could happen while we're inside these functions, correct? Or
> is it a different reason?

Indeed, except it looks like that actually works (mostly by accident).

> By the way, is all reads/writes to ->on_cpu happen when a lock is held? Ie: we
> don't need to use any smp read/write barriers?

Yes, ->on_cpu is fully serialized by rq->lock. We use
smp_store_release() in finish_task() due to ttwu spin-waiting on it
(which reminds me, riel was seeing lots of that).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 16:57       ` Peter Zijlstra
@ 2019-11-06 17:26         ` Qais Yousef
  0 siblings, 0 replies; 38+ messages in thread
From: Qais Yousef @ 2019-11-06 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On 11/06/19 17:57, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 03:04:50PM +0000, Qais Yousef wrote:
> > On 11/06/19 14:08, Peter Zijlstra wrote:
> > > On Wed, Nov 06, 2019 at 01:05:25PM +0100, Peter Zijlstra wrote:
> 
> > > > The only thing I'm now considering is if we shouldn't be setting
> > > > ->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
> > > > cases.
> > > 
> > > So I think it all works, but that's more by accident than anything else.
> > > I'll move the ->on_cpu=2 assignment earlier. That clearly avoids calling
> > > put_prev_task() while we're in put_prev_task().
> > 
> > Did you mean avoids calling *set_next_task()* while we're in put_prev_task()?
> 
> Either, really. The change pattern does put_prev_task() first, and then
> restores state by calling set_next_task(). And it can do that while
> we're in put_prev_task(), unless we're setting ->on_cpu=2.

*head starts spinning*

I can't see how we can have double put_prev_task() in a row. Let me stare more
at the code.

> 
> > So what you're saying is that put_prev_task_{rt,dl}() could drop the rq_lock()
> > too and the race could happen while we're inside these functions, correct? Or
> > is it a different reason?
> 
> Indeed, except it looks like that actually works (mostly by accident).

+1

I think I got it now, it's the double_lock_balance() that can drop the lock.
It even has a comment above it!

> 
> > By the way, is all reads/writes to ->on_cpu happen when a lock is held? Ie: we
> > don't need to use any smp read/write barriers?
> 
> Yes, ->on_cpu is fully serialized by rq->lock. We use
> smp_store_release() in finish_task() due to ttwu spin-waiting on it
> (which reminds me, riel was seeing lots of that).

Thanks. I had to ask as it was hard to walk all the paths.

Sometimes I get tempted to sprinkle comments or lockdep_assert() but then
I think that can easily get ugly and out of hand. I guess one just has to know
the code.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 16:54     ` Peter Zijlstra
@ 2019-11-06 17:27       ` Peter Zijlstra
  2019-11-07  8:36         ` Kirill Tkhai
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-06 17:27 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Wed, Nov 06, 2019 at 05:54:37PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
> > > +#ifdef CONFIG_SMP
> > > +	if (!rq->nr_running) {
> > > +		/*
> > > +		 * Make sure task_on_rq_curr() fails, such that we don't do
> > > +		 * put_prev_task() + set_next_task() on this task again.
> > > +		 */
> > > +		prev->on_cpu = 2;
> > >  		newidle_balance(rq, rf);
> > 
> > Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
> > become pickable again after newidle_balance() releases rq->lock, and we
> > take it again, so this on_cpu == 2 never will be cleared?
> 
> Indeed so.

Oh wait, the way it was written this is not possible. Because
rq->nr_running == 0 and prev->on_cpu > 0 it means the current task is
going to sleep and cannot be woken back up.

But if I move the ->on_cpu=2 thing earlier, as I wrote I'd do, then yes,
we have to set it back to 1. Because in that case we can get here for a
spurious schedule and we'll pick the same task again.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-06 17:27       ` Peter Zijlstra
@ 2019-11-07  8:36         ` Kirill Tkhai
  2019-11-07 13:26           ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill Tkhai @ 2019-11-07  8:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On 06.11.2019 20:27, Peter Zijlstra wrote:
> On Wed, Nov 06, 2019 at 05:54:37PM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
>>>> +#ifdef CONFIG_SMP
>>>> +	if (!rq->nr_running) {
>>>> +		/*
>>>> +		 * Make sure task_on_rq_curr() fails, such that we don't do
>>>> +		 * put_prev_task() + set_next_task() on this task again.
>>>> +		 */
>>>> +		prev->on_cpu = 2;
>>>>  		newidle_balance(rq, rf);
>>>
>>> Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
>>> become pickable again after newidle_balance() releases rq->lock, and we
>>> take it again, so this on_cpu == 2 never will be cleared?
>>
>> Indeed so.
> 
> Oh wait, the way it was written this is not possible. Because
> rq->nr_running == 0 and prev->on_cpu > 0 it means the current task is
> going to sleep and cannot be woken back up.

I mostly mean throttling. AFAIR, tasks of throttled rt_rq are not accounted
in rq->nr_running. But it seems rt_rq may become unthrottled again after
newidle_balance() unlocks rq lock, and prev will become pickable again.
 
> But if I move the ->on_cpu=2 thing earlier, as I wrote I'd do, then yes,
> we have to set it back to 1. Because in that case we can get here for a
> spurious schedule and we'll pick the same task again.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07  8:36         ` Kirill Tkhai
@ 2019-11-07 13:26           ` Peter Zijlstra
  2019-11-07 15:12             ` Kirill Tkhai
                               ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-07 13:26 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 11:36:50AM +0300, Kirill Tkhai wrote:
> On 06.11.2019 20:27, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2019 at 05:54:37PM +0100, Peter Zijlstra wrote:
> >> On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
> >>>> +#ifdef CONFIG_SMP
> >>>> +	if (!rq->nr_running) {
> >>>> +		/*
> >>>> +		 * Make sure task_on_rq_curr() fails, such that we don't do
> >>>> +		 * put_prev_task() + set_next_task() on this task again.
> >>>> +		 */
> >>>> +		prev->on_cpu = 2;
> >>>>  		newidle_balance(rq, rf);
> >>>
> >>> Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
> >>> become pickable again after newidle_balance() releases rq->lock, and we
> >>> take it again, so this on_cpu == 2 never will be cleared?
> >>
> >> Indeed so.
> > 
> > Oh wait, the way it was written this is not possible. Because
> > rq->nr_running == 0 and prev->on_cpu > 0 it means the current task is
> > going to sleep and cannot be woken back up.
> 
> I mostly mean throttling. AFAIR, tasks of throttled rt_rq are not accounted
> in rq->nr_running. But it seems rt_rq may become unthrottled again after
> newidle_balance() unlocks rq lock, and prev will become pickable again.

Urgh... throttling.

Bah.. I had just named the "->on_cpu = 2" thing leave_task(), to match
prepare_task() and finish_task(), but when we have to flip it back to 1
that doesn't really work.

Also, I'm still flip-flopping on where to place it. Yesterday I
suggested placing it before put_prev_task(), but then I went to write a
comment, and either way around put_prev_task() needs to be very careful.

So I went back to placing it after and putting lots of comments on.

How does the below look?

---
Subject: sched: Fix pick_next_task() vs 'change' pattern race
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 4 22:18:14 CET 2019

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

The order in pick_next_task() is mandated by the fact that RT/DL
put_prev_task() can pull other RT tasks, in which case we should not
call newidle_balance() since we'll not be going idle. Similarly, we
cannot put newidle_balance() in put_prev_task_fair() because it should
be called every time we'll end up selecting the idle task.

Given that we're stuck with this order, the only solution is fixing
the 'change' pattern. The simplest fix seems to be to 'absuse'
p->on_cpu to carry more state. Adding more state to p->on_rq is
possible but is far more invasive and also ends up duplicating much of
the state we already carry in p->on_cpu.

Introduce task_on_rq_curr() to indicate the if
sched_class::set_next_task() has been called -- and we thus need to
call put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c     |   35 ++++++++++++++++++++++++++++-------
 kernel/sched/deadline.c |    9 +++++++++
 kernel/sched/rt.c       |    9 +++++++++
 kernel/sched/sched.h    |   21 +++++++++++++++++++++
 4 files changed, 67 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1595,7 +1595,7 @@ void do_set_cpus_allowed(struct task_str
 	lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued) {
 		/*
@@ -3078,6 +3078,19 @@ static inline void prepare_task(struct t
 #endif
 }
 
+static inline void leave_task(struct task_struct *prev)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * The task is on its way out, we'll have called put_prev_task() on it.
+	 *
+	 * Make sure task_on_rq_curr() fails, such that we don't do
+	 * put_prev_task() + set_next_task() on this task again.
+	 */
+	prev->on_cpu = 2;
+#endif
+}
+
 static inline void finish_task(struct task_struct *prev)
 {
 #ifdef CONFIG_SMP
@@ -3934,8 +3947,16 @@ pick_next_task(struct rq *rq, struct tas
 	 * can PULL higher prio tasks when we lower the RQ 'priority'.
 	 */
 	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
+	if (!rq->nr_running) {
+		leave_task(prev);
 		newidle_balance(rq, rf);
+		/*
+		 * When the below pick loop results in @p == @prev, then we
+		 * will not go through context_switch() but the
+		 * pick_next_task() will have done set_next_task() again.
+		 */
+		prepare_task(prev);
+	}
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -4422,7 +4443,7 @@ void rt_mutex_setprio(struct task_struct
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -4509,7 +4530,7 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -4957,7 +4978,7 @@ static int __sched_setscheduler(struct t
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -6141,7 +6162,7 @@ void sched_setnuma(struct task_struct *p
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -7031,7 +7052,7 @@ void sched_move_task(struct task_struct
 	rq = task_rq_lock(tsk, &rf);
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_on_rq_curr(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1778,6 +1778,15 @@ pick_next_task_dl(struct rq *rq, struct
 	return p;
 }
 
+/*
+ * As per the note for sched_class::put_prev_task() we must only drop
+ * the rq->lock before any permanent state change.
+ *
+ * In this case, the state change and the pull action are mutually exclusive.
+ * If @prev is still on the runqueue, the priority will not have dropped and we
+ * don't need to pull. If @prev is no longer on the runqueue we don't need
+ * to add it back to the pushable list.
+ */
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1566,6 +1566,15 @@ pick_next_task_rt(struct rq *rq, struct
 	return p;
 }
 
+/*
+ * As per the note for sched_class::put_prev_task() we must only drop
+ * the rq->lock before any permanent state change.
+ *
+ * In this case, the state change and the pull action are mutually exclusive.
+ * If @prev is still on the runqueue, the priority will not have dropped and we
+ * don't need to pull. If @prev is no longer on the runqueue we don't need
+ * to add it back to the pushable list.
+ */
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1628,6 +1628,22 @@ static inline int task_running(struct rq
 #endif
 }
 
+/*
+ * If true, @p has had sched_class::set_next_task() called on it.
+ * See pick_next_task().
+ */
+static inline bool task_on_rq_curr(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SMP
+	return rq->curr == p && p->on_cpu == 1;
+#else
+	return rq->curr == p;
+#endif
+}
+
+/*
+ * If true, @p has has sched_class::enqueue_task() called on it.
+ */
 static inline int task_on_rq_queued(struct task_struct *p)
 {
 	return p->on_rq == TASK_ON_RQ_QUEUED;
@@ -1727,6 +1743,11 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
+	/*
+	 * When put_prev_task() drops the rq->lock (RT/DL) it must do this
+	 * before any effective state change, such that a nested
+	 * 'put_prev_task() + set_next_task' pair will work correctly.
+	 */
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 13:26           ` Peter Zijlstra
@ 2019-11-07 15:12             ` Kirill Tkhai
  2019-11-07 15:42               ` Peter Zijlstra
  2019-11-07 15:38             ` Quentin Perret
  2019-11-07 16:09             ` Qais Yousef
  2 siblings, 1 reply; 38+ messages in thread
From: Kirill Tkhai @ 2019-11-07 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On 07.11.2019 16:26, Peter Zijlstra wrote:
> On Thu, Nov 07, 2019 at 11:36:50AM +0300, Kirill Tkhai wrote:
>> On 06.11.2019 20:27, Peter Zijlstra wrote:
>>> On Wed, Nov 06, 2019 at 05:54:37PM +0100, Peter Zijlstra wrote:
>>>> On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
>>>>>> +#ifdef CONFIG_SMP
>>>>>> +	if (!rq->nr_running) {
>>>>>> +		/*
>>>>>> +		 * Make sure task_on_rq_curr() fails, such that we don't do
>>>>>> +		 * put_prev_task() + set_next_task() on this task again.
>>>>>> +		 */
>>>>>> +		prev->on_cpu = 2;
>>>>>>  		newidle_balance(rq, rf);
>>>>>
>>>>> Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
>>>>> become pickable again after newidle_balance() releases rq->lock, and we
>>>>> take it again, so this on_cpu == 2 never will be cleared?
>>>>
>>>> Indeed so.
>>>
>>> Oh wait, the way it was written this is not possible. Because
>>> rq->nr_running == 0 and prev->on_cpu > 0 it means the current task is
>>> going to sleep and cannot be woken back up.
>>
>> I mostly mean throttling. AFAIR, tasks of throttled rt_rq are not accounted
>> in rq->nr_running. But it seems rt_rq may become unthrottled again after
>> newidle_balance() unlocks rq lock, and prev will become pickable again.
> 
> Urgh... throttling.
> 
> Bah.. I had just named the "->on_cpu = 2" thing leave_task(), to match
> prepare_task() and finish_task(), but when we have to flip it back to 1
> that doesn't really work.
> 
> Also, I'm still flip-flopping on where to place it. Yesterday I
> suggested placing it before put_prev_task(), but then I went to write a
> comment, and either way around put_prev_task() needs to be very careful.
> 
> So I went back to placing it after and putting lots of comments on.
> 
> How does the below look?

For me the below patch looks OK.

One more thing about current code in git. After rq->lock became able to
be unlocked after put_prev_task() is commited, we got a new corner case.
We usually had the same order for running task:

  dequeue_task()
  put_prev_task()

Now the order may be reversed (this is also in case of throttling):

  put_prev_task() (called from pick_next_task())
  dequeue_task()  (called from another cpu)

This is more theoretically, since I don't see a problem here. But there are
too many statistics and counters in sched_class methods, that it is impossible
to be sure all of them work as expected.

Kirill

> ---
> Subject: sched: Fix pick_next_task() vs 'change' pattern race
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Nov 4 22:18:14 CET 2019
> 
> Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
> inadvertly introduced a race because it changed a previously
> unexplored dependency between dropping the rq->lock and
> sched_class::put_prev_task().
> 
> The comments about dropping rq->lock, in for example
> newidle_balance(), only mentions the task being current and ->on_cpu
> being set. But when we look at the 'change' pattern (in for example
> sched_setnuma()):
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 
> It becomes obvious that if we do this after put_prev_task() has
> already been called on @p, things go sideways. This is exactly what
> the commit in question allows to happen when it does:
> 
> 	prev->sched_class->put_prev_task(rq, prev, rf);
> 	if (!rq->nr_running)
> 		newidle_balance(rq, rf);
> 
> The newidle_balance() call will drop rq->lock after we've called
> put_prev_task() and that allows the above 'change' pattern to
> interleave and mess up the state.
> 
> The order in pick_next_task() is mandated by the fact that RT/DL
> put_prev_task() can pull other RT tasks, in which case we should not
> call newidle_balance() since we'll not be going idle. Similarly, we
> cannot put newidle_balance() in put_prev_task_fair() because it should
> be called every time we'll end up selecting the idle task.
> 
> Given that we're stuck with this order, the only solution is fixing
> the 'change' pattern. The simplest fix seems to be to 'absuse'
> p->on_cpu to carry more state. Adding more state to p->on_rq is
> possible but is far more invasive and also ends up duplicating much of
> the state we already carry in p->on_cpu.
> 
> Introduce task_on_rq_curr() to indicate the if
> sched_class::set_next_task() has been called -- and we thus need to
> call put_prev_task().
> 
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Quentin Perret <qperret@google.com>
> Tested-by: Qais Yousef <qais.yousef@arm.com>
> Tested-by: Valentin Schneider <valentin.schneider@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
>  kernel/sched/core.c     |   35 ++++++++++++++++++++++++++++-------
>  kernel/sched/deadline.c |    9 +++++++++
>  kernel/sched/rt.c       |    9 +++++++++
>  kernel/sched/sched.h    |   21 +++++++++++++++++++++
>  4 files changed, 67 insertions(+), 7 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1595,7 +1595,7 @@ void do_set_cpus_allowed(struct task_str
>  	lockdep_assert_held(&p->pi_lock);
>  
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  
>  	if (queued) {
>  		/*
> @@ -3078,6 +3078,19 @@ static inline void prepare_task(struct t
>  #endif
>  }
>  
> +static inline void leave_task(struct task_struct *prev)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * The task is on its way out, we'll have called put_prev_task() on it.
> +	 *
> +	 * Make sure task_on_rq_curr() fails, such that we don't do
> +	 * put_prev_task() + set_next_task() on this task again.
> +	 */
> +	prev->on_cpu = 2;
> +#endif
> +}
> +
>  static inline void finish_task(struct task_struct *prev)
>  {
>  #ifdef CONFIG_SMP
> @@ -3934,8 +3947,16 @@ pick_next_task(struct rq *rq, struct tas
>  	 * can PULL higher prio tasks when we lower the RQ 'priority'.
>  	 */
>  	prev->sched_class->put_prev_task(rq, prev, rf);
> -	if (!rq->nr_running)
> +	if (!rq->nr_running) {
> +		leave_task(prev);
>  		newidle_balance(rq, rf);
> +		/*
> +		 * When the below pick loop results in @p == @prev, then we
> +		 * will not go through context_switch() but the
> +		 * pick_next_task() will have done set_next_task() again.
> +		 */
> +		prepare_task(prev);
> +	}
>  
>  	for_each_class(class) {
>  		p = class->pick_next_task(rq, NULL, NULL);
> @@ -4422,7 +4443,7 @@ void rt_mutex_setprio(struct task_struct
>  
>  	prev_class = p->sched_class;
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, queue_flag);
>  	if (running)
> @@ -4509,7 +4530,7 @@ void set_user_nice(struct task_struct *p
>  		goto out_unlock;
>  	}
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
>  	if (running)
> @@ -4957,7 +4978,7 @@ static int __sched_setscheduler(struct t
>  	}
>  
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  	if (queued)
>  		dequeue_task(rq, p, queue_flags);
>  	if (running)
> @@ -6141,7 +6162,7 @@ void sched_setnuma(struct task_struct *p
>  
>  	rq = task_rq_lock(p, &rf);
>  	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_on_rq_curr(rq, p);
>  
>  	if (queued)
>  		dequeue_task(rq, p, DEQUEUE_SAVE);
> @@ -7031,7 +7052,7 @@ void sched_move_task(struct task_struct
>  	rq = task_rq_lock(tsk, &rf);
>  	update_rq_clock(rq);
>  
> -	running = task_current(rq, tsk);
> +	running = task_on_rq_curr(rq, tsk);
>  	queued = task_on_rq_queued(tsk);
>  
>  	if (queued)
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1778,6 +1778,15 @@ pick_next_task_dl(struct rq *rq, struct
>  	return p;
>  }
>  
> +/*
> + * As per the note for sched_class::put_prev_task() we must only drop
> + * the rq->lock before any permanent state change.
> + *
> + * In this case, the state change and the pull action are mutually exclusive.
> + * If @prev is still on the runqueue, the priority will not have dropped and we
> + * don't need to pull. If @prev is no longer on the runqueue we don't need
> + * to add it back to the pushable list.
> + */
>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  {
>  	update_curr_dl(rq);
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1566,6 +1566,15 @@ pick_next_task_rt(struct rq *rq, struct
>  	return p;
>  }
>  
> +/*
> + * As per the note for sched_class::put_prev_task() we must only drop
> + * the rq->lock before any permanent state change.
> + *
> + * In this case, the state change and the pull action are mutually exclusive.
> + * If @prev is still on the runqueue, the priority will not have dropped and we
> + * don't need to pull. If @prev is no longer on the runqueue we don't need
> + * to add it back to the pushable list.
> + */
>  static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  {
>  	update_curr_rt(rq);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1628,6 +1628,22 @@ static inline int task_running(struct rq
>  #endif
>  }
>  
> +/*
> + * If true, @p has had sched_class::set_next_task() called on it.
> + * See pick_next_task().
> + */
> +static inline bool task_on_rq_curr(struct rq *rq, struct task_struct *p)
> +{
> +#ifdef CONFIG_SMP
> +	return rq->curr == p && p->on_cpu == 1;
> +#else
> +	return rq->curr == p;
> +#endif
> +}
> +
> +/*
> + * If true, @p has has sched_class::enqueue_task() called on it.
> + */
>  static inline int task_on_rq_queued(struct task_struct *p)
>  {
>  	return p->on_rq == TASK_ON_RQ_QUEUED;
> @@ -1727,6 +1743,11 @@ struct sched_class {
>  	struct task_struct * (*pick_next_task)(struct rq *rq,
>  					       struct task_struct *prev,
>  					       struct rq_flags *rf);
> +	/*
> +	 * When put_prev_task() drops the rq->lock (RT/DL) it must do this
> +	 * before any effective state change, such that a nested
> +	 * 'put_prev_task() + set_next_task' pair will work correctly.
> +	 */
>  	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
>  	void (*set_next_task)(struct rq *rq, struct task_struct *p);
>  
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 13:26           ` Peter Zijlstra
  2019-11-07 15:12             ` Kirill Tkhai
@ 2019-11-07 15:38             ` Quentin Perret
  2019-11-07 18:43               ` Peter Zijlstra
  2019-11-07 16:09             ` Qais Yousef
  2 siblings, 1 reply; 38+ messages in thread
From: Quentin Perret @ 2019-11-07 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thursday 07 Nov 2019 at 14:26:28 (+0100), Peter Zijlstra wrote:
> Given that we're stuck with this order, the only solution is fixing
> the 'change' pattern. The simplest fix seems to be to 'absuse'
> p->on_cpu to carry more state. Adding more state to p->on_rq is
> possible but is far more invasive and also ends up duplicating much of
> the state we already carry in p->on_cpu.

I think there is another solution, which is to 'de-factorize' the call
to put_prev_task() (that is, have each class do it). I gave it a go and
I basically end up with something equivalent to reverting 67692435c411
("sched: Rework pick_next_task() slow-path"), which isn't the worst
solution IMO. I'm thinking at least we should consider it.

Now, 67692435c411 _is_ a nice clean-up, it's just a shame that the fix
on top isn't as nice (IMO). It might just be a matter of personal taste,
so I don't have a strong opinion on this :)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 15:12             ` Kirill Tkhai
@ 2019-11-07 15:42               ` Peter Zijlstra
  2019-11-07 15:53                 ` Kirill Tkhai
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-07 15:42 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 06:12:07PM +0300, Kirill Tkhai wrote:
> On 07.11.2019 16:26, Peter Zijlstra wrote:

> > Urgh... throttling.

> One more thing about current code in git. After rq->lock became able to
> be unlocked after put_prev_task() is commited, we got a new corner case.
> We usually had the same order for running task:
> 
>   dequeue_task()
>   put_prev_task()
> 
> Now the order may be reversed (this is also in case of throttling):
> 
>   put_prev_task() (called from pick_next_task())
>   dequeue_task()  (called from another cpu)
> 
> This is more theoretically, since I don't see a problem here. But there are
> too many statistics and counters in sched_class methods, that it is impossible
> to be sure all of them work as expected.

Hmm,.. where does throttling happen on a remote CPU? I through both
cfs-bandwidth and dl throttle locally.

Or are you talking about NO_HZ_FULL's sched_remote_tick() ?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 15:42               ` Peter Zijlstra
@ 2019-11-07 15:53                 ` Kirill Tkhai
  0 siblings, 0 replies; 38+ messages in thread
From: Kirill Tkhai @ 2019-11-07 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quentin Perret, linux-kernel, aaron.lwe, valentin.schneider,
	mingo, pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On 07.11.2019 18:42, Peter Zijlstra wrote:
> On Thu, Nov 07, 2019 at 06:12:07PM +0300, Kirill Tkhai wrote:
>> On 07.11.2019 16:26, Peter Zijlstra wrote:
> 
>>> Urgh... throttling.
> 
>> One more thing about current code in git. After rq->lock became able to
>> be unlocked after put_prev_task() is commited, we got a new corner case.
>> We usually had the same order for running task:
>>
>>   dequeue_task()
>>   put_prev_task()
>>
>> Now the order may be reversed (this is also in case of throttling):
>>
>>   put_prev_task() (called from pick_next_task())
>>   dequeue_task()  (called from another cpu)
>>
>> This is more theoretically, since I don't see a problem here. But there are
>> too many statistics and counters in sched_class methods, that it is impossible
>> to be sure all of them work as expected.
> 
> Hmm,.. where does throttling happen on a remote CPU? I through both
> cfs-bandwidth and dl throttle locally.
> 
> Or are you talking about NO_HZ_FULL's sched_remote_tick() ?

I mean ordinary path: local throttling -> resched_curr -> schedule().
Then rq->nr_running == 0, but task is on rq. We call put_prev_task()
and newidle_balance().

On another cpu someone calls set_user_nice() and it makes dequeue_task()
in the middle of local cpu's newidle_balance().

Thus, we first made put_prev_task() and second dequeue_task().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 13:26           ` Peter Zijlstra
  2019-11-07 15:12             ` Kirill Tkhai
  2019-11-07 15:38             ` Quentin Perret
@ 2019-11-07 16:09             ` Qais Yousef
  2 siblings, 0 replies; 38+ messages in thread
From: Qais Yousef @ 2019-11-07 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, Quentin Perret, linux-kernel, aaron.lwe,
	valentin.schneider, mingo, pauld, jdesfossez, naravamudan,
	vincent.guittot, dietmar.eggemann, juri.lelli, rostedt, bsegall,
	mgorman, kernel-team, john.stultz

On 11/07/19 14:26, Peter Zijlstra wrote:
> On Thu, Nov 07, 2019 at 11:36:50AM +0300, Kirill Tkhai wrote:
> > On 06.11.2019 20:27, Peter Zijlstra wrote:
> > > On Wed, Nov 06, 2019 at 05:54:37PM +0100, Peter Zijlstra wrote:
> > >> On Wed, Nov 06, 2019 at 06:51:40PM +0300, Kirill Tkhai wrote:
> > >>>> +#ifdef CONFIG_SMP
> > >>>> +	if (!rq->nr_running) {
> > >>>> +		/*
> > >>>> +		 * Make sure task_on_rq_curr() fails, such that we don't do
> > >>>> +		 * put_prev_task() + set_next_task() on this task again.
> > >>>> +		 */
> > >>>> +		prev->on_cpu = 2;
> > >>>>  		newidle_balance(rq, rf);
> > >>>
> > >>> Shouldn't we restore prev->on_cpu = 1 after newidle_balance()? Can't prev
> > >>> become pickable again after newidle_balance() releases rq->lock, and we
> > >>> take it again, so this on_cpu == 2 never will be cleared?
> > >>
> > >> Indeed so.
> > > 
> > > Oh wait, the way it was written this is not possible. Because
> > > rq->nr_running == 0 and prev->on_cpu > 0 it means the current task is
> > > going to sleep and cannot be woken back up.
> > 
> > I mostly mean throttling. AFAIR, tasks of throttled rt_rq are not accounted
> > in rq->nr_running. But it seems rt_rq may become unthrottled again after
> > newidle_balance() unlocks rq lock, and prev will become pickable again.
> 
> Urgh... throttling.
> 
> Bah.. I had just named the "->on_cpu = 2" thing leave_task(), to match
> prepare_task() and finish_task(), but when we have to flip it back to 1
> that doesn't really work.
> 
> Also, I'm still flip-flopping on where to place it. Yesterday I
> suggested placing it before put_prev_task(), but then I went to write a
> comment, and either way around put_prev_task() needs to be very careful.
> 
> So I went back to placing it after and putting lots of comments on.
> 
> How does the below look?

This looks good to me. But it makes me wonder, if there's no penalty to adding
the leave_task() before put_prev_task(), and if it results on relaxing the
requirement of 'no permanent state change before releasing the rq lock',
shouldn't we just do it?

Anyways. I'll pick this version and spin it through my reproducer.

Thanks for fixing this!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 15:38             ` Quentin Perret
@ 2019-11-07 18:43               ` Peter Zijlstra
  2019-11-07 19:27                 ` Quentin Perret
  2019-11-07 19:29                 ` Peter Zijlstra
  0 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-07 18:43 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 03:38:48PM +0000, Quentin Perret wrote:
> On Thursday 07 Nov 2019 at 14:26:28 (+0100), Peter Zijlstra wrote:
> > Given that we're stuck with this order, the only solution is fixing
> > the 'change' pattern. The simplest fix seems to be to 'absuse'
> > p->on_cpu to carry more state. Adding more state to p->on_rq is
> > possible but is far more invasive and also ends up duplicating much of
> > the state we already carry in p->on_cpu.
> 
> I think there is another solution, which is to 'de-factorize' the call
> to put_prev_task() (that is, have each class do it). I gave it a go and
> I basically end up with something equivalent to reverting 67692435c411
> ("sched: Rework pick_next_task() slow-path"), which isn't the worst
> solution IMO. I'm thinking at least we should consider it.

The purpose of 67692435c411 is to ret rid of the RETRY_TASK logic
restarting the pick.

But you mean something like:

	for (class = prev->sched_class; class; class = class->next) {
		if (class->balance(rq, rf))
			break;
	}

	put_prev_task(rq, prev);

	for_each_class(class) {
		p = class->pick_next_task(rq);
		if (p)
			return p;
	}

	BUG();

like?

I had convinced myself we didn't need that, but that DL to RT case is
pesky and might require it after all.

> Now, 67692435c411 _is_ a nice clean-up, it's just a shame that the fix
> on top isn't as nice (IMO). It might just be a matter of personal taste,
> so I don't have a strong opinion on this :)

Yeah, it does rather make a mess of things.

I'll try and code up the above after dinner.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 18:43               ` Peter Zijlstra
@ 2019-11-07 19:27                 ` Quentin Perret
  2019-11-07 19:31                   ` Peter Zijlstra
  2019-11-07 19:29                 ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Quentin Perret @ 2019-11-07 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thursday 07 Nov 2019 at 19:43:56 (+0100), Peter Zijlstra wrote:
> But you mean something like:
> 
> 	for (class = prev->sched_class; class; class = class->next) {
> 		if (class->balance(rq, rf))
> 			break;
> 	}
> 
> 	put_prev_task(rq, prev);
> 
> 	for_each_class(class) {
> 		p = class->pick_next_task(rq);
> 		if (p)
> 			return p;
> 	}
> 
> 	BUG();
> 
> like?

Right, something like that, though what I had was basically doing the
pull from within the pick_next_task_*() functions directly, like we were
doing before. I'm now seeing how easy it is to get this wrong, and that
even good-looking code in this area can be broken in very subtle ways,
so I didn't feel comfortable refactoring again so close to rc7. If you
feel more confident, I'm more than happy to test a patch implemeting the
above :)

> I had convinced myself we didn't need that, but that DL to RT case is
> pesky and might require it after all.

Yep, I don't see a way to avoid iterating all classes to do the balance,
one way or another ...

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 18:43               ` Peter Zijlstra
  2019-11-07 19:27                 ` Quentin Perret
@ 2019-11-07 19:29                 ` Peter Zijlstra
  2019-11-08 11:02                   ` Quentin Perret
  2019-11-08 11:55                   ` Peter Zijlstra
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-07 19:29 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 07:43:56PM +0100, Peter Zijlstra wrote:
> But you mean something like:
> 
> 	for (class = prev->sched_class; class; class = class->next) {
> 		if (class->balance(rq, rf))
> 			break;
> 	}
> 
> 	put_prev_task(rq, prev);
> 
> 	for_each_class(class) {
> 		p = class->pick_next_task(rq);
> 		if (p)
> 			return p;
> 	}
> 
> 	BUG();
> 
> like?

I still havne't had food, but this here compiles...

---
 kernel/sched/core.c      | 20 ++++++++++++--------
 kernel/sched/deadline.c  | 23 +++++++++++++++--------
 kernel/sched/fair.c      | 12 ++++++++++--
 kernel/sched/idle.c      |  8 +++++++-
 kernel/sched/rt.c        | 32 +++++++++++++++++++-------------
 kernel/sched/sched.h     |  5 +++--
 kernel/sched/stop_task.c |  8 +++++++-
 7 files changed, 73 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb42b71faab9..d9909606806d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3929,13 +3929,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	}

 restart:
-	/*
-	 * Ensure that we put DL/RT tasks before the pick loop, such that they
-	 * can PULL higher prio tasks when we lower the RQ 'priority'.
-	 */
-	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
-		newidle_balance(rq, rf);
+#ifdef CONFIG_SMP
+	for (class = prev->sched_class;
+	     class != &idle_sched_class;
+	     class = class->next) {
+
+		if (class->balance(rq, prev, rf))
+			break;
+	}
+#endif
+
+	put_prev_task(rq, prev);

 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -6201,7 +6205,7 @@ static struct task_struct *__pick_migrate_task(struct rq *rq)
 	for_each_class(class) {
 		next = class->pick_next_task(rq, NULL, NULL);
 		if (next) {
-			next->sched_class->put_prev_task(rq, next, NULL);
+			next->sched_class->put_prev_task(rq, next);
 			return next;
 		}
 	}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2dc48720f189..b6c3fb10cf57 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1778,15 +1778,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }

-static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
-	update_curr_dl(rq);
-
-	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
-	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_dl_task(rq, p);
-
-	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
 		 * picked for load-balance and preemption/IRQs are still
@@ -1797,6 +1791,18 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_fla
 		pull_dl_task(rq);
 		rq_repin_lock(rq, rf);
 	}
+
+	return rq->dl.dl_nr_running > 0;
+}
+
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+
+	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
+	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }

 /*
@@ -2438,6 +2444,7 @@ const struct sched_class dl_sched_class = {
 	.check_preempt_curr	= check_preempt_curr_dl,

 	.pick_next_task		= pick_next_task_dl,
+	.balance		= balance_dl,
 	.put_prev_task		= put_prev_task_dl,
 	.set_next_task		= set_next_task_dl,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a14487462b6c..6b983214e00f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6746,10 +6746,18 @@ done: __maybe_unused;
 	return NULL;
 }

+static int balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	if (rq->cfs.nr_running)
+		return 1;
+
+	return newidle_balance(rq, rf) != 0;
+}
+
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
@@ -10609,7 +10617,7 @@ const struct sched_class fair_sched_class = {
 	.check_preempt_curr	= check_preempt_wakeup,

 	.pick_next_task		= pick_next_task_fair,
-
+	.balance		= balance_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8dad5aa600ea..2ff65fd2fe7e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -375,7 +375,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }

-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static int balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return 1; /* always runnable */
+}
+
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
 }

@@ -456,6 +461,7 @@ const struct sched_class idle_sched_class = {
 	.check_preempt_curr	= check_preempt_curr_idle,

 	.pick_next_task		= pick_next_task_idle,
+	.balance		= balance_idle,
 	.put_prev_task		= put_prev_task_idle,
 	.set_next_task          = set_next_task_idle,

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ebaa4e619684..3aa3e0fca796 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1566,20 +1566,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }

-static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
-	update_curr_rt(rq);
-
-	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
-
-	/*
-	 * The previous task needs to be made eligible for pushing
-	 * if it is still active
-	 */
-	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
-
-	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
 		/*
 		 * This is OK, because current is on_cpu, which avoids it being
 		 * picked for load-balance and preemption/IRQs are still
@@ -1590,6 +1579,22 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
 		pull_rt_task(rq);
 		rq_repin_lock(rq, rf);
 	}
+
+	return rq->rt.rt_queued > 0;
+}
+
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+{
+	update_curr_rt(rq);
+
+	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
+
+	/*
+	 * The previous task needs to be made eligible for pushing
+	 * if it is still active
+	 */
+	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_task(rq, p);
 }

 #ifdef CONFIG_SMP
@@ -2362,6 +2367,7 @@ const struct sched_class rt_sched_class = {
 	.check_preempt_curr	= check_preempt_curr_rt,

 	.pick_next_task		= pick_next_task_rt,
+	.balance		= balance_rt,
 	.put_prev_task		= put_prev_task_rt,
 	.set_next_task          = set_next_task_rt,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b3361e..3b50450428dc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1727,7 +1727,8 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
+	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);

 #ifdef CONFIG_SMP
@@ -1773,7 +1774,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev, NULL);
+	prev->sched_class->put_prev_task(rq, prev);
 }

 static inline void set_next_task(struct rq *rq, struct task_struct *next)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 7e1cee4e65b2..8a5b3e5540ee 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -60,7 +60,12 @@ static void yield_task_stop(struct rq *rq)
 	BUG(); /* the stop task should never yield, its pointless. */
 }

-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static int balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return 0;
+}
+
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
@@ -125,6 +130,7 @@ const struct sched_class stop_sched_class = {
 	.check_preempt_curr	= check_preempt_curr_stop,

 	.pick_next_task		= pick_next_task_stop,
+	.balance		= balance_stop,
 	.put_prev_task		= put_prev_task_stop,
 	.set_next_task          = set_next_task_stop,



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 19:27                 ` Quentin Perret
@ 2019-11-07 19:31                   ` Peter Zijlstra
  2019-11-07 19:42                     ` Quentin Perret
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-07 19:31 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 07:27:53PM +0000, Quentin Perret wrote:
> On Thursday 07 Nov 2019 at 19:43:56 (+0100), Peter Zijlstra wrote:
> > But you mean something like:
> > 
> > 	for (class = prev->sched_class; class; class = class->next) {
> > 		if (class->balance(rq, rf))
> > 			break;
> > 	}
> > 
> > 	put_prev_task(rq, prev);
> > 
> > 	for_each_class(class) {
> > 		p = class->pick_next_task(rq);
> > 		if (p)
> > 			return p;
> > 	}
> > 
> > 	BUG();
> > 
> > like?
> 
> Right, something like that, though what I had was basically doing the
> pull from within the pick_next_task_*() functions directly, like we were
> doing before. I'm now seeing how easy it is to get this wrong, and that
> even good-looking code in this area can be broken in very subtle ways,
> so I didn't feel comfortable refactoring again so close to rc7. If you
> feel more confident, I'm more than happy to test a patch implemeting the
> above :)

Thing is, if we revert (and we might have to), we'll have to revert more
than just the one patch due to that other (__pick_migrate_task) borkage
that got reported today.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 19:31                   ` Peter Zijlstra
@ 2019-11-07 19:42                     ` Quentin Perret
  0 siblings, 0 replies; 38+ messages in thread
From: Quentin Perret @ 2019-11-07 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thursday 07 Nov 2019 at 20:31:34 (+0100), Peter Zijlstra wrote:
> Thing is, if we revert (and we might have to), we'll have to revert more
> than just the one patch due to that other (__pick_migrate_task) borkage
> that got reported today.

Fair enough, let's try an avoid reverting the world if possible ...
I'll give a go to your patch tomorrow first thing.

Cheers,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 19:29                 ` Peter Zijlstra
@ 2019-11-08 11:02                   ` Quentin Perret
  2019-11-08 11:47                     ` Valentin Schneider
  2019-11-08 12:00                     ` Peter Zijlstra
  2019-11-08 11:55                   ` Peter Zijlstra
  1 sibling, 2 replies; 38+ messages in thread
From: Quentin Perret @ 2019-11-08 11:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thursday 07 Nov 2019 at 20:29:07 (+0100), Peter Zijlstra wrote:
> I still havne't had food, but this here compiles...

And it seems to work, too :)

> @@ -3929,13 +3929,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	}
> 
>  restart:
> -	/*
> -	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> -	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> -	 */
> -	prev->sched_class->put_prev_task(rq, prev, rf);
> -	if (!rq->nr_running)
> -		newidle_balance(rq, rf);
> +#ifdef CONFIG_SMP
> +	for (class = prev->sched_class;
> +	     class != &idle_sched_class;
> +	     class = class->next) {
> +
> +		if (class->balance(rq, prev, rf))
> +			break;
> +	}
> +#endif
> +
> +	put_prev_task(rq, prev);

Right, that looks much cleaner IMO. I'm thinking if we killed the
special case for CFS above we could do with a single loop to iterate the
classes, and you could fold ->balance() in ->pick_next_task() ...
That would remove one call site to newidle_balance() too, which I think
is good. Hackbench probably won't like that, though.

>  	for_each_class(class) {
>  		p = class->pick_next_task(rq, NULL, NULL);
> @@ -6201,7 +6205,7 @@ static struct task_struct *__pick_migrate_task(struct rq *rq)
>  	for_each_class(class) {
>  		next = class->pick_next_task(rq, NULL, NULL);
>  		if (next) {
> -			next->sched_class->put_prev_task(rq, next, NULL);
> +			next->sched_class->put_prev_task(rq, next);
>  			return next;
>  		}
>  	}
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 2dc48720f189..b6c3fb10cf57 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1778,15 +1778,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	return p;
>  }
> 
> -static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> +static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  {
> -	update_curr_dl(rq);
> -
> -	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
> -	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
> -		enqueue_pushable_dl_task(rq, p);
> -

Ah, and this can actually be done after the pull because the two are in
fact mutually exclusive. And same thing for RT. Good :)

> -	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
> +	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
>  		/*
>  		 * This is OK, because current is on_cpu, which avoids it being
>  		 * picked for load-balance and preemption/IRQs are still
> @@ -1797,6 +1791,18 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_fla
>  		pull_dl_task(rq);
>  		rq_repin_lock(rq, rf);
>  	}
> +
> +	return rq->dl.dl_nr_running > 0;
> +}
> +
> +
> +static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> +{
> +	update_curr_dl(rq);
> +
> +	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
> +		enqueue_pushable_dl_task(rq, p);
>  }
> 
>  /*
> @@ -2438,6 +2444,7 @@ const struct sched_class dl_sched_class = {
>  	.check_preempt_curr	= check_preempt_curr_dl,
> 
>  	.pick_next_task		= pick_next_task_dl,
> +	.balance		= balance_dl,
>  	.put_prev_task		= put_prev_task_dl,
>  	.set_next_task		= set_next_task_dl,
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a14487462b6c..6b983214e00f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6746,10 +6746,18 @@ done: __maybe_unused;
>  	return NULL;
>  }
> 
> +static int balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +	if (rq->cfs.nr_running)
> +		return 1;
> +
> +	return newidle_balance(rq, rf) != 0;

And you can ignore the RETRY_TASK case here under the assumption that
we must have tried to pull from RT/DL before ending up here ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 11:02                   ` Quentin Perret
@ 2019-11-08 11:47                     ` Valentin Schneider
  2019-11-08 11:58                       ` Quentin Perret
  2019-11-08 12:00                     ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Valentin Schneider @ 2019-11-08 11:47 UTC (permalink / raw)
  To: Quentin Perret, Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, mingo, pauld, jdesfossez,
	naravamudan, vincent.guittot, dietmar.eggemann, juri.lelli,
	rostedt, bsegall, mgorman, kernel-team, john.stultz

On 08/11/2019 11:02, Quentin Perret wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a14487462b6c..6b983214e00f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6746,10 +6746,18 @@ done: __maybe_unused;
>>  	return NULL;
>>  }
>>
>> +static int balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> +{
>> +	if (rq->cfs.nr_running)
>> +		return 1;
>> +
>> +	return newidle_balance(rq, rf) != 0;
> 
> And you can ignore the RETRY_TASK case here under the assumption that
> we must have tried to pull from RT/DL before ending up here ?
> 

I think we can ignore RETRY_TASK because this happens before the picking loop,
so we'll observe any new DL/RT task that got enqueued while newidle released
the lock. This also means we can safely break the balance loop in
pick_next_task() when we get RETRY_TASK, because we've got something to pick
(some new RT/DL task). This wants a comment though, methinks.

Other than that I agree with Quentin, it's a much cleaner approach and I quite
like it.

> Thanks,
> Quentin
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-07 19:29                 ` Peter Zijlstra
  2019-11-08 11:02                   ` Quentin Perret
@ 2019-11-08 11:55                   ` Peter Zijlstra
  2019-11-08 12:52                     ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-08 11:55 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Thu, Nov 07, 2019 at 08:29:07PM +0100, Peter Zijlstra wrote:
> I still havne't had food, but this here compiles...

A more polished patch.

---
Subject: sched: Fix pick_next_task() vs 'change' pattern race
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 8 11:11:52 CET 2019

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   17 +++++++++--------
 kernel/sched/deadline.c  |   40 ++++++++++++++++++++--------------------
 kernel/sched/fair.c      |   15 ++++++++++++---
 kernel/sched/idle.c      |    9 ++++++++-
 kernel/sched/rt.c        |   37 +++++++++++++++++++------------------
 kernel/sched/sched.h     |   30 +++++++++++++++++++++++++++---
 kernel/sched/stop_task.c |   18 +++++++++++-------
 7 files changed, 106 insertions(+), 60 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3929,13 +3929,14 @@ pick_next_task(struct rq *rq, struct tas
 	}
 
 restart:
-	/*
-	 * Ensure that we put DL/RT tasks before the pick loop, such that they
-	 * can PULL higher prio tasks when we lower the RQ 'priority'.
-	 */
-	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
-		newidle_balance(rq, rf);
+#ifdef CONFIG_SMP
+	for_class_range(class, prev->sched_class, &idle_sched_class) {
+		if (class->balance(rq, prev, rf))
+			break;
+	}
+#endif
+
+	put_prev_task(rq, prev);
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -6201,7 +6202,7 @@ static struct task_struct *__pick_migrat
 	for_each_class(class) {
 		next = class->pick_next_task(rq, NULL, NULL);
 		if (next) {
-			next->sched_class->put_prev_task(rq, next, NULL);
+			next->sched_class->put_prev_task(rq, next);
 			return next;
 		}
 	}
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1691,6 +1691,22 @@ static void check_preempt_equal_dl(struc
 	resched_curr(rq);
 }
 
+static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+{
+	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+
+	return sched_stop_runnable(rq) || sched_dl_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -1758,45 +1774,28 @@ static struct task_struct *
 pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_dl_entity *dl_se;
+	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
-	struct dl_rq *dl_rq;
 
 	WARN_ON_ONCE(prev || rf);
 
-	dl_rq = &rq->dl;
-
-	if (unlikely(!dl_rq->dl_nr_running))
+	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
-
 	p = dl_task_of(dl_se);
-
 	set_next_task_dl(rq, p);
-
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
-
-	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we've
-		 * not yet started the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-	}
 }
 
 /*
@@ -2442,6 +2441,7 @@ const struct sched_class dl_sched_class
 	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6423,6 +6423,15 @@ static void task_dead_fair(struct task_s
 {
 	remove_entity_load_avg(&p->se);
 }
+
+static int
+balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	if (rq->nr_running)
+		return 1;
+
+	return newidle_balance(rq, rf) != 0;
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
@@ -6599,7 +6608,7 @@ pick_next_task_fair(struct rq *rq, struc
 	int new_tasks;
 
 again:
-	if (!cfs_rq->nr_running)
+	if (!sched_fair_runnable(rq))
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6737,7 +6746,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
@@ -10597,11 +10606,11 @@ const struct sched_class fair_sched_clas
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
-
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -365,6 +365,12 @@ select_task_rq_idle(struct task_struct *
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static int
+balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return WARN_ON_ONCE(1);
+}
 #endif
 
 /*
@@ -375,7 +381,7 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
 }
 
@@ -460,6 +466,7 @@ const struct sched_class idle_sched_clas
 	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1469,6 +1469,22 @@ static void check_preempt_equal_prio(str
 	resched_curr(rq);
 }
 
+static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+{
+	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+
+	return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -1552,21 +1568,18 @@ static struct task_struct *
 pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *p;
-	struct rt_rq *rt_rq = &rq->rt;
 
 	WARN_ON_ONCE(prev || rf);
 
-	if (!rt_rq->rt_queued)
+	if (!sched_rt_runnable(rq))
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-
 	set_next_task_rt(rq, p);
-
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
@@ -1578,18 +1591,6 @@ static void put_prev_task_rt(struct rq *
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-
-	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we've
-		 * not yet started the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-	}
 }
 
 #ifdef CONFIG_SMP
@@ -2366,8 +2367,8 @@ const struct sched_class rt_sched_class
 	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_rt,
 	.select_task_rq		= select_task_rq_rt,
-
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
 	.rq_offline             = rq_offline_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1727,10 +1727,11 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
+	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
@@ -1773,7 +1774,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev, NULL);
+	prev->sched_class->put_prev_task(rq, prev);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
@@ -1787,8 +1788,12 @@ static inline void set_next_task(struct
 #else
 #define sched_class_highest (&dl_sched_class)
 #endif
+
+#define for_class_range(class, _from, _to) \
+	for (class = (_from); class != (_to); class = class->next)
+
 #define for_each_class(class) \
-   for (class = sched_class_highest; class; class = class->next)
+	for_class_range(class, sched_class_highest, NULL)
 
 extern const struct sched_class stop_sched_class;
 extern const struct sched_class dl_sched_class;
@@ -1796,6 +1801,25 @@ extern const struct sched_class rt_sched
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
+static inline bool sched_stop_runnable(struct rq *rq)
+{
+	return rq->stop && task_on_rq_queued(rq->stop);
+}
+
+static inline bool sched_dl_runnable(struct rq *rq)
+{
+	return rq->dl.dl_nr_running > 0;
+}
+
+static inline bool sched_rt_runnable(struct rq *rq)
+{
+	return rq->rt.rt_queued > 0;
+}
+
+static inline bool sched_fair_runnable(struct rq *rq)
+{
+	return rq->cfs.nr_running > 0;
+}
 
 #ifdef CONFIG_SMP
 
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -15,6 +15,12 @@ select_task_rq_stop(struct task_struct *
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
+
+static int
+balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return sched_stop_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 static void
@@ -31,16 +37,13 @@ static void set_next_task_stop(struct rq
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *stop = rq->stop;
-
 	WARN_ON_ONCE(prev || rf);
 
-	if (!stop || !task_on_rq_queued(stop))
+	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, stop);
-
-	return stop;
+	set_next_task_stop(rq, rq->stop);
+	return rq->stop;
 }
 
 static void
@@ -60,7 +63,7 @@ static void yield_task_stop(struct rq *r
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
@@ -129,6 +132,7 @@ const struct sched_class stop_sched_clas
 	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 11:47                     ` Valentin Schneider
@ 2019-11-08 11:58                       ` Quentin Perret
  0 siblings, 0 replies; 38+ messages in thread
From: Quentin Perret @ 2019-11-08 11:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Kirill Tkhai, linux-kernel, aaron.lwe, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Friday 08 Nov 2019 at 11:47:44 (+0000), Valentin Schneider wrote:
> I think we can ignore RETRY_TASK because this happens before the picking loop,
> so we'll observe any new DL/RT task that got enqueued while newidle released
> the lock. This also means we can safely break the balance loop in
> pick_next_task() when we get RETRY_TASK, because we've got something to pick
> (some new RT/DL task).

Ah right, the second loop always iterates from DL, so that works.

> This wants a comment though, methinks.

+1 :)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 11:02                   ` Quentin Perret
  2019-11-08 11:47                     ` Valentin Schneider
@ 2019-11-08 12:00                     ` Peter Zijlstra
  2019-11-08 12:15                       ` Quentin Perret
  2019-11-08 12:24                       ` Peter Zijlstra
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-08 12:00 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Fri, Nov 08, 2019 at 11:02:12AM +0000, Quentin Perret wrote:
> On Thursday 07 Nov 2019 at 20:29:07 (+0100), Peter Zijlstra wrote:
> > I still havne't had food, but this here compiles...
> 
> And it seems to work, too :)

Excellent!

> > @@ -3929,13 +3929,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  	}
> > 
> >  restart:
> > -	/*
> > -	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> > -	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> > -	 */
> > -	prev->sched_class->put_prev_task(rq, prev, rf);
> > -	if (!rq->nr_running)
> > -		newidle_balance(rq, rf);
> > +#ifdef CONFIG_SMP
> > +	for (class = prev->sched_class;
> > +	     class != &idle_sched_class;
> > +	     class = class->next) {
> > +
> > +		if (class->balance(rq, prev, rf))
> > +			break;
> > +	}
> > +#endif
> > +
> > +	put_prev_task(rq, prev);
> 
> Right, that looks much cleaner IMO. I'm thinking if we killed the
> special case for CFS above we could do with a single loop to iterate the
> classes, and you could fold ->balance() in ->pick_next_task() ...

No, you can't, because then you're back to having to restart the pick
when something happens when we drop the rq halfway down the pick.  Which
is something I really wanted to get rid of.

> That would remove one call site to newidle_balance() too, which I think
> is good. Hackbench probably won't like that, though.

Yeah, that fast path really is important. I've got a few patches pending
there, fixing a few things and that gets me 2% extra on a sched-yield
benchmark.

> > +static int balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > +{
> > +	if (rq->cfs.nr_running)

FWIW that must test rq->nr_running.

> > +		return 1;
> > +
> > +	return newidle_balance(rq, rf) != 0;
> 
> And you can ignore the RETRY_TASK case here under the assumption that
> we must have tried to pull from RT/DL before ending up here ?

Well, the balance callback can always return 0 and be correct. The point
is mostly to avoid more work.

In this case, since fair is the last class and we've already excluded
idle for balancing, the win is minimal. But since we have the code,
might as well dtrt.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 12:00                     ` Peter Zijlstra
@ 2019-11-08 12:15                       ` Quentin Perret
  2019-11-08 12:35                         ` Peter Zijlstra
  2019-11-08 12:24                       ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Quentin Perret @ 2019-11-08 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Friday 08 Nov 2019 at 13:00:35 (+0100), Peter Zijlstra wrote:
> On Fri, Nov 08, 2019 at 11:02:12AM +0000, Quentin Perret wrote:
> > On Thursday 07 Nov 2019 at 20:29:07 (+0100), Peter Zijlstra wrote:
> > > I still havne't had food, but this here compiles...
> > 
> > And it seems to work, too :)
> 
> Excellent!
> 
> > > @@ -3929,13 +3929,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >  	}
> > > 
> > >  restart:
> > > -	/*
> > > -	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> > > -	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> > > -	 */
> > > -	prev->sched_class->put_prev_task(rq, prev, rf);
> > > -	if (!rq->nr_running)
> > > -		newidle_balance(rq, rf);
> > > +#ifdef CONFIG_SMP
> > > +	for (class = prev->sched_class;
> > > +	     class != &idle_sched_class;
> > > +	     class = class->next) {
> > > +
> > > +		if (class->balance(rq, prev, rf))
> > > +			break;
> > > +	}
> > > +#endif
> > > +
> > > +	put_prev_task(rq, prev);
> > 
> > Right, that looks much cleaner IMO. I'm thinking if we killed the
> > special case for CFS above we could do with a single loop to iterate the
> > classes, and you could fold ->balance() in ->pick_next_task() ...
> 
> No, you can't, because then you're back to having to restart the pick
> when something happens when we drop the rq halfway down the pick.  Which
> is something I really wanted to get rid of.

Right, with a single loop you'll have to re-iterate the classes from
the start in case of RETRY_TASK, but you're re-iterating all the classes
too with this patch. You're doing a little less work in the second loop
though, so maybe it's worth it. And I was the one worried about
refactoring the code too much close to the release, so maybe that's for
another time ;)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 12:00                     ` Peter Zijlstra
  2019-11-08 12:15                       ` Quentin Perret
@ 2019-11-08 12:24                       ` Peter Zijlstra
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-08 12:24 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Fri, Nov 08, 2019 at 01:00:34PM +0100, Peter Zijlstra wrote:
> > That would remove one call site to newidle_balance() too, which I think
> > is good. Hackbench probably won't like that, though.
> 
> Yeah, that fast path really is important. I've got a few patches pending
> there, fixing a few things and that gets me 2% extra on a sched-yield
> benchmark.

That is, the fast path also allows a cpu-cgroup optimization that wins
something in the order of 3% for cgroup workloads.

The cgroup optimization is basically that when we schedule from
fair->fair, we can avoid having to put/set the whole cgroup hierarchy
but only have to update the part that changed.

Couple that with the set_next_buddy() from dequeue_task_fair(), which
results in the next task being more likely to be from the same cgroup,
and you've got a win.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 12:15                       ` Quentin Perret
@ 2019-11-08 12:35                         ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-08 12:35 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Fri, Nov 08, 2019 at 12:15:26PM +0000, Quentin Perret wrote:
> Right, with a single loop you'll have to re-iterate the classes from
> the start in case of RETRY_TASK, but you're re-iterating all the classes
> too with this patch. You're doing a little less work in the second loop
> though, so maybe it's worth it. 

The thing with the restart is that it'll make your head explode when we
go do core-scheduling.

> And I was the one worried about
> refactoring the code too much close to the release, so maybe that's for
> another time ;)

It is either this patch or reverting a bunch of patches :/



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NULL pointer dereference in pick_next_task_fair
  2019-11-08 11:55                   ` Peter Zijlstra
@ 2019-11-08 12:52                     ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2019-11-08 12:52 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Kirill Tkhai, linux-kernel, aaron.lwe, valentin.schneider, mingo,
	pauld, jdesfossez, naravamudan, vincent.guittot,
	dietmar.eggemann, juri.lelli, rostedt, bsegall, mgorman,
	kernel-team, john.stultz

On Fri, Nov 08, 2019 at 12:55:57PM +0100, Peter Zijlstra wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3929,13 +3929,14 @@ pick_next_task(struct rq *rq, struct tas
>  	}
>  
>  restart:
> -	/*
> -	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> -	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> -	 */
> -	prev->sched_class->put_prev_task(rq, prev, rf);
> -	if (!rq->nr_running)
> -		newidle_balance(rq, rf);
> +#ifdef CONFIG_SMP
	/*
	 * We must do the balancing pass before put_next_task(), such
	 * that when we release the rq->lock the task is in the same
	 * state as before we took rq->lock.
	 *
	 * We can terminate the balance pass as soon as we know there is
	 * a runnable task of @class priority or higher.
	 */
> +	for_class_range(class, prev->sched_class, &idle_sched_class) {
> +		if (class->balance(rq, prev, rf))
> +			break;
> +	}
> +#endif
> +
> +	put_prev_task(rq, prev);
>  
>  	for_each_class(class) {
>  		p = class->pick_next_task(rq, NULL, NULL);

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-11-08 12:52 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 17:46 NULL pointer dereference in pick_next_task_fair Quentin Perret
2019-10-28 21:49 ` Peter Zijlstra
2019-10-29 11:34 ` Peter Zijlstra
2019-10-29 11:50   ` Quentin Perret
2019-10-30 22:50     ` Ram Muthiah
2019-10-31  1:33       ` Valentin Schneider
2019-10-31 10:54         ` Valentin Schneider
2019-10-31 14:24           ` Valentin Schneider
2019-10-31 22:15       ` Valentin Schneider
2019-11-06 12:05 ` Peter Zijlstra
2019-11-06 13:08   ` Peter Zijlstra
2019-11-06 15:04     ` Qais Yousef
2019-11-06 16:57       ` Peter Zijlstra
2019-11-06 17:26         ` Qais Yousef
2019-11-06 15:51   ` Kirill Tkhai
2019-11-06 16:54     ` Peter Zijlstra
2019-11-06 17:27       ` Peter Zijlstra
2019-11-07  8:36         ` Kirill Tkhai
2019-11-07 13:26           ` Peter Zijlstra
2019-11-07 15:12             ` Kirill Tkhai
2019-11-07 15:42               ` Peter Zijlstra
2019-11-07 15:53                 ` Kirill Tkhai
2019-11-07 15:38             ` Quentin Perret
2019-11-07 18:43               ` Peter Zijlstra
2019-11-07 19:27                 ` Quentin Perret
2019-11-07 19:31                   ` Peter Zijlstra
2019-11-07 19:42                     ` Quentin Perret
2019-11-07 19:29                 ` Peter Zijlstra
2019-11-08 11:02                   ` Quentin Perret
2019-11-08 11:47                     ` Valentin Schneider
2019-11-08 11:58                       ` Quentin Perret
2019-11-08 12:00                     ` Peter Zijlstra
2019-11-08 12:15                       ` Quentin Perret
2019-11-08 12:35                         ` Peter Zijlstra
2019-11-08 12:24                       ` Peter Zijlstra
2019-11-08 11:55                   ` Peter Zijlstra
2019-11-08 12:52                     ` Peter Zijlstra
2019-11-07 16:09             ` Qais Yousef

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.