linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RT] BUG in sched/cpupri.c
@ 2021-12-18 14:25 John Keeping
  2021-12-20 17:35 ` Dietmar Eggemann
  0 siblings, 1 reply; 16+ messages in thread
From: John Keeping @ 2021-12-18 14:25 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

Hi,

On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
occasional BUG at cpupri.c:151 (full trace below).

Having added extra logging, it seems that p->prio == 120 which isn't
handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").

This happens maybe half the time as userspace is starting up, but if the
system boots to a login prompt I haven't seen any issues after that.
The process isn't always the same, I've seen systemd-udevd as well as
pr/ttyS2.

I can easily "fix" this by handling normal priority tasks in
convert_prio() but I guess there's some wider reason why that's not an
expected value there, so perhaps the real problem lies elsewhere.


Thanks,
John

------------[ cut here ]------------
kernel BUG at kernel/sched/cpupri.c:151!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP ARM
Modules linked in:
CPU: 1 PID: 117 Comm: pr/ttyS2 Tainted: G    B             5.15.10-rt24 #1
Hardware name: Rockchip (Device Tree)
PC is at cpupri_find_fitness+0x78/0x1a4
LR is at cpupri_find_fitness+0x28/0x1a4
pc : [<c0183be8>]    lr : [<c0183b98>]    psr: 20030193
sp : c3ea38f8  ip : c38461bc  fp : 00000001
r10: e7db8b00  r9 : 00000001  r8 : c1ccd880
r7 : c3846180  r6 : c3846180  r5 : 00000000  r4 : e7db13dc
r3 : c0183b90  r2 : 00000000  r1 : 00000007  r0 : 00000078
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 0296406a  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: non-slab/vmalloc memory
Register r4 information: non-slab/vmalloc memory
Register r5 information: NULL pointer
Register r6 information: slab task_struct start c3846180 pointer offset 0
Register r7 information: slab task_struct start c3846180 pointer offset 0
Register r8 information: slab kmalloc-1k start c1ccd800 pointer offset 128 size 1024
Register r9 information: non-paged memory
Register r10 information: non-slab/vmalloc memory
Register r11 information: non-paged memory
Register r12 information: slab task_struct start c3846180 pointer offset 60
Process pr/ttyS2 (pid: 117, stack limit = 0x(ptrval))
Stack: (0xc3ea38f8 to 0xc3ea4000)
38e0:                                                       187d4724 00000078
3900: 41b58ab3 c11a824a c016ffcc 00000001 e7db13dc c1388b00 c3846180 c1408e04
3920: 00000001 e7db8b00 00000001 c017a378 c017bcd0 e7db9098 e7db8b00 c1d84b00
3940: 00000001 e7db9290 e7db9098 c1ccd858 c1ccd858 c017bcd8 e7db8b00 c1ccd858
3960: e7db9008 c1ccd800 c1408e04 00000008 c1ccd858 c017cfc8 c1ccd85c c1ccd858
3980: 0000002b 0000002a c1ccd85c c026718c 00000000 00000000 c1ccd858 00000000
39a0: c1ccd85c 00000020 c1ccd858 c01e7300 c17a5524 e7db8b40 00000000 0181b72f
39c0: c3846288 00000003 c17a4fe0 00000001 00000013 c0d0ab00 f0802000 26a30000
39e0: c15c4dc0 c01114e0 c1cc7000 c1c3db80 c1409064 00000013 c0111664 f0802000
3a00: 26a30000 c0111678 c1cc7000 c01a05d4 c1cc7000 c3ea0000 00000003 c1409064
3a20: 00000003 f0802000 26a30000 c0198080 00000000 00000000 00000003 c019894c
3a40: c1387850 00000003 c3ea3a80 c06417a0 c3ea3a80 40030193 00000000 c0cda8ec
3a60: 60030013 ffffffff c3ea3ab4 c40893f8 c3ea0000 c1408e04 c3ea3b8c c0100b00
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff 00000051 c0cda8e8 c4088f00 c0167624 c122da30 187d4760
3ae0: c1388b00 80030013 e7db8b00 e7db8b00 c17a5dc0 c02353f0 00000000 00000000
3b00: 41b58ab3 c11a63b8 c01673fc c016580c 00000001 c3ea0004 26a30000 c3ea0000
3b20: c3ea3b84 c023523c c1d84b00 c0cda838 c17a5dc0 c02353f0 00000000 00000000
3b40: 00000000 c3ea0000 c1387020 c0cda8e8 00000001 c01640b4 00000001 c3ea0000
3b60: c3ea3c20 f6251f9e c3ea3b84 c181a338 b77d4774 c3ea3c20 c3ea0000 c3ea0000
3b80: c3ea3be0 c4073d18 c181a344 c0cd5ba8 80030013 c3ea000c 00000001 c3ea0000
3ba0: 41b58ab3 c11a889d c0cd59a8 00000003 c17a5dc0 c02353f0 ffffc000 c0235580
3bc0: c01641c8 c01640c4 00000000 00000001 c3ea0000 40030013 c38463c4 c14b7044
3be0: 00000001 c3ea3be0 c4088f00 c01641dc 40030013 c38463c4 40030013 c3846180
3c00: c14b7098 c0cda91c c14b7040 c01b5cc4 c01b5f2c 00000000 00000000 00000000
3c20: 00000001 00000000 00000000 00000001 c0d39640 f6251f9e 00000000 b77d4790
3c40: c181a338 c3ea3ce0 c3ea3ca0 c181a344 00000000 00000007 c181a4b8 c0cd5dac
3c60: 41b58ab3 c11e5383 c061d930 c023523c c17a5dc0 c02353f0 00000000 00000000
3c80: 41b58ab3 c11a8997 c0cd5cf0 c02353f0 00000000 00000000 00000000 c3ea0000
3ca0: c3846180 c06d95c4 00000001 c3ea0004 26a30000 00000007 c181a4b8 c02340c4
3cc0: c181a338 60030013 00000001 00000007 00000036 00000000 00000007 f6251f9e
3ce0: c181a338 b77d47a4 c181a4c4 c3f59000 00000036 c06ddb94 00000035 00000035
3d00: c3f59000 c3f59000 00000035 c0192a98 00000035 187d47a8 c3ea3e44 000003fe
3d20: 41b58ab3 c11eea53 c06dd92c c01b5eec b77d47ac c1448740 c3ea3dc0 c3ea3d80
3d40: 41b58ab3 c11a8f2c c0192900 c0cd5d9c c161de70 c0197ba4 26a20000 0000003e
3d60: 41b58ab3 c023523c c0cd5cf0 c0cda838 ffffffff c0cda838 00000000 c3ea0000
3d80: c17a5dc0 c02353f0 ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001
3da0: c3ea0000 c3ea3f40 00000000 00000000 00000243 00000036 c3ea3dd4 f6251f9e
3dc0: c3ea3f40 c161de40 c3efed00 c3ea3f40 00000000 00000000 00000243 00000036
3de0: c161de70 c0194fd8 c3ea3e40 000016ca c14d1250 00000000 c3f59000 00000000
3e00: 187d47c4 c3f59000 00000017 c3ea0000 c1387020 c016580c 00000001 c3ea0004
3e20: 41b58ab3 c11a9104 c0194bb0 c02340c4 c1d80000 e7db8b00 00000000 c3ea0000
3e40: c3ea3ea0 c3f59000 00000400 c0165830 00000002 c01b5eec 00000000 00000002
3e60: 00000000 c3846180 c01828b4 c3ea3e6c c3ea3e6c c1d80000 c1512280 c3846188
3e80: c384640c 00000001 c3ea3f54 c0cd01b0 c1d64000 b77d47d8 c3ea3f54 c0167508
3ea0: 00000242 00000000 8bb3f10e 00000001 c2030035 00000001 00000000 00000000
3ec0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3ee0: 00000000 00000000 00000000 00000000 00000000 00000000 c17a5dc0 c02353f0
3f00: ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001 c3ea0000 c3efec80
3f20: 60000013 c3ea000c 000004f8 c3846180 c3ea3f4c c01641dc c3efec80 60000013
3f40: 60000013 ffffc000 c3846180 f6251f9e c3ea0000 c3846180 c3efec80 00000000
3f60: c3846180 c0194bb0 c161de40 c1d67d20 c3846180 c0157eb0 00000000 c3846180
3f80: c3efeca0 c3efec0c 00000000 c3efec00 c0157c74 00000000 00000000 00000000
3fa0: 00000000 00000000 00000000 c01000fc 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[<c0183be8>] (cpupri_find_fitness) from [<c017a378>] (find_lowest_rq+0x1bc/0x258)
[<c017a378>] (find_lowest_rq) from [<c017bcd8>] (push_rt_task.part.0+0xe8/0x23c)
[<c017bcd8>] (push_rt_task.part.0) from [<c017cfc8>] (rto_push_irq_work_func+0x7c/0xd8)
[<c017cfc8>] (rto_push_irq_work_func) from [<c026718c>] (irq_work_single+0x8c/0x140)
[<c026718c>] (irq_work_single) from [<c01e7300>] (flush_smp_call_function_queue+0x238/0x31c)
[<c01e7300>] (flush_smp_call_function_queue) from [<c01114e0>] (do_handle_IPI+0x29c/0x420)
[<c01114e0>] (do_handle_IPI) from [<c0111678>] (ipi_handler+0x14/0x20)
[<c0111678>] (ipi_handler) from [<c01a05d4>] (handle_percpu_devid_irq+0x8c/0x140)
[<c01a05d4>] (handle_percpu_devid_irq) from [<c0198080>] (handle_irq_desc+0x38/0x48)
[<c0198080>] (handle_irq_desc) from [<c019894c>] (handle_domain_irq+0x40/0x54)
[<c019894c>] (handle_domain_irq) from [<c06417a0>] (gic_handle_irq+0x88/0xa0)
[<c06417a0>] (gic_handle_irq) from [<c0100b00>] (__irq_svc+0x60/0xac)
Exception stack(0xc3ea3a80 to 0xc3ea3ac8)
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff
[<c0100b00>] (__irq_svc) from [<c0cda8ec>] (_raw_spin_unlock_irqrestore+0x1c/0x70)
[<c0cda8ec>] (_raw_spin_unlock_irqrestore) from [<c0167624>] (try_to_wake_up+0x228/0x468)
[<c0167624>] (try_to_wake_up) from [<c0cd5ba8>] (rt_mutex_slowunlock+0x200/0x348)
[<c0cd5ba8>] (rt_mutex_slowunlock) from [<c0cd5dac>] (rt_spin_unlock+0xbc/0x104)
[<c0cd5dac>] (rt_spin_unlock) from [<c06ddb94>] (serial8250_console_write+0x268/0x39c)
[<c06ddb94>] (serial8250_console_write) from [<c0194fd8>] (printk_kthread_func+0x428/0x4c4)
[<c0194fd8>] (printk_kthread_func) from [<c0157eb0>] (kthread+0x23c/0x25c)
[<c0157eb0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc3ea3fb0 to 0xc3ea3ff8)
3fa0:                                     00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e1a00008 e28dd014 e8bd4ff0 ea00004a (e7f001f2) 
---[ end trace 0000000000000002 ]---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping
@ 2021-12-20 17:35 ` Dietmar Eggemann
  2021-12-21 16:11   ` Valentin Schneider
  0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2021-12-20 17:35 UTC (permalink / raw)
  To: John Keeping, linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 18.12.21 15:25, John Keeping wrote:
> Hi,
> 
> On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
> occasional BUG at cpupri.c:151 (full trace below).
> 
> Having added extra logging, it seems that p->prio == 120 which isn't
> handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
> Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").
> 
> This happens maybe half the time as userspace is starting up, but if the
> system boots to a login prompt I haven't seen any issues after that.
> The process isn't always the same, I've seen systemd-udevd as well as
> pr/ttyS2.
> 
> I can easily "fix" this by handling normal priority tasks in
> convert_prio() but I guess there's some wider reason why that's not an
> expected value there, so perhaps the real problem lies elsewhere.

find_lowest_rq() -> [ cpupri_find() ] -> cpupri_find_fitness() ->
convert_prio(p->prio) can only be called for an RT task (p->prio=[0..98])

I can recreate this on my Juno-r0 Arm64 machine with v5.15.10-rt24,
CONFIG_PREEMPT_RT=y .

------------[ cut here ]------------
[   15.256482] WARNING: CPU: 3 PID: 35 at kernel/sched/rt.c:1898 push_rt_task.part.0+0x190/0x364
[   15.256521] Modules linked in:
[   15.256532] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #14
[   15.256546] Hardware name: ARM Juno development board (r0) (DT)
[   15.256552] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   15.256567] pc : push_rt_task.part.0+0x190/0x364
[   15.256582] lr : push_rt_task.part.0+0x20/0x364
[   15.256597] sp : ffff800011ff3e20
[   15.256602] x29: ffff800011ff3e20 x28: ffff0008001ca940 x27: 0000000000000000
[   15.256622] x26: 0000000000000000 x25: ffff800011f61da0 x24: 0000000000000001
[   15.256639] x23: 0000000000000000 x22: ffff00097ef923c0 x21: ffff0008000e0dc0
[   15.256657] x20: ffff0008000e0dc0 x19: ffff00097ef923c0 x18: 0000000000000002
[   15.256674] x17: ffff80096d7e7000 x16: ffff800011ff4000 x15: 0000000000004000
[   15.256692] x14: 00000000000000c1 x13: 0000000000000001 x12: 0000000000000000
[   15.256709] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400
[   15.256725] x8 : ffff00097ef924c0 x7 : ffff00097ef92440 x6 : 00000000356c5f2b
[   15.256743] x5 : ffff80096d7e7000 x4 : 0000000000000000 x3 : 0000000000000003
[   15.256760] x2 : ffff00097ef923c0 x1 : 0000000000000062 x0 : 0000000000000078
[   15.256777] Call trace:
[   15.256783]  push_rt_task.part.0+0x190/0x364
[   15.256799]  rto_push_irq_work_func+0x180/0x190
[   15.256815]  irq_work_single+0x34/0xa0
[   15.256829]  flush_smp_call_function_queue+0x138/0x244
[   15.256845]  generic_smp_call_function_single_interrupt+0x18/0x24
[   15.256860]  ipi_handler+0xb0/0x15c
[   15.256875]  handle_percpu_devid_irq+0x88/0x140
[   15.256890]  handle_domain_irq+0x64/0x94
[   15.256907]  gic_handle_irq+0x50/0xf0
[   15.256924]  call_on_irq_stack+0x2c/0x60
[   15.256937]  do_interrupt_handler+0x54/0x60
[   15.256951]  el1_interrupt+0x30/0x80
[   15.256965]  el1h_64_irq_handler+0x1c/0x30
[   15.256978]  el1h_64_irq+0x78/0x7c
[   15.256989]  preempt_schedule_irq+0x40/0x150
[   15.257004]  el1_interrupt+0x60/0x80
[   15.257016]  el1h_64_irq_handler+0x1c/0x30
[   15.257029]  el1h_64_irq+0x78/0x7c
[   15.257039]  run_ksoftirqd+0x94/0xc0
[   15.257053]  smpboot_thread_fn+0x2d8/0x320
[   15.257066]  kthread+0x18c/0x1a0
[   15.257081]  ret_from_fork+0x10/0x20
[   15.257094] ---[ end trace 0000000000000002 ]---
[   16.257349] next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/3 35]

root@juno:~# ps -eTo comm,pid,lwp,pri,rtprio,nice,class | more
COMMAND           PID   LWP PRI RTPRIO  NI CLS
..
rcu_preempt        11    11  41      1   - FF
...
ksoftirqd/3        35    35  19      -   0 TS
...

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..798887f1eeff 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
                struct task_struct *push_task = NULL;
                int cpu;
 
+               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
+                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
+                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
+               }
+
                if (!pull || rq->push_busy)
                        return 0;
 
+               if (!rt_task(rq->curr))
+                       return 0;
+
                cpu = find_lowest_rq(rq->curr);
                if (cpu == -1 || cpu == rq->cpu)
                        return 0;

The 2. condition prevents the BUG_ON() in cpupri_find_fitness().

[...]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-20 17:35 ` Dietmar Eggemann
@ 2021-12-21 16:11   ` Valentin Schneider
  2021-12-21 16:45     ` John Keeping
  0 siblings, 1 reply; 16+ messages in thread
From: Valentin Schneider @ 2021-12-21 16:11 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping, linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 20/12/21 18:35, Dietmar Eggemann wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..798887f1eeff 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 struct task_struct *push_task = NULL;
>                 int cpu;
>
> +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> +               }
> +
>                 if (!pull || rq->push_busy)
>                         return 0;
>
> +               if (!rt_task(rq->curr))
> +                       return 0;
> +

If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
case), that's buggered: we shouldn't be trying to pull RT tasks when we
have queued RT tasks and a less-than-RT current, we should be rescheduling
right now.

I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
CFS task (or straight up sched_setscheduler()):
check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
so I suspect we get to the push/pull callback before getting a
resched (I actually don't see where we'd get a resched in that case other
than at the next tick).

IOW, feels like we want the below. Unfortunately I can't reproduce the
issue locally (yet), so that's untested.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * this is the right place to try to pull some other one
 	 * from an overloaded CPU, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	deadline_queue_pull_task(rq);
+	if (!rq->dl.dl_nr_running)
+		deadline_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+		resched_curr(rq);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..1ea2567612fb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	rt_queue_pull_task(rq);
+	if (!rq->rt.rt_nr_running)
+		rt_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+		resched_curr(rq);
 }
 
 void __init init_sched_rt_class(void)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:11   ` Valentin Schneider
@ 2021-12-21 16:45     ` John Keeping
  2021-12-21 17:22       ` Valentin Schneider
  2021-12-22 17:46       ` Dietmar Eggemann
  0 siblings, 2 replies; 16+ messages in thread
From: John Keeping @ 2021-12-21 16:45 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Tue, 21 Dec 2021 16:11:34 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> On 20/12/21 18:35, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..798887f1eeff 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> >                 struct task_struct *push_task = NULL;
> >                 int cpu;
> >
> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> > +               }
> > +
> >                 if (!pull || rq->push_busy)
> >                         return 0;
> >
> > +               if (!rt_task(rq->curr))
> > +                       return 0;
> > +  
> 
> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> have queued RT tasks and a less-than-RT current, we should be rescheduling
> right now.
> 
> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> CFS task (or straight up sched_setscheduler()):
> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> so I suspect we get to the push/pull callback before getting a
> resched (I actually don't see where we'd get a resched in that case other
> than at the next tick).
> 
> IOW, feels like we want the below. Unfortunately I can't reproduce the
> issue locally (yet), so that's untested.

This patch doesn't make any difference for me - I hit the BUG on the
first boot with this applied.

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	 * this is the right place to try to pull some other one
>  	 * from an overloaded CPU, if any.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	deadline_queue_pull_task(rq);
> +	if (!rq->dl.dl_nr_running)
> +		deadline_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> +		resched_curr(rq);
>  }
>  
>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..1ea2567612fb 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>  	 * we may need to handle the pulling of RT tasks
>  	 * now.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	rt_queue_pull_task(rq);
> +	if (!rq->rt.rt_nr_running)
> +		rt_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> +		resched_curr(rq);
>  }
>  
>  void __init init_sched_rt_class(void)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:45     ` John Keeping
@ 2021-12-21 17:22       ` Valentin Schneider
  2021-12-21 17:42         ` John Keeping
  2021-12-22 17:46       ` Dietmar Eggemann
  1 sibling, 1 reply; 16+ messages in thread
From: Valentin Schneider @ 2021-12-21 17:22 UTC (permalink / raw)
  To: John Keeping
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 21/12/21 16:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <valentin.schneider@arm.com> wrote:
>
>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> > index ef8228d19382..798887f1eeff 100644
>> > --- a/kernel/sched/rt.c
>> > +++ b/kernel/sched/rt.c
>> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
>> >                 struct task_struct *push_task = NULL;
>> >                 int cpu;
>> >
>> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
>> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
>> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
>> > +               }
>> > +
>> >                 if (!pull || rq->push_busy)
>> >                         return 0;
>> >
>> > +               if (!rt_task(rq->curr))
>> > +                       return 0;
>> > +  
>> 
>> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
>> case), that's buggered: we shouldn't be trying to pull RT tasks when we
>> have queued RT tasks and a less-than-RT current, we should be rescheduling
>> right now.
>> 
>> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
>> CFS task (or straight up sched_setscheduler()):
>> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
>> so I suspect we get to the push/pull callback before getting a
>> resched (I actually don't see where we'd get a resched in that case other
>> than at the next tick).
>> 
>> IOW, feels like we want the below. Unfortunately I can't reproduce the
>> issue locally (yet), so that's untested.
>
> This patch doesn't make any difference for me - I hit the BUG on the
> first boot with this applied.
>

Thanks for the swift testing!

Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
we can somehow get to the push IRQ work before rescheduling (which I think
might happen if we try to resched_curr(this_rq)), then we need his
bailout.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 17:22       ` Valentin Schneider
@ 2021-12-21 17:42         ` John Keeping
  0 siblings, 0 replies; 16+ messages in thread
From: John Keeping @ 2021-12-21 17:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Tue, 21 Dec 2021 17:22:34 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> On 21/12/21 16:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <valentin.schneider@arm.com> wrote:
> >  
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:  
> >> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> > index ef8228d19382..798887f1eeff 100644
> >> > --- a/kernel/sched/rt.c
> >> > +++ b/kernel/sched/rt.c
> >> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> >> >                 struct task_struct *push_task = NULL;
> >> >                 int cpu;
> >> >
> >> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> >> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> >> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> >> > +               }
> >> > +
> >> >                 if (!pull || rq->push_busy)
> >> >                         return 0;
> >> >
> >> > +               if (!rt_task(rq->curr))
> >> > +                       return 0;
> >> > +    
> >> 
> >> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> >> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> >> have queued RT tasks and a less-than-RT current, we should be rescheduling
> >> right now.
> >> 
> >> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> >> CFS task (or straight up sched_setscheduler()):
> >> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> >> so I suspect we get to the push/pull callback before getting a
> >> resched (I actually don't see where we'd get a resched in that case other
> >> than at the next tick).
> >> 
> >> IOW, feels like we want the below. Unfortunately I can't reproduce the
> >> issue locally (yet), so that's untested.  
> >
> > This patch doesn't make any difference for me - I hit the BUG on the
> > first boot with this applied.
> >  
> 
> Thanks for the swift testing!
> 
> Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
> we can somehow get to the push IRQ work before rescheduling (which I think
> might happen if we try to resched_curr(this_rq)), then we need his
> bailout.

With Dietmar's patch I hit the added WARN_ON_ONCE with:

	next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]
	next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]

	# ps -eTo comm,pid,lwp,pri,rtprio,nice,class
	...
	rcu_preempt        11    11  41      1   - FF
	...
	ksoftirqd/1        21    21  19      -   0 TS

Out of three reproductions, rcu_preempt as next_task is consistent, but
I've seen three different tasks in rq->curr (although all with
SCHED_OTHER).

And as expected, the added early return does stop the BUG.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:45     ` John Keeping
  2021-12-21 17:22       ` Valentin Schneider
@ 2021-12-22 17:46       ` Dietmar Eggemann
  2021-12-22 18:45         ` John Keeping
  2021-12-22 19:48         ` Valentin Schneider
  1 sibling, 2 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2021-12-22 17:46 UTC (permalink / raw)
  To: John Keeping, Valentin Schneider
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 21.12.21 17:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <valentin.schneider@arm.com> wrote:
> 
>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index fd7c4f972aaf..7d61ceec1a3b 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>  	 * this is the right place to try to pull some other one
>>  	 * from an overloaded CPU, if any.
>>  	 */
>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>> +	if (!task_on_rq_queued(p))
>>  		return;
>>  
>> -	deadline_queue_pull_task(rq);
>> +	if (!rq->dl.dl_nr_running)
>> +		deadline_queue_pull_task(rq);
>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>> +		resched_curr(rq);
>>  }
>>  
>>  /*
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..1ea2567612fb 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>  	 * we may need to handle the pulling of RT tasks
>>  	 * now.
>>  	 */
>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> +	if (!task_on_rq_queued(p))
>>  		return;
>>  
>> -	rt_queue_pull_task(rq);
>> +	if (!rq->rt.rt_nr_running)
>> +		rt_queue_pull_task(rq);
>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>> +		resched_curr(rq);

switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
  pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
    rto_push_irq_work_func() -> push_rt_task(rq, true)

seems to be the only way with pull=true.

In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.

[   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
[   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
[   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
                                                         ^^^^^^^^^^^^^^^^^^^^^^ 
[   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
[   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
                                                             ^^^^^^^^^^                           ^^^^^^ 
[   22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
[   22.288711] Hardware name: ARM Juno development board (r0) (DT)
[   22.288718] Call trace:
[   22.288722]  dump_backtrace+0x0/0x1ac
[   22.288747]  show_stack+0x1c/0x70
[   22.288763]  dump_stack_lvl+0x68/0x84
[   22.288777]  dump_stack+0x1c/0x38
[   22.288788]  push_rt_task.part.0+0x364/0x370
[   22.288805]  rto_push_irq_work_func+0x180/0x190
[   22.288821]  irq_work_single+0x34/0xa0
[   22.288836]  flush_smp_call_function_queue+0x138/0x244
[   22.288852]  generic_smp_call_function_single_interrupt+0x18/0x24
[   22.288867]  ipi_handler+0xb0/0x15c
...

What about slightly changing the layout in switched_from_rt() (only lightly tested):


@@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
         * we may need to handle the pulling of RT tasks
         * now.
         */
-       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+       if (!task_on_rq_queued(p))
+               return;
+
+       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
+               resched_curr(rq);
+               return;
+       }
+
+       if (rq->rt.rt_nr_running)
                return;
 
        rt_queue_pull_task(rq);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 17:46       ` Dietmar Eggemann
@ 2021-12-22 18:45         ` John Keeping
  2021-12-22 19:48         ` Valentin Schneider
  1 sibling, 0 replies; 16+ messages in thread
From: John Keeping @ 2021-12-22 18:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Wed, 22 Dec 2021 18:46:57 +0100
Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

> On 21.12.21 17:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <valentin.schneider@arm.com> wrote:
> >   
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:  
> 
> [...]
> 
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index fd7c4f972aaf..7d61ceec1a3b 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>  	 * this is the right place to try to pull some other one
> >>  	 * from an overloaded CPU, if any.
> >>  	 */
> >> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >> +	if (!task_on_rq_queued(p))
> >>  		return;
> >>  
> >> -	deadline_queue_pull_task(rq);
> >> +	if (!rq->dl.dl_nr_running)
> >> +		deadline_queue_pull_task(rq);
> >> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >> +		resched_curr(rq);
> >>  }
> >>  
> >>  /*
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index ef8228d19382..1ea2567612fb 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >>  	 * we may need to handle the pulling of RT tasks
> >>  	 * now.
> >>  	 */
> >> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >> +	if (!task_on_rq_queued(p))
> >>  		return;
> >>  
> >> -	rt_queue_pull_task(rq);
> >> +	if (!rq->rt.rt_nr_running)
> >> +		rt_queue_pull_task(rq);
> >> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >> +		resched_curr(rq);  
> 
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>     rto_push_irq_work_func() -> push_rt_task(rq, true)
> 
> seems to be the only way with pull=true.
> 
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
> 
> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>                                                          ^^^^^^^^^^^^^^^^^^^^^^ 
> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>                                                              ^^^^^^^^^^                           ^^^^^^ 
> [   22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
> [   22.288711] Hardware name: ARM Juno development board (r0) (DT)
> [   22.288718] Call trace:
> [   22.288722]  dump_backtrace+0x0/0x1ac
> [   22.288747]  show_stack+0x1c/0x70
> [   22.288763]  dump_stack_lvl+0x68/0x84
> [   22.288777]  dump_stack+0x1c/0x38
> [   22.288788]  push_rt_task.part.0+0x364/0x370
> [   22.288805]  rto_push_irq_work_func+0x180/0x190
> [   22.288821]  irq_work_single+0x34/0xa0
> [   22.288836]  flush_smp_call_function_queue+0x138/0x244
> [   22.288852]  generic_smp_call_function_single_interrupt+0x18/0x24
> [   22.288867]  ipi_handler+0xb0/0x15c
> ...
> 
> What about slightly changing the layout in switched_from_rt() (only lightly tested):

I still see the BUG splat with the patch below applied :-(

> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>          * we may need to handle the pulling of RT tasks
>          * now.
>          */
> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +       if (!task_on_rq_queued(p))
> +               return;
> +
> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> +               resched_curr(rq);
> +               return;
> +       }
> +
> +       if (rq->rt.rt_nr_running)
>                 return;
>  
>         rt_queue_pull_task(rq);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 17:46       ` Dietmar Eggemann
  2021-12-22 18:45         ` John Keeping
@ 2021-12-22 19:48         ` Valentin Schneider
  2021-12-23 11:58           ` John Keeping
  2022-01-07 10:46           ` Dietmar Eggemann
  1 sibling, 2 replies; 16+ messages in thread
From: Valentin Schneider @ 2021-12-22 19:48 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 22/12/21 18:46, Dietmar Eggemann wrote:
> On 21.12.21 17:45, John Keeping wrote:
>> On Tue, 21 Dec 2021 16:11:34 +0000
>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>
>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index fd7c4f972aaf..7d61ceec1a3b 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>>      * this is the right place to try to pull some other one
>>>      * from an overloaded CPU, if any.
>>>      */
>>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>>> +	if (!task_on_rq_queued(p))
>>>             return;
>>>
>>> -	deadline_queue_pull_task(rq);
>>> +	if (!rq->dl.dl_nr_running)
>>> +		deadline_queue_pull_task(rq);
>>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>>> +		resched_curr(rq);
>>>  }
>>>
>>>  /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..1ea2567612fb 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>>      * we may need to handle the pulling of RT tasks
>>>      * now.
>>>      */
>>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>>> +	if (!task_on_rq_queued(p))
>>>             return;
>>>
>>> -	rt_queue_pull_task(rq);
>>> +	if (!rq->rt.rt_nr_running)
>>> +		rt_queue_pull_task(rq);
>>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>>> +		resched_curr(rq);
>
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>
> seems to be the only way with pull=true.
>
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>
> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>                                                          ^^^^^^^^^^^^^^^^^^^^^^
> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>                                                              ^^^^^^^^^^                           ^^^^^^

mark_wakeup_next_waiter() first deboosts the previous owner and then
wakeups the next top waiter. Seems like you somehow have the wakeup happen
before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
only pick a CPU that is in rq->rd->rto_mask, which requires having at least
2 RT tasks there...

Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
check_preempt_curr() if required, which is good, though I think we are
still missing some for sched_setscheduler() (there are no wakeups
there). So if we just have to live with an IRQ work popping in before we
get to preempt_schedule_irq() (or somesuch), then perhaps the below would
be sufficient.

> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>
>
> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>          * we may need to handle the pulling of RT tasks
>          * now.
>          */
> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +       if (!task_on_rq_queued(p))
> +               return;
> +
> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> +               resched_curr(rq);
> +               return;
> +       }
> +
> +       if (rq->rt.rt_nr_running)
>                 return;
>
>         rt_queue_pull_task(rq);

If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
least from RT's perspective; p->sched_class->switched_to() takes care of
the rest)

---

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * this is the right place to try to pull some other one
 	 * from an overloaded CPU, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	deadline_queue_pull_task(rq);
+	if (!rq->dl.dl_nr_running)
+		deadline_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+		resched_curr(rq);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..8f3e3a1367b6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (!next_task)
 		return 0;
 
+	/*
+	 * It's possible that the next_task slipped in of higher priority than
+	 * current, or current has *just* changed priority.  If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < rq->curr->prio)) {
+		resched_curr(rq);
+		return 0;
+	}
+
 retry:
 	if (is_migration_disabled(next_task)) {
 		struct task_struct *push_task = NULL;
@@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (WARN_ON(next_task == rq->curr))
 		return 0;
 
-	/*
-	 * It's possible that the next_task slipped in of
-	 * higher priority than current. If that's the case
-	 * just reschedule current.
-	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
-		resched_curr(rq);
-		return 0;
-	}
-
 	/* We might release rq lock */
 	get_task_struct(next_task);
 
@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	rt_queue_pull_task(rq);
+	if (!rq->rt.rt_nr_running)
+		rt_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+		resched_curr(rq);
 }
 
 void __init init_sched_rt_class(void)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 19:48         ` Valentin Schneider
@ 2021-12-23 11:58           ` John Keeping
  2021-12-23 14:05             ` Valentin Schneider
  2022-01-07 10:46           ` Dietmar Eggemann
  1 sibling, 1 reply; 16+ messages in thread
From: John Keeping @ 2021-12-23 11:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
> > On 21.12.21 17:45, John Keeping wrote:
> >> On Tue, 21 Dec 2021 16:11:34 +0000
> >> Valentin Schneider <valentin.schneider@arm.com> wrote:
> >>
> >>> On 20/12/21 18:35, Dietmar Eggemann wrote:
> >
> > [...]
> >
> >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >>> index fd7c4f972aaf..7d61ceec1a3b 100644
> >>> --- a/kernel/sched/deadline.c
> >>> +++ b/kernel/sched/deadline.c
> >>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>>      * this is the right place to try to pull some other one
> >>>      * from an overloaded CPU, if any.
> >>>      */
> >>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >>> +	if (!task_on_rq_queued(p))
> >>>             return;
> >>>
> >>> -	deadline_queue_pull_task(rq);
> >>> +	if (!rq->dl.dl_nr_running)
> >>> +		deadline_queue_pull_task(rq);
> >>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >>> +		resched_curr(rq);
> >>>  }
> >>>
> >>>  /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..1ea2567612fb 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >>>      * we may need to handle the pulling of RT tasks
> >>>      * now.
> >>>      */
> >>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >>> +	if (!task_on_rq_queued(p))
> >>>             return;
> >>>
> >>> -	rt_queue_pull_task(rq);
> >>> +	if (!rq->rt.rt_nr_running)
> >>> +		rt_queue_pull_task(rq);
> >>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >>> +		resched_curr(rq);
> >
> > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
> >   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
> >     rto_push_irq_work_func() -> push_rt_task(rq, true)
> >
> > seems to be the only way with pull=true.
> >
> > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
> >
> > [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> > [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> > [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
> >                                                          ^^^^^^^^^^^^^^^^^^^^^^
> > [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> > [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
> >                                                              ^^^^^^^^^^                           ^^^^^^
> 
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...
> 
> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

With this patch I ran 100 reboots without hitting the BUG, so it looks
like this is the solution!

If you post this as a patch, feel free to add:

	Tested-by: John Keeping <john@metanate.com>

> ---
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	 * this is the right place to try to pull some other one
>  	 * from an overloaded CPU, if any.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	deadline_queue_pull_task(rq);
> +	if (!rq->dl.dl_nr_running)
> +		deadline_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> +		resched_curr(rq);
>  }
>  
>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (!next_task)
>  		return 0;
>  
> +	/*
> +	 * It's possible that the next_task slipped in of higher priority than
> +	 * current, or current has *just* changed priority.  If that's the case
> +	 * just reschedule current.
> +	 */
> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> +		resched_curr(rq);
> +		return 0;
> +	}
> +
>  retry:
>  	if (is_migration_disabled(next_task)) {
>  		struct task_struct *push_task = NULL;
> @@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (WARN_ON(next_task == rq->curr))
>  		return 0;
>  
> -	/*
> -	 * It's possible that the next_task slipped in of
> -	 * higher priority than current. If that's the case
> -	 * just reschedule current.
> -	 */
> -	if (unlikely(next_task->prio < rq->curr->prio)) {
> -		resched_curr(rq);
> -		return 0;
> -	}
> -
>  	/* We might release rq lock */
>  	get_task_struct(next_task);
>  
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>  	 * we may need to handle the pulling of RT tasks
>  	 * now.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	rt_queue_pull_task(rq);
> +	if (!rq->rt.rt_nr_running)
> +		rt_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> +		resched_curr(rq);
>  }
>  
>  void __init init_sched_rt_class(void)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-23 11:58           ` John Keeping
@ 2021-12-23 14:05             ` Valentin Schneider
  0 siblings, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2021-12-23 14:05 UTC (permalink / raw)
  To: John Keeping
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 23/12/21 11:58, John Keeping wrote:
> On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>> 
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> With this patch I ran 100 reboots without hitting the BUG, so it looks
> like this is the solution!
>
> If you post this as a patch, feel free to add:
>
> 	Tested-by: John Keeping <john@metanate.com>

Thanks!

I still need to convince myself beyond reasonable doubt that this is really
what is happening (esp the sched_setscheduler()) part, so the next episode
will probably air after the break :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 19:48         ` Valentin Schneider
  2021-12-23 11:58           ` John Keeping
@ 2022-01-07 10:46           ` Dietmar Eggemann
  2022-01-07 11:49             ` John Keeping
  2022-01-14 18:25             ` Valentin Schneider
  1 sibling, 2 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-07 10:46 UTC (permalink / raw)
  To: Valentin Schneider, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 22/12/2021 20:48, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
>> On 21.12.21 17:45, John Keeping wrote:
>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>>
>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>>
>> seems to be the only way with pull=true.
>>
>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>
>> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>>                                                          ^^^^^^^^^^^^^^^^^^^^^^
>> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>>                                                              ^^^^^^^^^^                           ^^^^^^
> 
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...

True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:

  mark_wakeup_next_waiter()

    (1) /* deboost */
    rt_mutex_adjust_prio()

      rt_mutex_setprio(current, ...)

        rq = __task_rq_lock(current, );
        check_class_changed(rq, p, prev_class, oldprio)

          switched_from_rt()

            if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
              return;

            rt_queue_pull_task(rq)

              queue_balance_callback(rq, ..., pull_rt_task);

                pull_rt_task()

                  tell_cpu_to_push()

                    *cpu* = rto_next_cpu(rq->rd)
                    irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)

                      rto_push_irq_work_func()
                        push_rt_task(rq, true) <-- !!!

    (2) /* waking the top waiter */
    rt_mutex_wake_q_add(wqh, waiter);

> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

I think that's the case here but we are on the RT overloaded CPU (*cpu*).

> 
>> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>>
>>
>> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>          * we may need to handle the pulling of RT tasks
>>          * now.
>>          */
>> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> +       if (!task_on_rq_queued(p))
>> +               return;
>> +
>> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
>> +               resched_curr(rq);
>> +               return;
>> +       }
>> +
>> +       if (rq->rt.rt_nr_running)
>>                 return;
>>
>>         rt_queue_pull_task(rq);
> 
> If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
> least from RT's perspective; p->sched_class->switched_to() takes care of
> the rest)

Right.

[...]

>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (!next_task)
>  		return 0;
>  
> +	/*
> +	 * It's possible that the next_task slipped in of higher priority than
> +	 * current, or current has *just* changed priority.  If that's the case
> +	 * just reschedule current.
> +	 */
> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> +		resched_curr(rq);
> +		return 0;
> +	}

IMHO, that's the bit which prevents the BUG.

But this would also prevent the case in which rq->curr is an RT task
with lower prio than next_task.

Also `rq->curr = migration/X` goes still though which is somehow fine
since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).

And DL tasks (like sugov:X go through and they can have
task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
by coincidence.

So maybe something like this:

@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
                if (!pull || rq->push_busy)
                        return 0;

+               if (rq->curr->sched_class != &rt_sched_class) {
+                       resched_curr(rq);
+                       return 0;
+               }
+
                cpu = find_lowest_rq(rq->curr);

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 10:46           ` Dietmar Eggemann
@ 2022-01-07 11:49             ` John Keeping
  2022-01-07 14:25               ` Dietmar Eggemann
  2022-01-14 18:25             ` Valentin Schneider
  1 sibling, 1 reply; 16+ messages in thread
From: John Keeping @ 2022-01-07 11:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
> >  /*
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..8f3e3a1367b6 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> >  	if (!next_task)
> >  		return 0;
> >  
> > +	/*
> > +	 * It's possible that the next_task slipped in of higher priority than
> > +	 * current, or current has *just* changed priority.  If that's the case
> > +	 * just reschedule current.
> > +	 */
> > +	if (unlikely(next_task->prio < rq->curr->prio)) {
> > +		resched_curr(rq);
> > +		return 0;
> > +	}
> 
> IMHO, that's the bit which prevents the BUG.
> 
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
> 
> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
> 
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
> 
> So maybe something like this:

Do you mean to replace just the one hunk from Valentin's patch with the
change below (keeping the rest), or are you saying that only the change
below is needed?

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 if (!pull || rq->push_busy)
>                         return 0;
> 
> +               if (rq->curr->sched_class != &rt_sched_class) {
> +                       resched_curr(rq);
> +                       return 0;
> +               }
> +
>                 cpu = find_lowest_rq(rq->curr);
> 
> [...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 11:49             ` John Keeping
@ 2022-01-07 14:25               ` Dietmar Eggemann
  2022-01-07 18:35                 ` John Keeping
  0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-07 14:25 UTC (permalink / raw)
  To: John Keeping
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 07/01/2022 12:49, John Keeping wrote:
> On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
>> On 22/12/2021 20:48, Valentin Schneider wrote:
>>>  /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..8f3e3a1367b6 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>>>  	if (!next_task)
>>>  		return 0;
>>>  
>>> +	/*
>>> +	 * It's possible that the next_task slipped in of higher priority than
>>> +	 * current, or current has *just* changed priority.  If that's the case
>>> +	 * just reschedule current.
>>> +	 */
>>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
>>> +		resched_curr(rq);
>>> +		return 0;
>>> +	}
>>
>> IMHO, that's the bit which prevents the BUG.
>>
>> But this would also prevent the case in which rq->curr is an RT task
>> with lower prio than next_task.
>>
>> Also `rq->curr = migration/X` goes still though which is somehow fine
>> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>>
>> And DL tasks (like sugov:X go through and they can have
>> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
>> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
>> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
>> by coincidence.
>>
>> So maybe something like this:
> 
> Do you mean to replace just the one hunk from Valentin's patch with the
> change below (keeping the rest), or are you saying that only the change
> below is needed?

The latter.

I think Valentin wanted to see if something like this can also occur via
sched_setscheduler() and maybe for this changes in switched_from_[rt/dl]
will be necessary.

>> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>>                 if (!pull || rq->push_busy)
>>                         return 0;
>>
>> +               if (rq->curr->sched_class != &rt_sched_class) {
>> +                       resched_curr(rq);
>> +                       return 0;
>> +               }
>> +
>>                 cpu = find_lowest_rq(rq->curr);
>>
>> [...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 14:25               ` Dietmar Eggemann
@ 2022-01-07 18:35                 ` John Keeping
  0 siblings, 0 replies; 16+ messages in thread
From: John Keeping @ 2022-01-07 18:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Fri, Jan 07, 2022 at 03:25:21PM +0100, Dietmar Eggemann wrote:
> On 07/01/2022 12:49, John Keeping wrote:
> > On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> >> On 22/12/2021 20:48, Valentin Schneider wrote:
> >>>  /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..8f3e3a1367b6 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> >>>  	if (!next_task)
> >>>  		return 0;
> >>>  
> >>> +	/*
> >>> +	 * It's possible that the next_task slipped in of higher priority than
> >>> +	 * current, or current has *just* changed priority.  If that's the case
> >>> +	 * just reschedule current.
> >>> +	 */
> >>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> >>> +		resched_curr(rq);
> >>> +		return 0;
> >>> +	}
> >>
> >> IMHO, that's the bit which prevents the BUG.
> >>
> >> But this would also prevent the case in which rq->curr is an RT task
> >> with lower prio than next_task.
> >>
> >> Also `rq->curr = migration/X` goes still though which is somehow fine
> >> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
> >>
> >> And DL tasks (like sugov:X go through and they can have
> >> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> >> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> >> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> >> by coincidence.
> >>
> >> So maybe something like this:
> > 
> > Do you mean to replace just the one hunk from Valentin's patch with the
> > change below (keeping the rest), or are you saying that only the change
> > below is needed?
> 
> The latter.

Thanks!  I tested the patch below and can confirm that I no longer see
any BUGs with this applied.

Tested-By: John Keeping <john@metanate.com>

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
                if (!pull || rq->push_busy)
                        return 0;
 
+               if (rq->curr->sched_class != &rt_sched_class) {
+                       resched_curr(rq);
+                       return 0;
+               }
+
                cpu = find_lowest_rq(rq->curr);
                if (cpu == -1 || cpu == rq->cpu)
                        return 0;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 10:46           ` Dietmar Eggemann
  2022-01-07 11:49             ` John Keeping
@ 2022-01-14 18:25             ` Valentin Schneider
  1 sibling, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2022-01-14 18:25 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel


Trying to page this back in...

On 07/01/22 11:46, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
>> On 22/12/21 18:46, Dietmar Eggemann wrote:
>>> On 21.12.21 17:45, John Keeping wrote:
>>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>>>
>>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>>>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>>>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>>>
>>> seems to be the only way with pull=true.
>>>
>>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>>
>>> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>>> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>>> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>>>                                                          ^^^^^^^^^^^^^^^^^^^^^^
>>> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>>> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>>>                                                              ^^^^^^^^^^                           ^^^^^^
>> 
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>
> True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:
>
>   mark_wakeup_next_waiter()
>
>     (1) /* deboost */
>     rt_mutex_adjust_prio()
>
>       rt_mutex_setprio(current, ...)
>
>         rq = __task_rq_lock(current, );
>         check_class_changed(rq, p, prev_class, oldprio)
>
>           switched_from_rt()
>
>             if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>               return;
>
>             rt_queue_pull_task(rq)
>
>               queue_balance_callback(rq, ..., pull_rt_task);
>
>                 pull_rt_task()
>
>                   tell_cpu_to_push()
>
>                     *cpu* = rto_next_cpu(rq->rd)
>                     irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)
>
>                       rto_push_irq_work_func()
>                         push_rt_task(rq, true) <-- !!!
>
>     (2) /* waking the top waiter */
>     rt_mutex_wake_q_add(wqh, waiter);
>
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> I think that's the case here but we are on the RT overloaded CPU (*cpu*).
>

So one thing I wasn't entirely clear on (and holidays didn't fix that) is
the rt_queue_pull_task() from switched_from_rt() only happens if that rq
has no other runnable RT tasks, so I don't quite see how the
irq_work_queue_on() can end up as a self-IPI...

>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..8f3e3a1367b6 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>>  	if (!next_task)
>>  		return 0;
>>  
>> +	/*
>> +	 * It's possible that the next_task slipped in of higher priority than
>> +	 * current, or current has *just* changed priority.  If that's the case
>> +	 * just reschedule current.
>> +	 */
>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
>> +		resched_curr(rq);
>> +		return 0;
>> +	}
>
> IMHO, that's the bit which prevents the BUG.
>
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
>

I think that's what we want - if current isn't the (logical) highest
priority task on the runqueue, we should forgo push/pull and reschedule
ASAP.

> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
>

Right. This reminds me of:
https://lore.kernel.org/lkml/jhjblbx7glh.mognet@arm.com/


> So maybe something like this:
>

Ah, so you explicitely prevent rt.c::find_lowest_rq() invocations with a
non-RT task... But what if current is an RT task that just got deboosted,
so that next_task->prio < rq->curr->prio? IMO we should reschedule ASAP (as
I already blabbered about above). If next_task is migration_disabled but
higher (logical) prio than current, we don't need to do any of the
migration_disabled specific crud, we just reschedule.

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 if (!pull || rq->push_busy)
>                         return 0;
>
> +               if (rq->curr->sched_class != &rt_sched_class) {
> +                       resched_curr(rq);
> +                       return 0;
> +               }
> +
>                 cpu = find_lowest_rq(rq->curr);
>
> [...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-01-14 18:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping
2021-12-20 17:35 ` Dietmar Eggemann
2021-12-21 16:11   ` Valentin Schneider
2021-12-21 16:45     ` John Keeping
2021-12-21 17:22       ` Valentin Schneider
2021-12-21 17:42         ` John Keeping
2021-12-22 17:46       ` Dietmar Eggemann
2021-12-22 18:45         ` John Keeping
2021-12-22 19:48         ` Valentin Schneider
2021-12-23 11:58           ` John Keeping
2021-12-23 14:05             ` Valentin Schneider
2022-01-07 10:46           ` Dietmar Eggemann
2022-01-07 11:49             ` John Keeping
2022-01-07 14:25               ` Dietmar Eggemann
2022-01-07 18:35                 ` John Keeping
2022-01-14 18:25             ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).