All of lore.kernel.org
 help / color / mirror / Atom feed
* [RT] BUG in sched/cpupri.c
@ 2021-12-18 14:25 John Keeping
  2021-12-20 17:35 ` Dietmar Eggemann
  0 siblings, 1 reply; 18+ messages in thread
From: John Keeping @ 2021-12-18 14:25 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

Hi,

On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
occasional BUG at cpupri.c:151 (full trace below).

Having added extra logging, it seems that p->prio == 120 which isn't
handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").

This happens maybe half the time as userspace is starting up, but if the
system boots to a login prompt I haven't seen any issues after that.
The process isn't always the same, I've seen systemd-udevd as well as
pr/ttyS2.

I can easily "fix" this by handling normal priority tasks in
convert_prio() but I guess there's some wider reason why that's not an
expected value there, so perhaps the real problem lies elsewhere.


Thanks,
John

------------[ cut here ]------------
kernel BUG at kernel/sched/cpupri.c:151!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP ARM
Modules linked in:
CPU: 1 PID: 117 Comm: pr/ttyS2 Tainted: G    B             5.15.10-rt24 #1
Hardware name: Rockchip (Device Tree)
PC is at cpupri_find_fitness+0x78/0x1a4
LR is at cpupri_find_fitness+0x28/0x1a4
pc : [<c0183be8>]    lr : [<c0183b98>]    psr: 20030193
sp : c3ea38f8  ip : c38461bc  fp : 00000001
r10: e7db8b00  r9 : 00000001  r8 : c1ccd880
r7 : c3846180  r6 : c3846180  r5 : 00000000  r4 : e7db13dc
r3 : c0183b90  r2 : 00000000  r1 : 00000007  r0 : 00000078
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 0296406a  DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: non-slab/vmalloc memory
Register r4 information: non-slab/vmalloc memory
Register r5 information: NULL pointer
Register r6 information: slab task_struct start c3846180 pointer offset 0
Register r7 information: slab task_struct start c3846180 pointer offset 0
Register r8 information: slab kmalloc-1k start c1ccd800 pointer offset 128 size 1024
Register r9 information: non-paged memory
Register r10 information: non-slab/vmalloc memory
Register r11 information: non-paged memory
Register r12 information: slab task_struct start c3846180 pointer offset 60
Process pr/ttyS2 (pid: 117, stack limit = 0x(ptrval))
Stack: (0xc3ea38f8 to 0xc3ea4000)
38e0:                                                       187d4724 00000078
3900: 41b58ab3 c11a824a c016ffcc 00000001 e7db13dc c1388b00 c3846180 c1408e04
3920: 00000001 e7db8b00 00000001 c017a378 c017bcd0 e7db9098 e7db8b00 c1d84b00
3940: 00000001 e7db9290 e7db9098 c1ccd858 c1ccd858 c017bcd8 e7db8b00 c1ccd858
3960: e7db9008 c1ccd800 c1408e04 00000008 c1ccd858 c017cfc8 c1ccd85c c1ccd858
3980: 0000002b 0000002a c1ccd85c c026718c 00000000 00000000 c1ccd858 00000000
39a0: c1ccd85c 00000020 c1ccd858 c01e7300 c17a5524 e7db8b40 00000000 0181b72f
39c0: c3846288 00000003 c17a4fe0 00000001 00000013 c0d0ab00 f0802000 26a30000
39e0: c15c4dc0 c01114e0 c1cc7000 c1c3db80 c1409064 00000013 c0111664 f0802000
3a00: 26a30000 c0111678 c1cc7000 c01a05d4 c1cc7000 c3ea0000 00000003 c1409064
3a20: 00000003 f0802000 26a30000 c0198080 00000000 00000000 00000003 c019894c
3a40: c1387850 00000003 c3ea3a80 c06417a0 c3ea3a80 40030193 00000000 c0cda8ec
3a60: 60030013 ffffffff c3ea3ab4 c40893f8 c3ea0000 c1408e04 c3ea3b8c c0100b00
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff 00000051 c0cda8e8 c4088f00 c0167624 c122da30 187d4760
3ae0: c1388b00 80030013 e7db8b00 e7db8b00 c17a5dc0 c02353f0 00000000 00000000
3b00: 41b58ab3 c11a63b8 c01673fc c016580c 00000001 c3ea0004 26a30000 c3ea0000
3b20: c3ea3b84 c023523c c1d84b00 c0cda838 c17a5dc0 c02353f0 00000000 00000000
3b40: 00000000 c3ea0000 c1387020 c0cda8e8 00000001 c01640b4 00000001 c3ea0000
3b60: c3ea3c20 f6251f9e c3ea3b84 c181a338 b77d4774 c3ea3c20 c3ea0000 c3ea0000
3b80: c3ea3be0 c4073d18 c181a344 c0cd5ba8 80030013 c3ea000c 00000001 c3ea0000
3ba0: 41b58ab3 c11a889d c0cd59a8 00000003 c17a5dc0 c02353f0 ffffc000 c0235580
3bc0: c01641c8 c01640c4 00000000 00000001 c3ea0000 40030013 c38463c4 c14b7044
3be0: 00000001 c3ea3be0 c4088f00 c01641dc 40030013 c38463c4 40030013 c3846180
3c00: c14b7098 c0cda91c c14b7040 c01b5cc4 c01b5f2c 00000000 00000000 00000000
3c20: 00000001 00000000 00000000 00000001 c0d39640 f6251f9e 00000000 b77d4790
3c40: c181a338 c3ea3ce0 c3ea3ca0 c181a344 00000000 00000007 c181a4b8 c0cd5dac
3c60: 41b58ab3 c11e5383 c061d930 c023523c c17a5dc0 c02353f0 00000000 00000000
3c80: 41b58ab3 c11a8997 c0cd5cf0 c02353f0 00000000 00000000 00000000 c3ea0000
3ca0: c3846180 c06d95c4 00000001 c3ea0004 26a30000 00000007 c181a4b8 c02340c4
3cc0: c181a338 60030013 00000001 00000007 00000036 00000000 00000007 f6251f9e
3ce0: c181a338 b77d47a4 c181a4c4 c3f59000 00000036 c06ddb94 00000035 00000035
3d00: c3f59000 c3f59000 00000035 c0192a98 00000035 187d47a8 c3ea3e44 000003fe
3d20: 41b58ab3 c11eea53 c06dd92c c01b5eec b77d47ac c1448740 c3ea3dc0 c3ea3d80
3d40: 41b58ab3 c11a8f2c c0192900 c0cd5d9c c161de70 c0197ba4 26a20000 0000003e
3d60: 41b58ab3 c023523c c0cd5cf0 c0cda838 ffffffff c0cda838 00000000 c3ea0000
3d80: c17a5dc0 c02353f0 ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001
3da0: c3ea0000 c3ea3f40 00000000 00000000 00000243 00000036 c3ea3dd4 f6251f9e
3dc0: c3ea3f40 c161de40 c3efed00 c3ea3f40 00000000 00000000 00000243 00000036
3de0: c161de70 c0194fd8 c3ea3e40 000016ca c14d1250 00000000 c3f59000 00000000
3e00: 187d47c4 c3f59000 00000017 c3ea0000 c1387020 c016580c 00000001 c3ea0004
3e20: 41b58ab3 c11a9104 c0194bb0 c02340c4 c1d80000 e7db8b00 00000000 c3ea0000
3e40: c3ea3ea0 c3f59000 00000400 c0165830 00000002 c01b5eec 00000000 00000002
3e60: 00000000 c3846180 c01828b4 c3ea3e6c c3ea3e6c c1d80000 c1512280 c3846188
3e80: c384640c 00000001 c3ea3f54 c0cd01b0 c1d64000 b77d47d8 c3ea3f54 c0167508
3ea0: 00000242 00000000 8bb3f10e 00000001 c2030035 00000001 00000000 00000000
3ec0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3ee0: 00000000 00000000 00000000 00000000 00000000 00000000 c17a5dc0 c02353f0
3f00: ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001 c3ea0000 c3efec80
3f20: 60000013 c3ea000c 000004f8 c3846180 c3ea3f4c c01641dc c3efec80 60000013
3f40: 60000013 ffffc000 c3846180 f6251f9e c3ea0000 c3846180 c3efec80 00000000
3f60: c3846180 c0194bb0 c161de40 c1d67d20 c3846180 c0157eb0 00000000 c3846180
3f80: c3efeca0 c3efec0c 00000000 c3efec00 c0157c74 00000000 00000000 00000000
3fa0: 00000000 00000000 00000000 c01000fc 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[<c0183be8>] (cpupri_find_fitness) from [<c017a378>] (find_lowest_rq+0x1bc/0x258)
[<c017a378>] (find_lowest_rq) from [<c017bcd8>] (push_rt_task.part.0+0xe8/0x23c)
[<c017bcd8>] (push_rt_task.part.0) from [<c017cfc8>] (rto_push_irq_work_func+0x7c/0xd8)
[<c017cfc8>] (rto_push_irq_work_func) from [<c026718c>] (irq_work_single+0x8c/0x140)
[<c026718c>] (irq_work_single) from [<c01e7300>] (flush_smp_call_function_queue+0x238/0x31c)
[<c01e7300>] (flush_smp_call_function_queue) from [<c01114e0>] (do_handle_IPI+0x29c/0x420)
[<c01114e0>] (do_handle_IPI) from [<c0111678>] (ipi_handler+0x14/0x20)
[<c0111678>] (ipi_handler) from [<c01a05d4>] (handle_percpu_devid_irq+0x8c/0x140)
[<c01a05d4>] (handle_percpu_devid_irq) from [<c0198080>] (handle_irq_desc+0x38/0x48)
[<c0198080>] (handle_irq_desc) from [<c019894c>] (handle_domain_irq+0x40/0x54)
[<c019894c>] (handle_domain_irq) from [<c06417a0>] (gic_handle_irq+0x88/0xa0)
[<c06417a0>] (gic_handle_irq) from [<c0100b00>] (__irq_svc+0x60/0xac)
Exception stack(0xc3ea3a80 to 0xc3ea3ac8)
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff
[<c0100b00>] (__irq_svc) from [<c0cda8ec>] (_raw_spin_unlock_irqrestore+0x1c/0x70)
[<c0cda8ec>] (_raw_spin_unlock_irqrestore) from [<c0167624>] (try_to_wake_up+0x228/0x468)
[<c0167624>] (try_to_wake_up) from [<c0cd5ba8>] (rt_mutex_slowunlock+0x200/0x348)
[<c0cd5ba8>] (rt_mutex_slowunlock) from [<c0cd5dac>] (rt_spin_unlock+0xbc/0x104)
[<c0cd5dac>] (rt_spin_unlock) from [<c06ddb94>] (serial8250_console_write+0x268/0x39c)
[<c06ddb94>] (serial8250_console_write) from [<c0194fd8>] (printk_kthread_func+0x428/0x4c4)
[<c0194fd8>] (printk_kthread_func) from [<c0157eb0>] (kthread+0x23c/0x25c)
[<c0157eb0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc3ea3fb0 to 0xc3ea3ff8)
3fa0:                                     00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e1a00008 e28dd014 e8bd4ff0 ea00004a (e7f001f2) 
---[ end trace 0000000000000002 ]---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping
@ 2021-12-20 17:35 ` Dietmar Eggemann
  2021-12-21 16:11   ` Valentin Schneider
  0 siblings, 1 reply; 18+ messages in thread
From: Dietmar Eggemann @ 2021-12-20 17:35 UTC (permalink / raw)
  To: John Keeping, linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 18.12.21 15:25, John Keeping wrote:
> Hi,
> 
> On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
> occasional BUG at cpupri.c:151 (full trace below).
> 
> Having added extra logging, it seems that p->prio == 120 which isn't
> handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
> Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").
> 
> This happens maybe half the time as userspace is starting up, but if the
> system boots to a login prompt I haven't seen any issues after that.
> The process isn't always the same, I've seen systemd-udevd as well as
> pr/ttyS2.
> 
> I can easily "fix" this by handling normal priority tasks in
> convert_prio() but I guess there's some wider reason why that's not an
> expected value there, so perhaps the real problem lies elsewhere.

find_lowest_rq() -> [ cpupri_find() ] -> cpupri_find_fitness() ->
convert_prio(p->prio) can only be called for an RT task (p->prio=[0..98])

I can recreate this on my Juno-r0 Arm64 machine with v5.15.10-rt24,
CONFIG_PREEMPT_RT=y .

------------[ cut here ]------------
[   15.256482] WARNING: CPU: 3 PID: 35 at kernel/sched/rt.c:1898 push_rt_task.part.0+0x190/0x364
[   15.256521] Modules linked in:
[   15.256532] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #14
[   15.256546] Hardware name: ARM Juno development board (r0) (DT)
[   15.256552] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   15.256567] pc : push_rt_task.part.0+0x190/0x364
[   15.256582] lr : push_rt_task.part.0+0x20/0x364
[   15.256597] sp : ffff800011ff3e20
[   15.256602] x29: ffff800011ff3e20 x28: ffff0008001ca940 x27: 0000000000000000
[   15.256622] x26: 0000000000000000 x25: ffff800011f61da0 x24: 0000000000000001
[   15.256639] x23: 0000000000000000 x22: ffff00097ef923c0 x21: ffff0008000e0dc0
[   15.256657] x20: ffff0008000e0dc0 x19: ffff00097ef923c0 x18: 0000000000000002
[   15.256674] x17: ffff80096d7e7000 x16: ffff800011ff4000 x15: 0000000000004000
[   15.256692] x14: 00000000000000c1 x13: 0000000000000001 x12: 0000000000000000
[   15.256709] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400
[   15.256725] x8 : ffff00097ef924c0 x7 : ffff00097ef92440 x6 : 00000000356c5f2b
[   15.256743] x5 : ffff80096d7e7000 x4 : 0000000000000000 x3 : 0000000000000003
[   15.256760] x2 : ffff00097ef923c0 x1 : 0000000000000062 x0 : 0000000000000078
[   15.256777] Call trace:
[   15.256783]  push_rt_task.part.0+0x190/0x364
[   15.256799]  rto_push_irq_work_func+0x180/0x190
[   15.256815]  irq_work_single+0x34/0xa0
[   15.256829]  flush_smp_call_function_queue+0x138/0x244
[   15.256845]  generic_smp_call_function_single_interrupt+0x18/0x24
[   15.256860]  ipi_handler+0xb0/0x15c
[   15.256875]  handle_percpu_devid_irq+0x88/0x140
[   15.256890]  handle_domain_irq+0x64/0x94
[   15.256907]  gic_handle_irq+0x50/0xf0
[   15.256924]  call_on_irq_stack+0x2c/0x60
[   15.256937]  do_interrupt_handler+0x54/0x60
[   15.256951]  el1_interrupt+0x30/0x80
[   15.256965]  el1h_64_irq_handler+0x1c/0x30
[   15.256978]  el1h_64_irq+0x78/0x7c
[   15.256989]  preempt_schedule_irq+0x40/0x150
[   15.257004]  el1_interrupt+0x60/0x80
[   15.257016]  el1h_64_irq_handler+0x1c/0x30
[   15.257029]  el1h_64_irq+0x78/0x7c
[   15.257039]  run_ksoftirqd+0x94/0xc0
[   15.257053]  smpboot_thread_fn+0x2d8/0x320
[   15.257066]  kthread+0x18c/0x1a0
[   15.257081]  ret_from_fork+0x10/0x20
[   15.257094] ---[ end trace 0000000000000002 ]---
[   16.257349] next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/3 35]

root@juno:~# ps -eTo comm,pid,lwp,pri,rtprio,nice,class | more
COMMAND           PID   LWP PRI RTPRIO  NI CLS
..
rcu_preempt        11    11  41      1   - FF
...
ksoftirqd/3        35    35  19      -   0 TS
...

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..798887f1eeff 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
                struct task_struct *push_task = NULL;
                int cpu;
 
+               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
+                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
+                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
+               }
+
                if (!pull || rq->push_busy)
                        return 0;
 
+               if (!rt_task(rq->curr))
+                       return 0;
+
                cpu = find_lowest_rq(rq->curr);
                if (cpu == -1 || cpu == rq->cpu)
                        return 0;

The 2. condition prevents the BUG_ON() in cpupri_find_fitness().

[...]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-20 17:35 ` Dietmar Eggemann
@ 2021-12-21 16:11   ` Valentin Schneider
  2021-12-21 16:45     ` John Keeping
  0 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2021-12-21 16:11 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping, linux-rt-users
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 20/12/21 18:35, Dietmar Eggemann wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..798887f1eeff 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 struct task_struct *push_task = NULL;
>                 int cpu;
>
> +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> +               }
> +
>                 if (!pull || rq->push_busy)
>                         return 0;
>
> +               if (!rt_task(rq->curr))
> +                       return 0;
> +

If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
case), that's buggered: we shouldn't be trying to pull RT tasks when we
have queued RT tasks and a less-than-RT current, we should be rescheduling
right now.

I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
CFS task (or straight up sched_setscheduler()):
check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
so I suspect we get to the push/pull callback before getting a
resched (I actually don't see where we'd get a resched in that case other
than at the next tick).

IOW, feels like we want the below. Unfortunately I can't reproduce the
issue locally (yet), so that's untested.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * this is the right place to try to pull some other one
 	 * from an overloaded CPU, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	deadline_queue_pull_task(rq);
+	if (!rq->dl.dl_nr_running)
+		deadline_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+		resched_curr(rq);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..1ea2567612fb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	rt_queue_pull_task(rq);
+	if (!rq->rt.rt_nr_running)
+		rt_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+		resched_curr(rq);
 }
 
 void __init init_sched_rt_class(void)

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:11   ` Valentin Schneider
@ 2021-12-21 16:45     ` John Keeping
  2021-12-21 17:22       ` Valentin Schneider
  2021-12-22 17:46       ` Dietmar Eggemann
  0 siblings, 2 replies; 18+ messages in thread
From: John Keeping @ 2021-12-21 16:45 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Tue, 21 Dec 2021 16:11:34 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> On 20/12/21 18:35, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..798887f1eeff 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> >                 struct task_struct *push_task = NULL;
> >                 int cpu;
> >
> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> > +               }
> > +
> >                 if (!pull || rq->push_busy)
> >                         return 0;
> >
> > +               if (!rt_task(rq->curr))
> > +                       return 0;
> > +  
> 
> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> have queued RT tasks and a less-than-RT current, we should be rescheduling
> right now.
> 
> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> CFS task (or straight up sched_setscheduler()):
> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> so I suspect we get to the push/pull callback before getting a
> resched (I actually don't see where we'd get a resched in that case other
> than at the next tick).
> 
> IOW, feels like we want the below. Unfortunately I can't reproduce the
> issue locally (yet), so that's untested.

This patch doesn't make any difference for me - I hit the BUG on the
first boot with this applied.

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	 * this is the right place to try to pull some other one
>  	 * from an overloaded CPU, if any.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	deadline_queue_pull_task(rq);
> +	if (!rq->dl.dl_nr_running)
> +		deadline_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> +		resched_curr(rq);
>  }
>  
>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..1ea2567612fb 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>  	 * we may need to handle the pulling of RT tasks
>  	 * now.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	rt_queue_pull_task(rq);
> +	if (!rq->rt.rt_nr_running)
> +		rt_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> +		resched_curr(rq);
>  }
>  
>  void __init init_sched_rt_class(void)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:45     ` John Keeping
@ 2021-12-21 17:22       ` Valentin Schneider
  2021-12-21 17:42         ` John Keeping
  2021-12-22 17:46       ` Dietmar Eggemann
  1 sibling, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2021-12-21 17:22 UTC (permalink / raw)
  To: John Keeping
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 21/12/21 16:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <valentin.schneider@arm.com> wrote:
>
>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> > index ef8228d19382..798887f1eeff 100644
>> > --- a/kernel/sched/rt.c
>> > +++ b/kernel/sched/rt.c
>> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
>> >                 struct task_struct *push_task = NULL;
>> >                 int cpu;
>> >
>> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
>> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
>> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
>> > +               }
>> > +
>> >                 if (!pull || rq->push_busy)
>> >                         return 0;
>> >
>> > +               if (!rt_task(rq->curr))
>> > +                       return 0;
>> > +  
>> 
>> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
>> case), that's buggered: we shouldn't be trying to pull RT tasks when we
>> have queued RT tasks and a less-than-RT current, we should be rescheduling
>> right now.
>> 
>> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
>> CFS task (or straight up sched_setscheduler()):
>> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
>> so I suspect we get to the push/pull callback before getting a
>> resched (I actually don't see where we'd get a resched in that case other
>> than at the next tick).
>> 
>> IOW, feels like we want the below. Unfortunately I can't reproduce the
>> issue locally (yet), so that's untested.
>
> This patch doesn't make any difference for me - I hit the BUG on the
> first boot with this applied.
>

Thanks for the swift testing!

Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
we can somehow get to the push IRQ work before rescheduling (which I think
might happen if we try to resched_curr(this_rq)), then we need his
bailout.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 17:22       ` Valentin Schneider
@ 2021-12-21 17:42         ` John Keeping
  0 siblings, 0 replies; 18+ messages in thread
From: John Keeping @ 2021-12-21 17:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Tue, 21 Dec 2021 17:22:34 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> On 21/12/21 16:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <valentin.schneider@arm.com> wrote:
> >  
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:  
> >> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> > index ef8228d19382..798887f1eeff 100644
> >> > --- a/kernel/sched/rt.c
> >> > +++ b/kernel/sched/rt.c
> >> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> >> >                 struct task_struct *push_task = NULL;
> >> >                 int cpu;
> >> >
> >> > +               if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> >> > +                       printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> >> > +                              next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> >> > +               }
> >> > +
> >> >                 if (!pull || rq->push_busy)
> >> >                         return 0;
> >> >
> >> > +               if (!rt_task(rq->curr))
> >> > +                       return 0;
> >> > +    
> >> 
> >> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> >> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> >> have queued RT tasks and a less-than-RT current, we should be rescheduling
> >> right now.
> >> 
> >> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> >> CFS task (or straight up sched_setscheduler()):
> >> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> >> so I suspect we get to the push/pull callback before getting a
> >> resched (I actually don't see where we'd get a resched in that case other
> >> than at the next tick).
> >> 
> >> IOW, feels like we want the below. Unfortunately I can't reproduce the
> >> issue locally (yet), so that's untested.  
> >
> > This patch doesn't make any difference for me - I hit the BUG on the
> > first boot with this applied.
> >  
> 
> Thanks for the swift testing!
> 
> Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
> we can somehow get to the push IRQ work before rescheduling (which I think
> might happen if we try to resched_curr(this_rq)), then we need his
> bailout.

With Dietmar's patch I hit the added WARN_ON_ONCE with:

	next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]
	next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]

	# ps -eTo comm,pid,lwp,pri,rtprio,nice,class
	...
	rcu_preempt        11    11  41      1   - FF
	...
	ksoftirqd/1        21    21  19      -   0 TS

Out of three reproductions, rcu_preempt as next_task is consistent, but
I've seen three different tasks in rq->curr (although all with
SCHED_OTHER).

And as expected, the added early return does stop the BUG.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-21 16:45     ` John Keeping
  2021-12-21 17:22       ` Valentin Schneider
@ 2021-12-22 17:46       ` Dietmar Eggemann
  2021-12-22 18:45         ` John Keeping
  2021-12-22 19:48         ` Valentin Schneider
  1 sibling, 2 replies; 18+ messages in thread
From: Dietmar Eggemann @ 2021-12-22 17:46 UTC (permalink / raw)
  To: John Keeping, Valentin Schneider
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 21.12.21 17:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <valentin.schneider@arm.com> wrote:
> 
>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index fd7c4f972aaf..7d61ceec1a3b 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>  	 * this is the right place to try to pull some other one
>>  	 * from an overloaded CPU, if any.
>>  	 */
>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>> +	if (!task_on_rq_queued(p))
>>  		return;
>>  
>> -	deadline_queue_pull_task(rq);
>> +	if (!rq->dl.dl_nr_running)
>> +		deadline_queue_pull_task(rq);
>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>> +		resched_curr(rq);
>>  }
>>  
>>  /*
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..1ea2567612fb 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>  	 * we may need to handle the pulling of RT tasks
>>  	 * now.
>>  	 */
>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> +	if (!task_on_rq_queued(p))
>>  		return;
>>  
>> -	rt_queue_pull_task(rq);
>> +	if (!rq->rt.rt_nr_running)
>> +		rt_queue_pull_task(rq);
>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>> +		resched_curr(rq);

switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
  pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
    rto_push_irq_work_func() -> push_rt_task(rq, true)

seems to be the only way with pull=true.

In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.

[   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
[   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
[   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
                                                         ^^^^^^^^^^^^^^^^^^^^^^ 
[   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
[   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
                                                             ^^^^^^^^^^                           ^^^^^^ 
[   22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
[   22.288711] Hardware name: ARM Juno development board (r0) (DT)
[   22.288718] Call trace:
[   22.288722]  dump_backtrace+0x0/0x1ac
[   22.288747]  show_stack+0x1c/0x70
[   22.288763]  dump_stack_lvl+0x68/0x84
[   22.288777]  dump_stack+0x1c/0x38
[   22.288788]  push_rt_task.part.0+0x364/0x370
[   22.288805]  rto_push_irq_work_func+0x180/0x190
[   22.288821]  irq_work_single+0x34/0xa0
[   22.288836]  flush_smp_call_function_queue+0x138/0x244
[   22.288852]  generic_smp_call_function_single_interrupt+0x18/0x24
[   22.288867]  ipi_handler+0xb0/0x15c
...

What about slightly changing the layout in switched_from_rt() (only lightly tested):


@@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
         * we may need to handle the pulling of RT tasks
         * now.
         */
-       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+       if (!task_on_rq_queued(p))
+               return;
+
+       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
+               resched_curr(rq);
+               return;
+       }
+
+       if (rq->rt.rt_nr_running)
                return;
 
        rt_queue_pull_task(rq);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 17:46       ` Dietmar Eggemann
@ 2021-12-22 18:45         ` John Keeping
  2021-12-22 19:48         ` Valentin Schneider
  1 sibling, 0 replies; 18+ messages in thread
From: John Keeping @ 2021-12-22 18:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Wed, 22 Dec 2021 18:46:57 +0100
Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

> On 21.12.21 17:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <valentin.schneider@arm.com> wrote:
> >   
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:  
> 
> [...]
> 
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index fd7c4f972aaf..7d61ceec1a3b 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>  	 * this is the right place to try to pull some other one
> >>  	 * from an overloaded CPU, if any.
> >>  	 */
> >> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >> +	if (!task_on_rq_queued(p))
> >>  		return;
> >>  
> >> -	deadline_queue_pull_task(rq);
> >> +	if (!rq->dl.dl_nr_running)
> >> +		deadline_queue_pull_task(rq);
> >> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >> +		resched_curr(rq);
> >>  }
> >>  
> >>  /*
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index ef8228d19382..1ea2567612fb 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >>  	 * we may need to handle the pulling of RT tasks
> >>  	 * now.
> >>  	 */
> >> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >> +	if (!task_on_rq_queued(p))
> >>  		return;
> >>  
> >> -	rt_queue_pull_task(rq);
> >> +	if (!rq->rt.rt_nr_running)
> >> +		rt_queue_pull_task(rq);
> >> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >> +		resched_curr(rq);  
> 
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>     rto_push_irq_work_func() -> push_rt_task(rq, true)
> 
> seems to be the only way with pull=true.
> 
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
> 
> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>                                                          ^^^^^^^^^^^^^^^^^^^^^^ 
> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>                                                              ^^^^^^^^^^                           ^^^^^^ 
> [   22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
> [   22.288711] Hardware name: ARM Juno development board (r0) (DT)
> [   22.288718] Call trace:
> [   22.288722]  dump_backtrace+0x0/0x1ac
> [   22.288747]  show_stack+0x1c/0x70
> [   22.288763]  dump_stack_lvl+0x68/0x84
> [   22.288777]  dump_stack+0x1c/0x38
> [   22.288788]  push_rt_task.part.0+0x364/0x370
> [   22.288805]  rto_push_irq_work_func+0x180/0x190
> [   22.288821]  irq_work_single+0x34/0xa0
> [   22.288836]  flush_smp_call_function_queue+0x138/0x244
> [   22.288852]  generic_smp_call_function_single_interrupt+0x18/0x24
> [   22.288867]  ipi_handler+0xb0/0x15c
> ...
> 
> What about slightly changing the layout in switched_from_rt() (only lightly tested):

I still see the BUG splat with the patch below applied :-(

> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>          * we may need to handle the pulling of RT tasks
>          * now.
>          */
> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +       if (!task_on_rq_queued(p))
> +               return;
> +
> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> +               resched_curr(rq);
> +               return;
> +       }
> +
> +       if (rq->rt.rt_nr_running)
>                 return;
>  
>         rt_queue_pull_task(rq);


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 17:46       ` Dietmar Eggemann
  2021-12-22 18:45         ` John Keeping
@ 2021-12-22 19:48         ` Valentin Schneider
  2021-12-23 11:58           ` John Keeping
  2022-01-07 10:46           ` Dietmar Eggemann
  1 sibling, 2 replies; 18+ messages in thread
From: Valentin Schneider @ 2021-12-22 19:48 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 22/12/21 18:46, Dietmar Eggemann wrote:
> On 21.12.21 17:45, John Keeping wrote:
>> On Tue, 21 Dec 2021 16:11:34 +0000
>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>
>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index fd7c4f972aaf..7d61ceec1a3b 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>>      * this is the right place to try to pull some other one
>>>      * from an overloaded CPU, if any.
>>>      */
>>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>>> +	if (!task_on_rq_queued(p))
>>>             return;
>>>
>>> -	deadline_queue_pull_task(rq);
>>> +	if (!rq->dl.dl_nr_running)
>>> +		deadline_queue_pull_task(rq);
>>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>>> +		resched_curr(rq);
>>>  }
>>>
>>>  /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..1ea2567612fb 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>>      * we may need to handle the pulling of RT tasks
>>>      * now.
>>>      */
>>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>>> +	if (!task_on_rq_queued(p))
>>>             return;
>>>
>>> -	rt_queue_pull_task(rq);
>>> +	if (!rq->rt.rt_nr_running)
>>> +		rt_queue_pull_task(rq);
>>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>>> +		resched_curr(rq);
>
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>
> seems to be the only way with pull=true.
>
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>
> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>                                                          ^^^^^^^^^^^^^^^^^^^^^^
> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>                                                              ^^^^^^^^^^                           ^^^^^^

mark_wakeup_next_waiter() first deboosts the previous owner and then
wakeups the next top waiter. Seems like you somehow have the wakeup happen
before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
only pick a CPU that is in rq->rd->rto_mask, which requires having at least
2 RT tasks there...

Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
check_preempt_curr() if required, which is good, though I think we are
still missing some for sched_setscheduler() (there are no wakeups
there). So if we just have to live with an IRQ work popping in before we
get to preempt_schedule_irq() (or somesuch), then perhaps the below would
be sufficient.

> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>
>
> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>          * we may need to handle the pulling of RT tasks
>          * now.
>          */
> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +       if (!task_on_rq_queued(p))
> +               return;
> +
> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> +               resched_curr(rq);
> +               return;
> +       }
> +
> +       if (rq->rt.rt_nr_running)
>                 return;
>
>         rt_queue_pull_task(rq);

If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
least from RT's perspective; p->sched_class->switched_to() takes care of
the rest)

---

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * this is the right place to try to pull some other one
 	 * from an overloaded CPU, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	deadline_queue_pull_task(rq);
+	if (!rq->dl.dl_nr_running)
+		deadline_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+		resched_curr(rq);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..8f3e3a1367b6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (!next_task)
 		return 0;
 
+	/*
+	 * It's possible that the next_task slipped in of higher priority than
+	 * current, or current has *just* changed priority.  If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < rq->curr->prio)) {
+		resched_curr(rq);
+		return 0;
+	}
+
 retry:
 	if (is_migration_disabled(next_task)) {
 		struct task_struct *push_task = NULL;
@@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (WARN_ON(next_task == rq->curr))
 		return 0;
 
-	/*
-	 * It's possible that the next_task slipped in of
-	 * higher priority than current. If that's the case
-	 * just reschedule current.
-	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
-		resched_curr(rq);
-		return 0;
-	}
-
 	/* We might release rq lock */
 	get_task_struct(next_task);
 
@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p))
 		return;
 
-	rt_queue_pull_task(rq);
+	if (!rq->rt.rt_nr_running)
+		rt_queue_pull_task(rq);
+	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+		resched_curr(rq);
 }
 
 void __init init_sched_rt_class(void)

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 19:48         ` Valentin Schneider
@ 2021-12-23 11:58           ` John Keeping
  2021-12-23 14:05             ` Valentin Schneider
  2022-01-07 10:46           ` Dietmar Eggemann
  1 sibling, 1 reply; 18+ messages in thread
From: John Keeping @ 2021-12-23 11:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
> > On 21.12.21 17:45, John Keeping wrote:
> >> On Tue, 21 Dec 2021 16:11:34 +0000
> >> Valentin Schneider <valentin.schneider@arm.com> wrote:
> >>
> >>> On 20/12/21 18:35, Dietmar Eggemann wrote:
> >
> > [...]
> >
> >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >>> index fd7c4f972aaf..7d61ceec1a3b 100644
> >>> --- a/kernel/sched/deadline.c
> >>> +++ b/kernel/sched/deadline.c
> >>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>>      * this is the right place to try to pull some other one
> >>>      * from an overloaded CPU, if any.
> >>>      */
> >>> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >>> +	if (!task_on_rq_queued(p))
> >>>             return;
> >>>
> >>> -	deadline_queue_pull_task(rq);
> >>> +	if (!rq->dl.dl_nr_running)
> >>> +		deadline_queue_pull_task(rq);
> >>> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >>> +		resched_curr(rq);
> >>>  }
> >>>
> >>>  /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..1ea2567612fb 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >>>      * we may need to handle the pulling of RT tasks
> >>>      * now.
> >>>      */
> >>> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >>> +	if (!task_on_rq_queued(p))
> >>>             return;
> >>>
> >>> -	rt_queue_pull_task(rq);
> >>> +	if (!rq->rt.rt_nr_running)
> >>> +		rt_queue_pull_task(rq);
> >>> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >>> +		resched_curr(rq);
> >
> > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
> >   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
> >     rto_push_irq_work_func() -> push_rt_task(rq, true)
> >
> > seems to be the only way with pull=true.
> >
> > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
> >
> > [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> > [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> > [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
> >                                                          ^^^^^^^^^^^^^^^^^^^^^^
> > [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> > [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
> >                                                              ^^^^^^^^^^                           ^^^^^^
> 
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...
> 
> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

With this patch I ran 100 reboots without hitting the BUG, so it looks
like this is the solution!

If you post this as a patch, feel free to add:

	Tested-by: John Keeping <john@metanate.com>

> ---
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	 * this is the right place to try to pull some other one
>  	 * from an overloaded CPU, if any.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	deadline_queue_pull_task(rq);
> +	if (!rq->dl.dl_nr_running)
> +		deadline_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> +		resched_curr(rq);
>  }
>  
>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (!next_task)
>  		return 0;
>  
> +	/*
> +	 * It's possible that the next_task slipped in of higher priority than
> +	 * current, or current has *just* changed priority.  If that's the case
> +	 * just reschedule current.
> +	 */
> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> +		resched_curr(rq);
> +		return 0;
> +	}
> +
>  retry:
>  	if (is_migration_disabled(next_task)) {
>  		struct task_struct *push_task = NULL;
> @@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (WARN_ON(next_task == rq->curr))
>  		return 0;
>  
> -	/*
> -	 * It's possible that the next_task slipped in of
> -	 * higher priority than current. If that's the case
> -	 * just reschedule current.
> -	 */
> -	if (unlikely(next_task->prio < rq->curr->prio)) {
> -		resched_curr(rq);
> -		return 0;
> -	}
> -
>  	/* We might release rq lock */
>  	get_task_struct(next_task);
>  
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>  	 * we may need to handle the pulling of RT tasks
>  	 * now.
>  	 */
> -	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> +	if (!task_on_rq_queued(p))
>  		return;
>  
> -	rt_queue_pull_task(rq);
> +	if (!rq->rt.rt_nr_running)
> +		rt_queue_pull_task(rq);
> +	else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> +		resched_curr(rq);
>  }
>  
>  void __init init_sched_rt_class(void)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-23 11:58           ` John Keeping
@ 2021-12-23 14:05             ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2021-12-23 14:05 UTC (permalink / raw)
  To: John Keeping
  Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 23/12/21 11:58, John Keeping wrote:
> On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>> 
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> With this patch I ran 100 reboots without hitting the BUG, so it looks
> like this is the solution!
>
> If you post this as a patch, feel free to add:
>
> 	Tested-by: John Keeping <john@metanate.com>

Thanks!

I still need to convince myself beyond reasonable doubt that this is really
what is happening (esp the sched_setscheduler()) part, so the next episode
will probably air after the break :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2021-12-22 19:48         ` Valentin Schneider
  2021-12-23 11:58           ` John Keeping
@ 2022-01-07 10:46           ` Dietmar Eggemann
  2022-01-07 11:49             ` John Keeping
  2022-01-14 18:25             ` Valentin Schneider
  1 sibling, 2 replies; 18+ messages in thread
From: Dietmar Eggemann @ 2022-01-07 10:46 UTC (permalink / raw)
  To: Valentin Schneider, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel

On 22/12/2021 20:48, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
>> On 21.12.21 17:45, John Keeping wrote:
>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>>
>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>>
>> seems to be the only way with pull=true.
>>
>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>
>> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>>                                                          ^^^^^^^^^^^^^^^^^^^^^^
>> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>>                                                              ^^^^^^^^^^                           ^^^^^^
> 
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...

True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:

  mark_wakeup_next_waiter()

    (1) /* deboost */
    rt_mutex_adjust_prio()

      rt_mutex_setprio(current, ...)

        rq = __task_rq_lock(current, );
        check_class_changed(rq, p, prev_class, oldprio)

          switched_from_rt()

            if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
              return;

            rt_queue_pull_task(rq)

              queue_balance_callback(rq, ..., pull_rt_task);

                pull_rt_task()

                  tell_cpu_to_push()

                    *cpu* = rto_next_cpu(rq->rd)
                    irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)

                      rto_push_irq_work_func()
                        push_rt_task(rq, true) <-- !!!

    (2) /* waking the top waiter */
    rt_mutex_wake_q_add(wqh, waiter);

> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

I think that's the case here but we are on the RT overloaded CPU (*cpu*).

> 
>> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>>
>>
>> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>          * we may need to handle the pulling of RT tasks
>>          * now.
>>          */
>> -       if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> +       if (!task_on_rq_queued(p))
>> +               return;
>> +
>> +       if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
>> +               resched_curr(rq);
>> +               return;
>> +       }
>> +
>> +       if (rq->rt.rt_nr_running)
>>                 return;
>>
>>         rt_queue_pull_task(rq);
> 
> If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
> least from RT's perspective; p->sched_class->switched_to() takes care of
> the rest)

Right.

[...]

>  /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>  	if (!next_task)
>  		return 0;
>  
> +	/*
> +	 * It's possible that the next_task slipped in of higher priority than
> +	 * current, or current has *just* changed priority.  If that's the case
> +	 * just reschedule current.
> +	 */
> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> +		resched_curr(rq);
> +		return 0;
> +	}

IMHO, that's the bit which prevents the BUG.

But this would also prevent the case in which rq->curr is an RT task
with lower prio than next_task.

Also `rq->curr = migration/X` goes still though which is somehow fine
since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).

And DL tasks (like sugov:X go through and they can have
task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
by coincidence.

So maybe something like this:

@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
                if (!pull || rq->push_busy)
                        return 0;

+               if (rq->curr->sched_class != &rt_sched_class) {
+                       resched_curr(rq);
+                       return 0;
+               }
+
                cpu = find_lowest_rq(rq->curr);

[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 10:46           ` Dietmar Eggemann
@ 2022-01-07 11:49             ` John Keeping
  2022-01-07 14:25               ` Dietmar Eggemann
  2022-01-14 18:25             ` Valentin Schneider
  1 sibling, 1 reply; 18+ messages in thread
From: John Keeping @ 2022-01-07 11:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
> >  /*
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..8f3e3a1367b6 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> >  	if (!next_task)
> >  		return 0;
> >  
> > +	/*
> > +	 * It's possible that the next_task slipped in of higher priority than
> > +	 * current, or current has *just* changed priority.  If that's the case
> > +	 * just reschedule current.
> > +	 */
> > +	if (unlikely(next_task->prio < rq->curr->prio)) {
> > +		resched_curr(rq);
> > +		return 0;
> > +	}
> 
> IMHO, that's the bit which prevents the BUG.
> 
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
> 
> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
> 
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
> 
> So maybe something like this:

Do you mean to replace just the one hunk from Valentin's patch with the
change below (keeping the rest), or are you saying that only the change
below is needed?

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 if (!pull || rq->push_busy)
>                         return 0;
> 
> +               if (rq->curr->sched_class != &rt_sched_class) {
> +                       resched_curr(rq);
> +                       return 0;
> +               }
> +
>                 cpu = find_lowest_rq(rq->curr);
> 
> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 11:49             ` John Keeping
@ 2022-01-07 14:25               ` Dietmar Eggemann
  2022-01-07 18:35                 ` John Keeping
  0 siblings, 1 reply; 18+ messages in thread
From: Dietmar Eggemann @ 2022-01-07 14:25 UTC (permalink / raw)
  To: John Keeping
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On 07/01/2022 12:49, John Keeping wrote:
> On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
>> On 22/12/2021 20:48, Valentin Schneider wrote:
>>>  /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..8f3e3a1367b6 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>>>  	if (!next_task)
>>>  		return 0;
>>>  
>>> +	/*
>>> +	 * It's possible that the next_task slipped in of higher priority than
>>> +	 * current, or current has *just* changed priority.  If that's the case
>>> +	 * just reschedule current.
>>> +	 */
>>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
>>> +		resched_curr(rq);
>>> +		return 0;
>>> +	}
>>
>> IMHO, that's the bit which prevents the BUG.
>>
>> But this would also prevent the case in which rq->curr is an RT task
>> with lower prio than next_task.
>>
>> Also `rq->curr = migration/X` goes still though which is somehow fine
>> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>>
>> And DL tasks (like sugov:X go through and they can have
>> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
>> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
>> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
>> by coincidence.
>>
>> So maybe something like this:
> 
> Do you mean to replace just the one hunk from Valentin's patch with the
> change below (keeping the rest), or are you saying that only the change
> below is needed?

The latter.

I think Valentin wanted to see if something like this can also occur via
sched_setscheduler() and maybe for this changes in switched_from_[rt/dl]
will be necessary.

>> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>>                 if (!pull || rq->push_busy)
>>                         return 0;
>>
>> +               if (rq->curr->sched_class != &rt_sched_class) {
>> +                       resched_curr(rq);
>> +                       return 0;
>> +               }
>> +
>>                 cpu = find_lowest_rq(rq->curr);
>>
>> [...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 14:25               ` Dietmar Eggemann
@ 2022-01-07 18:35                 ` John Keeping
  0 siblings, 0 replies; 18+ messages in thread
From: John Keeping @ 2022-01-07 18:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel

On Fri, Jan 07, 2022 at 03:25:21PM +0100, Dietmar Eggemann wrote:
> On 07/01/2022 12:49, John Keeping wrote:
> > On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> >> On 22/12/2021 20:48, Valentin Schneider wrote:
> >>>  /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..8f3e3a1367b6 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> >>>  	if (!next_task)
> >>>  		return 0;
> >>>  
> >>> +	/*
> >>> +	 * It's possible that the next_task slipped in of higher priority than
> >>> +	 * current, or current has *just* changed priority.  If that's the case
> >>> +	 * just reschedule current.
> >>> +	 */
> >>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
> >>> +		resched_curr(rq);
> >>> +		return 0;
> >>> +	}
> >>
> >> IMHO, that's the bit which prevents the BUG.
> >>
> >> But this would also prevent the case in which rq->curr is an RT task
> >> with lower prio than next_task.
> >>
> >> Also `rq->curr = migration/X` goes still though which is somehow fine
> >> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
> >>
> >> And DL tasks (like sugov:X go through and they can have
> >> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> >> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> >> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> >> by coincidence.
> >>
> >> So maybe something like this:
> > 
> > Do you mean to replace just the one hunk from Valentin's patch with the
> > change below (keeping the rest), or are you saying that only the change
> > below is needed?
> 
> The latter.

Thanks!  I tested the patch below and can confirm that I no longer see
any BUGs with this applied.

Tested-By: John Keeping <john@metanate.com>

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
                if (!pull || rq->push_busy)
                        return 0;
 
+               if (rq->curr->sched_class != &rt_sched_class) {
+                       resched_curr(rq);
+                       return 0;
+               }
+
                cpu = find_lowest_rq(rq->curr);
                if (cpu == -1 || cpu == rq->cpu)
                        return 0;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-07 10:46           ` Dietmar Eggemann
  2022-01-07 11:49             ` John Keeping
@ 2022-01-14 18:25             ` Valentin Schneider
  2022-06-27  2:02               ` andreadaoud6
  1 sibling, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2022-01-14 18:25 UTC (permalink / raw)
  To: Dietmar Eggemann, John Keeping
  Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, linux-kernel


Trying to page this back in...

On 07/01/22 11:46, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
>> On 22/12/21 18:46, Dietmar Eggemann wrote:
>>> On 21.12.21 17:45, John Keeping wrote:
>>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>>> Valentin Schneider <valentin.schneider@arm.com> wrote:
>>>>
>>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>>>   pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>>>     rto_push_irq_work_func() -> push_rt_task(rq, true)
>>>
>>> seems to be the only way with pull=true.
>>>
>>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>>
>>> [   22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>>> [   22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>>> [   22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>>>                                                          ^^^^^^^^^^^^^^^^^^^^^^
>>> [   22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>>> [   22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>>>                                                              ^^^^^^^^^^                           ^^^^^^
>> 
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>
> True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:
>
>   mark_wakeup_next_waiter()
>
>     (1) /* deboost */
>     rt_mutex_adjust_prio()
>
>       rt_mutex_setprio(current, ...)
>
>         rq = __task_rq_lock(current, );
>         check_class_changed(rq, p, prev_class, oldprio)
>
>           switched_from_rt()
>
>             if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>               return;
>
>             rt_queue_pull_task(rq)
>
>               queue_balance_callback(rq, ..., pull_rt_task);
>
>                 pull_rt_task()
>
>                   tell_cpu_to_push()
>
>                     *cpu* = rto_next_cpu(rq->rd)
>                     irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)
>
>                       rto_push_irq_work_func()
>                         push_rt_task(rq, true) <-- !!!
>
>     (2) /* waking the top waiter */
>     rt_mutex_wake_q_add(wqh, waiter);
>
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> I think that's the case here but we are on the RT overloaded CPU (*cpu*).
>

So one thing I wasn't entirely clear on (and holidays didn't fix that) is
the rt_queue_pull_task() from switched_from_rt() only happens if that rq
has no other runnable RT tasks, so I don't quite see how the
irq_work_queue_on() can end up as a self-IPI...

>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..8f3e3a1367b6 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>>  	if (!next_task)
>>  		return 0;
>>  
>> +	/*
>> +	 * It's possible that the next_task slipped in of higher priority than
>> +	 * current, or current has *just* changed priority.  If that's the case
>> +	 * just reschedule current.
>> +	 */
>> +	if (unlikely(next_task->prio < rq->curr->prio)) {
>> +		resched_curr(rq);
>> +		return 0;
>> +	}
>
> IMHO, that's the bit which prevents the BUG.
>
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
>

I think that's what we want - if current isn't the (logical) highest
priority task on the runqueue, we should forgo push/pull and reschedule
ASAP.

> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle  task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
>

Right. This reminds me of:
https://lore.kernel.org/lkml/jhjblbx7glh.mognet@arm.com/


> So maybe something like this:
>

Ah, so you explicitely prevent rt.c::find_lowest_rq() invocations with a
non-RT task... But what if current is an RT task that just got deboosted,
so that next_task->prio < rq->curr->prio? IMO we should reschedule ASAP (as
I already blabbered about above). If next_task is migration_disabled but
higher (logical) prio than current, we don't need to do any of the
migration_disabled specific crud, we just reschedule.

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>                 if (!pull || rq->push_busy)
>                         return 0;
>
> +               if (rq->curr->sched_class != &rt_sched_class) {
> +                       resched_curr(rq);
> +                       return 0;
> +               }
> +
>                 cpu = find_lowest_rq(rq->curr);
>
> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-01-14 18:25             ` Valentin Schneider
@ 2022-06-27  2:02               ` andreadaoud6
  2022-06-27  9:51                 ` John Keeping
  0 siblings, 1 reply; 18+ messages in thread
From: andreadaoud6 @ 2022-06-27  2:02 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: Dietmar Eggemann, John Keeping, linux-rt-users

Hi all,

I'm encountering this problem today with linux 5.15.5 rt22 kernel.
This bug leads to an unstable system. Having read all your discussions, I know
there is a workaround, but it doesn't seem that everybody agrees this.
I'd like to know if this problem is solved in later rt patch series.
If yes, could someone be more specific which patch solves it?
If not, is applying the above mentioned workaround patch the best solution?

Regards,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT] BUG in sched/cpupri.c
  2022-06-27  2:02               ` andreadaoud6
@ 2022-06-27  9:51                 ` John Keeping
  0 siblings, 0 replies; 18+ messages in thread
From: John Keeping @ 2022-06-27  9:51 UTC (permalink / raw)
  To: andreadaoud6; +Cc: Valentin Schneider, Dietmar Eggemann, linux-rt-users

On Mon, Jun 27, 2022 at 10:02:04AM +0800, andreadaoud6@gmail.com wrote:
> I'm encountering this problem today with linux 5.15.5 rt22 kernel.
> This bug leads to an unstable system. Having read all your discussions, I know
> there is a workaround, but it doesn't seem that everybody agrees this.
> I'd like to know if this problem is solved in later rt patch series.
> If yes, could someone be more specific which patch solves it?
> If not, is applying the above mentioned workaround patch the best solution?

Valentin was fixed this in 49bef33e4b87 ("sched/rt: Plug
rt_mutex_setprio() vs push_rt_task() race") which was backported to
v5.15.33, so updating to the latest v5.15 stable release will pull in
the fix.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-06-27 10:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping
2021-12-20 17:35 ` Dietmar Eggemann
2021-12-21 16:11   ` Valentin Schneider
2021-12-21 16:45     ` John Keeping
2021-12-21 17:22       ` Valentin Schneider
2021-12-21 17:42         ` John Keeping
2021-12-22 17:46       ` Dietmar Eggemann
2021-12-22 18:45         ` John Keeping
2021-12-22 19:48         ` Valentin Schneider
2021-12-23 11:58           ` John Keeping
2021-12-23 14:05             ` Valentin Schneider
2022-01-07 10:46           ` Dietmar Eggemann
2022-01-07 11:49             ` John Keeping
2022-01-07 14:25               ` Dietmar Eggemann
2022-01-07 18:35                 ` John Keeping
2022-01-14 18:25             ` Valentin Schneider
2022-06-27  2:02               ` andreadaoud6
2022-06-27  9:51                 ` John Keeping

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.