* [RT] BUG in sched/cpupri.c @ 2021-12-18 14:25 John Keeping 2021-12-20 17:35 ` Dietmar Eggemann 0 siblings, 1 reply; 16+ messages in thread From: John Keeping @ 2021-12-18 14:25 UTC (permalink / raw) To: linux-rt-users Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel Hi, On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an occasional BUG at cpupri.c:151 (full trace below). Having added extra logging, it seems that p->prio == 120 which isn't handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1"). This happens maybe half the time as userspace is starting up, but if the system boots to a login prompt I haven't seen any issues after that. The process isn't always the same, I've seen systemd-udevd as well as pr/ttyS2. I can easily "fix" this by handling normal priority tasks in convert_prio() but I guess there's some wider reason why that's not an expected value there, so perhaps the real problem lies elsewhere. Thanks, John ------------[ cut here ]------------ kernel BUG at kernel/sched/cpupri.c:151! Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP ARM Modules linked in: CPU: 1 PID: 117 Comm: pr/ttyS2 Tainted: G B 5.15.10-rt24 #1 Hardware name: Rockchip (Device Tree) PC is at cpupri_find_fitness+0x78/0x1a4 LR is at cpupri_find_fitness+0x28/0x1a4 pc : [<c0183be8>] lr : [<c0183b98>] psr: 20030193 sp : c3ea38f8 ip : c38461bc fp : 00000001 r10: e7db8b00 r9 : 00000001 r8 : c1ccd880 r7 : c3846180 r6 : c3846180 r5 : 00000000 r4 : e7db13dc r3 : c0183b90 r2 : 00000000 r1 : 00000007 r0 : 00000078 Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 0296406a DAC: 00000051 Register r0 information: non-paged memory Register r1 information: non-paged memory Register r2 information: NULL pointer Register r3 information: non-slab/vmalloc memory Register r4 information: non-slab/vmalloc memory Register r5 information: NULL pointer Register r6 information: slab task_struct start c3846180 pointer offset 0 Register r7 information: slab task_struct start c3846180 pointer offset 0 Register r8 information: slab kmalloc-1k start c1ccd800 pointer offset 128 size 1024 Register r9 information: non-paged memory Register r10 information: non-slab/vmalloc memory Register r11 information: non-paged memory Register r12 information: slab task_struct start c3846180 pointer offset 60 Process pr/ttyS2 (pid: 117, stack limit = 0x(ptrval)) Stack: (0xc3ea38f8 to 0xc3ea4000) 38e0: 187d4724 00000078 3900: 41b58ab3 c11a824a c016ffcc 00000001 e7db13dc c1388b00 c3846180 c1408e04 3920: 00000001 e7db8b00 00000001 c017a378 c017bcd0 e7db9098 e7db8b00 c1d84b00 3940: 00000001 e7db9290 e7db9098 c1ccd858 c1ccd858 c017bcd8 e7db8b00 c1ccd858 3960: e7db9008 c1ccd800 c1408e04 00000008 c1ccd858 c017cfc8 c1ccd85c c1ccd858 3980: 0000002b 0000002a c1ccd85c c026718c 00000000 00000000 c1ccd858 00000000 39a0: c1ccd85c 00000020 c1ccd858 c01e7300 c17a5524 e7db8b40 00000000 0181b72f 39c0: c3846288 00000003 c17a4fe0 00000001 00000013 c0d0ab00 f0802000 26a30000 39e0: c15c4dc0 c01114e0 c1cc7000 c1c3db80 c1409064 00000013 c0111664 f0802000 3a00: 26a30000 c0111678 c1cc7000 c01a05d4 c1cc7000 c3ea0000 00000003 c1409064 3a20: 00000003 f0802000 26a30000 c0198080 00000000 00000000 00000003 c019894c 3a40: c1387850 00000003 c3ea3a80 c06417a0 c3ea3a80 40030193 00000000 c0cda8ec 3a60: 60030013 ffffffff c3ea3ab4 c40893f8 c3ea0000 c1408e04 c3ea3b8c c0100b00 3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001 3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec 3ac0: 60030013 ffffffff 00000051 c0cda8e8 c4088f00 c0167624 c122da30 187d4760 3ae0: c1388b00 80030013 e7db8b00 e7db8b00 c17a5dc0 c02353f0 00000000 00000000 3b00: 41b58ab3 c11a63b8 c01673fc c016580c 00000001 c3ea0004 26a30000 c3ea0000 3b20: c3ea3b84 c023523c c1d84b00 c0cda838 c17a5dc0 c02353f0 00000000 00000000 3b40: 00000000 c3ea0000 c1387020 c0cda8e8 00000001 c01640b4 00000001 c3ea0000 3b60: c3ea3c20 f6251f9e c3ea3b84 c181a338 b77d4774 c3ea3c20 c3ea0000 c3ea0000 3b80: c3ea3be0 c4073d18 c181a344 c0cd5ba8 80030013 c3ea000c 00000001 c3ea0000 3ba0: 41b58ab3 c11a889d c0cd59a8 00000003 c17a5dc0 c02353f0 ffffc000 c0235580 3bc0: c01641c8 c01640c4 00000000 00000001 c3ea0000 40030013 c38463c4 c14b7044 3be0: 00000001 c3ea3be0 c4088f00 c01641dc 40030013 c38463c4 40030013 c3846180 3c00: c14b7098 c0cda91c c14b7040 c01b5cc4 c01b5f2c 00000000 00000000 00000000 3c20: 00000001 00000000 00000000 00000001 c0d39640 f6251f9e 00000000 b77d4790 3c40: c181a338 c3ea3ce0 c3ea3ca0 c181a344 00000000 00000007 c181a4b8 c0cd5dac 3c60: 41b58ab3 c11e5383 c061d930 c023523c c17a5dc0 c02353f0 00000000 00000000 3c80: 41b58ab3 c11a8997 c0cd5cf0 c02353f0 00000000 00000000 00000000 c3ea0000 3ca0: c3846180 c06d95c4 00000001 c3ea0004 26a30000 00000007 c181a4b8 c02340c4 3cc0: c181a338 60030013 00000001 00000007 00000036 00000000 00000007 f6251f9e 3ce0: c181a338 b77d47a4 c181a4c4 c3f59000 00000036 c06ddb94 00000035 00000035 3d00: c3f59000 c3f59000 00000035 c0192a98 00000035 187d47a8 c3ea3e44 000003fe 3d20: 41b58ab3 c11eea53 c06dd92c c01b5eec b77d47ac c1448740 c3ea3dc0 c3ea3d80 3d40: 41b58ab3 c11a8f2c c0192900 c0cd5d9c c161de70 c0197ba4 26a20000 0000003e 3d60: 41b58ab3 c023523c c0cd5cf0 c0cda838 ffffffff c0cda838 00000000 c3ea0000 3d80: c17a5dc0 c02353f0 ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001 3da0: c3ea0000 c3ea3f40 00000000 00000000 00000243 00000036 c3ea3dd4 f6251f9e 3dc0: c3ea3f40 c161de40 c3efed00 c3ea3f40 00000000 00000000 00000243 00000036 3de0: c161de70 c0194fd8 c3ea3e40 000016ca c14d1250 00000000 c3f59000 00000000 3e00: 187d47c4 c3f59000 00000017 c3ea0000 c1387020 c016580c 00000001 c3ea0004 3e20: 41b58ab3 c11a9104 c0194bb0 c02340c4 c1d80000 e7db8b00 00000000 c3ea0000 3e40: c3ea3ea0 c3f59000 00000400 c0165830 00000002 c01b5eec 00000000 00000002 3e60: 00000000 c3846180 c01828b4 c3ea3e6c c3ea3e6c c1d80000 c1512280 c3846188 3e80: c384640c 00000001 c3ea3f54 c0cd01b0 c1d64000 b77d47d8 c3ea3f54 c0167508 3ea0: 00000242 00000000 8bb3f10e 00000001 c2030035 00000001 00000000 00000000 3ec0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3ee0: 00000000 00000000 00000000 00000000 00000000 00000000 c17a5dc0 c02353f0 3f00: ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001 c3ea0000 c3efec80 3f20: 60000013 c3ea000c 000004f8 c3846180 c3ea3f4c c01641dc c3efec80 60000013 3f40: 60000013 ffffc000 c3846180 f6251f9e c3ea0000 c3846180 c3efec80 00000000 3f60: c3846180 c0194bb0 c161de40 c1d67d20 c3846180 c0157eb0 00000000 c3846180 3f80: c3efeca0 c3efec0c 00000000 c3efec00 c0157c74 00000000 00000000 00000000 3fa0: 00000000 00000000 00000000 c01000fc 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 [<c0183be8>] (cpupri_find_fitness) from [<c017a378>] (find_lowest_rq+0x1bc/0x258) [<c017a378>] (find_lowest_rq) from [<c017bcd8>] (push_rt_task.part.0+0xe8/0x23c) [<c017bcd8>] (push_rt_task.part.0) from [<c017cfc8>] (rto_push_irq_work_func+0x7c/0xd8) [<c017cfc8>] (rto_push_irq_work_func) from [<c026718c>] (irq_work_single+0x8c/0x140) [<c026718c>] (irq_work_single) from [<c01e7300>] (flush_smp_call_function_queue+0x238/0x31c) [<c01e7300>] (flush_smp_call_function_queue) from [<c01114e0>] (do_handle_IPI+0x29c/0x420) [<c01114e0>] (do_handle_IPI) from [<c0111678>] (ipi_handler+0x14/0x20) [<c0111678>] (ipi_handler) from [<c01a05d4>] (handle_percpu_devid_irq+0x8c/0x140) [<c01a05d4>] (handle_percpu_devid_irq) from [<c0198080>] (handle_irq_desc+0x38/0x48) [<c0198080>] (handle_irq_desc) from [<c019894c>] (handle_domain_irq+0x40/0x54) [<c019894c>] (handle_domain_irq) from [<c06417a0>] (gic_handle_irq+0x88/0xa0) [<c06417a0>] (gic_handle_irq) from [<c0100b00>] (__irq_svc+0x60/0xac) Exception stack(0xc3ea3a80 to 0xc3ea3ac8) 3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001 3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec 3ac0: 60030013 ffffffff [<c0100b00>] (__irq_svc) from [<c0cda8ec>] (_raw_spin_unlock_irqrestore+0x1c/0x70) [<c0cda8ec>] (_raw_spin_unlock_irqrestore) from [<c0167624>] (try_to_wake_up+0x228/0x468) [<c0167624>] (try_to_wake_up) from [<c0cd5ba8>] (rt_mutex_slowunlock+0x200/0x348) [<c0cd5ba8>] (rt_mutex_slowunlock) from [<c0cd5dac>] (rt_spin_unlock+0xbc/0x104) [<c0cd5dac>] (rt_spin_unlock) from [<c06ddb94>] (serial8250_console_write+0x268/0x39c) [<c06ddb94>] (serial8250_console_write) from [<c0194fd8>] (printk_kthread_func+0x428/0x4c4) [<c0194fd8>] (printk_kthread_func) from [<c0157eb0>] (kthread+0x23c/0x25c) [<c0157eb0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc3ea3fb0 to 0xc3ea3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 Code: e1a00008 e28dd014 e8bd4ff0 ea00004a (e7f001f2) ---[ end trace 0000000000000002 ]--- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping @ 2021-12-20 17:35 ` Dietmar Eggemann 2021-12-21 16:11 ` Valentin Schneider 0 siblings, 1 reply; 16+ messages in thread From: Dietmar Eggemann @ 2021-12-20 17:35 UTC (permalink / raw) To: John Keeping, linux-rt-users Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 18.12.21 15:25, John Keeping wrote: > Hi, > > On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an > occasional BUG at cpupri.c:151 (full trace below). > > Having added extra logging, it seems that p->prio == 120 which isn't > handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri: > Remap CPUPRI_NORMAL to MAX_RT_PRIO-1"). > > This happens maybe half the time as userspace is starting up, but if the > system boots to a login prompt I haven't seen any issues after that. > The process isn't always the same, I've seen systemd-udevd as well as > pr/ttyS2. > > I can easily "fix" this by handling normal priority tasks in > convert_prio() but I guess there's some wider reason why that's not an > expected value there, so perhaps the real problem lies elsewhere. find_lowest_rq() -> [ cpupri_find() ] -> cpupri_find_fitness() -> convert_prio(p->prio) can only be called for an RT task (p->prio=[0..98]) I can recreate this on my Juno-r0 Arm64 machine with v5.15.10-rt24, CONFIG_PREEMPT_RT=y . ------------[ cut here ]------------ [ 15.256482] WARNING: CPU: 3 PID: 35 at kernel/sched/rt.c:1898 push_rt_task.part.0+0x190/0x364 [ 15.256521] Modules linked in: [ 15.256532] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #14 [ 15.256546] Hardware name: ARM Juno development board (r0) (DT) [ 15.256552] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 15.256567] pc : push_rt_task.part.0+0x190/0x364 [ 15.256582] lr : push_rt_task.part.0+0x20/0x364 [ 15.256597] sp : ffff800011ff3e20 [ 15.256602] x29: ffff800011ff3e20 x28: ffff0008001ca940 x27: 0000000000000000 [ 15.256622] x26: 0000000000000000 x25: ffff800011f61da0 x24: 0000000000000001 [ 15.256639] x23: 0000000000000000 x22: ffff00097ef923c0 x21: ffff0008000e0dc0 [ 15.256657] x20: ffff0008000e0dc0 x19: ffff00097ef923c0 x18: 0000000000000002 [ 15.256674] x17: ffff80096d7e7000 x16: ffff800011ff4000 x15: 0000000000004000 [ 15.256692] x14: 00000000000000c1 x13: 0000000000000001 x12: 0000000000000000 [ 15.256709] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400 [ 15.256725] x8 : ffff00097ef924c0 x7 : ffff00097ef92440 x6 : 00000000356c5f2b [ 15.256743] x5 : ffff80096d7e7000 x4 : 0000000000000000 x3 : 0000000000000003 [ 15.256760] x2 : ffff00097ef923c0 x1 : 0000000000000062 x0 : 0000000000000078 [ 15.256777] Call trace: [ 15.256783] push_rt_task.part.0+0x190/0x364 [ 15.256799] rto_push_irq_work_func+0x180/0x190 [ 15.256815] irq_work_single+0x34/0xa0 [ 15.256829] flush_smp_call_function_queue+0x138/0x244 [ 15.256845] generic_smp_call_function_single_interrupt+0x18/0x24 [ 15.256860] ipi_handler+0xb0/0x15c [ 15.256875] handle_percpu_devid_irq+0x88/0x140 [ 15.256890] handle_domain_irq+0x64/0x94 [ 15.256907] gic_handle_irq+0x50/0xf0 [ 15.256924] call_on_irq_stack+0x2c/0x60 [ 15.256937] do_interrupt_handler+0x54/0x60 [ 15.256951] el1_interrupt+0x30/0x80 [ 15.256965] el1h_64_irq_handler+0x1c/0x30 [ 15.256978] el1h_64_irq+0x78/0x7c [ 15.256989] preempt_schedule_irq+0x40/0x150 [ 15.257004] el1_interrupt+0x60/0x80 [ 15.257016] el1h_64_irq_handler+0x1c/0x30 [ 15.257029] el1h_64_irq+0x78/0x7c [ 15.257039] run_ksoftirqd+0x94/0xc0 [ 15.257053] smpboot_thread_fn+0x2d8/0x320 [ 15.257066] kthread+0x18c/0x1a0 [ 15.257081] ret_from_fork+0x10/0x20 [ 15.257094] ---[ end trace 0000000000000002 ]--- [ 16.257349] next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/3 35] root@juno:~# ps -eTo comm,pid,lwp,pri,rtprio,nice,class | more COMMAND PID LWP PRI RTPRIO NI CLS .. rcu_preempt 11 11 41 1 - FF ... ksoftirqd/3 35 35 19 - 0 TS ... diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index ef8228d19382..798887f1eeff 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull) struct task_struct *push_task = NULL; int cpu; + if (WARN_ON_ONCE(!rt_task(rq->curr))) { + printk("next_task=[%s %d] rq->curr=[%s %d]\n", + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid); + } + if (!pull || rq->push_busy) return 0; + if (!rt_task(rq->curr)) + return 0; + cpu = find_lowest_rq(rq->curr); if (cpu == -1 || cpu == rq->cpu) return 0; The 2. condition prevents the BUG_ON() in cpupri_find_fitness(). [...] ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-20 17:35 ` Dietmar Eggemann @ 2021-12-21 16:11 ` Valentin Schneider 2021-12-21 16:45 ` John Keeping 0 siblings, 1 reply; 16+ messages in thread From: Valentin Schneider @ 2021-12-21 16:11 UTC (permalink / raw) To: Dietmar Eggemann, John Keeping, linux-rt-users Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 20/12/21 18:35, Dietmar Eggemann wrote: > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index ef8228d19382..798887f1eeff 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull) > struct task_struct *push_task = NULL; > int cpu; > > + if (WARN_ON_ONCE(!rt_task(rq->curr))) { > + printk("next_task=[%s %d] rq->curr=[%s %d]\n", > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid); > + } > + > if (!pull || rq->push_busy) > return 0; > > + if (!rt_task(rq->curr)) > + return 0; > + If current is a DL/stopper task, why not; if that's CFS (which IIUC is your case), that's buggered: we shouldn't be trying to pull RT tasks when we have queued RT tasks and a less-than-RT current, we should be rescheduling right now. I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted CFS task (or straight up sched_setscheduler()): check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(), so I suspect we get to the push/pull callback before getting a resched (I actually don't see where we'd get a resched in that case other than at the next tick). IOW, feels like we want the below. Unfortunately I can't reproduce the issue locally (yet), so that's untested. --- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index fd7c4f972aaf..7d61ceec1a3b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) * this is the right place to try to pull some other one * from an overloaded CPU, if any. */ - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) + if (!task_on_rq_queued(p)) return; - deadline_queue_pull_task(rq); + if (!rq->dl.dl_nr_running) + deadline_queue_pull_task(rq); + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) + resched_curr(rq); } /* diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index ef8228d19382..1ea2567612fb 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) * we may need to handle the pulling of RT tasks * now. */ - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) + if (!task_on_rq_queued(p)) return; - rt_queue_pull_task(rq); + if (!rq->rt.rt_nr_running) + rt_queue_pull_task(rq); + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) + resched_curr(rq); } void __init init_sched_rt_class(void) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-21 16:11 ` Valentin Schneider @ 2021-12-21 16:45 ` John Keeping 2021-12-21 17:22 ` Valentin Schneider 2021-12-22 17:46 ` Dietmar Eggemann 0 siblings, 2 replies; 16+ messages in thread From: John Keeping @ 2021-12-21 16:45 UTC (permalink / raw) To: Valentin Schneider Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Tue, 21 Dec 2021 16:11:34 +0000 Valentin Schneider <valentin.schneider@arm.com> wrote: > On 20/12/21 18:35, Dietmar Eggemann wrote: > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index ef8228d19382..798887f1eeff 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull) > > struct task_struct *push_task = NULL; > > int cpu; > > > > + if (WARN_ON_ONCE(!rt_task(rq->curr))) { > > + printk("next_task=[%s %d] rq->curr=[%s %d]\n", > > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid); > > + } > > + > > if (!pull || rq->push_busy) > > return 0; > > > > + if (!rt_task(rq->curr)) > > + return 0; > > + > > If current is a DL/stopper task, why not; if that's CFS (which IIUC is your > case), that's buggered: we shouldn't be trying to pull RT tasks when we > have queued RT tasks and a less-than-RT current, we should be rescheduling > right now. > > I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted > CFS task (or straight up sched_setscheduler()): > check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(), > so I suspect we get to the push/pull callback before getting a > resched (I actually don't see where we'd get a resched in that case other > than at the next tick). > > IOW, feels like we want the below. Unfortunately I can't reproduce the > issue locally (yet), so that's untested. This patch doesn't make any difference for me - I hit the BUG on the first boot with this applied. > --- > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index fd7c4f972aaf..7d61ceec1a3b 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) > * this is the right place to try to pull some other one > * from an overloaded CPU, if any. > */ > - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) > + if (!task_on_rq_queued(p)) > return; > > - deadline_queue_pull_task(rq); > + if (!rq->dl.dl_nr_running) > + deadline_queue_pull_task(rq); > + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) > + resched_curr(rq); > } > > /* > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index ef8228d19382..1ea2567612fb 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > * we may need to handle the pulling of RT tasks > * now. > */ > - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > + if (!task_on_rq_queued(p)) > return; > > - rt_queue_pull_task(rq); > + if (!rq->rt.rt_nr_running) > + rt_queue_pull_task(rq); > + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) > + resched_curr(rq); > } > > void __init init_sched_rt_class(void) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-21 16:45 ` John Keeping @ 2021-12-21 17:22 ` Valentin Schneider 2021-12-21 17:42 ` John Keeping 2021-12-22 17:46 ` Dietmar Eggemann 1 sibling, 1 reply; 16+ messages in thread From: Valentin Schneider @ 2021-12-21 17:22 UTC (permalink / raw) To: John Keeping Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 21/12/21 16:45, John Keeping wrote: > On Tue, 21 Dec 2021 16:11:34 +0000 > Valentin Schneider <valentin.schneider@arm.com> wrote: > >> On 20/12/21 18:35, Dietmar Eggemann wrote: >> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >> > index ef8228d19382..798887f1eeff 100644 >> > --- a/kernel/sched/rt.c >> > +++ b/kernel/sched/rt.c >> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull) >> > struct task_struct *push_task = NULL; >> > int cpu; >> > >> > + if (WARN_ON_ONCE(!rt_task(rq->curr))) { >> > + printk("next_task=[%s %d] rq->curr=[%s %d]\n", >> > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid); >> > + } >> > + >> > if (!pull || rq->push_busy) >> > return 0; >> > >> > + if (!rt_task(rq->curr)) >> > + return 0; >> > + >> >> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your >> case), that's buggered: we shouldn't be trying to pull RT tasks when we >> have queued RT tasks and a less-than-RT current, we should be rescheduling >> right now. >> >> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted >> CFS task (or straight up sched_setscheduler()): >> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(), >> so I suspect we get to the push/pull callback before getting a >> resched (I actually don't see where we'd get a resched in that case other >> than at the next tick). >> >> IOW, feels like we want the below. Unfortunately I can't reproduce the >> issue locally (yet), so that's untested. > > This patch doesn't make any difference for me - I hit the BUG on the > first boot with this applied. > Thanks for the swift testing! Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if we can somehow get to the push IRQ work before rescheduling (which I think might happen if we try to resched_curr(this_rq)), then we need his bailout. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-21 17:22 ` Valentin Schneider @ 2021-12-21 17:42 ` John Keeping 0 siblings, 0 replies; 16+ messages in thread From: John Keeping @ 2021-12-21 17:42 UTC (permalink / raw) To: Valentin Schneider Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Tue, 21 Dec 2021 17:22:34 +0000 Valentin Schneider <valentin.schneider@arm.com> wrote: > On 21/12/21 16:45, John Keeping wrote: > > On Tue, 21 Dec 2021 16:11:34 +0000 > > Valentin Schneider <valentin.schneider@arm.com> wrote: > > > >> On 20/12/21 18:35, Dietmar Eggemann wrote: > >> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > >> > index ef8228d19382..798887f1eeff 100644 > >> > --- a/kernel/sched/rt.c > >> > +++ b/kernel/sched/rt.c > >> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull) > >> > struct task_struct *push_task = NULL; > >> > int cpu; > >> > > >> > + if (WARN_ON_ONCE(!rt_task(rq->curr))) { > >> > + printk("next_task=[%s %d] rq->curr=[%s %d]\n", > >> > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid); > >> > + } > >> > + > >> > if (!pull || rq->push_busy) > >> > return 0; > >> > > >> > + if (!rt_task(rq->curr)) > >> > + return 0; > >> > + > >> > >> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your > >> case), that's buggered: we shouldn't be trying to pull RT tasks when we > >> have queued RT tasks and a less-than-RT current, we should be rescheduling > >> right now. > >> > >> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted > >> CFS task (or straight up sched_setscheduler()): > >> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(), > >> so I suspect we get to the push/pull callback before getting a > >> resched (I actually don't see where we'd get a resched in that case other > >> than at the next tick). > >> > >> IOW, feels like we want the below. Unfortunately I can't reproduce the > >> issue locally (yet), so that's untested. > > > > This patch doesn't make any difference for me - I hit the BUG on the > > first boot with this applied. > > > > Thanks for the swift testing! > > Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if > we can somehow get to the push IRQ work before rescheduling (which I think > might happen if we try to resched_curr(this_rq)), then we need his > bailout. With Dietmar's patch I hit the added WARN_ON_ONCE with: next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21] next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21] # ps -eTo comm,pid,lwp,pri,rtprio,nice,class ... rcu_preempt 11 11 41 1 - FF ... ksoftirqd/1 21 21 19 - 0 TS Out of three reproductions, rcu_preempt as next_task is consistent, but I've seen three different tasks in rq->curr (although all with SCHED_OTHER). And as expected, the added early return does stop the BUG. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-21 16:45 ` John Keeping 2021-12-21 17:22 ` Valentin Schneider @ 2021-12-22 17:46 ` Dietmar Eggemann 2021-12-22 18:45 ` John Keeping 2021-12-22 19:48 ` Valentin Schneider 1 sibling, 2 replies; 16+ messages in thread From: Dietmar Eggemann @ 2021-12-22 17:46 UTC (permalink / raw) To: John Keeping, Valentin Schneider Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 21.12.21 17:45, John Keeping wrote: > On Tue, 21 Dec 2021 16:11:34 +0000 > Valentin Schneider <valentin.schneider@arm.com> wrote: > >> On 20/12/21 18:35, Dietmar Eggemann wrote: [...] >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index fd7c4f972aaf..7d61ceec1a3b 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) >> * this is the right place to try to pull some other one >> * from an overloaded CPU, if any. >> */ >> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) >> + if (!task_on_rq_queued(p)) >> return; >> >> - deadline_queue_pull_task(rq); >> + if (!rq->dl.dl_nr_running) >> + deadline_queue_pull_task(rq); >> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) >> + resched_curr(rq); >> } >> >> /* >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >> index ef8228d19382..1ea2567612fb 100644 >> --- a/kernel/sched/rt.c >> +++ b/kernel/sched/rt.c >> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) >> * we may need to handle the pulling of RT tasks >> * now. >> */ >> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) >> + if (!task_on_rq_queued(p)) >> return; >> >> - rt_queue_pull_task(rq); >> + if (!rq->rt.rt_nr_running) >> + rt_queue_pull_task(rq); >> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) >> + resched_curr(rq); switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) rto_push_irq_work_func() -> push_rt_task(rq, true) seems to be the only way with pull=true. In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 ^^^^^^^^^^^^^^^^^^^^^^ [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 ^^^^^^^^^^ ^^^^^^ [ 22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36 [ 22.288711] Hardware name: ARM Juno development board (r0) (DT) [ 22.288718] Call trace: [ 22.288722] dump_backtrace+0x0/0x1ac [ 22.288747] show_stack+0x1c/0x70 [ 22.288763] dump_stack_lvl+0x68/0x84 [ 22.288777] dump_stack+0x1c/0x38 [ 22.288788] push_rt_task.part.0+0x364/0x370 [ 22.288805] rto_push_irq_work_func+0x180/0x190 [ 22.288821] irq_work_single+0x34/0xa0 [ 22.288836] flush_smp_call_function_queue+0x138/0x244 [ 22.288852] generic_smp_call_function_single_interrupt+0x18/0x24 [ 22.288867] ipi_handler+0xb0/0x15c ... What about slightly changing the layout in switched_from_rt() (only lightly tested): @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) * we may need to handle the pulling of RT tasks * now. */ - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) + if (!task_on_rq_queued(p)) + return; + + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) { + resched_curr(rq); + return; + } + + if (rq->rt.rt_nr_running) return; rt_queue_pull_task(rq); ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-22 17:46 ` Dietmar Eggemann @ 2021-12-22 18:45 ` John Keeping 2021-12-22 19:48 ` Valentin Schneider 1 sibling, 0 replies; 16+ messages in thread From: John Keeping @ 2021-12-22 18:45 UTC (permalink / raw) To: Dietmar Eggemann Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Wed, 22 Dec 2021 18:46:57 +0100 Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 21.12.21 17:45, John Keeping wrote: > > On Tue, 21 Dec 2021 16:11:34 +0000 > > Valentin Schneider <valentin.schneider@arm.com> wrote: > > > >> On 20/12/21 18:35, Dietmar Eggemann wrote: > > [...] > > >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > >> index fd7c4f972aaf..7d61ceec1a3b 100644 > >> --- a/kernel/sched/deadline.c > >> +++ b/kernel/sched/deadline.c > >> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) > >> * this is the right place to try to pull some other one > >> * from an overloaded CPU, if any. > >> */ > >> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) > >> + if (!task_on_rq_queued(p)) > >> return; > >> > >> - deadline_queue_pull_task(rq); > >> + if (!rq->dl.dl_nr_running) > >> + deadline_queue_pull_task(rq); > >> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) > >> + resched_curr(rq); > >> } > >> > >> /* > >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > >> index ef8228d19382..1ea2567612fb 100644 > >> --- a/kernel/sched/rt.c > >> +++ b/kernel/sched/rt.c > >> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > >> * we may need to handle the pulling of RT tasks > >> * now. > >> */ > >> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > >> + if (!task_on_rq_queued(p)) > >> return; > >> > >> - rt_queue_pull_task(rq); > >> + if (!rq->rt.rt_nr_running) > >> + rt_queue_pull_task(rq); > >> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) > >> + resched_curr(rq); > > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) > pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) > rto_push_irq_work_func() -> push_rt_task(rq, true) > > seems to be the only way with pull=true. > > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. > > [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] > [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 > [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 > ^^^^^^^^^^^^^^^^^^^^^^ > [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 > [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 > ^^^^^^^^^^ ^^^^^^ > [ 22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36 > [ 22.288711] Hardware name: ARM Juno development board (r0) (DT) > [ 22.288718] Call trace: > [ 22.288722] dump_backtrace+0x0/0x1ac > [ 22.288747] show_stack+0x1c/0x70 > [ 22.288763] dump_stack_lvl+0x68/0x84 > [ 22.288777] dump_stack+0x1c/0x38 > [ 22.288788] push_rt_task.part.0+0x364/0x370 > [ 22.288805] rto_push_irq_work_func+0x180/0x190 > [ 22.288821] irq_work_single+0x34/0xa0 > [ 22.288836] flush_smp_call_function_queue+0x138/0x244 > [ 22.288852] generic_smp_call_function_single_interrupt+0x18/0x24 > [ 22.288867] ipi_handler+0xb0/0x15c > ... > > What about slightly changing the layout in switched_from_rt() (only lightly tested): I still see the BUG splat with the patch below applied :-( > @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > * we may need to handle the pulling of RT tasks > * now. > */ > - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > + if (!task_on_rq_queued(p)) > + return; > + > + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) { > + resched_curr(rq); > + return; > + } > + > + if (rq->rt.rt_nr_running) > return; > > rt_queue_pull_task(rq); ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-22 17:46 ` Dietmar Eggemann 2021-12-22 18:45 ` John Keeping @ 2021-12-22 19:48 ` Valentin Schneider 2021-12-23 11:58 ` John Keeping 2022-01-07 10:46 ` Dietmar Eggemann 1 sibling, 2 replies; 16+ messages in thread From: Valentin Schneider @ 2021-12-22 19:48 UTC (permalink / raw) To: Dietmar Eggemann, John Keeping Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 22/12/21 18:46, Dietmar Eggemann wrote: > On 21.12.21 17:45, John Keeping wrote: >> On Tue, 21 Dec 2021 16:11:34 +0000 >> Valentin Schneider <valentin.schneider@arm.com> wrote: >> >>> On 20/12/21 18:35, Dietmar Eggemann wrote: > > [...] > >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >>> index fd7c4f972aaf..7d61ceec1a3b 100644 >>> --- a/kernel/sched/deadline.c >>> +++ b/kernel/sched/deadline.c >>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) >>> * this is the right place to try to pull some other one >>> * from an overloaded CPU, if any. >>> */ >>> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) >>> + if (!task_on_rq_queued(p)) >>> return; >>> >>> - deadline_queue_pull_task(rq); >>> + if (!rq->dl.dl_nr_running) >>> + deadline_queue_pull_task(rq); >>> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) >>> + resched_curr(rq); >>> } >>> >>> /* >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >>> index ef8228d19382..1ea2567612fb 100644 >>> --- a/kernel/sched/rt.c >>> +++ b/kernel/sched/rt.c >>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) >>> * we may need to handle the pulling of RT tasks >>> * now. >>> */ >>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) >>> + if (!task_on_rq_queued(p)) >>> return; >>> >>> - rt_queue_pull_task(rq); >>> + if (!rq->rt.rt_nr_running) >>> + rt_queue_pull_task(rq); >>> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) >>> + resched_curr(rq); > > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) > pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) > rto_push_irq_work_func() -> push_rt_task(rq, true) > > seems to be the only way with pull=true. > > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. > > [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] > [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 > [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 > ^^^^^^^^^^^^^^^^^^^^^^ > [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 > [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 > ^^^^^^^^^^ ^^^^^^ mark_wakeup_next_waiter() first deboosts the previous owner and then wakeups the next top waiter. Seems like you somehow have the wakeup happen before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should only pick a CPU that is in rq->rd->rto_mask, which requires having at least 2 RT tasks there... Now, that wakeup from the rtmutex unlock would give us a resched_curr() via check_preempt_curr() if required, which is good, though I think we are still missing some for sched_setscheduler() (there are no wakeups there). So if we just have to live with an IRQ work popping in before we get to preempt_schedule_irq() (or somesuch), then perhaps the below would be sufficient. > What about slightly changing the layout in switched_from_rt() (only lightly tested): > > > @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > * we may need to handle the pulling of RT tasks > * now. > */ > - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > + if (!task_on_rq_queued(p)) > + return; > + > + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) { > + resched_curr(rq); > + return; > + } > + > + if (rq->rt.rt_nr_running) > return; > > rt_queue_pull_task(rq); If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at least from RT's perspective; p->sched_class->switched_to() takes care of the rest) --- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index fd7c4f972aaf..7d61ceec1a3b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) * this is the right place to try to pull some other one * from an overloaded CPU, if any. */ - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) + if (!task_on_rq_queued(p)) return; - deadline_queue_pull_task(rq); + if (!rq->dl.dl_nr_running) + deadline_queue_pull_task(rq); + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) + resched_curr(rq); } /* diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index ef8228d19382..8f3e3a1367b6 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) if (!next_task) return 0; + /* + * It's possible that the next_task slipped in of higher priority than + * current, or current has *just* changed priority. If that's the case + * just reschedule current. + */ + if (unlikely(next_task->prio < rq->curr->prio)) { + resched_curr(rq); + return 0; + } + retry: if (is_migration_disabled(next_task)) { struct task_struct *push_task = NULL; @@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull) if (WARN_ON(next_task == rq->curr)) return 0; - /* - * It's possible that the next_task slipped in of - * higher priority than current. If that's the case - * just reschedule current. - */ - if (unlikely(next_task->prio < rq->curr->prio)) { - resched_curr(rq); - return 0; - } - /* We might release rq lock */ get_task_struct(next_task); @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) * we may need to handle the pulling of RT tasks * now. */ - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) + if (!task_on_rq_queued(p)) return; - rt_queue_pull_task(rq); + if (!rq->rt.rt_nr_running) + rt_queue_pull_task(rq); + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) + resched_curr(rq); } void __init init_sched_rt_class(void) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-22 19:48 ` Valentin Schneider @ 2021-12-23 11:58 ` John Keeping 2021-12-23 14:05 ` Valentin Schneider 2022-01-07 10:46 ` Dietmar Eggemann 1 sibling, 1 reply; 16+ messages in thread From: John Keeping @ 2021-12-23 11:58 UTC (permalink / raw) To: Valentin Schneider Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote: > On 22/12/21 18:46, Dietmar Eggemann wrote: > > On 21.12.21 17:45, John Keeping wrote: > >> On Tue, 21 Dec 2021 16:11:34 +0000 > >> Valentin Schneider <valentin.schneider@arm.com> wrote: > >> > >>> On 20/12/21 18:35, Dietmar Eggemann wrote: > > > > [...] > > > >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > >>> index fd7c4f972aaf..7d61ceec1a3b 100644 > >>> --- a/kernel/sched/deadline.c > >>> +++ b/kernel/sched/deadline.c > >>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) > >>> * this is the right place to try to pull some other one > >>> * from an overloaded CPU, if any. > >>> */ > >>> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) > >>> + if (!task_on_rq_queued(p)) > >>> return; > >>> > >>> - deadline_queue_pull_task(rq); > >>> + if (!rq->dl.dl_nr_running) > >>> + deadline_queue_pull_task(rq); > >>> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) > >>> + resched_curr(rq); > >>> } > >>> > >>> /* > >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > >>> index ef8228d19382..1ea2567612fb 100644 > >>> --- a/kernel/sched/rt.c > >>> +++ b/kernel/sched/rt.c > >>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > >>> * we may need to handle the pulling of RT tasks > >>> * now. > >>> */ > >>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > >>> + if (!task_on_rq_queued(p)) > >>> return; > >>> > >>> - rt_queue_pull_task(rq); > >>> + if (!rq->rt.rt_nr_running) > >>> + rt_queue_pull_task(rq); > >>> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) > >>> + resched_curr(rq); > > > > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) > > pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) > > rto_push_irq_work_func() -> push_rt_task(rq, true) > > > > seems to be the only way with pull=true. > > > > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. > > > > [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] > > [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 > > [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 > > ^^^^^^^^^^^^^^^^^^^^^^ > > [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 > > [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 > > ^^^^^^^^^^ ^^^^^^ > > mark_wakeup_next_waiter() first deboosts the previous owner and then > wakeups the next top waiter. Seems like you somehow have the wakeup happen > before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should > only pick a CPU that is in rq->rd->rto_mask, which requires having at least > 2 RT tasks there... > > Now, that wakeup from the rtmutex unlock would give us a resched_curr() via > check_preempt_curr() if required, which is good, though I think we are > still missing some for sched_setscheduler() (there are no wakeups > there). So if we just have to live with an IRQ work popping in before we > get to preempt_schedule_irq() (or somesuch), then perhaps the below would > be sufficient. With this patch I ran 100 reboots without hitting the BUG, so it looks like this is the solution! If you post this as a patch, feel free to add: Tested-by: John Keeping <john@metanate.com> > --- > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index fd7c4f972aaf..7d61ceec1a3b 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) > * this is the right place to try to pull some other one > * from an overloaded CPU, if any. > */ > - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) > + if (!task_on_rq_queued(p)) > return; > > - deadline_queue_pull_task(rq); > + if (!rq->dl.dl_nr_running) > + deadline_queue_pull_task(rq); > + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class)) > + resched_curr(rq); > } > > /* > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index ef8228d19382..8f3e3a1367b6 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) > if (!next_task) > return 0; > > + /* > + * It's possible that the next_task slipped in of higher priority than > + * current, or current has *just* changed priority. If that's the case > + * just reschedule current. > + */ > + if (unlikely(next_task->prio < rq->curr->prio)) { > + resched_curr(rq); > + return 0; > + } > + > retry: > if (is_migration_disabled(next_task)) { > struct task_struct *push_task = NULL; > @@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull) > if (WARN_ON(next_task == rq->curr)) > return 0; > > - /* > - * It's possible that the next_task slipped in of > - * higher priority than current. If that's the case > - * just reschedule current. > - */ > - if (unlikely(next_task->prio < rq->curr->prio)) { > - resched_curr(rq); > - return 0; > - } > - > /* We might release rq lock */ > get_task_struct(next_task); > > @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) > * we may need to handle the pulling of RT tasks > * now. > */ > - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > + if (!task_on_rq_queued(p)) > return; > > - rt_queue_pull_task(rq); > + if (!rq->rt.rt_nr_running) > + rt_queue_pull_task(rq); > + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) > + resched_curr(rq); > } > > void __init init_sched_rt_class(void) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-23 11:58 ` John Keeping @ 2021-12-23 14:05 ` Valentin Schneider 0 siblings, 0 replies; 16+ messages in thread From: Valentin Schneider @ 2021-12-23 14:05 UTC (permalink / raw) To: John Keeping Cc: Dietmar Eggemann, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 23/12/21 11:58, John Keeping wrote: > On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote: >> mark_wakeup_next_waiter() first deboosts the previous owner and then >> wakeups the next top waiter. Seems like you somehow have the wakeup happen >> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should >> only pick a CPU that is in rq->rd->rto_mask, which requires having at least >> 2 RT tasks there... >> >> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via >> check_preempt_curr() if required, which is good, though I think we are >> still missing some for sched_setscheduler() (there are no wakeups >> there). So if we just have to live with an IRQ work popping in before we >> get to preempt_schedule_irq() (or somesuch), then perhaps the below would >> be sufficient. > > With this patch I ran 100 reboots without hitting the BUG, so it looks > like this is the solution! > > If you post this as a patch, feel free to add: > > Tested-by: John Keeping <john@metanate.com> Thanks! I still need to convince myself beyond reasonable doubt that this is really what is happening (esp the sched_setscheduler()) part, so the next episode will probably air after the break :-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2021-12-22 19:48 ` Valentin Schneider 2021-12-23 11:58 ` John Keeping @ 2022-01-07 10:46 ` Dietmar Eggemann 2022-01-07 11:49 ` John Keeping 2022-01-14 18:25 ` Valentin Schneider 1 sibling, 2 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-07 10:46 UTC (permalink / raw) To: Valentin Schneider, John Keeping Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 22/12/2021 20:48, Valentin Schneider wrote: > On 22/12/21 18:46, Dietmar Eggemann wrote: >> On 21.12.21 17:45, John Keeping wrote: >>> On Tue, 21 Dec 2021 16:11:34 +0000 >>> Valentin Schneider <valentin.schneider@arm.com> wrote: >>> >>>> On 20/12/21 18:35, Dietmar Eggemann wrote: [...] >> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) >> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) >> rto_push_irq_work_func() -> push_rt_task(rq, true) >> >> seems to be the only way with pull=true. >> >> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. >> >> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] >> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 >> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 >> ^^^^^^^^^^^^^^^^^^^^^^ >> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 >> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 >> ^^^^^^^^^^ ^^^^^^ > > mark_wakeup_next_waiter() first deboosts the previous owner and then > wakeups the next top waiter. Seems like you somehow have the wakeup happen > before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should > only pick a CPU that is in rq->rd->rto_mask, which requires having at least > 2 RT tasks there... True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2: mark_wakeup_next_waiter() (1) /* deboost */ rt_mutex_adjust_prio() rt_mutex_setprio(current, ...) rq = __task_rq_lock(current, ); check_class_changed(rq, p, prev_class, oldprio) switched_from_rt() if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) return; rt_queue_pull_task(rq) queue_balance_callback(rq, ..., pull_rt_task); pull_rt_task() tell_cpu_to_push() *cpu* = rto_next_cpu(rq->rd) irq_work_queue_on(&rq->rd->rto_push_work, *cpu*) rto_push_irq_work_func() push_rt_task(rq, true) <-- !!! (2) /* waking the top waiter */ rt_mutex_wake_q_add(wqh, waiter); > Now, that wakeup from the rtmutex unlock would give us a resched_curr() via > check_preempt_curr() if required, which is good, though I think we are > still missing some for sched_setscheduler() (there are no wakeups > there). So if we just have to live with an IRQ work popping in before we > get to preempt_schedule_irq() (or somesuch), then perhaps the below would > be sufficient. I think that's the case here but we are on the RT overloaded CPU (*cpu*). > >> What about slightly changing the layout in switched_from_rt() (only lightly tested): >> >> >> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p) >> * we may need to handle the pulling of RT tasks >> * now. >> */ >> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) >> + if (!task_on_rq_queued(p)) >> + return; >> + >> + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) { >> + resched_curr(rq); >> + return; >> + } >> + >> + if (rq->rt.rt_nr_running) >> return; >> >> rt_queue_pull_task(rq); > > If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at > least from RT's perspective; p->sched_class->switched_to() takes care of > the rest) Right. [...] > /* > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index ef8228d19382..8f3e3a1367b6 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) > if (!next_task) > return 0; > > + /* > + * It's possible that the next_task slipped in of higher priority than > + * current, or current has *just* changed priority. If that's the case > + * just reschedule current. > + */ > + if (unlikely(next_task->prio < rq->curr->prio)) { > + resched_curr(rq); > + return 0; > + } IMHO, that's the bit which prevents the BUG. But this would also prevent the case in which rq->curr is an RT task with lower prio than next_task. Also `rq->curr = migration/X` goes still though which is somehow fine since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1). And DL tasks (like sugov:X go through and they can have task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared freuency domains with schedutil). cpupri_find_fitness()->convert_prio() can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow by coincidence. So maybe something like this: @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull) if (!pull || rq->push_busy) return 0; + if (rq->curr->sched_class != &rt_sched_class) { + resched_curr(rq); + return 0; + } + cpu = find_lowest_rq(rq->curr); [...] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2022-01-07 10:46 ` Dietmar Eggemann @ 2022-01-07 11:49 ` John Keeping 2022-01-07 14:25 ` Dietmar Eggemann 2022-01-14 18:25 ` Valentin Schneider 1 sibling, 1 reply; 16+ messages in thread From: John Keeping @ 2022-01-07 11:49 UTC (permalink / raw) To: Dietmar Eggemann Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote: > On 22/12/2021 20:48, Valentin Schneider wrote: > > /* > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index ef8228d19382..8f3e3a1367b6 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) > > if (!next_task) > > return 0; > > > > + /* > > + * It's possible that the next_task slipped in of higher priority than > > + * current, or current has *just* changed priority. If that's the case > > + * just reschedule current. > > + */ > > + if (unlikely(next_task->prio < rq->curr->prio)) { > > + resched_curr(rq); > > + return 0; > > + } > > IMHO, that's the bit which prevents the BUG. > > But this would also prevent the case in which rq->curr is an RT task > with lower prio than next_task. > > Also `rq->curr = migration/X` goes still though which is somehow fine > since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1). > > And DL tasks (like sugov:X go through and they can have > task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared > freuency domains with schedutil). cpupri_find_fitness()->convert_prio() > can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow > by coincidence. > > So maybe something like this: Do you mean to replace just the one hunk from Valentin's patch with the change below (keeping the rest), or are you saying that only the change below is needed? > @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull) > if (!pull || rq->push_busy) > return 0; > > + if (rq->curr->sched_class != &rt_sched_class) { > + resched_curr(rq); > + return 0; > + } > + > cpu = find_lowest_rq(rq->curr); > > [...] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2022-01-07 11:49 ` John Keeping @ 2022-01-07 14:25 ` Dietmar Eggemann 2022-01-07 18:35 ` John Keeping 0 siblings, 1 reply; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-07 14:25 UTC (permalink / raw) To: John Keeping Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On 07/01/2022 12:49, John Keeping wrote: > On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote: >> On 22/12/2021 20:48, Valentin Schneider wrote: >>> /* >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >>> index ef8228d19382..8f3e3a1367b6 100644 >>> --- a/kernel/sched/rt.c >>> +++ b/kernel/sched/rt.c >>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) >>> if (!next_task) >>> return 0; >>> >>> + /* >>> + * It's possible that the next_task slipped in of higher priority than >>> + * current, or current has *just* changed priority. If that's the case >>> + * just reschedule current. >>> + */ >>> + if (unlikely(next_task->prio < rq->curr->prio)) { >>> + resched_curr(rq); >>> + return 0; >>> + } >> >> IMHO, that's the bit which prevents the BUG. >> >> But this would also prevent the case in which rq->curr is an RT task >> with lower prio than next_task. >> >> Also `rq->curr = migration/X` goes still though which is somehow fine >> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1). >> >> And DL tasks (like sugov:X go through and they can have >> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared >> freuency domains with schedutil). cpupri_find_fitness()->convert_prio() >> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow >> by coincidence. >> >> So maybe something like this: > > Do you mean to replace just the one hunk from Valentin's patch with the > change below (keeping the rest), or are you saying that only the change > below is needed? The latter. I think Valentin wanted to see if something like this can also occur via sched_setscheduler() and maybe for this changes in switched_from_[rt/dl] will be necessary. >> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull) >> if (!pull || rq->push_busy) >> return 0; >> >> + if (rq->curr->sched_class != &rt_sched_class) { >> + resched_curr(rq); >> + return 0; >> + } >> + >> cpu = find_lowest_rq(rq->curr); >> >> [... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2022-01-07 14:25 ` Dietmar Eggemann @ 2022-01-07 18:35 ` John Keeping 0 siblings, 0 replies; 16+ messages in thread From: John Keeping @ 2022-01-07 18:35 UTC (permalink / raw) To: Dietmar Eggemann Cc: Valentin Schneider, linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel On Fri, Jan 07, 2022 at 03:25:21PM +0100, Dietmar Eggemann wrote: > On 07/01/2022 12:49, John Keeping wrote: > > On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote: > >> On 22/12/2021 20:48, Valentin Schneider wrote: > >>> /* > >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > >>> index ef8228d19382..8f3e3a1367b6 100644 > >>> --- a/kernel/sched/rt.c > >>> +++ b/kernel/sched/rt.c > >>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) > >>> if (!next_task) > >>> return 0; > >>> > >>> + /* > >>> + * It's possible that the next_task slipped in of higher priority than > >>> + * current, or current has *just* changed priority. If that's the case > >>> + * just reschedule current. > >>> + */ > >>> + if (unlikely(next_task->prio < rq->curr->prio)) { > >>> + resched_curr(rq); > >>> + return 0; > >>> + } > >> > >> IMHO, that's the bit which prevents the BUG. > >> > >> But this would also prevent the case in which rq->curr is an RT task > >> with lower prio than next_task. > >> > >> Also `rq->curr = migration/X` goes still though which is somehow fine > >> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1). > >> > >> And DL tasks (like sugov:X go through and they can have > >> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared > >> freuency domains with schedutil). cpupri_find_fitness()->convert_prio() > >> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow > >> by coincidence. > >> > >> So maybe something like this: > > > > Do you mean to replace just the one hunk from Valentin's patch with the > > change below (keeping the rest), or are you saying that only the change > > below is needed? > > The latter. Thanks! I tested the patch below and can confirm that I no longer see any BUGs with this applied. Tested-By: John Keeping <john@metanate.com> --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull) if (!pull || rq->push_busy) return 0; + if (rq->curr->sched_class != &rt_sched_class) { + resched_curr(rq); + return 0; + } + cpu = find_lowest_rq(rq->curr); if (cpu == -1 || cpu == rq->cpu) return 0; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RT] BUG in sched/cpupri.c 2022-01-07 10:46 ` Dietmar Eggemann 2022-01-07 11:49 ` John Keeping @ 2022-01-14 18:25 ` Valentin Schneider 1 sibling, 0 replies; 16+ messages in thread From: Valentin Schneider @ 2022-01-14 18:25 UTC (permalink / raw) To: Dietmar Eggemann, John Keeping Cc: linux-rt-users, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel Trying to page this back in... On 07/01/22 11:46, Dietmar Eggemann wrote: > On 22/12/2021 20:48, Valentin Schneider wrote: >> On 22/12/21 18:46, Dietmar Eggemann wrote: >>> On 21.12.21 17:45, John Keeping wrote: >>>> On Tue, 21 Dec 2021 16:11:34 +0000 >>>> Valentin Schneider <valentin.schneider@arm.com> wrote: >>>> >>>>> On 20/12/21 18:35, Dietmar Eggemann wrote: > > [...] > >>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task) >>> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,) >>> rto_push_irq_work_func() -> push_rt_task(rq, true) >>> >>> seems to be the only way with pull=true. >>> >>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens. >>> >>> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35] >>> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120 >>> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0 >>> ^^^^^^^^^^^^^^^^^^^^^^ >>> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98 >>> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1 >>> ^^^^^^^^^^ ^^^^^^ >> >> mark_wakeup_next_waiter() first deboosts the previous owner and then >> wakeups the next top waiter. Seems like you somehow have the wakeup happen >> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should >> only pick a CPU that is in rq->rd->rto_mask, which requires having at least >> 2 RT tasks there... > > True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2: > > mark_wakeup_next_waiter() > > (1) /* deboost */ > rt_mutex_adjust_prio() > > rt_mutex_setprio(current, ...) > > rq = __task_rq_lock(current, ); > check_class_changed(rq, p, prev_class, oldprio) > > switched_from_rt() > > if (!task_on_rq_queued(p) || rq->rt.rt_nr_running) > return; > > rt_queue_pull_task(rq) > > queue_balance_callback(rq, ..., pull_rt_task); > > pull_rt_task() > > tell_cpu_to_push() > > *cpu* = rto_next_cpu(rq->rd) > irq_work_queue_on(&rq->rd->rto_push_work, *cpu*) > > rto_push_irq_work_func() > push_rt_task(rq, true) <-- !!! > > (2) /* waking the top waiter */ > rt_mutex_wake_q_add(wqh, waiter); > >> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via >> check_preempt_curr() if required, which is good, though I think we are >> still missing some for sched_setscheduler() (there are no wakeups >> there). So if we just have to live with an IRQ work popping in before we >> get to preempt_schedule_irq() (or somesuch), then perhaps the below would >> be sufficient. > > I think that's the case here but we are on the RT overloaded CPU (*cpu*). > So one thing I wasn't entirely clear on (and holidays didn't fix that) is the rt_queue_pull_task() from switched_from_rt() only happens if that rq has no other runnable RT tasks, so I don't quite see how the irq_work_queue_on() can end up as a self-IPI... >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c >> index ef8228d19382..8f3e3a1367b6 100644 >> --- a/kernel/sched/rt.c >> +++ b/kernel/sched/rt.c >> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull) >> if (!next_task) >> return 0; >> >> + /* >> + * It's possible that the next_task slipped in of higher priority than >> + * current, or current has *just* changed priority. If that's the case >> + * just reschedule current. >> + */ >> + if (unlikely(next_task->prio < rq->curr->prio)) { >> + resched_curr(rq); >> + return 0; >> + } > > IMHO, that's the bit which prevents the BUG. > > But this would also prevent the case in which rq->curr is an RT task > with lower prio than next_task. > I think that's what we want - if current isn't the (logical) highest priority task on the runqueue, we should forgo push/pull and reschedule ASAP. > Also `rq->curr = migration/X` goes still though which is somehow fine > since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1). > > And DL tasks (like sugov:X go through and they can have > task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared > freuency domains with schedutil). cpupri_find_fitness()->convert_prio() > can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow > by coincidence. > Right. This reminds me of: https://lore.kernel.org/lkml/jhjblbx7glh.mognet@arm.com/ > So maybe something like this: > Ah, so you explicitely prevent rt.c::find_lowest_rq() invocations with a non-RT task... But what if current is an RT task that just got deboosted, so that next_task->prio < rq->curr->prio? IMO we should reschedule ASAP (as I already blabbered about above). If next_task is migration_disabled but higher (logical) prio than current, we don't need to do any of the migration_disabled specific crud, we just reschedule. > @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull) > if (!pull || rq->push_busy) > return 0; > > + if (rq->curr->sched_class != &rt_sched_class) { > + resched_curr(rq); > + return 0; > + } > + > cpu = find_lowest_rq(rq->curr); > > [...] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-01-14 18:25 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-18 14:25 [RT] BUG in sched/cpupri.c John Keeping 2021-12-20 17:35 ` Dietmar Eggemann 2021-12-21 16:11 ` Valentin Schneider 2021-12-21 16:45 ` John Keeping 2021-12-21 17:22 ` Valentin Schneider 2021-12-21 17:42 ` John Keeping 2021-12-22 17:46 ` Dietmar Eggemann 2021-12-22 18:45 ` John Keeping 2021-12-22 19:48 ` Valentin Schneider 2021-12-23 11:58 ` John Keeping 2021-12-23 14:05 ` Valentin Schneider 2022-01-07 10:46 ` Dietmar Eggemann 2022-01-07 11:49 ` John Keeping 2022-01-07 14:25 ` Dietmar Eggemann 2022-01-07 18:35 ` John Keeping 2022-01-14 18:25 ` Valentin Schneider
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).