From: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> To: "peterz@infradead.org" <peterz@infradead.org> Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-mediatek@lists.infradead.org" <linux-mediatek@lists.infradead.org>, "rostedt@goodmis.org" <rostedt@goodmis.org>, wsd_upstream <wsd_upstream@mediatek.com>, "vschneid@redhat.com" <vschneid@redhat.com>, "bristot@redhat.com" <bristot@redhat.com>, "juri.lelli@redhat.com" <juri.lelli@redhat.com>, "mingo@redhat.com" <mingo@redhat.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, "bsegall@google.com" <bsegall@google.com>, "mgorman@suse.de" <mgorman@suse.de>, "matthias.bgg@gmail.com" <matthias.bgg@gmail.com>, "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>, "angelogioacchino.delregno@collabora.com" <angelogioacchino.delregno@collabora.com> Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable Date: Wed, 27 Sep 2023 15:57:35 +0000 [thread overview] Message-ID: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com> (raw) In-Reply-To: <20230927080850.GB21824@noisy.programming.kicks-ass.net> On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote: > > From: kuyo chang <kuyo.chang@mediatek.com> > > > > [Syndrome] hung detect shows below warning msg > > [ 4320.666557] [ T56] khungtaskd: [name:hung_task&]INFO: task > stressapptest:17803 blocked for more than 3600 seconds. > > [ 4320.666589] [ T56] khungtaskd: > [name:core&]task:stressapptest state:D stack:0 pid:17803 > ppid:17579 flags:0x04000008 > > [ 4320.666601] [ T56] khungtaskd: Call trace: > > [ 4320.666607] [ T56] khungtaskd: __switch_to+0x17c/0x338 > > [ 4320.666642] [ T56] khungtaskd: __schedule+0x54c/0x8ec > > [ 4320.666651] [ T56] khungtaskd: schedule+0x74/0xd4 > > [ 4320.666656] [ T56] khungtaskd: schedule_timeout+0x34/0x108 > > [ 4320.666672] [ T56] khungtaskd: do_wait_for_common+0xe0/0x154 > > [ 4320.666678] [ T56] khungtaskd: wait_for_completion+0x44/0x58 > > [ 4320.666681] [ T56] > khungtaskd: __set_cpus_allowed_ptr_locked+0x344/0x730 > > [ 4320.666702] [ T56] > khungtaskd: __sched_setaffinity+0x118/0x160 > > [ 4320.666709] [ T56] khungtaskd: sched_setaffinity+0x10c/0x248 > > [ 4320.666715] [ T56] > khungtaskd: __arm64_sys_sched_setaffinity+0x15c/0x1c0 > > [ 4320.666719] [ T56] khungtaskd: invoke_syscall+0x3c/0xf8 > > [ 4320.666743] [ T56] khungtaskd: el0_svc_common+0xb0/0xe8 > > [ 4320.666749] [ T56] khungtaskd: do_el0_svc+0x28/0xa8 > > [ 4320.666755] [ T56] khungtaskd: el0_svc+0x28/0x9c > > [ 4320.666761] [ T56] khungtaskd: el0t_64_sync_handler+0x7c/0xe4 > > [ 4320.666766] [ T56] khungtaskd: el0t_64_sync+0x18c/0x190 > > > > [Analysis] > > > > After add some debug footprint massage, this issue happened at > stopper > > disable case. > > It cannot exec migration_cpu_stop fun to complete migration. > > This will cause stuck on wait_for_completion. > > How did you get in this situation? > This issue occurs at CPU hotplug/set_affinity stress test. The reproduce ratio is very low(about once a week). So I add/record some debug message to snapshot the task status while it stuck on wait_for_completion. Below is the snapshot status while issue happened: cpu_active_mask is 0xFC new_mask is 0x8 pending->arg.dest_cpu is 0x3 task_on_cpu(rq,p) is 1 task_cpu is 0x2 p__state = TASK_RUNNING flag is SCA_CHACK|SCA_USER stop_one_cpu_nowait(stopper->enabled) return value is false. I also record the footprint at migration_cpu_stop. It shows the migration_cpu_stop is not execute. > > Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> > > --- > > kernel/sched/core.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 1dc0b0287e30..98c217a1caa0 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3041,8 +3041,9 @@ static int affine_move_task(struct rq *rq, > struct task_struct *p, struct rq_flag > > task_rq_unlock(rq, p, rf); > > > > if (!stop_pending) { > > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, > > - &pending->arg, &pending->stop_work); > > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, > > + &pending->arg, &pending->stop_work)) > > +return -ENOENT; > > And -ENOENT is the right return code for when the target CPU is not > available? > > I suspect you're missing more than halp the picture and this is a > band-aid solution at best. Please try harder. > I think -ENOENT means stopper is not execute? Perhaps the error code is abused, or could you kindly give me some suggestions? Thanks, Kuyo > > } > > > > if (flags & SCA_MIGRATE_ENABLE) > > -- > > 2.18.0 > >
WARNING: multiple messages have this Message-ID (diff)
From: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> To: "peterz@infradead.org" <peterz@infradead.org> Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-mediatek@lists.infradead.org" <linux-mediatek@lists.infradead.org>, "rostedt@goodmis.org" <rostedt@goodmis.org>, wsd_upstream <wsd_upstream@mediatek.com>, "vschneid@redhat.com" <vschneid@redhat.com>, "bristot@redhat.com" <bristot@redhat.com>, "juri.lelli@redhat.com" <juri.lelli@redhat.com>, "mingo@redhat.com" <mingo@redhat.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, "bsegall@google.com" <bsegall@google.com>, "mgorman@suse.de" <mgorman@suse.de>, "matthias.bgg@gmail.com" <matthias.bgg@gmail.com>, "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>, "angelogioacchino.delregno@collabora.com" <angelogioacchino.delregno@collabora.com> Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable Date: Wed, 27 Sep 2023 15:57:35 +0000 [thread overview] Message-ID: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com> (raw) In-Reply-To: <20230927080850.GB21824@noisy.programming.kicks-ass.net> On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote: > > From: kuyo chang <kuyo.chang@mediatek.com> > > > > [Syndrome] hung detect shows below warning msg > > [ 4320.666557] [ T56] khungtaskd: [name:hung_task&]INFO: task > stressapptest:17803 blocked for more than 3600 seconds. > > [ 4320.666589] [ T56] khungtaskd: > [name:core&]task:stressapptest state:D stack:0 pid:17803 > ppid:17579 flags:0x04000008 > > [ 4320.666601] [ T56] khungtaskd: Call trace: > > [ 4320.666607] [ T56] khungtaskd: __switch_to+0x17c/0x338 > > [ 4320.666642] [ T56] khungtaskd: __schedule+0x54c/0x8ec > > [ 4320.666651] [ T56] khungtaskd: schedule+0x74/0xd4 > > [ 4320.666656] [ T56] khungtaskd: schedule_timeout+0x34/0x108 > > [ 4320.666672] [ T56] khungtaskd: do_wait_for_common+0xe0/0x154 > > [ 4320.666678] [ T56] khungtaskd: wait_for_completion+0x44/0x58 > > [ 4320.666681] [ T56] > khungtaskd: __set_cpus_allowed_ptr_locked+0x344/0x730 > > [ 4320.666702] [ T56] > khungtaskd: __sched_setaffinity+0x118/0x160 > > [ 4320.666709] [ T56] khungtaskd: sched_setaffinity+0x10c/0x248 > > [ 4320.666715] [ T56] > khungtaskd: __arm64_sys_sched_setaffinity+0x15c/0x1c0 > > [ 4320.666719] [ T56] khungtaskd: invoke_syscall+0x3c/0xf8 > > [ 4320.666743] [ T56] khungtaskd: el0_svc_common+0xb0/0xe8 > > [ 4320.666749] [ T56] khungtaskd: do_el0_svc+0x28/0xa8 > > [ 4320.666755] [ T56] khungtaskd: el0_svc+0x28/0x9c > > [ 4320.666761] [ T56] khungtaskd: el0t_64_sync_handler+0x7c/0xe4 > > [ 4320.666766] [ T56] khungtaskd: el0t_64_sync+0x18c/0x190 > > > > [Analysis] > > > > After add some debug footprint massage, this issue happened at > stopper > > disable case. > > It cannot exec migration_cpu_stop fun to complete migration. > > This will cause stuck on wait_for_completion. > > How did you get in this situation? > This issue occurs at CPU hotplug/set_affinity stress test. The reproduce ratio is very low(about once a week). So I add/record some debug message to snapshot the task status while it stuck on wait_for_completion. Below is the snapshot status while issue happened: cpu_active_mask is 0xFC new_mask is 0x8 pending->arg.dest_cpu is 0x3 task_on_cpu(rq,p) is 1 task_cpu is 0x2 p__state = TASK_RUNNING flag is SCA_CHACK|SCA_USER stop_one_cpu_nowait(stopper->enabled) return value is false. I also record the footprint at migration_cpu_stop. It shows the migration_cpu_stop is not execute. > > Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> > > --- > > kernel/sched/core.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 1dc0b0287e30..98c217a1caa0 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3041,8 +3041,9 @@ static int affine_move_task(struct rq *rq, > struct task_struct *p, struct rq_flag > > task_rq_unlock(rq, p, rf); > > > > if (!stop_pending) { > > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, > > - &pending->arg, &pending->stop_work); > > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, > > + &pending->arg, &pending->stop_work)) > > +return -ENOENT; > > And -ENOENT is the right return code for when the target CPU is not > available? > > I suspect you're missing more than halp the picture and this is a > band-aid solution at best. Please try harder. > I think -ENOENT means stopper is not execute? Perhaps the error code is abused, or could you kindly give me some suggestions? Thanks, Kuyo > > } > > > > if (flags & SCA_MIGRATE_ENABLE) > > -- > > 2.18.0 > > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-09-27 15:57 UTC|newest] Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-27 3:34 [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable Kuyo Chang 2023-09-27 3:34 ` Kuyo Chang 2023-09-27 8:08 ` Peter Zijlstra 2023-09-27 8:08 ` Peter Zijlstra 2023-09-27 15:57 ` Kuyo Chang (張建文) [this message] 2023-09-27 15:57 ` Kuyo Chang (張建文) 2023-09-28 15:16 ` Peter Zijlstra 2023-09-28 15:16 ` Peter Zijlstra 2023-09-28 15:19 ` Peter Zijlstra 2023-09-28 15:19 ` Peter Zijlstra 2023-09-29 10:21 ` Peter Zijlstra 2023-09-29 10:21 ` Peter Zijlstra 2023-10-01 15:15 ` Kuyo Chang (張建文) 2023-10-01 15:15 ` Kuyo Chang (張建文) 2023-10-10 14:40 ` Kuyo Chang (張建文) 2023-10-10 14:40 ` Kuyo Chang (張建文) 2023-10-10 14:57 ` Peter Zijlstra 2023-10-10 14:57 ` Peter Zijlstra 2023-10-10 20:04 ` [PATCH] sched: Fix stop_one_cpu_nowait() vs hotplug Peter Zijlstra 2023-10-10 20:04 ` Peter Zijlstra 2023-10-11 3:24 ` Kuyo Chang (張建文) 2023-10-11 3:24 ` Kuyo Chang (張建文) 2023-10-11 13:26 ` Peter Zijlstra 2023-10-11 13:26 ` Peter Zijlstra 2023-10-13 8:06 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com \ --to=kuyo.chang@mediatek.com \ --cc=angelogioacchino.delregno@collabora.com \ --cc=bristot@redhat.com \ --cc=bsegall@google.com \ --cc=dietmar.eggemann@arm.com \ --cc=juri.lelli@redhat.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mediatek@lists.infradead.org \ --cc=matthias.bgg@gmail.com \ --cc=mgorman@suse.de \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=rostedt@goodmis.org \ --cc=vincent.guittot@linaro.org \ --cc=vschneid@redhat.com \ --cc=wsd_upstream@mediatek.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.