All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
To: "peterz@infradead.org" <peterz@infradead.org>
Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mediatek@lists.infradead.org" 
	<linux-mediatek@lists.infradead.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	wsd_upstream <wsd_upstream@mediatek.com>,
	"vschneid@redhat.com" <vschneid@redhat.com>,
	"bristot@redhat.com" <bristot@redhat.com>,
	"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"bsegall@google.com" <bsegall@google.com>,
	"mgorman@suse.de" <mgorman@suse.de>,
	"matthias.bgg@gmail.com" <matthias.bgg@gmail.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"angelogioacchino.delregno@collabora.com" 
	<angelogioacchino.delregno@collabora.com>
Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable
Date: Wed, 27 Sep 2023 15:57:35 +0000	[thread overview]
Message-ID: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com> (raw)
In-Reply-To: <20230927080850.GB21824@noisy.programming.kicks-ass.net>

On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote:
> > From: kuyo chang <kuyo.chang@mediatek.com>
> > 
> > [Syndrome] hung detect shows below warning msg
> > [ 4320.666557] [   T56] khungtaskd: [name:hung_task&]INFO: task
> stressapptest:17803 blocked for more than 3600 seconds.
> > [ 4320.666589] [   T56] khungtaskd:
> [name:core&]task:stressapptest   state:D stack:0     pid:17803
> ppid:17579  flags:0x04000008
> > [ 4320.666601] [   T56] khungtaskd: Call trace:
> > [ 4320.666607] [   T56] khungtaskd:  __switch_to+0x17c/0x338
> > [ 4320.666642] [   T56] khungtaskd:  __schedule+0x54c/0x8ec
> > [ 4320.666651] [   T56] khungtaskd:  schedule+0x74/0xd4
> > [ 4320.666656] [   T56] khungtaskd:  schedule_timeout+0x34/0x108
> > [ 4320.666672] [   T56] khungtaskd:  do_wait_for_common+0xe0/0x154
> > [ 4320.666678] [   T56] khungtaskd:  wait_for_completion+0x44/0x58
> > [ 4320.666681] [   T56]
> khungtaskd:  __set_cpus_allowed_ptr_locked+0x344/0x730
> > [ 4320.666702] [   T56]
> khungtaskd:  __sched_setaffinity+0x118/0x160
> > [ 4320.666709] [   T56] khungtaskd:  sched_setaffinity+0x10c/0x248
> > [ 4320.666715] [   T56]
> khungtaskd:  __arm64_sys_sched_setaffinity+0x15c/0x1c0
> > [ 4320.666719] [   T56] khungtaskd:  invoke_syscall+0x3c/0xf8
> > [ 4320.666743] [   T56] khungtaskd:  el0_svc_common+0xb0/0xe8
> > [ 4320.666749] [   T56] khungtaskd:  do_el0_svc+0x28/0xa8
> > [ 4320.666755] [   T56] khungtaskd:  el0_svc+0x28/0x9c
> > [ 4320.666761] [   T56] khungtaskd:  el0t_64_sync_handler+0x7c/0xe4
> > [ 4320.666766] [   T56] khungtaskd:  el0t_64_sync+0x18c/0x190
> > 
> > [Analysis]
> > 
> > After add some debug footprint massage, this issue happened at
> stopper
> > disable case.
> > It cannot exec migration_cpu_stop fun to complete migration.
> > This will cause stuck on wait_for_completion.
> 
> How did you get in this situation?
> 

This issue occurs at CPU hotplug/set_affinity stress test.
The reproduce ratio is very low(about once a week).

So I add/record some debug message to snapshot the task status while it
stuck on wait_for_completion.

Below is the snapshot status while issue happened:

cpu_active_mask is 0xFC
new_mask is 0x8
pending->arg.dest_cpu is 0x3
task_on_cpu(rq,p) is 1
task_cpu is 0x2
p__state = TASK_RUNNING
flag is SCA_CHACK|SCA_USER
stop_one_cpu_nowait(stopper->enabled) return value is false.

I also record the footprint at migration_cpu_stop.
It shows the migration_cpu_stop is not execute.


> > Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
> > ---
> >  kernel/sched/core.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1dc0b0287e30..98c217a1caa0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3041,8 +3041,9 @@ static int affine_move_task(struct rq *rq,
> struct task_struct *p, struct rq_flag
> >  task_rq_unlock(rq, p, rf);
> >  
> >  if (!stop_pending) {
> > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > -    &pending->arg, &pending->stop_work);
> > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > +    &pending->arg, &pending->stop_work))
> > +return -ENOENT;
> 
> And -ENOENT is the right return code for when the target CPU is not
> available?
> 
> I suspect you're missing more than halp the picture and this is a
> band-aid solution at best. Please try harder.
> 

I think -ENOENT means stopper is not execute? 
Perhaps the error code is abused, or could you kindly give me some
suggestions?

Thanks,
Kuyo

> >  }
> >  
> >  if (flags & SCA_MIGRATE_ENABLE)
> > -- 
> > 2.18.0
> > 

WARNING: multiple messages have this Message-ID (diff)
From: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
To: "peterz@infradead.org" <peterz@infradead.org>
Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mediatek@lists.infradead.org"
	<linux-mediatek@lists.infradead.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	wsd_upstream <wsd_upstream@mediatek.com>,
	"vschneid@redhat.com" <vschneid@redhat.com>,
	"bristot@redhat.com" <bristot@redhat.com>,
	"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"bsegall@google.com" <bsegall@google.com>,
	"mgorman@suse.de" <mgorman@suse.de>,
	"matthias.bgg@gmail.com" <matthias.bgg@gmail.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"angelogioacchino.delregno@collabora.com"
	<angelogioacchino.delregno@collabora.com>
Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable
Date: Wed, 27 Sep 2023 15:57:35 +0000	[thread overview]
Message-ID: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com> (raw)
In-Reply-To: <20230927080850.GB21824@noisy.programming.kicks-ass.net>

On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote:
> > From: kuyo chang <kuyo.chang@mediatek.com>
> > 
> > [Syndrome] hung detect shows below warning msg
> > [ 4320.666557] [   T56] khungtaskd: [name:hung_task&]INFO: task
> stressapptest:17803 blocked for more than 3600 seconds.
> > [ 4320.666589] [   T56] khungtaskd:
> [name:core&]task:stressapptest   state:D stack:0     pid:17803
> ppid:17579  flags:0x04000008
> > [ 4320.666601] [   T56] khungtaskd: Call trace:
> > [ 4320.666607] [   T56] khungtaskd:  __switch_to+0x17c/0x338
> > [ 4320.666642] [   T56] khungtaskd:  __schedule+0x54c/0x8ec
> > [ 4320.666651] [   T56] khungtaskd:  schedule+0x74/0xd4
> > [ 4320.666656] [   T56] khungtaskd:  schedule_timeout+0x34/0x108
> > [ 4320.666672] [   T56] khungtaskd:  do_wait_for_common+0xe0/0x154
> > [ 4320.666678] [   T56] khungtaskd:  wait_for_completion+0x44/0x58
> > [ 4320.666681] [   T56]
> khungtaskd:  __set_cpus_allowed_ptr_locked+0x344/0x730
> > [ 4320.666702] [   T56]
> khungtaskd:  __sched_setaffinity+0x118/0x160
> > [ 4320.666709] [   T56] khungtaskd:  sched_setaffinity+0x10c/0x248
> > [ 4320.666715] [   T56]
> khungtaskd:  __arm64_sys_sched_setaffinity+0x15c/0x1c0
> > [ 4320.666719] [   T56] khungtaskd:  invoke_syscall+0x3c/0xf8
> > [ 4320.666743] [   T56] khungtaskd:  el0_svc_common+0xb0/0xe8
> > [ 4320.666749] [   T56] khungtaskd:  do_el0_svc+0x28/0xa8
> > [ 4320.666755] [   T56] khungtaskd:  el0_svc+0x28/0x9c
> > [ 4320.666761] [   T56] khungtaskd:  el0t_64_sync_handler+0x7c/0xe4
> > [ 4320.666766] [   T56] khungtaskd:  el0t_64_sync+0x18c/0x190
> > 
> > [Analysis]
> > 
> > After add some debug footprint massage, this issue happened at
> stopper
> > disable case.
> > It cannot exec migration_cpu_stop fun to complete migration.
> > This will cause stuck on wait_for_completion.
> 
> How did you get in this situation?
> 

This issue occurs at CPU hotplug/set_affinity stress test.
The reproduce ratio is very low(about once a week).

So I add/record some debug message to snapshot the task status while it
stuck on wait_for_completion.

Below is the snapshot status while issue happened:

cpu_active_mask is 0xFC
new_mask is 0x8
pending->arg.dest_cpu is 0x3
task_on_cpu(rq,p) is 1
task_cpu is 0x2
p__state = TASK_RUNNING
flag is SCA_CHACK|SCA_USER
stop_one_cpu_nowait(stopper->enabled) return value is false.

I also record the footprint at migration_cpu_stop.
It shows the migration_cpu_stop is not execute.


> > Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
> > ---
> >  kernel/sched/core.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1dc0b0287e30..98c217a1caa0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3041,8 +3041,9 @@ static int affine_move_task(struct rq *rq,
> struct task_struct *p, struct rq_flag
> >  task_rq_unlock(rq, p, rf);
> >  
> >  if (!stop_pending) {
> > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > -    &pending->arg, &pending->stop_work);
> > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > +    &pending->arg, &pending->stop_work))
> > +return -ENOENT;
> 
> And -ENOENT is the right return code for when the target CPU is not
> available?
> 
> I suspect you're missing more than halp the picture and this is a
> band-aid solution at best. Please try harder.
> 

I think -ENOENT means stopper is not execute? 
Perhaps the error code is abused, or could you kindly give me some
suggestions?

Thanks,
Kuyo

> >  }
> >  
> >  if (flags & SCA_MIGRATE_ENABLE)
> > -- 
> > 2.18.0
> > 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-09-27 15:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-27  3:34 [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable Kuyo Chang
2023-09-27  3:34 ` Kuyo Chang
2023-09-27  8:08 ` Peter Zijlstra
2023-09-27  8:08   ` Peter Zijlstra
2023-09-27 15:57   ` Kuyo Chang (張建文) [this message]
2023-09-27 15:57     ` Kuyo Chang (張建文)
2023-09-28 15:16     ` Peter Zijlstra
2023-09-28 15:16       ` Peter Zijlstra
2023-09-28 15:19       ` Peter Zijlstra
2023-09-28 15:19         ` Peter Zijlstra
2023-09-29 10:21     ` Peter Zijlstra
2023-09-29 10:21       ` Peter Zijlstra
2023-10-01 15:15       ` Kuyo Chang (張建文)
2023-10-01 15:15         ` Kuyo Chang (張建文)
2023-10-10 14:40       ` Kuyo Chang (張建文)
2023-10-10 14:40         ` Kuyo Chang (張建文)
2023-10-10 14:57         ` Peter Zijlstra
2023-10-10 14:57           ` Peter Zijlstra
2023-10-10 20:04           ` [PATCH] sched: Fix stop_one_cpu_nowait() vs hotplug Peter Zijlstra
2023-10-10 20:04             ` Peter Zijlstra
2023-10-11  3:24             ` Kuyo Chang (張建文)
2023-10-11  3:24               ` Kuyo Chang (張建文)
2023-10-11 13:26               ` Peter Zijlstra
2023-10-11 13:26                 ` Peter Zijlstra
2023-10-13  8:06             ` [tip: sched/core] " tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com \
    --to=kuyo.chang@mediatek.com \
    --cc=angelogioacchino.delregno@collabora.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=matthias.bgg@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wsd_upstream@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.