All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jing-Ting Wu <jing-ting.wu@mediatek.com>
To: Mukesh Ojha <quic_mojha@quicinc.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Tejun Heo <tj@kernel.org>
Cc: <wsd_upstream@mediatek.com>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>,
	<Jonathan.JMChen@mediatek.com>,
	"chris.redpath@arm.com" <chris.redpath@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"Vincent Donnefort" <vdonnefort@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Christian Brauner <brauner@kernel.org>, <cgroups@vger.kernel.org>,
	<lixiong.liu@mediatek.com>, <wenju.xu@mediatek.com>
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Mon, 5 Sep 2022 16:22:29 +0800	[thread overview]
Message-ID: <203d4614c1b2a498a240ace287156e9f401d5395.camel@mediatek.com> (raw)
In-Reply-To: <b605c3ec-94ab-a55f-5825-9b370d77ecf3@quicinc.com>

Hi, Mukesh



https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ is for
fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.

I think they are not same issue.
Do the patch is useful for this issue?



Best regards,
Jing-Ting Wu


On Mon, 2022-09-05 at 12:14 +0530, Mukesh Ojha wrote:
> This is fixed by this.
> 
> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/
> 
> -Mukesh
> 
> On 9/5/2022 8:17 AM, Jing-Ting Wu wrote:
> > Hi,
> > 
> > We meet the HANG_DETECT happened in T SW version with kernel-5.15.
> > Many tasks have been blocked for a long time.
> > 
> > 
> > Root cause:
> > migration_cpu_stop() is not complete due to
> > is_migration_disabled(p) is
> > true, complete is false and complete_all() never get executed.
> > It let other task wait the rwsem.
> > 
> > Detail:
> > system_server waiting for cgroup_threadgroup_rwsem.
> > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for
> > cpuset_rwsem.
> > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for
> > affine_move_task() complete.
> > affine_move_task() waiting for migration_cpu_stop() complete.
> > 
> > The backtrace of system_server:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem
> > cgroup_can_fork
> > copy_process
> > kernel_clone
> > 
> > The backtrace of OomAdjuster:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > percpu_down_write
> > cpuset_can_attach => wait for cpuset_rwsem
> > cgroup_migrate_execute
> > cgroup_attach_task
> > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem
> > cgroup1_procs_write
> > cgroup_file_write
> > kernfs_fop_write_iter
> > vfs_write
> > ksys_write
> > 
> > The backtrace of cpuset_hotplug_workfn:
> > __switch_to
> > __schedule
> > schedule
> > schedule_timeout
> > wait_for_common
> > affine_move_task => wait for complete
> > __set_cpus_allowed_ptr_locked
> > update_tasks_cpumask
> > cpuset_hotplug_update_tasks => hold cpuset_rwsem
> > cpuset_hotplug_workfn
> > process_one_work
> > worker_thread
> > kthread
> > 
> > 
> > In affine_move_task() will call migration_cpu_stop() and wait for
> > it
> > complete.
> > In normal case, if migration_cpu_stop() complete it will inform
> > everyone that he is done.
> > But there is an exception case that will not notify.
> > If is_migration_disabled(p) is true and complete will always is
> > false,
> > then complete_all() never get executed.
> > 
> > static int migration_cpu_stop(void *data)
> > {
> > ...
> >      bool complete = false;
> > ...
> > 
> >      if (task_rq(p) == rq) {
> >          if (is_migration_disabled(p))
> >                goto out; => is_migration_disabled(p) = true,
> >                             so complete = false.
> >              ...
> >          }
> > ...
> > 
> > out:
> > ...
> >      if (complete) => complete = false,
> >                       so complete_all() never get executed.
> >          complete_all(&pending->done);
> > 
> >          return 0;
> > }
> > 
> > 
> > Review the code, we found that there are many places can change
> > is_migration_disabled() value.
> > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...)
> > 
> > Do you have any suggestion for this issue?
> > Thank you.
> > 
> > 
> > 
> > 
> > Best regards,
> > Jing-Ting Wu
> > 
> > 


WARNING: multiple messages have this Message-ID (diff)
From: Jing-Ting Wu <jing-ting.wu@mediatek.com>
To: Mukesh Ojha <quic_mojha@quicinc.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Tejun Heo <tj@kernel.org>
Cc: <wsd_upstream@mediatek.com>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>,
	<Jonathan.JMChen@mediatek.com>,
	"chris.redpath@arm.com" <chris.redpath@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"Vincent Donnefort" <vdonnefort@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Christian Brauner <brauner@kernel.org>, <cgroups@vger.kernel.org>,
	<lixiong.liu@mediatek.com>, <wenju.xu@mediatek.com>
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Mon, 5 Sep 2022 16:22:29 +0800	[thread overview]
Message-ID: <203d4614c1b2a498a240ace287156e9f401d5395.camel@mediatek.com> (raw)
In-Reply-To: <b605c3ec-94ab-a55f-5825-9b370d77ecf3@quicinc.com>

Hi, Mukesh



https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ is for
fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.

I think they are not same issue.
Do the patch is useful for this issue?



Best regards,
Jing-Ting Wu


On Mon, 2022-09-05 at 12:14 +0530, Mukesh Ojha wrote:
> This is fixed by this.
> 
> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/
> 
> -Mukesh
> 
> On 9/5/2022 8:17 AM, Jing-Ting Wu wrote:
> > Hi,
> > 
> > We meet the HANG_DETECT happened in T SW version with kernel-5.15.
> > Many tasks have been blocked for a long time.
> > 
> > 
> > Root cause:
> > migration_cpu_stop() is not complete due to
> > is_migration_disabled(p) is
> > true, complete is false and complete_all() never get executed.
> > It let other task wait the rwsem.
> > 
> > Detail:
> > system_server waiting for cgroup_threadgroup_rwsem.
> > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for
> > cpuset_rwsem.
> > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for
> > affine_move_task() complete.
> > affine_move_task() waiting for migration_cpu_stop() complete.
> > 
> > The backtrace of system_server:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem
> > cgroup_can_fork
> > copy_process
> > kernel_clone
> > 
> > The backtrace of OomAdjuster:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > percpu_down_write
> > cpuset_can_attach => wait for cpuset_rwsem
> > cgroup_migrate_execute
> > cgroup_attach_task
> > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem
> > cgroup1_procs_write
> > cgroup_file_write
> > kernfs_fop_write_iter
> > vfs_write
> > ksys_write
> > 
> > The backtrace of cpuset_hotplug_workfn:
> > __switch_to
> > __schedule
> > schedule
> > schedule_timeout
> > wait_for_common
> > affine_move_task => wait for complete
> > __set_cpus_allowed_ptr_locked
> > update_tasks_cpumask
> > cpuset_hotplug_update_tasks => hold cpuset_rwsem
> > cpuset_hotplug_workfn
> > process_one_work
> > worker_thread
> > kthread
> > 
> > 
> > In affine_move_task() will call migration_cpu_stop() and wait for
> > it
> > complete.
> > In normal case, if migration_cpu_stop() complete it will inform
> > everyone that he is done.
> > But there is an exception case that will not notify.
> > If is_migration_disabled(p) is true and complete will always is
> > false,
> > then complete_all() never get executed.
> > 
> > static int migration_cpu_stop(void *data)
> > {
> > ...
> >      bool complete = false;
> > ...
> > 
> >      if (task_rq(p) == rq) {
> >          if (is_migration_disabled(p))
> >                goto out; => is_migration_disabled(p) = true,
> >                             so complete = false.
> >              ...
> >          }
> > ...
> > 
> > out:
> > ...
> >      if (complete) => complete = false,
> >                       so complete_all() never get executed.
> >          complete_all(&pending->done);
> > 
> >          return 0;
> > }
> > 
> > 
> > Review the code, we found that there are many places can change
> > is_migration_disabled() value.
> > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...)
> > 
> > Do you have any suggestion for this issue?
> > Thank you.
> > 
> > 
> > 
> > 
> > Best regards,
> > Jing-Ting Wu
> > 
> > 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Jing-Ting Wu <jing-ting.wu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org>
To: Mukesh Ojha <quic_mojha-jfJNa2p1gH1BDgjK7y7TUQ@public.gmane.org>,
	Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Valentin Schneider
	<vschneid-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: wsd_upstream-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Jonathan.JMChen-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	"chris.redpath-5wv7dgnIgG8@public.gmane.org"
	<chris.redpath-5wv7dgnIgG8@public.gmane.org>,
	Dietmar Eggemann <dietmar.eggemann-5wv7dgnIgG8@public.gmane.org>,
	Vincent Donnefort
	<vdonnefort-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Juri Lelli <juri.lelli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Vincent Guittot
	<vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>,
	Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>,
	Ben Segall <bsegall-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>,
	Christian Brauner
	<brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lixiong.liu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	wenju.xu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Mon, 5 Sep 2022 16:22:29 +0800	[thread overview]
Message-ID: <203d4614c1b2a498a240ace287156e9f401d5395.camel@mediatek.com> (raw)
In-Reply-To: <b605c3ec-94ab-a55f-5825-9b370d77ecf3-jfJNa2p1gH1BDgjK7y7TUQ@public.gmane.org>

Hi, Mukesh



https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/ is for
fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.

I think they are not same issue.
Do the patch is useful for this issue?



Best regards,
Jing-Ting Wu


On Mon, 2022-09-05 at 12:14 +0530, Mukesh Ojha wrote:
> This is fixed by this.
> 
> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/
> 
> -Mukesh
> 
> On 9/5/2022 8:17 AM, Jing-Ting Wu wrote:
> > Hi,
> > 
> > We meet the HANG_DETECT happened in T SW version with kernel-5.15.
> > Many tasks have been blocked for a long time.
> > 
> > 
> > Root cause:
> > migration_cpu_stop() is not complete due to
> > is_migration_disabled(p) is
> > true, complete is false and complete_all() never get executed.
> > It let other task wait the rwsem.
> > 
> > Detail:
> > system_server waiting for cgroup_threadgroup_rwsem.
> > OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for
> > cpuset_rwsem.
> > cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for
> > affine_move_task() complete.
> > affine_move_task() waiting for migration_cpu_stop() complete.
> > 
> > The backtrace of system_server:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem
> > cgroup_can_fork
> > copy_process
> > kernel_clone
> > 
> > The backtrace of OomAdjuster:
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > percpu_down_write
> > cpuset_can_attach => wait for cpuset_rwsem
> > cgroup_migrate_execute
> > cgroup_attach_task
> > __cgroup1_procs_write => hold cgroup_threadgroup_rwsem
> > cgroup1_procs_write
> > cgroup_file_write
> > kernfs_fop_write_iter
> > vfs_write
> > ksys_write
> > 
> > The backtrace of cpuset_hotplug_workfn:
> > __switch_to
> > __schedule
> > schedule
> > schedule_timeout
> > wait_for_common
> > affine_move_task => wait for complete
> > __set_cpus_allowed_ptr_locked
> > update_tasks_cpumask
> > cpuset_hotplug_update_tasks => hold cpuset_rwsem
> > cpuset_hotplug_workfn
> > process_one_work
> > worker_thread
> > kthread
> > 
> > 
> > In affine_move_task() will call migration_cpu_stop() and wait for
> > it
> > complete.
> > In normal case, if migration_cpu_stop() complete it will inform
> > everyone that he is done.
> > But there is an exception case that will not notify.
> > If is_migration_disabled(p) is true and complete will always is
> > false,
> > then complete_all() never get executed.
> > 
> > static int migration_cpu_stop(void *data)
> > {
> > ...
> >      bool complete = false;
> > ...
> > 
> >      if (task_rq(p) == rq) {
> >          if (is_migration_disabled(p))
> >                goto out; => is_migration_disabled(p) = true,
> >                             so complete = false.
> >              ...
> >          }
> > ...
> > 
> > out:
> > ...
> >      if (complete) => complete = false,
> >                       so complete_all() never get executed.
> >          complete_all(&pending->done);
> > 
> >          return 0;
> > }
> > 
> > 
> > Review the code, we found that there are many places can change
> > is_migration_disabled() value.
> > (such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...)
> > 
> > Do you have any suggestion for this issue?
> > Thank you.
> > 
> > 
> > 
> > 
> > Best regards,
> > Jing-Ting Wu
> > 
> > 


  reply	other threads:[~2022-09-05  8:22 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-05  2:47 BUG: HANG_DETECT waiting for migration_cpu_stop() complete Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  6:44 ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  8:22   ` Jing-Ting Wu [this message]
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-06 18:30     ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 20:01       ` Waiman Long
2022-09-06 20:01         ` Waiman Long
2022-09-06 20:40         ` Waiman Long
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:50           ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 21:02             ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-23 14:20             ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-29 15:13               ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-07  0:07 ` Hillf Danton
2022-09-22  5:40   ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22 12:02     ` Hillf Danton
2023-03-22  9:37 Ryan Xiao (肖水林)
2023-03-22  9:37 ` Ryan Xiao (肖水林)
2023-03-27  4:05 ` Ryan Xiao (肖水林)
2023-03-27  4:05   ` Ryan Xiao (肖水林)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=203d4614c1b2a498a240ace287156e9f401d5395.camel@mediatek.com \
    --to=jing-ting.wu@mediatek.com \
    --cc=Jonathan.JMChen@mediatek.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chris.redpath@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=lixiong.liu@mediatek.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quic_mojha@quicinc.com \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vdonnefort@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wenju.xu@mediatek.com \
    --cc=wsd_upstream@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.