All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ryan Xiao (肖水林)" <Ryan.Xiao@mediatek.com>
To: "tj@kernel.org" <tj@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"vschneid@redhat.com" <vschneid@redhat.com>
Cc: "linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	wsd_upstream <wsd_upstream@mediatek.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mediatek@lists.infradead.org"
	<linux-mediatek@lists.infradead.org>,
	"Yongjun Luo (罗勇军)" <Yongjun.Luo@mediatek.com>
Subject: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Wed, 22 Mar 2023 09:37:01 +0000	[thread overview]
Message-ID: <a578623ecbae88433876381d3b28ec494f479ab3.camel@mediatek.com> (raw)

Hi,

We meet the HANG_DETECT happened in T SW version with kernel-5.15.
Many tasks have been blocked for a long time.


Root cause:
migration_cpu_stop() is not complete due to is_migration_disabled(p) is
true, complete is false and complete_all() never get executed.
It let other task wait the rwsem.

Detail:
system_server waiting for cgroup_threadgroup_rwsem.
OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for
cpuset_rwsem.
cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for
affine_move_task() complete.
affine_move_task() waiting for migration_cpu_stop() complete.

The backtrace of system_server:
__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem
cgroup_can_fork
copy_process
kernel_clone

The backtrace of OomAdjuster:
__switch_to
__schedule
schedule
percpu_rwsem_wait
percpu_down_write
cpuset_can_attach => wait for cpuset_rwsem
cgroup_migrate_execute
cgroup_attach_task
__cgroup1_procs_write => hold cgroup_threadgroup_rwsem
cgroup1_procs_write
cgroup_file_write
kernfs_fop_write_iter
vfs_write
ksys_write
The backtrace of cpuset_hotplug_workfn:
__switch_to
__schedule
schedule
schedule_timeout
wait_for_common
affine_move_task => wait for complete
__set_cpus_allowed_ptr_locked
update_tasks_cpumask
cpuset_hotplug_update_tasks => hold cpuset_rwsem
cpuset_hotplug_workfn
process_one_work
worker_thread
kthread


In affine_move_task() will call migration_cpu_stop() and wait for it
complete.
In normal case, if migration_cpu_stop() complete it will inform
everyone that he is done.
But there is an exception case that will not notify.
If is_migration_disabled(p) is true and complete will always is false,
then complete_all() never get executed.

static int migration_cpu_stop(void *data)
{
...
    bool complete = false;
...

    if (task_rq(p) == rq) {
        if (is_migration_disabled(p))
              goto out; => is_migration_disabled(p) = true,
                           so complete = false.
            ...
        }
...

out:
...
    if (complete) => complete = false,
                     so complete_all() never get executed.
        complete_all(&pending->done);

        return 0;
}
Review the code, we found that there are many places can change
is_migration_disabled() value.
(such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...)

Do you have any suggestion for this issue?
Thank you.

WARNING: multiple messages have this Message-ID (diff)
From: "Ryan Xiao (肖水林)" <Ryan.Xiao@mediatek.com>
To: "tj@kernel.org" <tj@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"vschneid@redhat.com" <vschneid@redhat.com>
Cc: "linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	wsd_upstream <wsd_upstream@mediatek.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mediatek@lists.infradead.org"
	<linux-mediatek@lists.infradead.org>,
	"Yongjun Luo (罗勇军)" <Yongjun.Luo@mediatek.com>
Subject: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Wed, 22 Mar 2023 09:37:01 +0000	[thread overview]
Message-ID: <a578623ecbae88433876381d3b28ec494f479ab3.camel@mediatek.com> (raw)

Hi,

We meet the HANG_DETECT happened in T SW version with kernel-5.15.
Many tasks have been blocked for a long time.


Root cause:
migration_cpu_stop() is not complete due to is_migration_disabled(p) is
true, complete is false and complete_all() never get executed.
It let other task wait the rwsem.

Detail:
system_server waiting for cgroup_threadgroup_rwsem.
OomAdjuster is holding the cgroup_threadgroup_rwsem and waiting for
cpuset_rwsem.
cpuset_hotplug_workfn is holding the cpuset_rwsem and waiting for
affine_move_task() complete.
affine_move_task() waiting for migration_cpu_stop() complete.

The backtrace of system_server:
__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
cgroup_css_set_fork => wait for cgroup_threadgroup_rwsem
cgroup_can_fork
copy_process
kernel_clone

The backtrace of OomAdjuster:
__switch_to
__schedule
schedule
percpu_rwsem_wait
percpu_down_write
cpuset_can_attach => wait for cpuset_rwsem
cgroup_migrate_execute
cgroup_attach_task
__cgroup1_procs_write => hold cgroup_threadgroup_rwsem
cgroup1_procs_write
cgroup_file_write
kernfs_fop_write_iter
vfs_write
ksys_write
The backtrace of cpuset_hotplug_workfn:
__switch_to
__schedule
schedule
schedule_timeout
wait_for_common
affine_move_task => wait for complete
__set_cpus_allowed_ptr_locked
update_tasks_cpumask
cpuset_hotplug_update_tasks => hold cpuset_rwsem
cpuset_hotplug_workfn
process_one_work
worker_thread
kthread


In affine_move_task() will call migration_cpu_stop() and wait for it
complete.
In normal case, if migration_cpu_stop() complete it will inform
everyone that he is done.
But there is an exception case that will not notify.
If is_migration_disabled(p) is true and complete will always is false,
then complete_all() never get executed.

static int migration_cpu_stop(void *data)
{
...
    bool complete = false;
...

    if (task_rq(p) == rq) {
        if (is_migration_disabled(p))
              goto out; => is_migration_disabled(p) = true,
                           so complete = false.
            ...
        }
...

out:
...
    if (complete) => complete = false,
                     so complete_all() never get executed.
        complete_all(&pending->done);

        return 0;
}
Review the code, we found that there are many places can change
is_migration_disabled() value.
(such as: __rt_spin_lock(), rt_read_lock(), rt_write_lock(), ...)

Do you have any suggestion for this issue?
Thank you.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

             reply	other threads:[~2023-03-22  9:37 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22  9:37 Ryan Xiao (肖水林) [this message]
2023-03-22  9:37 ` BUG: HANG_DETECT waiting for migration_cpu_stop() complete Ryan Xiao (肖水林)
2023-03-27  4:05 ` Ryan Xiao (肖水林)
2023-03-27  4:05   ` Ryan Xiao (肖水林)
  -- strict thread matches above, loose matches on Subject: below --
2022-09-05  2:47 Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  6:44 ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  8:22   ` Jing-Ting Wu
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-06 18:30     ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 20:01       ` Waiman Long
2022-09-06 20:01         ` Waiman Long
2022-09-06 20:40         ` Waiman Long
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:50           ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 21:02             ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-23 14:20             ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-29 15:13               ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-07  0:07 ` Hillf Danton
2022-09-22  5:40   ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22 12:02     ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a578623ecbae88433876381d3b28ec494f479ab3.camel@mediatek.com \
    --to=ryan.xiao@mediatek.com \
    --cc=Yongjun.Luo@mediatek.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=peterz@infradead.org \
    --cc=tj@kernel.org \
    --cc=vschneid@redhat.com \
    --cc=wsd_upstream@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.