All of lore.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Jing-Ting Wu <jing-ting.wu@mediatek.com>
Cc: Mukesh Ojha <quic_mojha@quicinc.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	wsd_upstream@mediatek.com, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org, Jonathan.JMChen@mediatek.com,
	"chris.redpath@arm.com" <chris.redpath@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Vincent Donnefort <vdonnefort@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Christian Brauner <brauner@kernel.org>,
	cgroups@vger.kernel.org, lixiong.liu@mediatek.com,
	wenju.xu@mediatek.com
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Tue, 6 Sep 2022 16:40:03 -0400	[thread overview]
Message-ID: <36a73401-7011-834a-7949-c65a2f66246c@redhat.com> (raw)
In-Reply-To: <02b8e7b3-941d-8bb9-cd0e-992738893ba3@redhat.com>


On 9/6/22 16:01, Waiman Long wrote:
> On 9/6/22 14:30, Tejun Heo wrote:
>> Hello,
>>
>> (cc'ing Waiman in case he has a better idea)
>>
>> On Mon, Sep 05, 2022 at 04:22:29PM +0800, Jing-Ting Wu wrote:
>>> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ is for
>>> fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
>>> But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.
>> If I'm understanding what you're writing correctly, this isn't a 
>> deadlock.
>> The cpuset_hotplug_workfn simply isn't being woken up while holding
>> cpuset_rwsem and others are just waiting for that lock to be released.
>
> I believe it is probably a bug in the scheduler core code. 
> __set_cpus_allowed_ptr_locked() calls affine_move_task() to move to a 
> random cpu within the new set allowable CPUs. However, if migration is 
> disabled, it shouldn't call affine_move_task() at all. Instead, I 
> would suggest that if the current cpu is within the new allowable 
> cpus, it should just skip doing affine_move_task(). Otherwise, it 
> should fail __set_cpus_allowed_ptr_locked().

Maybe like

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 838623b68031..5d9ea1553ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2794,9 +2794,9 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
                 if (cpumask_equal(&p->cpus_mask, new_mask))
                         goto out;

-               if (WARN_ON_ONCE(p == current &&
-                                is_migration_disabled(p) &&
-                                !cpumask_test_cpu(task_cpu(p), 
new_mask))) {
+               if (is_migration_disabled(p) &&
+                   !cpumask_test_cpu(task_cpu(p), new_mask)) {
+                       WARN_ON_ONCE(p == current);
                         ret = -EBUSY;
                         goto out;
                 }
@@ -2818,7 +2818,11 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
         if (flags & SCA_USER)
                 user_mask = clear_user_cpus_ptr(p);

-       ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       if (!is_migration_disabled(p) || (flags & SCA_MIGRATE_ENABLE)) {
+               ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       } else {
+               task_rq_unlock(rq, p, rf);
+       }

         kfree(user_mask);

I haven't tested it myself, though.

Cheers,
Longman


WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Jing-Ting Wu <jing-ting.wu@mediatek.com>
Cc: Mukesh Ojha <quic_mojha@quicinc.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	wsd_upstream@mediatek.com, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org, Jonathan.JMChen@mediatek.com,
	"chris.redpath@arm.com" <chris.redpath@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Vincent Donnefort <vdonnefort@gmail.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Christian Brauner <brauner@kernel.org>,
	cgroups@vger.kernel.org, lixiong.liu@mediatek.com,
	wenju.xu@mediatek.com
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Tue, 6 Sep 2022 16:40:03 -0400	[thread overview]
Message-ID: <36a73401-7011-834a-7949-c65a2f66246c@redhat.com> (raw)
In-Reply-To: <02b8e7b3-941d-8bb9-cd0e-992738893ba3@redhat.com>


On 9/6/22 16:01, Waiman Long wrote:
> On 9/6/22 14:30, Tejun Heo wrote:
>> Hello,
>>
>> (cc'ing Waiman in case he has a better idea)
>>
>> On Mon, Sep 05, 2022 at 04:22:29PM +0800, Jing-Ting Wu wrote:
>>> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T@slm.duckdns.org/ is for
>>> fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
>>> But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.
>> If I'm understanding what you're writing correctly, this isn't a 
>> deadlock.
>> The cpuset_hotplug_workfn simply isn't being woken up while holding
>> cpuset_rwsem and others are just waiting for that lock to be released.
>
> I believe it is probably a bug in the scheduler core code. 
> __set_cpus_allowed_ptr_locked() calls affine_move_task() to move to a 
> random cpu within the new set allowable CPUs. However, if migration is 
> disabled, it shouldn't call affine_move_task() at all. Instead, I 
> would suggest that if the current cpu is within the new allowable 
> cpus, it should just skip doing affine_move_task(). Otherwise, it 
> should fail __set_cpus_allowed_ptr_locked().

Maybe like

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 838623b68031..5d9ea1553ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2794,9 +2794,9 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
                 if (cpumask_equal(&p->cpus_mask, new_mask))
                         goto out;

-               if (WARN_ON_ONCE(p == current &&
-                                is_migration_disabled(p) &&
-                                !cpumask_test_cpu(task_cpu(p), 
new_mask))) {
+               if (is_migration_disabled(p) &&
+                   !cpumask_test_cpu(task_cpu(p), new_mask)) {
+                       WARN_ON_ONCE(p == current);
                         ret = -EBUSY;
                         goto out;
                 }
@@ -2818,7 +2818,11 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
         if (flags & SCA_USER)
                 user_mask = clear_user_cpus_ptr(p);

-       ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       if (!is_migration_disabled(p) || (flags & SCA_MIGRATE_ENABLE)) {
+               ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       } else {
+               task_rq_unlock(rq, p, rf);
+       }

         kfree(user_mask);

I haven't tested it myself, though.

Cheers,
Longman


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Jing-Ting Wu
	<jing-ting.wu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org>
Cc: Mukesh Ojha <quic_mojha-jfJNa2p1gH1BDgjK7y7TUQ@public.gmane.org>,
	Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Valentin Schneider
	<vschneid-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	wsd_upstream-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Jonathan.JMChen-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	"chris.redpath-5wv7dgnIgG8@public.gmane.org"
	<chris.redpath-5wv7dgnIgG8@public.gmane.org>,
	Dietmar Eggemann <dietmar.eggemann-5wv7dgnIgG8@public.gmane.org>,
	Vincent Donnefort
	<vdonnefort-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Juri Lelli <juri.lelli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Vincent Guittot
	<vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>,
	Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>,
	Ben Segall <bsegall-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>,
	Christian Brauner
	<brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lixiong.liu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org,
	wenju.xu-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete
Date: Tue, 6 Sep 2022 16:40:03 -0400	[thread overview]
Message-ID: <36a73401-7011-834a-7949-c65a2f66246c@redhat.com> (raw)
In-Reply-To: <02b8e7b3-941d-8bb9-cd0e-992738893ba3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


On 9/6/22 16:01, Waiman Long wrote:
> On 9/6/22 14:30, Tejun Heo wrote:
>> Hello,
>>
>> (cc'ing Waiman in case he has a better idea)
>>
>> On Mon, Sep 05, 2022 at 04:22:29PM +0800, Jing-Ting Wu wrote:
>>> https://lore.kernel.org/lkml/YvrWaml3F+x9Dk+T-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/ is for
>>> fix cgroup_threadgroup_rwsem <-> cpus_read_lock() deadlock.
>>> But this issue is cgroup_threadgroup_rwsem <-> cpuset_rwsem deadlock.
>> If I'm understanding what you're writing correctly, this isn't a 
>> deadlock.
>> The cpuset_hotplug_workfn simply isn't being woken up while holding
>> cpuset_rwsem and others are just waiting for that lock to be released.
>
> I believe it is probably a bug in the scheduler core code. 
> __set_cpus_allowed_ptr_locked() calls affine_move_task() to move to a 
> random cpu within the new set allowable CPUs. However, if migration is 
> disabled, it shouldn't call affine_move_task() at all. Instead, I 
> would suggest that if the current cpu is within the new allowable 
> cpus, it should just skip doing affine_move_task(). Otherwise, it 
> should fail __set_cpus_allowed_ptr_locked().

Maybe like

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 838623b68031..5d9ea1553ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2794,9 +2794,9 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
                 if (cpumask_equal(&p->cpus_mask, new_mask))
                         goto out;

-               if (WARN_ON_ONCE(p == current &&
-                                is_migration_disabled(p) &&
-                                !cpumask_test_cpu(task_cpu(p), 
new_mask))) {
+               if (is_migration_disabled(p) &&
+                   !cpumask_test_cpu(task_cpu(p), new_mask)) {
+                       WARN_ON_ONCE(p == current);
                         ret = -EBUSY;
                         goto out;
                 }
@@ -2818,7 +2818,11 @@ static int __set_cpus_allowed_ptr_locked(struct 
task_struct *p,
         if (flags & SCA_USER)
                 user_mask = clear_user_cpus_ptr(p);

-       ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       if (!is_migration_disabled(p) || (flags & SCA_MIGRATE_ENABLE)) {
+               ret = affine_move_task(rq, p, rf, dest_cpu, flags);
+       } else {
+               task_rq_unlock(rq, p, rf);
+       }

         kfree(user_mask);

I haven't tested it myself, though.

Cheers,
Longman


  reply	other threads:[~2022-09-06 20:40 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-05  2:47 BUG: HANG_DETECT waiting for migration_cpu_stop() complete Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  2:47 ` Jing-Ting Wu
2022-09-05  6:44 ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  6:44   ` Mukesh Ojha
2022-09-05  8:22   ` Jing-Ting Wu
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-05  8:22     ` Jing-Ting Wu
2022-09-06 18:30     ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 18:30       ` Tejun Heo
2022-09-06 20:01       ` Waiman Long
2022-09-06 20:01         ` Waiman Long
2022-09-06 20:40         ` Waiman Long [this message]
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:40           ` Waiman Long
2022-09-06 20:50           ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 20:50             ` Peter Zijlstra
2022-09-06 21:02             ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-06 21:02               ` Waiman Long
2022-09-23 14:20             ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-23 14:20               ` Mukesh Ojha
2022-09-29 15:13               ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-29 15:13                 ` Mukesh Ojha
2022-09-07  0:07 ` Hillf Danton
2022-09-22  5:40   ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22  5:40     ` Jing-Ting Wu
2022-09-22 12:02     ` Hillf Danton
2023-03-22  9:37 Ryan Xiao (肖水林)
2023-03-22  9:37 ` Ryan Xiao (肖水林)
2023-03-27  4:05 ` Ryan Xiao (肖水林)
2023-03-27  4:05   ` Ryan Xiao (肖水林)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36a73401-7011-834a-7949-c65a2f66246c@redhat.com \
    --to=longman@redhat.com \
    --cc=Jonathan.JMChen@mediatek.com \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chris.redpath@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jing-ting.wu@mediatek.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=lixiong.liu@mediatek.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quic_mojha@quicinc.com \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vdonnefort@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wenju.xu@mediatek.com \
    --cc=wsd_upstream@mediatek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.