[problem] Hung task caused by memory migration when cpuset.mems changes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [problem] Hung task caused by memory migration when cpuset.mems changes
@ 2024-03-25 14:46 Chuyi Zhou
  2024-03-26 17:26 ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Chuyi Zhou @ 2024-03-25 14:46 UTC (permalink / raw)
  To: cgroups, longman, tj, hughd
  Cc: wuyun.abel, hezhongkun.hzk, chenying.kernel, zhanghaoyu.zhy, Chuyi Zhou

In our production environment, we have observed several cases of hung tasks
blocked on the cgroup_mutex. The underlying cause is that when user modify
the cpuset.mems, memory migration operations are performed in the
work_queue. However, the duration of these operations depends on the memory
size of workloads and can consume a significant amount of time.

In the __cgroup_procs_write operation, there is a flush_workqueue operation
that waits for the migration to complete while holding the cgroup_mutex.
As a result, most cgroup-related operations have the potential to
experience blocking.

We have noticed the commit "cgroup/cpuset: Enable memory migration for
cpuset v2"[1]. This commit enforces memory migration when modifying the
cpuset. Furthermore, in cgroup v2, there is no option available for
users to disable CS_MEMORY_MIGRATE.

In our scenario, we do need to perform memory migration when cpuset.mems
changes, while ensuring that other tasks are not blocked on cgroup_mutex
for an extended period of time.

One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
trigger memory migration, and we can manually perform memory migration
using migrate_pages()/move_pages() syscalls.

Another solution is to use a lazy approach for memory migration[2]. In
this way we only walk through all the pages and sets pages to protnone,
and numa faults triggered by later touch will handle the movement. That
would significantly reduce the time spent in cpuset_migrate_mm_workfn.
But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
confusing MPOL_MF_LAZY dead code")

Do you have any better suggestions?

Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee9707e8593dfb9a375cf4793c3fd03d4142b463
[2] https://lore.kernel.org/lkml/20210426065946.40491-1-wuyun.abel@bytedance.com/T/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-25 14:46 [problem] Hung task caused by memory migration when cpuset.mems changes Chuyi Zhou
@ 2024-03-26 17:26 ` Tejun Heo
  2024-03-27 14:07   ` Chuyi Zhou
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tejun Heo @ 2024-03-26 17:26 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: cgroups, longman, tj, hughd, wuyun.abel, hezhongkun.hzk,
	chenying.kernel, zhanghaoyu.zhy

Hello,

On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
> In our production environment, we have observed several cases of hung tasks
> blocked on the cgroup_mutex. The underlying cause is that when user modify
> the cpuset.mems, memory migration operations are performed in the
> work_queue. However, the duration of these operations depends on the memory
> size of workloads and can consume a significant amount of time.
> 
> In the __cgroup_procs_write operation, there is a flush_workqueue operation
> that waits for the migration to complete while holding the cgroup_mutex.
> As a result, most cgroup-related operations have the potential to
> experience blocking.
> 
> We have noticed the commit "cgroup/cpuset: Enable memory migration for
> cpuset v2"[1]. This commit enforces memory migration when modifying the
> cpuset. Furthermore, in cgroup v2, there is no option available for
> users to disable CS_MEMORY_MIGRATE.
> 
> In our scenario, we do need to perform memory migration when cpuset.mems
> changes, while ensuring that other tasks are not blocked on cgroup_mutex
> for an extended period of time.
> 
> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
> trigger memory migration, and we can manually perform memory migration
> using migrate_pages()/move_pages() syscalls.
> 
> Another solution is to use a lazy approach for memory migration[2]. In
> this way we only walk through all the pages and sets pages to protnone,
> and numa faults triggered by later touch will handle the movement. That
> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
> confusing MPOL_MF_LAZY dead code")

One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
task_work so that it happens after cpuset mutex is dropped. That way we
maintain the operation synchronicity for the issuer while avoiding bothering
anyone else.

Can you see whether the following patch fixes the issue for you? Thanks.

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ba36c073304a..8a8bd3f157ab 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -42,6 +42,7 @@
 #include <linux/spinlock.h>
 #include <linux/oom.h>
 #include <linux/sched/isolation.h>
+#include <linux/task_work.h>
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
@@ -2696,6 +2697,26 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
 	kfree(mwork);
 }
 
+static void flush_migrate_mm_task_workfn(struct callback_head *head)
+{
+	flush_workqueue(cpuset_migrate_mm_wq);
+}
+
+static int schedule_flush_migrate_mm(void)
+{
+	struct callback_head *flush_cb;
+
+	flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
+	if (!flush_cb)
+		return -ENOMEM;
+
+	flush_cb->func = flush_migrate_mm_task_workfn;
+	if (task_work_add(current, flush_cb, TWA_RESUME))
+		kfree(flush_cb);
+
+	return 0;
+}
+
 static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 							const nodemask_t *to)
 {
@@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 	}
 }
 
-static void cpuset_post_attach(void)
-{
-	flush_workqueue(cpuset_migrate_mm_wq);
-}
-
 /*
  * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
  * @tsk: the task to change
@@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	bool cpus_updated, mems_updated;
 	int ret;
 
+	ret = schedule_flush_migrate_mm();
+	if (ret)
+		return ret;
+
 	/* used later by cpuset_attach() */
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
 	oldcs = cpuset_attach_old_cs;
@@ -3584,7 +3604,11 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 {
 	struct cpuset *cs = css_cs(of_css(of));
 	struct cpuset *trialcs;
-	int retval = -ENODEV;
+	int retval;
+
+	retval = schedule_flush_migrate_mm();
+	if (retval)
+		return retval;
 
 	buf = strstrip(buf);
 
@@ -3613,8 +3637,10 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
 	cpus_read_lock();
 	mutex_lock(&cpuset_mutex);
-	if (!is_cpuset_online(cs))
+	if (!is_cpuset_online(cs)) {
+		retval = -ENODEV;
 		goto out_unlock;
+	}
 
 	trialcs = alloc_trial_cpuset(cs);
 	if (!trialcs) {
@@ -3643,7 +3669,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	cpus_read_unlock();
 	kernfs_unbreak_active_protection(of->kn);
 	css_put(&cs->css);
-	flush_workqueue(cpuset_migrate_mm_wq);
 	return retval ?: nbytes;
 }
 
@@ -4283,7 +4308,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.can_attach	= cpuset_can_attach,
 	.cancel_attach	= cpuset_cancel_attach,
 	.attach		= cpuset_attach,
-	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.can_fork	= cpuset_can_fork,
 	.cancel_fork	= cpuset_cancel_fork,

-- 
tejun

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-26 17:26 ` Tejun Heo
@ 2024-03-27 14:07   ` Chuyi Zhou
  2024-03-27 16:13     ` Tejun Heo
  2024-03-27 17:14   ` Waiman Long
  2024-03-28  7:53   ` Abel Wu
  2 siblings, 1 reply; 8+ messages in thread
From: Chuyi Zhou @ 2024-03-27 14:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, longman, hughd, wuyun.abel, hezhongkun.hzk,
	chenying.kernel, zhanghaoyu.zhy



在 2024/3/27 01:26, Tejun Heo 写道:
> Hello,
> 
> On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
>> In our production environment, we have observed several cases of hung tasks
>> blocked on the cgroup_mutex. The underlying cause is that when user modify
>> the cpuset.mems, memory migration operations are performed in the
>> work_queue. However, the duration of these operations depends on the memory
>> size of workloads and can consume a significant amount of time.
>>
>> In the __cgroup_procs_write operation, there is a flush_workqueue operation
>> that waits for the migration to complete while holding the cgroup_mutex.
>> As a result, most cgroup-related operations have the potential to
>> experience blocking.
>>
>> We have noticed the commit "cgroup/cpuset: Enable memory migration for
>> cpuset v2"[1]. This commit enforces memory migration when modifying the
>> cpuset. Furthermore, in cgroup v2, there is no option available for
>> users to disable CS_MEMORY_MIGRATE.
>>
>> In our scenario, we do need to perform memory migration when cpuset.mems
>> changes, while ensuring that other tasks are not blocked on cgroup_mutex
>> for an extended period of time.
>>
>> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
>> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
>> trigger memory migration, and we can manually perform memory migration
>> using migrate_pages()/move_pages() syscalls.
>>
>> Another solution is to use a lazy approach for memory migration[2]. In
>> this way we only walk through all the pages and sets pages to protnone,
>> and numa faults triggered by later touch will handle the movement. That
>> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
>> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
>> confusing MPOL_MF_LAZY dead code")
> 
> One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> task_work so that it happens after cpuset mutex is dropped. That way we
> maintain the operation synchronicity for the issuer while avoiding bothering
> anyone else.
> 
> Can you see whether the following patch fixes the issue for you? Thanks.


Thanks for the reply! I think it would help to fix this issue.

BTW, will you merge this patch to the mainline?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-27 14:07   ` Chuyi Zhou
@ 2024-03-27 16:13     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2024-03-27 16:13 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: cgroups, longman, hughd, wuyun.abel, hezhongkun.hzk,
	chenying.kernel, zhanghaoyu.zhy

On Wed, Mar 27, 2024 at 10:07:23PM +0800, Chuyi Zhou wrote:
> > One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> > task_work so that it happens after cpuset mutex is dropped. That way we
> > maintain the operation synchronicity for the issuer while avoiding bothering
> > anyone else.
> > 
> > Can you see whether the following patch fixes the issue for you? Thanks.
> 
> 
> Thanks for the reply! I think it would help to fix this issue.
> 
> BTW, will you merge this patch to the mainline?

Sure but I'd really like to see that it actually solves the issues you're
experiencing before merging it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-26 17:26 ` Tejun Heo
  2024-03-27 14:07   ` Chuyi Zhou
@ 2024-03-27 17:14   ` Waiman Long
  2024-03-27 21:43     ` Tejun Heo
  2024-03-28  7:53   ` Abel Wu
  2 siblings, 1 reply; 8+ messages in thread
From: Waiman Long @ 2024-03-27 17:14 UTC (permalink / raw)
  To: Tejun Heo, Chuyi Zhou
  Cc: cgroups, hughd, wuyun.abel, hezhongkun.hzk, chenying.kernel,
	zhanghaoyu.zhy

On 3/26/24 13:26, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
>> In our production environment, we have observed several cases of hung tasks
>> blocked on the cgroup_mutex. The underlying cause is that when user modify
>> the cpuset.mems, memory migration operations are performed in the
>> work_queue. However, the duration of these operations depends on the memory
>> size of workloads and can consume a significant amount of time.
>>
>> In the __cgroup_procs_write operation, there is a flush_workqueue operation
>> that waits for the migration to complete while holding the cgroup_mutex.
>> As a result, most cgroup-related operations have the potential to
>> experience blocking.
>>
>> We have noticed the commit "cgroup/cpuset: Enable memory migration for
>> cpuset v2"[1]. This commit enforces memory migration when modifying the
>> cpuset. Furthermore, in cgroup v2, there is no option available for
>> users to disable CS_MEMORY_MIGRATE.
>>
>> In our scenario, we do need to perform memory migration when cpuset.mems
>> changes, while ensuring that other tasks are not blocked on cgroup_mutex
>> for an extended period of time.
>>
>> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
>> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
>> trigger memory migration, and we can manually perform memory migration
>> using migrate_pages()/move_pages() syscalls.
>>
>> Another solution is to use a lazy approach for memory migration[2]. In
>> this way we only walk through all the pages and sets pages to protnone,
>> and numa faults triggered by later touch will handle the movement. That
>> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
>> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
>> confusing MPOL_MF_LAZY dead code")
> One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> task_work so that it happens after cpuset mutex is dropped. That way we
> maintain the operation synchronicity for the issuer while avoiding bothering
> anyone else.
I think it is a good idea to use task_work() to wait for mm migration to 
finish before returning to user space.
>
> Can you see whether the following patch fixes the issue for you? Thanks.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ba36c073304a..8a8bd3f157ab 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -42,6 +42,7 @@
>   #include <linux/spinlock.h>
>   #include <linux/oom.h>
>   #include <linux/sched/isolation.h>
> +#include <linux/task_work.h>
>   #include <linux/cgroup.h>
>   #include <linux/wait.h>
>   #include <linux/workqueue.h>
> @@ -2696,6 +2697,26 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
>   	kfree(mwork);
>   }
>   
> +static void flush_migrate_mm_task_workfn(struct callback_head *head)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
> +static int schedule_flush_migrate_mm(void)
> +{
> +	struct callback_head *flush_cb;
> +
> +	flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
> +	if (!flush_cb)
> +		return -ENOMEM;
> +
> +	flush_cb->func = flush_migrate_mm_task_workfn;
> +	if (task_work_add(current, flush_cb, TWA_RESUME))
> +		kfree(flush_cb);
> +
> +	return 0;
> +}
> +
>   static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   							const nodemask_t *to)
>   {
> @@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   	}
>   }
>   
> -static void cpuset_post_attach(void)
> -{
> -	flush_workqueue(cpuset_migrate_mm_wq);
> -}
> -
>   /*
>    * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>    * @tsk: the task to change
> @@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	bool cpus_updated, mems_updated;
>   	int ret;
>   
> +	ret = schedule_flush_migrate_mm();
> +	if (ret)
> +		return ret;
> +

It may be too early to initiate the task_work at cpuset_can_attach() as 
no mm migration may happen. My suggestion is to do it at cpuset_attach() 
with at least one cpuset_migrate_mm() call.

Cheers,
Longman

>   	/* used later by cpuset_attach() */
>   	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>   	oldcs = cpuset_attach_old_cs;
> @@ -3584,7 +3604,11 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   {
>   	struct cpuset *cs = css_cs(of_css(of));
>   	struct cpuset *trialcs;
> -	int retval = -ENODEV;
> +	int retval;
> +
> +	retval = schedule_flush_migrate_mm();
> +	if (retval)
> +		return retval;
>   
>   	buf = strstrip(buf);
>   
> @@ -3613,8 +3637,10 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   
>   	cpus_read_lock();
>   	mutex_lock(&cpuset_mutex);
> -	if (!is_cpuset_online(cs))
> +	if (!is_cpuset_online(cs)) {
> +		retval = -ENODEV;
>   		goto out_unlock;
> +	}
>   
>   	trialcs = alloc_trial_cpuset(cs);
>   	if (!trialcs) {
> @@ -3643,7 +3669,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	cpus_read_unlock();
>   	kernfs_unbreak_active_protection(of->kn);
>   	css_put(&cs->css);
> -	flush_workqueue(cpuset_migrate_mm_wq);
>   	return retval ?: nbytes;
>   }
>   
> @@ -4283,7 +4308,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
>   	.can_attach	= cpuset_can_attach,
>   	.cancel_attach	= cpuset_cancel_attach,
>   	.attach		= cpuset_attach,
> -	.post_attach	= cpuset_post_attach,
>   	.bind		= cpuset_bind,
>   	.can_fork	= cpuset_can_fork,
>   	.cancel_fork	= cpuset_cancel_fork,
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-27 17:14   ` Waiman Long
@ 2024-03-27 21:43     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2024-03-27 21:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chuyi Zhou, cgroups, hughd, wuyun.abel, hezhongkun.hzk,
	chenying.kernel, zhanghaoyu.zhy

Hello,

On Wed, Mar 27, 2024 at 01:14:49PM -0400, Waiman Long wrote:
...
> > @@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> >   	}
> >   }
> > -static void cpuset_post_attach(void)
> > -{
> > -	flush_workqueue(cpuset_migrate_mm_wq);
> > -}
> > -
> >   /*
> >    * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
> >    * @tsk: the task to change
> > @@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> >   	bool cpus_updated, mems_updated;
> >   	int ret;
> > +	ret = schedule_flush_migrate_mm();
> > +	if (ret)
> > +		return ret;
> > +
> 
> It may be too early to initiate the task_work at cpuset_can_attach() as no
> mm migration may happen. My suggestion is to do it at cpuset_attach() with
> at least one cpuset_migrate_mm() call.

Yeah, we can do that too. The downside is that we lose the ability to return
-ENOMEM unless we separate out allocation and queueing. Given that
flush_workqueue() when migration is not in progress is really cheap and the
existing code always flushes from post_attach(), I don't think it's too bad
but yeah it widens the scope of unnecessary waits. So, yeah, what you're
suggesting sounds good too especially given that migration is best effort
anyway and already depends on memory allocation.

Let's see whether this works for Chuyi and I'll post an update version
later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-26 17:26 ` Tejun Heo
  2024-03-27 14:07   ` Chuyi Zhou
  2024-03-27 17:14   ` Waiman Long
@ 2024-03-28  7:53   ` Abel Wu
  2024-03-28 17:19     ` Tejun Heo
  2 siblings, 1 reply; 8+ messages in thread
From: Abel Wu @ 2024-03-28  7:53 UTC (permalink / raw)
  To: Tejun Heo, Chuyi Zhou
  Cc: cgroups, longman, tj, hughd, hezhongkun.hzk, chenying.kernel,
	zhanghaoyu.zhy

On 3/27/24 1:26 AM, Tejun Heo Wrote:
> Hello,
> 
> On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
>> In our production environment, we have observed several cases of hung tasks
>> blocked on the cgroup_mutex. The underlying cause is that when user modify
>> the cpuset.mems, memory migration operations are performed in the
>> work_queue. However, the duration of these operations depends on the memory
>> size of workloads and can consume a significant amount of time.
>>
>> In the __cgroup_procs_write operation, there is a flush_workqueue operation
>> that waits for the migration to complete while holding the cgroup_mutex.
>> As a result, most cgroup-related operations have the potential to
>> experience blocking.
>>
>> We have noticed the commit "cgroup/cpuset: Enable memory migration for
>> cpuset v2"[1]. This commit enforces memory migration when modifying the
>> cpuset. Furthermore, in cgroup v2, there is no option available for
>> users to disable CS_MEMORY_MIGRATE.
>>
>> In our scenario, we do need to perform memory migration when cpuset.mems
>> changes, while ensuring that other tasks are not blocked on cgroup_mutex
>> for an extended period of time.
>>
>> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
>> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
>> trigger memory migration, and we can manually perform memory migration
>> using migrate_pages()/move_pages() syscalls.
>>
>> Another solution is to use a lazy approach for memory migration[2]. In
>> this way we only walk through all the pages and sets pages to protnone,
>> and numa faults triggered by later touch will handle the movement. That
>> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
>> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
>> confusing MPOL_MF_LAZY dead code")
> 
> One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> task_work so that it happens after cpuset mutex is dropped. That way we
> maintain the operation synchronicity for the issuer while avoiding bothering
> anyone else.

Good idea!

> 
> Can you see whether the following patch fixes the issue for you? Thanks.

We use move_pages() when cpuset memory migration disabled, which is proved
fine. Given that the way you proposed is kind of like what we have done but
inside kernel before return to userspace, I think this patch will help.

> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ba36c073304a..8a8bd3f157ab 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -42,6 +42,7 @@
>   #include <linux/spinlock.h>
>   #include <linux/oom.h>
>   #include <linux/sched/isolation.h>
> +#include <linux/task_work.h>
>   #include <linux/cgroup.h>
>   #include <linux/wait.h>
>   #include <linux/workqueue.h>
> @@ -2696,6 +2697,26 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
>   	kfree(mwork);
>   }
>   
> +static void flush_migrate_mm_task_workfn(struct callback_head *head)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
> +static int schedule_flush_migrate_mm(void)
> +{
> +	struct callback_head *flush_cb;
> +
> +	flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
> +	if (!flush_cb)
> +		return -ENOMEM;
> +
> +	flush_cb->func = flush_migrate_mm_task_workfn;
> +	if (task_work_add(current, flush_cb, TWA_RESUME))
> +		kfree(flush_cb);

It seems we will lose track of flush_cb and causes memleak here. Did I miss
anything?

Thanks & BR,
	Abel

> +
> +	return 0;
> +}
> +
>   static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   							const nodemask_t *to)
>   {
> @@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   	}
>   }
>   
> -static void cpuset_post_attach(void)
> -{
> -	flush_workqueue(cpuset_migrate_mm_wq);
> -}
> -
>   /*
>    * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>    * @tsk: the task to change
> @@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	bool cpus_updated, mems_updated;
>   	int ret;
>   
> +	ret = schedule_flush_migrate_mm();
> +	if (ret)
> +		return ret;
> +
>   	/* used later by cpuset_attach() */
>   	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>   	oldcs = cpuset_attach_old_cs;
> @@ -3584,7 +3604,11 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   {
>   	struct cpuset *cs = css_cs(of_css(of));
>   	struct cpuset *trialcs;
> -	int retval = -ENODEV;
> +	int retval;
> +
> +	retval = schedule_flush_migrate_mm();
> +	if (retval)
> +		return retval;
>   
>   	buf = strstrip(buf);
>   
> @@ -3613,8 +3637,10 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   
>   	cpus_read_lock();
>   	mutex_lock(&cpuset_mutex);
> -	if (!is_cpuset_online(cs))
> +	if (!is_cpuset_online(cs)) {
> +		retval = -ENODEV;
>   		goto out_unlock;
> +	}
>   
>   	trialcs = alloc_trial_cpuset(cs);
>   	if (!trialcs) {
> @@ -3643,7 +3669,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	cpus_read_unlock();
>   	kernfs_unbreak_active_protection(of->kn);
>   	css_put(&cs->css);
> -	flush_workqueue(cpuset_migrate_mm_wq);
>   	return retval ?: nbytes;
>   }
>   
> @@ -4283,7 +4308,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
>   	.can_attach	= cpuset_can_attach,
>   	.cancel_attach	= cpuset_cancel_attach,
>   	.attach		= cpuset_attach,
> -	.post_attach	= cpuset_post_attach,
>   	.bind		= cpuset_bind,
>   	.can_fork	= cpuset_can_fork,
>   	.cancel_fork	= cpuset_cancel_fork,
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [problem] Hung task caused by memory migration when cpuset.mems changes
  2024-03-28  7:53   ` Abel Wu
@ 2024-03-28 17:19     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2024-03-28 17:19 UTC (permalink / raw)
  To: Abel Wu
  Cc: Chuyi Zhou, cgroups, longman, tj, hughd, hezhongkun.hzk,
	chenying.kernel, zhanghaoyu.zhy

On Thu, Mar 28, 2024 at 03:53:30PM +0800, Abel Wu wrote:
> > +static int schedule_flush_migrate_mm(void)
> > +{
> > +	struct callback_head *flush_cb;
> > +
> > +	flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
> > +	if (!flush_cb)
> > +		return -ENOMEM;
> > +
> > +	flush_cb->func = flush_migrate_mm_task_workfn;
> > +	if (task_work_add(current, flush_cb, TWA_RESUME))
> > +		kfree(flush_cb);
> 
> It seems we will lose track of flush_cb and causes memleak here. Did I miss
> anything?

Oops, yeah, the work item needs to free itself. Thanks for spotting it.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-28 17:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 14:46 [problem] Hung task caused by memory migration when cpuset.mems changes Chuyi Zhou
2024-03-26 17:26 ` Tejun Heo
2024-03-27 14:07   ` Chuyi Zhou
2024-03-27 16:13     ` Tejun Heo
2024-03-27 17:14   ` Waiman Long
2024-03-27 21:43     ` Tejun Heo
2024-03-28  7:53   ` Abel Wu
2024-03-28 17:19     ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.