All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28  0:58 ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28  0:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Waiman Long

It was found that any change to the current cpuset hierarchy may reset
the cpus_allowed list of the tasks in the affected cpusets to the
default cpuset value even if those tasks have cpus affinity explicitly
set by the users before. That is especially easy to trigger under a
cgroup v2 environment where writing "+cpuset" to the root cgroup's
cgroup.subtree_control file will reset the cpus affinity of all the
processes in the system.

That is especially problematic in a nohz_full environment where the
tasks running in the nohz_full CPUs usually have their cpus affinity
explicitly set and will behave incorrectly if cpus affinity changes.

Fix this problem by adding a flag in the task structure to indicate that
a task has their cpus affinity explicitly set before and make cpuset
code not to change their cpus_allowed list unless the user chosen cpu
list is no longer a subset of the cpus_allowed list of the cpuset itself.

With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++--
 kernel/sched/core.c    |  1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c46f3a63b758..60ae022fa842 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ struct task_struct {
 
 	unsigned int			policy;
 	int				nr_cpus_allowed;
+	int				cpus_affinity_set;
 	const cpumask_t			*cpus_ptr;
 	cpumask_t			*user_cpus_ptr;
 	cpumask_t			cpus_mask;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 71a418858a5e..c47757c61f39 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,20 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 	return ret;
 }
 
+/*
+ * Don't change the cpus_allowed list if cpus affinity has been explicitly
+ * set before unless the current cpu list is not a subset of the new cpu list.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *new_mask)
+{
+	if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
+		return 0;
+
+	p->cpus_affinity_set = 0;
+	return set_cpus_allowed_ptr(p, new_mask);
+}
+
 #ifdef CONFIG_SMP
 /*
  * Helper routine for generate_sched_domains().
@@ -1130,7 +1144,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 
 	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
-		set_cpus_allowed_ptr(task, cs->effective_cpus);
+		cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
 }
 
@@ -2303,7 +2317,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		 * can_attach beforehand should guarantee that this doesn't
 		 * fail.  TODO: have a better way to handle failure here
 		 */
-		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+		WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
 
 		cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 		cpuset_update_task_spread_flag(cs, task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da0bf6fe9ecd..ab8ea6fa92db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8034,6 +8034,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	if (retval)
 		goto out_free_new_mask;
 
+	p->cpus_affinity_set = 1;
 	cpuset_cpus_allowed(p, cpus_allowed);
 	if (!cpumask_subset(new_mask, cpus_allowed)) {
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28  0:58 ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28  0:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Waiman Long

It was found that any change to the current cpuset hierarchy may reset
the cpus_allowed list of the tasks in the affected cpusets to the
default cpuset value even if those tasks have cpus affinity explicitly
set by the users before. That is especially easy to trigger under a
cgroup v2 environment where writing "+cpuset" to the root cgroup's
cgroup.subtree_control file will reset the cpus affinity of all the
processes in the system.

That is especially problematic in a nohz_full environment where the
tasks running in the nohz_full CPUs usually have their cpus affinity
explicitly set and will behave incorrectly if cpus affinity changes.

Fix this problem by adding a flag in the task structure to indicate that
a task has their cpus affinity explicitly set before and make cpuset
code not to change their cpus_allowed list unless the user chosen cpu
list is no longer a subset of the cpus_allowed list of the cpuset itself.

With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sched.h  |  1 +
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++--
 kernel/sched/core.c    |  1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c46f3a63b758..60ae022fa842 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ struct task_struct {
 
 	unsigned int			policy;
 	int				nr_cpus_allowed;
+	int				cpus_affinity_set;
 	const cpumask_t			*cpus_ptr;
 	cpumask_t			*user_cpus_ptr;
 	cpumask_t			cpus_mask;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 71a418858a5e..c47757c61f39 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,20 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 	return ret;
 }
 
+/*
+ * Don't change the cpus_allowed list if cpus affinity has been explicitly
+ * set before unless the current cpu list is not a subset of the new cpu list.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *new_mask)
+{
+	if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
+		return 0;
+
+	p->cpus_affinity_set = 0;
+	return set_cpus_allowed_ptr(p, new_mask);
+}
+
 #ifdef CONFIG_SMP
 /*
  * Helper routine for generate_sched_domains().
@@ -1130,7 +1144,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 
 	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
-		set_cpus_allowed_ptr(task, cs->effective_cpus);
+		cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
 }
 
@@ -2303,7 +2317,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		 * can_attach beforehand should guarantee that this doesn't
 		 * fail.  TODO: have a better way to handle failure here
 		 */
-		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+		WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
 
 		cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 		cpuset_update_task_spread_flag(cs, task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da0bf6fe9ecd..ab8ea6fa92db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8034,6 +8034,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	if (retval)
 		goto out_free_new_mask;
 
+	p->cpus_affinity_set = 1;
 	cpuset_cpus_allowed(p, cpus_allowed);
 	if (!cpumask_subset(new_mask, cpus_allowed)) {
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28  0:58   ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28  0:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Waiman Long

The cgroup_update_dfl_csses() function updates css associations when a
cgroup's subtree_control file is modified. Any changes made to a cgroup's
subtree_control file, however, will only affect its descendants but not
the cgroup itself. So there is no point in migrating csses associated
with that cgroup. We can skip them instead.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 13c8e91d7862..1151ff44d578 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2960,6 +2960,15 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		struct cgrp_cset_link *link;
 
+		/*
+		 * As cgroup_update_dfl_csses() is only called by
+		 * cgroup_apply_control(). The csses associated with the
+		 * given cgrp will not be affected by changes made to
+		 * its subtree_control file. We can skip them.
+		 */
+		if (dsct == cgrp)
+			continue;
+
 		list_for_each_entry(link, &dsct->cset_links, cset_link)
 			cgroup_migrate_add_src(link->cset, dsct, &mgctx);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28  0:58   ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28  0:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Waiman Long

The cgroup_update_dfl_csses() function updates css associations when a
cgroup's subtree_control file is modified. Any changes made to a cgroup's
subtree_control file, however, will only affect its descendants but not
the cgroup itself. So there is no point in migrating csses associated
with that cgroup. We can skip them instead.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 13c8e91d7862..1151ff44d578 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2960,6 +2960,15 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		struct cgrp_cset_link *link;
 
+		/*
+		 * As cgroup_update_dfl_csses() is only called by
+		 * cgroup_apply_control(). The csses associated with the
+		 * given cgrp will not be affected by changes made to
+		 * its subtree_control file. We can skip them.
+		 */
+		if (dsct == cgrp)
+			continue;
+
 		list_for_each_entry(link, &dsct->cset_links, cset_link)
 			cgroup_migrate_add_src(link->cset, dsct, &mgctx);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 14:44   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

Hello.

On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <longman@redhat.com> wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before.

I'm surprised this went so long unnoticed / unreported.

Could it be users relied on that implicit affinity reset?

> That is especially easy to trigger under a cgroup v2 environment where
> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
> will reset the cpus affinity of all the processes in the system.

This should apply only to tasks that were extracted out of the root
cgroup, no? (OK, those are all processes practically.)

(Even without your second patch, the scope should be limited because of
src_cset==dst_cset check in cgroup_migrate_prepare_dst().)

> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.

One could also argue that for such processes, cgroup hierarchy should be
first configured and only then they start and set own affinity.

> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.

I'm uneasy with the occasional revert of this flag, i.e. the task who
set their affinity would sometimes have it overwritten and sometimes
not (which might have been relied on, especially with writes into
cpuset.cpus).
(But I have no better answer than the counter-argument above since
there's no easier way to detect the implicit migrations.)

Also, is there similar effect with memory binding?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 14:44   ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello.

On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before.

I'm surprised this went so long unnoticed / unreported.

Could it be users relied on that implicit affinity reset?

> That is especially easy to trigger under a cgroup v2 environment where
> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
> will reset the cpus affinity of all the processes in the system.

This should apply only to tasks that were extracted out of the root
cgroup, no? (OK, those are all processes practically.)

(Even without your second patch, the scope should be limited because of
src_cset==dst_cset check in cgroup_migrate_prepare_dst().)

> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.

One could also argue that for such processes, cgroup hierarchy should be
first configured and only then they start and set own affinity.

> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.

I'm uneasy with the occasional revert of this flag, i.e. the task who
set their affinity would sometimes have it overwritten and sometimes
not (which might have been relied on, especially with writes into
cpuset.cpus).
(But I have no better answer than the counter-argument above since
there's no easier way to detect the implicit migrations.)

Also, is there similar effect with memory binding?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 14:44     ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long <longman@redhat.com> wrote:
> The cgroup_update_dfl_csses() function updates css associations when a
> cgroup's subtree_control file is modified. Any changes made to a cgroup's
> subtree_control file, however, will only affect its descendants but not
> the cgroup itself. 

I find this correct.

> So there is no point in migrating csses associated with that cgroup.
> We can skip them instead.

Alone it's not such a big win but it componds with the recent Tejun's
threadgroup_rwsem elision.

> ---
>  kernel/cgroup/cgroup.c | 9 +++++++++
>  1 file changed, 9 insertions(+)

Feel free to have
Reviewed-by: Michal Koutný <mkoutny@suse.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 14:44     ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> The cgroup_update_dfl_csses() function updates css associations when a
> cgroup's subtree_control file is modified. Any changes made to a cgroup's
> subtree_control file, however, will only affect its descendants but not
> the cgroup itself. 

I find this correct.

> So there is no point in migrating csses associated with that cgroup.
> We can skip them instead.

Alone it's not such a big win but it componds with the recent Tejun's
threadgroup_rwsem elision.

> ---
>  kernel/cgroup/cgroup.c | 9 +++++++++
>  1 file changed, 9 insertions(+)

Feel free to have
Reviewed-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
  2022-07-28 14:44     ` Michal Koutný
  (?)
@ 2022-07-28 14:49     ` Waiman Long
  -1 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 14:49 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel


On 7/28/22 10:44, Michal Koutný wrote:
> On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long <longman@redhat.com> wrote:
>> The cgroup_update_dfl_csses() function updates css associations when a
>> cgroup's subtree_control file is modified. Any changes made to a cgroup's
>> subtree_control file, however, will only affect its descendants but not
>> the cgroup itself.
> I find this correct.
>
>> So there is no point in migrating csses associated with that cgroup.
>> We can skip them instead.
> Alone it's not such a big win but it componds with the recent Tejun's
> threadgroup_rwsem elision.
>
It is more an optimization patch trying to not waste cpu time doing 
unnecessary work.
>> ---
>>   kernel/cgroup/cgroup.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
> Feel free to have
> Reviewed-by: Michal Koutný <mkoutny@suse.com>
>
Thanks for the review.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 14:59     ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 14:59 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

On 7/28/22 10:44, Michal Koutný wrote:
> Hello.
>
> On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <longman@redhat.com> wrote:
>> It was found that any change to the current cpuset hierarchy may reset
>> the cpus_allowed list of the tasks in the affected cpusets to the
>> default cpuset value even if those tasks have cpus affinity explicitly
>> set by the users before.
> I'm surprised this went so long unnoticed / unreported.
>
> Could it be users relied on that implicit affinity reset?

As said, it is more easily triggered in a cgroup v2 environment. 
Systemd, on a cgroup v2 environment, will write "+cpuset" to the root 
cgroup's subtree_control file when a new container is instantiated. I 
don't know why it is doing that, but that cause problem as it resets all 
the cpus_allowed list assignment. Cgroup v1 doesn't have this problem.


>> That is especially easy to trigger under a cgroup v2 environment where
>> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
>> will reset the cpus affinity of all the processes in the system.
> This should apply only to tasks that were extracted out of the root
> cgroup, no? (OK, those are all processes practically.)
The reset is done on all cgroups in a particular subtree. In the case of 
cgroup root, it is all the processes in the system.
>
> (Even without your second patch, the scope should be limited because of
> src_cset==dst_cset check in cgroup_migrate_prepare_dst().)
>
>> That is especially problematic in a nohz_full environment where the
>> tasks running in the nohz_full CPUs usually have their cpus affinity
>> explicitly set and will behave incorrectly if cpus affinity changes.
> One could also argue that for such processes, cgroup hierarchy should be
> first configured and only then they start and set own affinity.
>
>> Fix this problem by adding a flag in the task structure to indicate that
>> a task has their cpus affinity explicitly set before and make cpuset
>> code not to change their cpus_allowed list unless the user chosen cpu
>> list is no longer a subset of the cpus_allowed list of the cpuset itself.
> I'm uneasy with the occasional revert of this flag, i.e. the task who
> set their affinity would sometimes have it overwritten and sometimes
> not (which might have been relied on, especially with writes into
> cpuset.cpus).
> (But I have no better answer than the counter-argument above since
> there's no easier way to detect the implicit migrations.)
I also thought about that. Another possible alternative is to use the 
intersection of cpuset's cpu list and task's own cpu list as long as it 
is not empty. Reducing the number of cpus assigned to a task may produce 
some unexpected behavior too.
>
> Also, is there similar effect with memory binding?

I think so, but memory binding is less frequently used and its effect is 
less noticeable.

Cheers,
Longman

>
> Thanks,
> Michal
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 14:59     ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 14:59 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/28/22 10:44, Michal Koutný wrote:
> Hello.
>
> On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> It was found that any change to the current cpuset hierarchy may reset
>> the cpus_allowed list of the tasks in the affected cpusets to the
>> default cpuset value even if those tasks have cpus affinity explicitly
>> set by the users before.
> I'm surprised this went so long unnoticed / unreported.
>
> Could it be users relied on that implicit affinity reset?

As said, it is more easily triggered in a cgroup v2 environment. 
Systemd, on a cgroup v2 environment, will write "+cpuset" to the root 
cgroup's subtree_control file when a new container is instantiated. I 
don't know why it is doing that, but that cause problem as it resets all 
the cpus_allowed list assignment. Cgroup v1 doesn't have this problem.


>> That is especially easy to trigger under a cgroup v2 environment where
>> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
>> will reset the cpus affinity of all the processes in the system.
> This should apply only to tasks that were extracted out of the root
> cgroup, no? (OK, those are all processes practically.)
The reset is done on all cgroups in a particular subtree. In the case of 
cgroup root, it is all the processes in the system.
>
> (Even without your second patch, the scope should be limited because of
> src_cset==dst_cset check in cgroup_migrate_prepare_dst().)
>
>> That is especially problematic in a nohz_full environment where the
>> tasks running in the nohz_full CPUs usually have their cpus affinity
>> explicitly set and will behave incorrectly if cpus affinity changes.
> One could also argue that for such processes, cgroup hierarchy should be
> first configured and only then they start and set own affinity.
>
>> Fix this problem by adding a flag in the task structure to indicate that
>> a task has their cpus affinity explicitly set before and make cpuset
>> code not to change their cpus_allowed list unless the user chosen cpu
>> list is no longer a subset of the cpus_allowed list of the cpuset itself.
> I'm uneasy with the occasional revert of this flag, i.e. the task who
> set their affinity would sometimes have it overwritten and sometimes
> not (which might have been relied on, especially with writes into
> cpuset.cpus).
> (But I have no better answer than the counter-argument above since
> there's no easier way to detect the implicit migrations.)
I also thought about that. Another possible alternative is to use the 
intersection of cpuset's cpu list and task's own cpu list as long as it 
is not empty. Reducing the number of cpus assigned to a task may produce 
some unexpected behavior too.
>
> Also, is there similar effect with memory binding?

I think so, but memory binding is less frequently used and its effect is 
less noticeable.

Cheers,
Longman

>
> Thanks,
> Michal
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 15:23       ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 15:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <longman@redhat.com> wrote:
> Cgroup v1 doesn't have this problem.

v1 analogy would be:

	echo 2-3 >$dst/cpuset.cpus
	# job runs in $dst
	# one task T in $dst sets affinity just to one cpu
	# I rethink my config, I want to allow $dst more space
	echo 2-5 >$dst/cpuset.cpus

Most tasks in $dst happily utilize the new cpus but it breaks affinity
for T -- this must have been broken since ever.

(Or I'd argue that per-thread affinities are just recommendations, if I
have a task for nohz CPU, I should enforce its placement with cpuset
from the beginning.)

Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 15:23       ` Michal Koutný
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Koutný @ 2022-07-28 15:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Cgroup v1 doesn't have this problem.

v1 analogy would be:

	echo 2-3 >$dst/cpuset.cpus
	# job runs in $dst
	# one task T in $dst sets affinity just to one cpu
	# I rethink my config, I want to allow $dst more space
	echo 2-5 >$dst/cpuset.cpus

Most tasks in $dst happily utilize the new cpus but it breaks affinity
for T -- this must have been broken since ever.

(Or I'd argue that per-thread affinities are just recommendations, if I
have a task for nohz CPU, I should enforce its placement with cpuset
from the beginning.)

Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 15:35         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 15:35 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

On 7/28/22 11:23, Michal Koutný wrote:
> On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <longman@redhat.com> wrote:
>> Cgroup v1 doesn't have this problem.
> v1 analogy would be:
>
> 	echo 2-3 >$dst/cpuset.cpus
> 	# job runs in $dst
> 	# one task T in $dst sets affinity just to one cpu
> 	# I rethink my config, I want to allow $dst more space
> 	echo 2-5 >$dst/cpuset.cpus
>
> Most tasks in $dst happily utilize the new cpus but it breaks affinity
> for T -- this must have been broken since ever.
>
> (Or I'd argue that per-thread affinities are just recommendations, if I
> have a task for nohz CPU, I should enforce its placement with cpuset
> from the beginning.)

I should have clarified that what I meant is systemd on a cgroup v1 
environment doesn't cause this cpu list reset to happen. It doesn't mean 
that cgroup v1 has no similar problem. Sorry for the confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 15:35         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 15:35 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tejun Heo,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/28/22 11:23, Michal Koutný wrote:
> On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Cgroup v1 doesn't have this problem.
> v1 analogy would be:
>
> 	echo 2-3 >$dst/cpuset.cpus
> 	# job runs in $dst
> 	# one task T in $dst sets affinity just to one cpu
> 	# I rethink my config, I want to allow $dst more space
> 	echo 2-5 >$dst/cpuset.cpus
>
> Most tasks in $dst happily utilize the new cpus but it breaks affinity
> for T -- this must have been broken since ever.
>
> (Or I'd argue that per-thread affinities are just recommendations, if I
> have a task for nohz CPU, I should enforce its placement with cpuset
> from the beginning.)

I should have clarified that what I meant is systemd on a cgroup v1 
environment doesn't cause this cpu list reset to happen. It doesn't mean 
that cgroup v1 has no similar problem. Sorry for the confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 16:50       ` Valentin Schneider
  0 siblings, 0 replies; 40+ messages in thread
From: Valentin Schneider @ 2022-07-28 16:50 UTC (permalink / raw)
  To: Waiman Long, Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Tejun Heo, Zefan Li, Johannes Weiner,
	cgroups, linux-kernel

On 28/07/22 10:59, Waiman Long wrote:
> On 7/28/22 10:44, Michal Koutný wrote:
>> This should apply only to tasks that were extracted out of the root
>> cgroup, no? (OK, those are all processes practically.)
> The reset is done on all cgroups in a particular subtree. In the case of
> cgroup root, it is all the processes in the system.
>>

I've been briefly playing with this, tasks in the cgroup root don't seem
affected on my end (QEMU + buildroot + latest tip/sched/core):

  $ mount -t cgroup2 none /sys/fs/cgroup
  $ /root/loop.sh &
  $ PID=$!
  $ taskset -pc 2-3 $PID
  pid 177's current affinity list: 0-3
  pid 177's new affinity list: 2,3
  $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
  $ taskset -pc $PID
  pid 177's current affinity list: 2,3

However tasks extracted out as mentioned by Michal definitely are:

  $ mount -t cgroup2 none /sys/fs/cgroup
  $ /root/loop.sh &
  $ PID=$!
  $ taskset -pc 2-3 $PID
  pid 172's current affinity list: 0-3
  pid 172's new affinity list: 2,3
  $ mkdir /sys/fs/cgroup/foobar
  $ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
  $ taskset -pc $PID
  pid 172's current affinity list: 2,3
  $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
  $ taskset -pc $PID
  pid 172's current affinity list: 0-3

IIUC this is just what happens anytime a task gets migrated to a new
cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
+cpuset migrates it to the /foobar one.

Does that match what you're seeing?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 16:50       ` Valentin Schneider
  0 siblings, 0 replies; 40+ messages in thread
From: Valentin Schneider @ 2022-07-28 16:50 UTC (permalink / raw)
  To: Waiman Long, Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Tejun Heo, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 28/07/22 10:59, Waiman Long wrote:
> On 7/28/22 10:44, Michal Koutný wrote:
>> This should apply only to tasks that were extracted out of the root
>> cgroup, no? (OK, those are all processes practically.)
> The reset is done on all cgroups in a particular subtree. In the case of
> cgroup root, it is all the processes in the system.
>>

I've been briefly playing with this, tasks in the cgroup root don't seem
affected on my end (QEMU + buildroot + latest tip/sched/core):

  $ mount -t cgroup2 none /sys/fs/cgroup
  $ /root/loop.sh &
  $ PID=$!
  $ taskset -pc 2-3 $PID
  pid 177's current affinity list: 0-3
  pid 177's new affinity list: 2,3
  $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
  $ taskset -pc $PID
  pid 177's current affinity list: 2,3

However tasks extracted out as mentioned by Michal definitely are:

  $ mount -t cgroup2 none /sys/fs/cgroup
  $ /root/loop.sh &
  $ PID=$!
  $ taskset -pc 2-3 $PID
  pid 172's current affinity list: 0-3
  pid 172's new affinity list: 2,3
  $ mkdir /sys/fs/cgroup/foobar
  $ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
  $ taskset -pc $PID
  pid 172's current affinity list: 2,3
  $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
  $ taskset -pc $PID
  pid 172's current affinity list: 0-3

IIUC this is just what happens anytime a task gets migrated to a new
cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
+cpuset migrates it to the /foobar one.

Does that match what you're seeing?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 17:23   ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

Hello,

On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before. That is especially easy to trigger under a
> cgroup v2 environment where writing "+cpuset" to the root cgroup's
> cgroup.subtree_control file will reset the cpus affinity of all the
> processes in the system.
> 
> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.
> 
> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.
> 
> With that change in place, it was verified that tasks that have its
> cpus affinity explicitly set will not be affected by changes made to
> the v2 cgroup.subtree_control files.

I think the underlying cause here is cpuset overwriting the cpumask the user
configured but that's a longer discussion.

> +/*
> + * Don't change the cpus_allowed list if cpus affinity has been explicitly
> + * set before unless the current cpu list is not a subset of the new cpu list.
> + */
> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> +				       const struct cpumask *new_mask)
> +{
> +	if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
> +		return 0;
> +
> +	p->cpus_affinity_set = 0;
> +	return set_cpus_allowed_ptr(p, new_mask);
> +}

I wonder whether the more predictable behavior would be always not resetting
the cpumask if it's a subset of the new_mask. Also, shouldn't this check
p->cpus_mask instead of p->cpus_ptr?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 17:23   ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before. That is especially easy to trigger under a
> cgroup v2 environment where writing "+cpuset" to the root cgroup's
> cgroup.subtree_control file will reset the cpus affinity of all the
> processes in the system.
> 
> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.
> 
> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.
> 
> With that change in place, it was verified that tasks that have its
> cpus affinity explicitly set will not be affected by changes made to
> the v2 cgroup.subtree_control files.

I think the underlying cause here is cpuset overwriting the cpumask the user
configured but that's a longer discussion.

> +/*
> + * Don't change the cpus_allowed list if cpus affinity has been explicitly
> + * set before unless the current cpu list is not a subset of the new cpu list.
> + */
> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> +				       const struct cpumask *new_mask)
> +{
> +	if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
> +		return 0;
> +
> +	p->cpus_affinity_set = 0;
> +	return set_cpus_allowed_ptr(p, new_mask);
> +}

I wonder whether the more predictable behavior would be always not resetting
the cpumask if it's a subset of the new_mask. Also, shouldn't this check
p->cpus_mask instead of p->cpus_ptr?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 17:26       ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:26 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Zefan Li, Johannes Weiner, cgroups, linux-kernel

On Thu, Jul 28, 2022 at 04:44:26PM +0200, Michal Koutný wrote:
> On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long <longman@redhat.com> wrote:
> > The cgroup_update_dfl_csses() function updates css associations when a
> > cgroup's subtree_control file is modified. Any changes made to a cgroup's
> > subtree_control file, however, will only affect its descendants but not
> > the cgroup itself. 
> 
> I find this correct.
> 
> > So there is no point in migrating csses associated with that cgroup.
> > We can skip them instead.
> 
> Alone it's not such a big win but it componds with the recent Tejun's
> threadgroup_rwsem elision.

The chance is that if you're writing to a cgroup's subtree_control, that
cgroup is gonna be empty. The only case I can think of that this would make
a difference is w/ threaded cgroups, but it does make sense.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 17:26       ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:26 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Zefan Li, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 28, 2022 at 04:44:26PM +0200, Michal Koutný wrote:
> On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > The cgroup_update_dfl_csses() function updates css associations when a
> > cgroup's subtree_control file is modified. Any changes made to a cgroup's
> > subtree_control file, however, will only affect its descendants but not
> > the cgroup itself. 
> 
> I find this correct.
> 
> > So there is no point in migrating csses associated with that cgroup.
> > We can skip them instead.
> 
> Alone it's not such a big win but it componds with the recent Tejun's
> threadgroup_rwsem elision.

The chance is that if you're writing to a cgroup's subtree_control, that
cgroup is gonna be empty. The only case I can think of that this would make
a difference is w/ threaded cgroups, but it does make sense.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 17:27     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long wrote:
> The cgroup_update_dfl_csses() function updates css associations when a
> cgroup's subtree_control file is modified. Any changes made to a cgroup's
> subtree_control file, however, will only affect its descendants but not
> the cgroup itself. So there is no point in migrating csses associated
> with that cgroup. We can skip them instead.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Applied to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses()
@ 2022-07-28 17:27     ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 17:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 27, 2022 at 08:58:15PM -0400, Waiman Long wrote:
> The cgroup_update_dfl_csses() function updates css associations when a
> cgroup's subtree_control file is modified. Any changes made to a cgroup's
> subtree_control file, however, will only affect its descendants but not
> the cgroup itself. So there is no point in migrating csses associated
> with that cgroup. We can skip them instead.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Applied to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 17:42         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 17:42 UTC (permalink / raw)
  To: Valentin Schneider, Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Tejun Heo, Zefan Li, Johannes Weiner,
	cgroups, linux-kernel

On 7/28/22 12:50, Valentin Schneider wrote:
> On 28/07/22 10:59, Waiman Long wrote:
>> On 7/28/22 10:44, Michal Koutný wrote:
>>> This should apply only to tasks that were extracted out of the root
>>> cgroup, no? (OK, those are all processes practically.)
>> The reset is done on all cgroups in a particular subtree. In the case of
>> cgroup root, it is all the processes in the system.
> I've been briefly playing with this, tasks in the cgroup root don't seem
> affected on my end (QEMU + buildroot + latest tip/sched/core):
>
>    $ mount -t cgroup2 none /sys/fs/cgroup
>    $ /root/loop.sh &
>    $ PID=$!
>    $ taskset -pc 2-3 $PID
>    pid 177's current affinity list: 0-3
>    pid 177's new affinity list: 2,3
>    $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>    $ taskset -pc $PID
>    pid 177's current affinity list: 2,3
>
> However tasks extracted out as mentioned by Michal definitely are:
>
>    $ mount -t cgroup2 none /sys/fs/cgroup
>    $ /root/loop.sh &
>    $ PID=$!
>    $ taskset -pc 2-3 $PID
>    pid 172's current affinity list: 0-3
>    pid 172's new affinity list: 2,3
>    $ mkdir /sys/fs/cgroup/foobar
>    $ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
>    $ taskset -pc $PID
>    pid 172's current affinity list: 2,3
>    $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>    $ taskset -pc $PID
>    pid 172's current affinity list: 0-3
>
> IIUC this is just what happens anytime a task gets migrated to a new
> cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
> +cpuset migrates it to the /foobar one.
>
> Does that match what you're seeing?
>
Yes. echo "+cpuset" to subtree_control means tasks in the child cgroups 
will move to new cpusets. Those new cpusets will have the same cpu lists 
as the parent unless the cpuset.cpus files are explicitly written to. 
This patch will ensure that tasks that have explicitly set their cpu 
affinity won't be affected by this change.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 17:42         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 17:42 UTC (permalink / raw)
  To: Valentin Schneider, Michal Koutný
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Tejun Heo, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/28/22 12:50, Valentin Schneider wrote:
> On 28/07/22 10:59, Waiman Long wrote:
>> On 7/28/22 10:44, Michal Koutný wrote:
>>> This should apply only to tasks that were extracted out of the root
>>> cgroup, no? (OK, those are all processes practically.)
>> The reset is done on all cgroups in a particular subtree. In the case of
>> cgroup root, it is all the processes in the system.
> I've been briefly playing with this, tasks in the cgroup root don't seem
> affected on my end (QEMU + buildroot + latest tip/sched/core):
>
>    $ mount -t cgroup2 none /sys/fs/cgroup
>    $ /root/loop.sh &
>    $ PID=$!
>    $ taskset -pc 2-3 $PID
>    pid 177's current affinity list: 0-3
>    pid 177's new affinity list: 2,3
>    $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>    $ taskset -pc $PID
>    pid 177's current affinity list: 2,3
>
> However tasks extracted out as mentioned by Michal definitely are:
>
>    $ mount -t cgroup2 none /sys/fs/cgroup
>    $ /root/loop.sh &
>    $ PID=$!
>    $ taskset -pc 2-3 $PID
>    pid 172's current affinity list: 0-3
>    pid 172's new affinity list: 2,3
>    $ mkdir /sys/fs/cgroup/foobar
>    $ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
>    $ taskset -pc $PID
>    pid 172's current affinity list: 2,3
>    $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
>    $ taskset -pc $PID
>    pid 172's current affinity list: 0-3
>
> IIUC this is just what happens anytime a task gets migrated to a new
> cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
> +cpuset migrates it to the /foobar one.
>
> Does that match what you're seeing?
>
Yes. echo "+cpuset" to subtree_control means tasks in the child cgroups 
will move to new cpusets. Those new cpusets will have the same cpu lists 
as the parent unless the cpuset.cpus files are explicitly written to. 
This patch will ensure that tasks that have explicitly set their cpu 
affinity won't be affected by this change.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
  2022-07-28 17:23   ` Tejun Heo
  (?)
@ 2022-07-28 18:57   ` Waiman Long
  2022-07-28 19:02       ` Tejun Heo
  -1 siblings, 1 reply; 40+ messages in thread
From: Waiman Long @ 2022-07-28 18:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

On 7/28/22 13:23, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long wrote:
>> It was found that any change to the current cpuset hierarchy may reset
>> the cpus_allowed list of the tasks in the affected cpusets to the
>> default cpuset value even if those tasks have cpus affinity explicitly
>> set by the users before. That is especially easy to trigger under a
>> cgroup v2 environment where writing "+cpuset" to the root cgroup's
>> cgroup.subtree_control file will reset the cpus affinity of all the
>> processes in the system.
>>
>> That is especially problematic in a nohz_full environment where the
>> tasks running in the nohz_full CPUs usually have their cpus affinity
>> explicitly set and will behave incorrectly if cpus affinity changes.
>>
>> Fix this problem by adding a flag in the task structure to indicate that
>> a task has their cpus affinity explicitly set before and make cpuset
>> code not to change their cpus_allowed list unless the user chosen cpu
>> list is no longer a subset of the cpus_allowed list of the cpuset itself.
>>
>> With that change in place, it was verified that tasks that have its
>> cpus affinity explicitly set will not be affected by changes made to
>> the v2 cgroup.subtree_control files.
> I think the underlying cause here is cpuset overwriting the cpumask the user
> configured but that's a longer discussion.
>
>> +/*
>> + * Don't change the cpus_allowed list if cpus affinity has been explicitly
>> + * set before unless the current cpu list is not a subset of the new cpu list.
>> + */
>> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
>> +				       const struct cpumask *new_mask)
>> +{
>> +	if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
>> +		return 0;
>> +
>> +	p->cpus_affinity_set = 0;
>> +	return set_cpus_allowed_ptr(p, new_mask);
>> +}
> I wonder whether the more predictable behavior would be always not resetting
> the cpumask if it's a subset of the new_mask.
There can be a counter argument that if a user found out that there is 
not enough cpus in a cpuset to meet its performance target, one can 
always increase the number of cpus in the cpuset. Generalizing this 
behavior to all the tasks irrespective if they have explicitly set cpus 
affinity before will disallow this use case.
> Also, shouldn't this check
> p->cpus_mask instead of p->cpus_ptr?

You are right. I should have used cpus_mask instead. Will send out a v2.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 19:02       ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 19:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

Hello,

On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
> There can be a counter argument that if a user found out that there is not
> enough cpus in a cpuset to meet its performance target, one can always
> increase the number of cpus in the cpuset. Generalizing this behavior to all
> the tasks irrespective if they have explicitly set cpus affinity before will
> disallow this use case.

This is nasty. The real solution here is separating out what user requested
and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
remember what the user requested in a separate cpumask and compute the
intersection into p->cpus_maks whenever something changes and apply
fallbacks on that final mask. Multiple parties updating the same variable is
never gonna lead to anything consistent and we're patching up for whatever
the immediate use case seems to need at the moment. That said, I'm not
necessarily against patching it up but if you're interested in delving into
it deeper, that'd be great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 19:02       ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 19:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello,

On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
> There can be a counter argument that if a user found out that there is not
> enough cpus in a cpuset to meet its performance target, one can always
> increase the number of cpus in the cpuset. Generalizing this behavior to all
> the tasks irrespective if they have explicitly set cpus affinity before will
> disallow this use case.

This is nasty. The real solution here is separating out what user requested
and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
remember what the user requested in a separate cpumask and compute the
intersection into p->cpus_maks whenever something changes and apply
fallbacks on that final mask. Multiple parties updating the same variable is
never gonna lead to anything consistent and we're patching up for whatever
the immediate use case seems to need at the moment. That said, I'm not
necessarily against patching it up but if you're interested in delving into
it deeper, that'd be great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 19:21         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 19:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

On 7/28/22 15:02, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
>> There can be a counter argument that if a user found out that there is not
>> enough cpus in a cpuset to meet its performance target, one can always
>> increase the number of cpus in the cpuset. Generalizing this behavior to all
>> the tasks irrespective if they have explicitly set cpus affinity before will
>> disallow this use case.
> This is nasty.

That is a nasty example, I know. There may be users depending on the 
existing behavior even if they don't know it. So I am a bit hesitant to 
change the default behavior like that. On the other hand, tasks that 
have explicitly set its cpu affinity certainly don't want to have 
unexpected change to that.

> The real solution here is separating out what user requested
> and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
> remember what the user requested in a separate cpumask and compute the
> intersection into p->cpus_maks whenever something changes and apply
> fallbacks on that final mask. Multiple parties updating the same variable is
> never gonna lead to anything consistent and we're patching up for whatever
> the immediate use case seems to need at the moment. That said, I'm not
> necessarily against patching it up but if you're interested in delving into
> it deeper, that'd be great.

I believe the current code is already restricting what cpu affinity that 
a user can request by limiting to those allowed by the current cpuset. 
Hotplug is another issue that may need to be addressed. I will update my 
patch to make it handle hotplug in a more graceful way.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 19:21         ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-28 19:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/28/22 15:02, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
>> There can be a counter argument that if a user found out that there is not
>> enough cpus in a cpuset to meet its performance target, one can always
>> increase the number of cpus in the cpuset. Generalizing this behavior to all
>> the tasks irrespective if they have explicitly set cpus affinity before will
>> disallow this use case.
> This is nasty.

That is a nasty example, I know. There may be users depending on the 
existing behavior even if they don't know it. So I am a bit hesitant to 
change the default behavior like that. On the other hand, tasks that 
have explicitly set its cpu affinity certainly don't want to have 
unexpected change to that.

> The real solution here is separating out what user requested
> and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
> remember what the user requested in a separate cpumask and compute the
> intersection into p->cpus_maks whenever something changes and apply
> fallbacks on that final mask. Multiple parties updating the same variable is
> never gonna lead to anything consistent and we're patching up for whatever
> the immediate use case seems to need at the moment. That said, I'm not
> necessarily against patching it up but if you're interested in delving into
> it deeper, that'd be great.

I believe the current code is already restricting what cpu affinity that 
a user can request by limiting to those allowed by the current cpuset. 
Hotplug is another issue that may need to be addressed. I will update my 
patch to make it handle hotplug in a more graceful way.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
  2022-07-28 19:21         ` Waiman Long
  (?)
@ 2022-07-28 20:44         ` Tejun Heo
  2022-07-28 21:04           ` Waiman Long
  -1 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 20:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

Hello,

On Thu, Jul 28, 2022 at 03:21:26PM -0400, Waiman Long wrote:
> On 7/28/22 15:02, Tejun Heo wrote:
> > On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
> > > There can be a counter argument that if a user found out that there is not
> > > enough cpus in a cpuset to meet its performance target, one can always
> > > increase the number of cpus in the cpuset. Generalizing this behavior to all
> > > the tasks irrespective if they have explicitly set cpus affinity before will
> > > disallow this use case.
> > This is nasty.
> 
> That is a nasty example, I know. There may be users depending on the
> existing behavior even if they don't know it. So I am a bit hesitant to
> change the default behavior like that. On the other hand, tasks that have
> explicitly set its cpu affinity certainly don't want to have unexpected
> change to that.

Yeah, I hear you. I'm on the same page.

> > The real solution here is separating out what user requested
> > and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
> > remember what the user requested in a separate cpumask and compute the
> > intersection into p->cpus_maks whenever something changes and apply
> > fallbacks on that final mask. Multiple parties updating the same variable is
> > never gonna lead to anything consistent and we're patching up for whatever
> > the immediate use case seems to need at the moment. That said, I'm not
> > necessarily against patching it up but if you're interested in delving into
> > it deeper, that'd be great.
> 
> I believe the current code is already restricting what cpu affinity that a
> user can request by limiting to those allowed by the current cpuset. Hotplug
> is another issue that may need to be addressed. I will update my patch to
> make it handle hotplug in a more graceful way.

So, the patch you proposed is making the code remember one special aspect of
user requested configuration - whether it configured it or not, and trying
to preserve that particular state as cpuset state changes. It addresses the
immediate problem but it is a very partial approach. Let's say a task wanna
be affined to one logical thread of each core and set its mask to 0x5555.
Now, let's say cpuset got enabled and enforced 0xff and affined the task to
0xff. After a while, the cgroup got more cpus allocated and its cpuset now
has 0xfff. Ideally, what should happen is the task now having the effective
mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
on which way we decide to misbehave.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
  2022-07-28 20:44         ` Tejun Heo
@ 2022-07-28 21:04           ` Waiman Long
  2022-07-28 21:39               ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Waiman Long @ 2022-07-28 21:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

On 7/28/22 16:44, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 03:21:26PM -0400, Waiman Long wrote:
>> On 7/28/22 15:02, Tejun Heo wrote:
>>> On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
>>>> There can be a counter argument that if a user found out that there is not
>>>> enough cpus in a cpuset to meet its performance target, one can always
>>>> increase the number of cpus in the cpuset. Generalizing this behavior to all
>>>> the tasks irrespective if they have explicitly set cpus affinity before will
>>>> disallow this use case.
>>> This is nasty.
>> That is a nasty example, I know. There may be users depending on the
>> existing behavior even if they don't know it. So I am a bit hesitant to
>> change the default behavior like that. On the other hand, tasks that have
>> explicitly set its cpu affinity certainly don't want to have unexpected
>> change to that.
> Yeah, I hear you. I'm on the same page.
>
>>> The real solution here is separating out what user requested
>>> and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
>>> remember what the user requested in a separate cpumask and compute the
>>> intersection into p->cpus_maks whenever something changes and apply
>>> fallbacks on that final mask. Multiple parties updating the same variable is
>>> never gonna lead to anything consistent and we're patching up for whatever
>>> the immediate use case seems to need at the moment. That said, I'm not
>>> necessarily against patching it up but if you're interested in delving into
>>> it deeper, that'd be great.
>> I believe the current code is already restricting what cpu affinity that a
>> user can request by limiting to those allowed by the current cpuset. Hotplug
>> is another issue that may need to be addressed. I will update my patch to
>> make it handle hotplug in a more graceful way.
> af
> So, the patch you proposed is making the code remember one special aspect of
> user requested configuration - whether it configured it or not, and trying
> to preserve that particular state as cpuset state changes. It addresses the
> immediate problem but it is a very partial approach. Let's say a task wanna
> be affined to one logical thread of each core and set its mask to 0x5555.
> Now, let's say cpuset got enabled and enforced 0xff and affined the task to
> 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
> has 0xfff. Ideally, what should happen is the task now having the effective
> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
> on which way we decide to misbehave.

OK, I see what you want to accomplish. To fully address this issue, we 
will need to have a new cpumask variable in the the task structure which 
will be allocated if sched_setaffinity() is ever called. I can rework my 
patch to use this approach.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 21:39               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups, linux-kernel

Hello, Waiman.

On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
> > So, the patch you proposed is making the code remember one special aspect of
> > user requested configuration - whether it configured it or not, and trying
> > to preserve that particular state as cpuset state changes. It addresses the
> > immediate problem but it is a very partial approach. Let's say a task wanna
> > be affined to one logical thread of each core and set its mask to 0x5555.
> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
> > has 0xfff. Ideally, what should happen is the task now having the effective
> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
> > on which way we decide to misbehave.
> 
> OK, I see what you want to accomplish. To fully address this issue, we will
> need to have a new cpumask variable in the the task structure which will be
> allocated if sched_setaffinity() is ever called. I can rework my patch to
> use this approach.

Yeah, we'd need to track what user requested separately from the currently
effective cpumask. Let's make sure that the scheduler folks are on board
before committing to the idea tho. Peter, Ingo, what do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-28 21:39               ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2022-07-28 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Zefan Li,
	Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello, Waiman.

On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
> > So, the patch you proposed is making the code remember one special aspect of
> > user requested configuration - whether it configured it or not, and trying
> > to preserve that particular state as cpuset state changes. It addresses the
> > immediate problem but it is a very partial approach. Let's say a task wanna
> > be affined to one logical thread of each core and set its mask to 0x5555.
> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
> > has 0xfff. Ideally, what should happen is the task now having the effective
> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
> > on which way we decide to misbehave.
> 
> OK, I see what you want to accomplish. To fully address this issue, we will
> need to have a new cpumask variable in the the task structure which will be
> allocated if sched_setaffinity() is ever called. I can rework my patch to
> use this approach.

Yeah, we'd need to track what user requested separately from the currently
effective cpumask. Let's make sure that the scheduler folks are on board
before committing to the idea tho. Peter, Ingo, what do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 14:15                 ` Valentin Schneider
  0 siblings, 0 replies; 40+ messages in thread
From: Valentin Schneider @ 2022-07-29 14:15 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner, cgroups,
	linux-kernel

On 28/07/22 11:39, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>> > So, the patch you proposed is making the code remember one special aspect of
>> > user requested configuration - whether it configured it or not, and trying
>> > to preserve that particular state as cpuset state changes. It addresses the
>> > immediate problem but it is a very partial approach. Let's say a task wanna
>> > be affined to one logical thread of each core and set its mask to 0x5555.
>> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>> > has 0xfff. Ideally, what should happen is the task now having the effective
>> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>> > on which way we decide to misbehave.
>>
>> OK, I see what you want to accomplish. To fully address this issue, we will
>> need to have a new cpumask variable in the the task structure which will be
>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>> use this approach.
>
> Yeah, we'd need to track what user requested separately from the currently
> effective cpumask. Let's make sure that the scheduler folks are on board
> before committing to the idea tho. Peter, Ingo, what do you guys think?
>

FWIW on a runtime overhead side of things I think it'll be OK as that
should be just an extra mask copy  in sched_setaffinity() and a subset
check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
less clear (when, if ever, do we clear the user-defined mask? Will it keep
haunting us even after moving a task to a disjoint cpuset partition?).

There's also if/how that new mask should be exposed, because attaching a
task to a cpuset will now yield a not-necessarily-obvious affinity -
e.g. in the thread affinity example above, if the initial affinity setting
was done ages ago by some system tool, IMO the user needs a way to be able
to expect/understand the result of 0x555 rather than 0xfff.

While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
then again that one is for "special" hardware...

> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 14:15                 ` Valentin Schneider
  0 siblings, 0 replies; 40+ messages in thread
From: Valentin Schneider @ 2022-07-29 14:15 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 28/07/22 11:39, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>> > So, the patch you proposed is making the code remember one special aspect of
>> > user requested configuration - whether it configured it or not, and trying
>> > to preserve that particular state as cpuset state changes. It addresses the
>> > immediate problem but it is a very partial approach. Let's say a task wanna
>> > be affined to one logical thread of each core and set its mask to 0x5555.
>> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>> > has 0xfff. Ideally, what should happen is the task now having the effective
>> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>> > on which way we decide to misbehave.
>>
>> OK, I see what you want to accomplish. To fully address this issue, we will
>> need to have a new cpumask variable in the the task structure which will be
>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>> use this approach.
>
> Yeah, we'd need to track what user requested separately from the currently
> effective cpumask. Let's make sure that the scheduler folks are on board
> before committing to the idea tho. Peter, Ingo, what do you guys think?
>

FWIW on a runtime overhead side of things I think it'll be OK as that
should be just an extra mask copy  in sched_setaffinity() and a subset
check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
less clear (when, if ever, do we clear the user-defined mask? Will it keep
haunting us even after moving a task to a disjoint cpuset partition?).

There's also if/how that new mask should be exposed, because attaching a
task to a cpuset will now yield a not-necessarily-obvious affinity -
e.g. in the thread affinity example above, if the initial affinity setting
was done ages ago by some system tool, IMO the user needs a way to be able
to expect/understand the result of 0x555 rather than 0xfff.

While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
then again that one is for "special" hardware...

> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 14:50                   ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-29 14:50 UTC (permalink / raw)
  To: Valentin Schneider, Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner, cgroups,
	linux-kernel

On 7/29/22 10:15, Valentin Schneider wrote:
> On 28/07/22 11:39, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>> So, the patch you proposed is making the code remember one special aspect of
>>>> user requested configuration - whether it configured it or not, and trying
>>>> to preserve that particular state as cpuset state changes. It addresses the
>>>> immediate problem but it is a very partial approach. Let's say a task wanna
>>>> be affined to one logical thread of each core and set its mask to 0x5555.
>>>> Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>>>> 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>>>> has 0xfff. Ideally, what should happen is the task now having the effective
>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>>>> on which way we decide to misbehave.
>>> OK, I see what you want to accomplish. To fully address this issue, we will
>>> need to have a new cpumask variable in the the task structure which will be
>>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>>> use this approach.
>> Yeah, we'd need to track what user requested separately from the currently
>> effective cpumask. Let's make sure that the scheduler folks are on board
>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>
> FWIW on a runtime overhead side of things I think it'll be OK as that
> should be just an extra mask copy  in sched_setaffinity() and a subset
> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
> less clear (when, if ever, do we clear the user-defined mask? Will it keep
> haunting us even after moving a task to a disjoint cpuset partition?).

The runtime overhead should be minimal. It is the behavioral side that 
we should be careful about. It is a change in existing behavior and we 
don't want to cause surprise to the users. Currently, a task that set 
its cpu affinity explicitly will have its affinity reset whenever there 
is any change to the cpuset it belongs to or a hotplug event touch any 
cpu in the current cpuset. The new behavior we are proposing here is 
that it will try its best to keep the cpu affinity that the user 
requested within the constraint of the current cpuset as well as the cpu 
hotplug state.


>
> There's also if/how that new mask should be exposed, because attaching a
> task to a cpuset will now yield a not-necessarily-obvious affinity -
> e.g. in the thread affinity example above, if the initial affinity setting
> was done ages ago by some system tool, IMO the user needs a way to be able
> to expect/understand the result of 0x555 rather than 0xfff.

Users can use sched_getaffinity(2) to retrieve the current cpu affinity. 
It is up to users to set another one if they don't like the current one. 
I don't think we need to return what the previous requested cpu affinity 
is. They are suppose to know that or they can set their own if they 
don't like it. \

Cheers,
Longman

>
> While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
> then again that one is for "special" hardware...
>
>> Thanks.
>>
>> --
>> tejun


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 14:50                   ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-29 14:50 UTC (permalink / raw)
  To: Valentin Schneider, Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/29/22 10:15, Valentin Schneider wrote:
> On 28/07/22 11:39, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>> So, the patch you proposed is making the code remember one special aspect of
>>>> user requested configuration - whether it configured it or not, and trying
>>>> to preserve that particular state as cpuset state changes. It addresses the
>>>> immediate problem but it is a very partial approach. Let's say a task wanna
>>>> be affined to one logical thread of each core and set its mask to 0x5555.
>>>> Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>>>> 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>>>> has 0xfff. Ideally, what should happen is the task now having the effective
>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>>>> on which way we decide to misbehave.
>>> OK, I see what you want to accomplish. To fully address this issue, we will
>>> need to have a new cpumask variable in the the task structure which will be
>>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>>> use this approach.
>> Yeah, we'd need to track what user requested separately from the currently
>> effective cpumask. Let's make sure that the scheduler folks are on board
>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>
> FWIW on a runtime overhead side of things I think it'll be OK as that
> should be just an extra mask copy  in sched_setaffinity() and a subset
> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
> less clear (when, if ever, do we clear the user-defined mask? Will it keep
> haunting us even after moving a task to a disjoint cpuset partition?).

The runtime overhead should be minimal. It is the behavioral side that 
we should be careful about. It is a change in existing behavior and we 
don't want to cause surprise to the users. Currently, a task that set 
its cpu affinity explicitly will have its affinity reset whenever there 
is any change to the cpuset it belongs to or a hotplug event touch any 
cpu in the current cpuset. The new behavior we are proposing here is 
that it will try its best to keep the cpu affinity that the user 
requested within the constraint of the current cpuset as well as the cpu 
hotplug state.


>
> There's also if/how that new mask should be exposed, because attaching a
> task to a cpuset will now yield a not-necessarily-obvious affinity -
> e.g. in the thread affinity example above, if the initial affinity setting
> was done ages ago by some system tool, IMO the user needs a way to be able
> to expect/understand the result of 0x555 rather than 0xfff.

Users can use sched_getaffinity(2) to retrieve the current cpu affinity. 
It is up to users to set another one if they don't like the current one. 
I don't think we need to return what the previous requested cpu affinity 
is. They are suppose to know that or they can set their own if they 
don't like it. \

Cheers,
Longman

>
> While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
> then again that one is for "special" hardware...
>
>> Thanks.
>>
>> --
>> tejun


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 18:31                     ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-29 18:31 UTC (permalink / raw)
  To: Valentin Schneider, Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner, cgroups,
	linux-kernel, Will Deacon

On 7/29/22 10:50, Waiman Long wrote:
> On 7/29/22 10:15, Valentin Schneider wrote:
>> On 28/07/22 11:39, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>>> So, the patch you proposed is making the code remember one special 
>>>>> aspect of
>>>>> user requested configuration - whether it configured it or not, 
>>>>> and trying
>>>>> to preserve that particular state as cpuset state changes. It 
>>>>> addresses the
>>>>> immediate problem but it is a very partial approach. Let's say a 
>>>>> task wanna
>>>>> be affined to one logical thread of each core and set its mask to 
>>>>> 0x5555.
>>>>> Now, let's say cpuset got enabled and enforced 0xff and affined 
>>>>> the task to
>>>>> 0xff. After a while, the cgroup got more cpus allocated and its 
>>>>> cpuset now
>>>>> has 0xfff. Ideally, what should happen is the task now having the 
>>>>> effective
>>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 
>>>>> depending
>>>>> on which way we decide to misbehave.
>>>> OK, I see what you want to accomplish. To fully address this issue, 
>>>> we will
>>>> need to have a new cpumask variable in the the task structure which 
>>>> will be
>>>> allocated if sched_setaffinity() is ever called. I can rework my 
>>>> patch to
>>>> use this approach.
>>> Yeah, we'd need to track what user requested separately from the 
>>> currently
>>> effective cpumask. Let's make sure that the scheduler folks are on 
>>> board
>>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>>
>> FWIW on a runtime overhead side of things I think it'll be OK as that
>> should be just an extra mask copy  in sched_setaffinity() and a subset
>> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a 
>> bit
>> less clear (when, if ever, do we clear the user-defined mask? Will it 
>> keep
>> haunting us even after moving a task to a disjoint cpuset partition?).
>
> The runtime overhead should be minimal. It is the behavioral side that 
> we should be careful about. It is a change in existing behavior and we 
> don't want to cause surprise to the users. Currently, a task that set 
> its cpu affinity explicitly will have its affinity reset whenever 
> there is any change to the cpuset it belongs to or a hotplug event 
> touch any cpu in the current cpuset. The new behavior we are proposing 
> here is that it will try its best to keep the cpu affinity that the 
> user requested within the constraint of the current cpuset as well as 
> the cpu hotplug state.
>
>
>>
>> There's also if/how that new mask should be exposed, because attaching a
>> task to a cpuset will now yield a not-necessarily-obvious affinity -
>> e.g. in the thread affinity example above, if the initial affinity 
>> setting
>> was done ages ago by some system tool, IMO the user needs a way to be 
>> able
>> to expect/understand the result of 0x555 rather than 0xfff.
>
> Users can use sched_getaffinity(2) to retrieve the current cpu 
> affinity. It is up to users to set another one if they don't like the 
> current one. I don't think we need to return what the previous 
> requested cpu affinity is. They are suppose to know that or they can 
> set their own if they don't like it. \

Looking at Will's series that introduced user_cpus_ptr, I think we can 
overlay our proposal on top of that. So calling sched_setaffinity() will 
also update user_cpus_ptr. We may still need a flag to indicate whether 
user_cpus_ptr is set up because of sched_setaffinity() or due to a call 
to force_compatible_cpus_allowed_ptr() from arm64 arch code. That will 
make our work easier as some of the infrastructure is already there. I 
am looking forward for your feedback.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set
@ 2022-07-29 18:31                     ` Waiman Long
  0 siblings, 0 replies; 40+ messages in thread
From: Waiman Long @ 2022-07-29 18:31 UTC (permalink / raw)
  To: Valentin Schneider, Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Zefan Li, Johannes Weiner,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Will Deacon

On 7/29/22 10:50, Waiman Long wrote:
> On 7/29/22 10:15, Valentin Schneider wrote:
>> On 28/07/22 11:39, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>>> So, the patch you proposed is making the code remember one special 
>>>>> aspect of
>>>>> user requested configuration - whether it configured it or not, 
>>>>> and trying
>>>>> to preserve that particular state as cpuset state changes. It 
>>>>> addresses the
>>>>> immediate problem but it is a very partial approach. Let's say a 
>>>>> task wanna
>>>>> be affined to one logical thread of each core and set its mask to 
>>>>> 0x5555.
>>>>> Now, let's say cpuset got enabled and enforced 0xff and affined 
>>>>> the task to
>>>>> 0xff. After a while, the cgroup got more cpus allocated and its 
>>>>> cpuset now
>>>>> has 0xfff. Ideally, what should happen is the task now having the 
>>>>> effective
>>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 
>>>>> depending
>>>>> on which way we decide to misbehave.
>>>> OK, I see what you want to accomplish. To fully address this issue, 
>>>> we will
>>>> need to have a new cpumask variable in the the task structure which 
>>>> will be
>>>> allocated if sched_setaffinity() is ever called. I can rework my 
>>>> patch to
>>>> use this approach.
>>> Yeah, we'd need to track what user requested separately from the 
>>> currently
>>> effective cpumask. Let's make sure that the scheduler folks are on 
>>> board
>>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>>
>> FWIW on a runtime overhead side of things I think it'll be OK as that
>> should be just an extra mask copy  in sched_setaffinity() and a subset
>> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a 
>> bit
>> less clear (when, if ever, do we clear the user-defined mask? Will it 
>> keep
>> haunting us even after moving a task to a disjoint cpuset partition?).
>
> The runtime overhead should be minimal. It is the behavioral side that 
> we should be careful about. It is a change in existing behavior and we 
> don't want to cause surprise to the users. Currently, a task that set 
> its cpu affinity explicitly will have its affinity reset whenever 
> there is any change to the cpuset it belongs to or a hotplug event 
> touch any cpu in the current cpuset. The new behavior we are proposing 
> here is that it will try its best to keep the cpu affinity that the 
> user requested within the constraint of the current cpuset as well as 
> the cpu hotplug state.
>
>
>>
>> There's also if/how that new mask should be exposed, because attaching a
>> task to a cpuset will now yield a not-necessarily-obvious affinity -
>> e.g. in the thread affinity example above, if the initial affinity 
>> setting
>> was done ages ago by some system tool, IMO the user needs a way to be 
>> able
>> to expect/understand the result of 0x555 rather than 0xfff.
>
> Users can use sched_getaffinity(2) to retrieve the current cpu 
> affinity. It is up to users to set another one if they don't like the 
> current one. I don't think we need to return what the previous 
> requested cpu affinity is. They are suppose to know that or they can 
> set their own if they don't like it. \

Looking at Will's series that introduced user_cpus_ptr, I think we can 
overlay our proposal on top of that. So calling sched_setaffinity() will 
also update user_cpus_ptr. We may still need a flag to indicate whether 
user_cpus_ptr is set up because of sched_setaffinity() or due to a call 
to force_compatible_cpus_allowed_ptr() from arm64 arch code. That will 
make our work easier as some of the infrastructure is already there. I 
am looking forward for your feedback.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-07-29 18:31 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28  0:58 [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set Waiman Long
2022-07-28  0:58 ` Waiman Long
2022-07-28  0:58 ` [PATCH 2/2] cgroup: Skip subtree root in cgroup_update_dfl_csses() Waiman Long
2022-07-28  0:58   ` Waiman Long
2022-07-28 14:44   ` Michal Koutný
2022-07-28 14:44     ` Michal Koutný
2022-07-28 14:49     ` Waiman Long
2022-07-28 17:26     ` Tejun Heo
2022-07-28 17:26       ` Tejun Heo
2022-07-28 17:27   ` Tejun Heo
2022-07-28 17:27     ` Tejun Heo
2022-07-28 14:44 ` [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set Michal Koutný
2022-07-28 14:44   ` Michal Koutný
2022-07-28 14:59   ` Waiman Long
2022-07-28 14:59     ` Waiman Long
2022-07-28 15:23     ` Michal Koutný
2022-07-28 15:23       ` Michal Koutný
2022-07-28 15:35       ` Waiman Long
2022-07-28 15:35         ` Waiman Long
2022-07-28 16:50     ` Valentin Schneider
2022-07-28 16:50       ` Valentin Schneider
2022-07-28 17:42       ` Waiman Long
2022-07-28 17:42         ` Waiman Long
2022-07-28 17:23 ` Tejun Heo
2022-07-28 17:23   ` Tejun Heo
2022-07-28 18:57   ` Waiman Long
2022-07-28 19:02     ` Tejun Heo
2022-07-28 19:02       ` Tejun Heo
2022-07-28 19:21       ` Waiman Long
2022-07-28 19:21         ` Waiman Long
2022-07-28 20:44         ` Tejun Heo
2022-07-28 21:04           ` Waiman Long
2022-07-28 21:39             ` Tejun Heo
2022-07-28 21:39               ` Tejun Heo
2022-07-29 14:15               ` Valentin Schneider
2022-07-29 14:15                 ` Valentin Schneider
2022-07-29 14:50                 ` Waiman Long
2022-07-29 14:50                   ` Waiman Long
2022-07-29 18:31                   ` Waiman Long
2022-07-29 18:31                     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.