linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
@ 2023-02-03 16:40 Waiman Long
  2023-02-03 21:00 ` Tejun Heo
  2023-02-04 10:01 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Waiman Long @ 2023-02-03 16:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Will Deacon, Peter Zijlstra
  Cc: linux-kernel, cgroups, kernel-team, Waiman Long

Since commit 8f9ea86fdf99 ("sched: Always preserve the user
requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
__sched_setaffinity() unconditionally. This helps to expose a bug in
the current cpuset hotplug code where the cpumasks of the tasks in
the top cpuset are not updated at all when some CPUs become online or
offline. It is likely caused by the fact that some of the tasks in the
top cpuset, like percpu kthreads, cannot have their cpu affinity changed.

One way to reproduce this as suggested by Peter is:
 - boot machine
 - offline all CPUs except one
 - taskset -p ffffffff $$
 - online all CPUs

Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
includes offline CPUs for those tasks that are in the top cpuset. For
tasks not in the top cpuset, the old rule applies and only online CPUs
will be returned in the mask since hotplug events will update their
cpumasks accordingly.

Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
Reported-by: Will Deacon <will@kernel.org>
Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 207bafdb05e8..11554e5845f7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3707,15 +3707,38 @@ void __init cpuset_init_smp(void)
  * Description: Returns the cpumask_var_t cpus_allowed of the cpuset
  * attached to the specified @tsk.  Guaranteed to return some non-empty
  * subset of cpu_online_mask, even if this means going outside the
- * tasks cpuset.
+ * tasks cpuset, except when the task is in the top cpuset.
  **/
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
 	unsigned long flags;
+	struct cpuset *cs;
 
 	spin_lock_irqsave(&callback_lock, flags);
-	guarantee_online_cpus(tsk, pmask);
+	rcu_read_lock();
+
+	cs = task_cs(tsk);
+	if (cs != &top_cpuset)
+		guarantee_online_cpus(tsk, pmask);
+	/*
+	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
+	 * when a hotplug online/offline event happens. So we include all
+	 * offline cpus in the allowed cpu list.
+	 */
+	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
+		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+
+		/*
+		 * We first exclude cpus allocated to partitions. If there is no
+		 * allowable online cpu left, we fall back to all possible cpus.
+		 */
+		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);
+		if (!cpumask_intersects(pmask, cpu_online_mask))
+			cpumask_copy(pmask, possible_mask);
+	}
+
+	rcu_read_unlock();
 	spin_unlock_irqrestore(&callback_lock, flags);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-03 16:40 [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks Waiman Long
@ 2023-02-03 21:00 ` Tejun Heo
  2023-02-03 22:26   ` Waiman Long
  2023-02-04 13:32   ` Will Deacon
  2023-02-04 10:01 ` Peter Zijlstra
  1 sibling, 2 replies; 9+ messages in thread
From: Tejun Heo @ 2023-02-03 21:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Will Deacon, Peter Zijlstra,
	linux-kernel, cgroups, kernel-team

On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
> Since commit 8f9ea86fdf99 ("sched: Always preserve the user
> requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
> __sched_setaffinity() unconditionally. This helps to expose a bug in
> the current cpuset hotplug code where the cpumasks of the tasks in
> the top cpuset are not updated at all when some CPUs become online or
> offline. It is likely caused by the fact that some of the tasks in the
> top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
> 
> One way to reproduce this as suggested by Peter is:
>  - boot machine
>  - offline all CPUs except one
>  - taskset -p ffffffff $$
>  - online all CPUs
> 
> Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
> includes offline CPUs for those tasks that are in the top cpuset. For
> tasks not in the top cpuset, the old rule applies and only online CPUs
> will be returned in the mask since hotplug events will update their
> cpumasks accordingly.
> 
> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> Reported-by: Will Deacon <will@kernel.org>
> Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>

So, this is the replacement for the first patch[1] Will posted, right?

>  void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  {
>  	unsigned long flags;
> +	struct cpuset *cs;
>  
>  	spin_lock_irqsave(&callback_lock, flags);
> -	guarantee_online_cpus(tsk, pmask);
> +	rcu_read_lock();
> +
> +	cs = task_cs(tsk);
> +	if (cs != &top_cpuset)
> +		guarantee_online_cpus(tsk, pmask);
> +	/*
> +	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
> +	 * when a hotplug online/offline event happens. So we include all
> +	 * offline cpus in the allowed cpu list.
> +	 */
> +	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
> +		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
> +		/*
> +		 * We first exclude cpus allocated to partitions. If there is no
> +		 * allowable online cpu left, we fall back to all possible cpus.
> +		 */
> +		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);

and the differences are that

* It's only applied to the root cgroup.

* Cpus taken up by partitions are excluded.

Is my understanding correct?

> +		if (!cpumask_intersects(pmask, cpu_online_mask))
> +			cpumask_copy(pmask, possible_mask);
> +	}
> +
> +	rcu_read_unlock();
>  	spin_unlock_irqrestore(&callback_lock, flags);

So, I suppose you're suggesting applying this patch instead of the one Will
Deacon posted[1] and we need Will's second patch[2] on top, right?

[1] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org
[2] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-03 21:00 ` Tejun Heo
@ 2023-02-03 22:26   ` Waiman Long
  2023-02-04 13:32   ` Will Deacon
  1 sibling, 0 replies; 9+ messages in thread
From: Waiman Long @ 2023-02-03 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Will Deacon, Peter Zijlstra,
	linux-kernel, cgroups, kernel-team

On 2/3/23 16:00, Tejun Heo wrote:
> On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user
>> requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
>> __sched_setaffinity() unconditionally. This helps to expose a bug in
>> the current cpuset hotplug code where the cpumasks of the tasks in
>> the top cpuset are not updated at all when some CPUs become online or
>> offline. It is likely caused by the fact that some of the tasks in the
>> top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
>>
>> One way to reproduce this as suggested by Peter is:
>>   - boot machine
>>   - offline all CPUs except one
>>   - taskset -p ffffffff $$
>>   - online all CPUs
>>
>> Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
>> includes offline CPUs for those tasks that are in the top cpuset. For
>> tasks not in the top cpuset, the old rule applies and only online CPUs
>> will be returned in the mask since hotplug events will update their
>> cpumasks accordingly.
>>
>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
>> Reported-by: Will Deacon <will@kernel.org>
>> Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> So, this is the replacement for the first patch[1] Will posted, right?

Yes, if Will and Peter has no objection. I think it is less risky and 
handle the partition case better.

With v1, Will's patch should get similar result as the existing 
guarantee_online_cpus() function since we can infer offline cpus from 
cpus_allowed. With v2, it does include offline cpus correctly, I 
believe, as long as no partition is enabled. However, the hotplug code 
is able to update the cpumasks when a CPU is onlined. So the presence of 
offline CPUs is nice to have, but not essential.

>
>>   void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>>   {
>>   	unsigned long flags;
>> +	struct cpuset *cs;
>>   
>>   	spin_lock_irqsave(&callback_lock, flags);
>> -	guarantee_online_cpus(tsk, pmask);
>> +	rcu_read_lock();
>> +
>> +	cs = task_cs(tsk);
>> +	if (cs != &top_cpuset)
>> +		guarantee_online_cpus(tsk, pmask);
>> +	/*
>> +	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
>> +	 * when a hotplug online/offline event happens. So we include all
>> +	 * offline cpus in the allowed cpu list.
>> +	 */
>> +	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
>> +		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
>> +
>> +		/*
>> +		 * We first exclude cpus allocated to partitions. If there is no
>> +		 * allowable online cpu left, we fall back to all possible cpus.
>> +		 */
>> +		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);
> and the differences are that
>
> * It's only applied to the root cgroup.
>
> * Cpus taken up by partitions are excluded.
>
> Is my understanding correct?
Yes, that is correct.
>
>> +		if (!cpumask_intersects(pmask, cpu_online_mask))
>> +			cpumask_copy(pmask, possible_mask);
>> +	}
>> +
>> +	rcu_read_unlock();
>>   	spin_unlock_irqrestore(&callback_lock, flags);
> So, I suppose you're suggesting applying this patch instead of the one Will
> Deacon posted[1] and we need Will's second patch[2] on top, right?
Right. Let hear if Will and Peter agree with this plan. I have tested 
this patch and it passed Peter's reproducer test correctly. During 
testing, I uncovered another bug in the cpu affinity code which results 
in a separate scheduler patch to fix it.
>
> [1] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org
> [2] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org
>
> Thanks.
Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-03 16:40 [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks Waiman Long
  2023-02-03 21:00 ` Tejun Heo
@ 2023-02-04 10:01 ` Peter Zijlstra
  2023-02-05  5:00   ` Waiman Long
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2023-02-04 10:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	cgroups, kernel-team

On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
> Since commit 8f9ea86fdf99 ("sched: Always preserve the user
> requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
> __sched_setaffinity() unconditionally. This helps to expose a bug in
> the current cpuset hotplug code where the cpumasks of the tasks in
> the top cpuset are not updated at all when some CPUs become online or
> offline. It is likely caused by the fact that some of the tasks in the
> top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
> 
> One way to reproduce this as suggested by Peter is:
>  - boot machine
>  - offline all CPUs except one
>  - taskset -p ffffffff $$
>  - online all CPUs
> 
> Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
> includes offline CPUs for those tasks that are in the top cpuset. For
> tasks not in the top cpuset, the old rule applies and only online CPUs
> will be returned in the mask since hotplug events will update their
> cpumasks accordingly.

So you get the task_cpu_possible_mask() interaction vs cpusets horribly
wrong here, but given the very sorry state of task_cpu_possible_mask()
correctness of cpuset as a whole that might just not matter at this
point.

I do very much hate how you add exceptions on exceptions instead of
looking to do something right :-(

Fixing that parition case in my patch is 1 extra line and then I think
it fundamentally does the right thing and can serve as a basis for
fixing cpuset as a whole.

> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> Reported-by: Will Deacon <will@kernel.org>
> Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 207bafdb05e8..11554e5845f7 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3707,15 +3707,38 @@ void __init cpuset_init_smp(void)
>   * Description: Returns the cpumask_var_t cpus_allowed of the cpuset
>   * attached to the specified @tsk.  Guaranteed to return some non-empty
>   * subset of cpu_online_mask, even if this means going outside the
> - * tasks cpuset.
> + * tasks cpuset, except when the task is in the top cpuset.
>   **/
>  
>  void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>  {
>  	unsigned long flags;
> +	struct cpuset *cs;
>  
>  	spin_lock_irqsave(&callback_lock, flags);
> -	guarantee_online_cpus(tsk, pmask);
> +	rcu_read_lock();
> +
> +	cs = task_cs(tsk);
> +	if (cs != &top_cpuset)
> +		guarantee_online_cpus(tsk, pmask);
> +	/*
> +	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
> +	 * when a hotplug online/offline event happens. So we include all
> +	 * offline cpus in the allowed cpu list.
> +	 */

I don't like TODO there, I really don't think CPUSET should update root
tasks, that means yet another fundamental difference between
CPUSET={y,n}.

> +	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
> +		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
> +		/*
> +		 * We first exclude cpus allocated to partitions. If there is no
> +		 * allowable online cpu left, we fall back to all possible cpus.
> +		 */
> +		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);
> +		if (!cpumask_intersects(pmask, cpu_online_mask))
> +			cpumask_copy(pmask, possible_mask);
> +	}
> +
> +	rcu_read_unlock();
>  	spin_unlock_irqrestore(&callback_lock, flags);
>  }

I really detest this patch, but if you insist it might just do :-/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-03 21:00 ` Tejun Heo
  2023-02-03 22:26   ` Waiman Long
@ 2023-02-04 13:32   ` Will Deacon
  1 sibling, 0 replies; 9+ messages in thread
From: Will Deacon @ 2023-02-04 13:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Zefan Li, Johannes Weiner, Peter Zijlstra,
	linux-kernel, cgroups, kernel-team

On Fri, Feb 03, 2023 at 11:00:24AM -1000, Tejun Heo wrote:
> On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
> > Since commit 8f9ea86fdf99 ("sched: Always preserve the user
> > requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
> > __sched_setaffinity() unconditionally. This helps to expose a bug in
> > the current cpuset hotplug code where the cpumasks of the tasks in
> > the top cpuset are not updated at all when some CPUs become online or
> > offline. It is likely caused by the fact that some of the tasks in the
> > top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
> > 
> > One way to reproduce this as suggested by Peter is:
> >  - boot machine
> >  - offline all CPUs except one
> >  - taskset -p ffffffff $$
> >  - online all CPUs
> > 
> > Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
> > includes offline CPUs for those tasks that are in the top cpuset. For
> > tasks not in the top cpuset, the old rule applies and only online CPUs
> > will be returned in the mask since hotplug events will update their
> > cpumasks accordingly.
> > 
> > Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> > Reported-by: Will Deacon <will@kernel.org>
> > Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Waiman Long <longman@redhat.com>
> 
> So, this is the replacement for the first patch[1] Will posted, right?
> 
> >  void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >  {
> >  	unsigned long flags;
> > +	struct cpuset *cs;
> >  
> >  	spin_lock_irqsave(&callback_lock, flags);
> > -	guarantee_online_cpus(tsk, pmask);
> > +	rcu_read_lock();
> > +
> > +	cs = task_cs(tsk);
> > +	if (cs != &top_cpuset)
> > +		guarantee_online_cpus(tsk, pmask);
> > +	/*
> > +	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
> > +	 * when a hotplug online/offline event happens. So we include all
> > +	 * offline cpus in the allowed cpu list.
> > +	 */
> > +	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
> > +		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> > +
> > +		/*
> > +		 * We first exclude cpus allocated to partitions. If there is no
> > +		 * allowable online cpu left, we fall back to all possible cpus.
> > +		 */
> > +		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);
> 
> and the differences are that
> 
> * It's only applied to the root cgroup.
> 
> * Cpus taken up by partitions are excluded.
> 
> Is my understanding correct?
> 
> > +		if (!cpumask_intersects(pmask, cpu_online_mask))
> > +			cpumask_copy(pmask, possible_mask);
> > +	}
> > +
> > +	rcu_read_unlock();
> >  	spin_unlock_irqrestore(&callback_lock, flags);
> 
> So, I suppose you're suggesting applying this patch instead of the one Will
> Deacon posted[1] and we need Will's second patch[2] on top, right?
> 
> [1] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org
> [2] http://lkml.kernel.org/r/20230131221719.3176-3-will@kernel.org

FWIW, although I tend to share Peter's sentiments in this thread, I took
this (+ my second patch) for a spin and my tests are giving the same
results when compared with Peter's patch.

Tested-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-04 10:01 ` Peter Zijlstra
@ 2023-02-05  5:00   ` Waiman Long
  2023-02-06 11:05     ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2023-02-05  5:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Will Deacon, linux-kernel,
	cgroups, kernel-team

On 2/4/23 05:01, Peter Zijlstra wrote:
> On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user
>> requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
>> __sched_setaffinity() unconditionally. This helps to expose a bug in
>> the current cpuset hotplug code where the cpumasks of the tasks in
>> the top cpuset are not updated at all when some CPUs become online or
>> offline. It is likely caused by the fact that some of the tasks in the
>> top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
>>
>> One way to reproduce this as suggested by Peter is:
>>   - boot machine
>>   - offline all CPUs except one
>>   - taskset -p ffffffff $$
>>   - online all CPUs
>>
>> Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
>> includes offline CPUs for those tasks that are in the top cpuset. For
>> tasks not in the top cpuset, the old rule applies and only online CPUs
>> will be returned in the mask since hotplug events will update their
>> cpumasks accordingly.
> So you get the task_cpu_possible_mask() interaction vs cpusets horribly
> wrong here, but given the very sorry state of task_cpu_possible_mask()
> correctness of cpuset as a whole that might just not matter at this
> point.
>
> I do very much hate how you add exceptions on exceptions instead of
> looking to do something right :-(
>
> Fixing that parition case in my patch is 1 extra line and then I think
> it fundamentally does the right thing and can serve as a basis for
> fixing cpuset as a whole.

I am not saying that your patch is incorrect other than handling the 
partition case. However, it is rather complex and is hard to understand 
especially for those that are not that familiar with the cpuset code. 
 From the maintainability point of view, a simpler solution that is 
easier to understand is better.

If we want to get it into the next merge windows, there isn't much time 
left for linux-next testing. So a lower risk solution is better from 
that perspective too.

>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
>> Reported-by: Will Deacon <will@kernel.org>
>> Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 207bafdb05e8..11554e5845f7 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -3707,15 +3707,38 @@ void __init cpuset_init_smp(void)
>>    * Description: Returns the cpumask_var_t cpus_allowed of the cpuset
>>    * attached to the specified @tsk.  Guaranteed to return some non-empty
>>    * subset of cpu_online_mask, even if this means going outside the
>> - * tasks cpuset.
>> + * tasks cpuset, except when the task is in the top cpuset.
>>    **/
>>   
>>   void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>>   {
>>   	unsigned long flags;
>> +	struct cpuset *cs;
>>   
>>   	spin_lock_irqsave(&callback_lock, flags);
>> -	guarantee_online_cpus(tsk, pmask);
>> +	rcu_read_lock();
>> +
>> +	cs = task_cs(tsk);
>> +	if (cs != &top_cpuset)
>> +		guarantee_online_cpus(tsk, pmask);
>> +	/*
>> +	 * TODO: Tasks in the top cpuset won't get update to their cpumasks
>> +	 * when a hotplug online/offline event happens. So we include all
>> +	 * offline cpus in the allowed cpu list.
>> +	 */
> I don't like TODO there, I really don't think CPUSET should update root
> tasks, that means yet another fundamental difference between
> CPUSET={y,n}.
OK, I can remove the "TODO". I have no objection to that.
>
>> +	if ((cs == &top_cpuset) || cpumask_empty(pmask)) {
>> +		const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
>> +
>> +		/*
>> +		 * We first exclude cpus allocated to partitions. If there is no
>> +		 * allowable online cpu left, we fall back to all possible cpus.
>> +		 */
>> +		cpumask_andnot(pmask, possible_mask, top_cpuset.subparts_cpus);
>> +		if (!cpumask_intersects(pmask, cpu_online_mask))
>> +			cpumask_copy(pmask, possible_mask);
>> +	}
>> +
>> +	rcu_read_unlock();
>>   	spin_unlock_irqrestore(&callback_lock, flags);
>>   }
> I really detest this patch, but if you insist it might just do :-/

If we decide that we should always try to keep possible offline cpus in 
a task's cpumask. We could adopt your solution or we can try to keep 
that information in the cpuset structure itself. At this point, I don't 
see any advantage in doing that except for tasks in the top cpuset 
because the hotplug code won't update their cpumasks. Also inferring 
offline cpus that should be in the cpuset is only possible with cgroup 
v2. It does not work for v1. So it is also not a complete solution. To 
be complete, we may need keep this information in the cpuset.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-05  5:00   ` Waiman Long
@ 2023-02-06 11:05     ` Will Deacon
  2023-02-06 20:17       ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2023-02-06 11:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Tejun Heo, Zefan Li, Johannes Weiner,
	linux-kernel, cgroups, kernel-team

On Sun, Feb 05, 2023 at 12:00:25AM -0500, Waiman Long wrote:
> On 2/4/23 05:01, Peter Zijlstra wrote:
> > On Fri, Feb 03, 2023 at 11:40:40AM -0500, Waiman Long wrote:
> > > Since commit 8f9ea86fdf99 ("sched: Always preserve the user
> > > requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
> > > __sched_setaffinity() unconditionally. This helps to expose a bug in
> > > the current cpuset hotplug code where the cpumasks of the tasks in
> > > the top cpuset are not updated at all when some CPUs become online or
> > > offline. It is likely caused by the fact that some of the tasks in the
> > > top cpuset, like percpu kthreads, cannot have their cpu affinity changed.
> > > 
> > > One way to reproduce this as suggested by Peter is:
> > >   - boot machine
> > >   - offline all CPUs except one
> > >   - taskset -p ffffffff $$
> > >   - online all CPUs
> > > 
> > > Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
> > > includes offline CPUs for those tasks that are in the top cpuset. For
> > > tasks not in the top cpuset, the old rule applies and only online CPUs
> > > will be returned in the mask since hotplug events will update their
> > > cpumasks accordingly.
> > So you get the task_cpu_possible_mask() interaction vs cpusets horribly
> > wrong here, but given the very sorry state of task_cpu_possible_mask()
> > correctness of cpuset as a whole that might just not matter at this
> > point.
> > 
> > I do very much hate how you add exceptions on exceptions instead of
> > looking to do something right :-(
> > 
> > Fixing that parition case in my patch is 1 extra line and then I think
> > it fundamentally does the right thing and can serve as a basis for
> > fixing cpuset as a whole.
> 
> I am not saying that your patch is incorrect other than handling the
> partition case. However, it is rather complex and is hard to understand
> especially for those that are not that familiar with the cpuset code. From
> the maintainability point of view, a simpler solution that is easier to
> understand is better.
> 
> If we want to get it into the next merge windows, there isn't much time left
> for linux-next testing. So a lower risk solution is better from that
> perspective too.

This needs to land for 6.2 to fix the regression. The next merge window is
too late. That's why I cooked the reverts [1] as an alternative.

Will

[1] https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=ssa-reverts

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-06 11:05     ` Will Deacon
@ 2023-02-06 20:17       ` Tejun Heo
  2023-02-07 11:32         ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2023-02-06 20:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, Peter Zijlstra, Zefan Li, Johannes Weiner,
	linux-kernel, cgroups, kernel-team

Hello,

On Mon, Feb 06, 2023 at 11:05:41AM +0000, Will Deacon wrote:
> > If we want to get it into the next merge windows, there isn't much time left
> > for linux-next testing. So a lower risk solution is better from that
> > perspective too.
> 
> This needs to land for 6.2 to fix the regression. The next merge window is
> too late. That's why I cooked the reverts [1] as an alternative.

Yeah, I think Waiman meant before the coming merge window. We have at least
one more week so let's try the two patches and see how that goes. We can
always revert if it doesn't work out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
  2023-02-06 20:17       ` Tejun Heo
@ 2023-02-07 11:32         ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2023-02-07 11:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Peter Zijlstra, Zefan Li, Johannes Weiner,
	linux-kernel, cgroups, kernel-team

On Mon, Feb 06, 2023 at 10:17:05AM -1000, Tejun Heo wrote:
> On Mon, Feb 06, 2023 at 11:05:41AM +0000, Will Deacon wrote:
> > > If we want to get it into the next merge windows, there isn't much time left
> > > for linux-next testing. So a lower risk solution is better from that
> > > perspective too.
> > 
> > This needs to land for 6.2 to fix the regression. The next merge window is
> > too late. That's why I cooked the reverts [1] as an alternative.
> 
> Yeah, I think Waiman meant before the coming merge window. We have at least
> one more week so let's try the two patches and see how that goes. We can
> always revert if it doesn't work out.

I see this has landed in Linus' tree now, so thanks for that! I also tested
his branch just to make sure and it all seems fine in my asymmetric
environment.

Cheers,

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-02-07 11:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03 16:40 [PATCH] cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks Waiman Long
2023-02-03 21:00 ` Tejun Heo
2023-02-03 22:26   ` Waiman Long
2023-02-04 13:32   ` Will Deacon
2023-02-04 10:01 ` Peter Zijlstra
2023-02-05  5:00   ` Waiman Long
2023-02-06 11:05     ` Will Deacon
2023-02-06 20:17       ` Tejun Heo
2023-02-07 11:32         ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).