Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()

From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	qperret@google.com, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
Date: Fri, 27 Jan 2023 14:09:01 -0500	[thread overview]
Message-ID: <0f8dec04-db47-d043-694f-601baa2ea615@redhat.com> (raw)
In-Reply-To: <Y9QZxVPLYpH/srMw@hirez.programming.kicks-ass.net>

On 1/27/23 13:36, Peter Zijlstra wrote:
> On Tue, Jan 17, 2023 at 04:08:26PM +0000, Will Deacon wrote:
>> Hi Waiman,
>>
>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>> setup. This patch extends its usage to save user provided cpumask
>>> when sched_setaffinity() is called for all arches. With this patch
>>> applied, user_cpus_ptr, once allocated after a successful call to
>>> sched_setaffinity(), will only be freed when the task exits.
>>>
>>> Since user_cpus_ptr is supposed to be used for "requested
>>> affinity", there is actually no point to save current cpu affinity in
>>> restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called.
>>> Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use
>>> it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>>> if defined but not changing it.
>>>
>>> This will be some changes in behavior for arm64 systems with asymmetric
>>> CPUs in some corner cases. For instance, if sched_setaffinity()
>>> has never been called and there is a cpuset change before
>>> relax_compatible_cpus_allowed_ptr() is called, its subsequent call will
>>> follow what the cpuset allows but not what the previous cpu affinity
>>> setting allows.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   kernel/sched/core.c  | 82 ++++++++++++++++++++------------------------
>>>   kernel/sched/sched.h |  7 ++++
>>>   2 files changed, 44 insertions(+), 45 deletions(-)
>> We've tracked this down as the cause of an arm64 regression in Android and I've
>> reproduced the issue with mainline.
>>
>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>> the command-line, then the arch code will (amongst other things) call
>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>> when exec()'ing a 32-bit or a 64-bit task respectively.
>>
>> If you consider a system where everything is 64-bit but the cmdline option
>> above is present, then the call to relax_compatible_cpus_allowed_ptr() isn't
>> expected to do anything in this case, and the old code made sure of that:
>>
>>> @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
>>>   
>>>   /*
>>>    * Restore the affinity of a task @p which was previously restricted by a
>>> - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
>>> - * @p->user_cpus_ptr.
>>> + * call to force_compatible_cpus_allowed_ptr().
>>>    *
>>>    * It is the caller's responsibility to serialise this with any calls to
>>>    * force_compatible_cpus_allowed_ptr(@p).
>>>    */
>>>   void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>>   {
>>> -	struct cpumask *user_mask = p->user_cpus_ptr;
>>> -	unsigned long flags;
>>> +	int ret;
>>>   
>>>   	/*
>>> -	 * Try to restore the old affinity mask. If this fails, then
>>> -	 * we free the mask explicitly to avoid it being inherited across
>>> -	 * a subsequent fork().
>>> +	 * Try to restore the old affinity mask with __sched_setaffinity().
>>> +	 * Cpuset masking will be done there too.
>>>   	 */
>>> -	if (!user_mask || !__sched_setaffinity(p, user_mask))
>>> -		return;
>> ... since it returned early here if '!user_mask' ...
>>
>>> -
>>> -	raw_spin_lock_irqsave(&p->pi_lock, flags);
>>> -	user_mask = clear_user_cpus_ptr(p);
>>> -	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>>> -
>>> -	kfree(user_mask);
>>> +	ret = __sched_setaffinity(p, task_user_cpus(p));
>>> +	WARN_ON_ONCE(ret);
>> ... however, now we end up going down into __sched_setaffinity() with
>> task_user_cpus(p) giving us the 'cpu_possible_mask'! This can lead to a mixture
>> of WARN_ON()s and incorrect affinity masks (for example, a newly exec'd task
>> ends up with the affinity mask of the online CPUs at the point of exec() and is
>> unable to run on anything onlined later).
>>
>> I've had a crack at fixing the code above to restore the old behaviour, and it
>> seems to work for my basic tests (still pending confirmation from others):
> This seems to cure things... cpuset is insane and insists on limiting
> things to online CPUs for no real reason. It is perfectly fine to have
> offline CPUs in the allowed mask (in fact, that's the default
> behaviour).
>
> With this on and "relax_compatible_cpus_allowed_ptr(current);" added to
> the exec() path things seem to work as expected for me.
>
> I'll clean up and post properly tomorrow (I think there's a simpler
> version hiding in there)...
>
> ---
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index a29c0b13706b..7a63416a46f3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -498,19 +498,33 @@ static inline bool partition_is_populated(struct cpuset *cs,
>    *
>    * Call with callback_lock or cpuset_rwsem held.
>    */
> -static void guarantee_online_cpus(struct task_struct *tsk,
> -				  struct cpumask *pmask)
> +static void guarantee_cs_cpus(struct task_struct *tsk, struct cpumask *pmask, bool online)
>   {
> -	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +	const struct cpumask *task_possible_mask = task_cpu_possible_mask(tsk);
> +	const struct cpumask *possible_mask = cpu_possible_mask;
> +	const struct cpumask *cs_cpus;
>   	struct cpuset *cs;
>   
> -	if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
> -		cpumask_copy(pmask, cpu_online_mask);
> +	if (online)
> +		possible_mask = cpu_online_mask;
> +
> +	if (WARN_ON(!cpumask_and(pmask, task_possible_mask, possible_mask)))
> +		cpumask_copy(pmask, possible_mask);
>   
>   	rcu_read_lock();
>   	cs = task_cs(tsk);
>   
> -	while (!cpumask_intersects(cs->effective_cpus, pmask)) {
> +	if (!parent_cs(cs)) {
> +		cs_cpus = cpu_possible_mask;
> +		if (online)
> +			cs_cpus = cpu_online_mask;
> +	} else {
> +		cs_cpus = cs->cpus_allowed;
> +		if (online)
> +			cs_cpus = cs->effective_cpus;

This may not be the right thing to do to use cpus_allowed directly in 
the case of cgroup v2. In v2, cpus_allowed starts as empty and 
effective_cpus inherit from its parent. So we may have to go up the 
cpuset hierarchy to arrive at the proper cpus_allowed to use. We may 
need another helper to do that.

Cheers,
Longman