Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()

From: Waiman Long <longman@redhat.com>
To: Will Deacon <will@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	qperret@google.com
Subject: Re: [PATCH v10 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()
Date: Fri, 20 Jan 2023 13:10:38 -0500	[thread overview]
Message-ID: <f269370f-d72c-08e3-da06-3cd70cdf75f2@redhat.com> (raw)
In-Reply-To: <20230120175931.GA22417@willie-the-truck>

On 1/20/23 12:59, Will Deacon wrote:
> Hey Waiman,
>
> Cheers for the quick reply.
>
> On Tue, Jan 17, 2023 at 01:13:31PM -0500, Waiman Long wrote:
>> On 1/17/23 11:08, Will Deacon wrote:
>>> On Thu, Sep 22, 2022 at 02:00:38PM -0400, Waiman Long wrote:
>>>> The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched:
>>>> Introduce task_struct::user_cpus_ptr to track requested affinity"). It
>>>> is currently used only by arm64 arch due to possible asymmetric CPU
>>>> setup. This patch extends its usage to save user provided cpumask
>>>> when sched_setaffinity() is called for all arches. With this patch
>>>> applied, user_cpus_ptr, once allocated after a successful call to
>>>> sched_setaffinity(), will only be freed when the task exits.
> [...]
>
>>> We've tracked this down as the cause of an arm64 regression in Android and I've
>>> reproduced the issue with mainline.
>>>
>>> Basically, if an arm64 system is booted with "allow_mismatched_32bit_el0" on
>>> the command-line, then the arch code will (amongst other things) call
>>> force_compatible_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr()
>>> when exec()'ing a 32-bit or a 64-bit task respectively.
>> IOW, relax_compatible_cpus_allowed_ptr() can be called without a previous
>> force_compatible_cpus_allowed_ptr(). Right?
> In practice, these functions are only called by arm64 during exec. As above,
> exec()'ing a 32-bit task calls force_compatible_cpus_allowed_ptr() and
> exec()'ing a 64-bit task calls relax_compatible_cpus_allowed_ptr(). So
> they don't come in pairs at all; it's just that calling relax_[...] should
> try to restore the affinity mask if it was previously clobbered by
> force_[...].
>
That was what I thought.
>> A possible optimization in this case is to add a bit flag in the task_struct
>> to indicate a previous call to force_compatible_cpus_allowed_ptr(). Without
>> that flag set, relax_compatible_cpus_allowed_ptr() can return immediately.
> How is this an optimisation over a pointer comparison?

The sched_setaffinity() patch had repurposed user_cpus_ptr as a user 
requested cpu affinity mask irrespective if 
force_compatible_cpus_allowed_ptr() has been called or not. So checking 
against user_cpus_ptr will no longer serve its purpose as an indicator 
if force_compatible_cpus_allowed_ptr() has been called or not.

>>> I've had a crack at fixing the code above to restore the old behaviour, and it
>>> seems to work for my basic tests (still pending confirmation from others):
>>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index bb1ee6d7bdde..0d4a11384648 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3125,17 +3125,16 @@ __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx);
>>>    void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
>>>    {
>>>           struct affinity_context ac = {
>>> -               .new_mask  = task_user_cpus(p),
>>> +               .new_mask  = p->user_cpus_ptr,
>>>                   .flags     = 0,
>>>           };
>>> -       int ret;
>>>           /*
>>>            * Try to restore the old affinity mask with __sched_setaffinity().
>>>            * Cpuset masking will be done there too.
>>>            */
>>> -       ret = __sched_setaffinity(p, &ac);
>>> -       WARN_ON_ONCE(ret);
>>> +       if (ac.new_mask)
>>> +               WARN_ON_ONCE(__sched_setaffinity(p, &ac));
>>>    }
>>>    void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>>
>>>
>>> With this change, task_user_cpus() is only used by restrict_cpus_allowed_ptr()
>>> so I'd be inclined to remove it altogether tbh.
>>>
>>> What do you think?
>> The problem here is that force_compatible_cpus_allowed_ptr() can be called
>> without a matching relax_compatible_cpus_allowed_ptr() at the end. So we may
>> end up artificially restrict the number of cpus that can be used when
>> running a 64-bit binary.
> Hmm, is this because an intervening call to sched_setaffinity() could've
> set ->user_cpus_ptr? If so, I'd have thought that would also point to a
> superset of the effective affinity -- is that not the case?
>
>> What do you think about the idea of having a bit flag to track that?
> I'm not hugely happy with that approach because it's adding additional state
> which is only needed for arm64, and only when operating in this funny
> asymmetric mode. I also don't understand how it would interact with the new
> sched_setaffinity() behaviour; would we need to clear the flag when that
> function updates the mask?
The new flag bit will be independent of sched_setaffinity() call. It is 
set when restrict_cpus_allowed_ptr() is called and cleared in 
relax_compatible_cpus_allowed_ptr() if it is set before. I will post a 
patch for your evaluation.
>
> Since I'm basically trying to re-instate the v6.1 behaviour to fix the arm64
> regression, I'm happy to review/test any proposal you have, but as we get
> closer to the 6.2 release I'm wondering whether it would make more sense to
> revert the sched_setaffinity() changes for now and I can help you with arm64
> review and testing if we bring the changes back for e.g. 6.4.

The purpose of the bit flag is to reinstate 6.1 behavior.

Cheers,
Longman

>
> Will
>