Re: [PATCH] xen/arm: Ensure the vCPU context is seen before clearing the _VPF_down

From: Julien Grall <julien@xen.org>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: xen-devel@lists.xenproject.org, bertrand.marquis@arm.com,
	ash.j.wilding@gmail.com, Julien Grall <jgrall@amazon.com>,
	Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	George Dunlap <george.dunlap@citrix.com>
Subject: Re: [PATCH] xen/arm: Ensure the vCPU context is seen before clearing the _VPF_down
Date: Thu, 1 Apr 2021 16:09:06 +0100	[thread overview]
Message-ID: <c481a95a-c9d4-4c77-e9e2-8aee7bc347d5@xen.org> (raw)
In-Reply-To: <86165804-34a1-59e5-1b51-fecc60dbf796@xen.org>

Hi,

On 20/03/2021 11:47, Julien Grall wrote:
> 
> 
> On 20/03/2021 00:01, Stefano Stabellini wrote:
>> On Sat, 27 Feb 2021, Julien Grall wrote:
>>> (+ Dario and George)
>>>
>>> Hi Stefano,
>>>
>>> I have added Dario and George to get some inputs from the scheduling 
>>> part.
>>>
>>> On 27/02/2021 01:58, Stefano Stabellini wrote:
>>>> On Fri, 26 Feb 2021, Julien Grall wrote:
>>>>> From: Julien Grall <jgrall@amazon.com>
>>>>>
>>>>> A vCPU can get scheduled as soon as _VPF_down is cleared. As there is
>>>>> currently not ordering guarantee in arch_set_info_guest(), it may be
>>>>> possible that flag can be observed cleared before the new values of 
>>>>> vCPU
>>>>> registers are observed.
>>>>>
>>>>> Add an smp_mb() before the flag is cleared to prevent re-ordering.
>>>>>
>>>>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>>>>>
>>>>> ---
>>>>>
>>>>> Barriers should work in pair. However, I am not entirely sure 
>>>>> whether to
>>>>> put the other half. Maybe at the beginning of context_switch_to()?
>>>>
>>>> It should be right after VGCF_online is set or cleared, right?
>>>
>>> vcpu_guest_context_t is variable allocated on the heap just for the 
>>> purpose of
>>> this call. So an ordering with VGFC_online is not going to do anything.
>>>
>>>> So it
>>>> would be:
>>>>
>>>> xen/arch/arm/domctl.c:arch_get_info_guest
>>>> xen/arch/arm/vpsci.c:do_common_cpu_on
>>>>
>>>> But I think it is impossible that either of them get called at the same
>>>> time as arch_set_info_guest, which makes me wonder if we actually need
>>>> the barrier...
>>> arch_get_info_guest() is called without the domain lock held and I 
>>> can't see
>>> any other lock that could prevent it to be called in // of
>>> arch_set_info_guest().
>>>
>>> So you could technically get corrupted information from
>>> XEN_DOMCTL_getvcpucontext. For this case, we would want a smp_wmb() 
>>> before
>>> writing to v->is_initialised. The corresponding read barrier would be in
>>> vcpu_pause() -> vcpu_sleep_sync() -> sync_vcpu_execstate().
>>>
>>> But this is not the issue I was originally trying to solve. Currently,
>>> do_common_cpu_on() will roughly do:
>>>
>>>   1) domain_lock(d)
>>>
>>>   2) v->arch.sctlr = ...
>>>      v->arch.ttbr0 = ...
>>>
>>>   3) clear_bit(_VFP_down, &v->pause_flags);
>>>
>>>   4) domain_unlock(d)
>>>
>>>   5) vcpu_wake(v);
>>>
>>> If we had only one pCPU on the system, then we would only wake the 
>>> vCPU in
>>> step 5. We would be fine in this situation. But that's not the 
>>> interesting
>>> case.
>>>
>>> If you add a second pCPU in the story, it may be possible to have 
>>> vcpu_wake()
>>> happening in // (see more below). As there is no memory barrier, step 
>>> 3 may be
>>> observed before 2. So, assuming the vcpu is runnable, we could start to
>>> schedule a vCPU before any update to the registers (step 2) are 
>>> observed.
>>>
>>> This means that when context_switch_to() is called, we may end up to 
>>> restore
>>> some old values.
>>>
>>> Now the question is can vcpu_wake() be called in // from another 
>>> pCPU? AFAICT,
>>> it would be only called if a given flag in v->pause_flags is cleared 
>>> (e.g.
>>> _VFP_blocked). But can we rely on that?
>>>
>>> Even if we can rely on it, v->pause_flags has other flags in it. I 
>>> couldn't
>>> rule out that _VPF_down cannot be set at the same time as the other 
>>> _VPF_*.
>>>
>>> Therefore, I think a barrier is necessary to ensure the ordering.
>>>
>>> Do you agree with this analysis?
>> Yes, I think this makes sense. The corresponding barrier in the
>> scheduling code would have to be after reading _VPF_down and before
>> reading v->arch.sctlr, etc.
>>
>>
>>>>> The issues described here is also quite theoritical because there are
>>>>> hundreds of instructions executed between the time a vCPU is seen
>>>>> runnable and scheduled. But better be safe than sorry :).
>>>>> ---
>>>>>    xen/arch/arm/domain.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>>>>> index bdd3d3e5b5d5..2b705e66be81 100644
>>>>> --- a/xen/arch/arm/domain.c
>>>>> +++ b/xen/arch/arm/domain.c
>>>>> @@ -914,7 +914,14 @@ int arch_set_info_guest(
>>>>>        v->is_initialised = 1;
>>>>>          if ( ctxt->flags & VGCF_online )
>>>>> +    {
>>>>> +        /*
>>>>> +         * The vCPU can be scheduled as soon as _VPF_down is cleared.
>>>>> +         * So clear the bit *after* the context was loaded.
>>>>> +         */
>>>>> +        smp_mb();
>>>
>>>  From the discussion above, I would move this barrier before 
>>> v->is_initialised.
>>> So we also take care of the issue with arch_get_info_guest().
>>>
>>> This barrier also can be reduced to a smp_wmb() as we only expect an 
>>> ordering
>>> between writes.
>>>
>>> The barrier would be paired with the barrier in:
>>>     - sync_vcpu_execstate() in the case of arch_get_vcpu_info_guest().
>>>     - context_switch_to() in the case of scheduling (The exact 
>>> barrier is TDB).
>>
>> OK, this makes sense, but why before:
>>
>>    v->is_initialised = 1;
>>
>> instead of right after it? It is just v->pause_flags we care about,
>> right?
> 
> The issue I originally tried to address was a race with v->pause_flags. 
> But I also discovered one with v->initialised while answering to your 
> previous e-mail. This was only briefly mentioned so let me expand it.
> 
> A toolstack can take a snapshot of the vCPU context using 
> XEN_DOMCTL_get_vcpucontext. The helper will bail out if 
> v->is_initialized is 0.
> 
> If v->is_initialized is 1, it will temporarily pause the vCPU and then 
> call arch_get_info_guest().
> 
> AFAICT, arch_get_info_guest() and arch_set_info_guest() (called from 
> PSCI CPU on) can run concurrently.
> 
> If you put the barrier right after v->is_initialised, then a 
> processor/compiler is allowed to re-order the write with what comes 
> before. Therefore, the new value of v->is_initialised may be observed 
> before v->arch.{sctlr, ttbr0,...}.
> 
> Hence, we need a barrier before setting v->is_initialized so the new 
> value is observed *after* the changes to v->arch.{sctlr, ttbr0, ...) 
> have been observed.
> 
> A single smp_wmb() barrier before v->is_initialized should be sufficient 
> to cover the two problems discussed as I don't think we need to observe 
> v->is_initialized *before* v->pause_flags.

Gentle ping. Does this makes sense to you?

Cheers,

-- 
Julien Grall