Re: [PATCH] xen/arm: Ensure the vCPU context is seen before clearing the _VPF_down

From: Stefano Stabellini <sstabellini@kernel.org>
To: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	 xen-devel@lists.xenproject.org, bertrand.marquis@arm.com,
	 ash.j.wilding@gmail.com, Julien Grall <jgrall@amazon.com>,
	 Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
	 Dario Faggioli <dfaggioli@suse.com>,
	 George Dunlap <george.dunlap@citrix.com>
Subject: Re: [PATCH] xen/arm: Ensure the vCPU context is seen before clearing the _VPF_down
Date: Tue, 13 Apr 2021 15:43:54 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.21.2104131541370.4885@sstabellini-ThinkPad-T480s> (raw)
In-Reply-To: <86165804-34a1-59e5-1b51-fecc60dbf796@xen.org>

On Sat, 20 Mar 2021, Julien Grall wrote:
> On 20/03/2021 00:01, Stefano Stabellini wrote:
> > On Sat, 27 Feb 2021, Julien Grall wrote:
> > > (+ Dario and George)
> > > 
> > > Hi Stefano,
> > > 
> > > I have added Dario and George to get some inputs from the scheduling part.
> > > 
> > > On 27/02/2021 01:58, Stefano Stabellini wrote:
> > > > On Fri, 26 Feb 2021, Julien Grall wrote:
> > > > > From: Julien Grall <jgrall@amazon.com>
> > > > > 
> > > > > A vCPU can get scheduled as soon as _VPF_down is cleared. As there is
> > > > > currently not ordering guarantee in arch_set_info_guest(), it may be
> > > > > possible that flag can be observed cleared before the new values of
> > > > > vCPU
> > > > > registers are observed.
> > > > > 
> > > > > Add an smp_mb() before the flag is cleared to prevent re-ordering.
> > > > > 
> > > > > Signed-off-by: Julien Grall <jgrall@amazon.com>
> > > > > 
> > > > > ---
> > > > > 
> > > > > Barriers should work in pair. However, I am not entirely sure whether
> > > > > to
> > > > > put the other half. Maybe at the beginning of context_switch_to()?
> > > > 
> > > > It should be right after VGCF_online is set or cleared, right?
> > > 
> > > vcpu_guest_context_t is variable allocated on the heap just for the
> > > purpose of
> > > this call. So an ordering with VGFC_online is not going to do anything.
> > > 
> > > > So it
> > > > would be:
> > > > 
> > > > xen/arch/arm/domctl.c:arch_get_info_guest
> > > > xen/arch/arm/vpsci.c:do_common_cpu_on
> > > > 
> > > > But I think it is impossible that either of them get called at the same
> > > > time as arch_set_info_guest, which makes me wonder if we actually need
> > > > the barrier...
> > > arch_get_info_guest() is called without the domain lock held and I can't
> > > see
> > > any other lock that could prevent it to be called in // of
> > > arch_set_info_guest().
> > > 
> > > So you could technically get corrupted information from
> > > XEN_DOMCTL_getvcpucontext. For this case, we would want a smp_wmb() before
> > > writing to v->is_initialised. The corresponding read barrier would be in
> > > vcpu_pause() -> vcpu_sleep_sync() -> sync_vcpu_execstate().
> > > 
> > > But this is not the issue I was originally trying to solve. Currently,
> > > do_common_cpu_on() will roughly do:
> > > 
> > >   1) domain_lock(d)
> > > 
> > >   2) v->arch.sctlr = ...
> > >      v->arch.ttbr0 = ...
> > > 
> > >   3) clear_bit(_VFP_down, &v->pause_flags);
> > > 
> > >   4) domain_unlock(d)
> > > 
> > >   5) vcpu_wake(v);
> > > 
> > > If we had only one pCPU on the system, then we would only wake the vCPU in
> > > step 5. We would be fine in this situation. But that's not the interesting
> > > case.
> > > 
> > > If you add a second pCPU in the story, it may be possible to have
> > > vcpu_wake()
> > > happening in // (see more below). As there is no memory barrier, step 3
> > > may be
> > > observed before 2. So, assuming the vcpu is runnable, we could start to
> > > schedule a vCPU before any update to the registers (step 2) are observed.
> > > 
> > > This means that when context_switch_to() is called, we may end up to
> > > restore
> > > some old values.
> > > 
> > > Now the question is can vcpu_wake() be called in // from another pCPU?
> > > AFAICT,
> > > it would be only called if a given flag in v->pause_flags is cleared (e.g.
> > > _VFP_blocked). But can we rely on that?
> > > 
> > > Even if we can rely on it, v->pause_flags has other flags in it. I
> > > couldn't
> > > rule out that _VPF_down cannot be set at the same time as the other
> > > _VPF_*.
> > > 
> > > Therefore, I think a barrier is necessary to ensure the ordering.
> > > 
> > > Do you agree with this analysis?
> >   Yes, I think this makes sense. The corresponding barrier in the
> > scheduling code would have to be after reading _VPF_down and before
> > reading v->arch.sctlr, etc.
> > 
> > 
> > > > > The issues described here is also quite theoritical because there are
> > > > > hundreds of instructions executed between the time a vCPU is seen
> > > > > runnable and scheduled. But better be safe than sorry :).
> > > > > ---
> > > > >    xen/arch/arm/domain.c | 7 +++++++
> > > > >    1 file changed, 7 insertions(+)
> > > > > 
> > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> > > > > index bdd3d3e5b5d5..2b705e66be81 100644
> > > > > --- a/xen/arch/arm/domain.c
> > > > > +++ b/xen/arch/arm/domain.c
> > > > > @@ -914,7 +914,14 @@ int arch_set_info_guest(
> > > > >        v->is_initialised = 1;
> > > > >          if ( ctxt->flags & VGCF_online )
> > > > > +    {
> > > > > +        /*
> > > > > +         * The vCPU can be scheduled as soon as _VPF_down is cleared.
> > > > > +         * So clear the bit *after* the context was loaded.
> > > > > +         */
> > > > > +        smp_mb();
> > > 
> > >  From the discussion above, I would move this barrier before
> > > v->is_initialised.
> > > So we also take care of the issue with arch_get_info_guest().
> > > 
> > > This barrier also can be reduced to a smp_wmb() as we only expect an
> > > ordering
> > > between writes.
> > > 
> > > The barrier would be paired with the barrier in:
> > >     - sync_vcpu_execstate() in the case of arch_get_vcpu_info_guest().
> > >     - context_switch_to() in the case of scheduling (The exact barrier is
> > > TDB).
> > 
> > OK, this makes sense, but why before:
> > 
> >    v->is_initialised = 1;
> > 
> > instead of right after it? It is just v->pause_flags we care about,
> > right?
> 
> The issue I originally tried to address was a race with v->pause_flags. But I
> also discovered one with v->initialised while answering to your previous
> e-mail. This was only briefly mentioned so let me expand it.
> 
> A toolstack can take a snapshot of the vCPU context using
> XEN_DOMCTL_get_vcpucontext. The helper will bail out if v->is_initialized is
> 0.
> 
> If v->is_initialized is 1, it will temporarily pause the vCPU and then call
> arch_get_info_guest().
> 
> AFAICT, arch_get_info_guest() and arch_set_info_guest() (called from PSCI CPU
> on) can run concurrently.
> 
> If you put the barrier right after v->is_initialised, then a
> processor/compiler is allowed to re-order the write with what comes before.
> Therefore, the new value of v->is_initialised may be observed before
> v->arch.{sctlr, ttbr0,...}.
> 
> Hence, we need a barrier before setting v->is_initialized so the new value is
> observed *after* the changes to v->arch.{sctlr, ttbr0, ...) have been
> observed.
> 
> A single smp_wmb() barrier before v->is_initialized should be sufficient to
> cover the two problems discussed as I don't think we need to observe
> v->is_initialized *before* v->pause_flags.

I think your explanation is correct. However, don't we need a smp_rmb()
barrier after reading v->is_initialised in xen/common/domctl.c:do_domctl
? That would be the barrier that pairs with smp_wmb in regards to
v->is_initialised.