Re: [Xen-devel] [PATCH v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall

From: "Jan Beulich" <JBeulich@suse.com>
To: <andrii.anisov@gmail.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
	Julien Grall <julien.grall@arm.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	"andrii_anisov@epam.com" <andrii_anisov@epam.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall
Date: Fri, 07 Jun 2019 08:23:05 -0600	[thread overview]
Message-ID: <5CFA734902000078002364C0@prv1-mh.provo.novell.com> (raw)
In-Reply-To: <1558721577-13958-3-git-send-email-andrii.anisov@gmail.com>

>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote:
> From: Andrii Anisov <andrii_anisov@epam.com>
> 
> Existing interface to register runstate are with its virtual address
> is prone to issues which became more obvious with KPTI enablement in
> guests. The nature of those issues is the fact that the guest could
> be interrupted by the hypervisor at any time, and there is no guarantee
> to have the registered virtual address translated with the currently
> available guest's page tables. Before the KPTI such a situation was
> possible in case the guest is caught in the middle of PT processing
> (e.g. superpage shattering). With the KPTI this happens also when the
> guest runs userspace, so has a pretty high probability.

Except when there's no need for KPTI in the guest in the first place,
as is the case for x86-64 PV guests. I think this is worthwhile clarifying.

> So it was agreed to register runstate with the guest's physical address
> so that its mapping is permanent from the hypervisor point of view. [1]
> 
> The hypercall employs the same vcpu_register_runstate_memory_area
> structure for the interface, but requires a registered area to not
> cross a page boundary.
> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html 
> 
> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>

I would have really hoped that you would outline the intended interaction
between both interfaces, such that while reviewing one can judge whether
you're actually matching the goal. I think it is actually mandatory to make
explicit in the public header what level of mixing is permitted, what is not
permitted, and what mix of requests is simply undefined.

In particular, while we did work out during prior discussions that some
level of mixing has to be permitted, I'm unconvinced that arbitrary
mixing has to be allowed. For example, switching from one model to
another could be permitted only with just a single active vCPU. This
might allow to do away again with the runstate_in_use field you add.

> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n)
>      virt_timer_restore(n);
>  }
>  
> -/* Update per-VCPU guest runstate shared memory area (if registered). */
> -static void update_runstate_area(struct vcpu *v)
> +static void update_runstate_by_gvaddr(struct vcpu *v)
>  {
>      void __user *guest_handle = NULL;
>  
> -    if ( guest_handle_is_null(runstate_guest(v)) )
> -        return;
> +    ASSERT(!guest_handle_is_null(runstate_guest_virt(v)));
>  
>      if ( VM_ASSIST(v->domain, runstate_update_flag) )
>      {
> -        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
> +        guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1;
>          guest_handle--;
>          v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>          __raw_copy_to_guest(guest_handle,
> @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v)
>          smp_wmb();
>      }
>  
> -    __copy_to_guest(runstate_guest(v), &v->runstate, 1);
> +    __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1);

In a prior version you did the mechanical part of adjusting the VA-based
code in a prereq patch, aiding review. Is there a particular reason you
folded everything into one patch now?

> @@ -303,6 +301,53 @@ static void update_runstate_area(struct vcpu *v)
>      }
>  }
>  
> +static void update_runstate_by_gpaddr(struct vcpu *v)
> +{
> +    struct vcpu_runstate_info *runstate =
> +            (struct vcpu_runstate_info *)v->runstate_guest.phys;

What's the cast for here?

> @@ -1614,6 +1613,92 @@ bool update_runstate_area(struct vcpu *v)
>      return rc;
>  }
>  
> +static bool update_runstate_by_gpaddr_native(struct vcpu *v)
> +{
> +    struct vcpu_runstate_info *runstate =
> +            (struct vcpu_runstate_info *)v->runstate_guest.phys;
> +
> +    ASSERT(runstate != NULL);
> +
> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
> +    {
> +        runstate->state_entry_time |= XEN_RUNSTATE_UPDATE;
> +        smp_wmb();
> +        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
> +    }
> +
> +    memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate));
> +
> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
> +    {
> +        runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE;
> +        smp_wmb();
> +        v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
> +    }
> +
> +    return true;
> +}

I can't help thinking that this matches the Arm code. Can common things
please be put in common code? I may be asking too much, but if the
pre-existing implementations are similar enough (I didn't check) perhaps
they could be folded in a first step, too?

> +static bool update_runstate_by_gpaddr_compat(struct vcpu *v)
> +{
> +    struct compat_vcpu_runstate_info *runstate =
> +            (struct compat_vcpu_runstate_info *)v->runstate_guest.phys;
> +
> +    ASSERT(runstate != NULL);
> +
> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
> +    {
> +        runstate->state_entry_time |= XEN_RUNSTATE_UPDATE;
> +        smp_wmb();
> +        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
> +    }
> +
> +    {
> +        struct compat_vcpu_runstate_info info;
> +        XLAT_vcpu_runstate_info(&info, &v->runstate);
> +        memcpy(v->runstate_guest.phys, &info, sizeof(info));
> +    }
> +    else
> +        memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate));

This "else" does not seem to be paired with an if(). Does this build
at all?

> --- a/xen/arch/x86/x86_64/domain.c
> +++ b/xen/arch/x86/x86_64/domain.c
> @@ -12,6 +12,8 @@
>  CHECK_vcpu_get_physid;
>  #undef xen_vcpu_get_physid
>  
> +extern void discard_runstate_area(struct vcpu *v);

No, this is not allowed. The declaration must be visible to both consumer
and producer, so that when either side is changed things won't build until
the other side gets changed too.

> @@ -35,8 +37,16 @@ arch_compat_vcpu_op(
>               !compat_handle_okay(area.addr.h, 1) )
>              break;
>  
> +        while( xchg(&v->runstate_in_use, 1) == 0);

At the very least such loops want a cpu_relax() in their bodies.
But this being on a hypercall path - are there theoretical guarantees
that a guest can't abuse this to lock up a CPU?

Furthermore I don't understand how this is supposed to work in
the first place. The first xchg() will store 1, no matter what. Thus
on the second iteration you'll find the wanted value unless the
other side stored 0. Yet the other side is a simple xchg() too.
Hence its first attempt would fail, but its second attempt (which
we didn't exit the critical region here yet) would succeed.

Also - style.

> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -698,6 +698,74 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
>      return 0;
>  }
>  
> +static void unmap_runstate_area(struct vcpu *v)
> +{
> +    mfn_t mfn;
> +
> +    if ( ! v->runstate_guest.phys )

Stray blank after ! .

> +        return;
> +
> +    mfn = domain_page_map_to_mfn(v->runstate_guest.phys);
> +
> +    unmap_domain_page_global((void *)
> +                             ((unsigned long)v->runstate_guest.phys &
> +                              PAGE_MASK));
> +
> +    v->runstate_guest.phys = NULL;

I think you would better store NULL before unmapping.

> @@ -734,7 +802,10 @@ int domain_kill(struct domain *d)
>          if ( cpupool_move_domain(d, cpupool0) )
>              return -ERESTART;
>          for_each_vcpu ( d, v )
> +        {
> +            discard_runstate_area_locked(v);
>              unmap_vcpu_info(v);
> +        }

What is the "locked" aspect here about? The guest itself is dead (i.e.
fully paused) at this point, isn't it? And it won't ever be unpaused
again.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -163,17 +163,31 @@ struct vcpu
>      void            *sched_priv;    /* scheduler-specific data */
>  
>      struct vcpu_runstate_info runstate;
> +
> +    enum {
> +        RUNSTATE_NONE = 0,
> +        RUNSTATE_PADDR = 1,
> +        RUNSTATE_VADDR = 2,
> +    } runstate_guest_type;
> +
> +    unsigned long runstate_in_use;

Why "unsigned long"? Isn't a bool all you need?

Also these would now all want to be grouped in a sub-structure named
"runstate", rather than having "runstate_" prefixes.

> +        void*   phys;

Bad ordering between * and the blanks. There ought to be a blank
ahead of the *, and I don't see why you need any after it.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel