Re: [Xen-devel] [PATCH v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall

From: Andrii Anisov <andrii.anisov@gmail.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
	Julien Grall <julien.grall@arm.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	"andrii_anisov@epam.com" <andrii_anisov@epam.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH v3] xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall
Date: Tue, 11 Jun 2019 19:09:14 +0300	[thread overview]
Message-ID: <25ea00e0-c5fc-6606-f3f5-12001675bb60@gmail.com> (raw)
In-Reply-To: <5CFA734902000078002364C0@prv1-mh.provo.novell.com>

Hello Jan,

On 07.06.19 17:23, Jan Beulich wrote:
>>>> On 24.05.19 at 20:12, <andrii.anisov@gmail.com> wrote:
>> From: Andrii Anisov <andrii_anisov@epam.com>
>>
>> Existing interface to register runstate are with its virtual address
>> is prone to issues which became more obvious with KPTI enablement in
>> guests. The nature of those issues is the fact that the guest could
>> be interrupted by the hypervisor at any time, and there is no guarantee
>> to have the registered virtual address translated with the currently
>> available guest's page tables. Before the KPTI such a situation was
>> possible in case the guest is caught in the middle of PT processing
>> (e.g. superpage shattering). With the KPTI this happens also when the
>> guest runs userspace, so has a pretty high probability.
> 
> Except when there's no need for KPTI in the guest in the first place,
> as is the case for x86-64 PV guests. I think this is worthwhile clarifying.
> 
>> So it was agreed to register runstate with the guest's physical address
>> so that its mapping is permanent from the hypervisor point of view. [1]
>>
>> The hypercall employs the same vcpu_register_runstate_memory_area
>> structure for the interface, but requires a registered area to not
>> cross a page boundary.
>>
>> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html
>>
>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> 
> I would have really hoped that you would outline the intended interaction
> between both interfaces,

I'd suppose no specific interaction between interfaces. I hardly imagine realistic use-cases where this might be needed.

> such that while reviewing one can judge whether
> you're actually matching the goal. I think it is actually mandatory to make
> explicit in the public header what level of mixing is permitted, what is not
> permitted, and what mix of requests is simply undefined.> 
> In particular, while we did work out during prior discussions that some
> level of mixing has to be permitted, I'm unconvinced that arbitrary
> mixing has to be allowed. For example, switching from one model to
> another could be permitted only with just a single active vCPU. This
> might allow to do away again with the runstate_in_use field you add.

Well, my point here is to left it as is, maybe add more documentation. If one likes shooting his leg, we should only care about avoiding ricochet harms hypervisor or other guests.
If you disagree, please suggest your interaction model, I'll be happy to implement it.

> 
>> --- a/xen/arch/arm/domain.c
>> +++ b/xen/arch/arm/domain.c
>> @@ -274,17 +274,15 @@ static void ctxt_switch_to(struct vcpu *n)
>>       virt_timer_restore(n);
>>   }
>>   
>> -/* Update per-VCPU guest runstate shared memory area (if registered). */
>> -static void update_runstate_area(struct vcpu *v)
>> +static void update_runstate_by_gvaddr(struct vcpu *v)
>>   {
>>       void __user *guest_handle = NULL;
>>   
>> -    if ( guest_handle_is_null(runstate_guest(v)) )
>> -        return;
>> +    ASSERT(!guest_handle_is_null(runstate_guest_virt(v)));
>>   
>>       if ( VM_ASSIST(v->domain, runstate_update_flag) )
>>       {
>> -        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
>> +        guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1;
>>           guest_handle--;
>>           v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>>           __raw_copy_to_guest(guest_handle,
>> @@ -292,7 +290,7 @@ static void update_runstate_area(struct vcpu *v)
>>           smp_wmb();
>>       }
>>   
>> -    __copy_to_guest(runstate_guest(v), &v->runstate, 1);
>> +    __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1);
> 
> In a prior version you did the mechanical part of adjusting the VA-based
> code in a prereq patch, aiding review. Is there a particular reason you
> folded everything into one patch now?

I silently followed suggestion from George [1]. Any objections?

> 
>> @@ -303,6 +301,53 @@ static void update_runstate_area(struct vcpu *v)
>>       }
>>   }
>>   
>> +static void update_runstate_by_gpaddr(struct vcpu *v)
>> +{
>> +    struct vcpu_runstate_info *runstate =
>> +            (struct vcpu_runstate_info *)v->runstate_guest.phys;
> 
> What's the cast for here?
> 
>> @@ -1614,6 +1613,92 @@ bool update_runstate_area(struct vcpu *v)
>>       return rc;
>>   }
>>   
>> +static bool update_runstate_by_gpaddr_native(struct vcpu *v)
>> +{
>> +    struct vcpu_runstate_info *runstate =
>> +            (struct vcpu_runstate_info *)v->runstate_guest.phys;
>> +
>> +    ASSERT(runstate != NULL);
>> +
>> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +    {
>> +        runstate->state_entry_time |= XEN_RUNSTATE_UPDATE;
>> +        smp_wmb();
>> +        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>> +    }
>> +
>> +    memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate));
>> +
>> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +    {
>> +        runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE;
>> +        smp_wmb();
>> +        v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
>> +    }
>> +
>> +    return true;
>> +}
> 
> I can't help thinking that this matches the Arm code. Can common things
> please be put in common code? I may be asking too much, but if the
> pre-existing implementations are similar enough (I didn't check) perhaps
> they could be folded in a first step, too?

The problem thought to be the interface: on x86 update_runstate_area() returns bool, but on ARM it is void.
But for the common function update_runstate_by_gpaddr_~native~() it would not be a problem, because we will return proper bool from the refactored update_runstate_area().
Thank you for the point.

> 
>> +static bool update_runstate_by_gpaddr_compat(struct vcpu *v)
>> +{
>> +    struct compat_vcpu_runstate_info *runstate =
>> +            (struct compat_vcpu_runstate_info *)v->runstate_guest.phys;
>> +
>> +    ASSERT(runstate != NULL);
>> +
>> +    if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +    {
>> +        runstate->state_entry_time |= XEN_RUNSTATE_UPDATE;
>> +        smp_wmb();
>> +        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>> +    }
>> +
>> +    {
>> +        struct compat_vcpu_runstate_info info;
>> +        XLAT_vcpu_runstate_info(&info, &v->runstate);
>> +        memcpy(v->runstate_guest.phys, &info, sizeof(info));
>> +    }
>> +    else
>> +        memcpy(v->runstate_guest.phys, &v->runstate, sizeof(v->runstate));
> 
> This "else" does not seem to be paired with an if(). Does this build
> at all?

This particular - not!
And it is really strange, I remember I checked patch compilation for x86. Looking through git reflog to realize at what amend it became broken.
But also I do not completely understand the meaning of "_compat" and if it should be supported here?

> 
>> --- a/xen/arch/x86/x86_64/domain.c
>> +++ b/xen/arch/x86/x86_64/domain.c
>> @@ -12,6 +12,8 @@
>>   CHECK_vcpu_get_physid;
>>   #undef xen_vcpu_get_physid
>>   
>> +extern void discard_runstate_area(struct vcpu *v);
> 
> No, this is not allowed. The declaration must be visible to both consumer
> and producer, so that when either side is changed things won't build until
> the other side gets changed too.
> 
>> @@ -35,8 +37,16 @@ arch_compat_vcpu_op(
>>                !compat_handle_okay(area.addr.h, 1) )
>>               break;
>>   
>> +        while( xchg(&v->runstate_in_use, 1) == 0);
> 
> At the very least such loops want a cpu_relax() in their bodies.
> But this being on a hypercall path - are there theoretical guarantees
> that a guest can't abuse this to lock up a CPU?
> 
> Furthermore I don't understand how this is supposed to work in
> the first place. The first xchg() will store 1, no matter what. Thus
> on the second iteration you'll find the wanted value unless the
> other side stored 0. Yet the other side is a simple xchg() too.
> Hence its first attempt would fail, but its second attempt (which
> we didn't exit the critical region here yet) would succeed.
> 
> Also - style.

Will leave this part for locking issue discussion.

> 
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -698,6 +698,74 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
>>       return 0;
>>   }
>>   
>> +static void unmap_runstate_area(struct vcpu *v)
>> +{
>> +    mfn_t mfn;
>> +
>> +    if ( ! v->runstate_guest.phys )
> 
> Stray blank after ! .

OK.

> 
>> +        return;
>> +
>> +    mfn = domain_page_map_to_mfn(v->runstate_guest.phys);
>> +
>> +    unmap_domain_page_global((void *)
>> +                             ((unsigned long)v->runstate_guest.phys &
>> +                              PAGE_MASK));
>> +
>> +    v->runstate_guest.phys = NULL;
> 
> I think you would better store NULL before unmapping.

Do you mean using local variable to pass address to unmap_domain_page_global(), and setting v->runstate_guest.phys to NULL prior to unmap?

> 
>> @@ -734,7 +802,10 @@ int domain_kill(struct domain *d)
>>           if ( cpupool_move_domain(d, cpupool0) )
>>               return -ERESTART;
>>           for_each_vcpu ( d, v )
>> +        {
>> +            discard_runstate_area_locked(v);
>>               unmap_vcpu_info(v);
>> +        }
> 
> What is the "locked" aspect here about? The guest itself is dead (i.e.
> fully paused) at this point, isn't it? And it won't ever be unpaused
> again.

You are right. All vcpus here are stopped. We can discard runstate area without locking.

> 
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -163,17 +163,31 @@ struct vcpu
>>       void            *sched_priv;    /* scheduler-specific data */
>>   
>>       struct vcpu_runstate_info runstate;
>> +
>> +    enum {
>> +        RUNSTATE_NONE = 0,
>> +        RUNSTATE_PADDR = 1,
>> +        RUNSTATE_VADDR = 2,
>> +    } runstate_guest_type;
>> +
>> +    unsigned long runstate_in_use;
> 
> Why "unsigned long"? Isn't a bool all you need?

Bool should be enough. But it seems we will have a lock here.

> Also these would now all want to be grouped in a sub-structure named
> "runstate", rather than having "runstate_" prefixes.

Member `runstate` has already a type of `struct vcpu_runstate_info` which is an interface type.
`runstate_guest` is a union. I'd not like moving `runstate_guest` union into another substructure. Because we would have long lines like `v->struct.runstate_guest.virt.p->state_entry_time`.

> 
>> +        void*   phys;
> 
> Bad ordering between * and the blanks. There ought to be a blank
> ahead of the *, and I don't see why you need any after it.

OK.

[1] https://lists.xenproject.org/archives/html/xen-devel/2019-05/msg00599.html

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel