Re: [PATCH v9 15/17] vmx: VT-d posted-interrupt core logic handling

From: Dario Faggioli <dario.faggioli@citrix.com>
To: Feng Wu <feng.wu@intel.com>, xen-devel@lists.xen.org
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH v9 15/17] vmx: VT-d posted-interrupt core logic handling
Date: Tue, 10 Nov 2015 19:14:51 +0100	[thread overview]
Message-ID: <1447179291.6409.59.camel@citrix.com> (raw)
In-Reply-To: <1446540207-4806-16-git-send-email-feng.wu@intel.com>

[-- Attachment #1.1: Type: text/plain, Size: 12274 bytes --]

On Tue, 2015-11-03 at 16:43 +0800, Feng Wu wrote:
> This patch includes the following aspects:
> - Handling logic when vCPU is blocked:
>     * Add a global vector to wake up the blocked vCPU
>       when an interrupt is being posted to it (This part
>       was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
>     * Define two per-cpu variables:
>           1. pi_blocked_vcpu:
>             A list storing the vCPUs which were blocked
>             on this pCPU.
> 
>           2. pi_blocked_vcpu_lock:
>             The spinlock to protect pi_blocked_vcpu.
> 
> - Add the following hooks, this part was suggested
>   by George Dunlap <george.dunlap@eu.citrix.com> and
>   Dario Faggioli <dario.faggioli@citrix.com>.
>     * arch_vcpu_block()
>       Called alled before vcpu is blocking and update the PID
>       (posted-interrupt descriptor).
> 
>     * vmx_pi_switch_from()
>       Called before context switch, we update the PID when the
>       vCPU is preempted or going to sleep.
> 
>     * vmx_pi_switch_to()
>       Called after context switch, we update the PID when the vCPU
>       is going to run.
> 
> - Before VM-entry, check the state of PI descriptor, make sure the
> 'NV' field is set to '&posted_intr_vector' when the guest is running
> in non-root mode. Suggested by Jan Beulich <jbeulich@suse.com>.
> 
So, TBH, I never felt in love with this changelog. I'm saying this now
because I'm starting to loose track of things, and the changelog is not
helpful as it could be, in explaining to me what this patch (and, in
this case, this particular version of the patch) does.

In fact, above, you're listing the functions and the data structure you
are adding, and some of the places from where they're called/used.
That, however, can all be figured out from the code.

I think this would want to become a bit more high level, and try to
describe the _reason_why_ the various bits and pieces have been put in
the shape they are. For instance, although you added a paragraph about
the VMENTRY check, it is not clear to me, by just reading this
description, that it is meant at avoiding having to add much more
hooks, as well as, as another example, what you put in the other email
about it needing to be called on ever VMENTRY (the "we can't use
vmx_do_resume() thing).

I'm not saying you should rewrite another design document in this
changelog, and I know it's hard to pick the best structure and content,
but, really, this isn't ideal.

Also, explicitly crediting people who suggested the various items is
not helpful too. In 5 years someone (either of us or new), won't care
much who actually came up with the idea of having hooks. What he'll
care, is for the changelog to provide guidance throughout what the code
does, and why it's been done that way. There is no trace of a similar
crediting scheme (which, in principle, it would apply to any, o at
least to major, comments you get at each review) in our history, and it
makes the changelog heavier to read, so I'd cut it all off.

> ---
> v9:
> - Remove arch_vcpu_block_cancel() and arch_vcpu_wake_prepare()
> - Add vmx_pi_state_change() and call it before VM Entry
> 
And this is why I said I'm losing track of what was your actual aim for
this round.

Since when the possibility of moving stuff to VMX helpers came out

Jan said, during v7 review (about the need to undo what
arch_vcpu_block() does in case local_evnts_need_delivery() is true):

1) "Couldn't this be taken care of by, if necessary, adjusting PI state
    in vmx_do_resume()?"

And George said:

2) "we could just clear SN on every vmexit, and set SN and NDST on 
    every vmentry, couldn't we?  Then we wouldn't need hooks on the 
    context switch path at all (which are only there to catch running
    -> runnable and runnable -> running) -- we could just have the 
    block/unblock hooks (to change NV and add / remove things from the 
    list)."

Then, during v8, review comments focused mostly on 1), because we felt
the need to try to get rid of the block-cancelling hook.

What you seem to have done in this patch, is something in the middle.
In fact, AFAIU:
 - the block-cancelling hook is no longer there,
 - the wakeup hooks are gone as well,
 - the context switch hooks are still there.

Now, I'm not saying that this is not satisfactory as a result, I'm
saying it's hard to keep track of what's going on and to figure out
what your aim was in this round (and hence check the aim against the
review comments, and the implementation against the aim).

If describing that does not fit in the changelog (e.g., because it
would be something about how the code is evolving, as it probably is),
then put it in the cover letter (this very patch is almost the only
thing being changed significantly in the series anyway), or in a few
paragraph, here, after the "---", above the v9 bulleted list of
changes.

> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index e448b31..09c9c08 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -83,7 +83,132 @@ static int vmx_msr_write_intercept(unsigned int
> msr, uint64_t msr_content);

> +void vmx_vcpu_block(struct vcpu *v)
> +{
> +    unsigned long flags;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS);
> +
> +    /*
> +     * The vCPU is blocking, we need to add it to one of the per
> pCPU lists.
> +     * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use
> it for
> +     * the per-CPU list, we also save it to posted-interrupt
> descriptor and
> +     * make it as the destination of the wake-up notification event.
> +     */
> +    v->arch.hvm_vmx.pi_block_cpu = v->processor;
> +
> +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                      v->arch.hvm_vmx.pi_block_cpu), flags);
> +    list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> +                  &per_cpu(pi_blocked_vcpu, v
> ->arch.hvm_vmx.pi_block_cpu));
> +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> +
> +    ASSERT(!pi_test_sn(pi_desc));
> +
> +    /*
> +     * We don't need to set the 'NDST' field, since it should point
> to
> +     * the same pCPU as v->processor.
> +     */
> +
> +    write_atomic(&pi_desc->nv, pi_wakeup_vector);
>
Glad to see the 'do { } while ( cpmxchg(xxx) );' not being necessary
any longer! :-)

> +}
> +
> +static void vmx_pi_switch_from(struct vcpu *v)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !has_arch_pdevs(v->domain) || !iommu_intpost ||
> +         test_bit(_VPF_blocked, &v->pause_flags) )
> +        return;
> +
The '!iommu_intpost' is probably the quickest condition to check. If it
is, it should be the first thing checked.

The same applies in a few other places.

> +static void vmx_pi_state_change(struct vcpu *v)
> +{
FWIW, I don't like the name. _state_change to what?

Looking at the code, it seems to me that what it does it to always sets
the PI descriptor to posted_intr_vector, and removes the vCPU from a
blocking list.

The reason why it does so is because we don't want the vCPU to run
while being in a blocked list, with posted_wakeup_vector in its PI
descriptor.

I'd go for a name that better reflect either the operations being done
or the aim.

> +    unsigned long flags;
> +    unsigned int pi_block_cpu;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !has_arch_pdevs(v->domain) || !iommu_intpost )
> +        return;
> +
> +    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> +
> +    /*
> +     * Set 'NV' field back to posted_intr_vector, so the
> +     * Posted-Interrupts can be delivered to the vCPU when
> +     * it is running in non-root mode.
> +     */
> +    if ( pi_desc->nv != posted_intr_vector )
> +        write_atomic(&pi_desc->nv, posted_intr_vector);
> +
> +    /* the vCPU is not on any blocking list. */
> +    pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> +    if ( pi_block_cpu == NR_CPUS )
> +        return;
> +
> +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu),
> flags);
> +
> +    /*
> +     * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU
> was
> +     * removed from the blocking list while we are acquiring the
> lock.
> +     */
> +    if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS )
> +    {
> +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> pi_block_cpu), flags);
> +        return;
> +    }
> +
> +    list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> pi_block_cpu), flags);
> +}
> +

> @@ -106,10 +231,18 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>  
>      spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
>  
> +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
> +
> +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> +
>      v->arch.schedule_tail    = vmx_do_resume;
>      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
>      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
>  
> +    if ( iommu_intpost && has_hvm_container_vcpu(v) )
> +        v->arch.vcpu_block = vmx_vcpu_block;
> +
Genuine question: is it really possible to be here, in
vmx_vcpu_initialise(), with has_hvm_container_vcpu()==false ?

> +/* Handle VT-d posted-interrupt when VCPU is blocked. */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx, *tmp;
> +    struct vcpu *v;
> +    spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock,
> smp_processor_id());
> +    struct list_head *blocked_vcpus =
> +                       &per_cpu(pi_blocked_vcpu,
> smp_processor_id());
> +    LIST_HEAD(list);
> +
> +    ack_APIC_irq();
> +    this_cpu(irq_count)++;
> +
> +    spin_lock(lock);
> +
> +    /*
> +     * XXX: The length of the list depends on how many vCPU is
> current
> +     * blocked on this specific pCPU. This may hurt the interrupt
> latency
> +     * if the list grows to too many entries.
> +     */
> +    list_for_each_entry_safe(vmx, tmp, blocked_vcpus,
> pi_blocked_vcpu_list)
> +    {
> +        if ( pi_test_on(&vmx->pi_desc) )
> +        {
> +            list_del(&vmx->pi_blocked_vcpu_list);
> +            vmx->pi_block_cpu = NR_CPUS;
> +
> +            /*
> +             * We cannot call vcpu_unblock here, since it also needs
> +             * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
> +             * set in another list and unblock them after we release
> +             * 'pi_blocked_vcpu_lock'.
> +             */
>
Is this still true? If yes, is it evident from this patch? I'm asking
because I don't see that here.

I see that it was like that in v7, before of the wakeup hooks... Is
that still the case?

> +            list_add_tail(&vmx->pi_vcpu_on_set_list, &list);
> +        }
> +    }
> +
> +    spin_unlock(lock);
> +
> +    list_for_each_entry_safe(vmx, tmp, &list, pi_vcpu_on_set_list)
> +    {
> +        v = container_of(vmx, struct vcpu, arch.hvm_vmx);
> +        list_del(&vmx->pi_vcpu_on_set_list);
> +        vcpu_unblock(v);
> +    }
> +}
> +

> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 3eefed7..6e5c2f9 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c

> @@ -854,6 +856,7 @@ static long do_poll(struct sched_poll
> *sched_poll)
>  #endif
>  
>      rc = 0;
> +
>      if ( local_events_need_delivery() )
>          goto out;
> 
Just don't add this. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel