Re: [PATCH v5 4/6] xen/arm: inflight irqs during migration

From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: julien.grall@citrix.com, xen-devel@lists.xensource.com,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: [PATCH v5 4/6] xen/arm: inflight irqs during migration
Date: Thu, 19 Jun 2014 19:32:01 +0100	[thread overview]
Message-ID: <alpine.DEB.2.02.1406191909580.6594@kaball.uk.xensource.com> (raw)
In-Reply-To: <1403089443.24051.20.camel@kazak.uk.xensource.com>

On Wed, 18 Jun 2014, Ian Campbell wrote:
> On Wed, 2014-06-11 at 17:27 +0100, Stefano Stabellini wrote:
> > We need to take special care when migrating irqs that are already
> > inflight from one vcpu to another. See "The effect of changes to an
> > GICD_ITARGETSR", part of chapter 4.3.12 of the ARM Generic Interrupt
> > Controller Architecture Specification.
> > 
> > The main issue from the Xen point of view is that the lr_pending and
> > inflight lists are per-vcpu. The lock we take to protect them is also
> > per-vcpu.
> > 
> > In order to avoid issues, we set a new flag GIC_IRQ_GUEST_MIGRATING, so
> > that we can recognize when we receive an irq while the previous one is
> > still inflight (given that we are only dealing with hardware interrupts
> > here, it just means that its LR hasn't been cleared yet on the old vcpu).
> > 
> > If GIC_IRQ_GUEST_MIGRATING is set, we only set GIC_IRQ_GUEST_QUEUED and
> > interrupt the old vcpu. When clearing the LR on the old vcpu, we take
> > special care of injecting the interrupt into the new vcpu. To do that we
> > need to release the old vcpu lock before taking the new vcpu lock.
> > 
> > Using barriers and test_bit on GIC_IRQ_GUEST_MIGRATING, make sure that
> > vgic_migrate_irq can run at the same time as gic_update_one_lr on
> > different cpus with consistent results.
> > 
> > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > 
> > ---
> > 
> > Changes in v5:
> > - pass unsigned long to find_next_bit for alignment on aarch64;
> > - call vgic_get_target_vcpu instead of gic_add_to_lr_pending to add the
> > irq in the right inflight queue;
> > - add barrier and bit tests to make sure that vgic_migrate_irq and
> > gic_update_one_lr can run simultaneously on different cpus without
> > issues;
> > - rework the loop to identify the new vcpu when ITARGETSR is written;
> > - use find_first_bit instead of find_next_bit.
> > ---
> >  xen/arch/arm/gic.c           |   25 ++++++++++++++--
> >  xen/arch/arm/vgic.c          |   66 +++++++++++++++++++++++++++++++++++++-----
> >  xen/include/asm-arm/domain.h |    4 +++
> >  3 files changed, 85 insertions(+), 10 deletions(-)
> > 
> > diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
> > index 5e502df..8ed242e 100644
> > --- a/xen/arch/arm/gic.c
> > +++ b/xen/arch/arm/gic.c
> > @@ -677,10 +677,31 @@ static void gic_update_one_lr(struct vcpu *v, int i)
> >          clear_bit(GIC_IRQ_GUEST_ACTIVE, &p->status);
> >          p->lr = GIC_INVALID_LR;
> >          if ( test_bit(GIC_IRQ_GUEST_ENABLED, &p->status) &&
> > -             test_bit(GIC_IRQ_GUEST_QUEUED, &p->status) )
> > +             test_bit(GIC_IRQ_GUEST_QUEUED, &p->status) &&
> > +             !test_bit(GIC_IRQ_GUEST_MIGRATING, &p->status) )
> >              gic_raise_guest_irq(v, irq, p->priority);
> > -        else
> > +        else {
> >              list_del_init(&p->inflight);
> > +
> > +            /* inflight is cleared before clearing
> > +             * GIC_IRQ_GUEST_MIGRATING */
> > +            dsb(sy);
> 
> Is sy really necessary? THis is all about state stored in RAM not the
> actual GIC, right? I think "ish" is probably sufficient. POssibly even a
> dmb of some sort.

Yeah, it is only about the state in RAM. It only needs to be visible
across all the cores, not the devices. So I guess the right function to
call would be smp_wmb() that translates to dmb(ishst)?
That makes me think that I should add smp_rmb() to the corresponding
critical section: vgic_migrate_irq.

> > diff --git a/xen/arch/arm/vgic.c b/xen/arch/arm/vgic.c
> > index 4d655af..e640de9 100644
> > --- a/xen/arch/arm/vgic.c
> > +++ b/xen/arch/arm/vgic.c
> > @@ -543,6 +562,8 @@ static int vgic_distr_mmio_write(struct vcpu *v, mmio_info_t *info)
> >      int offset = (int)(info->gpa - v->domain->arch.vgic.dbase);
> >      int gicd_reg = REG(offset);
> >      uint32_t tr;
> > +    unsigned long trl;
> > +    int i;
> >  
> >      switch ( gicd_reg )
> >      {
> > @@ -626,22 +647,45 @@ static int vgic_distr_mmio_write(struct vcpu *v, mmio_info_t *info)
> >          rank = vgic_rank_offset(v, 8, gicd_reg - GICD_ITARGETSR);
> >          if ( rank == NULL) goto write_ignore;
> >          /* 8-bit vcpu mask for this domain */
> > -        tr = (1 << v->domain->max_vcpus) - 1;
> > -        tr = tr | (tr << 8) | (tr << 16) | (tr << 24);
> > -        tr &= *r;
> > +        trl = (1 << v->domain->max_vcpus) - 1;
> > +        if ( dabt.size == 2 )
> > +            trl = trl | (trl << 8) | (trl << 16) | (trl << 24);
> > +        else
> > +            trl = (trl << (8 * (offset & 0x3)));
> > +        trl &= *r;
> 
> What does trl stand for? Why is it an unsigned long when r and tr are
> uint32_t?

It is an unfortunate name. trl is exactly like tr but unsigned long
instead of uint32_t.
It is unsigned long so that it can be used as an argument of
find_next_bit below. uint32_t cannot be used because it doesn't have the
right alignment on armv8.

> >          /* ignore zero writes */
> > -        if ( !tr )
> > +        if ( !trl )
> >              goto write_ignore;
> >          if ( dabt.size == 2 &&
> > -            !((tr & 0xff) && (tr & (0xff << 8)) &&
> > -             (tr & (0xff << 16)) && (tr & (0xff << 24))))
> > +            !((trl & 0xff) && (trl & (0xff << 8)) &&
> > +             (trl & (0xff << 16)) && (trl & (0xff << 24))))
> >              goto write_ignore;
> 
> I'm still not sure what this is ;-)

All the bytes in trl needs to be non-zero for the write to be valid
(considering a 0 write to any of the byte fields as invalid).

> >          vgic_lock_rank(v, rank);
> > +        i = 0;
> > +        while ( (i = find_next_bit((const unsigned long *)&trl, 32, i)) < 32 )
> 
> Unnecessary cast (if trl really should be an unsigned long).

The argument to find_next_bit really needs to be unsigned long. I'll
remove the cast.

> > +        {
> > +            unsigned int irq, target, old_target;
> > +            unsigned long old_target_mask;
> > +            struct vcpu *v_target, *v_old;
> > +
> > +            target = i % 8;
> > +            old_target_mask = byte_read(rank->itargets[REG_RANK_INDEX(8, gicd_reg - GICD_ITARGETSR)], 0, i/8);
> > +            old_target = find_first_bit((const unsigned long *) &old_target_mask, 8);
> 
> Another unnecessary cast.

Right.

> > +
> > +            if ( target != old_target )
> 
> Given the implementation only supports a single target I'm wondering if
> maybe we should just store the configured target.
> 
> If we weren't required to read back the actual value then we could even
> dispense with rank->itargets completely and just fabricate it.

Yeah, that's a good point, I am not sure.
The spec says that it is valid to statically configure GICD_ITARGETSR
and ignore any writes to it, but it doesn't say that we can ignore
/some/ writes to it.

> > +            {
> > +                irq = offset + (i / 8);
> > +                v_target = v->domain->vcpu[target];
> > +                v_old = v->domain->vcpu[old_target];
> > +                vgic_migrate_irq(v_old, v_target, irq);
> > +            }
> > +            i += 8 - target;
> 
> I think this whole loop would be less subtle as:
> 
> 	for ( byte = 0 ; byte < 4 ; byte++ )
> 		target = find_first_bit((char *)(&tr)+byte, 8)

That would have alignment issues, wouldn't it?

> 		if ( target > 8 ) continue;
> 		old_target  =...
> 	}
> 
> The use of find_next_bit is not really semantically what is happening
> here.