All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] ARM: New (Xen) VGIC design document
@ 2017-10-11 14:33 Andre Przywara
  2017-10-11 14:42 ` Andre Przywara
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andre Przywara @ 2017-10-11 14:33 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Marc Zyngier, xen-devel, Christoffer Dall, Eric Auger

Hi,

(CC:ing some KVM/ARM folks involved in the VGIC)

starting with the addition of the ITS support we were seeing more and
more issues with the current implementation of our ARM Generic Interrupt
Controller (GIC) emulation, the VGIC.
Among other approaches to fix those issues it was proposed to copy the
VGIC emulation used in KVM. This one was suffering from very similar
issues, and a clean design from scratch lead to a very robust and
capable re-implementation. Interestingly this implementation is fairly
self-contained, so it seems feasible to copy it. Hopefully we only need
minor adjustments, possibly we can even copy it verbatim with some
additional glue layer code.
Stefano asked for getting a design overview, to assess the feasibility
of copying the KVM code without reviewing tons of code in the first
place.
So to follow Xen rules for new features, this design document below is
an attempt to describe the current KVM VGIC design - in a hypervisor
agnostic session. It is a bit of a retro-fit design description, as it
is not strictly forward-looking only, but actually describing the
existing implemenation [1].

Please have a look and let me know:
1) if this document has the right scope
2) if this document has the right level of detail
3) if there are points missing from the document
3) if the design in general is a fit

Appreciate any feedback!

Cheers,
Andre.

---------------------------------------

VGIC design
===========

This document describes the design of an ARM Generic Interrupt Controller (GIC)
emulation. It is meant to emulate a GIC for a guest in an virtual machine,
the common name for that is VGIC (from "virtual GIC").

This design was the result of a one-week-long design session with some
engineers in a room, triggered by ever-increasing difficulties in maintaining
the existing GIC emulation in the KVM hypervisor. The design eventually
materialised as an alternative VGIC implementation in the Linux kernel
(merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
was removed, so it is now the current code used by Linux.
Although being used in KVM, the actual design of this VGIC is rather hypervisor
agnostic and can be used by other hypervisors as well, in particular for Xen.

GIC hardware virtualization support
-----------------------------------

The ARM Generic Interrupt Controller (since v2) supports the virtualization
extensions, which allows some parts of the interrupt life cycle to be handled
purely inside the guest without exiting into the hypervisor.
In the GICv2 and GICv3 architecture this covers mostly the "interrupt
acknowledgement", "priority drop" and "interrupt deactivate" actions.
So a guest can handle most of the interrupt processing code without
leaving EL1 and trapping into the hypervisor. To accomplish
this, the GIC holds so called "list registers" (LRs), which shadow the
interrupt state for any virtual interrupt. Injecting an interrupt to a guest
involves setting up one LR with the interrupt number, its priority and initial
state (mostly "pending"), then entering the guest. Any EOI related action
from within the guest just acts on those LRs, the hypervisor can later update
the virtual interrupt state when the guest exists the next time (for whatever
reason).
But despite the GIC hardware helping out here, the whole interrupt
configuration management is not virtualized at all and needs to be emulated
by the hypervisor - or another related software component, for instance a
userland emulator. This so called "distributor" part of the GIC consists of
memory mapped registers, which can be trapped by the hypervisor, so any guest
access can be emulated in the usual way.

VGIC design motivation
----------------------

A GIC emulation thus needs to take care of those bits:

- trap GIC distributor MMIO accesses and shadow the configuration setup
  (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
- handle incoming hardware and virtual interrupt requests and inject the
  associated virtual interrupt by manipulating one of the list registers
- track the state of a virtual interrupt by inspecting the LRs after the
  guest has exited, possibly adjusting the shadowed virtual interrupt state

Despite the distributor MMIO register emulation being a sizeable chunk of
the emulation, it is actually not dominant if looking at the frequency at
which it is accessed. Normally the interrupt configuration is done at boot
time or upon initialising the device (driver), but rarely during the actual
run time of a system. Injecting and EOI-ing interrupts however happens much
more often. A good emulation approach should thus focus on tracking the virtual
interrupt state efficiently, allowing quick handling of incoming and EOI-ed
interrupts.

The actual interrupt state tracking can be quite tricky in parts. Interrupt
injections can be independent from the guest entry/exit points, also MMIO
configuration accesses could be triggered by any VCPU at any point in time.
Changing interrupt CPU affinity adds to the complication.
This leads to many code parts which could run in parallel and thus contains
some race conditions, so proper locking becomes key of a good design.
But one has to consider that interrupts in general can be characterised
as a rare event - otherwise a guest would be busy handling interrupts and could
not process actual computation tasks.
That's why the interrupt state tracking should focus on a clear and race-free
locking scheme, without needlessly optimising too much in this respect.
Experience shows that this complicates the code and leads to undetected and
hard-to-debug race conditions, which affect the stability of the system in
possibly untested corner cases.

VGIC design principles
----------------------

### Data structure

This VGIC design is based on the idea of having one structure per virtual
interrupt, protected by its own lock. In addition there is a list per VCPU,
which queues the interrupts which this VCPU should consider for injection.
One interrupt can only be on one VCPU list at any given point in time.
For private interrupts and SPIs a static allocation of this data structure
would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
dynamic and possibly very sparse allocation scheme, so we need to deal with
dynamic allocation and de-allocation of this struct. To accommodate this
there is an additional list header to link all LPIs.
Also the LPI mapping and unmapping can happen asynchronously, so we need to
properly ref-count the structure (at least for LPIs), otherwise some code parts
would potentially end up with referencing an already freed pointer.

The central data structure is called `struct vgic_irq`, and, beside the
expected interrupt configuration data, contains at least the lock, a list
header (to be able to link it to a VCPU) and a refcount. Also it contains
the interrupt number (to accommodate for non-contiguous interrupt allocations,
for instance for LPIs).
Beside those essential elements it proves worth to store (a reference to) the
VCPU this IRQ is associated with. This allows to easily find the respective
VCPU list.

    struct vgic_irq {
        spinlock_t irq_lock;            /* Protects the content of the struct */
        struct list_head lpi_list;      /* Used to link all LPIs together */
        struct list_head ap_list;

        struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
                                         * SPIs and LPIs: The VCPU whose ap_list
                                         * this is queued on.
                                         */

        struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
                                          * be sent to, as a result of the
                                          * targets reg (v2) or the
                                          * affinity reg (v3).
                                          */

        u32 intid;                      /* Guest visible INTID */
        bool line_level;                /* Level only */
        bool pending_latch;             /* The pending latch state used to
                                         * calculate the pending state for
                                         * both level and edge triggered IRQs.
                                         */

        bool active;                    /* not used for LPIs */
        bool enabled;
        bool hw;                        /* Tied to HW IRQ */
        struct kref refcount;           /* Used for LPIs */
        u32 hwintid;                    /* HW INTID number */
        union {
            u8 targets;                     /* GICv2 target VCPUs mask */
            u32 mpidr;                      /* GICv3 target VCPU */
        };
        u8 source;                      /* GICv2 SGIs only */
        u8 priority;
        enum vgic_irq_config config;    /* Level or edge */
    };

### VCPU list handling

Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
the distributor will change the state information in this structure.
When an interrupt is actually made pending (either by an associated hardware
IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
linked to the current target VCPU. The `vcpu` member in the structure will
be set to this VCPU. Any affinity change after this point will not affect
the current target VCPU anymore, it just updates the `target_vpu` field in
the structure, which will be considered on the next injection.
This per-VCPU list is called the `ap_list`, since it holds interrupts which
are in a pending and/or active state.

### Virtual IRQ references

There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
given its number.
For private IRQs and SPIs it is expected that this just indexes a static array.
For LPIs (which are dynamically allocated at run time) this is expected to
iterate a data structure (like a linked list) to find the right structure.
In any case a call to `vgic_get_irq` will increase a refcount, which will
prevent LPIs from being de-allocated while another part of the VGIC is still
holding a reference. Thus any caller to `vgic_get_irq` shall call
`vgic_put_irq()` after it is done with handling this interrupt.
An exception would be if the virtual IRQ is eventually injected into a VCPU. In
this case the VCPU holds that reference and it is kept as long as the guest
sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
been EOIed by the guest and it having been removed from the VCPU list.

### Locking

To keep the `vgic_irq` structure consistent and to avoid races between
different parts of the VGIC, locking is essential whenever accessing a member
of this structure. It is expected that this lock is almost never contended,
also held only for brief periods of time, so this is considered cheap.
To keep the code clean and avoid nasty corner cases, there are no tricks on
trying to be lockless here.
If for any reason the code needs to hold the locks for two virtual IRQs, the
one with the lower IRQ number is to be taken first, to avoid deadlocks.

Another lock to consider is the VCPU lock, which on the first glance protects
the virtual CPU's list structure, but also synchronises additions and removals
of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
lock need to be held. To avoid deadlocks, there is a strict locking order:

> The VCPU lock needs to be taken first, the per-IRQ lock after this.

Some operations (like migrating IRQs between two VCPUs) require two VCPU
locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
is to be taken first.

There are occasions where the locking order (VCPU first) is hard to observe,
because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
list. In this case the IRQ lock needs to be dropped, the respective VCPU
lock should be taken, then the per-IRQ lock needs to be re-taken.
After both the locks are held, we need to check if the conditions which
originally mandated the list addition (or removal) are still true. This is
needed because the IRQ lock could have been taken by another entity meanwhile
and the state of this interrupt could have been changed. Examples are if the
interrupt is no longer pending, got disabled or changed the CPU affinity.
Some of those changes might render to current action obsolete (no longer
pending), other will lead to a retry of the re-locking scheme described above.
This re-locking scheme shall be implemented in a well-documented function.

### Level and edge triggered interrupts

The GIC knows about two kinds of signalling interrupts:

- Edge triggered interrupts are triggered by a device once, their life cycle
ends when the guest has EOIed them, at which point we remove the pending state,
clear the LR and return the `vgic_irq` structure to a quiescent state.

- Level triggered interrupts are triggered when a device raises its interrupt
line, they stay pending as long as this line is held high. At some point the
driver in the guest is expected to program the device to explicitly or
implicitly lower this interrupt line. That means that we have to store the
state of the virtual interrupt line, which is only controlled by the (virtual)
device. This is done in the `line_level` member of `struct vgic_irq`.

To assert the interrupt condition, a (virtual) device calls a function exported
by the VGIC, which allows to raise or lower an interrupt line. Lowering the
line for an edge triggered IRQ is ignored (and so is optional). Raising the
line asserts the pending state and potentially injects this virtual IRQ. Any
subsequent "raising" call might inject another IRQ, if the previous has at
least been activated by the guest already, otherwise is ignored.

For level triggered interrupts this function stores the new state into the
`line_level` variable, potentially injecting the interrupt if that line
changes from false to true. If the line is lowered before the guest has
seen it, this particular interrupt instance will be discarded. Successive
"raising" calls will not lead to multiple interrupts if the line has not
been lowered in between.

### Software triggered interrupts

Beside the naturally software triggered inter-processor-interrupts
(SGIs in GIC speak), there is another way of letting software raise an
interrupt condition.
The GIC distributor allows to set or clear both the pending and active state
of any interrupt via MMIO registers. This isn't widely used by many operating
systems, but is useful when saving and restoring the state of a machine.
So emulating these functions is required for being architecture compliant,
however the implementation might not need to be very efficient given its rare
usage. In fact supporting the set-pending and clear-pending registers is
relatively straight-forward, as long as one keeps this state separate from
the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.

The set-active and clear-active registers are much harder to emulate, though,
as normally the active state is of little concern to the GIC emulation. In
a normal interrupt life cycle the active state isn't even visible to the
hypervisor, as it might be set and cleared again entirely within the guest
in the list register, without exiting to the hypervisor.
So manipulating the active state via the MMIO registers requires some heavy
lifting: If this interrupt is currently injected into a running VCPU, this
VCPU must exit, the active state must be set or cleared in the LR, then
execution can continue. While this is expensive, as mentioned above this
should not happen too often, also probably the system isn't very performance
sensitive when using this feature for save and restore anyway.

### MMIO emulation

As mentioned before, the distributor and redistributor part of the VGIC needs
to be fully emulated. Those parts are characterised by a range of MMIO
registers. The implementation shall provide a dispatcher function, which
takes the faulted address, relative to the beginning of the MMIO range, and
works out which actual register is affected. It then looks up the the
respective handler function and calls it. Those functions are expected to
be listed in a struct initialiser, which connects the actual register
offset and its size to a particular handler. Having handler functions for
a register range seems beneficial over handling registers in a switch/case,
because it's easier to read and simplifies code sharing, for instance
between the GICv2, GICv3 distributor and GICv3 redistributor registers
with the same semantics.

### List register management

A list register (LR) holds the state of a virtual interrupt, which will
be used by the GIC hardware to simulate an IRQ life cycle for a guest.
Each GIC hardware implementation can choose to implement a number of LRs,
having four of them seems to be a common value. This design here does not
try to manage the LRs very cleverly, instead on every guest exit every LR
in use will be synced to the emulated state, then cleared. Upon guest entry
the top priority virtual IRQs will be inserted into the LRs. If there are
more pending or active IRQs than list registers, the GIC management IRQ
will be configured to notify the hypervisor of a free LR (once the guest
has EOIed one IRQ). This will trigger a normal exit, which will go through
the normal cleanup/repopulate scheme, possibly now queuing the leftover
interrupt(s).
To facilitate quick guest exit and entry times, the VGIC maintains the list
of pending or active interrupts (ap\_list) sorted by their priority. Active
interrupts always go first on the list, since a guest and the hardware GIC
expect those to stay until they have been explicitly deactivated. Failure
in keeping active IRQs around will result in error conditions in the GIC.
The second sort criteria for the ap\_list is their priority, so higher
priority pending interrupt always go first into the LRs.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-10-11 14:33 [RFC] ARM: New (Xen) VGIC design document Andre Przywara
@ 2017-10-11 14:42 ` Andre Przywara
  2017-10-12 12:05 ` Christoffer Dall
  2017-11-01  1:58 ` Stefano Stabellini
  2 siblings, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2017-10-11 14:42 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Marc Zyngier, xen-devel, Christoffer Dall, Eric Auger

Hi,

On 11/10/17 15:33, Andre Przywara wrote:
> Hi,
> 
> (CC:ing some KVM/ARM folks involved in the VGIC)
> 
> starting with the addition of the ITS support we were seeing more and
> more issues with the current implementation of our ARM Generic Interrupt
> Controller (GIC) emulation, the VGIC.
> Among other approaches to fix those issues it was proposed to copy the
> VGIC emulation used in KVM. This one was suffering from very similar
> issues, and a clean design from scratch lead to a very robust and
> capable re-implementation. Interestingly this implementation is fairly
> self-contained, so it seems feasible to copy it. Hopefully we only need
> minor adjustments, possibly we can even copy it verbatim with some
> additional glue layer code.
> Stefano asked for getting a design overview, to assess the feasibility
> of copying the KVM code without reviewing tons of code in the first
> place.
> So to follow Xen rules for new features, this design document below is
> an attempt to describe the current KVM VGIC design - in a hypervisor
> agnostic session. It is a bit of a retro-fit design description, as it
> is not strictly forward-looking only, but actually describing the
> existing implemenation [1].

and that link should point to:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/vgic

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-10-11 14:33 [RFC] ARM: New (Xen) VGIC design document Andre Przywara
  2017-10-11 14:42 ` Andre Przywara
@ 2017-10-12 12:05 ` Christoffer Dall
  2017-11-01 17:54   ` Andre Przywara
  2017-11-01  1:58 ` Stefano Stabellini
  2 siblings, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2017-10-12 12:05 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Stefano Stabellini, Marc Zyngier, Eric Auger, Julien Grall,
	xen-devel, Christoffer Dall

Hi Andre,

On Wed, Oct 11, 2017 at 03:33:03PM +0100, Andre Przywara wrote:
> Hi,
> 
> (CC:ing some KVM/ARM folks involved in the VGIC)

Very nice writeup!

I added a bunch of comments, mostly for the writing and clarity, I hope
it helps.

> 
> starting with the addition of the ITS support we were seeing more and
> more issues with the current implementation of our ARM Generic Interrupt
> Controller (GIC) emulation, the VGIC.
> Among other approaches to fix those issues it was proposed to copy the
> VGIC emulation used in KVM. This one was suffering from very similar
> issues, and a clean design from scratch lead to a very robust and
> capable re-implementation. Interestingly this implementation is fairly
> self-contained, so it seems feasible to copy it. Hopefully we only need
> minor adjustments, possibly we can even copy it verbatim with some
> additional glue layer code.
> Stefano asked for getting a design overview, to assess the feasibility
> of copying the KVM code without reviewing tons of code in the first
> place.
> So to follow Xen rules for new features, this design document below is
> an attempt to describe the current KVM VGIC design - in a hypervisor
> agnostic session. It is a bit of a retro-fit design description, as it
> is not strictly forward-looking only, but actually describing the
> existing implemenation [1].
> 
> Please have a look and let me know:
> 1) if this document has the right scope
> 2) if this document has the right level of detail
> 3) if there are points missing from the document
> 3) if the design in general is a fit
> 
> Appreciate any feedback!
> 
> Cheers,
> Andre.
> 
> ---------------------------------------
> 
> VGIC design
> ===========
> 
> This document describes the design of an ARM Generic Interrupt Controller (GIC)
> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
> the common name for that is VGIC (from "virtual GIC").
> 
> This design was the result of a one-week-long design session with some
> engineers in a room, triggered by ever-increasing difficulties in maintaining
> the existing GIC emulation in the KVM hypervisor. The design eventually
> materialised as an alternative VGIC implementation in the Linux kernel
> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
> was removed, so it is now the current code used by Linux.
> Although being used in KVM, the actual design of this VGIC is rather hypervisor
> agnostic and can be used by other hypervisors as well, in particular for Xen.
> 
> GIC hardware virtualization support
> -----------------------------------
> 
> The ARM Generic Interrupt Controller (since v2) supports the virtualization
> extensions, which allows some parts of the interrupt life cycle to be handled
> purely inside the guest without exiting into the hypervisor.
> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
> acknowledgement", "priority drop" and "interrupt deactivate" actions.
> So a guest can handle most of the interrupt processing code without
> leaving EL1 and trapping into the hypervisor. To accomplish
> this, the GIC holds so called "list registers" (LRs), which shadow the
> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
> involves setting up one LR with the interrupt number, its priority and initial
> state (mostly "pending"), then entering the guest. Any EOI related action
> from within the guest just acts on those LRs, the hypervisor can later update
> the virtual interrupt state when the guest exists the next time (for whatever
> reason).
> But despite the GIC hardware helping out here, the whole interrupt
> configuration management is not virtualized at all and needs to be emulated
> by the hypervisor - or another related software component, for instance a
> userland emulator. This so called "distributor" part of the GIC consists of
> memory mapped registers, which can be trapped by the hypervisor, so any guest
> access can be emulated in the usual way.
> 
> VGIC design motivation
> ----------------------
> 
> A GIC emulation thus needs to take care of those bits:
> 
> - trap GIC distributor MMIO accesses and shadow the configuration setup
>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
> - handle incoming hardware and virtual interrupt requests and inject the
>   associated virtual interrupt by manipulating one of the list registers
> - track the state of a virtual interrupt by inspecting the LRs after the
>   guest has exited, possibly adjusting the shadowed virtual interrupt state
> 
> Despite the distributor MMIO register emulation being a sizeable chunk of
> the emulation, it is actually not dominant if looking at the frequency at
> which it is accessed. Normally the interrupt configuration is done at boot
> time or upon initialising the device (driver), but rarely during the actual
> run time of a system. Injecting and EOI-ing interrupts however happens much
> more often. A good emulation approach should thus focus on tracking the virtual
> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
> interrupts.

I would also say that the architecture for the GIC includes a relatively
high number of corner cases and invariants that may not be violated, and
being completely architecture compliant was our first requirement, the
second requirement was to support efficient interrupt life cycle
management and to be able to quickly tell which (if any) interrupts must
be presented to a virtual CPU.

> 
> The actual interrupt state tracking can be quite tricky in parts. Interrupt
> injections can be independent from the guest entry/exit points, also MMIO
> configuration accesses could be triggered by any VCPU at any point in time.
> Changing interrupt CPU affinity adds to the complication.
> This leads to many code parts which could run in parallel and thus contains
> some race conditions, so proper locking becomes key of a good design.
> But one has to consider that interrupts in general can be characterised
> as a rare event - otherwise a guest would be busy handling interrupts and could

(across all virtual CPUs)

> not process actual computation tasks.
> That's why the interrupt state tracking should focus on a clear and race-free

nit: not sure it makes sense to talk about a race-free locking scheme.
You have locking inherently because you have races; locking just makes
sure that things that race and access data concurrently don't corrupt
stace and that the races become benign.

> locking scheme, without needlessly optimising too much in this respect.
> Experience shows that this complicates the code and leads to undetected and
> hard-to-debug race conditions, which affect the stability of the system in
> possibly untested corner cases.

I think experience also shows that the expected performance bottlenecks
really weren't there at all, and any optimization efforts should be
driven by clear measurements of the pain points, falling back to clarity
of implementation and ease of maintenance for all other parts of the
implementation.

> 
> VGIC design principles
> ----------------------
> 
> ### Data structure
> 
> This VGIC design is based on the idea of having one structure per virtual
> interrupt, protected by its own lock. 

Even more high level: This VGIC design was based around having a very
clear data structure design, never duplicating state, and making it
abundantly clear how things are structured.  One way of achieving that
is to have a structure per interrupt, each having its own lock.

> In addition there is a list per VCPU,
> which queues the interrupts which this VCPU should consider for injection.

nit: Should you introduce the AP list name here, and say that it's protected
by the VCPU lock?

> One interrupt can only be on one VCPU list at any given point in time.

nit, wording: Any interrupt can be on at most one AP list at any point
in time.

> For private interrupts and SPIs a static allocation of this data structure

nit: PPIs and SPIs (or private and shared interrupts)

> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
> dynamic and possibly very sparse allocation scheme, so we need to deal with
> dynamic allocation and de-allocation of this struct. To accommodate this
> there is an additional list header to link all LPIs.
> Also the LPI mapping and unmapping can happen asynchronously, so we need to

asynchronously to what?

> properly ref-count the structure (at least for LPIs), otherwise some code parts

nit, wording: reference count.

> would potentially end up with referencing an already freed pointer.

It's not only that, it's that you need to know when to free things.
This is the basic idea of reference counting which I don't think you
need to argue for in this document.

> 
> The central data structure is called `struct vgic_irq`, and, beside the
> expected interrupt configuration data, contains at least the lock, a list
> header (to be able to link it to a VCPU) and a refcount. Also it contains
> the interrupt number (to accommodate for non-contiguous interrupt allocations,
> for instance for LPIs).
> Beside those essential elements it proves worth to store (a reference to) the
> VCPU this IRQ is associated with. This allows to easily find the respective
> VCPU list.
> 
>     struct vgic_irq {
>         spinlock_t irq_lock;            /* Protects the content of the struct */
>         struct list_head lpi_list;      /* Used to link all LPIs together */
>         struct list_head ap_list;
> 
>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>                                          * SPIs and LPIs: The VCPU whose ap_list
>                                          * this is queued on.
>                                          */
> 
>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
>                                           * be sent to, as a result of the
>                                           * targets reg (v2) or the
>                                           * affinity reg (v3).
>                                           */
> 
>         u32 intid;                      /* Guest visible INTID */
>         bool line_level;                /* Level only */
>         bool pending_latch;             /* The pending latch state used to
>                                          * calculate the pending state for
>                                          * both level and edge triggered IRQs.
>                                          */
> 
>         bool active;                    /* not used for LPIs */
>         bool enabled;
>         bool hw;                        /* Tied to HW IRQ */
>         struct kref refcount;           /* Used for LPIs */
>         u32 hwintid;                    /* HW INTID number */
>         union {
>             u8 targets;                     /* GICv2 target VCPUs mask */
>             u32 mpidr;                      /* GICv3 target VCPU */
>         };
>         u8 source;                      /* GICv2 SGIs only */
>         u8 priority;
>         enum vgic_irq_config config;    /* Level or edge */
>     };
> 
> ### VCPU list handling
> 
> Initially a virtual interrupt just lives on its own. 

not sure what this means, see if you can clarify by being more concrete.

> Guest MMIO accesses to
> the distributor will change the state information in this structure.
> When an interrupt is actually made pending (either by an associated hardware
> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be

I think the distinction of what causes an interrupt to be fired should
be reworked in the document.  The important bit is that the VGIC has a 
virtual interrupt input line, which can be raised and lowered, which the
hypervisor can use to signal virtual interrupts.  These may or may not
be tied to a physical interrupt, and they may therefore be marked as
hw=true or hw=false, respectively.

> linked to the current target VCPU. The `vcpu` member in the structure will
> be set to this VCPU. Any affinity change after this point will not affect
> the current target VCPU anymore, it just updates the `target_vpu` field in
> the structure, which will be considered on the next injection.

I think this description is a little vague.  There are clear semantics
associated with these two fields:

   vcpu: The VCPU whose ap_list this interrupt is queued on (which
         happens to be immutable for SGIs and PPIs)

   target_vcpu: For SGIs and LPIs, the configured target VCPU for an
                interrupt.

Once this is clearly defined, there are some rules in terms of when the
vcpu field can be changed; when queing a virtual interrupt for delivery
(because it's pending and/or active), the vcpu field field points to the
VCPU on which it is queud.  The target_vcpu field simply records the
configuration, and can be changed by the hypervisor or the VM itself at
any time, but only the VCPU on whose AP list the virtual interrupt is
already queued, can change a non-NULL vcpu field to NULL or to a
different value, i.e. migrate the virtual interrupt.

This is a requirement to ensure correct functionality; once you present
an active interrupt to a VCPU, you cannot take it away behind its back,
but you have to wait until the VCPU deactivates the interrupt.

> This per-VCPU list is called the `ap_list`, since it holds interrupts which
> are in a pending and/or active state.
> 
> ### Virtual IRQ references
> 
> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
> given its number.
> For private IRQs and SPIs it is expected that this just indexes a static array.
> For LPIs (which are dynamically allocated at run time) this is expected to
> iterate a data structure (like a linked list) to find the right structure.

(or allocate one?)

> In any case a call to `vgic_get_irq` will increase a refcount, which will
> prevent LPIs from being de-allocated while another part of the VGIC is still
> holding a reference. Thus any caller to `vgic_get_irq` shall call
> `vgic_put_irq()` after it is done with handling this interrupt.

Isn't this refcounting 101?  I assume it's already used in Xen and the
rationale could be skipped here in the interest of focus.

> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
> this case the VCPU holds that reference and it is kept as long as the guest
> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
> been EOIed by the guest and it having been removed from the VCPU list.

Again, this seems to just explain an example of one of the references.
Reference counting works by counting references, freeing the resource
when the refecence reaches zero.  That's about it.

> 
> ### Locking
> 
> To keep the `vgic_irq` structure consistent and to avoid races between
> different parts of the VGIC, locking is essential whenever accessing a member

nit: again a race cannot be avoided completely, but they can be made
benign...

> of this structure. It is expected that this lock is almost never contended,
> also held only for brief periods of time, so this is considered cheap.
> To keep the code clean and avoid nasty corner cases, there are no tricks on
> trying to be lockless here.
> If for any reason the code needs to hold the locks for two virtual IRQs, the
> one with the lower IRQ number is to be taken first, to avoid deadlocks.
> 
> Another lock to consider is the VCPU lock, which on the first glance protects
> the virtual CPU's list structure, but also synchronises additions and removals
> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
> lock need to be held. To avoid deadlocks, there is a strict locking order:
> 
> > The VCPU lock needs to be taken first, the per-IRQ lock after this.
> 
> Some operations (like migrating IRQs between two VCPUs) require two VCPU
> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
> is to be taken first.
> 
> There are occasions where the locking order (VCPU first) is hard to observe,
> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
> list. In this case the IRQ lock needs to be dropped, the respective VCPU
> lock should be taken, then the per-IRQ lock needs to be re-taken.
> After both the locks are held, we need to check if the conditions which
> originally mandated the list addition (or removal) are still true. This is
> needed because the IRQ lock could have been taken by another entity meanwhile
> and the state of this interrupt could have been changed. Examples are if the
> interrupt is no longer pending, got disabled or changed the CPU affinity.
> Some of those changes might render to current action obsolete (no longer
> pending), other will lead to a retry of the re-locking scheme described above.
> This re-locking scheme shall be implemented in a well-documented function.

Do we have this documentation on the KVM side that you could link to
here for people to have an understanding of how this can be explained?
It's not that bad when you look at it really.

> 
> ### Level and edge triggered interrupts
> 
> The GIC knows about two kinds of signalling interrupts:
> 
> - Edge triggered interrupts are triggered by a device once, their life cycle
> ends when the guest has EOIed them, at which point we remove the pending state,
> clear the LR and return the `vgic_irq` structure to a quiescent state.

For non-HW interrupts, you have the added potential complexity of
PENDING+ACTIVE.

> 
> - Level triggered interrupts are triggered when a device raises its interrupt
> line, they stay pending as long as this line is held high. At some point the
> driver in the guest is expected to program the device to explicitly or
> implicitly lower this interrupt line. That means that we have to store the
> state of the virtual interrupt line, which is only controlled by the (virtual)
> device. This is done in the `line_level` member of `struct vgic_irq`.
> 
> To assert the interrupt condition, a (virtual) device calls a function exported
> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
> line for an edge triggered IRQ is ignored (and so is optional). Raising the
> line asserts the pending state and potentially injects this virtual IRQ. Any
> subsequent "raising" call might inject another IRQ, if the previous has at
> least been activated by the guest already, otherwise is ignored.
> 
> For level triggered interrupts this function stores the new state into the
> `line_level` variable, potentially injecting the interrupt if that line
> changes from false to true. If the line is lowered before the guest has
> seen it, this particular interrupt instance will be discarded. Successive
> "raising" calls will not lead to multiple interrupts if the line has not
> been lowered in between.

This is confusing:  Lowering or raising the line for a level triggered
interrupt doesn't make any difference.  The point is that as long as the
line is high, if you deactivate that interrupt, a new interrupt will hit
immediately again, unless the line has been lowered in the meantime.

> 
> ### Software triggered interrupts
> 
> Beside the naturally software triggered inter-processor-interrupts
> (SGIs in GIC speak), there is another way of letting software raise an
> interrupt condition.

These three lines appear to belong to the heading...

> The GIC distributor allows to set or clear both the pending and active state
> of any interrupt via MMIO registers. This isn't widely used by many operating
> systems, but is useful when saving and restoring the state of a machine.
> So emulating these functions is required for being architecture compliant,
> however the implementation might not need to be very efficient given its rare
> usage. In fact supporting the set-pending and clear-pending registers is
> relatively straight-forward, as long as one keeps this state separate from
> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
> 
> The set-active and clear-active registers are much harder to emulate, though,
> as normally the active state is of little concern to the GIC emulation. In
> a normal interrupt life cycle the active state isn't even visible to the
> hypervisor, as it might be set and cleared again entirely within the guest
> in the list register, without exiting to the hypervisor.
> So manipulating the active state via the MMIO registers requires some heavy
> lifting: If this interrupt is currently injected into a running VCPU, this
> VCPU must exit, the active state must be set or cleared in the LR, then
> execution can continue. While this is expensive, as mentioned above this
> should not happen too often, also probably the system isn't very performance
> sensitive when using this feature for save and restore anyway.

These two paragraphs not so much, they seem to belong to MMIO emulation,
and should probably follow the paragraph below.

> 
> ### MMIO emulation
> 
> As mentioned before, the distributor and redistributor part of the VGIC needs
> to be fully emulated. Those parts are characterised by a range of MMIO
> registers. The implementation shall provide a dispatcher function, which
> takes the faulted address, relative to the beginning of the MMIO range, and
> works out which actual register is affected. It then looks up the the
> respective handler function and calls it. Those functions are expected to
> be listed in a struct initialiser, which connects the actual register
> offset and its size to a particular handler. Having handler functions for
> a register range seems beneficial over handling registers in a switch/case,
> because it's easier to read and simplifies code sharing, for instance
> between the GICv2, GICv3 distributor and GICv3 redistributor registers
> with the same semantics.
> 
> ### List register management
> 
> A list register (LR) holds the state of a virtual interrupt, which will
> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
> Each GIC hardware implementation can choose to implement a number of LRs,
> having four of them seems to be a common value. This design here does not
> try to manage the LRs very cleverly, instead on every guest exit every LR
> in use will be synced to the emulated state, then cleared. 

In fact I think we came up with counter-examples for every model of
being clever with not reading back the LRs, because you simply have to
observe any change in state that happened in hardware while the guest is
running, to be able to properly emulate compliant functionality of being
able to inject interrupts or not.

> Upon guest entry
> the top priority virtual IRQs will be inserted into the LRs. If there are
> more pending or active IRQs than list registers, the GIC management IRQ
> will be configured to notify the hypervisor of a free LR (once the guest
> has EOIed one IRQ). This will trigger a normal exit, which will go through
> the normal cleanup/repopulate scheme, possibly now queuing the leftover
> interrupt(s).
> To facilitate quick guest exit and entry times, the VGIC maintains the list
> of pending or active interrupts (ap\_list) sorted by their priority. Active
> interrupts always go first on the list, since a guest and the hardware GIC
> expect those to stay until they have been explicitly deactivated. Failure
> in keeping active IRQs around will result in error conditions in the GIC.
> The second sort criteria for the ap\_list is their priority, so higher
> priority pending interrupt always go first into the LRs.


Otherwise, as I said, this is a really nice writeup.

Thanks,
-Christoffer

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-10-11 14:33 [RFC] ARM: New (Xen) VGIC design document Andre Przywara
  2017-10-11 14:42 ` Andre Przywara
  2017-10-12 12:05 ` Christoffer Dall
@ 2017-11-01  1:58 ` Stefano Stabellini
  2017-11-01  4:31   ` Christoffer Dall
  2017-11-01 14:30   ` Andre Przywara
  2 siblings, 2 replies; 13+ messages in thread
From: Stefano Stabellini @ 2017-11-01  1:58 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Stefano Stabellini, Marc Zyngier, Eric Auger, Julien Grall,
	xen-devel, Christoffer Dall

On Wed, 11 Oct 2017, Andre Przywara wrote:
> Hi,
> 
> (CC:ing some KVM/ARM folks involved in the VGIC)
> 
> starting with the addition of the ITS support we were seeing more and
> more issues with the current implementation of our ARM Generic Interrupt
> Controller (GIC) emulation, the VGIC.
> Among other approaches to fix those issues it was proposed to copy the
> VGIC emulation used in KVM. This one was suffering from very similar
> issues, and a clean design from scratch lead to a very robust and
> capable re-implementation. Interestingly this implementation is fairly
> self-contained, so it seems feasible to copy it. Hopefully we only need
> minor adjustments, possibly we can even copy it verbatim with some
> additional glue layer code.
>
> Stefano asked for getting a design overview, to assess the feasibility
> of copying the KVM code without reviewing tons of code in the first
> place.
> So to follow Xen rules for new features, this design document below is
> an attempt to describe the current KVM VGIC design - in a hypervisor
> agnostic session. It is a bit of a retro-fit design description, as it
> is not strictly forward-looking only, but actually describing the
> existing implemenation [1].
> 
> Please have a look and let me know:
> 1) if this document has the right scope
> 2) if this document has the right level of detail
> 3) if there are points missing from the document
> 3) if the design in general is a fit

Please read the following statements as genuine questions and concerns.
Most ideas on this document are good. Some of them I have even suggested
them myself in the context of GIC improvements for Xen. I asked for a
couple of clarifications.

But I don't see why we cannot implement these ideas on top of the
existing code, rather than with a separate codebase, ending up with two
drivers. I would prefer a natual evolution. Specifically, the following
improvements would be simple and would give us most of the benefits on
top of the current codebase:
- adding the irq lock, and the refcount
- taking both vcpu locks when necessary (on migration code for example
  it would help a lot), the lower vcpu_id first
- level irq emulation


If we do end up with a second separate driver for technical or process
reasons, I would expect the regular Xen submission/review process to be
followed. The code style will be different, the hooks into the rest of
the hypervisors will be different and things will be generally changed.
The new V/GIC might be derived from KVM, but it should end up looking
and feeling like a 100% genuine Xen component. After all, we'll
maintain it going forward. I don't want a copy of a Linux driver with
glue code. The Xen community cannot be expected not to review the
submission, but if we review it, then we'll ask for changes. Once we
change the code, there will be no point in keeping the Linux code
separate with glue code. We should fully adapt it to Xen.

That is what was done in the past when KVM took code from Xen (for
example async shadow pagetables). I am eager to avoid a situation like
the current SMMU driver in Xen, which comes from Linux, and we are not
entirely sure how to maintain it.


> Appreciate any feedback!
> 
> Cheers,
> Andre.
> 
> ---------------------------------------
> 
> VGIC design
> ===========
> 
> This document describes the design of an ARM Generic Interrupt Controller (GIC)
> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
> the common name for that is VGIC (from "virtual GIC").
> 
> This design was the result of a one-week-long design session with some
> engineers in a room, triggered by ever-increasing difficulties in maintaining
> the existing GIC emulation in the KVM hypervisor. The design eventually
> materialised as an alternative VGIC implementation in the Linux kernel
> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
> was removed, so it is now the current code used by Linux.
> Although being used in KVM, the actual design of this VGIC is rather hypervisor
> agnostic and can be used by other hypervisors as well, in particular for Xen.
> 
> GIC hardware virtualization support
> -----------------------------------
> 
> The ARM Generic Interrupt Controller (since v2) supports the virtualization
> extensions, which allows some parts of the interrupt life cycle to be handled
> purely inside the guest without exiting into the hypervisor.
> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
> acknowledgement", "priority drop" and "interrupt deactivate" actions.
> So a guest can handle most of the interrupt processing code without
> leaving EL1 and trapping into the hypervisor. To accomplish
> this, the GIC holds so called "list registers" (LRs), which shadow the
> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
> involves setting up one LR with the interrupt number, its priority and initial
> state (mostly "pending"), then entering the guest. Any EOI related action
> from within the guest just acts on those LRs, the hypervisor can later update
> the virtual interrupt state when the guest exists the next time (for whatever
> reason).
> But despite the GIC hardware helping out here, the whole interrupt
> configuration management is not virtualized at all and needs to be emulated
> by the hypervisor - or another related software component, for instance a
> userland emulator. This so called "distributor" part of the GIC consists of
> memory mapped registers, which can be trapped by the hypervisor, so any guest
> access can be emulated in the usual way.
> 
> VGIC design motivation
> ----------------------
> 
> A GIC emulation thus needs to take care of those bits:
> 
> - trap GIC distributor MMIO accesses and shadow the configuration setup
>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
> - handle incoming hardware and virtual interrupt requests and inject the
>   associated virtual interrupt by manipulating one of the list registers
> - track the state of a virtual interrupt by inspecting the LRs after the
>   guest has exited, possibly adjusting the shadowed virtual interrupt state
> 
> Despite the distributor MMIO register emulation being a sizeable chunk of
> the emulation, it is actually not dominant if looking at the frequency at
> which it is accessed. Normally the interrupt configuration is done at boot
> time or upon initialising the device (driver), but rarely during the actual
> run time of a system. Injecting and EOI-ing interrupts however happens much
> more often. A good emulation approach should thus focus on tracking the virtual
> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
> interrupts.
> 
> The actual interrupt state tracking can be quite tricky in parts. Interrupt
> injections can be independent from the guest entry/exit points, also MMIO
> configuration accesses could be triggered by any VCPU at any point in time.
> Changing interrupt CPU affinity adds to the complication.
> This leads to many code parts which could run in parallel and thus contains
> some race conditions, so proper locking becomes key of a good design.
> But one has to consider that interrupts in general can be characterised
> as a rare event - otherwise a guest would be busy handling interrupts and could
> not process actual computation tasks.
> That's why the interrupt state tracking should focus on a clear and race-free
> locking scheme, without needlessly optimising too much in this respect.
> Experience shows that this complicates the code and leads to undetected and
> hard-to-debug race conditions, which affect the stability of the system in
> possibly untested corner cases.
> 
> VGIC design principles
> ----------------------
> 
> ### Data structure
> 
> This VGIC design is based on the idea of having one structure per virtual
> interrupt, protected by its own lock. In addition there is a list per VCPU,
> which queues the interrupts which this VCPU should consider for injection.
> One interrupt can only be on one VCPU list at any given point in time.
> For private interrupts and SPIs a static allocation of this data structure
> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
> dynamic and possibly very sparse allocation scheme, so we need to deal with
> dynamic allocation and de-allocation of this struct. To accommodate this
> there is an additional list header to link all LPIs.
> Also the LPI mapping and unmapping can happen asynchronously, so we need to
> properly ref-count the structure (at least for LPIs), otherwise some code parts
> would potentially end up with referencing an already freed pointer.
> 
> The central data structure is called `struct vgic_irq`, and, beside the
> expected interrupt configuration data, contains at least the lock, a list
> header (to be able to link it to a VCPU) and a refcount. Also it contains
> the interrupt number (to accommodate for non-contiguous interrupt allocations,
> for instance for LPIs).
> Beside those essential elements it proves worth to store (a reference to) the
> VCPU this IRQ is associated with. This allows to easily find the respective
> VCPU list.
> 
>     struct vgic_irq {
>         spinlock_t irq_lock;            /* Protects the content of the struct */
>         struct list_head lpi_list;      /* Used to link all LPIs together */
>         struct list_head ap_list;
> 
>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>                                          * SPIs and LPIs: The VCPU whose ap_list
>                                          * this is queued on.
>                                          */
> 
>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
>                                           * be sent to, as a result of the
>                                           * targets reg (v2) or the
>                                           * affinity reg (v3).
>                                           */
> 
>         u32 intid;                      /* Guest visible INTID */
>         bool line_level;                /* Level only */
>         bool pending_latch;             /* The pending latch state used to
>                                          * calculate the pending state for
>                                          * both level and edge triggered IRQs.
>                                          */
> 
>         bool active;                    /* not used for LPIs */
>         bool enabled;
>         bool hw;                        /* Tied to HW IRQ */
>         struct kref refcount;           /* Used for LPIs */
>         u32 hwintid;                    /* HW INTID number */
>         union {
>             u8 targets;                     /* GICv2 target VCPUs mask */
>             u32 mpidr;                      /* GICv3 target VCPU */
>         };
>         u8 source;                      /* GICv2 SGIs only */
>         u8 priority;
>         enum vgic_irq_config config;    /* Level or edge */
>     };

The refcount and irq_lock are good ideas, let's have them.


> ### VCPU list handling
> 
> Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
> the distributor will change the state information in this structure.
> When an interrupt is actually made pending (either by an associated hardware
> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
> linked to the current target VCPU. The `vcpu` member in the structure will
> be set to this VCPU. Any affinity change after this point will not affect
> the current target VCPU anymore, it just updates the `target_vpu` field in
> the structure, which will be considered on the next injection.
> This per-VCPU list is called the `ap_list`, since it holds interrupts which
> are in a pending and/or active state.

The two vcpu lists sound like a good idea too, and Christoffer's
explanation helped. It is actually similar to what we do in Xen already.
I guess a vgic is always a vgic :-)

What happens when the irq is migrated while still in an LR on another
pcpu? When/How is the physical affinity changed?

What happens when a new irq is supposed to be injected when target_vcpu is
already set? Does target_vcpu simply get overwritten?

What happens when a vcpu is migrated from pcpu1 to pcpu2?



> ### Virtual IRQ references
> 
> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
> given its number.
> For private IRQs and SPIs it is expected that this just indexes a static array.
> For LPIs (which are dynamically allocated at run time) this is expected to
> iterate a data structure (like a linked list) to find the right structure.
> In any case a call to `vgic_get_irq` will increase a refcount, which will
> prevent LPIs from being de-allocated while another part of the VGIC is still
> holding a reference. Thus any caller to `vgic_get_irq` shall call
> `vgic_put_irq()` after it is done with handling this interrupt.
> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
> this case the VCPU holds that reference and it is kept as long as the guest
> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
> been EOIed by the guest and it having been removed from the VCPU list.

I understand the idea behind a refcount and sounds like a good thing to
have.

Let me ask you a couple of questions. How does it help with the issue
that an LPI could be discarded and remapped (MAPTI) from another
pcpu while it could still be in an LR? What happens if the MAPTI is
issued before and what happens if it is issed after the irq has been
EOId and cleared from the LR and ap_list?

I am referring to the case that we currently handling with the
GIC_IRQ_GUEST_PRISTINE_LPI flag in Xen.


> ### Locking
> 
> To keep the `vgic_irq` structure consistent and to avoid races between
> different parts of the VGIC, locking is essential whenever accessing a member
> of this structure. It is expected that this lock is almost never contended,
> also held only for brief periods of time, so this is considered cheap.
> To keep the code clean and avoid nasty corner cases, there are no tricks on
> trying to be lockless here.
> If for any reason the code needs to hold the locks for two virtual IRQs, the
> one with the lower IRQ number is to be taken first, to avoid deadlocks.
> 
> Another lock to consider is the VCPU lock, which on the first glance protects
> the virtual CPU's list structure, but also synchronises additions and removals
> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
> lock need to be held. To avoid deadlocks, there is a strict locking order:
> > The VCPU lock needs to be taken first, the per-IRQ lock after this.

Sounds good (it is basically what I suggested to do in the past).


> Some operations (like migrating IRQs between two VCPUs) require two VCPU
> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
> is to be taken first.
> 
> There are occasions where the locking order (VCPU first) is hard to observe,
> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
> list. In this case the IRQ lock needs to be dropped, the respective VCPU
> lock should be taken, then the per-IRQ lock needs to be re-taken.
> After both the locks are held, we need to check if the conditions which
> originally mandated the list addition (or removal) are still true. This is
> needed because the IRQ lock could have been taken by another entity meanwhile
> and the state of this interrupt could have been changed. Examples are if the
> interrupt is no longer pending, got disabled or changed the CPU affinity.
> Some of those changes might render to current action obsolete (no longer
> pending), other will lead to a retry of the re-locking scheme described above.
> This re-locking scheme shall be implemented in a well-documented function.
> 
> ### Level and edge triggered interrupts
> 
> The GIC knows about two kinds of signalling interrupts:
> 
> - Edge triggered interrupts are triggered by a device once, their life cycle
> ends when the guest has EOIed them, at which point we remove the pending state,
> clear the LR and return the `vgic_irq` structure to a quiescent state.

I assume that "at which point" means at the next trap into the
hypervisor? We are not trapping on purpose guest EOIs, are we?

Is it possible to have active and pending irqs in an LR? How is that
handled?


> - Level triggered interrupts are triggered when a device raises its interrupt
> line, they stay pending as long as this line is held high. At some point the
> driver in the guest is expected to program the device to explicitly or
> implicitly lower this interrupt line. That means that we have to store the
> state of the virtual interrupt line, which is only controlled by the (virtual)
> device. This is done in the `line_level` member of `struct vgic_irq`.
> 
> To assert the interrupt condition, a (virtual) device calls a function exported
> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
> line for an edge triggered IRQ is ignored (and so is optional). Raising the
> line asserts the pending state and potentially injects this virtual IRQ. Any
> subsequent "raising" call might inject another IRQ, if the previous has at
> least been activated by the guest already, otherwise is ignored.

The irq becomes active and pending in the LR?


> For level triggered interrupts this function stores the new state into the
> `line_level` variable, potentially injecting the interrupt if that line
> changes from false to true. If the line is lowered before the guest has
> seen it, this particular interrupt instance will be discarded. Successive
> "raising" calls will not lead to multiple interrupts if the line has not
> been lowered in between.

This is something Xen needs too.


> ### Software triggered interrupts
> 
> Beside the naturally software triggered inter-processor-interrupts
> (SGIs in GIC speak), there is another way of letting software raise an
> interrupt condition.
> The GIC distributor allows to set or clear both the pending and active state
> of any interrupt via MMIO registers. This isn't widely used by many operating
> systems, but is useful when saving and restoring the state of a machine.
> So emulating these functions is required for being architecture compliant,
> however the implementation might not need to be very efficient given its rare
> usage. In fact supporting the set-pending and clear-pending registers is
> relatively straight-forward, as long as one keeps this state separate from
> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
> 
> The set-active and clear-active registers are much harder to emulate, though,
> as normally the active state is of little concern to the GIC emulation. In
> a normal interrupt life cycle the active state isn't even visible to the
> hypervisor, as it might be set and cleared again entirely within the guest
> in the list register, without exiting to the hypervisor.
> So manipulating the active state via the MMIO registers requires some heavy
> lifting: If this interrupt is currently injected into a running VCPU, this
> VCPU must exit, the active state must be set or cleared in the LR, then
> execution can continue. While this is expensive, as mentioned above this
> should not happen too often, also probably the system isn't very performance
> sensitive when using this feature for save and restore anyway.

set-active and clear-active registers are not emulated in Xen today, it
would be nice to have them.

How does the locking/synchronization work in the case given that the
vCPU that needs to exit could be running on a different pCPU?


> ### MMIO emulation
> 
> As mentioned before, the distributor and redistributor part of the VGIC needs
> to be fully emulated. Those parts are characterised by a range of MMIO
> registers. The implementation shall provide a dispatcher function, which
> takes the faulted address, relative to the beginning of the MMIO range, and
> works out which actual register is affected. It then looks up the the
> respective handler function and calls it. Those functions are expected to
> be listed in a struct initialiser, which connects the actual register
> offset and its size to a particular handler. Having handler functions for
> a register range seems beneficial over handling registers in a switch/case,
> because it's easier to read and simplifies code sharing, for instance
> between the GICv2, GICv3 distributor and GICv3 redistributor registers
> with the same semantics.

I am happy to replace the whole MMIO emulation bit.


> ### List register management
> 
> A list register (LR) holds the state of a virtual interrupt, which will
> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
> Each GIC hardware implementation can choose to implement a number of LRs,
> having four of them seems to be a common value. This design here does not
> try to manage the LRs very cleverly, instead on every guest exit every LR
> in use will be synced to the emulated state, then cleared. Upon guest entry
> the top priority virtual IRQs will be inserted into the LRs. If there are
> more pending or active IRQs than list registers, the GIC management IRQ
> will be configured to notify the hypervisor of a free LR (once the guest
> has EOIed one IRQ). This will trigger a normal exit, which will go through
> the normal cleanup/repopulate scheme, possibly now queuing the leftover
> interrupt(s).
> To facilitate quick guest exit and entry times, the VGIC maintains the list
> of pending or active interrupts (ap\_list) sorted by their priority. Active
> interrupts always go first on the list, since a guest and the hardware GIC
> expect those to stay until they have been explicitly deactivated. Failure
> in keeping active IRQs around will result in error conditions in the GIC.
> The second sort criteria for the ap\_list is their priority, so higher
> priority pending interrupt always go first into the LRs.

The suggestion of using this model in Xen was made in the past already.
I always objected for the reason that we don't actually know how many
LRs the hardware provides, potentially very many, and it is expensive
and needless to read/write them all every time on entry/exit.

I would prefer to avoid that, but I'll be honest: I can be convinced
that that model of handling LRs is so much simpler that it is worth it.
I am more concerned about the future maintainance of a separate new
driver developed elsewhere.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01  1:58 ` Stefano Stabellini
@ 2017-11-01  4:31   ` Christoffer Dall
  2017-11-01  9:15     ` Andre Przywara
  2017-11-01 14:30   ` Andre Przywara
  1 sibling, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2017-11-01  4:31 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Marc Zyngier, Andre Przywara, Julien Grall, Eric Auger, xen-devel

On Wed, Nov 1, 2017 at 9:58 AM, Stefano Stabellini
<sstabellini@kernel.org> wrote:

[....]

>
>> ### List register management
>>
>> A list register (LR) holds the state of a virtual interrupt, which will
>> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
>> Each GIC hardware implementation can choose to implement a number of LRs,
>> having four of them seems to be a common value. This design here does not
>> try to manage the LRs very cleverly, instead on every guest exit every LR
>> in use will be synced to the emulated state, then cleared. Upon guest entry
>> the top priority virtual IRQs will be inserted into the LRs. If there are
>> more pending or active IRQs than list registers, the GIC management IRQ
>> will be configured to notify the hypervisor of a free LR (once the guest
>> has EOIed one IRQ). This will trigger a normal exit, which will go through
>> the normal cleanup/repopulate scheme, possibly now queuing the leftover
>> interrupt(s).
>> To facilitate quick guest exit and entry times, the VGIC maintains the list
>> of pending or active interrupts (ap\_list) sorted by their priority. Active
>> interrupts always go first on the list, since a guest and the hardware GIC
>> expect those to stay until they have been explicitly deactivated. Failure
>> in keeping active IRQs around will result in error conditions in the GIC.
>> The second sort criteria for the ap\_list is their priority, so higher
>> priority pending interrupt always go first into the LRs.
>
> The suggestion of using this model in Xen was made in the past already.
> I always objected for the reason that we don't actually know how many
> LRs the hardware provides, potentially very many, and it is expensive
> and needless to read/write them all every time on entry/exit.
>
> I would prefer to avoid that, but I'll be honest: I can be convinced
> that that model of handling LRs is so much simpler that it is worth it.
> I am more concerned about the future maintainance of a separate new
> driver developed elsewhere.

[Having just spent a fair amount of time optimizing KVM/ARM and
measuring GIC interaction, I'll comment on this and leave it up to
Andre to drive the rest of the discussion].

In KVM we currently only ever touch an LR when we absolutely have to.
For example, if there are no interrupts, we do not touch an LR.

When you do have an interrupt in flight, and have programmed one or
more LRs, you have to either read back that LR, or read one of the
status registers to figure out if the interrupt has become inactive
(and should potentially be injected again).  I measured both on KVM
for various workloads and it was faster to never read the status
registers, but simply read back the LRs that were in use when entering
the guest.

You can potentially micro-optimize slightly by remembering the exit
value of an LR (and not clearing it on guest exit), but you have to
pay the cost in terms of additional logic during VCPU migration and
when you enter a VM again, maintaining a mapping of the LR and the
virtual state, to avoid rewriting the same value to the LR again.  We
tried that in KVM and could not measure any benefit using either a
pinned or oversubscribed workload; I speculate that the number of
times you exit with unprocessed interrupts in the LRs is extremely
rare.

In terms of the number of LRs, I stil haven't seen an implementation
with anything else than 4 LRs.

-Christoffer

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01  4:31   ` Christoffer Dall
@ 2017-11-01  9:15     ` Andre Przywara
  2017-11-02  7:38       ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: Andre Przywara @ 2017-11-01  9:15 UTC (permalink / raw)
  To: Christoffer Dall, Stefano Stabellini
  Cc: Marc Zyngier, xen-devel, Julien Grall, Eric Auger

Hi,

On 01/11/17 04:31, Christoffer Dall wrote:
> On Wed, Nov 1, 2017 at 9:58 AM, Stefano Stabellini
> <sstabellini@kernel.org> wrote:
> 
> [....]

Christoffer, many thanks for answering this!
I think we have a lot of assumptions about the whole VGIC life cycle
floating around, but it would indeed be good to get some numbers behind it.
I would be all too happy to trace some workloads on Xen again and
getting some metrics, though this sounds time consuming if done properly.

Do you have any numbers on VGIC performance available somewhere?

....

>>> ### List register management
>>>
>>> A list register (LR) holds the state of a virtual interrupt, which will
>>> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
>>> Each GIC hardware implementation can choose to implement a number of LRs,
>>> having four of them seems to be a common value. This design here does not
>>> try to manage the LRs very cleverly, instead on every guest exit every LR
>>> in use will be synced to the emulated state, then cleared. Upon guest entry
>>> the top priority virtual IRQs will be inserted into the LRs. If there are
>>> more pending or active IRQs than list registers, the GIC management IRQ
>>> will be configured to notify the hypervisor of a free LR (once the guest
>>> has EOIed one IRQ). This will trigger a normal exit, which will go through
>>> the normal cleanup/repopulate scheme, possibly now queuing the leftover
>>> interrupt(s).
>>> To facilitate quick guest exit and entry times, the VGIC maintains the list
>>> of pending or active interrupts (ap\_list) sorted by their priority. Active
>>> interrupts always go first on the list, since a guest and the hardware GIC
>>> expect those to stay until they have been explicitly deactivated. Failure
>>> in keeping active IRQs around will result in error conditions in the GIC.
>>> The second sort criteria for the ap\_list is their priority, so higher
>>> priority pending interrupt always go first into the LRs.
>>
>> The suggestion of using this model in Xen was made in the past already.
>> I always objected for the reason that we don't actually know how many
>> LRs the hardware provides, potentially very many, and it is expensive
>> and needless to read/write them all every time on entry/exit.
>>
>> I would prefer to avoid that, but I'll be honest: I can be convinced
>> that that model of handling LRs is so much simpler that it is worth it.
>> I am more concerned about the future maintainance of a separate new
>> driver developed elsewhere.
> 
> [Having just spent a fair amount of time optimizing KVM/ARM and
> measuring GIC interaction, I'll comment on this and leave it up to
> Andre to drive the rest of the discussion].
> 
> In KVM we currently only ever touch an LR when we absolutely have to.
> For example, if there are no interrupts, we do not touch an LR.

Yes, I think this is a key point. We only touch LRs that we need to
touch: On guest entry we iterate our per-VCPU list of pending IRQs
(ap_list, that could be empty!), and store that number in a variable.
On entry we just sync back the first <n> LRs.
I think the code in KVM explains it quite well:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/vgic/vgic.c#n677

> When you do have an interrupt in flight, and have programmed one or
> more LRs, you have to either read back that LR, or read one of the
> status registers to figure out if the interrupt has become inactive
> (and should potentially be injected again).  I measured both on KVM
> for various workloads and it was faster to never read the status
> registers, but simply read back the LRs that were in use when entering
> the guest.
> 
> You can potentially micro-optimize slightly by remembering the exit
> value of an LR (and not clearing it on guest exit), but you have to
> pay the cost in terms of additional logic during VCPU migration and
> when you enter a VM again, maintaining a mapping of the LR and the
> virtual state, to avoid rewriting the same value to the LR again.  We
> tried that in KVM and could not measure any benefit using either a
> pinned or oversubscribed workload; I speculate that the number of
> times you exit with unprocessed interrupts in the LRs is extremely
> rare.
> 
> In terms of the number of LRs, I stil haven't seen an implementation
> with anything else than 4 LRs.

Yes, that is what I know of as well. The fast model has 16, but I guess
this doesn't count - though it's good to test some code. I can try to
learn the figure in newer hardware.

In the past I traced some workloads and found only a small number of LRs
to be actually used, with 4 or more being extremely rare.

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01  1:58 ` Stefano Stabellini
  2017-11-01  4:31   ` Christoffer Dall
@ 2017-11-01 14:30   ` Andre Przywara
  2017-11-01 21:54     ` Stefano Stabellini
  1 sibling, 1 reply; 13+ messages in thread
From: Andre Przywara @ 2017-11-01 14:30 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Marc Zyngier, xen-devel, Julien Grall, Christoffer Dall, Eric Auger

Hi Stefano,


On 01/11/17 01:58, Stefano Stabellini wrote:
> On Wed, 11 Oct 2017, Andre Przywara wrote:

many thanks for going through all of this!

>> (CC:ing some KVM/ARM folks involved in the VGIC)
>>
>> starting with the addition of the ITS support we were seeing more and
>> more issues with the current implementation of our ARM Generic Interrupt
>> Controller (GIC) emulation, the VGIC.
>> Among other approaches to fix those issues it was proposed to copy the
>> VGIC emulation used in KVM. This one was suffering from very similar
>> issues, and a clean design from scratch lead to a very robust and
>> capable re-implementation. Interestingly this implementation is fairly
>> self-contained, so it seems feasible to copy it. Hopefully we only need
>> minor adjustments, possibly we can even copy it verbatim with some
>> additional glue layer code.
>>
>> Stefano asked for getting a design overview, to assess the feasibility
>> of copying the KVM code without reviewing tons of code in the first
>> place.
>> So to follow Xen rules for new features, this design document below is
>> an attempt to describe the current KVM VGIC design - in a hypervisor
>> agnostic session. It is a bit of a retro-fit design description, as it
>> is not strictly forward-looking only, but actually describing the
>> existing implemenation [1].
>>
>> Please have a look and let me know:
>> 1) if this document has the right scope
>> 2) if this document has the right level of detail
>> 3) if there are points missing from the document
>> 3) if the design in general is a fit
> 
> Please read the following statements as genuine questions and concerns.
> Most ideas on this document are good. Some of them I have even suggested
> them myself in the context of GIC improvements for Xen. I asked for a
> couple of clarifications.
> 
> But I don't see why we cannot implement these ideas on top of the
> existing code, rather than with a separate codebase, ending up with two
> drivers. I would prefer a natual evolution. Specifically, the following
> improvements would be simple and would give us most of the benefits on
> top of the current codebase:
> - adding the irq lock, and the refcount
> - taking both vcpu locks when necessary (on migration code for example
>   it would help a lot), the lower vcpu_id first
> - level irq emulation

I think some of those points you mentioned are not easily implemented in
the current Xen. For instance I ran into locking order issues with those
*two* inflight and lr_queue lists, when trying to implement the lock and
the refcount.
Also this "put vIRQs into LRs early, but possibly rip them out again" is
really complicating things a lot.

I believe only level IRQs could be added in a relatively straight
forward manner.

So the problem with the evolutionary approach is that it generates a lot
of patches, some of them quite invasive, others creating hard-to-read
diffs, which are both hard to review.
And chances are that the actual result would be pretty close to the KVM
code. To be clear: I hacked the Xen VGIC into the KVM direction in a few
days some months ago, but it took me *weeks* to make sane patches of
only the first part of it.
And this would not cover all those general, tedious corner cases that
the VGIC comes with. Those would need to be fixed in a painful process,
which we could avoid by "lifting" the KVM code.

> If we do end up with a second separate driver for technical or process
> reasons, I would expect the regular Xen submission/review process to be
> followed. The code style will be different, the hooks into the rest of
> the hypervisors will be different and things will be generally changed.
> The new V/GIC might be derived from KVM, but it should end up looking
> and feeling like a 100% genuine Xen component. After all, we'll
> maintain it going forward. I don't want a copy of a Linux driver with
> glue code. The Xen community cannot be expected not to review the
> submission, but if we review it, then we'll ask for changes. Once we
> change the code, there will be no point in keeping the Linux code
> separate with glue code. We should fully adapt it to Xen.

I see your point, and this actually simplifies *my* work, but I am a bit
worried about the effects of having two separate implementations which
then diverge over time.
In the moment we have two separate implementations as well, but they are
quite different, which has the advantage of doing things differently
enough to help in finding bugs in the other one (something we should
actually exploit in testing, I believe).

So how is your feeling towards some shared "libvgic"? I understand that
people are not too happy about that extra maintenance cost of having a
separate repository, but I am curious what your, Marc's and
Christoffer's take is on this idea.

> That is what was done in the past when KVM took code from Xen (for
> example async shadow pagetables). I am eager to avoid a situation like
> the current SMMU driver in Xen, which comes from Linux, and we are not
> entirely sure how to maintain it.
> 
> 
>> Appreciate any feedback!
>>
>> Cheers,
>> Andre.
>>
>> ---------------------------------------
>>
>> VGIC design
>> ===========
>>
>> This document describes the design of an ARM Generic Interrupt Controller (GIC)
>> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
>> the common name for that is VGIC (from "virtual GIC").
>>
>> This design was the result of a one-week-long design session with some
>> engineers in a room, triggered by ever-increasing difficulties in maintaining
>> the existing GIC emulation in the KVM hypervisor. The design eventually
>> materialised as an alternative VGIC implementation in the Linux kernel
>> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
>> was removed, so it is now the current code used by Linux.
>> Although being used in KVM, the actual design of this VGIC is rather hypervisor
>> agnostic and can be used by other hypervisors as well, in particular for Xen.
>>
>> GIC hardware virtualization support
>> -----------------------------------
>>
>> The ARM Generic Interrupt Controller (since v2) supports the virtualization
>> extensions, which allows some parts of the interrupt life cycle to be handled
>> purely inside the guest without exiting into the hypervisor.
>> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
>> acknowledgement", "priority drop" and "interrupt deactivate" actions.
>> So a guest can handle most of the interrupt processing code without
>> leaving EL1 and trapping into the hypervisor. To accomplish
>> this, the GIC holds so called "list registers" (LRs), which shadow the
>> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
>> involves setting up one LR with the interrupt number, its priority and initial
>> state (mostly "pending"), then entering the guest. Any EOI related action
>> from within the guest just acts on those LRs, the hypervisor can later update
>> the virtual interrupt state when the guest exists the next time (for whatever
>> reason).
>> But despite the GIC hardware helping out here, the whole interrupt
>> configuration management is not virtualized at all and needs to be emulated
>> by the hypervisor - or another related software component, for instance a
>> userland emulator. This so called "distributor" part of the GIC consists of
>> memory mapped registers, which can be trapped by the hypervisor, so any guest
>> access can be emulated in the usual way.
>>
>> VGIC design motivation
>> ----------------------
>>
>> A GIC emulation thus needs to take care of those bits:
>>
>> - trap GIC distributor MMIO accesses and shadow the configuration setup
>>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
>> - handle incoming hardware and virtual interrupt requests and inject the
>>   associated virtual interrupt by manipulating one of the list registers
>> - track the state of a virtual interrupt by inspecting the LRs after the
>>   guest has exited, possibly adjusting the shadowed virtual interrupt state
>>
>> Despite the distributor MMIO register emulation being a sizeable chunk of
>> the emulation, it is actually not dominant if looking at the frequency at
>> which it is accessed. Normally the interrupt configuration is done at boot
>> time or upon initialising the device (driver), but rarely during the actual
>> run time of a system. Injecting and EOI-ing interrupts however happens much
>> more often. A good emulation approach should thus focus on tracking the virtual
>> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
>> interrupts.
>>
>> The actual interrupt state tracking can be quite tricky in parts. Interrupt
>> injections can be independent from the guest entry/exit points, also MMIO
>> configuration accesses could be triggered by any VCPU at any point in time.
>> Changing interrupt CPU affinity adds to the complication.
>> This leads to many code parts which could run in parallel and thus contains
>> some race conditions, so proper locking becomes key of a good design.
>> But one has to consider that interrupts in general can be characterised
>> as a rare event - otherwise a guest would be busy handling interrupts and could
>> not process actual computation tasks.
>> That's why the interrupt state tracking should focus on a clear and race-free
>> locking scheme, without needlessly optimising too much in this respect.
>> Experience shows that this complicates the code and leads to undetected and
>> hard-to-debug race conditions, which affect the stability of the system in
>> possibly untested corner cases.
>>
>> VGIC design principles
>> ----------------------
>>
>> ### Data structure
>>
>> This VGIC design is based on the idea of having one structure per virtual
>> interrupt, protected by its own lock. In addition there is a list per VCPU,
>> which queues the interrupts which this VCPU should consider for injection.
>> One interrupt can only be on one VCPU list at any given point in time.
>> For private interrupts and SPIs a static allocation of this data structure
>> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
>> dynamic and possibly very sparse allocation scheme, so we need to deal with
>> dynamic allocation and de-allocation of this struct. To accommodate this
>> there is an additional list header to link all LPIs.
>> Also the LPI mapping and unmapping can happen asynchronously, so we need to
>> properly ref-count the structure (at least for LPIs), otherwise some code parts
>> would potentially end up with referencing an already freed pointer.
>>
>> The central data structure is called `struct vgic_irq`, and, beside the
>> expected interrupt configuration data, contains at least the lock, a list
>> header (to be able to link it to a VCPU) and a refcount. Also it contains
>> the interrupt number (to accommodate for non-contiguous interrupt allocations,
>> for instance for LPIs).
>> Beside those essential elements it proves worth to store (a reference to) the
>> VCPU this IRQ is associated with. This allows to easily find the respective
>> VCPU list.
>>
>>     struct vgic_irq {
>>         spinlock_t irq_lock;            /* Protects the content of the struct */
>>         struct list_head lpi_list;      /* Used to link all LPIs together */
>>         struct list_head ap_list;
>>
>>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>>                                          * SPIs and LPIs: The VCPU whose ap_list
>>                                          * this is queued on.
>>                                          */
>>
>>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
>>                                           * be sent to, as a result of the
>>                                           * targets reg (v2) or the
>>                                           * affinity reg (v3).
>>                                           */
>>
>>         u32 intid;                      /* Guest visible INTID */
>>         bool line_level;                /* Level only */
>>         bool pending_latch;             /* The pending latch state used to
>>                                          * calculate the pending state for
>>                                          * both level and edge triggered IRQs.
>>                                          */
>>
>>         bool active;                    /* not used for LPIs */
>>         bool enabled;
>>         bool hw;                        /* Tied to HW IRQ */
>>         struct kref refcount;           /* Used for LPIs */
>>         u32 hwintid;                    /* HW INTID number */
>>         union {
>>             u8 targets;                     /* GICv2 target VCPUs mask */
>>             u32 mpidr;                      /* GICv3 target VCPU */
>>         };
>>         u8 source;                      /* GICv2 SGIs only */
>>         u8 priority;
>>         enum vgic_irq_config config;    /* Level or edge */
>>     };
> 
> The refcount and irq_lock are good ideas, let's have them.
> 
> 
>> ### VCPU list handling
>>
>> Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
>> the distributor will change the state information in this structure.
>> When an interrupt is actually made pending (either by an associated hardware
>> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
>> linked to the current target VCPU. The `vcpu` member in the structure will
>> be set to this VCPU. Any affinity change after this point will not affect
>> the current target VCPU anymore, it just updates the `target_vpu` field in
>> the structure, which will be considered on the next injection.
>> This per-VCPU list is called the `ap_list`, since it holds interrupts which
>> are in a pending and/or active state.
> 
> The two vcpu lists sound like a good idea too, and Christoffer's
> explanation helped. It is actually similar to what we do in Xen already.
> I guess a vgic is always a vgic :-)

Mmmh, I don't get where you see two VCPU *lists* here. There are two
VCPU *fields* in the structure, but they are completely different from
the lr_pending and lr_queue lists in Xen.
In fact I believe that these *two* lists in Xen are one of the major
pain points in the current VGIC.

> What happens when the irq is migrated while still in an LR on another
> pcpu? When/How is the physical affinity changed?

Per the architecture there is nothing like an IRQ "migration". There is
the CPU affinity, which determines to which core this IRQ is forwarded
*when it becomes pending*. Once it has been activated, it stays at this
core, even if you change the ITARGETSR or IROUTER register afterwards.
This is a benign race, you just came too late to change the affinity.

And at the moment we don't synchronize the physical affinity, simply
because most IRQs in KVM world were virtual so far. Now this is going to
change, so I guess we have to take a look at this at some point. But I
consider this an optimization, and would prefer correctness and
stability over performance.

> What happens when a new irq is supposed to be injected when target_vcpu is
> already set? Does target_vcpu simply get overwritten?

target_vcpu is simply a configuration storage. Anyone can update this
field at any time, without any side effects.
When an IRQ is going to be injected, the current value of target_vcpu is
written *once* to the "vcpu" field, which from now on determines the
responsible VCPU for the whole interrupt life cycle (queueing on lists,
putting into LRs, ...). This field cannot change anymore until the IRQ
is EOIed.

> What happens when a vcpu is migrated from pcpu1 to pcpu2?

Nothing spectacular, I guess. We don't care about the physical IRQ
affinity. And since we clear all LRs on exit and (re-)populate them on
entry, doing this on two different CPUs is a total no-brainer.

>> ### Virtual IRQ references
>>
>> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
>> given its number.
>> For private IRQs and SPIs it is expected that this just indexes a static array.
>> For LPIs (which are dynamically allocated at run time) this is expected to
>> iterate a data structure (like a linked list) to find the right structure.
>> In any case a call to `vgic_get_irq` will increase a refcount, which will
>> prevent LPIs from being de-allocated while another part of the VGIC is still
>> holding a reference. Thus any caller to `vgic_get_irq` shall call
>> `vgic_put_irq()` after it is done with handling this interrupt.
>> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
>> this case the VCPU holds that reference and it is kept as long as the guest
>> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
>> been EOIed by the guest and it having been removed from the VCPU list.
> 
> I understand the idea behind a refcount and sounds like a good thing to
> have.
> 
> Let me ask you a couple of questions. How does it help with the issue
> that an LPI could be discarded and remapped (MAPTI) from another
> pcpu while it could still be in an LR?

On DISCARD we remove it from the list of mapped LPIs, but don't free the
structure. So any vgic_get_lpi(lpi_nr) won't find it anymore. But since
the interrupt is in an LR, the VCPU's ap_list still references the
vgic_irq structure, so we can do the whole IRQ life cycle management
just as normal (because being a list member is what counts when it comes
to a "life" interrupt).
Once this LPI is EOIed, we remove it from the VCPU list, which decreases
the refcount and most probably will free the memory, since the value has
become zero by then. Normally, without unmapping it before, the
reference held by the ITS list would make sure the refcount stays
greater than 0.

Now when there is a MAPTI to the same LPI number meanwhile, this will
allocate a new structure (this is a new interrupt!) and enters this into
the ITS list. So anyone asking for this new LPI *number* will get the
reference to the new IRQ. Think: deleting a file and creating a new one
with the same name on a UNIX system, any old users of an already opened
file descriptor will still use the deleted file, but an open() will
return a handle to the new file.

> What happens if the MAPTI is
> issued before and what happens if it is issed after the irq has been
> EOId and cleared from the LR and ap_list?

I believe the above description should answer this. If not, please let
me know.

> I am referring to the case that we currently handling with the
> GIC_IRQ_GUEST_PRISTINE_LPI flag in Xen.

... which was a hack of mine to work around the missing refcount ;-)

>> ### Locking
>>
>> To keep the `vgic_irq` structure consistent and to avoid races between
>> different parts of the VGIC, locking is essential whenever accessing a member
>> of this structure. It is expected that this lock is almost never contended,
>> also held only for brief periods of time, so this is considered cheap.
>> To keep the code clean and avoid nasty corner cases, there are no tricks on
>> trying to be lockless here.
>> If for any reason the code needs to hold the locks for two virtual IRQs, the
>> one with the lower IRQ number is to be taken first, to avoid deadlocks.
>>
>> Another lock to consider is the VCPU lock, which on the first glance protects
>> the virtual CPU's list structure, but also synchronises additions and removals
>> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
>> lock need to be held. To avoid deadlocks, there is a strict locking order:
>>> The VCPU lock needs to be taken first, the per-IRQ lock after this.
> 
> Sounds good (it is basically what I suggested to do in the past).
> 
> 
>> Some operations (like migrating IRQs between two VCPUs) require two VCPU
>> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
>> is to be taken first.
>>
>> There are occasions where the locking order (VCPU first) is hard to observe,
>> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
>> list. In this case the IRQ lock needs to be dropped, the respective VCPU
>> lock should be taken, then the per-IRQ lock needs to be re-taken.
>> After both the locks are held, we need to check if the conditions which
>> originally mandated the list addition (or removal) are still true. This is
>> needed because the IRQ lock could have been taken by another entity meanwhile
>> and the state of this interrupt could have been changed. Examples are if the
>> interrupt is no longer pending, got disabled or changed the CPU affinity.
>> Some of those changes might render to current action obsolete (no longer
>> pending), other will lead to a retry of the re-locking scheme described above.
>> This re-locking scheme shall be implemented in a well-documented function.
>>
>> ### Level and edge triggered interrupts
>>
>> The GIC knows about two kinds of signalling interrupts:
>>
>> - Edge triggered interrupts are triggered by a device once, their life cycle
>> ends when the guest has EOIed them, at which point we remove the pending state,
>> clear the LR and return the `vgic_irq` structure to a quiescent state.
> 
> I assume that "at which point" means at the next trap into the
> hypervisor? We are not trapping on purpose guest EOIs, are we?

Correct. This means our data structures are not up-to-date all of the
itme. But I believe this only matters for the ISPENDINGR/ISACTIVER
register accesses, which are handled in a special way to fix this.
And this is nothing implementation specific, but a general feature of
the GIC emulation architecture.

> Is it possible to have active and pending irqs in an LR? How is that
> handled?

Sure. This happens when there is a new interrupt triggered while the old
one has been activated, but not EOIed (yet).
This actually happens in the following scenario:
- IRQ triggers and gets injected as "pending".
- Guest acks vIRQ by reading virtual ICC_IAR, the LR state changes from
pending to active.
- During the further interrupt handling the guest triggers an MMIO fault
(because it wants to read data from the device or explicitly lowers the
IRQ line with a register access). The CPU exists, and the "active-only"
state becomes visible to the hypervisor.
- The HV sync back the LR to our struct, clearing the pending latch, but
setting the active field. The struct is in sync now.
- For whatever reason the interrupt fires *again* while the HV is still
in charge. This sets the pending state in our struct.
- Upon guest entry we sync both the active and pending bit to the LR,
making it both active *and* pending.
- The guest's IRQ handler continues to handle the IRQ, the active bit
"shadows" the pending condition for now. Eventually  the handler retires
the IRQ by EOIing it, dropping the active state (in the LR).
- Now immediately after this drop the virtual IRQ is firing again, since
it is pending, but not blocked by the active state anymore.
- The guest's IRQ handler is invoked again and handles this second IRQ
as normal.

Does that make sense?
This is a bit simplified description for the sake of clarity, as there
are corner cases with priority drops, for instance.

>> - Level triggered interrupts are triggered when a device raises its interrupt
>> line, they stay pending as long as this line is held high. At some point the
>> driver in the guest is expected to program the device to explicitly or
>> implicitly lower this interrupt line. That means that we have to store the
>> state of the virtual interrupt line, which is only controlled by the (virtual)
>> device. This is done in the `line_level` member of `struct vgic_irq`.
>>
>> To assert the interrupt condition, a (virtual) device calls a function exported
>> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
>> line for an edge triggered IRQ is ignored (and so is optional). Raising the
>> line asserts the pending state and potentially injects this virtual IRQ. Any
>> subsequent "raising" call might inject another IRQ, if the previous has at
>> least been activated by the guest already, otherwise is ignored.
> 
> The irq becomes active and pending in the LR?

Yes, see above.

>> For level triggered interrupts this function stores the new state into the
>> `line_level` variable, potentially injecting the interrupt if that line
>> changes from false to true. If the line is lowered before the guest has
>> seen it, this particular interrupt instance will be discarded. Successive
>> "raising" calls will not lead to multiple interrupts if the line has not
>> been lowered in between.
> 
> This is something Xen needs too.
> 
> 
>> ### Software triggered interrupts
>>
>> Beside the naturally software triggered inter-processor-interrupts
>> (SGIs in GIC speak), there is another way of letting software raise an
>> interrupt condition.
>> The GIC distributor allows to set or clear both the pending and active state
>> of any interrupt via MMIO registers. This isn't widely used by many operating
>> systems, but is useful when saving and restoring the state of a machine.
>> So emulating these functions is required for being architecture compliant,
>> however the implementation might not need to be very efficient given its rare
>> usage. In fact supporting the set-pending and clear-pending registers is
>> relatively straight-forward, as long as one keeps this state separate from
>> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
>>
>> The set-active and clear-active registers are much harder to emulate, though,
>> as normally the active state is of little concern to the GIC emulation. In
>> a normal interrupt life cycle the active state isn't even visible to the
>> hypervisor, as it might be set and cleared again entirely within the guest
>> in the list register, without exiting to the hypervisor.
>> So manipulating the active state via the MMIO registers requires some heavy
>> lifting: If this interrupt is currently injected into a running VCPU, this
>> VCPU must exit, the active state must be set or cleared in the LR, then
>> execution can continue. While this is expensive, as mentioned above this
>> should not happen too often, also probably the system isn't very performance
>> sensitive when using this feature for save and restore anyway.
> 
> set-active and clear-active registers are not emulated in Xen today, it
> would be nice to have them.
> 
> How does the locking/synchronization work in the case given that the
> vCPU that needs to exit could be running on a different pCPU?

As I hinted above this is a bit of a sledge hammer: We call
kvm_arm_halt_guest() to force all VCPUs to exit and to make sure we are
in sync. Then we can safely update the status, and the normal entry
process takes care of writing this into the LRs.

>> ### MMIO emulation
>>
>> As mentioned before, the distributor and redistributor part of the VGIC needs
>> to be fully emulated. Those parts are characterised by a range of MMIO
>> registers. The implementation shall provide a dispatcher function, which
>> takes the faulted address, relative to the beginning of the MMIO range, and
>> works out which actual register is affected. It then looks up the the
>> respective handler function and calls it. Those functions are expected to
>> be listed in a struct initialiser, which connects the actual register
>> offset and its size to a particular handler. Having handler functions for
>> a register range seems beneficial over handling registers in a switch/case,
>> because it's easier to read and simplifies code sharing, for instance
>> between the GICv2, GICv3 distributor and GICv3 redistributor registers
>> with the same semantics.
> 
> I am happy to replace the whole MMIO emulation bit.

To be honest, this is my least concern ;-) I find the Xen way easier to
read, since everything is in one place. The KVM version spreads the
handlers out and it's not easy to chase them (with ctags, for instance).
But the KVM way is cleaner and more flexible, since it naturally allows
reusing handler functions.

>> ### List register management
>>
>> A list register (LR) holds the state of a virtual interrupt, which will
>> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
>> Each GIC hardware implementation can choose to implement a number of LRs,
>> having four of them seems to be a common value. This design here does not
>> try to manage the LRs very cleverly, instead on every guest exit every LR
>> in use will be synced to the emulated state, then cleared. Upon guest entry
>> the top priority virtual IRQs will be inserted into the LRs. If there are
>> more pending or active IRQs than list registers, the GIC management IRQ
>> will be configured to notify the hypervisor of a free LR (once the guest
>> has EOIed one IRQ). This will trigger a normal exit, which will go through
>> the normal cleanup/repopulate scheme, possibly now queuing the leftover
>> interrupt(s).
>> To facilitate quick guest exit and entry times, the VGIC maintains the list
>> of pending or active interrupts (ap\_list) sorted by their priority. Active
>> interrupts always go first on the list, since a guest and the hardware GIC
>> expect those to stay until they have been explicitly deactivated. Failure
>> in keeping active IRQs around will result in error conditions in the GIC.
>> The second sort criteria for the ap\_list is their priority, so higher
>> priority pending interrupt always go first into the LRs.
> 
> The suggestion of using this model in Xen was made in the past already.
> I always objected for the reason that we don't actually know how many
> LRs the hardware provides, potentially very many, and it is expensive
> and needless to read/write them all every time on entry/exit.
> 
> I would prefer to avoid that, but I'll be honest: I can be convinced
> that that model of handling LRs is so much simpler that it is worth it.
> I am more concerned about the future maintainance of a separate new
> driver developed elsewhere.

I think this LR topic should have been covered in that other email.

Beside being a strong supporter of the KISS principle in general, I
believe in case of the GIC emulation we should avoid (premature)
optimizations like the plague, as there are quite some corner cases in
any VGIC, and handling all of them explicitly with some hacks will not
fly (been there, done that).
So I can just support Christoffer's point: having an architecture
compliant VGIC emulation in an maintainable manner requires a
straight-forward and clear design. Everything else should be secondary,
and can be evaluated later, if there are good reasons (numbers!).

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-10-12 12:05 ` Christoffer Dall
@ 2017-11-01 17:54   ` Andre Przywara
  0 siblings, 0 replies; 13+ messages in thread
From: Andre Przywara @ 2017-11-01 17:54 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Stefano Stabellini, Marc Zyngier, Julien Grall, Eric Auger,
	xen-devel, Christoffer Dall

Hi Christoffer,

On 12/10/17 13:05, Christoffer Dall wrote:
> Hi Andre,
> 
> On Wed, Oct 11, 2017 at 03:33:03PM +0100, Andre Przywara wrote:
>> Hi,
>>
>> (CC:ing some KVM/ARM folks involved in the VGIC)
> 
> Very nice writeup!
> 
> I added a bunch of comments, mostly for the writing and clarity, I hope
> it helps.

Thank you very much for the response and the comments! I really
appreciate your precise (academic) language here.
I held back the response since Stefano was the actual addressee of this
write-up, so: sorry for the delay.

>> starting with the addition of the ITS support we were seeing more and
>> more issues with the current implementation of our ARM Generic Interrupt
>> Controller (GIC) emulation, the VGIC.
>> Among other approaches to fix those issues it was proposed to copy the
>> VGIC emulation used in KVM. This one was suffering from very similar
>> issues, and a clean design from scratch lead to a very robust and
>> capable re-implementation. Interestingly this implementation is fairly
>> self-contained, so it seems feasible to copy it. Hopefully we only need
>> minor adjustments, possibly we can even copy it verbatim with some
>> additional glue layer code.
>> Stefano asked for getting a design overview, to assess the feasibility
>> of copying the KVM code without reviewing tons of code in the first
>> place.
>> So to follow Xen rules for new features, this design document below is
>> an attempt to describe the current KVM VGIC design - in a hypervisor
>> agnostic session. It is a bit of a retro-fit design description, as it
>> is not strictly forward-looking only, but actually describing the
>> existing implemenation [1].
>>
>> Please have a look and let me know:
>> 1) if this document has the right scope
>> 2) if this document has the right level of detail
>> 3) if there are points missing from the document
>> 3) if the design in general is a fit
>>
>> Appreciate any feedback!
>>
>> Cheers,
>> Andre.
>>
>> ---------------------------------------
>>
>> VGIC design
>> ===========
>>
>> This document describes the design of an ARM Generic Interrupt Controller (GIC)
>> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
>> the common name for that is VGIC (from "virtual GIC").
>>
>> This design was the result of a one-week-long design session with some
>> engineers in a room, triggered by ever-increasing difficulties in maintaining
>> the existing GIC emulation in the KVM hypervisor. The design eventually
>> materialised as an alternative VGIC implementation in the Linux kernel
>> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
>> was removed, so it is now the current code used by Linux.
>> Although being used in KVM, the actual design of this VGIC is rather hypervisor
>> agnostic and can be used by other hypervisors as well, in particular for Xen.
>>
>> GIC hardware virtualization support
>> -----------------------------------
>>
>> The ARM Generic Interrupt Controller (since v2) supports the virtualization
>> extensions, which allows some parts of the interrupt life cycle to be handled
>> purely inside the guest without exiting into the hypervisor.
>> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
>> acknowledgement", "priority drop" and "interrupt deactivate" actions.
>> So a guest can handle most of the interrupt processing code without
>> leaving EL1 and trapping into the hypervisor. To accomplish
>> this, the GIC holds so called "list registers" (LRs), which shadow the
>> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
>> involves setting up one LR with the interrupt number, its priority and initial
>> state (mostly "pending"), then entering the guest. Any EOI related action
>> from within the guest just acts on those LRs, the hypervisor can later update
>> the virtual interrupt state when the guest exists the next time (for whatever
>> reason).
>> But despite the GIC hardware helping out here, the whole interrupt
>> configuration management is not virtualized at all and needs to be emulated
>> by the hypervisor - or another related software component, for instance a
>> userland emulator. This so called "distributor" part of the GIC consists of
>> memory mapped registers, which can be trapped by the hypervisor, so any guest
>> access can be emulated in the usual way.
>>
>> VGIC design motivation
>> ----------------------
>>
>> A GIC emulation thus needs to take care of those bits:
>>
>> - trap GIC distributor MMIO accesses and shadow the configuration setup
>>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
>> - handle incoming hardware and virtual interrupt requests and inject the
>>   associated virtual interrupt by manipulating one of the list registers
>> - track the state of a virtual interrupt by inspecting the LRs after the
>>   guest has exited, possibly adjusting the shadowed virtual interrupt state
>>
>> Despite the distributor MMIO register emulation being a sizeable chunk of
>> the emulation, it is actually not dominant if looking at the frequency at
>> which it is accessed. Normally the interrupt configuration is done at boot
>> time or upon initialising the device (driver), but rarely during the actual
>> run time of a system. Injecting and EOI-ing interrupts however happens much
>> more often. A good emulation approach should thus focus on tracking the virtual
>> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
>> interrupts.
> 
> I would also say that the architecture for the GIC includes a relatively
> high number of corner cases and invariants that may not be violated, and
> being completely architecture compliant was our first requirement, the
> second requirement was to support efficient interrupt life cycle
> management and to be able to quickly tell which (if any) interrupts must
> be presented to a virtual CPU.

100% ACK. Will try to include this.

>> The actual interrupt state tracking can be quite tricky in parts. Interrupt
>> injections can be independent from the guest entry/exit points, also MMIO
>> configuration accesses could be triggered by any VCPU at any point in time.
>> Changing interrupt CPU affinity adds to the complication.
>> This leads to many code parts which could run in parallel and thus contains
>> some race conditions, so proper locking becomes key of a good design.
>> But one has to consider that interrupts in general can be characterised
>> as a rare event - otherwise a guest would be busy handling interrupts and could
> 
> (across all virtual CPUs)
> 
>> not process actual computation tasks.
>> That's why the interrupt state tracking should focus on a clear and race-free
> 
> nit: not sure it makes sense to talk about a race-free locking scheme.
> You have locking inherently because you have races; locking just makes
> sure that things that race and access data concurrently don't corrupt
> stace and that the races become benign.

Indeed, good point.

>> locking scheme, without needlessly optimising too much in this respect.
>> Experience shows that this complicates the code and leads to undetected and
>> hard-to-debug race conditions, which affect the stability of the system in
>> possibly untested corner cases.
> 
> I think experience also shows that the expected performance bottlenecks
> really weren't there at all, and any optimization efforts should be
> driven by clear measurements of the pain points, falling back to clarity
> of implementation and ease of maintenance for all other parts of the
> implementation.

Yes, it seems obvious, but is indeed something that we need to remember
ourselves of from time to time.

>> VGIC design principles
>> ----------------------
>>
>> ### Data structure
>>
>> This VGIC design is based on the idea of having one structure per virtual
>> interrupt, protected by its own lock. 
> 
> Even more high level: This VGIC design was based around having a very
> clear data structure design, never duplicating state, and making it
> abundantly clear how things are structured.  One way of achieving that
> is to have a structure per interrupt, each having its own lock.
> 
>> In addition there is a list per VCPU,
>> which queues the interrupts which this VCPU should consider for injection.
> 
> nit: Should you introduce the AP list name here, and say that it's protected
> by the VCPU lock?

Yeah, good idea.

>> One interrupt can only be on one VCPU list at any given point in time.
> 
> nit, wording: Any interrupt can be on at most one AP list at any point
> in time.
> 
>> For private interrupts and SPIs a static allocation of this data structure
> 
> nit: PPIs and SPIs (or private and shared interrupts)

Well, with "private interrupts" I included SGIs as well, basically every
IRQ that is static and has a well bounded upper limit.

>> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
>> dynamic and possibly very sparse allocation scheme, so we need to deal with
>> dynamic allocation and de-allocation of this struct. To accommodate this
>> there is an additional list header to link all LPIs.
>> Also the LPI mapping and unmapping can happen asynchronously, so we need to
> 
> asynchronously to what?

Asynchronously to managing virtual interrupts, i.e. in the middle of the
HV running or even during the actual injection code.

>> properly ref-count the structure (at least for LPIs), otherwise some code parts
> 
> nit, wording: reference count.
> 
>> would potentially end up with referencing an already freed pointer.
> 
> It's not only that, it's that you need to know when to free things.
> This is the basic idea of reference counting which I don't think you
> need to argue for in this document.

Agreed, and since Stefano obviously does not need convincing on this
front ;-), I can shorten the argument here.

>>
>> The central data structure is called `struct vgic_irq`, and, beside the
>> expected interrupt configuration data, contains at least the lock, a list
>> header (to be able to link it to a VCPU) and a refcount. Also it contains
>> the interrupt number (to accommodate for non-contiguous interrupt allocations,
>> for instance for LPIs).
>> Beside those essential elements it proves worth to store (a reference to) the
>> VCPU this IRQ is associated with. This allows to easily find the respective
>> VCPU list.
>>
>>     struct vgic_irq {
>>         spinlock_t irq_lock;            /* Protects the content of the struct */
>>         struct list_head lpi_list;      /* Used to link all LPIs together */
>>         struct list_head ap_list;
>>
>>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>>                                          * SPIs and LPIs: The VCPU whose ap_list
>>                                          * this is queued on.
>>                                          */
>>
>>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
>>                                           * be sent to, as a result of the
>>                                           * targets reg (v2) or the
>>                                           * affinity reg (v3).
>>                                           */
>>
>>         u32 intid;                      /* Guest visible INTID */
>>         bool line_level;                /* Level only */
>>         bool pending_latch;             /* The pending latch state used to
>>                                          * calculate the pending state for
>>                                          * both level and edge triggered IRQs.
>>                                          */
>>
>>         bool active;                    /* not used for LPIs */
>>         bool enabled;
>>         bool hw;                        /* Tied to HW IRQ */
>>         struct kref refcount;           /* Used for LPIs */
>>         u32 hwintid;                    /* HW INTID number */
>>         union {
>>             u8 targets;                     /* GICv2 target VCPUs mask */
>>             u32 mpidr;                      /* GICv3 target VCPU */
>>         };
>>         u8 source;                      /* GICv2 SGIs only */
>>         u8 priority;
>>         enum vgic_irq_config config;    /* Level or edge */
>>     };
>>
>> ### VCPU list handling
>>
>> Initially a virtual interrupt just lives on its own. 
> 
> not sure what this means, see if you can clarify by being more concrete.

Yeah, I didn't find any better wording. Basically that struct vgic_irq
is not connected to anything, and is self-absorbed with playing with its
fields and the lock ;-)

>> Guest MMIO accesses to
>> the distributor will change the state information in this structure.
>> When an interrupt is actually made pending (either by an associated hardware
>> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
> 
> I think the distinction of what causes an interrupt to be fired should
> be reworked in the document.  The important bit is that the VGIC has a 
> virtual interrupt input line, which can be raised and lowered, which the
> hypervisor can use to signal virtual interrupts.  These may or may not
> be tied to a physical interrupt, and they may therefore be marked as
> hw=true or hw=false, respectively.

OK, that's a good point. With the Xen focus on hw=1 interrupts, I got a
bit confused, but indeed it's still a virtual interrupt, which needs to
be triggered by the hypervisor. The connection to a physical IRQ is
something on top of this, for later in the IRQ life cycle.
Thanks for spelling this out!

>> linked to the current target VCPU. The `vcpu` member in the structure will
>> be set to this VCPU. Any affinity change after this point will not affect
>> the current target VCPU anymore, it just updates the `target_vpu` field in
>> the structure, which will be considered on the next injection.
> 
> I think this description is a little vague.  There are clear semantics
> associated with these two fields:
> 
>    vcpu: The VCPU whose ap_list this interrupt is queued on (which
>          happens to be immutable for SGIs and PPIs)
> 
>    target_vcpu: For SGIs and LPIs, the configured target VCPU for an
>                 interrupt.
> 
> Once this is clearly defined, there are some rules in terms of when the
> vcpu field can be changed; when queing a virtual interrupt for delivery
> (because it's pending and/or active), the vcpu field field points to the
> VCPU on which it is queud.  The target_vcpu field simply records the
> configuration, and can be changed by the hypervisor or the VM itself at
> any time, but only the VCPU on whose AP list the virtual interrupt is
> already queued, can change a non-NULL vcpu field to NULL or to a
> different value, i.e. migrate the virtual interrupt.
> 
> This is a requirement to ensure correct functionality; once you present
> an active interrupt to a VCPU, you cannot take it away behind its back,
> but you have to wait until the VCPU deactivates the interrupt.
> 
>> This per-VCPU list is called the `ap_list`, since it holds interrupts which
>> are in a pending and/or active state.
>>
>> ### Virtual IRQ references
>>
>> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
>> given its number.
>> For private IRQs and SPIs it is expected that this just indexes a static array.
>> For LPIs (which are dynamically allocated at run time) this is expected to
>> iterate a data structure (like a linked list) to find the right structure.
> 
> (or allocate one?)

Mmh, we don't in our code: vgic_get_lpi() returns NULL if the LPI is not
found, and needs vgic_add_lpi() to explicitly allocate one.
I guess an implementation could choose to automatically allocate one,
but this would be an implementation detail, wouldn't it?

>> In any case a call to `vgic_get_irq` will increase a refcount, which will
>> prevent LPIs from being de-allocated while another part of the VGIC is still
>> holding a reference. Thus any caller to `vgic_get_irq` shall call
>> `vgic_put_irq()` after it is done with handling this interrupt.
> 
> Isn't this refcounting 101?  I assume it's already used in Xen and the
> rationale could be skipped here in the interest of focus.

Yes, but this "call vgic_put_irq() after it is done" has consequences
for the code: you have to do it explicitly and cannot just return a
pointer from a function without passing the responsibility of "putting
it" to the caller. I found this noteworthy, since the Xen VGIC code does
violate this principle at the moment (which is fine since Xen does not
need it).

>> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
>> this case the VCPU holds that reference and it is kept as long as the guest
>> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
>> been EOIed by the guest and it having been removed from the VCPU list.
> 
> Again, this seems to just explain an example of one of the references.
> Reference counting works by counting references, freeing the resource
> when the refecence reaches zero.  That's about it.

OK, I guess I can shorten the lengthy CS lecture here, just adding that
being on an ap_list requires us to keep the reference, even if we leave
the function and enter a VCPU.
As mechanically one would pair a put() with a get() in a (leaf)
function, the deviating need for spreading gets and puts over different
code parts seems to be worthwhile to be mentioned here explicitly.

>> ### Locking
>>
>> To keep the `vgic_irq` structure consistent and to avoid races between
>> different parts of the VGIC, locking is essential whenever accessing a member
> 
> nit: again a race cannot be avoided completely, but they can be made
> benign...

Right.

>> of this structure. It is expected that this lock is almost never contended,
>> also held only for brief periods of time, so this is considered cheap.
>> To keep the code clean and avoid nasty corner cases, there are no tricks on
>> trying to be lockless here.
>> If for any reason the code needs to hold the locks for two virtual IRQs, the
>> one with the lower IRQ number is to be taken first, to avoid deadlocks.
>>
>> Another lock to consider is the VCPU lock, which on the first glance protects
>> the virtual CPU's list structure, but also synchronises additions and removals
>> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
>> lock need to be held. To avoid deadlocks, there is a strict locking order:
>>
>>> The VCPU lock needs to be taken first, the per-IRQ lock after this.
>>
>> Some operations (like migrating IRQs between two VCPUs) require two VCPU
>> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
>> is to be taken first.
>>
>> There are occasions where the locking order (VCPU first) is hard to observe,
>> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
>> list. In this case the IRQ lock needs to be dropped, the respective VCPU
>> lock should be taken, then the per-IRQ lock needs to be re-taken.
>> After both the locks are held, we need to check if the conditions which
>> originally mandated the list addition (or removal) are still true. This is
>> needed because the IRQ lock could have been taken by another entity meanwhile
>> and the state of this interrupt could have been changed. Examples are if the
>> interrupt is no longer pending, got disabled or changed the CPU affinity.
>> Some of those changes might render to current action obsolete (no longer
>> pending), other will lead to a retry of the re-locking scheme described above.
>> This re-locking scheme shall be implemented in a well-documented function.
> 
> Do we have this documentation on the KVM side that you could link to
> here for people to have an understanding of how this can be explained?
> It's not that bad when you look at it really.

Well, there is this function with 80% comments ;-)
But I found it a bit cumbersome to refer to some existing implementation
in a design document. Maybe I could add some pseudo code here?

>>
>> ### Level and edge triggered interrupts
>>
>> The GIC knows about two kinds of signalling interrupts:
>>
>> - Edge triggered interrupts are triggered by a device once, their life cycle
>> ends when the guest has EOIed them, at which point we remove the pending state,
>> clear the LR and return the `vgic_irq` structure to a quiescent state.
> 
> For non-HW interrupts, you have the added potential complexity of
> PENDING+ACTIVE.

Well, I wanted to introduce level and edge triggered IRQs here, not
completely alienate the reader ;-)
I guess I shorten this to the text book definition of level vs. edge,
and details on those cases later.

>> - Level triggered interrupts are triggered when a device raises its interrupt
>> line, they stay pending as long as this line is held high. At some point the
>> driver in the guest is expected to program the device to explicitly or
>> implicitly lower this interrupt line. That means that we have to store the
>> state of the virtual interrupt line, which is only controlled by the (virtual)
>> device. This is done in the `line_level` member of `struct vgic_irq`.
>>
>> To assert the interrupt condition, a (virtual) device calls a function exported
>> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
>> line for an edge triggered IRQ is ignored (and so is optional). Raising the
>> line asserts the pending state and potentially injects this virtual IRQ. Any
>> subsequent "raising" call might inject another IRQ, if the previous has at
>> least been activated by the guest already, otherwise is ignored.
>>
>> For level triggered interrupts this function stores the new state into the
>> `line_level` variable, potentially injecting the interrupt if that line
>> changes from false to true. If the line is lowered before the guest has
>> seen it, this particular interrupt instance will be discarded. Successive
>> "raising" calls will not lead to multiple interrupts if the line has not
>> been lowered in between.
> 
> This is confusing:

Even for me, apparently ;-)

> Lowering or raising the line for a level triggered
> interrupt doesn't make any difference.  The point is that as long as the
> line is high, if you deactivate that interrupt, a new interrupt will hit
> immediately again, unless the line has been lowered in the meantime.

Yeah, it seems I need to seriously rework this paragraph.

>>
>> ### Software triggered interrupts
>>
>> Beside the naturally software triggered inter-processor-interrupts
>> (SGIs in GIC speak), there is another way of letting software raise an
>> interrupt condition.
> 
> These three lines appear to belong to the heading...
> 
>> The GIC distributor allows to set or clear both the pending and active state
>> of any interrupt via MMIO registers. This isn't widely used by many operating
>> systems, but is useful when saving and restoring the state of a machine.
>> So emulating these functions is required for being architecture compliant,
>> however the implementation might not need to be very efficient given its rare
>> usage. In fact supporting the set-pending and clear-pending registers is
>> relatively straight-forward, as long as one keeps this state separate from
>> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
>>
>> The set-active and clear-active registers are much harder to emulate, though,
>> as normally the active state is of little concern to the GIC emulation. In
>> a normal interrupt life cycle the active state isn't even visible to the
>> hypervisor, as it might be set and cleared again entirely within the guest
>> in the list register, without exiting to the hypervisor.
>> So manipulating the active state via the MMIO registers requires some heavy
>> lifting: If this interrupt is currently injected into a running VCPU, this
>> VCPU must exit, the active state must be set or cleared in the LR, then
>> execution can continue. While this is expensive, as mentioned above this
>> should not happen too often, also probably the system isn't very performance
>> sensitive when using this feature for save and restore anyway.
> 
> These two paragraphs not so much, they seem to belong to MMIO emulation,
> and should probably follow the paragraph below.

Yes, indeed it makes sense to move them.

>> ### MMIO emulation
>>
>> As mentioned before, the distributor and redistributor part of the VGIC needs
>> to be fully emulated. Those parts are characterised by a range of MMIO
>> registers. The implementation shall provide a dispatcher function, which
>> takes the faulted address, relative to the beginning of the MMIO range, and
>> works out which actual register is affected. It then looks up the the
>> respective handler function and calls it. Those functions are expected to
>> be listed in a struct initialiser, which connects the actual register
>> offset and its size to a particular handler. Having handler functions for
>> a register range seems beneficial over handling registers in a switch/case,
>> because it's easier to read and simplifies code sharing, for instance
>> between the GICv2, GICv3 distributor and GICv3 redistributor registers
>> with the same semantics.
>>
>> ### List register management
>>
>> A list register (LR) holds the state of a virtual interrupt, which will
>> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
>> Each GIC hardware implementation can choose to implement a number of LRs,
>> having four of them seems to be a common value. This design here does not
>> try to manage the LRs very cleverly, instead on every guest exit every LR
>> in use will be synced to the emulated state, then cleared. 
> 
> In fact I think we came up with counter-examples for every model of
> being clever with not reading back the LRs, because you simply have to
> observe any change in state that happened in hardware while the guest is
> running, to be able to properly emulate compliant functionality of being
> able to inject interrupts or not.

Adding another one:
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/arm/gic.c;hb=HEAD#l622

Sorry, Stefano, couldn't resist ;-)

>> Upon guest entry
>> the top priority virtual IRQs will be inserted into the LRs. If there are
>> more pending or active IRQs than list registers, the GIC management IRQ
>> will be configured to notify the hypervisor of a free LR (once the guest
>> has EOIed one IRQ). This will trigger a normal exit, which will go through
>> the normal cleanup/repopulate scheme, possibly now queuing the leftover
>> interrupt(s).
>> To facilitate quick guest exit and entry times, the VGIC maintains the list
>> of pending or active interrupts (ap\_list) sorted by their priority. Active
>> interrupts always go first on the list, since a guest and the hardware GIC
>> expect those to stay until they have been explicitly deactivated. Failure
>> in keeping active IRQs around will result in error conditions in the GIC.
>> The second sort criteria for the ap\_list is their priority, so higher
>> priority pending interrupt always go first into the LRs.
> 
> 
> Otherwise, as I said, this is a really nice writeup.

Thanks, much appreciated!

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01 14:30   ` Andre Przywara
@ 2017-11-01 21:54     ` Stefano Stabellini
  2017-11-02  7:40       ` Christoffer Dall
  2017-11-02 16:00       ` Andre Przywara
  0 siblings, 2 replies; 13+ messages in thread
From: Stefano Stabellini @ 2017-11-01 21:54 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Stefano Stabellini, Marc Zyngier, Eric Auger, Julien Grall,
	xen-devel, Christoffer Dall

On Wed, 1 Nov 2017, Andre Przywara wrote:
> Hi Stefano,
> 
> 
> On 01/11/17 01:58, Stefano Stabellini wrote:
> > On Wed, 11 Oct 2017, Andre Przywara wrote:
> 
> many thanks for going through all of this!

No problems, and thanks for your work and for caring about doing the
best thing for the project.


> >> (CC:ing some KVM/ARM folks involved in the VGIC)
> >>
> >> starting with the addition of the ITS support we were seeing more and
> >> more issues with the current implementation of our ARM Generic Interrupt
> >> Controller (GIC) emulation, the VGIC.
> >> Among other approaches to fix those issues it was proposed to copy the
> >> VGIC emulation used in KVM. This one was suffering from very similar
> >> issues, and a clean design from scratch lead to a very robust and
> >> capable re-implementation. Interestingly this implementation is fairly
> >> self-contained, so it seems feasible to copy it. Hopefully we only need
> >> minor adjustments, possibly we can even copy it verbatim with some
> >> additional glue layer code.
> >>
> >> Stefano asked for getting a design overview, to assess the feasibility
> >> of copying the KVM code without reviewing tons of code in the first
> >> place.
> >> So to follow Xen rules for new features, this design document below is
> >> an attempt to describe the current KVM VGIC design - in a hypervisor
> >> agnostic session. It is a bit of a retro-fit design description, as it
> >> is not strictly forward-looking only, but actually describing the
> >> existing implemenation [1].
> >>
> >> Please have a look and let me know:
> >> 1) if this document has the right scope
> >> 2) if this document has the right level of detail
> >> 3) if there are points missing from the document
> >> 3) if the design in general is a fit
> > 
> > Please read the following statements as genuine questions and concerns.
> > Most ideas on this document are good. Some of them I have even suggested
> > them myself in the context of GIC improvements for Xen. I asked for a
> > couple of clarifications.
> > 
> > But I don't see why we cannot implement these ideas on top of the
> > existing code, rather than with a separate codebase, ending up with two
> > drivers. I would prefer a natual evolution. Specifically, the following
> > improvements would be simple and would give us most of the benefits on
> > top of the current codebase:
> > - adding the irq lock, and the refcount
> > - taking both vcpu locks when necessary (on migration code for example
> >   it would help a lot), the lower vcpu_id first
> > - level irq emulation
> 
> I think some of those points you mentioned are not easily implemented in
> the current Xen. For instance I ran into locking order issues with those
> *two* inflight and lr_queue lists, when trying to implement the lock and
> the refcount.
> Also this "put vIRQs into LRs early, but possibly rip them out again" is
> really complicating things a lot.
> 
> I believe only level IRQs could be added in a relatively straight
> forward manner.
> 
> So the problem with the evolutionary approach is that it generates a lot
> of patches, some of them quite invasive, others creating hard-to-read
> diffs, which are both hard to review.
> And chances are that the actual result would be pretty close to the KVM
> code. To be clear: I hacked the Xen VGIC into the KVM direction in a few
> days some months ago, but it took me *weeks* to make sane patches of
> only the first part of it.
> And this would not cover all those general, tedious corner cases that
> the VGIC comes with. Those would need to be fixed in a painful process,
> which we could avoid by "lifting" the KVM code.

I hear you, but the principal cost here is the review time, not the
development time. Julien told me that it would be pretty much the same
for him in terms of time it takes to review the changes, it doesn't
matter if it's a new driver or changes to the existing driver. For me,
it wouldn't be the same: I think it would take me far less time to
review them if they were against the existing codebase.

However, as I wrote, this is not my foremost concern. I would be up to
committing myself to review this even if we decide to go for a new
driver.


> > If we do end up with a second separate driver for technical or process
> > reasons, I would expect the regular Xen submission/review process to be
> > followed. The code style will be different, the hooks into the rest of
> > the hypervisors will be different and things will be generally changed.
> > The new V/GIC might be derived from KVM, but it should end up looking
> > and feeling like a 100% genuine Xen component. After all, we'll
> > maintain it going forward. I don't want a copy of a Linux driver with
> > glue code. The Xen community cannot be expected not to review the
> > submission, but if we review it, then we'll ask for changes. Once we
> > change the code, there will be no point in keeping the Linux code
> > separate with glue code. We should fully adapt it to Xen.
> 
> I see your point, and this actually simplifies *my* work, but I am a bit
> worried about the effects of having two separate implementations which
> then diverge over time.
> In the moment we have two separate implementations as well, but they are
> quite different, which has the advantage of doing things differently
> enough to help in finding bugs in the other one (something we should
> actually exploit in testing, I believe).

It is a matter of ownership and responsibilities. The gic and vgic
components are critical to the hypervisor functionalities, and as Xen
Project we need to take ownership of them. It means fixing bugs and
maintaining them going forward. It makes sense to have them fully
integrated into Xen.


> So how is your feeling towards some shared "libvgic"? I understand that
> people are not too happy about that extra maintenance cost of having a
> separate repository, but I am curious what your, Marc's and
> Christoffer's take is on this idea.

I am open to this discussion. It is nice in theory, but it is hard to
put into practice. I think neither Julien and I nor Christoffer and Marc
like the idea of a separate repository. It is a pain and it is ugly. But
if we don't have a single repository, how can we share the codebase?

Also keep in mind that Xen and Linux have different release cycles and
they go into freeze at different times. It affects when/how fixes can
get into the codebase.

Unless you come up with a clever idea on how to make this work, I think
we are better off with our own version of the driver.


> > That is what was done in the past when KVM took code from Xen (for
> > example async shadow pagetables). I am eager to avoid a situation like
> > the current SMMU driver in Xen, which comes from Linux, and we are not
> > entirely sure how to maintain it.
> > 
> > 
> >> Appreciate any feedback!
> >>
> >> Cheers,
> >> Andre.
> >>
> >> ---------------------------------------
> >>
> >> VGIC design
> >> ===========
> >>
> >> This document describes the design of an ARM Generic Interrupt Controller (GIC)
> >> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
> >> the common name for that is VGIC (from "virtual GIC").
> >>
> >> This design was the result of a one-week-long design session with some
> >> engineers in a room, triggered by ever-increasing difficulties in maintaining
> >> the existing GIC emulation in the KVM hypervisor. The design eventually
> >> materialised as an alternative VGIC implementation in the Linux kernel
> >> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
> >> was removed, so it is now the current code used by Linux.
> >> Although being used in KVM, the actual design of this VGIC is rather hypervisor
> >> agnostic and can be used by other hypervisors as well, in particular for Xen.
> >>
> >> GIC hardware virtualization support
> >> -----------------------------------
> >>
> >> The ARM Generic Interrupt Controller (since v2) supports the virtualization
> >> extensions, which allows some parts of the interrupt life cycle to be handled
> >> purely inside the guest without exiting into the hypervisor.
> >> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
> >> acknowledgement", "priority drop" and "interrupt deactivate" actions.
> >> So a guest can handle most of the interrupt processing code without
> >> leaving EL1 and trapping into the hypervisor. To accomplish
> >> this, the GIC holds so called "list registers" (LRs), which shadow the
> >> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
> >> involves setting up one LR with the interrupt number, its priority and initial
> >> state (mostly "pending"), then entering the guest. Any EOI related action
> >> from within the guest just acts on those LRs, the hypervisor can later update
> >> the virtual interrupt state when the guest exists the next time (for whatever
> >> reason).
> >> But despite the GIC hardware helping out here, the whole interrupt
> >> configuration management is not virtualized at all and needs to be emulated
> >> by the hypervisor - or another related software component, for instance a
> >> userland emulator. This so called "distributor" part of the GIC consists of
> >> memory mapped registers, which can be trapped by the hypervisor, so any guest
> >> access can be emulated in the usual way.
> >>
> >> VGIC design motivation
> >> ----------------------
> >>
> >> A GIC emulation thus needs to take care of those bits:
> >>
> >> - trap GIC distributor MMIO accesses and shadow the configuration setup
> >>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
> >> - handle incoming hardware and virtual interrupt requests and inject the
> >>   associated virtual interrupt by manipulating one of the list registers
> >> - track the state of a virtual interrupt by inspecting the LRs after the
> >>   guest has exited, possibly adjusting the shadowed virtual interrupt state
> >>
> >> Despite the distributor MMIO register emulation being a sizeable chunk of
> >> the emulation, it is actually not dominant if looking at the frequency at
> >> which it is accessed. Normally the interrupt configuration is done at boot
> >> time or upon initialising the device (driver), but rarely during the actual
> >> run time of a system. Injecting and EOI-ing interrupts however happens much
> >> more often. A good emulation approach should thus focus on tracking the virtual
> >> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
> >> interrupts.
> >>
> >> The actual interrupt state tracking can be quite tricky in parts. Interrupt
> >> injections can be independent from the guest entry/exit points, also MMIO
> >> configuration accesses could be triggered by any VCPU at any point in time.
> >> Changing interrupt CPU affinity adds to the complication.
> >> This leads to many code parts which could run in parallel and thus contains
> >> some race conditions, so proper locking becomes key of a good design.
> >> But one has to consider that interrupts in general can be characterised
> >> as a rare event - otherwise a guest would be busy handling interrupts and could
> >> not process actual computation tasks.
> >> That's why the interrupt state tracking should focus on a clear and race-free
> >> locking scheme, without needlessly optimising too much in this respect.
> >> Experience shows that this complicates the code and leads to undetected and
> >> hard-to-debug race conditions, which affect the stability of the system in
> >> possibly untested corner cases.
> >>
> >> VGIC design principles
> >> ----------------------
> >>
> >> ### Data structure
> >>
> >> This VGIC design is based on the idea of having one structure per virtual
> >> interrupt, protected by its own lock. In addition there is a list per VCPU,
> >> which queues the interrupts which this VCPU should consider for injection.
> >> One interrupt can only be on one VCPU list at any given point in time.
> >> For private interrupts and SPIs a static allocation of this data structure
> >> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
> >> dynamic and possibly very sparse allocation scheme, so we need to deal with
> >> dynamic allocation and de-allocation of this struct. To accommodate this
> >> there is an additional list header to link all LPIs.
> >> Also the LPI mapping and unmapping can happen asynchronously, so we need to
> >> properly ref-count the structure (at least for LPIs), otherwise some code parts
> >> would potentially end up with referencing an already freed pointer.
> >>
> >> The central data structure is called `struct vgic_irq`, and, beside the
> >> expected interrupt configuration data, contains at least the lock, a list
> >> header (to be able to link it to a VCPU) and a refcount. Also it contains
> >> the interrupt number (to accommodate for non-contiguous interrupt allocations,
> >> for instance for LPIs).
> >> Beside those essential elements it proves worth to store (a reference to) the
> >> VCPU this IRQ is associated with. This allows to easily find the respective
> >> VCPU list.
> >>
> >>     struct vgic_irq {
> >>         spinlock_t irq_lock;            /* Protects the content of the struct */
> >>         struct list_head lpi_list;      /* Used to link all LPIs together */
> >>         struct list_head ap_list;
> >>
> >>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
> >>                                          * SPIs and LPIs: The VCPU whose ap_list
> >>                                          * this is queued on.
> >>                                          */
> >>
> >>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
> >>                                           * be sent to, as a result of the
> >>                                           * targets reg (v2) or the
> >>                                           * affinity reg (v3).
> >>                                           */
> >>
> >>         u32 intid;                      /* Guest visible INTID */
> >>         bool line_level;                /* Level only */
> >>         bool pending_latch;             /* The pending latch state used to
> >>                                          * calculate the pending state for
> >>                                          * both level and edge triggered IRQs.
> >>                                          */
> >>
> >>         bool active;                    /* not used for LPIs */
> >>         bool enabled;
> >>         bool hw;                        /* Tied to HW IRQ */
> >>         struct kref refcount;           /* Used for LPIs */
> >>         u32 hwintid;                    /* HW INTID number */
> >>         union {
> >>             u8 targets;                     /* GICv2 target VCPUs mask */
> >>             u32 mpidr;                      /* GICv3 target VCPU */
> >>         };
> >>         u8 source;                      /* GICv2 SGIs only */
> >>         u8 priority;
> >>         enum vgic_irq_config config;    /* Level or edge */
> >>     };
> > 
> > The refcount and irq_lock are good ideas, let's have them.
> > 
> > 
> >> ### VCPU list handling
> >>
> >> Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
> >> the distributor will change the state information in this structure.
> >> When an interrupt is actually made pending (either by an associated hardware
> >> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
> >> linked to the current target VCPU. The `vcpu` member in the structure will
> >> be set to this VCPU. Any affinity change after this point will not affect
> >> the current target VCPU anymore, it just updates the `target_vpu` field in
> >> the structure, which will be considered on the next injection.
> >> This per-VCPU list is called the `ap_list`, since it holds interrupts which
> >> are in a pending and/or active state.
> > 
> > The two vcpu lists sound like a good idea too, and Christoffer's
> > explanation helped. It is actually similar to what we do in Xen already.
> > I guess a vgic is always a vgic :-)
> 
> Mmmh, I don't get where you see two VCPU *lists* here. There are two
> VCPU *fields* in the structure, but they are completely different from
> the lr_pending and lr_queue lists in Xen.
> In fact I believe that these *two* lists in Xen are one of the major
> pain points in the current VGIC.

OK


> > What happens when the irq is migrated while still in an LR on another
> > pcpu? When/How is the physical affinity changed?
> 
> Per the architecture there is nothing like an IRQ "migration". There is
> the CPU affinity, which determines to which core this IRQ is forwarded
> *when it becomes pending*. Once it has been activated, it stays at this
> core, even if you change the ITARGETSR or IROUTER register afterwards.
> This is a benign race, you just came too late to change the affinity.
> 
> And at the moment we don't synchronize the physical affinity, simply
> because most IRQs in KVM world were virtual so far. Now this is going to
> change, so I guess we have to take a look at this at some point. But I
> consider this an optimization, and would prefer correctness and
> stability over performance.

This feature is a key enabler for Xen in embedded, it is certainly a
blocker. It needs to be part of the design doc.


> > What happens when a new irq is supposed to be injected when target_vcpu is
> > already set? Does target_vcpu simply get overwritten?
> 
> target_vcpu is simply a configuration storage. Anyone can update this
> field at any time, without any side effects.
> When an IRQ is going to be injected, the current value of target_vcpu is
> written *once* to the "vcpu" field, which from now on determines the
> responsible VCPU for the whole interrupt life cycle (queueing on lists,
> putting into LRs, ...). This field cannot change anymore until the IRQ
> is EOIed.

OK, I understand. Sounds good.


> > What happens when a vcpu is migrated from pcpu1 to pcpu2?
> 
> Nothing spectacular, I guess. We don't care about the physical IRQ
> affinity. And since we clear all LRs on exit and (re-)populate them on
> entry, doing this on two different CPUs is a total no-brainer.

We should care about physical IRQ affinity, and it should be part of
this design doc. This would be a regression.


> >> ### Virtual IRQ references
> >>
> >> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
> >> given its number.
> >> For private IRQs and SPIs it is expected that this just indexes a static array.
> >> For LPIs (which are dynamically allocated at run time) this is expected to
> >> iterate a data structure (like a linked list) to find the right structure.
> >> In any case a call to `vgic_get_irq` will increase a refcount, which will
> >> prevent LPIs from being de-allocated while another part of the VGIC is still
> >> holding a reference. Thus any caller to `vgic_get_irq` shall call
> >> `vgic_put_irq()` after it is done with handling this interrupt.
> >> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
> >> this case the VCPU holds that reference and it is kept as long as the guest
> >> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
> >> been EOIed by the guest and it having been removed from the VCPU list.
> > 
> > I understand the idea behind a refcount and sounds like a good thing to
> > have.
> > 
> > Let me ask you a couple of questions. How does it help with the issue
> > that an LPI could be discarded and remapped (MAPTI) from another
> > pcpu while it could still be in an LR?
> 
> On DISCARD we remove it from the list of mapped LPIs, but don't free the
> structure. So any vgic_get_lpi(lpi_nr) won't find it anymore. But since
> the interrupt is in an LR, the VCPU's ap_list still references the
> vgic_irq structure, so we can do the whole IRQ life cycle management
> just as normal (because being a list member is what counts when it comes
> to a "life" interrupt).
> Once this LPI is EOIed, we remove it from the VCPU list, which decreases
> the refcount and most probably will free the memory, since the value has
> become zero by then. Normally, without unmapping it before, the
> reference held by the ITS list would make sure the refcount stays
> greater than 0.
>
> Now when there is a MAPTI to the same LPI number meanwhile, this will
> allocate a new structure (this is a new interrupt!) and enters this into
> the ITS list. So anyone asking for this new LPI *number* will get the
> reference to the new IRQ. Think: deleting a file and creating a new one
> with the same name on a UNIX system, any old users of an already opened
> file descriptor will still use the deleted file, but an open() will
> return a handle to the new file.

This needs to be captured in the doc.

Are vgic_irq struct dynamically allocated? Is there a reutilization
scheme to avoid a malicious guest from spamming Xen with LPI requests?
Multiple struct vgic_irq for the same LPI would cause even more memory
allocations.

If both the old and the new vgic_irq struct end up being written to LRs,
wouldn't it cause problems?


> > What happens if the MAPTI is
> > issued before and what happens if it is issed after the irq has been
> > EOId and cleared from the LR and ap_list?
> 
> I believe the above description should answer this. If not, please let
> me know.
> 
> > I am referring to the case that we currently handling with the
> > GIC_IRQ_GUEST_PRISTINE_LPI flag in Xen.
> 
> ... which was a hack of mine to work around the missing refcount ;-)

Yes, you can say that :)


> >> ### Locking
> >>
> >> To keep the `vgic_irq` structure consistent and to avoid races between
> >> different parts of the VGIC, locking is essential whenever accessing a member
> >> of this structure. It is expected that this lock is almost never contended,
> >> also held only for brief periods of time, so this is considered cheap.
> >> To keep the code clean and avoid nasty corner cases, there are no tricks on
> >> trying to be lockless here.
> >> If for any reason the code needs to hold the locks for two virtual IRQs, the
> >> one with the lower IRQ number is to be taken first, to avoid deadlocks.
> >>
> >> Another lock to consider is the VCPU lock, which on the first glance protects
> >> the virtual CPU's list structure, but also synchronises additions and removals
> >> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
> >> lock need to be held. To avoid deadlocks, there is a strict locking order:
> >>> The VCPU lock needs to be taken first, the per-IRQ lock after this.
> > 
> > Sounds good (it is basically what I suggested to do in the past).
> > 
> > 
> >> Some operations (like migrating IRQs between two VCPUs) require two VCPU
> >> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
> >> is to be taken first.
> >>
> >> There are occasions where the locking order (VCPU first) is hard to observe,
> >> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
> >> list. In this case the IRQ lock needs to be dropped, the respective VCPU
> >> lock should be taken, then the per-IRQ lock needs to be re-taken.
> >> After both the locks are held, we need to check if the conditions which
> >> originally mandated the list addition (or removal) are still true. This is
> >> needed because the IRQ lock could have been taken by another entity meanwhile
> >> and the state of this interrupt could have been changed. Examples are if the
> >> interrupt is no longer pending, got disabled or changed the CPU affinity.
> >> Some of those changes might render to current action obsolete (no longer
> >> pending), other will lead to a retry of the re-locking scheme described above.
> >> This re-locking scheme shall be implemented in a well-documented function.
> >>
> >> ### Level and edge triggered interrupts
> >>
> >> The GIC knows about two kinds of signalling interrupts:
> >>
> >> - Edge triggered interrupts are triggered by a device once, their life cycle
> >> ends when the guest has EOIed them, at which point we remove the pending state,
> >> clear the LR and return the `vgic_irq` structure to a quiescent state.
> > 
> > I assume that "at which point" means at the next trap into the
> > hypervisor? We are not trapping on purpose guest EOIs, are we?
> 
> Correct. This means our data structures are not up-to-date all of the
> itme. But I believe this only matters for the ISPENDINGR/ISACTIVER
> register accesses, which are handled in a special way to fix this.
> And this is nothing implementation specific, but a general feature of
> the GIC emulation architecture.

Right. Xen structs today are also not up-to-date all the time for the
same reason I guess.


> > Is it possible to have active and pending irqs in an LR? How is that
> > handled?
> 
> Sure. This happens when there is a new interrupt triggered while the old
> one has been activated, but not EOIed (yet).
> This actually happens in the following scenario:
> - IRQ triggers and gets injected as "pending".
> - Guest acks vIRQ by reading virtual ICC_IAR, the LR state changes from
> pending to active.
> - During the further interrupt handling the guest triggers an MMIO fault
> (because it wants to read data from the device or explicitly lowers the
> IRQ line with a register access). The CPU exists, and the "active-only"
> state becomes visible to the hypervisor.
> - The HV sync back the LR to our struct, clearing the pending latch, but
> setting the active field. The struct is in sync now.
> - For whatever reason the interrupt fires *again* while the HV is still
> in charge. This sets the pending state in our struct.
> - Upon guest entry we sync both the active and pending bit to the LR,
> making it both active *and* pending.
> - The guest's IRQ handler continues to handle the IRQ, the active bit
> "shadows" the pending condition for now. Eventually  the handler retires
> the IRQ by EOIing it, dropping the active state (in the LR).
> - Now immediately after this drop the virtual IRQ is firing again, since
> it is pending, but not blocked by the active state anymore.
> - The guest's IRQ handler is invoked again and handles this second IRQ
> as normal.
> 
> Does that make sense?
> This is a bit simplified description for the sake of clarity, as there
> are corner cases with priority drops, for instance.

OK


> >> - Level triggered interrupts are triggered when a device raises its interrupt
> >> line, they stay pending as long as this line is held high. At some point the
> >> driver in the guest is expected to program the device to explicitly or
> >> implicitly lower this interrupt line. That means that we have to store the
> >> state of the virtual interrupt line, which is only controlled by the (virtual)
> >> device. This is done in the `line_level` member of `struct vgic_irq`.
> >>
> >> To assert the interrupt condition, a (virtual) device calls a function exported
> >> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
> >> line for an edge triggered IRQ is ignored (and so is optional). Raising the
> >> line asserts the pending state and potentially injects this virtual IRQ. Any
> >> subsequent "raising" call might inject another IRQ, if the previous has at
> >> least been activated by the guest already, otherwise is ignored.
> > 
> > The irq becomes active and pending in the LR?
> 
> Yes, see above.
> 
> >> For level triggered interrupts this function stores the new state into the
> >> `line_level` variable, potentially injecting the interrupt if that line
> >> changes from false to true. If the line is lowered before the guest has
> >> seen it, this particular interrupt instance will be discarded. Successive
> >> "raising" calls will not lead to multiple interrupts if the line has not
> >> been lowered in between.
> > 
> > This is something Xen needs too.
> > 
> > 
> >> ### Software triggered interrupts
> >>
> >> Beside the naturally software triggered inter-processor-interrupts
> >> (SGIs in GIC speak), there is another way of letting software raise an
> >> interrupt condition.
> >> The GIC distributor allows to set or clear both the pending and active state
> >> of any interrupt via MMIO registers. This isn't widely used by many operating
> >> systems, but is useful when saving and restoring the state of a machine.
> >> So emulating these functions is required for being architecture compliant,
> >> however the implementation might not need to be very efficient given its rare
> >> usage. In fact supporting the set-pending and clear-pending registers is
> >> relatively straight-forward, as long as one keeps this state separate from
> >> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
> >>
> >> The set-active and clear-active registers are much harder to emulate, though,
> >> as normally the active state is of little concern to the GIC emulation. In
> >> a normal interrupt life cycle the active state isn't even visible to the
> >> hypervisor, as it might be set and cleared again entirely within the guest
> >> in the list register, without exiting to the hypervisor.
> >> So manipulating the active state via the MMIO registers requires some heavy
> >> lifting: If this interrupt is currently injected into a running VCPU, this
> >> VCPU must exit, the active state must be set or cleared in the LR, then
> >> execution can continue. While this is expensive, as mentioned above this
> >> should not happen too often, also probably the system isn't very performance
> >> sensitive when using this feature for save and restore anyway.
> > 
> > set-active and clear-active registers are not emulated in Xen today, it
> > would be nice to have them.
> > 
> > How does the locking/synchronization work in the case given that the
> > vCPU that needs to exit could be running on a different pCPU?
> 
> As I hinted above this is a bit of a sledge hammer: We call
> kvm_arm_halt_guest() to force all VCPUs to exit and to make sure we are
> in sync. Then we can safely update the status, and the normal entry
> process takes care of writing this into the LRs.

How does vcpu1/pcpu1 tell vcpu2/pcpu2 that after trapping into the
hypervisor, it should get rid of the active bit or set the active bit?
Is it done via the active field in struct vgic_irq?


> > The suggestion of using this model in Xen was made in the past already.
> > I always objected for the reason that we don't actually know how many
> > LRs the hardware provides, potentially very many, and it is expensive
> > and needless to read/write them all every time on entry/exit.
> > 
> > I would prefer to avoid that, but I'll be honest: I can be convinced
> > that that model of handling LRs is so much simpler that it is worth it.
> > I am more concerned about the future maintainance of a separate new
> > driver developed elsewhere.
> 
> I think this LR topic should have been covered in that other email.
> 
> Beside being a strong supporter of the KISS principle in general, I
> believe in case of the GIC emulation we should avoid (premature)
> optimizations like the plague, as there are quite some corner cases in
> any VGIC, and handling all of them explicitly with some hacks will not
> fly (been there, done that).
> So I can just support Christoffer's point: having an architecture
> compliant VGIC emulation in an maintainable manner requires a
> straight-forward and clear design. Everything else should be secondary,
> and can be evaluated later, if there are good reasons (numbers!).

The reason why I stated the above is that I ran the numbers back in the
day and reading or writing LRs on an XGene was so slow that it made
sense to avoid it as much as possible. But maybe things have changed if
Christoffer also ran the numbers and managed to demonstrate the
opposite.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01  9:15     ` Andre Przywara
@ 2017-11-02  7:38       ` Christoffer Dall
  0 siblings, 0 replies; 13+ messages in thread
From: Christoffer Dall @ 2017-11-02  7:38 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Marc Zyngier, xen-devel, Stefano Stabellini, Julien Grall, Eric Auger

On Wed, Nov 1, 2017 at 10:15 AM, Andre Przywara
<andre.przywara@linaro.org> wrote:
> Hi,
>
> On 01/11/17 04:31, Christoffer Dall wrote:
>> On Wed, Nov 1, 2017 at 9:58 AM, Stefano Stabellini
>> <sstabellini@kernel.org> wrote:
>>
>> [....]
>
> Christoffer, many thanks for answering this!
> I think we have a lot of assumptions about the whole VGIC life cycle
> floating around, but it would indeed be good to get some numbers behind it.
> I would be all too happy to trace some workloads on Xen again and
> getting some metrics, though this sounds time consuming if done properly.
>
> Do you have any numbers on VGIC performance available somewhere?
>

I ran the full set of workloads and micro numbers that I've used for
my papers with/without the optimization, and couldn't find a
measurable difference.

-Christoffer

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01 21:54     ` Stefano Stabellini
@ 2017-11-02  7:40       ` Christoffer Dall
  2017-11-02 16:00       ` Andre Przywara
  1 sibling, 0 replies; 13+ messages in thread
From: Christoffer Dall @ 2017-11-02  7:40 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Marc Zyngier, xen-devel, Julien Grall, Andre Przywara, Eric Auger

On Wed, Nov 1, 2017 at 10:54 PM, Stefano Stabellini
<sstabellini@kernel.org> wrote:

[...]

>
>> > The suggestion of using this model in Xen was made in the past already.
>> > I always objected for the reason that we don't actually know how many
>> > LRs the hardware provides, potentially very many, and it is expensive
>> > and needless to read/write them all every time on entry/exit.
>> >
>> > I would prefer to avoid that, but I'll be honest: I can be convinced
>> > that that model of handling LRs is so much simpler that it is worth it.
>> > I am more concerned about the future maintainance of a separate new
>> > driver developed elsewhere.
>>
>> I think this LR topic should have been covered in that other email.
>>
>> Beside being a strong supporter of the KISS principle in general, I
>> believe in case of the GIC emulation we should avoid (premature)
>> optimizations like the plague, as there are quite some corner cases in
>> any VGIC, and handling all of them explicitly with some hacks will not
>> fly (been there, done that).
>> So I can just support Christoffer's point: having an architecture
>> compliant VGIC emulation in an maintainable manner requires a
>> straight-forward and clear design. Everything else should be secondary,
>> and can be evaluated later, if there are good reasons (numbers!).
>
> The reason why I stated the above is that I ran the numbers back in the
> day and reading or writing LRs on an XGene was so slow that it made
> sense to avoid it as much as possible. But maybe things have changed if
> Christoffer also ran the numbers and managed to demonstrate the
> opposite.

Accessing LRs on GICv2 is terrible indeed, and we already optimize it
as much as it makes sense.  It's just that with the current KVM code
base reading/writing the same value almost never happens, so there's
no more room (in practice) for optimization.

Also, note with GICv3 the cost goes down a lot, and potentially also
for other integrations of GICv2.

-Christoffer

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-01 21:54     ` Stefano Stabellini
  2017-11-02  7:40       ` Christoffer Dall
@ 2017-11-02 16:00       ` Andre Przywara
  2017-11-02 17:56         ` Stefano Stabellini
  1 sibling, 1 reply; 13+ messages in thread
From: Andre Przywara @ 2017-11-02 16:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Marc Zyngier, xen-devel, Julien Grall, Christoffer Dall, Eric Auger

Hi,

On 01/11/17 21:54, Stefano Stabellini wrote:
> On Wed, 1 Nov 2017, Andre Przywara wrote:
>> Hi Stefano,
>>
>>
>> On 01/11/17 01:58, Stefano Stabellini wrote:
>>> On Wed, 11 Oct 2017, Andre Przywara wrote:
>>
>> many thanks for going through all of this!
> 
> No problems, and thanks for your work and for caring about doing the
> best thing for the project.
> 
> 
>>>> (CC:ing some KVM/ARM folks involved in the VGIC)
>>>>
>>>> starting with the addition of the ITS support we were seeing more and
>>>> more issues with the current implementation of our ARM Generic Interrupt
>>>> Controller (GIC) emulation, the VGIC.
>>>> Among other approaches to fix those issues it was proposed to copy the
>>>> VGIC emulation used in KVM. This one was suffering from very similar
>>>> issues, and a clean design from scratch lead to a very robust and
>>>> capable re-implementation. Interestingly this implementation is fairly
>>>> self-contained, so it seems feasible to copy it. Hopefully we only need
>>>> minor adjustments, possibly we can even copy it verbatim with some
>>>> additional glue layer code.
>>>>
>>>> Stefano asked for getting a design overview, to assess the feasibility
>>>> of copying the KVM code without reviewing tons of code in the first
>>>> place.
>>>> So to follow Xen rules for new features, this design document below is
>>>> an attempt to describe the current KVM VGIC design - in a hypervisor
>>>> agnostic session. It is a bit of a retro-fit design description, as it
>>>> is not strictly forward-looking only, but actually describing the
>>>> existing implemenation [1].
>>>>
>>>> Please have a look and let me know:
>>>> 1) if this document has the right scope
>>>> 2) if this document has the right level of detail
>>>> 3) if there are points missing from the document
>>>> 3) if the design in general is a fit
>>>
>>> Please read the following statements as genuine questions and concerns.
>>> Most ideas on this document are good. Some of them I have even suggested
>>> them myself in the context of GIC improvements for Xen. I asked for a
>>> couple of clarifications.
>>>
>>> But I don't see why we cannot implement these ideas on top of the
>>> existing code, rather than with a separate codebase, ending up with two
>>> drivers. I would prefer a natual evolution. Specifically, the following
>>> improvements would be simple and would give us most of the benefits on
>>> top of the current codebase:
>>> - adding the irq lock, and the refcount
>>> - taking both vcpu locks when necessary (on migration code for example
>>>   it would help a lot), the lower vcpu_id first
>>> - level irq emulation
>>
>> I think some of those points you mentioned are not easily implemented in
>> the current Xen. For instance I ran into locking order issues with those
>> *two* inflight and lr_queue lists, when trying to implement the lock and
>> the refcount.
>> Also this "put vIRQs into LRs early, but possibly rip them out again" is
>> really complicating things a lot.
>>
>> I believe only level IRQs could be added in a relatively straight
>> forward manner.
>>
>> So the problem with the evolutionary approach is that it generates a lot
>> of patches, some of them quite invasive, others creating hard-to-read
>> diffs, which are both hard to review.
>> And chances are that the actual result would be pretty close to the KVM
>> code. To be clear: I hacked the Xen VGIC into the KVM direction in a few
>> days some months ago, but it took me *weeks* to make sane patches of
>> only the first part of it.
>> And this would not cover all those general, tedious corner cases that
>> the VGIC comes with. Those would need to be fixed in a painful process,
>> which we could avoid by "lifting" the KVM code.
> 
> I hear you, but the principal cost here is the review time, not the
> development time. Julien told me that it would be pretty much the same
> for him in terms of time it takes to review the changes, it doesn't
> matter if it's a new driver or changes to the existing driver. For me,
> it wouldn't be the same: I think it would take me far less time to
> review them if they were against the existing codebase.

I am not so sure about this. The changes are quite dramatic, and those
changes tend to produce horrible diffs. Or we try to mitigate this, but
this comes at a cost of having *many* patches, which take a while to
produce.
But if we instantiate a new VGIC implementation from scratch, we can
provide very nice-to-review patches, because the patches can focus on
logical changes and don't need to care about bisectability.

> However, as I wrote, this is not my foremost concern. I would be up to
> committing myself to review this even if we decide to go for a new
> driver.
> 
> 
>>> If we do end up with a second separate driver for technical or process
>>> reasons, I would expect the regular Xen submission/review process to be
>>> followed. The code style will be different, the hooks into the rest of
>>> the hypervisors will be different and things will be generally changed.
>>> The new V/GIC might be derived from KVM, but it should end up looking
>>> and feeling like a 100% genuine Xen component. After all, we'll
>>> maintain it going forward. I don't want a copy of a Linux driver with
>>> glue code. The Xen community cannot be expected not to review the
>>> submission, but if we review it, then we'll ask for changes. Once we
>>> change the code, there will be no point in keeping the Linux code
>>> separate with glue code. We should fully adapt it to Xen.
>>
>> I see your point, and this actually simplifies *my* work, but I am a bit
>> worried about the effects of having two separate implementations which
>> then diverge over time.
>> In the moment we have two separate implementations as well, but they are
>> quite different, which has the advantage of doing things differently
>> enough to help in finding bugs in the other one (something we should
>> actually exploit in testing, I believe).
> 
> It is a matter of ownership and responsibilities. The gic and vgic
> components are critical to the hypervisor functionalities, and as Xen
> Project we need to take ownership of them. It means fixing bugs and
> maintaining them going forward. It makes sense to have them fully
> integrated into Xen.

Yes, I can see that. I now came to the belief that taking the KVM code
*verbatim* is not worth the effort: in the moment I am struggling with
tiny, but nasty details to be solved.
If we allow the code to be changed, we get much more freedom.

>> So how is your feeling towards some shared "libvgic"? I understand that
>> people are not too happy about that extra maintenance cost of having a
>> separate repository, but I am curious what your, Marc's and
>> Christoffer's take is on this idea.
> 
> I am open to this discussion. It is nice in theory, but it is hard to
> put into practice. I think neither Julien and I nor Christoffer and Marc
> like the idea of a separate repository. It is a pain and it is ugly. But
> if we don't have a single repository, how can we share the codebase?
> 
> Also keep in mind that Xen and Linux have different release cycles and
> they go into freeze at different times. It affects when/how fixes can
> get into the codebase.
> 
> Unless you come up with a clever idea on how to make this work, I think
> we are better off with our own version of the driver.

Yeah, I agree, it would probably be quite some pain, which is hard to
justify, especially from the Linux side.

>>> That is what was done in the past when KVM took code from Xen (for
>>> example async shadow pagetables). I am eager to avoid a situation like
>>> the current SMMU driver in Xen, which comes from Linux, and we are not
>>> entirely sure how to maintain it.
>>>
>>>
>>>> Appreciate any feedback!
>>>>
>>>> Cheers,
>>>> Andre.
>>>>
>>>> ---------------------------------------
>>>>
>>>> VGIC design
>>>> ===========
>>>>
>>>> This document describes the design of an ARM Generic Interrupt Controller (GIC)
>>>> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
>>>> the common name for that is VGIC (from "virtual GIC").
>>>>
>>>> This design was the result of a one-week-long design session with some
>>>> engineers in a room, triggered by ever-increasing difficulties in maintaining
>>>> the existing GIC emulation in the KVM hypervisor. The design eventually
>>>> materialised as an alternative VGIC implementation in the Linux kernel
>>>> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
>>>> was removed, so it is now the current code used by Linux.
>>>> Although being used in KVM, the actual design of this VGIC is rather hypervisor
>>>> agnostic and can be used by other hypervisors as well, in particular for Xen.
>>>>
>>>> GIC hardware virtualization support
>>>> -----------------------------------
>>>>
>>>> The ARM Generic Interrupt Controller (since v2) supports the virtualization
>>>> extensions, which allows some parts of the interrupt life cycle to be handled
>>>> purely inside the guest without exiting into the hypervisor.
>>>> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
>>>> acknowledgement", "priority drop" and "interrupt deactivate" actions.
>>>> So a guest can handle most of the interrupt processing code without
>>>> leaving EL1 and trapping into the hypervisor. To accomplish
>>>> this, the GIC holds so called "list registers" (LRs), which shadow the
>>>> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
>>>> involves setting up one LR with the interrupt number, its priority and initial
>>>> state (mostly "pending"), then entering the guest. Any EOI related action
>>>> from within the guest just acts on those LRs, the hypervisor can later update
>>>> the virtual interrupt state when the guest exists the next time (for whatever
>>>> reason).
>>>> But despite the GIC hardware helping out here, the whole interrupt
>>>> configuration management is not virtualized at all and needs to be emulated
>>>> by the hypervisor - or another related software component, for instance a
>>>> userland emulator. This so called "distributor" part of the GIC consists of
>>>> memory mapped registers, which can be trapped by the hypervisor, so any guest
>>>> access can be emulated in the usual way.
>>>>
>>>> VGIC design motivation
>>>> ----------------------
>>>>
>>>> A GIC emulation thus needs to take care of those bits:
>>>>
>>>> - trap GIC distributor MMIO accesses and shadow the configuration setup
>>>>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
>>>> - handle incoming hardware and virtual interrupt requests and inject the
>>>>   associated virtual interrupt by manipulating one of the list registers
>>>> - track the state of a virtual interrupt by inspecting the LRs after the
>>>>   guest has exited, possibly adjusting the shadowed virtual interrupt state
>>>>
>>>> Despite the distributor MMIO register emulation being a sizeable chunk of
>>>> the emulation, it is actually not dominant if looking at the frequency at
>>>> which it is accessed. Normally the interrupt configuration is done at boot
>>>> time or upon initialising the device (driver), but rarely during the actual
>>>> run time of a system. Injecting and EOI-ing interrupts however happens much
>>>> more often. A good emulation approach should thus focus on tracking the virtual
>>>> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
>>>> interrupts.
>>>>
>>>> The actual interrupt state tracking can be quite tricky in parts. Interrupt
>>>> injections can be independent from the guest entry/exit points, also MMIO
>>>> configuration accesses could be triggered by any VCPU at any point in time.
>>>> Changing interrupt CPU affinity adds to the complication.
>>>> This leads to many code parts which could run in parallel and thus contains
>>>> some race conditions, so proper locking becomes key of a good design.
>>>> But one has to consider that interrupts in general can be characterised
>>>> as a rare event - otherwise a guest would be busy handling interrupts and could
>>>> not process actual computation tasks.
>>>> That's why the interrupt state tracking should focus on a clear and race-free
>>>> locking scheme, without needlessly optimising too much in this respect.
>>>> Experience shows that this complicates the code and leads to undetected and
>>>> hard-to-debug race conditions, which affect the stability of the system in
>>>> possibly untested corner cases.
>>>>
>>>> VGIC design principles
>>>> ----------------------
>>>>
>>>> ### Data structure
>>>>
>>>> This VGIC design is based on the idea of having one structure per virtual
>>>> interrupt, protected by its own lock. In addition there is a list per VCPU,
>>>> which queues the interrupts which this VCPU should consider for injection.
>>>> One interrupt can only be on one VCPU list at any given point in time.
>>>> For private interrupts and SPIs a static allocation of this data structure
>>>> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
>>>> dynamic and possibly very sparse allocation scheme, so we need to deal with
>>>> dynamic allocation and de-allocation of this struct. To accommodate this
>>>> there is an additional list header to link all LPIs.
>>>> Also the LPI mapping and unmapping can happen asynchronously, so we need to
>>>> properly ref-count the structure (at least for LPIs), otherwise some code parts
>>>> would potentially end up with referencing an already freed pointer.
>>>>
>>>> The central data structure is called `struct vgic_irq`, and, beside the
>>>> expected interrupt configuration data, contains at least the lock, a list
>>>> header (to be able to link it to a VCPU) and a refcount. Also it contains
>>>> the interrupt number (to accommodate for non-contiguous interrupt allocations,
>>>> for instance for LPIs).
>>>> Beside those essential elements it proves worth to store (a reference to) the
>>>> VCPU this IRQ is associated with. This allows to easily find the respective
>>>> VCPU list.
>>>>
>>>>     struct vgic_irq {
>>>>         spinlock_t irq_lock;            /* Protects the content of the struct */
>>>>         struct list_head lpi_list;      /* Used to link all LPIs together */
>>>>         struct list_head ap_list;
>>>>
>>>>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>>>>                                          * SPIs and LPIs: The VCPU whose ap_list
>>>>                                          * this is queued on.
>>>>                                          */
>>>>
>>>>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
>>>>                                           * be sent to, as a result of the
>>>>                                           * targets reg (v2) or the
>>>>                                           * affinity reg (v3).
>>>>                                           */
>>>>
>>>>         u32 intid;                      /* Guest visible INTID */
>>>>         bool line_level;                /* Level only */
>>>>         bool pending_latch;             /* The pending latch state used to
>>>>                                          * calculate the pending state for
>>>>                                          * both level and edge triggered IRQs.
>>>>                                          */
>>>>
>>>>         bool active;                    /* not used for LPIs */
>>>>         bool enabled;
>>>>         bool hw;                        /* Tied to HW IRQ */
>>>>         struct kref refcount;           /* Used for LPIs */
>>>>         u32 hwintid;                    /* HW INTID number */
>>>>         union {
>>>>             u8 targets;                     /* GICv2 target VCPUs mask */
>>>>             u32 mpidr;                      /* GICv3 target VCPU */
>>>>         };
>>>>         u8 source;                      /* GICv2 SGIs only */
>>>>         u8 priority;
>>>>         enum vgic_irq_config config;    /* Level or edge */
>>>>     };
>>>
>>> The refcount and irq_lock are good ideas, let's have them.
>>>
>>>
>>>> ### VCPU list handling
>>>>
>>>> Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
>>>> the distributor will change the state information in this structure.
>>>> When an interrupt is actually made pending (either by an associated hardware
>>>> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
>>>> linked to the current target VCPU. The `vcpu` member in the structure will
>>>> be set to this VCPU. Any affinity change after this point will not affect
>>>> the current target VCPU anymore, it just updates the `target_vpu` field in
>>>> the structure, which will be considered on the next injection.
>>>> This per-VCPU list is called the `ap_list`, since it holds interrupts which
>>>> are in a pending and/or active state.
>>>
>>> The two vcpu lists sound like a good idea too, and Christoffer's
>>> explanation helped. It is actually similar to what we do in Xen already.
>>> I guess a vgic is always a vgic :-)
>>
>> Mmmh, I don't get where you see two VCPU *lists* here. There are two
>> VCPU *fields* in the structure, but they are completely different from
>> the lr_pending and lr_queue lists in Xen.
>> In fact I believe that these *two* lists in Xen are one of the major
>> pain points in the current VGIC.
> 
> OK
> 
> 
>>> What happens when the irq is migrated while still in an LR on another
>>> pcpu? When/How is the physical affinity changed?
>>
>> Per the architecture there is nothing like an IRQ "migration". There is
>> the CPU affinity, which determines to which core this IRQ is forwarded
>> *when it becomes pending*. Once it has been activated, it stays at this
>> core, even if you change the ITARGETSR or IROUTER register afterwards.
>> This is a benign race, you just came too late to change the affinity.
>>
>> And at the moment we don't synchronize the physical affinity, simply
>> because most IRQs in KVM world were virtual so far. Now this is going to
>> change, so I guess we have to take a look at this at some point. But I
>> consider this an optimization, and would prefer correctness and
>> stability over performance.
> 
> This feature is a key enabler for Xen in embedded, it is certainly a
> blocker. It needs to be part of the design doc.

OK, can do this.

>>> What happens when a new irq is supposed to be injected when target_vcpu is
>>> already set? Does target_vcpu simply get overwritten?
>>
>> target_vcpu is simply a configuration storage. Anyone can update this
>> field at any time, without any side effects.
>> When an IRQ is going to be injected, the current value of target_vcpu is
>> written *once* to the "vcpu" field, which from now on determines the
>> responsible VCPU for the whole interrupt life cycle (queueing on lists,
>> putting into LRs, ...). This field cannot change anymore until the IRQ
>> is EOIed.
> 
> OK, I understand. Sounds good.
> 
> 
>>> What happens when a vcpu is migrated from pcpu1 to pcpu2?
>>
>> Nothing spectacular, I guess. We don't care about the physical IRQ
>> affinity. And since we clear all LRs on exit and (re-)populate them on
>> entry, doing this on two different CPUs is a total no-brainer.
> 
> We should care about physical IRQ affinity, and it should be part of
> this design doc. This would be a regression.

Understood.

>>>> ### Virtual IRQ references
>>>>
>>>> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
>>>> given its number.
>>>> For private IRQs and SPIs it is expected that this just indexes a static array.
>>>> For LPIs (which are dynamically allocated at run time) this is expected to
>>>> iterate a data structure (like a linked list) to find the right structure.
>>>> In any case a call to `vgic_get_irq` will increase a refcount, which will
>>>> prevent LPIs from being de-allocated while another part of the VGIC is still
>>>> holding a reference. Thus any caller to `vgic_get_irq` shall call
>>>> `vgic_put_irq()` after it is done with handling this interrupt.
>>>> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
>>>> this case the VCPU holds that reference and it is kept as long as the guest
>>>> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
>>>> been EOIed by the guest and it having been removed from the VCPU list.
>>>
>>> I understand the idea behind a refcount and sounds like a good thing to
>>> have.
>>>
>>> Let me ask you a couple of questions. How does it help with the issue
>>> that an LPI could be discarded and remapped (MAPTI) from another
>>> pcpu while it could still be in an LR?
>>
>> On DISCARD we remove it from the list of mapped LPIs, but don't free the
>> structure. So any vgic_get_lpi(lpi_nr) won't find it anymore. But since
>> the interrupt is in an LR, the VCPU's ap_list still references the
>> vgic_irq structure, so we can do the whole IRQ life cycle management
>> just as normal (because being a list member is what counts when it comes
>> to a "life" interrupt).
>> Once this LPI is EOIed, we remove it from the VCPU list, which decreases
>> the refcount and most probably will free the memory, since the value has
>> become zero by then. Normally, without unmapping it before, the
>> reference held by the ITS list would make sure the refcount stays
>> greater than 0.
>>
>> Now when there is a MAPTI to the same LPI number meanwhile, this will
>> allocate a new structure (this is a new interrupt!) and enters this into
>> the ITS list. So anyone asking for this new LPI *number* will get the
>> reference to the new IRQ. Think: deleting a file and creating a new one
>> with the same name on a UNIX system, any old users of an already opened
>> file descriptor will still use the deleted file, but an open() will
>> return a handle to the new file.
> 
> This needs to be captured in the doc.
> 
> Are vgic_irq struct dynamically allocated?
> Is there a reutilization
> scheme to avoid a malicious guest from spamming Xen with LPI requests?
> Multiple struct vgic_irq for the same LPI would cause even more memory
> allocations.

Interesting point. I need to think about a neat solution. For normal
cases I think we might want to stick with the current Xen scheme of
allocating the vIRQ structs when we map a device,then handing out
pointers to some array member on vgic_add_lpi(). Maybe we re-pointer the
existing vIRQ to point to some other location, and use the device
provided storage

> If both the old and the new vgic_irq struct end up being written to LRs,
> wouldn't it cause problems?

Can't happen. DISCARD removes the pending state. Since LPIs have no
active state, upon the VCPU exiting this LPI's life cycle has finished.
So we just keep it around as long as it's still in a VCPU, but it
vanishes as soon as this VCPU exits.

>>> What happens if the MAPTI is
>>> issued before and what happens if it is issed after the irq has been
>>> EOId and cleared from the LR and ap_list?
>>
>> I believe the above description should answer this. If not, please let
>> me know.
>>
>>> I am referring to the case that we currently handling with the
>>> GIC_IRQ_GUEST_PRISTINE_LPI flag in Xen.
>>
>> ... which was a hack of mine to work around the missing refcount ;-)
> 
> Yes, you can say that :)
> 
> 
>>>> ### Locking
>>>>
>>>> To keep the `vgic_irq` structure consistent and to avoid races between
>>>> different parts of the VGIC, locking is essential whenever accessing a member
>>>> of this structure. It is expected that this lock is almost never contended,
>>>> also held only for brief periods of time, so this is considered cheap.
>>>> To keep the code clean and avoid nasty corner cases, there are no tricks on
>>>> trying to be lockless here.
>>>> If for any reason the code needs to hold the locks for two virtual IRQs, the
>>>> one with the lower IRQ number is to be taken first, to avoid deadlocks.
>>>>
>>>> Another lock to consider is the VCPU lock, which on the first glance protects
>>>> the virtual CPU's list structure, but also synchronises additions and removals
>>>> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
>>>> lock need to be held. To avoid deadlocks, there is a strict locking order:
>>>>> The VCPU lock needs to be taken first, the per-IRQ lock after this.
>>>
>>> Sounds good (it is basically what I suggested to do in the past).
>>>
>>>
>>>> Some operations (like migrating IRQs between two VCPUs) require two VCPU
>>>> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
>>>> is to be taken first.
>>>>
>>>> There are occasions where the locking order (VCPU first) is hard to observe,
>>>> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
>>>> list. In this case the IRQ lock needs to be dropped, the respective VCPU
>>>> lock should be taken, then the per-IRQ lock needs to be re-taken.
>>>> After both the locks are held, we need to check if the conditions which
>>>> originally mandated the list addition (or removal) are still true. This is
>>>> needed because the IRQ lock could have been taken by another entity meanwhile
>>>> and the state of this interrupt could have been changed. Examples are if the
>>>> interrupt is no longer pending, got disabled or changed the CPU affinity.
>>>> Some of those changes might render to current action obsolete (no longer
>>>> pending), other will lead to a retry of the re-locking scheme described above.
>>>> This re-locking scheme shall be implemented in a well-documented function.
>>>>
>>>> ### Level and edge triggered interrupts
>>>>
>>>> The GIC knows about two kinds of signalling interrupts:
>>>>
>>>> - Edge triggered interrupts are triggered by a device once, their life cycle
>>>> ends when the guest has EOIed them, at which point we remove the pending state,
>>>> clear the LR and return the `vgic_irq` structure to a quiescent state.
>>>
>>> I assume that "at which point" means at the next trap into the
>>> hypervisor? We are not trapping on purpose guest EOIs, are we?
>>
>> Correct. This means our data structures are not up-to-date all of the
>> itme. But I believe this only matters for the ISPENDINGR/ISACTIVER
>> register accesses, which are handled in a special way to fix this.
>> And this is nothing implementation specific, but a general feature of
>> the GIC emulation architecture.
> 
> Right. Xen structs today are also not up-to-date all the time for the
> same reason I guess.
> 
> 
>>> Is it possible to have active and pending irqs in an LR? How is that
>>> handled?
>>
>> Sure. This happens when there is a new interrupt triggered while the old
>> one has been activated, but not EOIed (yet).
>> This actually happens in the following scenario:
>> - IRQ triggers and gets injected as "pending".
>> - Guest acks vIRQ by reading virtual ICC_IAR, the LR state changes from
>> pending to active.
>> - During the further interrupt handling the guest triggers an MMIO fault
>> (because it wants to read data from the device or explicitly lowers the
>> IRQ line with a register access). The CPU exists, and the "active-only"
>> state becomes visible to the hypervisor.
>> - The HV sync back the LR to our struct, clearing the pending latch, but
>> setting the active field. The struct is in sync now.
>> - For whatever reason the interrupt fires *again* while the HV is still
>> in charge. This sets the pending state in our struct.
>> - Upon guest entry we sync both the active and pending bit to the LR,
>> making it both active *and* pending.
>> - The guest's IRQ handler continues to handle the IRQ, the active bit
>> "shadows" the pending condition for now. Eventually  the handler retires
>> the IRQ by EOIing it, dropping the active state (in the LR).
>> - Now immediately after this drop the virtual IRQ is firing again, since
>> it is pending, but not blocked by the active state anymore.
>> - The guest's IRQ handler is invoked again and handles this second IRQ
>> as normal.
>>
>> Does that make sense?
>> This is a bit simplified description for the sake of clarity, as there
>> are corner cases with priority drops, for instance.
> 
> OK
> 
> 
>>>> - Level triggered interrupts are triggered when a device raises its interrupt
>>>> line, they stay pending as long as this line is held high. At some point the
>>>> driver in the guest is expected to program the device to explicitly or
>>>> implicitly lower this interrupt line. That means that we have to store the
>>>> state of the virtual interrupt line, which is only controlled by the (virtual)
>>>> device. This is done in the `line_level` member of `struct vgic_irq`.
>>>>
>>>> To assert the interrupt condition, a (virtual) device calls a function exported
>>>> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
>>>> line for an edge triggered IRQ is ignored (and so is optional). Raising the
>>>> line asserts the pending state and potentially injects this virtual IRQ. Any
>>>> subsequent "raising" call might inject another IRQ, if the previous has at
>>>> least been activated by the guest already, otherwise is ignored.
>>>
>>> The irq becomes active and pending in the LR?
>>
>> Yes, see above.
>>
>>>> For level triggered interrupts this function stores the new state into the
>>>> `line_level` variable, potentially injecting the interrupt if that line
>>>> changes from false to true. If the line is lowered before the guest has
>>>> seen it, this particular interrupt instance will be discarded. Successive
>>>> "raising" calls will not lead to multiple interrupts if the line has not
>>>> been lowered in between.
>>>
>>> This is something Xen needs too.
>>>
>>>
>>>> ### Software triggered interrupts
>>>>
>>>> Beside the naturally software triggered inter-processor-interrupts
>>>> (SGIs in GIC speak), there is another way of letting software raise an
>>>> interrupt condition.
>>>> The GIC distributor allows to set or clear both the pending and active state
>>>> of any interrupt via MMIO registers. This isn't widely used by many operating
>>>> systems, but is useful when saving and restoring the state of a machine.
>>>> So emulating these functions is required for being architecture compliant,
>>>> however the implementation might not need to be very efficient given its rare
>>>> usage. In fact supporting the set-pending and clear-pending registers is
>>>> relatively straight-forward, as long as one keeps this state separate from
>>>> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
>>>>
>>>> The set-active and clear-active registers are much harder to emulate, though,
>>>> as normally the active state is of little concern to the GIC emulation. In
>>>> a normal interrupt life cycle the active state isn't even visible to the
>>>> hypervisor, as it might be set and cleared again entirely within the guest
>>>> in the list register, without exiting to the hypervisor.
>>>> So manipulating the active state via the MMIO registers requires some heavy
>>>> lifting: If this interrupt is currently injected into a running VCPU, this
>>>> VCPU must exit, the active state must be set or cleared in the LR, then
>>>> execution can continue. While this is expensive, as mentioned above this
>>>> should not happen too often, also probably the system isn't very performance
>>>> sensitive when using this feature for save and restore anyway.
>>>
>>> set-active and clear-active registers are not emulated in Xen today, it
>>> would be nice to have them.
>>>
>>> How does the locking/synchronization work in the case given that the
>>> vCPU that needs to exit could be running on a different pCPU?
>>
>> As I hinted above this is a bit of a sledge hammer: We call
>> kvm_arm_halt_guest() to force all VCPUs to exit and to make sure we are
>> in sync. Then we can safely update the status, and the normal entry
>> process takes care of writing this into the LRs.
> 
> How does vcpu1/pcpu1 tell vcpu2/pcpu2 that after trapping into the
> hypervisor, it should get rid of the active bit or set the active bit?
> Is it done via the active field in struct vgic_irq?

Precisely. We kick all VCPUs out, then update the respective bools in
struct vgic_irq. Then, upon entering the guest again, each VCPU
naturally presents the new state in an LR.


Cheers,
Andre.


>>> The suggestion of using this model in Xen was made in the past already.
>>> I always objected for the reason that we don't actually know how many
>>> LRs the hardware provides, potentially very many, and it is expensive
>>> and needless to read/write them all every time on entry/exit.
>>>
>>> I would prefer to avoid that, but I'll be honest: I can be convinced
>>> that that model of handling LRs is so much simpler that it is worth it.
>>> I am more concerned about the future maintainance of a separate new
>>> driver developed elsewhere.
>>
>> I think this LR topic should have been covered in that other email.
>>
>> Beside being a strong supporter of the KISS principle in general, I
>> believe in case of the GIC emulation we should avoid (premature)
>> optimizations like the plague, as there are quite some corner cases in
>> any VGIC, and handling all of them explicitly with some hacks will not
>> fly (been there, done that).
>> So I can just support Christoffer's point: having an architecture
>> compliant VGIC emulation in an maintainable manner requires a
>> straight-forward and clear design. Everything else should be secondary,
>> and can be evaluated later, if there are good reasons (numbers!).
> 
> The reason why I stated the above is that I ran the numbers back in the
> day and reading or writing LRs on an XGene was so slow that it made
> sense to avoid it as much as possible. But maybe things have changed if
> Christoffer also ran the numbers and managed to demonstrate the
> opposite.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] ARM: New (Xen) VGIC design document
  2017-11-02 16:00       ` Andre Przywara
@ 2017-11-02 17:56         ` Stefano Stabellini
  0 siblings, 0 replies; 13+ messages in thread
From: Stefano Stabellini @ 2017-11-02 17:56 UTC (permalink / raw)
  To: Andre Przywara
  Cc: Stefano Stabellini, Marc Zyngier, Eric Auger, Julien Grall,
	xen-devel, Christoffer Dall

On Thu, 2 Nov 2017, Andre Przywara wrote:
> >>>> (CC:ing some KVM/ARM folks involved in the VGIC)
> >>>>
> >>>> starting with the addition of the ITS support we were seeing more and
> >>>> more issues with the current implementation of our ARM Generic Interrupt
> >>>> Controller (GIC) emulation, the VGIC.
> >>>> Among other approaches to fix those issues it was proposed to copy the
> >>>> VGIC emulation used in KVM. This one was suffering from very similar
> >>>> issues, and a clean design from scratch lead to a very robust and
> >>>> capable re-implementation. Interestingly this implementation is fairly
> >>>> self-contained, so it seems feasible to copy it. Hopefully we only need
> >>>> minor adjustments, possibly we can even copy it verbatim with some
> >>>> additional glue layer code.
> >>>>
> >>>> Stefano asked for getting a design overview, to assess the feasibility
> >>>> of copying the KVM code without reviewing tons of code in the first
> >>>> place.
> >>>> So to follow Xen rules for new features, this design document below is
> >>>> an attempt to describe the current KVM VGIC design - in a hypervisor
> >>>> agnostic session. It is a bit of a retro-fit design description, as it
> >>>> is not strictly forward-looking only, but actually describing the
> >>>> existing implemenation [1].
> >>>>
> >>>> Please have a look and let me know:
> >>>> 1) if this document has the right scope
> >>>> 2) if this document has the right level of detail
> >>>> 3) if there are points missing from the document
> >>>> 3) if the design in general is a fit
> >>>
> >>> Please read the following statements as genuine questions and concerns.
> >>> Most ideas on this document are good. Some of them I have even suggested
> >>> them myself in the context of GIC improvements for Xen. I asked for a
> >>> couple of clarifications.
> >>>
> >>> But I don't see why we cannot implement these ideas on top of the
> >>> existing code, rather than with a separate codebase, ending up with two
> >>> drivers. I would prefer a natual evolution. Specifically, the following
> >>> improvements would be simple and would give us most of the benefits on
> >>> top of the current codebase:
> >>> - adding the irq lock, and the refcount
> >>> - taking both vcpu locks when necessary (on migration code for example
> >>>   it would help a lot), the lower vcpu_id first
> >>> - level irq emulation
> >>
> >> I think some of those points you mentioned are not easily implemented in
> >> the current Xen. For instance I ran into locking order issues with those
> >> *two* inflight and lr_queue lists, when trying to implement the lock and
> >> the refcount.
> >> Also this "put vIRQs into LRs early, but possibly rip them out again" is
> >> really complicating things a lot.
> >>
> >> I believe only level IRQs could be added in a relatively straight
> >> forward manner.
> >>
> >> So the problem with the evolutionary approach is that it generates a lot
> >> of patches, some of them quite invasive, others creating hard-to-read
> >> diffs, which are both hard to review.
> >> And chances are that the actual result would be pretty close to the KVM
> >> code. To be clear: I hacked the Xen VGIC into the KVM direction in a few
> >> days some months ago, but it took me *weeks* to make sane patches of
> >> only the first part of it.
> >> And this would not cover all those general, tedious corner cases that
> >> the VGIC comes with. Those would need to be fixed in a painful process,
> >> which we could avoid by "lifting" the KVM code.
> > 
> > I hear you, but the principal cost here is the review time, not the
> > development time. Julien told me that it would be pretty much the same
> > for him in terms of time it takes to review the changes, it doesn't
> > matter if it's a new driver or changes to the existing driver. For me,
> > it wouldn't be the same: I think it would take me far less time to
> > review them if they were against the existing codebase.
> 
> I am not so sure about this. The changes are quite dramatic, and those
> changes tend to produce horrible diffs. Or we try to mitigate this, but
> this comes at a cost of having *many* patches, which take a while to
> produce.
> But if we instantiate a new VGIC implementation from scratch, we can
> provide very nice-to-review patches, because the patches can focus on
> logical changes and don't need to care about bisectability.

All right


> > However, as I wrote, this is not my foremost concern. I would be up to
> > committing myself to review this even if we decide to go for a new
> > driver.
> > 
> > 
> >>> If we do end up with a second separate driver for technical or process
> >>> reasons, I would expect the regular Xen submission/review process to be
> >>> followed. The code style will be different, the hooks into the rest of
> >>> the hypervisors will be different and things will be generally changed.
> >>> The new V/GIC might be derived from KVM, but it should end up looking
> >>> and feeling like a 100% genuine Xen component. After all, we'll
> >>> maintain it going forward. I don't want a copy of a Linux driver with
> >>> glue code. The Xen community cannot be expected not to review the
> >>> submission, but if we review it, then we'll ask for changes. Once we
> >>> change the code, there will be no point in keeping the Linux code
> >>> separate with glue code. We should fully adapt it to Xen.
> >>
> >> I see your point, and this actually simplifies *my* work, but I am a bit
> >> worried about the effects of having two separate implementations which
> >> then diverge over time.
> >> In the moment we have two separate implementations as well, but they are
> >> quite different, which has the advantage of doing things differently
> >> enough to help in finding bugs in the other one (something we should
> >> actually exploit in testing, I believe).
> > 
> > It is a matter of ownership and responsibilities. The gic and vgic
> > components are critical to the hypervisor functionalities, and as Xen
> > Project we need to take ownership of them. It means fixing bugs and
> > maintaining them going forward. It makes sense to have them fully
> > integrated into Xen.
> 
> Yes, I can see that. I now came to the belief that taking the KVM code
> *verbatim* is not worth the effort: in the moment I am struggling with
> tiny, but nasty details to be solved.
> If we allow the code to be changed, we get much more freedom.

Sounds good.


> >> So how is your feeling towards some shared "libvgic"? I understand that
> >> people are not too happy about that extra maintenance cost of having a
> >> separate repository, but I am curious what your, Marc's and
> >> Christoffer's take is on this idea.
> > 
> > I am open to this discussion. It is nice in theory, but it is hard to
> > put into practice. I think neither Julien and I nor Christoffer and Marc
> > like the idea of a separate repository. It is a pain and it is ugly. But
> > if we don't have a single repository, how can we share the codebase?
> > 
> > Also keep in mind that Xen and Linux have different release cycles and
> > they go into freeze at different times. It affects when/how fixes can
> > get into the codebase.
> > 
> > Unless you come up with a clever idea on how to make this work, I think
> > we are better off with our own version of the driver.
> 
> Yeah, I agree, it would probably be quite some pain, which is hard to
> justify, especially from the Linux side.

Right


> >>>> ### Virtual IRQ references
> >>>>
> >>>> There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
> >>>> given its number.
> >>>> For private IRQs and SPIs it is expected that this just indexes a static array.
> >>>> For LPIs (which are dynamically allocated at run time) this is expected to
> >>>> iterate a data structure (like a linked list) to find the right structure.
> >>>> In any case a call to `vgic_get_irq` will increase a refcount, which will
> >>>> prevent LPIs from being de-allocated while another part of the VGIC is still
> >>>> holding a reference. Thus any caller to `vgic_get_irq` shall call
> >>>> `vgic_put_irq()` after it is done with handling this interrupt.
> >>>> An exception would be if the virtual IRQ is eventually injected into a VCPU. In
> >>>> this case the VCPU holds that reference and it is kept as long as the guest
> >>>> sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
> >>>> been EOIed by the guest and it having been removed from the VCPU list.
> >>>
> >>> I understand the idea behind a refcount and sounds like a good thing to
> >>> have.
> >>>
> >>> Let me ask you a couple of questions. How does it help with the issue
> >>> that an LPI could be discarded and remapped (MAPTI) from another
> >>> pcpu while it could still be in an LR?
> >>
> >> On DISCARD we remove it from the list of mapped LPIs, but don't free the
> >> structure. So any vgic_get_lpi(lpi_nr) won't find it anymore. But since
> >> the interrupt is in an LR, the VCPU's ap_list still references the
> >> vgic_irq structure, so we can do the whole IRQ life cycle management
> >> just as normal (because being a list member is what counts when it comes
> >> to a "life" interrupt).
> >> Once this LPI is EOIed, we remove it from the VCPU list, which decreases
> >> the refcount and most probably will free the memory, since the value has
> >> become zero by then. Normally, without unmapping it before, the
> >> reference held by the ITS list would make sure the refcount stays
> >> greater than 0.
> >>
> >> Now when there is a MAPTI to the same LPI number meanwhile, this will
> >> allocate a new structure (this is a new interrupt!) and enters this into
> >> the ITS list. So anyone asking for this new LPI *number* will get the
> >> reference to the new IRQ. Think: deleting a file and creating a new one
> >> with the same name on a UNIX system, any old users of an already opened
> >> file descriptor will still use the deleted file, but an open() will
> >> return a handle to the new file.
> > 
> > This needs to be captured in the doc.
> > 
> > Are vgic_irq struct dynamically allocated?
> > Is there a reutilization
> > scheme to avoid a malicious guest from spamming Xen with LPI requests?
> > Multiple struct vgic_irq for the same LPI would cause even more memory
> > allocations.
> 
> Interesting point. I need to think about a neat solution. For normal
> cases I think we might want to stick with the current Xen scheme of
> allocating the vIRQ structs when we map a device,then handing out
> pointers to some array member on vgic_add_lpi(). Maybe we re-pointer the
> existing vIRQ to point to some other location, and use the device
> provided storage

Yes, I think we need to take some ideas from our long ITS design
sessions to minimize the chances of DOS from a malicious guest. For
example, we have a scheme to reuse pending_irq struct in the ITS that I
think it is worth keeping.


> > If both the old and the new vgic_irq struct end up being written to LRs,
> > wouldn't it cause problems?
> 
> Can't happen. DISCARD removes the pending state. Since LPIs have no
> active state, upon the VCPU exiting this LPI's life cycle has finished.
> So we just keep it around as long as it's still in a VCPU, but it
> vanishes as soon as this VCPU exits.

Please add these details to the doc. In fact, let's turn this doc into a
useful documentation for the new implementation.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-02 17:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 14:33 [RFC] ARM: New (Xen) VGIC design document Andre Przywara
2017-10-11 14:42 ` Andre Przywara
2017-10-12 12:05 ` Christoffer Dall
2017-11-01 17:54   ` Andre Przywara
2017-11-01  1:58 ` Stefano Stabellini
2017-11-01  4:31   ` Christoffer Dall
2017-11-01  9:15     ` Andre Przywara
2017-11-02  7:38       ` Christoffer Dall
2017-11-01 14:30   ` Andre Przywara
2017-11-01 21:54     ` Stefano Stabellini
2017-11-02  7:40       ` Christoffer Dall
2017-11-02 16:00       ` Andre Przywara
2017-11-02 17:56         ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.