[RFC] ARM: New (Xen) VGIC design document

* [RFC] ARM: New (Xen) VGIC design document
@ 2017-10-11 14:33 Andre Przywara
  2017-10-11 14:42 ` Andre Przywara
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andre Przywara @ 2017-10-11 14:33 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Marc Zyngier, xen-devel, Christoffer Dall, Eric Auger

Hi,

(CC:ing some KVM/ARM folks involved in the VGIC)

starting with the addition of the ITS support we were seeing more and
more issues with the current implementation of our ARM Generic Interrupt
Controller (GIC) emulation, the VGIC.
Among other approaches to fix those issues it was proposed to copy the
VGIC emulation used in KVM. This one was suffering from very similar
issues, and a clean design from scratch lead to a very robust and
capable re-implementation. Interestingly this implementation is fairly
self-contained, so it seems feasible to copy it. Hopefully we only need
minor adjustments, possibly we can even copy it verbatim with some
additional glue layer code.
Stefano asked for getting a design overview, to assess the feasibility
of copying the KVM code without reviewing tons of code in the first
place.
So to follow Xen rules for new features, this design document below is
an attempt to describe the current KVM VGIC design - in a hypervisor
agnostic session. It is a bit of a retro-fit design description, as it
is not strictly forward-looking only, but actually describing the
existing implemenation [1].

Please have a look and let me know:
1) if this document has the right scope
2) if this document has the right level of detail
3) if there are points missing from the document
3) if the design in general is a fit

Appreciate any feedback!

Cheers,
Andre.

---------------------------------------

VGIC design
===========

This document describes the design of an ARM Generic Interrupt Controller (GIC)
emulation. It is meant to emulate a GIC for a guest in an virtual machine,
the common name for that is VGIC (from "virtual GIC").

This design was the result of a one-week-long design session with some
engineers in a room, triggered by ever-increasing difficulties in maintaining
the existing GIC emulation in the KVM hypervisor. The design eventually
materialised as an alternative VGIC implementation in the Linux kernel
(merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
was removed, so it is now the current code used by Linux.
Although being used in KVM, the actual design of this VGIC is rather hypervisor
agnostic and can be used by other hypervisors as well, in particular for Xen.

GIC hardware virtualization support
-----------------------------------

The ARM Generic Interrupt Controller (since v2) supports the virtualization
extensions, which allows some parts of the interrupt life cycle to be handled
purely inside the guest without exiting into the hypervisor.
In the GICv2 and GICv3 architecture this covers mostly the "interrupt
acknowledgement", "priority drop" and "interrupt deactivate" actions.
So a guest can handle most of the interrupt processing code without
leaving EL1 and trapping into the hypervisor. To accomplish
this, the GIC holds so called "list registers" (LRs), which shadow the
interrupt state for any virtual interrupt. Injecting an interrupt to a guest
involves setting up one LR with the interrupt number, its priority and initial
state (mostly "pending"), then entering the guest. Any EOI related action
from within the guest just acts on those LRs, the hypervisor can later update
the virtual interrupt state when the guest exists the next time (for whatever
reason).
But despite the GIC hardware helping out here, the whole interrupt
configuration management is not virtualized at all and needs to be emulated
by the hypervisor - or another related software component, for instance a
userland emulator. This so called "distributor" part of the GIC consists of
memory mapped registers, which can be trapped by the hypervisor, so any guest
access can be emulated in the usual way.

VGIC design motivation
----------------------

A GIC emulation thus needs to take care of those bits:

- trap GIC distributor MMIO accesses and shadow the configuration setup
  (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
- handle incoming hardware and virtual interrupt requests and inject the
  associated virtual interrupt by manipulating one of the list registers
- track the state of a virtual interrupt by inspecting the LRs after the
  guest has exited, possibly adjusting the shadowed virtual interrupt state

Despite the distributor MMIO register emulation being a sizeable chunk of
the emulation, it is actually not dominant if looking at the frequency at
which it is accessed. Normally the interrupt configuration is done at boot
time or upon initialising the device (driver), but rarely during the actual
run time of a system. Injecting and EOI-ing interrupts however happens much
more often. A good emulation approach should thus focus on tracking the virtual
interrupt state efficiently, allowing quick handling of incoming and EOI-ed
interrupts.

The actual interrupt state tracking can be quite tricky in parts. Interrupt
injections can be independent from the guest entry/exit points, also MMIO
configuration accesses could be triggered by any VCPU at any point in time.
Changing interrupt CPU affinity adds to the complication.
This leads to many code parts which could run in parallel and thus contains
some race conditions, so proper locking becomes key of a good design.
But one has to consider that interrupts in general can be characterised
as a rare event - otherwise a guest would be busy handling interrupts and could
not process actual computation tasks.
That's why the interrupt state tracking should focus on a clear and race-free
locking scheme, without needlessly optimising too much in this respect.
Experience shows that this complicates the code and leads to undetected and
hard-to-debug race conditions, which affect the stability of the system in
possibly untested corner cases.

VGIC design principles
----------------------

### Data structure

This VGIC design is based on the idea of having one structure per virtual
interrupt, protected by its own lock. In addition there is a list per VCPU,
which queues the interrupts which this VCPU should consider for injection.
One interrupt can only be on one VCPU list at any given point in time.
For private interrupts and SPIs a static allocation of this data structure
would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
dynamic and possibly very sparse allocation scheme, so we need to deal with
dynamic allocation and de-allocation of this struct. To accommodate this
there is an additional list header to link all LPIs.
Also the LPI mapping and unmapping can happen asynchronously, so we need to
properly ref-count the structure (at least for LPIs), otherwise some code parts
would potentially end up with referencing an already freed pointer.

The central data structure is called `struct vgic_irq`, and, beside the
expected interrupt configuration data, contains at least the lock, a list
header (to be able to link it to a VCPU) and a refcount. Also it contains
the interrupt number (to accommodate for non-contiguous interrupt allocations,
for instance for LPIs).
Beside those essential elements it proves worth to store (a reference to) the
VCPU this IRQ is associated with. This allows to easily find the respective
VCPU list.

    struct vgic_irq {
        spinlock_t irq_lock;            /* Protects the content of the struct */
        struct list_head lpi_list;      /* Used to link all LPIs together */
        struct list_head ap_list;

        struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
                                         * SPIs and LPIs: The VCPU whose ap_list
                                         * this is queued on.
                                         */

        struct vcpu *target_vcpu;        /* The VCPU that this interrupt should
                                          * be sent to, as a result of the
                                          * targets reg (v2) or the
                                          * affinity reg (v3).
                                          */

        u32 intid;                      /* Guest visible INTID */
        bool line_level;                /* Level only */
        bool pending_latch;             /* The pending latch state used to
                                         * calculate the pending state for
                                         * both level and edge triggered IRQs.
                                         */

        bool active;                    /* not used for LPIs */
        bool enabled;
        bool hw;                        /* Tied to HW IRQ */
        struct kref refcount;           /* Used for LPIs */
        u32 hwintid;                    /* HW INTID number */
        union {
            u8 targets;                     /* GICv2 target VCPUs mask */
            u32 mpidr;                      /* GICv3 target VCPU */
        };
        u8 source;                      /* GICv2 SGIs only */
        u8 priority;
        enum vgic_irq_config config;    /* Level or edge */
    };

### VCPU list handling

Initially a virtual interrupt just lives on its own. Guest MMIO accesses to
the distributor will change the state information in this structure.
When an interrupt is actually made pending (either by an associated hardware
IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
linked to the current target VCPU. The `vcpu` member in the structure will
be set to this VCPU. Any affinity change after this point will not affect
the current target VCPU anymore, it just updates the `target_vpu` field in
the structure, which will be considered on the next injection.
This per-VCPU list is called the `ap_list`, since it holds interrupts which
are in a pending and/or active state.

### Virtual IRQ references

There is a function `vgic_get_irq()` which returns a reference to a virtual IRQ
given its number.
For private IRQs and SPIs it is expected that this just indexes a static array.
For LPIs (which are dynamically allocated at run time) this is expected to
iterate a data structure (like a linked list) to find the right structure.
In any case a call to `vgic_get_irq` will increase a refcount, which will
prevent LPIs from being de-allocated while another part of the VGIC is still
holding a reference. Thus any caller to `vgic_get_irq` shall call
`vgic_put_irq()` after it is done with handling this interrupt.
An exception would be if the virtual IRQ is eventually injected into a VCPU. In
this case the VCPU holds that reference and it is kept as long as the guest
sees this virtual IRQ. The refcount would only be decreased upon the IRQ having
been EOIed by the guest and it having been removed from the VCPU list.

### Locking

To keep the `vgic_irq` structure consistent and to avoid races between
different parts of the VGIC, locking is essential whenever accessing a member
of this structure. It is expected that this lock is almost never contended,
also held only for brief periods of time, so this is considered cheap.
To keep the code clean and avoid nasty corner cases, there are no tricks on
trying to be lockless here.
If for any reason the code needs to hold the locks for two virtual IRQs, the
one with the lower IRQ number is to be taken first, to avoid deadlocks.

Another lock to consider is the VCPU lock, which on the first glance protects
the virtual CPU's list structure, but also synchronises additions and removals
of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
lock need to be held. To avoid deadlocks, there is a strict locking order:

> The VCPU lock needs to be taken first, the per-IRQ lock after this.

Some operations (like migrating IRQs between two VCPUs) require two VCPU
locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
is to be taken first.

There are occasions where the locking order (VCPU first) is hard to observe,
because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
list. In this case the IRQ lock needs to be dropped, the respective VCPU
lock should be taken, then the per-IRQ lock needs to be re-taken.
After both the locks are held, we need to check if the conditions which
originally mandated the list addition (or removal) are still true. This is
needed because the IRQ lock could have been taken by another entity meanwhile
and the state of this interrupt could have been changed. Examples are if the
interrupt is no longer pending, got disabled or changed the CPU affinity.
Some of those changes might render to current action obsolete (no longer
pending), other will lead to a retry of the re-locking scheme described above.
This re-locking scheme shall be implemented in a well-documented function.

### Level and edge triggered interrupts

The GIC knows about two kinds of signalling interrupts:

- Edge triggered interrupts are triggered by a device once, their life cycle
ends when the guest has EOIed them, at which point we remove the pending state,
clear the LR and return the `vgic_irq` structure to a quiescent state.

- Level triggered interrupts are triggered when a device raises its interrupt
line, they stay pending as long as this line is held high. At some point the
driver in the guest is expected to program the device to explicitly or
implicitly lower this interrupt line. That means that we have to store the
state of the virtual interrupt line, which is only controlled by the (virtual)
device. This is done in the `line_level` member of `struct vgic_irq`.

To assert the interrupt condition, a (virtual) device calls a function exported
by the VGIC, which allows to raise or lower an interrupt line. Lowering the
line for an edge triggered IRQ is ignored (and so is optional). Raising the
line asserts the pending state and potentially injects this virtual IRQ. Any
subsequent "raising" call might inject another IRQ, if the previous has at
least been activated by the guest already, otherwise is ignored.

For level triggered interrupts this function stores the new state into the
`line_level` variable, potentially injecting the interrupt if that line
changes from false to true. If the line is lowered before the guest has
seen it, this particular interrupt instance will be discarded. Successive
"raising" calls will not lead to multiple interrupts if the line has not
been lowered in between.

### Software triggered interrupts

Beside the naturally software triggered inter-processor-interrupts
(SGIs in GIC speak), there is another way of letting software raise an
interrupt condition.
The GIC distributor allows to set or clear both the pending and active state
of any interrupt via MMIO registers. This isn't widely used by many operating
systems, but is useful when saving and restoring the state of a machine.
So emulating these functions is required for being architecture compliant,
however the implementation might not need to be very efficient given its rare
usage. In fact supporting the set-pending and clear-pending registers is
relatively straight-forward, as long as one keeps this state separate from
the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.

The set-active and clear-active registers are much harder to emulate, though,
as normally the active state is of little concern to the GIC emulation. In
a normal interrupt life cycle the active state isn't even visible to the
hypervisor, as it might be set and cleared again entirely within the guest
in the list register, without exiting to the hypervisor.
So manipulating the active state via the MMIO registers requires some heavy
lifting: If this interrupt is currently injected into a running VCPU, this
VCPU must exit, the active state must be set or cleared in the LR, then
execution can continue. While this is expensive, as mentioned above this
should not happen too often, also probably the system isn't very performance
sensitive when using this feature for save and restore anyway.

### MMIO emulation

As mentioned before, the distributor and redistributor part of the VGIC needs
to be fully emulated. Those parts are characterised by a range of MMIO
registers. The implementation shall provide a dispatcher function, which
takes the faulted address, relative to the beginning of the MMIO range, and
works out which actual register is affected. It then looks up the the
respective handler function and calls it. Those functions are expected to
be listed in a struct initialiser, which connects the actual register
offset and its size to a particular handler. Having handler functions for
a register range seems beneficial over handling registers in a switch/case,
because it's easier to read and simplifies code sharing, for instance
between the GICv2, GICv3 distributor and GICv3 redistributor registers
with the same semantics.

### List register management

A list register (LR) holds the state of a virtual interrupt, which will
be used by the GIC hardware to simulate an IRQ life cycle for a guest.
Each GIC hardware implementation can choose to implement a number of LRs,
having four of them seems to be a common value. This design here does not
try to manage the LRs very cleverly, instead on every guest exit every LR
in use will be synced to the emulated state, then cleared. Upon guest entry
the top priority virtual IRQs will be inserted into the LRs. If there are
more pending or active IRQs than list registers, the GIC management IRQ
will be configured to notify the hypervisor of a free LR (once the guest
has EOIed one IRQ). This will trigger a normal exit, which will go through
the normal cleanup/repopulate scheme, possibly now queuing the leftover
interrupt(s).
To facilitate quick guest exit and entry times, the VGIC maintains the list
of pending or active interrupts (ap\_list) sorted by their priority. Active
interrupts always go first on the list, since a guest and the hardware GIC
expect those to stay until they have been explicitly deactivated. Failure
in keeping active IRQs around will result in error conditions in the GIC.
The second sort criteria for the ap\_list is their priority, so higher
priority pending interrupt always go first into the LRs.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread