All of lore.kernel.org
 help / color / mirror / Atom feed
* Scalable Event Channel ABI design (draft A)
@ 2013-02-04 17:52 David Vrabel
  2013-02-04 19:59 ` Keir Fraser
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: David Vrabel @ 2013-02-04 17:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu

Hi,

Here is a design for a scalable event channel ABI for Xen.  It has a
number of benefits over the design currently being proposed by Wei Lui.

* More event channels (>100,000).
* 16 event priorities.
* Reduced memory requirements (only 1 additional page per domU).
* The use of FIFOs for events ensures fairness, whereas it is difficult
to reason about the fairness with the current bitmap system.

The PDF version is easier to read and has diagrams and readable maths
but the original markdown format document is included below (for ease of
commenting).

http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf

Introduction
============

Purpose
-------

Xen uses event channels to signal events (interrupts) to (fully or
partially) paravirtualized guests.  The current event channel ABI
provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
(for 64-bit guests) event channels.  This is limiting scalability as
support for more VMs, VCPUs and devices is required.

The existing ABI does not easily allow events to have different
priorities.  Current Linux kernels prioritize the timer event by
special casing this but this is not generalizable to more events.
Event priorities may be useful for prioritizing MMIO emulation
requests over bulk data traffic (such as network or disk).

This design replaces the existing event channel ABI with one that:

- is scalable to more than 100,000 event channels, with scope for
increasing this further with minimal ABI changes.

- allows guests to use up-to 16 different event priorities.

System Overview
---------------

[FIXME: diagram showing Xen and guest and shared memory block for events?]

Design Map
----------

A new event channel ABI requires changes to Xen and the guest kernels.

References
----------

[FIXME: link to alternate proposal?]

Design Considerations
=====================

Assumptions
-----------

* Atomic read-modify-write of 32-bit words is possible on all
  supported platforms.  This can be with a linked-load /
  store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
  (e.g., x86's cmpxchg).

Constraints
-----------

* The existing ABI must continue to be useable.  Compatibilty with
  existing guests is mandatory.

Risks and Volatile Areas
------------------------

* Should the 3-level proposal be merged into Xen then this design does
  not offer enough improvements to warrant the cost of maintaining
  three different event channel ABIs in Xen and guest kernels.

* The performance of some operations may be decreased.  Specifically,
  retriggering an event now always requires a hypercall.

Architecture
============

Overview
--------

The event channel ABI uses a data structure that is shared between Xen
and the guest.  Access to the structure is done with lock-less
operations (except for some less common operations where the guest
must use a hypercall).  The guest is responsible for allocating this
structure and registering it with Xen during VCPU bring-up.

Events are reported to a guest's VCPU using a FIFO queue.  There is a
queue for each priority level and each VCPU.

Each event has a _pending_ and a _masked_ bit.  The pending bit
indicates the event has been raised.  The masked bit is used by the
guest to prevent delivery of that specific event.


High Level Design
=================

Shared Event Data Structure
---------------------------

The shared event data structure has a per-domain _event array_, and a
per-VCPU _control block_.

- _event array_: A logical array of event words (one per event
  channel) which contains the pending and mask bits and the link index
  for next event in the queue.

![\label{fig_event-word}Event Array Word](event-word.pdf)

- _control block_: This contains the head and tail indexes for each
  priority level.  Each VCPU has its own control block and this is
  contained in the existing `struct vcpu_info`.

The pages within the event array need not be physically nor virtually
contiguous, but the guest or Xen may make the virtually contiguous for
ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
Linux.  Pages are added by the guest as required by the number of
bound event channels.

Only 17 bits are currently defined for the LINK field, allowing 2^17^
(131,072) events.  This limit can be trivially increased without any
other changes to the ABI.  Bits [28:17] are reserved for future
expansion or for other uses.

Event State Machine
-------------------

Event channels are bound to a domain using the existing ABI.

A bound event may be in one of three main states.

State    Abbrev.  Meaning
-----    -------  -------
BOUND    B        The event is bound but not pending.
PENDING  P        The event has been raised and not yet acknowledged.
LINKED   L        The event is on an event queue.

Additionally, events may be UNMASKED or MASKED (M).

![\label{events_sm}Event State Machine](event-sm.pdf)

The state of an event is tracked using 3 bits within the event word.
the M (masked), P (pending) and L (linked) bits.  Only state
transitions that change a single bit are valid.

Event Queues
------------

The event queues use a singly-linked list of event array words (see
figure \ref{fig_event-word} and \ref{fig_event-queue}).

![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)

Each event queue has a head and tail index stored in the control
block.  The head index is the index of the first element in the queue.
The tail index is the last element in the queue.  Every element within
the queue has the L bit set.

The LINK field in the event word indexes the next event in the queue.
LINK is zero for the last word in the queue.

The queue is empty when the head index is zero (zero is not a valid
event channel).

Hypercalls
----------

Four new EVTCHNOP hypercall sub-operations are added:

* `EVTCHNOP_init`

* `EVTCHNOP_expand`

* `EVTCHNOP_set_priority`

* `EVTCHNOP_set_limit`

### `EVTCHNOP_init`

This call initializes a single VCPU's event channel data structures,
adding one page for the event array.

A guest should call this during initial VCPU bring up (and on resume?).

    struct evtchnop_init {
        uint32_t vcpu;
        uint64_t array_pfn;
    };

Field          Purpose
-----          -------
`vcpu`         [in] The VCPU number.
`array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
               of the event array.

Error code  Reason
----------  ------
EINVAL      `vcpu` is invalid or already initialized.
EINVAL      `array_pfn` is not a valid frame for the domain.
ENOMEM      Insufficient memory to allocate internal structures.

### `EVTCHNOP_expand`

This call expands the event array for a VCPU by appended an additional
page.

A guest should call this for all VCPUs when a new event channel is
required and there is insufficient space in the current event array.

It is not possible to shrink the event array once it has been
expanded.

    struct evtchnop_expand {
        uint32_t vcpu;
        uint64_t array_pfn;
    };

Field          Purpose
-----          -------
`vcpu`         [in] The VCPU number.
`array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
               of the event array.

Error code  Reason
----------  ------
EINVAL      `vcpu` is invalid or already initialized.
EINVAL      `array_pfn` is not a valid frames for the domain.
EINVAL      The event array already has the maximum number of pages.
ENOMEM      Insufficient memory to allocate internal structures.

### `EVTCHNOP_set_priority`

This call sets the priority for an event channel.  The event must be
unbound.

A guest may call this prior to binding an event channel. The meaning
and the use of the priority are up to the guest.  Valid priorities are
0 - 15 and the default is 7.

    struct evtchnop_set_priority {
        uint32_t port;
        uint32_t priority;
    };

Field       Purpose
-----       -------
`port`      [in] The event channel.
`priority`  [in] The priority for the event channel.

Error code  Reason
----------  ------
EINVAL      `port` is invalid.
EINVAL      `port` is currently bound.
EINVAL      `priority` is outside the range 0 - 15.

### `EVTCHNOP_set_limit`

This privileged call sets the maximum number of event channels a
domain can bind.  The default for dom0 is unlimited.  Other domains
default to 1024 events (requiring only a single page for their event
array).

    struct evtchnop_set_limit {
        uint32_t domid;
        uint32_t max_events;
    };

Field         Purpose
-----         -------
`domid`       [in] The domain ID.
`max_events`  [in] The maximum number of event channels that may be bound
              to the domain.

Error code  Reason
----------  ------
EINVAL      `domid` is invalid.
EPERM       The calling domain has insufficient privileges.

Memory Usage
------------

### Event Arrays

Xen needs to map every domains' event array into its address space.
The space reserved for these global mappings is limited to 1 GiB on
x86-64 (262144 pages) and is shared with other users.

It is non-trivial to calculate the maximum number of VMs that can be
supported as this depends on the system configuration (how many driver
domains etc.) and VM configuration.  We can make some assuptions and
derive an approximate limit.

Each page of the event array has space for 1024 events ($E_P$) so a
regular domU will only require a single page.  Since event channels
typically come in pairs, the upper bound on the total number of pages
is $2 \times\text{number of VMs}$.

If the guests are further restricted in the number of event channels
($E_V$) then this upper bound can be reduced further.

The number of VMs ($V$) with a limit of $P$ total event array pages is:
\begin{equation*}
V = P \div \left(1 + \frac{E_V}{E_P}\right)
\end{equation*}

Using only half the available pages and limiting guests to only 64
events gives:
\begin{eqnarray*}
V & = & (262144 / 2) \div (1 + 64 / 1024) \\
  & = & 123 \times 10^3\text{ VMs}
\end{eqnarray*}

Alternatively, we can consider a system with $D$ driver domains, each
of which requires $E_D$ events, and a dom0 using the maximum number of
pages (128).

\begin{eqnarray*}
V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
\end{eqnarray*}

With, for example, 16 driver domains each using the maximum number of pages:
\begin{eqnarray*}
V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
   & = & 129 \times 10^3\text{ VMs}
\end{eqnarray*}

In summary, there is space to map the event arrays for over 100,000
VMs.  This is more than the limit imposed by the 16 bit domain ID
(~32,000 VMs).

### Control Block

With $L$ priority levels and two 32-bit words for the head and tail
indexes, the amount of space ($S$) required in the `struct vcpu_info`
for the control block is:
\begin{eqnarray*}
S & = & L \times 2 \times 4 \\
  & = & 16 \times 2 \times 4 \\
  & = & 128\text{ bytes}
\end{eqnarray*}

There is more than enough space within `struct vcpu_info` for the
control block while still keeping plenty of space for future use.

Low Level Design
================

In the pseudo code in this section, all memory accesses are atomic,
including those to bit-fields within the event word.

These variables have a standard meaning:

Variable  Purpose
--------  -------
E         Event array.
p         A specific event.
H         The head index for a specific priority.
T         The tail index for a specific priority.

These memory barriers are required:

Function  Purpose
--------  -------
mb()      Full (read/write) memory barrier

Raising an Event
----------------

When Xen raises an event it marks it pending and (if it is not masked)
adds it tail of event queue.

    E[p].pending = 1
    if not E[p].linked and not E[n].masked
        E[p].linked = 1
        E[p].link = 0
        mb()
        if H == 0
            H = p
        else
            E[T].link = p
        T = p

Concurrent access by Xen to the event queue must be protected by a
per-event queue spin lock.

Consuming Events
----------------

The guests consumes events starting at the head until it reaches the
tail.  Events in the queue that are not pending or are masked are
consumed but not handled.

    while H != 0
        p = H
        H = E[p].link
        if H == 0
            mb()
            H = E[p].link
        E[H].linked = 0
        if not E[p].masked
            handle(p)

handle() clears `E[p].pending` and EOIs level-triggered PIRQs.

> Note: When the event queue contains a single event, removing the
> event may race with Xen appending another event because the load of
> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> the guest must recheck `E[p].link` if the list appeared empty.

Masking Events
--------------

Events are masked by setting the masked bit.  If the event is pending
and linked it does not need to be unlinked.

    E[p].masked = 1

Unmasking Events
----------------

Events are unmasked by the guest by clearing the masked bit.  If the
event is pending the guest must call the event channel unmask
hypercall so Xen can link the event into the correct event queue.

    E[p].masked = 0
    if E[p].pending
        hypercall(EVTCHN_unmask)

The expectation here is that unmasking a pending event will be rare,
so the performance hit of the hypercall is minimal.

> Note: that after clearing the mask bit, the event may be raised and
> thus it may already be linked by the time the hypercall is done.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
@ 2013-02-04 19:59 ` Keir Fraser
  2013-02-05 14:48   ` David Vrabel
  2013-02-06 11:46   ` Jan Beulich
  2013-02-04 21:07 ` Wei Liu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Keir Fraser @ 2013-02-04 19:59 UTC (permalink / raw)
  To: David Vrabel, xen-devel; +Cc: Wei Liu

On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:

> Hi,
> 
> Here is a design for a scalable event channel ABI for Xen.  It has a
> number of benefits over the design currently being proposed by Wei Lui.
> 
> * More event channels (>100,000).
> * 16 event priorities.
> * Reduced memory requirements (only 1 additional page per domU).
> * The use of FIFOs for events ensures fairness, whereas it is difficult
> to reason about the fairness with the current bitmap system.

I have some sympathy for this design. It's primary downside compared with
the 3-level alternative is its greater space cost (32*#vcpus). However, as
you say the fairness and prioritisation features are rather nice. Also
having the data structures be per vcpu may well help avoid cacheline
contention on busy multi-vcpu guests.

Interested in what others think, but I may actually prefer a ground-up
redesign like this.

 -- Keir

> The PDF version is easier to read and has diagrams and readable maths
> but the original markdown format document is included below (for ease of
> commenting).
> 
> http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf
> 
> Introduction
> ============
> 
> Purpose
> -------
> 
> Xen uses event channels to signal events (interrupts) to (fully or
> partially) paravirtualized guests.  The current event channel ABI
> provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
> (for 64-bit guests) event channels.  This is limiting scalability as
> support for more VMs, VCPUs and devices is required.
> 
> The existing ABI does not easily allow events to have different
> priorities.  Current Linux kernels prioritize the timer event by
> special casing this but this is not generalizable to more events.
> Event priorities may be useful for prioritizing MMIO emulation
> requests over bulk data traffic (such as network or disk).
> 
> This design replaces the existing event channel ABI with one that:
> 
> - is scalable to more than 100,000 event channels, with scope for
> increasing this further with minimal ABI changes.
> 
> - allows guests to use up-to 16 different event priorities.
> 
> System Overview
> ---------------
> 
> [FIXME: diagram showing Xen and guest and shared memory block for events?]
> 
> Design Map
> ----------
> 
> A new event channel ABI requires changes to Xen and the guest kernels.
> 
> References
> ----------
> 
> [FIXME: link to alternate proposal?]
> 
> Design Considerations
> =====================
> 
> Assumptions
> -----------
> 
> * Atomic read-modify-write of 32-bit words is possible on all
>   supported platforms.  This can be with a linked-load /
>   store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
>   (e.g., x86's cmpxchg).
> 
> Constraints
> -----------
> 
> * The existing ABI must continue to be useable.  Compatibilty with
>   existing guests is mandatory.
> 
> Risks and Volatile Areas
> ------------------------
> 
> * Should the 3-level proposal be merged into Xen then this design does
>   not offer enough improvements to warrant the cost of maintaining
>   three different event channel ABIs in Xen and guest kernels.
> 
> * The performance of some operations may be decreased.  Specifically,
>   retriggering an event now always requires a hypercall.
> 
> Architecture
> ============
> 
> Overview
> --------
> 
> The event channel ABI uses a data structure that is shared between Xen
> and the guest.  Access to the structure is done with lock-less
> operations (except for some less common operations where the guest
> must use a hypercall).  The guest is responsible for allocating this
> structure and registering it with Xen during VCPU bring-up.
> 
> Events are reported to a guest's VCPU using a FIFO queue.  There is a
> queue for each priority level and each VCPU.
> 
> Each event has a _pending_ and a _masked_ bit.  The pending bit
> indicates the event has been raised.  The masked bit is used by the
> guest to prevent delivery of that specific event.
> 
> 
> High Level Design
> =================
> 
> Shared Event Data Structure
> ---------------------------
> 
> The shared event data structure has a per-domain _event array_, and a
> per-VCPU _control block_.
> 
> - _event array_: A logical array of event words (one per event
>   channel) which contains the pending and mask bits and the link index
>   for next event in the queue.
> 
> ![\label{fig_event-word}Event Array Word](event-word.pdf)
> 
> - _control block_: This contains the head and tail indexes for each
>   priority level.  Each VCPU has its own control block and this is
>   contained in the existing `struct vcpu_info`.
> 
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.
> 
> Only 17 bits are currently defined for the LINK field, allowing 2^17^
> (131,072) events.  This limit can be trivially increased without any
> other changes to the ABI.  Bits [28:17] are reserved for future
> expansion or for other uses.
> 
> Event State Machine
> -------------------
> 
> Event channels are bound to a domain using the existing ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  Meaning
> -----    -------  -------
> BOUND    B        The event is bound but not pending.
> PENDING  P        The event has been raised and not yet acknowledged.
> LINKED   L        The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
> 
> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.
> 
> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.
> 
> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.
> 
> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
>     struct evtchnop_expand {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frames for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs, the upper bound on the total number of pages
> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 
> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).
> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.
> 
> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.
> 
> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.
> 
>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p
> 
> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 
> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()
>             H = E[p].link
>         E[H].linked = 0
>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> 
>> Note: When the event queue contains a single event, removing the
>> event may race with Xen appending another event because the load of
>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
>> the guest must recheck `E[p].link` if the list appeared empty.
> 
> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)
> 
> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 
>> Note: that after clearing the mask bit, the event may be raised and
>> thus it may already be linked by the time the hypercall is done.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
  2013-02-04 19:59 ` Keir Fraser
@ 2013-02-04 21:07 ` Wei Liu
  2013-02-04 22:16   ` Keir Fraser
  2013-02-05 18:36   ` David Vrabel
  2013-02-05 16:10 ` Ian Campbell
  2013-02-06  9:13 ` Ian Campbell
  3 siblings, 2 replies; 28+ messages in thread
From: Wei Liu @ 2013-02-04 21:07 UTC (permalink / raw)
  To: David Vrabel; +Cc: wei.liu2, xen-devel

Nice work and interesting reading for the night! please see inline
comments.

On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> Hi,
> 
> Here is a design for a scalable event channel ABI for Xen.  It has a
> number of benefits over the design currently being proposed by Wei Lui.
> 
> * More event channels (>100,000).

This is not a benefit over the 3-level design. ;-)

> * 16 event priorities.
> * Reduced memory requirements (only 1 additional page per domU).
> * The use of FIFOs for events ensures fairness, whereas it is difficult
> to reason about the fairness with the current bitmap system.

These are! ;-)

> The PDF version is easier to read and has diagrams and readable maths
> but the original markdown format document is included below (for ease of
> commenting).
> 
> http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf
> 
> Introduction
> ============
> 
> Purpose
> -------
> 
> Xen uses event channels to signal events (interrupts) to (fully or
> partially) paravirtualized guests.  The current event channel ABI
> provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
> (for 64-bit guests) event channels.  This is limiting scalability as
> support for more VMs, VCPUs and devices is required.
> 
> The existing ABI does not easily allow events to have different
> priorities.  Current Linux kernels prioritize the timer event by
> special casing this but this is not generalizable to more events.
> Event priorities may be useful for prioritizing MMIO emulation
> requests over bulk data traffic (such as network or disk).
> 
> This design replaces the existing event channel ABI with one that:
> 
> - is scalable to more than 100,000 event channels, with scope for
> increasing this further with minimal ABI changes.
> 
> - allows guests to use up-to 16 different event priorities.
> 
> System Overview
> ---------------
> 
> [FIXME: diagram showing Xen and guest and shared memory block for events?]
> 
> Design Map
> ----------
> 
> A new event channel ABI requires changes to Xen and the guest kernels.
> 
> References
> ----------
> 
> [FIXME: link to alternate proposal?]
> 
> Design Considerations
> =====================
> 
> Assumptions
> -----------
> 
> * Atomic read-modify-write of 32-bit words is possible on all
>   supported platforms.  This can be with a linked-load /
>   store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
>   (e.g., x86's cmpxchg).
> 
> Constraints
> -----------
> 
> * The existing ABI must continue to be useable.  Compatibilty with
>   existing guests is mandatory.
> 
> Risks and Volatile Areas
> ------------------------
> 
> * Should the 3-level proposal be merged into Xen then this design does
>   not offer enough improvements to warrant the cost of maintaining
>   three different event channel ABIs in Xen and guest kernels.
> 
> * The performance of some operations may be decreased.  Specifically,
>   retriggering an event now always requires a hypercall.
> 
> Architecture
> ============
> 
> Overview
> --------
> 
> The event channel ABI uses a data structure that is shared between Xen
> and the guest.  Access to the structure is done with lock-less
> operations (except for some less common operations where the guest
> must use a hypercall).  The guest is responsible for allocating this
> structure and registering it with Xen during VCPU bring-up.

Just for completeness, legacy guests might not register vcpu info - they
just use those vcpu info inside shared info. And a modern guest might
use vcpu info inside shared info along with registered per-vcpu vcpu
info if registrations fail half way.

But this will not be a big problem for this design.

> Events are reported to a guest's VCPU using a FIFO queue.  There is a
> queue for each priority level and each VCPU.
> 
> Each event has a _pending_ and a _masked_ bit.  The pending bit
> indicates the event has been raised.  The masked bit is used by the
> guest to prevent delivery of that specific event.
> 
> 
> High Level Design
> =================
> 
> Shared Event Data Structure
> ---------------------------
> 
> The shared event data structure has a per-domain _event array_, and a
> per-VCPU _control block_.
> 
> - _event array_: A logical array of event words (one per event
>   channel) which contains the pending and mask bits and the link index
>   for next event in the queue.
> 
> ![\label{fig_event-word}Event Array Word](event-word.pdf)
> 
> - _control block_: This contains the head and tail indexes for each
>   priority level.  Each VCPU has its own control block and this is
>   contained in the existing `struct vcpu_info`.
> 
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.
> 
> Only 17 bits are currently defined for the LINK field, allowing 2^17^
> (131,072) events.  This limit can be trivially increased without any
> other changes to the ABI.  Bits [28:17] are reserved for future
> expansion or for other uses.
> 
> Event State Machine
> -------------------
> 
> Event channels are bound to a domain using the existing ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  Meaning
> -----    -------  -------
> BOUND    B        The event is bound but not pending.
> PENDING  P        The event has been raised and not yet acknowledged.
> LINKED   L        The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
> 
> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.
> 

Where is the "priority" information stored? I think it should be
somewhere inside Xen, right? I presume a field in struct evtchn?

> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.
> 

I think the registration for new event channels should always be done in
a batch. If not then you need to provide another OP to rollback previous
registered ones.

> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.
> 
> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
>     struct evtchnop_expand {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frames for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 

Is shrinking max_events legal? Should also deal with max_events >
design_limit case.

> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs, the upper bound on the total number of pages
                               ^^^^^
                               upper or lower?

> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 

Can this bound really be reduced? Can you map memory on non-page
granularity?

> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).
> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.
> 
> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.
> 
> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.
> 
>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p
> 
> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 

I presume "E[n]" in the pseudo code is "E[p]"?

Is this spin lock really a good idea? How many threads / cpus will spin
on this lock? As [0] shows, contention on spin lock incurs heavy
performance penalty.

[0] https://lwn.net/Articles/530458/

> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()
>             H = E[p].link
>         E[H].linked = 0
>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> 

How to synchronize access to the array and control blocks between Xen
and guest? I'm afraid I have no knowledge of a xen-guest spin lock...

AFAICT at least inter-domain event channel is a corner case in your
design - cpu A is handling events in Linux kernel while cpu B tries to
raise a inter-domain port to A. If this is not a problem, please also
clarify a bit. (sorry I'm a bit fuzzy after a day's work)

> > Note: When the event queue contains a single event, removing the
> > event may race with Xen appending another event because the load of
> > `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> > the guest must recheck `E[p].link` if the list appeared empty.
> 
> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)
> 
> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 

Currently unmask a "remote" port requires issuing hyercall as well, so
if unmasking is not very frequent, this is not a big problem.

But please take some interrupt-intensive workloads into consideration.
For example, 1G nic (e1000) even with NAPI enabled can generate 27k+
intrs per second under high packet load [1], 10G nic can surely generate
more. Can you give estimation on the performance hit on the context
switch?

[1] http://www.linuxfoundation.org/collaborate/workgroups/networking/napi

> > Note: that after clearing the mask bit, the event may be raised and
> > thus it may already be linked by the time the hypercall is done.

Even if the event has already been linked before you finish the
hypercall, you would still need to get hold of the a lock to serialize
access to event structure for checking, right? Or a test_bit on linked
field is sufficient? I think you need to write some pseudo code for this
as well.


Thanks
Wei.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 21:07 ` Wei Liu
@ 2013-02-04 22:16   ` Keir Fraser
  2013-02-05 18:36   ` David Vrabel
  1 sibling, 0 replies; 28+ messages in thread
From: Keir Fraser @ 2013-02-04 22:16 UTC (permalink / raw)
  To: Wei Liu, David Vrabel; +Cc: xen-devel

On 04/02/2013 21:07, "Wei Liu" <wei.liu2@citrix.com> wrote:

>> Concurrent access by Xen to the event queue must be protected by a
>> per-event queue spin lock.
>> 
> 
> I presume "E[n]" in the pseudo code is "E[p]"?
> 
> Is this spin lock really a good idea? How many threads / cpus will spin
> on this lock? As [0] shows, contention on spin lock incurs heavy
> performance penalty.
> 
> [0] https://lwn.net/Articles/530458/

Given that the critical region is small, the extra cache line contention for
the spinlock is probably not a big deal. Even in the current event-channel
design, we would get cache ping-pong on the event-channel bitmaps.

Consider 10k interrupts to a CPU would be a heavy amount. That's one every
100us. The event-channel delivery code described probably runs in less than
1us, even if memory accesses are horrible cache misses. The really highly
contended case shouldn't happen.

 -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 19:59 ` Keir Fraser
@ 2013-02-05 14:48   ` David Vrabel
  2013-02-05 15:16     ` Wei Liu
  2013-02-05 15:49     ` Keir Fraser
  2013-02-06 11:46   ` Jan Beulich
  1 sibling, 2 replies; 28+ messages in thread
From: David Vrabel @ 2013-02-05 14:48 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Wei Liu, xen-devel

On 04/02/13 19:59, Keir Fraser wrote:
> On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:
> 
>> Hi,
>>
>> Here is a design for a scalable event channel ABI for Xen.  It has a
>> number of benefits over the design currently being proposed by Wei Lui.
>>
>> * More event channels (>100,000).
>> * 16 event priorities.
>> * Reduced memory requirements (only 1 additional page per domU).
>> * The use of FIFOs for events ensures fairness, whereas it is difficult
>> to reason about the fairness with the current bitmap system.
> 
> I have some sympathy for this design. It's primary downside compared with
> the 3-level alternative is its greater space cost (32*#vcpus). However, as
> you say the fairness and prioritisation features are rather nice. Also
> having the data structures be per vcpu may well help avoid cacheline
> contention on busy multi-vcpu guests.

This design originally (before I posted it) did have per-VCPU event
arrays but I changed it to per-domain to reduce the memory footprint.

A hybrid approach might be useful.  Busy guests like dom0 or driver
domains could use per-VCPU event arrays but other guests could be
per-domain.  This would be controlled by the toolstack.

> Interested in what others think, but I may actually prefer a ground-up
> redesign like this.

Since the ABI needs to be changed to support more event channels anyway,
it seems an ideal point to revisit the design.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 14:48   ` David Vrabel
@ 2013-02-05 15:16     ` Wei Liu
  2013-02-05 18:05       ` George Dunlap
  2013-02-05 15:49     ` Keir Fraser
  1 sibling, 1 reply; 28+ messages in thread
From: Wei Liu @ 2013-02-05 15:16 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, wei.liu2, xen-devel

On Tue, 2013-02-05 at 14:48 +0000, David Vrabel wrote:
> On 04/02/13 19:59, Keir Fraser wrote:
> > On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:
> > 
> >> Hi,
> >>
> >> Here is a design for a scalable event channel ABI for Xen.  It has a
> >> number of benefits over the design currently being proposed by Wei Lui.
> >>
> >> * More event channels (>100,000).
> >> * 16 event priorities.
> >> * Reduced memory requirements (only 1 additional page per domU).
> >> * The use of FIFOs for events ensures fairness, whereas it is difficult
> >> to reason about the fairness with the current bitmap system.
> > 
> > I have some sympathy for this design. It's primary downside compared with
> > the 3-level alternative is its greater space cost (32*#vcpus). However, as
> > you say the fairness and prioritisation features are rather nice. Also
> > having the data structures be per vcpu may well help avoid cacheline
> > contention on busy multi-vcpu guests.
> 
> This design originally (before I posted it) did have per-VCPU event
> arrays but I changed it to per-domain to reduce the memory footprint.
> 
> A hybrid approach might be useful.  Busy guests like dom0 or driver
> domains could use per-VCPU event arrays but other guests could be
> per-domain.  This would be controlled by the toolstack.
> 
> > Interested in what others think, but I may actually prefer a ground-up
> > redesign like this.
> 
> Since the ABI needs to be changed to support more event channels anyway,
> it seems an ideal point to revisit the design.
> 

Right. I do care about better design and good implementation. Can we
build a prototype of this design? We are less than two months away from
4.3 feature freeze, and the event channel scalability is planned for
that release, which means we need to be hurried. :-)


Wei.

> David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 14:48   ` David Vrabel
  2013-02-05 15:16     ` Wei Liu
@ 2013-02-05 15:49     ` Keir Fraser
  2013-02-05 15:54       ` David Vrabel
  1 sibling, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2013-02-05 15:49 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel

On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote:

>> I have some sympathy for this design. It's primary downside compared with
>> the 3-level alternative is its greater space cost (32*#vcpus). However, as
>> you say the fairness and prioritisation features are rather nice. Also
>> having the data structures be per vcpu may well help avoid cacheline
>> contention on busy multi-vcpu guests.
> 
> This design originally (before I posted it) did have per-VCPU event
> arrays but I changed it to per-domain to reduce the memory footprint.

Okay, I wonder how much it actually matters anyhow...

Oh by the way you say the control block is 128 bytes and will easily fit in
the existing struct vcpu_info. That existing structure is 64 bytes in total.
So how does that work then?

 -- Keir

> A hybrid approach might be useful.  Busy guests like dom0 or driver
> domains could use per-VCPU event arrays but other guests could be
> per-domain.  This would be controlled by the toolstack.
>
>> Interested in what others think, but I may actually prefer a ground-up
>> redesign like this.
> 
> Since the ABI needs to be changed to support more event channels anyway,
> it seems an ideal point to revisit the design.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 15:49     ` Keir Fraser
@ 2013-02-05 15:54       ` David Vrabel
  2013-02-05 16:11         ` Ian Campbell
  2013-02-05 16:11         ` Keir Fraser
  0 siblings, 2 replies; 28+ messages in thread
From: David Vrabel @ 2013-02-05 15:54 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Wei Liu, xen-devel

On 05/02/13 15:49, Keir Fraser wrote:
> On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote:
> 
>>> I have some sympathy for this design. It's primary downside compared with
>>> the 3-level alternative is its greater space cost (32*#vcpus). However, as
>>> you say the fairness and prioritisation features are rather nice. Also
>>> having the data structures be per vcpu may well help avoid cacheline
>>> contention on busy multi-vcpu guests.
>>
>> This design originally (before I posted it) did have per-VCPU event
>> arrays but I changed it to per-domain to reduce the memory footprint.
> 
> Okay, I wonder how much it actually matters anyhow...
> 
> Oh by the way you say the control block is 128 bytes and will easily fit in
> the existing struct vcpu_info. That existing structure is 64 bytes in total.
> So how does that work then?

I meant struct vcpu_info can be extended without it growing to more than
a page.  i.e., it fits into the guest page provided in the
VCPUOP_register_vcpu_info call so no additional pages need to be
globally mapped for the control block.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
  2013-02-04 19:59 ` Keir Fraser
  2013-02-04 21:07 ` Wei Liu
@ 2013-02-05 16:10 ` Ian Campbell
  2013-02-05 18:18   ` David Vrabel
  2013-02-06  9:13 ` Ian Campbell
  3 siblings, 1 reply; 28+ messages in thread
From: Ian Campbell @ 2013-02-05 16:10 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel

On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.

Strictly speaking the size is related to the maximum bound evtchn, which
need not be the same as the number of bound evtchns.

> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.

The diagram shows transitions B<->P and P<->L, which involve two bits in
each case. So do BM<->PM and LM<->BM now I think about it.

Is the L bit redundant with the LINK field == 0 or != 0?

> 
> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).`
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.

Isn't the event array per-domain?

> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };

I think this will have a different layout on 32 and 64 bit x86, if you
care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).

> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.

This doesn't seem all that different to _init, except the former handles
the transition from 0->1 event pages and this handles N->N+1?

> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.



> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.

EBUSY?

> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs,

I'm not sure this is the case, since evtchns are bidirectional (i.e.
either end can use it to signal the other) so one is often shared for
both rqst and rsp notifications.

 the upper bound on the total number of pages
> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 
> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}

This accounts for the driver domains and dom0 but not the domains which
they are serving, doesn't it?
 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).

Is there scope to reduce the maximum then?

> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.

There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel
which I suppose becomes redundant now.

I don't think it would be a problem to predicate use of this new
interface on the use of the VCPU_placement API and therefore give scope
to expand the vcpu_info.

> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.

> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.

What are the conditions for actually performing the upcall when
returning from the hypervisor to the guest?

>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p

Do you not need a barrier towards the end here to ensure that a consumer
who is currently processing interrupts sees the updated link when they
get there?

> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 
> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()

What is this mb() needed for?

>             H = E[p].link
>         E[H].linked = 0

Did you mean E[p].linked here?

If at this point the interrupt is reraised then the if in the raising
pseudo code becomes true, linked will set again and don't we also race
with the clearing of E[p].pending below?

Is a barrier needed around here due to Xen also reading E[p].pending?

>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` 

What barriers are required for this and where? Is it done at the start
or the end of the handler? (early of late EOI)

> and EOIs level-triggered PIRQs.
> 
> > Note: When the event queue contains a single event, removing the
> > event may race with Xen appending another event because the load of
> > `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> > the guest must recheck `E[p].link` if the list appeared empty.

It appears that both "Raising an Event" and "Consuming Events" can write
H? Is that correct? Likewise for the linked bit.

It's a bit unclear because the pseudo-code doesn't make it explicitly
which variables are par of the shared data structure and which are
private to the local routine.

> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)

Can the hypercall do the E[p].masked = 0?

> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 
> > Note: that after clearing the mask bit, the event may be raised and
> > thus it may already be linked by the time the hypercall is done.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 15:54       ` David Vrabel
@ 2013-02-05 16:11         ` Ian Campbell
  2013-02-05 18:02           ` Keir Fraser
  2013-02-05 16:11         ` Keir Fraser
  1 sibling, 1 reply; 28+ messages in thread
From: Ian Campbell @ 2013-02-05 16:11 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, Keir (Xen.org), xen-devel

On Tue, 2013-02-05 at 15:54 +0000, David Vrabel wrote:
> On 05/02/13 15:49, Keir Fraser wrote:
> > On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote:
> > 
> >>> I have some sympathy for this design. It's primary downside compared with
> >>> the 3-level alternative is its greater space cost (32*#vcpus). However, as
> >>> you say the fairness and prioritisation features are rather nice. Also
> >>> having the data structures be per vcpu may well help avoid cacheline
> >>> contention on busy multi-vcpu guests.
> >>
> >> This design originally (before I posted it) did have per-VCPU event
> >> arrays but I changed it to per-domain to reduce the memory footprint.
> > 
> > Okay, I wonder how much it actually matters anyhow...
> > 
> > Oh by the way you say the control block is 128 bytes and will easily fit in
> > the existing struct vcpu_info. That existing structure is 64 bytes in total.
> > So how does that work then?
> 
> I meant struct vcpu_info can be extended without it growing to more than
> a page.  i.e., it fits into the guest page provided in the
> VCPUOP_register_vcpu_info call so no additional pages need to be
> globally mapped for the control block.

VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
by itself, even if that happens to be the Linux implementation today (I
haven't checked that).

Ian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 15:54       ` David Vrabel
  2013-02-05 16:11         ` Ian Campbell
@ 2013-02-05 16:11         ` Keir Fraser
  1 sibling, 0 replies; 28+ messages in thread
From: Keir Fraser @ 2013-02-05 16:11 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel

On 05/02/2013 15:54, "David Vrabel" <david.vrabel@citrix.com> wrote:

> On 05/02/13 15:49, Keir Fraser wrote:
>> On 05/02/2013 14:48, "David Vrabel" <david.vrabel@citrix.com> wrote:
>> 
>>>> I have some sympathy for this design. It's primary downside compared with
>>>> the 3-level alternative is its greater space cost (32*#vcpus). However, as
>>>> you say the fairness and prioritisation features are rather nice. Also
>>>> having the data structures be per vcpu may well help avoid cacheline
>>>> contention on busy multi-vcpu guests.
>>> 
>>> This design originally (before I posted it) did have per-VCPU event
>>> arrays but I changed it to per-domain to reduce the memory footprint.
>> 
>> Okay, I wonder how much it actually matters anyhow...
>> 
>> Oh by the way you say the control block is 128 bytes and will easily fit in
>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
>> So how does that work then?
> 
> I meant struct vcpu_info can be extended without it growing to more than
> a page.  i.e., it fits into the guest page provided in the
> VCPUOP_register_vcpu_info call so no additional pages need to be
> globally mapped for the control block.

Oh, I see so any guest that uses the new event-channel interface will
understand that vcpu_info is extended to contain it. Well, that makes sense.
It's not what your document says though.

 -- Keir

> David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 16:11         ` Ian Campbell
@ 2013-02-05 18:02           ` Keir Fraser
  2013-02-06  9:38             ` Ian Campbell
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2013-02-05 18:02 UTC (permalink / raw)
  To: Ian Campbell, David Vrabel; +Cc: Wei Liu, xen-devel

On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:

>>> Okay, I wonder how much it actually matters anyhow...
>>> 
>>> Oh by the way you say the control block is 128 bytes and will easily fit in
>>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
>>> So how does that work then?
>> 
>> I meant struct vcpu_info can be extended without it growing to more than
>> a page.  i.e., it fits into the guest page provided in the
>> VCPUOP_register_vcpu_info call so no additional pages need to be
>> globally mapped for the control block.
> 
> VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
> by itself, even if that happens to be the Linux implementation today (I
> haven't checked that).

Having guest agree that vcpu_info grows by size of the per-vcpu control
block, if using this new event-channel interface, is reasonable though.

 -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 15:16     ` Wei Liu
@ 2013-02-05 18:05       ` George Dunlap
  2013-02-05 18:57         ` David Vrabel
  0 siblings, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-02-05 18:05 UTC (permalink / raw)
  To: Wei Liu; +Cc: Keir Fraser, David Vrabel, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 687 bytes --]

On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com> wrote:

> > Since the ABI needs to be changed to support more event channels anyway,
> > it seems an ideal point to revisit the design.
> >
>
> Right. I do care about better design and good implementation. Can we
> build a prototype of this design? We are less than two months away from
> 4.3 feature freeze, and the event channel scalability is planned for
> that release, which means we need to be hurried. :-)
>

I think the general consensus is that scaling event channels is pretty
important -- probably important enough to slip the release a bit if
necessary.  (Although it would certainly be better not to.)

 -George

[-- Attachment #1.2: Type: text/html, Size: 1133 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 16:10 ` Ian Campbell
@ 2013-02-05 18:18   ` David Vrabel
  2013-02-06  9:35     ` Ian Campbell
  0 siblings, 1 reply; 28+ messages in thread
From: David Vrabel @ 2013-02-05 18:18 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, xen-devel

On 05/02/13 16:10, Ian Campbell wrote:
> On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
>> The pages within the event array need not be physically nor virtually
>> contiguous, but the guest or Xen may make the virtually contiguous for
>> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
>> Linux.  Pages are added by the guest as required by the number of
>> bound event channels.
> 
> Strictly speaking the size is related to the maximum bound evtchn, which
> need not be the same as the number of bound evtchns.

Yes.

>> The state of an event is tracked using 3 bits within the event word.
>> the M (masked), P (pending) and L (linked) bits.  Only state
>> transitions that change a single bit are valid.
> 
> The diagram shows transitions B<->P and P<->L, which involve two bits in
> each case. So do BM<->PM and LM<->BM now I think about it.

B == 000b, P == 100b, L == 101b.

Similarly for the masked transitions:

BM == 010b, PM == 110b, LM == 111b.

> Is the L bit redundant with the LINK field == 0 or != 0?

LINK == 0 is the end of queue marker.  We could use all ones to mean
'unlinked' but using the L bit allows the guest to remove the event from
the queue by clearing a single bit, instead of writing the LINK field.

>> ### `EVTCHNOP_init`
>>
>> This call initializes a single VCPU's event channel data structures,
>> adding one page for the event array.
> 
> Isn't the event array per-domain?

Er, yes.  I changed this from per-VCPU and it looks like I didn't make
all the updates required.

>> A guest should call this during initial VCPU bring up (and on resume?).
>>
>>     struct evtchnop_init {
>>         uint32_t vcpu;
>>         uint64_t array_pfn;
>>     };
> 
> I think this will have a different layout on 32 and 64 bit x86, if you
> care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).

Ok.  I thought uint64_t was always 8 byte aligned on both.

>> Field          Purpose
>> -----          -------
>> `vcpu`         [in] The VCPU number.
>> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>>                of the event array.
>>
>> Error code  Reason
>> ----------  ------
>> EINVAL      `vcpu` is invalid or already initialized.
>> EINVAL      `array_pfn` is not a valid frame for the domain.
>> ENOMEM      Insufficient memory to allocate internal structures.
>>
>> ### `EVTCHNOP_expand`
>>
>> This call expands the event array for a VCPU by appended an additional
>> page.
> 
> This doesn't seem all that different to _init, except the former handles
> the transition from 0->1 event pages and this handles N->N+1?

Agreed, I'll fold the two together.

>> ### `EVTCHNOP_set_priority`
>>
>> This call sets the priority for an event channel.  The event must be
>> unbound.
>>
>> A guest may call this prior to binding an event channel. The meaning
>> and the use of the priority are up to the guest.  Valid priorities are
>> 0 - 15 and the default is 7.
>>
>>     struct evtchnop_set_priority {
>>         uint32_t port;
>>         uint32_t priority;
>>     };
>>
>> Field       Purpose
>> -----       -------
>> `port`      [in] The event channel.
>> `priority`  [in] The priority for the event channel.
>>
>> Error code  Reason
>> ----------  ------
>> EINVAL      `port` is invalid.
>> EINVAL      `port` is currently bound.
> 
> EBUSY?

Sure.

>> Memory Usage
>> ------------
>>
>> Alternatively, we can consider a system with $D$ driver domains, each
>> of which requires $E_D$ events, and a dom0 using the maximum number of
>> pages (128).
>>
>> \begin{eqnarray*}
>> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
>> \end{eqnarray*}
>>
>> With, for example, 16 driver domains each using the maximum number of pages:
>> \begin{eqnarray*}
>> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>>    & = & 129 \times 10^3\text{ VMs}
>> \end{eqnarray*}
> 
> This accounts for the driver domains and dom0 but not the domains which
> they are serving, doesn't it?

This is calculating the number of pages left once those used for dom0
and the driver domains are used.  This is the same as the number of
supportable VMs since the other VMs only require a page.

>> In summary, there is space to map the event arrays for over 100,000
>> VMs.  This is more than the limit imposed by the 16 bit domain ID
>> (~32,000 VMs).
> 
> Is there scope to reduce the maximum then?

Maximum of what?

>> ### Control Block
>>
>> With $L$ priority levels and two 32-bit words for the head and tail
>> indexes, the amount of space ($S$) required in the `struct vcpu_info`
>> for the control block is:
>> \begin{eqnarray*}
>> S & = & L \times 2 \times 4 \\
>>   & = & 16 \times 2 \times 4 \\
>>   & = & 128\text{ bytes}
>> \end{eqnarray*}
>>
>> There is more than enough space within `struct vcpu_info` for the
>> control block while still keeping plenty of space for future use.
> 
> There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel
> which I suppose becomes redundant now.
> 
> I don't think it would be a problem to predicate use of this new
> interface on the use of the VCPU_placement API and therefore give scope
> to expand the vcpu_info.

I should have taken a proper look at where the vcpu_info comes from...
Oops.

We can add a new VCPUOP_register_vcpu_info_and_control_block which will
add a struct vcpu_info and the control block.

>> Low Level Design
>> ================
>>
>> In the pseudo code in this section, all memory accesses are atomic,
>> including those to bit-fields within the event word.
> 
>> These variables have a standard meaning:
>>
>> Variable  Purpose
>> --------  -------
>> E         Event array.
>> p         A specific event.
>> H         The head index for a specific priority.
>> T         The tail index for a specific priority.
>>
>> These memory barriers are required:
>>
>> Function  Purpose
>> --------  -------
>> mb()      Full (read/write) memory barrier
>>
>> Raising an Event
>> ----------------
>>
>> When Xen raises an event it marks it pending and (if it is not masked)
>> adds it tail of event queue.
> 
> What are the conditions for actually performing the upcall when
> returning from the hypervisor to the guest?

Any H != 0 for that VCPU.  This means we should have a per-VCPU bitmap
of non empty queues.

>>     E[p].pending = 1
>>     if not E[p].linked and not E[n].masked
>>         E[p].linked = 1
>>         E[p].link = 0
>>         mb()
>>         if H == 0
>>             H = p
>>         else
>>             E[T].link = p
>>         T = p
> 
> Do you not need a barrier towards the end here to ensure that a consumer
> who is currently processing interrupts sees the updated link when they
> get there?

I need to take another look at the barriers -- I'll get back to you on
this and the others you highlighted.

>> Concurrent access by Xen to the event queue must be protected by a
>> per-event queue spin lock.
>>
>> Consuming Events
>> ----------------
>>
>> The guests consumes events starting at the head until it reaches the
>> tail.  Events in the queue that are not pending or are masked are
>> consumed but not handled.
>>
>>     while H != 0
>>         p = H
>>         H = E[p].link
>>         if H == 0
>>             mb()
>>             H = E[p].link
>>         E[H].linked = 0
> 
> Did you mean E[p].linked here?
> 
> If at this point the interrupt is reraised then the if in the raising
> pseudo code becomes true, linked will set again and don't we also race
> with the clearing of E[p].pending below?

The event will be correctly linked and then we clear P, when we consume
the linked event it will be ignored as P is already clear.

Perhaps we could also avoid linking the event if it is already pending?

>>> Note: When the event queue contains a single event, removing the
>>> event may race with Xen appending another event because the load of
>>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
>>> the guest must recheck `E[p].link` if the list appeared empty.
> 
> It appears that both "Raising an Event" and "Consuming Events" can write
> H? Is that correct? Likewise for the linked bit.

Xen only writes H when the queue is empty, the VM only writes H if it is
non-empty.

The linked bit is set by Xen and cleared by the guest.

Do you see a problem with this?

> It's a bit unclear because the pseudo-code doesn't make it explicitly
> which variables are par of the shared data structure and which are
> private to the local routine.

Capital variables are in the shared structures.  Lower-case are local.

>> Masking Events
>> --------------
>>
>> Events are masked by setting the masked bit.  If the event is pending
>> and linked it does not need to be unlinked.
>>
>>     E[p].masked = 1
>>
>> Unmasking Events
>> ----------------
>>
>> Events are unmasked by the guest by clearing the masked bit.  If the
>> event is pending the guest must call the event channel unmask
>> hypercall so Xen can link the event into the correct event queue.
>>
>>     E[p].masked = 0
>>     if E[p].pending
>>         hypercall(EVTCHN_unmask)
> 
> Can the hypercall do the E[p].masked = 0?

Sure.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 21:07 ` Wei Liu
  2013-02-04 22:16   ` Keir Fraser
@ 2013-02-05 18:36   ` David Vrabel
  1 sibling, 0 replies; 28+ messages in thread
From: David Vrabel @ 2013-02-05 18:36 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel

On 04/02/13 21:07, Wei Liu wrote:
> 
> Where is the "priority" information stored? I think it should be
> somewhere inside Xen, right? I presume a field in struct evtchn?

I think so. I've not really thought too much about the internal design.

>> ### `EVTCHNOP_init`
>>
>> This call initializes a single VCPU's event channel data structures,
>> adding one page for the event array.
>>
> 
> I think the registration for new event channels should always be done in
> a batch. If not then you need to provide another OP to rollback previous
> registered ones.

Hmmm. That's an interesting point.  I'd be inclined to have the guest
take the VCPU offline if it cannot initialize it fully.

>> Each page of the event array has space for 1024 events ($E_P$) so a
>> regular domU will only require a single page.  Since event channels
>> typically come in pairs, the upper bound on the total number of pages
>                                ^^^^^
>                                upper or lower?

I meant upper, but I am playing fast-and-loose with the maths here since
I was aiming for an estimate rather than a real upper bound.

>> is $2 \times\text{number of VMs}$.
>>
>> If the guests are further restricted in the number of event channels
>> ($E_V$) then this upper bound can be reduced further.
>>
> 
> Can this bound really be reduced? Can you map memory on non-page
> granularity?

The reasoning here is that event channels are in pairs (or rather they
have two ends).  One event is the domU, the other in dom0.  The events
in dom0 are packed into pages, whereas the domU events use 1 page no
matter how few events there are.

I wasn't entirely happy with this way of doing the estimate which is why
I did the second method, which gave a similar figure.

>> The number of VMs ($V$) with a limit of $P$ total event array pages is:
>> \begin{equation*}
>> V = P \div \left(1 + \frac{E_V}{E_P}\right)
                    ^         ^^^^^^^^^
The page in domU             The fraction of a page in dom0.


>> Raising an Event
>> ----------------
>>
>> When Xen raises an event it marks it pending and (if it is not masked)
>> adds it tail of event queue.
>>
>>     E[p].pending = 1
>>     if not E[p].linked and not E[n].masked
>>         E[p].linked = 1
>>         E[p].link = 0
>>         mb()
>>         if H == 0
>>             H = p
>>         else
>>             E[T].link = p
>>         T = p
>>
>> Concurrent access by Xen to the event queue must be protected by a
>> per-event queue spin lock.
>>
> 
> I presume "E[n]" in the pseudo code is "E[p]"?

Yes.

> Is this spin lock really a good idea? How many threads / cpus will spin
> on this lock? As [0] shows, contention on spin lock incurs heavy
> performance penalty.

In addition to Keir's comment, the spinlock itself won't reside in the
same cache line as the control block or event array so this will reduce
cache line bouncing.

> [0] https://lwn.net/Articles/530458/
> 
>> Consuming Events
>> ----------------
>>
>> The guests consumes events starting at the head until it reaches the
>> tail.  Events in the queue that are not pending or are masked are
>> consumed but not handled.
>>
>>     while H != 0
>>         p = H
>>         H = E[p].link
>>         if H == 0
>>             mb()
>>             H = E[p].link
>>         E[H].linked = 0
>>         if not E[p].masked
>>             handle(p)
>>
>> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
>>
> 
> How to synchronize access to the array and control blocks between Xen
> and guest? I'm afraid I have no knowledge of a xen-guest spin lock...

It's a lockless data structure on the consumer side (provided there is
only one consumer).

>> Unmasking Events
>> ----------------
>>
>> Events are unmasked by the guest by clearing the masked bit.  If the
>> event is pending the guest must call the event channel unmask
>> hypercall so Xen can link the event into the correct event queue.
>>
>>     E[p].masked = 0
>>     if E[p].pending
>>         hypercall(EVTCHN_unmask)
>>
>> The expectation here is that unmasking a pending event will be rare,
>> so the performance hit of the hypercall is minimal.
>>
> 
> Currently unmask a "remote" port requires issuing hyercall as well, so
> if unmasking is not very frequent, this is not a big problem.
> 
> But please take some interrupt-intensive workloads into consideration.
> For example, 1G nic (e1000) even with NAPI enabled can generate 27k+
> intrs per second under high packet load [1], 10G nic can surely generate
> more. Can you give estimation on the performance hit on the context
> switch?

I'm not sure how I would give an estimate, this is something that would
need to be measured I think.

Also, whilst the upcall itself might be reentrant, the processing of
each queue cannot be so the mask/unmask done by the irq_chip callbacks
isn't needed.  mask/unmask is then only needed occasionally (e.g., for
irq migration) and thus isn't so performance critical.

>>> Note: that after clearing the mask bit, the event may be raised and
>>> thus it may already be linked by the time the hypercall is done.
> 
> Even if the event has already been linked before you finish the
> hypercall, you would still need to get hold of the a lock to serialize
> access to event structure for checking, right? Or a test_bit on linked
> field is sufficient? I think you need to write some pseudo code for this
> as well.

Xen will need to take the per-queue lock for this, yes.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 18:05       ` George Dunlap
@ 2013-02-05 18:57         ` David Vrabel
  2013-02-05 19:03           ` Wei Liu
  2013-02-06 11:32           ` George Dunlap
  0 siblings, 2 replies; 28+ messages in thread
From: David Vrabel @ 2013-02-05 18:57 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Wei Liu, xen-devel

On 05/02/13 18:05, George Dunlap wrote:
> On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com
> <mailto:wei.liu2@citrix.com>> wrote:
> 
>     > Since the ABI needs to be changed to support more event channels
>     anyway,
>     > it seems an ideal point to revisit the design.
>     >
> 
>     Right. I do care about better design and good implementation. Can we
>     build a prototype of this design? We are less than two months away from
>     4.3 feature freeze, and the event channel scalability is planned for
>     that release, which means we need to be hurried. :-)

Two months doesn't seem possible even if I could work on this full time.

> I think the general consensus is that scaling event channels is pretty
> important -- probably important enough to slip the release a bit if
> necessary.  (Although it would certainly be better not to.)

What to do here is a non-trivial decision.  Possible options include:

1. Merge the 3-level ABI for 4.3.  Work on the FIFO-based ABI in
parallel, aiming to get this in for 4.4.  This means maintaining 3 event
channel ABIs in Xen.

2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in.

3. Status quo.  Defer extending the event channel ABI to 4.4.

The preferable option may be to:

4. Get the 3-level ABI to a mergable state. In parallel develop a
prototype of the FIFO-based ABI.  When the prototype is ready or the 4.3
freeze is here, evaluate it and make a decision then.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 18:57         ` David Vrabel
@ 2013-02-05 19:03           ` Wei Liu
  2013-02-06 11:32           ` George Dunlap
  1 sibling, 0 replies; 28+ messages in thread
From: Wei Liu @ 2013-02-05 19:03 UTC (permalink / raw)
  To: David Vrabel; +Cc: George Dunlap, Keir Fraser, wei.liu2, xen-devel

On Tue, 2013-02-05 at 18:57 +0000, David Vrabel wrote:
> On 05/02/13 18:05, George Dunlap wrote:
> > On Tue, Feb 5, 2013 at 3:16 PM, Wei Liu <wei.liu2@citrix.com
> > <mailto:wei.liu2@citrix.com>> wrote:
> > 
> >     > Since the ABI needs to be changed to support more event channels
> >     anyway,
> >     > it seems an ideal point to revisit the design.
> >     >
> > 
> >     Right. I do care about better design and good implementation. Can we
> >     build a prototype of this design? We are less than two months away from
> >     4.3 feature freeze, and the event channel scalability is planned for
> >     that release, which means we need to be hurried. :-)
> 
> Two months doesn't seem possible even if I could work on this full time.
> 
> > I think the general consensus is that scaling event channels is pretty
> > important -- probably important enough to slip the release a bit if
> > necessary.  (Although it would certainly be better not to.)
> 
> What to do here is a non-trivial decision.  Possible options include:
> 
> 1. Merge the 3-level ABI for 4.3.  Work on the FIFO-based ABI in
> parallel, aiming to get this in for 4.4.  This means maintaining 3 event
> channel ABIs in Xen.
> 
> 2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in.
> 3. Status quo.  Defer extending the event channel ABI to 4.4.
> 
> The preferable option may be to:
> 
> 4. Get the 3-level ABI to a mergable state. In parallel develop a
> prototype of the FIFO-based ABI.  When the prototype is ready or the 4.3
> freeze is here, evaluate it and make a decision then.
> 

Actually this is what I'm thinking now.


Wei.

> David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
                   ` (2 preceding siblings ...)
  2013-02-05 16:10 ` Ian Campbell
@ 2013-02-06  9:13 ` Ian Campbell
  3 siblings, 0 replies; 28+ messages in thread
From: Ian Campbell @ 2013-02-06  9:13 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel

On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.

I suppose this restriction is because it is hard to change the priority
of a pending interrupt from either Xen or the guest in a lock-free
manner?

The problem is that while on one end you call EVTCHNOP_alloc_unbound and
can then change the priority on the other you call
EVTCHNOP_bind_interdomain passing (remote-domid,remote-evtcht) and
receive a local-evtchn which is already bound, which means you never get
the opportunity to set the priority. Likewise when binding virqs and
such you never get to see the unbound evtchn.

I don't think we want to go around adding priority to all of those
existing binding hypercalls.

Perhaps it would be acceptable to say that after having called this
hypercall the domain may still observe at most one upcall of the evtchn
with the old priority, which corresponds with the interrupt being raised
right before the priority change takes effect, but being delivered
after.

> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15. 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 18:18   ` David Vrabel
@ 2013-02-06  9:35     ` Ian Campbell
  0 siblings, 0 replies; 28+ messages in thread
From: Ian Campbell @ 2013-02-06  9:35 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel

On Tue, 2013-02-05 at 18:18 +0000, David Vrabel wrote:
> On 05/02/13 16:10, Ian Campbell wrote:
> > On Mon, 2013-02-04 at 17:52 +0000, David Vrabel wrote:
> >> The pages within the event array need not be physically nor virtually
> >> contiguous, but the guest or Xen may make the virtually contiguous for
> >> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> >> Linux.  Pages are added by the guest as required by the number of
> >> bound event channels.
> > 
> > Strictly speaking the size is related to the maximum bound evtchn, which
> > need not be the same as the number of bound evtchns.
> 
> Yes.
> 
> >> The state of an event is tracked using 3 bits within the event word.
> >> the M (masked), P (pending) and L (linked) bits.  Only state
> >> transitions that change a single bit are valid.
> > 
> > The diagram shows transitions B<->P and P<->L, which involve two bits in
> > each case. So do BM<->PM and LM<->BM now I think about it.
> 
> B == 000b, P == 100b, L == 101b.
> 
> Similarly for the masked transitions:
> 
> BM == 010b, PM == 110b, LM == 111b.

Ah, it wasn't clear that the states in the diagram didn't relate to the
bits directly, reusing the letters in that way is a bit confusing.

> 
> > Is the L bit redundant with the LINK field == 0 or != 0?
> 
> LINK == 0 is the end of queue marker.  We could use all ones to mean
> 'unlinked' but using the L bit allows the guest to remove the event from
> the queue by clearing a single bit, instead of writing the LINK field.

So you can use test/clear-bit instead of cmpxchg?

> >> A guest should call this during initial VCPU bring up (and on resume?).
> >>
> >>     struct evtchnop_init {
> >>         uint32_t vcpu;
> >>         uint64_t array_pfn;
> >>     };
> > 
> > I think this will have a different layout on 32 and 64 bit x86, if you
> > care (because uint64_t is align(4) on 32-bit and align(8) on 64-bit).
> 
> Ok.  I thought uint64_t was always 8 byte aligned on both.

Sadly not, it's the cause of a fair few 32- vs 64-bit ABI
differences :-(

> >> Memory Usage
> >> ------------
> >>
> >> Alternatively, we can consider a system with $D$ driver domains, each
> >> of which requires $E_D$ events, and a dom0 using the maximum number of
> >> pages (128).
> >>
> >> \begin{eqnarray*}
> >> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> >> \end{eqnarray*}
> >>
> >> With, for example, 16 driver domains each using the maximum number of pages:
> >> \begin{eqnarray*}
> >> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
> >>    & = & 129 \times 10^3\text{ VMs}
> >> \end{eqnarray*}
> > 
> > This accounts for the driver domains and dom0 but not the domains which
> > they are serving, doesn't it?
> 
> This is calculating the number of pages left once those used for dom0
> and the driver domains are used.  This is the same as the number of
> supportable VMs since the other VMs only require a page.

Ah, makes sense.

> >> In summary, there is space to map the event arrays for over 100,000
> >> VMs.  This is more than the limit imposed by the 16 bit domain ID
> >> (~32,000 VMs).
> > 
> > Is there scope to reduce the maximum then?
> 
> Maximum of what?

I suppose the maximum number of pages a domain can use for evtchn
queues.

> 
> >> ### Control Block
> >>
> >> With $L$ priority levels and two 32-bit words for the head and tail
> >> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> >> for the control block is:
> >> \begin{eqnarray*}
> >> S & = & L \times 2 \times 4 \\
> >>   & = & 16 \times 2 \times 4 \\
> >>   & = & 128\text{ bytes}
> >> \end{eqnarray*}
> >>
> >> There is more than enough space within `struct vcpu_info` for the
> >> control block while still keeping plenty of space for future use.
> > 
> > There is? I can see 7 bytes of padding and 4 bytes of evtchn_pending_sel
> > which I suppose becomes redundant now.
> > 
> > I don't think it would be a problem to predicate use of this new
> > interface on the use of the VCPU_placement API and therefore give scope
> > to expand the vcpu_info.
> 
> I should have taken a proper look at where the vcpu_info comes from...
> Oops.
> 
> We can add a new VCPUOP_register_vcpu_info_and_control_block which will
> add a struct vcpu_info and the control block.

Better to either say that using the new ABI requires the guest to have
reserved space after vcpu_info or to have a separate call for the
control block (which could have the constraint that it is on the same
page as vcpu_nfo)

> >> Low Level Design
> >> ================
> >>
> >> In the pseudo code in this section, all memory accesses are atomic,
> >> including those to bit-fields within the event word.
> > 
> >> These variables have a standard meaning:
> >>
> >> Variable  Purpose
> >> --------  -------
> >> E         Event array.
> >> p         A specific event.
> >> H         The head index for a specific priority.
> >> T         The tail index for a specific priority.
> >>
> >> These memory barriers are required:
> >>
> >> Function  Purpose
> >> --------  -------
> >> mb()      Full (read/write) memory barrier
> >>
> >> Raising an Event
> >> ----------------
> >>
> >> When Xen raises an event it marks it pending and (if it is not masked)
> >> adds it tail of event queue.
> > 
> > What are the conditions for actually performing the upcall when
> > returning from the hypervisor to the guest?
> 
> Any H != 0 for that VCPU.  This means we should have a per-VCPU bitmap
> of non empty queues.

You need to be careful here since if two evtchn's are raised near
simultaneously then you will return to the guest with H == +2, and do an
upcall. If in the course of handling the first interrupt you make a
hypercall you will be returning with H == +1, but you won't want to
reinject.

Which makes me wonder about handling of IRQ priority, is the intention
that if a higher priority interrupt occurs while you are processing an
existing upcall then it will be interrupted for the new higher priority
one? In which case we need to have an IRQL concept which allows the
domain to re-enable interrupts while masking interrupts at the current
or lower priority.

Or perhaps it is acceptable to say that a domain will always finish
handling the current IRQ before it takes the higher priority one.

Should we have a bitmask of pending priorities so the guest doesn't have
to check all the queues? It's only 15/16 queues but since most evtchns
probably remain at level 7 that's still a bit wasteful, especially if
you need to restart the scan after every interrupt to implement
prioritisation.

> >> Concurrent access by Xen to the event queue must be protected by a
> >> per-event queue spin lock.
> >>
> >> Consuming Events
> >> ----------------
> >>
> >> The guests consumes events starting at the head until it reaches the
> >> tail.  Events in the queue that are not pending or are masked are
> >> consumed but not handled.
> >>
> >>     while H != 0
> >>         p = H
> >>         H = E[p].link
> >>         if H == 0
> >>             mb()
> >>             H = E[p].link
> >>         E[H].linked = 0
> > 
> > Did you mean E[p].linked here?
> > 
> > If at this point the interrupt is reraised then the if in the raising
> > pseudo code becomes true, linked will set again and don't we also race
> > with the clearing of E[p].pending below?
> 
> The event will be correctly linked and then we clear P,

By P you mean E[p].pending in the pseudo-code terminology?

>  when we consume
> the linked event it will be ignored as P is already clear.

So we will miss the second interrupt?

> Perhaps we could also avoid linking the event if it is already pending?
> 
> >>> Note: When the event queue contains a single event, removing the
> >>> event may race with Xen appending another event because the load of
> >>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
> >>> the guest must recheck `E[p].link` if the list appeared empty.
> > 
> > It appears that both "Raising an Event" and "Consuming Events" can write
> > H? Is that correct? Likewise for the linked bit.
> 
> Xen only writes H when the queue is empty, the VM only writes H if it is
> non-empty.
> 
> The linked bit is set by Xen and cleared by the guest.
> 
> Do you see a problem with this?

You'd need to be careful about the barriers and think hard about the
case where the VM is emptying the queue at the same time as Xen is
injecting a new interrupt and therefore unemptying it. And vice-versa.

> > It's a bit unclear because the pseudo-code doesn't make it explicitly
> > which variables are par of the shared data structure and which are
> > private to the local routine.
> 
> Capital variables are in the shared structures.  Lower-case are local.
> 
> >> Masking Events
> >> --------------
> >>
> >> Events are masked by setting the masked bit.  If the event is pending
> >> and linked it does not need to be unlinked.
> >>
> >>     E[p].masked = 1
> >>
> >> Unmasking Events
> >> ----------------
> >>
> >> Events are unmasked by the guest by clearing the masked bit.  If the
> >> event is pending the guest must call the event channel unmask
> >> hypercall so Xen can link the event into the correct event queue.
> >>
> >>     E[p].masked = 0
> >>     if E[p].pending
> >>         hypercall(EVTCHN_unmask)
> > 
> > Can the hypercall do the E[p].masked = 0?
> 
> Sure.

Perhaps I should have said "Should..." :-)

Ian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 18:02           ` Keir Fraser
@ 2013-02-06  9:38             ` Ian Campbell
  2013-02-06 10:41               ` Keir Fraser
  2013-02-06 10:42               ` Wei Liu
  0 siblings, 2 replies; 28+ messages in thread
From: Ian Campbell @ 2013-02-06  9:38 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Wei Liu, David Vrabel, xen-devel

On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote:
> On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:
> 
> >>> Okay, I wonder how much it actually matters anyhow...
> >>> 
> >>> Oh by the way you say the control block is 128 bytes and will easily fit in
> >>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
> >>> So how does that work then?
> >> 
> >> I meant struct vcpu_info can be extended without it growing to more than
> >> a page.  i.e., it fits into the guest page provided in the
> >> VCPUOP_register_vcpu_info call so no additional pages need to be
> >> globally mapped for the control block.
> > 
> > VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
> > by itself, even if that happens to be the Linux implementation today (I
> > haven't checked that).
> 
> Having guest agree that vcpu_info grows by size of the per-vcpu control
> block, if using this new event-channel interface, is reasonable though.

Can only use this trick once though, so it might be blocking ourselves
into a future ABI corner.

Is there a downside to registering the control block separately? The
guest can always arrange for them to be contiguous if it wants, or if we
are worried about the number of global mappings then the hypervisor
could require it shares a page with the vcpu_info but allow the offset
to be specified separately.

Ian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06  9:38             ` Ian Campbell
@ 2013-02-06 10:41               ` Keir Fraser
  2013-02-06 10:42               ` Wei Liu
  1 sibling, 0 replies; 28+ messages in thread
From: Keir Fraser @ 2013-02-06 10:41 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, David Vrabel, xen-devel

On 06/02/2013 09:38, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:

>>> VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
>>> by itself, even if that happens to be the Linux implementation today (I
>>> haven't checked that).
>> 
>> Having guest agree that vcpu_info grows by size of the per-vcpu control
>> block, if using this new event-channel interface, is reasonable though.
> 
> Can only use this trick once though, so it might be blocking ourselves
> into a future ABI corner.
> 
> Is there a downside to registering the control block separately? The
> guest can always arrange for them to be contiguous if it wants, or if we
> are worried about the number of global mappings then the hypervisor
> could require it shares a page with the vcpu_info but allow the offset
> to be specified separately.

I would say we consider vcpu_info to be versioned, and that using the new
event-channel interface requires use of at least v2 of the vcpu_info
structure. There is a rsvd field in register_vcpu_info hypercall that could
be used for specifying such a thing, although sadly it is not currently
checked to be zero, so we may not actually be able to make use of those
bits.

 -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06  9:38             ` Ian Campbell
  2013-02-06 10:41               ` Keir Fraser
@ 2013-02-06 10:42               ` Wei Liu
  2013-02-06 10:52                 ` Ian Campbell
  1 sibling, 1 reply; 28+ messages in thread
From: Wei Liu @ 2013-02-06 10:42 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Keir (Xen.org), wei.liu2, David Vrabel, xen-devel

On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote:
> On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote:
> > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:
> > 
> > >>> Okay, I wonder how much it actually matters anyhow...
> > >>> 
> > >>> Oh by the way you say the control block is 128 bytes and will easily fit in
> > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
> > >>> So how does that work then?
> > >> 
> > >> I meant struct vcpu_info can be extended without it growing to more than
> > >> a page.  i.e., it fits into the guest page provided in the
> > >> VCPUOP_register_vcpu_info call so no additional pages need to be
> > >> globally mapped for the control block.
> > > 
> > > VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
> > > by itself, even if that happens to be the Linux implementation today (I
> > > haven't checked that).
> > 
> > Having guest agree that vcpu_info grows by size of the per-vcpu control
> > block, if using this new event-channel interface, is reasonable though.
> 
> Can only use this trick once though, so it might be blocking ourselves
> into a future ABI corner.
> 

In practice embedding control block in vcpu_info might not be feasible
because there is a legacy array of vcpu_info in shared_info page. It is
quite easy to bloat shared_info to exceed size limit.

> Is there a downside to registering the control block separately? The
> guest can always arrange for them to be contiguous if it wants, or if we
> are worried about the number of global mappings then the hypervisor
> could require it shares a page with the vcpu_info but allow the offset
> to be specified separately.
> 

IMHO the global mapping space is the main concern. Regarding sharing
page with vcpu_info, this requires us to control the way kernel handles
its per-cpu section. But how?


Wei.

> Ian.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06 10:42               ` Wei Liu
@ 2013-02-06 10:52                 ` Ian Campbell
  2013-02-06 11:09                   ` Wei Liu
  0 siblings, 1 reply; 28+ messages in thread
From: Ian Campbell @ 2013-02-06 10:52 UTC (permalink / raw)
  To: Wei Liu; +Cc: Keir (Xen.org), David Vrabel, xen-devel

On Wed, 2013-02-06 at 10:42 +0000, Wei Liu wrote:
> On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote:
> > On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote:
> > > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:
> > > 
> > > >>> Okay, I wonder how much it actually matters anyhow...
> > > >>> 
> > > >>> Oh by the way you say the control block is 128 bytes and will easily fit in
> > > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
> > > >>> So how does that work then?
> > > >> 
> > > >> I meant struct vcpu_info can be extended without it growing to more than
> > > >> a page.  i.e., it fits into the guest page provided in the
> > > >> VCPUOP_register_vcpu_info call so no additional pages need to be
> > > >> globally mapped for the control block.
> > > > 
> > > > VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
> > > > by itself, even if that happens to be the Linux implementation today (I
> > > > haven't checked that).
> > > 
> > > Having guest agree that vcpu_info grows by size of the per-vcpu control
> > > block, if using this new event-channel interface, is reasonable though.
> > 
> > Can only use this trick once though, so it might be blocking ourselves
> > into a future ABI corner.
> > 
> 
> In practice embedding control block in vcpu_info might not be feasible
> because there is a legacy array of vcpu_info in shared_info page. It is
> quite easy to bloat shared_info to exceed size limit.

I don't think we need to literally add the control block to struct
vcpu_info, just require that it immediately follows the vpcu_info. This
new interface requires the use of vcpu_info placement so the legacy
array is not a concern.

> > Is there a downside to registering the control block separately? The
> > guest can always arrange for them to be contiguous if it wants, or if we
> > are worried about the number of global mappings then the hypervisor
> > could require it shares a page with the vcpu_info but allow the offset
> > to be specified separately.
> > 
> 
> IMHO the global mapping space is the main concern. Regarding sharing
> page with vcpu_info, this requires us to control the way kernel handles
> its per-cpu section. But how?

We simply require that the kernel do it right...

In this instance you would probably do:
	struct per_vcpu_info {
		struct vcpu_info ...;
		struct evtchn_control ...;
	}
and allocate that in per-cpu space. Or is there concern that this
allocation might cross a page boundary? I think it is required to be
naturally aligned (i.e. aligned to at least its own size) so that is ok.

Ian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06 10:52                 ` Ian Campbell
@ 2013-02-06 11:09                   ` Wei Liu
  0 siblings, 0 replies; 28+ messages in thread
From: Wei Liu @ 2013-02-06 11:09 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Keir (Xen.org), wei.liu2, David Vrabel, xen-devel

On Wed, 2013-02-06 at 10:52 +0000, Ian Campbell wrote:
> On Wed, 2013-02-06 at 10:42 +0000, Wei Liu wrote:
> > On Wed, 2013-02-06 at 09:38 +0000, Ian Campbell wrote:
> > > On Tue, 2013-02-05 at 18:02 +0000, Keir Fraser wrote:
> > > > On 05/02/2013 16:11, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:
> > > > 
> > > > >>> Okay, I wonder how much it actually matters anyhow...
> > > > >>> 
> > > > >>> Oh by the way you say the control block is 128 bytes and will easily fit in
> > > > >>> the existing struct vcpu_info. That existing structure is 64 bytes in total.
> > > > >>> So how does that work then?
> > > > >> 
> > > > >> I meant struct vcpu_info can be extended without it growing to more than
> > > > >> a page.  i.e., it fits into the guest page provided in the
> > > > >> VCPUOP_register_vcpu_info call so no additional pages need to be
> > > > >> globally mapped for the control block.
> > > > > 
> > > > > VCPUOP_register_vcpu_info doesn't require each vcpu_info to be on a page
> > > > > by itself, even if that happens to be the Linux implementation today (I
> > > > > haven't checked that).
> > > > 
> > > > Having guest agree that vcpu_info grows by size of the per-vcpu control
> > > > block, if using this new event-channel interface, is reasonable though.
> > > 
> > > Can only use this trick once though, so it might be blocking ourselves
> > > into a future ABI corner.
> > > 
> > 
> > In practice embedding control block in vcpu_info might not be feasible
> > because there is a legacy array of vcpu_info in shared_info page. It is
> > quite easy to bloat shared_info to exceed size limit.
> 
> I don't think we need to literally add the control block to struct
> vcpu_info, just require that it immediately follows the vpcu_info. This
> new interface requires the use of vcpu_info placement so the legacy
> array is not a concern.
> 
> > > Is there a downside to registering the control block separately? The
> > > guest can always arrange for them to be contiguous if it wants, or if we
> > > are worried about the number of global mappings then the hypervisor
> > > could require it shares a page with the vcpu_info but allow the offset
> > > to be specified separately.
> > > 
> > 
> > IMHO the global mapping space is the main concern. Regarding sharing
> > page with vcpu_info, this requires us to control the way kernel handles
> > its per-cpu section. But how?
> 
> We simply require that the kernel do it right...
> 
> In this instance you would probably do:
> 	struct per_vcpu_info {
> 		struct vcpu_info ...;
> 		struct evtchn_control ...;
> 	}
> and allocate that in per-cpu space. Or is there concern that this
> allocation might cross a page boundary? I think it is required to be
> naturally aligned (i.e. aligned to at least its own size) so that is ok.
> 

I get what you mean. ;-)

My other concern is, along this path we can save global mapping address
space, but now the burden of enabling the new interface somewhat slip to
the vcpu placement hypercall other than HYPERVISOR_event_channel_op. If
you're fine with this I will be OK too.


Wei.

> Ian.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-05 18:57         ` David Vrabel
  2013-02-05 19:03           ` Wei Liu
@ 2013-02-06 11:32           ` George Dunlap
  2013-02-06 13:53             ` Keir Fraser
  1 sibling, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-02-06 11:32 UTC (permalink / raw)
  To: David Vrabel; +Cc: Keir Fraser, Wei Liu, xen-devel

On 05/02/13 18:57, David Vrabel wrote:
> What to do here is a non-trivial decision.  Possible options include:
>
> 1. Merge the 3-level ABI for 4.3.  Work on the FIFO-based ABI in
> parallel, aiming to get this in for 4.4.  This means maintaining 3 event
> channel ABIs in Xen.
>
> 2. Slip the 4.3 release for 2-3 months and merge the FIFO-based ABI in.
>
> 3. Status quo.  Defer extending the event channel ABI to 4.4.
>
> The preferable option may be to:
>
> 4. Get the 3-level ABI to a mergable state. In parallel develop a
> prototype of the FIFO-based ABI.  When the prototype is ready or the 4.3
> freeze is here, evaluate it and make a decision then.

Just to clarify, the difference between #1 and #4 is that in #4 we hold 
off on the merge, to delay committing to a specific course of action 
until later?

That seems at first blush to be a pretty safe option.  But I think it's 
worth pointing out that in practice the end result is likely to be that 
we just go with #1 eventually anyway: if the FIFO ABI can't be finished 
in 4 months giving it all our effort, can we really expect it to be 
finished in any less time while polishing up the 3-level ABI?

I was going to say, "There's no particular reason to merge the 3-level 
ABI sooner rather than later", but of course there is: it allows 
considerably longer and wider testing.

No conclusion here, just adding to the mix of things to consider. :-)

  -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-04 19:59 ` Keir Fraser
  2013-02-05 14:48   ` David Vrabel
@ 2013-02-06 11:46   ` Jan Beulich
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2013-02-06 11:46 UTC (permalink / raw)
  To: David Vrabel, Keir Fraser; +Cc: Wei Liu, xen-devel

>>> On 04.02.13 at 20:59, Keir Fraser <keir.xen@gmail.com> wrote:
> On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:
>> Here is a design for a scalable event channel ABI for Xen.  It has a
>> number of benefits over the design currently being proposed by Wei Lui.
>> 
>> * More event channels (>100,000).
>> * 16 event priorities.
>> * Reduced memory requirements (only 1 additional page per domU).
>> * The use of FIFOs for events ensures fairness, whereas it is difficult
>> to reason about the fairness with the current bitmap system.
> 
> I have some sympathy for this design. It's primary downside compared with
> the 3-level alternative is its greater space cost (32*#vcpus). However, as
> you say the fairness and prioritisation features are rather nice. Also
> having the data structures be per vcpu may well help avoid cacheline
> contention on busy multi-vcpu guests.
> 
> Interested in what others think, but I may actually prefer a ground-up
> redesign like this.

So do I, fwiw.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06 11:32           ` George Dunlap
@ 2013-02-06 13:53             ` Keir Fraser
  2013-03-14 19:20               ` David Vrabel
  0 siblings, 1 reply; 28+ messages in thread
From: Keir Fraser @ 2013-02-06 13:53 UTC (permalink / raw)
  To: George Dunlap, David Vrabel; +Cc: Wei Liu, xen-devel

On 06/02/2013 11:32, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

>> 4. Get the 3-level ABI to a mergable state. In parallel develop a
>> prototype of the FIFO-based ABI.  When the prototype is ready or the 4.3
>> freeze is here, evaluate it and make a decision then.
> 
> Just to clarify, the difference between #1 and #4 is that in #4 we hold
> off on the merge, to delay committing to a specific course of action
> until later?
> 
> That seems at first blush to be a pretty safe option.  But I think it's
> worth pointing out that in practice the end result is likely to be that
> we just go with #1 eventually anyway: if the FIFO ABI can't be finished
> in 4 months giving it all our effort, can we really expect it to be
> finished in any less time while polishing up the 3-level ABI?
> 
> I was going to say, "There's no particular reason to merge the 3-level
> ABI sooner rather than later", but of course there is: it allows
> considerably longer and wider testing.
> 
> No conclusion here, just adding to the mix of things to consider. :-)

How many man-weeks do we think David's design would take to get to draft
implementation? I mean honestly I would have thought that a straight
two-week run at it would be a reasonable estimate -- the places it plugs in
in hypervisor and guest kernel are pretty clean and well defined.

This depends on a man having the weeks to spend on it of course!

 -- Keir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Scalable Event Channel ABI design (draft A)
  2013-02-06 13:53             ` Keir Fraser
@ 2013-03-14 19:20               ` David Vrabel
  0 siblings, 0 replies; 28+ messages in thread
From: David Vrabel @ 2013-03-14 19:20 UTC (permalink / raw)
  To: Keir Fraser; +Cc: George Dunlap, Wei Liu, xen-devel

On 06/02/13 13:53, Keir Fraser wrote:
> On 06/02/2013 11:32, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
> 
>>> 4. Get the 3-level ABI to a mergable state. In parallel develop a
>>> prototype of the FIFO-based ABI.  When the prototype is ready or the 4.3
>>> freeze is here, evaluate it and make a decision then.
>>
>> Just to clarify, the difference between #1 and #4 is that in #4 we hold
>> off on the merge, to delay committing to a specific course of action
>> until later?
>>
>> That seems at first blush to be a pretty safe option.  But I think it's
>> worth pointing out that in practice the end result is likely to be that
>> we just go with #1 eventually anyway: if the FIFO ABI can't be finished
>> in 4 months giving it all our effort, can we really expect it to be
>> finished in any less time while polishing up the 3-level ABI?
>>
>> I was going to say, "There's no particular reason to merge the 3-level
>> ABI sooner rather than later", but of course there is: it allows
>> considerably longer and wider testing.
>>
>> No conclusion here, just adding to the mix of things to consider. :-)
> 
> How many man-weeks do we think David's design would take to get to draft
> implementation? I mean honestly I would have thought that a straight
> two-week run at it would be a reasonable estimate -- the places it plugs in
> in hypervisor and guest kernel are pretty clean and well defined.

Your estimate wasn't far off.  After 6 days I now have my first
prototype implementation sending and receiving its first events.

There needs to be some more work on the Xen side to add an explicit hook
for binding new event channels to VCPU0 (so we can hook them up to the
right queue) and the Linux side is limited to 1024 events as Linux isn't
expanding the event array yet.

And it needs more testing of course.

Should be able to post a patch series of the prototype of both the Xen
and Linux sides within a week or so.

David

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-03-14 19:20 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
2013-02-04 19:59 ` Keir Fraser
2013-02-05 14:48   ` David Vrabel
2013-02-05 15:16     ` Wei Liu
2013-02-05 18:05       ` George Dunlap
2013-02-05 18:57         ` David Vrabel
2013-02-05 19:03           ` Wei Liu
2013-02-06 11:32           ` George Dunlap
2013-02-06 13:53             ` Keir Fraser
2013-03-14 19:20               ` David Vrabel
2013-02-05 15:49     ` Keir Fraser
2013-02-05 15:54       ` David Vrabel
2013-02-05 16:11         ` Ian Campbell
2013-02-05 18:02           ` Keir Fraser
2013-02-06  9:38             ` Ian Campbell
2013-02-06 10:41               ` Keir Fraser
2013-02-06 10:42               ` Wei Liu
2013-02-06 10:52                 ` Ian Campbell
2013-02-06 11:09                   ` Wei Liu
2013-02-05 16:11         ` Keir Fraser
2013-02-06 11:46   ` Jan Beulich
2013-02-04 21:07 ` Wei Liu
2013-02-04 22:16   ` Keir Fraser
2013-02-05 18:36   ` David Vrabel
2013-02-05 16:10 ` Ian Campbell
2013-02-05 18:18   ` David Vrabel
2013-02-06  9:35     ` Ian Campbell
2013-02-06  9:13 ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.