All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] ivshmem v2: Shared memory device specification
@ 2020-05-25  7:58 ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-05-25  7:58 UTC (permalink / raw)
  To: virtio-comment
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, Michael S. Tsirkin

Hi all,

as requested by Michael, find below the current version of the Inter-VM 
Shared Memory device specification version 2 (as version 1 could be 
considered what is currently in QEMU).

This posting is intended to collect feedback from the virtio community 
before officially proposing it to become part of the spec. As you can 
see, the spec format is not yet integrated with the virtio documents at 
all.

IVSHMEM has value of its own, allowing unprivileged applications to 
establish links to other applications in VMs connected via this 
transport. In addition, and that is where virtio comes into play even 
more, it can be used to build virtio backend-frontend connections 
between two VMs. Prototypes have been developed, see e.g. [1], 
specifying that transport is still a to-do. I will try to reserve a few 
cycle in the upcoming weeks for a first draft.

My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
propose them both to virtio but allowing virtio-shmem to also be mapped 
on other shared memory channels for virtual machines.

The ivshmem2 device model is fairly stable now, also being in use in 
Jailhouse for quite a while. Still there are some aspects that could be 
worth to discuss in particular:

 - Do we also need a platform device model, in addition to PCI? My
   feelings are negative, but there has been at least one request.

 - Should we add some feature flags, or is using the PCI revision ID
   sufficient (...to do that later)? Currently, there is no need for
   communicating features this way.

 - Should we try to model the doorbell interface more flexibly, in way
   that may allow mapping it on hardware-provided mailboxes (i.e. VM-
   exit free channels)? Unfortunately, I'm not yet aware of any hardware
   that could provide this feature and, thus, could act as a use case to
   design against.

Thanks,
Jan

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html

---

IVSHMEM Device Specification
============================

** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **

The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
define the minimally needed building blocks a hypervisor has to
provide for enabling guest-to-guest communication. The details of
communication protocols shall remain opaque to the hypervisor so that
guests are free to define them as simple or sophisticated as they
need.

For that purpose, the IVSHMEM provides the following features to its
users:

- Interconnection between up to 65536 peers

- Multi-purpose shared memory region

    - common read/writable section

    - output sections that are read/writable for one peer and only
      readable for the others

    - section with peer states

- Event signaling via interrupt to remote sides

- Support for life-cycle management via state value exchange and
  interrupt notification on changes, backed by a shared memory
  section

- Free choice of protocol to be used on top

- Protocol type declaration

- Register can be implemented either memory-mapped or via I/O,
  depending on platform support and lower VM-exit costs

- Unprivileged access to memory-mapped or I/O registers feasible

- Single discovery and configuration via standard PCI, no complexity
  by additionally defining a platform device model


Hypervisor Model
----------------

In order to provide a consistent link between peers, all connected
instances of IVSHMEM devices need to be configured, created and run
by the hypervisor according to the following requirements:

- The instances of the device shall appear as a PCI device to their
  users.

- The read/write shared memory section has to be of the same size for
  all peers. The size can be zero.

- If shared memory output sections are present (non-zero section
  size), there must be one reserved for each peer with exclusive
  write access. All output sections must have the same size and must
  be readable for all peers.

- The State Table must have the same size for all peers, must be
  large enough to hold the state values of all peers, and must be
  read-only for the user.

- State register changes (explicit writes, peer resets) have to be
  propagated to the other peers by updating the corresponding State
  Table entry and issuing an interrupt to all other peers if they
  enabled reception.

- Interrupts events triggered by a peer have to be delivered to the
  target peer, provided the receiving side is valid and has enabled
  the reception.

- All peers must have the same interrupt delivery features available,
  i.e. MSI-X with the same maximum number of vectors on platforms
  supporting this mechanism, otherwise INTx with one vector.


Guest-side Programming Model
----------------------------

An IVSHMEM device appears as a PCI device to its users. Unless
otherwise noted, it conforms to the PCI Local Bus Specification,
Revision 3.0. As such, it is discoverable via the PCI configuration
space and provides a number of standard and custom PCI configuration
registers.

### Shared Memory Region Layout

The shared memory region is divided into several sections.

    +-----------------------------+   -
    |                             |   :
    | Output Section for peer n-1 |   : Output Section Size
    |     (n = Maximum Peers)     |   :
    +-----------------------------+   -
    :                             :
    :                             :
    :                             :
    +-----------------------------+   -
    |                             |   :
    |  Output Section for peer 1  |   : Output Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |  Output Section for peer 0  |   : Output Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |     Read/Write Section      |   : R/W Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |         State Table         |   : State Table Size
    |                             |   :
    +-----------------------------+   <-- Shared memory base address

The first section consists of the mandatory State Table. Its size is
defined by the State Table Size register and cannot be zero. This
section is read-only for all peers.

The second section consists of shared memory that is read/writable
for all peers. Its size is defined by the R/W Section Size register.
A size of zero is permitted.

The third and following sections are output sections, one for each
peer. Their sizes are all identical. The size of a single output
section is defined by the Output Section Size register. An output
section is read/writable for the corresponding peer and read-only for
all other peers. E.g., only the peer with ID 3 can write to the
fourths output section, but all peers can read from this section.

All sizes have to be rounded up to multiples of a mappable page in
order to allow access control according to the section restrictions.

### Configuration Space Registers

#### Header Registers

| Offset | Register               | Content                                              |
|-------:|:-----------------------|:-----------------------------------------------------|
|    00h | Vendor ID              | 110Ah                                                |
|    02h | Device ID              | 4106h                                                |
|    04h | Command Register       | 0000h on reset, writable bits are:                   |
|        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
|        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
|        |                        | Bit 3: Bus Master                                    |
|        |                        | Bit 10: INTx interrupt disable                       |
|        |                        | Writes to other bits are ignored                     |
|    06h | Status Register        | 0010h, static value                                  |
|        |                        | In deviation to the PCI specification, the Interrupt |
|        |                        | Status (bit 3) is never set                          |
|    08h | Revision ID            | 00h                                                  |
|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
|    0Bh | Class Code, Base Class | FFh                                                  |
|    0Eh | Header Type            | 00h                                                  |
|    10h | BAR 0                  | MMIO or I/O register region                          |
|    14h | BAR 1                  | MSI-X region                                         |
|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
|    34h | Capability Pointer     | First capability                                     |
|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |

The INTx status bit is never set by an implementation. Users of the
IVSHMEM device are instead expected to derive the event state from
protocol-specific information kept in the shared memory. This
approach is significantly faster, and the complexity of
register-based status tracking can be avoided.

If BAR 2 is not present, the shared memory region is not relocatable
by the user. In that case, the hypervisor has to implement the Base
Address register in the vendor-specific capability.

Subsystem IDs shall encode the provider (hypervisor) in order to
allow identifying potential deviating implementations in case this
should ever be required.

If its platform supports MSI-X, an implementation of the IVSHMEM
device must provide this interrupt model and must not expose INTx
support.

Other header registers may not be implemented. If not implemented,
they return 0 on read and ignore write accesses.

#### Vendor Specific Capability (ID 09h)

This capability must always be present.

| Offset | Register            | Content                                        |
|-------:|:--------------------|:-----------------------------------------------|
|    00h | ID                  | 09h                                            |
|    01h | Next Capability     | Pointer to next capability or 00h              |
|    02h | Length              | 20h if Base Address is present, 18h otherwise  |
|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
|        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
|    04h | State Table Size    | 32-bit size of read-only State Table           |
|    08h | R/W Section Size    | 64-bit size of common read/write section       |
|    10h | Output Section Size | 64-bit size of output sections                 |
|    18h | Base Address        | optional: 64-bit base address of shared memory |

All registers are read-only. Writes are ignored, except to bit 0 of
the Privileged Control register.

When bit 0 in the Privileged Control register is set to 1, the device
clears bit 0 in the Interrupt Control register on each interrupt
delivery. This enables automatic interrupt throttling when
re-enabling shall be performed by a scheduled unprivileged instance
on the user side.

An IVSHMEM device may not support a relocatable shared memory region.
This support the hypervisor in locking down the guest-to-host address
mapping and simplifies the runtime logic. In such a case, BAR 2 must
not be implemented by the hypervisor. Instead, the Base Address
register has to be implemented to report the location of the shared
memory region in the user's address space.

A non-existing shared memory section has to report zero in its
Section Size register.

#### MSI-X Capability (ID 11h)

On platforms supporting MSI-X, IVSHMEM has to provide interrupt
delivery via this mechanism. In that case, the MSI-X capability is
present while the legacy INTx delivery mechanism is not available,
and the Interrupt Pin configuration register returns 0.

The IVSHMEM device has no notion of pending interrupts. Therefore,
reading from the MSI-X Pending Bit Array will always return 0. Users
of the IVSHMEM device are instead expected to derive the event state
from protocol-specific information kept in the shared memory. This
approach is significantly faster, and the complexity of
register-based status tracking can be avoided.

The corresponding MSI-X MMIO region is configured via BAR 1.

The MSI-X table size reported by the MSI-X capability structure is
identical for all peers.

### Register Region

The register region may be implemented as MMIO or I/O.

When implementing it as MMIO, the hypervisor has to ensure that the
register region can be mapped as a single page into the address space
of the user, without causing potential overlaps with other resources.
Write accesses to MMIO region offsets that are not backed by
registers have to be ignored, read accesses have to return 0. This
enables the user to hand out the complete region, along with the
shared memory, to an unprivileged instance.

The region location in the user's physical address space is
configured via BAR 0. The following table visualizes the region
layout:

| Offset | Register                                                            |
|-------:|:--------------------------------------------------------------------|
|    00h | ID                                                                  |
|    04h | Maximum Peers                                                       |
|    08h | Interrupt Control                                                   |
|    0Ch | Doorbell                                                            |
|    10h | State                                                               |

All registers support only aligned 32-bit accesses.

#### ID Register (Offset 00h)

Read-only register that reports the ID of the local device. It is
unique for all of the connected devices and remains unchanged over
their lifetime.

#### Maximum Peers Register (Offset 04h)

Read-only register that reports the maximum number of possible peers
(including the local one). The permitted range is between 2 and 65536
and remains constant over the lifetime of all peers.

#### Interrupt Control Register (Offset 08h)

This read/write register controls the generation of interrupts
whenever a peer writes to the Doorbell register or changes its state.

| Bits | Content                                                               |
|-----:|:----------------------------------------------------------------------|
|    0 | 1: Enable interrupt generation                                        |
| 1-31 | Reserved (0 on read, writes ignored)                                  |

Note that bit 0 is reset to 0 on interrupt delivery if one-shot
interrupt mode is enabled in the Enhanced Features register.

The value of this register after device reset is 0.

#### Doorbell Register (Offset 0Ch)

Write-only register that triggers an interrupt vector in the target
device if it is enabled there.

| Bits  | Content                                                              |
|------:|:---------------------------------------------------------------------|
|  0-15 | Vector number                                                        |
| 16-31 | Target ID                                                            |

Writing a vector number that is not enabled by the target has no
effect. The peers can derive the number of available vectors from
their own device capabilities because the provider is required to
expose an identical number of vectors to all connected peers. The
peers are expected to define or negotiate the used ones via the
selected protocol.

Addressing a non-existing or inactive target has no effect. Peers can
identify active targets via the State Table.

Implementations of the Doorbell register must ensure that data written by the
CPU prior to issuing the register write is visible to the receiving peer before
the interrupt arrives.

The behavior on reading from this register is undefined.

#### State Register (Offset 10h)

Read/write register that defines the state of the local device.
Writing to this register sets the state and triggers MSI-X vector 0
or the INTx interrupt, respectively, on the remote device if the
written state value differs from the previous one. Users of peer
devices can read the value written to this register from the State
Table. They are expected differentiate state change interrupts from
doorbell events by comparing the new state value with a locally
stored copy.

The value of this register after device reset is 0. The semantic of
all other values can be defined freely by the chosen protocol.

### State Table

The State Table is a read-only section at the beginning of the shared
memory region. It contains a 32-bit state value for each of the
peers. Locating the table in shared memory allows fast checking of
remote states without register accesses.

The table is updated on each state change of a peers. Whenever a user
of an IVSHMEM device writes a value to the Local State register, this
value is copied into the corresponding entry of the State Table. When
a IVSHMEM device is reset or disconnected from the other peers, zero
is written into the corresponding table entry. The initial content of
the table is all zeros.

    +--------------------------------+
    | 32-bit state value of peer n-1 |
    +--------------------------------+
    :                                :
    +--------------------------------+
    | 32-bit state value of peer 1   |
    +--------------------------------+
    | 32-bit state value of peer 0   |
    +--------------------------------+ <-- Shared memory base address


Protocols
---------

The IVSHMEM device shall support the peers of a connection in
agreeing on the protocol used over the shared memory devices. For
that purpose, the interface byte (offset 09h) and the sub-class byte
(offset 0Ah) of the Class Code register encodes a 16-bit protocol
type for the users. The following type values are defined:

| Protocol Type | Description                                                  |
|--------------:|:-------------------------------------------------------------|
|         0000h | Undefined type                                               |
|         0001h | Virtual peer-to-peer Ethernet                                |
|   0002h-3FFFh | Reserved                                                     |
|   4000h-7FFFh | User-defined protocols                                       |
|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |

Details of the protocols are not in the scope of this specification.

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-05-25  7:58 ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-05-25  7:58 UTC (permalink / raw)
  To: virtio-comment
  Cc: Jailhouse, qemu-devel, Michael S. Tsirkin, Alex Bennée, liang yan

Hi all,

as requested by Michael, find below the current version of the Inter-VM 
Shared Memory device specification version 2 (as version 1 could be 
considered what is currently in QEMU).

This posting is intended to collect feedback from the virtio community 
before officially proposing it to become part of the spec. As you can 
see, the spec format is not yet integrated with the virtio documents at 
all.

IVSHMEM has value of its own, allowing unprivileged applications to 
establish links to other applications in VMs connected via this 
transport. In addition, and that is where virtio comes into play even 
more, it can be used to build virtio backend-frontend connections 
between two VMs. Prototypes have been developed, see e.g. [1], 
specifying that transport is still a to-do. I will try to reserve a few 
cycle in the upcoming weeks for a first draft.

My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
propose them both to virtio but allowing virtio-shmem to also be mapped 
on other shared memory channels for virtual machines.

The ivshmem2 device model is fairly stable now, also being in use in 
Jailhouse for quite a while. Still there are some aspects that could be 
worth to discuss in particular:

 - Do we also need a platform device model, in addition to PCI? My
   feelings are negative, but there has been at least one request.

 - Should we add some feature flags, or is using the PCI revision ID
   sufficient (...to do that later)? Currently, there is no need for
   communicating features this way.

 - Should we try to model the doorbell interface more flexibly, in way
   that may allow mapping it on hardware-provided mailboxes (i.e. VM-
   exit free channels)? Unfortunately, I'm not yet aware of any hardware
   that could provide this feature and, thus, could act as a use case to
   design against.

Thanks,
Jan

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html

---

IVSHMEM Device Specification
============================

** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **

The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
define the minimally needed building blocks a hypervisor has to
provide for enabling guest-to-guest communication. The details of
communication protocols shall remain opaque to the hypervisor so that
guests are free to define them as simple or sophisticated as they
need.

For that purpose, the IVSHMEM provides the following features to its
users:

- Interconnection between up to 65536 peers

- Multi-purpose shared memory region

    - common read/writable section

    - output sections that are read/writable for one peer and only
      readable for the others

    - section with peer states

- Event signaling via interrupt to remote sides

- Support for life-cycle management via state value exchange and
  interrupt notification on changes, backed by a shared memory
  section

- Free choice of protocol to be used on top

- Protocol type declaration

- Register can be implemented either memory-mapped or via I/O,
  depending on platform support and lower VM-exit costs

- Unprivileged access to memory-mapped or I/O registers feasible

- Single discovery and configuration via standard PCI, no complexity
  by additionally defining a platform device model


Hypervisor Model
----------------

In order to provide a consistent link between peers, all connected
instances of IVSHMEM devices need to be configured, created and run
by the hypervisor according to the following requirements:

- The instances of the device shall appear as a PCI device to their
  users.

- The read/write shared memory section has to be of the same size for
  all peers. The size can be zero.

- If shared memory output sections are present (non-zero section
  size), there must be one reserved for each peer with exclusive
  write access. All output sections must have the same size and must
  be readable for all peers.

- The State Table must have the same size for all peers, must be
  large enough to hold the state values of all peers, and must be
  read-only for the user.

- State register changes (explicit writes, peer resets) have to be
  propagated to the other peers by updating the corresponding State
  Table entry and issuing an interrupt to all other peers if they
  enabled reception.

- Interrupts events triggered by a peer have to be delivered to the
  target peer, provided the receiving side is valid and has enabled
  the reception.

- All peers must have the same interrupt delivery features available,
  i.e. MSI-X with the same maximum number of vectors on platforms
  supporting this mechanism, otherwise INTx with one vector.


Guest-side Programming Model
----------------------------

An IVSHMEM device appears as a PCI device to its users. Unless
otherwise noted, it conforms to the PCI Local Bus Specification,
Revision 3.0. As such, it is discoverable via the PCI configuration
space and provides a number of standard and custom PCI configuration
registers.

### Shared Memory Region Layout

The shared memory region is divided into several sections.

    +-----------------------------+   -
    |                             |   :
    | Output Section for peer n-1 |   : Output Section Size
    |     (n = Maximum Peers)     |   :
    +-----------------------------+   -
    :                             :
    :                             :
    :                             :
    +-----------------------------+   -
    |                             |   :
    |  Output Section for peer 1  |   : Output Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |  Output Section for peer 0  |   : Output Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |     Read/Write Section      |   : R/W Section Size
    |                             |   :
    +-----------------------------+   -
    |                             |   :
    |         State Table         |   : State Table Size
    |                             |   :
    +-----------------------------+   <-- Shared memory base address

The first section consists of the mandatory State Table. Its size is
defined by the State Table Size register and cannot be zero. This
section is read-only for all peers.

The second section consists of shared memory that is read/writable
for all peers. Its size is defined by the R/W Section Size register.
A size of zero is permitted.

The third and following sections are output sections, one for each
peer. Their sizes are all identical. The size of a single output
section is defined by the Output Section Size register. An output
section is read/writable for the corresponding peer and read-only for
all other peers. E.g., only the peer with ID 3 can write to the
fourths output section, but all peers can read from this section.

All sizes have to be rounded up to multiples of a mappable page in
order to allow access control according to the section restrictions.

### Configuration Space Registers

#### Header Registers

| Offset | Register               | Content                                              |
|-------:|:-----------------------|:-----------------------------------------------------|
|    00h | Vendor ID              | 110Ah                                                |
|    02h | Device ID              | 4106h                                                |
|    04h | Command Register       | 0000h on reset, writable bits are:                   |
|        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
|        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
|        |                        | Bit 3: Bus Master                                    |
|        |                        | Bit 10: INTx interrupt disable                       |
|        |                        | Writes to other bits are ignored                     |
|    06h | Status Register        | 0010h, static value                                  |
|        |                        | In deviation to the PCI specification, the Interrupt |
|        |                        | Status (bit 3) is never set                          |
|    08h | Revision ID            | 00h                                                  |
|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
|    0Bh | Class Code, Base Class | FFh                                                  |
|    0Eh | Header Type            | 00h                                                  |
|    10h | BAR 0                  | MMIO or I/O register region                          |
|    14h | BAR 1                  | MSI-X region                                         |
|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
|    34h | Capability Pointer     | First capability                                     |
|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |

The INTx status bit is never set by an implementation. Users of the
IVSHMEM device are instead expected to derive the event state from
protocol-specific information kept in the shared memory. This
approach is significantly faster, and the complexity of
register-based status tracking can be avoided.

If BAR 2 is not present, the shared memory region is not relocatable
by the user. In that case, the hypervisor has to implement the Base
Address register in the vendor-specific capability.

Subsystem IDs shall encode the provider (hypervisor) in order to
allow identifying potential deviating implementations in case this
should ever be required.

If its platform supports MSI-X, an implementation of the IVSHMEM
device must provide this interrupt model and must not expose INTx
support.

Other header registers may not be implemented. If not implemented,
they return 0 on read and ignore write accesses.

#### Vendor Specific Capability (ID 09h)

This capability must always be present.

| Offset | Register            | Content                                        |
|-------:|:--------------------|:-----------------------------------------------|
|    00h | ID                  | 09h                                            |
|    01h | Next Capability     | Pointer to next capability or 00h              |
|    02h | Length              | 20h if Base Address is present, 18h otherwise  |
|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
|        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
|    04h | State Table Size    | 32-bit size of read-only State Table           |
|    08h | R/W Section Size    | 64-bit size of common read/write section       |
|    10h | Output Section Size | 64-bit size of output sections                 |
|    18h | Base Address        | optional: 64-bit base address of shared memory |

All registers are read-only. Writes are ignored, except to bit 0 of
the Privileged Control register.

When bit 0 in the Privileged Control register is set to 1, the device
clears bit 0 in the Interrupt Control register on each interrupt
delivery. This enables automatic interrupt throttling when
re-enabling shall be performed by a scheduled unprivileged instance
on the user side.

An IVSHMEM device may not support a relocatable shared memory region.
This support the hypervisor in locking down the guest-to-host address
mapping and simplifies the runtime logic. In such a case, BAR 2 must
not be implemented by the hypervisor. Instead, the Base Address
register has to be implemented to report the location of the shared
memory region in the user's address space.

A non-existing shared memory section has to report zero in its
Section Size register.

#### MSI-X Capability (ID 11h)

On platforms supporting MSI-X, IVSHMEM has to provide interrupt
delivery via this mechanism. In that case, the MSI-X capability is
present while the legacy INTx delivery mechanism is not available,
and the Interrupt Pin configuration register returns 0.

The IVSHMEM device has no notion of pending interrupts. Therefore,
reading from the MSI-X Pending Bit Array will always return 0. Users
of the IVSHMEM device are instead expected to derive the event state
from protocol-specific information kept in the shared memory. This
approach is significantly faster, and the complexity of
register-based status tracking can be avoided.

The corresponding MSI-X MMIO region is configured via BAR 1.

The MSI-X table size reported by the MSI-X capability structure is
identical for all peers.

### Register Region

The register region may be implemented as MMIO or I/O.

When implementing it as MMIO, the hypervisor has to ensure that the
register region can be mapped as a single page into the address space
of the user, without causing potential overlaps with other resources.
Write accesses to MMIO region offsets that are not backed by
registers have to be ignored, read accesses have to return 0. This
enables the user to hand out the complete region, along with the
shared memory, to an unprivileged instance.

The region location in the user's physical address space is
configured via BAR 0. The following table visualizes the region
layout:

| Offset | Register                                                            |
|-------:|:--------------------------------------------------------------------|
|    00h | ID                                                                  |
|    04h | Maximum Peers                                                       |
|    08h | Interrupt Control                                                   |
|    0Ch | Doorbell                                                            |
|    10h | State                                                               |

All registers support only aligned 32-bit accesses.

#### ID Register (Offset 00h)

Read-only register that reports the ID of the local device. It is
unique for all of the connected devices and remains unchanged over
their lifetime.

#### Maximum Peers Register (Offset 04h)

Read-only register that reports the maximum number of possible peers
(including the local one). The permitted range is between 2 and 65536
and remains constant over the lifetime of all peers.

#### Interrupt Control Register (Offset 08h)

This read/write register controls the generation of interrupts
whenever a peer writes to the Doorbell register or changes its state.

| Bits | Content                                                               |
|-----:|:----------------------------------------------------------------------|
|    0 | 1: Enable interrupt generation                                        |
| 1-31 | Reserved (0 on read, writes ignored)                                  |

Note that bit 0 is reset to 0 on interrupt delivery if one-shot
interrupt mode is enabled in the Enhanced Features register.

The value of this register after device reset is 0.

#### Doorbell Register (Offset 0Ch)

Write-only register that triggers an interrupt vector in the target
device if it is enabled there.

| Bits  | Content                                                              |
|------:|:---------------------------------------------------------------------|
|  0-15 | Vector number                                                        |
| 16-31 | Target ID                                                            |

Writing a vector number that is not enabled by the target has no
effect. The peers can derive the number of available vectors from
their own device capabilities because the provider is required to
expose an identical number of vectors to all connected peers. The
peers are expected to define or negotiate the used ones via the
selected protocol.

Addressing a non-existing or inactive target has no effect. Peers can
identify active targets via the State Table.

Implementations of the Doorbell register must ensure that data written by the
CPU prior to issuing the register write is visible to the receiving peer before
the interrupt arrives.

The behavior on reading from this register is undefined.

#### State Register (Offset 10h)

Read/write register that defines the state of the local device.
Writing to this register sets the state and triggers MSI-X vector 0
or the INTx interrupt, respectively, on the remote device if the
written state value differs from the previous one. Users of peer
devices can read the value written to this register from the State
Table. They are expected differentiate state change interrupts from
doorbell events by comparing the new state value with a locally
stored copy.

The value of this register after device reset is 0. The semantic of
all other values can be defined freely by the chosen protocol.

### State Table

The State Table is a read-only section at the beginning of the shared
memory region. It contains a 32-bit state value for each of the
peers. Locating the table in shared memory allows fast checking of
remote states without register accesses.

The table is updated on each state change of a peers. Whenever a user
of an IVSHMEM device writes a value to the Local State register, this
value is copied into the corresponding entry of the State Table. When
a IVSHMEM device is reset or disconnected from the other peers, zero
is written into the corresponding table entry. The initial content of
the table is all zeros.

    +--------------------------------+
    | 32-bit state value of peer n-1 |
    +--------------------------------+
    :                                :
    +--------------------------------+
    | 32-bit state value of peer 1   |
    +--------------------------------+
    | 32-bit state value of peer 0   |
    +--------------------------------+ <-- Shared memory base address


Protocols
---------

The IVSHMEM device shall support the peers of a connection in
agreeing on the protocol used over the shared memory devices. For
that purpose, the interface byte (offset 09h) and the sub-class byte
(offset 0Ah) of the Class Code register encodes a 16-bit protocol
type for the users. The following type values are defined:

| Protocol Type | Description                                                  |
|--------------:|:-------------------------------------------------------------|
|         0000h | Undefined type                                               |
|         0001h | Virtual peer-to-peer Ethernet                                |
|   0002h-3FFFh | Reserved                                                     |
|   4000h-7FFFh | User-defined protocols                                       |
|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |

Details of the protocols are not in the scope of this specification.

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] ivshmem v2: Shared memory device specification
  2020-05-25  7:58 ` [virtio-comment] " Jan Kiszka
@ 2020-06-17 15:32   ` Alex Bennée
  -1 siblings, 0 replies; 26+ messages in thread
From: Alex Bennée @ 2020-06-17 15:32 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel, virtio-comment


Jan Kiszka <jan.kiszka@siemens.com> writes:

> Hi all,
>
> as requested by Michael, find below the current version of the Inter-VM 
> Shared Memory device specification version 2 (as version 1 could be 
> considered what is currently in QEMU).
>
> This posting is intended to collect feedback from the virtio community 
> before officially proposing it to become part of the spec. As you can 
> see, the spec format is not yet integrated with the virtio documents at 
> all.
>
> IVSHMEM has value of its own, allowing unprivileged applications to 
> establish links to other applications in VMs connected via this 
> transport. In addition, and that is where virtio comes into play even 
> more, it can be used to build virtio backend-frontend connections 
> between two VMs. Prototypes have been developed, see e.g. [1], 
> specifying that transport is still a to-do. I will try to reserve a few 
> cycle in the upcoming weeks for a first draft.
>
> My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
> propose them both to virtio but allowing virtio-shmem to also be mapped 
> on other shared memory channels for virtual machines.
>
> The ivshmem2 device model is fairly stable now, also being in use in 
> Jailhouse for quite a while. Still there are some aspects that could be 
> worth to discuss in particular:
>
>  - Do we also need a platform device model, in addition to PCI? My
>    feelings are negative, but there has been at least one request.
>
>  - Should we add some feature flags, or is using the PCI revision ID
>    sufficient (...to do that later)? Currently, there is no need for
>    communicating features this way.
>
>  - Should we try to model the doorbell interface more flexibly, in way
>    that may allow mapping it on hardware-provided mailboxes (i.e. VM-
>    exit free channels)? Unfortunately, I'm not yet aware of any hardware
>    that could provide this feature and, thus, could act as a use case to
>    design against.
>
> Thanks,
> Jan
>
> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html
>
> ---
>
> IVSHMEM Device Specification
> ============================
>
> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>
> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
> define the minimally needed building blocks a hypervisor has to
> provide for enabling guest-to-guest communication. The details of
> communication protocols shall remain opaque to the hypervisor so that
> guests are free to define them as simple or sophisticated as they
> need.
>
> For that purpose, the IVSHMEM provides the following features to its
> users:
>
> - Interconnection between up to 65536 peers
>
> - Multi-purpose shared memory region
>
>     - common read/writable section
>
>     - output sections that are read/writable for one peer and only
>       readable for the others
>
>     - section with peer states
>
> - Event signaling via interrupt to remote sides
>
> - Support for life-cycle management via state value exchange and
>   interrupt notification on changes, backed by a shared memory
>   section
>
> - Free choice of protocol to be used on top
>
> - Protocol type declaration
>
> - Register can be implemented either memory-mapped or via I/O,
>   depending on platform support and lower VM-exit costs
>
> - Unprivileged access to memory-mapped or I/O registers feasible
>
> - Single discovery and configuration via standard PCI, no complexity
>   by additionally defining a platform device model
>
>
> Hypervisor Model
> ----------------
>
> In order to provide a consistent link between peers, all connected
> instances of IVSHMEM devices need to be configured, created and run
> by the hypervisor according to the following requirements:
>
> - The instances of the device shall appear as a PCI device to their
>   users.

Would there ever be scope for a plain MMIO style device to be reported?
While I agree PCI is pretty useful from the point of view of easy
discovery I note a number of machine types prefer device tree reported
devices because they are faster to bring up for cloudy fast VM purposes
(c.f. microvm).

> - The read/write shared memory section has to be of the same size for
>   all peers. The size can be zero.
>
> - If shared memory output sections are present (non-zero section
>   size), there must be one reserved for each peer with exclusive
>   write access. All output sections must have the same size and must
>   be readable for all peers.
>
> - The State Table must have the same size for all peers, must be
>   large enough to hold the state values of all peers, and must be
>   read-only for the user.
>
> - State register changes (explicit writes, peer resets) have to be
>   propagated to the other peers by updating the corresponding State
>   Table entry and issuing an interrupt to all other peers if they
>   enabled reception.
>
> - Interrupts events triggered by a peer have to be delivered to the
>   target peer, provided the receiving side is valid and has enabled
>   the reception.
>
> - All peers must have the same interrupt delivery features available,
>   i.e. MSI-X with the same maximum number of vectors on platforms
>   supporting this mechanism, otherwise INTx with one vector.
>
>
> Guest-side Programming Model
> ----------------------------
>
> An IVSHMEM device appears as a PCI device to its users. Unless
> otherwise noted, it conforms to the PCI Local Bus Specification,
> Revision 3.0. As such, it is discoverable via the PCI configuration
> space and provides a number of standard and custom PCI configuration
> registers.
>
> ### Shared Memory Region Layout
>
> The shared memory region is divided into several sections.
>
>     +-----------------------------+   -
>     |                             |   :
>     | Output Section for peer n-1 |   : Output Section Size
>     |     (n = Maximum Peers)     |   :
>     +-----------------------------+   -
>     :                             :
>     :                             :
>     :                             :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 1  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 0  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -

There doesn't seem to be any mention of limits on the output section
size. If you wanted to limit visibility between peers there would have
to be an explicit page alignment requirement for these sections.

Or would the alternative be for the hypervisor to instantiate additional
IVSHMEM devices for each combination of peers that need visibility of
each other?

>     |                             |   :
>     |     Read/Write Section      |   : R/W Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |         State Table         |   : State Table Size
>     |                             |   :
>     +-----------------------------+   <-- Shared memory base address
>
> The first section consists of the mandatory State Table. Its size is
> defined by the State Table Size register and cannot be zero. This
> section is read-only for all peers.
>
> The second section consists of shared memory that is read/writable
> for all peers. Its size is defined by the R/W Section Size register.
> A size of zero is permitted.
>
> The third and following sections are output sections, one for each
> peer. Their sizes are all identical. The size of a single output
> section is defined by the Output Section Size register. An output
> section is read/writable for the corresponding peer and read-only for
> all other peers. E.g., only the peer with ID 3 can write to the
> fourths output section, but all peers can read from this section.
>
> All sizes have to be rounded up to multiples of a mappable page in
> order to allow access control according to the section restrictions.
>
> ### Configuration Space Registers
>
> #### Header Registers
>
> | Offset | Register               | Content                                              |
> |-------:|:-----------------------|:-----------------------------------------------------|
> |    00h | Vendor ID              | 110Ah                                                |
> |    02h | Device ID              | 4106h                                                |
> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
> |        |                        | Bit 3: Bus Master                                    |
> |        |                        | Bit 10: INTx interrupt disable                       |
> |        |                        | Writes to other bits are ignored                     |
> |    06h | Status Register        | 0010h, static value                                  |
> |        |                        | In deviation to the PCI specification, the Interrupt |
> |        |                        | Status (bit 3) is never set                          |
> |    08h | Revision ID            | 00h                                                  |
> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
> |    0Bh | Class Code, Base Class | FFh                                                  |
> |    0Eh | Header Type            | 00h                                                  |
> |    10h | BAR 0                  | MMIO or I/O register region                          |
> |    14h | BAR 1                  | MSI-X region                                         |
> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
> |    34h | Capability Pointer     | First capability                                     |

Is this a PCI thing? How are additional capabilities described?

> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>
> The INTx status bit is never set by an implementation. Users of the
> IVSHMEM device are instead expected to derive the event state from
> protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
>
> If BAR 2 is not present, the shared memory region is not relocatable
> by the user. In that case, the hypervisor has to implement the Base
> Address register in the vendor-specific capability.

Sorry I don't follow - if there is no BAR2 there is still a shared
memory region but you just can't move it around the guests address
space? I guess this ties in with my question above about capabilities,
is there just a value you read for the physical address? What about it's
size?

> Subsystem IDs shall encode the provider (hypervisor) in order to
> allow identifying potential deviating implementations in case this
> should ever be required.
>
> If its platform supports MSI-X, an implementation of the IVSHMEM
> device must provide this interrupt model and must not expose INTx
> support.
>
> Other header registers may not be implemented. If not implemented,
> they return 0 on read and ignore write accesses.
>
> #### Vendor Specific Capability (ID 09h)
>
> This capability must always be present.
>
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of
> shared memory |

ah so we follow Capability Pointer and then walk down the table in Next
Capability chunks? Next Capability is described as a pointer but only at
an offset of 01h with Length at 02h. Is it really a pointer or just the
size of this capability record?

I think this answers the question further up - no BAR2 and the optional
Base Address in the Vendor Specific Capability.

> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.
>
> When bit 0 in the Privileged Control register is set to 1, the device
> clears bit 0 in the Interrupt Control register on each interrupt
> delivery. This enables automatic interrupt throttling when
> re-enabling shall be performed by a scheduled unprivileged instance
> on the user side.
>
> An IVSHMEM device may not support a relocatable shared memory region.
> This support the hypervisor in locking down the guest-to-host address
> mapping and simplifies the runtime logic. In such a case, BAR 2 must
> not be implemented by the hypervisor. Instead, the Base Address
> register has to be implemented to report the location of the shared
> memory region in the user's address space.

This seems useful - especially in the case of trying to keep the guest
using a fixed guest physical address and not potentially breaking up
mappings in the hypervisor by moving stuff around.

>
> A non-existing shared memory section has to report zero in its
> Section Size register.
>
> #### MSI-X Capability (ID 11h)
>
> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
> delivery via this mechanism. In that case, the MSI-X capability is
> present while the legacy INTx delivery mechanism is not available,
> and the Interrupt Pin configuration register returns 0.
>
> The IVSHMEM device has no notion of pending interrupts. Therefore,
> reading from the MSI-X Pending Bit Array will always return 0. Users
> of the IVSHMEM device are instead expected to derive the event state
> from protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
>
> The corresponding MSI-X MMIO region is configured via BAR 1.
>
> The MSI-X table size reported by the MSI-X capability structure is
> identical for all peers.
>
> ### Register Region
>
> The register region may be implemented as MMIO or I/O.
>
> When implementing it as MMIO, the hypervisor has to ensure that the
> register region can be mapped as a single page into the address space
> of the user, without causing potential overlaps with other resources.
> Write accesses to MMIO region offsets that are not backed by
> registers have to be ignored, read accesses have to return 0. This
> enables the user to hand out the complete region, along with the
> shared memory, to an unprivileged instance.
>
> The region location in the user's physical address space is
> configured via BAR 0. The following table visualizes the region
> layout:
>
> | Offset | Register                                                            |
> |-------:|:--------------------------------------------------------------------|
> |    00h | ID                                                                  |
> |    04h | Maximum Peers                                                       |
> |    08h | Interrupt Control                                                   |
> |    0Ch | Doorbell                                                            |
> |    10h | State                                                               |
>
> All registers support only aligned 32-bit accesses.
>
> #### ID Register (Offset 00h)
>
> Read-only register that reports the ID of the local device. It is
> unique for all of the connected devices and remains unchanged over
> their lifetime.
>
> #### Maximum Peers Register (Offset 04h)
>
> Read-only register that reports the maximum number of possible peers
> (including the local one). The permitted range is between 2 and 65536
> and remains constant over the lifetime of all peers.
>
> #### Interrupt Control Register (Offset 08h)
>
> This read/write register controls the generation of interrupts
> whenever a peer writes to the Doorbell register or changes its state.
>
> | Bits | Content                                                               |
> |-----:|:----------------------------------------------------------------------|
> |    0 | 1: Enable interrupt generation                                        |
> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>
> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
> interrupt mode is enabled in the Enhanced Features register.
>
> The value of this register after device reset is 0.
>
> #### Doorbell Register (Offset 0Ch)
>
> Write-only register that triggers an interrupt vector in the target
> device if it is enabled there.
>
> | Bits  | Content                                                              |
> |------:|:---------------------------------------------------------------------|
> |  0-15 | Vector number                                                        |
> | 16-31 | Target ID                                                            |
>
> Writing a vector number that is not enabled by the target has no
> effect. The peers can derive the number of available vectors from
> their own device capabilities because the provider is required to
> expose an identical number of vectors to all connected peers. The
> peers are expected to define or negotiate the used ones via the
> selected protocol.
>
> Addressing a non-existing or inactive target has no effect. Peers can
> identify active targets via the State Table.
>
> Implementations of the Doorbell register must ensure that data written by the
> CPU prior to issuing the register write is visible to the receiving peer before
> the interrupt arrives.
>
> The behavior on reading from this register is undefined.
>
> #### State Register (Offset 10h)
>
> Read/write register that defines the state of the local device.
> Writing to this register sets the state and triggers MSI-X vector 0
> or the INTx interrupt, respectively, on the remote device if the
> written state value differs from the previous one. Users of peer
> devices can read the value written to this register from the State
> Table. They are expected differentiate state change interrupts from
> doorbell events by comparing the new state value with a locally
> stored copy.
>
> The value of this register after device reset is 0. The semantic of
> all other values can be defined freely by the chosen protocol.
>
> ### State Table
>
> The State Table is a read-only section at the beginning of the shared
> memory region. It contains a 32-bit state value for each of the
> peers. Locating the table in shared memory allows fast checking of
> remote states without register accesses.
>
> The table is updated on each state change of a peers. Whenever a user
> of an IVSHMEM device writes a value to the Local State register, this
> value is copied into the corresponding entry of the State Table. When
> a IVSHMEM device is reset or disconnected from the other peers, zero
> is written into the corresponding table entry. The initial content of
> the table is all zeros.
>
>     +--------------------------------+
>     | 32-bit state value of peer n-1 |
>     +--------------------------------+
>     :                                :
>     +--------------------------------+
>     | 32-bit state value of peer 1   |
>     +--------------------------------+
>     | 32-bit state value of peer 0   |
>     +--------------------------------+ <-- Shared memory base address
>
>
> Protocols
> ---------
>
> The IVSHMEM device shall support the peers of a connection in
> agreeing on the protocol used over the shared memory devices. For
> that purpose, the interface byte (offset 09h) and the sub-class byte
> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
> type for the users. The following type values are defined:
>
> | Protocol Type | Description                                                  |
> |--------------:|:-------------------------------------------------------------|
> |         0000h | Undefined type                                               |
> |         0001h | Virtual peer-to-peer Ethernet                                |
> |   0002h-3FFFh | Reserved                                                     |
> |   4000h-7FFFh | User-defined protocols                                       |
> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>
> Details of the protocols are not in the scope of this specification.

Otherwise it reads fine to me.  I'm sure some of my confusion is from
not being familiar with the details of PCI devices.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-06-17 15:32   ` Alex Bennée
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Bennée @ 2020-06-17 15:32 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin, liang yan


Jan Kiszka <jan.kiszka@siemens.com> writes:

> Hi all,
>
> as requested by Michael, find below the current version of the Inter-VM 
> Shared Memory device specification version 2 (as version 1 could be 
> considered what is currently in QEMU).
>
> This posting is intended to collect feedback from the virtio community 
> before officially proposing it to become part of the spec. As you can 
> see, the spec format is not yet integrated with the virtio documents at 
> all.
>
> IVSHMEM has value of its own, allowing unprivileged applications to 
> establish links to other applications in VMs connected via this 
> transport. In addition, and that is where virtio comes into play even 
> more, it can be used to build virtio backend-frontend connections 
> between two VMs. Prototypes have been developed, see e.g. [1], 
> specifying that transport is still a to-do. I will try to reserve a few 
> cycle in the upcoming weeks for a first draft.
>
> My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
> propose them both to virtio but allowing virtio-shmem to also be mapped 
> on other shared memory channels for virtual machines.
>
> The ivshmem2 device model is fairly stable now, also being in use in 
> Jailhouse for quite a while. Still there are some aspects that could be 
> worth to discuss in particular:
>
>  - Do we also need a platform device model, in addition to PCI? My
>    feelings are negative, but there has been at least one request.
>
>  - Should we add some feature flags, or is using the PCI revision ID
>    sufficient (...to do that later)? Currently, there is no need for
>    communicating features this way.
>
>  - Should we try to model the doorbell interface more flexibly, in way
>    that may allow mapping it on hardware-provided mailboxes (i.e. VM-
>    exit free channels)? Unfortunately, I'm not yet aware of any hardware
>    that could provide this feature and, thus, could act as a use case to
>    design against.
>
> Thanks,
> Jan
>
> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html
>
> ---
>
> IVSHMEM Device Specification
> ============================
>
> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>
> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
> define the minimally needed building blocks a hypervisor has to
> provide for enabling guest-to-guest communication. The details of
> communication protocols shall remain opaque to the hypervisor so that
> guests are free to define them as simple or sophisticated as they
> need.
>
> For that purpose, the IVSHMEM provides the following features to its
> users:
>
> - Interconnection between up to 65536 peers
>
> - Multi-purpose shared memory region
>
>     - common read/writable section
>
>     - output sections that are read/writable for one peer and only
>       readable for the others
>
>     - section with peer states
>
> - Event signaling via interrupt to remote sides
>
> - Support for life-cycle management via state value exchange and
>   interrupt notification on changes, backed by a shared memory
>   section
>
> - Free choice of protocol to be used on top
>
> - Protocol type declaration
>
> - Register can be implemented either memory-mapped or via I/O,
>   depending on platform support and lower VM-exit costs
>
> - Unprivileged access to memory-mapped or I/O registers feasible
>
> - Single discovery and configuration via standard PCI, no complexity
>   by additionally defining a platform device model
>
>
> Hypervisor Model
> ----------------
>
> In order to provide a consistent link between peers, all connected
> instances of IVSHMEM devices need to be configured, created and run
> by the hypervisor according to the following requirements:
>
> - The instances of the device shall appear as a PCI device to their
>   users.

Would there ever be scope for a plain MMIO style device to be reported?
While I agree PCI is pretty useful from the point of view of easy
discovery I note a number of machine types prefer device tree reported
devices because they are faster to bring up for cloudy fast VM purposes
(c.f. microvm).

> - The read/write shared memory section has to be of the same size for
>   all peers. The size can be zero.
>
> - If shared memory output sections are present (non-zero section
>   size), there must be one reserved for each peer with exclusive
>   write access. All output sections must have the same size and must
>   be readable for all peers.
>
> - The State Table must have the same size for all peers, must be
>   large enough to hold the state values of all peers, and must be
>   read-only for the user.
>
> - State register changes (explicit writes, peer resets) have to be
>   propagated to the other peers by updating the corresponding State
>   Table entry and issuing an interrupt to all other peers if they
>   enabled reception.
>
> - Interrupts events triggered by a peer have to be delivered to the
>   target peer, provided the receiving side is valid and has enabled
>   the reception.
>
> - All peers must have the same interrupt delivery features available,
>   i.e. MSI-X with the same maximum number of vectors on platforms
>   supporting this mechanism, otherwise INTx with one vector.
>
>
> Guest-side Programming Model
> ----------------------------
>
> An IVSHMEM device appears as a PCI device to its users. Unless
> otherwise noted, it conforms to the PCI Local Bus Specification,
> Revision 3.0. As such, it is discoverable via the PCI configuration
> space and provides a number of standard and custom PCI configuration
> registers.
>
> ### Shared Memory Region Layout
>
> The shared memory region is divided into several sections.
>
>     +-----------------------------+   -
>     |                             |   :
>     | Output Section for peer n-1 |   : Output Section Size
>     |     (n = Maximum Peers)     |   :
>     +-----------------------------+   -
>     :                             :
>     :                             :
>     :                             :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 1  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 0  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -

There doesn't seem to be any mention of limits on the output section
size. If you wanted to limit visibility between peers there would have
to be an explicit page alignment requirement for these sections.

Or would the alternative be for the hypervisor to instantiate additional
IVSHMEM devices for each combination of peers that need visibility of
each other?

>     |                             |   :
>     |     Read/Write Section      |   : R/W Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |         State Table         |   : State Table Size
>     |                             |   :
>     +-----------------------------+   <-- Shared memory base address
>
> The first section consists of the mandatory State Table. Its size is
> defined by the State Table Size register and cannot be zero. This
> section is read-only for all peers.
>
> The second section consists of shared memory that is read/writable
> for all peers. Its size is defined by the R/W Section Size register.
> A size of zero is permitted.
>
> The third and following sections are output sections, one for each
> peer. Their sizes are all identical. The size of a single output
> section is defined by the Output Section Size register. An output
> section is read/writable for the corresponding peer and read-only for
> all other peers. E.g., only the peer with ID 3 can write to the
> fourths output section, but all peers can read from this section.
>
> All sizes have to be rounded up to multiples of a mappable page in
> order to allow access control according to the section restrictions.
>
> ### Configuration Space Registers
>
> #### Header Registers
>
> | Offset | Register               | Content                                              |
> |-------:|:-----------------------|:-----------------------------------------------------|
> |    00h | Vendor ID              | 110Ah                                                |
> |    02h | Device ID              | 4106h                                                |
> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
> |        |                        | Bit 3: Bus Master                                    |
> |        |                        | Bit 10: INTx interrupt disable                       |
> |        |                        | Writes to other bits are ignored                     |
> |    06h | Status Register        | 0010h, static value                                  |
> |        |                        | In deviation to the PCI specification, the Interrupt |
> |        |                        | Status (bit 3) is never set                          |
> |    08h | Revision ID            | 00h                                                  |
> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
> |    0Bh | Class Code, Base Class | FFh                                                  |
> |    0Eh | Header Type            | 00h                                                  |
> |    10h | BAR 0                  | MMIO or I/O register region                          |
> |    14h | BAR 1                  | MSI-X region                                         |
> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
> |    34h | Capability Pointer     | First capability                                     |

Is this a PCI thing? How are additional capabilities described?

> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>
> The INTx status bit is never set by an implementation. Users of the
> IVSHMEM device are instead expected to derive the event state from
> protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
>
> If BAR 2 is not present, the shared memory region is not relocatable
> by the user. In that case, the hypervisor has to implement the Base
> Address register in the vendor-specific capability.

Sorry I don't follow - if there is no BAR2 there is still a shared
memory region but you just can't move it around the guests address
space? I guess this ties in with my question above about capabilities,
is there just a value you read for the physical address? What about it's
size?

> Subsystem IDs shall encode the provider (hypervisor) in order to
> allow identifying potential deviating implementations in case this
> should ever be required.
>
> If its platform supports MSI-X, an implementation of the IVSHMEM
> device must provide this interrupt model and must not expose INTx
> support.
>
> Other header registers may not be implemented. If not implemented,
> they return 0 on read and ignore write accesses.
>
> #### Vendor Specific Capability (ID 09h)
>
> This capability must always be present.
>
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of
> shared memory |

ah so we follow Capability Pointer and then walk down the table in Next
Capability chunks? Next Capability is described as a pointer but only at
an offset of 01h with Length at 02h. Is it really a pointer or just the
size of this capability record?

I think this answers the question further up - no BAR2 and the optional
Base Address in the Vendor Specific Capability.

> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.
>
> When bit 0 in the Privileged Control register is set to 1, the device
> clears bit 0 in the Interrupt Control register on each interrupt
> delivery. This enables automatic interrupt throttling when
> re-enabling shall be performed by a scheduled unprivileged instance
> on the user side.
>
> An IVSHMEM device may not support a relocatable shared memory region.
> This support the hypervisor in locking down the guest-to-host address
> mapping and simplifies the runtime logic. In such a case, BAR 2 must
> not be implemented by the hypervisor. Instead, the Base Address
> register has to be implemented to report the location of the shared
> memory region in the user's address space.

This seems useful - especially in the case of trying to keep the guest
using a fixed guest physical address and not potentially breaking up
mappings in the hypervisor by moving stuff around.

>
> A non-existing shared memory section has to report zero in its
> Section Size register.
>
> #### MSI-X Capability (ID 11h)
>
> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
> delivery via this mechanism. In that case, the MSI-X capability is
> present while the legacy INTx delivery mechanism is not available,
> and the Interrupt Pin configuration register returns 0.
>
> The IVSHMEM device has no notion of pending interrupts. Therefore,
> reading from the MSI-X Pending Bit Array will always return 0. Users
> of the IVSHMEM device are instead expected to derive the event state
> from protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
>
> The corresponding MSI-X MMIO region is configured via BAR 1.
>
> The MSI-X table size reported by the MSI-X capability structure is
> identical for all peers.
>
> ### Register Region
>
> The register region may be implemented as MMIO or I/O.
>
> When implementing it as MMIO, the hypervisor has to ensure that the
> register region can be mapped as a single page into the address space
> of the user, without causing potential overlaps with other resources.
> Write accesses to MMIO region offsets that are not backed by
> registers have to be ignored, read accesses have to return 0. This
> enables the user to hand out the complete region, along with the
> shared memory, to an unprivileged instance.
>
> The region location in the user's physical address space is
> configured via BAR 0. The following table visualizes the region
> layout:
>
> | Offset | Register                                                            |
> |-------:|:--------------------------------------------------------------------|
> |    00h | ID                                                                  |
> |    04h | Maximum Peers                                                       |
> |    08h | Interrupt Control                                                   |
> |    0Ch | Doorbell                                                            |
> |    10h | State                                                               |
>
> All registers support only aligned 32-bit accesses.
>
> #### ID Register (Offset 00h)
>
> Read-only register that reports the ID of the local device. It is
> unique for all of the connected devices and remains unchanged over
> their lifetime.
>
> #### Maximum Peers Register (Offset 04h)
>
> Read-only register that reports the maximum number of possible peers
> (including the local one). The permitted range is between 2 and 65536
> and remains constant over the lifetime of all peers.
>
> #### Interrupt Control Register (Offset 08h)
>
> This read/write register controls the generation of interrupts
> whenever a peer writes to the Doorbell register or changes its state.
>
> | Bits | Content                                                               |
> |-----:|:----------------------------------------------------------------------|
> |    0 | 1: Enable interrupt generation                                        |
> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>
> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
> interrupt mode is enabled in the Enhanced Features register.
>
> The value of this register after device reset is 0.
>
> #### Doorbell Register (Offset 0Ch)
>
> Write-only register that triggers an interrupt vector in the target
> device if it is enabled there.
>
> | Bits  | Content                                                              |
> |------:|:---------------------------------------------------------------------|
> |  0-15 | Vector number                                                        |
> | 16-31 | Target ID                                                            |
>
> Writing a vector number that is not enabled by the target has no
> effect. The peers can derive the number of available vectors from
> their own device capabilities because the provider is required to
> expose an identical number of vectors to all connected peers. The
> peers are expected to define or negotiate the used ones via the
> selected protocol.
>
> Addressing a non-existing or inactive target has no effect. Peers can
> identify active targets via the State Table.
>
> Implementations of the Doorbell register must ensure that data written by the
> CPU prior to issuing the register write is visible to the receiving peer before
> the interrupt arrives.
>
> The behavior on reading from this register is undefined.
>
> #### State Register (Offset 10h)
>
> Read/write register that defines the state of the local device.
> Writing to this register sets the state and triggers MSI-X vector 0
> or the INTx interrupt, respectively, on the remote device if the
> written state value differs from the previous one. Users of peer
> devices can read the value written to this register from the State
> Table. They are expected differentiate state change interrupts from
> doorbell events by comparing the new state value with a locally
> stored copy.
>
> The value of this register after device reset is 0. The semantic of
> all other values can be defined freely by the chosen protocol.
>
> ### State Table
>
> The State Table is a read-only section at the beginning of the shared
> memory region. It contains a 32-bit state value for each of the
> peers. Locating the table in shared memory allows fast checking of
> remote states without register accesses.
>
> The table is updated on each state change of a peers. Whenever a user
> of an IVSHMEM device writes a value to the Local State register, this
> value is copied into the corresponding entry of the State Table. When
> a IVSHMEM device is reset or disconnected from the other peers, zero
> is written into the corresponding table entry. The initial content of
> the table is all zeros.
>
>     +--------------------------------+
>     | 32-bit state value of peer n-1 |
>     +--------------------------------+
>     :                                :
>     +--------------------------------+
>     | 32-bit state value of peer 1   |
>     +--------------------------------+
>     | 32-bit state value of peer 0   |
>     +--------------------------------+ <-- Shared memory base address
>
>
> Protocols
> ---------
>
> The IVSHMEM device shall support the peers of a connection in
> agreeing on the protocol used over the shared memory devices. For
> that purpose, the interface byte (offset 09h) and the sub-class byte
> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
> type for the users. The following type values are defined:
>
> | Protocol Type | Description                                                  |
> |--------------:|:-------------------------------------------------------------|
> |         0000h | Undefined type                                               |
> |         0001h | Virtual peer-to-peer Ethernet                                |
> |   0002h-3FFFh | Reserved                                                     |
> |   4000h-7FFFh | User-defined protocols                                       |
> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>
> Details of the protocols are not in the scope of this specification.

Otherwise it reads fine to me.  I'm sure some of my confusion is from
not being familiar with the details of PCI devices.

-- 
Alex Bennée

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] ivshmem v2: Shared memory device specification
  2020-06-17 15:32   ` [virtio-comment] " Alex Bennée
@ 2020-06-17 16:10     ` Jan Kiszka
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-06-17 16:10 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel, virtio-comment

On 17.06.20 17:32, Alex Bennée wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> Hi all,
>>
>> as requested by Michael, find below the current version of the Inter-VM 
>> Shared Memory device specification version 2 (as version 1 could be 
>> considered what is currently in QEMU).
>>
>> This posting is intended to collect feedback from the virtio community 
>> before officially proposing it to become part of the spec. As you can 
>> see, the spec format is not yet integrated with the virtio documents at 
>> all.
>>
>> IVSHMEM has value of its own, allowing unprivileged applications to 
>> establish links to other applications in VMs connected via this 
>> transport. In addition, and that is where virtio comes into play even 
>> more, it can be used to build virtio backend-frontend connections 
>> between two VMs. Prototypes have been developed, see e.g. [1], 
>> specifying that transport is still a to-do. I will try to reserve a few 
>> cycle in the upcoming weeks for a first draft.
>>
>> My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
>> propose them both to virtio but allowing virtio-shmem to also be mapped 
>> on other shared memory channels for virtual machines.
>>
>> The ivshmem2 device model is fairly stable now, also being in use in 
>> Jailhouse for quite a while. Still there are some aspects that could be 
>> worth to discuss in particular:
>>
>>  - Do we also need a platform device model, in addition to PCI? My
>>    feelings are negative, but there has been at least one request.
>>
>>  - Should we add some feature flags, or is using the PCI revision ID
>>    sufficient (...to do that later)? Currently, there is no need for
>>    communicating features this way.
>>
>>  - Should we try to model the doorbell interface more flexibly, in way
>>    that may allow mapping it on hardware-provided mailboxes (i.e. VM-
>>    exit free channels)? Unfortunately, I'm not yet aware of any hardware
>>    that could provide this feature and, thus, could act as a use case to
>>    design against.
>>
>> Thanks,
>> Jan
>>
>> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html
>>
>> ---
>>
>> IVSHMEM Device Specification
>> ============================
>>
>> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>>
>> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
>> define the minimally needed building blocks a hypervisor has to
>> provide for enabling guest-to-guest communication. The details of
>> communication protocols shall remain opaque to the hypervisor so that
>> guests are free to define them as simple or sophisticated as they
>> need.
>>
>> For that purpose, the IVSHMEM provides the following features to its
>> users:
>>
>> - Interconnection between up to 65536 peers
>>
>> - Multi-purpose shared memory region
>>
>>     - common read/writable section
>>
>>     - output sections that are read/writable for one peer and only
>>       readable for the others
>>
>>     - section with peer states
>>
>> - Event signaling via interrupt to remote sides
>>
>> - Support for life-cycle management via state value exchange and
>>   interrupt notification on changes, backed by a shared memory
>>   section
>>
>> - Free choice of protocol to be used on top
>>
>> - Protocol type declaration
>>
>> - Register can be implemented either memory-mapped or via I/O,
>>   depending on platform support and lower VM-exit costs
>>
>> - Unprivileged access to memory-mapped or I/O registers feasible
>>
>> - Single discovery and configuration via standard PCI, no complexity
>>   by additionally defining a platform device model
>>
>>
>> Hypervisor Model
>> ----------------
>>
>> In order to provide a consistent link between peers, all connected
>> instances of IVSHMEM devices need to be configured, created and run
>> by the hypervisor according to the following requirements:
>>
>> - The instances of the device shall appear as a PCI device to their
>>   users.
> 
> Would there ever be scope for a plain MMIO style device to be reported?

As pointed out above, there has been a question already targeting at
MMIO binding.

> While I agree PCI is pretty useful from the point of view of easy
> discovery I note a number of machine types prefer device tree reported
> devices because they are faster to bring up for cloudy fast VM purposes
> (c.f. microvm).

Can't you avoid that probing delay by listing devices explicitly in the
device tree? I have no data on that, just heard about. Maybe the same
works with ACPI.

What I heard so far the concerns were more targeting the complexity of
PCI driver stacks in the guests. But that complexity quickly grows also
on the MMIO platform side when getting feature-wise on a similar level
as PCI.

> 
>> - The read/write shared memory section has to be of the same size for
>>   all peers. The size can be zero.
>>
>> - If shared memory output sections are present (non-zero section
>>   size), there must be one reserved for each peer with exclusive
>>   write access. All output sections must have the same size and must
>>   be readable for all peers.
>>
>> - The State Table must have the same size for all peers, must be
>>   large enough to hold the state values of all peers, and must be
>>   read-only for the user.
>>
>> - State register changes (explicit writes, peer resets) have to be
>>   propagated to the other peers by updating the corresponding State
>>   Table entry and issuing an interrupt to all other peers if they
>>   enabled reception.
>>
>> - Interrupts events triggered by a peer have to be delivered to the
>>   target peer, provided the receiving side is valid and has enabled
>>   the reception.
>>
>> - All peers must have the same interrupt delivery features available,
>>   i.e. MSI-X with the same maximum number of vectors on platforms
>>   supporting this mechanism, otherwise INTx with one vector.
>>
>>
>> Guest-side Programming Model
>> ----------------------------
>>
>> An IVSHMEM device appears as a PCI device to its users. Unless
>> otherwise noted, it conforms to the PCI Local Bus Specification,
>> Revision 3.0. As such, it is discoverable via the PCI configuration
>> space and provides a number of standard and custom PCI configuration
>> registers.
>>
>> ### Shared Memory Region Layout
>>
>> The shared memory region is divided into several sections.
>>
>>     +-----------------------------+   -
>>     |                             |   :
>>     | Output Section for peer n-1 |   : Output Section Size
>>     |     (n = Maximum Peers)     |   :
>>     +-----------------------------+   -
>>     :                             :
>>     :                             :
>>     :                             :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |  Output Section for peer 1  |   : Output Section Size
>>     |                             |   :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |  Output Section for peer 0  |   : Output Section Size
>>     |                             |   :
>>     +-----------------------------+   -
> 
> There doesn't seem to be any mention of limits on the output section
> size. If you wanted to limit visibility between peers there would have
> to be an explicit page alignment requirement for these sections.

That is needed for managing write access, in fact. We should likely
mention that alignment need (which also depends on the guests' smallest
page size), but the actual value is out of scope for the spec.

> 
> Or would the alternative be for the hypervisor to instantiate additional
> IVSHMEM devices for each combination of peers that need visibility of
> each other?

If you link guests via a set of IVSHMEM devices, that implies the guests
are aware of who can see what. And it is actually the model to have
multiple IVSHMEM links, split per purpose (protocol) or split per
visibility constraints.

> 
>>     |                             |   :
>>     |     Read/Write Section      |   : R/W Section Size
>>     |                             |   :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |         State Table         |   : State Table Size
>>     |                             |   :
>>     +-----------------------------+   <-- Shared memory base address
>>
>> The first section consists of the mandatory State Table. Its size is
>> defined by the State Table Size register and cannot be zero. This
>> section is read-only for all peers.
>>
>> The second section consists of shared memory that is read/writable
>> for all peers. Its size is defined by the R/W Section Size register.
>> A size of zero is permitted.
>>
>> The third and following sections are output sections, one for each
>> peer. Their sizes are all identical. The size of a single output
>> section is defined by the Output Section Size register. An output
>> section is read/writable for the corresponding peer and read-only for
>> all other peers. E.g., only the peer with ID 3 can write to the
>> fourths output section, but all peers can read from this section.
>>
>> All sizes have to be rounded up to multiples of a mappable page in
>> order to allow access control according to the section restrictions.
>>
>> ### Configuration Space Registers
>>
>> #### Header Registers
>>
>> | Offset | Register               | Content                                              |
>> |-------:|:-----------------------|:-----------------------------------------------------|
>> |    00h | Vendor ID              | 110Ah                                                |
>> |    02h | Device ID              | 4106h                                                |
>> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
>> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
>> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
>> |        |                        | Bit 3: Bus Master                                    |
>> |        |                        | Bit 10: INTx interrupt disable                       |
>> |        |                        | Writes to other bits are ignored                     |
>> |    06h | Status Register        | 0010h, static value                                  |
>> |        |                        | In deviation to the PCI specification, the Interrupt |
>> |        |                        | Status (bit 3) is never set                          |
>> |    08h | Revision ID            | 00h                                                  |
>> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>> |    0Bh | Class Code, Base Class | FFh                                                  |
>> |    0Eh | Header Type            | 00h                                                  |
>> |    10h | BAR 0                  | MMIO or I/O register region                          |
>> |    14h | BAR 1                  | MSI-X region                                         |
>> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
>> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
>> |    34h | Capability Pointer     | First capability                                     |
> 
> Is this a PCI thing? How are additional capabilities described?

Here the link goes to the PCI spec: nothing special, the usual linked list.

> 
>> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>>
>> The INTx status bit is never set by an implementation. Users of the
>> IVSHMEM device are instead expected to derive the event state from
>> protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> If BAR 2 is not present, the shared memory region is not relocatable
>> by the user. In that case, the hypervisor has to implement the Base
>> Address register in the vendor-specific capability.
> 
> Sorry I don't follow - if there is no BAR2 there is still a shared
> memory region but you just can't move it around the guests address

Exactly. PCI unfortunatly does not allow us to describe that with PCI
means, so we need a side-channel. If tried PCI EA once, but it did not
work out:
https://groups.google.com/d/msg/jailhouse-dev/H62ahr0_bRk/pyHyohSdDAAJ

> space? I guess this ties in with my question above about capabilities,
> is there just a value you read for the physical address? What about it's
> size?

See vendor cap below.

> 
>> Subsystem IDs shall encode the provider (hypervisor) in order to
>> allow identifying potential deviating implementations in case this
>> should ever be required.
>>
>> If its platform supports MSI-X, an implementation of the IVSHMEM
>> device must provide this interrupt model and must not expose INTx
>> support.
>>
>> Other header registers may not be implemented. If not implemented,
>> they return 0 on read and ignore write accesses.
>>
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of
>> shared memory |
> 
> ah so we follow Capability Pointer and then walk down the table in Next
> Capability chunks? Next Capability is described as a pointer but only at
> an offset of 01h with Length at 02h. Is it really a pointer or just the
> size of this capability record?

It's an offset in the config space region. PCI standard.

> 
> I think this answers the question further up - no BAR2 and the optional
> Base Address in the Vendor Specific Capability.
> 
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
>>
>> When bit 0 in the Privileged Control register is set to 1, the device
>> clears bit 0 in the Interrupt Control register on each interrupt
>> delivery. This enables automatic interrupt throttling when
>> re-enabling shall be performed by a scheduled unprivileged instance
>> on the user side.
>>
>> An IVSHMEM device may not support a relocatable shared memory region.
>> This support the hypervisor in locking down the guest-to-host address
>> mapping and simplifies the runtime logic. In such a case, BAR 2 must
>> not be implemented by the hypervisor. Instead, the Base Address
>> register has to be implemented to report the location of the shared
>> memory region in the user's address space.
> 
> This seems useful - especially in the case of trying to keep the guest
> using a fixed guest physical address and not potentially breaking up
> mappings in the hypervisor by moving stuff around.

Exactly. One motivation in Jailhouse context is to allow for a
runtime-verifiable checksum over most of the stage-2 page tables (not
yet implemented but in mind). Permitting the guest to control where its
shared memory shall be would prevent that. I would also mean TLB
management for the second-stage page tables during guest runtime,
something we do not have in Jailhouse and would like to avoid.

> 
>>
>> A non-existing shared memory section has to report zero in its
>> Section Size register.
>>
>> #### MSI-X Capability (ID 11h)
>>
>> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
>> delivery via this mechanism. In that case, the MSI-X capability is
>> present while the legacy INTx delivery mechanism is not available,
>> and the Interrupt Pin configuration register returns 0.
>>
>> The IVSHMEM device has no notion of pending interrupts. Therefore,
>> reading from the MSI-X Pending Bit Array will always return 0. Users
>> of the IVSHMEM device are instead expected to derive the event state
>> from protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> The corresponding MSI-X MMIO region is configured via BAR 1.
>>
>> The MSI-X table size reported by the MSI-X capability structure is
>> identical for all peers.
>>
>> ### Register Region
>>
>> The register region may be implemented as MMIO or I/O.
>>
>> When implementing it as MMIO, the hypervisor has to ensure that the
>> register region can be mapped as a single page into the address space
>> of the user, without causing potential overlaps with other resources.
>> Write accesses to MMIO region offsets that are not backed by
>> registers have to be ignored, read accesses have to return 0. This
>> enables the user to hand out the complete region, along with the
>> shared memory, to an unprivileged instance.
>>
>> The region location in the user's physical address space is
>> configured via BAR 0. The following table visualizes the region
>> layout:
>>
>> | Offset | Register                                                            |
>> |-------:|:--------------------------------------------------------------------|
>> |    00h | ID                                                                  |
>> |    04h | Maximum Peers                                                       |
>> |    08h | Interrupt Control                                                   |
>> |    0Ch | Doorbell                                                            |
>> |    10h | State                                                               |
>>
>> All registers support only aligned 32-bit accesses.
>>
>> #### ID Register (Offset 00h)
>>
>> Read-only register that reports the ID of the local device. It is
>> unique for all of the connected devices and remains unchanged over
>> their lifetime.
>>
>> #### Maximum Peers Register (Offset 04h)
>>
>> Read-only register that reports the maximum number of possible peers
>> (including the local one). The permitted range is between 2 and 65536
>> and remains constant over the lifetime of all peers.
>>
>> #### Interrupt Control Register (Offset 08h)
>>
>> This read/write register controls the generation of interrupts
>> whenever a peer writes to the Doorbell register or changes its state.
>>
>> | Bits | Content                                                               |
>> |-----:|:----------------------------------------------------------------------|
>> |    0 | 1: Enable interrupt generation                                        |
>> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>>
>> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
>> interrupt mode is enabled in the Enhanced Features register.
>>
>> The value of this register after device reset is 0.
>>
>> #### Doorbell Register (Offset 0Ch)
>>
>> Write-only register that triggers an interrupt vector in the target
>> device if it is enabled there.
>>
>> | Bits  | Content                                                              |
>> |------:|:---------------------------------------------------------------------|
>> |  0-15 | Vector number                                                        |
>> | 16-31 | Target ID                                                            |
>>
>> Writing a vector number that is not enabled by the target has no
>> effect. The peers can derive the number of available vectors from
>> their own device capabilities because the provider is required to
>> expose an identical number of vectors to all connected peers. The
>> peers are expected to define or negotiate the used ones via the
>> selected protocol.
>>
>> Addressing a non-existing or inactive target has no effect. Peers can
>> identify active targets via the State Table.
>>
>> Implementations of the Doorbell register must ensure that data written by the
>> CPU prior to issuing the register write is visible to the receiving peer before
>> the interrupt arrives.
>>
>> The behavior on reading from this register is undefined.
>>
>> #### State Register (Offset 10h)
>>
>> Read/write register that defines the state of the local device.
>> Writing to this register sets the state and triggers MSI-X vector 0
>> or the INTx interrupt, respectively, on the remote device if the
>> written state value differs from the previous one. Users of peer
>> devices can read the value written to this register from the State
>> Table. They are expected differentiate state change interrupts from
>> doorbell events by comparing the new state value with a locally
>> stored copy.
>>
>> The value of this register after device reset is 0. The semantic of
>> all other values can be defined freely by the chosen protocol.
>>
>> ### State Table
>>
>> The State Table is a read-only section at the beginning of the shared
>> memory region. It contains a 32-bit state value for each of the
>> peers. Locating the table in shared memory allows fast checking of
>> remote states without register accesses.
>>
>> The table is updated on each state change of a peers. Whenever a user
>> of an IVSHMEM device writes a value to the Local State register, this
>> value is copied into the corresponding entry of the State Table. When
>> a IVSHMEM device is reset or disconnected from the other peers, zero
>> is written into the corresponding table entry. The initial content of
>> the table is all zeros.
>>
>>     +--------------------------------+
>>     | 32-bit state value of peer n-1 |
>>     +--------------------------------+
>>     :                                :
>>     +--------------------------------+
>>     | 32-bit state value of peer 1   |
>>     +--------------------------------+
>>     | 32-bit state value of peer 0   |
>>     +--------------------------------+ <-- Shared memory base address
>>
>>
>> Protocols
>> ---------
>>
>> The IVSHMEM device shall support the peers of a connection in
>> agreeing on the protocol used over the shared memory devices. For
>> that purpose, the interface byte (offset 09h) and the sub-class byte
>> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
>> type for the users. The following type values are defined:
>>
>> | Protocol Type | Description                                                  |
>> |--------------:|:-------------------------------------------------------------|
>> |         0000h | Undefined type                                               |
>> |         0001h | Virtual peer-to-peer Ethernet                                |
>> |   0002h-3FFFh | Reserved                                                     |
>> |   4000h-7FFFh | User-defined protocols                                       |
>> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>>
>> Details of the protocols are not in the scope of this specification.
> 
> Otherwise it reads fine to me.  I'm sure some of my confusion is from
> not being familiar with the details of PCI devices.
> 

We could link into chapters of PCI spec directly if that helps.

Thanks for the feedback!

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-06-17 16:10     ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-06-17 16:10 UTC (permalink / raw)
  To: Alex Bennée
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin, liang yan

On 17.06.20 17:32, Alex Bennée wrote:
> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> Hi all,
>>
>> as requested by Michael, find below the current version of the Inter-VM 
>> Shared Memory device specification version 2 (as version 1 could be 
>> considered what is currently in QEMU).
>>
>> This posting is intended to collect feedback from the virtio community 
>> before officially proposing it to become part of the spec. As you can 
>> see, the spec format is not yet integrated with the virtio documents at 
>> all.
>>
>> IVSHMEM has value of its own, allowing unprivileged applications to 
>> establish links to other applications in VMs connected via this 
>> transport. In addition, and that is where virtio comes into play even 
>> more, it can be used to build virtio backend-frontend connections 
>> between two VMs. Prototypes have been developed, see e.g. [1], 
>> specifying that transport is still a to-do. I will try to reserve a few 
>> cycle in the upcoming weeks for a first draft.
>>
>> My current strategy for these two pieces, ivshmem2 and virtio-shmem, is 
>> propose them both to virtio but allowing virtio-shmem to also be mapped 
>> on other shared memory channels for virtual machines.
>>
>> The ivshmem2 device model is fairly stable now, also being in use in 
>> Jailhouse for quite a while. Still there are some aspects that could be 
>> worth to discuss in particular:
>>
>>  - Do we also need a platform device model, in addition to PCI? My
>>    feelings are negative, but there has been at least one request.
>>
>>  - Should we add some feature flags, or is using the PCI revision ID
>>    sufficient (...to do that later)? Currently, there is no need for
>>    communicating features this way.
>>
>>  - Should we try to model the doorbell interface more flexibly, in way
>>    that may allow mapping it on hardware-provided mailboxes (i.e. VM-
>>    exit free channels)? Unfortunately, I'm not yet aware of any hardware
>>    that could provide this feature and, thus, could act as a use case to
>>    design against.
>>
>> Thanks,
>> Jan
>>
>> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg668749.html
>>
>> ---
>>
>> IVSHMEM Device Specification
>> ============================
>>
>> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>>
>> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
>> define the minimally needed building blocks a hypervisor has to
>> provide for enabling guest-to-guest communication. The details of
>> communication protocols shall remain opaque to the hypervisor so that
>> guests are free to define them as simple or sophisticated as they
>> need.
>>
>> For that purpose, the IVSHMEM provides the following features to its
>> users:
>>
>> - Interconnection between up to 65536 peers
>>
>> - Multi-purpose shared memory region
>>
>>     - common read/writable section
>>
>>     - output sections that are read/writable for one peer and only
>>       readable for the others
>>
>>     - section with peer states
>>
>> - Event signaling via interrupt to remote sides
>>
>> - Support for life-cycle management via state value exchange and
>>   interrupt notification on changes, backed by a shared memory
>>   section
>>
>> - Free choice of protocol to be used on top
>>
>> - Protocol type declaration
>>
>> - Register can be implemented either memory-mapped or via I/O,
>>   depending on platform support and lower VM-exit costs
>>
>> - Unprivileged access to memory-mapped or I/O registers feasible
>>
>> - Single discovery and configuration via standard PCI, no complexity
>>   by additionally defining a platform device model
>>
>>
>> Hypervisor Model
>> ----------------
>>
>> In order to provide a consistent link between peers, all connected
>> instances of IVSHMEM devices need to be configured, created and run
>> by the hypervisor according to the following requirements:
>>
>> - The instances of the device shall appear as a PCI device to their
>>   users.
> 
> Would there ever be scope for a plain MMIO style device to be reported?

As pointed out above, there has been a question already targeting at
MMIO binding.

> While I agree PCI is pretty useful from the point of view of easy
> discovery I note a number of machine types prefer device tree reported
> devices because they are faster to bring up for cloudy fast VM purposes
> (c.f. microvm).

Can't you avoid that probing delay by listing devices explicitly in the
device tree? I have no data on that, just heard about. Maybe the same
works with ACPI.

What I heard so far the concerns were more targeting the complexity of
PCI driver stacks in the guests. But that complexity quickly grows also
on the MMIO platform side when getting feature-wise on a similar level
as PCI.

> 
>> - The read/write shared memory section has to be of the same size for
>>   all peers. The size can be zero.
>>
>> - If shared memory output sections are present (non-zero section
>>   size), there must be one reserved for each peer with exclusive
>>   write access. All output sections must have the same size and must
>>   be readable for all peers.
>>
>> - The State Table must have the same size for all peers, must be
>>   large enough to hold the state values of all peers, and must be
>>   read-only for the user.
>>
>> - State register changes (explicit writes, peer resets) have to be
>>   propagated to the other peers by updating the corresponding State
>>   Table entry and issuing an interrupt to all other peers if they
>>   enabled reception.
>>
>> - Interrupts events triggered by a peer have to be delivered to the
>>   target peer, provided the receiving side is valid and has enabled
>>   the reception.
>>
>> - All peers must have the same interrupt delivery features available,
>>   i.e. MSI-X with the same maximum number of vectors on platforms
>>   supporting this mechanism, otherwise INTx with one vector.
>>
>>
>> Guest-side Programming Model
>> ----------------------------
>>
>> An IVSHMEM device appears as a PCI device to its users. Unless
>> otherwise noted, it conforms to the PCI Local Bus Specification,
>> Revision 3.0. As such, it is discoverable via the PCI configuration
>> space and provides a number of standard and custom PCI configuration
>> registers.
>>
>> ### Shared Memory Region Layout
>>
>> The shared memory region is divided into several sections.
>>
>>     +-----------------------------+   -
>>     |                             |   :
>>     | Output Section for peer n-1 |   : Output Section Size
>>     |     (n = Maximum Peers)     |   :
>>     +-----------------------------+   -
>>     :                             :
>>     :                             :
>>     :                             :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |  Output Section for peer 1  |   : Output Section Size
>>     |                             |   :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |  Output Section for peer 0  |   : Output Section Size
>>     |                             |   :
>>     +-----------------------------+   -
> 
> There doesn't seem to be any mention of limits on the output section
> size. If you wanted to limit visibility between peers there would have
> to be an explicit page alignment requirement for these sections.

That is needed for managing write access, in fact. We should likely
mention that alignment need (which also depends on the guests' smallest
page size), but the actual value is out of scope for the spec.

> 
> Or would the alternative be for the hypervisor to instantiate additional
> IVSHMEM devices for each combination of peers that need visibility of
> each other?

If you link guests via a set of IVSHMEM devices, that implies the guests
are aware of who can see what. And it is actually the model to have
multiple IVSHMEM links, split per purpose (protocol) or split per
visibility constraints.

> 
>>     |                             |   :
>>     |     Read/Write Section      |   : R/W Section Size
>>     |                             |   :
>>     +-----------------------------+   -
>>     |                             |   :
>>     |         State Table         |   : State Table Size
>>     |                             |   :
>>     +-----------------------------+   <-- Shared memory base address
>>
>> The first section consists of the mandatory State Table. Its size is
>> defined by the State Table Size register and cannot be zero. This
>> section is read-only for all peers.
>>
>> The second section consists of shared memory that is read/writable
>> for all peers. Its size is defined by the R/W Section Size register.
>> A size of zero is permitted.
>>
>> The third and following sections are output sections, one for each
>> peer. Their sizes are all identical. The size of a single output
>> section is defined by the Output Section Size register. An output
>> section is read/writable for the corresponding peer and read-only for
>> all other peers. E.g., only the peer with ID 3 can write to the
>> fourths output section, but all peers can read from this section.
>>
>> All sizes have to be rounded up to multiples of a mappable page in
>> order to allow access control according to the section restrictions.
>>
>> ### Configuration Space Registers
>>
>> #### Header Registers
>>
>> | Offset | Register               | Content                                              |
>> |-------:|:-----------------------|:-----------------------------------------------------|
>> |    00h | Vendor ID              | 110Ah                                                |
>> |    02h | Device ID              | 4106h                                                |
>> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
>> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
>> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
>> |        |                        | Bit 3: Bus Master                                    |
>> |        |                        | Bit 10: INTx interrupt disable                       |
>> |        |                        | Writes to other bits are ignored                     |
>> |    06h | Status Register        | 0010h, static value                                  |
>> |        |                        | In deviation to the PCI specification, the Interrupt |
>> |        |                        | Status (bit 3) is never set                          |
>> |    08h | Revision ID            | 00h                                                  |
>> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>> |    0Bh | Class Code, Base Class | FFh                                                  |
>> |    0Eh | Header Type            | 00h                                                  |
>> |    10h | BAR 0                  | MMIO or I/O register region                          |
>> |    14h | BAR 1                  | MSI-X region                                         |
>> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
>> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
>> |    34h | Capability Pointer     | First capability                                     |
> 
> Is this a PCI thing? How are additional capabilities described?

Here the link goes to the PCI spec: nothing special, the usual linked list.

> 
>> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>>
>> The INTx status bit is never set by an implementation. Users of the
>> IVSHMEM device are instead expected to derive the event state from
>> protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> If BAR 2 is not present, the shared memory region is not relocatable
>> by the user. In that case, the hypervisor has to implement the Base
>> Address register in the vendor-specific capability.
> 
> Sorry I don't follow - if there is no BAR2 there is still a shared
> memory region but you just can't move it around the guests address

Exactly. PCI unfortunatly does not allow us to describe that with PCI
means, so we need a side-channel. If tried PCI EA once, but it did not
work out:
https://groups.google.com/d/msg/jailhouse-dev/H62ahr0_bRk/pyHyohSdDAAJ

> space? I guess this ties in with my question above about capabilities,
> is there just a value you read for the physical address? What about it's
> size?

See vendor cap below.

> 
>> Subsystem IDs shall encode the provider (hypervisor) in order to
>> allow identifying potential deviating implementations in case this
>> should ever be required.
>>
>> If its platform supports MSI-X, an implementation of the IVSHMEM
>> device must provide this interrupt model and must not expose INTx
>> support.
>>
>> Other header registers may not be implemented. If not implemented,
>> they return 0 on read and ignore write accesses.
>>
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of
>> shared memory |
> 
> ah so we follow Capability Pointer and then walk down the table in Next
> Capability chunks? Next Capability is described as a pointer but only at
> an offset of 01h with Length at 02h. Is it really a pointer or just the
> size of this capability record?

It's an offset in the config space region. PCI standard.

> 
> I think this answers the question further up - no BAR2 and the optional
> Base Address in the Vendor Specific Capability.
> 
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
>>
>> When bit 0 in the Privileged Control register is set to 1, the device
>> clears bit 0 in the Interrupt Control register on each interrupt
>> delivery. This enables automatic interrupt throttling when
>> re-enabling shall be performed by a scheduled unprivileged instance
>> on the user side.
>>
>> An IVSHMEM device may not support a relocatable shared memory region.
>> This support the hypervisor in locking down the guest-to-host address
>> mapping and simplifies the runtime logic. In such a case, BAR 2 must
>> not be implemented by the hypervisor. Instead, the Base Address
>> register has to be implemented to report the location of the shared
>> memory region in the user's address space.
> 
> This seems useful - especially in the case of trying to keep the guest
> using a fixed guest physical address and not potentially breaking up
> mappings in the hypervisor by moving stuff around.

Exactly. One motivation in Jailhouse context is to allow for a
runtime-verifiable checksum over most of the stage-2 page tables (not
yet implemented but in mind). Permitting the guest to control where its
shared memory shall be would prevent that. I would also mean TLB
management for the second-stage page tables during guest runtime,
something we do not have in Jailhouse and would like to avoid.

> 
>>
>> A non-existing shared memory section has to report zero in its
>> Section Size register.
>>
>> #### MSI-X Capability (ID 11h)
>>
>> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
>> delivery via this mechanism. In that case, the MSI-X capability is
>> present while the legacy INTx delivery mechanism is not available,
>> and the Interrupt Pin configuration register returns 0.
>>
>> The IVSHMEM device has no notion of pending interrupts. Therefore,
>> reading from the MSI-X Pending Bit Array will always return 0. Users
>> of the IVSHMEM device are instead expected to derive the event state
>> from protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> The corresponding MSI-X MMIO region is configured via BAR 1.
>>
>> The MSI-X table size reported by the MSI-X capability structure is
>> identical for all peers.
>>
>> ### Register Region
>>
>> The register region may be implemented as MMIO or I/O.
>>
>> When implementing it as MMIO, the hypervisor has to ensure that the
>> register region can be mapped as a single page into the address space
>> of the user, without causing potential overlaps with other resources.
>> Write accesses to MMIO region offsets that are not backed by
>> registers have to be ignored, read accesses have to return 0. This
>> enables the user to hand out the complete region, along with the
>> shared memory, to an unprivileged instance.
>>
>> The region location in the user's physical address space is
>> configured via BAR 0. The following table visualizes the region
>> layout:
>>
>> | Offset | Register                                                            |
>> |-------:|:--------------------------------------------------------------------|
>> |    00h | ID                                                                  |
>> |    04h | Maximum Peers                                                       |
>> |    08h | Interrupt Control                                                   |
>> |    0Ch | Doorbell                                                            |
>> |    10h | State                                                               |
>>
>> All registers support only aligned 32-bit accesses.
>>
>> #### ID Register (Offset 00h)
>>
>> Read-only register that reports the ID of the local device. It is
>> unique for all of the connected devices and remains unchanged over
>> their lifetime.
>>
>> #### Maximum Peers Register (Offset 04h)
>>
>> Read-only register that reports the maximum number of possible peers
>> (including the local one). The permitted range is between 2 and 65536
>> and remains constant over the lifetime of all peers.
>>
>> #### Interrupt Control Register (Offset 08h)
>>
>> This read/write register controls the generation of interrupts
>> whenever a peer writes to the Doorbell register or changes its state.
>>
>> | Bits | Content                                                               |
>> |-----:|:----------------------------------------------------------------------|
>> |    0 | 1: Enable interrupt generation                                        |
>> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>>
>> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
>> interrupt mode is enabled in the Enhanced Features register.
>>
>> The value of this register after device reset is 0.
>>
>> #### Doorbell Register (Offset 0Ch)
>>
>> Write-only register that triggers an interrupt vector in the target
>> device if it is enabled there.
>>
>> | Bits  | Content                                                              |
>> |------:|:---------------------------------------------------------------------|
>> |  0-15 | Vector number                                                        |
>> | 16-31 | Target ID                                                            |
>>
>> Writing a vector number that is not enabled by the target has no
>> effect. The peers can derive the number of available vectors from
>> their own device capabilities because the provider is required to
>> expose an identical number of vectors to all connected peers. The
>> peers are expected to define or negotiate the used ones via the
>> selected protocol.
>>
>> Addressing a non-existing or inactive target has no effect. Peers can
>> identify active targets via the State Table.
>>
>> Implementations of the Doorbell register must ensure that data written by the
>> CPU prior to issuing the register write is visible to the receiving peer before
>> the interrupt arrives.
>>
>> The behavior on reading from this register is undefined.
>>
>> #### State Register (Offset 10h)
>>
>> Read/write register that defines the state of the local device.
>> Writing to this register sets the state and triggers MSI-X vector 0
>> or the INTx interrupt, respectively, on the remote device if the
>> written state value differs from the previous one. Users of peer
>> devices can read the value written to this register from the State
>> Table. They are expected differentiate state change interrupts from
>> doorbell events by comparing the new state value with a locally
>> stored copy.
>>
>> The value of this register after device reset is 0. The semantic of
>> all other values can be defined freely by the chosen protocol.
>>
>> ### State Table
>>
>> The State Table is a read-only section at the beginning of the shared
>> memory region. It contains a 32-bit state value for each of the
>> peers. Locating the table in shared memory allows fast checking of
>> remote states without register accesses.
>>
>> The table is updated on each state change of a peers. Whenever a user
>> of an IVSHMEM device writes a value to the Local State register, this
>> value is copied into the corresponding entry of the State Table. When
>> a IVSHMEM device is reset or disconnected from the other peers, zero
>> is written into the corresponding table entry. The initial content of
>> the table is all zeros.
>>
>>     +--------------------------------+
>>     | 32-bit state value of peer n-1 |
>>     +--------------------------------+
>>     :                                :
>>     +--------------------------------+
>>     | 32-bit state value of peer 1   |
>>     +--------------------------------+
>>     | 32-bit state value of peer 0   |
>>     +--------------------------------+ <-- Shared memory base address
>>
>>
>> Protocols
>> ---------
>>
>> The IVSHMEM device shall support the peers of a connection in
>> agreeing on the protocol used over the shared memory devices. For
>> that purpose, the interface byte (offset 09h) and the sub-class byte
>> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
>> type for the users. The following type values are defined:
>>
>> | Protocol Type | Description                                                  |
>> |--------------:|:-------------------------------------------------------------|
>> |         0000h | Undefined type                                               |
>> |         0001h | Virtual peer-to-peer Ethernet                                |
>> |   0002h-3FFFh | Reserved                                                     |
>> |   4000h-7FFFh | User-defined protocols                                       |
>> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>>
>> Details of the protocols are not in the scope of this specification.
> 
> Otherwise it reads fine to me.  I'm sure some of my confusion is from
> not being familiar with the details of PCI devices.
> 

We could link into chapters of PCI spec directly if that helps.

Thanks for the feedback!

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
  2020-05-25  7:58 ` [virtio-comment] " Jan Kiszka
@ 2020-07-15 13:27   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-15 13:27 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel,
	virtio-comment, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 22088 bytes --]

On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> IVSHMEM Device Specification
> ============================
> 
> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **

Hi Jan,
Thanks for posting this! I have a posted comments where I wasn't sure
what the spec meant.

> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
> define the minimally needed building blocks a hypervisor has to
> provide for enabling guest-to-guest communication. The details of
> communication protocols shall remain opaque to the hypervisor so that
> guests are free to define them as simple or sophisticated as they
> need.
> 
> For that purpose, the IVSHMEM provides the following features to its
> users:
> 
> - Interconnection between up to 65536 peers
> 
> - Multi-purpose shared memory region
> 
>     - common read/writable section
> 
>     - output sections that are read/writable for one peer and only
>       readable for the others
> 
>     - section with peer states
> 
> - Event signaling via interrupt to remote sides
> 
> - Support for life-cycle management via state value exchange and
>   interrupt notification on changes, backed by a shared memory
>   section
> 
> - Free choice of protocol to be used on top
> 
> - Protocol type declaration
> 
> - Register can be implemented either memory-mapped or via I/O,
>   depending on platform support and lower VM-exit costs
> 
> - Unprivileged access to memory-mapped or I/O registers feasible
> 
> - Single discovery and configuration via standard PCI, no complexity
>   by additionally defining a platform device model
> 
> 
> Hypervisor Model
> ----------------
> 
> In order to provide a consistent link between peers, all connected
> instances of IVSHMEM devices need to be configured, created and run
> by the hypervisor according to the following requirements:
> 
> - The instances of the device shall appear as a PCI device to their
>   users.
> 
> - The read/write shared memory section has to be of the same size for
>   all peers. The size can be zero.
> 
> - If shared memory output sections are present (non-zero section
>   size), there must be one reserved for each peer with exclusive
>   write access. All output sections must have the same size and must
>   be readable for all peers.
> 
> - The State Table must have the same size for all peers, must be
>   large enough to hold the state values of all peers, and must be
>   read-only for the user.

Who/what is the "user"? I guess this simply means that the State Table
is read-only and only the hypervisor can update the table entries?

> - State register changes (explicit writes, peer resets) have to be
>   propagated to the other peers by updating the corresponding State
>   Table entry and issuing an interrupt to all other peers if they
>   enabled reception.
> 
> - Interrupts events triggered by a peer have to be delivered to the
>   target peer, provided the receiving side is valid and has enabled
>   the reception.
> 
> - All peers must have the same interrupt delivery features available,
>   i.e. MSI-X with the same maximum number of vectors on platforms
>   supporting this mechanism, otherwise INTx with one vector.
> 
> 
> Guest-side Programming Model
> ----------------------------
> 
> An IVSHMEM device appears as a PCI device to its users. Unless
> otherwise noted, it conforms to the PCI Local Bus Specification,
> Revision 3.0. As such, it is discoverable via the PCI configuration
> space and provides a number of standard and custom PCI configuration
> registers.
> 
> ### Shared Memory Region Layout
> 
> The shared memory region is divided into several sections.
> 
>     +-----------------------------+   -
>     |                             |   :
>     | Output Section for peer n-1 |   : Output Section Size
>     |     (n = Maximum Peers)     |   :
>     +-----------------------------+   -
>     :                             :
>     :                             :
>     :                             :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 1  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 0  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |     Read/Write Section      |   : R/W Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |         State Table         |   : State Table Size
>     |                             |   :
>     +-----------------------------+   <-- Shared memory base address
> 
> The first section consists of the mandatory State Table. Its size is
> defined by the State Table Size register and cannot be zero. This
> section is read-only for all peers.
> 
> The second section consists of shared memory that is read/writable
> for all peers. Its size is defined by the R/W Section Size register.
> A size of zero is permitted.
> 
> The third and following sections are output sections, one for each
> peer. Their sizes are all identical. The size of a single output
> section is defined by the Output Section Size register. An output
> section is read/writable for the corresponding peer and read-only for
> all other peers. E.g., only the peer with ID 3 can write to the
> fourths output section, but all peers can read from this section.

s/fourths/fourth/

> 
> All sizes have to be rounded up to multiples of a mappable page in
> order to allow access control according to the section restrictions.

The host's mappable page size? I guess the guest's page size doesn't
matter here.

> 
> ### Configuration Space Registers
> 
> #### Header Registers
> 
> | Offset | Register               | Content                                              |
> |-------:|:-----------------------|:-----------------------------------------------------|
> |    00h | Vendor ID              | 110Ah                                                |
> |    02h | Device ID              | 4106h                                                |
> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
> |        |                        | Bit 3: Bus Master                                    |
> |        |                        | Bit 10: INTx interrupt disable                       |
> |        |                        | Writes to other bits are ignored                     |
> |    06h | Status Register        | 0010h, static value                                  |
> |        |                        | In deviation to the PCI specification, the Interrupt |
> |        |                        | Status (bit 3) is never set                          |
> |    08h | Revision ID            | 00h                                                  |
> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
> |    0Bh | Class Code, Base Class | FFh                                                  |
> |    0Eh | Header Type            | 00h                                                  |
> |    10h | BAR 0                  | MMIO or I/O register region                          |
> |    14h | BAR 1                  | MSI-X region                                         |
> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
> |    34h | Capability Pointer     | First capability                                     |
> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
> 
> The INTx status bit is never set by an implementation. Users of the
> IVSHMEM device are instead expected to derive the event state from
> protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
> 
> If BAR 2 is not present, the shared memory region is not relocatable
> by the user. In that case, the hypervisor has to implement the Base
> Address register in the vendor-specific capability.

What does relocatable mean in this context?

> 
> Subsystem IDs shall encode the provider (hypervisor) in order to
> allow identifying potential deviating implementations in case this
> should ever be required.
> 
> If its platform supports MSI-X, an implementation of the IVSHMEM
> device must provide this interrupt model and must not expose INTx
> support.
> 
> Other header registers may not be implemented. If not implemented,
> they return 0 on read and ignore write accesses.
> 
> #### Vendor Specific Capability (ID 09h)
> 
> This capability must always be present.
> 
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of shared memory |
> 
> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.
> 
> When bit 0 in the Privileged Control register is set to 1, the device
> clears bit 0 in the Interrupt Control register on each interrupt
> delivery.

The Interrupt Control register has not be defined yet at this point in
the spec. Maybe you can rearrange this or include a reference to the
section on the Interrupt Control register.

> This enables automatic interrupt throttling when
> re-enabling shall be performed by a scheduled unprivileged instance
> on the user side.

This last sentence is hard to parse.

I guess the flow is:

1. Guest sets Interrupt Control Bit 0 to 1 to enable interrupts from a
   peer.
2. Peer writes doorbell and the hypervisor clears Interrupt Control Bit
   0 and raises the interrupt.
3. The guest's interrupt handler is scheduled at a later point in time.
   If any additional doorbell writes occur then will not result in
   additional interrupts until the guest sets Interrupt Control again.

Does this mean the peer still takes a vmexit writing to the doorbell
register but the hypervisor ignores the write?

VIRTIO exposes whether notifications are desired in shared memory. That
way the peer can skip the doorbell write entirely (saving a vmexit).
This is a fast approach for software device implementations (slow for
hardware implementations because they would need to do a DMA read).

> 
> An IVSHMEM device may not support a relocatable shared memory region.
> This support the hypervisor in locking down the guest-to-host address
> mapping and simplifies the runtime logic. In such a case, BAR 2 must
> not be implemented by the hypervisor. Instead, the Base Address
> register has to be implemented to report the location of the shared
> memory region in the user's address space.

This paragraph is confusing. There seem to be two concepts:

1. Relocatable memory. Placing shared memory at a fixed addresses
   eliminates the need for address translation. But this seems like a
   security issue if a peer can make me access an arbitrary address, why
   not just use relative addresses from the start of the shared memory
   region?

2. Location of shared memory. BAR 2 is a regular PCI bar and the memory
   is located on the device. In the non-BAR 2 case I think the spec says
   the shared memory is located in guest RAM instead?

> A non-existing shared memory section has to report zero in its
> Section Size register.
> 
> #### MSI-X Capability (ID 11h)
> 
> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
> delivery via this mechanism. In that case, the MSI-X capability is
> present while the legacy INTx delivery mechanism is not available,
> and the Interrupt Pin configuration register returns 0.
> 
> The IVSHMEM device has no notion of pending interrupts. Therefore,
> reading from the MSI-X Pending Bit Array will always return 0. Users
> of the IVSHMEM device are instead expected to derive the event state
> from protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.

Isn't the PBA just used for masked MSI-X interrupts? That is a standard
MSI-X feature.

On a minimal system that doesn't used masking it makes sense to strip
down the implementation, but I think the spec should allow for PCI
compliance.

> The corresponding MSI-X MMIO region is configured via BAR 1.
> 
> The MSI-X table size reported by the MSI-X capability structure is
> identical for all peers.
> 
> ### Register Region
> 
> The register region may be implemented as MMIO or I/O.

Is that "either MMIO or I/O" or "MMIO and/or I/O"?

Drivers and devices will have to implement both anyway to be compatible
with all other devices and drivers, respectively. But allowing both MMIO
and I/O at the same time ensures that a device will work with any
driver.

> When implementing it as MMIO, the hypervisor has to ensure that the
> register region can be mapped as a single page into the address space
> of the user, without causing potential overlaps with other resources.

Can "page size" be replaced with a specific value like 4 KB? In general
devices shouldn't know about CPU MMU page sizes because they vary.

> Write accesses to MMIO region offsets that are not backed by
> registers have to be ignored, read accesses have to return 0. This
> enables the user to hand out the complete region, along with the
> shared memory, to an unprivileged instance.
> 
> The region location in the user's physical address space is
> configured via BAR 0. The following table visualizes the region
> layout:
> 
> | Offset | Register                                                            |
> |-------:|:--------------------------------------------------------------------|
> |    00h | ID                                                                  |
> |    04h | Maximum Peers                                                       |
> |    08h | Interrupt Control                                                   |
> |    0Ch | Doorbell                                                            |
> |    10h | State                                                               |
> 
> All registers support only aligned 32-bit accesses.
> 
> #### ID Register (Offset 00h)
> 
> Read-only register that reports the ID of the local device. It is
> unique for all of the connected devices and remains unchanged over
> their lifetime.

What is the purpose of the ID?

> #### Maximum Peers Register (Offset 04h)
> 
> Read-only register that reports the maximum number of possible peers
> (including the local one). The permitted range is between 2 and 65536
> and remains constant over the lifetime of all peers.
> 
> #### Interrupt Control Register (Offset 08h)
> 
> This read/write register controls the generation of interrupts
> whenever a peer writes to the Doorbell register or changes its state.
> 
> | Bits | Content                                                               |
> |-----:|:----------------------------------------------------------------------|
> |    0 | 1: Enable interrupt generation                                        |
> | 1-31 | Reserved (0 on read, writes ignored)                                  |
> 
> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
> interrupt mode is enabled in the Enhanced Features register.
> 
> The value of this register after device reset is 0.

This is a global interrupt enable/disable for all vectors?

> #### Doorbell Register (Offset 0Ch)
> 
> Write-only register that triggers an interrupt vector in the target
> device if it is enabled there.
> 
> | Bits  | Content                                                              |
> |------:|:---------------------------------------------------------------------|
> |  0-15 | Vector number                                                        |
> | 16-31 | Target ID                                                            |

s/Target ID/Peer/ ?

> 
> Writing a vector number that is not enabled by the target has no
> effect. The peers can derive the number of available vectors from
> their own device capabilities because the provider is required to

What is the "provider"? The hypervisor?

> expose an identical number of vectors to all connected peers. The
> peers are expected to define or negotiate the used ones via the

s/ones/vectors/ is clearer

> selected protocol.
> 
> Addressing a non-existing or inactive target has no effect. Peers can
> identify active targets via the State Table.
> 
> Implementations of the Doorbell register must ensure that data written by the
> CPU prior to issuing the register write is visible to the receiving peer before
> the interrupt arrives.

Physical devices have no control over CPU memory ordering. Normally the
drivers have the necessary memory barriers. Why is it the device's
responsibility to do this in IVSHMEM v2?

> 
> The behavior on reading from this register is undefined.
> 
> #### State Register (Offset 10h)
> 
> Read/write register that defines the state of the local device.
> Writing to this register sets the state and triggers MSI-X vector 0
> or the INTx interrupt, respectively, on the remote device if the
> written state value differs from the previous one. Users of peer
> devices can read the value written to this register from the State
> Table. They are expected differentiate state change interrupts from
> doorbell events by comparing the new state value with a locally
> stored copy.

Does this mean drivers using INTx must compare the state table entries
for their peers on each interrupt to differentiate from doorbell events?
That seems inefficient.

> 
> The value of this register after device reset is 0. The semantic of
> all other values can be defined freely by the chosen protocol.
> 
> ### State Table
> 
> The State Table is a read-only section at the beginning of the shared
> memory region. It contains a 32-bit state value for each of the
> peers. Locating the table in shared memory allows fast checking of
> remote states without register accesses.
> 
> The table is updated on each state change of a peers. Whenever a user
> of an IVSHMEM device writes a value to the Local State register, this
> value is copied into the corresponding entry of the State Table. When
> a IVSHMEM device is reset or disconnected from the other peers, zero
> is written into the corresponding table entry. The initial content of
> the table is all zeros.
> 
>     +--------------------------------+
>     | 32-bit state value of peer n-1 |
>     +--------------------------------+
>     :                                :
>     +--------------------------------+
>     | 32-bit state value of peer 1   |
>     +--------------------------------+
>     | 32-bit state value of peer 0   |
>     +--------------------------------+ <-- Shared memory base address
> 
> 
> Protocols
> ---------
> 
> The IVSHMEM device shall support the peers of a connection in
> agreeing on the protocol used over the shared memory devices. For
> that purpose, the interface byte (offset 09h) and the sub-class byte
> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
> type for the users. The following type values are defined:
> 
> | Protocol Type | Description                                                  |
> |--------------:|:-------------------------------------------------------------|
> |         0000h | Undefined type                                               |
> |         0001h | Virtual peer-to-peer Ethernet                                |
> |   0002h-3FFFh | Reserved                                                     |
> |   4000h-7FFFh | User-defined protocols                                       |
> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
> 
> Details of the protocols are not in the scope of this specification.

If I understand correctly the reason for creating IVSHMEM v2 is to
support non-VIRTIO devices? If you just want VIRTIO devices then IVSHMEM
v2 + VIRTIO is more complex than a pure VIRTIO solution. But if you
don't want VIRTIO then IVSHMEM + a minimal device can be simpler than
VIRTIO.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-15 13:27   ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-15 13:27 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin,
	Alex Bennée, liang yan

[-- Attachment #1: Type: text/plain, Size: 22088 bytes --]

On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> IVSHMEM Device Specification
> ============================
> 
> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **

Hi Jan,
Thanks for posting this! I have a posted comments where I wasn't sure
what the spec meant.

> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
> define the minimally needed building blocks a hypervisor has to
> provide for enabling guest-to-guest communication. The details of
> communication protocols shall remain opaque to the hypervisor so that
> guests are free to define them as simple or sophisticated as they
> need.
> 
> For that purpose, the IVSHMEM provides the following features to its
> users:
> 
> - Interconnection between up to 65536 peers
> 
> - Multi-purpose shared memory region
> 
>     - common read/writable section
> 
>     - output sections that are read/writable for one peer and only
>       readable for the others
> 
>     - section with peer states
> 
> - Event signaling via interrupt to remote sides
> 
> - Support for life-cycle management via state value exchange and
>   interrupt notification on changes, backed by a shared memory
>   section
> 
> - Free choice of protocol to be used on top
> 
> - Protocol type declaration
> 
> - Register can be implemented either memory-mapped or via I/O,
>   depending on platform support and lower VM-exit costs
> 
> - Unprivileged access to memory-mapped or I/O registers feasible
> 
> - Single discovery and configuration via standard PCI, no complexity
>   by additionally defining a platform device model
> 
> 
> Hypervisor Model
> ----------------
> 
> In order to provide a consistent link between peers, all connected
> instances of IVSHMEM devices need to be configured, created and run
> by the hypervisor according to the following requirements:
> 
> - The instances of the device shall appear as a PCI device to their
>   users.
> 
> - The read/write shared memory section has to be of the same size for
>   all peers. The size can be zero.
> 
> - If shared memory output sections are present (non-zero section
>   size), there must be one reserved for each peer with exclusive
>   write access. All output sections must have the same size and must
>   be readable for all peers.
> 
> - The State Table must have the same size for all peers, must be
>   large enough to hold the state values of all peers, and must be
>   read-only for the user.

Who/what is the "user"? I guess this simply means that the State Table
is read-only and only the hypervisor can update the table entries?

> - State register changes (explicit writes, peer resets) have to be
>   propagated to the other peers by updating the corresponding State
>   Table entry and issuing an interrupt to all other peers if they
>   enabled reception.
> 
> - Interrupts events triggered by a peer have to be delivered to the
>   target peer, provided the receiving side is valid and has enabled
>   the reception.
> 
> - All peers must have the same interrupt delivery features available,
>   i.e. MSI-X with the same maximum number of vectors on platforms
>   supporting this mechanism, otherwise INTx with one vector.
> 
> 
> Guest-side Programming Model
> ----------------------------
> 
> An IVSHMEM device appears as a PCI device to its users. Unless
> otherwise noted, it conforms to the PCI Local Bus Specification,
> Revision 3.0. As such, it is discoverable via the PCI configuration
> space and provides a number of standard and custom PCI configuration
> registers.
> 
> ### Shared Memory Region Layout
> 
> The shared memory region is divided into several sections.
> 
>     +-----------------------------+   -
>     |                             |   :
>     | Output Section for peer n-1 |   : Output Section Size
>     |     (n = Maximum Peers)     |   :
>     +-----------------------------+   -
>     :                             :
>     :                             :
>     :                             :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 1  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |  Output Section for peer 0  |   : Output Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |     Read/Write Section      |   : R/W Section Size
>     |                             |   :
>     +-----------------------------+   -
>     |                             |   :
>     |         State Table         |   : State Table Size
>     |                             |   :
>     +-----------------------------+   <-- Shared memory base address
> 
> The first section consists of the mandatory State Table. Its size is
> defined by the State Table Size register and cannot be zero. This
> section is read-only for all peers.
> 
> The second section consists of shared memory that is read/writable
> for all peers. Its size is defined by the R/W Section Size register.
> A size of zero is permitted.
> 
> The third and following sections are output sections, one for each
> peer. Their sizes are all identical. The size of a single output
> section is defined by the Output Section Size register. An output
> section is read/writable for the corresponding peer and read-only for
> all other peers. E.g., only the peer with ID 3 can write to the
> fourths output section, but all peers can read from this section.

s/fourths/fourth/

> 
> All sizes have to be rounded up to multiples of a mappable page in
> order to allow access control according to the section restrictions.

The host's mappable page size? I guess the guest's page size doesn't
matter here.

> 
> ### Configuration Space Registers
> 
> #### Header Registers
> 
> | Offset | Register               | Content                                              |
> |-------:|:-----------------------|:-----------------------------------------------------|
> |    00h | Vendor ID              | 110Ah                                                |
> |    02h | Device ID              | 4106h                                                |
> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
> |        |                        | Bit 3: Bus Master                                    |
> |        |                        | Bit 10: INTx interrupt disable                       |
> |        |                        | Writes to other bits are ignored                     |
> |    06h | Status Register        | 0010h, static value                                  |
> |        |                        | In deviation to the PCI specification, the Interrupt |
> |        |                        | Status (bit 3) is never set                          |
> |    08h | Revision ID            | 00h                                                  |
> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
> |    0Bh | Class Code, Base Class | FFh                                                  |
> |    0Eh | Header Type            | 00h                                                  |
> |    10h | BAR 0                  | MMIO or I/O register region                          |
> |    14h | BAR 1                  | MSI-X region                                         |
> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
> |    34h | Capability Pointer     | First capability                                     |
> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
> 
> The INTx status bit is never set by an implementation. Users of the
> IVSHMEM device are instead expected to derive the event state from
> protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.
> 
> If BAR 2 is not present, the shared memory region is not relocatable
> by the user. In that case, the hypervisor has to implement the Base
> Address register in the vendor-specific capability.

What does relocatable mean in this context?

> 
> Subsystem IDs shall encode the provider (hypervisor) in order to
> allow identifying potential deviating implementations in case this
> should ever be required.
> 
> If its platform supports MSI-X, an implementation of the IVSHMEM
> device must provide this interrupt model and must not expose INTx
> support.
> 
> Other header registers may not be implemented. If not implemented,
> they return 0 on read and ignore write accesses.
> 
> #### Vendor Specific Capability (ID 09h)
> 
> This capability must always be present.
> 
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of shared memory |
> 
> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.
> 
> When bit 0 in the Privileged Control register is set to 1, the device
> clears bit 0 in the Interrupt Control register on each interrupt
> delivery.

The Interrupt Control register has not be defined yet at this point in
the spec. Maybe you can rearrange this or include a reference to the
section on the Interrupt Control register.

> This enables automatic interrupt throttling when
> re-enabling shall be performed by a scheduled unprivileged instance
> on the user side.

This last sentence is hard to parse.

I guess the flow is:

1. Guest sets Interrupt Control Bit 0 to 1 to enable interrupts from a
   peer.
2. Peer writes doorbell and the hypervisor clears Interrupt Control Bit
   0 and raises the interrupt.
3. The guest's interrupt handler is scheduled at a later point in time.
   If any additional doorbell writes occur then will not result in
   additional interrupts until the guest sets Interrupt Control again.

Does this mean the peer still takes a vmexit writing to the doorbell
register but the hypervisor ignores the write?

VIRTIO exposes whether notifications are desired in shared memory. That
way the peer can skip the doorbell write entirely (saving a vmexit).
This is a fast approach for software device implementations (slow for
hardware implementations because they would need to do a DMA read).

> 
> An IVSHMEM device may not support a relocatable shared memory region.
> This support the hypervisor in locking down the guest-to-host address
> mapping and simplifies the runtime logic. In such a case, BAR 2 must
> not be implemented by the hypervisor. Instead, the Base Address
> register has to be implemented to report the location of the shared
> memory region in the user's address space.

This paragraph is confusing. There seem to be two concepts:

1. Relocatable memory. Placing shared memory at a fixed addresses
   eliminates the need for address translation. But this seems like a
   security issue if a peer can make me access an arbitrary address, why
   not just use relative addresses from the start of the shared memory
   region?

2. Location of shared memory. BAR 2 is a regular PCI bar and the memory
   is located on the device. In the non-BAR 2 case I think the spec says
   the shared memory is located in guest RAM instead?

> A non-existing shared memory section has to report zero in its
> Section Size register.
> 
> #### MSI-X Capability (ID 11h)
> 
> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
> delivery via this mechanism. In that case, the MSI-X capability is
> present while the legacy INTx delivery mechanism is not available,
> and the Interrupt Pin configuration register returns 0.
> 
> The IVSHMEM device has no notion of pending interrupts. Therefore,
> reading from the MSI-X Pending Bit Array will always return 0. Users
> of the IVSHMEM device are instead expected to derive the event state
> from protocol-specific information kept in the shared memory. This
> approach is significantly faster, and the complexity of
> register-based status tracking can be avoided.

Isn't the PBA just used for masked MSI-X interrupts? That is a standard
MSI-X feature.

On a minimal system that doesn't used masking it makes sense to strip
down the implementation, but I think the spec should allow for PCI
compliance.

> The corresponding MSI-X MMIO region is configured via BAR 1.
> 
> The MSI-X table size reported by the MSI-X capability structure is
> identical for all peers.
> 
> ### Register Region
> 
> The register region may be implemented as MMIO or I/O.

Is that "either MMIO or I/O" or "MMIO and/or I/O"?

Drivers and devices will have to implement both anyway to be compatible
with all other devices and drivers, respectively. But allowing both MMIO
and I/O at the same time ensures that a device will work with any
driver.

> When implementing it as MMIO, the hypervisor has to ensure that the
> register region can be mapped as a single page into the address space
> of the user, without causing potential overlaps with other resources.

Can "page size" be replaced with a specific value like 4 KB? In general
devices shouldn't know about CPU MMU page sizes because they vary.

> Write accesses to MMIO region offsets that are not backed by
> registers have to be ignored, read accesses have to return 0. This
> enables the user to hand out the complete region, along with the
> shared memory, to an unprivileged instance.
> 
> The region location in the user's physical address space is
> configured via BAR 0. The following table visualizes the region
> layout:
> 
> | Offset | Register                                                            |
> |-------:|:--------------------------------------------------------------------|
> |    00h | ID                                                                  |
> |    04h | Maximum Peers                                                       |
> |    08h | Interrupt Control                                                   |
> |    0Ch | Doorbell                                                            |
> |    10h | State                                                               |
> 
> All registers support only aligned 32-bit accesses.
> 
> #### ID Register (Offset 00h)
> 
> Read-only register that reports the ID of the local device. It is
> unique for all of the connected devices and remains unchanged over
> their lifetime.

What is the purpose of the ID?

> #### Maximum Peers Register (Offset 04h)
> 
> Read-only register that reports the maximum number of possible peers
> (including the local one). The permitted range is between 2 and 65536
> and remains constant over the lifetime of all peers.
> 
> #### Interrupt Control Register (Offset 08h)
> 
> This read/write register controls the generation of interrupts
> whenever a peer writes to the Doorbell register or changes its state.
> 
> | Bits | Content                                                               |
> |-----:|:----------------------------------------------------------------------|
> |    0 | 1: Enable interrupt generation                                        |
> | 1-31 | Reserved (0 on read, writes ignored)                                  |
> 
> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
> interrupt mode is enabled in the Enhanced Features register.
> 
> The value of this register after device reset is 0.

This is a global interrupt enable/disable for all vectors?

> #### Doorbell Register (Offset 0Ch)
> 
> Write-only register that triggers an interrupt vector in the target
> device if it is enabled there.
> 
> | Bits  | Content                                                              |
> |------:|:---------------------------------------------------------------------|
> |  0-15 | Vector number                                                        |
> | 16-31 | Target ID                                                            |

s/Target ID/Peer/ ?

> 
> Writing a vector number that is not enabled by the target has no
> effect. The peers can derive the number of available vectors from
> their own device capabilities because the provider is required to

What is the "provider"? The hypervisor?

> expose an identical number of vectors to all connected peers. The
> peers are expected to define or negotiate the used ones via the

s/ones/vectors/ is clearer

> selected protocol.
> 
> Addressing a non-existing or inactive target has no effect. Peers can
> identify active targets via the State Table.
> 
> Implementations of the Doorbell register must ensure that data written by the
> CPU prior to issuing the register write is visible to the receiving peer before
> the interrupt arrives.

Physical devices have no control over CPU memory ordering. Normally the
drivers have the necessary memory barriers. Why is it the device's
responsibility to do this in IVSHMEM v2?

> 
> The behavior on reading from this register is undefined.
> 
> #### State Register (Offset 10h)
> 
> Read/write register that defines the state of the local device.
> Writing to this register sets the state and triggers MSI-X vector 0
> or the INTx interrupt, respectively, on the remote device if the
> written state value differs from the previous one. Users of peer
> devices can read the value written to this register from the State
> Table. They are expected differentiate state change interrupts from
> doorbell events by comparing the new state value with a locally
> stored copy.

Does this mean drivers using INTx must compare the state table entries
for their peers on each interrupt to differentiate from doorbell events?
That seems inefficient.

> 
> The value of this register after device reset is 0. The semantic of
> all other values can be defined freely by the chosen protocol.
> 
> ### State Table
> 
> The State Table is a read-only section at the beginning of the shared
> memory region. It contains a 32-bit state value for each of the
> peers. Locating the table in shared memory allows fast checking of
> remote states without register accesses.
> 
> The table is updated on each state change of a peers. Whenever a user
> of an IVSHMEM device writes a value to the Local State register, this
> value is copied into the corresponding entry of the State Table. When
> a IVSHMEM device is reset or disconnected from the other peers, zero
> is written into the corresponding table entry. The initial content of
> the table is all zeros.
> 
>     +--------------------------------+
>     | 32-bit state value of peer n-1 |
>     +--------------------------------+
>     :                                :
>     +--------------------------------+
>     | 32-bit state value of peer 1   |
>     +--------------------------------+
>     | 32-bit state value of peer 0   |
>     +--------------------------------+ <-- Shared memory base address
> 
> 
> Protocols
> ---------
> 
> The IVSHMEM device shall support the peers of a connection in
> agreeing on the protocol used over the shared memory devices. For
> that purpose, the interface byte (offset 09h) and the sub-class byte
> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
> type for the users. The following type values are defined:
> 
> | Protocol Type | Description                                                  |
> |--------------:|:-------------------------------------------------------------|
> |         0000h | Undefined type                                               |
> |         0001h | Virtual peer-to-peer Ethernet                                |
> |   0002h-3FFFh | Reserved                                                     |
> |   4000h-7FFFh | User-defined protocols                                       |
> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
> 
> Details of the protocols are not in the scope of this specification.

If I understand correctly the reason for creating IVSHMEM v2 is to
support non-VIRTIO devices? If you just want VIRTIO devices then IVSHMEM
v2 + VIRTIO is more complex than a pure VIRTIO solution. But if you
don't want VIRTIO then IVSHMEM + a minimal device can be simpler than
VIRTIO.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
  2020-07-15 13:27   ` Stefan Hajnoczi
@ 2020-07-17 16:15     ` Jan Kiszka
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-17 16:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel,
	virtio-comment, Alex Bennée

On 15.07.20 15:27, Stefan Hajnoczi wrote:
> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>> IVSHMEM Device Specification
>> ============================
>>
>> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
> 
> Hi Jan,
> Thanks for posting this! I have a posted comments where I wasn't sure
> what the spec meant.
> 
>> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
>> define the minimally needed building blocks a hypervisor has to
>> provide for enabling guest-to-guest communication. The details of
>> communication protocols shall remain opaque to the hypervisor so that
>> guests are free to define them as simple or sophisticated as they
>> need.
>>
>> For that purpose, the IVSHMEM provides the following features to its
>> users:
>>
>> - Interconnection between up to 65536 peers
>>
>> - Multi-purpose shared memory region
>>
>>      - common read/writable section
>>
>>      - output sections that are read/writable for one peer and only
>>        readable for the others
>>
>>      - section with peer states
>>
>> - Event signaling via interrupt to remote sides
>>
>> - Support for life-cycle management via state value exchange and
>>    interrupt notification on changes, backed by a shared memory
>>    section
>>
>> - Free choice of protocol to be used on top
>>
>> - Protocol type declaration
>>
>> - Register can be implemented either memory-mapped or via I/O,
>>    depending on platform support and lower VM-exit costs
>>
>> - Unprivileged access to memory-mapped or I/O registers feasible
>>
>> - Single discovery and configuration via standard PCI, no complexity
>>    by additionally defining a platform device model
>>
>>
>> Hypervisor Model
>> ----------------
>>
>> In order to provide a consistent link between peers, all connected
>> instances of IVSHMEM devices need to be configured, created and run
>> by the hypervisor according to the following requirements:
>>
>> - The instances of the device shall appear as a PCI device to their
>>    users.
>>
>> - The read/write shared memory section has to be of the same size for
>>    all peers. The size can be zero.
>>
>> - If shared memory output sections are present (non-zero section
>>    size), there must be one reserved for each peer with exclusive
>>    write access. All output sections must have the same size and must
>>    be readable for all peers.
>>
>> - The State Table must have the same size for all peers, must be
>>    large enough to hold the state values of all peers, and must be
>>    read-only for the user.
> 
> Who/what is the "user"? I guess this simply means that the State Table
> is read-only and only the hypervisor can update the table entries?

Read-only for the guest, right. I used the term "user" here and 
elsewhere, switching to "guest" might be better.

> 
>> - State register changes (explicit writes, peer resets) have to be
>>    propagated to the other peers by updating the corresponding State
>>    Table entry and issuing an interrupt to all other peers if they
>>    enabled reception.
>>
>> - Interrupts events triggered by a peer have to be delivered to the
>>    target peer, provided the receiving side is valid and has enabled
>>    the reception.
>>
>> - All peers must have the same interrupt delivery features available,
>>    i.e. MSI-X with the same maximum number of vectors on platforms
>>    supporting this mechanism, otherwise INTx with one vector.
>>
>>
>> Guest-side Programming Model
>> ----------------------------
>>
>> An IVSHMEM device appears as a PCI device to its users. Unless
>> otherwise noted, it conforms to the PCI Local Bus Specification,
>> Revision 3.0. As such, it is discoverable via the PCI configuration
>> space and provides a number of standard and custom PCI configuration
>> registers.
>>
>> ### Shared Memory Region Layout
>>
>> The shared memory region is divided into several sections.
>>
>>      +-----------------------------+   -
>>      |                             |   :
>>      | Output Section for peer n-1 |   : Output Section Size
>>      |     (n = Maximum Peers)     |   :
>>      +-----------------------------+   -
>>      :                             :
>>      :                             :
>>      :                             :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |  Output Section for peer 1  |   : Output Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |  Output Section for peer 0  |   : Output Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |     Read/Write Section      |   : R/W Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |         State Table         |   : State Table Size
>>      |                             |   :
>>      +-----------------------------+   <-- Shared memory base address
>>
>> The first section consists of the mandatory State Table. Its size is
>> defined by the State Table Size register and cannot be zero. This
>> section is read-only for all peers.
>>
>> The second section consists of shared memory that is read/writable
>> for all peers. Its size is defined by the R/W Section Size register.
>> A size of zero is permitted.
>>
>> The third and following sections are output sections, one for each
>> peer. Their sizes are all identical. The size of a single output
>> section is defined by the Output Section Size register. An output
>> section is read/writable for the corresponding peer and read-only for
>> all other peers. E.g., only the peer with ID 3 can write to the
>> fourths output section, but all peers can read from this section.
> 
> s/fourths/fourth/
> 
>>
>> All sizes have to be rounded up to multiples of a mappable page in
>> order to allow access control according to the section restrictions.
> 
> The host's mappable page size? I guess the guest's page size doesn't
> matter here.

It does, in fact. This is to allow unprivileged direct access to memory 
regions (as well as the register space), i.e. to enable the guest OS to 
map the regions without overlaps.

> 
>>
>> ### Configuration Space Registers
>>
>> #### Header Registers
>>
>> | Offset | Register               | Content                                              |
>> |-------:|:-----------------------|:-----------------------------------------------------|
>> |    00h | Vendor ID              | 110Ah                                                |
>> |    02h | Device ID              | 4106h                                                |
>> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
>> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
>> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
>> |        |                        | Bit 3: Bus Master                                    |
>> |        |                        | Bit 10: INTx interrupt disable                       |
>> |        |                        | Writes to other bits are ignored                     |
>> |    06h | Status Register        | 0010h, static value                                  |
>> |        |                        | In deviation to the PCI specification, the Interrupt |
>> |        |                        | Status (bit 3) is never set                          |
>> |    08h | Revision ID            | 00h                                                  |
>> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>> |    0Bh | Class Code, Base Class | FFh                                                  |
>> |    0Eh | Header Type            | 00h                                                  |
>> |    10h | BAR 0                  | MMIO or I/O register region                          |
>> |    14h | BAR 1                  | MSI-X region                                         |
>> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
>> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
>> |    34h | Capability Pointer     | First capability                                     |
>> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>>
>> The INTx status bit is never set by an implementation. Users of the
>> IVSHMEM device are instead expected to derive the event state from
>> protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> If BAR 2 is not present, the shared memory region is not relocatable
>> by the user. In that case, the hypervisor has to implement the Base
>> Address register in the vendor-specific capability.
> 
> What does relocatable mean in this context?

That the guest can decide (via BAR) where the resource should show up in 
the physical guest address space. We do not want to support this in 
setups like for static partitioning hypervisors, and then we use that 
side-channel read-only configuration.

> 
>>
>> Subsystem IDs shall encode the provider (hypervisor) in order to
>> allow identifying potential deviating implementations in case this
>> should ever be required.
>>
>> If its platform supports MSI-X, an implementation of the IVSHMEM
>> device must provide this interrupt model and must not expose INTx
>> support.
>>
>> Other header registers may not be implemented. If not implemented,
>> they return 0 on read and ignore write accesses.
>>
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
>>
>> When bit 0 in the Privileged Control register is set to 1, the device
>> clears bit 0 in the Interrupt Control register on each interrupt
>> delivery.
> 
> The Interrupt Control register has not be defined yet at this point in
> the spec. Maybe you can rearrange this or include a reference to the
> section on the Interrupt Control register.

I can add a reference.

> 
>> This enables automatic interrupt throttling when
>> re-enabling shall be performed by a scheduled unprivileged instance
>> on the user side.
> 
> This last sentence is hard to parse.
> 
> I guess the flow is:
> 
> 1. Guest sets Interrupt Control Bit 0 to 1 to enable interrupts from a
>     peer.
> 2. Peer writes doorbell and the hypervisor clears Interrupt Control Bit
>     0 and raises the interrupt.
> 3. The guest's interrupt handler is scheduled at a later point in time.
>     If any additional doorbell writes occur then will not result in
>     additional interrupts until the guest sets Interrupt Control again.

Right, that is the behavior. The reasoning is in the second half of my 
short explanation: We want to use for setups like UIO, where the final 
consumer is an unprivileged application. That applicaiton shall not be 
able to cause incoming interrupts at a rate higher than its own 
schedule. And that is achieved by forcing it to re-enable IRQs (via 
directly mapped MMIO regs) when it actually runs again.

> 
> Does this mean the peer still takes a vmexit writing to the doorbell
> register but the hypervisor ignores the write?

Yes, this is no optimization for kick-related vmexits. I would consider 
that an application-level thing, doable - like in virtio - via 
protocol-specific fields in the shared memory.

> 
> VIRTIO exposes whether notifications are desired in shared memory. That
> way the peer can skip the doorbell write entirely (saving a vmexit).
> This is a fast approach for software device implementations (slow for
> hardware implementations because they would need to do a DMA read).

Which if fine and should actually work as-is when mapping virtio over 
ivshmem. Other ivshmem protocols may implement their own peer-to-peer 
hints for skipping kicks.

> 
>>
>> An IVSHMEM device may not support a relocatable shared memory region.
>> This support the hypervisor in locking down the guest-to-host address

Maybe better: "This allows the hypervisor to lock down..."

>> mapping and simplifies the runtime logic. In such a case, BAR 2 must
>> not be implemented by the hypervisor. Instead, the Base Address
>> register has to be implemented to report the location of the shared
>> memory region in the user's address space.
> 
> This paragraph is confusing. There seem to be two concepts:
> 
> 1. Relocatable memory. Placing shared memory at a fixed addresses
>     eliminates the need for address translation. But this seems like a
>     security issue if a peer can make me access an arbitrary address, why
>     not just use relative addresses from the start of the shared memory
>     region?

The interpretation of addresses in the shared memory is completely up to 
the protocol used on top of ivshmem, thus is out of scope for this spec. 
The protocol may use relative addresses or some other basis. If fact, it 
should use relative addresses because this ivshmem spec does not provide 
means to tell all peers if they see the shared memory at the same guest 
physical base addresses.

> 
> 2. Location of shared memory. BAR 2 is a regular PCI bar and the memory
>     is located on the device. In the non-BAR 2 case I think the spec says
>     the shared memory is located in guest RAM instead?

Maybe we should clarify the second aspect: If shared memory is described 
via the side-channel, it is still a resource separate from guest RAM. 
That is at least how Jailhouse implements it, and I do not see a use 
case yet for doing it differently.

> 
>> A non-existing shared memory section has to report zero in its
>> Section Size register.
>>
>> #### MSI-X Capability (ID 11h)
>>
>> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
>> delivery via this mechanism. In that case, the MSI-X capability is
>> present while the legacy INTx delivery mechanism is not available,
>> and the Interrupt Pin configuration register returns 0.
>>
>> The IVSHMEM device has no notion of pending interrupts. Therefore,
>> reading from the MSI-X Pending Bit Array will always return 0. Users
>> of the IVSHMEM device are instead expected to derive the event state
>> from protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
> 
> Isn't the PBA just used for masked MSI-X interrupts? That is a standard
> MSI-X feature.
> 
> On a minimal system that doesn't used masking it makes sense to strip
> down the implementation, but I think the spec should allow for PCI
> compliance.

Well, we are already in incompliance when we allow PBA to be 
non-functional or when we make INTx and edge interrupt. Both is to make 
things virtualization friendly (low overhead for the guest, low effort 
for the host). If compliance has a strong use case, we can allow that, 
but I would like to see the use case first.

> 
>> The corresponding MSI-X MMIO region is configured via BAR 1.
>>
>> The MSI-X table size reported by the MSI-X capability structure is
>> identical for all peers.
>>
>> ### Register Region
>>
>> The register region may be implemented as MMIO or I/O.
> 
> Is that "either MMIO or I/O" or "MMIO and/or I/O"?

Was thought of as "either ... or". No implementation for I/O exists so 
for, though.

> 
> Drivers and devices will have to implement both anyway to be compatible
> with all other devices and drivers, respectively. But allowing both MMIO
> and I/O at the same time ensures that a device will work with any
> driver.

In fact, I would then rather consider to drop the (so far unused) I/O 
option to avoid having to implement that even more complex case.

> 
>> When implementing it as MMIO, the hypervisor has to ensure that the
>> register region can be mapped as a single page into the address space
>> of the user, without causing potential overlaps with other resources.
> 
> Can "page size" be replaced with a specific value like 4 KB? In general
> devices shouldn't know about CPU MMU page sizes because they vary.

It's a configuration thing of your hypervisor, in practice. The 
hypervisor should allow the system configurator to specify also larger 
sizes of the MMIO region. We just had the case that TI needed 64K for 
the kernel of their J721e SDK (which I suggested them to avoid, but for 
different reasons).

> 
>> Write accesses to MMIO region offsets that are not backed by
>> registers have to be ignored, read accesses have to return 0. This
>> enables the user to hand out the complete region, along with the
>> shared memory, to an unprivileged instance.
>>
>> The region location in the user's physical address space is
>> configured via BAR 0. The following table visualizes the region
>> layout:
>>
>> | Offset | Register                                                            |
>> |-------:|:--------------------------------------------------------------------|
>> |    00h | ID                                                                  |
>> |    04h | Maximum Peers                                                       |
>> |    08h | Interrupt Control                                                   |
>> |    0Ch | Doorbell                                                            |
>> |    10h | State                                                               |
>>
>> All registers support only aligned 32-bit accesses.
>>
>> #### ID Register (Offset 00h)
>>
>> Read-only register that reports the ID of the local device. It is
>> unique for all of the connected devices and remains unchanged over
>> their lifetime.
> 
> What is the purpose of the ID?

- find out which is my output region
- differentiate from other peers (regarding input regions, state table
   fields, doorbell targets)

> 
>> #### Maximum Peers Register (Offset 04h)
>>
>> Read-only register that reports the maximum number of possible peers
>> (including the local one). The permitted range is between 2 and 65536
>> and remains constant over the lifetime of all peers.
>>
>> #### Interrupt Control Register (Offset 08h)
>>
>> This read/write register controls the generation of interrupts
>> whenever a peer writes to the Doorbell register or changes its state.
>>
>> | Bits | Content                                                               |
>> |-----:|:----------------------------------------------------------------------|
>> |    0 | 1: Enable interrupt generation                                        |
>> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>>
>> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
>> interrupt mode is enabled in the Enhanced Features register.
>>
>> The value of this register after device reset is 0.
> 
> This is a global interrupt enable/disable for all vectors?

Yes, that is the current model. I didn't consider of per-vector control 
so far. That's probably because UIO does not support multiple vectors, 
and there was no other use case for this register yet. If we added 
per-vector support, we would limit the vector number or would need more 
registers...

> 
>> #### Doorbell Register (Offset 0Ch)
>>
>> Write-only register that triggers an interrupt vector in the target
>> device if it is enabled there.
>>
>> | Bits  | Content                                                              |
>> |------:|:---------------------------------------------------------------------|
>> |  0-15 | Vector number                                                        |
>> | 16-31 | Target ID                                                            |
> 
> s/Target ID/Peer/ ?

ID refers to the ID register. Target is what the doorbell is targeting.

> 
>>
>> Writing a vector number that is not enabled by the target has no
>> effect. The peers can derive the number of available vectors from
>> their own device capabilities because the provider is required to
> 
> What is the "provider"? The hypervisor?

Yes. I think I didn't consistently switch from the former more abstract 
"provider of this device" term to "hypervisor".

> 
>> expose an identical number of vectors to all connected peers. The
>> peers are expected to define or negotiate the used ones via the
> 
> s/ones/vectors/ is clearer
> 
>> selected protocol.
>>
>> Addressing a non-existing or inactive target has no effect. Peers can
>> identify active targets via the State Table.
>>
>> Implementations of the Doorbell register must ensure that data written by the
>> CPU prior to issuing the register write is visible to the receiving peer before
>> the interrupt arrives.
> 
> Physical devices have no control over CPU memory ordering. Normally the
> drivers have the necessary memory barriers. Why is it the device's
> responsibility to do this in IVSHMEM v2?

It's an optimizing requirement: When the guest does not know if the 
hypervisor uses a synchronizing mechanism to actually submit the 
interrupt, it has to add an unconditional barrier in case it is actually 
needed. In contrast, it is easy for the hypervisor to add a barrier when 
the kick mechanism should in come with an implicit one.

> 
>>
>> The behavior on reading from this register is undefined.
>>
>> #### State Register (Offset 10h)
>>
>> Read/write register that defines the state of the local device.
>> Writing to this register sets the state and triggers MSI-X vector 0
>> or the INTx interrupt, respectively, on the remote device if the
>> written state value differs from the previous one. Users of peer
>> devices can read the value written to this register from the State
>> Table. They are expected differentiate state change interrupts from
>> doorbell events by comparing the new state value with a locally
>> stored copy.
> 
> Does this mean drivers using INTx must compare the state table entries
> for their peers on each interrupt to differentiate from doorbell events?
> That seems inefficient.

Yes, single-vector INTx is not optimal. But the state table is in RAM, 
and it will only become noteworthy when you have dozens of peers. In 
that case, I would indeed recommend providing an MSI-X implementation.

Markus already asked if we couldn't drop INTx. On the long run, he is 
right as most newer SOCs come with support for MSI-X, thus the guests 
can also handle virtually injected events. Existing SOCs do not always, 
so that's why INTx is here.

> 
>>
>> The value of this register after device reset is 0. The semantic of
>> all other values can be defined freely by the chosen protocol.
>>
>> ### State Table
>>
>> The State Table is a read-only section at the beginning of the shared
>> memory region. It contains a 32-bit state value for each of the
>> peers. Locating the table in shared memory allows fast checking of
>> remote states without register accesses.
>>
>> The table is updated on each state change of a peers. Whenever a user
>> of an IVSHMEM device writes a value to the Local State register, this
>> value is copied into the corresponding entry of the State Table. When
>> a IVSHMEM device is reset or disconnected from the other peers, zero
>> is written into the corresponding table entry. The initial content of
>> the table is all zeros.
>>
>>      +--------------------------------+
>>      | 32-bit state value of peer n-1 |
>>      +--------------------------------+
>>      :                                :
>>      +--------------------------------+
>>      | 32-bit state value of peer 1   |
>>      +--------------------------------+
>>      | 32-bit state value of peer 0   |
>>      +--------------------------------+ <-- Shared memory base address
>>
>>
>> Protocols
>> ---------
>>
>> The IVSHMEM device shall support the peers of a connection in
>> agreeing on the protocol used over the shared memory devices. For
>> that purpose, the interface byte (offset 09h) and the sub-class byte
>> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
>> type for the users. The following type values are defined:
>>
>> | Protocol Type | Description                                                  |
>> |--------------:|:-------------------------------------------------------------|
>> |         0000h | Undefined type                                               |
>> |         0001h | Virtual peer-to-peer Ethernet                                |
>> |   0002h-3FFFh | Reserved                                                     |
>> |   4000h-7FFFh | User-defined protocols                                       |
>> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>>
>> Details of the protocols are not in the scope of this specification.
> 
> If I understand correctly the reason for creating IVSHMEM v2 is to
> support non-VIRTIO devices? If you just want VIRTIO devices then IVSHMEM
> v2 + VIRTIO is more complex than a pure VIRTIO solution. But if you
> don't want VIRTIO then IVSHMEM + a minimal device can be simpler than
> VIRTIO.

One reason is to provide non-virtio "devices" or rather simple 
application-to-application protocols. Many embedded hypervisor offer 
this already today, as it is simple and "flexible" (hacky...). But the 
other reason is to provide the simplest possible transport for virtio 
(from the hypervisor POV) that does not require knowing all the 
countless details of virtio. Such a stack will allow to keep the trusted 
hypervisor core untouched when adding a new virtio device. Try that with 
PCI or MMIO transport. The price of implementating virtio natively into 
a 5-10 thousand LoC hypervisor is simply too high. Or it ends up in 
non-robust, insecure or even worse architectures.

Thanks for the feedback!

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-17 16:15     ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-17 16:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin,
	Alex Bennée, liang yan

On 15.07.20 15:27, Stefan Hajnoczi wrote:
> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>> IVSHMEM Device Specification
>> ============================
>>
>> ** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
> 
> Hi Jan,
> Thanks for posting this! I have a posted comments where I wasn't sure
> what the spec meant.
> 
>> The goal of the Inter-VM Shared Memory (IVSHMEM) device model is to
>> define the minimally needed building blocks a hypervisor has to
>> provide for enabling guest-to-guest communication. The details of
>> communication protocols shall remain opaque to the hypervisor so that
>> guests are free to define them as simple or sophisticated as they
>> need.
>>
>> For that purpose, the IVSHMEM provides the following features to its
>> users:
>>
>> - Interconnection between up to 65536 peers
>>
>> - Multi-purpose shared memory region
>>
>>      - common read/writable section
>>
>>      - output sections that are read/writable for one peer and only
>>        readable for the others
>>
>>      - section with peer states
>>
>> - Event signaling via interrupt to remote sides
>>
>> - Support for life-cycle management via state value exchange and
>>    interrupt notification on changes, backed by a shared memory
>>    section
>>
>> - Free choice of protocol to be used on top
>>
>> - Protocol type declaration
>>
>> - Register can be implemented either memory-mapped or via I/O,
>>    depending on platform support and lower VM-exit costs
>>
>> - Unprivileged access to memory-mapped or I/O registers feasible
>>
>> - Single discovery and configuration via standard PCI, no complexity
>>    by additionally defining a platform device model
>>
>>
>> Hypervisor Model
>> ----------------
>>
>> In order to provide a consistent link between peers, all connected
>> instances of IVSHMEM devices need to be configured, created and run
>> by the hypervisor according to the following requirements:
>>
>> - The instances of the device shall appear as a PCI device to their
>>    users.
>>
>> - The read/write shared memory section has to be of the same size for
>>    all peers. The size can be zero.
>>
>> - If shared memory output sections are present (non-zero section
>>    size), there must be one reserved for each peer with exclusive
>>    write access. All output sections must have the same size and must
>>    be readable for all peers.
>>
>> - The State Table must have the same size for all peers, must be
>>    large enough to hold the state values of all peers, and must be
>>    read-only for the user.
> 
> Who/what is the "user"? I guess this simply means that the State Table
> is read-only and only the hypervisor can update the table entries?

Read-only for the guest, right. I used the term "user" here and 
elsewhere, switching to "guest" might be better.

> 
>> - State register changes (explicit writes, peer resets) have to be
>>    propagated to the other peers by updating the corresponding State
>>    Table entry and issuing an interrupt to all other peers if they
>>    enabled reception.
>>
>> - Interrupts events triggered by a peer have to be delivered to the
>>    target peer, provided the receiving side is valid and has enabled
>>    the reception.
>>
>> - All peers must have the same interrupt delivery features available,
>>    i.e. MSI-X with the same maximum number of vectors on platforms
>>    supporting this mechanism, otherwise INTx with one vector.
>>
>>
>> Guest-side Programming Model
>> ----------------------------
>>
>> An IVSHMEM device appears as a PCI device to its users. Unless
>> otherwise noted, it conforms to the PCI Local Bus Specification,
>> Revision 3.0. As such, it is discoverable via the PCI configuration
>> space and provides a number of standard and custom PCI configuration
>> registers.
>>
>> ### Shared Memory Region Layout
>>
>> The shared memory region is divided into several sections.
>>
>>      +-----------------------------+   -
>>      |                             |   :
>>      | Output Section for peer n-1 |   : Output Section Size
>>      |     (n = Maximum Peers)     |   :
>>      +-----------------------------+   -
>>      :                             :
>>      :                             :
>>      :                             :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |  Output Section for peer 1  |   : Output Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |  Output Section for peer 0  |   : Output Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |     Read/Write Section      |   : R/W Section Size
>>      |                             |   :
>>      +-----------------------------+   -
>>      |                             |   :
>>      |         State Table         |   : State Table Size
>>      |                             |   :
>>      +-----------------------------+   <-- Shared memory base address
>>
>> The first section consists of the mandatory State Table. Its size is
>> defined by the State Table Size register and cannot be zero. This
>> section is read-only for all peers.
>>
>> The second section consists of shared memory that is read/writable
>> for all peers. Its size is defined by the R/W Section Size register.
>> A size of zero is permitted.
>>
>> The third and following sections are output sections, one for each
>> peer. Their sizes are all identical. The size of a single output
>> section is defined by the Output Section Size register. An output
>> section is read/writable for the corresponding peer and read-only for
>> all other peers. E.g., only the peer with ID 3 can write to the
>> fourths output section, but all peers can read from this section.
> 
> s/fourths/fourth/
> 
>>
>> All sizes have to be rounded up to multiples of a mappable page in
>> order to allow access control according to the section restrictions.
> 
> The host's mappable page size? I guess the guest's page size doesn't
> matter here.

It does, in fact. This is to allow unprivileged direct access to memory 
regions (as well as the register space), i.e. to enable the guest OS to 
map the regions without overlaps.

> 
>>
>> ### Configuration Space Registers
>>
>> #### Header Registers
>>
>> | Offset | Register               | Content                                              |
>> |-------:|:-----------------------|:-----------------------------------------------------|
>> |    00h | Vendor ID              | 110Ah                                                |
>> |    02h | Device ID              | 4106h                                                |
>> |    04h | Command Register       | 0000h on reset, writable bits are:                   |
>> |        |                        | Bit 0: I/O Space (if Register Region uses I/O)       |
>> |        |                        | Bit 1: Memory Space (if Register Region uses Memory) |
>> |        |                        | Bit 3: Bus Master                                    |
>> |        |                        | Bit 10: INTx interrupt disable                       |
>> |        |                        | Writes to other bits are ignored                     |
>> |    06h | Status Register        | 0010h, static value                                  |
>> |        |                        | In deviation to the PCI specification, the Interrupt |
>> |        |                        | Status (bit 3) is never set                          |
>> |    08h | Revision ID            | 00h                                                  |
>> |    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>> |    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>> |    0Bh | Class Code, Base Class | FFh                                                  |
>> |    0Eh | Header Type            | 00h                                                  |
>> |    10h | BAR 0                  | MMIO or I/O register region                          |
>> |    14h | BAR 1                  | MSI-X region                                         |
>> |    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
>> |    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>> |    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
>> |    34h | Capability Pointer     | First capability                                     |
>> |    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>>
>> The INTx status bit is never set by an implementation. Users of the
>> IVSHMEM device are instead expected to derive the event state from
>> protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
>>
>> If BAR 2 is not present, the shared memory region is not relocatable
>> by the user. In that case, the hypervisor has to implement the Base
>> Address register in the vendor-specific capability.
> 
> What does relocatable mean in this context?

That the guest can decide (via BAR) where the resource should show up in 
the physical guest address space. We do not want to support this in 
setups like for static partitioning hypervisors, and then we use that 
side-channel read-only configuration.

> 
>>
>> Subsystem IDs shall encode the provider (hypervisor) in order to
>> allow identifying potential deviating implementations in case this
>> should ever be required.
>>
>> If its platform supports MSI-X, an implementation of the IVSHMEM
>> device must provide this interrupt model and must not expose INTx
>> support.
>>
>> Other header registers may not be implemented. If not implemented,
>> they return 0 on read and ignore write accesses.
>>
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
>>
>> When bit 0 in the Privileged Control register is set to 1, the device
>> clears bit 0 in the Interrupt Control register on each interrupt
>> delivery.
> 
> The Interrupt Control register has not be defined yet at this point in
> the spec. Maybe you can rearrange this or include a reference to the
> section on the Interrupt Control register.

I can add a reference.

> 
>> This enables automatic interrupt throttling when
>> re-enabling shall be performed by a scheduled unprivileged instance
>> on the user side.
> 
> This last sentence is hard to parse.
> 
> I guess the flow is:
> 
> 1. Guest sets Interrupt Control Bit 0 to 1 to enable interrupts from a
>     peer.
> 2. Peer writes doorbell and the hypervisor clears Interrupt Control Bit
>     0 and raises the interrupt.
> 3. The guest's interrupt handler is scheduled at a later point in time.
>     If any additional doorbell writes occur then will not result in
>     additional interrupts until the guest sets Interrupt Control again.

Right, that is the behavior. The reasoning is in the second half of my 
short explanation: We want to use for setups like UIO, where the final 
consumer is an unprivileged application. That applicaiton shall not be 
able to cause incoming interrupts at a rate higher than its own 
schedule. And that is achieved by forcing it to re-enable IRQs (via 
directly mapped MMIO regs) when it actually runs again.

> 
> Does this mean the peer still takes a vmexit writing to the doorbell
> register but the hypervisor ignores the write?

Yes, this is no optimization for kick-related vmexits. I would consider 
that an application-level thing, doable - like in virtio - via 
protocol-specific fields in the shared memory.

> 
> VIRTIO exposes whether notifications are desired in shared memory. That
> way the peer can skip the doorbell write entirely (saving a vmexit).
> This is a fast approach for software device implementations (slow for
> hardware implementations because they would need to do a DMA read).

Which if fine and should actually work as-is when mapping virtio over 
ivshmem. Other ivshmem protocols may implement their own peer-to-peer 
hints for skipping kicks.

> 
>>
>> An IVSHMEM device may not support a relocatable shared memory region.
>> This support the hypervisor in locking down the guest-to-host address

Maybe better: "This allows the hypervisor to lock down..."

>> mapping and simplifies the runtime logic. In such a case, BAR 2 must
>> not be implemented by the hypervisor. Instead, the Base Address
>> register has to be implemented to report the location of the shared
>> memory region in the user's address space.
> 
> This paragraph is confusing. There seem to be two concepts:
> 
> 1. Relocatable memory. Placing shared memory at a fixed addresses
>     eliminates the need for address translation. But this seems like a
>     security issue if a peer can make me access an arbitrary address, why
>     not just use relative addresses from the start of the shared memory
>     region?

The interpretation of addresses in the shared memory is completely up to 
the protocol used on top of ivshmem, thus is out of scope for this spec. 
The protocol may use relative addresses or some other basis. If fact, it 
should use relative addresses because this ivshmem spec does not provide 
means to tell all peers if they see the shared memory at the same guest 
physical base addresses.

> 
> 2. Location of shared memory. BAR 2 is a regular PCI bar and the memory
>     is located on the device. In the non-BAR 2 case I think the spec says
>     the shared memory is located in guest RAM instead?

Maybe we should clarify the second aspect: If shared memory is described 
via the side-channel, it is still a resource separate from guest RAM. 
That is at least how Jailhouse implements it, and I do not see a use 
case yet for doing it differently.

> 
>> A non-existing shared memory section has to report zero in its
>> Section Size register.
>>
>> #### MSI-X Capability (ID 11h)
>>
>> On platforms supporting MSI-X, IVSHMEM has to provide interrupt
>> delivery via this mechanism. In that case, the MSI-X capability is
>> present while the legacy INTx delivery mechanism is not available,
>> and the Interrupt Pin configuration register returns 0.
>>
>> The IVSHMEM device has no notion of pending interrupts. Therefore,
>> reading from the MSI-X Pending Bit Array will always return 0. Users
>> of the IVSHMEM device are instead expected to derive the event state
>> from protocol-specific information kept in the shared memory. This
>> approach is significantly faster, and the complexity of
>> register-based status tracking can be avoided.
> 
> Isn't the PBA just used for masked MSI-X interrupts? That is a standard
> MSI-X feature.
> 
> On a minimal system that doesn't used masking it makes sense to strip
> down the implementation, but I think the spec should allow for PCI
> compliance.

Well, we are already in incompliance when we allow PBA to be 
non-functional or when we make INTx and edge interrupt. Both is to make 
things virtualization friendly (low overhead for the guest, low effort 
for the host). If compliance has a strong use case, we can allow that, 
but I would like to see the use case first.

> 
>> The corresponding MSI-X MMIO region is configured via BAR 1.
>>
>> The MSI-X table size reported by the MSI-X capability structure is
>> identical for all peers.
>>
>> ### Register Region
>>
>> The register region may be implemented as MMIO or I/O.
> 
> Is that "either MMIO or I/O" or "MMIO and/or I/O"?

Was thought of as "either ... or". No implementation for I/O exists so 
for, though.

> 
> Drivers and devices will have to implement both anyway to be compatible
> with all other devices and drivers, respectively. But allowing both MMIO
> and I/O at the same time ensures that a device will work with any
> driver.

In fact, I would then rather consider to drop the (so far unused) I/O 
option to avoid having to implement that even more complex case.

> 
>> When implementing it as MMIO, the hypervisor has to ensure that the
>> register region can be mapped as a single page into the address space
>> of the user, without causing potential overlaps with other resources.
> 
> Can "page size" be replaced with a specific value like 4 KB? In general
> devices shouldn't know about CPU MMU page sizes because they vary.

It's a configuration thing of your hypervisor, in practice. The 
hypervisor should allow the system configurator to specify also larger 
sizes of the MMIO region. We just had the case that TI needed 64K for 
the kernel of their J721e SDK (which I suggested them to avoid, but for 
different reasons).

> 
>> Write accesses to MMIO region offsets that are not backed by
>> registers have to be ignored, read accesses have to return 0. This
>> enables the user to hand out the complete region, along with the
>> shared memory, to an unprivileged instance.
>>
>> The region location in the user's physical address space is
>> configured via BAR 0. The following table visualizes the region
>> layout:
>>
>> | Offset | Register                                                            |
>> |-------:|:--------------------------------------------------------------------|
>> |    00h | ID                                                                  |
>> |    04h | Maximum Peers                                                       |
>> |    08h | Interrupt Control                                                   |
>> |    0Ch | Doorbell                                                            |
>> |    10h | State                                                               |
>>
>> All registers support only aligned 32-bit accesses.
>>
>> #### ID Register (Offset 00h)
>>
>> Read-only register that reports the ID of the local device. It is
>> unique for all of the connected devices and remains unchanged over
>> their lifetime.
> 
> What is the purpose of the ID?

- find out which is my output region
- differentiate from other peers (regarding input regions, state table
   fields, doorbell targets)

> 
>> #### Maximum Peers Register (Offset 04h)
>>
>> Read-only register that reports the maximum number of possible peers
>> (including the local one). The permitted range is between 2 and 65536
>> and remains constant over the lifetime of all peers.
>>
>> #### Interrupt Control Register (Offset 08h)
>>
>> This read/write register controls the generation of interrupts
>> whenever a peer writes to the Doorbell register or changes its state.
>>
>> | Bits | Content                                                               |
>> |-----:|:----------------------------------------------------------------------|
>> |    0 | 1: Enable interrupt generation                                        |
>> | 1-31 | Reserved (0 on read, writes ignored)                                  |
>>
>> Note that bit 0 is reset to 0 on interrupt delivery if one-shot
>> interrupt mode is enabled in the Enhanced Features register.
>>
>> The value of this register after device reset is 0.
> 
> This is a global interrupt enable/disable for all vectors?

Yes, that is the current model. I didn't consider of per-vector control 
so far. That's probably because UIO does not support multiple vectors, 
and there was no other use case for this register yet. If we added 
per-vector support, we would limit the vector number or would need more 
registers...

> 
>> #### Doorbell Register (Offset 0Ch)
>>
>> Write-only register that triggers an interrupt vector in the target
>> device if it is enabled there.
>>
>> | Bits  | Content                                                              |
>> |------:|:---------------------------------------------------------------------|
>> |  0-15 | Vector number                                                        |
>> | 16-31 | Target ID                                                            |
> 
> s/Target ID/Peer/ ?

ID refers to the ID register. Target is what the doorbell is targeting.

> 
>>
>> Writing a vector number that is not enabled by the target has no
>> effect. The peers can derive the number of available vectors from
>> their own device capabilities because the provider is required to
> 
> What is the "provider"? The hypervisor?

Yes. I think I didn't consistently switch from the former more abstract 
"provider of this device" term to "hypervisor".

> 
>> expose an identical number of vectors to all connected peers. The
>> peers are expected to define or negotiate the used ones via the
> 
> s/ones/vectors/ is clearer
> 
>> selected protocol.
>>
>> Addressing a non-existing or inactive target has no effect. Peers can
>> identify active targets via the State Table.
>>
>> Implementations of the Doorbell register must ensure that data written by the
>> CPU prior to issuing the register write is visible to the receiving peer before
>> the interrupt arrives.
> 
> Physical devices have no control over CPU memory ordering. Normally the
> drivers have the necessary memory barriers. Why is it the device's
> responsibility to do this in IVSHMEM v2?

It's an optimizing requirement: When the guest does not know if the 
hypervisor uses a synchronizing mechanism to actually submit the 
interrupt, it has to add an unconditional barrier in case it is actually 
needed. In contrast, it is easy for the hypervisor to add a barrier when 
the kick mechanism should in come with an implicit one.

> 
>>
>> The behavior on reading from this register is undefined.
>>
>> #### State Register (Offset 10h)
>>
>> Read/write register that defines the state of the local device.
>> Writing to this register sets the state and triggers MSI-X vector 0
>> or the INTx interrupt, respectively, on the remote device if the
>> written state value differs from the previous one. Users of peer
>> devices can read the value written to this register from the State
>> Table. They are expected differentiate state change interrupts from
>> doorbell events by comparing the new state value with a locally
>> stored copy.
> 
> Does this mean drivers using INTx must compare the state table entries
> for their peers on each interrupt to differentiate from doorbell events?
> That seems inefficient.

Yes, single-vector INTx is not optimal. But the state table is in RAM, 
and it will only become noteworthy when you have dozens of peers. In 
that case, I would indeed recommend providing an MSI-X implementation.

Markus already asked if we couldn't drop INTx. On the long run, he is 
right as most newer SOCs come with support for MSI-X, thus the guests 
can also handle virtually injected events. Existing SOCs do not always, 
so that's why INTx is here.

> 
>>
>> The value of this register after device reset is 0. The semantic of
>> all other values can be defined freely by the chosen protocol.
>>
>> ### State Table
>>
>> The State Table is a read-only section at the beginning of the shared
>> memory region. It contains a 32-bit state value for each of the
>> peers. Locating the table in shared memory allows fast checking of
>> remote states without register accesses.
>>
>> The table is updated on each state change of a peers. Whenever a user
>> of an IVSHMEM device writes a value to the Local State register, this
>> value is copied into the corresponding entry of the State Table. When
>> a IVSHMEM device is reset or disconnected from the other peers, zero
>> is written into the corresponding table entry. The initial content of
>> the table is all zeros.
>>
>>      +--------------------------------+
>>      | 32-bit state value of peer n-1 |
>>      +--------------------------------+
>>      :                                :
>>      +--------------------------------+
>>      | 32-bit state value of peer 1   |
>>      +--------------------------------+
>>      | 32-bit state value of peer 0   |
>>      +--------------------------------+ <-- Shared memory base address
>>
>>
>> Protocols
>> ---------
>>
>> The IVSHMEM device shall support the peers of a connection in
>> agreeing on the protocol used over the shared memory devices. For
>> that purpose, the interface byte (offset 09h) and the sub-class byte
>> (offset 0Ah) of the Class Code register encodes a 16-bit protocol
>> type for the users. The following type values are defined:
>>
>> | Protocol Type | Description                                                  |
>> |--------------:|:-------------------------------------------------------------|
>> |         0000h | Undefined type                                               |
>> |         0001h | Virtual peer-to-peer Ethernet                                |
>> |   0002h-3FFFh | Reserved                                                     |
>> |   4000h-7FFFh | User-defined protocols                                       |
>> |   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>> |   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>>
>> Details of the protocols are not in the scope of this specification.
> 
> If I understand correctly the reason for creating IVSHMEM v2 is to
> support non-VIRTIO devices? If you just want VIRTIO devices then IVSHMEM
> v2 + VIRTIO is more complex than a pure VIRTIO solution. But if you
> don't want VIRTIO then IVSHMEM + a minimal device can be simpler than
> VIRTIO.

One reason is to provide non-virtio "devices" or rather simple 
application-to-application protocols. Many embedded hypervisor offer 
this already today, as it is simple and "flexible" (hacky...). But the 
other reason is to provide the simplest possible transport for virtio 
(from the hypervisor POV) that does not require knowing all the 
countless details of virtio. Such a stack will allow to keep the trusted 
hypervisor core untouched when adding a new virtio device. Try that with 
PCI or MMIO transport. The price of implementating virtio natively into 
a 5-10 thousand LoC hypervisor is simply too high. Or it ends up in 
non-robust, insecure or even worse architectures.

Thanks for the feedback!

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
  2020-07-17 16:15     ` Jan Kiszka
@ 2020-07-23  6:54       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23  6:54 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel,
	virtio-comment, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
> On 15.07.20 15:27, Stefan Hajnoczi wrote:
> > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:

Thanks for the responses. It would be great to update the spec with
these clarifications.

> > > If BAR 2 is not present, the shared memory region is not relocatable
> > > by the user. In that case, the hypervisor has to implement the Base
> > > Address register in the vendor-specific capability.
> > 
> > What does relocatable mean in this context?
> 
> That the guest can decide (via BAR) where the resource should show up in the
> physical guest address space. We do not want to support this in setups like
> for static partitioning hypervisors, and then we use that side-channel
> read-only configuration.

I see. I'm not sure what is vendor-specific about non-relocatable shared
memory. I guess it could be added to the spec too?

In any case, since "relocatable" hasn't been fully defined, I suggest
making the statement more general:

  If BAR 2 is not present the hypervisor has to implement the Base
  Address Register in the vendor-specific capability. This can be used
  for vendor-specific shared memory functionality.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-23  6:54       ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-23  6:54 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin,
	Alex Bennée, liang yan

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
> On 15.07.20 15:27, Stefan Hajnoczi wrote:
> > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:

Thanks for the responses. It would be great to update the spec with
these clarifications.

> > > If BAR 2 is not present, the shared memory region is not relocatable
> > > by the user. In that case, the hypervisor has to implement the Base
> > > Address register in the vendor-specific capability.
> > 
> > What does relocatable mean in this context?
> 
> That the guest can decide (via BAR) where the resource should show up in the
> physical guest address space. We do not want to support this in setups like
> for static partitioning hypervisors, and then we use that side-channel
> read-only configuration.

I see. I'm not sure what is vendor-specific about non-relocatable shared
memory. I guess it could be added to the spec too?

In any case, since "relocatable" hasn't been fully defined, I suggest
making the statement more general:

  If BAR 2 is not present the hypervisor has to implement the Base
  Address Register in the vendor-specific capability. This can be used
  for vendor-specific shared memory functionality.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
  2020-07-23  6:54       ` Stefan Hajnoczi
@ 2020-07-23  7:02         ` Jan Kiszka
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-23  7:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel,
	virtio-comment, Alex Bennée

On 23.07.20 08:54, Stefan Hajnoczi wrote:
> On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
>> On 15.07.20 15:27, Stefan Hajnoczi wrote:
>>> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> 
> Thanks for the responses. It would be great to update the spec with
> these clarifications.
> 
>>>> If BAR 2 is not present, the shared memory region is not relocatable
>>>> by the user. In that case, the hypervisor has to implement the Base
>>>> Address register in the vendor-specific capability.
>>>
>>> What does relocatable mean in this context?
>>
>> That the guest can decide (via BAR) where the resource should show up in the
>> physical guest address space. We do not want to support this in setups like
>> for static partitioning hypervisors, and then we use that side-channel
>> read-only configuration.
> 
> I see. I'm not sure what is vendor-specific about non-relocatable shared
> memory. I guess it could be added to the spec too?

That "vendor-specific" comes from the PCI spec which - to my 
understanding - provides us no other means to introduce registers to the 
config space that are outside of the PCI spec. I could introduce a name 
for the ivshmem vendor cap and use that name here - would that be better?

> 
> In any case, since "relocatable" hasn't been fully defined, I suggest
> making the statement more general:
> 
>    If BAR 2 is not present the hypervisor has to implement the Base
>    Address Register in the vendor-specific capability. This can be used
>    for vendor-specific shared memory functionality.
> 

Will integrate this.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-23  7:02         ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-23  7:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin,
	Alex Bennée, liang yan

On 23.07.20 08:54, Stefan Hajnoczi wrote:
> On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
>> On 15.07.20 15:27, Stefan Hajnoczi wrote:
>>> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> 
> Thanks for the responses. It would be great to update the spec with
> these clarifications.
> 
>>>> If BAR 2 is not present, the shared memory region is not relocatable
>>>> by the user. In that case, the hypervisor has to implement the Base
>>>> Address register in the vendor-specific capability.
>>>
>>> What does relocatable mean in this context?
>>
>> That the guest can decide (via BAR) where the resource should show up in the
>> physical guest address space. We do not want to support this in setups like
>> for static partitioning hypervisors, and then we use that side-channel
>> read-only configuration.
> 
> I see. I'm not sure what is vendor-specific about non-relocatable shared
> memory. I guess it could be added to the spec too?

That "vendor-specific" comes from the PCI spec which - to my 
understanding - provides us no other means to introduce registers to the 
config space that are outside of the PCI spec. I could introduce a name 
for the ivshmem vendor cap and use that name here - would that be better?

> 
> In any case, since "relocatable" hasn't been fully defined, I suggest
> making the statement more general:
> 
>    If BAR 2 is not present the hypervisor has to implement the Base
>    Address Register in the vendor-specific capability. This can be used
>    for vendor-specific shared memory functionality.
> 

Will integrate this.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
  2020-07-23  7:02         ` Jan Kiszka
@ 2020-07-27 12:29           ` Stefan Hajnoczi
  -1 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 12:29 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Michael S. Tsirkin, qemu-devel,
	virtio-comment, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Thu, Jul 23, 2020 at 09:02:09AM +0200, Jan Kiszka wrote:
> On 23.07.20 08:54, Stefan Hajnoczi wrote:
> > On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
> > > On 15.07.20 15:27, Stefan Hajnoczi wrote:
> > > > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > 
> > Thanks for the responses. It would be great to update the spec with
> > these clarifications.
> > 
> > > > > If BAR 2 is not present, the shared memory region is not relocatable
> > > > > by the user. In that case, the hypervisor has to implement the Base
> > > > > Address register in the vendor-specific capability.
> > > > 
> > > > What does relocatable mean in this context?
> > > 
> > > That the guest can decide (via BAR) where the resource should show up in the
> > > physical guest address space. We do not want to support this in setups like
> > > for static partitioning hypervisors, and then we use that side-channel
> > > read-only configuration.
> > 
> > I see. I'm not sure what is vendor-specific about non-relocatable shared
> > memory. I guess it could be added to the spec too?
> 
> That "vendor-specific" comes from the PCI spec which - to my understanding -
> provides us no other means to introduce registers to the config space that
> are outside of the PCI spec. I could introduce a name for the ivshmem vendor
> cap and use that name here - would that be better?

Sounds good.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 12:29           ` Stefan Hajnoczi
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Hajnoczi @ 2020-07-27 12:29 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Michael S. Tsirkin,
	Alex Bennée, liang yan

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Thu, Jul 23, 2020 at 09:02:09AM +0200, Jan Kiszka wrote:
> On 23.07.20 08:54, Stefan Hajnoczi wrote:
> > On Fri, Jul 17, 2020 at 06:15:58PM +0200, Jan Kiszka wrote:
> > > On 15.07.20 15:27, Stefan Hajnoczi wrote:
> > > > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > 
> > Thanks for the responses. It would be great to update the spec with
> > these clarifications.
> > 
> > > > > If BAR 2 is not present, the shared memory region is not relocatable
> > > > > by the user. In that case, the hypervisor has to implement the Base
> > > > > Address register in the vendor-specific capability.
> > > > 
> > > > What does relocatable mean in this context?
> > > 
> > > That the guest can decide (via BAR) where the resource should show up in the
> > > physical guest address space. We do not want to support this in setups like
> > > for static partitioning hypervisors, and then we use that side-channel
> > > read-only configuration.
> > 
> > I see. I'm not sure what is vendor-specific about non-relocatable shared
> > memory. I guess it could be added to the spec too?
> 
> That "vendor-specific" comes from the PCI spec which - to my understanding -
> provides us no other means to introduce registers to the config space that
> are outside of the PCI spec. I could introduce a name for the ivshmem vendor
> cap and use that name here - would that be better?

Sounds good.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] ivshmem v2: Shared memory device specification
  2020-05-25  7:58 ` [virtio-comment] " Jan Kiszka
@ 2020-07-27 13:20   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 13:20 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, virtio-comment

On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> #### Vendor Specific Capability (ID 09h)
> 
> This capability must always be present.
> 
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of shared memory |
> 
> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.


Is there value in making this follow the virtio vendor-specific
capability format? That will cost several extra bytes - do you envision
having many of these in the config space?
Also, do we want to define an extended capability format in case this
is a pci extended capability?

-- 
MST



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 13:20   ` Michael S. Tsirkin
  0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 13:20 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Alex Bennée, liang yan

On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> #### Vendor Specific Capability (ID 09h)
> 
> This capability must always be present.
> 
> | Offset | Register            | Content                                        |
> |-------:|:--------------------|:-----------------------------------------------|
> |    00h | ID                  | 09h                                            |
> |    01h | Next Capability     | Pointer to next capability or 00h              |
> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> |    04h | State Table Size    | 32-bit size of read-only State Table           |
> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> |    10h | Output Section Size | 64-bit size of output sections                 |
> |    18h | Base Address        | optional: 64-bit base address of shared memory |
> 
> All registers are read-only. Writes are ignored, except to bit 0 of
> the Privileged Control register.


Is there value in making this follow the virtio vendor-specific
capability format? That will cost several extra bytes - do you envision
having many of these in the config space?
Also, do we want to define an extended capability format in case this
is a pci extended capability?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
  2020-07-27 13:20   ` [virtio-comment] " Michael S. Tsirkin
@ 2020-07-27 13:39     ` Jan Kiszka
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-27 13:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, virtio-comment

On 27.07.20 15:20, Michael S. Tsirkin wrote:
> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
> 
> 
> Is there value in making this follow the virtio vendor-specific
> capability format? That will cost several extra bytes - do you envision
> having many of these in the config space?

Of course, this could be modeled with via virtio_pci_cap as well. Would 
add 12 unused by bytes and one type byte. If it helps to make the device 
look more virtio'ish, but I'm afraid there are more differences at PCI 
level.

I do not see a use case for having multiple of those caps above per 
device. If someone comes around with a valid use case for having 
multiple, non-consequitive shared memory regions for one device, we 
would need to add registers for them. But that would also only work for 
non-BAR regions due to limited BARs.

> Also, do we want to define an extended capability format in case this
> is a pci extended capability?
> 

What would be the practical benefit? Do you see PCIe caps that could 
become useful in virtual setups? We don't do that for regular virtio 
devices either, do we?

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 13:39     ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-27 13:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jailhouse, qemu-devel, Alex Bennée, liang yan

On 27.07.20 15:20, Michael S. Tsirkin wrote:
> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>> #### Vendor Specific Capability (ID 09h)
>>
>> This capability must always be present.
>>
>> | Offset | Register            | Content                                        |
>> |-------:|:--------------------|:-----------------------------------------------|
>> |    00h | ID                  | 09h                                            |
>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> |    10h | Output Section Size | 64-bit size of output sections                 |
>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>
>> All registers are read-only. Writes are ignored, except to bit 0 of
>> the Privileged Control register.
> 
> 
> Is there value in making this follow the virtio vendor-specific
> capability format? That will cost several extra bytes - do you envision
> having many of these in the config space?

Of course, this could be modeled with via virtio_pci_cap as well. Would 
add 12 unused by bytes and one type byte. If it helps to make the device 
look more virtio'ish, but I'm afraid there are more differences at PCI 
level.

I do not see a use case for having multiple of those caps above per 
device. If someone comes around with a valid use case for having 
multiple, non-consequitive shared memory regions for one device, we 
would need to add registers for them. But that would also only work for 
non-BAR regions due to limited BARs.

> Also, do we want to define an extended capability format in case this
> is a pci extended capability?
> 

What would be the practical benefit? Do you see PCIe caps that could 
become useful in virtual setups? We don't do that for regular virtio 
devices either, do we?

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
  2020-07-27 13:39     ` Jan Kiszka
@ 2020-07-27 13:56       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 13:56 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, virtio-comment

On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
> On 27.07.20 15:20, Michael S. Tsirkin wrote:
> > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > > #### Vendor Specific Capability (ID 09h)
> > > 
> > > This capability must always be present.
> > > 
> > > | Offset | Register            | Content                                        |
> > > |-------:|:--------------------|:-----------------------------------------------|
> > > |    00h | ID                  | 09h                                            |
> > > |    01h | Next Capability     | Pointer to next capability or 00h              |
> > > |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> > > |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> > > |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> > > |    04h | State Table Size    | 32-bit size of read-only State Table           |
> > > |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> > > |    10h | Output Section Size | 64-bit size of output sections                 |
> > > |    18h | Base Address        | optional: 64-bit base address of shared memory |
> > > 
> > > All registers are read-only. Writes are ignored, except to bit 0 of
> > > the Privileged Control register.
> > 
> > 
> > Is there value in making this follow the virtio vendor-specific
> > capability format? That will cost several extra bytes - do you envision
> > having many of these in the config space?
> 
> Of course, this could be modeled with via virtio_pci_cap as well. Would add
> 12 unused by bytes and one type byte. If it helps to make the device look
> more virtio'ish, but I'm afraid there are more differences at PCI level.

I guess it will be useful if we ever find it handy to make an ivshmem
device also be a virtio device. Can't say why yet but if we don't care
it vaguely seems kind of like a good idea. I guess it will also be handy
if you ever need another vendor specific cap: you already get a way to
identify it without breaking drivers.


> I do not see a use case for having multiple of those caps above per device.
> If someone comes around with a valid use case for having multiple,
> non-consequitive shared memory regions for one device, we would need to add
> registers for them. But that would also only work for non-BAR regions due to
> limited BARs.


OK, I guess this answers the below too.

> > Also, do we want to define an extended capability format in case this
> > is a pci extended capability?
> > 
> 
> What would be the practical benefit? Do you see PCIe caps that could become
> useful in virtual setups?

So if we ever have a huge number of these caps, PCIe allows many more
caps.

> We don't do that for regular virtio devices
> either, do we?

We don't, there's a small number of these so we don't run out of config
space.

> 
> Thanks,
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 13:56       ` Michael S. Tsirkin
  0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 13:56 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Alex Bennée, liang yan

On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
> On 27.07.20 15:20, Michael S. Tsirkin wrote:
> > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > > #### Vendor Specific Capability (ID 09h)
> > > 
> > > This capability must always be present.
> > > 
> > > | Offset | Register            | Content                                        |
> > > |-------:|:--------------------|:-----------------------------------------------|
> > > |    00h | ID                  | 09h                                            |
> > > |    01h | Next Capability     | Pointer to next capability or 00h              |
> > > |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> > > |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> > > |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> > > |    04h | State Table Size    | 32-bit size of read-only State Table           |
> > > |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> > > |    10h | Output Section Size | 64-bit size of output sections                 |
> > > |    18h | Base Address        | optional: 64-bit base address of shared memory |
> > > 
> > > All registers are read-only. Writes are ignored, except to bit 0 of
> > > the Privileged Control register.
> > 
> > 
> > Is there value in making this follow the virtio vendor-specific
> > capability format? That will cost several extra bytes - do you envision
> > having many of these in the config space?
> 
> Of course, this could be modeled with via virtio_pci_cap as well. Would add
> 12 unused by bytes and one type byte. If it helps to make the device look
> more virtio'ish, but I'm afraid there are more differences at PCI level.

I guess it will be useful if we ever find it handy to make an ivshmem
device also be a virtio device. Can't say why yet but if we don't care
it vaguely seems kind of like a good idea. I guess it will also be handy
if you ever need another vendor specific cap: you already get a way to
identify it without breaking drivers.


> I do not see a use case for having multiple of those caps above per device.
> If someone comes around with a valid use case for having multiple,
> non-consequitive shared memory regions for one device, we would need to add
> registers for them. But that would also only work for non-BAR regions due to
> limited BARs.


OK, I guess this answers the below too.

> > Also, do we want to define an extended capability format in case this
> > is a pci extended capability?
> > 
> 
> What would be the practical benefit? Do you see PCIe caps that could become
> useful in virtual setups?

So if we ever have a huge number of these caps, PCIe allows many more
caps.

> We don't do that for regular virtio devices
> either, do we?

We don't, there's a small number of these so we don't run out of config
space.

> 
> Thanks,
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
  2020-07-27 13:56       ` Michael S. Tsirkin
@ 2020-07-27 14:17         ` Jan Kiszka
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-27 14:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, virtio-comment

On 27.07.20 15:56, Michael S. Tsirkin wrote:
> On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
>> On 27.07.20 15:20, Michael S. Tsirkin wrote:
>>> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>>>> #### Vendor Specific Capability (ID 09h)
>>>>
>>>> This capability must always be present.
>>>>
>>>> | Offset | Register            | Content                                        |
>>>> |-------:|:--------------------|:-----------------------------------------------|
>>>> |    00h | ID                  | 09h                                            |
>>>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>>>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>>>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>>>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>>>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>>>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>>>> |    10h | Output Section Size | 64-bit size of output sections                 |
>>>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>>>
>>>> All registers are read-only. Writes are ignored, except to bit 0 of
>>>> the Privileged Control register.
>>>
>>>
>>> Is there value in making this follow the virtio vendor-specific
>>> capability format? That will cost several extra bytes - do you envision
>>> having many of these in the config space?
>>
>> Of course, this could be modeled with via virtio_pci_cap as well. Would add
>> 12 unused by bytes and one type byte. If it helps to make the device look
>> more virtio'ish, but I'm afraid there are more differences at PCI level.
> 
> I guess it will be useful if we ever find it handy to make an ivshmem
> device also be a virtio device. Can't say why yet but if we don't care
> it vaguely seems kind of like a good idea. I guess it will also be handy
> if you ever need another vendor specific cap: you already get a way to
> identify it without breaking drivers.
> 

I can look into that. Those 12 wasted bytes are a bit ugly, but so far 
we are not short on config space, even in the non-extended range.

More problematic is that the existing specification of virtio_pci_cap 
assumes that this describes a structure in a PCI resource, rather than 
even being that data itself, and even a register (privileged control).

Would it be possible to split the types into two ranges, one for the 
existing structure, one for others - like ivshmem - that will only share 
the cfg_type field?

> 
>> I do not see a use case for having multiple of those caps above per device.
>> If someone comes around with a valid use case for having multiple,
>> non-consequitive shared memory regions for one device, we would need to add
>> registers for them. But that would also only work for non-BAR regions due to
>> limited BARs.
> 
> 
> OK, I guess this answers the below too.
> 
>>> Also, do we want to define an extended capability format in case this
>>> is a pci extended capability?
>>>
>>
>> What would be the practical benefit? Do you see PCIe caps that could become
>> useful in virtual setups?
> 
> So if we ever have a huge number of these caps, PCIe allows many more
> caps.
> 
>> We don't do that for regular virtio devices
>> either, do we?
> 
> We don't, there's a small number of these so we don't run out of config
> space.

Right. But then it would not a be a problem to add PCIe (right before 
adding it becomes impossible) and push new caps into the extended space. 
And all that without breaking existing drivers. It's just a cap, and the 
spec so far does not state that there must be no other cap, neither in 
current virtio nor this ivshmem device.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 14:17         ` Jan Kiszka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kiszka @ 2020-07-27 14:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jailhouse, qemu-devel, Alex Bennée, liang yan

On 27.07.20 15:56, Michael S. Tsirkin wrote:
> On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
>> On 27.07.20 15:20, Michael S. Tsirkin wrote:
>>> On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
>>>> #### Vendor Specific Capability (ID 09h)
>>>>
>>>> This capability must always be present.
>>>>
>>>> | Offset | Register            | Content                                        |
>>>> |-------:|:--------------------|:-----------------------------------------------|
>>>> |    00h | ID                  | 09h                                            |
>>>> |    01h | Next Capability     | Pointer to next capability or 00h              |
>>>> |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
>>>> |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>>>> |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
>>>> |    04h | State Table Size    | 32-bit size of read-only State Table           |
>>>> |    08h | R/W Section Size    | 64-bit size of common read/write section       |
>>>> |    10h | Output Section Size | 64-bit size of output sections                 |
>>>> |    18h | Base Address        | optional: 64-bit base address of shared memory |
>>>>
>>>> All registers are read-only. Writes are ignored, except to bit 0 of
>>>> the Privileged Control register.
>>>
>>>
>>> Is there value in making this follow the virtio vendor-specific
>>> capability format? That will cost several extra bytes - do you envision
>>> having many of these in the config space?
>>
>> Of course, this could be modeled with via virtio_pci_cap as well. Would add
>> 12 unused by bytes and one type byte. If it helps to make the device look
>> more virtio'ish, but I'm afraid there are more differences at PCI level.
> 
> I guess it will be useful if we ever find it handy to make an ivshmem
> device also be a virtio device. Can't say why yet but if we don't care
> it vaguely seems kind of like a good idea. I guess it will also be handy
> if you ever need another vendor specific cap: you already get a way to
> identify it without breaking drivers.
> 

I can look into that. Those 12 wasted bytes are a bit ugly, but so far 
we are not short on config space, even in the non-extended range.

More problematic is that the existing specification of virtio_pci_cap 
assumes that this describes a structure in a PCI resource, rather than 
even being that data itself, and even a register (privileged control).

Would it be possible to split the types into two ranges, one for the 
existing structure, one for others - like ivshmem - that will only share 
the cfg_type field?

> 
>> I do not see a use case for having multiple of those caps above per device.
>> If someone comes around with a valid use case for having multiple,
>> non-consequitive shared memory regions for one device, we would need to add
>> registers for them. But that would also only work for non-BAR regions due to
>> limited BARs.
> 
> 
> OK, I guess this answers the below too.
> 
>>> Also, do we want to define an extended capability format in case this
>>> is a pci extended capability?
>>>
>>
>> What would be the practical benefit? Do you see PCIe caps that could become
>> useful in virtual setups?
> 
> So if we ever have a huge number of these caps, PCIe allows many more
> caps.
> 
>> We don't do that for regular virtio devices
>> either, do we?
> 
> We don't, there's a small number of these so we don't run out of config
> space.

Right. But then it would not a be a problem to add PCIe (right before 
adding it becomes impossible) and push new caps into the extended space. 
And all that without breaking existing drivers. It's just a cap, and the 
spec so far does not state that there must be no other cap, neither in 
current virtio nor this ivshmem device.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
  2020-07-27 14:17         ` Jan Kiszka
@ 2020-07-27 14:19           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 14:19 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jailhouse, liang yan, Alex Bennée, qemu-devel, virtio-comment

On Mon, Jul 27, 2020 at 04:17:06PM +0200, Jan Kiszka wrote:
> On 27.07.20 15:56, Michael S. Tsirkin wrote:
> > On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
> > > On 27.07.20 15:20, Michael S. Tsirkin wrote:
> > > > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > > > > #### Vendor Specific Capability (ID 09h)
> > > > > 
> > > > > This capability must always be present.
> > > > > 
> > > > > | Offset | Register            | Content                                        |
> > > > > |-------:|:--------------------|:-----------------------------------------------|
> > > > > |    00h | ID                  | 09h                                            |
> > > > > |    01h | Next Capability     | Pointer to next capability or 00h              |
> > > > > |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> > > > > |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> > > > > |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> > > > > |    04h | State Table Size    | 32-bit size of read-only State Table           |
> > > > > |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> > > > > |    10h | Output Section Size | 64-bit size of output sections                 |
> > > > > |    18h | Base Address        | optional: 64-bit base address of shared memory |
> > > > > 
> > > > > All registers are read-only. Writes are ignored, except to bit 0 of
> > > > > the Privileged Control register.
> > > > 
> > > > 
> > > > Is there value in making this follow the virtio vendor-specific
> > > > capability format? That will cost several extra bytes - do you envision
> > > > having many of these in the config space?
> > > 
> > > Of course, this could be modeled with via virtio_pci_cap as well. Would add
> > > 12 unused by bytes and one type byte. If it helps to make the device look
> > > more virtio'ish, but I'm afraid there are more differences at PCI level.
> > 
> > I guess it will be useful if we ever find it handy to make an ivshmem
> > device also be a virtio device. Can't say why yet but if we don't care
> > it vaguely seems kind of like a good idea. I guess it will also be handy
> > if you ever need another vendor specific cap: you already get a way to
> > identify it without breaking drivers.
> > 
> 
> I can look into that. Those 12 wasted bytes are a bit ugly, but so far we
> are not short on config space, even in the non-extended range.
> 
> More problematic is that the existing specification of virtio_pci_cap
> assumes that this describes a structure in a PCI resource, rather than even
> being that data itself, and even a register (privileged control).
> 
> Would it be possible to split the types into two ranges, one for the
> existing structure, one for others - like ivshmem - that will only share the
> cfg_type field?

Sure.

> > 
> > > I do not see a use case for having multiple of those caps above per device.
> > > If someone comes around with a valid use case for having multiple,
> > > non-consequitive shared memory regions for one device, we would need to add
> > > registers for them. But that would also only work for non-BAR regions due to
> > > limited BARs.
> > 
> > 
> > OK, I guess this answers the below too.
> > 
> > > > Also, do we want to define an extended capability format in case this
> > > > is a pci extended capability?
> > > > 
> > > 
> > > What would be the practical benefit? Do you see PCIe caps that could become
> > > useful in virtual setups?
> > 
> > So if we ever have a huge number of these caps, PCIe allows many more
> > caps.
> > 
> > > We don't do that for regular virtio devices
> > > either, do we?
> > 
> > We don't, there's a small number of these so we don't run out of config
> > space.
> 
> Right. But then it would not a be a problem to add PCIe (right before adding
> it becomes impossible) and push new caps into the extended space. And all
> that without breaking existing drivers. It's just a cap, and the spec so far
> does not state that there must be no other cap, neither in current virtio
> nor this ivshmem device.
> 
> Jan

Right.


> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [virtio-comment] Re: [RFC] ivshmem v2: Shared memory device specification
@ 2020-07-27 14:19           ` Michael S. Tsirkin
  0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2020-07-27 14:19 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: virtio-comment, Jailhouse, qemu-devel, Alex Bennée, liang yan

On Mon, Jul 27, 2020 at 04:17:06PM +0200, Jan Kiszka wrote:
> On 27.07.20 15:56, Michael S. Tsirkin wrote:
> > On Mon, Jul 27, 2020 at 03:39:32PM +0200, Jan Kiszka wrote:
> > > On 27.07.20 15:20, Michael S. Tsirkin wrote:
> > > > On Mon, May 25, 2020 at 09:58:28AM +0200, Jan Kiszka wrote:
> > > > > #### Vendor Specific Capability (ID 09h)
> > > > > 
> > > > > This capability must always be present.
> > > > > 
> > > > > | Offset | Register            | Content                                        |
> > > > > |-------:|:--------------------|:-----------------------------------------------|
> > > > > |    00h | ID                  | 09h                                            |
> > > > > |    01h | Next Capability     | Pointer to next capability or 00h              |
> > > > > |    02h | Length              | 20h if Base Address is present, 18h otherwise  |
> > > > > |    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> > > > > |        |                     | Bits 1-7: Reserved (0 on read, writes ignored) |
> > > > > |    04h | State Table Size    | 32-bit size of read-only State Table           |
> > > > > |    08h | R/W Section Size    | 64-bit size of common read/write section       |
> > > > > |    10h | Output Section Size | 64-bit size of output sections                 |
> > > > > |    18h | Base Address        | optional: 64-bit base address of shared memory |
> > > > > 
> > > > > All registers are read-only. Writes are ignored, except to bit 0 of
> > > > > the Privileged Control register.
> > > > 
> > > > 
> > > > Is there value in making this follow the virtio vendor-specific
> > > > capability format? That will cost several extra bytes - do you envision
> > > > having many of these in the config space?
> > > 
> > > Of course, this could be modeled with via virtio_pci_cap as well. Would add
> > > 12 unused by bytes and one type byte. If it helps to make the device look
> > > more virtio'ish, but I'm afraid there are more differences at PCI level.
> > 
> > I guess it will be useful if we ever find it handy to make an ivshmem
> > device also be a virtio device. Can't say why yet but if we don't care
> > it vaguely seems kind of like a good idea. I guess it will also be handy
> > if you ever need another vendor specific cap: you already get a way to
> > identify it without breaking drivers.
> > 
> 
> I can look into that. Those 12 wasted bytes are a bit ugly, but so far we
> are not short on config space, even in the non-extended range.
> 
> More problematic is that the existing specification of virtio_pci_cap
> assumes that this describes a structure in a PCI resource, rather than even
> being that data itself, and even a register (privileged control).
> 
> Would it be possible to split the types into two ranges, one for the
> existing structure, one for others - like ivshmem - that will only share the
> cfg_type field?

Sure.

> > 
> > > I do not see a use case for having multiple of those caps above per device.
> > > If someone comes around with a valid use case for having multiple,
> > > non-consequitive shared memory regions for one device, we would need to add
> > > registers for them. But that would also only work for non-BAR regions due to
> > > limited BARs.
> > 
> > 
> > OK, I guess this answers the below too.
> > 
> > > > Also, do we want to define an extended capability format in case this
> > > > is a pci extended capability?
> > > > 
> > > 
> > > What would be the practical benefit? Do you see PCIe caps that could become
> > > useful in virtual setups?
> > 
> > So if we ever have a huge number of these caps, PCIe allows many more
> > caps.
> > 
> > > We don't do that for regular virtio devices
> > > either, do we?
> > 
> > We don't, there's a small number of these so we don't run out of config
> > space.
> 
> Right. But then it would not a be a problem to add PCIe (right before adding
> it becomes impossible) and push new caps into the extended space. And all
> that without breaking existing drivers. It's just a cap, and the spec so far
> does not state that there must be no other cap, neither in current virtio
> nor this ivshmem device.
> 
> Jan

Right.


> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-07-27 14:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-25  7:58 [RFC] ivshmem v2: Shared memory device specification Jan Kiszka
2020-05-25  7:58 ` [virtio-comment] " Jan Kiszka
2020-06-17 15:32 ` Alex Bennée
2020-06-17 15:32   ` [virtio-comment] " Alex Bennée
2020-06-17 16:10   ` Jan Kiszka
2020-06-17 16:10     ` [virtio-comment] " Jan Kiszka
2020-07-15 13:27 ` [virtio-comment] " Stefan Hajnoczi
2020-07-15 13:27   ` Stefan Hajnoczi
2020-07-17 16:15   ` Jan Kiszka
2020-07-17 16:15     ` Jan Kiszka
2020-07-23  6:54     ` Stefan Hajnoczi
2020-07-23  6:54       ` Stefan Hajnoczi
2020-07-23  7:02       ` Jan Kiszka
2020-07-23  7:02         ` Jan Kiszka
2020-07-27 12:29         ` Stefan Hajnoczi
2020-07-27 12:29           ` Stefan Hajnoczi
2020-07-27 13:20 ` Michael S. Tsirkin
2020-07-27 13:20   ` [virtio-comment] " Michael S. Tsirkin
2020-07-27 13:39   ` Jan Kiszka
2020-07-27 13:39     ` Jan Kiszka
2020-07-27 13:56     ` Michael S. Tsirkin
2020-07-27 13:56       ` Michael S. Tsirkin
2020-07-27 14:17       ` Jan Kiszka
2020-07-27 14:17         ` Jan Kiszka
2020-07-27 14:19         ` Michael S. Tsirkin
2020-07-27 14:19           ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.