All of lore.kernel.org
 help / color / mirror / Atom feed
From: Markus Armbruster <armbru@redhat.com>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [PULL 10/40] ivshmem: Rewrite specification document
Date: Fri, 18 Mar 2016 18:00:57 +0100	[thread overview]
Message-ID: <1458320487-19603-11-git-send-email-armbru@redhat.com> (raw)
In-Reply-To: <1458320487-19603-1-git-send-email-armbru@redhat.com>

This started as an attempt to update ivshmem_device_spec.txt for
clarity, accuracy and completeness while working on its code, and
quickly became a full rewrite.  Since the diff would be useless
anyway, I'm using the opportunity to rename the file to
ivshmem-spec.txt.

I tried hard to ensure the new text contradicts neither the old text
nor the code.  If the new text contradicts the old text but not the
code, it's probably a bug in the old text.  If the new text
contradicts both, its probably a bug in the new text.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <1458066895-20632-11-git-send-email-armbru@redhat.com>
---
 docs/specs/ivshmem-spec.txt        | 243 +++++++++++++++++++++++++++++++++++++
 docs/specs/ivshmem_device_spec.txt | 161 ------------------------
 2 files changed, 243 insertions(+), 161 deletions(-)
 create mode 100644 docs/specs/ivshmem-spec.txt
 delete mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
new file mode 100644
index 0000000..0e9185a
--- /dev/null
+++ b/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,243 @@
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host.  In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory: -device ivshmem,shm=NAME,...
+
+  This uses shared memory object NAME.
+
+- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
+
+  An ivshmem server must already be running on the host.  The device
+  connects to the server's UNIX domain socket via character device
+  CHR.
+
+  Each peer gets assigned a unique ID by the server.  IDs must be
+  between 0 and 65535.
+
+  Interrupts are message-signaled by default (MSI-X).  With msi=off
+  the device has no MSI-X capability, and uses legacy INTx instead.
+  vectors=N configures the number of vectors to use.
+
+For more details on ivshmem device properties, see The QEMU Emulator
+User Documentation (qemu-doc.*).
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only when using MSI-X)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices.  This way,
+  you have access to the shared memory in the guest and can use it as
+  you see fit.  Memnic, for example, uses ivshmem this way from guest
+  user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
+  likely want to write a kernel driver to handle interrupts.  Requires
+  the device to be configured for interrupts, obviously.
+
+If the device is configured for interrupts, BAR2 is initially invalid.
+It becomes safely accessible only after the ivshmem server provided
+the shared memory.  Guest software should wait for the IVPosition
+register (described below) to become non-negative before accessing
+BAR2.
+
+The device is not capable to tell guest software whether it is
+configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+    Offset  Size  Access      On reset  Function
+        0     4   read/write        0   Interrupt Mask
+                                        bit 0: peer interrupt
+                                        bit 1..31: reserved
+        4     4   read/write        0   Interrupt Status
+                                        bit 0: peer interrupt
+                                        bit 1..31: reserved
+        8     4   read-only   0 or -1   IVPosition
+       12     4   write-only      N/A   Doorbell
+                                        bit 0..15: vector
+                                        bit 16..31: peer ID
+       16   240   none            N/A   reserved
+
+Software should only access the registers as specified in column
+"Access".  Reserved bits should be ignored on read, and preserved on
+write.
+
+Interrupt Status and Mask Register together control the legacy INTx
+interrupt when the device has no MSI-X capability: INTx is asserted
+when the bit-wise AND of Status and Mask is non-zero and the device
+has no MSI-X capability.  Interrupt Status Register bit 0 becomes 1
+when an interrupt request from a peer is received.  Reading the
+register clears it.
+
+IVPosition Register: if the device is not configured for interrupts,
+this is zero.  Else, it's -1 for a short while after reset, then
+changes to the device's ID (between 0 and 65535).
+
+There is no good way for software to find out whether the device is
+configured for interrupts.  A positive IVPosition means interrupts,
+but zero could be either.  The initial -1 cannot be reliably observed.
+
+Doorbell Register: writing this register requests to interrupt a peer.
+The written value's high 16 bits are the ID of the peer to interrupt,
+and its low 16 bits select an interrupt vector.
+
+If the device is not configured for interrupts, the write is ignored.
+
+If the interrupt hasn't completed setup, the write is ignored.  The
+device is not capable to tell guest software whether setup is
+complete.  Interrupts can regress to this state on migration.
+
+If the peer with the requested ID isn't connected, or it has fewer
+interrupt vectors connected, the write is ignored.  The device is not
+capable to tell guest software what peers are connected, or how many
+interrupt vectors are connected.
+
+If the peer doesn't use MSI-X, its Interrupt Status register is set to
+1.  This asserts INTx unless masked by the Interrupt Mask register.
+The device is not capable to communicate the interrupt vector to guest
+software then.
+
+If the peer uses MSI-X, the interrupt for this vector becomes pending.
+There is no way for software to clear the pending bit, and a polling
+mode of operation is therefore impossible with MSI-X.
+
+With multiple MSI-X vectors, different vectors can be used to indicate
+different events have occurred.  The semantics of interrupt vectors
+are left to the application.
+
+
+== Interrupt infrastructure ==
+
+When configured for interrupts, the peers share eventfd objects in
+addition to shared memory.  The shared resources are managed by an
+ivshmem server.
+
+=== The ivshmem server ===
+
+The server listens on a UNIX domain socket.
+
+For each new client that connects to the server, the server
+- picks an ID,
+- creates eventfd file descriptors for the interrupt vectors,
+- sends the ID and the file descriptor for the shared memory to the
+  new client,
+- sends connect notifications for the new client to the other clients
+  (these contain file descriptors for sending interrupts),
+- sends connect notifications for the other clients to the new client,
+  and
+- sends interrupt setup messages to the new client (these contain file
+  descriptors for receiving interrupts).
+
+When a client disconnects from the server, the server sends disconnect
+notifications to the other clients.
+
+The next section describes the protocol in detail.
+
+If the server terminates without sending disconnect notifications for
+its connected clients, the clients can elect to continue.  They can
+communicate with each other normally, but won't receive disconnect
+notification on disconnect, and no new clients can connect.  There is
+no way for the clients to connect to a restarted server.  The device
+is not capable to tell guest software whether the server is still up.
+
+Example server code is in contrib/ivshmem-server/.  Not to be used in
+production.  It assumes all clients use the same number of interrupt
+vectors.
+
+A standalone client is in contrib/ivshmem-client/.  It can be useful
+for debugging.
+
+=== The ivshmem Client-Server Protocol ===
+
+An ivshmem device configured for interrupts connects to an ivshmem
+server.  This section details the protocol between the two.
+
+The connection is one-way: the server sends messages to the client.
+Each message consists of a single 8 byte little-endian signed number,
+and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
+client and server close the connection on error.
+
+On connect, the server sends the following messages in order:
+
+1. The protocol version number, currently zero.  The client should
+   close the connection on receipt of versions it can't handle.
+
+2. The client's ID.  This is unique among all clients of this server.
+   IDs must be between 0 and 65535, because the Doorbell register
+   provides only 16 bits for them.
+
+3. The number -1, accompanied by the file descriptor for the shared
+   memory.
+
+4. Connect notifications for existing other clients, if any.  This is
+   a peer ID (number between 0 and 65535 other than the client's ID),
+   repeated N times.  Each repetition is accompanied by one file
+   descriptor.  These are for interrupting the peer with that ID using
+   vector 0,..,N-1, in order.  If the client is configured for fewer
+   vectors, it closes the extra file descriptors.  If it is configured
+   for more, the extra vectors remain unconnected.
+
+5. Interrupt setup.  This is the client's own ID, repeated N times.
+   Each repetition is accompanied by one file descriptor.  These are
+   for receiving interrupts from peers using vector 0,..,N-1, in
+   order.  If the client is configured for fewer vectors, it closes
+   the extra file descriptors.  If it is configured for more, the
+   extra vectors remain unconnected.
+
+From then on, the server sends these kinds of messages:
+
+6. Connection / disconnection notification.  This is a peer ID.
+
+  - If the number comes with a file descriptor, it's a connection
+    notification, exactly like in step 4.
+
+  - Else, it's a disconnection notification for the peer with that ID.
+
+Known bugs:
+
+* The protocol changed incompatibly in QEMU 2.5.  Before, messages
+  were native endian long, and there was no version number.
+
+* The protocol is poorly designed.
+
+=== The ivshmem Client-Client Protocol ===
+
+An ivshmem device configured for interrupts receives eventfd file
+descriptors for interrupting peers and getting interrupted by peers
+from the server, as explained in the previous section.
+
+To interrupt a peer, the device writes the 8-byte integer 1 in native
+byte order to the respective file descriptor.
+
+To receive an interrupt, the device reads and discards as many 8-byte
+integers as it can.
diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
deleted file mode 100644
index d318d65..0000000
--- a/docs/specs/ivshmem_device_spec.txt
+++ /dev/null
@@ -1,161 +0,0 @@
-
-Device Specification for Inter-VM shared memory device
-------------------------------------------------------
-
-The Inter-VM shared memory device is designed to share a memory region (created
-on the host via the POSIX shared memory API) between multiple QEMU processes
-running different guests. In order for all guests to be able to pick up the
-shared memory area, it is modeled by QEMU as a PCI device exposing said memory
-to the guest as a PCI BAR.
-The memory region does not belong to any guest, but is a POSIX memory object on
-the host. The host can access this shared memory if needed.
-
-The device also provides an optional communication mechanism between guests
-sharing the same memory object. More details about that in the section 'Guest to
-guest communication' section.
-
-
-The Inter-VM PCI device
------------------------
-
-From the VM point of view, the ivshmem PCI device supports three BARs.
-
-- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
-  not used.
-- BAR1 is used for MSI-X when it is enabled in the device.
-- BAR2 is used to access the shared memory object.
-
-It is your choice how to use the device but you must choose between two
-behaviors :
-
-- basically, if you only need the shared memory part, you will map BAR2.
-  This way, you have access to the shared memory in guest and can use it as you
-  see fit (memnic, for example, uses it in userland
-  http://dpdk.org/browse/memnic).
-
-- BAR0 and BAR1 are used to implement an optional communication mechanism
-  through interrupts in the guests. If you need an event mechanism between the
-  guests accessing the shared memory, you will most likely want to write a
-  kernel driver that will handle interrupts. See details in the section 'Guest
-  to guest communication' section.
-
-The behavior is chosen when starting your QEMU processes:
-- no communication mechanism needed, the first QEMU to start creates the shared
-  memory on the host, subsequent QEMU processes will use it.
-
-- communication mechanism needed, an ivshmem server must be started before any
-  QEMU processes, then each QEMU process connects to the server unix socket.
-
-For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
-
-
-Guest to guest communication
-----------------------------
-
-This section details the communication mechanism between the guests accessing
-the ivhsmem shared memory.
-
-*ivshmem server*
-
-This server code is available in qemu.git/contrib/ivshmem-server.
-
-The server must be started on the host before any guest.
-It creates a shared memory object then waits for clients to connect on a unix
-socket. All the messages are little-endian int64_t integer.
-
-For each client (QEMU process) that connects to the server:
-- the server sends a protocol version, if client does not support it, the client
-  closes the communication,
-- the server assigns an ID for this client and sends this ID to him as the first
-  message,
-- the server sends a fd to the shared memory object to this client,
-- the server creates a new set of host eventfds associated to the new client and
-  sends this set to all already connected clients,
-- finally, the server sends all the eventfds sets for all clients to the new
-  client.
-
-The server signals all clients when one of them disconnects.
-
-The client IDs are limited to 16 bits because of the current implementation (see
-Doorbell register in 'PCI device registers' subsection). Hence only 65536
-clients are supported.
-
-All the file descriptors (fd to the shared memory, eventfds for each client)
-are passed to clients using SCM_RIGHTS over the server unix socket.
-
-Apart from the current ivshmem implementation in QEMU, an ivshmem client has
-been provided in qemu.git/contrib/ivshmem-client for debug.
-
-*QEMU as an ivshmem client*
-
-At initialisation, when creating the ivshmem device, QEMU first receives a
-protocol version and closes communication with server if it does not match.
-Then, QEMU gets its ID from the server then makes it available through BAR0
-IVPosition register for the VM to use (see 'PCI device registers' subsection).
-QEMU then uses the fd to the shared memory to map it to BAR2.
-eventfds for all other clients received from the server are stored to implement
-BAR0 Doorbell register (see 'PCI device registers' subsection).
-Finally, eventfds assigned to this QEMU process are used to send interrupts in
-this VM.
-
-*PCI device registers*
-
-From the VM point of view, the ivshmem PCI device supports 4 registers of
-32-bits each.
-
-enum ivshmem_registers {
-    IntrMask = 0,
-    IntrStatus = 4,
-    IVPosition = 8,
-    Doorbell = 12
-};
-
-The first two registers are the interrupt mask and status registers.  Mask and
-status are only used with pin-based interrupts.  They are unused with MSI
-interrupts.
-
-Status Register: The status register is set to 1 when an interrupt occurs.
-
-Mask Register: The mask register is bitwise ANDed with the interrupt status
-and the result will raise an interrupt if it is non-zero.  However, since 1 is
-the only value the status will be set to, it is only the first bit of the mask
-that has any effect.  Therefore interrupts can be masked by setting the first
-bit to 0 and unmasked by setting the first bit to 1.
-
-IVPosition Register: The IVPosition register is read-only and reports the
-guest's ID number.  The guest IDs are non-negative integers.  When using the
-server, since the server is a separate process, the VM ID will only be set when
-the device is ready (shared memory is received from the server and accessible
-via the device).  If the device is not ready, the IVPosition will return -1.
-Applications should ensure that they have a valid VM ID before accessing the
-shared memory.
-
-Doorbell Register:  To interrupt another guest, a guest must write to the
-Doorbell register.  The doorbell register is 32-bits, logically divided into
-two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
-16-bits are the interrupt vector to trigger.  The semantics of the value
-written to the doorbell depends on whether the device is using MSI or a regular
-pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
-the status register.
-
-Regular Interrupts
-
-If regular interrupts are used (due to either a guest not supporting MSI or the
-user specifying not to use them on startup) then the value written to the lower
-16-bits of the Doorbell register results is arbitrary and will trigger an
-interrupt in the destination guest.
-
-Message Signalled Interrupts
-
-An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
-written to the Doorbell register must be between 0 and the maximum number of
-vectors the guest supports.  The lower 16 bits written to the doorbell is the
-MSI vector that will be raised in the destination guest.  The number of MSI
-vectors is configurable but it is set when the VM is started.
-
-The important thing to remember with MSI is that it is only a signal, no status
-is set (since MSI interrupts are not shared).  All information other than the
-interrupt itself should be communicated via the shared memory region.  Devices
-supporting multiple MSI vectors can use different vectors to indicate different
-events have occurred.  The semantics of interrupt vectors are left to the
-user's discretion.
-- 
2.4.3

  parent reply	other threads:[~2016-03-18 17:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-18 17:00 [Qemu-devel] [PULL 00/40] ivshmem: Fixes, cleanups, device model split Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 01/40] target-ppc: Document TOCTTOU in hugepage support Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 02/40] ivshmem-server: Fix and clean up command line help Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 03/40] ivshmem-server: Don't overload POSIX shmem and file name Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 04/40] qemu-doc: Fix ivshmem huge page example Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 05/40] event_notifier: Make event_notifier_init_fd() #ifdef CONFIG_EVENTFD Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 06/40] tests/libqos/pci-pc: Fix qpci_pc_iomap() to map BARs aligned Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 07/40] ivshmem-test: Improve test case /ivshmem/single Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 08/40] ivshmem-test: Clean up wait for devices to become operational Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 09/40] ivshmem-test: Improve test cases /ivshmem/server-* Markus Armbruster
2016-03-18 17:00 ` Markus Armbruster [this message]
2016-03-18 17:00 ` [Qemu-devel] [PULL 11/40] ivshmem: Add missing newlines to debug printfs Markus Armbruster
2016-03-18 17:00 ` [Qemu-devel] [PULL 12/40] ivshmem: Compile debug prints unconditionally to prevent bit-rot Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 13/40] ivshmem: Clean up after commit 9940c32 Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 14/40] ivshmem: Drop ivshmem_event() stub Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 15/40] ivshmem: Don't destroy the chardev on version mismatch Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 16/40] ivshmem: Fix harmless misuse of Error Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 17/40] ivshmem: Failed realize() can leave migration blocker behind Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 18/40] ivshmem: Clean up register callbacks Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 19/40] ivshmem: Clean up MSI-X conditions Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 20/40] ivshmem: Leave INTx alone when using MSI-X Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 21/40] ivshmem: Assert interrupts are set up once Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 22/40] ivshmem: Simplify rejection of invalid peer ID from server Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 23/40] ivshmem: Disentangle ivshmem_read() Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 24/40] ivshmem: Plug leaks on unplug, fix peer disconnect Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 25/40] ivshmem: Receive shared memory synchronously in realize() Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 26/40] ivshmem: Propagate errors through ivshmem_recv_setup() Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 27/40] ivshmem: Rely on server sending the ID right after the version Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 28/40] ivshmem: Drop the hackish test for UNIX domain chardev Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 29/40] ivshmem: Simplify how we cope with short reads from server Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 30/40] ivshmem: Tighten check of property "size" Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 31/40] ivshmem: Implement shm=... with a memory backend Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 32/40] ivshmem: Simplify memory regions for BAR 2 (shared memory) Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 33/40] ivshmem: Inline check_shm_size() into its only caller Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 34/40] qdev: New DEFINE_PROP_ON_OFF_AUTO Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 35/40] ivshmem: Replace int role_val by OnOffAuto master Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 36/40] ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 37/40] ivshmem: Clean up after the previous commit Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 38/40] ivshmem: Drop ivshmem property x-memdev Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 39/40] ivshmem: Require master to have ID zero Markus Armbruster
2016-03-18 17:01 ` [Qemu-devel] [PULL 40/40] contrib/ivshmem-server: Print "not for production" warning Markus Armbruster
2016-03-21  9:45 ` [Qemu-devel] [PULL 00/40] ivshmem: Fixes, cleanups, device model split Peter Maydell
2016-03-21 10:05   ` Markus Armbruster
2016-03-21 10:18     ` Peter Maydell
2016-03-21 11:52       ` Markus Armbruster
2016-03-21 12:11   ` Markus Armbruster
2016-03-28  6:02     ` Marcel Apfelbaum
2016-03-28  6:38       ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1458320487-19603-11-git-send-email-armbru@redhat.com \
    --to=armbru@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.