QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
@ 2019-11-11 12:57 Jan Kiszka
  2019-11-11 12:57 ` [RFC][PATCH 1/3] hw/misc: Add implementation of ivshmem revision 2 device Jan Kiszka
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 12:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

To get the ball rolling after my presentation of the topic at KVM Forum
[1] and many fruitful discussions around it, this is a first concrete
code series. As discussed, I'm starting with the IVSHMEM implementation
of a QEMU device and server. It's RFC because, besides specification and
implementation details, there will still be some decisions needed about
how to integrate the new version best into the existing code bases.

If you want to play with this, the basic setup of the shared memory
device is described in patch 1 and 3. UIO driver and also the
virtio-ivshmem prototype can be found at

    http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2

Accessing the device via UIO is trivial enough. If you want to use it
for virtio, this is additionally to the description in patch 3 needed on
the virtio console backend side:

    modprobe uio_ivshmem
    echo "1af4 1110 1af4 1100 ffc003 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
    linux/tools/virtio/virtio-ivshmem-console /dev/uio0

And for virtio block:

    echo "1af4 1110 1af4 1100 ffc002 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
    linux/tools/virtio/virtio-ivshmem-console /dev/uio0 /path/to/disk.img

After that, you can start the QEMU frontend instance with the
virtio-ivshmem driver installed which can use the new /dev/hvc* or
/dev/vda* as usual.

Any feedback welcome!

Jan

PS: Let me know if I missed someone potentially interested in this topic
on CC - or if you would like to be dropped from the list.

PPS: The Jailhouse queues are currently out of sync /wrt minor details
of this one, primarily the device ID. Will update them when the general
direction is clear.

[1] https://kvmforum2019.sched.com/event/TmxI

Jan Kiszka (3):
  hw/misc: Add implementation of ivshmem revision 2 device
  docs/specs: Add specification of ivshmem device revision 2
  contrib: Add server for ivshmem revision 2

 Makefile                                  |    3 +
 Makefile.objs                             |    1 +
 configure                                 |    1 +
 contrib/ivshmem2-server/Makefile.objs     |    1 +
 contrib/ivshmem2-server/ivshmem2-server.c |  462 ++++++++++++
 contrib/ivshmem2-server/ivshmem2-server.h |  158 +++++
 contrib/ivshmem2-server/main.c            |  313 +++++++++
 docs/specs/ivshmem-2-device-spec.md       |  333 +++++++++
 hw/misc/Makefile.objs                     |    2 +-
 hw/misc/ivshmem2.c                        | 1091 +++++++++++++++++++++++++++++
 include/hw/misc/ivshmem2.h                |   48 ++
 11 files changed, 2412 insertions(+), 1 deletion(-)
 create mode 100644 contrib/ivshmem2-server/Makefile.objs
 create mode 100644 contrib/ivshmem2-server/ivshmem2-server.c
 create mode 100644 contrib/ivshmem2-server/ivshmem2-server.h
 create mode 100644 contrib/ivshmem2-server/main.c
 create mode 100644 docs/specs/ivshmem-2-device-spec.md
 create mode 100644 hw/misc/ivshmem2.c
 create mode 100644 include/hw/misc/ivshmem2.h

-- 
2.16.4



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 1/3] hw/misc: Add implementation of ivshmem revision 2 device
  2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
@ 2019-11-11 12:57 ` Jan Kiszka
  2019-11-11 12:57 ` [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2 Jan Kiszka
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 12:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

From: Jan Kiszka <jan.kiszka@siemens.com>

This adds a reimplementation of ivshmem in its new revision 2 as
separate device. The goal of this is not to enable sharing with v1,
rather to allow explore the properties and potential limitation of the
new version prior to discussing its integration with the existing code.

v2 always requires a server to interconnect two more more QEMU
instances because it provides signaling between peers unconditionally.
Therefore, only the interconnecting chardev, master mode, and the usage
of ioeventfd can be configured at device level. All other parameters are
defined by the server instance.

A new server protocol is introduced along this. Its primary difference
is the introduction of a single welcome message that contains all peer
parameters, rather than a series of single-word messages pushing them
piece by piece.

A complicating difference in interrupt handling, compare to v1, is the
auto-disable mode of v2: When this is active, interrupt delivery is
disabled by the device after each interrupt event. This prevents the
usage of irqfd on the receiving side, but it lowers the handling cost
for guests that implemented interrupt throttling this way (specifically
when exposing the device via UIO).

No changes have been made to the ivshmem device regarding migration:
Only the master can live-migrate, slave peers have to hot-unplug the
device first.

The details of the device model will be specified in a succeeding
commit. Drivers for this device can currently be found under

http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2

To instantiate a ivshmem v2 device, just add

 ... -chardev socket,path=/tmp/ivshmem_socket,id=ivshmem \
     -device ivshmem,chardev=ivshmem

provided the server opened its socket under the default path.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/misc/Makefile.objs      |    2 +-
 hw/misc/ivshmem2.c         | 1091 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/misc/ivshmem2.h |   48 ++
 3 files changed, 1140 insertions(+), 1 deletion(-)
 create mode 100644 hw/misc/ivshmem2.c
 create mode 100644 include/hw/misc/ivshmem2.h

diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
index ba898a5781..90a4a6608c 100644
--- a/hw/misc/Makefile.objs
+++ b/hw/misc/Makefile.objs
@@ -26,7 +26,7 @@ common-obj-$(CONFIG_PUV3) += puv3_pm.o
 
 common-obj-$(CONFIG_MACIO) += macio/
 
-common-obj-$(CONFIG_IVSHMEM_DEVICE) += ivshmem.o
+common-obj-$(CONFIG_IVSHMEM_DEVICE) += ivshmem.o ivshmem2.o
 
 common-obj-$(CONFIG_REALVIEW) += arm_sysctl.o
 common-obj-$(CONFIG_NSERIES) += cbus.o
diff --git a/hw/misc/ivshmem2.c b/hw/misc/ivshmem2.c
new file mode 100644
index 0000000000..aeb5c40e0b
--- /dev/null
+++ b/hw/misc/ivshmem2.c
@@ -0,0 +1,1091 @@
+/*
+ * Inter-VM Shared Memory PCI device, version 2.
+ *
+ * Copyright (c) Siemens AG, 2019
+ *
+ * Authors:
+ *  Jan Kiszka <jan.kiszka@siemens.com>
+ *
+ * Based on ivshmem.c by Cam Macdonell <cam@cs.ualberta.ca>
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/qdev-properties.h"
+#include "sysemu/kvm.h"
+#include "migration/blocker.h"
+#include "migration/vmstate.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/module.h"
+#include "qom/object_interfaces.h"
+#include "chardev/char-fe.h"
+#include "sysemu/qtest.h"
+#include "qapi/visitor.h"
+
+#include "hw/misc/ivshmem2.h"
+
+#define PCI_VENDOR_ID_IVSHMEM   PCI_VENDOR_ID_REDHAT_QUMRANET
+#define PCI_DEVICE_ID_IVSHMEM   0x1110
+
+#define IVSHMEM_MAX_PEERS       UINT16_MAX
+#define IVSHMEM_IOEVENTFD       0
+#define IVSHMEM_MSI             1
+
+#define IVSHMEM_REG_BAR_SIZE    0x1000
+
+#define IVSHMEM_REG_ID          0x00
+#define IVSHMEM_REG_MAX_PEERS   0x04
+#define IVSHMEM_REG_FEATURES    0x08
+#define IVSHMEM_REG_INT_CTRL    0x0c
+#define IVSHMEM_REG_DOORBELL    0x10
+#define IVSHMEM_REG_STATE       0x14
+
+#define IVSHMEM_INT_ENABLE      0x1
+
+#define IVSHMEM_ONESHOT_MODE    0x1
+
+#define IVSHMEM_DEBUG 0
+#define IVSHMEM_DPRINTF(fmt, ...)                       \
+    do {                                                \
+        if (IVSHMEM_DEBUG) {                            \
+            printf("IVSHMEM: " fmt, ## __VA_ARGS__);    \
+        }                                               \
+    } while (0)
+
+#define TYPE_IVSHMEM "ivshmem"
+#define IVSHMEM(obj) \
+    OBJECT_CHECK(IVShmemState, (obj), TYPE_IVSHMEM)
+
+typedef struct Peer {
+    int nb_eventfds;
+    EventNotifier *eventfds;
+} Peer;
+
+typedef struct MSIVector {
+    PCIDevice *pdev;
+    int virq;
+    bool unmasked;
+} MSIVector;
+
+typedef struct IVShmemVndrCap {
+    uint8_t id;
+    uint8_t next;
+    uint8_t length;
+    uint8_t priv_ctrl;
+    uint32_t state_tab_sz;
+    uint64_t rw_section_sz;
+    uint64_t output_section_sz;
+} IVShmemVndrCap;
+
+typedef struct IVShmemState {
+    /*< private >*/
+    PCIDevice parent_obj;
+    /*< public >*/
+
+    uint32_t features;
+
+    CharBackend server_chr;
+
+    /* registers */
+    uint8_t *priv_ctrl;
+    uint32_t vm_id;
+    uint32_t intctrl;
+    uint32_t state;
+
+    /* BARs */
+    MemoryRegion ivshmem_mmio; /* BAR 0 (registers) */
+    MemoryRegion ivshmem_bar2; /* BAR 2 (shared memory) */
+
+    void *shmem;
+    size_t shmem_sz;
+    size_t output_section_sz;
+
+    MemoryRegion state_tab;
+    MemoryRegion rw_section;
+    MemoryRegion input_sections;
+    MemoryRegion output_section;
+
+    /* interrupt support */
+    Peer *peers;
+    int nb_peers;               /* space in @peers[] */
+    uint32_t max_peers;
+    uint32_t vectors;
+    MSIVector *msi_vectors;
+
+    uint8_t msg_buf[32];        /* buffer for receiving server messages */
+    int msg_buffered_bytes;     /* #bytes in @msg_buf */
+
+    uint32_t protocol;
+
+    /* migration stuff */
+    OnOffAuto master;
+    Error *migration_blocker;
+} IVShmemState;
+
+static void ivshmem_enable_irqfd(IVShmemState *s);
+static void ivshmem_disable_irqfd(IVShmemState *s);
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs,
+                                           unsigned int feature) {
+    return (ivs->features & (1 << feature));
+}
+
+static inline bool ivshmem_is_master(IVShmemState *s)
+{
+    assert(s->master != ON_OFF_AUTO_AUTO);
+    return s->master == ON_OFF_AUTO_ON;
+}
+
+static bool ivshmem_irqfd_usable(IVShmemState *s)
+{
+    PCIDevice *pdev = PCI_DEVICE(s);
+
+    return (s->intctrl & IVSHMEM_INT_ENABLE) && msix_enabled(pdev) &&
+        !(*s->priv_ctrl & IVSHMEM_ONESHOT_MODE);
+}
+
+static void ivshmem_update_irqfd(IVShmemState *s, bool was_usable)
+{
+    bool is_usable = ivshmem_irqfd_usable(s);
+
+    if (kvm_msi_via_irqfd_enabled()) {
+        if (!was_usable && is_usable) {
+            ivshmem_enable_irqfd(s);
+        } else if (was_usable && !is_usable) {
+            ivshmem_disable_irqfd(s);
+        }
+    }
+}
+
+static void ivshmem_write_intctrl(IVShmemState *s, uint32_t new_state)
+{
+    bool was_usable = ivshmem_irqfd_usable(s);
+
+    s->intctrl = new_state & IVSHMEM_INT_ENABLE;
+    ivshmem_update_irqfd(s, was_usable);
+}
+
+static void ivshmem_write_state(IVShmemState *s, uint32_t new_state)
+{
+    uint32_t *state_table = s->shmem;
+    int peer;
+
+    state_table[s->vm_id] = new_state;
+    smp_mb();
+
+    if (s->state != new_state) {
+        s->state = new_state;
+        for (peer = 0; peer < s->nb_peers; peer++) {
+            if (peer != s->vm_id && s->peers[peer].nb_eventfds > 0) {
+                event_notifier_set(&s->peers[peer].eventfds[0]);
+            }
+        }
+    }
+}
+
+static void ivshmem_io_write(void *opaque, hwaddr addr,
+                             uint64_t val, unsigned size)
+{
+    IVShmemState *s = opaque;
+
+    uint16_t dest = val >> 16;
+    uint16_t vector = val & 0xff;
+
+    addr &= 0xfc;
+
+    IVSHMEM_DPRINTF("writing to addr " TARGET_FMT_plx "\n", addr);
+    switch (addr) {
+    case IVSHMEM_REG_INT_CTRL:
+        ivshmem_write_intctrl(s, val);
+        break;
+
+    case IVSHMEM_REG_DOORBELL:
+        /* check that dest VM ID is reasonable */
+        if (dest >= s->nb_peers) {
+            IVSHMEM_DPRINTF("Invalid destination VM ID (%d)\n", dest);
+            break;
+        }
+
+        /* check doorbell range */
+        if (vector < s->peers[dest].nb_eventfds) {
+            IVSHMEM_DPRINTF("Notifying VM %d on vector %d\n", dest, vector);
+            event_notifier_set(&s->peers[dest].eventfds[vector]);
+        } else {
+            IVSHMEM_DPRINTF("Invalid destination vector %d on VM %d\n",
+                            vector, dest);
+        }
+        break;
+
+    case IVSHMEM_REG_STATE:
+        ivshmem_write_state(s, val);
+        break;
+
+    default:
+        IVSHMEM_DPRINTF("Unhandled write " TARGET_FMT_plx "\n", addr);
+    }
+}
+
+static uint64_t ivshmem_io_read(void *opaque, hwaddr addr,
+                                unsigned size)
+{
+    IVShmemState *s = opaque;
+    uint32_t ret;
+
+    switch (addr) {
+    case IVSHMEM_REG_ID:
+        ret = s->vm_id;
+        break;
+
+    case IVSHMEM_REG_MAX_PEERS:
+        ret = s->max_peers;
+        break;
+
+    case IVSHMEM_REG_FEATURES:
+        ret = 0;
+        break;
+
+    case IVSHMEM_REG_INT_CTRL:
+        ret = s->intctrl;
+        break;
+
+    case IVSHMEM_REG_STATE:
+        ret = s->state;
+        break;
+
+    default:
+        IVSHMEM_DPRINTF("why are we reading " TARGET_FMT_plx "\n", addr);
+        ret = 0;
+    }
+
+    return ret;
+}
+
+static const MemoryRegionOps ivshmem_mmio_ops = {
+    .read = ivshmem_io_read,
+    .write = ivshmem_io_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+};
+
+static void ivshmem_vector_notify(void *opaque)
+{
+    MSIVector *entry = opaque;
+    PCIDevice *pdev = entry->pdev;
+    IVShmemState *s = IVSHMEM(pdev);
+    int vector = entry - s->msi_vectors;
+    EventNotifier *n = &s->peers[s->vm_id].eventfds[vector];
+
+    if (!event_notifier_test_and_clear(n) ||
+        !(s->intctrl & IVSHMEM_INT_ENABLE)) {
+        return;
+    }
+
+    IVSHMEM_DPRINTF("interrupt on vector %p %d\n", pdev, vector);
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        if (msix_enabled(pdev)) {
+            msix_notify(pdev, vector);
+        }
+    } else if (pdev->config[PCI_INTERRUPT_PIN]) {
+        pci_set_irq(pdev, 1);
+        pci_set_irq(pdev, 0);
+    }
+    if (*s->priv_ctrl & IVSHMEM_ONESHOT_MODE) {
+        s->intctrl &= ~IVSHMEM_INT_ENABLE;
+    }
+}
+
+static int ivshmem_irqfd_vector_unmask(PCIDevice *dev, unsigned vector,
+                                       MSIMessage msg)
+{
+    IVShmemState *s = IVSHMEM(dev);
+    EventNotifier *n = &s->peers[s->vm_id].eventfds[vector];
+    MSIVector *v = &s->msi_vectors[vector];
+    int ret;
+
+    IVSHMEM_DPRINTF("vector unmask %p %d\n", dev, vector);
+    if (!v->pdev) {
+        error_report("ivshmem: vector %d route does not exist", vector);
+        return -EINVAL;
+    }
+    assert(!v->unmasked);
+
+    ret = kvm_irqchip_add_msi_route(kvm_state, vector, dev);
+    if (ret < 0) {
+        error_report("kvm_irqchip_add_msi_route failed");
+        return ret;
+    }
+    v->virq = ret;
+    kvm_irqchip_commit_routes(kvm_state);
+
+    ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, n, NULL, v->virq);
+    if (ret < 0) {
+        error_report("kvm_irqchip_add_irqfd_notifier_gsi failed");
+        return ret;
+    }
+    v->unmasked = true;
+
+    return 0;
+}
+
+static void ivshmem_irqfd_vector_mask(PCIDevice *dev, unsigned vector)
+{
+    IVShmemState *s = IVSHMEM(dev);
+    EventNotifier *n = &s->peers[s->vm_id].eventfds[vector];
+    MSIVector *v = &s->msi_vectors[vector];
+    int ret;
+
+    IVSHMEM_DPRINTF("vector mask %p %d\n", dev, vector);
+    if (!v->pdev) {
+        error_report("ivshmem: vector %d route does not exist", vector);
+        return;
+    }
+    assert(v->unmasked);
+
+    ret = kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, n, v->virq);
+    if (ret < 0) {
+        error_report("remove_irqfd_notifier_gsi failed");
+        return;
+    }
+    kvm_irqchip_release_virq(kvm_state, v->virq);
+
+    v->unmasked = false;
+}
+
+static void ivshmem_irqfd_vector_poll(PCIDevice *dev,
+                                      unsigned int vector_start,
+                                      unsigned int vector_end)
+{
+    IVShmemState *s = IVSHMEM(dev);
+    unsigned int vector;
+
+    IVSHMEM_DPRINTF("vector poll %p %d-%d\n", dev, vector_start, vector_end);
+
+    vector_end = MIN(vector_end, s->vectors);
+
+    for (vector = vector_start; vector < vector_end; vector++) {
+        EventNotifier *notifier = &s->peers[s->vm_id].eventfds[vector];
+
+        if (!msix_is_masked(dev, vector)) {
+            continue;
+        }
+
+        if (event_notifier_test_and_clear(notifier)) {
+            msix_set_pending(dev, vector);
+        }
+    }
+}
+
+static void ivshmem_watch_vector_notifier(IVShmemState *s, int vector)
+{
+    EventNotifier *n = &s->peers[s->vm_id].eventfds[vector];
+    int eventfd = event_notifier_get_fd(n);
+
+    assert(!s->msi_vectors[vector].pdev);
+    s->msi_vectors[vector].pdev = PCI_DEVICE(s);
+
+    qemu_set_fd_handler(eventfd, ivshmem_vector_notify,
+                        NULL, &s->msi_vectors[vector]);
+}
+
+static void ivshmem_unwatch_vector_notifier(IVShmemState *s, int vector)
+{
+    EventNotifier *n = &s->peers[s->vm_id].eventfds[vector];
+    int eventfd = event_notifier_get_fd(n);
+
+    if (!s->msi_vectors[vector].pdev) {
+        return;
+    }
+
+    qemu_set_fd_handler(eventfd, NULL, NULL, NULL);
+
+    s->msi_vectors[vector].pdev = NULL;
+}
+
+static void ivshmem_add_eventfd(IVShmemState *s, int posn, int i)
+{
+    memory_region_add_eventfd(&s->ivshmem_mmio,
+                              IVSHMEM_REG_DOORBELL,
+                              4,
+                              true,
+                              (posn << 16) | i,
+                              &s->peers[posn].eventfds[i]);
+}
+
+static void ivshmem_del_eventfd(IVShmemState *s, int posn, int i)
+{
+    memory_region_del_eventfd(&s->ivshmem_mmio,
+                              IVSHMEM_REG_DOORBELL,
+                              4,
+                              true,
+                              (posn << 16) | i,
+                              &s->peers[posn].eventfds[i]);
+}
+
+static void close_peer_eventfds(IVShmemState *s, int posn)
+{
+    int i, n;
+
+    assert(posn >= 0 && posn < s->nb_peers);
+    n = s->peers[posn].nb_eventfds;
+
+    if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
+        memory_region_transaction_begin();
+        for (i = 0; i < n; i++) {
+            ivshmem_del_eventfd(s, posn, i);
+        }
+        memory_region_transaction_commit();
+    }
+
+    for (i = 0; i < n; i++) {
+        event_notifier_cleanup(&s->peers[posn].eventfds[i]);
+    }
+
+    g_free(s->peers[posn].eventfds);
+    s->peers[posn].nb_eventfds = 0;
+}
+
+static void resize_peers(IVShmemState *s, int nb_peers)
+{
+    int old_nb_peers = s->nb_peers;
+    int i;
+
+    assert(nb_peers > old_nb_peers);
+    IVSHMEM_DPRINTF("bumping storage to %d peers\n", nb_peers);
+
+    s->peers = g_realloc(s->peers, nb_peers * sizeof(Peer));
+    s->nb_peers = nb_peers;
+
+    for (i = old_nb_peers; i < nb_peers; i++) {
+        s->peers[i].eventfds = NULL;
+        s->peers[i].nb_eventfds = 0;
+    }
+}
+
+static void ivshmem_add_kvm_msi_virq(IVShmemState *s, int vector, Error **errp)
+{
+    PCIDevice *pdev = PCI_DEVICE(s);
+
+    IVSHMEM_DPRINTF("ivshmem_add_kvm_msi_virq vector:%d\n", vector);
+    assert(!s->msi_vectors[vector].pdev);
+
+    s->msi_vectors[vector].unmasked = false;
+    s->msi_vectors[vector].pdev = pdev;
+}
+
+static void ivshmem_remove_kvm_msi_virq(IVShmemState *s, int vector)
+{
+    IVSHMEM_DPRINTF("ivshmem_remove_kvm_msi_virq vector:%d\n", vector);
+
+    if (s->msi_vectors[vector].pdev == NULL) {
+        return;
+    }
+
+    if (s->msi_vectors[vector].unmasked) {
+        ivshmem_irqfd_vector_mask(s->msi_vectors[vector].pdev, vector);
+    }
+
+    s->msi_vectors[vector].pdev = NULL;
+}
+
+static void process_msg_disconnect(IVShmemState *s, IvshmemPeerGone *msg,
+                                   Error **errp)
+{
+    if (msg->header.len < sizeof(*msg)) {
+        error_setg(errp, "Invalid peer-gone message size");
+        return;
+    }
+
+    le32_to_cpus(&msg->id);
+
+    IVSHMEM_DPRINTF("peer %d has gone away\n", msg->id);
+    if (msg->id >= s->nb_peers || msg->id == s->vm_id) {
+        error_setg(errp, "invalid peer %d", msg->id);
+        return;
+    }
+    close_peer_eventfds(s, msg->id);
+    event_notifier_set(&s->peers[s->vm_id].eventfds[0]);
+}
+
+static void process_msg_connect(IVShmemState *s, IvshmemEventFd *msg, int fd,
+                                Error **errp)
+{
+    Peer *peer;
+
+    if (msg->header.len < sizeof(*msg)) {
+        error_setg(errp, "Invalid eventfd message size");
+        close(fd);
+        return;
+    }
+
+    le32_to_cpus(&msg->id);
+    le32_to_cpus(&msg->vector);
+
+    if (msg->id >= s->nb_peers) {
+        resize_peers(s, msg->id + 1);
+    }
+
+    peer = &s->peers[msg->id];
+
+    /*
+     * The N-th connect message for this peer comes with the file
+     * descriptor for vector N-1.
+     */
+    if (msg->vector != peer->nb_eventfds) {
+        error_setg(errp, "Received vector %d out of order", msg->vector);
+        close(fd);
+        return;
+    }
+    if (peer->nb_eventfds >= s->vectors) {
+        error_setg(errp, "Too many eventfd received, device has %d vectors",
+                   s->vectors);
+        close(fd);
+        return;
+    }
+    peer->nb_eventfds++;
+
+    if (msg->vector == 0)
+        peer->eventfds = g_new0(EventNotifier, s->vectors);
+
+    IVSHMEM_DPRINTF("eventfds[%d][%d] = %d\n", msg->id, msg->vector, fd);
+    event_notifier_init_fd(&peer->eventfds[msg->vector], fd);
+    fcntl_setfl(fd, O_NONBLOCK); /* msix/irqfd poll non block */
+
+    if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
+        ivshmem_add_eventfd(s, msg->id, msg->vector);
+    }
+
+    if (msg->id == s->vm_id) {
+        ivshmem_watch_vector_notifier(s, peer->nb_eventfds - 1);
+    }
+}
+
+static int ivshmem_can_receive(void *opaque)
+{
+    IVShmemState *s = opaque;
+
+    assert(s->msg_buffered_bytes < sizeof(s->msg_buf));
+    return sizeof(s->msg_buf) - s->msg_buffered_bytes;
+}
+
+static void ivshmem_read(void *opaque, const uint8_t *buf, int size)
+{
+    IVShmemState *s = opaque;
+    IvshmemMsgHeader *header = (IvshmemMsgHeader *)&s->msg_buf;
+    Error *err = NULL;
+    int fd;
+
+    assert(size >= 0 && s->msg_buffered_bytes + size <= sizeof(s->msg_buf));
+    memcpy(s->msg_buf + s->msg_buffered_bytes, buf, size);
+    s->msg_buffered_bytes += size;
+    if (s->msg_buffered_bytes < sizeof(*header) ||
+        s->msg_buffered_bytes < le32_to_cpu(header->len)) {
+        return;
+    }
+
+    fd = qemu_chr_fe_get_msgfd(&s->server_chr);
+
+    le32_to_cpus(&header->type);
+    le32_to_cpus(&header->len);
+
+    switch (header->type) {
+    case IVSHMEM_MSG_EVENT_FD:
+        process_msg_connect(s, (IvshmemEventFd *)header, fd, &err);
+        break;
+    case IVSHMEM_MSG_PEER_GONE:
+        process_msg_disconnect(s, (IvshmemPeerGone *)header, &err);
+        break;
+    default:
+        error_setg(&err, "invalid message, type %d", header->type);
+        break;
+    }
+    if (err) {
+        error_report_err(err);
+    }
+
+    s->msg_buffered_bytes -= header->len;
+    memmove(s->msg_buf, s->msg_buf + header->len, s->msg_buffered_bytes);
+}
+
+static void ivshmem_recv_setup(IVShmemState *s, Error **errp)
+{
+    IvshmemInitialInfo msg;
+    struct stat buf;
+    uint8_t dummy;
+    int fd, n, ret;
+
+    n = 0;
+    do {
+        ret = qemu_chr_fe_read_all(&s->server_chr, (uint8_t *)&msg + n,
+                                   sizeof(msg) - n);
+        if (ret < 0) {
+            if (ret == -EINTR) {
+                continue;
+            }
+            error_setg_errno(errp, -ret, "read from server failed");
+            return;
+        }
+        n += ret;
+    } while (n < sizeof(msg));
+
+    fd = qemu_chr_fe_get_msgfd(&s->server_chr);
+
+    le32_to_cpus(&msg.header.type);
+    le32_to_cpus(&msg.header.len);
+    if (msg.header.type != IVSHMEM_MSG_INIT || msg.header.len < sizeof(msg)) {
+        error_setg(errp, "server sent invalid initial info");
+        return;
+    }
+
+    /* consume additional bytes of message */
+    msg.header.len -= sizeof(msg);
+    while (msg.header.len > 0) {
+        ret = qemu_chr_fe_read_all(&s->server_chr, &dummy, 1);
+        if (ret < 0) {
+            if (ret == -EINTR) {
+                continue;
+            }
+            error_setg_errno(errp, -ret, "read from server failed");
+            return;
+        }
+        msg.header.len -= ret;
+    }
+
+    le32_to_cpus(&msg.compatible_version);
+    if (msg.compatible_version != IVSHMEM_PROTOCOL_VERSION) {
+        error_setg(errp, "server sent compatible version %u, expecting %u",
+                   msg.compatible_version, IVSHMEM_PROTOCOL_VERSION);
+        return;
+    }
+
+    le32_to_cpus(&msg.id);
+    if (msg.id > IVSHMEM_MAX_PEERS) {
+        error_setg(errp, "server sent invalid ID");
+        return;
+    }
+    s->vm_id = msg.id;
+
+    if (fstat(fd, &buf) < 0) {
+        error_setg_errno(errp, errno,
+            "can't determine size of shared memory sent by server");
+        close(fd);
+        return;
+    }
+
+    s->shmem_sz = buf.st_size;
+
+    s->shmem = mmap(NULL, s->shmem_sz, PROT_READ | PROT_WRITE, MAP_SHARED,
+                    fd, 0);
+    if (s->shmem == MAP_FAILED) {
+        error_setg_errno(errp, errno,
+                         "can't map shared memory sent by server");
+        return;
+    }
+
+    le32_to_cpus(&msg.vectors);
+    if (msg.vectors < 1 || msg.vectors > 1024) {
+        error_setg(errp, "server sent invalid number of vectors message");
+        return;
+    }
+    s->vectors = msg.vectors;
+
+    s->max_peers = le32_to_cpu(msg.max_peers);
+    s->protocol = le32_to_cpu(msg.protocol);
+    s->output_section_sz = le64_to_cpu(msg.output_section_size);
+}
+
+/* Select the MSI-X vectors used by device.
+ * ivshmem maps events to vectors statically, so
+ * we just enable all vectors on init and after reset. */
+static void ivshmem_msix_vector_use(IVShmemState *s)
+{
+    PCIDevice *d = PCI_DEVICE(s);
+    int i;
+
+    for (i = 0; i < s->vectors; i++) {
+        msix_vector_use(d, i);
+    }
+}
+
+static void ivshmem_reset(DeviceState *d)
+{
+    IVShmemState *s = IVSHMEM(d);
+
+    ivshmem_disable_irqfd(s);
+
+    s->intctrl = 0;
+    ivshmem_write_state(s, 0);
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        ivshmem_msix_vector_use(s);
+    }
+}
+
+static int ivshmem_setup_interrupts(IVShmemState *s, Error **errp)
+{
+    /* allocate QEMU callback data for receiving interrupts */
+    s->msi_vectors = g_malloc0(s->vectors * sizeof(MSIVector));
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        if (msix_init_exclusive_bar(PCI_DEVICE(s), s->vectors, 1, errp)) {
+            IVSHMEM_DPRINTF("msix requested but not available - disabling\n");
+            s->features &= ~(IVSHMEM_MSI | IVSHMEM_IOEVENTFD);
+        } else {
+            IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
+            ivshmem_msix_vector_use(s);
+        }
+    }
+
+    return 0;
+}
+
+static void ivshmem_enable_irqfd(IVShmemState *s)
+{
+    PCIDevice *pdev = PCI_DEVICE(s);
+    int i;
+
+    for (i = 0; i < s->peers[s->vm_id].nb_eventfds; i++) {
+        Error *err = NULL;
+
+        ivshmem_unwatch_vector_notifier(s, i);
+
+        ivshmem_add_kvm_msi_virq(s, i, &err);
+        if (err) {
+            error_report_err(err);
+            goto undo;
+        }
+    }
+
+    if (msix_set_vector_notifiers(pdev,
+                                  ivshmem_irqfd_vector_unmask,
+                                  ivshmem_irqfd_vector_mask,
+                                  ivshmem_irqfd_vector_poll)) {
+        error_report("ivshmem: msix_set_vector_notifiers failed");
+        goto undo;
+    }
+    return;
+
+undo:
+    while (--i >= 0) {
+        ivshmem_remove_kvm_msi_virq(s, i);
+    }
+}
+
+static void ivshmem_disable_irqfd(IVShmemState *s)
+{
+    PCIDevice *pdev = PCI_DEVICE(s);
+    int i;
+
+    if (!pdev->msix_vector_use_notifier) {
+        return;
+    }
+
+    msix_unset_vector_notifiers(pdev);
+
+    for (i = 0; i < s->peers[s->vm_id].nb_eventfds; i++) {
+        ivshmem_remove_kvm_msi_virq(s, i);
+        ivshmem_watch_vector_notifier(s, i);
+    }
+
+}
+
+static void ivshmem_write_config(PCIDevice *pdev, uint32_t address,
+                                 uint32_t val, int len)
+{
+    IVShmemState *s = IVSHMEM(pdev);
+    bool was_usable = ivshmem_irqfd_usable(s);
+
+    pci_default_write_config(pdev, address, val, len);
+    ivshmem_update_irqfd(s, was_usable);
+}
+
+static void ivshmem_exit(PCIDevice *dev)
+{
+    IVShmemState *s = IVSHMEM(dev);
+    int i;
+
+    if (s->migration_blocker) {
+        migrate_del_blocker(s->migration_blocker);
+        error_free(s->migration_blocker);
+    }
+
+    if (memory_region_is_mapped(&s->rw_section)) {
+        void *addr = memory_region_get_ram_ptr(&s->rw_section);
+        int fd;
+
+        if (munmap(addr, memory_region_size(&s->rw_section) == -1)) {
+            error_report("Failed to munmap shared memory %s",
+                         strerror(errno));
+        }
+
+        fd = memory_region_get_fd(&s->rw_section);
+        close(fd);
+
+        vmstate_unregister_ram(&s->state_tab, DEVICE(dev));
+        vmstate_unregister_ram(&s->rw_section, DEVICE(dev));
+    }
+
+    if (s->peers) {
+        for (i = 0; i < s->nb_peers; i++) {
+            close_peer_eventfds(s, i);
+        }
+        g_free(s->peers);
+    }
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        msix_uninit_exclusive_bar(dev);
+    }
+
+    g_free(s->msi_vectors);
+}
+
+static int ivshmem_pre_load(void *opaque)
+{
+    IVShmemState *s = opaque;
+
+    if (!ivshmem_is_master(s)) {
+        error_report("'peer' devices are not migratable");
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int ivshmem_post_load(void *opaque, int version_id)
+{
+    IVShmemState *s = opaque;
+
+    if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        ivshmem_msix_vector_use(s);
+    }
+    return 0;
+}
+
+static const VMStateDescription ivshmem_vmsd = {
+    .name = TYPE_IVSHMEM,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_load = ivshmem_pre_load,
+    .post_load = ivshmem_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(parent_obj, IVShmemState),
+        VMSTATE_MSIX(parent_obj, IVShmemState),
+        VMSTATE_UINT32(state, IVShmemState),
+        VMSTATE_UINT32(intctrl, IVShmemState),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static Property ivshmem_properties[] = {
+    DEFINE_PROP_CHR("chardev", IVShmemState, server_chr),
+    DEFINE_PROP_BIT("ioeventfd", IVShmemState, features, IVSHMEM_IOEVENTFD,
+                    true),
+    DEFINE_PROP_ON_OFF_AUTO("master", IVShmemState, master, ON_OFF_AUTO_OFF),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void ivshmem_init(Object *obj)
+{
+    IVShmemState *s = IVSHMEM(obj);
+
+    s->features |= (1 << IVSHMEM_MSI);
+}
+
+static void ivshmem_realize(PCIDevice *dev, Error **errp)
+{
+    IVShmemState *s = IVSHMEM(dev);
+    Chardev *chr = qemu_chr_fe_get_driver(&s->server_chr);
+    size_t rw_section_sz, input_sections_sz;
+    IVShmemVndrCap *vndr_cap;
+    Error *err = NULL;
+    uint8_t *pci_conf;
+    int offset, priv_ctrl_pos;
+    off_t shmem_pos;
+
+    if (!qemu_chr_fe_backend_connected(&s->server_chr)) {
+        error_setg(errp, "You must specify a 'chardev'");
+        return;
+    }
+
+    /* IRQFD requires MSI */
+    if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD) &&
+        !ivshmem_has_feature(s, IVSHMEM_MSI)) {
+        error_setg(errp, "ioeventfd/irqfd requires MSI");
+        return;
+    }
+
+    pci_conf = dev->config;
+    pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+
+    memory_region_init_io(&s->ivshmem_mmio, OBJECT(s), &ivshmem_mmio_ops, s,
+                          "ivshmem.mmio", IVSHMEM_REG_BAR_SIZE);
+
+    /* region for registers*/
+    pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                     &s->ivshmem_mmio);
+
+    assert(chr);
+    IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
+                    chr->filename);
+
+    /*
+     * Receive setup messages from server synchronously.
+     * Older versions did it asynchronously, but that creates a
+     * number of entertaining race conditions.
+     */
+    ivshmem_recv_setup(s, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    /* we allocate enough space for 16 peers and grow as needed */
+    resize_peers(s, 16);
+
+    if (s->master == ON_OFF_AUTO_ON && s->vm_id != 0) {
+        error_setg(errp,
+                   "Master must connect to the server before any peers");
+        return;
+    }
+
+    qemu_chr_fe_set_handlers(&s->server_chr, ivshmem_can_receive,
+                             ivshmem_read, NULL, NULL, s, NULL, true);
+
+    if (ivshmem_setup_interrupts(s, errp) < 0) {
+        error_prepend(errp, "Failed to initialize interrupts: ");
+        return;
+    }
+
+    memory_region_init(&s->ivshmem_bar2, OBJECT(s), "ivshmem.bar2",
+                       s->shmem_sz);
+
+    input_sections_sz = s->output_section_sz * s->max_peers;
+    if (input_sections_sz + 4096 > s->shmem_sz) {
+        error_setg(errp,
+                   "Invalid output section size, shared memory too small");
+        return;
+    }
+    rw_section_sz = s->shmem_sz - input_sections_sz - 4096;
+
+    shmem_pos = 0;
+    memory_region_init_ram_ptr(&s->state_tab, OBJECT(s), "ivshmem.state",
+                               4096, s->shmem + shmem_pos);
+    memory_region_set_readonly(&s->state_tab, true);
+    memory_region_add_subregion(&s->ivshmem_bar2, shmem_pos, &s->state_tab);
+
+    vmstate_register_ram(&s->state_tab, DEVICE(s));
+
+    if (rw_section_sz > 0) {
+        shmem_pos += 4096;
+        memory_region_init_ram_ptr(&s->rw_section, OBJECT(s),
+                                   "ivshmem.rw-section",
+                                   rw_section_sz, s->shmem + shmem_pos);
+        memory_region_add_subregion(&s->ivshmem_bar2, shmem_pos,
+                                    &s->rw_section);
+
+        vmstate_register_ram(&s->rw_section, DEVICE(s));
+    }
+
+    if (s->output_section_sz > 0) {
+        shmem_pos += rw_section_sz;
+        memory_region_init_ram_ptr(&s->input_sections, OBJECT(s),
+                                   "ivshmem.input-sections", input_sections_sz,
+                                   s->shmem + shmem_pos);
+        memory_region_set_readonly(&s->input_sections, true);
+        memory_region_add_subregion(&s->ivshmem_bar2, shmem_pos,
+                                    &s->input_sections);
+
+        shmem_pos += s->vm_id * s->output_section_sz;
+        memory_region_init_ram_ptr(&s->output_section, OBJECT(s),
+                                   "ivshmem.output-section",
+                                   s->output_section_sz, s->shmem + shmem_pos);
+        memory_region_add_subregion_overlap(&s->ivshmem_bar2, shmem_pos,
+                                            &s->output_section, 1);
+
+        vmstate_register_ram(&s->input_sections, DEVICE(s));
+    }
+
+    pci_config_set_class(dev->config, 0xff00 | (s->protocol >> 8));
+    pci_config_set_prog_interface(dev->config, (uint8_t)s->protocol);
+
+    offset = pci_add_capability(dev, PCI_CAP_ID_VNDR, 0, 0x18,
+                                &error_abort);
+    vndr_cap = (IVShmemVndrCap *)(pci_conf + offset);
+    vndr_cap->length = 0x18;
+    vndr_cap->state_tab_sz = cpu_to_le32(4096);
+    vndr_cap->rw_section_sz = cpu_to_le64(rw_section_sz);
+    vndr_cap->output_section_sz = s->output_section_sz;
+
+    priv_ctrl_pos = offset + offsetof(IVShmemVndrCap, priv_ctrl);
+    s->priv_ctrl = &dev->config[priv_ctrl_pos];
+    dev->wmask[priv_ctrl_pos] |= IVSHMEM_ONESHOT_MODE;
+
+    if (s->master == ON_OFF_AUTO_AUTO) {
+        s->master = s->vm_id == 0 ? ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
+    }
+
+    if (!ivshmem_is_master(s)) {
+        error_setg(&s->migration_blocker,
+                   "Migration is disabled when using feature 'peer mode' in device 'ivshmem'");
+        migrate_add_blocker(s->migration_blocker, &err);
+        if (err) {
+            error_propagate(errp, err);
+            error_free(s->migration_blocker);
+            return;
+        }
+    }
+
+    pci_register_bar(PCI_DEVICE(s), 2,
+                     PCI_BASE_ADDRESS_SPACE_MEMORY |
+                     PCI_BASE_ADDRESS_MEM_PREFETCH |
+                     PCI_BASE_ADDRESS_MEM_TYPE_64,
+                     &s->ivshmem_bar2);
+}
+
+static void ivshmem_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+    k->realize = ivshmem_realize;
+    k->exit = ivshmem_exit;
+    k->config_write = ivshmem_write_config;
+    k->vendor_id = PCI_VENDOR_ID_IVSHMEM;
+    k->device_id = PCI_DEVICE_ID_IVSHMEM;
+    k->revision = 2;
+    dc->reset = ivshmem_reset;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    dc->desc = "Inter-VM shared memory v2";
+
+    dc->props = ivshmem_properties;
+    dc->vmsd = &ivshmem_vmsd;
+}
+
+static const TypeInfo ivshmem_info = {
+    .name          = TYPE_IVSHMEM,
+    .parent        = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(IVShmemState),
+    .instance_init = ivshmem_init,
+    .class_init    = ivshmem_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { INTERFACE_CONVENTIONAL_PCI_DEVICE },
+        { },
+    },
+};
+
+static void ivshmem_register_types(void)
+{
+    type_register_static(&ivshmem_info);
+}
+
+type_init(ivshmem_register_types)
diff --git a/include/hw/misc/ivshmem2.h b/include/hw/misc/ivshmem2.h
new file mode 100644
index 0000000000..ee994699e8
--- /dev/null
+++ b/include/hw/misc/ivshmem2.h
@@ -0,0 +1,48 @@
+/*
+ * Inter-VM Shared Memory PCI device, Version 2.
+ *
+ * Copyright (c) Siemens AG, 2019
+ *
+ * Authors:
+ *  Jan Kiszka <jan.kiszka@siemens.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.
+ */
+#ifndef IVSHMEM2_H
+#define IVSHMEM2_H
+
+#define IVSHMEM_PROTOCOL_VERSION    2
+
+#define IVSHMEM_MSG_INIT            0
+#define IVSHMEM_MSG_EVENT_FD        1
+#define IVSHMEM_MSG_PEER_GONE       2
+
+typedef struct IvshmemMsgHeader {
+    uint32_t type;
+    uint32_t len;
+} IvshmemMsgHeader;
+
+typedef struct IvshmemInitialInfo {
+    IvshmemMsgHeader header;
+    uint32_t version;
+    uint32_t compatible_version;
+    uint32_t id;
+    uint32_t max_peers;
+    uint32_t vectors;
+    uint32_t protocol;
+    uint64_t output_section_size;
+} IvshmemInitialInfo;
+
+typedef struct IvshmemEventFd {
+    IvshmemMsgHeader header;
+    uint32_t id;
+    uint32_t vector;
+} IvshmemEventFd;
+
+typedef struct IvshmemPeerGone {
+    IvshmemMsgHeader header;
+    uint32_t id;
+} IvshmemPeerGone;
+
+#endif /* IVSHMEM2_H */
-- 
2.16.4



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
  2019-11-11 12:57 ` [RFC][PATCH 1/3] hw/misc: Add implementation of ivshmem revision 2 device Jan Kiszka
@ 2019-11-11 12:57 ` Jan Kiszka
  2019-11-11 13:45   ` Michael S. Tsirkin
  2019-12-05 11:14   ` Markus Armbruster
  2019-11-11 12:57 ` [RFC][PATCH 3/3] contrib: Add server for ivshmem " Jan Kiszka
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 12:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

From: Jan Kiszka <jan.kiszka@siemens.com>

This imports the ivshmem v2 specification draft from Jailhouse. Its
final home is to be decided, this shall just simplify the review process
at this stage.

Note that specifically the Features register (offset 08h) is still under
consideration. In particular, its bit 0 seems useless now as its benefit
to guests, specifically when they want to be portable, is close to zero.
Maybe the register should still be kept, with all bits RsvdZ, for easier
extensibility.

The rest appears now rather mature and reasonably implementable, as
proven also by a version for Jailhouse.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 docs/specs/ivshmem-2-device-spec.md | 333 ++++++++++++++++++++++++++++++++++++
 1 file changed, 333 insertions(+)
 create mode 100644 docs/specs/ivshmem-2-device-spec.md

diff --git a/docs/specs/ivshmem-2-device-spec.md b/docs/specs/ivshmem-2-device-spec.md
new file mode 100644
index 0000000000..98cfde585a
--- /dev/null
+++ b/docs/specs/ivshmem-2-device-spec.md
@@ -0,0 +1,333 @@
+IVSHMEM Device Specification
+============================
+
+** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
+
+The Inter-VM Shared Memory device provides the following features to its users:
+
+- Interconnection between up to 65536 peers
+
+- Multi-purpose shared memory region
+
+    - common read/writable section
+
+    - unidirectional sections that are read/writable for one peer and only
+      readable for the others
+
+    - section with peer states
+
+- Event signaling via interrupt to remote sides
+
+- Support for life-cycle management via state value exchange and interrupt
+  notification on changes, backed by a shared memory section
+
+- Free choice of protocol to be used on top
+
+- Protocol type declaration
+
+- Unprivileged access to memory-mapped or I/O registers feasible
+
+- Discoverable and configurable via standard PCI mechanisms
+
+
+Hypervisor Model
+----------------
+
+In order to provide a consistent link between peers, all connected instances of
+IVSHMEM devices need to be configured, created and run by the hypervisor
+according to the following requirements:
+
+- The instances of the device need to be accessible via PCI programming
+  interfaces on all sides.
+
+- The read/write shared memory section has to be of the same size for all
+  peers and, if non-zero, has to reflect the same memory content for them.
+
+- If output sections are present (non-zero section size), there must be one
+  reserved for each peer with exclusive write access. All output sections
+  must have the same size and must be readable for all peers. They have to
+  reflect the same memory content for all peers.
+
+- The State Table must have the same size for all peers, must be large enough to
+  hold a state values of all peers, and must be read-only for the user.
+
+- State register changes (explicit writes, peer resets) have to be propagated
+  to the other peers by updating the corresponding State Table entry and issuing
+  an interrupt to all other peers if they enabled reception.
+
+- Interrupts events triggered by a peer have to be delivered to the target peer,
+  provided the receiving side is valid and has enabled the reception.
+
+- All peers must have the same interrupt delivery features available, i.e. MSI-X
+  with the same maximum number of vectors on platforms supporting this
+  mechanism, otherwise INTx with one vector.
+
+
+Guest-side Programming Model
+----------------------------
+
+An IVSHMEM device appears as a PCI device to its users. Unless otherwise noted,
+it conforms to the PCI Local Bus Specification, Revision 3.0 As such, it is
+discoverable via the PCI configuration space and provides a number of standard
+and custom PCI configuration registers.
+
+### Shared Memory Region Layout
+
+The shared memory region is divided into several sections.
+
+    +-----------------------------+   -
+    |                             |   :
+    | Output Section for peer n-1 |   : Output Section Size
+    |     (n = Maximum Peers)     |   :
+    +-----------------------------+   -
+    :                             :
+    :                             :
+    :                             :
+    +-----------------------------+   -
+    |                             |   :
+    |  Output Section for peer 1  |   : Output Section Size
+    |                             |   :
+    +-----------------------------+   -
+    |                             |   :
+    |  Output Section for peer 0  |   : Output Section Size
+    |                             |   :
+    +-----------------------------+   -
+    |                             |   :
+    |     Read/Write Section      |   : R/W Section Size
+    |                             |   :
+    +-----------------------------+   -
+    |                             |   :
+    |         State Table         |   : State Table Size
+    |                             |   :
+    +-----------------------------+   <-- Shared memory base address
+
+The first section consists of the mandatory State Table. Its size is defined by
+the State Table Size register and cannot be zero. This section is read-only for
+all peers.
+
+The second section consists of shared memory that is read/writable for all
+peers. Its size is defined by the R/W Section Size register. A size of zero is
+permitted.
+
+The third and following sections are unidirectional output sections, one for
+each peer. Their sizes are all identical. The size of a single output section is
+defined by the Output Section Size register. An output section is read/writable
+for the corresponding peer and read-only for all other peers. E.g., only the
+peer with ID 3 can write to the fourths output section, but all peers can read
+from this section.
+
+All sizes have to be rounded up to multiples of a mappable page in order to
+allow access control according to the section restrictions.
+
+### Configuration Space Registers
+
+#### Header Registers
+
+| Offset | Register               | Content                                              |
+|-------:|:-----------------------|:-----------------------------------------------------|
+|    00h | Vendor ID              | 1AF4h                                                |
+|    02h | Device ID              | 1110h                                                |
+|    04h | Command Register       | 0000h on reset, implementing bits 1, 2, 10           |
+|    06h | Status Register        | 0010h, static value (bit 3 not implemented)          |
+|    08h | Revision ID            | 02h                                                  |
+|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
+|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
+|    0Bh | Class Code, Base Class | FFh                                                  |
+|    0Eh | Header Type            | 00h                                                  |
+|    10h | BAR 0                  | MMIO or I/O register region                          |
+|    14h | BAR 1                  | MSI-X region                                         |
+|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
+|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
+|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
+|    34h | Capability Pointer     | First capability                                     |
+|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
+
+If BAR 2 is not present, the shared memory region is not relocatable by the
+user. In that case, the hypervisor has to implement the Base Address register in
+the vendor-specific capability.
+
+Other header registers may not be implemented. If not implemented, they return 0
+on read and ignore write accesses.
+
+#### Vendor Specific Capability (ID 09h)
+
+| Offset | Register            | Content                                        |
+|-------:|:--------------------|:-----------------------------------------------|
+|    00h | ID                  | 09h                                            |
+|    01h | Next Capability     | Pointer to next capability or 00h              |
+|    02h | Length              | 18h or 20h                                     |
+|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
+|        |                     | Bits 1-7: RsvdZ                                |
+|    04h | State Table Size    | 32-bit size of read-only State Table           |
+|    08h | R/W Section Size    | 64-bit size of common read/write section       |
+|    10h | Output Section Size | 64-bit size of unidirectional output sections  |
+|    18h | Base Address        | optional: 64-bit base address of shared memory |
+
+All registers are read-only, except for bit 0 of the Privileged Control
+register.
+
+When bit 0 in the Privileged Control register is set to 1, the device clears
+bit 0 in the Interrupt Control register on each interrupt delivery. This enables
+automatic interrupt throttling when re-enabling shall be performed by a
+scheduled unprivileged instance on the user side.
+
+If an IVSHMEM device does not support a relocatable shared memory region, BAR 2
+must not be implemented by the provider. Instead, the Base Address register has
+to be implemented to report the location of the shared memory region in the
+user's address space.
+
+A non-existing shared memory section has to report zero in its Section Size
+register.
+
+#### MSI-X Capability (ID 11h)
+
+On platforms supporting MSI-X, IVSHMEM has to provide interrupt delivery via
+this mechanism. In that case, the legacy INTx delivery mechanism is not
+available, and the Interrupt Pin configuration register returns 0.
+
+The IVSHMEM device has no notion of pending interrupts. Therefore, reading from
+the MSI-X Pending Bit Array will always return 0.
+
+The corresponding MSI-X MMIO region is configured via BAR 1.
+
+The MSI-X table size reported by the MSI-X capability structure is identical for
+all peers.
+
+### Register Region
+
+The register region may be implemented as MMIO or I/O.
+
+When implementing it as MMIO, the hypervisor has to ensure that the register
+region can be mapped as a single page into the address space of the user. Write
+accesses to MMIO region offsets that are not backed by registers have to be
+ignored, read accesses have to return 0. This enables the user to hand out the
+complete region, along with the shared memory, to an unprivileged instance.
+
+The region location in the user's physical address space is configured via BAR
+0. The following table visualizes the region layout:
+
+| Offset | Register                                                            |
+|-------:|:--------------------------------------------------------------------|
+|    00h | ID                                                                  |
+|    04h | Maximum Peers                                                       |
+|    08h | Features                                                            |
+|    0Ch | Interrupt Control                                                   |
+|    10h | Doorbell                                                            |
+|    14h | State                                                               |
+
+All registers support only aligned 32-bit accesses.
+
+#### ID Register (Offset 00h)
+
+Read-only register that reports the ID of the local device. It is unique for all
+of the connected devices and remains unchanged over their lifetime.
+
+#### Maximum Peers Register (Offset 04h)
+
+Read-only register that reports the maximum number of possible peers (including
+the local one). The supported range is between 2 and 65536 and remains constant
+over the lifetime of all peers.
+
+#### Features Register (Offset 08h)
+
+Read-only register that reports features of the local device or the connected
+peers. Its content remains constant over the lifetime of all peers.
+
+| Bits | Content                                                               |
+|-----:|:----------------------------------------------------------------------|
+|    0 | 1: Synchronized shared memory base address                            |
+| 1-31 | RsvdZ                                                                 |
+
+If "synchronized shared memory base address" is reported (bit 0 is set), the
+shared memory region is mapped at the same address into the user address spaces
+of all connected peers. Thus, peers can use physical addresses as pointers when
+exchanging information via the shared memory. This feature flag is never set
+when the shared memory region is relocatable via BAR 2.
+
+#### Interrupt Control Register (Offset 0Ch)
+
+This read/write register controls the generation of interrupts whenever a peer
+writes to the Doorbell register or changes its state.
+
+| Bits | Content                                                               |
+|-----:|:----------------------------------------------------------------------|
+|    0 | 1: Enable interrupt generation                                        |
+| 1-31 | RsvdZ                                                                 |
+
+Note that bit 0 is reset to 0 on interrupt delivery if one-shot interrupt mode
+is enabled in the Enhanced Features register.
+
+The value of this register after device reset is 0.
+
+#### Doorbell Register (Offset 10h)
+
+Write-only register that triggers an interrupt vector in the target device if it
+is enabled there.
+
+| Bits  | Content                                                              |
+|------:|:---------------------------------------------------------------------|
+|  0-15 | Vector number                                                        |
+| 16-31 | Target ID                                                            |
+
+Writing a vector number that is not enabled by the target has no effect. The
+peers can derive the number of available vectors from their own device
+capabilities and are expected to define or negotiate the used ones via the
+selected protocol.
+
+Addressing a non-existing or inactive target has no effect. Peers can identify
+active targets via the State Table.
+
+The behavior on reading from this register is undefined.
+
+#### State Register (Offset 14h)
+
+Read/write register that defines the state of the local device. Writing to this
+register sets the state and triggers interrupt vector 0 on the remote device if
+the written state value differs from the previous one. The user of the remote
+device can read the value written to this register from the State Table.
+
+The value of this register after device reset is 0.
+
+### State Table
+
+The State Table is a read-only section at the beginning of the shared memory
+region. It contains a 32-bit state value for each of the peers. Locating the
+table in shared memory allows fast checking of remote states without register
+accesses.
+
+The table is updated on each state change of a peers. Whenever a user of an
+IVSHMEM device writes a value to the Local State register, this value is copied
+into the corresponding entry of the State Table. When a IVSHMEM device is reset
+or disconnected from the other peers, zero is written into the corresponding
+table entry. The initial content of the table is all zeros.
+
+    +--------------------------------+
+    | 32-bit state value of peer n-1 |
+    +--------------------------------+
+    :                                :
+    +--------------------------------+
+    | 32-bit state value of peer 1   |
+    +--------------------------------+
+    | 32-bit state value of peer 0   |
+    +--------------------------------+ <-- Shared memory base address
+
+
+Protocols
+---------
+
+The IVSHMEM device shall support the peers of a connection in agreeing on the
+protocol used over the shared memory devices. For that purpose, the interface
+byte (offset 09h) and the sub-class byte (offset 0Ah) of the Class Code register
+encodes a 16-bit protocol type for the users. The following type values are
+defined:
+
+| Protocol Type | Description                                                  |
+|--------------:|:-------------------------------------------------------------|
+|         0000h | Undefined type                                               |
+|         0001h | Virtual peer-to-peer Ethernet                                |
+|   0002h-3FFFh | Reserved                                                     |
+|   4000h-7FFFh | User-defined protocols                                       |
+|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
+|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
+
+Details of the protocols are not in the scope of this specification.
-- 
2.16.4



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 3/3] contrib: Add server for ivshmem revision 2
  2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
  2019-11-11 12:57 ` [RFC][PATCH 1/3] hw/misc: Add implementation of ivshmem revision 2 device Jan Kiszka
  2019-11-11 12:57 ` [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2 Jan Kiszka
@ 2019-11-11 12:57 ` " Jan Kiszka
  2019-11-12  0:56 ` [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU no-reply
  2019-11-27 15:28 ` Liang Yan
  4 siblings, 0 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 12:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

From: Jan Kiszka <jan.kiszka@siemens.com>

This implements the server process for ivshmem v2 device models of QEMU.
Again, no effort has been spent yet on sharing code with the v1 server.
Parts have been copied, others were rewritten.

In addition to parameters of v1, this server now also specifies

 - the maximum number of peers to be connected (required to know in
   advance because of v2's state table)
 - the size of the output sections (can be 0)
 - the protocol ID to be published to all peers

When a virtio protocol ID is chosen, only 2 peers can be connected.
Furthermore, the server will signal the backend variant of the ID to the
master instance and the frontend ID to the slave peer.

To start, e.g., a server that allows virtio console over ivshmem, call

ivshmem2-server -F -l 64K -n 2 -V 3 -P 0x8003

TODO: specify the new server protocol.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 Makefile                                  |   3 +
 Makefile.objs                             |   1 +
 configure                                 |   1 +
 contrib/ivshmem2-server/Makefile.objs     |   1 +
 contrib/ivshmem2-server/ivshmem2-server.c | 462 ++++++++++++++++++++++++++++++
 contrib/ivshmem2-server/ivshmem2-server.h | 158 ++++++++++
 contrib/ivshmem2-server/main.c            | 313 ++++++++++++++++++++
 7 files changed, 939 insertions(+)
 create mode 100644 contrib/ivshmem2-server/Makefile.objs
 create mode 100644 contrib/ivshmem2-server/ivshmem2-server.c
 create mode 100644 contrib/ivshmem2-server/ivshmem2-server.h
 create mode 100644 contrib/ivshmem2-server/main.c

diff --git a/Makefile b/Makefile
index aa9d1a42aa..85b5e985ac 100644
--- a/Makefile
+++ b/Makefile
@@ -430,6 +430,7 @@ dummy := $(call unnest-vars,, \
                 elf2dmp-obj-y \
                 ivshmem-client-obj-y \
                 ivshmem-server-obj-y \
+                ivshmem2-server-obj-y \
                 rdmacm-mux-obj-y \
                 libvhost-user-obj-y \
                 vhost-user-scsi-obj-y \
@@ -662,6 +663,8 @@ ivshmem-client$(EXESUF): $(ivshmem-client-obj-y) $(COMMON_LDADDS)
 	$(call LINK, $^)
 ivshmem-server$(EXESUF): $(ivshmem-server-obj-y) $(COMMON_LDADDS)
 	$(call LINK, $^)
+ivshmem2-server$(EXESUF): $(ivshmem2-server-obj-y) $(COMMON_LDADDS)
+	$(call LINK, $^)
 endif
 vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)
diff --git a/Makefile.objs b/Makefile.objs
index 11ba1a36bd..a4a729d808 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -117,6 +117,7 @@ qga-vss-dll-obj-y = qga/
 elf2dmp-obj-y = contrib/elf2dmp/
 ivshmem-client-obj-$(CONFIG_IVSHMEM) = contrib/ivshmem-client/
 ivshmem-server-obj-$(CONFIG_IVSHMEM) = contrib/ivshmem-server/
+ivshmem2-server-obj-$(CONFIG_IVSHMEM) = contrib/ivshmem2-server/
 libvhost-user-obj-y = contrib/libvhost-user/
 vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
diff --git a/configure b/configure
index efe165edf9..1b466b2461 100755
--- a/configure
+++ b/configure
@@ -6184,6 +6184,7 @@ if test "$want_tools" = "yes" ; then
   fi
   if [ "$ivshmem" = "yes" ]; then
     tools="ivshmem-client\$(EXESUF) ivshmem-server\$(EXESUF) $tools"
+    tools="ivshmem2-server\$(EXESUF) $tools"
   fi
   if [ "$curl" = "yes" ]; then
       tools="elf2dmp\$(EXESUF) $tools"
diff --git a/contrib/ivshmem2-server/Makefile.objs b/contrib/ivshmem2-server/Makefile.objs
new file mode 100644
index 0000000000..d233e18ec8
--- /dev/null
+++ b/contrib/ivshmem2-server/Makefile.objs
@@ -0,0 +1 @@
+ivshmem2-server-obj-y = ivshmem2-server.o main.o
diff --git a/contrib/ivshmem2-server/ivshmem2-server.c b/contrib/ivshmem2-server/ivshmem2-server.c
new file mode 100644
index 0000000000..b341f1fcd0
--- /dev/null
+++ b/contrib/ivshmem2-server/ivshmem2-server.c
@@ -0,0 +1,462 @@
+/*
+ * Copyright 6WIND S.A., 2014
+ * Copyright (c) Siemens AG, 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.  See the COPYING file in the
+ * top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "qemu/sockets.h"
+#include "qemu/atomic.h"
+
+#include <sys/socket.h>
+#include <sys/un.h>
+
+#include "ivshmem2-server.h"
+
+/* log a message on stdout if verbose=1 */
+#define IVSHMEM_SERVER_DEBUG(server, fmt, ...) do { \
+        if ((server)->args.verbose) {         \
+            printf(fmt, ## __VA_ARGS__); \
+        }                                \
+    } while (0)
+
+/** maximum size of a huge page, used by ivshmem_server_ftruncate() */
+#define IVSHMEM_SERVER_MAX_HUGEPAGE_SIZE (1024 * 1024 * 1024)
+
+/** default listen backlog (number of sockets not accepted) */
+#define IVSHMEM_SERVER_LISTEN_BACKLOG 10
+
+/* send message to a client unix socket */
+static int ivshmem_server_send_msg(int sock_fd, void *payload, int len, int fd)
+{
+    int ret;
+    struct msghdr msg;
+    struct iovec iov[1];
+    union {
+        struct cmsghdr cmsg;
+        char control[CMSG_SPACE(sizeof(int))];
+    } msg_control;
+    struct cmsghdr *cmsg;
+
+    iov[0].iov_base = payload;
+    iov[0].iov_len = len;
+
+    memset(&msg, 0, sizeof(msg));
+    msg.msg_iov = iov;
+    msg.msg_iovlen = 1;
+
+    /* if fd is specified, add it in a cmsg */
+    if (fd >= 0) {
+        memset(&msg_control, 0, sizeof(msg_control));
+        msg.msg_control = &msg_control;
+        msg.msg_controllen = sizeof(msg_control);
+        cmsg = CMSG_FIRSTHDR(&msg);
+        cmsg->cmsg_level = SOL_SOCKET;
+        cmsg->cmsg_type = SCM_RIGHTS;
+        cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+        memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
+    }
+
+    ret = sendmsg(sock_fd, &msg, 0);
+    if (ret <= 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static int ivshmem_server_send_event_fd(int sock_fd, int peer_id, int vector,
+                                        int fd)
+{
+    IvshmemEventFd msg = {
+        .header = {
+            .type = GUINT32_TO_LE(IVSHMEM_MSG_EVENT_FD),
+            .len = GUINT32_TO_LE(sizeof(msg)),
+        },
+        .id = GUINT32_TO_LE(peer_id),
+        .vector = GUINT32_TO_LE(vector),
+    };
+
+    return ivshmem_server_send_msg(sock_fd, &msg, sizeof(msg), fd);
+}
+
+/* free a peer when the server advertises a disconnection or when the
+ * server is freed */
+static void
+ivshmem_server_free_peer(IvshmemServer *server, IvshmemServerPeer *peer)
+{
+    unsigned vector;
+    IvshmemServerPeer *other_peer;
+    IvshmemPeerGone msg = {
+        .header = {
+            .type = GUINT32_TO_LE(IVSHMEM_MSG_PEER_GONE),
+            .len = GUINT32_TO_LE(sizeof(msg)),
+        },
+        .id = GUINT32_TO_LE(peer->id),
+    };
+
+    IVSHMEM_SERVER_DEBUG(server, "free peer %" PRId64 "\n", peer->id);
+    close(peer->sock_fd);
+    QTAILQ_REMOVE(&server->peer_list, peer, next);
+
+    server->state_table[peer->id] = 0;
+    smp_mb();
+
+    /* advertise the deletion to other peers */
+    QTAILQ_FOREACH(other_peer, &server->peer_list, next) {
+        ivshmem_server_send_msg(other_peer->sock_fd, &msg, sizeof(msg), -1);
+    }
+
+    for (vector = 0; vector < peer->vectors_count; vector++) {
+        event_notifier_cleanup(&peer->vectors[vector]);
+    }
+
+    g_free(peer);
+}
+
+/* send the peer id and the shm_fd just after a new client connection */
+static int
+ivshmem_server_send_initial_info(IvshmemServer *server, IvshmemServerPeer *peer)
+{
+    IvshmemInitialInfo msg = {
+        .header = {
+            .type = GUINT32_TO_LE(IVSHMEM_MSG_INIT),
+            .len = GUINT32_TO_LE(sizeof(msg)),
+        },
+        .version = GUINT32_TO_LE(IVSHMEM_PROTOCOL_VERSION),
+        .compatible_version = GUINT32_TO_LE(IVSHMEM_PROTOCOL_VERSION),
+        .id = GUINT32_TO_LE(peer->id),
+        .max_peers = GUINT32_TO_LE(server->args.max_peers),
+        .vectors = GUINT32_TO_LE(server->args.vectors),
+        .protocol = GUINT32_TO_LE(server->args.protocol),
+        .output_section_size = GUINT64_TO_LE(server->args.output_section_size),
+    };
+    unsigned virtio_protocol;
+    int ret;
+
+    if (server->args.protocol >= 0x8000) {
+        virtio_protocol = server->args.protocol & ~0x4000;
+        msg.protocol &= ~0x4000;
+        if (peer->id == 0) {
+            virtio_protocol |= 0x4000;
+        }
+        msg.protocol = GUINT32_TO_LE(virtio_protocol);
+    }
+
+    ret = ivshmem_server_send_msg(peer->sock_fd, &msg, sizeof(msg),
+                                  server->shm_fd);
+    if (ret < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot send initial info: %s\n",
+                             strerror(errno));
+        return -1;
+    }
+
+    return 0;
+}
+
+/* handle message on listening unix socket (new client connection) */
+static int
+ivshmem_server_handle_new_conn(IvshmemServer *server)
+{
+    IvshmemServerPeer *peer, *other_peer;
+    struct sockaddr_un unaddr;
+    socklen_t unaddr_len;
+    int newfd;
+    unsigned i;
+
+    /* accept the incoming connection */
+    unaddr_len = sizeof(unaddr);
+    newfd = qemu_accept(server->sock_fd,
+                        (struct sockaddr *)&unaddr, &unaddr_len);
+
+    if (newfd < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot accept() %s\n", strerror(errno));
+        return -1;
+    }
+
+    qemu_set_nonblock(newfd);
+    IVSHMEM_SERVER_DEBUG(server, "accept()=%d\n", newfd);
+
+    /* allocate new structure for this peer */
+    peer = g_malloc0(sizeof(*peer));
+    peer->sock_fd = newfd;
+
+    /* get an unused peer id */
+    /* XXX: this could use id allocation such as Linux IDA, or simply
+     * a free-list */
+    for (i = 0; i < G_MAXUINT16; i++) {
+        if (ivshmem_server_search_peer(server, i) == NULL) {
+            break;
+        }
+    }
+    if (i >= server->args.max_peers) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot allocate new client id\n");
+        close(newfd);
+        g_free(peer);
+        return -1;
+    }
+    peer->id = i;
+
+    /* create eventfd, one per vector */
+    peer->vectors_count = server->args.vectors;
+    for (i = 0; i < peer->vectors_count; i++) {
+        if (event_notifier_init(&peer->vectors[i], FALSE) < 0) {
+            IVSHMEM_SERVER_DEBUG(server, "cannot create eventfd\n");
+            goto fail;
+        }
+    }
+
+    /* send peer id and shm fd */
+    if (ivshmem_server_send_initial_info(server, peer) < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot send initial info\n");
+        goto fail;
+    }
+
+    /* advertise the new peer to others */
+    QTAILQ_FOREACH(other_peer, &server->peer_list, next) {
+        for (i = 0; i < peer->vectors_count; i++) {
+            ivshmem_server_send_event_fd(other_peer->sock_fd, peer->id, i,
+                                         peer->vectors[i].wfd);
+        }
+    }
+
+    /* advertise the other peers to the new one */
+    QTAILQ_FOREACH(other_peer, &server->peer_list, next) {
+        for (i = 0; i < peer->vectors_count; i++) {
+            ivshmem_server_send_event_fd(peer->sock_fd, other_peer->id, i,
+                                         other_peer->vectors[i].wfd);
+        }
+    }
+
+    /* advertise the new peer to itself */
+    for (i = 0; i < peer->vectors_count; i++) {
+        ivshmem_server_send_event_fd(peer->sock_fd, peer->id, i,
+                                     event_notifier_get_fd(&peer->vectors[i]));
+    }
+
+    QTAILQ_INSERT_TAIL(&server->peer_list, peer, next);
+    IVSHMEM_SERVER_DEBUG(server, "new peer id = %" PRId64 "\n",
+                         peer->id);
+    return 0;
+
+fail:
+    while (i--) {
+        event_notifier_cleanup(&peer->vectors[i]);
+    }
+    close(newfd);
+    g_free(peer);
+    return -1;
+}
+
+/* Try to ftruncate a file to next power of 2 of shmsize.
+ * If it fails; all power of 2 above shmsize are tested until
+ * we reach the maximum huge page size. This is useful
+ * if the shm file is in a hugetlbfs that cannot be truncated to the
+ * shm_size value. */
+static int
+ivshmem_server_ftruncate(int fd, unsigned shmsize)
+{
+    int ret;
+    struct stat mapstat;
+
+    /* align shmsize to next power of 2 */
+    shmsize = pow2ceil(shmsize);
+
+    if (fstat(fd, &mapstat) != -1 && mapstat.st_size == shmsize) {
+        return 0;
+    }
+
+    while (shmsize <= IVSHMEM_SERVER_MAX_HUGEPAGE_SIZE) {
+        ret = ftruncate(fd, shmsize);
+        if (ret == 0) {
+            return ret;
+        }
+        shmsize *= 2;
+    }
+
+    return -1;
+}
+
+/* Init a new ivshmem server */
+void ivshmem_server_init(IvshmemServer *server)
+{
+    server->sock_fd = -1;
+    server->shm_fd = -1;
+    server->state_table = NULL;
+    QTAILQ_INIT(&server->peer_list);
+}
+
+/* open shm, create and bind to the unix socket */
+int
+ivshmem_server_start(IvshmemServer *server)
+{
+    struct sockaddr_un sun;
+    int shm_fd, sock_fd, ret;
+    void *state_table;
+
+    /* open shm file */
+    if (server->args.use_shm_open) {
+        IVSHMEM_SERVER_DEBUG(server, "Using POSIX shared memory: %s\n",
+                             server->args.shm_path);
+        shm_fd = shm_open(server->args.shm_path, O_CREAT | O_RDWR, S_IRWXU);
+    } else {
+        gchar *filename = g_strdup_printf("%s/ivshmem.XXXXXX",
+                                          server->args.shm_path);
+        IVSHMEM_SERVER_DEBUG(server, "Using file-backed shared memory: %s\n",
+                             server->args.shm_path);
+        shm_fd = mkstemp(filename);
+        unlink(filename);
+        g_free(filename);
+    }
+
+    if (shm_fd < 0) {
+        fprintf(stderr, "cannot open shm file %s: %s\n", server->args.shm_path,
+                strerror(errno));
+        return -1;
+    }
+    if (ivshmem_server_ftruncate(shm_fd, server->args.shm_size) < 0) {
+        fprintf(stderr, "ftruncate(%s) failed: %s\n", server->args.shm_path,
+                strerror(errno));
+        goto err_close_shm;
+    }
+    state_table = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED,
+                       shm_fd, 0);
+    if (state_table == MAP_FAILED) {
+        fprintf(stderr, "mmap failed: %s\n", strerror(errno));
+        goto err_close_shm;
+    }
+
+    IVSHMEM_SERVER_DEBUG(server, "create & bind socket %s\n",
+                         server->args.unix_socket_path);
+
+    /* create the unix listening socket */
+    sock_fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (sock_fd < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot create socket: %s\n",
+                             strerror(errno));
+        goto err_unmap;
+    }
+
+    sun.sun_family = AF_UNIX;
+    ret = snprintf(sun.sun_path, sizeof(sun.sun_path), "%s",
+                   server->args.unix_socket_path);
+    if (ret < 0 || ret >= sizeof(sun.sun_path)) {
+        IVSHMEM_SERVER_DEBUG(server, "could not copy unix socket path\n");
+        goto err_close_sock;
+    }
+    if (bind(sock_fd, (struct sockaddr *)&sun, sizeof(sun)) < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "cannot connect to %s: %s\n", sun.sun_path,
+                             strerror(errno));
+        goto err_close_sock;
+    }
+
+    if (listen(sock_fd, IVSHMEM_SERVER_LISTEN_BACKLOG) < 0) {
+        IVSHMEM_SERVER_DEBUG(server, "listen() failed: %s\n", strerror(errno));
+        goto err_close_sock;
+    }
+
+    server->sock_fd = sock_fd;
+    server->shm_fd = shm_fd;
+    server->state_table = state_table;
+
+    return 0;
+
+err_close_sock:
+    close(sock_fd);
+err_unmap:
+    munmap(state_table, 4096);
+err_close_shm:
+    if (server->args.use_shm_open) {
+        shm_unlink(server->args.shm_path);
+    }
+    close(shm_fd);
+    shm_unlink(server->args.shm_path);
+    return -1;
+}
+
+/* close connections to clients, the unix socket and the shm fd */
+void
+ivshmem_server_close(IvshmemServer *server)
+{
+    IvshmemServerPeer *peer, *npeer;
+
+    IVSHMEM_SERVER_DEBUG(server, "close server\n");
+
+    QTAILQ_FOREACH_SAFE(peer, &server->peer_list, next, npeer) {
+        ivshmem_server_free_peer(server, peer);
+    }
+
+    unlink(server->args.unix_socket_path);
+    if (server->args.use_shm_open) {
+        shm_unlink(server->args.shm_path);
+    }
+    close(server->sock_fd);
+    munmap(server->state_table, 4096);
+    close(server->shm_fd);
+    server->sock_fd = -1;
+    server->shm_fd = -1;
+}
+
+/* get the fd_set according to the unix socket and the peer list */
+void
+ivshmem_server_get_fds(const IvshmemServer *server, fd_set *fds, int *maxfd)
+{
+    IvshmemServerPeer *peer;
+
+    if (server->sock_fd == -1) {
+        return;
+    }
+
+    FD_SET(server->sock_fd, fds);
+    if (server->sock_fd >= *maxfd) {
+        *maxfd = server->sock_fd + 1;
+    }
+
+    QTAILQ_FOREACH(peer, &server->peer_list, next) {
+        FD_SET(peer->sock_fd, fds);
+        if (peer->sock_fd >= *maxfd) {
+            *maxfd = peer->sock_fd + 1;
+        }
+    }
+}
+
+/* process incoming messages on the sockets in fd_set */
+int
+ivshmem_server_handle_fds(IvshmemServer *server, fd_set *fds, int maxfd)
+{
+    IvshmemServerPeer *peer, *peer_next;
+
+    if (server->sock_fd < maxfd && FD_ISSET(server->sock_fd, fds) &&
+        ivshmem_server_handle_new_conn(server) < 0 && errno != EINTR) {
+        IVSHMEM_SERVER_DEBUG(server, "ivshmem_server_handle_new_conn() "
+                             "failed\n");
+        return -1;
+    }
+
+    QTAILQ_FOREACH_SAFE(peer, &server->peer_list, next, peer_next) {
+        /* any message from a peer socket result in a close() */
+        IVSHMEM_SERVER_DEBUG(server, "peer->sock_fd=%d\n", peer->sock_fd);
+        if (peer->sock_fd < maxfd && FD_ISSET(peer->sock_fd, fds)) {
+            ivshmem_server_free_peer(server, peer);
+        }
+    }
+
+    return 0;
+}
+
+/* lookup peer from its id */
+IvshmemServerPeer *
+ivshmem_server_search_peer(IvshmemServer *server, int64_t peer_id)
+{
+    IvshmemServerPeer *peer;
+
+    QTAILQ_FOREACH(peer, &server->peer_list, next) {
+        if (peer->id == peer_id) {
+            return peer;
+        }
+    }
+    return NULL;
+}
diff --git a/contrib/ivshmem2-server/ivshmem2-server.h b/contrib/ivshmem2-server/ivshmem2-server.h
new file mode 100644
index 0000000000..3fd6166577
--- /dev/null
+++ b/contrib/ivshmem2-server/ivshmem2-server.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright 6WIND S.A., 2014
+ * Copyright (c) Siemens AG, 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.  See the COPYING file in the
+ * top-level directory.
+ */
+
+#ifndef IVSHMEM2_SERVER_H
+#define IVSHMEM2_SERVER_H
+
+/**
+ * The ivshmem server is a daemon that creates a unix socket in listen
+ * mode. The ivshmem clients (qemu or ivshmem-client) connect to this
+ * unix socket. For each client, the server will create some eventfd
+ * (see EVENTFD(2)), one per vector. These fd are transmitted to all
+ * clients using the SCM_RIGHTS cmsg message. Therefore, each client is
+ * able to send a notification to another client without being
+ * "profixied" by the server.
+ *
+ * We use this mechanism to send interruptions between guests.
+ * qemu is able to transform an event on a eventfd into a PCI MSI-x
+ * interruption in the guest.
+ *
+ * The ivshmem server is also able to share the file descriptor
+ * associated to the ivshmem shared memory.
+ */
+
+#include <sys/select.h>
+
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "hw/misc/ivshmem2.h"
+
+/**
+ * Maximum number of notification vectors supported by the server
+ */
+#define IVSHMEM_SERVER_MAX_VECTORS 64
+
+/**
+ * Structure storing a peer
+ *
+ * Each time a client connects to an ivshmem server, a new
+ * IvshmemServerPeer structure is created. This peer and all its
+ * vectors are advertised to all connected clients through the connected
+ * unix sockets.
+ */
+typedef struct IvshmemServerPeer {
+    QTAILQ_ENTRY(IvshmemServerPeer) next;    /**< next in list*/
+    int sock_fd;                             /**< connected unix sock */
+    int64_t id;                              /**< the id of the peer */
+    EventNotifier vectors[IVSHMEM_SERVER_MAX_VECTORS]; /**< one per vector */
+    unsigned vectors_count;                  /**< number of vectors */
+} IvshmemServerPeer;
+
+/**
+ * Structure describing ivshmem server arguments
+ */
+typedef struct IvshmemServerArgs {
+    bool verbose;                   /**< true to enable verbose mode */
+    const char *unix_socket_path;   /**< pointer to unix socket file name */
+    const char *shm_path;           /**< Path to the shared memory; path
+                                         corresponds to a POSIX shm name or a
+                                         hugetlbfs mount point. */
+    bool use_shm_open;              /**< true to use shm_open, false for
+                                         file-backed shared memory */
+    uint64_t shm_size;              /**< total size of shared memory */
+    uint64_t output_section_size;   /**< size of each output section */
+    unsigned max_peers;             /**< maximum number of peers */
+    unsigned vectors;               /**< interrupt vectors per client */
+    unsigned protocol;              /**< protocol advertised to all clients */
+} IvshmemServerArgs;
+
+/**
+ * Structure describing an ivshmem server
+ *
+ * This structure stores all information related to our server: the name
+ * of the server unix socket and the list of connected peers.
+ */
+typedef struct IvshmemServer {
+    IvshmemServerArgs args;          /**< server arguments */
+    int sock_fd;                     /**< unix sock file descriptor */
+    int shm_fd;                      /**< shm file descriptor */
+    uint32_t *state_table;           /**< mapped state table */
+    QTAILQ_HEAD(, IvshmemServerPeer) peer_list; /**< list of peers */
+} IvshmemServer;
+
+/**
+ * Initialize an ivshmem server
+ *
+ * @server:         A pointer to an uninitialized IvshmemServer structure
+ */
+void ivshmem_server_init(IvshmemServer *server);
+
+/**
+ * Open the shm, then create and bind to the unix socket
+ *
+ * @server: The pointer to the initialized IvshmemServer structure
+ *
+ * Returns: 0 on success, or a negative value on error
+ */
+int ivshmem_server_start(IvshmemServer *server);
+
+/**
+ * Close the server
+ *
+ * Close connections to all clients, close the unix socket and the
+ * shared memory file descriptor. The structure remains initialized, so
+ * it is possible to call ivshmem_server_start() again after a call to
+ * ivshmem_server_close().
+ *
+ * @server: The ivshmem server
+ */
+void ivshmem_server_close(IvshmemServer *server);
+
+/**
+ * Fill a fd_set with file descriptors to be monitored
+ *
+ * This function will fill a fd_set with all file descriptors that must
+ * be polled (unix server socket and peers unix socket). The function
+ * will not initialize the fd_set, it is up to the caller to do it.
+ *
+ * @server: The ivshmem server
+ * @fds:    The fd_set to be updated
+ * @maxfd:  Must be set to the max file descriptor + 1 in fd_set. This value is
+ *          updated if this function adds a greater fd in fd_set.
+ */
+void
+ivshmem_server_get_fds(const IvshmemServer *server, fd_set *fds, int *maxfd);
+
+/**
+ * Read and handle new messages
+ *
+ * Given a fd_set (for instance filled by a call to select()), handle
+ * incoming messages from peers.
+ *
+ * @server: The ivshmem server
+ * @fds:    The fd_set containing the file descriptors to be checked. Note that
+ *          file descriptors that are not related to our server are ignored.
+ * @maxfd:  The maximum fd in fd_set, plus one.
+ *
+ * Returns: 0 on success, or a negative value on error
+ */
+int ivshmem_server_handle_fds(IvshmemServer *server, fd_set *fds, int maxfd);
+
+/**
+ * Search a peer from its identifier
+ *
+ * @server:  The ivshmem server
+ * @peer_id: The identifier of the peer structure
+ *
+ * Returns:  The peer structure, or NULL if not found
+ */
+IvshmemServerPeer *
+ivshmem_server_search_peer(IvshmemServer *server, int64_t peer_id);
+
+#endif /* IVSHMEM2_SERVER_H */
diff --git a/contrib/ivshmem2-server/main.c b/contrib/ivshmem2-server/main.c
new file mode 100644
index 0000000000..35cd6fca0f
--- /dev/null
+++ b/contrib/ivshmem2-server/main.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright 6WIND S.A., 2014
+ * Copyright (c) Siemens AG, 2019
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.  See the COPYING file in the
+ * top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/option.h"
+#include "ivshmem2-server.h"
+
+#define IVSHMEM_SERVER_DEFAULT_FOREGROUND     0
+#define IVSHMEM_SERVER_DEFAULT_PID_FILE       "/var/run/ivshmem-server.pid"
+
+#define IVSHMEM_SERVER_DEFAULT_VERBOSE        0
+#define IVSHMEM_SERVER_DEFAULT_UNIX_SOCK_PATH "/tmp/ivshmem_socket"
+#define IVSHMEM_SERVER_DEFAULT_SHM_PATH       "ivshmem"
+#define IVSHMEM_SERVER_DEFAULT_SHM_SIZE       (4*1024*1024)
+#define IVSHMEM_SERVER_DEFAULT_OUTPUT_SEC_SZ  0
+#define IVSHMEM_SERVER_DEFAULT_MAX_PEERS      2
+#define IVSHMEM_SERVER_DEFAULT_VECTORS        1
+#define IVSHMEM_SERVER_DEFAULT_PROTOCOL       0
+
+/* used to quit on signal SIGTERM */
+static int ivshmem_server_quit;
+
+static bool foreground = IVSHMEM_SERVER_DEFAULT_FOREGROUND;
+static const char *pid_file = IVSHMEM_SERVER_DEFAULT_PID_FILE;
+
+static void
+ivshmem_server_usage(const char *progname)
+{
+    printf("Usage: %s [OPTION]...\n"
+           "  -h: show this help\n"
+           "  -v: verbose mode\n"
+           "  -F: foreground mode (default is to daemonize)\n"
+           "  -p <pid-file>: path to the PID file (used in daemon mode only)\n"
+           "     default " IVSHMEM_SERVER_DEFAULT_PID_FILE "\n"
+           "  -S <unix-socket-path>: path to the unix socket to listen to\n"
+           "     default " IVSHMEM_SERVER_DEFAULT_UNIX_SOCK_PATH "\n"
+           "  -M <shm-name>: POSIX shared memory object to use\n"
+           "     default " IVSHMEM_SERVER_DEFAULT_SHM_PATH "\n"
+           "  -m <dir-name>: where to create shared memory\n"
+           "  -l <size>: size of shared memory in bytes\n"
+           "     suffixes K, M and G can be used, e.g. 1K means 1024\n"
+           "     default %u\n"
+           "  -o <size>: size of each output section in bytes "
+                "(suffixes supported)\n"
+           "     default %u\n"
+           "  -n <peers>: maximum number of peers\n"
+           "     default %u\n"
+           "  -V <vectors>: number of vectors\n"
+           "     default %u\n"
+           "  -P <protocol>: 16-bit protocol to be advertised\n"
+           "     default 0x%04x\n"
+           "     When using virtio (0x8000...0xffff), only two peers are "
+           "supported, peer 0\n"
+           "     will become backend, peer 1 frontend\n",
+           progname, IVSHMEM_SERVER_DEFAULT_SHM_SIZE,
+           IVSHMEM_SERVER_DEFAULT_OUTPUT_SEC_SZ,
+           IVSHMEM_SERVER_DEFAULT_MAX_PEERS, IVSHMEM_SERVER_DEFAULT_VECTORS,
+           IVSHMEM_SERVER_DEFAULT_PROTOCOL);
+}
+
+static void
+ivshmem_server_help(const char *progname)
+{
+    fprintf(stderr, "Try '%s -h' for more information.\n", progname);
+}
+
+/* parse the program arguments, exit on error */
+static void
+ivshmem_server_parse_args(IvshmemServerArgs *args, int argc, char *argv[])
+{
+    int c;
+    unsigned long long v;
+    Error *err = NULL;
+
+    while ((c = getopt(argc, argv, "hvFp:S:m:M:l:o:n:V:P:")) != -1) {
+
+        switch (c) {
+        case 'h': /* help */
+            ivshmem_server_usage(argv[0]);
+            exit(0);
+            break;
+
+        case 'v': /* verbose */
+            args->verbose = 1;
+            break;
+
+        case 'F': /* foreground */
+            foreground = 1;
+            break;
+
+        case 'p': /* pid file */
+            pid_file = optarg;
+            break;
+
+        case 'S': /* unix socket path */
+            args->unix_socket_path = optarg;
+            break;
+
+        case 'M': /* shm name */
+        case 'm': /* dir name */
+            args->shm_path = optarg;
+            args->use_shm_open = c == 'M';
+            break;
+
+        case 'l': /* shm size */
+            parse_option_size("shm_size", optarg, &args->shm_size, &err);
+            if (err) {
+                error_report_err(err);
+                ivshmem_server_help(argv[0]);
+                exit(1);
+            }
+            break;
+
+        case 'o': /* output section size */
+            parse_option_size("output_section_size", optarg,
+                              &args->output_section_size, &err);
+            if (err) {
+                error_report_err(err);
+                ivshmem_server_help(argv[0]);
+                exit(1);
+            }
+            break;
+
+        case 'n': /* maximum number of peers */
+            if (parse_uint_full(optarg, &v, 0) < 0) {
+                fprintf(stderr, "cannot parse max-peers\n");
+                ivshmem_server_help(argv[0]);
+                exit(1);
+            }
+            args->max_peers = v;
+            break;
+
+        case 'V': /* number of vectors */
+            if (parse_uint_full(optarg, &v, 0) < 0) {
+                fprintf(stderr, "cannot parse vectors\n");
+                ivshmem_server_help(argv[0]);
+                exit(1);
+            }
+            args->vectors = v;
+            break;
+
+        case 'P': /* protocol */
+            if (parse_uint_full(optarg, &v, 0) < 0) {
+                fprintf(stderr, "cannot parse protocol\n");
+                ivshmem_server_help(argv[0]);
+                exit(1);
+            }
+            args->protocol = v;
+            break;
+
+        default:
+            ivshmem_server_usage(argv[0]);
+            exit(1);
+            break;
+        }
+    }
+
+    if (args->vectors > IVSHMEM_SERVER_MAX_VECTORS) {
+        fprintf(stderr, "too many requested vectors (max is %d)\n",
+                IVSHMEM_SERVER_MAX_VECTORS);
+        ivshmem_server_help(argv[0]);
+        exit(1);
+    }
+
+    if (args->protocol >= 0x8000 && args->max_peers > 2) {
+        fprintf(stderr, "virtio protocols only support 2 peers\n");
+        ivshmem_server_help(argv[0]);
+        exit(1);
+    }
+
+    if (args->verbose == 1 && foreground == 0) {
+        fprintf(stderr, "cannot use verbose in daemon mode\n");
+        ivshmem_server_help(argv[0]);
+        exit(1);
+    }
+}
+
+/* wait for events on listening server unix socket and connected client
+ * sockets */
+static int
+ivshmem_server_poll_events(IvshmemServer *server)
+{
+    fd_set fds;
+    int ret = 0, maxfd;
+
+    while (!ivshmem_server_quit) {
+
+        FD_ZERO(&fds);
+        maxfd = 0;
+        ivshmem_server_get_fds(server, &fds, &maxfd);
+
+        ret = select(maxfd, &fds, NULL, NULL, NULL);
+
+        if (ret < 0) {
+            if (errno == EINTR) {
+                continue;
+            }
+
+            fprintf(stderr, "select error: %s\n", strerror(errno));
+            break;
+        }
+        if (ret == 0) {
+            continue;
+        }
+
+        if (ivshmem_server_handle_fds(server, &fds, maxfd) < 0) {
+            fprintf(stderr, "ivshmem_server_handle_fds() failed\n");
+            break;
+        }
+    }
+
+    return ret;
+}
+
+static void
+ivshmem_server_quit_cb(int signum)
+{
+    ivshmem_server_quit = 1;
+}
+
+int
+main(int argc, char *argv[])
+{
+    IvshmemServer server = {
+        .args = {
+            .verbose = IVSHMEM_SERVER_DEFAULT_VERBOSE,
+            .unix_socket_path = IVSHMEM_SERVER_DEFAULT_UNIX_SOCK_PATH,
+            .shm_path = IVSHMEM_SERVER_DEFAULT_SHM_PATH,
+            .use_shm_open = true,
+            .shm_size = IVSHMEM_SERVER_DEFAULT_SHM_SIZE,
+            .output_section_size = IVSHMEM_SERVER_DEFAULT_OUTPUT_SEC_SZ,
+            .max_peers = IVSHMEM_SERVER_DEFAULT_MAX_PEERS,
+            .vectors = IVSHMEM_SERVER_DEFAULT_VECTORS,
+            .protocol = IVSHMEM_SERVER_DEFAULT_PROTOCOL,
+        },
+    };
+    struct sigaction sa, sa_quit;
+    int ret = 1;
+
+    /*
+     * Do not remove this notice without adding proper error handling!
+     * Start with handling ivshmem_server_send_one_msg() failure.
+     */
+    printf("*** Example code, do not use in production ***\n");
+
+    /* parse arguments, will exit on error */
+    ivshmem_server_parse_args(&server.args, argc, argv);
+
+    /* Ignore SIGPIPE, see this link for more info:
+     * http://www.mail-archive.com/libevent-users@monkey.org/msg01606.html */
+    sa.sa_handler = SIG_IGN;
+    sa.sa_flags = 0;
+    if (sigemptyset(&sa.sa_mask) == -1 ||
+        sigaction(SIGPIPE, &sa, 0) == -1) {
+        perror("failed to ignore SIGPIPE; sigaction");
+        goto err;
+    }
+
+    sa_quit.sa_handler = ivshmem_server_quit_cb;
+    sa_quit.sa_flags = 0;
+    if (sigemptyset(&sa_quit.sa_mask) == -1 ||
+        sigaction(SIGTERM, &sa_quit, 0) == -1 ||
+        sigaction(SIGINT, &sa_quit, 0) == -1) {
+        perror("failed to add signal handler; sigaction");
+        goto err;
+    }
+
+    /* init the ivshms structure */
+    ivshmem_server_init(&server);
+
+    /* start the ivshmem server (open shm & unix socket) */
+    if (ivshmem_server_start(&server) < 0) {
+        fprintf(stderr, "cannot bind\n");
+        goto err;
+    }
+
+    /* daemonize if asked to */
+    if (!foreground) {
+        FILE *fp;
+
+        if (qemu_daemon(1, 1) < 0) {
+            fprintf(stderr, "cannot daemonize: %s\n", strerror(errno));
+            goto err_close;
+        }
+
+        /* write pid file */
+        fp = fopen(pid_file, "w");
+        if (fp == NULL) {
+            fprintf(stderr, "cannot write pid file: %s\n", strerror(errno));
+            goto err_close;
+        }
+
+        fprintf(fp, "%d\n", (int) getpid());
+        fclose(fp);
+    }
+
+    ivshmem_server_poll_events(&server);
+    fprintf(stdout, "server disconnected\n");
+    ret = 0;
+
+err_close:
+    ivshmem_server_close(&server);
+err:
+    return ret;
+}
-- 
2.16.4



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 12:57 ` [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2 Jan Kiszka
@ 2019-11-11 13:45   ` Michael S. Tsirkin
  2019-11-11 13:59     ` Jan Kiszka
  2019-12-05 11:14   ` Markus Armbruster
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2019-11-11 13:45 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Claudio Fontana, qemu-devel,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> +| Offset | Register               | Content                                              |
> +|-------:|:-----------------------|:-----------------------------------------------------|
> +|    00h | Vendor ID              | 1AF4h                                                |
> +|    02h | Device ID              | 1110h                                                |

Given it's a virtio vendor ID, please reserve a device ID
with the virtio TC.

Thanks!

-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 13:45   ` Michael S. Tsirkin
@ 2019-11-11 13:59     ` Jan Kiszka
  2019-11-11 15:08       ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 13:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: liang yan, Jailhouse, Claudio Fontana, qemu-devel,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On 11.11.19 14:45, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
>> +| Offset | Register               | Content                                              |
>> +|-------:|:-----------------------|:-----------------------------------------------------|
>> +|    00h | Vendor ID              | 1AF4h                                                |
>> +|    02h | Device ID              | 1110h                                                |
> 
> Given it's a virtio vendor ID, please reserve a device ID
> with the virtio TC.

Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this 
finally official.

Jan


-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 13:59     ` Jan Kiszka
@ 2019-11-11 15:08       ` Michael S. Tsirkin
  2019-11-11 15:27         ` Daniel P. Berrangé
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2019-11-11 15:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Claudio Fontana, qemu-devel,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
> On 11.11.19 14:45, Michael S. Tsirkin wrote:
> > On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> > > +| Offset | Register               | Content                                              |
> > > +|-------:|:-----------------------|:-----------------------------------------------------|
> > > +|    00h | Vendor ID              | 1AF4h                                                |
> > > +|    02h | Device ID              | 1110h                                                |
> > 
> > Given it's a virtio vendor ID, please reserve a device ID
> > with the virtio TC.
> 
> Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
> official.
> 
> Jan

And I guess we will just mark it reserved or something right?
Since at least IVSHMEM 1 isn't a virtio device.
And will you be reusing same ID for IVSHMEM 2 or a new one?


> 
> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 15:08       ` Michael S. Tsirkin
@ 2019-11-11 15:27         ` Daniel P. Berrangé
  2019-11-11 15:42           ` Jan Kiszka
  2019-11-11 16:11           ` Michael S. Tsirkin
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2019-11-11 15:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: liang yan, Jailhouse, Claudio Fontana, Jan Kiszka,
	Markus Armbruster, qemu-devel, Hannes Reinecke, Stefan Hajnoczi

On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
> > On 11.11.19 14:45, Michael S. Tsirkin wrote:
> > > On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> > > > +| Offset | Register               | Content                                              |
> > > > +|-------:|:-----------------------|:-----------------------------------------------------|
> > > > +|    00h | Vendor ID              | 1AF4h                                                |
> > > > +|    02h | Device ID              | 1110h                                                |
> > > 
> > > Given it's a virtio vendor ID, please reserve a device ID
> > > with the virtio TC.
> > 
> > Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
> > official.
> > 
> 
> And I guess we will just mark it reserved or something right?
> Since at least IVSHMEM 1 isn't a virtio device.
> And will you be reusing same ID for IVSHMEM 2 or a new one?

1110h isn't under either of the virtio PCI device ID allowed ranges
according to the spec:

  "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
   ID 0x1000 through 0x107F inclusive is a virtio device.
   ...
   Additionally, devices MAY utilize a Transitional PCI Device 
   ID range, 0x1000 to 0x103F depending on the device type. "

So there's no need to reserve 0x1110h from the virtio spec POV.

I have, however, ensured it is assigned to ivshmem from POV of
Red Hat's own internal tracking of allocated device IDs, under
its vendor ID.

If ivshmem 2 is now a virtio device, then it is a good thing that
it will get a new/different PCI device ID, to show that it is not
compatible with the old device impl.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 15:27         ` Daniel P. Berrangé
@ 2019-11-11 15:42           ` Jan Kiszka
  2019-11-11 16:14             ` Michael S. Tsirkin
  2019-11-11 16:11           ` Michael S. Tsirkin
  1 sibling, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 15:42 UTC (permalink / raw)
  To: Daniel P. Berrangé, Michael S. Tsirkin
  Cc: liang yan, Jailhouse, Claudio Fontana, qemu-devel,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On 11.11.19 16:27, Daniel P. Berrangé wrote:
> On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
>> On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
>>> On 11.11.19 14:45, Michael S. Tsirkin wrote:
>>>> On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
>>>>> +| Offset | Register               | Content                                              |
>>>>> +|-------:|:-----------------------|:-----------------------------------------------------|
>>>>> +|    00h | Vendor ID              | 1AF4h                                                |
>>>>> +|    02h | Device ID              | 1110h                                                |
>>>>
>>>> Given it's a virtio vendor ID, please reserve a device ID
>>>> with the virtio TC.
>>>
>>> Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
>>> official.
>>>
>>
>> And I guess we will just mark it reserved or something right?
>> Since at least IVSHMEM 1 isn't a virtio device.
>> And will you be reusing same ID for IVSHMEM 2 or a new one?
> 
> 1110h isn't under either of the virtio PCI device ID allowed ranges
> according to the spec:
> 
>    "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
>     ID 0x1000 through 0x107F inclusive is a virtio device.
>     ...
>     Additionally, devices MAY utilize a Transitional PCI Device
>     ID range, 0x1000 to 0x103F depending on the device type. "
> 
> So there's no need to reserve 0x1110h from the virtio spec POV.

Indeed.

> 
> I have, however, ensured it is assigned to ivshmem from POV of
> Red Hat's own internal tracking of allocated device IDs, under
> its vendor ID.
> 
> If ivshmem 2 is now a virtio device, then it is a good thing that
> it will get a new/different PCI device ID, to show that it is not
> compatible with the old device impl.

At this stage, it is just a PCI device that may be used in combination 
with virtio (stacked on top), but it is not designed like a normal 
virtio (PCI) device. That's because it lacks many properties of regular 
virtio devices, like queues.

So, if such a device could be come part of the virtio spec, it would be 
separate from the rest, and having an ID from the regular range would 
likely not be helpful in this regard.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 15:27         ` Daniel P. Berrangé
  2019-11-11 15:42           ` Jan Kiszka
@ 2019-11-11 16:11           ` Michael S. Tsirkin
  2019-11-11 16:38             ` Jan Kiszka
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2019-11-11 16:11 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: liang yan, Jailhouse, Claudio Fontana, Jan Kiszka,
	Markus Armbruster, qemu-devel, Hannes Reinecke, Stefan Hajnoczi

On Mon, Nov 11, 2019 at 03:27:43PM +0000, Daniel P. Berrangé wrote:
> On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
> > On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
> > > On 11.11.19 14:45, Michael S. Tsirkin wrote:
> > > > On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> > > > > +| Offset | Register               | Content                                              |
> > > > > +|-------:|:-----------------------|:-----------------------------------------------------|
> > > > > +|    00h | Vendor ID              | 1AF4h                                                |
> > > > > +|    02h | Device ID              | 1110h                                                |
> > > > 
> > > > Given it's a virtio vendor ID, please reserve a device ID
> > > > with the virtio TC.
> > > 
> > > Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
> > > official.
> > > 
> > 
> > And I guess we will just mark it reserved or something right?
> > Since at least IVSHMEM 1 isn't a virtio device.
> > And will you be reusing same ID for IVSHMEM 2 or a new one?
> 
> 1110h isn't under either of the virtio PCI device ID allowed ranges
> according to the spec:
> 
>   "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
>    ID 0x1000 through 0x107F inclusive is a virtio device.
>    ...
>    Additionally, devices MAY utilize a Transitional PCI Device 
>    ID range, 0x1000 to 0x103F depending on the device type. "
> 
> So there's no need to reserve 0x1110h from the virtio spec POV.

Well we do have:

	B.3
	What Device Number?
	Device numbers can be reserved by the OASIS committee: email virtio-dev@lists.oasis-open.org to secure
	a unique one.
	Meanwhile for experimental drivers, use 65535 and work backwards.

So it seems it can  in theory conflict at least with experimental virtio devices.

Really it's messy that people are reusing the virtio vendor ID for
random stuff - getting a vendor ID is only hard for a hobbyist, any big
company already has an ID - but if it is a hobbyist and they at least
register then doesn't cause much harm.

E.g. Red Hat switched to 1b36 for new non virtio devices and I think that's
nicer.


> I have, however, ensured it is assigned to ivshmem from POV of
> Red Hat's own internal tracking of allocated device IDs, under
> its vendor ID.

Thanks!

> If ivshmem 2 is now a virtio device, then it is a good thing that
> it will get a new/different PCI device ID, to show that it is not
> compatible with the old device impl.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 15:42           ` Jan Kiszka
@ 2019-11-11 16:14             ` Michael S. Tsirkin
  2019-11-11 16:25               ` Jan Kiszka
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2019-11-11 16:14 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Daniel P. Berrangé,
	Claudio Fontana, qemu-devel, Markus Armbruster, Hannes Reinecke,
	Stefan Hajnoczi

On Mon, Nov 11, 2019 at 04:42:52PM +0100, Jan Kiszka wrote:
> On 11.11.19 16:27, Daniel P. Berrangé wrote:
> > On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
> > > > On 11.11.19 14:45, Michael S. Tsirkin wrote:
> > > > > On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> > > > > > +| Offset | Register               | Content                                              |
> > > > > > +|-------:|:-----------------------|:-----------------------------------------------------|
> > > > > > +|    00h | Vendor ID              | 1AF4h                                                |
> > > > > > +|    02h | Device ID              | 1110h                                                |
> > > > > 
> > > > > Given it's a virtio vendor ID, please reserve a device ID
> > > > > with the virtio TC.
> > > > 
> > > > Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
> > > > official.
> > > > 
> > > 
> > > And I guess we will just mark it reserved or something right?
> > > Since at least IVSHMEM 1 isn't a virtio device.
> > > And will you be reusing same ID for IVSHMEM 2 or a new one?
> > 
> > 1110h isn't under either of the virtio PCI device ID allowed ranges
> > according to the spec:
> > 
> >    "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
> >     ID 0x1000 through 0x107F inclusive is a virtio device.
> >     ...
> >     Additionally, devices MAY utilize a Transitional PCI Device
> >     ID range, 0x1000 to 0x103F depending on the device type. "
> > 
> > So there's no need to reserve 0x1110h from the virtio spec POV.
> 
> Indeed.
> 
> > 
> > I have, however, ensured it is assigned to ivshmem from POV of
> > Red Hat's own internal tracking of allocated device IDs, under
> > its vendor ID.
> > 
> > If ivshmem 2 is now a virtio device, then it is a good thing that
> > it will get a new/different PCI device ID, to show that it is not
> > compatible with the old device impl.
> 
> At this stage, it is just a PCI device that may be used in combination with
> virtio (stacked on top), but it is not designed like a normal virtio (PCI)
> device. That's because it lacks many properties of regular virtio devices,
> like queues.
> 
> So, if such a device could be come part of the virtio spec, it would be
> separate from the rest, and having an ID from the regular range would likely
> not be helpful in this regard.
> 
> Jan

I agree it needs a separate ID not from the regular range.
It's a distinct transport.
Maybe even a distinct vendor ID - we could easily get another one
if needed.

> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 16:14             ` Michael S. Tsirkin
@ 2019-11-11 16:25               ` Jan Kiszka
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: liang yan, Jailhouse, Daniel P. Berrangé,
	Claudio Fontana, qemu-devel, Markus Armbruster, Hannes Reinecke,
	Stefan Hajnoczi

On 11.11.19 17:14, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 04:42:52PM +0100, Jan Kiszka wrote:
>> On 11.11.19 16:27, Daniel P. Berrangé wrote:
>>> On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
>>>> On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
>>>>> On 11.11.19 14:45, Michael S. Tsirkin wrote:
>>>>>> On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
>>>>>>> +| Offset | Register               | Content                                              |
>>>>>>> +|-------:|:-----------------------|:-----------------------------------------------------|
>>>>>>> +|    00h | Vendor ID              | 1AF4h                                                |
>>>>>>> +|    02h | Device ID              | 1110h                                                |
>>>>>>
>>>>>> Given it's a virtio vendor ID, please reserve a device ID
>>>>>> with the virtio TC.
>>>>>
>>>>> Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
>>>>> official.
>>>>>
>>>>
>>>> And I guess we will just mark it reserved or something right?
>>>> Since at least IVSHMEM 1 isn't a virtio device.
>>>> And will you be reusing same ID for IVSHMEM 2 or a new one?
>>>
>>> 1110h isn't under either of the virtio PCI device ID allowed ranges
>>> according to the spec:
>>>
>>>     "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
>>>      ID 0x1000 through 0x107F inclusive is a virtio device.
>>>      ...
>>>      Additionally, devices MAY utilize a Transitional PCI Device
>>>      ID range, 0x1000 to 0x103F depending on the device type. "
>>>
>>> So there's no need to reserve 0x1110h from the virtio spec POV.
>>
>> Indeed.
>>
>>>
>>> I have, however, ensured it is assigned to ivshmem from POV of
>>> Red Hat's own internal tracking of allocated device IDs, under
>>> its vendor ID.
>>>
>>> If ivshmem 2 is now a virtio device, then it is a good thing that
>>> it will get a new/different PCI device ID, to show that it is not
>>> compatible with the old device impl.
>>
>> At this stage, it is just a PCI device that may be used in combination with
>> virtio (stacked on top), but it is not designed like a normal virtio (PCI)
>> device. That's because it lacks many properties of regular virtio devices,
>> like queues.
>>
>> So, if such a device could be come part of the virtio spec, it would be
>> separate from the rest, and having an ID from the regular range would likely
>> not be helpful in this regard.
>>
>> Jan
> 
> I agree it needs a separate ID not from the regular range.
> It's a distinct transport.
> Maybe even a distinct vendor ID - we could easily get another one
> if needed.

That might be useful because I've seen the kernel's virtio-pci driver 
grabbing ivshmem devices from time to time. OTOH, that could likely also 
be improved in Linux, at least for future versions.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 16:11           ` Michael S. Tsirkin
@ 2019-11-11 16:38             ` Jan Kiszka
  2019-11-12  8:04               ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-11-11 16:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, Daniel P. Berrangé
  Cc: liang yan, Jailhouse, Claudio Fontana, qemu-devel,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On 11.11.19 17:11, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 03:27:43PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
>>> On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
>>>> On 11.11.19 14:45, Michael S. Tsirkin wrote:
>>>>> On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
>>>>>> +| Offset | Register               | Content                                              |
>>>>>> +|-------:|:-----------------------|:-----------------------------------------------------|
>>>>>> +|    00h | Vendor ID              | 1AF4h                                                |
>>>>>> +|    02h | Device ID              | 1110h                                                |
>>>>>
>>>>> Given it's a virtio vendor ID, please reserve a device ID
>>>>> with the virtio TC.
>>>>
>>>> Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
>>>> official.
>>>>
>>>
>>> And I guess we will just mark it reserved or something right?
>>> Since at least IVSHMEM 1 isn't a virtio device.
>>> And will you be reusing same ID for IVSHMEM 2 or a new one?
>>
>> 1110h isn't under either of the virtio PCI device ID allowed ranges
>> according to the spec:
>>
>>    "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
>>     ID 0x1000 through 0x107F inclusive is a virtio device.
>>     ...
>>     Additionally, devices MAY utilize a Transitional PCI Device
>>     ID range, 0x1000 to 0x103F depending on the device type. "
>>
>> So there's no need to reserve 0x1110h from the virtio spec POV.
> 
> Well we do have:
> 
> 	B.3
> 	What Device Number?
> 	Device numbers can be reserved by the OASIS committee: email virtio-dev@lists.oasis-open.org to secure
> 	a unique one.
> 	Meanwhile for experimental drivers, use 65535 and work backwards.
> 
> So it seems it can  in theory conflict at least with experimental virtio devices.
> 
> Really it's messy that people are reusing the virtio vendor ID for
> random stuff - getting a vendor ID is only hard for a hobbyist, any big
> company already has an ID - but if it is a hobbyist and they at least
> register then doesn't cause much harm.

Note that ivshmem came from a research environment. I do know if there 
was a check for the IDs at the point the code was merged.

That said, I may get a device ID here as well, provided I can explain 
that not a single "product" will own it, but rather an open specification.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
  2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
                   ` (2 preceding siblings ...)
  2019-11-11 12:57 ` [RFC][PATCH 3/3] contrib: Add server for ivshmem " Jan Kiszka
@ 2019-11-12  0:56 ` no-reply
  2019-11-27 15:28 ` Liang Yan
  4 siblings, 0 replies; 23+ messages in thread
From: no-reply @ 2019-11-12  0:56 UTC (permalink / raw)
  To: jan.kiszka
  Cc: lyan, jailhouse-dev, claudio.fontana, mst, armbru, qemu-devel,
	hare, stefanha

Patchew URL: https://patchew.org/QEMU/cover.1573477032.git.jan.kiszka@siemens.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
Type: series
Message-id: cover.1573477032.git.jan.kiszka@siemens.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
45625de contrib: Add server for ivshmem revision 2
df18ce0 docs/specs: Add specification of ivshmem device revision 2
ff35318 hw/misc: Add implementation of ivshmem revision 2 device

=== OUTPUT BEGIN ===
1/3 Checking commit ff35318fdf84 (hw/misc: Add implementation of ivshmem revision 2 device)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#63: 
new file mode 100644

ERROR: return is not a function, parentheses are not required
#206: FILE: hw/misc/ivshmem2.c:139:
+    return (ivs->features & (1 << feature));

ERROR: memory barrier without comment
#250: FILE: hw/misc/ivshmem2.c:183:
+    smp_mb();

ERROR: braces {} are necessary for all arms of this statement
#625: FILE: hw/misc/ivshmem2.c:558:
+    if (msg->vector == 0)
[...]

WARNING: Block comments use a leading /* on a separate line
#775: FILE: hw/misc/ivshmem2.c:708:
+/* Select the MSI-X vectors used by device.

WARNING: Block comments use a trailing */ on a separate line
#777: FILE: hw/misc/ivshmem2.c:710:
+ * we just enable all vectors on init and after reset. */

total: 3 errors, 3 warnings, 1147 lines checked

Patch 1/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/3 Checking commit df18ce079161 (docs/specs: Add specification of ivshmem device revision 2)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#24: 
new file mode 100644

total: 0 errors, 1 warnings, 333 lines checked

Patch 2/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
3/3 Checking commit 45625def0d51 (contrib: Add server for ivshmem revision 2)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#77: 
new file mode 100644

WARNING: Block comments use a leading /* on a separate line
#174: FILE: contrib/ivshmem2-server/ivshmem2-server.c:86:
+/* free a peer when the server advertises a disconnection or when the

WARNING: Block comments use a trailing */ on a separate line
#175: FILE: contrib/ivshmem2-server/ivshmem2-server.c:87:
+ * server is freed */

ERROR: memory barrier without comment
#194: FILE: contrib/ivshmem2-server/ivshmem2-server.c:106:
+    smp_mb();

WARNING: Block comments use a leading /* on a separate line
#276: FILE: contrib/ivshmem2-server/ivshmem2-server.c:188:
+    /* XXX: this could use id allocation such as Linux IDA, or simply

WARNING: Block comments use a trailing */ on a separate line
#277: FILE: contrib/ivshmem2-server/ivshmem2-server.c:189:
+     * a free-list */

WARNING: Block comments use a leading /* on a separate line
#342: FILE: contrib/ivshmem2-server/ivshmem2-server.c:254:
+/* Try to ftruncate a file to next power of 2 of shmsize.

WARNING: Block comments use a trailing */ on a separate line
#346: FILE: contrib/ivshmem2-server/ivshmem2-server.c:258:
+ * shm_size value. */

WARNING: Block comments use a leading /* on a separate line
#619: FILE: contrib/ivshmem2-server/ivshmem2-server.h:63:
+    const char *shm_path;           /**< Path to the shared memory; path

WARNING: Block comments use * on subsequent lines
#620: FILE: contrib/ivshmem2-server/ivshmem2-server.h:64:
+    const char *shm_path;           /**< Path to the shared memory; path
+                                         corresponds to a POSIX shm name or a

WARNING: Block comments use a trailing */ on a separate line
#621: FILE: contrib/ivshmem2-server/ivshmem2-server.h:65:
+                                         hugetlbfs mount point. */

WARNING: Block comments use a leading /* on a separate line
#622: FILE: contrib/ivshmem2-server/ivshmem2-server.h:66:
+    bool use_shm_open;              /**< true to use shm_open, false for

WARNING: Block comments use * on subsequent lines
#623: FILE: contrib/ivshmem2-server/ivshmem2-server.h:67:
+    bool use_shm_open;              /**< true to use shm_open, false for
+                                         file-backed shared memory */

WARNING: Block comments use a trailing */ on a separate line
#623: FILE: contrib/ivshmem2-server/ivshmem2-server.h:67:
+                                         file-backed shared memory */

ERROR: spaces required around that '*' (ctx:VxV)
#742: FILE: contrib/ivshmem2-server/main.c:22:
+#define IVSHMEM_SERVER_DEFAULT_SHM_SIZE       (4*1024*1024)
                                                 ^

ERROR: spaces required around that '*' (ctx:VxV)
#742: FILE: contrib/ivshmem2-server/main.c:22:
+#define IVSHMEM_SERVER_DEFAULT_SHM_SIZE       (4*1024*1024)
                                                      ^

WARNING: Block comments use a leading /* on a separate line
#906: FILE: contrib/ivshmem2-server/main.c:186:
+/* wait for events on listening server unix socket and connected client

WARNING: Block comments use a trailing */ on a separate line
#907: FILE: contrib/ivshmem2-server/main.c:187:
+ * sockets */

WARNING: Block comments use a leading /* on a separate line
#977: FILE: contrib/ivshmem2-server/main.c:257:
+    /* Ignore SIGPIPE, see this link for more info:

WARNING: Block comments use a trailing */ on a separate line
#978: FILE: contrib/ivshmem2-server/main.c:258:
+     * http://www.mail-archive.com/libevent-users@monkey.org/msg01606.html */

total: 3 errors, 17 warnings, 963 lines checked

Patch 3/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/cover.1573477032.git.jan.kiszka@siemens.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 16:38             ` Jan Kiszka
@ 2019-11-12  8:04               ` Michael S. Tsirkin
  2019-11-20 18:15                 ` Jan Kiszka
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2019-11-12  8:04 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Daniel P. Berrangé,
	Claudio Fontana, qemu-devel, Markus Armbruster, Hannes Reinecke,
	Stefan Hajnoczi

On Mon, Nov 11, 2019 at 05:38:29PM +0100, Jan Kiszka wrote:
> On 11.11.19 17:11, Michael S. Tsirkin wrote:
> > On Mon, Nov 11, 2019 at 03:27:43PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
> > > > > On 11.11.19 14:45, Michael S. Tsirkin wrote:
> > > > > > On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
> > > > > > > +| Offset | Register               | Content                                              |
> > > > > > > +|-------:|:-----------------------|:-----------------------------------------------------|
> > > > > > > +|    00h | Vendor ID              | 1AF4h                                                |
> > > > > > > +|    02h | Device ID              | 1110h                                                |
> > > > > > 
> > > > > > Given it's a virtio vendor ID, please reserve a device ID
> > > > > > with the virtio TC.
> > > > > 
> > > > > Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
> > > > > official.
> > > > > 
> > > > 
> > > > And I guess we will just mark it reserved or something right?
> > > > Since at least IVSHMEM 1 isn't a virtio device.
> > > > And will you be reusing same ID for IVSHMEM 2 or a new one?
> > > 
> > > 1110h isn't under either of the virtio PCI device ID allowed ranges
> > > according to the spec:
> > > 
> > >    "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
> > >     ID 0x1000 through 0x107F inclusive is a virtio device.
> > >     ...
> > >     Additionally, devices MAY utilize a Transitional PCI Device
> > >     ID range, 0x1000 to 0x103F depending on the device type. "
> > > 
> > > So there's no need to reserve 0x1110h from the virtio spec POV.
> > 
> > Well we do have:
> > 
> > 	B.3
> > 	What Device Number?
> > 	Device numbers can be reserved by the OASIS committee: email virtio-dev@lists.oasis-open.org to secure
> > 	a unique one.
> > 	Meanwhile for experimental drivers, use 65535 and work backwards.
> > 
> > So it seems it can  in theory conflict at least with experimental virtio devices.
> > 
> > Really it's messy that people are reusing the virtio vendor ID for
> > random stuff - getting a vendor ID is only hard for a hobbyist, any big
> > company already has an ID - but if it is a hobbyist and they at least
> > register then doesn't cause much harm.
> 
> Note that ivshmem came from a research environment. I do know if there was a
> check for the IDs at the point the code was merged.
> 
> That said, I may get a device ID here as well, provided I can explain that
> not a single "product" will own it, but rather an open specification.
> 
> Jan

OK, up to you - if you decide you want an ID reserved, pls let us know.

At this point I'm not sure I have a good grasp which IDs are
registered where anymore. If someone can write it up, that would
be great too!

> -- 
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-12  8:04               ` Michael S. Tsirkin
@ 2019-11-20 18:15                 ` Jan Kiszka
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-11-20 18:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: liang yan, Jailhouse, Daniel P. Berrangé,
	Claudio Fontana, qemu-devel, Markus Armbruster, Hannes Reinecke,
	Stefan Hajnoczi

On 12.11.19 09:04, Michael S. Tsirkin wrote:
> On Mon, Nov 11, 2019 at 05:38:29PM +0100, Jan Kiszka wrote:
>> On 11.11.19 17:11, Michael S. Tsirkin wrote:
>>> On Mon, Nov 11, 2019 at 03:27:43PM +0000, Daniel P. Berrangé wrote:
>>>> On Mon, Nov 11, 2019 at 10:08:20AM -0500, Michael S. Tsirkin wrote:
>>>>> On Mon, Nov 11, 2019 at 02:59:07PM +0100, Jan Kiszka wrote:
>>>>>> On 11.11.19 14:45, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Nov 11, 2019 at 01:57:11PM +0100, Jan Kiszka wrote:
>>>>>>>> +| Offset | Register               | Content                                              |
>>>>>>>> +|-------:|:-----------------------|:-----------------------------------------------------|
>>>>>>>> +|    00h | Vendor ID              | 1AF4h                                                |
>>>>>>>> +|    02h | Device ID              | 1110h                                                |
>>>>>>>
>>>>>>> Given it's a virtio vendor ID, please reserve a device ID
>>>>>>> with the virtio TC.
>>>>>>
>>>>>> Yeah, QEMU's IVSHMEM was always using that. I'm happy to make this finally
>>>>>> official.
>>>>>>
>>>>>
>>>>> And I guess we will just mark it reserved or something right?
>>>>> Since at least IVSHMEM 1 isn't a virtio device.
>>>>> And will you be reusing same ID for IVSHMEM 2 or a new one?
>>>>
>>>> 1110h isn't under either of the virtio PCI device ID allowed ranges
>>>> according to the spec:
>>>>
>>>>     "Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device
>>>>      ID 0x1000 through 0x107F inclusive is a virtio device.
>>>>      ...
>>>>      Additionally, devices MAY utilize a Transitional PCI Device
>>>>      ID range, 0x1000 to 0x103F depending on the device type. "
>>>>
>>>> So there's no need to reserve 0x1110h from the virtio spec POV.
>>>
>>> Well we do have:
>>>
>>> 	B.3
>>> 	What Device Number?
>>> 	Device numbers can be reserved by the OASIS committee: email virtio-dev@lists.oasis-open.org to secure
>>> 	a unique one.
>>> 	Meanwhile for experimental drivers, use 65535 and work backwards.
>>>
>>> So it seems it can  in theory conflict at least with experimental virtio devices.
>>>
>>> Really it's messy that people are reusing the virtio vendor ID for
>>> random stuff - getting a vendor ID is only hard for a hobbyist, any big
>>> company already has an ID - but if it is a hobbyist and they at least
>>> register then doesn't cause much harm.
>>
>> Note that ivshmem came from a research environment. I do know if there was a
>> check for the IDs at the point the code was merged.
>>
>> That said, I may get a device ID here as well, provided I can explain that
>> not a single "product" will own it, but rather an open specification.
>>
>> Jan
> 
> OK, up to you - if you decide you want an ID reserved, pls let us know.
> 

Turned out to be much simpler than expect:

I reserved device ID 4106h under the vendor ID 110Ah (Siemens AG) for 
the purpose of specifying a shared memory device via the VIRTIO TC. Will 
update this "detail" in the next revision of the patches, also resetting 
the device revision ID to 0 as no longer need to tell us apart from the 
current implementation this way.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
  2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
                   ` (3 preceding siblings ...)
  2019-11-12  0:56 ` [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU no-reply
@ 2019-11-27 15:28 ` Liang Yan
  2019-11-27 17:19   ` Jan Kiszka
  4 siblings, 1 reply; 23+ messages in thread
From: Liang Yan @ 2019-11-27 15:28 UTC (permalink / raw)
  To: Jan Kiszka, qemu-devel
  Cc: Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi



On 11/11/19 7:57 AM, Jan Kiszka wrote:
> To get the ball rolling after my presentation of the topic at KVM Forum
> [1] and many fruitful discussions around it, this is a first concrete
> code series. As discussed, I'm starting with the IVSHMEM implementation
> of a QEMU device and server. It's RFC because, besides specification and
> implementation details, there will still be some decisions needed about
> how to integrate the new version best into the existing code bases.
> 
> If you want to play with this, the basic setup of the shared memory
> device is described in patch 1 and 3. UIO driver and also the
> virtio-ivshmem prototype can be found at
> 
>     http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2
> 
> Accessing the device via UIO is trivial enough. If you want to use it
> for virtio, this is additionally to the description in patch 3 needed on
> the virtio console backend side:
> 
>     modprobe uio_ivshmem
>     echo "1af4 1110 1af4 1100 ffc003 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
>     linux/tools/virtio/virtio-ivshmem-console /dev/uio0
> 
> And for virtio block:
> 
>     echo "1af4 1110 1af4 1100 ffc002 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
>     linux/tools/virtio/virtio-ivshmem-console /dev/uio0 /path/to/disk.img
> 
> After that, you can start the QEMU frontend instance with the
> virtio-ivshmem driver installed which can use the new /dev/hvc* or
> /dev/vda* as usual.
> 
> Any feedback welcome!

Hi, Jan,

I have been playing your code for last few weeks, mostly study and test,
of course. Really nice work. I have a few questions here:

First, qemu part looks good, I tried test between couple VMs, and device
could pop up correctly for all of them, but I had some problems when
trying load driver. For example, if set up two VMs, vm1 and vm2, start
ivshmem server as you suggested. vm1 could load uio_ivshmem and
virtio_ivshmem correctly, vm2 could load uio_ivshmem but could not show
up "/dev/uio0", virtio_ivshmem could not be loaded at all, these still
exist even I switch the load sequence of vm1 and vm2, and sometimes
reset "virtio_ivshmem" could crash both vm1 and vm2. Not quite sure this
is bug or "Ivshmem Mode" issue, but I went through ivshmem-server code,
did not related information.

I started some code work recently, such as fix code style issues and
some work based on above testing, however I know you are also working on
RFC V2, beside the protocol between server-client and client-client is
not finalized yet either, things may change, so much appreciate if you
could squeeze me into your develop schedule and share with me some
plans, :-)  Maybe I could send some pull request in your github repo?

I personally like this project a lot, there would be a lot of potential
and user case for it, especially some devices like
ivshmem-net/ivshmem-block. Anyway, thanks for adding me to the list, and
looking forward to your reply.

Best,
Liang

> 
> Jan
> 
> PS: Let me know if I missed someone potentially interested in this topic
> on CC - or if you would like to be dropped from the list.
> 
> PPS: The Jailhouse queues are currently out of sync /wrt minor details
> of this one, primarily the device ID. Will update them when the general
> direction is clear.
> 
> [1] https://kvmforum2019.sched.com/event/TmxI
> 
> Jan Kiszka (3):
>   hw/misc: Add implementation of ivshmem revision 2 device
>   docs/specs: Add specification of ivshmem device revision 2
>   contrib: Add server for ivshmem revision 2
> 
>  Makefile                                  |    3 +
>  Makefile.objs                             |    1 +
>  configure                                 |    1 +
>  contrib/ivshmem2-server/Makefile.objs     |    1 +
>  contrib/ivshmem2-server/ivshmem2-server.c |  462 ++++++++++++
>  contrib/ivshmem2-server/ivshmem2-server.h |  158 +++++
>  contrib/ivshmem2-server/main.c            |  313 +++++++++
>  docs/specs/ivshmem-2-device-spec.md       |  333 +++++++++
>  hw/misc/Makefile.objs                     |    2 +-
>  hw/misc/ivshmem2.c                        | 1091 +++++++++++++++++++++++++++++
>  include/hw/misc/ivshmem2.h                |   48 ++
>  11 files changed, 2412 insertions(+), 1 deletion(-)
>  create mode 100644 contrib/ivshmem2-server/Makefile.objs
>  create mode 100644 contrib/ivshmem2-server/ivshmem2-server.c
>  create mode 100644 contrib/ivshmem2-server/ivshmem2-server.h
>  create mode 100644 contrib/ivshmem2-server/main.c
>  create mode 100644 docs/specs/ivshmem-2-device-spec.md
>  create mode 100644 hw/misc/ivshmem2.c
>  create mode 100644 include/hw/misc/ivshmem2.h
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
  2019-11-27 15:28 ` Liang Yan
@ 2019-11-27 17:19   ` Jan Kiszka
  2019-12-02  6:16     ` Jan Kiszka
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-11-27 17:19 UTC (permalink / raw)
  To: Liang Yan, qemu-devel
  Cc: Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

Hi Liang,

On 27.11.19 16:28, Liang Yan wrote:
> 
> 
> On 11/11/19 7:57 AM, Jan Kiszka wrote:
>> To get the ball rolling after my presentation of the topic at KVM Forum
>> [1] and many fruitful discussions around it, this is a first concrete
>> code series. As discussed, I'm starting with the IVSHMEM implementation
>> of a QEMU device and server. It's RFC because, besides specification and
>> implementation details, there will still be some decisions needed about
>> how to integrate the new version best into the existing code bases.
>>
>> If you want to play with this, the basic setup of the shared memory
>> device is described in patch 1 and 3. UIO driver and also the
>> virtio-ivshmem prototype can be found at
>>
>>      http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2
>>
>> Accessing the device via UIO is trivial enough. If you want to use it
>> for virtio, this is additionally to the description in patch 3 needed on
>> the virtio console backend side:
>>
>>      modprobe uio_ivshmem
>>      echo "1af4 1110 1af4 1100 ffc003 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0
>>
>> And for virtio block:
>>
>>      echo "1af4 1110 1af4 1100 ffc002 ffffff" > /sys/bus/pci/drivers/uio_ivshmem/new_id
>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0 /path/to/disk.img
>>
>> After that, you can start the QEMU frontend instance with the
>> virtio-ivshmem driver installed which can use the new /dev/hvc* or
>> /dev/vda* as usual.
>>
>> Any feedback welcome!
> 
> Hi, Jan,
> 
> I have been playing your code for last few weeks, mostly study and test,
> of course. Really nice work. I have a few questions here:
> 
> First, qemu part looks good, I tried test between couple VMs, and device
> could pop up correctly for all of them, but I had some problems when
> trying load driver. For example, if set up two VMs, vm1 and vm2, start
> ivshmem server as you suggested. vm1 could load uio_ivshmem and
> virtio_ivshmem correctly, vm2 could load uio_ivshmem but could not show
> up "/dev/uio0", virtio_ivshmem could not be loaded at all, these still
> exist even I switch the load sequence of vm1 and vm2, and sometimes
> reset "virtio_ivshmem" could crash both vm1 and vm2. Not quite sure this
> is bug or "Ivshmem Mode" issue, but I went through ivshmem-server code,
> did not related information.

If we are only talking about one ivshmem link and vm1 is the master, 
there is not role for virtio_ivshmem on it as backend. That is purely a 
frontend driver. Vice versa for vm2: If you want to use its ivshmem 
instance as virtio frontend, uio_ivshmem plays no role.

The "crash" is would be interesting to understand: Do you see kernel 
panics of the guests? Or are they stuck? Or are the QEMU instances 
stuck? Do you know that you can debug the guest kernels via gdb (and 
gdb-scripts of the kernel)?

> 
> I started some code work recently, such as fix code style issues and
> some work based on above testing, however I know you are also working on
> RFC V2, beside the protocol between server-client and client-client is
> not finalized yet either, things may change, so much appreciate if you
> could squeeze me into your develop schedule and share with me some
> plans, :-)  Maybe I could send some pull request in your github repo?

I'm currently working on a refresh of the Jailhouse queue and the kernel 
patches to incorporate just two smaller changes:

  - use Siemens device ID
  - drop "features" register from ivshmem device

I have not yet touched the QEMU code for that so far, thus no conflict 
yet. I would wait for your patches then.

If it helps us to work on this together, I can push things to github as 
well. Will drop you a note when done. We should just present the outcome 
frequently as new series to the list.

> 
> I personally like this project a lot, there would be a lot of potential
> and user case for it, especially some devices like
> ivshmem-net/ivshmem-block. Anyway, thanks for adding me to the list, and
> looking forward to your reply.

Thanks for the positive feedback. I'm looking forward to work on this 
together!

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
  2019-11-27 17:19   ` Jan Kiszka
@ 2019-12-02  6:16     ` Jan Kiszka
       [not found]       ` <877b0cd9-d1c5-00c9-c4b6-567c67740962@suse.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-12-02  6:16 UTC (permalink / raw)
  To: Liang Yan, qemu-devel
  Cc: Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On 27.11.19 18:19, Jan Kiszka wrote:
> Hi Liang,
> 
> On 27.11.19 16:28, Liang Yan wrote:
>>
>>
>> On 11/11/19 7:57 AM, Jan Kiszka wrote:
>>> To get the ball rolling after my presentation of the topic at KVM Forum
>>> [1] and many fruitful discussions around it, this is a first concrete
>>> code series. As discussed, I'm starting with the IVSHMEM implementation
>>> of a QEMU device and server. It's RFC because, besides specification and
>>> implementation details, there will still be some decisions needed about
>>> how to integrate the new version best into the existing code bases.
>>>
>>> If you want to play with this, the basic setup of the shared memory
>>> device is described in patch 1 and 3. UIO driver and also the
>>> virtio-ivshmem prototype can be found at
>>>
>>>      
>>> http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2 
>>>
>>>
>>> Accessing the device via UIO is trivial enough. If you want to use it
>>> for virtio, this is additionally to the description in patch 3 needed on
>>> the virtio console backend side:
>>>
>>>      modprobe uio_ivshmem
>>>      echo "1af4 1110 1af4 1100 ffc003 ffffff" > 
>>> /sys/bus/pci/drivers/uio_ivshmem/new_id
>>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0
>>>
>>> And for virtio block:
>>>
>>>      echo "1af4 1110 1af4 1100 ffc002 ffffff" > 
>>> /sys/bus/pci/drivers/uio_ivshmem/new_id
>>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0 
>>> /path/to/disk.img
>>>
>>> After that, you can start the QEMU frontend instance with the
>>> virtio-ivshmem driver installed which can use the new /dev/hvc* or
>>> /dev/vda* as usual.
>>>
>>> Any feedback welcome!
>>
>> Hi, Jan,
>>
>> I have been playing your code for last few weeks, mostly study and test,
>> of course. Really nice work. I have a few questions here:
>>
>> First, qemu part looks good, I tried test between couple VMs, and device
>> could pop up correctly for all of them, but I had some problems when
>> trying load driver. For example, if set up two VMs, vm1 and vm2, start
>> ivshmem server as you suggested. vm1 could load uio_ivshmem and
>> virtio_ivshmem correctly, vm2 could load uio_ivshmem but could not show
>> up "/dev/uio0", virtio_ivshmem could not be loaded at all, these still
>> exist even I switch the load sequence of vm1 and vm2, and sometimes
>> reset "virtio_ivshmem" could crash both vm1 and vm2. Not quite sure this
>> is bug or "Ivshmem Mode" issue, but I went through ivshmem-server code,
>> did not related information.
> 
> If we are only talking about one ivshmem link and vm1 is the master, 
> there is not role for virtio_ivshmem on it as backend. That is purely a 
> frontend driver. Vice versa for vm2: If you want to use its ivshmem 
> instance as virtio frontend, uio_ivshmem plays no role.
> 
> The "crash" is would be interesting to understand: Do you see kernel 
> panics of the guests? Or are they stuck? Or are the QEMU instances 
> stuck? Do you know that you can debug the guest kernels via gdb (and 
> gdb-scripts of the kernel)?
> 
>>
>> I started some code work recently, such as fix code style issues and
>> some work based on above testing, however I know you are also working on
>> RFC V2, beside the protocol between server-client and client-client is
>> not finalized yet either, things may change, so much appreciate if you
>> could squeeze me into your develop schedule and share with me some
>> plans, :-)  Maybe I could send some pull request in your github repo?
> 
> I'm currently working on a refresh of the Jailhouse queue and the kernel 
> patches to incorporate just two smaller changes:
> 
>   - use Siemens device ID
>   - drop "features" register from ivshmem device
> 
> I have not yet touched the QEMU code for that so far, thus no conflict 
> yet. I would wait for your patches then.
> 
> If it helps us to work on this together, I can push things to github as 
> well. Will drop you a note when done. We should just present the outcome 
> frequently as new series to the list.

I've updated my queues, mostly small changes, mostly to the kernel bits. 
Besides the already announced places, you can also find them as PR 
targets on

https://github.com/siemens/qemu/commits/wip/ivshmem2
https://github.com/siemens/linux/commits/queues/ivshmem2

To give the whole thing broader coverage, I will now also move forward 
and integrate the current state into Jailhouse - at the risk of having 
to rework the interface there once again. But there are a number of 
users already requiring the extended features (or even using them), plus 
this gives a nice test coverage of key components and properties.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU
       [not found]       ` <877b0cd9-d1c5-00c9-c4b6-567c67740962@suse.com>
@ 2019-12-03  7:14         ` Jan Kiszka
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kiszka @ 2019-12-03  7:14 UTC (permalink / raw)
  To: Liang Yan, qemu-devel
  Cc: Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	Markus Armbruster, Hannes Reinecke, Stefan Hajnoczi

On 03.12.19 06:53, Liang Yan wrote:
> 
> On 12/2/19 1:16 AM, Jan Kiszka wrote:
>> On 27.11.19 18:19, Jan Kiszka wrote:
>>> Hi Liang,
>>>
>>> On 27.11.19 16:28, Liang Yan wrote:
>>>>
>>>>
>>>> On 11/11/19 7:57 AM, Jan Kiszka wrote:
>>>>> To get the ball rolling after my presentation of the topic at KVM Forum
>>>>> [1] and many fruitful discussions around it, this is a first concrete
>>>>> code series. As discussed, I'm starting with the IVSHMEM implementation
>>>>> of a QEMU device and server. It's RFC because, besides specification
>>>>> and
>>>>> implementation details, there will still be some decisions needed about
>>>>> how to integrate the new version best into the existing code bases.
>>>>>
>>>>> If you want to play with this, the basic setup of the shared memory
>>>>> device is described in patch 1 and 3. UIO driver and also the
>>>>> virtio-ivshmem prototype can be found at
>>>>>
>>>>>     
>>>>> http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/ivshmem2
>>>>>
>>>>>
>>>>> Accessing the device via UIO is trivial enough. If you want to use it
>>>>> for virtio, this is additionally to the description in patch 3
>>>>> needed on
>>>>> the virtio console backend side:
>>>>>
>>>>>      modprobe uio_ivshmem
>>>>>      echo "1af4 1110 1af4 1100 ffc003 ffffff" >
>>>>> /sys/bus/pci/drivers/uio_ivshmem/new_id
>>>>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0
>>>>>
>>>>> And for virtio block:
>>>>>
>>>>>      echo "1af4 1110 1af4 1100 ffc002 ffffff" >
>>>>> /sys/bus/pci/drivers/uio_ivshmem/new_id
>>>>>      linux/tools/virtio/virtio-ivshmem-console /dev/uio0
>>>>> /path/to/disk.img
>>>>>
>>>>> After that, you can start the QEMU frontend instance with the
>>>>> virtio-ivshmem driver installed which can use the new /dev/hvc* or
>>>>> /dev/vda* as usual.
>>>>>
>>>>> Any feedback welcome!
>>>>
>>>> Hi, Jan,
>>>>
>>>> I have been playing your code for last few weeks, mostly study and test,
>>>> of course. Really nice work. I have a few questions here:
>>>>
>>>> First, qemu part looks good, I tried test between couple VMs, and device
>>>> could pop up correctly for all of them, but I had some problems when
>>>> trying load driver. For example, if set up two VMs, vm1 and vm2, start
>>>> ivshmem server as you suggested. vm1 could load uio_ivshmem and
>>>> virtio_ivshmem correctly, vm2 could load uio_ivshmem but could not show
>>>> up "/dev/uio0", virtio_ivshmem could not be loaded at all, these still
>>>> exist even I switch the load sequence of vm1 and vm2, and sometimes
>>>> reset "virtio_ivshmem" could crash both vm1 and vm2. Not quite sure this
>>>> is bug or "Ivshmem Mode" issue, but I went through ivshmem-server code,
>>>> did not related information.
>>>
>>> If we are only talking about one ivshmem link and vm1 is the master,
>>> there is not role for virtio_ivshmem on it as backend. That is purely
>>> a frontend driver. Vice versa for vm2: If you want to use its ivshmem
>>> instance as virtio frontend, uio_ivshmem plays no role.
>>>
> Hi, Jan,
> 
> Sorry for the late response. Just came back from Thanksgiving holiday.
> 
> I have a few questions here.
> First, how to decide master/slave node? I used two VMs here, they did
> not show same behavior even if I change the boot sequence.

The current mechanism works by selecting the VM gets ID 0 as the
backend, thus sending it also a different protocol ID than the frontend
gets. Could possibly be improved by allowing selection also on the VM
side (QEMU command line parameter etc.).

Inherently, this only affects virtio over ivshmem. Other, symmetric
protocols do not need this differentiation.

> 
> Second, in order to run virtio-ivshmem-console demo, VM1 connect to VM2
> Console. So, need to install virtio frontend driver in VM2, then install
> uio frontend driver in VM1 to get "/dev/uio0", then run demo, right?
> Could you share your procedure?
> 
> Also, I could not get "/dev/uio0" all the time.

OK, should have collected this earlier. This is how I start the console
demo right now:

- ivshmem2-server -F -l 64K -n 2 -V 3 -P 0x8003
- start backend qemu with something like
  "-chardev socket,path=/tmp/ivshmem_socket,id=ivshm
  -device ivshmem,chardev=ivshm" in its command line
- inside that VM
   - modprobe uio_ivshmem
   - echo "110a 4106 1af4 1100 ffc003 ffffff" > \
     /sys/bus/pci/drivers/uio_ivshmem/new_id
   - virtio-ivshmem-console /dev/uio0
- start frontend qemu (can be identical options)

Now the frontend VM should see the ivshmem-virtio transport device and
attach a virtio console driver to it (/dev/hvc0). If you build the
transport into the kernel, you can even do "console=hvc0".

> 
> 
>>> The "crash" is would be interesting to understand: Do you see kernel
>>> panics of the guests? Or are they stuck? Or are the QEMU instances
>>> stuck? Do you know that you can debug the guest kernels via gdb (and
>>> gdb-scripts of the kernel)?
>>>
> 
> They are stuck, no kernel panics, It's like during console connection, I
> try to load/remove/reset module from the other side, then two VMs are
> stuck. One VM could still run if I killed the other VM. Like I said
> above, it may be just wrong operation from my side. I am working on
> ivshmem-block now, it is easier to understand for dual connection case.
> 

As I said, would be good to have an exact description of steps how to
reproduce - or you could attach gdb to the instances and do a some
backtraces on where the VMs are stuck.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-11-11 12:57 ` [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2 Jan Kiszka
  2019-11-11 13:45   ` Michael S. Tsirkin
@ 2019-12-05 11:14   ` Markus Armbruster
  2019-12-05 21:29     ` Jan Kiszka
  1 sibling, 1 reply; 23+ messages in thread
From: Markus Armbruster @ 2019-12-05 11:14 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	qemu-devel, Hannes Reinecke, Stefan Hajnoczi

This has been on the list for more than three weeks already.  I
apologize for the delay.

Jan Kiszka <jan.kiszka@siemens.com> writes:

> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> This imports the ivshmem v2 specification draft from Jailhouse. Its
> final home is to be decided, this shall just simplify the review process
> at this stage.
>
> Note that specifically the Features register (offset 08h) is still under
> consideration. In particular, its bit 0 seems useless now as its benefit
> to guests, specifically when they want to be portable, is close to zero.
> Maybe the register should still be kept, with all bits RsvdZ, for easier
> extensibility.
>
> The rest appears now rather mature and reasonably implementable, as
> proven also by a version for Jailhouse.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  docs/specs/ivshmem-2-device-spec.md | 333 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 333 insertions(+)
>  create mode 100644 docs/specs/ivshmem-2-device-spec.md
>
> diff --git a/docs/specs/ivshmem-2-device-spec.md b/docs/specs/ivshmem-2-device-spec.md
> new file mode 100644
> index 0000000000..98cfde585a
> --- /dev/null
> +++ b/docs/specs/ivshmem-2-device-spec.md
> @@ -0,0 +1,333 @@
> +IVSHMEM Device Specification
> +============================
> +
> +** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
> +
> +The Inter-VM Shared Memory device provides the following features to its users:
> +
> +- Interconnection between up to 65536 peers
> +
> +- Multi-purpose shared memory region
> +
> +    - common read/writable section
> +
> +    - unidirectional sections that are read/writable for one peer and only
> +      readable for the others
> +
> +    - section with peer states
> +
> +- Event signaling via interrupt to remote sides
> +
> +- Support for life-cycle management via state value exchange and interrupt
> +  notification on changes, backed by a shared memory section
> +
> +- Free choice of protocol to be used on top
> +
> +- Protocol type declaration
> +
> +- Unprivileged access to memory-mapped or I/O registers feasible
> +
> +- Discoverable and configurable via standard PCI mechanisms

Stating requirements is much appreciated.  Design rationale would be
even better :)

As pointed out many times, shared memory is not a solution to any
communication problem, it's merely a building block for building such
solutions: you invariably have to layer some protocol on top.  In your
KVM Forum talk, you mention layering virtio on top.  Makes sense to me.
But why does *this* virtio transport have to be an independent device?
Other transports aren't.

Now let me indulge in spec nitpicking :)

> +
> +
> +Hypervisor Model
> +----------------
> +
> +In order to provide a consistent link between peers, all connected instances of
> +IVSHMEM devices need to be configured, created and run by the hypervisor
> +according to the following requirements:
> +
> +- The instances of the device need to be accessible via PCI programming
> +  interfaces on all sides.

What does that mean?

> +
> +- The read/write shared memory section has to be of the same size for all
> +  peers and, if non-zero, has to reflect the same memory content for them.

Isn't "same memory content" redundant with "shared memory"?

> +
> +- If output sections are present (non-zero section size), there must be one
> +  reserved for each peer with exclusive write access. All output sections
> +  must have the same size and must be readable for all peers. They have to
> +  reflect the same memory content for all peers.

Are these the "unidirectional sections" mentioned previously?

> +
> +- The State Table must have the same size for all peers, must be large enough to
> +  hold a state values of all peers, and must be read-only for the user.

"the state values", I guess.

> +
> +- State register changes (explicit writes, peer resets) have to be propagated
> +  to the other peers by updating the corresponding State Table entry and issuing
> +  an interrupt to all other peers if they enabled reception.
> +
> +- Interrupts events triggered by a peer have to be delivered to the target peer,
> +  provided the receiving side is valid and has enabled the reception.
> +
> +- All peers must have the same interrupt delivery features available, i.e. MSI-X
> +  with the same maximum number of vectors on platforms supporting this
> +  mechanism, otherwise INTx with one vector.

Use case for legacy INTx?

For what it's worth, we removed INTx support from ivshmem v1 in rev 1,
QEMU 2.6.

> +
> +
> +Guest-side Programming Model
> +----------------------------
> +
> +An IVSHMEM device appears as a PCI device to its users. Unless otherwise noted,
> +it conforms to the PCI Local Bus Specification, Revision 3.0 As such, it is

Uh, is there anything in it that does *not* conform to this spec?

> +discoverable via the PCI configuration space and provides a number of standard
> +and custom PCI configuration registers.
> +
> +### Shared Memory Region Layout
> +
> +The shared memory region is divided into several sections.
> +
> +    +-----------------------------+   -
> +    |                             |   :
> +    | Output Section for peer n-1 |   : Output Section Size
> +    |     (n = Maximum Peers)     |   :
> +    +-----------------------------+   -
> +    :                             :
> +    :                             :
> +    :                             :
> +    +-----------------------------+   -
> +    |                             |   :
> +    |  Output Section for peer 1  |   : Output Section Size
> +    |                             |   :
> +    +-----------------------------+   -
> +    |                             |   :
> +    |  Output Section for peer 0  |   : Output Section Size
> +    |                             |   :
> +    +-----------------------------+   -
> +    |                             |   :
> +    |     Read/Write Section      |   : R/W Section Size
> +    |                             |   :
> +    +-----------------------------+   -
> +    |                             |   :
> +    |         State Table         |   : State Table Size
> +    |                             |   :
> +    +-----------------------------+   <-- Shared memory base address
> +
> +The first section consists of the mandatory State Table. Its size is defined by
> +the State Table Size register and cannot be zero. This section is read-only for
> +all peers.
> +
> +The second section consists of shared memory that is read/writable for all
> +peers. Its size is defined by the R/W Section Size register. A size of zero is
> +permitted.
> +
> +The third and following sections are unidirectional output sections, one for
> +each peer. Their sizes are all identical. The size of a single output section is
> +defined by the Output Section Size register. An output section is read/writable
> +for the corresponding peer and read-only for all other peers. E.g., only the
> +peer with ID 3 can write to the fourths output section, but all peers can read
> +from this section.
> +
> +All sizes have to be rounded up to multiples of a mappable page in order to
> +allow access control according to the section restrictions.
> +
> +### Configuration Space Registers
> +
> +#### Header Registers
> +
> +| Offset | Register               | Content                                              |
> +|-------:|:-----------------------|:-----------------------------------------------------|
> +|    00h | Vendor ID              | 1AF4h                                                |
> +|    02h | Device ID              | 1110h                                                |

Same as ivshmem v1.  Revision ID (offset 08h) disambiguates.

> +|    04h | Command Register       | 0000h on reset, implementing bits 1, 2, 10           |

What does "implementing bits ..." mean?

> +|    06h | Status Register        | 0010h, static value (bit 3 not implemented)          |

What does "bit 3 not implemented" mean?

> +|    08h | Revision ID            | 02h                                                  |

ivshmem v1 is rev 0 before QEMU 2.6, rev 1 since.  Rev 1 is
backward-compatible to rev 0 for guest software.  Is rev 2 intended to
be backward-compatible, too?

You should probably explan how your v2 relates to v1 in more detail,
possibly in its own top-level section.

> +|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
> +|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
> +|    0Bh | Class Code, Base Class | FFh                                                  |

FFh means "device does not fit any defined class."  For what it's worth,
ivshmem v1 uses 05h "Memory Controller", with sub-class and interface
00h "RAM Controller".

> +|    0Eh | Header Type            | 00h                                                  |
> +|    10h | BAR 0                  | MMIO or I/O register region                          |

Use case for I/O?

For what it's worth, ivshmem v1 never supported I/O.

> +|    14h | BAR 1                  | MSI-X region                                         |
> +|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |

What does "(with BAR 3)" mean?

> +|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
> +|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |

ivshmem v1 leaves these blank.

> +|    34h | Capability Pointer     | First capability                                     |
> +|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |

"If MSI-X is available"?

A PCIe device always provides MSI-X, and may additionally provide legacy
INTx.  A conventional PCI device may provide either or both.  In any
case, the Interrupt Pin register is zero when legacy INTx is not
provided,

> +
> +If BAR 2 is not present, the shared memory region is not relocatable by the
> +user. In that case, the hypervisor has to implement the Base Address register in
> +the vendor-specific capability.
> +
> +Other header registers may not be implemented. If not implemented, they return 0
> +on read and ignore write accesses.

Is this an example of where the device does not conform to the PCI Local
Bus Specification?

> +
> +#### Vendor Specific Capability (ID 09h)
> +
> +| Offset | Register            | Content                                        |
> +|-------:|:--------------------|:-----------------------------------------------|
> +|    00h | ID                  | 09h                                            |
> +|    01h | Next Capability     | Pointer to next capability or 00h              |
> +|    02h | Length              | 18h or 20h                                     |
> +|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
> +|        |                     | Bits 1-7: RsvdZ                                |

Please define RsvdZ somewhere, or use plainer language.

> +|    04h | State Table Size    | 32-bit size of read-only State Table           |
> +|    08h | R/W Section Size    | 64-bit size of common read/write section       |
> +|    10h | Output Section Size | 64-bit size of unidirectional output sections  |
> +|    18h | Base Address        | optional: 64-bit base address of shared memory |

Length is 20h when Base Adress is present, else 18h.  Worth spelling
that out?

> +
> +All registers are read-only, except for bit 0 of the Privileged Control
> +register.

Well, the other bits of that register are writable, they're just
ignored.

For what it's worth, existing ivshmem-spec.txt uses the terms read-only,
write-only, read/write and reserved bits rigorously: "Software should
only access the registers as specified [...]  Reserved bits should be
ignored on read, and preserved on write."

> +
> +When bit 0 in the Privileged Control register is set to 1, the device clears
> +bit 0 in the Interrupt Control register on each interrupt delivery. This enables
> +automatic interrupt throttling when re-enabling shall be performed by a
> +scheduled unprivileged instance on the user side.
> +
> +If an IVSHMEM device does not support a relocatable shared memory region, BAR 2
> +must not be implemented by the provider. Instead, the Base Address register has
> +to be implemented to report the location of the shared memory region in the
> +user's address space.

Rationale for not wanting to support relocatable shared memory?

> +
> +A non-existing shared memory section has to report zero in its Section Size
> +register.

This vendor-specific capability must always be present, I presume.
Worth spelling out.

> +
> +#### MSI-X Capability (ID 11h)
> +
> +On platforms supporting MSI-X, IVSHMEM has to provide interrupt delivery via
> +this mechanism. In that case, the legacy INTx delivery mechanism is not
> +available, and the Interrupt Pin configuration register returns 0.

I'm confused.  Does that mean the device shall support either MSI-X or
legacy INTx, but never both?

> +
> +The IVSHMEM device has no notion of pending interrupts. Therefore, reading from
> +the MSI-X Pending Bit Array will always return 0.

I guess this means a polling mode of operation is not possible.
Correct?

> +
> +The corresponding MSI-X MMIO region is configured via BAR 1.
> +
> +The MSI-X table size reported by the MSI-X capability structure is identical for
> +all peers.
> +
> +### Register Region
> +
> +The register region may be implemented as MMIO or I/O.
> +
> +When implementing it as MMIO, the hypervisor has to ensure that the register
> +region can be mapped as a single page into the address space of the user. Write

"can be mapped as a single page" depends on the host system, not the
device.

For what it's worth, ivshmem v1 fixes the size of BAR0 to 256 bytes.
Any particular reason to keep its size so loosely specified in v2?

> +accesses to MMIO region offsets that are not backed by registers have to be
> +ignored, read accesses have to return 0. This enables the user to hand out the
> +complete region, along with the shared memory, to an unprivileged instance.
> +
> +The region location in the user's physical address space is configured via BAR
> +0. The following table visualizes the region layout:
> +
> +| Offset | Register                                                            |
> +|-------:|:--------------------------------------------------------------------|
> +|    00h | ID                                                                  |
> +|    04h | Maximum Peers                                                       |
> +|    08h | Features                                                            |
> +|    0Ch | Interrupt Control                                                   |
> +|    10h | Doorbell                                                            |
> +|    14h | State                                                               |
> +
> +All registers support only aligned 32-bit accesses.

Definitely not backwards compatible to rev 1.  I figure that means v2
should use a different Vendor ID / Device ID, not just bump the Revsion
ID.

No interrupt status register?

> +
> +#### ID Register (Offset 00h)
> +
> +Read-only register that reports the ID of the local device. It is unique for all
> +of the connected devices and remains unchanged over their lifetime.
> +
> +#### Maximum Peers Register (Offset 04h)
> +
> +Read-only register that reports the maximum number of possible peers (including
> +the local one). The supported range is between 2 and 65536 and remains constant
> +over the lifetime of all peers.

Value 1 would be boring, but is it really impossible?

> +
> +#### Features Register (Offset 08h)
> +
> +Read-only register that reports features of the local device or the connected
> +peers. Its content remains constant over the lifetime of all peers.
> +
> +| Bits | Content                                                               |
> +|-----:|:----------------------------------------------------------------------|
> +|    0 | 1: Synchronized shared memory base address                            |
> +| 1-31 | RsvdZ                                                                 |
> +
> +If "synchronized shared memory base address" is reported (bit 0 is set), the
> +shared memory region is mapped at the same address into the user address spaces
> +of all connected peers. Thus, peers can use physical addresses as pointers when
> +exchanging information via the shared memory. This feature flag is never set
> +when the shared memory region is relocatable via BAR 2.
> +
> +#### Interrupt Control Register (Offset 0Ch)
> +
> +This read/write register controls the generation of interrupts whenever a peer
> +writes to the Doorbell register or changes its state.
> +
> +| Bits | Content                                                               |
> +|-----:|:----------------------------------------------------------------------|
> +|    0 | 1: Enable interrupt generation                                        |
> +| 1-31 | RsvdZ                                                                 |
> +
> +Note that bit 0 is reset to 0 on interrupt delivery if one-shot interrupt mode
> +is enabled in the Enhanced Features register.
> +
> +The value of this register after device reset is 0.

I presume this applies only to legacy INTx.  Correct?

ivshmem v1 calls this Interrupt Mask.

> +
> +#### Doorbell Register (Offset 10h)
> +
> +Write-only register that triggers an interrupt vector in the target device if it
> +is enabled there.
> +
> +| Bits  | Content                                                              |
> +|------:|:---------------------------------------------------------------------|
> +|  0-15 | Vector number                                                        |
> +| 16-31 | Target ID                                                            |
> +
> +Writing a vector number that is not enabled by the target has no effect. The
> +peers can derive the number of available vectors from their own device
> +capabilities and are expected to define or negotiate the used ones via the
> +selected protocol.

That's because all peers use the same number of vectors.  Worth spelling out?

> +
> +Addressing a non-existing or inactive target has no effect. Peers can identify
> +active targets via the State Table.
> +
> +The behavior on reading from this register is undefined.
> +
> +#### State Register (Offset 14h)
> +
> +Read/write register that defines the state of the local device. Writing to this
> +register sets the state and triggers interrupt vector 0 on the remote device if

"the remote device"?  Do you mean all peers?

"interrupt vector 0" assumes MSI-X.  With legacy INTx, it simply
triggers an interrupt.

How can guest software distinguish between "interrupt due to remote
state change" and "interrupt due to doorbell"?

Things become so much easier when you ditch INTx: reserve vector 0 for
state change, done.

> +the written state value differs from the previous one. The user of the remote
> +device can read the value written to this register from the State Table.
> +
> +The value of this register after device reset is 0.

I guess the meaning of state values depends on the protocol type.
Correct?

> +
> +### State Table
> +
> +The State Table is a read-only section at the beginning of the shared memory
> +region. It contains a 32-bit state value for each of the peers. Locating the
> +table in shared memory allows fast checking of remote states without register
> +accesses.
> +
> +The table is updated on each state change of a peers. Whenever a user of an
> +IVSHMEM device writes a value to the Local State register, this value is copied
> +into the corresponding entry of the State Table. When a IVSHMEM device is reset
> +or disconnected from the other peers, zero is written into the corresponding
> +table entry. The initial content of the table is all zeros.
> +
> +    +--------------------------------+
> +    | 32-bit state value of peer n-1 |
> +    +--------------------------------+
> +    :                                :
> +    +--------------------------------+
> +    | 32-bit state value of peer 1   |
> +    +--------------------------------+
> +    | 32-bit state value of peer 0   |
> +    +--------------------------------+ <-- Shared memory base address
> +
> +
> +Protocols
> +---------
> +
> +The IVSHMEM device shall support the peers of a connection in agreeing on the
> +protocol used over the shared memory devices. For that purpose, the interface
> +byte (offset 09h) and the sub-class byte (offset 0Ah) of the Class Code register
> +encodes a 16-bit protocol type for the users. The following type values are
> +defined:
> +
> +| Protocol Type | Description                                                  |
> +|--------------:|:-------------------------------------------------------------|
> +|         0000h | Undefined type                                               |
> +|         0001h | Virtual peer-to-peer Ethernet                                |
> +|   0002h-3FFFh | Reserved                                                     |
> +|   4000h-7FFFh | User-defined protocols                                       |
> +|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
> +|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
> +
> +Details of the protocols are not in the scope of this specification.

Are you sure this use of PCI class code is kosher?

Final request: please break your lines around column 70 for readability.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-12-05 11:14   ` Markus Armbruster
@ 2019-12-05 21:29     ` Jan Kiszka
  2019-12-06 10:08       ` Markus Armbruster
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kiszka @ 2019-12-05 21:29 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	qemu-devel, Hannes Reinecke, Stefan Hajnoczi

On 05.12.19 12:14, Markus Armbruster wrote:
> This has been on the list for more than three weeks already.  I
> apologize for the delay.

No problem! Your feedback is highly appreciated.

> 
> Jan Kiszka <jan.kiszka@siemens.com> writes:
> 
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> This imports the ivshmem v2 specification draft from Jailhouse. Its
>> final home is to be decided, this shall just simplify the review process
>> at this stage.
>>
>> Note that specifically the Features register (offset 08h) is still under
>> consideration. In particular, its bit 0 seems useless now as its benefit
>> to guests, specifically when they want to be portable, is close to zero.
>> Maybe the register should still be kept, with all bits RsvdZ, for easier
>> extensibility.
>>
>> The rest appears now rather mature and reasonably implementable, as
>> proven also by a version for Jailhouse.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>  docs/specs/ivshmem-2-device-spec.md | 333 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 333 insertions(+)
>>  create mode 100644 docs/specs/ivshmem-2-device-spec.md
>>
>> diff --git a/docs/specs/ivshmem-2-device-spec.md b/docs/specs/ivshmem-2-device-spec.md
>> new file mode 100644
>> index 0000000000..98cfde585a
>> --- /dev/null
>> +++ b/docs/specs/ivshmem-2-device-spec.md
>> @@ -0,0 +1,333 @@
>> +IVSHMEM Device Specification
>> +============================
>> +
>> +** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>> +
>> +The Inter-VM Shared Memory device provides the following features to its users:
>> +
>> +- Interconnection between up to 65536 peers
>> +
>> +- Multi-purpose shared memory region
>> +
>> +    - common read/writable section
>> +
>> +    - unidirectional sections that are read/writable for one peer and only
>> +      readable for the others
>> +
>> +    - section with peer states
>> +
>> +- Event signaling via interrupt to remote sides
>> +
>> +- Support for life-cycle management via state value exchange and interrupt
>> +  notification on changes, backed by a shared memory section
>> +
>> +- Free choice of protocol to be used on top
>> +
>> +- Protocol type declaration
>> +
>> +- Unprivileged access to memory-mapped or I/O registers feasible
>> +
>> +- Discoverable and configurable via standard PCI mechanisms
> 
> Stating requirements is much appreciated.  Design rationale would be
> even better :)

The idea is to avoid having to model also a platform device with
platform discovery mechanism. Based on our experiences with providing a
generic virtual PCI host controller, such a complication of
specification and implementation is simply not needed.

> 
> As pointed out many times, shared memory is not a solution to any
> communication problem, it's merely a building block for building such
> solutions: you invariably have to layer some protocol on top.  In your
> KVM Forum talk, you mention layering virtio on top.  Makes sense to me.
> But why does *this* virtio transport have to be an independent device?
> Other transports aren't.

This device is not only a virtio transport. As you correctly point out,
it is a foundation for communication, not the full solution by itself.

Its goal is to define the minimal piece that a hypervisor has to provide
to its guests in order to enable them building full communication
solutions of their preferred flavor on top. It does not want to deal
with those details, may they be virtio, may they be something much
simpler, much older or much more sophisticated (or complicated). All
that shall not be the business of a hypervisor.

> 
> Now let me indulge in spec nitpicking :)
> 
>> +
>> +
>> +Hypervisor Model
>> +----------------
>> +
>> +In order to provide a consistent link between peers, all connected instances of
>> +IVSHMEM devices need to be configured, created and run by the hypervisor
>> +according to the following requirements:
>> +
>> +- The instances of the device need to be accessible via PCI programming
>> +  interfaces on all sides.
> 
> What does that mean?

"From a guest point of view, this shall be a PCI device."

> 
>> +
>> +- The read/write shared memory section has to be of the same size for all
>> +  peers and, if non-zero, has to reflect the same memory content for them.
> 
> Isn't "same memory content" redundant with "shared memory"?

Probably.

> 
>> +
>> +- If output sections are present (non-zero section size), there must be one
>> +  reserved for each peer with exclusive write access. All output sections
>> +  must have the same size and must be readable for all peers. They have to
>> +  reflect the same memory content for all peers.
> 
> Are these the "unidirectional sections" mentioned previously?

Yep. The terms have some history, maybe worth consolidating them.

> 
>> +
>> +- The State Table must have the same size for all peers, must be large enough to
>> +  hold a state values of all peers, and must be read-only for the user.
> 
> "the state values", I guess.

Yes.

> 
>> +
>> +- State register changes (explicit writes, peer resets) have to be propagated
>> +  to the other peers by updating the corresponding State Table entry and issuing
>> +  an interrupt to all other peers if they enabled reception.
>> +
>> +- Interrupts events triggered by a peer have to be delivered to the target peer,
>> +  provided the receiving side is valid and has enabled the reception.
>> +
>> +- All peers must have the same interrupt delivery features available, i.e. MSI-X
>> +  with the same maximum number of vectors on platforms supporting this
>> +  mechanism, otherwise INTx with one vector.
> 
> Use case for legacy INTx?

Platforms without support for MSI-X injection. Quite a few current ARM
platforms still fall into this category, but they get less, indeed.

> 
> For what it's worth, we removed INTx support from ivshmem v1 in rev 1,
> QEMU 2.6.

Maybe worth to look into this again, specifically when we are done with
ITS emulation on the Jailhouse side. Maybe it will be the same story as
with our virtual PCI host controller there: After we had to add PCI
logic for devices with real controller anyway, the virtual one was
literally for free (not a single line of code added).

> 
>> +
>> +
>> +Guest-side Programming Model
>> +----------------------------
>> +
>> +An IVSHMEM device appears as a PCI device to its users. Unless otherwise noted,
>> +it conforms to the PCI Local Bus Specification, Revision 3.0 As such, it is
> 
> Uh, is there anything in it that does *not* conform to this spec?

See below.

> 
>> +discoverable via the PCI configuration space and provides a number of standard
>> +and custom PCI configuration registers.
>> +
>> +### Shared Memory Region Layout
>> +
>> +The shared memory region is divided into several sections.
>> +
>> +    +-----------------------------+   -
>> +    |                             |   :
>> +    | Output Section for peer n-1 |   : Output Section Size
>> +    |     (n = Maximum Peers)     |   :
>> +    +-----------------------------+   -
>> +    :                             :
>> +    :                             :
>> +    :                             :
>> +    +-----------------------------+   -
>> +    |                             |   :
>> +    |  Output Section for peer 1  |   : Output Section Size
>> +    |                             |   :
>> +    +-----------------------------+   -
>> +    |                             |   :
>> +    |  Output Section for peer 0  |   : Output Section Size
>> +    |                             |   :
>> +    +-----------------------------+   -
>> +    |                             |   :
>> +    |     Read/Write Section      |   : R/W Section Size
>> +    |                             |   :
>> +    +-----------------------------+   -
>> +    |                             |   :
>> +    |         State Table         |   : State Table Size
>> +    |                             |   :
>> +    +-----------------------------+   <-- Shared memory base address
>> +
>> +The first section consists of the mandatory State Table. Its size is defined by
>> +the State Table Size register and cannot be zero. This section is read-only for
>> +all peers.
>> +
>> +The second section consists of shared memory that is read/writable for all
>> +peers. Its size is defined by the R/W Section Size register. A size of zero is
>> +permitted.
>> +
>> +The third and following sections are unidirectional output sections, one for
>> +each peer. Their sizes are all identical. The size of a single output section is
>> +defined by the Output Section Size register. An output section is read/writable
>> +for the corresponding peer and read-only for all other peers. E.g., only the
>> +peer with ID 3 can write to the fourths output section, but all peers can read
>> +from this section.
>> +
>> +All sizes have to be rounded up to multiples of a mappable page in order to
>> +allow access control according to the section restrictions.
>> +
>> +### Configuration Space Registers
>> +
>> +#### Header Registers
>> +
>> +| Offset | Register               | Content                                              |
>> +|-------:|:-----------------------|:-----------------------------------------------------|
>> +|    00h | Vendor ID              | 1AF4h                                                |
>> +|    02h | Device ID              | 1110h                                                |
> 
> Same as ivshmem v1.  Revision ID (offset 08h) disambiguates.

Will become 110Ah (Siemens) / 4106h (freshly reserved for a new ivshmem)
in the next round.

> 
>> +|    04h | Command Register       | 0000h on reset, implementing bits 1, 2, 10           |
> 
> What does "implementing bits ..." mean?

Ignore writes to the other bits.

> 
>> +|    06h | Status Register        | 0010h, static value (bit 3 not implemented)          |
> 
> What does "bit 3 not implemented" mean?

Will stay at 0, no "pending INTx" information available.

> 
>> +|    08h | Revision ID            | 02h                                                  |
> 
> ivshmem v1 is rev 0 before QEMU 2.6, rev 1 since.  Rev 1 is
> backward-compatible to rev 0 for guest software.  Is rev 2 intended to
> be backward-compatible, too?
> 
> You should probably explan how your v2 relates to v1 in more detail,
> possibly in its own top-level section.

With that new, separate device IT, this register will start at 0 in the
next revision of this document.

> 
>> +|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>> +|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>> +|    0Bh | Class Code, Base Class | FFh                                                  |
> 
> FFh means "device does not fit any defined class."  For what it's worth,
> ivshmem v1 uses 05h "Memory Controller", with sub-class and interface
> 00h "RAM Controller".

I know. But using that scheme would needlessly wast 8 bits for protocol
ID encoding. In turn, having the protocol encoded into the class code
helps a lot with automatic driver probing on Linux e.g.

> 
>> +|    0Eh | Header Type            | 00h                                                  |
>> +|    10h | BAR 0                  | MMIO or I/O register region                          |
> 
> Use case for I/O?

Faster VM exit handling. But, granted, that was not exploited so far in
existing implementations. I just wanted to keep that door open.

> 
> For what it's worth, ivshmem v1 never supported I/O.
> 
>> +|    14h | BAR 1                  | MSI-X region                                         |
>> +|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
> 
> What does "(with BAR 3)" mean?

64-bit resource regions always use two BARs, see PCI spec.

> 
>> +|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>> +|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
> 
> ivshmem v1 leaves these blank.

Same story as with virtio: This should encode the ID of the provider
(hypervisor), also for the case we ever face deviating implementations,
due to quirks of specification vagueness.

> 
>> +|    34h | Capability Pointer     | First capability                                     |
>> +|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
> 
> "If MSI-X is available"?
> 
> A PCIe device always provides MSI-X, and may additionally provide legacy
> INTx.  A conventional PCI device may provide either or both.  In any
> case, the Interrupt Pin register is zero when legacy INTx is not
> provided,

All known. This shall just "encourage" to provide more powerfull MSI-x
when the platform provides that for other devices anyway. Only if the
platform this device is injected to does not have such support, INTx
shall be there.

> 
>> +
>> +If BAR 2 is not present, the shared memory region is not relocatable by the
>> +user. In that case, the hypervisor has to implement the Base Address register in
>> +the vendor-specific capability.
>> +
>> +Other header registers may not be implemented. If not implemented, they return 0
>> +on read and ignore write accesses.
> 
> Is this an example of where the device does not conform to the PCI Local
> Bus Specification?

- INTx is an edge interrupt
- there is no INTx status bit
- there is no pending MSI-X bitmap

> 
>> +
>> +#### Vendor Specific Capability (ID 09h)
>> +
>> +| Offset | Register            | Content                                        |
>> +|-------:|:--------------------|:-----------------------------------------------|
>> +|    00h | ID                  | 09h                                            |
>> +|    01h | Next Capability     | Pointer to next capability or 00h              |
>> +|    02h | Length              | 18h or 20h                                     |
>> +|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>> +|        |                     | Bits 1-7: RsvdZ                                |
> 
> Please define RsvdZ somewhere, or use plainer language.

OK.

> 
>> +|    04h | State Table Size    | 32-bit size of read-only State Table           |
>> +|    08h | R/W Section Size    | 64-bit size of common read/write section       |
>> +|    10h | Output Section Size | 64-bit size of unidirectional output sections  |
>> +|    18h | Base Address        | optional: 64-bit base address of shared memory |
> 
> Length is 20h when Base Adress is present, else 18h.  Worth spelling
> that out?

Sure.

> 
>> +
>> +All registers are read-only, except for bit 0 of the Privileged Control
>> +register.
> 
> Well, the other bits of that register are writable, they're just
> ignored.
> 
> For what it's worth, existing ivshmem-spec.txt uses the terms read-only,
> write-only, read/write and reserved bits rigorously: "Software should
> only access the registers as specified [...]  Reserved bits should be
> ignored on read, and preserved on write."

Will change.

> 
>> +
>> +When bit 0 in the Privileged Control register is set to 1, the device clears
>> +bit 0 in the Interrupt Control register on each interrupt delivery. This enables
>> +automatic interrupt throttling when re-enabling shall be performed by a
>> +scheduled unprivileged instance on the user side.
>> +
>> +If an IVSHMEM device does not support a relocatable shared memory region, BAR 2
>> +must not be implemented by the provider. Instead, the Base Address register has
>> +to be implemented to report the location of the shared memory region in the
>> +user's address space.
> 
> Rationale for not wanting to support relocatable shared memory?

Locked-down second stage page tables during runtime. We can even
check-sum their content periodically when we do not allow guest to
influence the shared memory guest location. Relevant for safety and
possibly also security scenarios.

> 
>> +
>> +A non-existing shared memory section has to report zero in its Section Size
>> +register.
> 
> This vendor-specific capability must always be present, I presume.
> Worth spelling out.

Ack.

> 
>> +
>> +#### MSI-X Capability (ID 11h)
>> +
>> +On platforms supporting MSI-X, IVSHMEM has to provide interrupt delivery via
>> +this mechanism. In that case, the legacy INTx delivery mechanism is not
>> +available, and the Interrupt Pin configuration register returns 0.
> 
> I'm confused.  Does that mean the device shall support either MSI-X or
> legacy INTx, but never both?

Correct, see above. There is no point in requesting both and requiring
the hypervisor to deal with the guest switching between things during
runtime.

> 
>> +
>> +The IVSHMEM device has no notion of pending interrupts. Therefore, reading from
>> +the MSI-X Pending Bit Array will always return 0.
> 
> I guess this means a polling mode of operation is not possible.
> Correct?

Of course, polling remains possible, but the guest driver should rather
obtain the interrupt-triggering state information from the much faster
shared memory. There is simply no value in also mirroring that into
interrupt registers for this particular device.

> 
>> +
>> +The corresponding MSI-X MMIO region is configured via BAR 1.
>> +
>> +The MSI-X table size reported by the MSI-X capability structure is identical for
>> +all peers.
>> +
>> +### Register Region
>> +
>> +The register region may be implemented as MMIO or I/O.
>> +
>> +When implementing it as MMIO, the hypervisor has to ensure that the register
>> +region can be mapped as a single page into the address space of the user. Write
> 
> "can be mapped as a single page" depends on the host system, not the
> device.

When host system provides this device, it must ensure that the MMIO
register region can be mapped by its guest as one page, without
potential overlap with other resources.

> 
> For what it's worth, ivshmem v1 fixes the size of BAR0 to 256 bytes.
> Any particular reason to keep its size so loosely specified in v2?

UIO. The v1 device can't be handed out to untrusted UIO applications
without intercepting the MMIO writes in the guest kernel in order to
restrict them to the valid range. This device supports that.

> 
>> +accesses to MMIO region offsets that are not backed by registers have to be
>> +ignored, read accesses have to return 0. This enables the user to hand out the
>> +complete region, along with the shared memory, to an unprivileged instance.
>> +
>> +The region location in the user's physical address space is configured via BAR
>> +0. The following table visualizes the region layout:
>> +
>> +| Offset | Register                                                            |
>> +|-------:|:--------------------------------------------------------------------|
>> +|    00h | ID                                                                  |
>> +|    04h | Maximum Peers                                                       |
>> +|    08h | Features                                                            |
>> +|    0Ch | Interrupt Control                                                   |
>> +|    10h | Doorbell                                                            |
>> +|    14h | State                                                               |
>> +
>> +All registers support only aligned 32-bit accesses.
> 
> Definitely not backwards compatible to rev 1.  I figure that means v2
> should use a different Vendor ID / Device ID, not just bump the Revsion
> ID.

Yep, see above.

> 
> No interrupt status register?

Yep, see above.

> 
>> +
>> +#### ID Register (Offset 00h)
>> +
>> +Read-only register that reports the ID of the local device. It is unique for all
>> +of the connected devices and remains unchanged over their lifetime.
>> +
>> +#### Maximum Peers Register (Offset 04h)
>> +
>> +Read-only register that reports the maximum number of possible peers (including
>> +the local one). The supported range is between 2 and 65536 and remains constant
>> +over the lifetime of all peers.
> 
> Value 1 would be boring, but is it really impossible?

Likely not, didn't think about it so far. Would we gain anything by
permitting this?

> 
>> +
>> +#### Features Register (Offset 08h)
>> +
>> +Read-only register that reports features of the local device or the connected
>> +peers. Its content remains constant over the lifetime of all peers.
>> +
>> +| Bits | Content                                                               |
>> +|-----:|:----------------------------------------------------------------------|
>> +|    0 | 1: Synchronized shared memory base address                            |
>> +| 1-31 | RsvdZ                                                                 |
>> +
>> +If "synchronized shared memory base address" is reported (bit 0 is set), the
>> +shared memory region is mapped at the same address into the user address spaces
>> +of all connected peers. Thus, peers can use physical addresses as pointers when
>> +exchanging information via the shared memory. This feature flag is never set
>> +when the shared memory region is relocatable via BAR 2.
>> +
>> +#### Interrupt Control Register (Offset 0Ch)
>> +
>> +This read/write register controls the generation of interrupts whenever a peer
>> +writes to the Doorbell register or changes its state.
>> +
>> +| Bits | Content                                                               |
>> +|-----:|:----------------------------------------------------------------------|
>> +|    0 | 1: Enable interrupt generation                                        |
>> +| 1-31 | RsvdZ                                                                 |
>> +
>> +Note that bit 0 is reset to 0 on interrupt delivery if one-shot interrupt mode
>> +is enabled in the Enhanced Features register.
>> +
>> +The value of this register after device reset is 0.
> 
> I presume this applies only to legacy INTx.  Correct?

Nope, this applies to both INTx and MSI-x. It's a building block for UIO
devices, see one-shot interrupt mode. In that mode, bit 0 will be
cleared by the device after each interrupt delivery, avoiding that the
guest UIO driver needs to do that via a vmexit.

> 
> ivshmem v1 calls this Interrupt Mask.

As "set" does not mean "masked", I chose "enable" here.

> 
>> +
>> +#### Doorbell Register (Offset 10h)
>> +
>> +Write-only register that triggers an interrupt vector in the target device if it
>> +is enabled there.
>> +
>> +| Bits  | Content                                                              |
>> +|------:|:---------------------------------------------------------------------|
>> +|  0-15 | Vector number                                                        |
>> +| 16-31 | Target ID                                                            |
>> +
>> +Writing a vector number that is not enabled by the target has no effect. The
>> +peers can derive the number of available vectors from their own device
>> +capabilities and are expected to define or negotiate the used ones via the
>> +selected protocol.
> 
> That's because all peers use the same number of vectors.  Worth spelling out?

I thought I did that somewhere. Will check.

> 
>> +
>> +Addressing a non-existing or inactive target has no effect. Peers can identify
>> +active targets via the State Table.
>> +
>> +The behavior on reading from this register is undefined.
>> +
>> +#### State Register (Offset 14h)
>> +
>> +Read/write register that defines the state of the local device. Writing to this
>> +register sets the state and triggers interrupt vector 0 on the remote device if
> 
> "the remote device"?  Do you mean all peers?

Historic singular (an earlier revision only supported 2 peers). Will fix.

> 
> "interrupt vector 0" assumes MSI-X.  With legacy INTx, it simply
> triggers an interrupt.

Right, INTx only has vector 0.

> 
> How can guest software distinguish between "interrupt due to remote
> state change" and "interrupt due to doorbell"?

By checking those states. Both are in shared memory, thus quickly readable.

> 
> Things become so much easier when you ditch INTx: reserve vector 0 for
> state change, done.

That aspect does not make things complicated with INTx, but I agree that
we should look at its "TCO" again.

> 
>> +the written state value differs from the previous one. The user of the remote
>> +device can read the value written to this register from the State Table.
>> +
>> +The value of this register after device reset is 0.
> 
> I guess the meaning of state values depends on the protocol type.
> Correct?

Yes, except for value 0 which means "reset state". The protocols are
free to define all other values according to their needs.

> 
>> +
>> +### State Table
>> +
>> +The State Table is a read-only section at the beginning of the shared memory
>> +region. It contains a 32-bit state value for each of the peers. Locating the
>> +table in shared memory allows fast checking of remote states without register
>> +accesses.
>> +
>> +The table is updated on each state change of a peers. Whenever a user of an
>> +IVSHMEM device writes a value to the Local State register, this value is copied
>> +into the corresponding entry of the State Table. When a IVSHMEM device is reset
>> +or disconnected from the other peers, zero is written into the corresponding
>> +table entry. The initial content of the table is all zeros.
>> +
>> +    +--------------------------------+
>> +    | 32-bit state value of peer n-1 |
>> +    +--------------------------------+
>> +    :                                :
>> +    +--------------------------------+
>> +    | 32-bit state value of peer 1   |
>> +    +--------------------------------+
>> +    | 32-bit state value of peer 0   |
>> +    +--------------------------------+ <-- Shared memory base address
>> +
>> +
>> +Protocols
>> +---------
>> +
>> +The IVSHMEM device shall support the peers of a connection in agreeing on the
>> +protocol used over the shared memory devices. For that purpose, the interface
>> +byte (offset 09h) and the sub-class byte (offset 0Ah) of the Class Code register
>> +encodes a 16-bit protocol type for the users. The following type values are
>> +defined:
>> +
>> +| Protocol Type | Description                                                  |
>> +|--------------:|:-------------------------------------------------------------|
>> +|         0000h | Undefined type                                               |
>> +|         0001h | Virtual peer-to-peer Ethernet                                |
>> +|   0002h-3FFFh | Reserved                                                     |
>> +|   4000h-7FFFh | User-defined protocols                                       |
>> +|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>> +|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>> +
>> +Details of the protocols are not in the scope of this specification.
> 
> Are you sure this use of PCI class code is kosher?

Well, we are in the 0xff range. To my understanding, there are no
constraints on what you put in the other bits for that class.

> 
> Final request: please break your lines around column 70 for readability.
> 

Sure, can do.

Thanks a lot,
Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2
  2019-12-05 21:29     ` Jan Kiszka
@ 2019-12-06 10:08       ` Markus Armbruster
  0 siblings, 0 replies; 23+ messages in thread
From: Markus Armbruster @ 2019-12-06 10:08 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: liang yan, Jailhouse, Claudio Fontana, Michael S . Tsirkin,
	qemu-devel, Hannes Reinecke, Stefan Hajnoczi

Jan Kiszka <jan.kiszka@siemens.com> writes:

> On 05.12.19 12:14, Markus Armbruster wrote:
>> This has been on the list for more than three weeks already.  I
>> apologize for the delay.
>
> No problem! Your feedback is highly appreciated.
>
>> 
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>> 
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> This imports the ivshmem v2 specification draft from Jailhouse. Its
>>> final home is to be decided, this shall just simplify the review process
>>> at this stage.
>>>
>>> Note that specifically the Features register (offset 08h) is still under
>>> consideration. In particular, its bit 0 seems useless now as its benefit
>>> to guests, specifically when they want to be portable, is close to zero.
>>> Maybe the register should still be kept, with all bits RsvdZ, for easier
>>> extensibility.
>>>
>>> The rest appears now rather mature and reasonably implementable, as
>>> proven also by a version for Jailhouse.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>  docs/specs/ivshmem-2-device-spec.md | 333 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 333 insertions(+)
>>>  create mode 100644 docs/specs/ivshmem-2-device-spec.md
>>>
>>> diff --git a/docs/specs/ivshmem-2-device-spec.md b/docs/specs/ivshmem-2-device-spec.md
>>> new file mode 100644
>>> index 0000000000..98cfde585a
>>> --- /dev/null
>>> +++ b/docs/specs/ivshmem-2-device-spec.md
>>> @@ -0,0 +1,333 @@
>>> +IVSHMEM Device Specification
>>> +============================
>>> +
>>> +** NOTE: THIS IS WORK-IN-PROGRESS, NOT YET A STABLE INTERFACE SPECIFICATION! **
>>> +
>>> +The Inter-VM Shared Memory device provides the following features to its users:
>>> +
>>> +- Interconnection between up to 65536 peers
>>> +
>>> +- Multi-purpose shared memory region
>>> +
>>> +    - common read/writable section
>>> +
>>> +    - unidirectional sections that are read/writable for one peer and only
>>> +      readable for the others
>>> +
>>> +    - section with peer states
>>> +
>>> +- Event signaling via interrupt to remote sides
>>> +
>>> +- Support for life-cycle management via state value exchange and interrupt
>>> +  notification on changes, backed by a shared memory section
>>> +
>>> +- Free choice of protocol to be used on top
>>> +
>>> +- Protocol type declaration
>>> +
>>> +- Unprivileged access to memory-mapped or I/O registers feasible
>>> +
>>> +- Discoverable and configurable via standard PCI mechanisms
>> 
>> Stating requirements is much appreciated.  Design rationale would be
>> even better :)
>
> The idea is to avoid having to model also a platform device with
> platform discovery mechanism. Based on our experiences with providing a
> generic virtual PCI host controller, such a complication of
> specification and implementation is simply not needed.

Work that into the document please.

>> As pointed out many times, shared memory is not a solution to any
>> communication problem, it's merely a building block for building such
>> solutions: you invariably have to layer some protocol on top.  In your
>> KVM Forum talk, you mention layering virtio on top.  Makes sense to me.
>> But why does *this* virtio transport have to be an independent device?
>> Other transports aren't.
>
> This device is not only a virtio transport. As you correctly point out,
> it is a foundation for communication, not the full solution by itself.
>
> Its goal is to define the minimal piece that a hypervisor has to provide
> to its guests in order to enable them building full communication
> solutions of their preferred flavor on top. It does not want to deal
> with those details, may they be virtio, may they be something much
> simpler, much older or much more sophisticated (or complicated). All
> that shall not be the business of a hypervisor.

Likewise.

>> Now let me indulge in spec nitpicking :)
>> 
>>> +
>>> +
>>> +Hypervisor Model
>>> +----------------
>>> +
>>> +In order to provide a consistent link between peers, all connected instances of
>>> +IVSHMEM devices need to be configured, created and run by the hypervisor
>>> +according to the following requirements:
>>> +
>>> +- The instances of the device need to be accessible via PCI programming
>>> +  interfaces on all sides.
>> 
>> What does that mean?
>
> "From a guest point of view, this shall be a PCI device."

Clearer.

>>> +
>>> +- The read/write shared memory section has to be of the same size for all
>>> +  peers and, if non-zero, has to reflect the same memory content for them.
>> 
>> Isn't "same memory content" redundant with "shared memory"?
>
> Probably.
>
>> 
>>> +
>>> +- If output sections are present (non-zero section size), there must be one
>>> +  reserved for each peer with exclusive write access. All output sections
>>> +  must have the same size and must be readable for all peers. They have to
>>> +  reflect the same memory content for all peers.
>> 
>> Are these the "unidirectional sections" mentioned previously?
>
> Yep. The terms have some history, maybe worth consolidating them.

Yes, please.  Consistent terminology helps.

>>> +
>>> +- The State Table must have the same size for all peers, must be large enough to
>>> +  hold a state values of all peers, and must be read-only for the user.
>> 
>> "the state values", I guess.
>
> Yes.
>
>> 
>>> +
>>> +- State register changes (explicit writes, peer resets) have to be propagated
>>> +  to the other peers by updating the corresponding State Table entry and issuing
>>> +  an interrupt to all other peers if they enabled reception.
>>> +
>>> +- Interrupts events triggered by a peer have to be delivered to the target peer,
>>> +  provided the receiving side is valid and has enabled the reception.
>>> +
>>> +- All peers must have the same interrupt delivery features available, i.e. MSI-X
>>> +  with the same maximum number of vectors on platforms supporting this
>>> +  mechanism, otherwise INTx with one vector.
>> 
>> Use case for legacy INTx?
>
> Platforms without support for MSI-X injection. Quite a few current ARM
> platforms still fall into this category, but they get less, indeed.
>
>> 
>> For what it's worth, we removed INTx support from ivshmem v1 in rev 1,
>> QEMU 2.6.
>
> Maybe worth to look into this again, specifically when we are done with
> ITS emulation on the Jailhouse side. Maybe it will be the same story as
> with our virtual PCI host controller there: After we had to add PCI
> logic for devices with real controller anyway, the virtual one was
> literally for free (not a single line of code added).

I'd like to encourage you to consider this once more.  Dropping legacy
INTx support will simplify this document quite a bit.

>>> +
>>> +
>>> +Guest-side Programming Model
>>> +----------------------------
>>> +
>>> +An IVSHMEM device appears as a PCI device to its users. Unless otherwise noted,
>>> +it conforms to the PCI Local Bus Specification, Revision 3.0 As such, it is
>> 
>> Uh, is there anything in it that does *not* conform to this spec?
>
> See below.

I'm not opposed to reasoned deviations.  I'd like to see the "unless
otherwise noted" promise honored with explicit notes, though.

>> 
>>> +discoverable via the PCI configuration space and provides a number of standard
>>> +and custom PCI configuration registers.
>>> +
>>> +### Shared Memory Region Layout
>>> +
>>> +The shared memory region is divided into several sections.
>>> +
>>> +    +-----------------------------+   -
>>> +    |                             |   :
>>> +    | Output Section for peer n-1 |   : Output Section Size
>>> +    |     (n = Maximum Peers)     |   :
>>> +    +-----------------------------+   -
>>> +    :                             :
>>> +    :                             :
>>> +    :                             :
>>> +    +-----------------------------+   -
>>> +    |                             |   :
>>> +    |  Output Section for peer 1  |   : Output Section Size
>>> +    |                             |   :
>>> +    +-----------------------------+   -
>>> +    |                             |   :
>>> +    |  Output Section for peer 0  |   : Output Section Size
>>> +    |                             |   :
>>> +    +-----------------------------+   -
>>> +    |                             |   :
>>> +    |     Read/Write Section      |   : R/W Section Size
>>> +    |                             |   :
>>> +    +-----------------------------+   -
>>> +    |                             |   :
>>> +    |         State Table         |   : State Table Size
>>> +    |                             |   :
>>> +    +-----------------------------+   <-- Shared memory base address
>>> +
>>> +The first section consists of the mandatory State Table. Its size is defined by
>>> +the State Table Size register and cannot be zero. This section is read-only for
>>> +all peers.
>>> +
>>> +The second section consists of shared memory that is read/writable for all
>>> +peers. Its size is defined by the R/W Section Size register. A size of zero is
>>> +permitted.
>>> +
>>> +The third and following sections are unidirectional output sections, one for
>>> +each peer. Their sizes are all identical. The size of a single output section is
>>> +defined by the Output Section Size register. An output section is read/writable
>>> +for the corresponding peer and read-only for all other peers. E.g., only the
>>> +peer with ID 3 can write to the fourths output section, but all peers can read
>>> +from this section.
>>> +
>>> +All sizes have to be rounded up to multiples of a mappable page in order to
>>> +allow access control according to the section restrictions.
>>> +
>>> +### Configuration Space Registers
>>> +
>>> +#### Header Registers
>>> +
>>> +| Offset | Register               | Content                                              |
>>> +|-------:|:-----------------------|:-----------------------------------------------------|
>>> +|    00h | Vendor ID              | 1AF4h                                                |
>>> +|    02h | Device ID              | 1110h                                                |
>> 
>> Same as ivshmem v1.  Revision ID (offset 08h) disambiguates.
>
> Will become 110Ah (Siemens) / 4106h (freshly reserved for a new ivshmem)
> in the next round.

Better.

>>> +|    04h | Command Register       | 0000h on reset, implementing bits 1, 2, 10           |
>> 
>> What does "implementing bits ..." mean?
>
> Ignore writes to the other bits.
>
>> 
>>> +|    06h | Status Register        | 0010h, static value (bit 3 not implemented)          |
>> 
>> What does "bit 3 not implemented" mean?
>
> Will stay at 0, no "pending INTx" information available.
>
>> 
>>> +|    08h | Revision ID            | 02h                                                  |
>> 
>> ivshmem v1 is rev 0 before QEMU 2.6, rev 1 since.  Rev 1 is
>> backward-compatible to rev 0 for guest software.  Is rev 2 intended to
>> be backward-compatible, too?
>> 
>> You should probably explan how your v2 relates to v1 in more detail,
>> possibly in its own top-level section.
>
> With that new, separate device IT, this register will start at 0 in the
> next revision of this document.

By making the device clearly different that way, you don't need to
discuss differences in detail.  In particular, backward compatibility
can be treated as non-issue.  But I'd still like to see a higher level
discussion of the differences.  In your KVM Forum presentation, you
covered v1's deficits (slide 5), and v2's key differences (slide 6).

docs/specs/ivshmem-spec.txt documents numerous v1 deficits.  Referring
to that document might be helpful.

>>> +|    09h | Class Code, Interface  | Protocol Type bits 0-7, see [Protocols](#Protocols)  |
>>> +|    0Ah | Class Code, Sub-Class  | Protocol Type bits 8-15, see [Protocols](#Protocols) |
>>> +|    0Bh | Class Code, Base Class | FFh                                                  |
>> 
>> FFh means "device does not fit any defined class."  For what it's worth,
>> ivshmem v1 uses 05h "Memory Controller", with sub-class and interface
>> 00h "RAM Controller".
>
> I know. But using that scheme would needlessly wast 8 bits for protocol
> ID encoding. In turn, having the protocol encoded into the class code
> helps a lot with automatic driver probing on Linux e.g.

Less of an issue (or none-issue?) now that you use a fresh Vendor ID /
Device ID pair.

>> 
>>> +|    0Eh | Header Type            | 00h                                                  |
>>> +|    10h | BAR 0                  | MMIO or I/O register region                          |
>> 
>> Use case for I/O?
>
> Faster VM exit handling. But, granted, that was not exploited so far in
> existing implementations. I just wanted to keep that door open.

Write that down, please :)

*Why* things are designed a certain way is often as useful to know as
the actual design.

>> For what it's worth, ivshmem v1 never supported I/O.
>> 
>>> +|    14h | BAR 1                  | MSI-X region                                         |
>>> +|    18h | BAR 2 (with BAR 3)     | optional: 64-bit shared memory region                |
>> 
>> What does "(with BAR 3)" mean?
>
> 64-bit resource regions always use two BARs, see PCI spec.

Right.  Perhaps we can convey that more clearly to readers who forgot
(or never knew) this PCI detail, but I admit I'm drawing blanks right
now.

>>> +|    2Ch | Subsystem Vendor ID    | same as Vendor ID, or provider-specific value        |
>>> +|    2Eh | Subsystem ID           | same as Device ID, or provider-specific value        |
>> 
>> ivshmem v1 leaves these blank.
>
> Same story as with virtio: This should encode the ID of the provider
> (hypervisor), also for the case we ever face deviating implementations,
> due to quirks of specification vagueness.

This intended use makes sense.  Write it down, please.

>>> +|    34h | Capability Pointer     | First capability                                     |
>>> +|    3Eh | Interrupt Pin          | 01h-04h, must be 00h if MSI-X is available           |
>> 
>> "If MSI-X is available"?
>> 
>> A PCIe device always provides MSI-X, and may additionally provide legacy
>> INTx.  A conventional PCI device may provide either or both.  In any
>> case, the Interrupt Pin register is zero when legacy INTx is not
>> provided,
>
> All known. This shall just "encourage" to provide more powerfull MSI-x
> when the platform provides that for other devices anyway. Only if the
> platform this device is injected to does not have such support, INTx
> shall be there.

In case you keep legacy INTx support: I'd phrase this as a (strong)
recommendation, not as (vague) specification.

>>> +
>>> +If BAR 2 is not present, the shared memory region is not relocatable by the
>>> +user. In that case, the hypervisor has to implement the Base Address register in
>>> +the vendor-specific capability.
>>> +
>>> +Other header registers may not be implemented. If not implemented, they return 0
>>> +on read and ignore write accesses.
>> 
>> Is this an example of where the device does not conform to the PCI Local
>> Bus Specification?
>
> - INTx is an edge interrupt
> - there is no INTx status bit
> - there is no pending MSI-X bitmap

Two more reasons to ditch legacy INTx.

For each of these, I'd like to see an clear notice of deviation from the
PCI spec, and rationale (why are deviating).

>>> +
>>> +#### Vendor Specific Capability (ID 09h)
>>> +
>>> +| Offset | Register            | Content                                        |
>>> +|-------:|:--------------------|:-----------------------------------------------|
>>> +|    00h | ID                  | 09h                                            |
>>> +|    01h | Next Capability     | Pointer to next capability or 00h              |
>>> +|    02h | Length              | 18h or 20h                                     |
>>> +|    03h | Privileged Control  | Bit 0 (read/write): one-shot interrupt mode    |
>>> +|        |                     | Bits 1-7: RsvdZ                                |
>> 
>> Please define RsvdZ somewhere, or use plainer language.
>
> OK.
>
>> 
>>> +|    04h | State Table Size    | 32-bit size of read-only State Table           |
>>> +|    08h | R/W Section Size    | 64-bit size of common read/write section       |
>>> +|    10h | Output Section Size | 64-bit size of unidirectional output sections  |
>>> +|    18h | Base Address        | optional: 64-bit base address of shared memory |
>> 
>> Length is 20h when Base Adress is present, else 18h.  Worth spelling
>> that out?
>
> Sure.
>
>> 
>>> +
>>> +All registers are read-only, except for bit 0 of the Privileged Control
>>> +register.
>> 
>> Well, the other bits of that register are writable, they're just
>> ignored.
>> 
>> For what it's worth, existing ivshmem-spec.txt uses the terms read-only,
>> write-only, read/write and reserved bits rigorously: "Software should
>> only access the registers as specified [...]  Reserved bits should be
>> ignored on read, and preserved on write."
>
> Will change.

Could also help with documenting Command and Status register more
clearly.

>>> +
>>> +When bit 0 in the Privileged Control register is set to 1, the device clears
>>> +bit 0 in the Interrupt Control register on each interrupt delivery. This enables
>>> +automatic interrupt throttling when re-enabling shall be performed by a
>>> +scheduled unprivileged instance on the user side.
>>> +
>>> +If an IVSHMEM device does not support a relocatable shared memory region, BAR 2
>>> +must not be implemented by the provider. Instead, the Base Address register has
>>> +to be implemented to report the location of the shared memory region in the
>>> +user's address space.
>> 
>> Rationale for not wanting to support relocatable shared memory?
>
> Locked-down second stage page tables during runtime. We can even
> check-sum their content periodically when we do not allow guest to
> influence the shared memory guest location. Relevant for safety and
> possibly also security scenarios.

Write that down, please.

>>> +
>>> +A non-existing shared memory section has to report zero in its Section Size
>>> +register.
>> 
>> This vendor-specific capability must always be present, I presume.
>> Worth spelling out.
>
> Ack.
>
>> 
>>> +
>>> +#### MSI-X Capability (ID 11h)
>>> +
>>> +On platforms supporting MSI-X, IVSHMEM has to provide interrupt delivery via
>>> +this mechanism. In that case, the legacy INTx delivery mechanism is not
>>> +available, and the Interrupt Pin configuration register returns 0.
>> 
>> I'm confused.  Does that mean the device shall support either MSI-X or
>> legacy INTx, but never both?
>
> Correct, see above. There is no point in requesting both and requiring
> the hypervisor to deal with the guest switching between things during
> runtime.

The PCI spec is cool with devices that only implement one of the two.
But shipping one kind of device come in two such different flavours
while still sharing the same Vendor ID, Device ID is something that
would get a hardware vendor yelled at, and rightfully so.

Now, this isn't physical hardware.  And you might have a high tolerance
for getting yelled at.

Still, yet another reason to ditch legacy INTx.

If not, I'd advocate for a more prominent "caveat emptor" sticker on
this particular, unusual feature.

>>> +
>>> +The IVSHMEM device has no notion of pending interrupts. Therefore, reading from
>>> +the MSI-X Pending Bit Array will always return 0.
>> 
>> I guess this means a polling mode of operation is not possible.
>> Correct?
>
> Of course, polling remains possible, but the guest driver should rather
> obtain the interrupt-triggering state information from the much faster
> shared memory. There is simply no value in also mirroring that into
> interrupt registers for this particular device.

I see.  Mirroring it just for spec compliance may not be worthwhile.
Please explain this in this document.

>>> +
>>> +The corresponding MSI-X MMIO region is configured via BAR 1.
>>> +
>>> +The MSI-X table size reported by the MSI-X capability structure is identical for
>>> +all peers.
>>> +
>>> +### Register Region
>>> +
>>> +The register region may be implemented as MMIO or I/O.
>>> +
>>> +When implementing it as MMIO, the hypervisor has to ensure that the register
>>> +region can be mapped as a single page into the address space of the user. Write
>> 
>> "can be mapped as a single page" depends on the host system, not the
>> device.
>
> When host system provides this device, it must ensure that the MMIO
> register region can be mapped by its guest as one page, without
> potential overlap with other resources.
>
>> 
>> For what it's worth, ivshmem v1 fixes the size of BAR0 to 256 bytes.
>> Any particular reason to keep its size so loosely specified in v2?
>
> UIO. The v1 device can't be handed out to untrusted UIO applications
> without intercepting the MMIO writes in the guest kernel in order to
> restrict them to the valid range. This device supports that.

Sad.  BARs changing size while Vendor ID, Device ID stays the same is
unusual, but the other BARs *have* be that way, so this one hardly makes
it more unusual.

Please write this down.

>>> +accesses to MMIO region offsets that are not backed by registers have to be
>>> +ignored, read accesses have to return 0. This enables the user to hand out the
>>> +complete region, along with the shared memory, to an unprivileged instance.
>>> +
>>> +The region location in the user's physical address space is configured via BAR
>>> +0. The following table visualizes the region layout:
>>> +
>>> +| Offset | Register                                                            |
>>> +|-------:|:--------------------------------------------------------------------|
>>> +|    00h | ID                                                                  |
>>> +|    04h | Maximum Peers                                                       |
>>> +|    08h | Features                                                            |
>>> +|    0Ch | Interrupt Control                                                   |
>>> +|    10h | Doorbell                                                            |
>>> +|    14h | State                                                               |
>>> +
>>> +All registers support only aligned 32-bit accesses.
>> 
>> Definitely not backwards compatible to rev 1.  I figure that means v2
>> should use a different Vendor ID / Device ID, not just bump the Revsion
>> ID.
>
> Yep, see above.
>
>> 
>> No interrupt status register?
>
> Yep, see above.

Where?  Is it "guest driver should rather obtain the
interrupt-triggering state information from the much faster shared
memory"?

>> 
>>> +
>>> +#### ID Register (Offset 00h)
>>> +
>>> +Read-only register that reports the ID of the local device. It is unique for all
>>> +of the connected devices and remains unchanged over their lifetime.
>>> +
>>> +#### Maximum Peers Register (Offset 04h)
>>> +
>>> +Read-only register that reports the maximum number of possible peers (including
>>> +the local one). The supported range is between 2 and 65536 and remains constant
>>> +over the lifetime of all peers.
>> 
>> Value 1 would be boring, but is it really impossible?
>
> Likely not, didn't think about it so far. Would we gain anything by
> permitting this?

Running a lone guest with an ivshmem device wouldn't be out of spec.

Never stipulate an error you don't want to detect and handle :)

>>> +
>>> +#### Features Register (Offset 08h)
>>> +
>>> +Read-only register that reports features of the local device or the connected
>>> +peers. Its content remains constant over the lifetime of all peers.
>>> +
>>> +| Bits | Content                                                               |
>>> +|-----:|:----------------------------------------------------------------------|
>>> +|    0 | 1: Synchronized shared memory base address                            |
>>> +| 1-31 | RsvdZ                                                                 |
>>> +
>>> +If "synchronized shared memory base address" is reported (bit 0 is set), the
>>> +shared memory region is mapped at the same address into the user address spaces
>>> +of all connected peers. Thus, peers can use physical addresses as pointers when
>>> +exchanging information via the shared memory. This feature flag is never set
>>> +when the shared memory region is relocatable via BAR 2.
>>> +
>>> +#### Interrupt Control Register (Offset 0Ch)
>>> +
>>> +This read/write register controls the generation of interrupts whenever a peer
>>> +writes to the Doorbell register or changes its state.
>>> +
>>> +| Bits | Content                                                               |
>>> +|-----:|:----------------------------------------------------------------------|
>>> +|    0 | 1: Enable interrupt generation                                        |
>>> +| 1-31 | RsvdZ                                                                 |
>>> +
>>> +Note that bit 0 is reset to 0 on interrupt delivery if one-shot interrupt mode
>>> +is enabled in the Enhanced Features register.
>>> +
>>> +The value of this register after device reset is 0.
>> 
>> I presume this applies only to legacy INTx.  Correct?
>
> Nope, this applies to both INTx and MSI-x. It's a building block for UIO
> devices, see one-shot interrupt mode. In that mode, bit 0 will be
> cleared by the device after each interrupt delivery, avoiding that the
> guest UIO driver needs to do that via a vmexit.

How does this play with MSI-X per-vector masking?

>> ivshmem v1 calls this Interrupt Mask.
>
> As "set" does not mean "masked", I chose "enable" here.

I'm confused.  Can you explain the difference?

>>> +
>>> +#### Doorbell Register (Offset 10h)
>>> +
>>> +Write-only register that triggers an interrupt vector in the target device if it
>>> +is enabled there.
>>> +
>>> +| Bits  | Content                                                              |
>>> +|------:|:---------------------------------------------------------------------|
>>> +|  0-15 | Vector number                                                        |
>>> +| 16-31 | Target ID                                                            |
>>> +
>>> +Writing a vector number that is not enabled by the target has no effect. The
>>> +peers can derive the number of available vectors from their own device
>>> +capabilities and are expected to define or negotiate the used ones via the
>>> +selected protocol.
>> 
>> That's because all peers use the same number of vectors.  Worth spelling out?
>
> I thought I did that somewhere. Will check.
>
>> 
>>> +
>>> +Addressing a non-existing or inactive target has no effect. Peers can identify
>>> +active targets via the State Table.
>>> +
>>> +The behavior on reading from this register is undefined.
>>> +
>>> +#### State Register (Offset 14h)
>>> +
>>> +Read/write register that defines the state of the local device. Writing to this
>>> +register sets the state and triggers interrupt vector 0 on the remote device if
>> 
>> "the remote device"?  Do you mean all peers?
>
> Historic singular (an earlier revision only supported 2 peers). Will fix.
>
>> 
>> "interrupt vector 0" assumes MSI-X.  With legacy INTx, it simply
>> triggers an interrupt.
>
> Right, INTx only has vector 0.

INTx has no concept of vectors, it uses pins.  Suggest to be a bit more
verbose to avoid confusion.

>> How can guest software distinguish between "interrupt due to remote
>> state change" and "interrupt due to doorbell"?
>
> By checking those states. Both are in shared memory, thus quickly readable.

I think that's worth pointing out somewhere.

>> Things become so much easier when you ditch INTx: reserve vector 0 for
>> state change, done.
>
> That aspect does not make things complicated with INTx, but I agree that
> we should look at its "TCO" again.
>
>> 
>>> +the written state value differs from the previous one. The user of the remote
>>> +device can read the value written to this register from the State Table.
>>> +
>>> +The value of this register after device reset is 0.
>> 
>> I guess the meaning of state values depends on the protocol type.
>> Correct?
>
> Yes, except for value 0 which means "reset state". The protocols are
> free to define all other values according to their needs.
>
>> 
>>> +
>>> +### State Table
>>> +
>>> +The State Table is a read-only section at the beginning of the shared memory
>>> +region. It contains a 32-bit state value for each of the peers. Locating the
>>> +table in shared memory allows fast checking of remote states without register
>>> +accesses.
>>> +
>>> +The table is updated on each state change of a peers. Whenever a user of an
>>> +IVSHMEM device writes a value to the Local State register, this value is copied
>>> +into the corresponding entry of the State Table. When a IVSHMEM device is reset
>>> +or disconnected from the other peers, zero is written into the corresponding
>>> +table entry. The initial content of the table is all zeros.
>>> +
>>> +    +--------------------------------+
>>> +    | 32-bit state value of peer n-1 |
>>> +    +--------------------------------+
>>> +    :                                :
>>> +    +--------------------------------+
>>> +    | 32-bit state value of peer 1   |
>>> +    +--------------------------------+
>>> +    | 32-bit state value of peer 0   |
>>> +    +--------------------------------+ <-- Shared memory base address
>>> +
>>> +
>>> +Protocols
>>> +---------
>>> +
>>> +The IVSHMEM device shall support the peers of a connection in agreeing on the
>>> +protocol used over the shared memory devices. For that purpose, the interface
>>> +byte (offset 09h) and the sub-class byte (offset 0Ah) of the Class Code register
>>> +encodes a 16-bit protocol type for the users. The following type values are
>>> +defined:
>>> +
>>> +| Protocol Type | Description                                                  |
>>> +|--------------:|:-------------------------------------------------------------|
>>> +|         0000h | Undefined type                                               |
>>> +|         0001h | Virtual peer-to-peer Ethernet                                |
>>> +|   0002h-3FFFh | Reserved                                                     |
>>> +|   4000h-7FFFh | User-defined protocols                                       |
>>> +|   8000h-BFFFh | Virtio over Shared Memory, front-end peer                    |
>>> +|   C000h-FFFFh | Virtio over Shared Memory, back-end peer                     |
>>> +
>>> +Details of the protocols are not in the scope of this specification.
>> 
>> Are you sure this use of PCI class code is kosher?
>
> Well, we are in the 0xff range. To my understanding, there are no
> constraints on what you put in the other bits for that class.
>
>> 
>> Final request: please break your lines around column 70 for readability.
>> 
>
> Sure, can do.
>
> Thanks a lot,

You're welcome!



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-11 12:57 [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU Jan Kiszka
2019-11-11 12:57 ` [RFC][PATCH 1/3] hw/misc: Add implementation of ivshmem revision 2 device Jan Kiszka
2019-11-11 12:57 ` [RFC][PATCH 2/3] docs/specs: Add specification of ivshmem device revision 2 Jan Kiszka
2019-11-11 13:45   ` Michael S. Tsirkin
2019-11-11 13:59     ` Jan Kiszka
2019-11-11 15:08       ` Michael S. Tsirkin
2019-11-11 15:27         ` Daniel P. Berrangé
2019-11-11 15:42           ` Jan Kiszka
2019-11-11 16:14             ` Michael S. Tsirkin
2019-11-11 16:25               ` Jan Kiszka
2019-11-11 16:11           ` Michael S. Tsirkin
2019-11-11 16:38             ` Jan Kiszka
2019-11-12  8:04               ` Michael S. Tsirkin
2019-11-20 18:15                 ` Jan Kiszka
2019-12-05 11:14   ` Markus Armbruster
2019-12-05 21:29     ` Jan Kiszka
2019-12-06 10:08       ` Markus Armbruster
2019-11-11 12:57 ` [RFC][PATCH 3/3] contrib: Add server for ivshmem " Jan Kiszka
2019-11-12  0:56 ` [RFC][PATCH 0/3] IVSHMEM version 2 device for QEMU no-reply
2019-11-27 15:28 ` Liang Yan
2019-11-27 17:19   ` Jan Kiszka
2019-12-02  6:16     ` Jan Kiszka
     [not found]       ` <877b0cd9-d1c5-00c9-c4b6-567c67740962@suse.com>
2019-12-03  7:14         ` Jan Kiszka

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git