[Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration
@ 2010-02-25 18:27 Michael S. Tsirkin
  2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
                   ` (12 more replies)
  0 siblings, 13 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:27 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

Here's a patchset with vhost support for upstream qemu,
rabed to latest bits.

Note that irqchip/MSI is no longer required for vhost, but you should
not expect performance gains from vhost unless in-kernel irqchip is
enabled (which is not in upstream qemu now), and unless guest enables
MSI.  A follow-up patchset against qemu-kvm will add irqchip support.

Only virtio-pci is currently supported: I'm interested in supporting
syborg/s390 as well, and tried to make APIs generic to make this
possible.

Also missing is packet socket backend.

Cc'd, you did review of these internally, I would be thankful
for review/ack upstream.

Changes from v1:
  Addressed style comments
  Migration fixes.
  Gracefully fail with non-tap backends.

Michael S. Tsirkin (12):
  tap: add interface to get device fd
  kvm: add API to set ioeventfd
  notifier: event notifier implementation
  virtio: add notifier support
  virtio: add APIs for queue fields
  virtio: add set_status callback
  virtio: move typedef to qemu-common
  virtio-pci: fill in notifier support
  vhost: vhost net support
  tap: add vhost/vhostfd options
  tap: add API to retrieve vhost net header
  virtio-net: vhost net support

 Makefile.target      |    3 +
 configure            |   21 ++
 hw/notifier.c        |   50 ++++
 hw/notifier.h        |   16 ++
 hw/s390-virtio-bus.c |    7 +-
 hw/syborg_virtio.c   |    2 +
 hw/vhost.c           |  631 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vhost.h           |   44 ++++
 hw/vhost_net.c       |  177 ++++++++++++++
 hw/vhost_net.h       |   20 ++
 hw/virtio-net.c      |   71 ++++++-
 hw/virtio-pci.c      |   71 ++++++-
 hw/virtio.c          |   55 +++++-
 hw/virtio.h          |   15 +-
 kvm-all.c            |   22 ++
 kvm.h                |   16 ++
 net.c                |    8 +
 net/tap.c            |   47 ++++
 net/tap.h            |    5 +
 qemu-common.h        |    2 +
 qemu-options.hx      |    4 +-
 21 files changed, 1279 insertions(+), 8 deletions(-)
 create mode 100644 hw/notifier.c
 create mode 100644 hw/notifier.h
 create mode 100644 hw/vhost.c
 create mode 100644 hw/vhost.h
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
@ 2010-02-25 18:27 ` Michael S. Tsirkin
  2010-02-25 18:49   ` Blue Swirl
  2010-02-25 19:25   ` [Qemu-devel] " Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support Michael S. Tsirkin
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:27 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

vhost needs physical addresses for ring and other queue fields,
so add APIs for these.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/virtio.h |   10 +++++++++-
 2 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 1f5e7be..b017d7b 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -74,6 +74,11 @@ struct VirtQueue
     uint16_t vector;
     void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
     VirtIODevice *vdev;
+<<<<<<< HEAD
+=======
+    EventNotifier guest_notifier;
+    EventNotifier host_notifier;
+>>>>>>> 8afa4fd... virtio: add APIs for queue fields
 };
 
 /* virt queue functions */
@@ -593,6 +598,12 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
     return &vdev->vq[i];
 }
 
+void virtio_irq(VirtQueue *vq)
+{
+    vq->vdev->isr |= 0x01;
+    virtio_notify_vector(vq->vdev, vq->vector);
+}
+
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     /* Always notify when queue is empty (when feature acknowledge) */
@@ -736,3 +747,42 @@ void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
     vdev->binding = binding;
     vdev->binding_opaque = opaque;
 }
+
+target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n)
+{
+	return vdev->vq[n].vring.desc;
+}
+
+target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n)
+{
+	return vdev->vq[n].vring.avail;
+}
+
+target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n)
+{
+	return vdev->vq[n].vring.used;
+}
+
+uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n)
+{
+	return vdev->vq[n].last_avail_idx;
+}
+
+void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx)
+{
+	vdev->vq[n].last_avail_idx = idx;
+}
+
+VirtQueue *virtio_queue(VirtIODevice *vdev, int n)
+{
+	return vdev->vq + n;
+}
+
+EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq)
+{
+	return &vq->guest_notifier;
+}
+EventNotifier *virtio_queue_host_notifier(VirtQueue *vq)
+{
+	return &vq->host_notifier;
+}
diff --git a/hw/virtio.h b/hw/virtio.h
index af87889..2ebf2dd 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -184,5 +184,13 @@ void virtio_net_exit(VirtIODevice *vdev);
 	DEFINE_PROP_BIT("indirect_desc", _state, _field, \
 			VIRTIO_RING_F_INDIRECT_DESC, true)
 
-
+target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n);
+target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n);
+target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n);
+uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n);
+void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
+VirtQueue *virtio_queue(VirtIODevice *vdev, int n);
+EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq);
+EventNotifier *virtio_queue_host_notifier(VirtQueue *vq);
+void virtio_irq(VirtQueue *vq);
 #endif
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
  2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:04   ` [Qemu-devel] " Juan Quintela
  2010-02-25 19:44   ` Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 02/12] kvm: add API to set ioeventfd Michael S. Tsirkin
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

This adds vhost net device support in qemu. Will be tied to tap device
and virtio by following patches.  Raw backend is currently missing,
will be worked on/submitted separately.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 Makefile.target |    2 +
 configure       |   21 ++
 hw/vhost.c      |  631 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vhost.h      |   44 ++++
 hw/vhost_net.c  |  177 ++++++++++++++++
 hw/vhost_net.h  |   20 ++
 6 files changed, 895 insertions(+), 0 deletions(-)
 create mode 100644 hw/vhost.c
 create mode 100644 hw/vhost.h
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h

diff --git a/Makefile.target b/Makefile.target
index c1580e9..9b4fd84 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -174,6 +174,8 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
 # need to fix this properly
 obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
 obj-y += notifier.o
+obj-y += vhost_net.o
+obj-$(CONFIG_VHOST_NET) += vhost.o
 obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
diff --git a/configure b/configure
index 8eb5f5b..5eccc7c 100755
--- a/configure
+++ b/configure
@@ -1498,6 +1498,23 @@ EOF
 fi
 
 ##########################################
+# test for vhost net
+
+if test "$kvm" != "no"; then
+	cat > $TMPC <<EOF
+#include <linux/vhost.h>
+int main(void) { return 0; }
+EOF
+	if compile_prog "$kvm_cflags" "" ; then
+	vhost_net=yes
+	else
+	vhost_net=no
+	fi
+else
+	vhost_net=no
+fi
+
+##########################################
 # pthread probe
 PTHREADLIBS_LIST="-lpthread -lpthreadGC2"
 
@@ -1968,6 +1985,7 @@ echo "fdt support       $fdt"
 echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
 echo "uuid support      $uuid"
+echo "vhost-net support $vhost_net"
 
 if test $sdl_too_old = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -2492,6 +2510,9 @@ case "$target_arch2" in
       if test "$kvm_para" = "yes"; then
         echo "CONFIG_KVM_PARA=y" >> $config_target_mak
       fi
+      if test $vhost_net = "yes" ; then
+        echo "CONFIG_VHOST_NET=y" >> $config_target_mak
+      fi
     fi
 esac
 echo "TARGET_PHYS_ADDR_BITS=$target_phys_bits" >> $config_target_mak
diff --git a/hw/vhost.c b/hw/vhost.c
new file mode 100644
index 0000000..4d5ea02
--- /dev/null
+++ b/hw/vhost.c
@@ -0,0 +1,631 @@
+#include "linux/vhost.h"
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+#include "vhost.h"
+#include "hw/hw.h"
+/* For range_get_last */
+#include "pci.h"
+
+static void vhost_dev_sync_region(struct vhost_dev *dev,
+                                  uint64_t mfirst, uint64_t mlast,
+                                  uint64_t rfirst, uint64_t rlast)
+{
+    uint64_t start = MAX(mfirst, rfirst);
+    uint64_t end = MIN(mlast, rlast);
+    vhost_log_chunk_t *from = dev->log + start / VHOST_LOG_CHUNK;
+    vhost_log_chunk_t *to = dev->log + end / VHOST_LOG_CHUNK + 1;
+    uint64_t addr = (start / VHOST_LOG_CHUNK) * VHOST_LOG_CHUNK;
+
+    assert(end / VHOST_LOG_CHUNK < dev->log_size);
+    assert(start / VHOST_LOG_CHUNK < dev->log_size);
+    if (end < start) {
+        return;
+    }
+    for (;from < to; ++from) {
+        vhost_log_chunk_t log;
+        int bit;
+        /* We first check with non-atomic: much cheaper,
+         * and we expect non-dirty to be the common case. */
+        if (!*from) {
+            continue;
+        }
+        /* Data must be read atomically. We don't really
+         * need the barrier semantics of __sync
+         * builtins, but it's easier to use them than
+         * roll our own. */
+        log = __sync_fetch_and_and(from, 0);
+        while ((bit = sizeof(log) > sizeof(int) ?
+                ffsll(log) : ffs(log))) {
+            bit -= 1;
+            cpu_physical_memory_set_dirty(addr + bit * VHOST_LOG_PAGE);
+            log &= ~(0x1ull << bit);
+        }
+        addr += VHOST_LOG_CHUNK;
+    }
+}
+
+static int vhost_client_sync_dirty_bitmap(struct CPUPhysMemoryClient *client,
+                                          target_phys_addr_t start_addr,
+                                          target_phys_addr_t end_addr)
+{
+    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
+    int i;
+    if (!dev->log_enabled || !dev->started) {
+        return 0;
+    }
+    for (i = 0; i < dev->mem->nregions; ++i) {
+        struct vhost_memory_region *reg = dev->mem->regions + i;
+        vhost_dev_sync_region(dev, start_addr, end_addr,
+                              reg->guest_phys_addr,
+                              range_get_last(reg->guest_phys_addr,
+                                             reg->memory_size));
+    }
+    for (i = 0; i < dev->nvqs; ++i) {
+        struct vhost_virtqueue *vq = dev->vqs + i;
+        unsigned size = offsetof(struct vring_used, ring) +
+            sizeof(struct vring_used_elem) * vq->num;
+        vhost_dev_sync_region(dev, start_addr, end_addr, vq->used_phys,
+                              range_get_last(vq->used_phys, size));
+    }
+    return 0;
+}
+
+/* Assign/unassign. Keep an unsorted array of non-overlapping
+ * memory regions in dev->mem. */
+static void vhost_dev_unassign_memory(struct vhost_dev *dev,
+                                      uint64_t start_addr,
+                                      uint64_t size)
+{
+    int from, to, n = dev->mem->nregions;
+    /* Track overlapping/split regions for sanity checking. */
+    int overlap_start = 0, overlap_end = 0, overlap_middle = 0, split = 0;
+
+    for (from = 0, to = 0; from < n; ++from, ++to) {
+        struct vhost_memory_region *reg = dev->mem->regions + to;
+        uint64_t reglast;
+        uint64_t memlast;
+        uint64_t change;
+
+        /* clone old region */
+        if (to != from) {
+            memcpy(reg, dev->mem->regions + from, sizeof *reg);
+        }
+
+        /* No overlap is simple */
+        if (!ranges_overlap(reg->guest_phys_addr, reg->memory_size,
+                            start_addr, size)) {
+            continue;
+        }
+
+        /* Split only happens if supplied region
+         * is in the middle of an existing one. Thus it can not
+         * overlap with any other existing region. */
+        assert(!split);
+
+        reglast = range_get_last(reg->guest_phys_addr, reg->memory_size);
+        memlast = range_get_last(start_addr, size);
+
+        /* Remove whole region */
+        if (start_addr <= reg->guest_phys_addr && memlast >= reglast) {
+            --dev->mem->nregions;
+            --to;
+            assert(to >= 0);
+            ++overlap_middle;
+            continue;
+        }
+
+        /* Shrink region */
+        if (memlast >= reglast) {
+            reg->memory_size = start_addr - reg->guest_phys_addr;
+            assert(reg->memory_size);
+            assert(!overlap_end);
+            ++overlap_end;
+            continue;
+        }
+
+        /* Shift region */
+        if (start_addr <= reg->guest_phys_addr) {
+            change = memlast + 1 - reg->guest_phys_addr;
+            reg->memory_size -= change;
+            reg->guest_phys_addr += change;
+            reg->userspace_addr += change;
+            assert(reg->memory_size);
+            assert(!overlap_start);
+            ++overlap_start;
+            continue;
+        }
+
+        /* This only happens if supplied region
+         * is in the middle of an existing one. Thus it can not
+         * overlap with any other existing region. */
+        assert(!overlap_start);
+        assert(!overlap_end);
+        assert(!overlap_middle);
+        /* Split region: shrink first part, shift second part. */
+        memcpy(dev->mem->regions + n, reg, sizeof *reg);
+        reg->memory_size = start_addr - reg->guest_phys_addr;
+        assert(reg->memory_size);
+        change = memlast + 1 - reg->guest_phys_addr;
+        reg = dev->mem->regions + n;
+        reg->memory_size -= change;
+        assert(reg->memory_size);
+        reg->guest_phys_addr += change;
+        reg->userspace_addr += change;
+        /* Never add more than 1 region */
+        assert(dev->mem->nregions == n);
+        ++dev->mem->nregions;
+        ++split;
+    }
+}
+
+/* Called after unassign, so no regions overlap the given range. */
+static void vhost_dev_assign_memory(struct vhost_dev *dev,
+                                    uint64_t start_addr,
+                                    uint64_t size,
+                                    uint64_t uaddr)
+{
+    int from, to;
+    struct vhost_memory_region *merged = NULL;
+    for (from = 0, to = 0; from < dev->mem->nregions; ++from, ++to) {
+        struct vhost_memory_region *reg = dev->mem->regions + to;
+        uint64_t prlast, urlast;
+        uint64_t pmlast, umlast;
+        uint64_t s, e, u;
+
+        /* clone old region */
+        if (to != from) {
+            memcpy(reg, dev->mem->regions + from, sizeof *reg);
+        }
+        prlast = range_get_last(reg->guest_phys_addr, reg->memory_size);
+        pmlast = range_get_last(start_addr, size);
+        urlast = range_get_last(reg->userspace_addr, reg->memory_size);
+        umlast = range_get_last(uaddr, size);
+
+        /* check for overlapping regions: should never happen. */
+        assert(prlast < start_addr || pmlast < reg->guest_phys_addr);
+        /* Not an adjacent or overlapping region - do not merge. */
+        if ((prlast + 1 != start_addr || urlast + 1 != uaddr) &&
+            (pmlast + 1 != reg->guest_phys_addr ||
+             umlast + 1 != reg->userspace_addr)) {
+            continue;
+        }
+
+        if (merged) {
+            --to;
+            assert(to >= 0);
+        } else {
+            merged = reg;
+        }
+        u = MIN(uaddr, reg->userspace_addr);
+        s = MIN(start_addr, reg->guest_phys_addr);
+        e = MAX(pmlast, prlast);
+        uaddr = merged->userspace_addr = u;
+        start_addr = merged->guest_phys_addr = s;
+        size = merged->memory_size = e - s + 1;
+        assert(merged->memory_size);
+    }
+
+    if (!merged) {
+        struct vhost_memory_region *reg = dev->mem->regions + to;
+        memset(reg, 0, sizeof *reg);
+        reg->memory_size = size;
+        assert(reg->memory_size);
+        reg->guest_phys_addr = start_addr;
+        reg->userspace_addr = uaddr;
+        ++to;
+    }
+    assert(to <= dev->mem->nregions + 1);
+    dev->mem->nregions = to;
+}
+
+static uint64_t vhost_get_log_size(struct vhost_dev *dev)
+{
+    uint64_t log_size = 0;
+    int i;
+    for (i = 0; i < dev->mem->nregions; ++i) {
+        struct vhost_memory_region *reg = dev->mem->regions + i;
+        uint64_t last = range_get_last(reg->guest_phys_addr,
+                                       reg->memory_size);
+        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
+    }
+    for (i = 0; i < dev->nvqs; ++i) {
+        struct vhost_virtqueue *vq = dev->vqs + i;
+        uint64_t last = vq->used_phys +
+            offsetof(struct vring_used, ring) +
+            sizeof(struct vring_used_elem) * vq->num - 1;
+        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
+    }
+    return log_size;
+}
+
+static inline void vhost_dev_log_resize(struct vhost_dev* dev, uint64_t size)
+{
+    vhost_log_chunk_t *log;
+    uint64_t log_base;
+    int r;
+    if (size) {
+        log = qemu_mallocz(size * sizeof *log);
+    } else {
+        log = NULL;
+    }
+    log_base = (uint64_t)(unsigned long)log;
+    r = ioctl(dev->control, VHOST_SET_LOG_BASE, &log_base);
+    assert(r >= 0);
+    vhost_client_sync_dirty_bitmap(&dev->client, 0,
+                                   (target_phys_addr_t)~0x0ull);
+    if (dev->log) {
+        qemu_free(dev->log);
+    }
+    dev->log = log;
+    dev->log_size = size;
+}
+
+static void vhost_client_set_memory(CPUPhysMemoryClient *client,
+                                    target_phys_addr_t start_addr,
+                                    ram_addr_t size,
+                                    ram_addr_t phys_offset)
+{
+    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
+    ram_addr_t flags = phys_offset & ~TARGET_PAGE_MASK;
+    int s = offsetof(struct vhost_memory, regions) +
+        (dev->mem->nregions + 1) * sizeof dev->mem->regions[0];
+    uint64_t log_size;
+    int r;
+    dev->mem = qemu_realloc(dev->mem, s);
+
+    assert(size);
+
+    vhost_dev_unassign_memory(dev, start_addr, size);
+    if (flags == IO_MEM_RAM) {
+        /* Add given mapping, merging adjacent regions if any */
+        vhost_dev_assign_memory(dev, start_addr, size,
+                                (uintptr_t)qemu_get_ram_ptr(phys_offset));
+    } else {
+        /* Remove old mapping for this memory, if any. */
+        vhost_dev_unassign_memory(dev, start_addr, size);
+    }
+
+    if (!dev->started) {
+        return;
+    }
+    if (!dev->log_enabled) {
+        r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
+        assert(r >= 0);
+        return;
+    }
+    log_size = vhost_get_log_size(dev);
+    /* We allocate an extra 4K bytes to log,
+     * to reduce the * number of reallocations. */
+#define VHOST_LOG_BUFFER (0x1000 / sizeof *dev->log)
+    /* To log more, must increase log size before table update. */
+    if (dev->log_size < log_size) {
+        vhost_dev_log_resize(dev, log_size + VHOST_LOG_BUFFER);
+    }
+    r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
+    assert(r >= 0);
+    /* To log less, can only decrease log size after table update. */
+    if (dev->log_size > log_size + VHOST_LOG_BUFFER) {
+        vhost_dev_log_resize(dev, log_size);
+    }
+}
+
+static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
+                                    struct vhost_virtqueue *vq,
+                                    unsigned idx, bool enable_log)
+{
+    struct vhost_vring_addr addr = {
+        .index = idx,
+        .desc_user_addr = (u_int64_t)(unsigned long)vq->desc,
+        .avail_user_addr = (u_int64_t)(unsigned long)vq->avail,
+        .used_user_addr = (u_int64_t)(unsigned long)vq->used,
+        .log_guest_addr = vq->used_phys,
+        .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
+    };
+    int r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
+    if (r < 0) {
+        return -errno;
+    }
+    return 0;
+}
+
+static int vhost_dev_set_features(struct vhost_dev *dev, bool enable_log)
+{
+    uint64_t features = dev->acked_features;
+    int r;
+    if (enable_log) {
+        features |= 0x1 << VHOST_F_LOG_ALL;
+    }
+    r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
+    return r < 0 ? -errno : 0;
+}
+
+static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
+{
+    int r, t, i;
+    r = vhost_dev_set_features(dev, enable_log);
+    if (r < 0)
+        goto err_features;
+    for (i = 0; i < dev->nvqs; ++i) {
+        r = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
+                                     enable_log);
+        if (r < 0)
+            goto err_vq;
+    }
+    return 0;
+err_vq:
+    for (; i >= 0; --i) {
+        t = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
+                                     dev->log_enabled);
+        assert(t >= 0);
+    }
+    t = vhost_dev_set_features(dev, dev->log_enabled);
+    assert(t >= 0);
+err_features:
+    return r;
+}
+
+static int vhost_client_migration_log(struct CPUPhysMemoryClient *client,
+                                      int enable)
+{
+    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
+    int r;
+    if (!!enable == dev->log_enabled) {
+        return 0;
+    }
+    if (!dev->started) {
+        dev->log_enabled = enable;
+        return 0;
+    }
+    if (!enable) {
+        r = vhost_dev_set_log(dev, false);
+        if (r < 0) {
+            return r;
+        }
+        if (dev->log) {
+            qemu_free(dev->log);
+        }
+        dev->log = NULL;
+        dev->log_size = 0;
+    } else {
+        vhost_dev_log_resize(dev, vhost_get_log_size(dev));
+        r = vhost_dev_set_log(dev, true);
+        if (r < 0) {
+            return r;
+        }
+    }
+    dev->log_enabled = enable;
+    return 0;
+}
+
+static int vhost_virtqueue_init(struct vhost_dev *dev,
+                                struct VirtIODevice *vdev,
+                                struct vhost_virtqueue *vq,
+                                unsigned idx)
+{
+    target_phys_addr_t s, l, a;
+    int r;
+    struct vhost_vring_file file = {
+        .index = idx,
+    };
+    struct vhost_vring_state state = {
+        .index = idx,
+    };
+    struct VirtQueue *q = virtio_queue(vdev, idx);
+
+    vq->num = state.num = virtio_queue_get_num(vdev, idx);
+    r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
+    if (r) {
+        return -errno;
+    }
+
+    state.num = virtio_queue_last_avail_idx(vdev, idx);
+    r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
+    if (r) {
+        return -errno;
+    }
+
+    s = l = sizeof(struct vring_desc) * vq->num;
+    a = virtio_queue_get_desc(vdev, idx);
+    vq->desc = cpu_physical_memory_map(a, &l, 0);
+    if (!vq->desc || l != s) {
+        r = -ENOMEM;
+        goto fail_alloc;
+    }
+    s = l = offsetof(struct vring_avail, ring) +
+        sizeof(u_int64_t) * vq->num;
+    a = virtio_queue_get_avail(vdev, idx);
+    vq->avail = cpu_physical_memory_map(a, &l, 0);
+    if (!vq->avail || l != s) {
+        r = -ENOMEM;
+        goto fail_alloc;
+    }
+    s = l = offsetof(struct vring_used, ring) +
+        sizeof(struct vring_used_elem) * vq->num;
+    vq->used_phys = a = virtio_queue_get_used(vdev, idx);
+    vq->used = cpu_physical_memory_map(a, &l, 1);
+    if (!vq->used || l != s) {
+        r = -ENOMEM;
+        goto fail_alloc;
+    }
+
+    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
+    if (r < 0) {
+        r = -errno;
+        goto fail_alloc;
+    }
+    if (!vdev->binding->guest_notifier || !vdev->binding->host_notifier) {
+        fprintf(stderr, "binding does not support irqfd/queuefd\n");
+        r = -ENOSYS;
+        goto fail_alloc;
+    }
+    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, true);
+    if (r < 0) {
+        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+        goto fail_guest_notifier;
+    }
+
+    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, true);
+    if (r < 0) {
+        fprintf(stderr, "Error binding host notifier: %d\n", -r);
+        goto fail_host_notifier;
+    }
+
+    file.fd = event_notifier_get_fd(virtio_queue_host_notifier(q));
+    r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
+    if (r) {
+        goto fail_kick;
+    }
+
+    file.fd = event_notifier_get_fd(virtio_queue_guest_notifier(q));
+    r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
+    if (r) {
+        goto fail_call;
+    }
+
+    return 0;
+
+fail_call:
+fail_kick:
+    vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
+fail_host_notifier:
+    vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
+fail_guest_notifier:
+fail_alloc:
+    return r;
+}
+
+static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
+                                    struct VirtIODevice *vdev,
+                                    struct vhost_virtqueue *vq,
+                                    unsigned idx)
+{
+    struct vhost_vring_state state = {
+        .index = idx,
+    };
+    int r;
+    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
+    if (r < 0) {
+        fprintf(stderr, "vhost VQ %d guest cleanup failed: %d\n", idx, r);
+        fflush(stderr);
+    }
+    assert (r >= 0);
+
+    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
+    if (r < 0) {
+        fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r);
+        fflush(stderr);
+    }
+    assert (r >= 0);
+    r = ioctl(dev->control, VHOST_GET_VRING_BASE, &state);
+    if (r < 0) {
+        fprintf(stderr, "vhost VQ %d ring restore failed: %d\n", idx, r);
+        fflush(stderr);
+    }
+    virtio_queue_set_last_avail_idx(vdev, idx, state.num);
+    assert (r >= 0);
+}
+
+int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+{
+    uint64_t features;
+    int r;
+    if (devfd >= 0) {
+        hdev->control = devfd;
+    } else {
+        hdev->control = open("/dev/vhost-net", O_RDWR);
+        if (hdev->control < 0)
+            return -errno;
+    }
+    r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+    if (r < 0)
+        goto fail;
+
+    r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
+    if (r < 0)
+        goto fail;
+    hdev->features = features;
+
+    hdev->client.set_memory = vhost_client_set_memory;
+    hdev->client.sync_dirty_bitmap = vhost_client_sync_dirty_bitmap;
+    hdev->client.migration_log = vhost_client_migration_log;
+    hdev->mem = qemu_mallocz(offsetof(struct vhost_memory, regions));
+    hdev->log = NULL;
+    hdev->log_size = 0;
+    hdev->log_enabled = false;
+    hdev->started = false;
+    cpu_register_phys_memory_client(&hdev->client);
+    return 0;
+fail:
+    r = -errno;
+    close(hdev->control);
+    return r;
+}
+
+void vhost_dev_cleanup(struct vhost_dev *hdev)
+{
+    cpu_unregister_phys_memory_client(&hdev->client);
+    qemu_free(hdev->mem);
+    close(hdev->control);
+}
+
+int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
+{
+    int i, r;
+
+    r = vhost_dev_set_features(hdev, hdev->log_enabled);
+    if (r < 0)
+        goto fail;
+    r = ioctl(hdev->control, VHOST_SET_MEM_TABLE, hdev->mem);
+    if (r < 0) {
+        r = -errno;
+        goto fail;
+    }
+    if (hdev->log_enabled) {
+        hdev->log_size = vhost_get_log_size(hdev);
+        hdev->log = hdev->log_size ?
+            qemu_mallocz(hdev->log_size * sizeof *hdev->log) : NULL;
+        r = ioctl(hdev->control, VHOST_SET_LOG_BASE,
+                  (uint64_t)(unsigned long)hdev->log);
+        if (r < 0) {
+            r = -errno;
+            goto fail;
+        }
+    }
+
+    for (i = 0; i < hdev->nvqs; ++i) {
+        r = vhost_virtqueue_init(hdev,
+                                 vdev,
+                                 hdev->vqs + i,
+                                 i);
+        if (r < 0)
+            goto fail_vq;
+    }
+    hdev->started = true;
+
+    return 0;
+fail_vq:
+    while (--i >= 0) {
+        vhost_virtqueue_cleanup(hdev,
+                                vdev,
+                                hdev->vqs + i,
+                                i);
+    }
+fail:
+    return r;
+}
+
+void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
+{
+    int i;
+    for (i = 0; i < hdev->nvqs; ++i) {
+        vhost_virtqueue_cleanup(hdev,
+                                vdev,
+                                hdev->vqs + i,
+                                i);
+    }
+    vhost_client_sync_dirty_bitmap(&hdev->client, 0,
+                                   (target_phys_addr_t)~0x0ull);
+    hdev->started = false;
+    qemu_free(hdev->log);
+    hdev->log_size = 0;
+}
diff --git a/hw/vhost.h b/hw/vhost.h
new file mode 100644
index 0000000..8f3e9ce
--- /dev/null
+++ b/hw/vhost.h
@@ -0,0 +1,44 @@
+#ifndef VHOST_H
+#define VHOST_H
+
+#include "hw/hw.h"
+#include "hw/virtio.h"
+
+/* Generic structures common for any vhost based device. */
+struct vhost_virtqueue {
+    int kick;
+    int call;
+    void *desc;
+    void *avail;
+    void *used;
+    int num;
+    unsigned long long used_phys;
+};
+
+typedef unsigned long vhost_log_chunk_t;
+#define VHOST_LOG_PAGE 0x1000
+#define VHOST_LOG_BITS (8 * sizeof(vhost_log_chunk_t))
+#define VHOST_LOG_CHUNK (VHOST_LOG_PAGE * VHOST_LOG_BITS)
+
+struct vhost_memory;
+struct vhost_dev {
+    CPUPhysMemoryClient client;
+    int control;
+    struct vhost_memory *mem;
+    struct vhost_virtqueue *vqs;
+    int nvqs;
+    unsigned long long features;
+    unsigned long long acked_features;
+    unsigned long long backend_features;
+    bool started;
+    bool log_enabled;
+    vhost_log_chunk_t *log;
+    unsigned long long log_size;
+};
+
+int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+void vhost_dev_cleanup(struct vhost_dev *hdev);
+int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
+void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
+
+#endif
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
new file mode 100644
index 0000000..06b7648
--- /dev/null
+++ b/hw/vhost_net.c
@@ -0,0 +1,177 @@
+#include "net.h"
+#include "net/tap.h"
+
+#include "virtio-net.h"
+#include "vhost_net.h"
+
+#include "config.h"
+
+#ifdef CONFIG_VHOST_NET
+#include <sys/eventfd.h>
+#include <sys/socket.h>
+#include <linux/kvm.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/virtio_ring.h>
+#include <netpacket/packet.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+
+#include <stdio.h>
+
+#include "vhost.h"
+
+struct vhost_net {
+    struct vhost_dev dev;
+    struct vhost_virtqueue vqs[2];
+    int backend;
+    VLANClientState *vc;
+};
+
+unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
+{
+    /* Clear features not supported by host kernel. */
+    if (!(net->dev.features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY)))
+        features &= ~(1 << VIRTIO_F_NOTIFY_ON_EMPTY);
+    if (!(net->dev.features & (1 << VIRTIO_RING_F_INDIRECT_DESC)))
+        features &= ~(1 << VIRTIO_RING_F_INDIRECT_DESC);
+    if (!(net->dev.features & (1 << VIRTIO_NET_F_MRG_RXBUF)))
+        features &= ~(1 << VIRTIO_NET_F_MRG_RXBUF);
+    return features;
+}
+
+void vhost_net_ack_features(struct vhost_net *net, unsigned features)
+{
+    net->dev.acked_features = net->dev.backend_features;
+    if (features & (1 << VIRTIO_F_NOTIFY_ON_EMPTY))
+        net->dev.acked_features |= (1 << VIRTIO_F_NOTIFY_ON_EMPTY);
+    if (features & (1 << VIRTIO_RING_F_INDIRECT_DESC))
+        net->dev.acked_features |= (1 << VIRTIO_RING_F_INDIRECT_DESC);
+}
+
+static int vhost_net_get_fd(VLANClientState *backend)
+{
+    switch (backend->info->type) {
+    case NET_CLIENT_TYPE_TAP:
+        return tap_get_fd(backend);
+    default:
+        fprintf(stderr, "vhost-net requires tap backend\n");
+        return -EBADFD;
+    }
+}
+
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+{
+    int r;
+    struct vhost_net *net = qemu_malloc(sizeof *net);
+    if (!backend) {
+        fprintf(stderr, "vhost-net requires backend to be setup\n");
+        goto fail;
+    }
+    r = vhost_net_get_fd(backend);
+    if (r < 0)
+        goto fail;
+    net->vc = backend;
+    net->dev.backend_features = tap_has_vnet_hdr(backend) ? 0 :
+        (1 << VHOST_NET_F_VIRTIO_NET_HDR);
+    net->backend = r;
+
+    r = vhost_dev_init(&net->dev, devfd);
+    if (r < 0)
+        goto fail;
+    if (~net->dev.features & net->dev.backend_features) {
+        fprintf(stderr, "vhost lacks feature mask %llu for backend\n",
+                ~net->dev.features & net->dev.backend_features);
+        vhost_dev_cleanup(&net->dev);
+        goto fail;
+    }
+
+    /* Set sane init value. Override when guest acks. */
+    vhost_net_ack_features(net, 0);
+    return net;
+fail:
+    qemu_free(net);
+    return NULL;
+}
+
+int vhost_net_start(struct vhost_net *net,
+                    VirtIODevice *dev)
+{
+    struct vhost_vring_file file = { };
+    int r;
+
+    net->dev.nvqs = 2;
+    net->dev.vqs = net->vqs;
+    r = vhost_dev_start(&net->dev, dev);
+    if (r < 0)
+        return r;
+
+    net->vc->info->poll(net->vc, false);
+    qemu_set_fd_handler(net->backend, NULL, NULL, NULL);
+    file.fd = net->backend;
+    for (file.index = 0; file.index < net->dev.nvqs; ++file.index) {
+        r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND, &file);
+        if (r < 0) {
+            r = -errno;
+            goto fail;
+        }
+    }
+    return 0;
+fail:
+    file.fd = -1;
+    while (--file.index >= 0) {
+        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND, &file);
+        assert(r >= 0);
+    }
+    net->vc->info->poll(net->vc, true);
+    vhost_dev_stop(&net->dev, dev);
+    return r;
+}
+
+void vhost_net_stop(struct vhost_net *net,
+                    VirtIODevice *dev)
+{
+    struct vhost_vring_file file = { .fd = -1 };
+
+    for (file.index = 0; file.index < net->dev.nvqs; ++file.index) {
+        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND, &file);
+        assert(r >= 0);
+    }
+    net->vc->info->poll(net->vc, true);
+    vhost_dev_stop(&net->dev, dev);
+}
+
+void vhost_net_cleanup(struct vhost_net *net)
+{
+    vhost_dev_cleanup(&net->dev);
+    qemu_free(net);
+}
+#else
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+{
+	return NULL;
+}
+
+int vhost_net_start(struct vhost_net *net,
+		    VirtIODevice *dev)
+{
+	return -ENOSYS;
+}
+void vhost_net_stop(struct vhost_net *net,
+		    VirtIODevice *dev)
+{
+}
+
+void vhost_net_cleanup(struct vhost_net *net)
+{
+}
+
+unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
+{
+	return features;
+}
+void vhost_net_ack_features(struct vhost_net *net, unsigned features)
+{
+}
+#endif
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
new file mode 100644
index 0000000..2a10210
--- /dev/null
+++ b/hw/vhost_net.h
@@ -0,0 +1,20 @@
+#ifndef VHOST_NET_H
+#define VHOST_NET_H
+
+#include "net.h"
+
+struct vhost_net;
+
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd);
+
+int vhost_net_start(struct vhost_net *net,
+                    VirtIODevice *dev);
+void vhost_net_stop(struct vhost_net *net,
+                    VirtIODevice *dev);
+
+void vhost_net_cleanup(struct vhost_net *net);
+
+unsigned vhost_net_get_features(struct vhost_net *net, unsigned features);
+void vhost_net_ack_features(struct vhost_net *net, unsigned features);
+
+#endif
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 02/12] kvm: add API to set ioeventfd
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
  2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:19   ` [Qemu-devel] " Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 04/12] virtio: add notifier support Michael S. Tsirkin
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, avi, mtosatti, kraxel, quintela

Comment on kvm usage: rather than require users to do if (kvm_enabled())
and/or ifdefs, this patch adds an API that, internally, is defined to
stub function on non-kvm build, and checks kvm_enabled for non-kvm
run.

While rest of qemu code still uses if (kvm_enabled()), I think this
approach is cleaner, and we should convert rest of code to it
long term.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Avi, Marcelo, pls review/ack.

 kvm-all.c |   22 ++++++++++++++++++++++
 kvm.h     |   16 ++++++++++++++++
 2 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 1a02076..9742791 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1138,3 +1138,25 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
 
     return r;
 }
+
+#ifdef KVM_IOEVENTFD
+int kvm_set_ioeventfd(uint16_t addr, uint16_t data, int fd, bool assigned)
+{
+    struct kvm_ioeventfd kick = {
+        .datamatch = data,
+        .addr = addr,
+        .len = 2,
+        .flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
+        .fd = fd,
+    };
+    int r;
+    if (!kvm_enabled())
+        return -ENOSYS;
+    if (!assigned)
+        kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+    r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &kick);
+    if (r < 0)
+        return r;
+    return 0;
+}
+#endif
diff --git a/kvm.h b/kvm.h
index a74dfcb..897efb7 100644
--- a/kvm.h
+++ b/kvm.h
@@ -14,10 +14,16 @@
 #ifndef QEMU_KVM_H
 #define QEMU_KVM_H
 
+#include <stdbool.h>
+#include <errno.h>
 #include "config.h"
 #include "qemu-queue.h"
 
 #ifdef CONFIG_KVM
+#include <linux/kvm.h>
+#endif
+
+#ifdef CONFIG_KVM
 extern int kvm_allowed;
 
 #define kvm_enabled() (kvm_allowed)
@@ -135,4 +141,14 @@ static inline void cpu_synchronize_state(CPUState *env)
     }
 }
 
+#if defined(KVM_IOEVENTFD) && defined(CONFIG_KVM)
+int kvm_set_ioeventfd(uint16_t addr, uint16_t data, int fd, bool assigned);
+#else
+static inline
+int kvm_set_ioeventfd(uint16_t data, uint16_t addr, int fd, bool assigned)
+{
+    return -ENOSYS;
+}
+#endif
+
 #endif
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 04/12] virtio: add notifier support
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 02/12] kvm: add API to set ioeventfd Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 01/12] tap: add interface to get device fd Michael S. Tsirkin
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

Add binding API to set host/guest notifiers.
Will be used by vhost.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio.c |    5 ++++-
 hw/virtio.h |    3 +++
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 7c020a3..1f5e7be 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -73,6 +73,7 @@ struct VirtQueue
     int inuse;
     uint16_t vector;
     void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
+    VirtIODevice *vdev;
 };
 
 /* virt queue functions */
@@ -714,8 +715,10 @@ VirtIODevice *virtio_common_init(const char *name, uint16_t device_id,
     vdev->queue_sel = 0;
     vdev->config_vector = VIRTIO_NO_VECTOR;
     vdev->vq = qemu_mallocz(sizeof(VirtQueue) * VIRTIO_PCI_QUEUE_MAX);
-    for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++)
+    for(i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
         vdev->vq[i].vector = VIRTIO_NO_VECTOR;
+        vdev->vq[i].vdev = vdev;
+    }
 
     vdev->name = name;
     vdev->config_len = config_size;
diff --git a/hw/virtio.h b/hw/virtio.h
index 3baa2a3..af87889 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -19,6 +19,7 @@
 #include "qdev.h"
 #include "sysemu.h"
 #include "block_int.h"
+#include "notifier.h"
 
 /* from Linux's linux/virtio_config.h */
 
@@ -89,6 +90,8 @@ typedef struct {
     int (*load_config)(void * opaque, QEMUFile *f);
     int (*load_queue)(void * opaque, int n, QEMUFile *f);
     unsigned (*get_features)(void * opaque);
+    int (*guest_notifier)(void * opaque, int n, bool assigned);
+    int (*host_notifier)(void * opaque, int n, bool assigned);
 } VirtIOBindings;
 
 #define VIRTIO_PCI_QUEUE_MAX 64
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 01/12] tap: add interface to get device fd
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (3 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 04/12] virtio: add notifier support Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 07/12] virtio: move typedef to qemu-common Michael S. Tsirkin
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

Will be used by vhost to attach/detach to backend.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/tap.c |    7 +++++++
 net/tap.h |    2 ++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 7a7320c..fc59fd4 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -269,6 +269,13 @@ static void tap_poll(VLANClientState *nc, bool enable)
     tap_write_poll(s, enable);
 }
 
+int tap_get_fd(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    assert(nc->info->type == NET_CLIENT_TYPE_TAP);
+    return s->fd;
+}
+
 /* fd support */
 
 static NetClientInfo net_tap_info = {
diff --git a/net/tap.h b/net/tap.h
index 538a562..a244b28 100644
--- a/net/tap.h
+++ b/net/tap.h
@@ -48,4 +48,6 @@ int tap_probe_vnet_hdr(int fd);
 int tap_probe_has_ufo(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo);
 
+int tap_get_fd(VLANClientState *vc);
+
 #endif /* QEMU_NET_TAP_H */
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 07/12] virtio: move typedef to qemu-common
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (4 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 01/12] tap: add interface to get device fd Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 10/12] tap: add vhost/vhostfd options Michael S. Tsirkin
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

make it possible to use type without header include

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio.h   |    1 -
 qemu-common.h |    1 +
 2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.h b/hw/virtio.h
index e12e8e3..b4cd877 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -69,7 +69,6 @@ static inline target_phys_addr_t vring_align(target_phys_addr_t addr,
 }
 
 typedef struct VirtQueue VirtQueue;
-typedef struct VirtIODevice VirtIODevice;
 
 #define VIRTQUEUE_MAX_SIZE 1024
 
diff --git a/qemu-common.h b/qemu-common.h
index f12a8f5..90ca3b8 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -228,6 +228,7 @@ typedef struct I2SCodec I2SCodec;
 typedef struct DeviceState DeviceState;
 typedef struct SSIBus SSIBus;
 typedef struct EventNotifier EventNotifier;
+typedef struct VirtIODevice VirtIODevice;
 
 typedef uint64_t pcibus_t;
 
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (5 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 07/12] virtio: move typedef to qemu-common Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:47   ` [Qemu-devel] " Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 11/12] tap: add API to retrieve vhost net header Michael S. Tsirkin
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

This adds vhost binary option to tap, to enable vhost net accelerator.
Default is off for now, we'll be able to make default on long term
when we know it's stable.

vhostfd option can be used by management, to pass in the fd. Assigning
vhostfd implies vhost=on.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net.c           |    8 ++++++++
 net/tap.c       |   33 +++++++++++++++++++++++++++++++++
 qemu-options.hx |    4 +++-
 3 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/net.c b/net.c
index a1bf49f..d1e23f1 100644
--- a/net.c
+++ b/net.c
@@ -973,6 +973,14 @@ static const struct {
                 .name = "vnet_hdr",
                 .type = QEMU_OPT_BOOL,
                 .help = "enable the IFF_VNET_HDR flag on the tap interface"
+            }, {
+                .name = "vhost",
+                .type = QEMU_OPT_BOOL,
+                .help = "enable vhost-net network accelerator",
+            }, {
+                .name = "vhostfd",
+                .type = QEMU_OPT_STRING,
+                .help = "file descriptor of an already opened vhost net device",
             },
 #endif /* _WIN32 */
             { /* end of list */ }
diff --git a/net/tap.c b/net/tap.c
index fc59fd4..65797ef 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -41,6 +41,8 @@
 
 #include "net/tap-linux.h"
 
+#include "hw/vhost_net.h"
+
 /* Maximum GSO packet size (64k) plus plenty of room for
  * the ethernet and virtio_net headers
  */
@@ -57,6 +59,7 @@ typedef struct TAPState {
     unsigned int has_vnet_hdr : 1;
     unsigned int using_vnet_hdr : 1;
     unsigned int has_ufo: 1;
+    struct vhost_net *vhost_net;
 } TAPState;
 
 static int launch_script(const char *setup_script, const char *ifname, int fd);
@@ -252,6 +255,10 @@ static void tap_cleanup(VLANClientState *nc)
 {
     TAPState *s = DO_UPCAST(TAPState, nc, nc);
 
+    if (s->vhost_net) {
+        vhost_net_cleanup(s->vhost_net);
+    }
+
     qemu_purge_queued_packets(nc);
 
     if (s->down_script[0])
@@ -307,6 +314,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
     s->has_ufo = tap_probe_has_ufo(s->fd);
     tap_set_offload(&s->nc, 0, 0, 0, 0, 0);
     tap_read_poll(s, 1);
+    s->vhost_net = NULL;
     return s;
 }
 
@@ -456,5 +464,30 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan
         }
     }
 
+    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+        int vhostfd, r;
+        if (qemu_opt_get(opts, "vhostfd")) {
+            r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
+            if (r == -1) {
+                return -1;
+            }
+            vhostfd = r;
+        } else {
+            vhostfd = -1;
+        }
+        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+        if (!s->vhost_net) {
+            qemu_error("vhost-net requested but could not be initialized\n");
+            return -1;
+        }
+    } else if (qemu_opt_get(opts, "vhostfd")) {
+        qemu_error("vhostfd= is not valid without vhost\n");
+        return -1;
+    }
+
+    if (vlan) {
+        vlan->nb_host_devs++;
+    }
+
     return 0;
 }
diff --git a/qemu-options.hx b/qemu-options.hx
index f53922f..1850906 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -879,7 +879,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
     "-net tap[,vlan=n][,name=str],ifname=name\n"
     "                connect the host TAP network interface to VLAN 'n'\n"
 #else
-    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off]\n"
+    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off][,vhost=on|off][,vhostfd=h]\n"
     "                connect the host TAP network interface to VLAN 'n' and use the\n"
     "                network scripts 'file' (default=" DEFAULT_NETWORK_SCRIPT ")\n"
     "                and 'dfile' (default=" DEFAULT_NETWORK_DOWN_SCRIPT ")\n"
@@ -889,6 +889,8 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
     "                default of 'sndbuf=1048576' can be disabled using 'sndbuf=0')\n"
     "                use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n"
     "                use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n"
+    "                use vhost=on to enable experimental in kernel accelerator\n"
+    "                use 'vhostfd=h' to connect to an already opened vhost net device\n"
 #endif
     "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
     "                connect the vlan 'n' to another VLAN using a socket connection\n"
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 11/12] tap: add API to retrieve vhost net header
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (6 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 10/12] tap: add vhost/vhostfd options Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 06/12] virtio: add set_status callback Michael S. Tsirkin
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

will be used by virtio-net for vhost net support

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/tap.c |    7 +++++++
 net/tap.h |    3 +++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 65797ef..9bb11fc 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -491,3 +491,10 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan
 
     return 0;
 }
+
+struct vhost_net *tap_get_vhost_net(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    assert(nc->info->type == NET_CLIENT_TYPE_TAP);
+    return s->vhost_net;
+}
diff --git a/net/tap.h b/net/tap.h
index a244b28..b8cec83 100644
--- a/net/tap.h
+++ b/net/tap.h
@@ -50,4 +50,7 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo);
 
 int tap_get_fd(VLANClientState *vc);
 
+struct vhost_net;
+struct vhost_net *tap_get_vhost_net(VLANClientState *vc);
+
 #endif /* QEMU_NET_TAP_H */
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 06/12] virtio: add set_status callback
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (7 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 11/12] tap: add API to retrieve vhost net header Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 08/12] virtio-pci: fill in notifier support Michael S. Tsirkin
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

vhost net backend needs to be notified when
frontend status changes. Add a callback,
similar to set_features.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/s390-virtio-bus.c |    7 ++++++-
 hw/syborg_virtio.c   |    2 ++
 hw/virtio-pci.c      |    9 ++++++++-
 hw/virtio.h          |    1 +
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
index fa0a74f..d7e3ae1 100644
--- a/hw/s390-virtio-bus.c
+++ b/hw/s390-virtio-bus.c
@@ -241,8 +241,13 @@ void s390_virtio_device_update_status(VirtIOS390Device *dev)
 {
     VirtIODevice *vdev = dev->vdev;
     uint32_t features;
+    uint8_t status;
 
-    vdev->status = ldub_phys(dev->dev_offs + VIRTIO_DEV_OFFS_STATUS);
+    status = ldub_phys(dev->dev_offs + VIRTIO_DEV_OFFS_STATUS);
+    if (vdev->set_status) {
+        vdev->set_status(vdev, status);
+    }
+    vdev->status = status;
 
     /* Update guest supported feature bitmap */
 
diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
index 65239a0..7a9e584 100644
--- a/hw/syborg_virtio.c
+++ b/hw/syborg_virtio.c
@@ -149,6 +149,8 @@ static void syborg_virtio_writel(void *opaque, target_phys_addr_t offset,
         virtio_queue_notify(vdev, value);
         break;
     case SYBORG_VIRTIO_STATUS:
+        if (vdev->set_status)
+            vdev->set_status(vdev, value & 0xFF);
         vdev->status = value & 0xFF;
         if (vdev->status == 0)
             virtio_reset(vdev);
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index bcd40f7..006ff38 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -206,6 +206,9 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
         virtio_queue_notify(vdev, val);
         break;
     case VIRTIO_PCI_STATUS:
+        if (vdev->set_status) {
+            vdev->set_status(vdev, val & 0xFF);
+        }
         vdev->status = val & 0xFF;
         if (vdev->status == 0) {
             virtio_reset(proxy->vdev);
@@ -377,7 +380,11 @@ static void virtio_write_config(PCIDevice *pci_dev, uint32_t address,
 
     if (PCI_COMMAND == address) {
         if (!(val & PCI_COMMAND_MASTER)) {
-            proxy->vdev->status &= ~VIRTIO_CONFIG_S_DRIVER_OK;
+            uint8_t status = proxy->vdev->status & ~VIRTIO_CONFIG_S_DRIVER_OK;
+            if (proxy->vdev->set_status) {
+                proxy->vdev->set_status(proxy->vdev, status);
+            }
+            proxy->vdev->status = status;
         }
     }
 
diff --git a/hw/virtio.h b/hw/virtio.h
index 2ebf2dd..e12e8e3 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -115,6 +115,7 @@ struct VirtIODevice
     void (*get_config)(VirtIODevice *vdev, uint8_t *config);
     void (*set_config)(VirtIODevice *vdev, const uint8_t *config);
     void (*reset)(VirtIODevice *vdev);
+    void (*set_status)(VirtIODevice *vdev, uint8_t val);
     VirtQueue *vq;
     const VirtIOBindings *binding;
     void *binding_opaque;
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 08/12] virtio-pci: fill in notifier support
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (8 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 06/12] virtio: add set_status callback Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:30   ` [Qemu-devel] " Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 03/12] notifier: event notifier implementation Michael S. Tsirkin
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

Support host/guest notifiers in virtio-pci.
The last one only with kvm, that's okay
because vhost relies on kvm anyway.

Note on kvm usage: kvm ioeventfd API
is implemented on non-kvm systems as well,
this is the reason we don't need if (kvm_enabled())
around it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio-pci.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 006ff38..3f1214c 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -24,6 +24,7 @@
 #include "net.h"
 #include "block_int.h"
 #include "loader.h"
+#include "kvm.h"
 
 /* from Linux's linux/virtio_pci.h */
 
@@ -398,6 +399,65 @@ static unsigned virtio_pci_get_features(void *opaque)
     return proxy->host_features;
 }
 
+static void virtio_pci_guest_notifier_read(void *opaque)
+{
+    VirtQueue *vq = opaque;
+    EventNotifier *n = virtio_queue_guest_notifier(vq);
+    if (event_notifier_test_and_clear(n)) {
+        virtio_irq(vq);
+    }
+}
+
+static int virtio_pci_guest_notifier(void *opaque, int n, bool assign)
+{
+    VirtIOPCIProxy *proxy = opaque;
+    VirtQueue *vq = virtio_queue(proxy->vdev, n);
+    EventNotifier *notifier = virtio_queue_guest_notifier(vq);
+
+    if (assign) {
+        int r = event_notifier_init(notifier, 0);
+	if (r < 0)
+		return r;
+        qemu_set_fd_handler(event_notifier_get_fd(notifier),
+                            virtio_pci_guest_notifier_read, NULL, vq);
+    } else {
+        qemu_set_fd_handler(event_notifier_get_fd(notifier),
+                            NULL, NULL, NULL);
+        event_notifier_cleanup(notifier);
+    }
+
+    return 0;
+}
+
+static int virtio_pci_host_notifier(void *opaque, int n, bool assign)
+{
+    VirtIOPCIProxy *proxy = opaque;
+    VirtQueue *vq = virtio_queue(proxy->vdev, n);
+    EventNotifier *notifier = virtio_queue_host_notifier(vq);
+    int r;
+    if (assign) {
+        r = event_notifier_init(notifier, 1);
+        if (r < 0) {
+            return r;
+        }
+        r = kvm_set_ioeventfd(proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
+                              n, event_notifier_get_fd(notifier),
+                              assign);
+        if (r < 0) {
+            event_notifier_cleanup(notifier);
+        }
+    } else {
+        r = kvm_set_ioeventfd(proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
+                              n, event_notifier_get_fd(notifier),
+                              assign);
+        if (r < 0) {
+            return r;
+        }
+        event_notifier_cleanup(notifier);
+    }
+    return r;
+}
+
 static const VirtIOBindings virtio_pci_bindings = {
     .notify = virtio_pci_notify,
     .save_config = virtio_pci_save_config,
@@ -405,6 +465,8 @@ static const VirtIOBindings virtio_pci_bindings = {
     .save_queue = virtio_pci_save_queue,
     .load_queue = virtio_pci_load_queue,
     .get_features = virtio_pci_get_features,
+    .host_notifier = virtio_pci_host_notifier,
+    .guest_notifier = virtio_pci_guest_notifier,
 };
 
 static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev,
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 03/12] notifier: event notifier implementation
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (9 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 08/12] virtio-pci: fill in notifier support Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:22   ` [Qemu-devel] " Anthony Liguori
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 12/12] virtio-net: vhost net support Michael S. Tsirkin
  2010-02-25 19:49 ` [Qemu-devel] Re: [PATCHv2 00/12] vhost-net: upstream integration Anthony Liguori
  12 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

event notifiers are slightly generalized eventfd descriptors. Current
implementation depends on eventfd because vhost is the only user, and
vhost depends on eventfd anyway, but a stub is provided for non-eventfd
case.

We'll be able to further generalize this when another user comes along
and we see how to best do this.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 Makefile.target |    1 +
 hw/notifier.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/notifier.h   |   16 ++++++++++++++++
 qemu-common.h   |    1 +
 4 files changed, 68 insertions(+), 0 deletions(-)
 create mode 100644 hw/notifier.c
 create mode 100644 hw/notifier.h

diff --git a/Makefile.target b/Makefile.target
index 4c4d397..c1580e9 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -173,6 +173,7 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
 obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
+obj-y += notifier.o
 obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
diff --git a/hw/notifier.c b/hw/notifier.c
new file mode 100644
index 0000000..dff38de
--- /dev/null
+++ b/hw/notifier.c
@@ -0,0 +1,50 @@
+#include "hw.h"
+#include "notifier.h"
+#ifdef CONFIG_EVENTFD
+#include <sys/eventfd.h>
+#endif
+
+int event_notifier_init(EventNotifier *e, int active)
+{
+#ifdef CONFIG_EVENTFD
+	int fd = eventfd(!!active, EFD_NONBLOCK | EFD_CLOEXEC);
+	if (fd < 0)
+		return -errno;
+	e->fd = fd;
+	return 0;
+#else
+	return -ENOSYS;
+#endif
+}
+
+void event_notifier_cleanup(EventNotifier *e)
+{
+	close(e->fd);
+}
+
+int event_notifier_get_fd(EventNotifier *e)
+{
+	return e->fd;
+}
+
+int event_notifier_test_and_clear(EventNotifier *e)
+{
+	uint64_t value;
+	int r = read(e->fd, &value, sizeof value);
+	return r == sizeof value;
+}
+
+int event_notifier_test(EventNotifier *e)
+{
+	uint64_t value;
+	int r = read(e->fd, &value, sizeof value);
+	if (r == sizeof value) {
+		/* restore previous value. */
+		int s = write(e->fd, &value, sizeof value);
+		/* never blocks because we use EFD_SEMAPHORE.
+		 * If we didn't we'd get EAGAIN on overflow
+		 * and we'd have to write code to ignore it. */
+		assert(s == sizeof value);
+	}
+	return r == sizeof value;
+}
diff --git a/hw/notifier.h b/hw/notifier.h
new file mode 100644
index 0000000..24117ea
--- /dev/null
+++ b/hw/notifier.h
@@ -0,0 +1,16 @@
+#ifndef QEMU_EVENT_NOTIFIER_H
+#define QEMU_EVENT_NOTIFIER_H
+
+#include "qemu-common.h"
+
+struct EventNotifier {
+	int fd;
+};
+
+int event_notifier_init(EventNotifier *, int active);
+void event_notifier_cleanup(EventNotifier *);
+int event_notifier_get_fd(EventNotifier *);
+int event_notifier_test_and_clear(EventNotifier *);
+int event_notifier_test(EventNotifier *);
+
+#endif
diff --git a/qemu-common.h b/qemu-common.h
index 805be1a..f12a8f5 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -227,6 +227,7 @@ typedef struct uWireSlave uWireSlave;
 typedef struct I2SCodec I2SCodec;
 typedef struct DeviceState DeviceState;
 typedef struct SSIBus SSIBus;
+typedef struct EventNotifier EventNotifier;
 
 typedef uint64_t pcibus_t;
 
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCHv2 12/12] virtio-net: vhost net support
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (10 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 03/12] notifier: event notifier implementation Michael S. Tsirkin
@ 2010-02-25 18:28 ` Michael S. Tsirkin
  2010-02-25 19:49 ` [Qemu-devel] Re: [PATCHv2 00/12] vhost-net: upstream integration Anthony Liguori
  12 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-25 18:28 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel; +Cc: amit.shah, kraxel, quintela

This connects virtio-net to vhost net backend.
The code is structured in a way analogous to what we have with vnet
header capability in tap.

We start/stop backend on driver start/stop as
well as on save and vm start (for migration).

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/virtio-net.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..9ddd58c 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -17,6 +17,7 @@
 #include "net/tap.h"
 #include "qemu-timer.h"
 #include "virtio-net.h"
+#include "vhost_net.h"
 
 #define VIRTIO_NET_VM_VERSION    11
 
@@ -47,6 +48,8 @@ typedef struct VirtIONet
     uint8_t nomulti;
     uint8_t nouni;
     uint8_t nobcast;
+    uint8_t vhost_started;
+    VMChangeStateEntry *vmstate;
     struct {
         int in_use;
         int first_multi;
@@ -114,6 +117,10 @@ static void virtio_net_reset(VirtIODevice *vdev)
     n->nomulti = 0;
     n->nouni = 0;
     n->nobcast = 0;
+    if (n->vhost_started) {
+        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), vdev);
+        n->vhost_started = 0;
+    }
 
     /* Flush any MAC and VLAN filter table state */
     n->mac_table.in_use = 0;
@@ -172,7 +179,14 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
         features &= ~(0x1 << VIRTIO_NET_F_HOST_UFO);
     }
 
-    return features;
+    if (!n->nic->nc.peer ||
+        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+        return features;
+    }
+    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+        return features;
+    }
+    return vhost_net_get_features(tap_get_vhost_net(n->nic->nc.peer), features);
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
@@ -698,6 +712,12 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
 {
     VirtIONet *n = opaque;
 
+    if (n->vhost_started) {
+        /* TODO: should we really stop the backend?
+         * If we don't, it might keep writing to memory. */
+        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
+        n->vhost_started = 0;
+    }
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
@@ -810,7 +830,6 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
         qemu_mod_timer(n->tx_timer,
                        qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
     }
-
     return 0;
 }
 
@@ -830,6 +849,47 @@ static NetClientInfo net_virtio_info = {
     .link_status_changed = virtio_net_set_link_status,
 };
 
+static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
+{
+    VirtIONet *n = to_virtio_net(vdev);
+    if (!n->nic->nc.peer) {
+        return;
+    }
+    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+        return;
+    }
+
+    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+        return;
+    }
+    if (!!n->vhost_started == !!(status & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        return;
+    }
+    if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+        int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), vdev);
+        if (r < 0) {
+            fprintf(stderr, "unable to start vhost net: %d: "
+                    "falling back on userspace virtio\n", -r);
+        } else {
+            n->vhost_started = 1;
+        }
+    } else {
+        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), vdev);
+        n->vhost_started = 0;
+    }
+}
+
+static void virtio_net_vmstate_change(void *opaque, int running, int reason)
+{
+    VirtIONet *n = opaque;
+    if (!running) {
+        return;
+    }
+    /* This is called when vm is started, it will start vhost backend if
+     * appropriate e.g. after migration. */
+    virtio_net_set_status(&n->vdev, n->vdev.status);
+}
+
 VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
 {
     VirtIONet *n;
@@ -845,6 +905,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
     n->vdev.set_features = virtio_net_set_features;
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
+    n->vdev.set_status = virtio_net_set_status;
     n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
     n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
     n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
@@ -867,6 +928,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
 
     register_savevm("virtio-net", virtio_net_id++, VIRTIO_NET_VM_VERSION,
                     virtio_net_save, virtio_net_load, n);
+    n->vmstate = qemu_add_vm_change_state_handler(virtio_net_vmstate_change, n);
 
     return &n->vdev;
 }
@@ -874,6 +936,11 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
 void virtio_net_exit(VirtIODevice *vdev)
 {
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
+    qemu_del_vm_change_state_handler(n->vmstate);
+
+    if (n->vhost_started) {
+        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), vdev);
+    }
 
     qemu_purge_queued_packets(&n->nic->nc);
 
-- 
1.7.0.18.g0d53a5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields
  2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
@ 2010-02-25 18:49   ` Blue Swirl
  2010-02-26 14:53     ` Michael S. Tsirkin
  2010-02-25 19:25   ` [Qemu-devel] " Anthony Liguori
  1 sibling, 1 reply; 70+ messages in thread
From: Blue Swirl @ 2010-02-25 18:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 2/25/10, Michael S. Tsirkin <mst@redhat.com> wrote:
> vhost needs physical addresses for ring and other queue fields,
>  so add APIs for these.
>
>  Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>  ---
>   hw/virtio.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   hw/virtio.h |   10 +++++++++-
>   2 files changed, 59 insertions(+), 1 deletions(-)
>
>  diff --git a/hw/virtio.c b/hw/virtio.c
>  index 1f5e7be..b017d7b 100644
>  --- a/hw/virtio.c
>  +++ b/hw/virtio.c
>  @@ -74,6 +74,11 @@ struct VirtQueue
>      uint16_t vector;
>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>      VirtIODevice *vdev;
>  +<<<<<<< HEAD
>  +=======
>  +    EventNotifier guest_notifier;
>  +    EventNotifier host_notifier;
>  +>>>>>>> 8afa4fd... virtio: add APIs for queue fields

Bug.

>   };
>
>   /* virt queue functions */
>  @@ -593,6 +598,12 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>      return &vdev->vq[i];
>   }
>
>  +void virtio_irq(VirtQueue *vq)
>  +{
>  +    vq->vdev->isr |= 0x01;
>  +    virtio_notify_vector(vq->vdev, vq->vector);
>  +}
>  +
>   void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
>   {
>      /* Always notify when queue is empty (when feature acknowledge) */
>  @@ -736,3 +747,42 @@ void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
>      vdev->binding = binding;
>      vdev->binding_opaque = opaque;
>   }
>  +
>  +target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n)
>  +{
>  +       return vdev->vq[n].vring.desc;
>  +}
>  +
>  +target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n)
>  +{
>  +       return vdev->vq[n].vring.avail;
>  +}
>  +
>  +target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n)
>  +{
>  +       return vdev->vq[n].vring.used;
>  +}
>  +
>  +uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n)
>  +{
>  +       return vdev->vq[n].last_avail_idx;
>  +}
>  +
>  +void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx)
>  +{
>  +       vdev->vq[n].last_avail_idx = idx;
>  +}
>  +
>  +VirtQueue *virtio_queue(VirtIODevice *vdev, int n)
>  +{
>  +       return vdev->vq + n;
>  +}
>  +
>  +EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq)
>  +{
>  +       return &vq->guest_notifier;
>  +}
>  +EventNotifier *virtio_queue_host_notifier(VirtQueue *vq)
>  +{
>  +       return &vq->host_notifier;
>  +}
>  diff --git a/hw/virtio.h b/hw/virtio.h
>  index af87889..2ebf2dd 100644
>  --- a/hw/virtio.h
>  +++ b/hw/virtio.h
>  @@ -184,5 +184,13 @@ void virtio_net_exit(VirtIODevice *vdev);
>         DEFINE_PROP_BIT("indirect_desc", _state, _field, \
>                         VIRTIO_RING_F_INDIRECT_DESC, true)
>
>  -
>  +target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n);
>  +target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n);
>  +target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n);
>  +uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n);
>  +void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
>  +VirtQueue *virtio_queue(VirtIODevice *vdev, int n);
>  +EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq);
>  +EventNotifier *virtio_queue_host_notifier(VirtQueue *vq);
>  +void virtio_irq(VirtQueue *vq);
>   #endif
>
> --
>  1.7.0.18.g0d53a5
>
>
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support Michael S. Tsirkin
@ 2010-02-25 19:04   ` Juan Quintela
  2010-02-26 14:32     ` Michael S. Tsirkin
  2010-02-25 19:44   ` Anthony Liguori
  1 sibling, 1 reply; 70+ messages in thread
From: Juan Quintela @ 2010-02-25 19:04 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, qemu-devel, kraxel

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> This adds vhost net device support in qemu. Will be tied to tap device
> and virtio by following patches.  Raw backend is currently missing,
> will be worked on/submitted separately.
>

+obj-y += vhost_net.o
+obj-$(CONFIG_VHOST_NET) += vhost.o

hy is vhost_net.o configured unconditionally?

> --- a/configure
> +++ b/configure
> @@ -1498,6 +1498,23 @@ EOF
>  fi
>
This misses vhost_net var definition at the start of the file and
--enable-vhost/--disable-vhost options.


>  ##########################################
> +# test for vhost net
> +
> +if test "$kvm" != "no"; then
> +	cat > $TMPC <<EOF
> +#include <linux/vhost.h>
> +int main(void) { return 0; }
> +EOF
> +	if compile_prog "$kvm_cflags" "" ; then
> +	vhost_net=yes
> +	else
> +	vhost_net=no
> +	fi

Indent please.

> +else
> +	vhost_net=no
> +fi
> +
> +##########################################
>  # pthread probe
>  PTHREADLIBS_LIST="-lpthread -lpthreadGC2"
>  
> @@ -1968,6 +1985,7 @@ echo "fdt support       $fdt"
>  echo "preadv support    $preadv"
>  echo "fdatasync         $fdatasync"
>  echo "uuid support      $uuid"
> +echo "vhost-net support $vhost_net"

Otherwise this couldo not be there.

>  if test $sdl_too_old = "yes"; then
>  echo "-> Your SDL version is too old - please upgrade to have SDL support"
> @@ -2492,6 +2510,9 @@ case "$target_arch2" in
>        if test "$kvm_para" = "yes"; then
>          echo "CONFIG_KVM_PARA=y" >> $config_target_mak
>        fi
> +      if test $vhost_net = "yes" ; then
> +        echo "CONFIG_VHOST_NET=y" >> $config_target_mak
> +      fi
>      fi
>  esac
>  echo "TARGET_PHYS_ADDR_BITS=$target_phys_bits" >> $config_target_mak

> +    for (;from < to; ++from) {
> +        vhost_log_chunk_t log;

.....

> +                ffsll(log) : ffs(log))) {

  if you defines vhost_log_chuck_t, you also define vhost_log_ffs() and
  you are done without this if.

Later, Juan.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 02/12] kvm: add API to set ioeventfd
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 02/12] kvm: add API to set ioeventfd Michael S. Tsirkin
@ 2010-02-25 19:19   ` Anthony Liguori
  2010-03-02 17:41     ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: quintela, mtosatti, qemu-devel, avi, amit.shah, kraxel

On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
> Comment on kvm usage: rather than require users to do if (kvm_enabled())
> and/or ifdefs, this patch adds an API that, internally, is defined to
> stub function on non-kvm build, and checks kvm_enabled for non-kvm
> run.
>
> While rest of qemu code still uses if (kvm_enabled()), I think this
> approach is cleaner, and we should convert rest of code to it
> long term.
>    

I'm not opposed to that.

> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>
> Avi, Marcelo, pls review/ack.
>
>   kvm-all.c |   22 ++++++++++++++++++++++
>   kvm.h     |   16 ++++++++++++++++
>   2 files changed, 38 insertions(+), 0 deletions(-)
>
> diff --git a/kvm-all.c b/kvm-all.c
> index 1a02076..9742791 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1138,3 +1138,25 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
>
>       return r;
>   }
> +
> +#ifdef KVM_IOEVENTFD
> +int kvm_set_ioeventfd(uint16_t addr, uint16_t data, int fd, bool assigned)
>    

I think this API could use some love.  You're using a very limited set 
of things that ioeventfd can do and you're multiplexing creation and 
destruction in a single call.

I think:

kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t data);
kvm_unset_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t data);

Would be better.  Alternatively, an API that matched ioeventfd exactly:

kvm_set_ioeventfd(int fd, uint64_t addr, int size, uint64_t data, int 
flags);
kvm_unset_ioeventfd(...);

Could work too.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 03/12] notifier: event notifier implementation
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 03/12] notifier: event notifier implementation Michael S. Tsirkin
@ 2010-02-25 19:22   ` Anthony Liguori
  2010-02-28 19:59     ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
> event notifiers are slightly generalized eventfd descriptors. Current
> implementation depends on eventfd because vhost is the only user, and
> vhost depends on eventfd anyway, but a stub is provided for non-eventfd
> case.
>
> We'll be able to further generalize this when another user comes along
> and we see how to best do this.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   Makefile.target |    1 +
>   hw/notifier.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   hw/notifier.h   |   16 ++++++++++++++++
>   qemu-common.h   |    1 +
>   4 files changed, 68 insertions(+), 0 deletions(-)
>   create mode 100644 hw/notifier.c
>   create mode 100644 hw/notifier.h
>
> diff --git a/Makefile.target b/Makefile.target
> index 4c4d397..c1580e9 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -173,6 +173,7 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
>   # virtio has to be here due to weird dependency between PCI and virtio-net.
>   # need to fix this properly
>   obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
> +obj-y += notifier.o
>   obj-y += rwhandler.o
>   obj-$(CONFIG_KVM) += kvm.o kvm-all.o
>   obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
> diff --git a/hw/notifier.c b/hw/notifier.c
> new file mode 100644
> index 0000000..dff38de
> --- /dev/null
> +++ b/hw/notifier.c
> @@ -0,0 +1,50 @@
> +#include "hw.h"
> +#include "notifier.h"
> +#ifdef CONFIG_EVENTFD
> +#include<sys/eventfd.h>
> +#endif
> +
> +int event_notifier_init(EventNotifier *e, int active)
> +{
> +#ifdef CONFIG_EVENTFD
> +	int fd = eventfd(!!active, EFD_NONBLOCK | EFD_CLOEXEC);
> +	if (fd<  0)
> +		return -errno;
> +	e->fd = fd;
> +	return 0;
> +#else
> +	return -ENOSYS;
> +#endif
> +}
> +
> +void event_notifier_cleanup(EventNotifier *e)
> +{
> +	close(e->fd);
> +}
> +
> +int event_notifier_get_fd(EventNotifier *e)
> +{
> +	return e->fd;
> +}
> +
> +int event_notifier_test_and_clear(EventNotifier *e)
> +{
> +	uint64_t value;
> +	int r = read(e->fd,&value, sizeof value);
> +	return r == sizeof value;
>    

Probably should handle EINTR, no?

> +}
> +
> +int event_notifier_test(EventNotifier *e)
> +{
> +	uint64_t value;
> +	int r = read(e->fd,&value, sizeof value);
>    

Coding Style is not quite explicit here but we always use sizeof(value).

> +	if (r == sizeof value) {
> +		/* restore previous value. */
> +		int s = write(e->fd,&value, sizeof value);
> +		/* never blocks because we use EFD_SEMAPHORE.
> +		 * If we didn't we'd get EAGAIN on overflow
> +		 * and we'd have to write code to ignore it. */
> +		assert(s == sizeof value);
> +	}
> +	return r == sizeof value;
> +}
> diff --git a/hw/notifier.h b/hw/notifier.h
> new file mode 100644
> index 0000000..24117ea
> --- /dev/null
> +++ b/hw/notifier.h
> @@ -0,0 +1,16 @@
>    

Needs copyright/license.

Thanks for doing this abstraction, I'm really happy with it over direct 
eventfd usage.

Regards,

Anthony Liguori

> +#ifndef QEMU_EVENT_NOTIFIER_H
> +#define QEMU_EVENT_NOTIFIER_H
> +
> +#include "qemu-common.h"
> +
> +struct EventNotifier {
> +	int fd;
> +};
> +
> +int event_notifier_init(EventNotifier *, int active);
> +void event_notifier_cleanup(EventNotifier *);
> +int event_notifier_get_fd(EventNotifier *);
> +int event_notifier_test_and_clear(EventNotifier *);
> +int event_notifier_test(EventNotifier *);
> +
> +#endif
> diff --git a/qemu-common.h b/qemu-common.h
> index 805be1a..f12a8f5 100644
> --- a/qemu-common.h
> +++ b/qemu-common.h
> @@ -227,6 +227,7 @@ typedef struct uWireSlave uWireSlave;
>   typedef struct I2SCodec I2SCodec;
>   typedef struct DeviceState DeviceState;
>   typedef struct SSIBus SSIBus;
> +typedef struct EventNotifier EventNotifier;
>
>   typedef uint64_t pcibus_t;
>
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 05/12] virtio: add APIs for queue fields
  2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
  2010-02-25 18:49   ` Blue Swirl
@ 2010-02-25 19:25   ` Anthony Liguori
  2010-02-26  8:46     ` Gleb Natapov
  1 sibling, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:27 PM, Michael S. Tsirkin wrote:
> vhost needs physical addresses for ring and other queue fields,
> so add APIs for these.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   hw/virtio.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   hw/virtio.h |   10 +++++++++-
>   2 files changed, 59 insertions(+), 1 deletions(-)
>
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 1f5e7be..b017d7b 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -74,6 +74,11 @@ struct VirtQueue
>       uint16_t vector;
>       void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>       VirtIODevice *vdev;
> +<<<<<<<  HEAD
> +=======
> +    EventNotifier guest_notifier;
> +    EventNotifier host_notifier;
> +>>>>>>>  8afa4fd... virtio: add APIs for queue fields
>    

That's clearly not right :-)

>   };
>
>   /* virt queue functions */
> @@ -593,6 +598,12 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
>       return&vdev->vq[i];
>   }
>
> +void virtio_irq(VirtQueue *vq)
> +{
> +    vq->vdev->isr |= 0x01;
> +    virtio_notify_vector(vq->vdev, vq->vector);
> +}
> +
>   void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
>   {
>       /* Always notify when queue is empty (when feature acknowledge) */
> @@ -736,3 +747,42 @@ void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
>       vdev->binding = binding;
>       vdev->binding_opaque = opaque;
>   }
> +
> +target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n)
> +{
> +	return vdev->vq[n].vring.desc;
> +}
> +
> +target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n)
> +{
> +	return vdev->vq[n].vring.avail;
> +}
> +
> +target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n)
> +{
> +	return vdev->vq[n].vring.used;
> +}
> +
> +uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n)
> +{
> +	return vdev->vq[n].last_avail_idx;
> +}
> +
> +void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx)
> +{
> +	vdev->vq[n].last_avail_idx = idx;
> +}
>    

Is it really necessary to return last_avail?  Can't vhost maintain it's 
own last_avail?

I'm not a huge fan of returning physical addresses for each queue 
element.  I think it makes more sense to just return the start of the 
ring queue.  The ABI defines the queue to have a very specific layout 
afterall.

Regards,

Anthony Liguori

> +VirtQueue *virtio_queue(VirtIODevice *vdev, int n)
> +{
> +	return vdev->vq + n;
> +}
> +
> +EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq)
> +{
> +	return&vq->guest_notifier;
> +}
> +EventNotifier *virtio_queue_host_notifier(VirtQueue *vq)
> +{
> +	return&vq->host_notifier;
> +}
> diff --git a/hw/virtio.h b/hw/virtio.h
> index af87889..2ebf2dd 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -184,5 +184,13 @@ void virtio_net_exit(VirtIODevice *vdev);
>   	DEFINE_PROP_BIT("indirect_desc", _state, _field, \
>   			VIRTIO_RING_F_INDIRECT_DESC, true)
>
> -
> +target_phys_addr_t virtio_queue_get_desc(VirtIODevice *vdev, int n);
> +target_phys_addr_t virtio_queue_get_avail(VirtIODevice *vdev, int n);
> +target_phys_addr_t virtio_queue_get_used(VirtIODevice *vdev, int n);
> +uint16_t virtio_queue_last_avail_idx(VirtIODevice *vdev, int n);
> +void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
> +VirtQueue *virtio_queue(VirtIODevice *vdev, int n);
> +EventNotifier *virtio_queue_guest_notifier(VirtQueue *vq);
> +EventNotifier *virtio_queue_host_notifier(VirtQueue *vq);
> +void

> virtio_irq(VirtQueue *vq);
>   #endif
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 08/12] virtio-pci: fill in notifier support
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 08/12] virtio-pci: fill in notifier support Michael S. Tsirkin
@ 2010-02-25 19:30   ` Anthony Liguori
  2010-02-28 20:02     ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
> Support host/guest notifiers in virtio-pci.
> The last one only with kvm, that's okay
> because vhost relies on kvm anyway.
>
> Note on kvm usage: kvm ioeventfd API
> is implemented on non-kvm systems as well,
> this is the reason we don't need if (kvm_enabled())
> around it.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   hw/virtio-pci.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 62 insertions(+), 0 deletions(-)
>
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index 006ff38..3f1214c 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -24,6 +24,7 @@
>   #include "net.h"
>   #include "block_int.h"
>   #include "loader.h"
> +#include "kvm.h"
>
>   /* from Linux's linux/virtio_pci.h */
>
> @@ -398,6 +399,65 @@ static unsigned virtio_pci_get_features(void *opaque)
>       return proxy->host_features;
>   }
>
> +static void virtio_pci_guest_notifier_read(void *opaque)
> +{
> +    VirtQueue *vq = opaque;
> +    EventNotifier *n = virtio_queue_guest_notifier(vq);
> +    if (event_notifier_test_and_clear(n)) {
> +        virtio_irq(vq);
> +    }
> +}
> +
> +static int virtio_pci_guest_notifier(void *opaque, int n, bool assign)
> +{
> +    VirtIOPCIProxy *proxy = opaque;
> +    VirtQueue *vq = virtio_queue(proxy->vdev, n);
> +    EventNotifier *notifier = virtio_queue_guest_notifier(vq);
> +
> +    if (assign) {
> +        int r = event_notifier_init(notifier, 0);
> +	if (r<  0)
> +		return r;
> +        qemu_set_fd_handler(event_notifier_get_fd(notifier),
> +                            virtio_pci_guest_notifier_read, NULL, vq);
>    

While not super important, it would be nice to have this a bit more 
common.  IOW:

     r = read_event_notifier_init(notifier, 
virtio_pci_guest_notifier_read, vq);

and:

     r = kvm_eventfd_notifier_init(notifier, proxy->addr + 
VIRTIO_PCI_QUEUE_NOTIFY, n, assign);

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support Michael S. Tsirkin
  2010-02-25 19:04   ` [Qemu-devel] " Juan Quintela
@ 2010-02-25 19:44   ` Anthony Liguori
  2010-02-26 14:49     ` Michael S. Tsirkin
  1 sibling, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
> This adds vhost net device support in qemu. Will be tied to tap device
> and virtio by following patches.  Raw backend is currently missing,
> will be worked on/submitted separately.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   Makefile.target |    2 +
>   configure       |   21 ++
>   hw/vhost.c      |  631 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   hw/vhost.h      |   44 ++++
>   hw/vhost_net.c  |  177 ++++++++++++++++
>   hw/vhost_net.h  |   20 ++
>   6 files changed, 895 insertions(+), 0 deletions(-)
>   create mode 100644 hw/vhost.c
>   create mode 100644 hw/vhost.h
>   create mode 100644 hw/vhost_net.c
>   create mode 100644 hw/vhost_net.h
>
> diff --git a/Makefile.target b/Makefile.target
> index c1580e9..9b4fd84 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -174,6 +174,8 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
>   # need to fix this properly
>   obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
>   obj-y += notifier.o
> +obj-y += vhost_net.o
> +obj-$(CONFIG_VHOST_NET) += vhost.o
>   obj-y += rwhandler.o
>   obj-$(CONFIG_KVM) += kvm.o kvm-all.o
>   obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
> diff --git a/configure b/configure
> index 8eb5f5b..5eccc7c 100755
> --- a/configure
> +++ b/configure
> @@ -1498,6 +1498,23 @@ EOF
>   fi
>
>   ##########################################
> +# test for vhost net
> +
> +if test "$kvm" != "no"; then
> +	cat>  $TMPC<<EOF
> +#include<linux/vhost.h>
> +int main(void) { return 0; }
> +EOF
> +	if compile_prog "$kvm_cflags" "" ; then
> +	vhost_net=yes
> +	else
> +	vhost_net=no
> +	fi
> +else
> +	vhost_net=no
> +fi
> +
> +##########################################
>   # pthread probe
>   PTHREADLIBS_LIST="-lpthread -lpthreadGC2"
>
> @@ -1968,6 +1985,7 @@ echo "fdt support       $fdt"
>   echo "preadv support    $preadv"
>   echo "fdatasync         $fdatasync"
>   echo "uuid support      $uuid"
> +echo "vhost-net support $vhost_net"
>
>   if test $sdl_too_old = "yes"; then
>   echo "->  Your SDL version is too old - please upgrade to have SDL support"
> @@ -2492,6 +2510,9 @@ case "$target_arch2" in
>         if test "$kvm_para" = "yes"; then
>           echo "CONFIG_KVM_PARA=y">>  $config_target_mak
>         fi
> +      if test $vhost_net = "yes" ; then
> +        echo "CONFIG_VHOST_NET=y">>  $config_target_mak
> +      fi
>       fi
>   esac
>   echo "TARGET_PHYS_ADDR_BITS=$target_phys_bits">>  $config_target_mak
> diff --git a/hw/vhost.c b/hw/vhost.c
> new file mode 100644
> index 0000000..4d5ea02
> --- /dev/null
> +++ b/hw/vhost.c
>    

Needs copyright/license.

> @@ -0,0 +1,631 @@
> +#include "linux/vhost.h"
> +#include<sys/ioctl.h>
> +#include<sys/eventfd.h>
> +#include "vhost.h"
> +#include "hw/hw.h"
> +/* For range_get_last */
> +#include "pci.h"
> +
> +static void vhost_dev_sync_region(struct vhost_dev *dev,
> +                                  uint64_t mfirst, uint64_t mlast,
> +                                  uint64_t rfirst, uint64_t rlast)
> +{
> +    uint64_t start = MAX(mfirst, rfirst);
> +    uint64_t end = MIN(mlast, rlast);
> +    vhost_log_chunk_t *from = dev->log + start / VHOST_LOG_CHUNK;
> +    vhost_log_chunk_t *to = dev->log + end / VHOST_LOG_CHUNK + 1;
> +    uint64_t addr = (start / VHOST_LOG_CHUNK) * VHOST_LOG_CHUNK;
> +
> +    assert(end / VHOST_LOG_CHUNK<  dev->log_size);
> +    assert(start / VHOST_LOG_CHUNK<  dev->log_size);
> +    if (end<  start) {
> +        return;
> +    }
> +    for (;from<  to; ++from) {
> +        vhost_log_chunk_t log;
> +        int bit;
> +        /* We first check with non-atomic: much cheaper,
> +         * and we expect non-dirty to be the common case. */
> +        if (!*from) {
> +            continue;
> +        }
> +        /* Data must be read atomically. We don't really
> +         * need the barrier semantics of __sync
> +         * builtins, but it's easier to use them than
> +         * roll our own. */
> +        log = __sync_fetch_and_and(from, 0);
>    

Is this too non-portable for us?

Technically speaking, it would be better to use qemu address types 
instead of uint64_t.

> +        while ((bit = sizeof(log)>  sizeof(int) ?
> +                ffsll(log) : ffs(log))) {
> +            bit -= 1;
> +            cpu_physical_memory_set_dirty(addr + bit * VHOST_LOG_PAGE);
> +            log&= ~(0x1ull<<  bit);
> +        }
> +        addr += VHOST_LOG_CHUNK;
> +    }
> +}
> +
> +static int vhost_client_sync_dirty_bitmap(struct CPUPhysMemoryClient *client,
> +                                          target_phys_addr_t start_addr,
> +                                          target_phys_addr_t end_addr)
>    

The 'struct' shouldn't be needed...

> +{
> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
> +    int i;
> +    if (!dev->log_enabled || !dev->started) {
> +        return 0;
> +    }
> +    for (i = 0; i<  dev->mem->nregions; ++i) {
> +        struct vhost_memory_region *reg = dev->mem->regions + i;
> +        vhost_dev_sync_region(dev, start_addr, end_addr,
> +                              reg->guest_phys_addr,
> +                              range_get_last(reg->guest_phys_addr,
> +                                             reg->memory_size));
> +    }
> +    for (i = 0; i<  dev->nvqs; ++i) {
> +        struct vhost_virtqueue *vq = dev->vqs + i;
> +        unsigned size = offsetof(struct vring_used, ring) +
> +            sizeof(struct vring_used_elem) * vq->num;
> +        vhost_dev_sync_region(dev, start_addr, end_addr, vq->used_phys,
> +                              range_get_last(vq->used_phys, size));
> +    }
> +    return 0;
> +}
> +
> +/* Assign/unassign. Keep an unsorted array of non-overlapping
> + * memory regions in dev->mem. */
> +static void vhost_dev_unassign_memory(struct vhost_dev *dev,
> +                                      uint64_t start_addr,
> +                                      uint64_t size)
> +{
> +    int from, to, n = dev->mem->nregions;
> +    /* Track overlapping/split regions for sanity checking. */
> +    int overlap_start = 0, overlap_end = 0, overlap_middle = 0, split = 0;
> +
> +    for (from = 0, to = 0; from<  n; ++from, ++to) {
> +        struct vhost_memory_region *reg = dev->mem->regions + to;
> +        uint64_t reglast;
> +        uint64_t memlast;
> +        uint64_t change;
> +
> +        /* clone old region */
> +        if (to != from) {
> +            memcpy(reg, dev->mem->regions + from, sizeof *reg);
> +        }
> +
> +        /* No overlap is simple */
> +        if (!ranges_overlap(reg->guest_phys_addr, reg->memory_size,
> +                            start_addr, size)) {
> +            continue;
> +        }
> +
> +        /* Split only happens if supplied region
> +         * is in the middle of an existing one. Thus it can not
> +         * overlap with any other existing region. */
> +        assert(!split);
> +
> +        reglast = range_get_last(reg->guest_phys_addr, reg->memory_size);
> +        memlast = range_get_last(start_addr, size);
> +
> +        /* Remove whole region */
> +        if (start_addr<= reg->guest_phys_addr&&  memlast>= reglast) {
> +            --dev->mem->nregions;
> +            --to;
> +            assert(to>= 0);
> +            ++overlap_middle;
> +            continue;
> +        }
> +
> +        /* Shrink region */
> +        if (memlast>= reglast) {
> +            reg->memory_size = start_addr - reg->guest_phys_addr;
> +            assert(reg->memory_size);
> +            assert(!overlap_end);
> +            ++overlap_end;
> +            continue;
> +        }
> +
> +        /* Shift region */
> +        if (start_addr<= reg->guest_phys_addr) {
> +            change = memlast + 1 - reg->guest_phys_addr;
> +            reg->memory_size -= change;
> +            reg->guest_phys_addr += change;
> +            reg->userspace_addr += change;
> +            assert(reg->memory_size);
> +            assert(!overlap_start);
> +            ++overlap_start;
> +            continue;
> +        }
> +
> +        /* This only happens if supplied region
> +         * is in the middle of an existing one. Thus it can not
> +         * overlap with any other existing region. */
> +        assert(!overlap_start);
> +        assert(!overlap_end);
> +        assert(!overlap_middle);
> +        /* Split region: shrink first part, shift second part. */
> +        memcpy(dev->mem->regions + n, reg, sizeof *reg);
> +        reg->memory_size = start_addr - reg->guest_phys_addr;
> +        assert(reg->memory_size);
> +        change = memlast + 1 - reg->guest_phys_addr;
> +        reg = dev->mem->regions + n;
> +        reg->memory_size -= change;
> +        assert(reg->memory_size);
> +        reg->guest_phys_addr += change;
> +        reg->userspace_addr += change;
> +        /* Never add more than 1 region */
> +        assert(dev->mem->nregions == n);
> +        ++dev->mem->nregions;
> +        ++split;
> +    }
> +}
>    

This code is basically replicating the code in kvm-all with respect to 
translating qemu ram registrations into a list of non-overlapping 
slots.  We should commonize the code and perhaps even change the 
notification API to deal with non-overlapping slots since that's what 
users seem to want.
> +
> +/* Called after unassign, so no regions overlap the given range. */
> +static void vhost_dev_assign_memory(struct vhost_dev *dev,
> +                                    uint64_t start_addr,
> +                                    uint64_t size,
> +                                    uint64_t uaddr)
> +{
> +    int from, to;
> +    struct vhost_memory_region *merged = NULL;
> +    for (from = 0, to = 0; from<  dev->mem->nregions; ++from, ++to) {
> +        struct vhost_memory_region *reg = dev->mem->regions + to;
> +        uint64_t prlast, urlast;
> +        uint64_t pmlast, umlast;
> +        uint64_t s, e, u;
> +
> +        /* clone old region */
> +        if (to != from) {
> +            memcpy(reg, dev->mem->regions + from, sizeof *reg);
> +        }
> +        prlast = range_get_last(reg->guest_phys_addr, reg->memory_size);
> +        pmlast = range_get_last(start_addr, size);
> +        urlast = range_get_last(reg->userspace_addr, reg->memory_size);
> +        umlast = range_get_last(uaddr, size);
> +
> +        /* check for overlapping regions: should never happen. */
> +        assert(prlast<  start_addr || pmlast<  reg->guest_phys_addr);
> +        /* Not an adjacent or overlapping region - do not merge. */
> +        if ((prlast + 1 != start_addr || urlast + 1 != uaddr)&&
> +            (pmlast + 1 != reg->guest_phys_addr ||
> +             umlast + 1 != reg->userspace_addr)) {
> +            continue;
> +        }
> +
> +        if (merged) {
> +            --to;
> +            assert(to>= 0);
> +        } else {
> +            merged = reg;
> +        }
> +        u = MIN(uaddr, reg->userspace_addr);
> +        s = MIN(start_addr, reg->guest_phys_addr);
> +        e = MAX(pmlast, prlast);
> +        uaddr = merged->userspace_addr = u;
> +        start_addr = merged->guest_phys_addr = s;
> +        size = merged->memory_size = e - s + 1;
> +        assert(merged->memory_size);
> +    }
> +
> +    if (!merged) {
> +        struct vhost_memory_region *reg = dev->mem->regions + to;
> +        memset(reg, 0, sizeof *reg);
> +        reg->memory_size = size;
> +        assert(reg->memory_size);
> +        reg->guest_phys_addr = start_addr;
> +        reg->userspace_addr = uaddr;
> +        ++to;
> +    }
> +    assert(to<= dev->mem->nregions + 1);
> +    dev->mem->nregions = to;
> +}
>    

See above.  Unifying the two bits of code is important IMHO because we 
had an unending supply of bugs with the code in kvm-all it seems.

> +static uint64_t vhost_get_log_size(struct vhost_dev *dev)
> +{
> +    uint64_t log_size = 0;
> +    int i;
> +    for (i = 0; i<  dev->mem->nregions; ++i) {
> +        struct vhost_memory_region *reg = dev->mem->regions + i;
> +        uint64_t last = range_get_last(reg->guest_phys_addr,
> +                                       reg->memory_size);
> +        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
> +    }
> +    for (i = 0; i<  dev->nvqs; ++i) {
> +        struct vhost_virtqueue *vq = dev->vqs + i;
> +        uint64_t last = vq->used_phys +
> +            offsetof(struct vring_used, ring) +
> +            sizeof(struct vring_used_elem) * vq->num - 1;
> +        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
> +    }
> +    return log_size;
> +}
> +
> +static inline void vhost_dev_log_resize(struct vhost_dev* dev, uint64_t size)
> +{
> +    vhost_log_chunk_t *log;
> +    uint64_t log_base;
> +    int r;
> +    if (size) {
> +        log = qemu_mallocz(size * sizeof *log);
> +    } else {
> +        log = NULL;
> +    }
> +    log_base = (uint64_t)(unsigned long)log;
> +    r = ioctl(dev->control, VHOST_SET_LOG_BASE,&log_base);
> +    assert(r>= 0);
> +    vhost_client_sync_dirty_bitmap(&dev->client, 0,
> +                                   (target_phys_addr_t)~0x0ull);
> +    if (dev->log) {
> +        qemu_free(dev->log);
> +    }
> +    dev->log = log;
> +    dev->log_size = size;
> +}
> +
> +static void vhost_client_set_memory(CPUPhysMemoryClient *client,
> +                                    target_phys_addr_t start_addr,
> +                                    ram_addr_t size,
> +                                    ram_addr_t phys_offset)
> +{
> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
> +    ram_addr_t flags = phys_offset&  ~TARGET_PAGE_MASK;
> +    int s = offsetof(struct vhost_memory, regions) +
> +        (dev->mem->nregions + 1) * sizeof dev->mem->regions[0];
> +    uint64_t log_size;
> +    int r;
> +    dev->mem = qemu_realloc(dev->mem, s);
> +
> +    assert(size);
> +
> +    vhost_dev_unassign_memory(dev, start_addr, size);
> +    if (flags == IO_MEM_RAM) {
> +        /* Add given mapping, merging adjacent regions if any */
> +        vhost_dev_assign_memory(dev, start_addr, size,
> +                                (uintptr_t)qemu_get_ram_ptr(phys_offset));
> +    } else {
> +        /* Remove old mapping for this memory, if any. */
> +        vhost_dev_unassign_memory(dev, start_addr, size);
> +    }
> +
> +    if (!dev->started) {
> +        return;
> +    }
> +    if (!dev->log_enabled) {
> +        r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> +        assert(r>= 0);
> +        return;
> +    }
> +    log_size = vhost_get_log_size(dev);
> +    /* We allocate an extra 4K bytes to log,
> +     * to reduce the * number of reallocations. */
> +#define VHOST_LOG_BUFFER (0x1000 / sizeof *dev->log)
> +    /* To log more, must increase log size before table update. */
> +    if (dev->log_size<  log_size) {
> +        vhost_dev_log_resize(dev, log_size + VHOST_LOG_BUFFER);
> +    }
> +    r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> +    assert(r>= 0);
> +    /* To log less, can only decrease log size after table update. */
> +    if (dev->log_size>  log_size + VHOST_LOG_BUFFER) {
> +        vhost_dev_log_resize(dev, log_size);
> +    }
> +}
> +
> +static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
> +                                    struct vhost_virtqueue *vq,
> +                                    unsigned idx, bool enable_log)
> +{
> +    struct vhost_vring_addr addr = {
> +        .index = idx,
> +        .desc_user_addr = (u_int64_t)(unsigned long)vq->desc,
> +        .avail_user_addr = (u_int64_t)(unsigned long)vq->avail,
> +        .used_user_addr = (u_int64_t)(unsigned long)vq->used,
> +        .log_guest_addr = vq->used_phys,
> +        .flags = enable_log ? (1<<  VHOST_VRING_F_LOG) : 0,
> +    };
> +    int r = ioctl(dev->control, VHOST_SET_VRING_ADDR,&addr);
> +    if (r<  0) {
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +static int vhost_dev_set_features(struct vhost_dev *dev, bool enable_log)
> +{
> +    uint64_t features = dev->acked_features;
> +    int r;
> +    if (enable_log) {
> +        features |= 0x1<<  VHOST_F_LOG_ALL;
> +    }
> +    r = ioctl(dev->control, VHOST_SET_FEATURES,&features);
> +    return r<  0 ? -errno : 0;
> +}
> +
> +static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
> +{
> +    int r, t, i;
> +    r = vhost_dev_set_features(dev, enable_log);
> +    if (r<  0)
> +        goto err_features;
>    

Coding Style is off with single line ifs.

> +    for (i = 0; i<  dev->nvqs; ++i) {
>    

C++ habits die hard :-)

> +        r = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
> +                                     enable_log);
> +        if (r<  0)
> +            goto err_vq;
> +    }
> +    return 0;
> +err_vq:
> +    for (; i>= 0; --i) {
> +        t = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
> +                                     dev->log_enabled);
> +        assert(t>= 0);
> +    }
> +    t = vhost_dev_set_features(dev, dev->log_enabled);
> +    assert(t>= 0);
> +err_features:
> +    return r;
> +}
> +
> +static int vhost_client_migration_log(struct CPUPhysMemoryClient *client,
> +                                      int enable)
> +{
> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
> +    int r;
> +    if (!!enable == dev->log_enabled) {
> +        return 0;
> +    }
> +    if (!dev->started) {
> +        dev->log_enabled = enable;
> +        return 0;
> +    }
> +    if (!enable) {
> +        r = vhost_dev_set_log(dev, false);
> +        if (r<  0) {
> +            return r;
> +        }
> +        if (dev->log) {
> +            qemu_free(dev->log);
> +        }
> +        dev->log = NULL;
> +        dev->log_size = 0;
> +    } else {
> +        vhost_dev_log_resize(dev, vhost_get_log_size(dev));
> +        r = vhost_dev_set_log(dev, true);
> +        if (r<  0) {
> +            return r;
> +        }
> +    }
> +    dev->log_enabled = enable;
> +    return 0;
> +}
> +
> +static int vhost_virtqueue_init(struct vhost_dev *dev,
> +                                struct VirtIODevice *vdev,
> +                                struct vhost_virtqueue *vq,
> +                                unsigned idx)
> +{
> +    target_phys_addr_t s, l, a;
> +    int r;
> +    struct vhost_vring_file file = {
> +        .index = idx,
> +    };
> +    struct vhost_vring_state state = {
> +        .index = idx,
> +    };
> +    struct VirtQueue *q = virtio_queue(vdev, idx);
> +
> +    vq->num = state.num = virtio_queue_get_num(vdev, idx);
> +    r = ioctl(dev->control, VHOST_SET_VRING_NUM,&state);
> +    if (r) {
> +        return -errno;
> +    }
> +
> +    state.num = virtio_queue_last_avail_idx(vdev, idx);
> +    r = ioctl(dev->control, VHOST_SET_VRING_BASE,&state);
> +    if (r) {
> +        return -errno;
> +    }
> +
> +    s = l = sizeof(struct vring_desc) * vq->num;
> +    a = virtio_queue_get_desc(vdev, idx);
> +    vq->desc = cpu_physical_memory_map(a,&l, 0);
> +    if (!vq->desc || l != s) {
> +        r = -ENOMEM;
> +        goto fail_alloc;
> +    }
> +    s = l = offsetof(struct vring_avail, ring) +
> +        sizeof(u_int64_t) * vq->num;
> +    a = virtio_queue_get_avail(vdev, idx);
> +    vq->avail = cpu_physical_memory_map(a,&l, 0);
> +    if (!vq->avail || l != s) {
> +        r = -ENOMEM;
> +        goto fail_alloc;
> +    }
>    

You don't unmap avail/desc on failure.  map() may fail because the ring 
cross MMIO memory and you run out of a bounce buffer.

IMHO, it would be better to attempt to map the full ring at once and 
then if that doesn't succeed, bail out.  You can still pass individual 
pointers via vhost ioctls but within qemu, it's much easier to deal with 
the whole ring at a time.

> +    s = l = offsetof(struct vring_used, ring) +
> +        sizeof(struct vring_used_elem) * vq->num;
>    

This is unfortunate.  We redefine this structures in qemu to avoid 
depending on Linux headers.  But you're using the linux versions instead 
of the qemu versions.  Is it really necessary for vhost.h to include 
virtio.h?

> +    vq->used_phys = a = virtio_queue_get_used(vdev, idx);
> +    vq->used = cpu_physical_memory_map(a,&l, 1);
> +    if (!vq->used || l != s) {
> +        r = -ENOMEM;
> +        goto fail_alloc;
> +    }
> +
> +    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
> +    if (r<  0) {
> +        r = -errno;
> +        goto fail_alloc;
> +    }
> +    if (!vdev->binding->guest_notifier || !vdev->binding->host_notifier) {
> +        fprintf(stderr, "binding does not support irqfd/queuefd\n");
> +        r = -ENOSYS;
> +        goto fail_alloc;
> +    }
> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, true);
> +    if (r<  0) {
> +        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> +        goto fail_guest_notifier;
> +    }
> +
> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, true);
> +    if (r<  0) {
> +        fprintf(stderr, "Error binding host notifier: %d\n", -r);
> +        goto fail_host_notifier;
> +    }
> +
> +    file.fd = event_notifier_get_fd(virtio_queue_host_notifier(q));
> +    r = ioctl(dev->control, VHOST_SET_VRING_KICK,&file);
> +    if (r) {
> +        goto fail_kick;
> +    }
> +
> +    file.fd = event_notifier_get_fd(virtio_queue_guest_notifier(q));
> +    r = ioctl(dev->control, VHOST_SET_VRING_CALL,&file);
> +    if (r) {
> +        goto fail_call;
> +    }
>    

This function would be a bit more reasonable if it were split into 
sections FWIW.

> +    return 0;
> +
> +fail_call:
> +fail_kick:
> +    vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
> +fail_host_notifier:
> +    vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
> +fail_guest_notifier:
> +fail_alloc:
> +    return r;
> +}
> +
> +static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
> +                                    struct VirtIODevice *vdev,
> +                                    struct vhost_virtqueue *vq,
> +                                    unsigned idx)
> +{
> +    struct vhost_vring_state state = {
> +        .index = idx,
> +    };
> +    int r;
> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
> +    if (r<  0) {
> +        fprintf(stderr, "vhost VQ %d guest cleanup failed: %d\n", idx, r);
> +        fflush(stderr);
> +    }
> +    assert (r>= 0);
> +
> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
> +    if (r<  0) {
> +        fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r);
> +        fflush(stderr);
> +    }
> +    assert (r>= 0);
> +    r = ioctl(dev->control, VHOST_GET_VRING_BASE,&state);
> +    if (r<  0) {
> +        fprintf(stderr, "vhost VQ %d ring restore failed: %d\n", idx, r);
> +        fflush(stderr);
> +    }
> +    virtio_queue_set_last_avail_idx(vdev, idx, state.num);
> +    assert (r>= 0);
>    

You never unmap() the mapped memory and you're cheating by assuming that 
the virtio rings have a constant mapping for the life time of a guest.  
That's not technically true.  My concern is that since a guest can 
trigger remappings (by adjusting PCI mappings) badness can ensue.
> diff --git a/hw/vhost_net.c b/hw/vhost_net.c
> new file mode 100644
> index 0000000..06b7648
>    

Need copyright/license.

> --- /dev/null
> +++ b/hw/vhost_net.c
> @@ -0,0 +1,177 @@
> +#include "net.h"
> +#include "net/tap.h"
> +
> +#include "virtio-net.h"
> +#include "vhost_net.h"
> +
> +#include "config.h"
> +
> +#ifdef CONFIG_VHOST_NET
> +#include<sys/eventfd.h>
> +#include<sys/socket.h>
> +#include<linux/kvm.h>
> +#include<fcntl.h>
> +#include<sys/ioctl.h>
> +#include<linux/virtio_ring.h>
> +#include<netpacket/packet.h>
> +#include<net/ethernet.h>
> +#include<net/if.h>
> +#include<netinet/in.h>
> +
> +#include<stdio.h>
> +
> +#include "vhost.h"
> +
> +struct vhost_net {
>    


VHostNetState.

> +    struct vhost_dev dev;
> +    struct vhost_virtqueue vqs[2];
> +    int backend;
> +    VLANClientState *vc;
> +};
> +
> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
> +{
> +    /* Clear features not supported by host kernel. */
> +    if (!(net->dev.features&  (1<<  VIRTIO_F_NOTIFY_ON_EMPTY)))
> +        features&= ~(1<<  VIRTIO_F_NOTIFY_ON_EMPTY);
> +    if (!(net->dev.features&  (1<<  VIRTIO_RING_F_INDIRECT_DESC)))
> +        features&= ~(1<<  VIRTIO_RING_F_INDIRECT_DESC);
> +    if (!(net->dev.features&  (1<<  VIRTIO_NET_F_MRG_RXBUF)))
> +        features&= ~(1<<  VIRTIO_NET_F_MRG_RXBUF);
> +    return features;
> +}
> +
> +void vhost_net_ack_features(struct vhost_net *net, unsigned features)
> +{
> +    net->dev.acked_features = net->dev.backend_features;
> +    if (features&  (1<<  VIRTIO_F_NOTIFY_ON_EMPTY))
> +        net->dev.acked_features |= (1<<  VIRTIO_F_NOTIFY_ON_EMPTY);
> +    if (features&  (1<<  VIRTIO_RING_F_INDIRECT_DESC))
> +        net->dev.acked_features |= (1<<  VIRTIO_RING_F_INDIRECT_DESC);
> +}
> +
> +static int vhost_net_get_fd(VLANClientState *backend)
> +{
> +    switch (backend->info->type) {
> +    case NET_CLIENT_TYPE_TAP:
> +        return tap_get_fd(backend);
> +    default:
> +        fprintf(stderr, "vhost-net requires tap backend\n");
> +        return -EBADFD;
> +    }
> +}
> +
> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
> +{
> +    int r;
> +    struct vhost_net *net = qemu_malloc(sizeof *net);
> +    if (!backend) {
> +        fprintf(stderr, "vhost-net requires backend to be setup\n");
> +        goto fail;
> +    }
> +    r = vhost_net_get_fd(backend);
> +    if (r<  0)
> +        goto fail;
> +    net->vc = backend;
> +    net->dev.backend_features = tap_has_vnet_hdr(backend) ? 0 :
> +        (1<<  VHOST_NET_F_VIRTIO_NET_HDR);
> +    net->backend = r;
> +
> +    r = vhost_dev_init(&net->dev, devfd);
> +    if (r<  0)
> +        goto fail;
> +    if (~net->dev.features&  net->dev.backend_features) {
> +        fprintf(stderr, "vhost lacks feature mask %llu for backend\n",
> +                ~net->dev.features&  net->dev.backend_features);
> +        vhost_dev_cleanup(&net->dev);
> +        goto fail;
> +    }
> +
> +    /* Set sane init value. Override when guest acks. */
> +    vhost_net_ack_features(net, 0);
> +    return net;
> +fail:
> +    qemu_free(net);
> +    return NULL;
> +}
> +
> +int vhost_net_start(struct vhost_net *net,
> +                    VirtIODevice *dev)
> +{
> +    struct vhost_vring_file file = { };
> +    int r;
> +
> +    net->dev.nvqs = 2;
> +    net->dev.vqs = net->vqs;
> +    r = vhost_dev_start(&net->dev, dev);
> +    if (r<  0)
> +        return r;
> +
> +    net->vc->info->poll(net->vc, false);
> +    qemu_set_fd_handler(net->backend, NULL, NULL, NULL);
> +    file.fd = net->backend;
> +    for (file.index = 0; file.index<  net->dev.nvqs; ++file.index) {
> +        r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
> +        if (r<  0) {
> +            r = -errno;
> +            goto fail;
> +        }
> +    }
> +    return 0;
> +fail:
> +    file.fd = -1;
> +    while (--file.index>= 0) {
> +        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
> +        assert(r>= 0);
> +    }
> +    net->vc->info->poll(net->vc, true);
> +    vhost_dev_stop(&net->dev, dev);
> +    return r;
> +}
> +
> +void vhost_net_stop(struct vhost_net *net,
> +                    VirtIODevice *dev)
> +{
> +    struct vhost_vring_file file = { .fd = -1 };
> +
> +    for (file.index = 0; file.index<  net->dev.nvqs; ++file.index) {
> +        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
> +        assert(r>= 0);
> +    }
> +    net->vc->info->poll(net->vc, true);
> +    vhost_dev_stop(&net->dev, dev);
> +}
> +
> +void vhost_net_cleanup(struct vhost_net *net)
> +{
> +    vhost_dev_cleanup(&net->dev);
> +    qemu_free(net);
> +}
> +#else
>    

If you're going this way, I'd suggest making static inlines in the 
header file instead of polluting the C file.  It's more common to search 
within a C file and having two declarations can get annoying.

Regards,

Anthony Liguori

> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
> +{
> +	return NULL;
> +}
> +
> +int vhost_net_start(struct vhost_net *net,
> +		    VirtIODevice *dev)
> +{
> +	return -ENOSYS;
> +}
> +void vhost_net_stop(struct vhost_net *net,
> +		    VirtIODevice *dev)
> +{
> +}
> +
> +void vhost_net_cleanup(struct vhost_net *net)
> +{
> +}
> +
> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
> +{
> +	return features;
> +}
> +void vhost_net_ack_features(struct vhost_net *net, unsigned features)
> +{
> +}
> +#endif
> diff --git a/hw/vhost_net.h b/hw/vhost_net.h
> new file mode 100644
> index 0000000..2a10210
> --- /dev/null
> +++ b/hw/vhost_net.h
> @@ -0,0 +1,20 @@
> +#ifndef VHOST_NET_H
> +#define VHOST_NET_H
> +
> +#include "net.h"
> +
> +struct vhost_net;
> +
> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd);
> +
> +int vhost_net_start(struct vhost_net *net,
> +                    VirtIODevice *dev);
> +void vhost_net_stop(struct vhost_net *net,
> +                    VirtIODevice *dev);
> +
> +void vhost_net_cleanup(struct vhost_net *net);
> +
> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features);
> +void vhost_net_ack_features(struct vhost_net *net, unsigned features);
> +
> +#endif
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 10/12] tap: add vhost/vhostfd options Michael S. Tsirkin
@ 2010-02-25 19:47   ` Anthony Liguori
  2010-02-26 14:51     ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
> This adds vhost binary option to tap, to enable vhost net accelerator.
> Default is off for now, we'll be able to make default on long term
> when we know it's stable.
>
> vhostfd option can be used by management, to pass in the fd. Assigning
> vhostfd implies vhost=on.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>    

Since the thinking these days is that macvtap and tap is pretty much all 
we'll ever need for vhost-net, perhaps we should revisit -net vhost vs. 
-net tap,vhost=X?

I think -net vhost,fd=X makes a lot more sense than -net 
tap,vhost=on,vhostfd=X.

Regards,

Anthony Liguori

> ---
>   net.c           |    8 ++++++++
>   net/tap.c       |   33 +++++++++++++++++++++++++++++++++
>   qemu-options.hx |    4 +++-
>   3 files changed, 44 insertions(+), 1 deletions(-)
>
> diff --git a/net.c b/net.c
> index a1bf49f..d1e23f1 100644
> --- a/net.c
> +++ b/net.c
> @@ -973,6 +973,14 @@ static const struct {
>                   .name = "vnet_hdr",
>                   .type = QEMU_OPT_BOOL,
>                   .help = "enable the IFF_VNET_HDR flag on the tap interface"
> +            }, {
> +                .name = "vhost",
> +                .type = QEMU_OPT_BOOL,
> +                .help = "enable vhost-net network accelerator",
> +            }, {
> +                .name = "vhostfd",
> +                .type = QEMU_OPT_STRING,
> +                .help = "file descriptor of an already opened vhost net device",
>               },
>   #endif /* _WIN32 */
>               { /* end of list */ }
> diff --git a/net/tap.c b/net/tap.c
> index fc59fd4..65797ef 100644
> --- a/net/tap.c
> +++ b/net/tap.c
> @@ -41,6 +41,8 @@
>
>   #include "net/tap-linux.h"
>
> +#include "hw/vhost_net.h"
> +
>   /* Maximum GSO packet size (64k) plus plenty of room for
>    * the ethernet and virtio_net headers
>    */
> @@ -57,6 +59,7 @@ typedef struct TAPState {
>       unsigned int has_vnet_hdr : 1;
>       unsigned int using_vnet_hdr : 1;
>       unsigned int has_ufo: 1;
> +    struct vhost_net *vhost_net;
>   } TAPState;
>
>   static int launch_script(const char *setup_script, const char *ifname, int fd);
> @@ -252,6 +255,10 @@ static void tap_cleanup(VLANClientState *nc)
>   {
>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
>
> +    if (s->vhost_net) {
> +        vhost_net_cleanup(s->vhost_net);
> +    }
> +
>       qemu_purge_queued_packets(nc);
>
>       if (s->down_script[0])
> @@ -307,6 +314,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
>       s->has_ufo = tap_probe_has_ufo(s->fd);
>       tap_set_offload(&s->nc, 0, 0, 0, 0, 0);
>       tap_read_poll(s, 1);
> +    s->vhost_net = NULL;
>       return s;
>   }
>
> @@ -456,5 +464,30 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan
>           }
>       }
>
> +    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
> +        int vhostfd, r;
> +        if (qemu_opt_get(opts, "vhostfd")) {
> +            r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
> +            if (r == -1) {
> +                return -1;
> +            }
> +            vhostfd = r;
> +        } else {
> +            vhostfd = -1;
> +        }
> +        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
> +        if (!s->vhost_net) {
> +            qemu_error("vhost-net requested but could not be initialized\n");
> +            return -1;
> +        }
> +    } else if (qemu_opt_get(opts, "vhostfd")) {
> +        qemu_error("vhostfd= is not valid without vhost\n");
> +        return -1;
> +    }
> +
> +    if (vlan) {
> +        vlan->nb_host_devs++;
> +    }
> +
>       return 0;
>   }
> diff --git a/qemu-options.hx b/qemu-options.hx
> index f53922f..1850906 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -879,7 +879,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>       "-net tap[,vlan=n][,name=str],ifname=name\n"
>       "                connect the host TAP network interface to VLAN 'n'\n"
>   #else
> -    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off]\n"
> +    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off][,vhost=on|off][,vhostfd=h]\n"
>       "                connect the host TAP network interface to VLAN 'n' and use the\n"
>       "                network scripts 'file' (default=" DEFAULT_NETWORK_SCRIPT ")\n"
>       "                and 'dfile' (default=" DEFAULT_NETWORK_DOWN_SCRIPT ")\n"
> @@ -889,6 +889,8 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>       "                default of 'sndbuf=1048576' can be disabled using 'sndbuf=0')\n"
>       "                use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n"
>       "                use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n"
> +    "                use vhost=on to enable experimental in kernel accelerator\n"
> +    "                use 'vhostfd=h' to connect to an already opened vhost net device\n"
>   #endif
>       "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
>       "                connect the vlan 'n' to another VLAN using a socket connection\n"
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 00/12] vhost-net: upstream integration
  2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
                   ` (11 preceding siblings ...)
  2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 12/12] virtio-net: vhost net support Michael S. Tsirkin
@ 2010-02-25 19:49 ` Anthony Liguori
  12 siblings, 0 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/25/2010 12:27 PM, Michael S. Tsirkin wrote:
> Here's a patchset with vhost support for upstream qemu,
> rabed to latest bits.
>
> Note that irqchip/MSI is no longer required for vhost, but you should
> not expect performance gains from vhost unless in-kernel irqchip is
> enabled (which is not in upstream qemu now), and unless guest enables
> MSI.  A follow-up patchset against qemu-kvm will add irqchip support.
>
> Only virtio-pci is currently supported: I'm interested in supporting
> syborg/s390 as well, and tried to make APIs generic to make this
> possible.
>
> Also missing is packet socket backend.
>    

Looks pretty good overall.

Regards,

Anthony Liguori

> Cc'd, you did review of these internally, I would be thankful
> for review/ack upstream.
>
> Changes from v1:
>    Addressed style comments
>    Migration fixes.
>    Gracefully fail with non-tap backends.
>
> Michael S. Tsirkin (12):
>    tap: add interface to get device fd
>    kvm: add API to set ioeventfd
>    notifier: event notifier implementation
>    virtio: add notifier support
>    virtio: add APIs for queue fields
>    virtio: add set_status callback
>    virtio: move typedef to qemu-common
>    virtio-pci: fill in notifier support
>    vhost: vhost net support
>    tap: add vhost/vhostfd options
>    tap: add API to retrieve vhost net header
>    virtio-net: vhost net support
>
>   Makefile.target      |    3 +
>   configure            |   21 ++
>   hw/notifier.c        |   50 ++++
>   hw/notifier.h        |   16 ++
>   hw/s390-virtio-bus.c |    7 +-
>   hw/syborg_virtio.c   |    2 +
>   hw/vhost.c           |  631 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   hw/vhost.h           |   44 ++++
>   hw/vhost_net.c       |  177 ++++++++++++++
>   hw/vhost_net.h       |   20 ++
>   hw/virtio-net.c      |   71 ++++++-
>   hw/virtio-pci.c      |   71 ++++++-
>   hw/virtio.c          |   55 +++++-
>   hw/virtio.h          |   15 +-
>   kvm-all.c            |   22 ++
>   kvm.h                |   16 ++
>   net.c                |    8 +
>   net/tap.c            |   47 ++++
>   net/tap.h            |    5 +
>   qemu-common.h        |    2 +
>   qemu-options.hx      |    4 +-
>   21 files changed, 1279 insertions(+), 8 deletions(-)
>   create mode 100644 hw/notifier.c
>   create mode 100644 hw/notifier.h
>   create mode 100644 hw/vhost.c
>   create mode 100644 hw/vhost.h
>   create mode 100644 hw/vhost_net.c
>   create mode 100644 hw/vhost_net.h
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 05/12] virtio: add APIs for queue fields
  2010-02-25 19:25   ` [Qemu-devel] " Anthony Liguori
@ 2010-02-26  8:46     ` Gleb Natapov
  0 siblings, 0 replies; 70+ messages in thread
From: Gleb Natapov @ 2010-02-26  8:46 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

On Thu, Feb 25, 2010 at 01:25:46PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:27 PM, Michael S. Tsirkin wrote:
> >vhost needs physical addresses for ring and other queue fields,
> >so add APIs for these.
> >
> >Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >---
> >  hw/virtio.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/virtio.h |   10 +++++++++-
> >  2 files changed, 59 insertions(+), 1 deletions(-)
> >
> >diff --git a/hw/virtio.c b/hw/virtio.c
> >index 1f5e7be..b017d7b 100644
> >--- a/hw/virtio.c
> >+++ b/hw/virtio.c
> >@@ -74,6 +74,11 @@ struct VirtQueue
> >      uint16_t vector;
> >      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >      VirtIODevice *vdev;
> >+<<<<<<<  HEAD
> >+=======
> >+    EventNotifier guest_notifier;
> >+    EventNotifier host_notifier;
> >+>>>>>>>  8afa4fd... virtio: add APIs for queue fields
> 
> That's clearly not right :-)
> 
Merge conflict resolution is left as exercise for the reader.

--
			Gleb.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-25 19:04   ` [Qemu-devel] " Juan Quintela
@ 2010-02-26 14:32     ` Michael S. Tsirkin
  2010-02-26 14:38       ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-26 14:32 UTC (permalink / raw)
  To: Juan Quintela; +Cc: amit.shah, qemu-devel, kraxel

On Thu, Feb 25, 2010 at 08:04:21PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > This adds vhost net device support in qemu. Will be tied to tap device
> > and virtio by following patches.  Raw backend is currently missing,
> > will be worked on/submitted separately.
> >
> 
> +obj-y += vhost_net.o
> +obj-$(CONFIG_VHOST_NET) += vhost.o
> 
> hy is vhost_net.o configured unconditionally?


It includes stubs that return error. This way callers do not
need to include any ifdefs.

> > --- a/configure
> > +++ b/configure
> > @@ -1498,6 +1498,23 @@ EOF
> >  fi
> >
> This misses vhost_net var definition at the start of the file

Seems to work without. Why is it needed?

> and
> --enable-vhost/--disable-vhost options.
> 

I don't really see why we need --enable-vhost/--disable-vhost.
Runtime flag is enough.

> >  ##########################################
> > +# test for vhost net
> > +
> > +if test "$kvm" != "no"; then
> > +	cat > $TMPC <<EOF
> > +#include <linux/vhost.h>
> > +int main(void) { return 0; }
> > +EOF
> > +	if compile_prog "$kvm_cflags" "" ; then
> > +	vhost_net=yes
> > +	else
> > +	vhost_net=no
> > +	fi
> 
> Indent please.
> 
> > +else
> > +	vhost_net=no
> > +fi
> > +
> > +##########################################
> >  # pthread probe
> >  PTHREADLIBS_LIST="-lpthread -lpthreadGC2"
> >  
> > @@ -1968,6 +1985,7 @@ echo "fdt support       $fdt"
> >  echo "preadv support    $preadv"
> >  echo "fdatasync         $fdatasync"
> >  echo "uuid support      $uuid"
> > +echo "vhost-net support $vhost_net"
> 
> Otherwise this couldo not be there.


What do you mean? vhost_net always gets a value, so it is safe to use
here.

> >  if test $sdl_too_old = "yes"; then
> >  echo "-> Your SDL version is too old - please upgrade to have SDL support"
> > @@ -2492,6 +2510,9 @@ case "$target_arch2" in
> >        if test "$kvm_para" = "yes"; then
> >          echo "CONFIG_KVM_PARA=y" >> $config_target_mak
> >        fi
> > +      if test $vhost_net = "yes" ; then
> > +        echo "CONFIG_VHOST_NET=y" >> $config_target_mak
> > +      fi
> >      fi
> >  esac
> >  echo "TARGET_PHYS_ADDR_BITS=$target_phys_bits" >> $config_target_mak
> 
> > +    for (;from < to; ++from) {
> > +        vhost_log_chunk_t log;
> 
> .....
> 
> > +                ffsll(log) : ffs(log))) {
> 
>   if you defines vhost_log_chuck_t, you also define vhost_log_ffs() and
>   you are done without this if.
> 
> Later, Juan.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-26 14:32     ` Michael S. Tsirkin
@ 2010-02-26 14:38       ` Anthony Liguori
  2010-02-26 14:54         ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-26 14:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, kraxel, qemu-devel, Juan Quintela

On 02/26/2010 08:32 AM, Michael S. Tsirkin wrote:
>> and
>> --enable-vhost/--disable-vhost options.
>>
>>      
> I don't really see why we need --enable-vhost/--disable-vhost.
> Runtime flag is enough.
>    

So that packagers can disable features at build time that they don't 
want to support.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-25 19:44   ` Anthony Liguori
@ 2010-02-26 14:49     ` Michael S. Tsirkin
  2010-02-26 15:18       ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-26 14:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Thu, Feb 25, 2010 at 01:44:34PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>> This adds vhost net device support in qemu. Will be tied to tap device
>> and virtio by following patches.  Raw backend is currently missing,
>> will be worked on/submitted separately.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>   Makefile.target |    2 +
>>   configure       |   21 ++
>>   hw/vhost.c      |  631 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vhost.h      |   44 ++++
>>   hw/vhost_net.c  |  177 ++++++++++++++++
>>   hw/vhost_net.h  |   20 ++
>>   6 files changed, 895 insertions(+), 0 deletions(-)
>>   create mode 100644 hw/vhost.c
>>   create mode 100644 hw/vhost.h
>>   create mode 100644 hw/vhost_net.c
>>   create mode 100644 hw/vhost_net.h
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index c1580e9..9b4fd84 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -174,6 +174,8 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
>>   # need to fix this properly
>>   obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
>>   obj-y += notifier.o
>> +obj-y += vhost_net.o
>> +obj-$(CONFIG_VHOST_NET) += vhost.o
>>   obj-y += rwhandler.o
>>   obj-$(CONFIG_KVM) += kvm.o kvm-all.o
>>   obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
>> diff --git a/configure b/configure
>> index 8eb5f5b..5eccc7c 100755
>> --- a/configure
>> +++ b/configure
>> @@ -1498,6 +1498,23 @@ EOF
>>   fi
>>
>>   ##########################################
>> +# test for vhost net
>> +
>> +if test "$kvm" != "no"; then
>> +	cat>  $TMPC<<EOF
>> +#include<linux/vhost.h>
>> +int main(void) { return 0; }
>> +EOF
>> +	if compile_prog "$kvm_cflags" "" ; then
>> +	vhost_net=yes
>> +	else
>> +	vhost_net=no
>> +	fi
>> +else
>> +	vhost_net=no
>> +fi
>> +
>> +##########################################
>>   # pthread probe
>>   PTHREADLIBS_LIST="-lpthread -lpthreadGC2"
>>
>> @@ -1968,6 +1985,7 @@ echo "fdt support       $fdt"
>>   echo "preadv support    $preadv"
>>   echo "fdatasync         $fdatasync"
>>   echo "uuid support      $uuid"
>> +echo "vhost-net support $vhost_net"
>>
>>   if test $sdl_too_old = "yes"; then
>>   echo "->  Your SDL version is too old - please upgrade to have SDL support"
>> @@ -2492,6 +2510,9 @@ case "$target_arch2" in
>>         if test "$kvm_para" = "yes"; then
>>           echo "CONFIG_KVM_PARA=y">>  $config_target_mak
>>         fi
>> +      if test $vhost_net = "yes" ; then
>> +        echo "CONFIG_VHOST_NET=y">>  $config_target_mak
>> +      fi
>>       fi
>>   esac
>>   echo "TARGET_PHYS_ADDR_BITS=$target_phys_bits">>  $config_target_mak
>> diff --git a/hw/vhost.c b/hw/vhost.c
>> new file mode 100644
>> index 0000000..4d5ea02
>> --- /dev/null
>> +++ b/hw/vhost.c
>>    
>
> Needs copyright/license.
>
>> @@ -0,0 +1,631 @@
>> +#include "linux/vhost.h"
>> +#include<sys/ioctl.h>
>> +#include<sys/eventfd.h>
>> +#include "vhost.h"
>> +#include "hw/hw.h"
>> +/* For range_get_last */
>> +#include "pci.h"
>> +
>> +static void vhost_dev_sync_region(struct vhost_dev *dev,
>> +                                  uint64_t mfirst, uint64_t mlast,
>> +                                  uint64_t rfirst, uint64_t rlast)
>> +{
>> +    uint64_t start = MAX(mfirst, rfirst);
>> +    uint64_t end = MIN(mlast, rlast);
>> +    vhost_log_chunk_t *from = dev->log + start / VHOST_LOG_CHUNK;
>> +    vhost_log_chunk_t *to = dev->log + end / VHOST_LOG_CHUNK + 1;
>> +    uint64_t addr = (start / VHOST_LOG_CHUNK) * VHOST_LOG_CHUNK;
>> +
>> +    assert(end / VHOST_LOG_CHUNK<  dev->log_size);
>> +    assert(start / VHOST_LOG_CHUNK<  dev->log_size);
>> +    if (end<  start) {
>> +        return;
>> +    }
>> +    for (;from<  to; ++from) {
>> +        vhost_log_chunk_t log;
>> +        int bit;
>> +        /* We first check with non-atomic: much cheaper,
>> +         * and we expect non-dirty to be the common case. */
>> +        if (!*from) {
>> +            continue;
>> +        }
>> +        /* Data must be read atomically. We don't really
>> +         * need the barrier semantics of __sync
>> +         * builtins, but it's easier to use them than
>> +         * roll our own. */
>> +        log = __sync_fetch_and_and(from, 0);
>>    
>
> Is this too non-portable for us?
>
> Technically speaking, it would be better to use qemu address types  
> instead of uint64_t.
>
>> +        while ((bit = sizeof(log)>  sizeof(int) ?
>> +                ffsll(log) : ffs(log))) {
>> +            bit -= 1;
>> +            cpu_physical_memory_set_dirty(addr + bit * VHOST_LOG_PAGE);
>> +            log&= ~(0x1ull<<  bit);
>> +        }
>> +        addr += VHOST_LOG_CHUNK;
>> +    }
>> +}
>> +
>> +static int vhost_client_sync_dirty_bitmap(struct CPUPhysMemoryClient *client,
>> +                                          target_phys_addr_t start_addr,
>> +                                          target_phys_addr_t end_addr)
>>    
>
> The 'struct' shouldn't be needed...
>
>> +{
>> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
>> +    int i;
>> +    if (!dev->log_enabled || !dev->started) {
>> +        return 0;
>> +    }
>> +    for (i = 0; i<  dev->mem->nregions; ++i) {
>> +        struct vhost_memory_region *reg = dev->mem->regions + i;
>> +        vhost_dev_sync_region(dev, start_addr, end_addr,
>> +                              reg->guest_phys_addr,
>> +                              range_get_last(reg->guest_phys_addr,
>> +                                             reg->memory_size));
>> +    }
>> +    for (i = 0; i<  dev->nvqs; ++i) {
>> +        struct vhost_virtqueue *vq = dev->vqs + i;
>> +        unsigned size = offsetof(struct vring_used, ring) +
>> +            sizeof(struct vring_used_elem) * vq->num;
>> +        vhost_dev_sync_region(dev, start_addr, end_addr, vq->used_phys,
>> +                              range_get_last(vq->used_phys, size));
>> +    }
>> +    return 0;
>> +}
>> +
>> +/* Assign/unassign. Keep an unsorted array of non-overlapping
>> + * memory regions in dev->mem. */
>> +static void vhost_dev_unassign_memory(struct vhost_dev *dev,
>> +                                      uint64_t start_addr,
>> +                                      uint64_t size)
>> +{
>> +    int from, to, n = dev->mem->nregions;
>> +    /* Track overlapping/split regions for sanity checking. */
>> +    int overlap_start = 0, overlap_end = 0, overlap_middle = 0, split = 0;
>> +
>> +    for (from = 0, to = 0; from<  n; ++from, ++to) {
>> +        struct vhost_memory_region *reg = dev->mem->regions + to;
>> +        uint64_t reglast;
>> +        uint64_t memlast;
>> +        uint64_t change;
>> +
>> +        /* clone old region */
>> +        if (to != from) {
>> +            memcpy(reg, dev->mem->regions + from, sizeof *reg);
>> +        }
>> +
>> +        /* No overlap is simple */
>> +        if (!ranges_overlap(reg->guest_phys_addr, reg->memory_size,
>> +                            start_addr, size)) {
>> +            continue;
>> +        }
>> +
>> +        /* Split only happens if supplied region
>> +         * is in the middle of an existing one. Thus it can not
>> +         * overlap with any other existing region. */
>> +        assert(!split);
>> +
>> +        reglast = range_get_last(reg->guest_phys_addr, reg->memory_size);
>> +        memlast = range_get_last(start_addr, size);
>> +
>> +        /* Remove whole region */
>> +        if (start_addr<= reg->guest_phys_addr&&  memlast>= reglast) {
>> +            --dev->mem->nregions;
>> +            --to;
>> +            assert(to>= 0);
>> +            ++overlap_middle;
>> +            continue;
>> +        }
>> +
>> +        /* Shrink region */
>> +        if (memlast>= reglast) {
>> +            reg->memory_size = start_addr - reg->guest_phys_addr;
>> +            assert(reg->memory_size);
>> +            assert(!overlap_end);
>> +            ++overlap_end;
>> +            continue;
>> +        }
>> +
>> +        /* Shift region */
>> +        if (start_addr<= reg->guest_phys_addr) {
>> +            change = memlast + 1 - reg->guest_phys_addr;
>> +            reg->memory_size -= change;
>> +            reg->guest_phys_addr += change;
>> +            reg->userspace_addr += change;
>> +            assert(reg->memory_size);
>> +            assert(!overlap_start);
>> +            ++overlap_start;
>> +            continue;
>> +        }
>> +
>> +        /* This only happens if supplied region
>> +         * is in the middle of an existing one. Thus it can not
>> +         * overlap with any other existing region. */
>> +        assert(!overlap_start);
>> +        assert(!overlap_end);
>> +        assert(!overlap_middle);
>> +        /* Split region: shrink first part, shift second part. */
>> +        memcpy(dev->mem->regions + n, reg, sizeof *reg);
>> +        reg->memory_size = start_addr - reg->guest_phys_addr;
>> +        assert(reg->memory_size);
>> +        change = memlast + 1 - reg->guest_phys_addr;
>> +        reg = dev->mem->regions + n;
>> +        reg->memory_size -= change;
>> +        assert(reg->memory_size);
>> +        reg->guest_phys_addr += change;
>> +        reg->userspace_addr += change;
>> +        /* Never add more than 1 region */
>> +        assert(dev->mem->nregions == n);
>> +        ++dev->mem->nregions;
>> +        ++split;
>> +    }
>> +}
>>    
>
> This code is basically replicating the code in kvm-all with respect to  
> translating qemu ram registrations into a list of non-overlapping slots.  
> We should commonize the code and perhaps even change the notification API 
> to deal with non-overlapping slots since that's what users seem to want.

KVM code needs all kind of work-arounds for KVM specific issues.
It also assumes that KVM is registered at startup, so it
does not try to optimize finding slots.

I propose merging this as is, and then someone who has an idea
how to do this better can come and unify the code.

>> +
>> +/* Called after unassign, so no regions overlap the given range. */
>> +static void vhost_dev_assign_memory(struct vhost_dev *dev,
>> +                                    uint64_t start_addr,
>> +                                    uint64_t size,
>> +                                    uint64_t uaddr)
>> +{
>> +    int from, to;
>> +    struct vhost_memory_region *merged = NULL;
>> +    for (from = 0, to = 0; from<  dev->mem->nregions; ++from, ++to) {
>> +        struct vhost_memory_region *reg = dev->mem->regions + to;
>> +        uint64_t prlast, urlast;
>> +        uint64_t pmlast, umlast;
>> +        uint64_t s, e, u;
>> +
>> +        /* clone old region */
>> +        if (to != from) {
>> +            memcpy(reg, dev->mem->regions + from, sizeof *reg);
>> +        }
>> +        prlast = range_get_last(reg->guest_phys_addr, reg->memory_size);
>> +        pmlast = range_get_last(start_addr, size);
>> +        urlast = range_get_last(reg->userspace_addr, reg->memory_size);
>> +        umlast = range_get_last(uaddr, size);
>> +
>> +        /* check for overlapping regions: should never happen. */
>> +        assert(prlast<  start_addr || pmlast<  reg->guest_phys_addr);
>> +        /* Not an adjacent or overlapping region - do not merge. */
>> +        if ((prlast + 1 != start_addr || urlast + 1 != uaddr)&&
>> +            (pmlast + 1 != reg->guest_phys_addr ||
>> +             umlast + 1 != reg->userspace_addr)) {
>> +            continue;
>> +        }
>> +
>> +        if (merged) {
>> +            --to;
>> +            assert(to>= 0);
>> +        } else {
>> +            merged = reg;
>> +        }
>> +        u = MIN(uaddr, reg->userspace_addr);
>> +        s = MIN(start_addr, reg->guest_phys_addr);
>> +        e = MAX(pmlast, prlast);
>> +        uaddr = merged->userspace_addr = u;
>> +        start_addr = merged->guest_phys_addr = s;
>> +        size = merged->memory_size = e - s + 1;
>> +        assert(merged->memory_size);
>> +    }
>> +
>> +    if (!merged) {
>> +        struct vhost_memory_region *reg = dev->mem->regions + to;
>> +        memset(reg, 0, sizeof *reg);
>> +        reg->memory_size = size;
>> +        assert(reg->memory_size);
>> +        reg->guest_phys_addr = start_addr;
>> +        reg->userspace_addr = uaddr;
>> +        ++to;
>> +    }
>> +    assert(to<= dev->mem->nregions + 1);
>> +    dev->mem->nregions = to;
>> +}
>>    
>
> See above.  Unifying the two bits of code is important IMHO because we  
> had an unending supply of bugs with the code in kvm-all it seems.

Mine has no bugs, let's switch to it!

Seriously, need to tread very carefully here.
This is why I say: merge it, then look at how to reuse code.

>> +static uint64_t vhost_get_log_size(struct vhost_dev *dev)
>> +{
>> +    uint64_t log_size = 0;
>> +    int i;
>> +    for (i = 0; i<  dev->mem->nregions; ++i) {
>> +        struct vhost_memory_region *reg = dev->mem->regions + i;
>> +        uint64_t last = range_get_last(reg->guest_phys_addr,
>> +                                       reg->memory_size);
>> +        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
>> +    }
>> +    for (i = 0; i<  dev->nvqs; ++i) {
>> +        struct vhost_virtqueue *vq = dev->vqs + i;
>> +        uint64_t last = vq->used_phys +
>> +            offsetof(struct vring_used, ring) +
>> +            sizeof(struct vring_used_elem) * vq->num - 1;
>> +        log_size = MAX(log_size, last / VHOST_LOG_CHUNK + 1);
>> +    }
>> +    return log_size;
>> +}
>> +
>> +static inline void vhost_dev_log_resize(struct vhost_dev* dev, uint64_t size)
>> +{
>> +    vhost_log_chunk_t *log;
>> +    uint64_t log_base;
>> +    int r;
>> +    if (size) {
>> +        log = qemu_mallocz(size * sizeof *log);
>> +    } else {
>> +        log = NULL;
>> +    }
>> +    log_base = (uint64_t)(unsigned long)log;
>> +    r = ioctl(dev->control, VHOST_SET_LOG_BASE,&log_base);
>> +    assert(r>= 0);
>> +    vhost_client_sync_dirty_bitmap(&dev->client, 0,
>> +                                   (target_phys_addr_t)~0x0ull);
>> +    if (dev->log) {
>> +        qemu_free(dev->log);
>> +    }
>> +    dev->log = log;
>> +    dev->log_size = size;
>> +}
>> +
>> +static void vhost_client_set_memory(CPUPhysMemoryClient *client,
>> +                                    target_phys_addr_t start_addr,
>> +                                    ram_addr_t size,
>> +                                    ram_addr_t phys_offset)
>> +{
>> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
>> +    ram_addr_t flags = phys_offset&  ~TARGET_PAGE_MASK;
>> +    int s = offsetof(struct vhost_memory, regions) +
>> +        (dev->mem->nregions + 1) * sizeof dev->mem->regions[0];
>> +    uint64_t log_size;
>> +    int r;
>> +    dev->mem = qemu_realloc(dev->mem, s);
>> +
>> +    assert(size);
>> +
>> +    vhost_dev_unassign_memory(dev, start_addr, size);
>> +    if (flags == IO_MEM_RAM) {
>> +        /* Add given mapping, merging adjacent regions if any */
>> +        vhost_dev_assign_memory(dev, start_addr, size,
>> +                                (uintptr_t)qemu_get_ram_ptr(phys_offset));
>> +    } else {
>> +        /* Remove old mapping for this memory, if any. */
>> +        vhost_dev_unassign_memory(dev, start_addr, size);
>> +    }
>> +
>> +    if (!dev->started) {
>> +        return;
>> +    }
>> +    if (!dev->log_enabled) {
>> +        r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
>> +        assert(r>= 0);
>> +        return;
>> +    }
>> +    log_size = vhost_get_log_size(dev);
>> +    /* We allocate an extra 4K bytes to log,
>> +     * to reduce the * number of reallocations. */
>> +#define VHOST_LOG_BUFFER (0x1000 / sizeof *dev->log)
>> +    /* To log more, must increase log size before table update. */
>> +    if (dev->log_size<  log_size) {
>> +        vhost_dev_log_resize(dev, log_size + VHOST_LOG_BUFFER);
>> +    }
>> +    r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
>> +    assert(r>= 0);
>> +    /* To log less, can only decrease log size after table update. */
>> +    if (dev->log_size>  log_size + VHOST_LOG_BUFFER) {
>> +        vhost_dev_log_resize(dev, log_size);
>> +    }
>> +}
>> +
>> +static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
>> +                                    struct vhost_virtqueue *vq,
>> +                                    unsigned idx, bool enable_log)
>> +{
>> +    struct vhost_vring_addr addr = {
>> +        .index = idx,
>> +        .desc_user_addr = (u_int64_t)(unsigned long)vq->desc,
>> +        .avail_user_addr = (u_int64_t)(unsigned long)vq->avail,
>> +        .used_user_addr = (u_int64_t)(unsigned long)vq->used,
>> +        .log_guest_addr = vq->used_phys,
>> +        .flags = enable_log ? (1<<  VHOST_VRING_F_LOG) : 0,
>> +    };
>> +    int r = ioctl(dev->control, VHOST_SET_VRING_ADDR,&addr);
>> +    if (r<  0) {
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vhost_dev_set_features(struct vhost_dev *dev, bool enable_log)
>> +{
>> +    uint64_t features = dev->acked_features;
>> +    int r;
>> +    if (enable_log) {
>> +        features |= 0x1<<  VHOST_F_LOG_ALL;
>> +    }
>> +    r = ioctl(dev->control, VHOST_SET_FEATURES,&features);
>> +    return r<  0 ? -errno : 0;
>> +}
>> +
>> +static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
>> +{
>> +    int r, t, i;
>> +    r = vhost_dev_set_features(dev, enable_log);
>> +    if (r<  0)
>> +        goto err_features;
>>    
>
> Coding Style is off with single line ifs.
>
>> +    for (i = 0; i<  dev->nvqs; ++i) {
>>    
>
> C++ habits die hard :-)


What's that about?

>> +        r = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
>> +                                     enable_log);
>> +        if (r<  0)
>> +            goto err_vq;
>> +    }
>> +    return 0;
>> +err_vq:
>> +    for (; i>= 0; --i) {
>> +        t = vhost_virtqueue_set_addr(dev, dev->vqs + i, i,
>> +                                     dev->log_enabled);
>> +        assert(t>= 0);
>> +    }
>> +    t = vhost_dev_set_features(dev, dev->log_enabled);
>> +    assert(t>= 0);
>> +err_features:
>> +    return r;
>> +}
>> +
>> +static int vhost_client_migration_log(struct CPUPhysMemoryClient *client,
>> +                                      int enable)
>> +{
>> +    struct vhost_dev *dev = container_of(client, struct vhost_dev, client);
>> +    int r;
>> +    if (!!enable == dev->log_enabled) {
>> +        return 0;
>> +    }
>> +    if (!dev->started) {
>> +        dev->log_enabled = enable;
>> +        return 0;
>> +    }
>> +    if (!enable) {
>> +        r = vhost_dev_set_log(dev, false);
>> +        if (r<  0) {
>> +            return r;
>> +        }
>> +        if (dev->log) {
>> +            qemu_free(dev->log);
>> +        }
>> +        dev->log = NULL;
>> +        dev->log_size = 0;
>> +    } else {
>> +        vhost_dev_log_resize(dev, vhost_get_log_size(dev));
>> +        r = vhost_dev_set_log(dev, true);
>> +        if (r<  0) {
>> +            return r;
>> +        }
>> +    }
>> +    dev->log_enabled = enable;
>> +    return 0;
>> +}
>> +
>> +static int vhost_virtqueue_init(struct vhost_dev *dev,
>> +                                struct VirtIODevice *vdev,
>> +                                struct vhost_virtqueue *vq,
>> +                                unsigned idx)
>> +{
>> +    target_phys_addr_t s, l, a;
>> +    int r;
>> +    struct vhost_vring_file file = {
>> +        .index = idx,
>> +    };
>> +    struct vhost_vring_state state = {
>> +        .index = idx,
>> +    };
>> +    struct VirtQueue *q = virtio_queue(vdev, idx);
>> +
>> +    vq->num = state.num = virtio_queue_get_num(vdev, idx);
>> +    r = ioctl(dev->control, VHOST_SET_VRING_NUM,&state);
>> +    if (r) {
>> +        return -errno;
>> +    }
>> +
>> +    state.num = virtio_queue_last_avail_idx(vdev, idx);
>> +    r = ioctl(dev->control, VHOST_SET_VRING_BASE,&state);
>> +    if (r) {
>> +        return -errno;
>> +    }
>> +
>> +    s = l = sizeof(struct vring_desc) * vq->num;
>> +    a = virtio_queue_get_desc(vdev, idx);
>> +    vq->desc = cpu_physical_memory_map(a,&l, 0);
>> +    if (!vq->desc || l != s) {
>> +        r = -ENOMEM;
>> +        goto fail_alloc;
>> +    }
>> +    s = l = offsetof(struct vring_avail, ring) +
>> +        sizeof(u_int64_t) * vq->num;
>> +    a = virtio_queue_get_avail(vdev, idx);
>> +    vq->avail = cpu_physical_memory_map(a,&l, 0);
>> +    if (!vq->avail || l != s) {
>> +        r = -ENOMEM;
>> +        goto fail_alloc;
>> +    }
>>    
>
> You don't unmap avail/desc on failure.  map() may fail because the ring  
> cross MMIO memory and you run out of a bounce buffer.
>
> IMHO, it would be better to attempt to map the full ring at once and  
> then if that doesn't succeed, bail out.  You can still pass individual  
> pointers via vhost ioctls but within qemu, it's much easier to deal with  
> the whole ring at a time.

I prefer to keep as much logic about ring layout as possible
in virtio.c

>> +    s = l = offsetof(struct vring_used, ring) +
>> +        sizeof(struct vring_used_elem) * vq->num;
>>    
>
> This is unfortunate.  We redefine this structures in qemu to avoid  
> depending on Linux headers.

And we should for e.g. windows portability.

>  But you're using the linux versions instead  
> of the qemu versions.  Is it really necessary for vhost.h to include  
> virtio.h?

Yes. And anyway, vhost does not exist on non-linux systems so there
is no issue IMO.

>> +    vq->used_phys = a = virtio_queue_get_used(vdev, idx);
>> +    vq->used = cpu_physical_memory_map(a,&l, 1);
>> +    if (!vq->used || l != s) {
>> +        r = -ENOMEM;
>> +        goto fail_alloc;
>> +    }
>> +
>> +    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
>> +    if (r<  0) {
>> +        r = -errno;
>> +        goto fail_alloc;
>> +    }
>> +    if (!vdev->binding->guest_notifier || !vdev->binding->host_notifier) {
>> +        fprintf(stderr, "binding does not support irqfd/queuefd\n");
>> +        r = -ENOSYS;
>> +        goto fail_alloc;
>> +    }
>> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, true);
>> +    if (r<  0) {
>> +        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
>> +        goto fail_guest_notifier;
>> +    }
>> +
>> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, true);
>> +    if (r<  0) {
>> +        fprintf(stderr, "Error binding host notifier: %d\n", -r);
>> +        goto fail_host_notifier;
>> +    }
>> +
>> +    file.fd = event_notifier_get_fd(virtio_queue_host_notifier(q));
>> +    r = ioctl(dev->control, VHOST_SET_VRING_KICK,&file);
>> +    if (r) {
>> +        goto fail_kick;
>> +    }
>> +
>> +    file.fd = event_notifier_get_fd(virtio_queue_guest_notifier(q));
>> +    r = ioctl(dev->control, VHOST_SET_VRING_CALL,&file);
>> +    if (r) {
>> +        goto fail_call;
>> +    }
>>    
>
> This function would be a bit more reasonable if it were split into  
> sections FWIW.

Not sure what do you mean here.

>> +    return 0;
>> +
>> +fail_call:
>> +fail_kick:
>> +    vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
>> +fail_host_notifier:
>> +    vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
>> +fail_guest_notifier:
>> +fail_alloc:
>> +    return r;
>> +}
>> +
>> +static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
>> +                                    struct VirtIODevice *vdev,
>> +                                    struct vhost_virtqueue *vq,
>> +                                    unsigned idx)
>> +{
>> +    struct vhost_vring_state state = {
>> +        .index = idx,
>> +    };
>> +    int r;
>> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, false);
>> +    if (r<  0) {
>> +        fprintf(stderr, "vhost VQ %d guest cleanup failed: %d\n", idx, r);
>> +        fflush(stderr);
>> +    }
>> +    assert (r>= 0);
>> +
>> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, false);
>> +    if (r<  0) {
>> +        fprintf(stderr, "vhost VQ %d host cleanup failed: %d\n", idx, r);
>> +        fflush(stderr);
>> +    }
>> +    assert (r>= 0);
>> +    r = ioctl(dev->control, VHOST_GET_VRING_BASE,&state);
>> +    if (r<  0) {
>> +        fprintf(stderr, "vhost VQ %d ring restore failed: %d\n", idx, r);
>> +        fflush(stderr);
>> +    }
>> +    virtio_queue_set_last_avail_idx(vdev, idx, state.num);
>> +    assert (r>= 0);
>>    
>
> You never unmap() the mapped memory and you're cheating by assuming that  
> the virtio rings have a constant mapping for the life time of a guest.   
> That's not technically true.  My concern is that since a guest can  
> trigger remappings (by adjusting PCI mappings) badness can ensue.

I do not know how this can happen. What do PCI mappings have to do with this?
Please explain. If it can, vhost will need notification to update.

>> diff --git a/hw/vhost_net.c b/hw/vhost_net.c
>> new file mode 100644
>> index 0000000..06b7648
>>    
>
> Need copyright/license.
>
>> --- /dev/null
>> +++ b/hw/vhost_net.c
>> @@ -0,0 +1,177 @@
>> +#include "net.h"
>> +#include "net/tap.h"
>> +
>> +#include "virtio-net.h"
>> +#include "vhost_net.h"
>> +
>> +#include "config.h"
>> +
>> +#ifdef CONFIG_VHOST_NET
>> +#include<sys/eventfd.h>
>> +#include<sys/socket.h>
>> +#include<linux/kvm.h>
>> +#include<fcntl.h>
>> +#include<sys/ioctl.h>
>> +#include<linux/virtio_ring.h>
>> +#include<netpacket/packet.h>
>> +#include<net/ethernet.h>
>> +#include<net/if.h>
>> +#include<netinet/in.h>
>> +
>> +#include<stdio.h>
>> +
>> +#include "vhost.h"
>> +
>> +struct vhost_net {
>>    
>
>
> VHostNetState.
>
>> +    struct vhost_dev dev;
>> +    struct vhost_virtqueue vqs[2];
>> +    int backend;
>> +    VLANClientState *vc;
>> +};
>> +
>> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
>> +{
>> +    /* Clear features not supported by host kernel. */
>> +    if (!(net->dev.features&  (1<<  VIRTIO_F_NOTIFY_ON_EMPTY)))
>> +        features&= ~(1<<  VIRTIO_F_NOTIFY_ON_EMPTY);
>> +    if (!(net->dev.features&  (1<<  VIRTIO_RING_F_INDIRECT_DESC)))
>> +        features&= ~(1<<  VIRTIO_RING_F_INDIRECT_DESC);
>> +    if (!(net->dev.features&  (1<<  VIRTIO_NET_F_MRG_RXBUF)))
>> +        features&= ~(1<<  VIRTIO_NET_F_MRG_RXBUF);
>> +    return features;
>> +}
>> +
>> +void vhost_net_ack_features(struct vhost_net *net, unsigned features)
>> +{
>> +    net->dev.acked_features = net->dev.backend_features;
>> +    if (features&  (1<<  VIRTIO_F_NOTIFY_ON_EMPTY))
>> +        net->dev.acked_features |= (1<<  VIRTIO_F_NOTIFY_ON_EMPTY);
>> +    if (features&  (1<<  VIRTIO_RING_F_INDIRECT_DESC))
>> +        net->dev.acked_features |= (1<<  VIRTIO_RING_F_INDIRECT_DESC);
>> +}
>> +
>> +static int vhost_net_get_fd(VLANClientState *backend)
>> +{
>> +    switch (backend->info->type) {
>> +    case NET_CLIENT_TYPE_TAP:
>> +        return tap_get_fd(backend);
>> +    default:
>> +        fprintf(stderr, "vhost-net requires tap backend\n");
>> +        return -EBADFD;
>> +    }
>> +}
>> +
>> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
>> +{
>> +    int r;
>> +    struct vhost_net *net = qemu_malloc(sizeof *net);
>> +    if (!backend) {
>> +        fprintf(stderr, "vhost-net requires backend to be setup\n");
>> +        goto fail;
>> +    }
>> +    r = vhost_net_get_fd(backend);
>> +    if (r<  0)
>> +        goto fail;
>> +    net->vc = backend;
>> +    net->dev.backend_features = tap_has_vnet_hdr(backend) ? 0 :
>> +        (1<<  VHOST_NET_F_VIRTIO_NET_HDR);
>> +    net->backend = r;
>> +
>> +    r = vhost_dev_init(&net->dev, devfd);
>> +    if (r<  0)
>> +        goto fail;
>> +    if (~net->dev.features&  net->dev.backend_features) {
>> +        fprintf(stderr, "vhost lacks feature mask %llu for backend\n",
>> +                ~net->dev.features&  net->dev.backend_features);
>> +        vhost_dev_cleanup(&net->dev);
>> +        goto fail;
>> +    }
>> +
>> +    /* Set sane init value. Override when guest acks. */
>> +    vhost_net_ack_features(net, 0);
>> +    return net;
>> +fail:
>> +    qemu_free(net);
>> +    return NULL;
>> +}
>> +
>> +int vhost_net_start(struct vhost_net *net,
>> +                    VirtIODevice *dev)
>> +{
>> +    struct vhost_vring_file file = { };
>> +    int r;
>> +
>> +    net->dev.nvqs = 2;
>> +    net->dev.vqs = net->vqs;
>> +    r = vhost_dev_start(&net->dev, dev);
>> +    if (r<  0)
>> +        return r;
>> +
>> +    net->vc->info->poll(net->vc, false);
>> +    qemu_set_fd_handler(net->backend, NULL, NULL, NULL);
>> +    file.fd = net->backend;
>> +    for (file.index = 0; file.index<  net->dev.nvqs; ++file.index) {
>> +        r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
>> +        if (r<  0) {
>> +            r = -errno;
>> +            goto fail;
>> +        }
>> +    }
>> +    return 0;
>> +fail:
>> +    file.fd = -1;
>> +    while (--file.index>= 0) {
>> +        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
>> +        assert(r>= 0);
>> +    }
>> +    net->vc->info->poll(net->vc, true);
>> +    vhost_dev_stop(&net->dev, dev);
>> +    return r;
>> +}
>> +
>> +void vhost_net_stop(struct vhost_net *net,
>> +                    VirtIODevice *dev)
>> +{
>> +    struct vhost_vring_file file = { .fd = -1 };
>> +
>> +    for (file.index = 0; file.index<  net->dev.nvqs; ++file.index) {
>> +        int r = ioctl(net->dev.control, VHOST_NET_SET_BACKEND,&file);
>> +        assert(r>= 0);
>> +    }
>> +    net->vc->info->poll(net->vc, true);
>> +    vhost_dev_stop(&net->dev, dev);
>> +}
>> +
>> +void vhost_net_cleanup(struct vhost_net *net)
>> +{
>> +    vhost_dev_cleanup(&net->dev);
>> +    qemu_free(net);
>> +}
>> +#else
>>    
>
> If you're going this way, I'd suggest making static inlines in the  
> header file instead of polluting the C file.  It's more common to search  
> within a C file and having two declarations can get annoying.
>
> Regards,
>
> Anthony Liguori

The issue with inline is that this means that virtio net will depend on
target (need to be recompiled).  As it is, a single object can link with
vhost and non-vhost versions.

>> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
>> +{
>> +	return NULL;
>> +}
>> +
>> +int vhost_net_start(struct vhost_net *net,
>> +		    VirtIODevice *dev)
>> +{
>> +	return -ENOSYS;
>> +}
>> +void vhost_net_stop(struct vhost_net *net,
>> +		    VirtIODevice *dev)
>> +{
>> +}
>> +
>> +void vhost_net_cleanup(struct vhost_net *net)
>> +{
>> +}
>> +
>> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features)
>> +{
>> +	return features;
>> +}
>> +void vhost_net_ack_features(struct vhost_net *net, unsigned features)
>> +{
>> +}
>> +#endif
>> diff --git a/hw/vhost_net.h b/hw/vhost_net.h
>> new file mode 100644
>> index 0000000..2a10210
>> --- /dev/null
>> +++ b/hw/vhost_net.h
>> @@ -0,0 +1,20 @@
>> +#ifndef VHOST_NET_H
>> +#define VHOST_NET_H
>> +
>> +#include "net.h"
>> +
>> +struct vhost_net;
>> +
>> +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd);
>> +
>> +int vhost_net_start(struct vhost_net *net,
>> +                    VirtIODevice *dev);
>> +void vhost_net_stop(struct vhost_net *net,
>> +                    VirtIODevice *dev);
>> +
>> +void vhost_net_cleanup(struct vhost_net *net);
>> +
>> +unsigned vhost_net_get_features(struct vhost_net *net, unsigned features);
>> +void vhost_net_ack_features(struct vhost_net *net, unsigned features);
>> +
>> +#endif
>>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-25 19:47   ` [Qemu-devel] " Anthony Liguori
@ 2010-02-26 14:51     ` Michael S. Tsirkin
  2010-02-26 15:23       ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-26 14:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Thu, Feb 25, 2010 at 01:47:27PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>> This adds vhost binary option to tap, to enable vhost net accelerator.
>> Default is off for now, we'll be able to make default on long term
>> when we know it's stable.
>>
>> vhostfd option can be used by management, to pass in the fd. Assigning
>> vhostfd implies vhost=on.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>    
>
> Since the thinking these days is that macvtap and tap is pretty much all  
> we'll ever need for vhost-net, perhaps we should revisit -net vhost vs.  
> -net tap,vhost=X?
>
> I think -net vhost,fd=X makes a lot more sense than -net  
> tap,vhost=on,vhostfd=X.
>
> Regards,
>
> Anthony Liguori

We'll have to duplicate all tap options.
I think long term we will just make vhost=on the default.
Users do not really care about vhost, it just makes tap
go fater. So promoting it to 1st class options is wrong IMO.

>> ---
>>   net.c           |    8 ++++++++
>>   net/tap.c       |   33 +++++++++++++++++++++++++++++++++
>>   qemu-options.hx |    4 +++-
>>   3 files changed, 44 insertions(+), 1 deletions(-)
>>
>> diff --git a/net.c b/net.c
>> index a1bf49f..d1e23f1 100644
>> --- a/net.c
>> +++ b/net.c
>> @@ -973,6 +973,14 @@ static const struct {
>>                   .name = "vnet_hdr",
>>                   .type = QEMU_OPT_BOOL,
>>                   .help = "enable the IFF_VNET_HDR flag on the tap interface"
>> +            }, {
>> +                .name = "vhost",
>> +                .type = QEMU_OPT_BOOL,
>> +                .help = "enable vhost-net network accelerator",
>> +            }, {
>> +                .name = "vhostfd",
>> +                .type = QEMU_OPT_STRING,
>> +                .help = "file descriptor of an already opened vhost net device",
>>               },
>>   #endif /* _WIN32 */
>>               { /* end of list */ }
>> diff --git a/net/tap.c b/net/tap.c
>> index fc59fd4..65797ef 100644
>> --- a/net/tap.c
>> +++ b/net/tap.c
>> @@ -41,6 +41,8 @@
>>
>>   #include "net/tap-linux.h"
>>
>> +#include "hw/vhost_net.h"
>> +
>>   /* Maximum GSO packet size (64k) plus plenty of room for
>>    * the ethernet and virtio_net headers
>>    */
>> @@ -57,6 +59,7 @@ typedef struct TAPState {
>>       unsigned int has_vnet_hdr : 1;
>>       unsigned int using_vnet_hdr : 1;
>>       unsigned int has_ufo: 1;
>> +    struct vhost_net *vhost_net;
>>   } TAPState;
>>
>>   static int launch_script(const char *setup_script, const char *ifname, int fd);
>> @@ -252,6 +255,10 @@ static void tap_cleanup(VLANClientState *nc)
>>   {
>>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
>>
>> +    if (s->vhost_net) {
>> +        vhost_net_cleanup(s->vhost_net);
>> +    }
>> +
>>       qemu_purge_queued_packets(nc);
>>
>>       if (s->down_script[0])
>> @@ -307,6 +314,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
>>       s->has_ufo = tap_probe_has_ufo(s->fd);
>>       tap_set_offload(&s->nc, 0, 0, 0, 0, 0);
>>       tap_read_poll(s, 1);
>> +    s->vhost_net = NULL;
>>       return s;
>>   }
>>
>> @@ -456,5 +464,30 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan
>>           }
>>       }
>>
>> +    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
>> +        int vhostfd, r;
>> +        if (qemu_opt_get(opts, "vhostfd")) {
>> +            r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
>> +            if (r == -1) {
>> +                return -1;
>> +            }
>> +            vhostfd = r;
>> +        } else {
>> +            vhostfd = -1;
>> +        }
>> +        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
>> +        if (!s->vhost_net) {
>> +            qemu_error("vhost-net requested but could not be initialized\n");
>> +            return -1;
>> +        }
>> +    } else if (qemu_opt_get(opts, "vhostfd")) {
>> +        qemu_error("vhostfd= is not valid without vhost\n");
>> +        return -1;
>> +    }
>> +
>> +    if (vlan) {
>> +        vlan->nb_host_devs++;
>> +    }
>> +
>>       return 0;
>>   }
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index f53922f..1850906 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -879,7 +879,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>       "-net tap[,vlan=n][,name=str],ifname=name\n"
>>       "                connect the host TAP network interface to VLAN 'n'\n"
>>   #else
>> -    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off]\n"
>> +    "-net tap[,vlan=n][,name=str][,fd=h][,ifname=name][,script=file][,downscript=dfile][,sndbuf=nbytes][,vnet_hdr=on|off][,vhost=on|off][,vhostfd=h]\n"
>>       "                connect the host TAP network interface to VLAN 'n' and use the\n"
>>       "                network scripts 'file' (default=" DEFAULT_NETWORK_SCRIPT ")\n"
>>       "                and 'dfile' (default=" DEFAULT_NETWORK_DOWN_SCRIPT ")\n"
>> @@ -889,6 +889,8 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>       "                default of 'sndbuf=1048576' can be disabled using 'sndbuf=0')\n"
>>       "                use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n"
>>       "                use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n"
>> +    "                use vhost=on to enable experimental in kernel accelerator\n"
>> +    "                use 'vhostfd=h' to connect to an already opened vhost net device\n"
>>   #endif
>>       "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
>>       "                connect the vlan 'n' to another VLAN using a socket connection\n"
>>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields
  2010-02-25 18:49   ` Blue Swirl
@ 2010-02-26 14:53     ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-26 14:53 UTC (permalink / raw)
  To: Blue Swirl; +Cc: amit.shah, quintela, qemu-devel, kraxel

> Bug.

Ouch.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-26 14:38       ` Anthony Liguori
@ 2010-02-26 14:54         ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-26 14:54 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, kraxel, qemu-devel, Juan Quintela

On Fri, Feb 26, 2010 at 08:38:27AM -0600, Anthony Liguori wrote:
> On 02/26/2010 08:32 AM, Michael S. Tsirkin wrote:
>>> and
>>> --enable-vhost/--disable-vhost options.
>>>
>>>      
>> I don't really see why we need --enable-vhost/--disable-vhost.
>> Runtime flag is enough.
>>    
>
> So that packagers can disable features at build time that they don't  
> want to support.
>
> Regards,
>
> Anthony Liguori

Fair enough.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-26 14:49     ` Michael S. Tsirkin
@ 2010-02-26 15:18       ` Anthony Liguori
  2010-02-27 19:38         ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-26 15:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/26/2010 08:49 AM, Michael S. Tsirkin wrote:
>
> KVM code needs all kind of work-arounds for KVM specific issues.
> It also assumes that KVM is registered at startup, so it
> does not try to optimize finding slots.
>    

No, the slot mapping changes dynamically so KVM certainly needs to 
optimize this.

But the point is, why can't we keep a central list of slots somewhere 
that KVM and vhost-net can both use?  I'm not saying we use a common 
function to do this work, I'm saying qemu should maintain a proper slot 
list than anyone can access.

> I propose merging this as is, and then someone who has an idea
> how to do this better can come and unify the code.
>    

Like I said, this has been a huge source of very subtle bugs in the 
past.  I'm open to hearing what other people think, but I'm concerned 
that if we merge this code, we'll end up facing some nasty bugs that 
could easily be eliminated by just using the code in kvm-all that has 
already been tested rather extensively.

There really aren't that many work-arounds in the code BTW.  The work 
arounds just result in a couple of extra slots too so they shouldn't be 
a burden to vhost.

> Mine has no bugs, let's switch to it!
>
> Seriously, need to tread very carefully here.
> This is why I say: merge it, then look at how to reuse code.
>    

Once it's merged, there's no incentive to look at reusing code.  Again, 
I don't think this is a huge burden to vhost.  The two bits of code 
literally do exactly the same thing.  They just use different data 
structures that ultimately contain the same values.

>> C++ habits die hard :-)
>>      
>
> What's that about?
>    

'++i' is an odd thing to do in C in a for() loop.  We're not explicit 
about it in Coding Style but the vast majority of code just does 'i++'.

>>> +    vq->desc = cpu_physical_memory_map(a,&l, 0);
>>> +    if (!vq->desc || l != s) {
>>> +        r = -ENOMEM;
>>> +        goto fail_alloc;
>>> +    }
>>> +    s = l = offsetof(struct vring_avail, ring) +
>>> +        sizeof(u_int64_t) * vq->num;
>>> +    a = virtio_queue_get_avail(vdev, idx);
>>> +    vq->avail = cpu_physical_memory_map(a,&l, 0);
>>> +    if (!vq->avail || l != s) {
>>> +        r = -ENOMEM;
>>> +        goto fail_alloc;
>>> +    }
>>>
>>>        
>> You don't unmap avail/desc on failure.  map() may fail because the ring
>> cross MMIO memory and you run out of a bounce buffer.
>>
>> IMHO, it would be better to attempt to map the full ring at once and
>> then if that doesn't succeed, bail out.  You can still pass individual
>> pointers via vhost ioctls but within qemu, it's much easier to deal with
>> the whole ring at a time.
>>      
> + a = virtio_queue_get_desc(vdev, idx);
> I prefer to keep as much logic about ring layout as possible
> in virtio.c
>    

Well, the downside is that you need to deal with the error path and 
cleanup paths and it becomes more complicated.

>>> +    s = l = offsetof(struct vring_used, ring) +
>>> +        sizeof(struct vring_used_elem) * vq->num;
>>>
>>>        
>> This is unfortunate.  We redefine this structures in qemu to avoid
>> depending on Linux headers.
>>      
> And we should for e.g. windows portability.
>
>    
>>   But you're using the linux versions instead
>> of the qemu versions.  Is it really necessary for vhost.h to include
>> virtio.h?
>>      
> Yes. And anyway, vhost does not exist on non-linux systems so there
> is no issue IMO.
>    

Yeah, like I said, it's unfortunate because it means a read of vhost and 
a reader of virtio.c is likely to get confused.  I'm not saying there's 
an easy solution, it's just unfortunate.

>>> +    vq->used_phys = a = virtio_queue_get_used(vdev, idx);
>>> +    vq->used = cpu_physical_memory_map(a,&l, 1);
>>> +    if (!vq->used || l != s) {
>>> +        r = -ENOMEM;
>>> +        goto fail_alloc;
>>> +    }
>>> +
>>> +    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
>>> +    if (r<   0) {
>>> +        r = -errno;
>>> +        goto fail_alloc;
>>> +    }
>>> +    if (!vdev->binding->guest_notifier || !vdev->binding->host_notifier) {
>>> +        fprintf(stderr, "binding does not support irqfd/queuefd\n");
>>> +        r = -ENOSYS;
>>> +        goto fail_alloc;
>>> +    }
>>> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, true);
>>> +    if (r<   0) {
>>> +        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
>>> +        goto fail_guest_notifier;
>>> +    }
>>> +
>>> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, true);
>>> +    if (r<   0) {
>>> +        fprintf(stderr, "Error binding host notifier: %d\n", -r);
>>> +        goto fail_host_notifier;
>>> +    }
>>> +
>>> +    file.fd = event_notifier_get_fd(virtio_queue_host_notifier(q));
>>> +    r = ioctl(dev->control, VHOST_SET_VRING_KICK,&file);
>>> +    if (r) {
>>> +        goto fail_kick;
>>> +    }
>>> +
>>> +    file.fd = event_notifier_get_fd(virtio_queue_guest_notifier(q));
>>> +    r = ioctl(dev->control, VHOST_SET_VRING_CALL,&file);
>>> +    if (r) {
>>> +        goto fail_call;
>>> +    }
>>>
>>>        
>> This function would be a bit more reasonable if it were split into
>> sections FWIW.
>>      
> Not sure what do you mean here.
>    

Just a suggestion.  For instance, moving the setting up of the notifiers 
to a separate function would help with readability IMHO.

>
>> You never unmap() the mapped memory and you're cheating by assuming that
>> the virtio rings have a constant mapping for the life time of a guest.
>> That's not technically true.  My concern is that since a guest can
>> trigger remappings (by adjusting PCI mappings) badness can ensue.
>>      
> I do not know how this can happen. What do PCI mappings have to do with this?
> Please explain. If it can, vhost will need notification to update.
>    

If a guest modifies the bar for an MMIO region such that it happens to 
exist in RAM, while this is a bad thing for the guest to do, I don't 
think we do anything to stop it.  When the region gets remapped, the 
result will be that the mapping will change.

Within qemu, because we carry the qemu_mutex, we know that the mappings 
are fixed as long as we're in qemu.  We're very careful to assume that 
we don't rely on a mapping past when we drop the qemu_mutex.

With vhost, you register a slot table and update it whenever mappings 
change.  I think that's good enough for dealing with ram addresses.  But 
you pass the virtual address for the rings and assume those mappings 
never change.

I'm pretty sure a guest can cause those to change and I'm not 100% sure, 
but I think it's a potential source of exploits if you assume a 
mapping.  In the very least, a guest can trick vhost into writing to ram 
that it wouldn't normally write to.

>> If you're going this way, I'd suggest making static inlines in the
>> header file instead of polluting the C file.  It's more common to search
>> within a C file and having two declarations can get annoying.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> The issue with inline is that this means that virtio net will depend on
> target (need to be recompiled).  As it is, a single object can link with
> vhost and non-vhost versions.
>    

Fair enough.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-26 14:51     ` Michael S. Tsirkin
@ 2010-02-26 15:23       ` Anthony Liguori
  2010-02-27 19:44         ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-26 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/26/2010 08:51 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 25, 2010 at 01:47:27PM -0600, Anthony Liguori wrote:
>    
>> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>>      
>>> This adds vhost binary option to tap, to enable vhost net accelerator.
>>> Default is off for now, we'll be able to make default on long term
>>> when we know it's stable.
>>>
>>> vhostfd option can be used by management, to pass in the fd. Assigning
>>> vhostfd implies vhost=on.
>>>
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>
>>>        
>> Since the thinking these days is that macvtap and tap is pretty much all
>> we'll ever need for vhost-net, perhaps we should revisit -net vhost vs.
>> -net tap,vhost=X?
>>
>> I think -net vhost,fd=X makes a lot more sense than -net
>> tap,vhost=on,vhostfd=X.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> We'll have to duplicate all tap options.
> I think long term we will just make vhost=on the default.
>    

I don't think we can.  vhost only works when using KVM and it doesn't 
support all of the features of userspace virtio.  Since it's in upstream 
Linux without supporting all of the virtio-net features, it's something 
we're going to have to deal with for a long time.

Furthermore, vhost reduces a virtual machine's security.  It offers an 
impressive performance boost (particularly when dealing with 10gbit+ 
networking) but for a user that doesn't have such strong networking 
performance requirements, I think it's reasonable for them to not want 
to make a security trade off.

One reason I like -net vhost is that it's a much less obscure syntax and 
it's the sort of thing that is easy to tell users that they should use.  
I understand you're argument for -net tap if you assume vhost=on will 
become the default because that means that users never really have to be 
aware of vhost once it becomes the default.  But as I said above, I 
don't think it's reasonable to make it on by default with -net tap.

> Users do not really care about vhost, it just makes tap
> go fater. So promoting it to 1st class options is wrong IMO.
>    

User's should care about vhost because it impacts the features supported 
by the virtual machine and it has security ramifications.  It's a great 
feature and I think the most users will want to use it, but I do think 
it's something that users ought to be aware of.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-26 15:18       ` Anthony Liguori
@ 2010-02-27 19:38         ` Michael S. Tsirkin
  2010-02-28  1:59           ` Paul Brook
  2010-02-28 16:02           ` Anthony Liguori
  0 siblings, 2 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-27 19:38 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Fri, Feb 26, 2010 at 09:18:03AM -0600, Anthony Liguori wrote:
> On 02/26/2010 08:49 AM, Michael S. Tsirkin wrote:
>>
>> KVM code needs all kind of work-arounds for KVM specific issues.
>> It also assumes that KVM is registered at startup, so it
>> does not try to optimize finding slots.
>>    
>
> No, the slot mapping changes dynamically so KVM certainly needs to  
> optimize this.

Maybe, but it does not, KVM algorithms are n^2 or worse.

> But the point is, why can't we keep a central list of slots somewhere  
> that KVM and vhost-net can both use?  I'm not saying we use a common  
> function to do this work, I'm saying qemu should maintain a proper slot  
> list than anyone can access.
>
>> I propose merging this as is, and then someone who has an idea
>> how to do this better can come and unify the code.
>>    
>
> Like I said, this has been a huge source of very subtle bugs in the  
> past.  I'm open to hearing what other people think, but I'm concerned  
> that if we merge this code, we'll end up facing some nasty bugs that  
> could easily be eliminated by just using the code in kvm-all that has  
> already been tested rather extensively.
>
> There really aren't that many work-arounds in the code BTW.  The work  
> arounds just result in a couple of extra slots too so they shouldn't be  
> a burden to vhost.
>
>> Mine has no bugs, let's switch to it!
>>
>> Seriously, need to tread very carefully here.
>> This is why I say: merge it, then look at how to reuse code.
>>    
>
> Once it's merged, there's no incentive to look at reusing code.
> Again, I don't think this is a huge burden to vhost.  The two bits of code  
> literally do exactly the same thing.  They just use different data  
> structures that ultimately contain the same values.

Not exactly. For example kvm track ROM and video ram addresses.

>>> C++ habits die hard :-)
>>>      
>>
>> What's that about?
>>    
>
> '++i' is an odd thing to do in C in a for() loop.  We're not explicit
> about it in Coding Style but the vast majority of code just does
> 'i++'.

Ugh. Do we really need to specify every little thing?

>>>> +    vq->desc = cpu_physical_memory_map(a,&l, 0);
>>>> +    if (!vq->desc || l != s) {
>>>> +        r = -ENOMEM;
>>>> +        goto fail_alloc;
>>>> +    }
>>>> +    s = l = offsetof(struct vring_avail, ring) +
>>>> +        sizeof(u_int64_t) * vq->num;
>>>> +    a = virtio_queue_get_avail(vdev, idx);
>>>> +    vq->avail = cpu_physical_memory_map(a,&l, 0);
>>>> +    if (!vq->avail || l != s) {
>>>> +        r = -ENOMEM;
>>>> +        goto fail_alloc;
>>>> +    }
>>>>
>>>>        
>>> You don't unmap avail/desc on failure.  map() may fail because the ring
>>> cross MMIO memory and you run out of a bounce buffer.
>>>
>>> IMHO, it would be better to attempt to map the full ring at once and
>>> then if that doesn't succeed, bail out.  You can still pass individual
>>> pointers via vhost ioctls but within qemu, it's much easier to deal with
>>> the whole ring at a time.
>>>      
>> + a = virtio_queue_get_desc(vdev, idx);
>> I prefer to keep as much logic about ring layout as possible
>> in virtio.c
>>    
>
> Well, the downside is that you need to deal with the error path and  
> cleanup paths and it becomes more complicated.
>
>>>> +    s = l = offsetof(struct vring_used, ring) +
>>>> +        sizeof(struct vring_used_elem) * vq->num;
>>>>
>>>>        
>>> This is unfortunate.  We redefine this structures in qemu to avoid
>>> depending on Linux headers.
>>>      
>> And we should for e.g. windows portability.
>>
>>    
>>>   But you're using the linux versions instead
>>> of the qemu versions.  Is it really necessary for vhost.h to include
>>> virtio.h?
>>>      
>> Yes. And anyway, vhost does not exist on non-linux systems so there
>> is no issue IMO.
>>    
>
> Yeah, like I said, it's unfortunate because it means a read of vhost and  
> a reader of virtio.c is likely to get confused.  I'm not saying there's  
> an easy solution, it's just unfortunate.
>
>>>> +    vq->used_phys = a = virtio_queue_get_used(vdev, idx);
>>>> +    vq->used = cpu_physical_memory_map(a,&l, 1);
>>>> +    if (!vq->used || l != s) {
>>>> +        r = -ENOMEM;
>>>> +        goto fail_alloc;
>>>> +    }
>>>> +
>>>> +    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
>>>> +    if (r<   0) {
>>>> +        r = -errno;
>>>> +        goto fail_alloc;
>>>> +    }
>>>> +    if (!vdev->binding->guest_notifier || !vdev->binding->host_notifier) {
>>>> +        fprintf(stderr, "binding does not support irqfd/queuefd\n");
>>>> +        r = -ENOSYS;
>>>> +        goto fail_alloc;
>>>> +    }
>>>> +    r = vdev->binding->guest_notifier(vdev->binding_opaque, idx, true);
>>>> +    if (r<   0) {
>>>> +        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
>>>> +        goto fail_guest_notifier;
>>>> +    }
>>>> +
>>>> +    r = vdev->binding->host_notifier(vdev->binding_opaque, idx, true);
>>>> +    if (r<   0) {
>>>> +        fprintf(stderr, "Error binding host notifier: %d\n", -r);
>>>> +        goto fail_host_notifier;
>>>> +    }
>>>> +
>>>> +    file.fd = event_notifier_get_fd(virtio_queue_host_notifier(q));
>>>> +    r = ioctl(dev->control, VHOST_SET_VRING_KICK,&file);
>>>> +    if (r) {
>>>> +        goto fail_kick;
>>>> +    }
>>>> +
>>>> +    file.fd = event_notifier_get_fd(virtio_queue_guest_notifier(q));
>>>> +    r = ioctl(dev->control, VHOST_SET_VRING_CALL,&file);
>>>> +    if (r) {
>>>> +        goto fail_call;
>>>> +    }
>>>>
>>>>        
>>> This function would be a bit more reasonable if it were split into
>>> sections FWIW.
>>>      
>> Not sure what do you mean here.
>>    
>
> Just a suggestion.  For instance, moving the setting up of the notifiers  
> to a separate function would help with readability IMHO.


Hmm. I'll look into it.
I actually think that for functions that just do a list of things
unconditionally, without branches or loops, or with just error handling
as here, it is perfectly fine for them to be of any length.

>>
>>> You never unmap() the mapped memory and you're cheating by assuming that
>>> the virtio rings have a constant mapping for the life time of a guest.
>>> That's not technically true.  My concern is that since a guest can
>>> trigger remappings (by adjusting PCI mappings) badness can ensue.
>>>      
>> I do not know how this can happen. What do PCI mappings have to do with this?
>> Please explain. If it can, vhost will need notification to update.
>>    
>
> If a guest modifies the bar for an MMIO region such that it happens to  
> exist in RAM, while this is a bad thing for the guest to do, I don't  
> think we do anything to stop it.  When the region gets remapped, the  
> result will be that the mapping will change.

So IMO this is the bug. If there's a BAR that matches RAM
physical address, it should never get mapped. Any idea how
to check this?

> Within qemu, because we carry the qemu_mutex, we know that the mappings  
> are fixed as long as we're in qemu.  We're very careful to assume that  
> we don't rely on a mapping past when we drop the qemu_mutex.
>
> With vhost, you register a slot table and update it whenever mappings  
> change.  I think that's good enough for dealing with ram addresses.  But  
> you pass the virtual address for the rings and assume those mappings  
> never change.

So, the issue IMO is that an MMIO address gets passed instead of RAM.
There's no reason to put virtio rings not in RAM, we just need to
verify this.

> 
> I'm pretty sure a guest can cause those to change and I'm not 100% sure,  
> but I think it's a potential source of exploits if you assume a mapping.  
> In the very least, a guest can trick vhost into writing to ram that it 
> wouldn't normally write to.

This seems harmless. guest can write anywhere in ram, anyway.

>>> If you're going this way, I'd suggest making static inlines in the
>>> header file instead of polluting the C file.  It's more common to search
>>> within a C file and having two declarations can get annoying.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>      
>> The issue with inline is that this means that virtio net will depend on
>> target (need to be recompiled).  As it is, a single object can link with
>> vhost and non-vhost versions.
>>    
>
> Fair enough.
>
> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-26 15:23       ` Anthony Liguori
@ 2010-02-27 19:44         ` Michael S. Tsirkin
  2010-02-28 16:08           ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-27 19:44 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Fri, Feb 26, 2010 at 09:23:01AM -0600, Anthony Liguori wrote:
> On 02/26/2010 08:51 AM, Michael S. Tsirkin wrote:
>> On Thu, Feb 25, 2010 at 01:47:27PM -0600, Anthony Liguori wrote:
>>    
>>> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>>>      
>>>> This adds vhost binary option to tap, to enable vhost net accelerator.
>>>> Default is off for now, we'll be able to make default on long term
>>>> when we know it's stable.
>>>>
>>>> vhostfd option can be used by management, to pass in the fd. Assigning
>>>> vhostfd implies vhost=on.
>>>>
>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>
>>>>        
>>> Since the thinking these days is that macvtap and tap is pretty much all
>>> we'll ever need for vhost-net, perhaps we should revisit -net vhost vs.
>>> -net tap,vhost=X?
>>>
>>> I think -net vhost,fd=X makes a lot more sense than -net
>>> tap,vhost=on,vhostfd=X.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>      
>> We'll have to duplicate all tap options.
>> I think long term we will just make vhost=on the default.
>>    
>
> I don't think we can.  vhost only works when using KVM

Yes, default to on with KVM.

> and it doesn't  
> support all of the features of userspace virtio.  Since it's in upstream  
> Linux without supporting all of the virtio-net features, it's something  
> we're going to have to deal with for a long time.

Speaking of vlan filtering etc?  It's just a matter of time before it
supports all interesting features. Kernel support is there in net-next
already, userspace should be easy too. I should be able to code it up
once I finish bothering about upstream merge (hint hint :)).

> Furthermore, vhost reduces a virtual machine's security.  It offers an  
> impressive performance boost (particularly when dealing with 10gbit+  
> networking) but for a user that doesn't have such strong networking  
> performance requirements, I think it's reasonable for them to not want  
> to make a security trade off.

It's hard for me to see how it reduces VM security. If it does, it's
not by design and will be fixed.

> One reason I like -net vhost is that it's a much less obscure syntax and  
> it's the sort of thing that is easy to tell users that they should use.   
> I understand you're argument for -net tap if you assume vhost=on will  
> become the default because that means that users never really have to be  
> aware of vhost once it becomes the default.  But as I said above, I  
> don't think it's reasonable to make it on by default with -net tap.

Not yet, but we'll get there.

>> Users do not really care about vhost, it just makes tap
>> go fater. So promoting it to 1st class options is wrong IMO.
>>    
>
> User's should care about vhost because it impacts the features supported  
> by the virtual machine and it has security ramifications.  It's a great  
> feature and I think the most users will want to use it, but I do think  
> it's something that users ought to be aware of.
>
> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-27 19:38         ` Michael S. Tsirkin
@ 2010-02-28  1:59           ` Paul Brook
  2010-02-28 10:15             ` Michael S. Tsirkin
  2010-02-28 16:02           ` Anthony Liguori
  1 sibling, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-02-28  1:59 UTC (permalink / raw)
  To: qemu-devel; +Cc: amit.shah, quintela, kraxel, Michael S. Tsirkin

> > I'm pretty sure a guest can cause those to change and I'm not 100%
> > sure,   but I think it's a potential source of exploits if you assume a
> > mapping. In the very least, a guest can trick vhost into writing to ram
> > that it wouldn't normally write to.
> 
> This seems harmless. guest can write anywhere in ram, anyway.

Surely writing to the wrong address is always a fatal flaw.  There certainly 
exist machines that can change physical RAM mapping.  While I wouldn't expect 
this to happen during normal operation, it could occur between a (virtio-
aware) bootloader/BIOS and real kernel.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-28  1:59           ` Paul Brook
@ 2010-02-28 10:15             ` Michael S. Tsirkin
  2010-02-28 12:45               ` Paul Brook
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 10:15 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, kraxel, qemu-devel, quintela

On Sun, Feb 28, 2010 at 01:59:27AM +0000, Paul Brook wrote:
> > > I'm pretty sure a guest can cause those to change and I'm not 100%
> > > sure,   but I think it's a potential source of exploits if you assume a
> > > mapping. In the very least, a guest can trick vhost into writing to ram
> > > that it wouldn't normally write to.
> > 
> > This seems harmless. guest can write anywhere in ram, anyway.
> 
> Surely writing to the wrong address is always a fatal flaw.

If guest does an illegal operation, it can corrupt its own memory.
This is the case with physical devices as well.

>  There certainly 
> exist machines that can change physical RAM mapping.

I am talking about mapping between phy RAM offset and qemu virt address.
When can it change without RAM in question going away?

> While I wouldn't expect 
> this to happen during normal operation, it could occur between a (virtio-
> aware) bootloader/BIOS and real kernel.
> 
> Paul

Should not matter for vhost, it is only active if driver is active ...

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-28 10:15             ` Michael S. Tsirkin
@ 2010-02-28 12:45               ` Paul Brook
  2010-02-28 14:44                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-02-28 12:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: amit.shah, quintela, kraxel, Michael S. Tsirkin

> >  There certainly
> > exist machines that can change physical RAM mapping.
> 
> I am talking about mapping between phy RAM offset and qemu virt address.
> When can it change without RAM in question going away?

RAM offset or guest physical address? The two are very different.
Some machines have chip selects that allow given physical address range to be 
mapped to different banks of ram.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-28 12:45               ` Paul Brook
@ 2010-02-28 14:44                 ` Michael S. Tsirkin
  2010-02-28 15:23                   ` Paul Brook
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 14:44 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Sun, Feb 28, 2010 at 12:45:07PM +0000, Paul Brook wrote:
> > >  There certainly
> > > exist machines that can change physical RAM mapping.
> > 
> > I am talking about mapping between phy RAM offset and qemu virt address.
> > When can it change without RAM in question going away?
> 
> RAM offset or guest physical address? The two are very different.

RAM offset. virtio only cares about where the rings are.

> Some machines have chip selects that allow given physical address range to be 
> mapped to different banks of ram.
> 
> Paul

So guest can cause vhost to write to a wrong place in RAM, but it can
just pass a wrong address directly.  As long as vhost does not access a
non-RAM address, we are definitely fine.

-- 
MST

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-28 14:44                 ` Michael S. Tsirkin
@ 2010-02-28 15:23                   ` Paul Brook
  2010-02-28 15:37                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-02-28 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

> So guest can cause vhost to write to a wrong place in RAM, but it can
> just pass a wrong address directly.  

That's not the point. Obviously any DMA capable device can be used to 
compromise a system. However if a device writes to address B after being told 
to write to address A, then you have a completely broken system.

> As long as vhost does not access a
> non-RAM address, we are definitely fine.

Why does it matter what it's changed to? The virtio DMA addresses guest 
physical addresses. If guest physical address mappings change then the virtio 
device must respect those changes. The extreme case is a system with an IOMMU 
(not currently implemented in QEMU). In that case it's likely that physical-
RAM mappings will change frequently.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-28 15:23                   ` Paul Brook
@ 2010-02-28 15:37                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 15:37 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Sun, Feb 28, 2010 at 03:23:06PM +0000, Paul Brook wrote:
> > So guest can cause vhost to write to a wrong place in RAM, but it can
> > just pass a wrong address directly.  
> 
> That's not the point. Obviously any DMA capable device can be used to 
> compromise a system. However if a device writes to address B after being told 
> to write to address A, then you have a completely broken system.

Yes, but I do not see how this can happen with vhost backed to virtio.

> > As long as vhost does not access a
> > non-RAM address, we are definitely fine.
> 
> Why does it matter what it's changed to? The virtio DMA addresses guest 
> physical addresses. If guest physical address mappings change then the virtio 
> device must respect those changes. The extreme case is a system with an IOMMU 
> (not currently implemented in QEMU). In that case it's likely that physical-
> RAM mappings will change frequently.
> 
> Paul

Yes, but this is already supported. The one thing that my patches assume
does not change while device is active, is physical to qemu virtual
mapping for virtio ring.

Since virtio device is allowed to access the ring at any time,
such changes would only legal when device is not active IMO,
and my code translates physical to virtual when device is
made active.

So I do not see a bug.


-- 
MST

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 09/12] vhost: vhost net support
  2010-02-27 19:38         ` Michael S. Tsirkin
  2010-02-28  1:59           ` Paul Brook
@ 2010-02-28 16:02           ` Anthony Liguori
  1 sibling, 0 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-02-28 16:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/27/2010 01:38 PM, Michael S. Tsirkin wrote:
> On Fri, Feb 26, 2010 at 09:18:03AM -0600, Anthony Liguori wrote:
>    
>> On 02/26/2010 08:49 AM, Michael S. Tsirkin wrote:
>>      
>>> KVM code needs all kind of work-arounds for KVM specific issues.
>>> It also assumes that KVM is registered at startup, so it
>>> does not try to optimize finding slots.
>>>
>>>        
>> No, the slot mapping changes dynamically so KVM certainly needs to
>> optimize this.
>>      
> Maybe, but it does not, KVM algorithms are n^2 or worse.
>    

But n is small and the mappings don't change frequently.

More importantly, they change at the exact same times for vhost as they 
do for kvm.  So even if vhost has an O(n) algorithm, the KVM code gets 
executed either immediately before or immediately after the vhost code 
so your optimizations are lost in KVM's O(n^2) algorithm.

>>> Mine has no bugs, let's switch to it!
>>>
>>> Seriously, need to tread very carefully here.
>>> This is why I say: merge it, then look at how to reuse code.
>>>
>>>        
>> Once it's merged, there's no incentive to look at reusing code.
>> Again, I don't think this is a huge burden to vhost.  The two bits of code
>> literally do exactly the same thing.  They just use different data
>> structures that ultimately contain the same values.
>>      
> Not exactly. For example kvm track ROM and video ram addresses.
>    

KVM treats ROM and RAM the same (it even maps ROM as RAM).  There is no 
special handling for video ram addresses.

There is some magic in the VGA code to switch the VGA LFB from mmio to 
ram when possible but that happens at a higher layer.

>> '++i' is an odd thing to do in C in a for() loop.  We're not explicit
>> about it in Coding Style but the vast majority of code just does
>> 'i++'.
>>      
> Ugh. Do we really need to specify every little thing?
>    

I don't care that much about coding style.  I don't care if there are 
curly brackets on single line ifs.

However, it's been made very clear to me that most other people do and 
that it's something that's important to enforce.

> Hmm. I'll look into it.
> I actually think that for functions that just do a list of things
> unconditionally, without branches or loops, or with just error handling
> as here, it is perfectly fine for them to be of any length.
>    

Like I said, just a suggestion.

>>>        
>>>> You never unmap() the mapped memory and you're cheating by assuming that
>>>> the virtio rings have a constant mapping for the life time of a guest.
>>>> That's not technically true.  My concern is that since a guest can
>>>> trigger remappings (by adjusting PCI mappings) badness can ensue.
>>>>
>>>>          
>>> I do not know how this can happen. What do PCI mappings have to do with this?
>>> Please explain. If it can, vhost will need notification to update.
>>>
>>>        
>> If a guest modifies the bar for an MMIO region such that it happens to
>> exist in RAM, while this is a bad thing for the guest to do, I don't
>> think we do anything to stop it.  When the region gets remapped, the
>> result will be that the mapping will change.
>>      
> So IMO this is the bug. If there's a BAR that matches RAM
> physical address, it should never get mapped. Any idea how
> to check this?
>    

We could check it when the BAR is mapped in the PCI layer.  I'm 
suspicious there are other ways a guest can enforce/determine mappings 
though.

Generally speaking, I think it's necessary to assume that a guest can 
manipulate memory mappings.  If we can prove that a guest cannot, it 
would definitely simplify the code a lot.  I'd love to make the same 
assumptions in virtio userspace before it's actually a big source of 
overhead.

I'm pretty sure though that we have to let a guest control mappings though.

>> Within qemu, because we carry the qemu_mutex, we know that the mappings
>> are fixed as long as we're in qemu.  We're very careful to assume that
>> we don't rely on a mapping past when we drop the qemu_mutex.
>>
>> With vhost, you register a slot table and update it whenever mappings
>> change.  I think that's good enough for dealing with ram addresses.  But
>> you pass the virtual address for the rings and assume those mappings
>> never change.
>>      
> So, the issue IMO is that an MMIO address gets passed instead of RAM.
> There's no reason to put virtio rings not in RAM, we just need to
> verify this.
>    

Yes, but we don't always map PCI IO regions as MMIO or PIO.  In 
particular, for VGA devices (particularly VMware VGA), we map certain IO 
regions as RAM because that's how the device is designed.  Likewise, if 
we do shared memory PCI devices using IO regions as the ram contents, we 
would be mapping those as ram too.

So just checking to see if the virtio ring area is RAM or not is not 
enough.  A guest may do something that causes a virtio ring to still be 
ram, but a different ram address.  Now the vhost code is writing to RAM 
that it thinks is physical address X but is really guest physical address Y.

This is not something that a guest can use to break into qemu, but it is 
an emulation bug and depending on the guest OS, it may be possible to 
use it to do a privilege escalation within the guest.

I think the only way to handle this is to explicitly check for changes 
in the physical addresses the rings are mapped at and do the appropriate 
ioctls to vhost to let it know if the ring's address has changed.

>> I'm pretty sure a guest can cause those to change and I'm not 100% sure,
>> but I think it's a potential source of exploits if you assume a mapping.
>> In the very least, a guest can trick vhost into writing to ram that it
>> wouldn't normally write to.
>>      
> This seems harmless. guest can write anywhere in ram, anyway.
>    

Not all guest code is created equal and if we're writing to the wrong 
guest ram location, it can potentially circumvent the guest's security 
architecture.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-27 19:44         ` Michael S. Tsirkin
@ 2010-02-28 16:08           ` Anthony Liguori
  2010-02-28 17:19             ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-02-28 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, qemu-devel, kraxel

On 02/27/2010 01:44 PM, Michael S. Tsirkin wrote:
>> and it doesn't
>> support all of the features of userspace virtio.  Since it's in upstream
>> Linux without supporting all of the virtio-net features, it's something
>> we're going to have to deal with for a long time.
>>      
> Speaking of vlan filtering etc?  It's just a matter of time before it
> supports all interesting features. Kernel support is there in net-next
> already, userspace should be easy too. I should be able to code it up
> once I finish bothering about upstream merge (hint hint :)).
>    

:-)  As I've said in the past, I'm willing to live with -net tap,vhost 
but I really think -net vhost would be better in the long run.

The only two real issues I have with the series is the ring address 
mapping stability and the duplicated slot management code.  Both have 
security implications so I think it's important that they be addressed.  
Otherwise, I'm pretty happy with how things are.

>> Furthermore, vhost reduces a virtual machine's security.  It offers an
>> impressive performance boost (particularly when dealing with 10gbit+
>> networking) but for a user that doesn't have such strong networking
>> performance requirements, I think it's reasonable for them to not want
>> to make a security trade off.
>>      
> It's hard for me to see how it reduces VM security. If it does, it's
> not by design and will be fixed.
>    

If you have a bug in vhost-net (would never happen of course) then it's 
a host-kernel exploit whereas if we have a bug in virtio-net userspace, 
it's a local user exploit.  We have a pretty robust architecture to deal 
with local user exploits (qemu can run unprivilieged, SELinux enforces 
mandatory access control) but a host-kernel can not be protected against.

I'm not saying that we should never put things in the kernel, but 
there's definitely a security vs. performance trade off here.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 16:08           ` Anthony Liguori
@ 2010-02-28 17:19             ` Michael S. Tsirkin
  2010-02-28 20:57               ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 17:19 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Sun, Feb 28, 2010 at 10:08:26AM -0600, Anthony Liguori wrote:
> On 02/27/2010 01:44 PM, Michael S. Tsirkin wrote:
>>> and it doesn't
>>> support all of the features of userspace virtio.  Since it's in upstream
>>> Linux without supporting all of the virtio-net features, it's something
>>> we're going to have to deal with for a long time.
>>>      
>> Speaking of vlan filtering etc?  It's just a matter of time before it
>> supports all interesting features. Kernel support is there in net-next
>> already, userspace should be easy too. I should be able to code it up
>> once I finish bothering about upstream merge (hint hint :)).
>>    
>
> :-)  As I've said in the past, I'm willing to live with -net tap,vhost  
> but I really think -net vhost would be better in the long run.
>
> The only two real issues I have with the series is the ring address  
> mapping stability

This one I do not yet understand completely to be able solve.  Is the
only case where PCI BAR overlays RAM?  I think this case is best dealt
with by disabling BAR mapping.



> and the duplicated slot management code.

If you look at qemu-kvm, it's even triplicated :) I just would like to
get the code merged, then work at adding more infrastructure to prettify
it.

> Both have  security implications so I think it's important that they
> be addressed.   Otherwise, I'm pretty happy with how things are.

Care suggesting some solutions?

>
>>> Furthermore, vhost reduces a virtual machine's security.  It offers an
>>> impressive performance boost (particularly when dealing with 10gbit+
>>> networking) but for a user that doesn't have such strong networking
>>> performance requirements, I think it's reasonable for them to not want
>>> to make a security trade off.
>>>      
>> It's hard for me to see how it reduces VM security. If it does, it's
>> not by design and will be fixed.
>>    
>
> If you have a bug in vhost-net (would never happen of course) then it's  
> a host-kernel exploit whereas if we have a bug in virtio-net userspace,  
> it's a local user exploit.  We have a pretty robust architecture to deal  
> with local user exploits (qemu can run unprivilieged, SELinux enforces  
> mandatory access control) but a host-kernel can not be protected against.
> 
> I'm not saying that we should never put things in the kernel, but  
> there's definitely a security vs. performance trade off here.
>
> Regards,
>
> Anthony Liguori

Not sure I get the argument completely. Any kernel service with a bug
might be exploited for priveledge escalation. Yes, more kernel code
gives you more attack surface, but given we use rich interfaces such as
ones exposed by kvm, I am not sure by how much.

Also note that vhost net does not take qemu out of the equation for
everything, just for datapath operations.

-- 
MST

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 03/12] notifier: event notifier implementation
  2010-02-25 19:22   ` [Qemu-devel] " Anthony Liguori
@ 2010-02-28 19:59     ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 19:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, kraxel, qemu-devel, quintela

On Thu, Feb 25, 2010 at 01:22:04PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>> event notifiers are slightly generalized eventfd descriptors. Current
>> implementation depends on eventfd because vhost is the only user, and
>> vhost depends on eventfd anyway, but a stub is provided for non-eventfd
>> case.
>>
>> We'll be able to further generalize this when another user comes along
>> and we see how to best do this.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>   Makefile.target |    1 +
>>   hw/notifier.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/notifier.h   |   16 ++++++++++++++++
>>   qemu-common.h   |    1 +
>>   4 files changed, 68 insertions(+), 0 deletions(-)
>>   create mode 100644 hw/notifier.c
>>   create mode 100644 hw/notifier.h
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 4c4d397..c1580e9 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -173,6 +173,7 @@ obj-y = vl.o async.o monitor.o pci.o pci_host.o pcie_host.o machine.o gdbstub.o
>>   # virtio has to be here due to weird dependency between PCI and virtio-net.
>>   # need to fix this properly
>>   obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-pci.o virtio-serial-bus.o
>> +obj-y += notifier.o
>>   obj-y += rwhandler.o
>>   obj-$(CONFIG_KVM) += kvm.o kvm-all.o
>>   obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
>> diff --git a/hw/notifier.c b/hw/notifier.c
>> new file mode 100644
>> index 0000000..dff38de
>> --- /dev/null
>> +++ b/hw/notifier.c
>> @@ -0,0 +1,50 @@
>> +#include "hw.h"
>> +#include "notifier.h"
>> +#ifdef CONFIG_EVENTFD
>> +#include<sys/eventfd.h>
>> +#endif
>> +
>> +int event_notifier_init(EventNotifier *e, int active)
>> +{
>> +#ifdef CONFIG_EVENTFD
>> +	int fd = eventfd(!!active, EFD_NONBLOCK | EFD_CLOEXEC);
>> +	if (fd<  0)
>> +		return -errno;
>> +	e->fd = fd;
>> +	return 0;
>> +#else
>> +	return -ENOSYS;
>> +#endif
>> +}
>> +
>> +void event_notifier_cleanup(EventNotifier *e)
>> +{
>> +	close(e->fd);
>> +}
>> +
>> +int event_notifier_get_fd(EventNotifier *e)
>> +{
>> +	return e->fd;
>> +}
>> +
>> +int event_notifier_test_and_clear(EventNotifier *e)
>> +{
>> +	uint64_t value;
>> +	int r = read(e->fd,&value, sizeof value);
>> +	return r == sizeof value;
>>    
>
> Probably should handle EINTR, no?

No, nonblocking eventfd read never returns EINTR.


>> +}
>> +
>> +int event_notifier_test(EventNotifier *e)
>> +{
>> +	uint64_t value;
>> +	int r = read(e->fd,&value, sizeof value);
>>    
>
> Coding Style is not quite explicit here but we always use sizeof(value).
>
>> +	if (r == sizeof value) {
>> +		/* restore previous value. */
>> +		int s = write(e->fd,&value, sizeof value);
>> +		/* never blocks because we use EFD_SEMAPHORE.
>> +		 * If we didn't we'd get EAGAIN on overflow
>> +		 * and we'd have to write code to ignore it. */
>> +		assert(s == sizeof value);
>> +	}
>> +	return r == sizeof value;
>> +}
>> diff --git a/hw/notifier.h b/hw/notifier.h
>> new file mode 100644
>> index 0000000..24117ea
>> --- /dev/null
>> +++ b/hw/notifier.h
>> @@ -0,0 +1,16 @@
>>    
>
> Needs copyright/license.
>
> Thanks for doing this abstraction, I'm really happy with it over direct  
> eventfd usage.
>
> Regards,
>
> Anthony Liguori
>
>> +#ifndef QEMU_EVENT_NOTIFIER_H
>> +#define QEMU_EVENT_NOTIFIER_H
>> +
>> +#include "qemu-common.h"
>> +
>> +struct EventNotifier {
>> +	int fd;
>> +};
>> +
>> +int event_notifier_init(EventNotifier *, int active);
>> +void event_notifier_cleanup(EventNotifier *);
>> +int event_notifier_get_fd(EventNotifier *);
>> +int event_notifier_test_and_clear(EventNotifier *);
>> +int event_notifier_test(EventNotifier *);
>> +
>> +#endif
>> diff --git a/qemu-common.h b/qemu-common.h
>> index 805be1a..f12a8f5 100644
>> --- a/qemu-common.h
>> +++ b/qemu-common.h
>> @@ -227,6 +227,7 @@ typedef struct uWireSlave uWireSlave;
>>   typedef struct I2SCodec I2SCodec;
>>   typedef struct DeviceState DeviceState;
>>   typedef struct SSIBus SSIBus;
>> +typedef struct EventNotifier EventNotifier;
>>
>>   typedef uint64_t pcibus_t;
>>
>>    
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 08/12] virtio-pci: fill in notifier support
  2010-02-25 19:30   ` [Qemu-devel] " Anthony Liguori
@ 2010-02-28 20:02     ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 20:02 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, kraxel, qemu-devel, quintela

On Thu, Feb 25, 2010 at 01:30:40PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>> Support host/guest notifiers in virtio-pci.
>> The last one only with kvm, that's okay
>> because vhost relies on kvm anyway.
>>
>> Note on kvm usage: kvm ioeventfd API
>> is implemented on non-kvm systems as well,
>> this is the reason we don't need if (kvm_enabled())
>> around it.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>   hw/virtio-pci.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 62 insertions(+), 0 deletions(-)
>>
>> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
>> index 006ff38..3f1214c 100644
>> --- a/hw/virtio-pci.c
>> +++ b/hw/virtio-pci.c
>> @@ -24,6 +24,7 @@
>>   #include "net.h"
>>   #include "block_int.h"
>>   #include "loader.h"
>> +#include "kvm.h"
>>
>>   /* from Linux's linux/virtio_pci.h */
>>
>> @@ -398,6 +399,65 @@ static unsigned virtio_pci_get_features(void *opaque)
>>       return proxy->host_features;
>>   }
>>
>> +static void virtio_pci_guest_notifier_read(void *opaque)
>> +{
>> +    VirtQueue *vq = opaque;
>> +    EventNotifier *n = virtio_queue_guest_notifier(vq);
>> +    if (event_notifier_test_and_clear(n)) {
>> +        virtio_irq(vq);
>> +    }
>> +}
>> +
>> +static int virtio_pci_guest_notifier(void *opaque, int n, bool assign)
>> +{
>> +    VirtIOPCIProxy *proxy = opaque;
>> +    VirtQueue *vq = virtio_queue(proxy->vdev, n);
>> +    EventNotifier *notifier = virtio_queue_guest_notifier(vq);
>> +
>> +    if (assign) {
>> +        int r = event_notifier_init(notifier, 0);
>> +	if (r<  0)
>> +		return r;
>> +        qemu_set_fd_handler(event_notifier_get_fd(notifier),
>> +                            virtio_pci_guest_notifier_read, NULL, vq);
>>    
>
> While not super important, it would be nice to have this a bit more  
> common.  IOW:
>
>     r = read_event_notifier_init(notifier,  
> virtio_pci_guest_notifier_read, vq);
>
> and:
>
>     r = kvm_eventfd_notifier_init(notifier, proxy->addr +  
> VIRTIO_PCI_QUEUE_NOTIFY, n, assign);
>
> Regards,
>
> Anthony Liguori


Hmm. Note that on qemu-kvm we also have interrupts using notifiers.
Possibly it's best to merge this, merge irqchip from qemu-kvm,
then think about the best API.

-- 
MST

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 17:19             ` Michael S. Tsirkin
@ 2010-02-28 20:57               ` Anthony Liguori
  2010-02-28 21:01                 ` Michael S. Tsirkin
                                   ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-02-28 20:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, Paul Brook, quintela, qemu-devel, kraxel

On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
>> Both have  security implications so I think it's important that they
>> be addressed.   Otherwise, I'm pretty happy with how things are.
>>      
> Care suggesting some solutions?
>    

The obvious thing to do would be to use the memory notifier in vhost to 
keep track of whenever something remaps the ring's memory region and if 
that happens, issue an ioctl to vhost to change the location of the 
ring.  Also, you would need to merge the vhost slot management code with 
the KVM slot management code.

I'm sympathetic to your arguments though.  As qemu is today, the above 
is definitely the right thing to do.  But ram is always ram and ram 
always has a fixed (albeit non-linear) mapping within a guest.  We can 
probably be smarter in qemu.

There are areas of MMIO/ROM address space that *sometimes* end up 
behaving like ram, but that's a different case.  The one other case to 
consider is ram hot add/remove in which case, ram may be removed or 
added (but it's location will never change during its lifetime).

Here's what I'll propose, and I'd really like to hear what Paul think 
about it before we start down this path.

I think we should add a new API that's:

void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);

This API would do two things.  It would call qemu_ram_alloc() and 
cpu_register_physical_memory() just as code does today.  It would also 
add this region into a new table.

There would be:

void *cpu_ram_map(target_phys_addr_t start, ram_addr_t *size);
void cpu_ram_unmap(void *mem);

These calls would use this new table to lookup ram addresses.  These 
mappings are valid as long as the guest is executed.  Within the table, 
each region would have a reference count.  When it comes time to do hot 
add/remove, we would wait to remove a region until the reference count 
went to zero to avoid unmapping during DMA.

cpu_ram_add() never gets called with overlapping regions.  We'll modify 
cpu_register_physical_memory() to ensure that a ram mapping is never 
changed after initial registration.

vhost no longer needs to bother keeping the dynamic table up to date so 
it removes all of the slot management code from vhost.  KVM still needs 
the code to handle rom/ram mappings but we can take care of that next.  
virtio-net's userspace code can do the same thing as vhost and only map 
the ring once which should be a big performance improvement.

It also introduces a place to do madvise() reset registrations.

This is definitely appropriate for target-i386.  I suspect it is for 
other architectures too.

Regards,

Anthony Liguori

>>      
>>>> Furthermore, vhost reduces a virtual machine's security.  It offers an
>>>> impressive performance boost (particularly when dealing with 10gbit+
>>>> networking) but for a user that doesn't have such strong networking
>>>> performance requirements, I think it's reasonable for them to not want
>>>> to make a security trade off.
>>>>
>>>>          
>>> It's hard for me to see how it reduces VM security. If it does, it's
>>> not by design and will be fixed.
>>>
>>>        
>> If you have a bug in vhost-net (would never happen of course) then it's
>> a host-kernel exploit whereas if we have a bug in virtio-net userspace,
>> it's a local user exploit.  We have a pretty robust architecture to deal
>> with local user exploits (qemu can run unprivilieged, SELinux enforces
>> mandatory access control) but a host-kernel can not be protected against.
>>
>> I'm not saying that we should never put things in the kernel, but
>> there's definitely a security vs. performance trade off here.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> Not sure I get the argument completely. Any kernel service with a bug
> might be exploited for priveledge escalation. Yes, more kernel code
> gives you more attack surface, but given we use rich interfaces such as
> ones exposed by kvm, I am not sure by how much.
>
> Also note that vhost net does not take qemu out of the equation for
> everything, just for datapath operations.
>
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 20:57               ` Anthony Liguori
@ 2010-02-28 21:01                 ` Michael S. Tsirkin
  2010-02-28 22:38                   ` Anthony Liguori
  2010-02-28 22:39                 ` Paul Brook
  2010-03-02 16:12                 ` Marcelo Tosatti
  2 siblings, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-02-28 21:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, Paul Brook, quintela, qemu-devel, kraxel

On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
> On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
>>> Both have  security implications so I think it's important that they
>>> be addressed.   Otherwise, I'm pretty happy with how things are.
>>>      
>> Care suggesting some solutions?
>>    
>
> The obvious thing to do would be to use the memory notifier in vhost to  
> keep track of whenever something remaps the ring's memory region and if  
> that happens, issue an ioctl to vhost to change the location of the  
> ring.

It would be easy to do, but what I wondered about, is what happens in the
guest meanwhile. Which ring address has the correct descriptors: the old
one?  The new one? Both?  This question leads me to the belief that well-behaved
guest will never encounter this.

>  Also, you would need to merge the vhost slot management code with  
> the KVM slot management code.
>
> I'm sympathetic to your arguments though.  As qemu is today, the above  
> is definitely the right thing to do.  But ram is always ram and ram  
> always has a fixed (albeit non-linear) mapping within a guest.  We can  
> probably be smarter in qemu.
>
> There are areas of MMIO/ROM address space that *sometimes* end up  
> behaving like ram, but that's a different case.  The one other case to  
> consider is ram hot add/remove in which case, ram may be removed or  
> added (but it's location will never change during its lifetime).
>
> Here's what I'll propose, and I'd really like to hear what Paul think  
> about it before we start down this path.
>
> I think we should add a new API that's:
>
> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
>
> This API would do two things.  It would call qemu_ram_alloc() and  
> cpu_register_physical_memory() just as code does today.  It would also  
> add this region into a new table.
>
> There would be:
>
> void *cpu_ram_map(target_phys_addr_t start, ram_addr_t *size);
> void cpu_ram_unmap(void *mem);
>
> These calls would use this new table to lookup ram addresses.  These  
> mappings are valid as long as the guest is executed.  Within the table,  
> each region would have a reference count.  When it comes time to do hot  
> add/remove, we would wait to remove a region until the reference count  
> went to zero to avoid unmapping during DMA.
>
> cpu_ram_add() never gets called with overlapping regions.  We'll modify  
> cpu_register_physical_memory() to ensure that a ram mapping is never  
> changed after initial registration.
>
> vhost no longer needs to bother keeping the dynamic table up to date so  
> it removes all of the slot management code from vhost.  KVM still needs  
> the code to handle rom/ram mappings but we can take care of that next.   
> virtio-net's userspace code can do the same thing as vhost and only map  
> the ring once which should be a big performance improvement.
>
> It also introduces a place to do madvise() reset registrations.
>
> This is definitely appropriate for target-i386.  I suspect it is for  
> other architectures too.
>
> Regards,
>
> Anthony Liguori
>
>>>      
>>>>> Furthermore, vhost reduces a virtual machine's security.  It offers an
>>>>> impressive performance boost (particularly when dealing with 10gbit+
>>>>> networking) but for a user that doesn't have such strong networking
>>>>> performance requirements, I think it's reasonable for them to not want
>>>>> to make a security trade off.
>>>>>
>>>>>          
>>>> It's hard for me to see how it reduces VM security. If it does, it's
>>>> not by design and will be fixed.
>>>>
>>>>        
>>> If you have a bug in vhost-net (would never happen of course) then it's
>>> a host-kernel exploit whereas if we have a bug in virtio-net userspace,
>>> it's a local user exploit.  We have a pretty robust architecture to deal
>>> with local user exploits (qemu can run unprivilieged, SELinux enforces
>>> mandatory access control) but a host-kernel can not be protected against.
>>>
>>> I'm not saying that we should never put things in the kernel, but
>>> there's definitely a security vs. performance trade off here.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>      
>> Not sure I get the argument completely. Any kernel service with a bug
>> might be exploited for priveledge escalation. Yes, more kernel code
>> gives you more attack surface, but given we use rich interfaces such as
>> ones exposed by kvm, I am not sure by how much.
>>
>> Also note that vhost net does not take qemu out of the equation for
>> everything, just for datapath operations.
>>
>>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 21:01                 ` Michael S. Tsirkin
@ 2010-02-28 22:38                   ` Anthony Liguori
  0 siblings, 0 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-02-28 22:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, Paul Brook, quintela, qemu-devel, kraxel

On 02/28/2010 03:01 PM, Michael S. Tsirkin wrote:
> On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
>    
>> On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
>>      
>>>> Both have  security implications so I think it's important that they
>>>> be addressed.   Otherwise, I'm pretty happy with how things are.
>>>>
>>>>          
>>> Care suggesting some solutions?
>>>
>>>        
>> The obvious thing to do would be to use the memory notifier in vhost to
>> keep track of whenever something remaps the ring's memory region and if
>> that happens, issue an ioctl to vhost to change the location of the
>> ring.
>>      
> It would be easy to do, but what I wondered about, is what happens in the
> guest meanwhile. Which ring address has the correct descriptors: the old
> one?  The new one? Both?  This question leads me to the belief that well-behaved
> guest will never encounter this.
>    

This is not a question of well-behaved guests.  It's a question about 
what our behaviour is in the face of a malicious guest.  While I agree 
with you that that behaviour can be undefined, writing to an invalid ram 
location I believe could lead to guest privilege escalation.

I think the two solutions we could implement would be to always use the 
latest mapping (which is what all code does today) or to actively 
prevent ram from being remapped (which is my proposal below).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 20:57               ` Anthony Liguori
  2010-02-28 21:01                 ` Michael S. Tsirkin
@ 2010-02-28 22:39                 ` Paul Brook
  2010-03-01 19:27                   ` Michael S. Tsirkin
  2010-03-02 14:07                   ` Anthony Liguori
  2010-03-02 16:12                 ` Marcelo Tosatti
  2 siblings, 2 replies; 70+ messages in thread
From: Paul Brook @ 2010-02-28 22:39 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

> I'm sympathetic to your arguments though.  As qemu is today, the above
> is definitely the right thing to do.  But ram is always ram and ram
> always has a fixed (albeit non-linear) mapping within a guest.

I think this assumption is unsafe. There are machines where RAM mappings can 
change. It's not uncommon for a chip select (i.e. physical memory address 
region) to be switchable to several different sources, one of which may be 
RAM.  I'm pretty sure this functionality is present (but not actually 
implemented) on some of the current qemu targets.

I agree that changing RAM mappings under an active DMA is a fairly suspect 
thing to do. However I think we need to avoid cache mappings between separate 
DMA transactions i.e. when the guest can know that no DMA will occur, and 
safely remap things.

I'm also of the opinion that virtio devices should behave the same as any 
other device. i.e. if you put a virtio-net-pci device on a PCI bus behind an 
IOMMU, then it should see the same address space as any other PCI device in 
that location.  Apart from anything else, failure to do this breaks nested 
virtualization.  While qemu doesn't currently implement an IOMMU, the DMA 
interfaces have been designed to allow it.

> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);

We need to support aliased memory regions. For example the ARM RealView boards 
expose the first 256M RAM at both address 0x0 and 0x70000000. It's also common 
for systems to create aliases by ignoring certain address bits. e.g. each sim 
slot is allocated a fixed 256M region. Populating that slot with a 128M stick 
will cause the contents to be aliased in both the top and bottom halves of 
that region.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 22:39                 ` Paul Brook
@ 2010-03-01 19:27                   ` Michael S. Tsirkin
  2010-03-01 21:54                     ` Anthony Liguori
  2010-03-02 14:07                   ` Anthony Liguori
  1 sibling, 1 reply; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-03-01 19:27 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Sun, Feb 28, 2010 at 10:39:21PM +0000, Paul Brook wrote:
> > I'm sympathetic to your arguments though.  As qemu is today, the above
> > is definitely the right thing to do.  But ram is always ram and ram
> > always has a fixed (albeit non-linear) mapping within a guest.
> 
> I think this assumption is unsafe. There are machines where RAM mappings can 
> change. It's not uncommon for a chip select (i.e. physical memory address 
> region) to be switchable to several different sources, one of which may be 
> RAM.  I'm pretty sure this functionality is present (but not actually 
> implemented) on some of the current qemu targets.
> 
> I agree that changing RAM mappings under an active DMA is a fairly suspect 
> thing to do. However I think we need to avoid cache mappings between separate 
> DMA transactions i.e. when the guest can know that no DMA will occur, and 
> safely remap things.
> 
> I'm also of the opinion that virtio devices should behave the same as any 
> other device. i.e. if you put a virtio-net-pci device on a PCI bus behind an 
> IOMMU, then it should see the same address space as any other PCI device in 
> that location.

It already doesn't. virtio passes physical memory addresses
to device instead of DMA addresses.

> Apart from anything else, failure to do this breaks nested 
> virtualization.

Assigning PV device in nested virtualization? It could work, but not
sure what the point would be.

>  While qemu doesn't currently implement an IOMMU, the DMA 
> interfaces have been designed to allow it.
> 
> > void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
> 
> We need to support aliased memory regions. For example the ARM RealView boards 
> expose the first 256M RAM at both address 0x0 and 0x70000000. It's also common 
> for systems to create aliases by ignoring certain address bits. e.g. each sim 
> slot is allocated a fixed 256M region. Populating that slot with a 128M stick 
> will cause the contents to be aliased in both the top and bottom halves of 
> that region.
> 
> Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-01 19:27                   ` Michael S. Tsirkin
@ 2010-03-01 21:54                     ` Anthony Liguori
  2010-03-02  9:57                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-03-01 21:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: amit.shah, quintela, kraxel, Paul Brook, qemu-devel

On 03/01/2010 01:27 PM, Michael S. Tsirkin wrote:
> On Sun, Feb 28, 2010 at 10:39:21PM +0000, Paul Brook wrote:
>    
>>> I'm sympathetic to your arguments though.  As qemu is today, the above
>>> is definitely the right thing to do.  But ram is always ram and ram
>>> always has a fixed (albeit non-linear) mapping within a guest.
>>>        
>> I think this assumption is unsafe. There are machines where RAM mappings can
>> change. It's not uncommon for a chip select (i.e. physical memory address
>> region) to be switchable to several different sources, one of which may be
>> RAM.  I'm pretty sure this functionality is present (but not actually
>> implemented) on some of the current qemu targets.
>>
>> I agree that changing RAM mappings under an active DMA is a fairly suspect
>> thing to do. However I think we need to avoid cache mappings between separate
>> DMA transactions i.e. when the guest can know that no DMA will occur, and
>> safely remap things.
>>
>> I'm also of the opinion that virtio devices should behave the same as any
>> other device. i.e. if you put a virtio-net-pci device on a PCI bus behind an
>> IOMMU, then it should see the same address space as any other PCI device in
>> that location.
>>      
> It already doesn't. virtio passes physical memory addresses
> to device instead of DMA addresses.
>    

That's technically a bug.

>> Apart from anything else, failure to do this breaks nested
>> virtualization.
>>      
> Assigning PV device in nested virtualization? It could work, but not
> sure what the point would be.
>    

It misses the point really.

vhost-net is not a device model and it shouldn't have to care about 
things like PCI IOMMU.  If we did ever implement a PCI IOMMU, then we 
would perform ring translation (or not use vhost-net).

Regards,

Anthony Liguori

>>   While qemu doesn't currently implement an IOMMU, the DMA
>> interfaces have been designed to allow it.
>>
>>      
>>> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
>>>        
>> We need to support aliased memory regions. For example the ARM RealView boards
>> expose the first 256M RAM at both address 0x0 and 0x70000000. It's also common
>> for systems to create aliases by ignoring certain address bits. e.g. each sim
>> slot is allocated a fixed 256M region. Populating that slot with a 128M stick
>> will cause the contents to be aliased in both the top and bottom halves of
>> that region.
>>
>> Paul
>>      

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-01 21:54                     ` Anthony Liguori
@ 2010-03-02  9:57                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-03-02  9:57 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: amit.shah, quintela, kraxel, Paul Brook, qemu-devel

On Mon, Mar 01, 2010 at 03:54:00PM -0600, Anthony Liguori wrote:
> On 03/01/2010 01:27 PM, Michael S. Tsirkin wrote:
>> On Sun, Feb 28, 2010 at 10:39:21PM +0000, Paul Brook wrote:
>>    
>>>> I'm sympathetic to your arguments though.  As qemu is today, the above
>>>> is definitely the right thing to do.  But ram is always ram and ram
>>>> always has a fixed (albeit non-linear) mapping within a guest.
>>>>        
>>> I think this assumption is unsafe. There are machines where RAM mappings can
>>> change. It's not uncommon for a chip select (i.e. physical memory address
>>> region) to be switchable to several different sources, one of which may be
>>> RAM.  I'm pretty sure this functionality is present (but not actually
>>> implemented) on some of the current qemu targets.
>>>
>>> I agree that changing RAM mappings under an active DMA is a fairly suspect
>>> thing to do. However I think we need to avoid cache mappings between separate
>>> DMA transactions i.e. when the guest can know that no DMA will occur, and
>>> safely remap things.
>>>
>>> I'm also of the opinion that virtio devices should behave the same as any
>>> other device. i.e. if you put a virtio-net-pci device on a PCI bus behind an
>>> IOMMU, then it should see the same address space as any other PCI device in
>>> that location.
>>>      
>> It already doesn't. virtio passes physical memory addresses
>> to device instead of DMA addresses.
>>    
>
> That's technically a bug.
>
>>> Apart from anything else, failure to do this breaks nested
>>> virtualization.
>>>      
>> Assigning PV device in nested virtualization? It could work, but not
>> sure what the point would be.
>>    
>
> It misses the point really.
>
> vhost-net is not a device model and it shouldn't have to care about  
> things like PCI IOMMU.  If we did ever implement a PCI IOMMU, then we  
> would perform ring translation (or not use vhost-net).
>
> Regards,
>
> Anthony Liguori

Right.

>>>   While qemu doesn't currently implement an IOMMU, the DMA
>>> interfaces have been designed to allow it.
>>>
>>>      
>>>> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
>>>>        
>>> We need to support aliased memory regions. For example the ARM RealView boards
>>> expose the first 256M RAM at both address 0x0 and 0x70000000. It's also common
>>> for systems to create aliases by ignoring certain address bits. e.g. each sim
>>> slot is allocated a fixed 256M region. Populating that slot with a 128M stick
>>> will cause the contents to be aliased in both the top and bottom halves of
>>> that region.
>>>
>>> Paul
>>>      

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 22:39                 ` Paul Brook
  2010-03-01 19:27                   ` Michael S. Tsirkin
@ 2010-03-02 14:07                   ` Anthony Liguori
  2010-03-02 14:33                     ` Paul Brook
  1 sibling, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 14:07 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

On 02/28/2010 04:39 PM, Paul Brook wrote:
>> I'm sympathetic to your arguments though.  As qemu is today, the above
>> is definitely the right thing to do.  But ram is always ram and ram
>> always has a fixed (albeit non-linear) mapping within a guest.
>>      
> I think this assumption is unsafe. There are machines where RAM mappings can
> change. It's not uncommon for a chip select (i.e. physical memory address
> region) to be switchable to several different sources, one of which may be
> RAM.  I'm pretty sure this functionality is present (but not actually
> implemented) on some of the current qemu targets.
>    

But I presume this is more about switching a dim to point at a different 
region in memory.  It's a rare event similar to memory hot plug.

Either way, if there are platforms where we don't treat ram with the new 
ram api, that's okay.

> I agree that changing RAM mappings under an active DMA is a fairly suspect
> thing to do. However I think we need to avoid cache mappings between separate
> DMA transactions i.e. when the guest can know that no DMA will occur, and
> safely remap things.
>    

One thing I like about having a new ram api is it gives us a stronger 
interface than what we have today.  Today, we don't have a strong 
guarantee that mappings won't be changed during a DMA transaction.

With a new api, cpu_physical_memory_map() changes semantics.  It only 
returns pointers for static ram mappings.  Everything else is bounced 
which guarantees that an address can't change during DMA.

> I'm also of the opinion that virtio devices should behave the same as any
> other device. i.e. if you put a virtio-net-pci device on a PCI bus behind an
> IOMMU, then it should see the same address space as any other PCI device in
> that location.  Apart from anything else, failure to do this breaks nested
> virtualization.  While qemu doesn't currently implement an IOMMU, the DMA
> interfaces have been designed to allow it.
>    

Yes, I've been working on that.  virtio is a bit more complicated than a 
normal PCI device because it can be on top of two busses.  It needs an 
additional layer of abstraction to deal with this.

>> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
>>      
> We need to support aliased memory regions. For example the ARM RealView boards
> expose the first 256M RAM at both address 0x0 and 0x70000000. It's also common
> for systems to create aliases by ignoring certain address bits. e.g. each sim
> slot is allocated a fixed 256M region. Populating that slot with a 128M stick
> will cause the contents to be aliased in both the top and bottom halves of
> that region.
>    

Okay, I'd prefer to add an explicit aliasing API.  That gives us more 
information to work with.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 14:07                   ` Anthony Liguori
@ 2010-03-02 14:33                     ` Paul Brook
  2010-03-02 14:39                       ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-03-02 14:33 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

> > I think this assumption is unsafe. There are machines where RAM mappings
> > can change. It's not uncommon for a chip select (i.e. physical memory
> > address region) to be switchable to several different sources, one of
> > which may be RAM.  I'm pretty sure this functionality is present (but not
> > actually implemented) on some of the current qemu targets.
> 
> But I presume this is more about switching a dim to point at a different
> region in memory.  It's a rare event similar to memory hot plug.

Approximately that, yes. One use is to start with ROM at address zero, then 
switch to RAM once you've initialised the DRAM controller (using a small 
internal SRAM as a trampoline). 
 
> Either way, if there are platforms where we don't treat ram with the new
> ram api, that's okay.
> 
> > I agree that changing RAM mappings under an active DMA is a fairly
> > suspect thing to do. However I think we need to avoid cache mappings
> > between separate DMA transactions i.e. when the guest can know that no
> > DMA will occur, and safely remap things.
> 
> One thing I like about having a new ram api is it gives us a stronger
> interface than what we have today.  Today, we don't have a strong
> guarantee that mappings won't be changed during a DMA transaction.
> 
> With a new api, cpu_physical_memory_map() changes semantics.  It only
> returns pointers for static ram mappings.  Everything else is bounced
> which guarantees that an address can't change during DMA.

Doesn't this mean that only the initial RAM is directly DMA-able?

While memory hotplug(and unplug) may be an infrequent event, having the 
majority of ram be hotplug seems much more likely.  Giving each guest a small 
base allocation, then hotplug the rest as required/paid for seems an entirely 
reasonable setup.  I use VMs which are normally hosted on a multi-core machine 
with buckets of ram, but may be migrated to a single CPU host with limited ram 
in an emergency.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 14:33                     ` Paul Brook
@ 2010-03-02 14:39                       ` Anthony Liguori
  2010-03-02 14:55                         ` Paul Brook
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 14:39 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

On 03/02/2010 08:33 AM, Paul Brook wrote:
>>> I think this assumption is unsafe. There are machines where RAM mappings
>>> can change. It's not uncommon for a chip select (i.e. physical memory
>>> address region) to be switchable to several different sources, one of
>>> which may be RAM.  I'm pretty sure this functionality is present (but not
>>> actually implemented) on some of the current qemu targets.
>>>        
>> But I presume this is more about switching a dim to point at a different
>> region in memory.  It's a rare event similar to memory hot plug.
>>      
> Approximately that, yes. One use is to start with ROM at address zero, then
> switch to RAM once you've initialised the DRAM controller (using a small
> internal SRAM as a trampoline).
>
>    
>> Either way, if there are platforms where we don't treat ram with the new
>> ram api, that's okay.
>>
>>      
>>> I agree that changing RAM mappings under an active DMA is a fairly
>>> suspect thing to do. However I think we need to avoid cache mappings
>>> between separate DMA transactions i.e. when the guest can know that no
>>> DMA will occur, and safely remap things.
>>>        
>> One thing I like about having a new ram api is it gives us a stronger
>> interface than what we have today.  Today, we don't have a strong
>> guarantee that mappings won't be changed during a DMA transaction.
>>
>> With a new api, cpu_physical_memory_map() changes semantics.  It only
>> returns pointers for static ram mappings.  Everything else is bounced
>> which guarantees that an address can't change during DMA.
>>      
> Doesn't this mean that only the initial RAM is directly DMA-able?
>
> While memory hotplug(and unplug) may be an infrequent event, having the
> majority of ram be hotplug seems much more likely.

Hotplug works fine for direct DMA'ing.  map/unmap would maintain a 
reference count on the registered RAM region and hot unplug would not be 
allowed until that reference dropped to zero.  For something like 
virtio, it means that the driver has to be unloaded in the guest before 
you hot unplug the region of memory if it happens to be using that 
region of memory for the ring storage.

The key difference is that these regions are created and destroyed 
rarely and in such a way that the destruction is visible to the guest.

If you compare that to IO memory, we currently flip IO memory's mappings 
dynamically without the guest really being aware (such as the VGA 
optimization).  An API like this wouldn't work for IO memory today 
without some serious thinking about how to model this sort of thing.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 14:39                       ` Anthony Liguori
@ 2010-03-02 14:55                         ` Paul Brook
  2010-03-02 15:33                           ` Anthony Liguori
  0 siblings, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-03-02 14:55 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

> >> With a new api, cpu_physical_memory_map() changes semantics.  It only
> >> returns pointers for static ram mappings.  Everything else is bounced
> >> which guarantees that an address can't change during DMA.
> >
> > Doesn't this mean that only the initial RAM is directly DMA-able?
> >
> > While memory hotplug(and unplug) may be an infrequent event, having the
> > majority of ram be hotplug seems much more likely.
> 
> Hotplug works fine for direct DMA'ing.  map/unmap would maintain a
> reference count on the registered RAM region and hot unplug would not be
> allowed until that reference dropped to zero.  For something like
> virtio, it means that the driver has to be unloaded in the guest before
> you hot unplug the region of memory if it happens to be using that
> region of memory for the ring storage.
> 
> The key difference is that these regions are created and destroyed
> rarely and in such a way that the destruction is visible to the guest.

So you're making ram unmap an asynchronous process, and requiring that the 
address space not be reused until that umap has completed?

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 14:55                         ` Paul Brook
@ 2010-03-02 15:33                           ` Anthony Liguori
  2010-03-02 15:53                             ` Paul Brook
  0 siblings, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 15:33 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

On 03/02/2010 08:55 AM, Paul Brook wrote:
>>>> With a new api, cpu_physical_memory_map() changes semantics.  It only
>>>> returns pointers for static ram mappings.  Everything else is bounced
>>>> which guarantees that an address can't change during DMA.
>>>>          
>>> Doesn't this mean that only the initial RAM is directly DMA-able?
>>>
>>> While memory hotplug(and unplug) may be an infrequent event, having the
>>> majority of ram be hotplug seems much more likely.
>>>        
>> Hotplug works fine for direct DMA'ing.  map/unmap would maintain a
>> reference count on the registered RAM region and hot unplug would not be
>> allowed until that reference dropped to zero.  For something like
>> virtio, it means that the driver has to be unloaded in the guest before
>> you hot unplug the region of memory if it happens to be using that
>> region of memory for the ring storage.
>>
>> The key difference is that these regions are created and destroyed
>> rarely and in such a way that the destruction is visible to the guest.
>>      
> So you're making ram unmap an asynchronous process, and requiring that the
> address space not be reused until that umap has completed?
>    

It technically already would be.  If you've got a pending DMA 
transaction and you try to hot unplug badness will happen.  This is 
something that is certainly exploitable.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 15:33                           ` Anthony Liguori
@ 2010-03-02 15:53                             ` Paul Brook
  2010-03-02 15:56                               ` Michael S. Tsirkin
  2010-03-02 16:12                               ` Anthony Liguori
  0 siblings, 2 replies; 70+ messages in thread
From: Paul Brook @ 2010-03-02 15:53 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

> >> The key difference is that these regions are created and destroyed
> >> rarely and in such a way that the destruction is visible to the guest.
> >
> > So you're making ram unmap an asynchronous process, and requiring that
> > the address space not be reused until that umap has completed?
> 
> It technically already would be.  If you've got a pending DMA
> transaction and you try to hot unplug badness will happen.  This is
> something that is certainly exploitable.

Hmm, I guess we probably want to make this work with all mappings then. DMA to 
a ram backed PCI BAR (e.g. video ram) is certainly feasible.
Technically it's not the unmap that causes badness, it's freeing the 
underlying ram.

For these reasons I'm tempted to push the refcounting down to the ram 
allocation level. This has a couple of nice properties.

Firstly we don't care about dynamic allocation any more. We just say that 
mapping changes may not effect active DMA transactions. If virtio chooses to 
define that the vring DMA transaction starts when the device is enabled and 
ends when disabled, that's fine by me.  This probably requires revisiting the 
memory barrier issues - barriers are pointless if you don't guarantee cache 
coherence (i.e. no bounce buffers).

Secondly, ram deallocation is not guest visible. The guest visible parts 
(memory unmapping) can happen immediately, and we avoid a whole set of 
unplug/replug race conditions. We may want to delay the completion of a 
monitor hotplug command until the actual deallocation occurs, but that's a 
largely separate issue.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 15:53                             ` Paul Brook
@ 2010-03-02 15:56                               ` Michael S. Tsirkin
  2010-03-02 16:12                               ` Anthony Liguori
  1 sibling, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-03-02 15:56 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, qemu-devel, kraxel

On Tue, Mar 02, 2010 at 03:53:30PM +0000, Paul Brook wrote:
> > >> The key difference is that these regions are created and destroyed
> > >> rarely and in such a way that the destruction is visible to the guest.
> > >
> > > So you're making ram unmap an asynchronous process, and requiring that
> > > the address space not be reused until that umap has completed?
> > 
> > It technically already would be.  If you've got a pending DMA
> > transaction and you try to hot unplug badness will happen.  This is
> > something that is certainly exploitable.
> 
> Hmm, I guess we probably want to make this work with all mappings then. DMA to 
> a ram backed PCI BAR (e.g. video ram) is certainly feasible.

This used to be possible with PCI/PCI-X.  But as far as I know, with PCI
Express, devices can not access each other's BARs.

> Technically it's not the unmap that causes badness, it's freeing the 
> underlying ram.
> 
> For these reasons I'm tempted to push the refcounting down to the ram 
> allocation level. This has a couple of nice properties.
> 
> Firstly we don't care about dynamic allocation any more. We just say that 
> mapping changes may not effect active DMA transactions. If virtio chooses to 
> define that the vring DMA transaction starts when the device is enabled and 
> ends when disabled, that's fine by me.  This probably requires revisiting the 
> memory barrier issues - barriers are pointless if you don't guarantee cache 
> coherence (i.e. no bounce buffers).
> 
> Secondly, ram deallocation is not guest visible. The guest visible parts 
> (memory unmapping) can happen immediately, and we avoid a whole set of 
> unplug/replug race conditions. We may want to delay the completion of a 
> monitor hotplug command until the actual deallocation occurs, but that's a 
> largely separate issue.
> 
> Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 15:53                             ` Paul Brook
  2010-03-02 15:56                               ` Michael S. Tsirkin
@ 2010-03-02 16:12                               ` Anthony Liguori
  2010-03-02 16:21                                 ` Marcelo Tosatti
  1 sibling, 1 reply; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 16:12 UTC (permalink / raw)
  To: Paul Brook; +Cc: amit.shah, quintela, kraxel, qemu-devel, Michael S. Tsirkin

On 03/02/2010 09:53 AM, Paul Brook wrote:
>>>> The key difference is that these regions are created and destroyed
>>>> rarely and in such a way that the destruction is visible to the guest.
>>>>          
>>> So you're making ram unmap an asynchronous process, and requiring that
>>> the address space not be reused until that umap has completed?
>>>        
>> It technically already would be.  If you've got a pending DMA
>> transaction and you try to hot unplug badness will happen.  This is
>> something that is certainly exploitable.
>>      
> Hmm, I guess we probably want to make this work with all mappings then. DMA to
> a ram backed PCI BAR (e.g. video ram) is certainly feasible.
> Technically it's not the unmap that causes badness, it's freeing the
> underlying ram.
>    

Let's avoid confusing terminology.  We have RAM mappings and then we 
have PCI BARs that are mapped as IO_MEM_RAM.

PCI BARs mapped as IO_MEM_RAM are allocated by the device and live for 
the duration of the device.  If you did something that changed the BAR's 
mapping from IO_MEM_RAM to an actual IO memory type, then you'd continue 
to DMA to the allocated device memory instead of doing MMIO operations.[1]

That's completely accurate and safe.  If you did this to bare metal, I 
expect you'd get very similar results.

This is different from DMA'ing to a RAM region and then removing the RAM 
region while the IO is in flight.  In this case, the mapping disappears 
and you potentially have the guest writing to an invalid host pointer.

[1] I don't think it's useful to support DMA'ing to arbitrary IO_MEM_RAM 
areas.  Instead, I think we should always bounce to this memory.  The 
benefit is that we avoid the complications resulting from PCI hot unplug 
and reference counting.

> For these reasons I'm tempted to push the refcounting down to the ram
> allocation level. This has a couple of nice properties.
>
> Firstly we don't care about dynamic allocation any more. We just say that
> mapping changes may not effect active DMA transactions.

Only if we think it's necessary to support native DMA to arbitrary 
IO_MEM_RAM.  I contend this is never a normal or performance sensitive 
case and it's not worth supporting.

>   If virtio chooses to
> define that the vring DMA transaction starts when the device is enabled and
> ends when disabled, that's fine by me.  This probably requires revisiting the
> memory barrier issues - barriers are pointless if you don't guarantee cache
> coherence (i.e. no bounce buffers).
>
> Secondly, ram deallocation is not guest visible. The guest visible parts
> (memory unmapping) can happen immediately, and we avoid a whole set of
> unplug/replug race conditions. We may want to delay the completion of a
> monitor hotplug command until the actual deallocation occurs, but that's a
> largely separate issue.
>    

You can do the same thing and always bounce IO_MEM_RAM IO regions.  It's 
just a question of whether we think it's worth the effort to do native 
DMA to this type of memory.  I personally don't think it is at least in 
the beginning.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-02-28 20:57               ` Anthony Liguori
  2010-02-28 21:01                 ` Michael S. Tsirkin
  2010-02-28 22:39                 ` Paul Brook
@ 2010-03-02 16:12                 ` Marcelo Tosatti
  2010-03-02 16:56                   ` Anthony Liguori
  2 siblings, 1 reply; 70+ messages in thread
From: Marcelo Tosatti @ 2010-03-02 16:12 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: quintela, Michael S. Tsirkin, qemu-devel, kraxel, amit.shah, Paul Brook

On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
> On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
> >>Both have  security implications so I think it's important that they
> >>be addressed.   Otherwise, I'm pretty happy with how things are.
> >Care suggesting some solutions?
> 
> The obvious thing to do would be to use the memory notifier in vhost
> to keep track of whenever something remaps the ring's memory region
> and if that happens, issue an ioctl to vhost to change the location
> of the ring.  Also, you would need to merge the vhost slot
> management code with the KVM slot management code.

There are no security implications as long as vhost uses the qemu
process mappings.

But your point is valid.

> I'm sympathetic to your arguments though.  As qemu is today, the
> above is definitely the right thing to do.  But ram is always ram
> and ram always has a fixed (albeit non-linear) mapping within a
> guest.  We can probably be smarter in qemu.
> 
> There are areas of MMIO/ROM address space that *sometimes* end up
> behaving like ram, but that's a different case.  The one other case
> to consider is ram hot add/remove in which case, ram may be removed
> or added (but it's location will never change during its lifetime).
> 
> Here's what I'll propose, and I'd really like to hear what Paul
> think about it before we start down this path.
> 
> I think we should add a new API that's:
> 
> void cpu_ram_add(target_phys_addr_t start, ram_addr_t size);
> 
> This API would do two things.  It would call qemu_ram_alloc() and
> cpu_register_physical_memory() just as code does today.  It would
> also add this region into a new table.
> 
> There would be:
> 
> void *cpu_ram_map(target_phys_addr_t start, ram_addr_t *size);
> void cpu_ram_unmap(void *mem);
> 
> These calls would use this new table to lookup ram addresses.  These
> mappings are valid as long as the guest is executed.  Within the
> table, each region would have a reference count.  When it comes time
> to do hot add/remove, we would wait to remove a region until the
> reference count went to zero to avoid unmapping during DMA.
>
> cpu_ram_add() never gets called with overlapping regions.  We'll
> modify cpu_register_physical_memory() to ensure that a ram mapping
> is never changed after initial registration.

What is the difference between your proposal and
cpu_physical_memory_map?

What i'd like to see is binding between cpu_physical_memory_map and qdev 
devices, so that you can use different host memory mappings for device
context and for CPU context (and provide the possibility for, say, map 
a certain memory region as read-only).

> vhost no longer needs to bother keeping the dynamic table up to date
> so it removes all of the slot management code from vhost.  KVM still
> needs the code to handle rom/ram mappings but we can take care of
> that next.  virtio-net's userspace code can do the same thing as
> vhost and only map the ring once which should be a big performance
> improvement.

Pinning the host virtual addresses as you propose reduces flexibility.
See above about different mappings for DMA/CPU contexes.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 16:12                               ` Anthony Liguori
@ 2010-03-02 16:21                                 ` Marcelo Tosatti
  0 siblings, 0 replies; 70+ messages in thread
From: Marcelo Tosatti @ 2010-03-02 16:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Michael S. Tsirkin, quintela, qemu-devel, kraxel, amit.shah, Paul Brook

On Tue, Mar 02, 2010 at 10:12:05AM -0600, Anthony Liguori wrote:
> On 03/02/2010 09:53 AM, Paul Brook wrote:
> >>>>The key difference is that these regions are created and destroyed
> >>>>rarely and in such a way that the destruction is visible to the guest.
> >>>So you're making ram unmap an asynchronous process, and requiring that
> >>>the address space not be reused until that umap has completed?
> >>It technically already would be.  If you've got a pending DMA
> >>transaction and you try to hot unplug badness will happen.  This is
> >>something that is certainly exploitable.
> >Hmm, I guess we probably want to make this work with all mappings then. DMA to
> >a ram backed PCI BAR (e.g. video ram) is certainly feasible.
> >Technically it's not the unmap that causes badness, it's freeing the
> >underlying ram.
> 
> Let's avoid confusing terminology.  We have RAM mappings and then we
> have PCI BARs that are mapped as IO_MEM_RAM.
> 
> PCI BARs mapped as IO_MEM_RAM are allocated by the device and live
> for the duration of the device.  If you did something that changed
> the BAR's mapping from IO_MEM_RAM to an actual IO memory type, then
> you'd continue to DMA to the allocated device memory instead of
> doing MMIO operations.[1]
> 
> That's completely accurate and safe.  If you did this to bare metal,
> I expect you'd get very similar results.
> 
> This is different from DMA'ing to a RAM region and then removing the
> RAM region while the IO is in flight.  In this case, the mapping
> disappears and you potentially have the guest writing to an invalid
> host pointer.
> 
> [1] I don't think it's useful to support DMA'ing to arbitrary
> IO_MEM_RAM areas.  Instead, I think we should always bounce to this
> memory.  The benefit is that we avoid the complications resulting
> from PCI hot unplug and reference counting.

Agree. Thus the suggestion to tie cpu_physical_memory_map to qdev
infrastructure.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 16:12                 ` Marcelo Tosatti
@ 2010-03-02 16:56                   ` Anthony Liguori
  2010-03-02 17:00                     ` Michael S. Tsirkin
                                       ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 16:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: quintela, Michael S. Tsirkin, qemu-devel, kraxel, amit.shah, Paul Brook

On 03/02/2010 10:12 AM, Marcelo Tosatti wrote:
> On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
>    
>> On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
>>      
>>>> Both have  security implications so I think it's important that they
>>>> be addressed.   Otherwise, I'm pretty happy with how things are.
>>>>          
>>> Care suggesting some solutions?
>>>        
>> The obvious thing to do would be to use the memory notifier in vhost
>> to keep track of whenever something remaps the ring's memory region
>> and if that happens, issue an ioctl to vhost to change the location
>> of the ring.  Also, you would need to merge the vhost slot
>> management code with the KVM slot management code.
>>      
> There are no security implications as long as vhost uses the qemu
> process mappings.
>    

There potentially are within a guest.  If a guest can trigger a qemu bug 
that results in qemu writing to a different location than what the guest 
told it to write, a malicious software may use this to escalate it's 
privileges within a guest.

>> cpu_ram_add() never gets called with overlapping regions.  We'll
>> modify cpu_register_physical_memory() to ensure that a ram mapping
>> is never changed after initial registration.
>>      
> What is the difference between your proposal and
> cpu_physical_memory_map?
>    

cpu_physical_memory_map() has the following semantics:

- it always returns a transient mapping
- it may (transparently) bounce
- it may fail to bounce, caller must deal

The new function I'm proposing has the following semantics:

- it always returns a persistent mapping
- it never bounces
- it will only fail if the mapping isn't ram

A caller can use the new function to implement an optimization to force 
the device to only work with real ram.  IOW, this is something we can 
use in virtio, but very little else.  cpu_physical_memory_map can be 
used in more circumstances.

> What i'd like to see is binding between cpu_physical_memory_map and qdev
> devices, so that you can use different host memory mappings for device
> context and for CPU context (and provide the possibility for, say, map
> a certain memory region as read-only).
>    

We really want per-bus mappings.  At the lowest level, we'll have 
sysbus_memory_map() but we'll also have pci_memory_map(), 
virtio_memory_map(), etc.

Nothing should ever call cpu_physical_memory_map() directly.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 16:56                   ` Anthony Liguori
@ 2010-03-02 17:00                     ` Michael S. Tsirkin
  2010-03-02 18:00                     ` Marcelo Tosatti
  2010-03-02 22:41                     ` Paul Brook
  2 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-03-02 17:00 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: quintela, Marcelo Tosatti, qemu-devel, kraxel, amit.shah, Paul Brook

On Tue, Mar 02, 2010 at 10:56:48AM -0600, Anthony Liguori wrote:
> On 03/02/2010 10:12 AM, Marcelo Tosatti wrote:
>> On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
>>    
>>> On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
>>>      
>>>>> Both have  security implications so I think it's important that they
>>>>> be addressed.   Otherwise, I'm pretty happy with how things are.
>>>>>          
>>>> Care suggesting some solutions?
>>>>        
>>> The obvious thing to do would be to use the memory notifier in vhost
>>> to keep track of whenever something remaps the ring's memory region
>>> and if that happens, issue an ioctl to vhost to change the location
>>> of the ring.  Also, you would need to merge the vhost slot
>>> management code with the KVM slot management code.
>>>      
>> There are no security implications as long as vhost uses the qemu
>> process mappings.
>>    
>
> There potentially are within a guest.  If a guest can trigger a qemu bug  
> that results in qemu writing to a different location than what the guest  
> told it to write, a malicious software may use this to escalate it's  
> privileges within a guest.

If malicious software has access to hardware that does DMA,
game is likely over :)

>>> cpu_ram_add() never gets called with overlapping regions.  We'll
>>> modify cpu_register_physical_memory() to ensure that a ram mapping
>>> is never changed after initial registration.
>>>      
>> What is the difference between your proposal and
>> cpu_physical_memory_map?
>>    
>
> cpu_physical_memory_map() has the following semantics:
>
> - it always returns a transient mapping
> - it may (transparently) bounce
> - it may fail to bounce, caller must deal
>
> The new function I'm proposing has the following semantics:
>
> - it always returns a persistent mapping
> - it never bounces
> - it will only fail if the mapping isn't ram
>
> A caller can use the new function to implement an optimization to force  
> the device to only work with real ram.  IOW, this is something we can  
> use in virtio, but very little else.  cpu_physical_memory_map can be  
> used in more circumstances.
>
>> What i'd like to see is binding between cpu_physical_memory_map and qdev
>> devices, so that you can use different host memory mappings for device
>> context and for CPU context (and provide the possibility for, say, map
>> a certain memory region as read-only).
>>    
>
> We really want per-bus mappings.  At the lowest level, we'll have  
> sysbus_memory_map() but we'll also have pci_memory_map(),  
> virtio_memory_map(), etc.
>
> Nothing should ever call cpu_physical_memory_map() directly.
>
> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] Re: [PATCHv2 02/12] kvm: add API to set ioeventfd
  2010-02-25 19:19   ` [Qemu-devel] " Anthony Liguori
@ 2010-03-02 17:41     ` Michael S. Tsirkin
  0 siblings, 0 replies; 70+ messages in thread
From: Michael S. Tsirkin @ 2010-03-02 17:41 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: quintela, mtosatti, qemu-devel, avi, amit.shah, kraxel

On Thu, Feb 25, 2010 at 01:19:30PM -0600, Anthony Liguori wrote:
> On 02/25/2010 12:28 PM, Michael S. Tsirkin wrote:
>> Comment on kvm usage: rather than require users to do if (kvm_enabled())
>> and/or ifdefs, this patch adds an API that, internally, is defined to
>> stub function on non-kvm build, and checks kvm_enabled for non-kvm
>> run.
>>
>> While rest of qemu code still uses if (kvm_enabled()), I think this
>> approach is cleaner, and we should convert rest of code to it
>> long term.
>>    
>
> I'm not opposed to that.
>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>
>> Avi, Marcelo, pls review/ack.
>>
>>   kvm-all.c |   22 ++++++++++++++++++++++
>>   kvm.h     |   16 ++++++++++++++++
>>   2 files changed, 38 insertions(+), 0 deletions(-)
>>
>> diff --git a/kvm-all.c b/kvm-all.c
>> index 1a02076..9742791 100644
>> --- a/kvm-all.c
>> +++ b/kvm-all.c
>> @@ -1138,3 +1138,25 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t *sigset)
>>
>>       return r;
>>   }
>> +
>> +#ifdef KVM_IOEVENTFD
>> +int kvm_set_ioeventfd(uint16_t addr, uint16_t data, int fd, bool assigned)
>>    
>
> I think this API could use some love.  You're using a very limited set  
> of things that ioeventfd can do and you're multiplexing creation and  
> destruction in a single call.
>
> I think:
>
> kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t data);
> kvm_unset_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t data);
>
> Would be better.  Alternatively, an API that matched ioeventfd exactly:
>
> kvm_set_ioeventfd(int fd, uint64_t addr, int size, uint64_t data, int  
> flags);
> kvm_unset_ioeventfd(...);
>
> Could work too.
>
> Regards,
>
> Anthony Liguori
>

So I renamed to kvm_set_ioeventfd_pio_word, but I still left assign
boolean in place because both implementation and usage take up
less code this way.

It's just an internal function, so no biggie to change it later ...

-- 
MST

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 16:56                   ` Anthony Liguori
  2010-03-02 17:00                     ` Michael S. Tsirkin
@ 2010-03-02 18:00                     ` Marcelo Tosatti
  2010-03-02 18:13                       ` Anthony Liguori
  2010-03-02 22:41                     ` Paul Brook
  2 siblings, 1 reply; 70+ messages in thread
From: Marcelo Tosatti @ 2010-03-02 18:00 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Michael S. Tsirkin, quintela, qemu-devel, Paul Brook, amit.shah, kraxel

On Tue, Mar 02, 2010 at 10:56:48AM -0600, Anthony Liguori wrote:
> On 03/02/2010 10:12 AM, Marcelo Tosatti wrote:
> >On Sun, Feb 28, 2010 at 02:57:56PM -0600, Anthony Liguori wrote:
> >>On 02/28/2010 11:19 AM, Michael S. Tsirkin wrote:
> >>>>Both have  security implications so I think it's important that they
> >>>>be addressed.   Otherwise, I'm pretty happy with how things are.
> >>>Care suggesting some solutions?
> >>The obvious thing to do would be to use the memory notifier in vhost
> >>to keep track of whenever something remaps the ring's memory region
> >>and if that happens, issue an ioctl to vhost to change the location
> >>of the ring.  Also, you would need to merge the vhost slot
> >>management code with the KVM slot management code.
> >There are no security implications as long as vhost uses the qemu
> >process mappings.
> 
> There potentially are within a guest.  If a guest can trigger a qemu
> bug that results in qemu writing to a different location than what
> the guest told it to write, a malicious software may use this to
> escalate it's privileges within a guest.
> 
> >>cpu_ram_add() never gets called with overlapping regions.  We'll
> >>modify cpu_register_physical_memory() to ensure that a ram mapping
> >>is never changed after initial registration.
> >What is the difference between your proposal and
> >cpu_physical_memory_map?
> 
> cpu_physical_memory_map() has the following semantics:


> 
> - it always returns a transient mapping
> - it may (transparently) bounce
> - it may fail to bounce, caller must deal
> 
> The new function I'm proposing has the following semantics:

What exactly are the purposes of the new function?

> - it always returns a persistent mapping

For hotplug? What exactly you mean persistent?

> - it never bounces
> - it will only fail if the mapping isn't ram
> 
> A caller can use the new function to implement an optimization to
> force the device to only work with real ram.

To bypass the address translation in exec.c? 

>   IOW, this is something we can use in virtio, but very little else.
> cpu_physical_memory_map can be used in more circumstances.

Does not make much sense to me. The qdev <-> memory map mapping seems
more important. Following your security enhancement drive, you can for
example check whether the region can actually be mapped by the device
and deny otherwise, or do whatever host-side memory protection tricks 
you'd like.

And "cpu_ram_map" seems like the memory is accessed through cpu context, 
while it is really always device context.

> >What i'd like to see is binding between cpu_physical_memory_map and qdev
> >devices, so that you can use different host memory mappings for device
> >context and for CPU context (and provide the possibility for, say, map
> >a certain memory region as read-only).
> 
> We really want per-bus mappings.  At the lowest level, we'll have
> sysbus_memory_map() but we'll also have pci_memory_map(),
> virtio_memory_map(), etc.
>
> Nothing should ever call cpu_physical_memory_map() directly.

Yep.

> 
> Regards,
> 
> Anthony Liguori
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 18:00                     ` Marcelo Tosatti
@ 2010-03-02 18:13                       ` Anthony Liguori
  0 siblings, 0 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-03-02 18:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michael S. Tsirkin, quintela, qemu-devel, Paul Brook, amit.shah, kraxel

On 03/02/2010 12:00 PM, Marcelo Tosatti wrote:
>> - it always returns a transient mapping
>> - it may (transparently) bounce
>> - it may fail to bounce, caller must deal
>>
>> The new function I'm proposing has the following semantics:
>>      
> What exactly are the purposes of the new function?
>    

We need an API that can be used to obtain a persistent mapping of a 
given guest physical address.  We don't have that today.  We can 
potentially change cpu_physical_memory_map() to also accommodate this 
use case but that's a second step.

>> - it always returns a persistent mapping
>>      
> For hotplug? What exactly you mean persistent?
>    

Hotplug cannot happen as long as a persistent mapping exists for an 
address within that region.  This is okay.  You cannot have an active 
device driver DMA'ing to a DIMM and then hot unplug it.  The guest is 
responsible for making sure this doesn't happen.

>> - it never bounces
>> - it will only fail if the mapping isn't ram
>>
>> A caller can use the new function to implement an optimization to
>> force the device to only work with real ram.
>>      
> To bypass the address translation in exec.c?
>    

No, the ultimate goal is to convert virtio ring accesses from:

static inline uint32_t vring_desc_len(target_phys_addr_t desc_pa, int i)
{
     target_phys_addr_t pa;
     pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
     return ldl_phys(pa);
}

len = vring_desc_len(vring.desc_pa, i)

To:

len = ldl_w(vring->desc[i].len);

When host == arch, ldl_w is a nop.  Otherwise, it's just a byte swap.  
ldl_phys() today turns into cpu_physical_memory_read() which is very slow.

To support this, we must enforce that when a guest passes us a physical 
address, we can safely obtain a persistent mapping to it.  This is true 
for any ram address.  It's not true for MMIO memory.  We have no way to 
do this with cpu_physical_memory_map().

>>    IOW, this is something we can use in virtio, but very little else.
>> cpu_physical_memory_map can be used in more circumstances.
>>      
> Does not make much sense to me. The qdev<->  memory map mapping seems
> more important. Following your security enhancement drive, you can for
> example check whether the region can actually be mapped by the device
> and deny otherwise, or do whatever host-side memory protection tricks
> you'd like.
>    

It's two independent things.  Part of what makes virtio so complicated 
to convert to proper bus accessors is it's use of 
ldl_phys/stl_phys/etc.  No other device use those functions.  If we 
reduce virtio to just use a map() function, it simplifies the bus 
accessor conversion.

> And "cpu_ram_map" seems like the memory is accessed through cpu context,
> while it is really always device context.
>    

Yes, but that's a separate effort.  In fact, see 
http://wiki.qemu.org/Features/RamAPI vs. 
http://wiki.qemu.org/Features/PCIMemoryAPI

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 16:56                   ` Anthony Liguori
  2010-03-02 17:00                     ` Michael S. Tsirkin
  2010-03-02 18:00                     ` Marcelo Tosatti
@ 2010-03-02 22:41                     ` Paul Brook
  2010-03-03 14:15                       ` Anthony Liguori
  2 siblings, 1 reply; 70+ messages in thread
From: Paul Brook @ 2010-03-02 22:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: quintela, Marcelo Tosatti, qemu-devel, Michael S. Tsirkin,
	kraxel, amit.shah

> The new function I'm proposing has the following semantics:
> 
> - it always returns a persistent mapping
> - it never bounces
> - it will only fail if the mapping isn't ram

So you're assuming that virtio rings are in ram that is not hot-pluggable or 
remapable, and the whole region is contiguous?
That sounds like it's likely to come back and bite you. The guest has no idea 
which areas of ram happen to be contiguous on the host.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-02 22:41                     ` Paul Brook
@ 2010-03-03 14:15                       ` Anthony Liguori
  2010-03-03 14:43                         ` Paul Brook
  2010-03-03 16:24                         ` Marcelo Tosatti
  0 siblings, 2 replies; 70+ messages in thread
From: Anthony Liguori @ 2010-03-03 14:15 UTC (permalink / raw)
  To: Paul Brook
  Cc: quintela, Marcelo Tosatti, qemu-devel, Michael S. Tsirkin,
	kraxel, amit.shah

On 03/02/2010 04:41 PM, Paul Brook wrote:
>> The new function I'm proposing has the following semantics:
>>
>> - it always returns a persistent mapping
>> - it never bounces
>> - it will only fail if the mapping isn't ram
>>      
> So you're assuming that virtio rings are in ram that is not hot-pluggable

As long as the device is active, yes.  This would be true with bare 
metal.  Memory allocated for the virtio-pci ring is not reclaimable and 
you have to be able to reclaim the entire area of ram covered by a DIMM 
to hot unplug it.  A user would have to unload the virtio-pci module to 
release the memory before hot unplug would be an option.

NB, almost nothing supports memory hot remove because it's very 
difficult for an OS to actually do.

>   or
> remapable,

Yes, it cannot be remapable.

>   and the whole region is contiguous?
>    

Yes, it has to be contiguous.

> That sounds like it's likely to come back and bite you. The guest has no idea
> which areas of ram happen to be contiguous on the host.
>    

Practically speaking, with target-i386 anything that is contiguous in 
guest physical memory is contiguous in the host address space provided 
it's ram.

These assumptions are important.  I have a local branch (that I'll send 
out soon) that implements a ram API and converted virtio to make use of 
it.  I'm seeing a ~50% increase in tx throughput.

If you try to support discontiguous, remapable ram for the virtio ring, 
then you have to do an l1_phys_map lookup for every single ring variable 
access followed by a memory copy.  This ends up costing an awful lot 
practically speaking.

The changes should work equally well with syborg although I don't think 
I can do meaningful performance testing there (I don't actually have a 
syborg image to test).

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-03 14:15                       ` Anthony Liguori
@ 2010-03-03 14:43                         ` Paul Brook
  2010-03-03 16:24                         ` Marcelo Tosatti
  1 sibling, 0 replies; 70+ messages in thread
From: Paul Brook @ 2010-03-03 14:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: quintela, Marcelo Tosatti, qemu-devel, Michael S. Tsirkin,
	kraxel, amit.shah

> > That sounds like it's likely to come back and bite you. The guest has no
> > idea which areas of ram happen to be contiguous on the host.
> 
> Practically speaking, with target-i386 anything that is contiguous in
> guest physical memory is contiguous in the host address space provided
> it's ram.
> 
> These assumptions are important.  I have a local branch (that I'll send
> out soon) that implements a ram API and converted virtio to make use of
> it.  I'm seeing a ~50% increase in tx throughput.

IMO supporting discontiguous regions is a requirement. target-i386 might get 
away with contiguous memory, because it omits most of the board level details. 
For everything else I'd expect this to be a real problem.

I'm not happy about the not-remapable assumption either.  In my experience 
this is fairly common.  In fact many real x86 machines have this capability 
(to workaround the 4G PCI hole).

By my reading the ppc440_bamboo board fails both your assumptions.
I imagine the KVM-PPC folks would be upset if you decided that virtio no 
longer worked on this board.

This is all somewhat disappointing, given virtio is supposed to be a DMA based 
architecture, and not rely on shared memory semantics.

Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] Re: [PATCHv2 10/12] tap: add vhost/vhostfd options
  2010-03-03 14:15                       ` Anthony Liguori
  2010-03-03 14:43                         ` Paul Brook
@ 2010-03-03 16:24                         ` Marcelo Tosatti
  1 sibling, 0 replies; 70+ messages in thread
From: Marcelo Tosatti @ 2010-03-03 16:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: quintela, Michael S. Tsirkin, qemu-devel, kraxel, amit.shah, Paul Brook

On Wed, Mar 03, 2010 at 08:15:15AM -0600, Anthony Liguori wrote:
> On 03/02/2010 04:41 PM, Paul Brook wrote:
> >>The new function I'm proposing has the following semantics:
> >>
> >>- it always returns a persistent mapping
> >>- it never bounces
> >>- it will only fail if the mapping isn't ram
> >So you're assuming that virtio rings are in ram that is not hot-pluggable
> 
> As long as the device is active, yes.  This would be true with bare
> metal.  Memory allocated for the virtio-pci ring is not reclaimable
> and you have to be able to reclaim the entire area of ram covered by
> a DIMM to hot unplug it.  A user would have to unload the virtio-pci
> module to release the memory before hot unplug would be an option.
> 
> NB, almost nothing supports memory hot remove because it's very
> difficult for an OS to actually do.
> 
> >  or
> >remapable,
> 
> Yes, it cannot be remapable.
> 
> >  and the whole region is contiguous?
> 
> Yes, it has to be contiguous.
> 
> >That sounds like it's likely to come back and bite you. The guest has no idea
> >which areas of ram happen to be contiguous on the host.
> 
> Practically speaking, with target-i386 anything that is contiguous
> in guest physical memory is contiguous in the host address space
> provided it's ram.
> 
> These assumptions are important.  I have a local branch (that I'll
> send out soon) that implements a ram API and converted virtio to
> make use of it.  I'm seeing a ~50% increase in tx throughput.
> 
> If you try to support discontiguous, remapable ram for the virtio
> ring, then you have to do an l1_phys_map lookup for every single
> ring variable access followed by a memory copy.  This ends up
> costing an awful lot practically speaking.

Speed up the lookup instead?

> 
> The changes should work equally well with syborg although I don't
> think I can do meaningful performance testing there (I don't
> actually have a syborg image to test).
> 
> Regards,
> 
> Anthony Liguori
> 
> >Paul

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2010-03-03 16:24 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-25 18:27 [Qemu-devel] [PATCHv2 00/12] vhost-net: upstream integration Michael S. Tsirkin
2010-02-25 18:27 ` [Qemu-devel] [PATCHv2 05/12] virtio: add APIs for queue fields Michael S. Tsirkin
2010-02-25 18:49   ` Blue Swirl
2010-02-26 14:53     ` Michael S. Tsirkin
2010-02-25 19:25   ` [Qemu-devel] " Anthony Liguori
2010-02-26  8:46     ` Gleb Natapov
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 09/12] vhost: vhost net support Michael S. Tsirkin
2010-02-25 19:04   ` [Qemu-devel] " Juan Quintela
2010-02-26 14:32     ` Michael S. Tsirkin
2010-02-26 14:38       ` Anthony Liguori
2010-02-26 14:54         ` Michael S. Tsirkin
2010-02-25 19:44   ` Anthony Liguori
2010-02-26 14:49     ` Michael S. Tsirkin
2010-02-26 15:18       ` Anthony Liguori
2010-02-27 19:38         ` Michael S. Tsirkin
2010-02-28  1:59           ` Paul Brook
2010-02-28 10:15             ` Michael S. Tsirkin
2010-02-28 12:45               ` Paul Brook
2010-02-28 14:44                 ` Michael S. Tsirkin
2010-02-28 15:23                   ` Paul Brook
2010-02-28 15:37                     ` Michael S. Tsirkin
2010-02-28 16:02           ` Anthony Liguori
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 02/12] kvm: add API to set ioeventfd Michael S. Tsirkin
2010-02-25 19:19   ` [Qemu-devel] " Anthony Liguori
2010-03-02 17:41     ` Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 04/12] virtio: add notifier support Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 01/12] tap: add interface to get device fd Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 07/12] virtio: move typedef to qemu-common Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 10/12] tap: add vhost/vhostfd options Michael S. Tsirkin
2010-02-25 19:47   ` [Qemu-devel] " Anthony Liguori
2010-02-26 14:51     ` Michael S. Tsirkin
2010-02-26 15:23       ` Anthony Liguori
2010-02-27 19:44         ` Michael S. Tsirkin
2010-02-28 16:08           ` Anthony Liguori
2010-02-28 17:19             ` Michael S. Tsirkin
2010-02-28 20:57               ` Anthony Liguori
2010-02-28 21:01                 ` Michael S. Tsirkin
2010-02-28 22:38                   ` Anthony Liguori
2010-02-28 22:39                 ` Paul Brook
2010-03-01 19:27                   ` Michael S. Tsirkin
2010-03-01 21:54                     ` Anthony Liguori
2010-03-02  9:57                       ` Michael S. Tsirkin
2010-03-02 14:07                   ` Anthony Liguori
2010-03-02 14:33                     ` Paul Brook
2010-03-02 14:39                       ` Anthony Liguori
2010-03-02 14:55                         ` Paul Brook
2010-03-02 15:33                           ` Anthony Liguori
2010-03-02 15:53                             ` Paul Brook
2010-03-02 15:56                               ` Michael S. Tsirkin
2010-03-02 16:12                               ` Anthony Liguori
2010-03-02 16:21                                 ` Marcelo Tosatti
2010-03-02 16:12                 ` Marcelo Tosatti
2010-03-02 16:56                   ` Anthony Liguori
2010-03-02 17:00                     ` Michael S. Tsirkin
2010-03-02 18:00                     ` Marcelo Tosatti
2010-03-02 18:13                       ` Anthony Liguori
2010-03-02 22:41                     ` Paul Brook
2010-03-03 14:15                       ` Anthony Liguori
2010-03-03 14:43                         ` Paul Brook
2010-03-03 16:24                         ` Marcelo Tosatti
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 11/12] tap: add API to retrieve vhost net header Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 06/12] virtio: add set_status callback Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 08/12] virtio-pci: fill in notifier support Michael S. Tsirkin
2010-02-25 19:30   ` [Qemu-devel] " Anthony Liguori
2010-02-28 20:02     ` Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 03/12] notifier: event notifier implementation Michael S. Tsirkin
2010-02-25 19:22   ` [Qemu-devel] " Anthony Liguori
2010-02-28 19:59     ` Michael S. Tsirkin
2010-02-25 18:28 ` [Qemu-devel] [PATCHv2 12/12] virtio-net: vhost net support Michael S. Tsirkin
2010-02-25 19:49 ` [Qemu-devel] Re: [PATCHv2 00/12] vhost-net: upstream integration Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.