[PATCH v4 0/5] vhost-user block device backend implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/5] vhost-user block device backend implementation
@ 2020-02-18  5:07 Coiby Xu
  2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

v4:
 * add object properties in class_init
 * relocate vhost-user-blk-test
 * other changes including using SocketAddress, coding style, etc.

v3:
 * separate generic vhost-user-server code from vhost-user-blk-server
   code
 * re-write vu_message_read and kick hander function as coroutines to
   directly call blk_co_preadv, blk_co_pwritev, etc.
 * add aio_context notifier functions to support multi-threading model
 * other fixes regarding coding style, warning report, etc.

v2:
 * Only enable this feauture for Linux because eventfd is a Linux-specific
   feature


This patch series is an implementation of vhost-user block device
backend server, thanks to Stefan and Kevin's guidance.

Vhost-user block device backend server is a UserCreatable object and can be
started using object_add,

 (qemu) object_add vhost-user-blk-server,id=ID,unix-socket=/tmp/vhost-user-blk_vhost.socket,node-name=DRIVE_NAME,writable=off,blk-size=512
 (qemu) object_del ID

or appending the "-object" option when starting QEMU,

  $ -object vhost-user-blk-server,id=disk,unix-socket=/tmp/vhost-user-blk_vhost.socket,node-name=DRIVE_NAME,writable=off,blk-size=512

Then vhost-user client can connect to the server backend.
For example, QEMU could act as a client,

  $ -m 256 -object memory-backend-memfd,id=mem,size=256M,share=on -numa node,memdev=mem -chardev socket,id=char1,path=/tmp/vhost-user-blk_vhost.socket -device vhost-user-blk-pci,id=blk0,chardev=char1

And guest OS could access this vhost-user block device after mouting it.

Coiby Xu (5):
  extend libvhost to support IOThread and coroutine
  generic vhost user server
  vhost-user block device backend server
  a standone-alone tool to directly share disk image file via vhost-user
    protocol
  new qTest case to test the vhost-user-blk-server

 Makefile                              |   4 +
 Makefile.target                       |   1 +
 backends/Makefile.objs                |   2 +
 backends/vhost-user-blk-server.c      | 718 ++++++++++++++++++++++++++
 backends/vhost-user-blk-server.h      |  21 +
 configure                             |   3 +
 contrib/libvhost-user/libvhost-user.c |  54 +-
 contrib/libvhost-user/libvhost-user.h |  38 +-
 qemu-vu.c                             | 252 +++++++++
 tests/Makefile.include                |   3 +-
 tests/qtest/Makefile.include          |   2 +
 tests/qtest/libqos/vhost-user-blk.c   | 126 +++++
 tests/qtest/libqos/vhost-user-blk.h   |  44 ++
 tests/qtest/vhost-user-blk-test.c     | 694 +++++++++++++++++++++++++
 util/Makefile.objs                    |   3 +
 util/vhost-user-server.c              | 427 +++++++++++++++
 util/vhost-user-server.h              |  56 ++
 vl.c                                  |   4 +
 18 files changed, 2444 insertions(+), 8 deletions(-)
 create mode 100644 backends/vhost-user-blk-server.c
 create mode 100644 backends/vhost-user-blk-server.h
 create mode 100644 qemu-vu.c
 create mode 100644 tests/qtest/libqos/vhost-user-blk.c
 create mode 100644 tests/qtest/libqos/vhost-user-blk.h
 create mode 100644 tests/qtest/vhost-user-blk-test.c
 create mode 100644 util/vhost-user-server.c
 create mode 100644 util/vhost-user-server.h

--
2.25.0



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/5] extend libvhost to support IOThread and coroutine
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
@ 2020-02-18  5:07 ` Coiby Xu
  2020-02-19 16:33   ` Stefan Hajnoczi
  2020-02-25 14:55   ` Kevin Wolf
  2020-02-18  5:07 ` [PATCH v4 2/5] generic vhost user server Coiby Xu
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

Previously libvhost dispatch events in its own GMainContext. Now vhost-user
client's kick event can be dispatched in block device drive's AioContext
thus IOThread is supported. And also allow vu_message_read and
vu_kick_cb to be replaced so QEMU can run them as coroutines.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 contrib/libvhost-user/libvhost-user.c | 54 ++++++++++++++++++++++++---
 contrib/libvhost-user/libvhost-user.h | 38 ++++++++++++++++++-
 2 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index b89bf18501..f95664bb22 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -67,8 +67,6 @@
 /* The version of inflight buffer */
 #define INFLIGHT_VERSION 1
 
-#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
-
 /* The version of the protocol we support */
 #define VHOST_USER_VERSION 1
 #define LIBVHOST_USER_DEBUG 0
@@ -260,7 +258,7 @@ have_userfault(void)
 }
 
 static bool
-vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
+vu_message_read_(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
     char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
     struct iovec iov = {
@@ -328,6 +326,17 @@ fail:
     return false;
 }
 
+static bool vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
+{
+    vu_read_msg_cb read_msg;
+    if (dev->co_iface) {
+        read_msg = dev->co_iface->read_msg;
+    } else {
+        read_msg = vu_message_read_;
+    }
+    return read_msg(dev, conn_fd, vmsg);
+}
+
 static bool
 vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
@@ -1075,9 +1084,14 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
     }
 
     if (dev->vq[index].kick_fd != -1 && dev->vq[index].handler) {
+        if (dev->set_watch_packed_data) {
+            dev->set_watch_packed_data(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
+                                       dev->co_iface->kick_callback,
+                                       (void *)(long)index);
+        } else {
         dev->set_watch(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
                        vu_kick_cb, (void *)(long)index);
-
+        }
         DPRINT("Waiting for kicks on fd: %d for vq: %d\n",
                dev->vq[index].kick_fd, index);
     }
@@ -1097,8 +1111,14 @@ void vu_set_queue_handler(VuDev *dev, VuVirtq *vq,
     vq->handler = handler;
     if (vq->kick_fd >= 0) {
         if (handler) {
+            if (dev->set_watch_packed_data) {
+                dev->set_watch_packed_data(dev, vq->kick_fd, VU_WATCH_IN,
+                                           dev->co_iface->kick_callback,
+                                           (void *)(long)qidx);
+            } else {
             dev->set_watch(dev, vq->kick_fd, VU_WATCH_IN,
                            vu_kick_cb, (void *)(long)qidx);
+            }
         } else {
             dev->remove_watch(dev, vq->kick_fd);
         }
@@ -1627,6 +1647,12 @@ vu_deinit(VuDev *dev)
         }
 
         if (vq->kick_fd != -1) {
+            /* remove watch for kick_fd
+             * When client process is running in gdb and
+             * quit command is run in gdb, QEMU will still dispatch the event
+             * which will cause segment fault in the callback function
+             */
+            dev->remove_watch(dev, vq->kick_fd);
             close(vq->kick_fd);
             vq->kick_fd = -1;
         }
@@ -1682,7 +1708,7 @@ vu_init(VuDev *dev,
 
     assert(max_queues > 0);
     assert(socket >= 0);
-    assert(set_watch);
+    /* assert(set_watch); */
     assert(remove_watch);
     assert(iface);
     assert(panic);
@@ -1715,6 +1741,24 @@ vu_init(VuDev *dev,
     return true;
 }
 
+bool
+vu_init_packed_data(VuDev *dev,
+        uint16_t max_queues,
+        int socket,
+        vu_panic_cb panic,
+        vu_set_watch_cb_packed_data set_watch_packed_data,
+        vu_remove_watch_cb remove_watch,
+        const VuDevIface *iface,
+        const CoIface *co_iface)
+{
+    if (vu_init(dev, max_queues, socket, panic, NULL, remove_watch, iface)) {
+        dev->set_watch_packed_data = set_watch_packed_data;
+        dev->co_iface = co_iface;
+        return true;
+    }
+    return false;
+}
+
 VuVirtq *
 vu_get_queue(VuDev *dev, int qidx)
 {
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 5cb7708559..6aadeaa0f2 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -30,6 +30,8 @@
 
 #define VHOST_MEMORY_MAX_NREGIONS 8
 
+#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
+
 typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MASTER = 0,
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
@@ -201,6 +203,7 @@ typedef uint64_t (*vu_get_features_cb) (VuDev *dev);
 typedef void (*vu_set_features_cb) (VuDev *dev, uint64_t features);
 typedef int (*vu_process_msg_cb) (VuDev *dev, VhostUserMsg *vmsg,
                                   int *do_reply);
+typedef bool (*vu_read_msg_cb) (VuDev *dev, int sock, VhostUserMsg *vmsg);
 typedef void (*vu_queue_set_started_cb) (VuDev *dev, int qidx, bool started);
 typedef bool (*vu_queue_is_processed_in_order_cb) (VuDev *dev, int qidx);
 typedef int (*vu_get_config_cb) (VuDev *dev, uint8_t *config, uint32_t len);
@@ -208,6 +211,20 @@ typedef int (*vu_set_config_cb) (VuDev *dev, const uint8_t *data,
                                  uint32_t offset, uint32_t size,
                                  uint32_t flags);
 
+typedef void (*vu_watch_cb_packed_data) (void *packed_data);
+
+typedef void (*vu_set_watch_cb_packed_data) (VuDev *dev, int fd, int condition,
+                                             vu_watch_cb_packed_data cb,
+                                             void *data);
+/*
+ * allowing vu_read_msg_cb and kick_callback to be replaced so QEMU
+ * can run them as coroutines
+ */
+typedef struct CoIface {
+    vu_read_msg_cb read_msg;
+    vu_watch_cb_packed_data kick_callback;
+} CoIface;
+
 typedef struct VuDevIface {
     /* called by VHOST_USER_GET_FEATURES to get the features bitmask */
     vu_get_features_cb get_features;
@@ -372,7 +389,8 @@ struct VuDev {
     /* @set_watch: add or update the given fd to the watch set,
      * call cb when condition is met */
     vu_set_watch_cb set_watch;
-
+    /* AIO dispatch will only one data pointer to callback function */
+    vu_set_watch_cb_packed_data set_watch_packed_data;
     /* @remove_watch: remove the given fd from the watch set */
     vu_remove_watch_cb remove_watch;
 
@@ -380,7 +398,7 @@ struct VuDev {
      * re-initialize */
     vu_panic_cb panic;
     const VuDevIface *iface;
-
+    const CoIface *co_iface;
     /* Postcopy data */
     int postcopy_ufd;
     bool postcopy_listening;
@@ -417,6 +435,22 @@ bool vu_init(VuDev *dev,
              const VuDevIface *iface);
 
 
+/**
+ * vu_init_packed_data:
+ * Same as vu_init except for set_watch_packed_data which will pack
+ * two parameters into a struct thus QEMU aio_dispatch can pass the
+ * required data to callback function.
+ *
+ * Returns: true on success, false on failure.
+ **/
+bool vu_init_packed_data(VuDev *dev,
+                         uint16_t max_queues,
+                         int socket,
+                         vu_panic_cb panic,
+                         vu_set_watch_cb_packed_data set_watch_packed_data,
+                         vu_remove_watch_cb remove_watch,
+                         const VuDevIface *iface,
+                         const CoIface *co_iface);
 /**
  * vu_deinit:
  * @dev: a VuDev context
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/5] generic vhost user server
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
  2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
@ 2020-02-18  5:07 ` Coiby Xu
  2020-02-25 15:44   ` Kevin Wolf
  2020-02-18  5:07 ` [PATCH v4 3/5] vhost-user block device backend server Coiby Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

Sharing QEMU devices via vhost-user protocol

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 util/Makefile.objs       |   3 +
 util/vhost-user-server.c | 427 +++++++++++++++++++++++++++++++++++++++
 util/vhost-user-server.h |  56 +++++
 3 files changed, 486 insertions(+)
 create mode 100644 util/vhost-user-server.c
 create mode 100644 util/vhost-user-server.h

diff --git a/util/Makefile.objs b/util/Makefile.objs
index 11262aafaf..5e450e501c 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -36,6 +36,9 @@ util-obj-y += readline.o
 util-obj-y += rcu.o
 util-obj-$(CONFIG_MEMBARRIER) += sys_membarrier.o
 util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
+ifdef CONFIG_LINUX
+util-obj-y += vhost-user-server.o
+endif
 util-obj-y += qemu-coroutine-sleep.o
 util-obj-y += qemu-co-shared-resource.o
 util-obj-y += coroutine-$(CONFIG_COROUTINE_BACKEND).o
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
new file mode 100644
index 0000000000..70ff6d6701
--- /dev/null
+++ b/util/vhost-user-server.c
@@ -0,0 +1,427 @@
+/*
+ * Sharing QEMU devices via vhost-user protocol
+ *
+ * Author: Coiby Xu <coiby.xu@gmail.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <sys/eventfd.h>
+#include "qemu/main-loop.h"
+#include "vhost-user-server.h"
+
+static void vmsg_close_fds(VhostUserMsg *vmsg)
+{
+    int i;
+    for (i = 0; i < vmsg->fd_num; i++) {
+        close(vmsg->fds[i]);
+    }
+}
+
+static void vmsg_unblock_fds(VhostUserMsg *vmsg)
+{
+    int i;
+    for (i = 0; i < vmsg->fd_num; i++) {
+        qemu_set_nonblock(vmsg->fds[i]);
+    }
+}
+
+
+static void close_client(VuClient *client)
+{
+    vu_deinit(&client->parent);
+    client->sioc = NULL;
+    object_unref(OBJECT(client->ioc));
+    client->closed = true;
+
+}
+
+static void panic_cb(VuDev *vu_dev, const char *buf)
+{
+    if (buf) {
+        error_report("vu_panic: %s", buf);
+    }
+
+    VuClient *client = container_of(vu_dev, VuClient, parent);
+    VuServer *server = client->server;
+    if (!client->closed) {
+        close_client(client);
+        QTAILQ_REMOVE(&server->clients, client, next);
+    }
+
+    if (server->device_panic_notifier) {
+        server->device_panic_notifier(client);
+    }
+}
+
+
+
+static bool coroutine_fn
+vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
+{
+    struct iovec iov = {
+        .iov_base = (char *)vmsg,
+        .iov_len = VHOST_USER_HDR_SIZE,
+    };
+    int rc, read_bytes = 0;
+    /*
+     * VhostUserMsg is a packed structure, gcc will complain about passing
+     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
+     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
+     * thus two temporary variables nfds and fds are used here.
+     */
+    size_t nfds = 0, nfds_t = 0;
+    int *fds = NULL, *fds_t = NULL;
+    VuClient *client = container_of(vu_dev, VuClient, parent);
+    QIOChannel *ioc = client->ioc;
+
+    Error *erp;
+    assert(qemu_in_coroutine());
+    do {
+        /*
+         * qio_channel_readv_full may have short reads, keeping calling it
+         * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
+         */
+        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &erp);
+        if (rc < 0) {
+            if (rc == QIO_CHANNEL_ERR_BLOCK) {
+                qio_channel_yield(ioc, G_IO_IN);
+                continue;
+            } else {
+                error_report("Error while recvmsg: %s", strerror(errno));
+                return false;
+            }
+        }
+        read_bytes += rc;
+        fds = g_renew(int, fds_t, nfds + nfds_t);
+        memcpy(fds + nfds, fds_t, nfds_t);
+        nfds += nfds_t;
+        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
+            break;
+        }
+    } while (true);
+
+    vmsg->fd_num = nfds;
+    memcpy(vmsg->fds, fds, nfds * sizeof(int));
+    g_free(fds);
+    /* qio_channel_readv_full will make socket fds blocking, unblock them */
+    vmsg_unblock_fds(vmsg);
+    if (vmsg->size > sizeof(vmsg->payload)) {
+        error_report("Error: too big message request: %d, "
+                     "size: vmsg->size: %u, "
+                     "while sizeof(vmsg->payload) = %zu",
+                     vmsg->request, vmsg->size, sizeof(vmsg->payload));
+        goto fail;
+    }
+
+    struct iovec iov_payload = {
+        .iov_base = (char *)&vmsg->payload,
+        .iov_len = vmsg->size,
+    };
+    if (vmsg->size) {
+        rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &erp);
+        if (rc == -1) {
+            error_report("Error while reading: %s", strerror(errno));
+            goto fail;
+        }
+    }
+
+    return true;
+
+fail:
+    vmsg_close_fds(vmsg);
+
+    return false;
+}
+
+
+static coroutine_fn void vu_client_next_trip(VuClient *client);
+
+static coroutine_fn void vu_client_trip(void *opaque)
+{
+    VuClient *client = opaque;
+
+    vu_dispatch(&client->parent);
+    client->co_trip = NULL;
+    if (!client->closed) {
+        vu_client_next_trip(client);
+    }
+}
+
+static coroutine_fn void vu_client_next_trip(VuClient *client)
+{
+    if (!client->co_trip) {
+        client->co_trip = qemu_coroutine_create(vu_client_trip, client);
+        aio_co_schedule(client->ioc->ctx, client->co_trip);
+    }
+}
+
+static void vu_client_start(VuClient *client)
+{
+    client->co_trip = qemu_coroutine_create(vu_client_trip, client);
+    aio_co_enter(client->ioc->ctx, client->co_trip);
+}
+
+static void coroutine_fn vu_kick_cb_next(VuClient *client,
+                                          kick_info *data);
+
+static void coroutine_fn vu_kick_cb(void *opaque)
+{
+    kick_info *data = (kick_info *) opaque;
+    int index = data->index;
+    VuDev *dev = data->vu_dev;
+    VuClient *client;
+    client = container_of(dev, VuClient, parent);
+    VuVirtq *vq = &dev->vq[index];
+    int sock = vq->kick_fd;
+    if (sock == -1) {
+        return;
+    }
+    assert(sock == data->fd);
+    eventfd_t kick_data;
+    ssize_t rc;
+    /*
+     * When eventfd is closed, the revent is POLLNVAL (=G_IO_NVAL) and
+     * reading eventfd will return errno=EBADF (Bad file number).
+     * Calling qio_channel_yield(ioc, G_IO_IN) will set reading handler
+     * for QIOChannel, but aio_dispatch_handlers will only dispatch
+     * G_IO_IN | G_IO_HUP | G_IO_ERR revents while ignoring
+     * G_IO_NVAL (POLLNVAL) revents.
+     *
+     * Thus when eventfd is closed by vhost-user client, QEMU will ignore
+     * G_IO_NVAL and keeping polling by repeatedly calling qemu_poll_ns which
+     * will lead to 100% CPU usage.
+     *
+     * To aovid this issue, make sure set_watch and remove_watch use the same
+     * AIOContext for QIOChannel. Thus remove_watch will eventually succefully
+     * remove eventfd from the set of file descriptors polled for
+     * corresponding GSource.
+     */
+    rc = read(sock, &kick_data, sizeof(eventfd_t));
+    if (rc != sizeof(eventfd_t)) {
+        if (errno == EAGAIN) {
+            qio_channel_yield(data->ioc, G_IO_IN);
+        } else if (errno != EINTR) {
+            data->co = NULL;
+            return;
+        }
+    } else {
+        vq->handler(dev, index);
+    }
+    data->co = NULL;
+    vu_kick_cb_next(client, data);
+
+}
+
+static void coroutine_fn vu_kick_cb_next(VuClient *client,
+                                          kick_info *cb_data)
+{
+    if (!cb_data->co) {
+        cb_data->co = qemu_coroutine_create(vu_kick_cb, cb_data);
+        aio_co_schedule(client->ioc->ctx, cb_data->co);
+    }
+}
+static const CoIface co_iface = {
+    .read_msg = vu_message_read,
+    .kick_callback = vu_kick_cb,
+};
+
+
+static void
+set_watch(VuDev *vu_dev, int fd, int vu_evt,
+          vu_watch_cb_packed_data cb, void *pvt)
+{
+    /*
+     * since aio_dispatch can only pass one user data pointer to the
+     * callback function, pack VuDev, pvt into a struct
+     */
+
+    VuClient *client;
+
+    client = container_of(vu_dev, VuClient, parent);
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+    long index = (intptr_t) pvt;
+    g_assert(cb);
+    kick_info *kick_info = &client->kick_info[index];
+    if (!kick_info->co) {
+        kick_info->fd = fd;
+        QIOChannelFile *fioc = qio_channel_file_new_fd(fd);
+        QIOChannel *ioc = QIO_CHANNEL(fioc);
+        ioc->ctx = client->ioc->ctx;
+        qio_channel_set_blocking(QIO_CHANNEL(ioc), false, NULL);
+        kick_info->fioc = fioc;
+        kick_info->ioc = ioc;
+        kick_info->vu_dev = vu_dev;
+        kick_info->co = qemu_coroutine_create(cb, kick_info);
+        aio_co_enter(client->ioc->ctx, kick_info->co);
+    }
+}
+
+
+static void remove_watch(VuDev *vu_dev, int fd)
+{
+    VuClient *client;
+    int i;
+    int index = -1;
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+
+    client = container_of(vu_dev, VuClient, parent);
+    for (i = 0; i < vu_dev->max_queues; i++) {
+        if (client->kick_info[i].fd == fd) {
+            index = i;
+            break;
+        }
+    }
+
+    if (index == -1) {
+        return;
+    }
+
+    kick_info *kick_info = &client->kick_info[index];
+    if (kick_info->ioc) {
+        aio_set_fd_handler(client->ioc->ctx, fd, false, NULL,
+                           NULL, NULL, NULL);
+        kick_info->ioc = NULL;
+        g_free(kick_info->fioc);
+        kick_info->co = NULL;
+        kick_info->fioc = NULL;
+    }
+}
+
+
+static void vu_accept(QIONetListener *listener, QIOChannelSocket *sioc,
+                      gpointer opaque)
+{
+    VuClient *client;
+    VuServer *server = opaque;
+    client = g_new0(VuClient, 1);
+
+    if (!vu_init_packed_data(&client->parent, server->max_queues,
+                             sioc->fd, panic_cb,
+                             set_watch, remove_watch,
+                             server->vu_iface, &co_iface)) {
+        error_report("Failed to initialized libvhost-user");
+        g_free(client);
+        return;
+    }
+
+    client->server = server;
+    client->sioc = sioc;
+    client->kick_info = g_new0(struct kick_info, server->max_queues);
+    /*
+     * increase the object reference, so cioc will not freed by
+     * qio_net_listener_channel_func which will call object_unref(OBJECT(sioc))
+     */
+    object_ref(OBJECT(client->sioc));
+    qio_channel_set_name(QIO_CHANNEL(sioc), "vhost-user client");
+    client->ioc = QIO_CHANNEL(sioc);
+    object_ref(OBJECT(client->ioc));
+    object_ref(OBJECT(sioc));
+    qio_channel_attach_aio_context(client->ioc, server->ctx);
+    qio_channel_set_blocking(QIO_CHANNEL(client->sioc), false, NULL);
+    client->closed = false;
+    QTAILQ_INSERT_TAIL(&server->clients, client, next);
+    vu_client_start(client);
+}
+
+
+void vhost_user_server_stop(VuServer *server)
+{
+    if (!server) {
+        return;
+    }
+
+    VuClient *client, *next;
+    QTAILQ_FOREACH_SAFE(client, &server->clients, next, next) {
+        if (!client->closed) {
+            close_client(client);
+            QTAILQ_REMOVE(&server->clients, client, next);
+        }
+    }
+
+    if (server->listener) {
+        qio_net_listener_disconnect(server->listener);
+        object_unref(OBJECT(server->listener));
+    }
+}
+
+static void detach_context(VuServer *server)
+{
+    VuClient *client;
+    int i;
+    QTAILQ_FOREACH(client, &server->clients, next) {
+        qio_channel_detach_aio_context(client->ioc);
+        for (i = 0; i < client->parent.max_queues; i++) {
+            if (client->kick_info[i].ioc) {
+                qio_channel_detach_aio_context(client->kick_info[i].ioc);
+            }
+        }
+    }
+}
+
+static void attach_context(VuServer *server, AioContext *ctx)
+{
+    VuClient *client;
+    int i;
+    QTAILQ_FOREACH(client, &server->clients, next) {
+        qio_channel_attach_aio_context(client->ioc, ctx);
+        if (client->co_trip) {
+            aio_co_schedule(ctx, client->co_trip);
+        }
+        for (i = 0; i < client->parent.max_queues; i++) {
+            if (client->kick_info[i].co) {
+                qio_channel_attach_aio_context(client->kick_info[i].ioc, ctx);
+                aio_co_schedule(ctx, client->kick_info[i].co);
+            }
+        }
+    }
+}
+void change_vu_context(AioContext *ctx, VuServer *server)
+{
+    AioContext *acquire_ctx = ctx ? ctx : server->ctx;
+    aio_context_acquire(acquire_ctx);
+    server->ctx = ctx ? ctx : qemu_get_aio_context();
+    if (ctx) {
+        attach_context(server, ctx);
+    } else {
+        detach_context(server);
+    }
+    aio_context_release(acquire_ctx);
+}
+
+
+VuServer *vhost_user_server_start(uint16_t max_queues,
+                                  SocketAddress *socket_addr,
+                                  AioContext *ctx,
+                                  void *server_ptr,
+                                  void *device_panic_notifier,
+                                  const VuDevIface *vu_iface,
+                                  Error **errp)
+{
+    VuServer *server = g_new0(VuServer, 1);
+    server->ptr_in_device = server_ptr;
+    server->listener = qio_net_listener_new();
+    if (qio_net_listener_open_sync(server->listener, socket_addr, 1, errp) < 0) {
+        goto error;
+    }
+
+    qio_net_listener_set_name(server->listener, "vhost-user-backend-listener");
+
+    server->vu_iface = vu_iface;
+    server->max_queues = max_queues;
+    server->ctx = ctx;
+    server->device_panic_notifier = device_panic_notifier;
+    qio_net_listener_set_client_func(server->listener,
+                                     vu_accept,
+                                     server,
+                                     NULL);
+
+    QTAILQ_INIT(&server->clients);
+    return server;
+error:
+    g_free(server);
+    return NULL;
+}
diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
new file mode 100644
index 0000000000..ff6d3145cd
--- /dev/null
+++ b/util/vhost-user-server.h
@@ -0,0 +1,56 @@
+#include "io/channel-socket.h"
+#include "io/channel-file.h"
+#include "io/net-listener.h"
+#include "contrib/libvhost-user/libvhost-user.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "qemu/error-report.h"
+
+typedef struct VuClient VuClient;
+
+typedef struct VuServer {
+    QIONetListener *listener;
+    AioContext *ctx;
+    QTAILQ_HEAD(, VuClient) clients;
+    void (*device_panic_notifier)(struct VuClient *client) ;
+    int max_queues;
+    const VuDevIface *vu_iface;
+    /*
+     * @ptr_in_device: VuServer pointer memory location in vhost-user device
+     * struct, so later container_of can be used to get device destruct
+     */
+    void *ptr_in_device;
+    bool close;
+} VuServer;
+
+typedef struct kick_info {
+    VuDev *vu_dev;
+    int fd; /*kick fd*/
+    long index; /*queue index*/
+    QIOChannel *ioc; /*I/O channel for kick fd*/
+    QIOChannelFile *fioc; /*underlying data channel for kick fd*/
+    Coroutine *co;
+} kick_info;
+
+struct VuClient {
+    VuDev parent;
+    VuServer *server;
+    QIOChannel *ioc; /* The current I/O channel */
+    QIOChannelSocket *sioc; /* The underlying data channel */
+    Coroutine *co_trip;
+    struct kick_info *kick_info;
+    QTAILQ_ENTRY(VuClient) next;
+    bool closed;
+};
+
+
+VuServer *vhost_user_server_start(uint16_t max_queues,
+                                  SocketAddress *unix_socket,
+                                  AioContext *ctx,
+                                  void *server_ptr,
+                                  void *device_panic_notifier,
+                                  const VuDevIface *vu_iface,
+                                  Error **errp);
+
+void vhost_user_server_stop(VuServer *server);
+
+void change_vu_context(AioContext *ctx, VuServer *server);
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/5] vhost-user block device backend server
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
  2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
  2020-02-18  5:07 ` [PATCH v4 2/5] generic vhost user server Coiby Xu
@ 2020-02-18  5:07 ` Coiby Xu
  2020-02-25 16:09   ` Kevin Wolf
  2020-02-18  5:07 ` [PATCH v4 4/5] a standone-alone tool to directly share disk image file via vhost-user protocol Coiby Xu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

By making use of libvhost, multiple block device drives can be exported
and each drive can serve multiple clients simultaneously.
Since vhost-user-server needs a block drive to be created first, delay
the creation of this object.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 Makefile.target                  |   1 +
 backends/Makefile.objs           |   2 +
 backends/vhost-user-blk-server.c | 718 +++++++++++++++++++++++++++++++
 backends/vhost-user-blk-server.h |  21 +
 vl.c                             |   4 +
 5 files changed, 746 insertions(+)
 create mode 100644 backends/vhost-user-blk-server.c
 create mode 100644 backends/vhost-user-blk-server.h

diff --git a/Makefile.target b/Makefile.target
index 6e61f607b1..8c6c01eb3a 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -159,6 +159,7 @@ obj-y += monitor/
 obj-y += qapi/
 obj-y += memory.o
 obj-y += memory_mapping.o
+obj-$(CONFIG_LINUX) += ../contrib/libvhost-user/libvhost-user.o
 obj-y += migration/ram.o
 LIBS := $(libs_softmmu) $(LIBS)
 
diff --git a/backends/Makefile.objs b/backends/Makefile.objs
index 28a847cd57..4e7be731e0 100644
--- a/backends/Makefile.objs
+++ b/backends/Makefile.objs
@@ -14,6 +14,8 @@ common-obj-y += cryptodev-vhost.o
 common-obj-$(CONFIG_VHOST_CRYPTO) += cryptodev-vhost-user.o
 endif
 
+common-obj-$(CONFIG_LINUX) += vhost-user-blk-server.o
+
 common-obj-$(call land,$(CONFIG_VHOST_USER),$(CONFIG_VIRTIO)) += vhost-user.o
 
 common-obj-$(CONFIG_LINUX) += hostmem-memfd.o
diff --git a/backends/vhost-user-blk-server.c b/backends/vhost-user-blk-server.c
new file mode 100644
index 0000000000..1bf7f7b544
--- /dev/null
+++ b/backends/vhost-user-blk-server.c
@@ -0,0 +1,718 @@
+/*
+ * Sharing QEMU block devices via vhost-user protocal
+ *
+ * Author: Coiby Xu <coiby.xu@gmail.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "block/block.h"
+#include "vhost-user-blk-server.h"
+#include "qapi/error.h"
+#include "qom/object_interfaces.h"
+#include "sysemu/block-backend.h"
+
+enum {
+    VHOST_USER_BLK_MAX_QUEUES = 1,
+};
+struct virtio_blk_inhdr {
+    unsigned char status;
+};
+
+static QTAILQ_HEAD(, VuBlockDev) vu_block_devs =
+                                 QTAILQ_HEAD_INITIALIZER(vu_block_devs);
+
+
+typedef struct VuBlockReq {
+    VuVirtqElement *elem;
+    int64_t sector_num;
+    size_t size;
+    struct virtio_blk_inhdr *in;
+    struct virtio_blk_outhdr out;
+    VuClient *client;
+    struct VuVirtq *vq;
+} VuBlockReq;
+
+
+static void vu_block_req_complete(VuBlockReq *req)
+{
+    VuDev *vu_dev = &req->client->parent;
+
+    /* IO size with 1 extra status byte */
+    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
+    vu_queue_notify(vu_dev, req->vq);
+
+    if (req->elem) {
+        free(req->elem);
+    }
+
+    g_free(req);
+}
+
+static VuBlockDev *get_vu_block_device_by_client(VuClient *client)
+{
+    return container_of(client->server->ptr_in_device, VuBlockDev, vu_server);
+}
+
+static int coroutine_fn
+vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
+                              uint32_t iovcnt, uint32_t type)
+{
+    struct virtio_blk_discard_write_zeroes desc;
+    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
+    if (unlikely(size != sizeof(desc))) {
+        error_report("Invalid size %ld, expect %ld", size, sizeof(desc));
+        return -EINVAL;
+    }
+
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(req->client);
+    uint64_t range[2] = { le64toh(desc.sector) << 9,
+                          le32toh(desc.num_sectors) << 9 };
+    if (type == VIRTIO_BLK_T_DISCARD) {
+        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
+            return 0;
+        }
+    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
+        if (blk_co_pwrite_zeroes(vdev_blk->backend,
+                                 range[0], range[1], 0) == 0) {
+            return 0;
+        }
+    }
+
+    return -EINVAL;
+}
+
+
+static void coroutine_fn vu_block_flush(VuBlockReq *req)
+{
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(req->client);
+    BlockBackend *backend = vdev_blk->backend;
+    blk_co_flush(backend);
+}
+
+
+static int coroutine_fn vu_block_virtio_process_req(VuClient *client,
+                                                    VuVirtq *vq)
+{
+    VuDev *vu_dev = &client->parent;
+    VuVirtqElement *elem;
+    uint32_t type;
+    VuBlockReq *req;
+
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
+    BlockBackend *backend = vdev_blk->backend;
+    elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
+                                    sizeof(VuBlockReq));
+    if (!elem) {
+        return -1;
+    }
+
+    struct iovec *in_iov = elem->in_sg;
+    struct iovec *out_iov = elem->out_sg;
+    unsigned in_num = elem->in_num;
+    unsigned out_num = elem->out_num;
+    /* refer to hw/block/virtio_blk.c */
+    if (elem->out_num < 1 || elem->in_num < 1) {
+        error_report("virtio-blk request missing headers");
+        free(elem);
+        return -EINVAL;
+    }
+
+    req = g_new0(VuBlockReq, 1);
+    req->client = client;
+    req->vq = vq;
+    req->elem = elem;
+
+    if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
+                            sizeof(req->out)) != sizeof(req->out))) {
+        error_report("virtio-blk request outhdr too short");
+        goto err;
+    }
+
+    iov_discard_front(&out_iov, &out_num, sizeof(req->out));
+
+    if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) {
+        error_report("virtio-blk request inhdr too short");
+        goto err;
+    }
+
+    /* We always touch the last byte, so just see how big in_iov is.  */
+    req->in = (void *)in_iov[in_num - 1].iov_base
+              + in_iov[in_num - 1].iov_len
+              - sizeof(struct virtio_blk_inhdr);
+    iov_discard_back(in_iov, &in_num, sizeof(struct virtio_blk_inhdr));
+
+
+    type = le32toh(req->out.type);
+    switch (type & ~VIRTIO_BLK_T_BARRIER) {
+    case VIRTIO_BLK_T_IN:
+    case VIRTIO_BLK_T_OUT: {
+        ssize_t ret = 0;
+        bool is_write = type & VIRTIO_BLK_T_OUT;
+        req->sector_num = le64toh(req->out.sector);
+
+        int64_t offset = req->sector_num * vdev_blk->blk_size;
+        QEMUIOVector *qiov = g_new0(QEMUIOVector, 1);
+        if (is_write) {
+            qemu_iovec_init_external(qiov, out_iov, out_num);
+            ret = blk_co_pwritev(backend, offset, qiov->size,
+                                 qiov, 0);
+        } else {
+            qemu_iovec_init_external(qiov, in_iov, in_num);
+            ret = blk_co_preadv(backend, offset, qiov->size,
+                                 qiov, 0);
+        }
+        aio_wait_kick();
+        if (ret >= 0) {
+            req->in->status = VIRTIO_BLK_S_OK;
+        } else {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+        }
+        vu_block_req_complete(req);
+        break;
+    }
+    case VIRTIO_BLK_T_FLUSH:
+        vu_block_flush(req);
+        req->in->status = VIRTIO_BLK_S_OK;
+        vu_block_req_complete(req);
+        break;
+    case VIRTIO_BLK_T_GET_ID: {
+        size_t size = MIN(iov_size(&elem->in_sg[0], in_num),
+                          VIRTIO_BLK_ID_BYTES);
+        snprintf(elem->in_sg[0].iov_base, size, "%s", "vhost_user_blk_server");
+        req->in->status = VIRTIO_BLK_S_OK;
+        req->size = elem->in_sg[0].iov_len;
+        vu_block_req_complete(req);
+        break;
+    }
+    case VIRTIO_BLK_T_DISCARD:
+    case VIRTIO_BLK_T_WRITE_ZEROES: {
+        int rc;
+        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
+                                           out_num, type);
+        if (rc == 0) {
+            req->in->status = VIRTIO_BLK_S_OK;
+        } else {
+            req->in->status = VIRTIO_BLK_S_IOERR;
+        }
+        vu_block_req_complete(req);
+        break;
+    }
+    default:
+        req->in->status = VIRTIO_BLK_S_UNSUPP;
+        vu_block_req_complete(req);
+        break;
+    }
+
+    return 0;
+
+err:
+    free(elem);
+    g_free(req);
+    return -EINVAL;
+}
+
+
+static void vu_block_process_vq(VuDev *vu_dev, int idx)
+{
+    VuClient *client;
+    VuVirtq *vq;
+    int ret;
+
+    client = container_of(vu_dev, VuClient, parent);
+    assert(client);
+
+    vq = vu_get_queue(vu_dev, idx);
+    assert(vq);
+
+    while (1) {
+        ret = vu_block_virtio_process_req(client, vq);
+        if (ret) {
+            break;
+        }
+    }
+}
+
+static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
+{
+    VuVirtq *vq;
+
+    assert(vu_dev);
+
+    vq = vu_get_queue(vu_dev, idx);
+    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
+}
+
+static uint64_t vu_block_get_features(VuDev *dev)
+{
+    uint64_t features;
+    VuClient *client = container_of(dev, VuClient, parent);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
+    features = 1ull << VIRTIO_BLK_F_SIZE_MAX |
+               1ull << VIRTIO_BLK_F_SEG_MAX |
+               1ull << VIRTIO_BLK_F_TOPOLOGY |
+               1ull << VIRTIO_BLK_F_BLK_SIZE |
+               1ull << VIRTIO_BLK_F_FLUSH |
+               1ull << VIRTIO_BLK_F_DISCARD |
+               1ull << VIRTIO_BLK_F_WRITE_ZEROES |
+               1ull << VIRTIO_BLK_F_CONFIG_WCE |
+               1ull << VIRTIO_F_VERSION_1 |
+               1ull << VIRTIO_RING_F_INDIRECT_DESC |
+               1ull << VIRTIO_RING_F_EVENT_IDX |
+               1ull << VHOST_USER_F_PROTOCOL_FEATURES;
+
+    if (!vdev_blk->writable) {
+        features |= 1ull << VIRTIO_BLK_F_RO;
+    }
+
+    return features;
+}
+
+static uint64_t vu_block_get_protocol_features(VuDev *dev)
+{
+    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
+           1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
+}
+
+static int
+vu_block_get_config(VuDev *vu_dev, uint8_t *config, uint32_t len)
+{
+    VuClient *client = container_of(vu_dev, VuClient, parent);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
+    memcpy(config, &vdev_blk->blkcfg, len);
+
+    return 0;
+}
+
+static int
+vu_block_set_config(VuDev *vu_dev, const uint8_t *data,
+                    uint32_t offset, uint32_t size, uint32_t flags)
+{
+    VuClient *client = container_of(vu_dev, VuClient, parent);
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
+    uint8_t wce;
+
+    /* don't support live migration */
+    if (flags != VHOST_SET_CONFIG_TYPE_MASTER) {
+        return -EINVAL;
+    }
+
+
+    if (offset != offsetof(struct virtio_blk_config, wce) ||
+        size != 1) {
+        return -EINVAL;
+    }
+
+    wce = *data;
+    if (wce == vdev_blk->blkcfg.wce) {
+        /* Do nothing as same with old configuration */
+        return 0;
+    }
+
+    vdev_blk->blkcfg.wce = wce;
+    blk_set_enable_write_cache(vdev_blk->backend, wce);
+    return 0;
+}
+
+
+/*
+ * When the client disconnects, it sends a VHOST_USER_NONE request
+ * and vu_process_message will simple call exit which cause the VM
+ * to exit abruptly.
+ * To avoid this issue,  process VHOST_USER_NONE request ahead
+ * of vu_process_message.
+ *
+ */
+static int vu_block_process_msg(VuDev *dev, VhostUserMsg *vmsg, int *do_reply)
+{
+    if (vmsg->request == VHOST_USER_NONE) {
+        dev->panic(dev, "disconnect");
+        return true;
+    }
+    return false;
+}
+
+
+static const VuDevIface vu_block_iface = {
+    .get_features          = vu_block_get_features,
+    .queue_set_started     = vu_block_queue_set_started,
+    .get_protocol_features = vu_block_get_protocol_features,
+    .get_config            = vu_block_get_config,
+    .set_config            = vu_block_set_config,
+    .process_msg           = vu_block_process_msg,
+};
+
+static void blk_aio_attached(AioContext *ctx, void *opaque)
+{
+    VuBlockDev *vub_dev = opaque;
+    aio_context_acquire(ctx);
+    change_vu_context(ctx, vub_dev->vu_server);
+    aio_context_release(ctx);
+}
+
+static void blk_aio_detach(void *opaque)
+{
+    VuBlockDev *vub_dev = opaque;
+    AioContext *ctx = vub_dev->vu_server->ctx;
+    aio_context_acquire(ctx);
+    change_vu_context(NULL, vub_dev->vu_server);
+    aio_context_release(ctx);
+}
+
+static void vu_block_free(VuBlockDev *vu_block_dev)
+{
+    if (!vu_block_dev) {
+        return;
+    }
+
+    if (vu_block_dev->backend) {
+        blk_remove_aio_context_notifier(vu_block_dev->backend, blk_aio_attached,
+                                        blk_aio_detach, vu_block_dev);
+    }
+
+    blk_unref(vu_block_dev->backend);
+
+    if (vu_block_dev->next.tqe_circ.tql_prev) {
+        /*
+         * if vu_block_dev->next.tqe_circ.tql_prev = null,
+         * vu_block_dev hasn't been inserted into the queue and
+         * vu_block_free is called by obj->instance_finalize.
+         */
+        QTAILQ_REMOVE(&vu_block_devs, vu_block_dev, next);
+    }
+}
+
+static void
+vu_block_initialize_config(BlockDriverState *bs,
+                           struct virtio_blk_config *config, uint32_t blk_size)
+{
+    config->capacity = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
+    config->blk_size = blk_size;
+    config->size_max = 0;
+    config->seg_max = 128 - 2;
+    config->min_io_size = 1;
+    config->opt_io_size = 1;
+    config->num_queues = VHOST_USER_BLK_MAX_QUEUES;
+    config->max_discard_sectors = 32768;
+    config->max_discard_seg = 1;
+    config->discard_sector_alignment = config->blk_size >> 9;
+    config->max_write_zeroes_sectors = 32768;
+    config->max_write_zeroes_seg = 1;
+}
+
+
+static VuBlockDev *vu_block_init(VuBlockDev *vu_block_device, Error **errp)
+{
+
+    BlockBackend *blk;
+    Error *local_error = NULL;
+    const char *node_name = vu_block_device->node_name;
+    bool writable = vu_block_device->writable;
+    /*
+     * Don't allow resize while the vhost user server is running,
+     * otherwise we don't care what happens with the node.
+     */
+    uint64_t perm = BLK_PERM_CONSISTENT_READ;
+    int ret;
+
+    AioContext *ctx;
+
+    BlockDriverState *bs = bdrv_lookup_bs(node_name, node_name, &local_error);
+
+    if (!bs) {
+        error_propagate(errp, local_error);
+        return NULL;
+    }
+
+    if (bdrv_is_read_only(bs)) {
+        writable = false;
+    }
+
+    if (writable) {
+        perm |= BLK_PERM_WRITE;
+    }
+
+    ctx = bdrv_get_aio_context(bs);
+    aio_context_acquire(ctx);
+    bdrv_invalidate_cache(bs, NULL);
+    aio_context_release(ctx);
+
+    blk = blk_new(bdrv_get_aio_context(bs), perm,
+                  BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
+                  BLK_PERM_WRITE | BLK_PERM_GRAPH_MOD);
+    ret = blk_insert_bs(blk, bs, errp);
+
+    if (ret < 0) {
+        goto fail;
+    }
+
+    blk_set_enable_write_cache(blk, false);
+
+    blk_set_allow_aio_context_change(blk, true);
+
+    vu_block_device->blkcfg.wce = 0;
+    vu_block_device->backend = blk;
+    if (!vu_block_device->blk_size) {
+        vu_block_device->blk_size = BDRV_SECTOR_SIZE;
+    }
+    vu_block_device->blkcfg.blk_size = vu_block_device->blk_size;
+    blk_set_guest_block_size(blk, vu_block_device->blk_size);
+    vu_block_initialize_config(bs, &vu_block_device->blkcfg,
+                                   vu_block_device->blk_size);
+    return vu_block_device;
+
+fail:
+    blk_unref(blk);
+    return NULL;
+}
+
+static void vhost_user_blk_server_free(VuBlockDev *vu_block_device)
+{
+    if (!vu_block_device) {
+        return;
+    }
+    vhost_user_server_stop(vu_block_device->vu_server);
+    vu_block_free(vu_block_device);
+
+}
+
+/*
+ * A exported drive can serve multiple multiple clients simutateously,
+ * thus no need to export the same drive twice.
+ *
+ */
+static VuBlockDev *vu_block_dev_find(const char *node_name)
+{
+    VuBlockDev *vu_block_device;
+    QTAILQ_FOREACH(vu_block_device, &vu_block_devs, next) {
+        if (strcmp(node_name, vu_block_device->node_name) == 0) {
+            return vu_block_device;
+        }
+    }
+
+    return NULL;
+}
+
+
+static VuBlockDev
+*vu_block_dev_find_by_unix_socket(const char *unix_socket)
+{
+    VuBlockDev *vu_block_device;
+    QTAILQ_FOREACH(vu_block_device, &vu_block_devs, next) {
+        if (strcmp(unix_socket, vu_block_device->addr->u.q_unix.path) == 0) {
+            return vu_block_device;
+        }
+    }
+
+    return NULL;
+}
+
+
+static void device_panic_notifier(VuClient *client)
+{
+    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
+    if (vdev_blk->exit_when_panic) {
+        vdev_blk->vu_server->close = true;
+    }
+}
+
+static void vhost_user_blk_server_start(VuBlockDev *vu_block_device,
+                                        Error **errp)
+{
+
+    const char *name = vu_block_device->node_name;
+    SocketAddress *addr = vu_block_device->addr;
+    char *unix_socket = vu_block_device->addr->u.q_unix.path;
+
+    if (vu_block_dev_find(name)) {
+        error_setg(errp, "Vhost-user-blk server with node-name '%s' "
+                   "has already been started",
+                   name);
+        return;
+    }
+
+    if (vu_block_dev_find_by_unix_socket(unix_socket)) {
+        error_setg(errp, "Vhost-user-blk server with with socket_path '%s' "
+                   "has already been started", unix_socket);
+        return;
+    }
+
+    if (!vu_block_init(vu_block_device, errp)) {
+        return;
+    }
+
+
+    AioContext *ctx = bdrv_get_aio_context(blk_bs(vu_block_device->backend));
+    VuServer *vu_server = vhost_user_server_start(VHOST_USER_BLK_MAX_QUEUES,
+                                                  addr,
+                                                  ctx,
+                                                  &vu_block_device->vu_server,
+                                                  device_panic_notifier,
+                                                  &vu_block_iface,
+                                                  errp);
+
+    if (!vu_server) {
+        goto error;
+    }
+    vu_block_device->vu_server = vu_server;
+    QTAILQ_INSERT_TAIL(&vu_block_devs, vu_block_device, next);
+    blk_add_aio_context_notifier(vu_block_device->backend, blk_aio_attached,
+                                 blk_aio_detach, vu_block_device);
+    return;
+
+ error:
+    vu_block_free(vu_block_device);
+}
+
+static void vu_set_node_name(Object *obj, const char *value, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    if (vus->node_name) {
+        error_setg(errp, "evdev property already set");
+        return;
+    }
+
+    vus->node_name = g_strdup(value);
+}
+
+static char *vu_get_node_name(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return g_strdup(vus->node_name);
+}
+
+
+static void vu_set_unix_socket(Object *obj, const char *value,
+                               Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    if (vus->addr) {
+        error_setg(errp, "unix_socket property already set");
+        return;
+    }
+
+    SocketAddress *addr = g_new0(SocketAddress, 1);
+    addr->type = SOCKET_ADDRESS_TYPE_UNIX;
+    addr->u.q_unix.path = g_strdup(value);
+    vus->addr = addr;
+}
+
+static char *vu_get_unix_socket(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return g_strdup(vus->addr->u.q_unix.path);
+}
+
+static bool vu_get_block_writable(Object *obj, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    return vus->writable;
+}
+
+static void vu_set_block_writable(Object *obj, bool value, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    vus->writable = value;
+}
+
+static void vu_get_blk_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+    uint32_t value = vus->blk_size;
+
+    visit_type_uint32(v, name, &value, errp);
+}
+
+static void vu_set_blk_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    VuBlockDev *vus = VHOST_USER_BLK_SERVER(obj);
+
+    Error *local_err = NULL;
+    uint32_t value;
+
+    visit_type_uint32(v, name, &value, &local_err);
+    if (local_err) {
+        goto out;
+    }
+    if (value != BDRV_SECTOR_SIZE && value != 4096) {
+        error_setg(&local_err,
+                   "Property '%s.%s' can only take value 512 or 4096",
+                   object_get_typename(obj), name);
+        goto out;
+    }
+
+    vus->blk_size = value;
+
+out:
+    error_propagate(errp, local_err);
+    vus->blk_size = value;
+}
+
+
+static void vhost_user_blk_server_instance_finalize(Object *obj)
+{
+    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
+
+    vhost_user_blk_server_free(vub);
+}
+
+static void vhost_user_blk_server_complete(UserCreatable *obj, Error **errp)
+{
+    Error *local_error = NULL;
+    VuBlockDev *vub = VHOST_USER_BLK_SERVER(obj);
+
+    vhost_user_blk_server_start(vub, &local_error);
+
+    if (local_error) {
+        error_propagate(errp, local_error);
+        return;
+    }
+}
+
+static void vhost_user_blk_server_class_init(ObjectClass *klass,
+                                             void *class_data)
+{
+    UserCreatableClass *ucc = USER_CREATABLE_CLASS(klass);
+    ucc->complete = vhost_user_blk_server_complete;
+
+    object_class_property_add_bool(klass, "writable",
+                                   vu_get_block_writable,
+                                   vu_set_block_writable, NULL);
+
+    object_class_property_add_str(klass, "node-name",
+                                  vu_get_node_name,
+                                  vu_set_node_name, NULL);
+
+    object_class_property_add_str(klass, "unix-socket",
+                                  vu_get_unix_socket,
+                                  vu_set_unix_socket, NULL);
+
+    object_class_property_add(klass, "blk-size", "uint32",
+                              vu_get_blk_size, vu_set_blk_size,
+                              NULL, NULL, NULL);
+}
+
+static const TypeInfo vhost_user_blk_server_info = {
+    .name = TYPE_VHOST_USER_BLK_SERVER,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VuBlockDev),
+    .instance_finalize = vhost_user_blk_server_instance_finalize,
+    .class_init = vhost_user_blk_server_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        {TYPE_USER_CREATABLE},
+        {}
+    },
+};
+
+static void vhost_user_blk_server_register_types(void)
+{
+    type_register_static(&vhost_user_blk_server_info);
+}
+
+type_init(vhost_user_blk_server_register_types)
diff --git a/backends/vhost-user-blk-server.h b/backends/vhost-user-blk-server.h
new file mode 100644
index 0000000000..03381d11f4
--- /dev/null
+++ b/backends/vhost-user-blk-server.h
@@ -0,0 +1,21 @@
+#include "util/vhost-user-server.h"
+typedef struct VuBlockDev VuBlockDev;
+#define TYPE_VHOST_USER_BLK_SERVER "vhost-user-blk-server"
+#define VHOST_USER_BLK_SERVER(obj) \
+   OBJECT_CHECK(VuBlockDev, obj, TYPE_VHOST_USER_BLK_SERVER)
+
+/* vhost user block device */
+struct VuBlockDev {
+    Object parent_obj;
+    char *node_name;
+    SocketAddress *addr;
+    bool exit_when_panic;
+    AioContext *ctx;
+    VuServer *vu_server;
+    uint32_t blk_size;
+    BlockBackend *backend;
+    QIOChannelSocket *sioc;
+    QTAILQ_ENTRY(VuBlockDev) next;
+    struct virtio_blk_config blkcfg;
+    bool writable;
+};
diff --git a/vl.c b/vl.c
index 794f2e5733..0332ab70da 100644
--- a/vl.c
+++ b/vl.c
@@ -2538,6 +2538,10 @@ static bool object_create_initial(const char *type, QemuOpts *opts)
     }
 #endif
 
+    /* Reason: vhost-user-server property "name" */
+    if (g_str_equal(type, "vhost-user-blk-server")) {
+        return false;
+    }
     /*
      * Reason: filter-* property "netdev" etc.
      */
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/5] a standone-alone tool to directly share disk image file via vhost-user protocol
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
                   ` (2 preceding siblings ...)
  2020-02-18  5:07 ` [PATCH v4 3/5] vhost-user block device backend server Coiby Xu
@ 2020-02-18  5:07 ` Coiby Xu
  2020-02-18  5:07 ` [PATCH v4 5/5] new qTest case to test the vhost-user-blk-server Coiby Xu
  2020-02-19 16:38 ` [PATCH v4 0/5] vhost-user block device backend implementation Stefan Hajnoczi
  5 siblings, 0 replies; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

vhost-user-blk could have played as vhost-user backend but it only supports raw
file and don't support VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES
operations on raw file (ioctl(fd, BLKDISCARD) is only valid for real
block device).

In the future Kevin's qemu-storage-daemon will be used to replace this
tool.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 Makefile  |   4 +
 configure |   3 +
 qemu-vu.c | 252 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 259 insertions(+)
 create mode 100644 qemu-vu.c

diff --git a/Makefile b/Makefile
index b5a7377cb1..74fb109675 100644
--- a/Makefile
+++ b/Makefile
@@ -572,6 +572,10 @@ qemu-img.o: qemu-img-cmds.h
 
 qemu-img$(EXESUF): qemu-img.o $(authz-obj-y) $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
 qemu-nbd$(EXESUF): qemu-nbd.o $(authz-obj-y) $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
+
+ifdef CONFIG_LINUX
+qemu-vu$(EXESUF): qemu-vu.o backends/vhost-user-blk-server.o $(authz-obj-y) $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS) libvhost-user.a
+endif
 qemu-io$(EXESUF): qemu-io.o $(authz-obj-y) $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
 
 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o $(COMMON_LDADDS)
diff --git a/configure b/configure
index 6f5d850949..2b846cecf0 100755
--- a/configure
+++ b/configure
@@ -6239,6 +6239,9 @@ if test "$want_tools" = "yes" ; then
   if [ "$linux" = "yes" -o "$bsd" = "yes" -o "$solaris" = "yes" ] ; then
     tools="qemu-nbd\$(EXESUF) $tools"
   fi
+  if [ "$linux" = "yes" ] ; then
+    tools="qemu-vu\$(EXESUF) $tools"
+  fi
   if [ "$ivshmem" = "yes" ]; then
     tools="ivshmem-client\$(EXESUF) ivshmem-server\$(EXESUF) $tools"
   fi
diff --git a/qemu-vu.c b/qemu-vu.c
new file mode 100644
index 0000000000..dd1032b205
--- /dev/null
+++ b/qemu-vu.c
@@ -0,0 +1,252 @@
+/*
+ *  Copyright (C) 2020  Coiby Xu <coiby.xu@gmail.com>
+ *
+ *  standone-alone vhost-user-blk device server backend
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <getopt.h>
+#include <libgen.h>
+#include "backends/vhost-user-blk-server.h"
+#include "block/block_int.h"
+#include "io/net-listener.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "qemu/config-file.h"
+#include "qemu/cutils.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "qemu/option.h"
+#include "qemu-common.h"
+#include "qemu-version.h"
+#include "qom/object_interfaces.h"
+#include "sysemu/block-backend.h"
+#define QEMU_VU_OPT_CACHE         256
+#define QEMU_VU_OPT_AIO           257
+#define QEMU_VU_OBJ_ID   "vu_disk"
+static QemuOptsList qemu_object_opts = {
+    .name = "object",
+    .implied_opt_name = "qom-type",
+    .head = QTAILQ_HEAD_INITIALIZER(qemu_object_opts.head),
+    .desc = {
+        { }
+    },
+};
+static char *srcpath;
+
+static void usage(const char *name)
+{
+    (printf) (
+"Usage: %s [OPTIONS] FILE\n"
+"  or:  %s -L [OPTIONS]\n"
+"QEMU Vhost-user Server Utility\n"
+"\n"
+"  -h, --help                display this help and exit\n"
+"  -V, --version             output version information and exit\n"
+"\n"
+"Connection properties:\n"
+"  -k, --socket=PATH         path to the unix socket\n"
+"\n"
+"General purpose options:\n"
+"  -e, -- exit-panic         When the panic callback is called, the program\n"
+"                            will exit. Useful for make check-qtest.\n"
+"\n"
+"Block device options:\n"
+"  -f, --format=FORMAT       set image format (raw, qcow2, ...)\n"
+"  -r, --read-only           export read-only\n"
+"  -n, --nocache             disable host cache\n"
+"      --cache=MODE          set cache mode (none, writeback, ...)\n"
+"      --aio=MODE            set AIO mode (native or threads)\n"
+"\n"
+QEMU_HELP_BOTTOM "\n"
+    , name, name);
+}
+
+static void version(const char *name)
+{
+    printf(
+"%s " QEMU_FULL_VERSION "\n"
+"Written by Coiby Xu, based on qemu-nbd by Anthony Liguori\n"
+"\n"
+QEMU_COPYRIGHT "\n"
+"This is free software; see the source for copying conditions.  There is NO\n"
+"warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n"
+    , name);
+}
+
+static VuBlockDev *vu_block_device;
+
+static void vus_shutdown(void)
+{
+
+    Error *local_err = NULL;
+    job_cancel_sync_all();
+    bdrv_close_all();
+    user_creatable_del(QEMU_VU_OBJ_ID, &local_err);
+}
+
+int main(int argc, char **argv)
+{
+    BlockBackend *blk;
+    BlockDriverState *bs;
+    bool readonly = false;
+    char *sockpath = NULL;
+    const char *sopt = "hVrnvek:f:";
+    struct option lopt[] = {
+        { "help", no_argument, NULL, 'h' },
+        { "version", no_argument, NULL, 'V' },
+        { "exit-panic", no_argument, NULL, 'e' },
+        { "socket", required_argument, NULL, 'k' },
+        { "read-only", no_argument, NULL, 'r' },
+        { "nocache", no_argument, NULL, 'n' },
+        { "cache", required_argument, NULL, QEMU_VU_OPT_CACHE },
+        { "aio", required_argument, NULL, QEMU_VU_OPT_AIO },
+        { "format", required_argument, NULL, 'f' },
+        { NULL, 0, NULL, 0 }
+    };
+    int ch;
+    int opt_ind = 0;
+    int flags = BDRV_O_RDWR;
+    bool seen_cache = false;
+    bool seen_aio = false;
+    const char *fmt = NULL;
+    Error *local_err = NULL;
+    QDict *options = NULL;
+    bool writethrough = true;
+    bool exit_when_panic = false;
+
+    error_init(argv[0]);
+
+    module_call_init(MODULE_INIT_QOM);
+    qemu_init_exec_dir(argv[0]);
+
+    while ((ch = getopt_long(argc, argv, sopt, lopt, &opt_ind)) != -1) {
+        switch (ch) {
+        case 'e':
+            exit_when_panic = true;
+            break;
+        case 'n':
+            optarg = (char *) "none";
+            /* fallthrough */
+        case QEMU_VU_OPT_CACHE:
+            if (seen_cache) {
+                error_report("-n and --cache can only be specified once");
+                exit(EXIT_FAILURE);
+            }
+            seen_cache = true;
+            if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) == -1) {
+                error_report("Invalid cache mode `%s'", optarg);
+                exit(EXIT_FAILURE);
+            }
+            break;
+        case QEMU_VU_OPT_AIO:
+            if (seen_aio) {
+                error_report("--aio can only be specified once");
+                exit(EXIT_FAILURE);
+            }
+            seen_aio = true;
+            if (!strcmp(optarg, "native")) {
+                flags |= BDRV_O_NATIVE_AIO;
+            } else if (!strcmp(optarg, "threads")) {
+                /* this is the default */
+            } else {
+               error_report("invalid aio mode `%s'", optarg);
+               exit(EXIT_FAILURE);
+            }
+            break;
+        case 'r':
+            readonly = true;
+            flags &= ~BDRV_O_RDWR;
+            break;
+        case 'k':
+            sockpath = optarg;
+            if (sockpath[0] != '/') {
+                error_report("socket path must be absolute");
+                exit(EXIT_FAILURE);
+            }
+            break;
+        case 'f':
+            fmt = optarg;
+            break;
+        case 'V':
+            version(argv[0]);
+            exit(0);
+            break;
+        case 'h':
+            usage(argv[0]);
+            exit(0);
+            break;
+        case '?':
+            error_report("Try `%s --help' for more information.", argv[0]);
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    if ((argc - optind) != 1) {
+        error_report("Invalid number of arguments");
+        error_printf("Try `%s --help' for more information.\n", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+    if (qemu_init_main_loop(&local_err)) {
+        error_report_err(local_err);
+        exit(EXIT_FAILURE);
+    }
+    bdrv_init();
+
+    srcpath = argv[optind];
+    if (fmt) {
+        options = qdict_new();
+        qdict_put_str(options, "driver", fmt);
+    }
+    blk = blk_new_open(srcpath, NULL, options, flags, &local_err);
+
+    if (!blk) {
+        error_reportf_err(local_err, "Failed to blk_new_open '%s': ",
+                          argv[optind]);
+        exit(EXIT_FAILURE);
+    }
+    bs = blk_bs(blk);
+
+    char buf[300];
+    snprintf(buf, 300, "%s,id=%s,node-name=%s,unix-socket=%s,writable=%s",
+             TYPE_VHOST_USER_BLK_SERVER, QEMU_VU_OBJ_ID, bdrv_get_node_name(bs),
+             sockpath, !readonly ? "on" : "off");
+    /* While calling user_creatable_del, 'object' group is required */
+    qemu_add_opts(&qemu_object_opts);
+    QemuOpts *opts = qemu_opts_parse(&qemu_object_opts, buf, true, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        goto error;
+    }
+
+    Object *obj = user_creatable_add_opts(opts, &local_err);
+
+    if (local_err) {
+        error_report_err(local_err);
+        goto error;
+    }
+
+    vu_block_device = VHOST_USER_BLK_SERVER(obj);
+    vu_block_device->exit_when_panic = exit_when_panic;
+
+    do {
+        main_loop_wait(false);
+    } while (!vu_block_device->exit_when_panic || !vu_block_device->vu_server->close);
+
+ error:
+    vus_shutdown();
+    exit(EXIT_SUCCESS);
+}
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 5/5] new qTest case to test the vhost-user-blk-server
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
                   ` (3 preceding siblings ...)
  2020-02-18  5:07 ` [PATCH v4 4/5] a standone-alone tool to directly share disk image file via vhost-user protocol Coiby Xu
@ 2020-02-18  5:07 ` Coiby Xu
  2020-02-19 16:38 ` [PATCH v4 0/5] vhost-user block device backend implementation Stefan Hajnoczi
  5 siblings, 0 replies; 22+ messages in thread
From: Coiby Xu @ 2020-02-18  5:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, bharatlkmlkvm, Coiby Xu, stefanha

This test case has the same tests as tests/virtio-blk-test.c except for
tests have block_resize.

Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
---
 tests/Makefile.include              |   3 +-
 tests/qtest/Makefile.include        |   2 +
 tests/qtest/libqos/vhost-user-blk.c | 126 +++++
 tests/qtest/libqos/vhost-user-blk.h |  44 ++
 tests/qtest/vhost-user-blk-test.c   | 694 ++++++++++++++++++++++++++++
 5 files changed, 868 insertions(+), 1 deletion(-)
 create mode 100644 tests/qtest/libqos/vhost-user-blk.c
 create mode 100644 tests/qtest/libqos/vhost-user-blk.h
 create mode 100644 tests/qtest/vhost-user-blk-test.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 2f1cafed72..4b8637b5d4 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -629,7 +629,8 @@ endef
 $(patsubst %, check-qtest-%, $(QTEST_TARGETS)): check-qtest-%: %-softmmu/all $(check-qtest-y)
 	$(call do_test_human,$(check-qtest-$*-y:%=tests/qtest/%$(EXESUF)) $(check-qtest-generic-y:%=tests/qtest/%$(EXESUF)), \
 	  QTEST_QEMU_BINARY=$*-softmmu/qemu-system-$* \
-	  QTEST_QEMU_IMG=qemu-img$(EXESUF))
+	  QTEST_QEMU_IMG=./qemu-img$(EXESUF) \
+	  QTEST_QEMU_VU_BINARY=./qemu-vu$(EXESUF))
 
 check-unit: $(check-unit-y)
 	$(call do_test_human, $^)
diff --git a/tests/qtest/Makefile.include b/tests/qtest/Makefile.include
index eb0f23b108..f587fe9f4d 100644
--- a/tests/qtest/Makefile.include
+++ b/tests/qtest/Makefile.include
@@ -182,6 +182,7 @@ qos-test-obj-y += tests/qtest/libqos/virtio.o
 qos-test-obj-$(CONFIG_VIRTFS) += tests/qtest/libqos/virtio-9p.o
 qos-test-obj-y += tests/qtest/libqos/virtio-balloon.o
 qos-test-obj-y += tests/qtest/libqos/virtio-blk.o
+qos-test-obj-$(CONFIG_LINUX) += tests/qtest/libqos/vhost-user-blk.o
 qos-test-obj-y += tests/qtest/libqos/virtio-mmio.o
 qos-test-obj-y += tests/qtest/libqos/virtio-net.o
 qos-test-obj-y += tests/qtest/libqos/virtio-pci.o
@@ -224,6 +225,7 @@ qos-test-obj-$(CONFIG_VHOST_NET_USER) += tests/qtest/vhost-user-test.o $(chardev
 qos-test-obj-y += tests/qtest/virtio-test.o
 qos-test-obj-$(CONFIG_VIRTFS) += tests/qtest/virtio-9p-test.o
 qos-test-obj-y += tests/qtest/virtio-blk-test.o
+qos-test-obj-$(CONFIG_LINUX) += tests/qtest/vhost-user-blk-test.o
 qos-test-obj-y += tests/qtest/virtio-net-test.o
 qos-test-obj-y += tests/qtest/virtio-rng-test.o
 qos-test-obj-y += tests/qtest/virtio-scsi-test.o
diff --git a/tests/qtest/libqos/vhost-user-blk.c b/tests/qtest/libqos/vhost-user-blk.c
new file mode 100644
index 0000000000..ec46b7ddb4
--- /dev/null
+++ b/tests/qtest/libqos/vhost-user-blk.c
@@ -0,0 +1,126 @@
+/*
+ * libqos driver framework
+ *
+ * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu/module.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "libqos/qgraph.h"
+#include "libqos/vhost-user-blk.h"
+
+#define PCI_SLOT                0x04
+#define PCI_FN                  0x00
+
+/* virtio-blk-device */
+static void *qvhost_user_blk_get_driver(QVhostUserBlk *v_blk,
+                                    const char *interface)
+{
+    if (!g_strcmp0(interface, "vhost-user-blk")) {
+        return v_blk;
+    }
+    if (!g_strcmp0(interface, "virtio")) {
+        return v_blk->vdev;
+    }
+
+    fprintf(stderr, "%s not present in vhost-user-blk-device\n", interface);
+    g_assert_not_reached();
+}
+
+static void *qvhost_user_blk_device_get_driver(void *object,
+                                           const char *interface)
+{
+    QVhostUserBlkDevice *v_blk = object;
+    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
+}
+
+static void *vhost_user_blk_device_create(void *virtio_dev,
+                                      QGuestAllocator *t_alloc,
+                                      void *addr)
+{
+    QVhostUserBlkDevice *vhost_user_blk = g_new0(QVhostUserBlkDevice, 1);
+    QVhostUserBlk *interface = &vhost_user_blk->blk;
+
+    interface->vdev = virtio_dev;
+
+    vhost_user_blk->obj.get_driver = qvhost_user_blk_device_get_driver;
+
+    return &vhost_user_blk->obj;
+}
+
+/* virtio-blk-pci */
+static void *qvhost_user_blk_pci_get_driver(void *object, const char *interface)
+{
+    QVhostUserBlkPCI *v_blk = object;
+    if (!g_strcmp0(interface, "pci-device")) {
+        return v_blk->pci_vdev.pdev;
+    }
+    return qvhost_user_blk_get_driver(&v_blk->blk, interface);
+}
+
+static void *vhost_user_blk_pci_create(void *pci_bus, QGuestAllocator *t_alloc,
+                                      void *addr)
+{
+    QVhostUserBlkPCI *vhost_user_blk = g_new0(QVhostUserBlkPCI, 1);
+    QVhostUserBlk *interface = &vhost_user_blk->blk;
+    QOSGraphObject *obj = &vhost_user_blk->pci_vdev.obj;
+
+    virtio_pci_init(&vhost_user_blk->pci_vdev, pci_bus, addr);
+    interface->vdev = &vhost_user_blk->pci_vdev.vdev;
+
+    g_assert_cmphex(interface->vdev->device_type, ==, VIRTIO_ID_BLOCK);
+
+    obj->get_driver = qvhost_user_blk_pci_get_driver;
+
+    return obj;
+}
+
+static void vhost_user_blk_register_nodes(void)
+{
+    /*
+     * FIXME: every test using these two nodes needs to setup a
+     * -drive,id=drive0 otherwise QEMU is not going to start.
+     * Therefore, we do not include "produces" edge for virtio
+     * and pci-device yet.
+     */
+
+    char *arg = g_strdup_printf("id=drv0,chardev=char1,addr=%x.%x",
+                                PCI_SLOT, PCI_FN);
+
+    QPCIAddress addr = {
+        .devfn = QPCI_DEVFN(PCI_SLOT, PCI_FN),
+    };
+
+    QOSGraphEdgeOptions opts = { };
+
+    /* virtio-blk-device */
+    /** opts.extra_device_opts = "drive=drive0"; */
+    qos_node_create_driver("vhost-user-blk-device", vhost_user_blk_device_create);
+    qos_node_consumes("vhost-user-blk-device", "virtio-bus", &opts);
+    qos_node_produces("vhost-user-blk-device", "vhost-user-blk");
+
+    /* virtio-blk-pci */
+    opts.extra_device_opts = arg;
+    add_qpci_address(&opts, &addr);
+    qos_node_create_driver("vhost-user-blk-pci", vhost_user_blk_pci_create);
+    qos_node_consumes("vhost-user-blk-pci", "pci-bus", &opts);
+    qos_node_produces("vhost-user-blk-pci", "vhost-user-blk");
+
+    g_free(arg);
+}
+
+libqos_init(vhost_user_blk_register_nodes);
diff --git a/tests/qtest/libqos/vhost-user-blk.h b/tests/qtest/libqos/vhost-user-blk.h
new file mode 100644
index 0000000000..ef4ef09cca
--- /dev/null
+++ b/tests/qtest/libqos/vhost-user-blk.h
@@ -0,0 +1,44 @@
+/*
+ * libqos driver framework
+ *
+ * Copyright (c) 2018 Emanuele Giuseppe Esposito <e.emanuelegiuseppe@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License version 2 as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#ifndef TESTS_LIBQOS_VHOST_USER_BLK_H
+#define TESTS_LIBQOS_VHOST_USER_BLK_H
+
+#include "libqos/qgraph.h"
+#include "libqos/virtio.h"
+#include "libqos/virtio-pci.h"
+
+typedef struct QVhostUserBlk QVhostUserBlk;
+typedef struct QVhostUserBlkPCI QVhostUserBlkPCI;
+typedef struct QVhostUserBlkDevice QVhostUserBlkDevice;
+
+struct QVhostUserBlk {
+    QVirtioDevice *vdev;
+};
+
+struct QVhostUserBlkPCI {
+    QVirtioPCIDevice pci_vdev;
+    QVhostUserBlk blk;
+};
+
+struct QVhostUserBlkDevice {
+    QOSGraphObject obj;
+    QVhostUserBlk blk;
+};
+
+#endif
diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c
new file mode 100644
index 0000000000..528f034b55
--- /dev/null
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -0,0 +1,694 @@
+/*
+ * QTest testcase for VirtIO Block Device
+ *
+ * Copyright (c) 2014 SUSE LINUX Products GmbH
+ * Copyright (c) 2014 Marc Marí
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest-single.h"
+#include "qemu/bswap.h"
+#include "qemu/module.h"
+#include "standard-headers/linux/virtio_blk.h"
+#include "standard-headers/linux/virtio_pci.h"
+#include "libqos/qgraph.h"
+#include "libqos/vhost-user-blk.h"
+#include "libqos/libqos-pc.h"
+
+/* TODO actually test the results and get rid of this */
+#define qmp_discard_response(...) qobject_unref(qmp(__VA_ARGS__))
+
+#define TEST_IMAGE_SIZE         (64 * 1024 * 1024)
+#define QVIRTIO_BLK_TIMEOUT_US  (30 * 1000 * 1000)
+#define PCI_SLOT_HP             0x06
+
+typedef struct QVirtioBlkReq {
+    uint32_t type;
+    uint32_t ioprio;
+    uint64_t sector;
+    char *data;
+    uint8_t status;
+} QVirtioBlkReq;
+
+
+#ifdef HOST_WORDS_BIGENDIAN
+static const bool host_is_big_endian = true;
+#else
+static const bool host_is_big_endian; /* false */
+#endif
+
+static inline void virtio_blk_fix_request(QVirtioDevice *d, QVirtioBlkReq *req)
+{
+    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
+        req->type = bswap32(req->type);
+        req->ioprio = bswap32(req->ioprio);
+        req->sector = bswap64(req->sector);
+    }
+}
+
+
+static inline void virtio_blk_fix_dwz_hdr(QVirtioDevice *d,
+    struct virtio_blk_discard_write_zeroes *dwz_hdr)
+{
+    if (qvirtio_is_big_endian(d) != host_is_big_endian) {
+        dwz_hdr->sector = bswap64(dwz_hdr->sector);
+        dwz_hdr->num_sectors = bswap32(dwz_hdr->num_sectors);
+        dwz_hdr->flags = bswap32(dwz_hdr->flags);
+    }
+}
+
+static uint64_t virtio_blk_request(QGuestAllocator *alloc, QVirtioDevice *d,
+                                   QVirtioBlkReq *req, uint64_t data_size)
+{
+    uint64_t addr;
+    uint8_t status = 0xFF;
+
+    switch (req->type) {
+    case VIRTIO_BLK_T_IN:
+    case VIRTIO_BLK_T_OUT:
+        g_assert_cmpuint(data_size % 512, ==, 0);
+        break;
+    case VIRTIO_BLK_T_DISCARD:
+    case VIRTIO_BLK_T_WRITE_ZEROES:
+        g_assert_cmpuint(data_size %
+                         sizeof(struct virtio_blk_discard_write_zeroes), ==, 0);
+        break;
+    default:
+        g_assert_cmpuint(data_size, ==, 0);
+    }
+
+    addr = guest_alloc(alloc, sizeof(*req) + data_size);
+
+    virtio_blk_fix_request(d, req);
+
+    memwrite(addr, req, 16);
+    memwrite(addr + 16, req->data, data_size);
+    memwrite(addr + 16 + data_size, &status, sizeof(status));
+
+    return addr;
+}
+
+/* Returns the request virtqueue so the caller can perform further tests */
+static QVirtQueue *test_basic(QVirtioDevice *dev, QGuestAllocator *alloc)
+{
+    QVirtioBlkReq req;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint8_t status;
+    char *data;
+    QTestState *qts = global_qtest;
+    QVirtQueue *vq;
+
+    features = qvirtio_get_features(dev);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                    (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                    (1u << VIRTIO_RING_F_EVENT_IDX) |
+                    (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, alloc, 0);
+
+    qvirtio_set_driver_ok(dev);
+
+    /* Write and read with 3 descriptor layout */
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    guest_free(alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+
+    req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    memread(req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    guest_free(alloc, req_addr);
+
+    if (features & (1u << VIRTIO_BLK_F_WRITE_ZEROES)) {
+        struct virtio_blk_discard_write_zeroes dwz_hdr;
+        void *expected;
+
+        /*
+         * WRITE_ZEROES request on the same sector of previous test where
+         * we wrote "TEST".
+         */
+        req.type = VIRTIO_BLK_T_WRITE_ZEROES;
+        req.data = (char *) &dwz_hdr;
+        dwz_hdr.sector = 0;
+        dwz_hdr.num_sectors = 1;
+        dwz_hdr.flags = 0;
+
+        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr), 1, true,
+                       false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 16 + sizeof(dwz_hdr));
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+
+        /* Read request to check if the sector contains all zeroes */
+        req.type = VIRTIO_BLK_T_IN;
+        req.ioprio = 1;
+        req.sector = 0;
+        req.data = g_malloc0(512);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        data = g_malloc(512);
+        expected = g_malloc0(512);
+        memread(req_addr + 16, data, 512);
+        g_assert_cmpmem(data, 512, expected, 512);
+        g_free(expected);
+        g_free(data);
+
+        guest_free(alloc, req_addr);
+    }
+
+    if (features & (1u << VIRTIO_BLK_F_DISCARD)) {
+        struct virtio_blk_discard_write_zeroes dwz_hdr;
+
+        req.type = VIRTIO_BLK_T_DISCARD;
+        req.data = (char *) &dwz_hdr;
+        dwz_hdr.sector = 0;
+        dwz_hdr.num_sectors = 1;
+        dwz_hdr.flags = 0;
+
+        virtio_blk_fix_dwz_hdr(dev, &dwz_hdr);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, sizeof(dwz_hdr));
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, sizeof(dwz_hdr), false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16 + sizeof(dwz_hdr),
+                       1, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 16 + sizeof(dwz_hdr));
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+    }
+
+    if (features & (1u << VIRTIO_F_ANY_LAYOUT)) {
+        /* Write and read with 2 descriptor layout */
+        /* Write request */
+        req.type = VIRTIO_BLK_T_OUT;
+        req.ioprio = 1;
+        req.sector = 1;
+        req.data = g_malloc0(512);
+        strcpy(req.data, "TEST");
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 528, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        guest_free(alloc, req_addr);
+
+        /* Read request */
+        req.type = VIRTIO_BLK_T_IN;
+        req.ioprio = 1;
+        req.sector = 1;
+        req.data = g_malloc0(512);
+
+        req_addr = virtio_blk_request(alloc, dev, &req, 512);
+
+        g_free(req.data);
+
+        free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+        qvirtqueue_add(qts, vq, req_addr + 16, 513, true, false);
+
+        qvirtqueue_kick(qts, dev, vq, free_head);
+
+        qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                               QVIRTIO_BLK_TIMEOUT_US);
+        status = readb(req_addr + 528);
+        g_assert_cmpint(status, ==, 0);
+
+        data = g_malloc0(512);
+        memread(req_addr + 16, data, 512);
+        g_assert_cmpstr(data, ==, "TEST");
+        g_free(data);
+
+        guest_free(alloc, req_addr);
+    }
+
+    return vq;
+}
+
+static void basic(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVhostUserBlk *blk_if = obj;
+    QVirtQueue *vq;
+
+    vq = test_basic(blk_if->vdev, t_alloc);
+    qvirtqueue_cleanup(blk_if->vdev->bus, vq, t_alloc);
+
+}
+
+static void indirect(void *obj, void *u_data, QGuestAllocator *t_alloc)
+{
+    QVirtQueue *vq;
+    QVhostUserBlk *blk_if = obj;
+    QVirtioDevice *dev = blk_if->vdev;
+    QVirtioBlkReq req;
+    QVRingIndirectDesc *indirect;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint8_t status;
+    char *data;
+    QTestState *qts = global_qtest;
+
+    features = qvirtio_get_features(dev);
+    g_assert_cmphex(features & (1u << VIRTIO_RING_F_INDIRECT_DESC), !=, 0);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_EVENT_IDX) |
+                            (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, t_alloc, 0);
+    qvirtio_set_driver_ok(dev);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 528, false);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 528, 1, true);
+    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    g_free(indirect);
+    guest_free(t_alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    indirect = qvring_indirect_desc_setup(qts, dev, t_alloc, 2);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr, 16, false);
+    qvring_indirect_desc_add(dev, qts, indirect, req_addr + 16, 513, true);
+    free_head = qvirtqueue_add_indirect(qts, vq, indirect);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    memread(req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    g_free(indirect);
+    guest_free(t_alloc, req_addr);
+    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
+}
+
+
+static void idx(void *obj, void *u_data, QGuestAllocator *t_alloc)
+{
+    QVirtQueue *vq;
+    QVhostUserBlkPCI *blk = obj;
+    QVirtioPCIDevice *pdev = &blk->pci_vdev;
+    QVirtioDevice *dev = &pdev->vdev;
+    QVirtioBlkReq req;
+    uint64_t req_addr;
+    uint64_t capacity;
+    uint64_t features;
+    uint32_t free_head;
+    uint32_t write_head;
+    uint32_t desc_idx;
+    uint8_t status;
+    char *data;
+    QOSGraphObject *blk_object = obj;
+    QPCIDevice *pci_dev = blk_object->get_driver(blk_object, "pci-device");
+    QTestState *qts = global_qtest;
+
+    if (qpci_check_buggy_msi(pci_dev)) {
+        return;
+    }
+
+    qpci_msix_enable(pdev->pdev);
+    qvirtio_pci_set_msix_configuration_vector(pdev, t_alloc, 0);
+
+    features = qvirtio_get_features(dev);
+    features = features & ~(QVIRTIO_F_BAD_FEATURE |
+                            (1u << VIRTIO_RING_F_INDIRECT_DESC) |
+                            (1u << VIRTIO_F_NOTIFY_ON_EMPTY) |
+                            (1u << VIRTIO_BLK_F_SCSI));
+    qvirtio_set_features(dev, features);
+
+    capacity = qvirtio_config_readq(dev, 0);
+    g_assert_cmpint(capacity, ==, TEST_IMAGE_SIZE / 512);
+
+    vq = qvirtqueue_setup(dev, t_alloc, 0);
+    qvirtqueue_pci_msix_setup(pdev, (QVirtQueuePCI *)vq, t_alloc, 1);
+
+    qvirtio_set_driver_ok(dev);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 0;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    qvirtio_wait_used_elem(qts, dev, vq, free_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+
+    /* Write request */
+    req.type = VIRTIO_BLK_T_OUT;
+    req.ioprio = 1;
+    req.sector = 1;
+    req.data = g_malloc0(512);
+    strcpy(req.data, "TEST");
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    /* Notify after processing the third request */
+    qvirtqueue_set_used_event(qts, vq, 2);
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+    qvirtqueue_kick(qts, dev, vq, free_head);
+    write_head = free_head;
+
+    /* No notification expected */
+    status = qvirtio_wait_status_byte_no_isr(qts, dev,
+                                             vq, req_addr + 528,
+                                             QVIRTIO_BLK_TIMEOUT_US);
+    g_assert_cmpint(status, ==, 0);
+
+    guest_free(t_alloc, req_addr);
+
+    /* Read request */
+    req.type = VIRTIO_BLK_T_IN;
+    req.ioprio = 1;
+    req.sector = 1;
+    req.data = g_malloc0(512);
+
+    req_addr = virtio_blk_request(t_alloc, dev, &req, 512);
+
+    g_free(req.data);
+
+    free_head = qvirtqueue_add(qts, vq, req_addr, 16, false, true);
+    qvirtqueue_add(qts, vq, req_addr + 16, 512, true, true);
+    qvirtqueue_add(qts, vq, req_addr + 528, 1, true, false);
+
+    qvirtqueue_kick(qts, dev, vq, free_head);
+
+    /* We get just one notification for both requests */
+    qvirtio_wait_used_elem(qts, dev, vq, write_head, NULL,
+                           QVIRTIO_BLK_TIMEOUT_US);
+    g_assert(qvirtqueue_get_buf(qts, vq, &desc_idx, NULL));
+    g_assert_cmpint(desc_idx, ==, free_head);
+
+    status = readb(req_addr + 528);
+    g_assert_cmpint(status, ==, 0);
+
+    data = g_malloc0(512);
+    memread(req_addr + 16, data, 512);
+    g_assert_cmpstr(data, ==, "TEST");
+    g_free(data);
+
+    guest_free(t_alloc, req_addr);
+
+    /* End test */
+    qpci_msix_disable(pdev->pdev);
+
+    qvirtqueue_cleanup(dev->bus, vq, t_alloc);
+}
+
+static void pci_hotplug(void *obj, void *data, QGuestAllocator *t_alloc)
+{
+    QVirtioPCIDevice *dev1 = obj;
+    QVirtioPCIDevice *dev;
+    QTestState *qts = dev1->pdev->bus->qts;
+
+    /* plug secondary disk */
+    qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
+                         "{'addr': %s, 'chardev': 'char2'}",
+                         stringify(PCI_SLOT_HP) ".0");
+
+    dev = virtio_pci_new(dev1->pdev->bus,
+                         &(QPCIAddress) { .devfn = QPCI_DEVFN(PCI_SLOT_HP, 0)
+                                        });
+    g_assert_nonnull(dev);
+    g_assert_cmpint(dev->vdev.device_type, ==, VIRTIO_ID_BLOCK);
+    qvirtio_pci_device_disable(dev);
+    qos_object_destroy((QOSGraphObject *)dev);
+
+    /* unplug secondary disk */
+    qpci_unplug_acpi_device_test(qts, "drv1", PCI_SLOT_HP);
+}
+
+/*
+ * Check that setting the vring addr on a non-existent virtqueue does
+ * not crash.
+ */
+static void test_nonexistent_virtqueue(void *obj, void *data,
+                                       QGuestAllocator *t_alloc)
+{
+    QVhostUserBlkPCI *blk = obj;
+    QVirtioPCIDevice *pdev = &blk->pci_vdev;
+    QPCIBar bar0;
+    QPCIDevice *dev;
+
+    dev = qpci_device_find(pdev->pdev->bus, QPCI_DEVFN(4, 0));
+    g_assert(dev != NULL);
+    qpci_device_enable(dev);
+
+    bar0 = qpci_iomap(dev, 0, NULL);
+
+    qpci_io_writeb(dev, bar0, VIRTIO_PCI_QUEUE_SEL, 2);
+    qpci_io_writel(dev, bar0, VIRTIO_PCI_QUEUE_PFN, 1);
+
+    g_free(dev);
+}
+
+static const char *qtest_qemu_vu_binary(void)
+{
+    const char *qemu_vu_bin;
+
+    qemu_vu_bin = getenv("QTEST_QEMU_VU_BINARY");
+    if (!qemu_vu_bin) {
+        fprintf(stderr, "Environment variable QTEST_QEMU_VU_BINARY required\n");
+        exit(0);
+    }
+
+    return qemu_vu_bin;
+}
+
+static void drive_destroy(void *path)
+{
+    unlink(path);
+    g_free(path);
+    qos_invalidate_command_line();
+}
+
+
+static char *drive_create(void)
+{
+ int fd, ret;
+ /** vhost-user-blk won't recognize drive located in /tmp */
+ char *t_path = g_strdup("qtest.XXXXXX");
+
+ /** Create a temporary raw image */
+ fd = mkstemp(t_path);
+ g_assert_cmpint(fd, >=, 0);
+ ret = ftruncate(fd, TEST_IMAGE_SIZE);
+ g_assert_cmpint(ret, ==, 0);
+ close(fd);
+
+ g_test_queue_destroy(drive_destroy, t_path);
+ return t_path;
+}
+
+
+
+static void start_vhost_user_blk(const char *img_path, const char *sock_path)
+{
+    const char *vhost_user_blk_bin = qtest_qemu_vu_binary();
+    /*
+     * "qemu-vu -e" will exit when the client disconnects thus the launched
+     *  qemu-vu process will not block scripts/tap-driver.pl
+     */
+    gchar *command = g_strdup_printf("exec %s "
+                                     "-e "
+                                     "-k %s "
+                                     "-f raw "
+                                     "%s",
+                                     vhost_user_blk_bin,
+                                     sock_path, img_path);
+    g_test_message("starting vhost-user backend: %s", command);
+    pid_t pid = fork();
+    if (pid == 0) {
+        execlp("/bin/sh", "sh", "-c", command, NULL);
+        exit(1);
+    }
+    /*
+     * make sure qemu-vu i.e. socket server is started before tests
+     * otherwise qemu will complain,
+     * "Failed to connect socket ... Connection refused"
+     */
+    g_usleep(G_USEC_PER_SEC);
+}
+
+static void *vhost_user_blk_test_setup(GString *cmd_line, void *arg)
+{
+    /* create image file */
+    const char *img_path = drive_create();
+    const char *sock_path = "/tmp/vhost-user-blk_vhost.socket";
+    start_vhost_user_blk(img_path, sock_path);
+    /* "-chardev socket,id=char2" is used for pci_hotplug*/
+    g_string_append_printf(cmd_line,
+                           " -object memory-backend-memfd,id=mem,size=128M,share=on -numa node,memdev=mem "
+                           "-chardev socket,id=char1,path=%s "
+                           "-chardev socket,id=char2,path=%s",
+                           sock_path, sock_path);
+    return arg;
+}
+
+static void register_vhost_user_blk_test(void)
+{
+    QOSGraphTestOptions opts = {
+        .before = vhost_user_blk_test_setup,
+    };
+
+    /*
+     * tests for vhost-user-blk and vhost-user-blk-pci
+     * The tests are borrowed from tests/virtio-blk-test.c. But some tests
+     * regarding block_resize don't work for vhost-user-blk.
+     * vhost-user-blk device doesn't have -drive, so tests containing
+     * block_resize are also abandoned,
+     *  - config
+     *  - resize
+     */
+    qos_add_test("basic", "vhost-user-blk", basic, &opts);
+    qos_add_test("indirect", "vhost-user-blk", indirect, &opts);
+    qos_add_test("idx", "vhost-user-blk-pci", idx, &opts);
+    qos_add_test("nxvirtq", "vhost-user-blk-pci",
+                      test_nonexistent_virtqueue, &opts);
+    qos_add_test("hotplug", "vhost-user-blk-pci", pci_hotplug, &opts);
+}
+
+libqos_init(register_vhost_user_blk_test);
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/5] extend libvhost to support IOThread and coroutine
  2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
@ 2020-02-19 16:33   ` Stefan Hajnoczi
  2020-02-25 14:55   ` Kevin Wolf
  1 sibling, 0 replies; 22+ messages in thread
From: Stefan Hajnoczi @ 2020-02-19 16:33 UTC (permalink / raw)
  To: Coiby Xu; +Cc: kwolf, bharatlkmlkvm, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2650 bytes --]

On Tue, Feb 18, 2020 at 01:07:07PM +0800, Coiby Xu wrote:
> +static bool vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
> +{
> +    vu_read_msg_cb read_msg;
> +    if (dev->co_iface) {
> +        read_msg = dev->co_iface->read_msg;
> +    } else {
> +        read_msg = vu_message_read_;
> +    }
> +    return read_msg(dev, conn_fd, vmsg);
> +}

libvhost-user is already partially asynchronous (->set/remove_watch()
are used for eventfds).

Can you make the vhost-user socket I/O asynchronous too?

> +
>  static bool
>  vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
>  {
> @@ -1075,9 +1084,14 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>      }
>  
>      if (dev->vq[index].kick_fd != -1 && dev->vq[index].handler) {
> +        if (dev->set_watch_packed_data) {
> +            dev->set_watch_packed_data(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
> +                                       dev->co_iface->kick_callback,
> +                                       (void *)(long)index);
> +        } else {
>          dev->set_watch(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
>                         vu_kick_cb, (void *)(long)index);

Why is it necessary to replace vu_kick_cb()?

The user can set a custom vq->handler() function with
vu_set_queue_handler().

Coroutines aren't needed for this part since eventfd_read() always
returns right away.

> @@ -1627,6 +1647,12 @@ vu_deinit(VuDev *dev)
>          }
>  
>          if (vq->kick_fd != -1) {
> +            /* remove watch for kick_fd
> +             * When client process is running in gdb and
> +             * quit command is run in gdb, QEMU will still dispatch the event
> +             * which will cause segment fault in the callback function
> +             */

Code and comments in libvhost-user should not refer to QEMU specifics.
Removing the watch is a good idea regardless of the application or event
loop implementation.  No comment is needed here.

> +            dev->remove_watch(dev, vq->kick_fd);
>              close(vq->kick_fd);
>              vq->kick_fd = -1;
>          }
> @@ -1682,7 +1708,7 @@ vu_init(VuDev *dev,
>  
>      assert(max_queues > 0);
>      assert(socket >= 0);
> -    assert(set_watch);
> +    /* assert(set_watch); */

?

> @@ -372,7 +389,8 @@ struct VuDev {
>      /* @set_watch: add or update the given fd to the watch set,
>       * call cb when condition is met */
>      vu_set_watch_cb set_watch;
> -
> +    /* AIO dispatch will only one data pointer to callback function */

I don't understand what this comment is saying.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
                   ` (4 preceding siblings ...)
  2020-02-18  5:07 ` [PATCH v4 5/5] new qTest case to test the vhost-user-blk-server Coiby Xu
@ 2020-02-19 16:38 ` Stefan Hajnoczi
  2020-02-26 15:18   ` Coiby Xu
  5 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2020-02-19 16:38 UTC (permalink / raw)
  To: Coiby Xu; +Cc: kwolf, bharatlkmlkvm, qemu-devel, Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

On Tue, Feb 18, 2020 at 01:07:06PM +0800, Coiby Xu wrote:
> v4:
>  * add object properties in class_init
>  * relocate vhost-user-blk-test
>  * other changes including using SocketAddress, coding style, etc.

Thanks!  I think the vhost-user server code can be simplified if
libvhost-user uses the event loop for asynchronous socket I/O.  Then
it's no longer necessary to duplicate vu_message_read() and
vu_kick_cb().  I've replied to Patch 1 and we can discuss on IRC if you
want to chat about it.

I've also CCed Marc-André to see what he thinks about removing the
blocking recv from libvhost-user and instead using the event loop (just
like for eventfds).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/5] extend libvhost to support IOThread and coroutine
  2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
  2020-02-19 16:33   ` Stefan Hajnoczi
@ 2020-02-25 14:55   ` Kevin Wolf
  1 sibling, 0 replies; 22+ messages in thread
From: Kevin Wolf @ 2020-02-25 14:55 UTC (permalink / raw)
  To: Coiby Xu; +Cc: bharatlkmlkvm, qemu-devel, stefanha

Am 18.02.2020 um 06:07 hat Coiby Xu geschrieben:
> Previously libvhost dispatch events in its own GMainContext. Now vhost-user
> client's kick event can be dispatched in block device drive's AioContext
> thus IOThread is supported. And also allow vu_message_read and
> vu_kick_cb to be replaced so QEMU can run them as coroutines.
> 
> Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
> ---
>  contrib/libvhost-user/libvhost-user.c | 54 ++++++++++++++++++++++++---
>  contrib/libvhost-user/libvhost-user.h | 38 ++++++++++++++++++-
>  2 files changed, 85 insertions(+), 7 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 5cb7708559..6aadeaa0f2 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -30,6 +30,8 @@
>  
>  #define VHOST_MEMORY_MAX_NREGIONS 8
>  
> +#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
> +
>  typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MASTER = 0,
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> @@ -201,6 +203,7 @@ typedef uint64_t (*vu_get_features_cb) (VuDev *dev);
>  typedef void (*vu_set_features_cb) (VuDev *dev, uint64_t features);
>  typedef int (*vu_process_msg_cb) (VuDev *dev, VhostUserMsg *vmsg,
>                                    int *do_reply);
> +typedef bool (*vu_read_msg_cb) (VuDev *dev, int sock, VhostUserMsg *vmsg);
>  typedef void (*vu_queue_set_started_cb) (VuDev *dev, int qidx, bool started);
>  typedef bool (*vu_queue_is_processed_in_order_cb) (VuDev *dev, int qidx);
>  typedef int (*vu_get_config_cb) (VuDev *dev, uint8_t *config, uint32_t len);
> @@ -208,6 +211,20 @@ typedef int (*vu_set_config_cb) (VuDev *dev, const uint8_t *data,
>                                   uint32_t offset, uint32_t size,
>                                   uint32_t flags);
>  
> +typedef void (*vu_watch_cb_packed_data) (void *packed_data);
> +
> +typedef void (*vu_set_watch_cb_packed_data) (VuDev *dev, int fd, int condition,
> +                                             vu_watch_cb_packed_data cb,
> +                                             void *data);
> +/*
> + * allowing vu_read_msg_cb and kick_callback to be replaced so QEMU
> + * can run them as coroutines
> + */
> +typedef struct CoIface {
> +    vu_read_msg_cb read_msg;
> +    vu_watch_cb_packed_data kick_callback;
> +} CoIface;

I think this should be part of VuDevIface, so that it becomes a properly
integrated part of the design instead of an adapter hacked on top.

>  typedef struct VuDevIface {
>      /* called by VHOST_USER_GET_FEATURES to get the features bitmask */
>      vu_get_features_cb get_features;
> @@ -372,7 +389,8 @@ struct VuDev {
>      /* @set_watch: add or update the given fd to the watch set,
>       * call cb when condition is met */
>      vu_set_watch_cb set_watch;
> -
> +    /* AIO dispatch will only one data pointer to callback function */
> +    vu_set_watch_cb_packed_data set_watch_packed_data;
>      /* @remove_watch: remove the given fd from the watch set */
>      vu_remove_watch_cb remove_watch;
>  
> @@ -380,7 +398,7 @@ struct VuDev {
>       * re-initialize */
>      vu_panic_cb panic;
>      const VuDevIface *iface;
> -
> +    const CoIface *co_iface;
>      /* Postcopy data */
>      int postcopy_ufd;
>      bool postcopy_listening;
> @@ -417,6 +435,22 @@ bool vu_init(VuDev *dev,
>               const VuDevIface *iface);
>  
>  
> +/**
> + * vu_init_packed_data:
> + * Same as vu_init except for set_watch_packed_data which will pack
> + * two parameters into a struct

Be specific: Which two parameters and which struct?

I think it would be more helpful to name the function after the
additional piece of information that it uses rather than the fact that
it stores it internally in a struct.

We have:

typedef void (*vu_set_watch_cb) (VuDev *dev, int fd, int condition,
                                 vu_watch_cb cb, void *data);
typedef void (*vu_set_watch_cb_packed_data) (VuDev *dev, int fd, int condition,
                                             vu_watch_cb_packed_data cb,
                                             void *data);

Without looking at the implementation, they have the same set of
parameters. I suspect that the difference is in the content of *data,
but since it is declared void*, I suppose it's treated as an opaque data
type and will only be passed unchanged (and uninspected) to cb.

If so, there is no differene between both types.

> thus QEMU aio_dispatch can pass the
> + * required data to callback function.
> + *
> + * Returns: true on success, false on failure.
> + **/
> +bool vu_init_packed_data(VuDev *dev,
> +                         uint16_t max_queues,
> +                         int socket,
> +                         vu_panic_cb panic,
> +                         vu_set_watch_cb_packed_data set_watch_packed_data,
> +                         vu_remove_watch_cb remove_watch,
> +                         const VuDevIface *iface,
> +                         const CoIface *co_iface);
>  /**
>   * vu_deinit:
>   * @dev: a VuDev context
> -- 
> 2.25.0
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index b89bf18501..f95664bb22 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -67,8 +67,6 @@
>  /* The version of inflight buffer */
>  #define INFLIGHT_VERSION 1
>  
> -#define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
> -
>  /* The version of the protocol we support */
>  #define VHOST_USER_VERSION 1
>  #define LIBVHOST_USER_DEBUG 0
> @@ -260,7 +258,7 @@ have_userfault(void)
>  }
>  
>  static bool
> -vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
> +vu_message_read_(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)

Just adding a trailing underscore isn't a good name. It doesn't tell the
reader what the difference between vu_message_read_ and vu_message_read
is.

>  {
>      char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
>      struct iovec iov = {
> @@ -328,6 +326,17 @@ fail:
>      return false;
>  }
>  
> +static bool vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
> +{
> +    vu_read_msg_cb read_msg;
> +    if (dev->co_iface) {
> +        read_msg = dev->co_iface->read_msg;
> +    } else {
> +        read_msg = vu_message_read_;
> +    }
> +    return read_msg(dev, conn_fd, vmsg);
> +}

If you change VuDevIface so that it contains the fields of CoIface
directly, you can just initialise dev->iface->read_msg with what is
called vu_message_read_() now for the non-QEMU case, and this whole
wrapper becomes unnecessary because the code path is the same for both
cases.

>  static bool
>  vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
>  {
> @@ -1075,9 +1084,14 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>      }
>  
>      if (dev->vq[index].kick_fd != -1 && dev->vq[index].handler) {
> +        if (dev->set_watch_packed_data) {
> +            dev->set_watch_packed_data(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
> +                                       dev->co_iface->kick_callback,
> +                                       (void *)(long)index);
> +        } else {
>          dev->set_watch(dev, dev->vq[index].kick_fd, VU_WATCH_IN,
>                         vu_kick_cb, (void *)(long)index);
> -
> +        }

Indentation is off here.

Also, this is almost exactly the same code for both cases. If you
generalise things to have a dev->iface->kick_callback that can be
initialised with vu_kick_cb in the non-QEMU case, you get rid of this
duplication, too.

>          DPRINT("Waiting for kicks on fd: %d for vq: %d\n",
>                 dev->vq[index].kick_fd, index);
>      }
> @@ -1097,8 +1111,14 @@ void vu_set_queue_handler(VuDev *dev, VuVirtq *vq,
>      vq->handler = handler;
>      if (vq->kick_fd >= 0) {
>          if (handler) {
> +            if (dev->set_watch_packed_data) {
> +                dev->set_watch_packed_data(dev, vq->kick_fd, VU_WATCH_IN,
> +                                           dev->co_iface->kick_callback,
> +                                           (void *)(long)qidx);
> +            } else {
>              dev->set_watch(dev, vq->kick_fd, VU_WATCH_IN,
>                             vu_kick_cb, (void *)(long)qidx);
> +            }

Same as above. (Indentation and duplicated code.)

>          } else {
>              dev->remove_watch(dev, vq->kick_fd);
>          }
> @@ -1627,6 +1647,12 @@ vu_deinit(VuDev *dev)
>          }
>  
>          if (vq->kick_fd != -1) {
> +            /* remove watch for kick_fd
> +             * When client process is running in gdb and
> +             * quit command is run in gdb, QEMU will still dispatch the event
> +             * which will cause segment fault in the callback function
> +             */

Reformat this comment to use a consistent line width, maybe like this:

            /*
             * remove watch for kick_fd.
             *
             * When client process is running in gdb and quit command is
             * run in gdb, QEMU will still dispatch the event which will
             * cause segment fault in the callback function
             */

I'm not sure what the comment wants to tell me: Is this an existing
problem in the code that we can run into segfaults, or do we remove the
watch to avoid segfaults?

> +            dev->remove_watch(dev, vq->kick_fd);
>              close(vq->kick_fd);
>              vq->kick_fd = -1;
>          }
> @@ -1682,7 +1708,7 @@ vu_init(VuDev *dev,
>  
>      assert(max_queues > 0);
>      assert(socket >= 0);
> -    assert(set_watch);
> +    /* assert(set_watch); */

Don't leave commented code around. Either leave it in, or remove it
completely.

I think this one should be left in. If you integrate CoIface into
VuDevIface, the assertion will hold true again.

>      assert(remove_watch);
>      assert(iface);
>      assert(panic);
> @@ -1715,6 +1741,24 @@ vu_init(VuDev *dev,
>      return true;
>  }
>  
> +bool
> +vu_init_packed_data(VuDev *dev,
> +        uint16_t max_queues,
> +        int socket,
> +        vu_panic_cb panic,
> +        vu_set_watch_cb_packed_data set_watch_packed_data,
> +        vu_remove_watch_cb remove_watch,
> +        const VuDevIface *iface,
> +        const CoIface *co_iface)
> +{
> +    if (vu_init(dev, max_queues, socket, panic, NULL, remove_watch, iface)) {
> +        dev->set_watch_packed_data = set_watch_packed_data;
> +        dev->co_iface = co_iface;
> +        return true;
> +    }
> +    return false;
> +}

With the integrated VuDevIface, this wrapper becomes unnecessary.

Kevin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/5] generic vhost user server
  2020-02-18  5:07 ` [PATCH v4 2/5] generic vhost user server Coiby Xu
@ 2020-02-25 15:44   ` Kevin Wolf
  2020-02-28  4:23     ` Coiby Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2020-02-25 15:44 UTC (permalink / raw)
  To: Coiby Xu; +Cc: bharatlkmlkvm, qemu-devel, stefanha

Am 18.02.2020 um 06:07 hat Coiby Xu geschrieben:
> Sharing QEMU devices via vhost-user protocol
> 
> Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
> ---
>  util/Makefile.objs       |   3 +
>  util/vhost-user-server.c | 427 +++++++++++++++++++++++++++++++++++++++
>  util/vhost-user-server.h |  56 +++++
>  3 files changed, 486 insertions(+)
>  create mode 100644 util/vhost-user-server.c
>  create mode 100644 util/vhost-user-server.h
> 
> diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
> new file mode 100644
> index 0000000000..ff6d3145cd
> --- /dev/null
> +++ b/util/vhost-user-server.h
> @@ -0,0 +1,56 @@
> +#include "io/channel-socket.h"
> +#include "io/channel-file.h"
> +#include "io/net-listener.h"
> +#include "contrib/libvhost-user/libvhost-user.h"
> +#include "standard-headers/linux/virtio_blk.h"
> +#include "qemu/error-report.h"
> +
> +typedef struct VuClient VuClient;

I find the terminology a bit confusing here: VuClient is really the
connection to a single client, but it's part of the server. The name
gives the impression as if this were client-side code. (This is
something that already tends to confuse me in the NBD code.)

I'm not sure what a better name could be, though. Maybe
VuServerConnevtion or VuExportClient or VuExportConnection?

> +typedef struct VuServer {
> +    QIONetListener *listener;
> +    AioContext *ctx;
> +    QTAILQ_HEAD(, VuClient) clients;
> +    void (*device_panic_notifier)(struct VuClient *client) ;
> +    int max_queues;
> +    const VuDevIface *vu_iface;
> +    /*
> +     * @ptr_in_device: VuServer pointer memory location in vhost-user device
> +     * struct, so later container_of can be used to get device destruct
> +     */
> +    void *ptr_in_device;
> +    bool close;
> +} VuServer;
> +
> +typedef struct kick_info {
> +    VuDev *vu_dev;

I suppose this could specifically be VuClient?

> +    int fd; /*kick fd*/
> +    long index; /*queue index*/
> +    QIOChannel *ioc; /*I/O channel for kick fd*/
> +    QIOChannelFile *fioc; /*underlying data channel for kick fd*/
> +    Coroutine *co;
> +} kick_info;
> +
> +struct VuClient {
> +    VuDev parent;
> +    VuServer *server;
> +    QIOChannel *ioc; /* The current I/O channel */
> +    QIOChannelSocket *sioc; /* The underlying data channel */
> +    Coroutine *co_trip;
> +    struct kick_info *kick_info;

If each struct kick_info (btw, QEMU coding style requires CamelCase) has
exactly one VuClient and each VuClient has exactly on kick_info, should
this be a single struct containing both?

[ Coming back from reading the code below - it's because this is in
fact an array. This should be made clear in the definition. ]

> +    QTAILQ_ENTRY(VuClient) next;
> +    bool closed;
> +};
> +
> +
> +VuServer *vhost_user_server_start(uint16_t max_queues,
> +                                  SocketAddress *unix_socket,
> +                                  AioContext *ctx,
> +                                  void *server_ptr,
> +                                  void *device_panic_notifier,
> +                                  const VuDevIface *vu_iface,
> +                                  Error **errp);
> +
> +void vhost_user_server_stop(VuServer *server);
> +
> +void change_vu_context(AioContext *ctx, VuServer *server);

Let's call this vhost_user_server_set_aio_context() for consistency.

> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 11262aafaf..5e450e501c 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -36,6 +36,9 @@ util-obj-y += readline.o
>  util-obj-y += rcu.o
>  util-obj-$(CONFIG_MEMBARRIER) += sys_membarrier.o
>  util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
> +ifdef CONFIG_LINUX
> +util-obj-y += vhost-user-server.o
> +endif
>  util-obj-y += qemu-coroutine-sleep.o
>  util-obj-y += qemu-co-shared-resource.o
>  util-obj-y += coroutine-$(CONFIG_COROUTINE_BACKEND).o
> diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
> new file mode 100644
> index 0000000000..70ff6d6701
> --- /dev/null
> +++ b/util/vhost-user-server.c
> @@ -0,0 +1,427 @@
> +/*
> + * Sharing QEMU devices via vhost-user protocol
> + *
> + * Author: Coiby Xu <coiby.xu@gmail.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <sys/eventfd.h>
> +#include "qemu/main-loop.h"
> +#include "vhost-user-server.h"
> +
> +static void vmsg_close_fds(VhostUserMsg *vmsg)
> +{
> +    int i;
> +    for (i = 0; i < vmsg->fd_num; i++) {
> +        close(vmsg->fds[i]);
> +    }
> +}
> +
> +static void vmsg_unblock_fds(VhostUserMsg *vmsg)
> +{
> +    int i;
> +    for (i = 0; i < vmsg->fd_num; i++) {
> +        qemu_set_nonblock(vmsg->fds[i]);
> +    }
> +}
> +
> +
> +static void close_client(VuClient *client)
> +{
> +    vu_deinit(&client->parent);
> +    client->sioc = NULL;
> +    object_unref(OBJECT(client->ioc));
> +    client->closed = true;
> +
> +}
> +
> +static void panic_cb(VuDev *vu_dev, const char *buf)
> +{
> +    if (buf) {
> +        error_report("vu_panic: %s", buf);
> +    }
> +
> +    VuClient *client = container_of(vu_dev, VuClient, parent);
> +    VuServer *server = client->server;

Please put declarations at the start of the block.

> +    if (!client->closed) {
> +        close_client(client);
> +        QTAILQ_REMOVE(&server->clients, client, next);
> +    }
> +
> +    if (server->device_panic_notifier) {
> +        server->device_panic_notifier(client);
> +    }
> +}
> +
> +
> +
> +static bool coroutine_fn
> +vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
> +{
> +    struct iovec iov = {
> +        .iov_base = (char *)vmsg,
> +        .iov_len = VHOST_USER_HDR_SIZE,
> +    };
> +    int rc, read_bytes = 0;
> +    /*
> +     * VhostUserMsg is a packed structure, gcc will complain about passing
> +     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
> +     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
> +     * thus two temporary variables nfds and fds are used here.
> +     */
> +    size_t nfds = 0, nfds_t = 0;
> +    int *fds = NULL, *fds_t = NULL;
> +    VuClient *client = container_of(vu_dev, VuClient, parent);
> +    QIOChannel *ioc = client->ioc;
> +
> +    Error *erp;

The convention is to call this local_err. It should be initialised as
NULL.

> +    assert(qemu_in_coroutine());
> +    do {
> +        /*
> +         * qio_channel_readv_full may have short reads, keeping calling it
> +         * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
> +         */
> +        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &erp);
> +        if (rc < 0) {
> +            if (rc == QIO_CHANNEL_ERR_BLOCK) {
> +                qio_channel_yield(ioc, G_IO_IN);
> +                continue;
> +            } else {
> +                error_report("Error while recvmsg: %s", strerror(errno));

I don't think, qio_channel_*() promise anything about the value in
errno. (They also don't promise to use recvmsg().)

Instead, use error_report_err() because erp contains the real error
message.

> +                return false;
> +            }
> +        }
> +        read_bytes += rc;
> +        fds = g_renew(int, fds_t, nfds + nfds_t);
> +        memcpy(fds + nfds, fds_t, nfds_t);
> +        nfds += nfds_t;
> +        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
> +            break;
> +        }
> +    } while (true);
> +
> +    vmsg->fd_num = nfds;
> +    memcpy(vmsg->fds, fds, nfds * sizeof(int));
> +    g_free(fds);
> +    /* qio_channel_readv_full will make socket fds blocking, unblock them */
> +    vmsg_unblock_fds(vmsg);
> +    if (vmsg->size > sizeof(vmsg->payload)) {
> +        error_report("Error: too big message request: %d, "
> +                     "size: vmsg->size: %u, "
> +                     "while sizeof(vmsg->payload) = %zu",
> +                     vmsg->request, vmsg->size, sizeof(vmsg->payload));
> +        goto fail;
> +    }
> +
> +    struct iovec iov_payload = {
> +        .iov_base = (char *)&vmsg->payload,
> +        .iov_len = vmsg->size,
> +    };
> +    if (vmsg->size) {
> +        rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &erp);
> +        if (rc == -1) {
> +            error_report("Error while reading: %s", strerror(errno));

error_report_err() again.

> +            goto fail;
> +        }
> +    }
> +
> +    return true;
> +
> +fail:
> +    vmsg_close_fds(vmsg);
> +
> +    return false;
> +}
> +
> +
> +static coroutine_fn void vu_client_next_trip(VuClient *client);
> +
> +static coroutine_fn void vu_client_trip(void *opaque)
> +{
> +    VuClient *client = opaque;
> +
> +    vu_dispatch(&client->parent);
> +    client->co_trip = NULL;
> +    if (!client->closed) {
> +        vu_client_next_trip(client);
> +    }
> +}

The last part is very untypical coroutine code: It says that we want to
spawn a new coroutine with vu_client_trip() as its entry point, and then
terminates the current one.

Why don't we just put the whole thing in a while (!client->closed) loop
and stay in the same coroutine instead of terminating the old one and
starting a new one all the time?

> +static coroutine_fn void vu_client_next_trip(VuClient *client)
> +{
> +    if (!client->co_trip) {
> +        client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> +        aio_co_schedule(client->ioc->ctx, client->co_trip);
> +    }
> +}
> +
> +static void vu_client_start(VuClient *client)
> +{
> +    client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> +    aio_co_enter(client->ioc->ctx, client->co_trip);
> +}

This is essentially a duplicate of vu_client_next_trip(). The only
place where it is called (vu_accept()) knows that client->co_trip is
already NULL, so it could just call vu_client_next_trip().

Or in fact, if vu_client_trip() gets turned into a loop, it's
vu_client_next_trip() that becomes unnecessary.

> +static void coroutine_fn vu_kick_cb_next(VuClient *client,
> +                                          kick_info *data);
> +
> +static void coroutine_fn vu_kick_cb(void *opaque)
> +{
> +    kick_info *data = (kick_info *) opaque;
> +    int index = data->index;
> +    VuDev *dev = data->vu_dev;
> +    VuClient *client;
> +    client = container_of(dev, VuClient, parent);
> +    VuVirtq *vq = &dev->vq[index];
> +    int sock = vq->kick_fd;
> +    if (sock == -1) {
> +        return;
> +    }
> +    assert(sock == data->fd);
> +    eventfd_t kick_data;
> +    ssize_t rc;
> +    /*
> +     * When eventfd is closed, the revent is POLLNVAL (=G_IO_NVAL) and
> +     * reading eventfd will return errno=EBADF (Bad file number).
> +     * Calling qio_channel_yield(ioc, G_IO_IN) will set reading handler
> +     * for QIOChannel, but aio_dispatch_handlers will only dispatch
> +     * G_IO_IN | G_IO_HUP | G_IO_ERR revents while ignoring
> +     * G_IO_NVAL (POLLNVAL) revents.
> +     *
> +     * Thus when eventfd is closed by vhost-user client, QEMU will ignore
> +     * G_IO_NVAL and keeping polling by repeatedly calling qemu_poll_ns which
> +     * will lead to 100% CPU usage.
> +     *
> +     * To aovid this issue, make sure set_watch and remove_watch use the same

s/aovid/avoid/

> +     * AIOContext for QIOChannel. Thus remove_watch will eventually succefully
> +     * remove eventfd from the set of file descriptors polled for
> +     * corresponding GSource.
> +     */
> +    rc = read(sock, &kick_data, sizeof(eventfd_t));

Why not a QIOChannel function like for vu_message_read() above?

> +    if (rc != sizeof(eventfd_t)) {
> +        if (errno == EAGAIN) {
> +            qio_channel_yield(data->ioc, G_IO_IN);
> +        } else if (errno != EINTR) {
> +            data->co = NULL;
> +            return;
> +        }
> +    } else {
> +        vq->handler(dev, index);
> +    }
> +    data->co = NULL;
> +    vu_kick_cb_next(client, data);

This can be a loop, too, instead of terminating the coroutine and
starting a new one for the same function.

> +}
> +
> +static void coroutine_fn vu_kick_cb_next(VuClient *client,
> +                                          kick_info *cb_data)
> +{
> +    if (!cb_data->co) {
> +        cb_data->co = qemu_coroutine_create(vu_kick_cb, cb_data);
> +        aio_co_schedule(client->ioc->ctx, cb_data->co);
> +    }
> +}

Kevin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/5] vhost-user block device backend server
  2020-02-18  5:07 ` [PATCH v4 3/5] vhost-user block device backend server Coiby Xu
@ 2020-02-25 16:09   ` Kevin Wolf
  0 siblings, 0 replies; 22+ messages in thread
From: Kevin Wolf @ 2020-02-25 16:09 UTC (permalink / raw)
  To: Coiby Xu; +Cc: bharatlkmlkvm, qemu-devel, stefanha

Am 18.02.2020 um 06:07 hat Coiby Xu geschrieben:
> By making use of libvhost, multiple block device drives can be exported
> and each drive can serve multiple clients simultaneously.
> Since vhost-user-server needs a block drive to be created first, delay
> the creation of this object.
> 
> Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
> ---
>  Makefile.target                  |   1 +
>  backends/Makefile.objs           |   2 +
>  backends/vhost-user-blk-server.c | 718 +++++++++++++++++++++++++++++++
>  backends/vhost-user-blk-server.h |  21 +
>  vl.c                             |   4 +
>  5 files changed, 746 insertions(+)
>  create mode 100644 backends/vhost-user-blk-server.c
>  create mode 100644 backends/vhost-user-blk-server.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index 6e61f607b1..8c6c01eb3a 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -159,6 +159,7 @@ obj-y += monitor/
>  obj-y += qapi/
>  obj-y += memory.o
>  obj-y += memory_mapping.o
> +obj-$(CONFIG_LINUX) += ../contrib/libvhost-user/libvhost-user.o
>  obj-y += migration/ram.o
>  LIBS := $(libs_softmmu) $(LIBS)
>  
> diff --git a/backends/Makefile.objs b/backends/Makefile.objs
> index 28a847cd57..4e7be731e0 100644
> --- a/backends/Makefile.objs
> +++ b/backends/Makefile.objs
> @@ -14,6 +14,8 @@ common-obj-y += cryptodev-vhost.o
>  common-obj-$(CONFIG_VHOST_CRYPTO) += cryptodev-vhost-user.o
>  endif
>  
> +common-obj-$(CONFIG_LINUX) += vhost-user-blk-server.o
> +
>  common-obj-$(call land,$(CONFIG_VHOST_USER),$(CONFIG_VIRTIO)) += vhost-user.o
>  
>  common-obj-$(CONFIG_LINUX) += hostmem-memfd.o
> diff --git a/backends/vhost-user-blk-server.c b/backends/vhost-user-blk-server.c
> new file mode 100644
> index 0000000000..1bf7f7b544
> --- /dev/null
> +++ b/backends/vhost-user-blk-server.c

Please move this to block/export/vhost-user-blk.c. This will
automatically have it covered as a block layer component in MAINTAINERS,
and it will provide a place where other exports can be put, too.

> @@ -0,0 +1,718 @@
> +/*
> + * Sharing QEMU block devices via vhost-user protocal
> + *
> + * Author: Coiby Xu <coiby.xu@gmail.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include "block/block.h"
> +#include "vhost-user-blk-server.h"
> +#include "qapi/error.h"
> +#include "qom/object_interfaces.h"
> +#include "sysemu/block-backend.h"
> +
> +enum {
> +    VHOST_USER_BLK_MAX_QUEUES = 1,
> +};
> +struct virtio_blk_inhdr {
> +    unsigned char status;
> +};
> +
> +static QTAILQ_HEAD(, VuBlockDev) vu_block_devs =
> +                                 QTAILQ_HEAD_INITIALIZER(vu_block_devs);
> +
> +
> +typedef struct VuBlockReq {
> +    VuVirtqElement *elem;
> +    int64_t sector_num;
> +    size_t size;
> +    struct virtio_blk_inhdr *in;
> +    struct virtio_blk_outhdr out;
> +    VuClient *client;
> +    struct VuVirtq *vq;
> +} VuBlockReq;
> +
> +
> +static void vu_block_req_complete(VuBlockReq *req)
> +{
> +    VuDev *vu_dev = &req->client->parent;
> +
> +    /* IO size with 1 extra status byte */
> +    vu_queue_push(vu_dev, req->vq, req->elem, req->size + 1);
> +    vu_queue_notify(vu_dev, req->vq);
> +
> +    if (req->elem) {
> +        free(req->elem);
> +    }
> +
> +    g_free(req);
> +}
> +
> +static VuBlockDev *get_vu_block_device_by_client(VuClient *client)
> +{
> +    return container_of(client->server->ptr_in_device, VuBlockDev, vu_server);
> +}
> +
> +static int coroutine_fn
> +vu_block_discard_write_zeroes(VuBlockReq *req, struct iovec *iov,
> +                              uint32_t iovcnt, uint32_t type)
> +{
> +    struct virtio_blk_discard_write_zeroes desc;
> +    ssize_t size = iov_to_buf(iov, iovcnt, 0, &desc, sizeof(desc));
> +    if (unlikely(size != sizeof(desc))) {
> +        error_report("Invalid size %ld, expect %ld", size, sizeof(desc));
> +        return -EINVAL;
> +    }
> +
> +    VuBlockDev *vdev_blk = get_vu_block_device_by_client(req->client);
> +    uint64_t range[2] = { le64toh(desc.sector) << 9,
> +                          le32toh(desc.num_sectors) << 9 };
> +    if (type == VIRTIO_BLK_T_DISCARD) {
> +        if (blk_co_pdiscard(vdev_blk->backend, range[0], range[1]) == 0) {
> +            return 0;
> +        }
> +    } else if (type == VIRTIO_BLK_T_WRITE_ZEROES) {
> +        if (blk_co_pwrite_zeroes(vdev_blk->backend,
> +                                 range[0], range[1], 0) == 0) {
> +            return 0;
> +        }
> +    }
> +
> +    return -EINVAL;
> +}
> +
> +
> +static void coroutine_fn vu_block_flush(VuBlockReq *req)
> +{
> +    VuBlockDev *vdev_blk = get_vu_block_device_by_client(req->client);
> +    BlockBackend *backend = vdev_blk->backend;
> +    blk_co_flush(backend);
> +}
> +
> +
> +static int coroutine_fn vu_block_virtio_process_req(VuClient *client,
> +                                                    VuVirtq *vq)
> +{
> +    VuDev *vu_dev = &client->parent;
> +    VuVirtqElement *elem;
> +    uint32_t type;
> +    VuBlockReq *req;
> +
> +    VuBlockDev *vdev_blk = get_vu_block_device_by_client(client);
> +    BlockBackend *backend = vdev_blk->backend;
> +    elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) +
> +                                    sizeof(VuBlockReq));
> +    if (!elem) {
> +        return -1;
> +    }
> +
> +    struct iovec *in_iov = elem->in_sg;
> +    struct iovec *out_iov = elem->out_sg;
> +    unsigned in_num = elem->in_num;
> +    unsigned out_num = elem->out_num;
> +    /* refer to hw/block/virtio_blk.c */
> +    if (elem->out_num < 1 || elem->in_num < 1) {
> +        error_report("virtio-blk request missing headers");
> +        free(elem);
> +        return -EINVAL;
> +    }
> +
> +    req = g_new0(VuBlockReq, 1);
> +    req->client = client;
> +    req->vq = vq;
> +    req->elem = elem;

You can keep req on the stack, it will be freed before this function
returns.

> +    if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out,
> +                            sizeof(req->out)) != sizeof(req->out))) {
> +        error_report("virtio-blk request outhdr too short");
> +        goto err;
> +    }
> +
> +    iov_discard_front(&out_iov, &out_num, sizeof(req->out));
> +
> +    if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) {
> +        error_report("virtio-blk request inhdr too short");
> +        goto err;
> +    }
> +
> +    /* We always touch the last byte, so just see how big in_iov is.  */
> +    req->in = (void *)in_iov[in_num - 1].iov_base
> +              + in_iov[in_num - 1].iov_len
> +              - sizeof(struct virtio_blk_inhdr);
> +    iov_discard_back(in_iov, &in_num, sizeof(struct virtio_blk_inhdr));
> +
> +
> +    type = le32toh(req->out.type);
> +    switch (type & ~VIRTIO_BLK_T_BARRIER) {
> +    case VIRTIO_BLK_T_IN:
> +    case VIRTIO_BLK_T_OUT: {
> +        ssize_t ret = 0;
> +        bool is_write = type & VIRTIO_BLK_T_OUT;
> +        req->sector_num = le64toh(req->out.sector);
> +
> +        int64_t offset = req->sector_num * vdev_blk->blk_size;
> +        QEMUIOVector *qiov = g_new0(QEMUIOVector, 1);
> +        if (is_write) {
> +            qemu_iovec_init_external(qiov, out_iov, out_num);
> +            ret = blk_co_pwritev(backend, offset, qiov->size,
> +                                 qiov, 0);
> +        } else {
> +            qemu_iovec_init_external(qiov, in_iov, in_num);
> +            ret = blk_co_preadv(backend, offset, qiov->size,
> +                                 qiov, 0);
> +        }
> +        aio_wait_kick();

What purpose does this aio_wait_kick() serve? Usually you do this if you
want to signal completion fo something to a different thread, but I
don't see that we changed any control state that could be visible to
other threads.

> +        if (ret >= 0) {
> +            req->in->status = VIRTIO_BLK_S_OK;
> +        } else {
> +            req->in->status = VIRTIO_BLK_S_IOERR;
> +        }
> +        vu_block_req_complete(req);
> +        break;
> +    }
> +    case VIRTIO_BLK_T_FLUSH:
> +        vu_block_flush(req);
> +        req->in->status = VIRTIO_BLK_S_OK;
> +        vu_block_req_complete(req);
> +        break;
> +    case VIRTIO_BLK_T_GET_ID: {
> +        size_t size = MIN(iov_size(&elem->in_sg[0], in_num),
> +                          VIRTIO_BLK_ID_BYTES);
> +        snprintf(elem->in_sg[0].iov_base, size, "%s", "vhost_user_blk_server");
> +        req->in->status = VIRTIO_BLK_S_OK;
> +        req->size = elem->in_sg[0].iov_len;
> +        vu_block_req_complete(req);
> +        break;
> +    }
> +    case VIRTIO_BLK_T_DISCARD:
> +    case VIRTIO_BLK_T_WRITE_ZEROES: {
> +        int rc;
> +        rc = vu_block_discard_write_zeroes(req, &elem->out_sg[1],
> +                                           out_num, type);
> +        if (rc == 0) {
> +            req->in->status = VIRTIO_BLK_S_OK;
> +        } else {
> +            req->in->status = VIRTIO_BLK_S_IOERR;
> +        }
> +        vu_block_req_complete(req);
> +        break;
> +    }
> +    default:
> +        req->in->status = VIRTIO_BLK_S_UNSUPP;
> +        vu_block_req_complete(req);
> +        break;
> +    }

You have vu_block_req_complete() as the last thing in every branch. You
could have it just once after the switch block.

> +    return 0;
> +
> +err:
> +    free(elem);
> +    g_free(req);
> +    return -EINVAL;
> +}

Okay, so vu_block_virtio_process_req() takes a single request from the
virtqueue and processes it. This makes sense as a building block, but it
doesn't allow parallelism yet. I expect to see the creation of multiple
coroutines below that can run in parallel.

> +
> +static void vu_block_process_vq(VuDev *vu_dev, int idx)
> +{
> +    VuClient *client;
> +    VuVirtq *vq;
> +    int ret;
> +
> +    client = container_of(vu_dev, VuClient, parent);
> +    assert(client);
> +
> +    vq = vu_get_queue(vu_dev, idx);
> +    assert(vq);
> +
> +    while (1) {
> +        ret = vu_block_virtio_process_req(client, vq);

If you call vu_block_virtio_process_req(), then this one need to be a
coroutine_fn, too.

> +        if (ret) {
> +            break;
> +        }
> +    }
> +}
> +
> +static void vu_block_queue_set_started(VuDev *vu_dev, int idx, bool started)
> +{
> +    VuVirtq *vq;
> +
> +    assert(vu_dev);
> +
> +    vq = vu_get_queue(vu_dev, idx);
> +    vu_set_queue_handler(vu_dev, vq, started ? vu_block_process_vq : NULL);
> +}

If vu_block_process_vq() is used as a vq->handler, will it not be called
outside coroutine context? How can this work when
vu_block_virtio_process_req() required to be run in coroutine context?

Another problem is that vu_block_process_vq() runs a loop as long as
there are requests that aren't completed yet. During this time,
vu_kick_cb() will block. I don't think this is what was intended.
Instead, we should start the requests (without waiting for their
completion) and return.

> diff --git a/vl.c b/vl.c
> index 794f2e5733..0332ab70da 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2538,6 +2538,10 @@ static bool object_create_initial(const char *type, QemuOpts *opts)
>      }
>  #endif
>  
> +    /* Reason: vhost-user-server property "name" */

"node-name" now.

> +    if (g_str_equal(type, "vhost-user-blk-server")) {
> +        return false;
> +    }
>      /*
>       * Reason: filter-* property "netdev" etc.
>       */

Kevin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-19 16:38 ` [PATCH v4 0/5] vhost-user block device backend implementation Stefan Hajnoczi
@ 2020-02-26 15:18   ` Coiby Xu
  2020-02-27  7:41     ` Stefan Hajnoczi
  0 siblings, 1 reply; 22+ messages in thread
From: Coiby Xu @ 2020-02-26 15:18 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, bharatlkmlkvm, qemu-devel, Marc-André Lureau

Hi Stefan,

Thank you for reviewing my code!

I tried to reach you on IRC. But somehow either you missed my message
or I missed your reply. So I will reply by email instead.

If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
i.e. use vu_dispatch as the read handler, then we can re-use
vu_message_read. And "removing the blocking recv from libvhost-user"
isn't necessary because "the operation of poll() and ppoll() is not
affected by the O_NONBLOCK flag" despite that we use
qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
to make recv non-blocking.

Previously I needed to run customized vu_kick_cb as coroutines to call
blk_co_readv/blk_co_writev directly. After taking Kevin's feedback on
v4 into consideration, now I use aio_set_fd_handler to set a read
handler for kick_fd and  this read handler will then call vu_kick_cb.

On Thu, Feb 20, 2020 at 1:58 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Feb 18, 2020 at 01:07:06PM +0800, Coiby Xu wrote:
> > v4:
> >  * add object properties in class_init
> >  * relocate vhost-user-blk-test
> >  * other changes including using SocketAddress, coding style, etc.
>
> Thanks!  I think the vhost-user server code can be simplified if
> libvhost-user uses the event loop for asynchronous socket I/O.  Then
> it's no longer necessary to duplicate vu_message_read() and
> vu_kick_cb().  I've replied to Patch 1 and we can discuss on IRC if you
> want to chat about it.
>
> I've also CCed Marc-André to see what he thinks about removing the
> blocking recv from libvhost-user and instead using the event loop (just
> like for eventfds).
>
> Stefan

-- 
Best regards,
Coiby

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-26 15:18   ` Coiby Xu
@ 2020-02-27  7:41     ` Stefan Hajnoczi
  2020-02-27  9:53       ` Coiby Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Hajnoczi @ 2020-02-27  7:41 UTC (permalink / raw)
  To: Coiby Xu; +Cc: kwolf, bharatlkmlkvm, qemu-devel, Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> Hi Stefan,
> 
> Thank you for reviewing my code!
> 
> I tried to reach you on IRC. But somehow either you missed my message
> or I missed your reply. So I will reply by email instead.
> 
> If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> i.e. use vu_dispatch as the read handler, then we can re-use
> vu_message_read. And "removing the blocking recv from libvhost-user"
> isn't necessary because "the operation of poll() and ppoll() is not
> affected by the O_NONBLOCK flag" despite that we use
> qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> to make recv non-blocking.

I'm not sure I understand.  poll() just says whether the file descriptor
is readable.  It does not say whether enough bytes are readable :).  So
our callback will be invoked if there is 1 byte ready, but when we try
to read 20 bytes either it will block (without O_NONBLOCK) or return
only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
code changes will be necessary.

But please go ahead and send the next revision and I'll take a look.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27  7:41     ` Stefan Hajnoczi
@ 2020-02-27  9:53       ` Coiby Xu
  2020-02-27 10:02         ` Kevin Wolf
  0 siblings, 1 reply; 22+ messages in thread
From: Coiby Xu @ 2020-02-27  9:53 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, bharatlkmlkvm, qemu-devel, Marc-André Lureau

Thank you for reminding me of this socket short read issue! It seems
we still need customized vu_message_read because libvhost-user assumes
we will always get a full-size VhostUserMsg and hasn't taken care of
this short read case. I will improve libvhost-user's vu_message_read
by making it keep reading from socket util getting enough bytes. I
assume short read is a rare case thus introduced performance penalty
would be negligible.


On Thu, Feb 27, 2020 at 3:41 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> > Hi Stefan,
> >
> > Thank you for reviewing my code!
> >
> > I tried to reach you on IRC. But somehow either you missed my message
> > or I missed your reply. So I will reply by email instead.
> >
> > If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> > i.e. use vu_dispatch as the read handler, then we can re-use
> > vu_message_read. And "removing the blocking recv from libvhost-user"
> > isn't necessary because "the operation of poll() and ppoll() is not
> > affected by the O_NONBLOCK flag" despite that we use
> > qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> > to make recv non-blocking.
>
> I'm not sure I understand.  poll() just says whether the file descriptor
> is readable.  It does not say whether enough bytes are readable :).  So
> our callback will be invoked if there is 1 byte ready, but when we try
> to read 20 bytes either it will block (without O_NONBLOCK) or return
> only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
> code changes will be necessary.
>
> But please go ahead and send the next revision and I'll take a look.
>
> Stefan



--
Best regards,
Coiby


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27  9:53       ` Coiby Xu
@ 2020-02-27 10:02         ` Kevin Wolf
  2020-02-27 10:28           ` Coiby Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2020-02-27 10:02 UTC (permalink / raw)
  To: Coiby Xu
  Cc: bharatlkmlkvm, Marc-André Lureau, qemu-devel, Stefan Hajnoczi

Am 27.02.2020 um 10:53 hat Coiby Xu geschrieben:
> Thank you for reminding me of this socket short read issue! It seems
> we still need customized vu_message_read because libvhost-user assumes
> we will always get a full-size VhostUserMsg and hasn't taken care of
> this short read case. I will improve libvhost-user's vu_message_read
> by making it keep reading from socket util getting enough bytes. I
> assume short read is a rare case thus introduced performance penalty
> would be negligible.

In any case, please make sure that we use the QIOChannel functions
called from a coroutine in QEMU so that it will never block, but the
coroutine can just yield while it's waiting for more bytes.

Kevin

> On Thu, Feb 27, 2020 at 3:41 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> > > Hi Stefan,
> > >
> > > Thank you for reviewing my code!
> > >
> > > I tried to reach you on IRC. But somehow either you missed my message
> > > or I missed your reply. So I will reply by email instead.
> > >
> > > If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> > > i.e. use vu_dispatch as the read handler, then we can re-use
> > > vu_message_read. And "removing the blocking recv from libvhost-user"
> > > isn't necessary because "the operation of poll() and ppoll() is not
> > > affected by the O_NONBLOCK flag" despite that we use
> > > qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> > > to make recv non-blocking.
> >
> > I'm not sure I understand.  poll() just says whether the file descriptor
> > is readable.  It does not say whether enough bytes are readable :).  So
> > our callback will be invoked if there is 1 byte ready, but when we try
> > to read 20 bytes either it will block (without O_NONBLOCK) or return
> > only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
> > code changes will be necessary.
> >
> > But please go ahead and send the next revision and I'll take a look.
> >
> > Stefan
> 
> 
> 
> --
> Best regards,
> Coiby
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 10:02         ` Kevin Wolf
@ 2020-02-27 10:28           ` Coiby Xu
  2020-02-27 10:55             ` Kevin Wolf
  0 siblings, 1 reply; 22+ messages in thread
From: Coiby Xu @ 2020-02-27 10:28 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: bharatlkmlkvm, Marc-André Lureau, qemu-devel, Stefan Hajnoczi

> > we still need customized vu_message_read because libvhost-user assumes
> > we will always get a full-size VhostUserMsg and hasn't taken care of
> > this short read case. I will improve libvhost-user's vu_message_read
> > by making it keep reading from socket util getting enough bytes. I
> > assume short read is a rare case thus introduced performance penalty
> > would be negligible.

> In any case, please make sure that we use the QIOChannel functions
> called from a coroutine in QEMU so that it will never block, but the
> coroutine can just yield while it's waiting for more bytes.

But if I am not wrong, libvhost-user is supposed to be indepdent from
the main QEMU code. So it can't use the QIOChannel functions if we
simply modify exiting vu_message_read to address the short read issue.
In v3 & v4, I extended libvhost-user to allow vu_message_read to be
replaced by one which will depend on the main QEMU code. I'm not sure
which way is better.

On Thu, Feb 27, 2020 at 6:02 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 27.02.2020 um 10:53 hat Coiby Xu geschrieben:
> > Thank you for reminding me of this socket short read issue! It seems
> > we still need customized vu_message_read because libvhost-user assumes
> > we will always get a full-size VhostUserMsg and hasn't taken care of
> > this short read case. I will improve libvhost-user's vu_message_read
> > by making it keep reading from socket util getting enough bytes. I
> > assume short read is a rare case thus introduced performance penalty
> > would be negligible.
>
> In any case, please make sure that we use the QIOChannel functions
> called from a coroutine in QEMU so that it will never block, but the
> coroutine can just yield while it's waiting for more bytes.
>
> Kevin
>
> > On Thu, Feb 27, 2020 at 3:41 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> > > > Hi Stefan,
> > > >
> > > > Thank you for reviewing my code!
> > > >
> > > > I tried to reach you on IRC. But somehow either you missed my message
> > > > or I missed your reply. So I will reply by email instead.
> > > >
> > > > If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> > > > i.e. use vu_dispatch as the read handler, then we can re-use
> > > > vu_message_read. And "removing the blocking recv from libvhost-user"
> > > > isn't necessary because "the operation of poll() and ppoll() is not
> > > > affected by the O_NONBLOCK flag" despite that we use
> > > > qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> > > > to make recv non-blocking.
> > >
> > > I'm not sure I understand.  poll() just says whether the file descriptor
> > > is readable.  It does not say whether enough bytes are readable :).  So
> > > our callback will be invoked if there is 1 byte ready, but when we try
> > > to read 20 bytes either it will block (without O_NONBLOCK) or return
> > > only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
> > > code changes will be necessary.
> > >
> > > But please go ahead and send the next revision and I'll take a look.
> > >
> > > Stefan
> >
> >
> >
> > --
> > Best regards,
> > Coiby
> >
>


-- 
Best regards,
Coiby


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 10:28           ` Coiby Xu
@ 2020-02-27 10:55             ` Kevin Wolf
  2020-02-27 11:07               ` Marc-André Lureau
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2020-02-27 10:55 UTC (permalink / raw)
  To: Coiby Xu
  Cc: bharatlkmlkvm, Marc-André Lureau, qemu-devel, Stefan Hajnoczi

Am 27.02.2020 um 11:28 hat Coiby Xu geschrieben:
> > > we still need customized vu_message_read because libvhost-user assumes
> > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > this short read case. I will improve libvhost-user's vu_message_read
> > > by making it keep reading from socket util getting enough bytes. I
> > > assume short read is a rare case thus introduced performance penalty
> > > would be negligible.
> 
> > In any case, please make sure that we use the QIOChannel functions
> > called from a coroutine in QEMU so that it will never block, but the
> > coroutine can just yield while it's waiting for more bytes.
> 
> But if I am not wrong, libvhost-user is supposed to be indepdent from
> the main QEMU code. So it can't use the QIOChannel functions if we
> simply modify exiting vu_message_read to address the short read issue.
> In v3 & v4, I extended libvhost-user to allow vu_message_read to be
> replaced by one which will depend on the main QEMU code. I'm not sure
> which way is better.

The way your latest patches have it, with a separate read function,
works for me.

You could probably change libvhost-user to reimplement the same
functionality, and it might be an improvement for other users of the
library, but it's also code duplication and doesn't provide more value
in the context of the vhost-user export in QEMU.

The point that's really important to me is just that we never block when
we run inside QEMU because that would actually stall the guest. This
means busy waiting in a tight loop until read() returns enough bytes is
not acceptable in QEMU.

Kevin

> On Thu, Feb 27, 2020 at 6:02 PM Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Am 27.02.2020 um 10:53 hat Coiby Xu geschrieben:
> > > Thank you for reminding me of this socket short read issue! It seems
> > > we still need customized vu_message_read because libvhost-user assumes
> > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > this short read case. I will improve libvhost-user's vu_message_read
> > > by making it keep reading from socket util getting enough bytes. I
> > > assume short read is a rare case thus introduced performance penalty
> > > would be negligible.
> >
> > In any case, please make sure that we use the QIOChannel functions
> > called from a coroutine in QEMU so that it will never block, but the
> > coroutine can just yield while it's waiting for more bytes.
> >
> > Kevin
> >
> > > On Thu, Feb 27, 2020 at 3:41 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> > > > > Hi Stefan,
> > > > >
> > > > > Thank you for reviewing my code!
> > > > >
> > > > > I tried to reach you on IRC. But somehow either you missed my message
> > > > > or I missed your reply. So I will reply by email instead.
> > > > >
> > > > > If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> > > > > i.e. use vu_dispatch as the read handler, then we can re-use
> > > > > vu_message_read. And "removing the blocking recv from libvhost-user"
> > > > > isn't necessary because "the operation of poll() and ppoll() is not
> > > > > affected by the O_NONBLOCK flag" despite that we use
> > > > > qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> > > > > to make recv non-blocking.
> > > >
> > > > I'm not sure I understand.  poll() just says whether the file descriptor
> > > > is readable.  It does not say whether enough bytes are readable :).  So
> > > > our callback will be invoked if there is 1 byte ready, but when we try
> > > > to read 20 bytes either it will block (without O_NONBLOCK) or return
> > > > only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
> > > > code changes will be necessary.
> > > >
> > > > But please go ahead and send the next revision and I'll take a look.
> > > >
> > > > Stefan
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Coiby
> > >
> >
> 
> 
> -- 
> Best regards,
> Coiby
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 10:55             ` Kevin Wolf
@ 2020-02-27 11:07               ` Marc-André Lureau
  2020-02-27 11:19                 ` Kevin Wolf
  0 siblings, 1 reply; 22+ messages in thread
From: Marc-André Lureau @ 2020-02-27 11:07 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: bharatlkmlkvm, Stefan Hajnoczi, Coiby Xu, qemu-devel

Hi

On Thu, Feb 27, 2020 at 11:55 AM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 27.02.2020 um 11:28 hat Coiby Xu geschrieben:
> > > > we still need customized vu_message_read because libvhost-user assumes
> > > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > > this short read case. I will improve libvhost-user's vu_message_read
> > > > by making it keep reading from socket util getting enough bytes. I
> > > > assume short read is a rare case thus introduced performance penalty
> > > > would be negligible.
> >
> > > In any case, please make sure that we use the QIOChannel functions
> > > called from a coroutine in QEMU so that it will never block, but the
> > > coroutine can just yield while it's waiting for more bytes.
> >
> > But if I am not wrong, libvhost-user is supposed to be indepdent from
> > the main QEMU code. So it can't use the QIOChannel functions if we
> > simply modify exiting vu_message_read to address the short read issue.
> > In v3 & v4, I extended libvhost-user to allow vu_message_read to be
> > replaced by one which will depend on the main QEMU code. I'm not sure
> > which way is better.
>
> The way your latest patches have it, with a separate read function,
> works for me.

Done right, I am not against it, fwiw

> You could probably change libvhost-user to reimplement the same
> functionality, and it might be an improvement for other users of the
> library, but it's also code duplication and doesn't provide more value
> in the context of the vhost-user export in QEMU.
>
> The point that's really important to me is just that we never block when
> we run inside QEMU because that would actually stall the guest. This
> means busy waiting in a tight loop until read() returns enough bytes is
> not acceptable in QEMU.

In the context of vhost-user, local unix sockets with short messages
(do we have >1k messages?), I am not sure if this is really a problem.

And isn't it possible to run libvhost-user in its own thread for this series?


>
> Kevin
>
> > On Thu, Feb 27, 2020 at 6:02 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > >
> > > Am 27.02.2020 um 10:53 hat Coiby Xu geschrieben:
> > > > Thank you for reminding me of this socket short read issue! It seems
> > > > we still need customized vu_message_read because libvhost-user assumes
> > > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > > this short read case. I will improve libvhost-user's vu_message_read
> > > > by making it keep reading from socket util getting enough bytes. I
> > > > assume short read is a rare case thus introduced performance penalty
> > > > would be negligible.
> > >
> > > In any case, please make sure that we use the QIOChannel functions
> > > called from a coroutine in QEMU so that it will never block, but the
> > > coroutine can just yield while it's waiting for more bytes.
> > >
> > > Kevin
> > >
> > > > On Thu, Feb 27, 2020 at 3:41 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Wed, Feb 26, 2020 at 11:18:41PM +0800, Coiby Xu wrote:
> > > > > > Hi Stefan,
> > > > > >
> > > > > > Thank you for reviewing my code!
> > > > > >
> > > > > > I tried to reach you on IRC. But somehow either you missed my message
> > > > > > or I missed your reply. So I will reply by email instead.
> > > > > >
> > > > > > If we use qio_channel_set_aio_fd_handler to monitor G_IO_IN event,
> > > > > > i.e. use vu_dispatch as the read handler, then we can re-use
> > > > > > vu_message_read. And "removing the blocking recv from libvhost-user"
> > > > > > isn't necessary because "the operation of poll() and ppoll() is not
> > > > > > affected by the O_NONBLOCK flag" despite that we use
> > > > > > qio_channel_set_blocking before calling qio_channel_set_aio_fd_handler
> > > > > > to make recv non-blocking.
> > > > >
> > > > > I'm not sure I understand.  poll() just says whether the file descriptor
> > > > > is readable.  It does not say whether enough bytes are readable :).  So
> > > > > our callback will be invoked if there is 1 byte ready, but when we try
> > > > > to read 20 bytes either it will block (without O_NONBLOCK) or return
> > > > > only 1 byte (with O_NONBLOCK).  Neither case is okay, so I expect that
> > > > > code changes will be necessary.
> > > > >
> > > > > But please go ahead and send the next revision and I'll take a look.
> > > > >
> > > > > Stefan
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Coiby
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Coiby
> >
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 11:07               ` Marc-André Lureau
@ 2020-02-27 11:19                 ` Kevin Wolf
  2020-02-27 11:38                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin Wolf @ 2020-02-27 11:19 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: bharatlkmlkvm, Stefan Hajnoczi, Coiby Xu, qemu-devel

Am 27.02.2020 um 12:07 hat Marc-André Lureau geschrieben:
> On Thu, Feb 27, 2020 at 11:55 AM Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 27.02.2020 um 11:28 hat Coiby Xu geschrieben:
> > > > > we still need customized vu_message_read because libvhost-user assumes
> > > > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > > > this short read case. I will improve libvhost-user's vu_message_read
> > > > > by making it keep reading from socket util getting enough bytes. I
> > > > > assume short read is a rare case thus introduced performance penalty
> > > > > would be negligible.
> > >
> > > > In any case, please make sure that we use the QIOChannel functions
> > > > called from a coroutine in QEMU so that it will never block, but the
> > > > coroutine can just yield while it's waiting for more bytes.
> > >
> > > But if I am not wrong, libvhost-user is supposed to be indepdent from
> > > the main QEMU code. So it can't use the QIOChannel functions if we
> > > simply modify exiting vu_message_read to address the short read issue.
> > > In v3 & v4, I extended libvhost-user to allow vu_message_read to be
> > > replaced by one which will depend on the main QEMU code. I'm not sure
> > > which way is better.
> >
> > The way your latest patches have it, with a separate read function,
> > works for me.
> 
> Done right, I am not against it, fwiw
> 
> > You could probably change libvhost-user to reimplement the same
> > functionality, and it might be an improvement for other users of the
> > library, but it's also code duplication and doesn't provide more value
> > in the context of the vhost-user export in QEMU.
> >
> > The point that's really important to me is just that we never block when
> > we run inside QEMU because that would actually stall the guest. This
> > means busy waiting in a tight loop until read() returns enough bytes is
> > not acceptable in QEMU.
> 
> In the context of vhost-user, local unix sockets with short messages
> (do we have >1k messages?), I am not sure if this is really a problem.

I'm not sure how much of a problem it is in practice, and whether we
can consider the vhost-user client trusted. But using QIOChannel from
within a coroutine just avoids the problem completely, so it feels like
a natural choice to just do that.

> And isn't it possible to run libvhost-user in its own thread for this
> series?

You need to run the actual block I/O requests in the iothread where the
block device happens to run. So if you move the message processing to a
separate thread, you would have to communicate between threads for the
actual request processing. Possible, but probably slower than necessary
and certainly more complex.

Kevin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 11:19                 ` Kevin Wolf
@ 2020-02-27 11:38                   ` Daniel P. Berrangé
  2020-02-27 13:07                     ` Marc-André Lureau
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel P. Berrangé @ 2020-02-27 11:38 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: bharatlkmlkvm, Marc-André Lureau, Coiby Xu, Stefan Hajnoczi,
	qemu-devel

On Thu, Feb 27, 2020 at 12:19:58PM +0100, Kevin Wolf wrote:
> Am 27.02.2020 um 12:07 hat Marc-André Lureau geschrieben:
> > On Thu, Feb 27, 2020 at 11:55 AM Kevin Wolf <kwolf@redhat.com> wrote:
> > > Am 27.02.2020 um 11:28 hat Coiby Xu geschrieben:
> > > > > > we still need customized vu_message_read because libvhost-user assumes
> > > > > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > > > > this short read case. I will improve libvhost-user's vu_message_read
> > > > > > by making it keep reading from socket util getting enough bytes. I
> > > > > > assume short read is a rare case thus introduced performance penalty
> > > > > > would be negligible.
> > > >
> > > > > In any case, please make sure that we use the QIOChannel functions
> > > > > called from a coroutine in QEMU so that it will never block, but the
> > > > > coroutine can just yield while it's waiting for more bytes.
> > > >
> > > > But if I am not wrong, libvhost-user is supposed to be indepdent from
> > > > the main QEMU code. So it can't use the QIOChannel functions if we
> > > > simply modify exiting vu_message_read to address the short read issue.
> > > > In v3 & v4, I extended libvhost-user to allow vu_message_read to be
> > > > replaced by one which will depend on the main QEMU code. I'm not sure
> > > > which way is better.
> > >
> > > The way your latest patches have it, with a separate read function,
> > > works for me.
> > 
> > Done right, I am not against it, fwiw
> > 
> > > You could probably change libvhost-user to reimplement the same
> > > functionality, and it might be an improvement for other users of the
> > > library, but it's also code duplication and doesn't provide more value
> > > in the context of the vhost-user export in QEMU.
> > >
> > > The point that's really important to me is just that we never block when
> > > we run inside QEMU because that would actually stall the guest. This
> > > means busy waiting in a tight loop until read() returns enough bytes is
> > > not acceptable in QEMU.
> > 
> > In the context of vhost-user, local unix sockets with short messages
> > (do we have >1k messages?), I am not sure if this is really a problem.
> 
> I'm not sure how much of a problem it is in practice, and whether we
> can consider the vhost-user client trusted. But using QIOChannel from
> within a coroutine just avoids the problem completely, so it feels like
> a natural choice to just do that.

It isn't clear to me that we have a consitent plan for how we intend
libvhost-user to develop & what it is permitted to use.  What information
I see in the source code and in this thread are contradictory.

For example, in the text quoted above:

  "libvhost-user is supposed to be indepdent from the main QEMU code."

which did match my overall understanding too. At the top of libvhost-user.c
there is a comment

   /* this code avoids GLib dependency */

but a few lines later it does

  #include "qemu/atomic.h"
  #include "qemu/osdep.h"
  #include "qemu/memfd.h"

and in the Makefile we link it to much of QEMU util code:

  libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)

this in turn pulls in GLib code, and looking at symbols we can see
over 100 GLib functions used:

  $ nm ./libvhost-user.a | grep 'U g_' | sort | uniq | wc -l
  128

And over 200 QEMU source object files included:

  $ nm ./libvhost-user.a | grep '.o:' | sort | wc -l
  224

So unless I'm missing something, we have lost any independance from
both QEMU and GLib code that we might have had in the past.

Note this also has licensing implications, as I expect this means that
via the QEMU source objects it pulls in, libvhost-user.a has become
a GPLv2-only combined work, not a GPLv2-or-later combined work.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/5] vhost-user block device backend implementation
  2020-02-27 11:38                   ` Daniel P. Berrangé
@ 2020-02-27 13:07                     ` Marc-André Lureau
  0 siblings, 0 replies; 22+ messages in thread
From: Marc-André Lureau @ 2020-02-27 13:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Kevin Wolf, bharatlkmlkvm, Stefan Hajnoczi, Coiby Xu, qemu-devel

Hi

On Thu, Feb 27, 2020 at 12:39 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Feb 27, 2020 at 12:19:58PM +0100, Kevin Wolf wrote:
> > Am 27.02.2020 um 12:07 hat Marc-André Lureau geschrieben:
> > > On Thu, Feb 27, 2020 at 11:55 AM Kevin Wolf <kwolf@redhat.com> wrote:
> > > > Am 27.02.2020 um 11:28 hat Coiby Xu geschrieben:
> > > > > > > we still need customized vu_message_read because libvhost-user assumes
> > > > > > > we will always get a full-size VhostUserMsg and hasn't taken care of
> > > > > > > this short read case. I will improve libvhost-user's vu_message_read
> > > > > > > by making it keep reading from socket util getting enough bytes. I
> > > > > > > assume short read is a rare case thus introduced performance penalty
> > > > > > > would be negligible.
> > > > >
> > > > > > In any case, please make sure that we use the QIOChannel functions
> > > > > > called from a coroutine in QEMU so that it will never block, but the
> > > > > > coroutine can just yield while it's waiting for more bytes.
> > > > >
> > > > > But if I am not wrong, libvhost-user is supposed to be indepdent from
> > > > > the main QEMU code. So it can't use the QIOChannel functions if we
> > > > > simply modify exiting vu_message_read to address the short read issue.
> > > > > In v3 & v4, I extended libvhost-user to allow vu_message_read to be
> > > > > replaced by one which will depend on the main QEMU code. I'm not sure
> > > > > which way is better.
> > > >
> > > > The way your latest patches have it, with a separate read function,
> > > > works for me.
> > >
> > > Done right, I am not against it, fwiw
> > >
> > > > You could probably change libvhost-user to reimplement the same
> > > > functionality, and it might be an improvement for other users of the
> > > > library, but it's also code duplication and doesn't provide more value
> > > > in the context of the vhost-user export in QEMU.
> > > >
> > > > The point that's really important to me is just that we never block when
> > > > we run inside QEMU because that would actually stall the guest. This
> > > > means busy waiting in a tight loop until read() returns enough bytes is
> > > > not acceptable in QEMU.
> > >
> > > In the context of vhost-user, local unix sockets with short messages
> > > (do we have >1k messages?), I am not sure if this is really a problem.
> >
> > I'm not sure how much of a problem it is in practice, and whether we
> > can consider the vhost-user client trusted. But using QIOChannel from
> > within a coroutine just avoids the problem completely, so it feels like
> > a natural choice to just do that.
>
> It isn't clear to me that we have a consitent plan for how we intend
> libvhost-user to develop & what it is permitted to use.  What information
> I see in the source code and in this thread are contradictory.
>
> For example, in the text quoted above:
>
>   "libvhost-user is supposed to be indepdent from the main QEMU code."
>
> which did match my overall understanding too. At the top of libvhost-user.c
> there is a comment
>
>    /* this code avoids GLib dependency */
>
> but a few lines later it does
>
>   #include "qemu/atomic.h"
>   #include "qemu/osdep.h"
>   #include "qemu/memfd.h"
>
> and in the Makefile we link it to much of QEMU util code:
>
>   libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
>
> this in turn pulls in GLib code, and looking at symbols we can see
> over 100 GLib functions used:
>
>   $ nm ./libvhost-user.a | grep 'U g_' | sort | uniq | wc -l
>   128
>
> And over 200 QEMU source object files included:
>
>   $ nm ./libvhost-user.a | grep '.o:' | sort | wc -l
>   224
>
> So unless I'm missing something, we have lost any independance from
> both QEMU and GLib code that we might have had in the past.

Yep, I think this is mostly due to commit 5f9ff1eff38 ("libvhost-user:
Support tracking inflight I/O in shared memory")

It may not be that hard to bring back glib independence. Is it worth it though?

>
> Note this also has licensing implications, as I expect this means that
> via the QEMU source objects it pulls in, libvhost-user.a has become
> a GPLv2-only combined work, not a GPLv2-or-later combined work.
>

libvhost-user.c is GPLv2-or-later because tests/vhost-user-bridge.c
was and is still as well. Do we need to change that?

> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
>


-- 
Marc-André Lureau


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/5] generic vhost user server
  2020-02-25 15:44   ` Kevin Wolf
@ 2020-02-28  4:23     ` Coiby Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Coiby Xu @ 2020-02-28  4:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: bharatlkmlkvm, qemu-devel, stefanha

> > +static coroutine_fn void vu_client_next_trip(VuClient *client);
> > +
> > +static coroutine_fn void vu_client_trip(void *opaque)
> > +{
> > +    VuClient *client = opaque;
> > +
> > +    vu_dispatch(&client->parent);
> > +    client->co_trip = NULL;
> > +    if (!client->closed) {
> > +        vu_client_next_trip(client);
> > +    }
> > +}

> > The last part is very untypical coroutine code: It says that we want to
spawn a new coroutine with vu_client_trip() as its entry point, and then
terminates the current one.

> > Why don't we just put the whole thing in a while (!client->closed) loop
and stay in the same coroutine instead of terminating the old one and
starting a new one all the time?

> > +static coroutine_fn void vu_client_next_trip(VuClient *client)
> > +{
> > +    if (!client->co_trip) {
> > +        client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> > +        aio_co_schedule(client->ioc->ctx, client->co_trip);
> > +    }
> > +}
> > +
> > +static void vu_client_start(VuClient *client)
> > +{
> > +    client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> > +    aio_co_enter(client->ioc->ctx, client->co_trip);
> > +}

> This is essentially a duplicate of vu_client_next_trip(). The only
place where it is called (vu_accept()) knows that client->co_trip is
already NULL, so it could just call vu_client_next_trip().

> Or in fact, if vu_client_trip() gets turned into a loop, it's
> vu_client_next_trip() that becomes unnecessary.

This part of code is an imitation of nbd_client_trip in nbd/server.c.
I think the reason to repeatedly create/start/terminate vu_client_trip
is to support BlockBackendAioNotifier. In v5, I will keep running the
spawned coroutine in a loop until being informed of the change of
AioContext of the block device backend, i.e. vu_client_trip will only
be restarted when the block device backend is attached to a different
AiOContext.

> > +    if (rc != sizeof(eventfd_t)) {
> > +        if (errno == EAGAIN) {
> > +            qio_channel_yield(data->ioc, G_IO_IN);
> > +        } else if (errno != EINTR) {
> > +            data->co = NULL;
> > +            return;
> > +        }
> > +    } else {
> > +        vq->handler(dev, index);
> > +    }
> > +    data->co = NULL;
> > +    vu_kick_cb_next(client, data);

> This can be a loop, too, instead of terminating the coroutine and
starting a new one for the same function.

In v5, I plan to use aio_set_fd_handler to set a read hander which is
a wrapper for vu_kick_cb to deal with kick events since eventfd
doesn't have the short read issue like socket. Thus vu_kick_cb in
libvhost-user can be re-used. My only concern is if this could lead to
worse performance in comparison to keep reading from eventfd until
getting EAGAIN errno.

On Tue, Feb 25, 2020 at 11:44 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 18.02.2020 um 06:07 hat Coiby Xu geschrieben:
> > Sharing QEMU devices via vhost-user protocol
> >
> > Signed-off-by: Coiby Xu <coiby.xu@gmail.com>
> > ---
> >  util/Makefile.objs       |   3 +
> >  util/vhost-user-server.c | 427 +++++++++++++++++++++++++++++++++++++++
> >  util/vhost-user-server.h |  56 +++++
> >  3 files changed, 486 insertions(+)
> >  create mode 100644 util/vhost-user-server.c
> >  create mode 100644 util/vhost-user-server.h
> >
> > diff --git a/util/vhost-user-server.h b/util/vhost-user-server.h
> > new file mode 100644
> > index 0000000000..ff6d3145cd
> > --- /dev/null
> > +++ b/util/vhost-user-server.h
> > @@ -0,0 +1,56 @@
> > +#include "io/channel-socket.h"
> > +#include "io/channel-file.h"
> > +#include "io/net-listener.h"
> > +#include "contrib/libvhost-user/libvhost-user.h"
> > +#include "standard-headers/linux/virtio_blk.h"
> > +#include "qemu/error-report.h"
> > +
> > +typedef struct VuClient VuClient;
>
> I find the terminology a bit confusing here: VuClient is really the
> connection to a single client, but it's part of the server. The name
> gives the impression as if this were client-side code. (This is
> something that already tends to confuse me in the NBD code.)
>
> I'm not sure what a better name could be, though. Maybe
> VuServerConnevtion or VuExportClient or VuExportConnection?
>
> > +typedef struct VuServer {
> > +    QIONetListener *listener;
> > +    AioContext *ctx;
> > +    QTAILQ_HEAD(, VuClient) clients;
> > +    void (*device_panic_notifier)(struct VuClient *client) ;
> > +    int max_queues;
> > +    const VuDevIface *vu_iface;
> > +    /*
> > +     * @ptr_in_device: VuServer pointer memory location in vhost-user device
> > +     * struct, so later container_of can be used to get device destruct
> > +     */
> > +    void *ptr_in_device;
> > +    bool close;
> > +} VuServer;
> > +
> > +typedef struct kick_info {
> > +    VuDev *vu_dev;
>
> I suppose this could specifically be VuClient?
>
> > +    int fd; /*kick fd*/
> > +    long index; /*queue index*/
> > +    QIOChannel *ioc; /*I/O channel for kick fd*/
> > +    QIOChannelFile *fioc; /*underlying data channel for kick fd*/
> > +    Coroutine *co;
> > +} kick_info;
> > +
> > +struct VuClient {
> > +    VuDev parent;
> > +    VuServer *server;
> > +    QIOChannel *ioc; /* The current I/O channel */
> > +    QIOChannelSocket *sioc; /* The underlying data channel */
> > +    Coroutine *co_trip;
> > +    struct kick_info *kick_info;
>
> If each struct kick_info (btw, QEMU coding style requires CamelCase) has
> exactly one VuClient and each VuClient has exactly on kick_info, should
> this be a single struct containing both?
>
> [ Coming back from reading the code below - it's because this is in
> fact an array. This should be made clear in the definition. ]
>
> > +    QTAILQ_ENTRY(VuClient) next;
> > +    bool closed;
> > +};
> > +
> > +
> > +VuServer *vhost_user_server_start(uint16_t max_queues,
> > +                                  SocketAddress *unix_socket,
> > +                                  AioContext *ctx,
> > +                                  void *server_ptr,
> > +                                  void *device_panic_notifier,
> > +                                  const VuDevIface *vu_iface,
> > +                                  Error **errp);
> > +
> > +void vhost_user_server_stop(VuServer *server);
> > +
> > +void change_vu_context(AioContext *ctx, VuServer *server);
>
> Let's call this vhost_user_server_set_aio_context() for consistency.
>
> > diff --git a/util/Makefile.objs b/util/Makefile.objs
> > index 11262aafaf..5e450e501c 100644
> > --- a/util/Makefile.objs
> > +++ b/util/Makefile.objs
> > @@ -36,6 +36,9 @@ util-obj-y += readline.o
> >  util-obj-y += rcu.o
> >  util-obj-$(CONFIG_MEMBARRIER) += sys_membarrier.o
> >  util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
> > +ifdef CONFIG_LINUX
> > +util-obj-y += vhost-user-server.o
> > +endif
> >  util-obj-y += qemu-coroutine-sleep.o
> >  util-obj-y += qemu-co-shared-resource.o
> >  util-obj-y += coroutine-$(CONFIG_COROUTINE_BACKEND).o
> > diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
> > new file mode 100644
> > index 0000000000..70ff6d6701
> > --- /dev/null
> > +++ b/util/vhost-user-server.c
> > @@ -0,0 +1,427 @@
> > +/*
> > + * Sharing QEMU devices via vhost-user protocol
> > + *
> > + * Author: Coiby Xu <coiby.xu@gmail.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or
> > + * later.  See the COPYING file in the top-level directory.
> > + */
> > +#include "qemu/osdep.h"
> > +#include <sys/eventfd.h>
> > +#include "qemu/main-loop.h"
> > +#include "vhost-user-server.h"
> > +
> > +static void vmsg_close_fds(VhostUserMsg *vmsg)
> > +{
> > +    int i;
> > +    for (i = 0; i < vmsg->fd_num; i++) {
> > +        close(vmsg->fds[i]);
> > +    }
> > +}
> > +
> > +static void vmsg_unblock_fds(VhostUserMsg *vmsg)
> > +{
> > +    int i;
> > +    for (i = 0; i < vmsg->fd_num; i++) {
> > +        qemu_set_nonblock(vmsg->fds[i]);
> > +    }
> > +}
> > +
> > +
> > +static void close_client(VuClient *client)
> > +{
> > +    vu_deinit(&client->parent);
> > +    client->sioc = NULL;
> > +    object_unref(OBJECT(client->ioc));
> > +    client->closed = true;
> > +
> > +}
> > +
> > +static void panic_cb(VuDev *vu_dev, const char *buf)
> > +{
> > +    if (buf) {
> > +        error_report("vu_panic: %s", buf);
> > +    }
> > +
> > +    VuClient *client = container_of(vu_dev, VuClient, parent);
> > +    VuServer *server = client->server;
>
> Please put declarations at the start of the block.
>
> > +    if (!client->closed) {
> > +        close_client(client);
> > +        QTAILQ_REMOVE(&server->clients, client, next);
> > +    }
> > +
> > +    if (server->device_panic_notifier) {
> > +        server->device_panic_notifier(client);
> > +    }
> > +}
> > +
> > +
> > +
> > +static bool coroutine_fn
> > +vu_message_read(VuDev *vu_dev, int conn_fd, VhostUserMsg *vmsg)
> > +{
> > +    struct iovec iov = {
> > +        .iov_base = (char *)vmsg,
> > +        .iov_len = VHOST_USER_HDR_SIZE,
> > +    };
> > +    int rc, read_bytes = 0;
> > +    /*
> > +     * VhostUserMsg is a packed structure, gcc will complain about passing
> > +     * pointer to a packed structure member if we pass &VhostUserMsg.fd_num
> > +     * and &VhostUserMsg.fds directly when calling qio_channel_readv_full,
> > +     * thus two temporary variables nfds and fds are used here.
> > +     */
> > +    size_t nfds = 0, nfds_t = 0;
> > +    int *fds = NULL, *fds_t = NULL;
> > +    VuClient *client = container_of(vu_dev, VuClient, parent);
> > +    QIOChannel *ioc = client->ioc;
> > +
> > +    Error *erp;
>
> The convention is to call this local_err. It should be initialised as
> NULL.
>
> > +    assert(qemu_in_coroutine());
> > +    do {
> > +        /*
> > +         * qio_channel_readv_full may have short reads, keeping calling it
> > +         * until getting VHOST_USER_HDR_SIZE or 0 bytes in total
> > +         */
> > +        rc = qio_channel_readv_full(ioc, &iov, 1, &fds_t, &nfds_t, &erp);
> > +        if (rc < 0) {
> > +            if (rc == QIO_CHANNEL_ERR_BLOCK) {
> > +                qio_channel_yield(ioc, G_IO_IN);
> > +                continue;
> > +            } else {
> > +                error_report("Error while recvmsg: %s", strerror(errno));
>
> I don't think, qio_channel_*() promise anything about the value in
> errno. (They also don't promise to use recvmsg().)
>
> Instead, use error_report_err() because erp contains the real error
> message.
>
> > +                return false;
> > +            }
> > +        }
> > +        read_bytes += rc;
> > +        fds = g_renew(int, fds_t, nfds + nfds_t);
> > +        memcpy(fds + nfds, fds_t, nfds_t);
> > +        nfds += nfds_t;
> > +        if (read_bytes == VHOST_USER_HDR_SIZE || rc == 0) {
> > +            break;
> > +        }
> > +    } while (true);
> > +
> > +    vmsg->fd_num = nfds;
> > +    memcpy(vmsg->fds, fds, nfds * sizeof(int));
> > +    g_free(fds);
> > +    /* qio_channel_readv_full will make socket fds blocking, unblock them */
> > +    vmsg_unblock_fds(vmsg);
> > +    if (vmsg->size > sizeof(vmsg->payload)) {
> > +        error_report("Error: too big message request: %d, "
> > +                     "size: vmsg->size: %u, "
> > +                     "while sizeof(vmsg->payload) = %zu",
> > +                     vmsg->request, vmsg->size, sizeof(vmsg->payload));
> > +        goto fail;
> > +    }
> > +
> > +    struct iovec iov_payload = {
> > +        .iov_base = (char *)&vmsg->payload,
> > +        .iov_len = vmsg->size,
> > +    };
> > +    if (vmsg->size) {
> > +        rc = qio_channel_readv_all_eof(ioc, &iov_payload, 1, &erp);
> > +        if (rc == -1) {
> > +            error_report("Error while reading: %s", strerror(errno));
>
> error_report_err() again.
>
> > +            goto fail;
> > +        }
> > +    }
> > +
> > +    return true;
> > +
> > +fail:
> > +    vmsg_close_fds(vmsg);
> > +
> > +    return false;
> > +}
> > +
> > +
> > +static coroutine_fn void vu_client_next_trip(VuClient *client);
> > +
> > +static coroutine_fn void vu_client_trip(void *opaque)
> > +{
> > +    VuClient *client = opaque;
> > +
> > +    vu_dispatch(&client->parent);
> > +    client->co_trip = NULL;
> > +    if (!client->closed) {
> > +        vu_client_next_trip(client);
> > +    }
> > +}
>
> The last part is very untypical coroutine code: It says that we want to
> spawn a new coroutine with vu_client_trip() as its entry point, and then
> terminates the current one.
>
> Why don't we just put the whole thing in a while (!client->closed) loop
> and stay in the same coroutine instead of terminating the old one and
> starting a new one all the time?
>
> > +static coroutine_fn void vu_client_next_trip(VuClient *client)
> > +{
> > +    if (!client->co_trip) {
> > +        client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> > +        aio_co_schedule(client->ioc->ctx, client->co_trip);
> > +    }
> > +}
> > +
> > +static void vu_client_start(VuClient *client)
> > +{
> > +    client->co_trip = qemu_coroutine_create(vu_client_trip, client);
> > +    aio_co_enter(client->ioc->ctx, client->co_trip);
> > +}
>
> This is essentially a duplicate of vu_client_next_trip(). The only
> place where it is called (vu_accept()) knows that client->co_trip is
> already NULL, so it could just call vu_client_next_trip().
>
> Or in fact, if vu_client_trip() gets turned into a loop, it's
> vu_client_next_trip() that becomes unnecessary.
>
> > +static void coroutine_fn vu_kick_cb_next(VuClient *client,
> > +                                          kick_info *data);
> > +
> > +static void coroutine_fn vu_kick_cb(void *opaque)
> > +{
> > +    kick_info *data = (kick_info *) opaque;
> > +    int index = data->index;
> > +    VuDev *dev = data->vu_dev;
> > +    VuClient *client;
> > +    client = container_of(dev, VuClient, parent);
> > +    VuVirtq *vq = &dev->vq[index];
> > +    int sock = vq->kick_fd;
> > +    if (sock == -1) {
> > +        return;
> > +    }
> > +    assert(sock == data->fd);
> > +    eventfd_t kick_data;
> > +    ssize_t rc;
> > +    /*
> > +     * When eventfd is closed, the revent is POLLNVAL (=G_IO_NVAL) and
> > +     * reading eventfd will return errno=EBADF (Bad file number).
> > +     * Calling qio_channel_yield(ioc, G_IO_IN) will set reading handler
> > +     * for QIOChannel, but aio_dispatch_handlers will only dispatch
> > +     * G_IO_IN | G_IO_HUP | G_IO_ERR revents while ignoring
> > +     * G_IO_NVAL (POLLNVAL) revents.
> > +     *
> > +     * Thus when eventfd is closed by vhost-user client, QEMU will ignore
> > +     * G_IO_NVAL and keeping polling by repeatedly calling qemu_poll_ns which
> > +     * will lead to 100% CPU usage.
> > +     *
> > +     * To aovid this issue, make sure set_watch and remove_watch use the same
>
> s/aovid/avoid/
>
> > +     * AIOContext for QIOChannel. Thus remove_watch will eventually succefully
> > +     * remove eventfd from the set of file descriptors polled for
> > +     * corresponding GSource.
> > +     */
> > +    rc = read(sock, &kick_data, sizeof(eventfd_t));
>
> Why not a QIOChannel function like for vu_message_read() above?
>
> > +    if (rc != sizeof(eventfd_t)) {
> > +        if (errno == EAGAIN) {
> > +            qio_channel_yield(data->ioc, G_IO_IN);
> > +        } else if (errno != EINTR) {
> > +            data->co = NULL;
> > +            return;
> > +        }
> > +    } else {
> > +        vq->handler(dev, index);
> > +    }
> > +    data->co = NULL;
> > +    vu_kick_cb_next(client, data);
>
> This can be a loop, too, instead of terminating the coroutine and
> starting a new one for the same function.
>
> > +}
> > +
> > +static void coroutine_fn vu_kick_cb_next(VuClient *client,
> > +                                          kick_info *cb_data)
> > +{
> > +    if (!cb_data->co) {
> > +        cb_data->co = qemu_coroutine_create(vu_kick_cb, cb_data);
> > +        aio_co_schedule(client->ioc->ctx, cb_data->co);
> > +    }
> > +}
>
> Kevin
>


-- 
Best regards,
Coiby


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-02-28  4:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18  5:07 [PATCH v4 0/5] vhost-user block device backend implementation Coiby Xu
2020-02-18  5:07 ` [PATCH v4 1/5] extend libvhost to support IOThread and coroutine Coiby Xu
2020-02-19 16:33   ` Stefan Hajnoczi
2020-02-25 14:55   ` Kevin Wolf
2020-02-18  5:07 ` [PATCH v4 2/5] generic vhost user server Coiby Xu
2020-02-25 15:44   ` Kevin Wolf
2020-02-28  4:23     ` Coiby Xu
2020-02-18  5:07 ` [PATCH v4 3/5] vhost-user block device backend server Coiby Xu
2020-02-25 16:09   ` Kevin Wolf
2020-02-18  5:07 ` [PATCH v4 4/5] a standone-alone tool to directly share disk image file via vhost-user protocol Coiby Xu
2020-02-18  5:07 ` [PATCH v4 5/5] new qTest case to test the vhost-user-blk-server Coiby Xu
2020-02-19 16:38 ` [PATCH v4 0/5] vhost-user block device backend implementation Stefan Hajnoczi
2020-02-26 15:18   ` Coiby Xu
2020-02-27  7:41     ` Stefan Hajnoczi
2020-02-27  9:53       ` Coiby Xu
2020-02-27 10:02         ` Kevin Wolf
2020-02-27 10:28           ` Coiby Xu
2020-02-27 10:55             ` Kevin Wolf
2020-02-27 11:07               ` Marc-André Lureau
2020-02-27 11:19                 ` Kevin Wolf
2020-02-27 11:38                   ` Daniel P. Berrangé
2020-02-27 13:07                     ` Marc-André Lureau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.