[Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
@ 2019-01-09 11:27 elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets elohimes
                   ` (8 more replies)
  0 siblings, 9 replies; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patchset is aimed at supporting qemu to reconnect
vhost-user-blk backend after vhost-user-blk backend crash or
restart.

The patch 1 uses exisiting wait/nowait options to make QEMU not
do a connect on client sockets during initialization of the chardev.

The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support providing shared
memory to backend.

The patch 3,4 are the corresponding libvhost-user patches of
patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD.

The patch 5 allows vhost-user-blk to use the two new messages
to get/set inflight buffer from/to backend.

The patch 6 supports vhost-user-blk to reconnect backend when
connection closed.

The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
to vhost-user-blk backend which is used to tell qemu that
we support reconnecting now.

To use it, we could start qemu with:

qemu-system-x86_64 \
        -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
        -device vhost-user-blk-pci,chardev=char0 \

and start vhost-user-blk backend with:

vhost-user-blk -b /path/file -s /path/vhost.socket

Then we can restart vhost-user-blk at any time during VM running.

V3 to V4:
- Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
- Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
  and VHOST_USER_SET_INFLIGHT_FD
- Allocate inflight buffer in backend rather than in qemu
- Document a recommended format for inflight buffer

V2 to V3:
- Using exisiting wait/nowait options to control connection on
  client sockets instead of introducing "disconnected" option.
- Support the case that vhost-user backend restart during initialzation
  of vhost-user-blk device.

V1 to V2:
- Introduce "disconnected" option for chardev instead of reuse "wait"
  option
- Support the case that QEMU starts before vhost-user backend
- Drop message VHOST_USER_SET_VRING_INFLIGHT
- Introduce two new messages VHOST_USER_GET_SHM_SIZE
  and VHOST_USER_SET_SHM_FD

Xie Yongji (7):
  char-socket: Enable "nowait" option on client sockets
  vhost-user: Support transferring inflight buffer between qemu and
    backend
  libvhost-user: Introduce vu_queue_map_desc()
  libvhost-user: Support tracking inflight I/O in shared memory
  vhost-user-blk: Add support to get/set inflight buffer
  vhost-user-blk: Add support to reconnect backend
  contrib/vhost-user-blk: enable inflight I/O tracking

 Makefile                                |   2 +-
 chardev/char-socket.c                   |  56 ++--
 contrib/libvhost-user/libvhost-user.c   | 346 ++++++++++++++++++++----
 contrib/libvhost-user/libvhost-user.h   |  29 ++
 contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
 docs/interop/vhost-user.txt             |  60 ++++
 hw/block/vhost-user-blk.c               | 223 ++++++++++++---
 hw/virtio/vhost-user.c                  | 108 ++++++++
 hw/virtio/vhost.c                       | 108 ++++++++
 include/hw/virtio/vhost-backend.h       |   9 +
 include/hw/virtio/vhost-user-blk.h      |   5 +
 include/hw/virtio/vhost.h               |  19 ++
 qapi/char.json                          |   3 +-
 qemu-options.hx                         |   9 +-
 14 files changed, 851 insertions(+), 129 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-10 12:49   ` Daniel P. Berrangé
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

Enable "nowait" option to make QEMU not do a connect
on client sockets during initialization of the chardev.
Then we can use qemu_chr_fe_wait_connected() to connect
when necessary. Now it would be used for unix domain
socket of vhost-user-blk device to support reconnect.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
 qapi/char.json        |  3 +--
 qemu-options.hx       |  9 ++++---
 3 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index eaa8e8b68f..f803f4f7d3 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
         s->reconnect_time = reconnect;
     }
 
-    if (s->reconnect_time) {
-        tcp_chr_connect_async(chr);
-    } else {
-        if (s->is_listen) {
-            char *name;
-            s->listener = qio_net_listener_new();
+    if (s->is_listen) {
+        char *name;
+        s->listener = qio_net_listener_new();
 
-            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
-            qio_net_listener_set_name(s->listener, name);
-            g_free(name);
+        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
+        qio_net_listener_set_name(s->listener, name);
+        g_free(name);
 
-            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
-                object_unref(OBJECT(s->listener));
-                s->listener = NULL;
-                goto error;
-            }
+        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
+            object_unref(OBJECT(s->listener));
+            s->listener = NULL;
+            goto error;
+        }
 
-            qapi_free_SocketAddress(s->addr);
-            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
-            update_disconnected_filename(s);
+        qapi_free_SocketAddress(s->addr);
+        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
+        update_disconnected_filename(s);
 
-            if (is_waitconnect &&
-                qemu_chr_wait_connected(chr, errp) < 0) {
-                return;
-            }
-            if (!s->ioc) {
-                qio_net_listener_set_client_func_full(s->listener,
-                                                      tcp_chr_accept,
-                                                      chr, NULL,
-                                                      chr->gcontext);
-            }
+        if (is_waitconnect &&
+            qemu_chr_wait_connected(chr, errp) < 0) {
+            return;
+        }
+        if (!s->ioc) {
+            qio_net_listener_set_client_func_full(s->listener,
+                                                  tcp_chr_accept,
+                                                  chr, NULL,
+                                                  chr->gcontext);
+        }
+    } else if (is_waitconnect) {
+        if (s->reconnect_time) {
+            tcp_chr_connect_async(chr);
         } else if (qemu_chr_wait_connected(chr, errp) < 0) {
             goto error;
         }
@@ -1120,7 +1120,7 @@ static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
                                   Error **errp)
 {
     bool is_listen      = qemu_opt_get_bool(opts, "server", false);
-    bool is_waitconnect = is_listen && qemu_opt_get_bool(opts, "wait", true);
+    bool is_waitconnect = qemu_opt_get_bool(opts, "wait", true);
     bool is_telnet      = qemu_opt_get_bool(opts, "telnet", false);
     bool is_tn3270      = qemu_opt_get_bool(opts, "tn3270", false);
     bool is_websock     = qemu_opt_get_bool(opts, "websocket", false);
diff --git a/qapi/char.json b/qapi/char.json
index 77ed847972..6a3b5bcd71 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -249,8 +249,7 @@
 #        or connect to (server=false)
 # @tls-creds: the ID of the TLS credentials object (since 2.6)
 # @server: create server socket (default: true)
-# @wait: wait for incoming connection on server
-#        sockets (default: false).
+# @wait: wait for being connected or connecting to (default: false)
 # @nodelay: set TCP_NODELAY socket option (default: false)
 # @telnet: enable telnet protocol on server
 #          sockets (default: false)
diff --git a/qemu-options.hx b/qemu-options.hx
index d4f3564b78..ebd11220c4 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2556,8 +2556,9 @@ undefined if TCP options are specified for a unix socket.
 
 @option{server} specifies that the socket shall be a listening socket.
 
-@option{nowait} specifies that QEMU should not block waiting for a client to
-connect to a listening socket.
+@option{nowait} specifies that QEMU should not wait for being connected on
+server sockets or try to do a sync/async connect on client sockets during
+initialization of the chardev.
 
 @option{telnet} specifies that traffic on the socket should interpret telnet
 escape sequences.
@@ -3093,7 +3094,9 @@ I/O to a location or wait for a connection from a location.  By default
 the TCP Net Console is sent to @var{host} at the @var{port}.  If you use
 the @var{server} option QEMU will wait for a client socket application
 to connect to the port before continuing, unless the @code{nowait}
-option was specified.  The @code{nodelay} option disables the Nagle buffering
+option was specified. And the @code{nowait} option could also be
+used when @var{noserver} is set to disallow QEMU to connect during
+initialization.  The @code{nodelay} option disables the Nagle buffering
 algorithm.  The @code{reconnect} option only applies if @var{noserver} is
 set, if the connection goes down it will attempt to reconnect at the
 given interval.  If @var{host} is omitted, 0.0.0.0 is assumed. Only
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-14 22:25   ` Michael S. Tsirkin
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
buffer between qemu and backend.

Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
shared buffer from backend. Then qemu should send it back
through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.

This shared buffer is used to track inflight I/O by backend.
Qemu should clear it when vm reset.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Chai Wen <chaiwen@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 docs/interop/vhost-user.txt       |  60 +++++++++++++++++
 hw/virtio/vhost-user.c            | 108 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 | 108 ++++++++++++++++++++++++++++++
 include/hw/virtio/vhost-backend.h |   9 +++
 include/hw/virtio/vhost.h         |  19 ++++++
 5 files changed, 304 insertions(+)

diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index c2194711d9..67da41fdd2 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -142,6 +142,18 @@ Depending on the request type, payload can be:
    Offset: a 64-bit offset of this area from the start of the
        supplied file descriptor
 
+ * Inflight description
+   ----------------------------------------------------------
+   | mmap size | mmap offset | align | num queues | version |
+   ----------------------------------------------------------
+
+   mmap size: a 64-bit size of area to track inflight I/O
+   mmap offset: a 64-bit offset of this area from the start
+                of the supplied file descriptor
+   align: a 32-bit align of each region in this area
+   num queues: a 16-bit number of virtqueues
+   version: a 16-bit version of this area
+
 In QEMU the vhost-user message is implemented with the following struct:
 
 typedef struct VhostUserMsg {
@@ -157,6 +169,7 @@ typedef struct VhostUserMsg {
         struct vhost_iotlb_msg iotlb;
         VhostUserConfig config;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
     };
 } QEMU_PACKED VhostUserMsg;
 
@@ -175,6 +188,7 @@ the ones that do:
  * VHOST_USER_GET_PROTOCOL_FEATURES
  * VHOST_USER_GET_VRING_BASE
  * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+ * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 [ Also see the section on REPLY_ACK protocol extension. ]
 
@@ -188,6 +202,7 @@ in the ancillary data:
  * VHOST_USER_SET_VRING_CALL
  * VHOST_USER_SET_VRING_ERR
  * VHOST_USER_SET_SLAVE_REQ_FD
+ * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 If Master is unable to send the full message or receives a wrong reply it will
 close the connection. An optional reconnection mechanism can be implemented.
@@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
 slave can send file descriptors (at most 8 descriptors in each message)
 to master via ancillary data using this fd communication channel.
 
+Inflight I/O tracking
+---------------------
+
+To support slave reconnecting, slave need to track inflight I/O in a
+shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
+are used to transfer the memory between master and slave. And to encourage
+consistency, we provide a recommended format for this memory:
+
+offset	 width	  description
+0x0      0x400    region for queue0
+0x400    0x400    region for queue1
+0x800    0x400    region for queue2
+...      ...      ...
+
+For each virtqueue, we have a 1024 bytes region. The region's format is like:
+
+offset   width    description
+0x0      0x1      descriptor 0 is in use or not
+0x1      0x1      descriptor 1 is in use or not
+0x2      0x1      descriptor 2 is in use or not
+...      ...      ...
+
+For each descriptor, we use one byte to specify whether it's in use or not.
+
 Protocol features
 -----------------
 
@@ -397,6 +436,7 @@ Protocol features
 #define VHOST_USER_PROTOCOL_F_CONFIG         9
 #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD  10
 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER  11
+#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
 
 Master message types
 --------------------
@@ -761,6 +801,26 @@ Master message types
       was previously sent.
       The value returned is an error indication; 0 is success.
 
+ * VHOST_USER_GET_INFLIGHT_FD
+      Id: 31
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to get
+      a shared memory from slave. The shared memory will be used to track
+      inflight I/O by slave. Master should clear it when vm reset.
+
+ * VHOST_USER_SET_INFLIGHT_FD
+      Id: 32
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to send
+      the shared inflight buffer back to slave so that slave could get
+      inflight I/O after a crash or restart.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e09bed0e4a..4d118c6e14 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -52,6 +52,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_CONFIG = 9,
     VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
     VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -89,6 +90,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_POSTCOPY_ADVISE  = 28,
     VHOST_USER_POSTCOPY_LISTEN  = 29,
     VHOST_USER_POSTCOPY_END     = 30,
+    VHOST_USER_GET_INFLIGHT_FD = 31,
+    VHOST_USER_SET_INFLIGHT_FD = 32,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -147,6 +150,14 @@ typedef struct VhostUserVringArea {
     uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserInflight {
+    uint64_t mmap_size;
+    uint64_t mmap_offset;
+    uint32_t align;
+    uint16_t num_queues;
+    uint16_t version;
+} VhostUserInflight;
+
 typedef struct {
     VhostUserRequest request;
 
@@ -169,6 +180,7 @@ typedef union {
         VhostUserConfig config;
         VhostUserCryptoSession session;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1739,6 +1751,100 @@ static bool vhost_user_mem_section_filter(struct vhost_dev *dev,
     return result;
 }
 
+static int vhost_user_get_inflight_fd(struct vhost_dev *dev,
+                                      struct vhost_inflight *inflight)
+{
+    void *addr;
+    int fd;
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->user->chr;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_GET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.num_queues = dev->nvqs,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        return -1;
+    }
+
+    if (vhost_user_read(dev, &msg) < 0) {
+        return -1;
+    }
+
+    if (msg.hdr.request != VHOST_USER_GET_INFLIGHT_FD) {
+        error_report("Received unexpected msg type. "
+                     "Expected %d received %d",
+                     VHOST_USER_GET_INFLIGHT_FD, msg.hdr.request);
+        return -1;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.inflight)) {
+        error_report("Received bad msg size.");
+        return -1;
+    }
+
+    if (!msg.payload.inflight.mmap_size) {
+        return 0;
+    }
+
+    fd = qemu_chr_fe_get_msgfd(chr);
+    if (fd < 0) {
+        error_report("Failed to get mem fd");
+        return -1;
+    }
+
+    addr = mmap(0, msg.payload.inflight.mmap_size, PROT_READ | PROT_WRITE,
+                MAP_SHARED, fd, msg.payload.inflight.mmap_offset);
+
+    if (addr == MAP_FAILED) {
+        error_report("Failed to mmap mem fd");
+        close(fd);
+        return -1;
+    }
+
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = msg.payload.inflight.mmap_size;
+    inflight->offset = msg.payload.inflight.mmap_offset;
+    inflight->align = msg.payload.inflight.align;
+    inflight->version = msg.payload.inflight.version;
+
+    return 0;
+}
+
+static int vhost_user_set_inflight_fd(struct vhost_dev *dev,
+                                      struct vhost_inflight *inflight)
+{
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_SET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.mmap_size = inflight->size,
+        .payload.inflight.mmap_offset = inflight->offset,
+        .payload.inflight.align = inflight->align,
+        .payload.inflight.num_queues = dev->nvqs,
+        .payload.inflight.version = inflight->version,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, &inflight->fd, 1) < 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 VhostUserState *vhost_user_init(void)
 {
     VhostUserState *user = g_new0(struct VhostUserState, 1);
@@ -1790,4 +1896,6 @@ const VhostOps user_ops = {
         .vhost_crypto_create_session = vhost_user_crypto_create_session,
         .vhost_crypto_close_session = vhost_user_crypto_close_session,
         .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
+        .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
+        .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 569c4053ea..730f436692 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1481,6 +1481,114 @@ void vhost_dev_set_config_notifier(struct vhost_dev *hdev,
     hdev->config_ops = ops;
 }
 
+void vhost_dev_reset_inflight(struct vhost_inflight *inflight)
+{
+    if (inflight->addr) {
+        memset(inflight->addr, 0, inflight->size);
+    }
+}
+
+void vhost_dev_free_inflight(struct vhost_inflight *inflight)
+{
+    if (inflight->addr) {
+        qemu_memfd_free(inflight->addr, inflight->size, inflight->fd);
+        inflight->addr = NULL;
+        inflight->fd = -1;
+    }
+}
+
+static int vhost_dev_resize_inflight(struct vhost_inflight *inflight,
+                                     uint64_t new_size)
+{
+    Error *err = NULL;
+    int fd = -1;
+    void *addr = qemu_memfd_alloc("vhost-inflight", new_size,
+                                  F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
+                                  &fd, &err);
+
+    if (err) {
+        error_report_err(err);
+        return -1;
+    }
+
+    vhost_dev_free_inflight(inflight);
+    inflight->offset = 0;
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = new_size;
+
+    return 0;
+}
+
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    if (inflight->addr) {
+        qemu_put_be64(f, inflight->size);
+        qemu_put_be64(f, inflight->offset);
+        qemu_put_be32(f, inflight->align);
+        qemu_put_be16(f, inflight->version);
+        qemu_put_buffer(f, inflight->addr, inflight->size);
+    } else {
+        qemu_put_be64(f, 0);
+    }
+}
+
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    uint64_t size;
+
+    size = qemu_get_be64(f);
+    if (!size) {
+        return 0;
+    }
+
+    if (inflight->size != size) {
+        if (vhost_dev_resize_inflight(inflight, size)) {
+            return -1;
+        }
+    }
+    inflight->size = size;
+    inflight->offset = qemu_get_be64(f);
+    inflight->align = qemu_get_be32(f);
+    inflight->version = qemu_get_be16(f);
+
+    qemu_get_buffer(f, inflight->addr, size);
+
+    return 0;
+}
+
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_set_inflight_fd && inflight->addr) {
+        r = dev->vhost_ops->vhost_set_inflight_fd(dev, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_set_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
+int vhost_dev_get_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_get_inflight_fd) {
+        r = dev->vhost_ops->vhost_get_inflight_fd(dev, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_get_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
 /* Host notifiers must be enabled at this point. */
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 81283ec50f..97676bd237 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -25,6 +25,7 @@ typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
 struct vhost_memory;
@@ -104,6 +105,12 @@ typedef int (*vhost_crypto_close_session_op)(struct vhost_dev *dev,
 typedef bool (*vhost_backend_mem_section_filter_op)(struct vhost_dev *dev,
                                                 MemoryRegionSection *section);
 
+typedef int (*vhost_get_inflight_fd_op)(struct vhost_dev *dev,
+                                        struct vhost_inflight *inflight);
+
+typedef int (*vhost_set_inflight_fd_op)(struct vhost_dev *dev,
+                                        struct vhost_inflight *inflight);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -142,6 +149,8 @@ typedef struct VhostOps {
     vhost_crypto_create_session_op vhost_crypto_create_session;
     vhost_crypto_close_session_op vhost_crypto_close_session;
     vhost_backend_mem_section_filter_op vhost_backend_mem_section_filter;
+    vhost_get_inflight_fd_op vhost_get_inflight_fd;
+    vhost_set_inflight_fd_op vhost_set_inflight_fd;
 } VhostOps;
 
 extern const VhostOps user_ops;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a7f449fa87..0a71596d8b 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -7,6 +7,16 @@
 #include "exec/memory.h"
 
 /* Generic structures common for any vhost based device. */
+
+struct vhost_inflight {
+    int fd;
+    void *addr;
+    uint64_t size;
+    uint64_t offset;
+    uint32_t align;
+    uint16_t version;
+};
+
 struct vhost_virtqueue {
     int kick;
     int call;
@@ -120,4 +130,13 @@ int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
  */
 void vhost_dev_set_config_notifier(struct vhost_dev *dev,
                                    const VhostDevConfigOps *ops);
+
+void vhost_dev_reset_inflight(struct vhost_inflight *inflight);
+void vhost_dev_free_inflight(struct vhost_inflight *inflight);
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight);
+int vhost_dev_get_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight);
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 3/7] libvhost-user: Introduce vu_queue_map_desc()
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

Introduce vu_queue_map_desc() which should be
independent with vu_queue_pop();

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 88 ++++++++++++++++-----------
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index a6b46cdc03..23bd52264c 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -1853,49 +1853,20 @@ virtqueue_alloc_element(size_t sz,
     return elem;
 }
 
-void *
-vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+static void *
+vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
 {
-    unsigned int i, head, max, desc_len;
+    struct vring_desc *desc = vq->vring.desc;
     uint64_t desc_addr, read_len;
+    unsigned int desc_len;
+    unsigned int max = vq->vring.num;
+    unsigned int i = idx;
     VuVirtqElement *elem;
-    unsigned out_num, in_num;
+    unsigned int out_num = 0, in_num = 0;
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
     struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
-    struct vring_desc *desc;
     int rc;
 
-    if (unlikely(dev->broken) ||
-        unlikely(!vq->vring.avail)) {
-        return NULL;
-    }
-
-    if (vu_queue_empty(dev, vq)) {
-        return NULL;
-    }
-    /* Needed after virtio_queue_empty(), see comment in
-     * virtqueue_num_heads(). */
-    smp_rmb();
-
-    /* When we start there are none of either input nor output. */
-    out_num = in_num = 0;
-
-    max = vq->vring.num;
-    if (vq->inuse >= vq->vring.num) {
-        vu_panic(dev, "Virtqueue size exceeded");
-        return NULL;
-    }
-
-    if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
-        return NULL;
-    }
-
-    if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
-        vring_set_avail_event(vq, vq->last_avail_idx);
-    }
-
-    i = head;
-    desc = vq->vring.desc;
     if (desc[i].flags & VRING_DESC_F_INDIRECT) {
         if (desc[i].len % sizeof(struct vring_desc)) {
             vu_panic(dev, "Invalid size for indirect buffer table");
@@ -1947,12 +1918,13 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
     } while (rc == VIRTQUEUE_READ_DESC_MORE);
 
     if (rc == VIRTQUEUE_READ_DESC_ERROR) {
+        vu_panic(dev, "read descriptor error");
         return NULL;
     }
 
     /* Now copy what we have collected and mapped */
     elem = virtqueue_alloc_element(sz, out_num, in_num);
-    elem->index = head;
+    elem->index = idx;
     for (i = 0; i < out_num; i++) {
         elem->out_sg[i] = iov[i];
     }
@@ -1960,6 +1932,48 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
         elem->in_sg[i] = iov[out_num + i];
     }
 
+    return elem;
+}
+
+void *
+vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+{
+    unsigned int head;
+    VuVirtqElement *elem;
+
+    if (unlikely(dev->broken) ||
+        unlikely(!vq->vring.avail)) {
+        return NULL;
+    }
+
+    if (vu_queue_empty(dev, vq)) {
+        return NULL;
+    }
+    /*
+     * Needed after virtio_queue_empty(), see comment in
+     * virtqueue_num_heads().
+     */
+    smp_rmb();
+
+    if (vq->inuse >= vq->vring.num) {
+        vu_panic(dev, "Virtqueue size exceeded");
+        return NULL;
+    }
+
+    if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
+        return NULL;
+    }
+
+    if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+        vring_set_avail_event(vq, vq->last_avail_idx);
+    }
+
+    elem = vu_queue_map_desc(dev, vq, head, sz);
+
+    if (!elem) {
+        return NULL;
+    }
+
     vq->inuse++;
 
     return elem;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (2 preceding siblings ...)
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-11  3:56   ` Jason Wang
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
VHOST_USER_SET_INFLIGHT_FD message to set/get shared memory
to/from qemu. Then we maintain a "bitmap" of all descriptors in
the shared memory for each queue to track inflight I/O.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 Makefile                              |   2 +-
 contrib/libvhost-user/libvhost-user.c | 258 ++++++++++++++++++++++++--
 contrib/libvhost-user/libvhost-user.h |  29 +++
 3 files changed, 268 insertions(+), 21 deletions(-)

diff --git a/Makefile b/Makefile
index dd53965f77..b5c9092605 100644
--- a/Makefile
+++ b/Makefile
@@ -473,7 +473,7 @@ Makefile: $(version-obj-y)
 # Build libraries
 
 libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
-libvhost-user.a: $(libvhost-user-obj-y)
+libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
 
 ######################################################################
 
diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 23bd52264c..e73ce04619 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -41,6 +41,8 @@
 #endif
 
 #include "qemu/atomic.h"
+#include "qemu/osdep.h"
+#include "qemu/memfd.h"
 
 #include "libvhost-user.h"
 
@@ -53,6 +55,18 @@
             _min1 < _min2 ? _min1 : _min2; })
 #endif
 
+/* Round number down to multiple */
+#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
+
+/* Round number up to multiple */
+#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
+
+/* Align each region to cache line size in inflight buffer */
+#define INFLIGHT_ALIGNMENT 64
+
+/* The version of inflight buffer */
+#define INFLIGHT_VERSION 1
+
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
 /* The version of the protocol we support */
@@ -66,6 +80,20 @@
         }                                       \
     } while (0)
 
+static inline
+bool has_feature(uint64_t features, unsigned int fbit)
+{
+    assert(fbit < 64);
+    return !!(features & (1ULL << fbit));
+}
+
+static inline
+bool vu_has_feature(VuDev *dev,
+                    unsigned int fbit)
+{
+    return has_feature(dev->features, fbit);
+}
+
 static const char *
 vu_request_to_string(unsigned int req)
 {
@@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_POSTCOPY_LISTEN),
         REQ(VHOST_USER_POSTCOPY_END),
+        REQ(VHOST_USER_GET_INFLIGHT_FD),
+        REQ(VHOST_USER_SET_INFLIGHT_FD),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -890,6 +920,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
     return true;
 }
 
+static int
+vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
+{
+    int i = 0;
+
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    vq->used_idx = vq->vring.used->idx;
+    vq->inflight_num = 0;
+    for (i = 0; i < vq->vring.num; i++) {
+        if (vq->inflight->desc[i] == 0) {
+            continue;
+        }
+
+        vq->inflight_desc[vq->inflight_num++] = i;
+        vq->inuse++;
+    }
+    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
+
+    /* in case of I/O hang after reconnecting */
+    if (eventfd_write(vq->kick_fd, 1) ||
+        eventfd_write(vq->call_fd, 1)) {
+        return -1;
+    }
+
+    return 0;
+}
+
 static bool
 vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -925,6 +990,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
                dev->vq[index].kick_fd, index);
     }
 
+    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
+        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
+    }
+
     return false;
 }
 
@@ -1215,6 +1284,117 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
     return true;
 }
 
+static bool
+vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int fd;
+    void *addr;
+    uint64_t mmap_size;
+
+    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
+        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
+        vmsg->payload.inflight.mmap_size = 0;
+        return true;
+    }
+
+    DPRINT("set_inflight_fd num_queues: %"PRId16"\n",
+           vmsg->payload.inflight.num_queues);
+
+    mmap_size = vmsg->payload.inflight.num_queues *
+                ALIGN_UP(sizeof(VuVirtqInflight), INFLIGHT_ALIGNMENT);
+
+    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
+                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
+                            &fd, NULL);
+
+    if (!addr) {
+        vu_panic(dev, "Failed to alloc vhost inflight area");
+        vmsg->payload.inflight.mmap_size = 0;
+        return true;
+    }
+
+    dev->inflight_info.addr = addr;
+    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
+    vmsg->payload.inflight.mmap_offset = 0;
+    vmsg->payload.inflight.align = INFLIGHT_ALIGNMENT;
+    vmsg->payload.inflight.version = INFLIGHT_VERSION;
+    vmsg->fd_num = 1;
+    dev->inflight_info.fd = vmsg->fds[0] = fd;
+
+    DPRINT("send inflight mmap_size: %"PRId64"\n",
+           vmsg->payload.inflight.mmap_size);
+    DPRINT("send inflight mmap offset: %"PRId64"\n",
+           vmsg->payload.inflight.mmap_offset);
+    DPRINT("send inflight align: %"PRId32"\n",
+           vmsg->payload.inflight.align);
+    DPRINT("send inflight version: %"PRId16"\n",
+           vmsg->payload.inflight.version);
+
+    return true;
+}
+
+static bool
+vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int fd, i;
+    uint64_t mmap_size, mmap_offset;
+    uint32_t align;
+    uint16_t num_queues, version;
+    void *rc;
+
+    if (vmsg->fd_num != 1 ||
+        vmsg->size != sizeof(vmsg->payload.inflight)) {
+        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
+                 vmsg->size, vmsg->fd_num);
+        return false;
+    }
+
+    fd = vmsg->fds[0];
+    mmap_size = vmsg->payload.inflight.mmap_size;
+    mmap_offset = vmsg->payload.inflight.mmap_offset;
+    align = vmsg->payload.inflight.align;
+    num_queues = vmsg->payload.inflight.num_queues;
+    version = vmsg->payload.inflight.version;
+
+    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
+    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
+    DPRINT("set_inflight_fd align: %"PRId32"\n", align);
+    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
+    DPRINT("set_inflight_fd version: %"PRId16"\n", version);
+
+    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+              fd, mmap_offset);
+
+    if (rc == MAP_FAILED) {
+        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
+        return false;
+    }
+
+    if (version != INFLIGHT_VERSION) {
+        vu_panic(dev, "Invalid set_inflight_fd version: %d", version);
+        return false;
+    }
+
+    if (dev->inflight_info.fd) {
+        close(dev->inflight_info.fd);
+    }
+
+    if (dev->inflight_info.addr) {
+        munmap(dev->inflight_info.addr, dev->inflight_info.size);
+    }
+
+    dev->inflight_info.fd = fd;
+    dev->inflight_info.addr = rc;
+    dev->inflight_info.size = mmap_size;
+
+    for (i = 0; i < num_queues; i++) {
+        dev->vq[i].inflight = (VuVirtqInflight *)rc;
+        rc = (void *)((char *)rc + ALIGN_UP(sizeof(VuVirtqInflight), align));
+    }
+
+    return false;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1292,6 +1472,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_postcopy_listen(dev, vmsg);
     case VHOST_USER_POSTCOPY_END:
         return vu_set_postcopy_end(dev, vmsg);
+    case VHOST_USER_GET_INFLIGHT_FD:
+        return vu_get_inflight_fd(dev, vmsg);
+    case VHOST_USER_SET_INFLIGHT_FD:
+        return vu_set_inflight_fd(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
@@ -1359,8 +1543,18 @@ vu_deinit(VuDev *dev)
             close(vq->err_fd);
             vq->err_fd = -1;
         }
+        vq->inflight = NULL;
     }
 
+    if (dev->inflight_info.addr) {
+        munmap(dev->inflight_info.addr, dev->inflight_info.size);
+        dev->inflight_info.addr = NULL;
+    }
+
+    if (dev->inflight_info.fd > 0) {
+        close(dev->inflight_info.fd);
+        dev->inflight_info.fd = -1;
+    }
 
     vu_close_log(dev);
     if (dev->slave_fd != -1) {
@@ -1687,20 +1881,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
     return vring_avail_idx(vq) == vq->last_avail_idx;
 }
 
-static inline
-bool has_feature(uint64_t features, unsigned int fbit)
-{
-    assert(fbit < 64);
-    return !!(features & (1ULL << fbit));
-}
-
-static inline
-bool vu_has_feature(VuDev *dev,
-                    unsigned int fbit)
-{
-    return has_feature(dev->features, fbit);
-}
-
 static bool
 vring_notify(VuDev *dev, VuVirtq *vq)
 {
@@ -1829,12 +2009,6 @@ virtqueue_map_desc(VuDev *dev,
     *p_num_sg = num_sg;
 }
 
-/* Round number down to multiple */
-#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
-
-/* Round number up to multiple */
-#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
-
 static void *
 virtqueue_alloc_element(size_t sz,
                                      unsigned out_num, unsigned in_num)
@@ -1935,9 +2109,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
     return elem;
 }
 
+static int
+vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    vq->inflight->desc[desc_idx] = 1;
+
+    return 0;
+}
+
+static int
+vu_queue_inflight_put(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    vq->inflight->desc[desc_idx] = 0;
+
+    return 0;
+}
+
 void *
 vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 {
+    int i;
     unsigned int head;
     VuVirtqElement *elem;
 
@@ -1946,6 +2155,12 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
         return NULL;
     }
 
+    if (unlikely(vq->inflight_num > 0)) {
+        i = (--vq->inflight_num);
+        elem = vu_queue_map_desc(dev, vq, vq->inflight_desc[i], sz);
+        return elem;
+    }
+
     if (vu_queue_empty(dev, vq)) {
         return NULL;
     }
@@ -1976,6 +2191,8 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 
     vq->inuse++;
 
+    vu_queue_inflight_get(dev, vq, head);
+
     return elem;
 }
 
@@ -2121,4 +2338,5 @@ vu_queue_push(VuDev *dev, VuVirtq *vq,
 {
     vu_queue_fill(dev, vq, elem, len, 0);
     vu_queue_flush(dev, vq, 1);
+    vu_queue_inflight_put(dev, vq, elem->index);
 }
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 4aa55b4d2d..5afb80ea5c 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -53,6 +53,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_CONFIG = 9,
     VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
     VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 
     VHOST_USER_PROTOCOL_F_MAX
 };
@@ -91,6 +92,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_POSTCOPY_ADVISE  = 28,
     VHOST_USER_POSTCOPY_LISTEN  = 29,
     VHOST_USER_POSTCOPY_END     = 30,
+    VHOST_USER_GET_INFLIGHT_FD = 31,
+    VHOST_USER_SET_INFLIGHT_FD = 32,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -138,6 +141,14 @@ typedef struct VhostUserVringArea {
     uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserInflight {
+    uint64_t mmap_size;
+    uint64_t mmap_offset;
+    uint32_t align;
+    uint16_t num_queues;
+    uint16_t version;
+} VhostUserInflight;
+
 #if defined(_WIN32)
 # define VU_PACKED __attribute__((gcc_struct, packed))
 #else
@@ -163,6 +174,7 @@ typedef struct VhostUserMsg {
         VhostUserLog log;
         VhostUserConfig config;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
     } payload;
 
     int fds[VHOST_MEMORY_MAX_NREGIONS];
@@ -234,9 +246,19 @@ typedef struct VuRing {
     uint32_t flags;
 } VuRing;
 
+typedef struct VuVirtqInflight {
+    char desc[VIRTQUEUE_MAX_SIZE];
+} VuVirtqInflight;
+
 typedef struct VuVirtq {
     VuRing vring;
 
+    VuVirtqInflight *inflight;
+
+    uint16_t inflight_desc[VIRTQUEUE_MAX_SIZE];
+
+    uint16_t inflight_num;
+
     /* Next head to pop */
     uint16_t last_avail_idx;
 
@@ -279,11 +301,18 @@ typedef void (*vu_set_watch_cb) (VuDev *dev, int fd, int condition,
                                  vu_watch_cb cb, void *data);
 typedef void (*vu_remove_watch_cb) (VuDev *dev, int fd);
 
+typedef struct VuDevInflightInfo {
+    int fd;
+    void *addr;
+    uint64_t size;
+} VuDevInflightInfo;
+
 struct VuDev {
     int sock;
     uint32_t nregions;
     VuDevRegion regions[VHOST_MEMORY_MAX_NREGIONS];
     VuVirtq vq[VHOST_MAX_NR_VIRTQUEUE];
+    VuDevInflightInfo inflight_info;
     int log_call_fd;
     int slave_fd;
     uint64_t log_size;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 5/7] vhost-user-blk: Add support to get/set inflight buffer
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (3 preceding siblings ...)
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 6/7] vhost-user-blk: Add support to reconnect backend elohimes
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch adds support for vhost-user-blk device to get/set
inflight buffer from/to backend.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 hw/block/vhost-user-blk.c          | 26 ++++++++++++++++++++++++++
 include/hw/virtio/vhost-user-blk.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 1451940845..e1c48b938c 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -126,6 +126,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
     }
 
     s->dev.acked_features = vdev->guest_features;
+
+    ret = vhost_dev_set_inflight(&s->dev, s->inflight);
+    if (ret < 0) {
+        error_report("Error set inflight: %d", -ret);
+        goto err_guest_notifiers;
+    }
+
     ret = vhost_dev_start(&s->dev, vdev);
     if (ret < 0) {
         error_report("Error starting vhost: %d", -ret);
@@ -245,6 +252,13 @@ static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     }
 }
 
+static void vhost_user_blk_reset(VirtIODevice *vdev)
+{
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    vhost_dev_reset_inflight(s->inflight);
+}
+
 static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -284,6 +298,8 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
                          vhost_user_blk_handle_output);
     }
 
+    s->inflight = g_new0(struct vhost_inflight, 1);
+
     s->dev.nvqs = s->num_queues;
     s->dev.vqs = g_new(struct vhost_virtqueue, s->dev.nvqs);
     s->dev.vq_index = 0;
@@ -309,12 +325,19 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
         s->blkcfg.num_queues = s->num_queues;
     }
 
+    ret = vhost_dev_get_inflight(&s->dev, s->inflight);
+    if (ret < 0) {
+        error_setg(errp, "vhost-user-blk: get inflight failed");
+        goto vhost_err;
+    }
+
     return;
 
 vhost_err:
     vhost_dev_cleanup(&s->dev);
 virtio_err:
     g_free(s->dev.vqs);
+    g_free(s->inflight);
     virtio_cleanup(vdev);
 
     vhost_user_cleanup(user);
@@ -329,7 +352,9 @@ static void vhost_user_blk_device_unrealize(DeviceState *dev, Error **errp)
 
     vhost_user_blk_set_status(vdev, 0);
     vhost_dev_cleanup(&s->dev);
+    vhost_dev_free_inflight(s->inflight);
     g_free(s->dev.vqs);
+    g_free(s->inflight);
     virtio_cleanup(vdev);
 
     if (s->vhost_user) {
@@ -379,6 +404,7 @@ static void vhost_user_blk_class_init(ObjectClass *klass, void *data)
     vdc->set_config = vhost_user_blk_set_config;
     vdc->get_features = vhost_user_blk_get_features;
     vdc->set_status = vhost_user_blk_set_status;
+    vdc->reset = vhost_user_blk_reset;
 }
 
 static const TypeInfo vhost_user_blk_info = {
diff --git a/include/hw/virtio/vhost-user-blk.h b/include/hw/virtio/vhost-user-blk.h
index d52944aeeb..445516604a 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -36,6 +36,7 @@ typedef struct VHostUserBlk {
     uint32_t queue_size;
     uint32_t config_wce;
     struct vhost_dev dev;
+    struct vhost_inflight *inflight;
     VhostUserState *vhost_user;
 } VHostUserBlk;
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 6/7] vhost-user-blk: Add support to reconnect backend
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (4 preceding siblings ...)
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

Since we now support the message VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD. The backend is able to restart
safely because it can track inflight I/O in shared memory.
This patch allows qemu to reconnect the backend after
connection closed.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Ni Xun <nixun@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 hw/block/vhost-user-blk.c          | 205 +++++++++++++++++++++++------
 include/hw/virtio/vhost-user-blk.h |   4 +
 2 files changed, 168 insertions(+), 41 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index e1c48b938c..a551486151 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -101,7 +101,7 @@ const VhostDevConfigOps blk_ops = {
     .vhost_dev_config_notifier = vhost_user_blk_handle_config_change,
 };
 
-static void vhost_user_blk_start(VirtIODevice *vdev)
+static int vhost_user_blk_start(VirtIODevice *vdev)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
@@ -110,13 +110,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
 
     if (!k->set_guest_notifiers) {
         error_report("binding does not support guest notifiers");
-        return;
+        return -ENOSYS;
     }
 
     ret = vhost_dev_enable_notifiers(&s->dev, vdev);
     if (ret < 0) {
         error_report("Error enabling host notifiers: %d", -ret);
-        return;
+        return ret;
     }
 
     ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, true);
@@ -147,12 +147,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
         vhost_virtqueue_mask(&s->dev, vdev, i, false);
     }
 
-    return;
+    return ret;
 
 err_guest_notifiers:
     k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
 err_host_notifiers:
     vhost_dev_disable_notifiers(&s->dev, vdev);
+    return ret;
 }
 
 static void vhost_user_blk_stop(VirtIODevice *vdev)
@@ -171,7 +172,6 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
     ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
     if (ret < 0) {
         error_report("vhost guest notifier cleanup failed: %d", ret);
-        return;
     }
 
     vhost_dev_disable_notifiers(&s->dev, vdev);
@@ -181,21 +181,43 @@ static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     bool should_start = status & VIRTIO_CONFIG_S_DRIVER_OK;
+    int ret;
 
     if (!vdev->vm_running) {
         should_start = false;
     }
 
-    if (s->dev.started == should_start) {
+    if (s->should_start == should_start) {
+        return;
+    }
+
+    if (!s->connected || s->dev.started == should_start) {
+        s->should_start = should_start;
         return;
     }
 
     if (should_start) {
-        vhost_user_blk_start(vdev);
+        s->should_start = true;
+        /*
+         * make sure vhost_user_blk_handle_output() ignores fake
+         * guest kick by vhost_dev_enable_notifiers()
+         */
+        barrier();
+        ret = vhost_user_blk_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-blk: vhost start failed: %s",
+                         strerror(-ret));
+            qemu_chr_fe_disconnect(&s->chardev);
+        }
     } else {
         vhost_user_blk_stop(vdev);
+        /*
+         * make sure vhost_user_blk_handle_output() ignore fake
+         * guest kick by vhost_dev_disable_notifiers()
+         */
+        barrier();
+        s->should_start = false;
     }
-
 }
 
 static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev,
@@ -225,13 +247,22 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev,
 static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
-    int i;
+    int i, ret;
 
     if (!(virtio_host_has_feature(vdev, VIRTIO_F_VERSION_1) &&
         !virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1))) {
         return;
     }
 
+    if (s->should_start) {
+        return;
+    }
+    s->should_start = true;
+
+    if (!s->connected) {
+        return;
+    }
+
     if (s->dev.started) {
         return;
     }
@@ -239,7 +270,13 @@ static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
      * vhost here instead of waiting for .set_status().
      */
-    vhost_user_blk_start(vdev);
+    ret = vhost_user_blk_start(vdev);
+    if (ret < 0) {
+        error_report("vhost-user-blk: vhost start failed: %s",
+                     strerror(-ret));
+        qemu_chr_fe_disconnect(&s->chardev);
+        return;
+    }
 
     /* Kick right away to begin processing requests already in vring */
     for (i = 0; i < s->dev.nvqs; i++) {
@@ -259,12 +296,105 @@ static void vhost_user_blk_reset(VirtIODevice *vdev)
     vhost_dev_reset_inflight(s->inflight);
 }
 
+static int vhost_user_blk_connect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+    int ret = 0;
+
+    if (s->connected) {
+        return 0;
+    }
+    s->connected = true;
+
+    s->dev.nvqs = s->num_queues;
+    s->dev.vqs = s->vqs;
+    s->dev.vq_index = 0;
+    s->dev.backend_features = 0;
+
+    vhost_dev_set_config_notifier(&s->dev, &blk_ops);
+
+    ret = vhost_dev_init(&s->dev, s->vhost_user, VHOST_BACKEND_TYPE_USER, 0);
+    if (ret < 0) {
+        error_report("vhost-user-blk: vhost initialization failed: %s",
+                     strerror(-ret));
+        return ret;
+    }
+
+    /* restore vhost state */
+    if (s->should_start) {
+        ret = vhost_user_blk_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-blk: vhost start failed: %s",
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static void vhost_user_blk_disconnect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    if (!s->connected) {
+        return;
+    }
+    s->connected = false;
+
+    if (s->dev.started) {
+        vhost_user_blk_stop(vdev);
+    }
+
+    vhost_dev_cleanup(&s->dev);
+}
+
+static gboolean vhost_user_blk_watch(GIOChannel *chan, GIOCondition cond,
+                                     void *opaque)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    qemu_chr_fe_disconnect(&s->chardev);
+
+    return true;
+}
+
+static void vhost_user_blk_event(void *opaque, int event)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    switch (event) {
+    case CHR_EVENT_OPENED:
+        if (vhost_user_blk_connect(dev) < 0) {
+            qemu_chr_fe_disconnect(&s->chardev);
+            return;
+        }
+        s->watch = qemu_chr_fe_add_watch(&s->chardev, G_IO_HUP,
+                                         vhost_user_blk_watch, dev);
+        break;
+    case CHR_EVENT_CLOSED:
+        vhost_user_blk_disconnect(dev);
+        if (s->watch) {
+            g_source_remove(s->watch);
+            s->watch = 0;
+        }
+        break;
+    }
+}
+
 static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     VhostUserState *user;
     int i, ret;
+    Error *err = NULL;
 
     if (!s->chardev.chr) {
         error_setg(errp, "vhost-user-blk: chardev is mandatory");
@@ -299,26 +429,28 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     s->inflight = g_new0(struct vhost_inflight, 1);
-
-    s->dev.nvqs = s->num_queues;
-    s->dev.vqs = g_new(struct vhost_virtqueue, s->dev.nvqs);
-    s->dev.vq_index = 0;
-    s->dev.backend_features = 0;
-
-    vhost_dev_set_config_notifier(&s->dev, &blk_ops);
-
-    ret = vhost_dev_init(&s->dev, s->vhost_user, VHOST_BACKEND_TYPE_USER, 0);
-    if (ret < 0) {
-        error_setg(errp, "vhost-user-blk: vhost initialization failed: %s",
-                   strerror(-ret));
-        goto virtio_err;
-    }
+    s->vqs = g_new(struct vhost_virtqueue, s->num_queues);
+    s->watch = 0;
+    s->should_start = false;
+    s->connected = false;
+
+    qemu_chr_fe_set_handlers(&s->chardev,  NULL, NULL, vhost_user_blk_event,
+                             NULL, (void *)dev, NULL, true);
+
+reconnect:
+    do {
+        if (qemu_chr_fe_wait_connected(&s->chardev, &err) < 0) {
+            error_report_err(err);
+            err = NULL;
+            sleep(1);
+        }
+    } while (!s->connected);
 
     ret = vhost_dev_get_config(&s->dev, (uint8_t *)&s->blkcfg,
-                              sizeof(struct virtio_blk_config));
+                                   sizeof(struct virtio_blk_config));
     if (ret < 0) {
-        error_setg(errp, "vhost-user-blk: get block config failed");
-        goto vhost_err;
+        error_report("vhost-user-blk: get block config failed");
+        goto reconnect;
     }
 
     if (s->blkcfg.num_queues != s->num_queues) {
@@ -327,22 +459,11 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
 
     ret = vhost_dev_get_inflight(&s->dev, s->inflight);
     if (ret < 0) {
-        error_setg(errp, "vhost-user-blk: get inflight failed");
-        goto vhost_err;
+        error_report("vhost-user-blk: get inflight failed");
+        goto reconnect;
     }
 
     return;
-
-vhost_err:
-    vhost_dev_cleanup(&s->dev);
-virtio_err:
-    g_free(s->dev.vqs);
-    g_free(s->inflight);
-    virtio_cleanup(vdev);
-
-    vhost_user_cleanup(user);
-    g_free(user);
-    s->vhost_user = NULL;
 }
 
 static void vhost_user_blk_device_unrealize(DeviceState *dev, Error **errp)
@@ -351,9 +472,11 @@ static void vhost_user_blk_device_unrealize(DeviceState *dev, Error **errp)
     VHostUserBlk *s = VHOST_USER_BLK(dev);
 
     vhost_user_blk_set_status(vdev, 0);
+    qemu_chr_fe_set_handlers(&s->chardev,  NULL, NULL, NULL,
+                             NULL, NULL, NULL, false);
     vhost_dev_cleanup(&s->dev);
     vhost_dev_free_inflight(s->inflight);
-    g_free(s->dev.vqs);
+    g_free(s->vqs);
     g_free(s->inflight);
     virtio_cleanup(vdev);
 
diff --git a/include/hw/virtio/vhost-user-blk.h b/include/hw/virtio/vhost-user-blk.h
index 445516604a..4849aa5eb5 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -38,6 +38,10 @@ typedef struct VHostUserBlk {
     struct vhost_dev dev;
     struct vhost_inflight *inflight;
     VhostUserState *vhost_user;
+    struct vhost_virtqueue *vqs;
+    guint watch;
+    bool should_start;
+    bool connected;
 } VHostUserBlk;
 
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [Qemu-devel] [PATCH v4 for-4.0 7/7] contrib/vhost-user-blk: enable inflight I/O tracking
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (5 preceding siblings ...)
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 6/7] vhost-user-blk: Add support to reconnect backend elohimes
@ 2019-01-09 11:27 ` elohimes
  2019-01-10 10:25 ` [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting Stefan Hajnoczi
  2019-01-10 10:39 ` Marc-André Lureau
  8 siblings, 0 replies; 54+ messages in thread
From: elohimes @ 2019-01-09 11:27 UTC (permalink / raw)
  To: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch enables inflight I/O tracking for
vhost-user-blk backend so that we could restart it safely.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 contrib/vhost-user-blk/vhost-user-blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/vhost-user-blk/vhost-user-blk.c b/contrib/vhost-user-blk/vhost-user-blk.c
index 858221ad95..8cc033946a 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -327,7 +327,8 @@ vub_get_features(VuDev *dev)
 static uint64_t
 vub_get_protocol_features(VuDev *dev)
 {
-    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG;
+    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
+           1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
 }
 
 static int
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (6 preceding siblings ...)
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
@ 2019-01-10 10:25 ` Stefan Hajnoczi
  2019-01-10 10:59   ` Yongji Xie
  2019-01-10 10:39 ` Marc-André Lureau
  8 siblings, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2019-01-10 10:25 UTC (permalink / raw)
  To: elohimes
  Cc: mst, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh, nixun, qemu-devel, lilin24, zhangyu31,
	chaiwen, Xie Yongji

[-- Attachment #1: Type: text/plain, Size: 653 bytes --]

On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
> 
> The patch 1 uses exisiting wait/nowait options to make QEMU not
> do a connect on client sockets during initialization of the chardev.
> 
> The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> memory to backend.

Can you describe the problem that the inflight I/O shared memory region
solves?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (7 preceding siblings ...)
  2019-01-10 10:25 ` [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting Stefan Hajnoczi
@ 2019-01-10 10:39 ` Marc-André Lureau
  2019-01-10 11:09   ` Yongji Xie
  8 siblings, 1 reply; 54+ messages in thread
From: Marc-André Lureau @ 2019-01-10 10:39 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Daniel P. Berrange, Jason Wang,
	Maxime Coquelin, yury-kotov, wrfsh, nixun, QEMU, lilin24,
	zhangyu31, chaiwen, Xie Yongji

Hi

On Wed, Jan 9, 2019 at 3:28 PM <elohimes@gmail.com> wrote:
>
> From: Xie Yongji <xieyongji@baidu.com>
>
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
>
> The patch 1 uses exisiting wait/nowait options to make QEMU not
> do a connect on client sockets during initialization of the chardev.
>
> The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> memory to backend.
>
> The patch 3,4 are the corresponding libvhost-user patches of
> patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD.
>
> The patch 5 allows vhost-user-blk to use the two new messages
> to get/set inflight buffer from/to backend.
>
> The patch 6 supports vhost-user-blk to reconnect backend when
> connection closed.
>
> The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> to vhost-user-blk backend which is used to tell qemu that
> we support reconnecting now.
>
> To use it, we could start qemu with:
>
> qemu-system-x86_64 \
>         -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
>         -device vhost-user-blk-pci,chardev=char0 \
>
> and start vhost-user-blk backend with:
>
> vhost-user-blk -b /path/file -s /path/vhost.socket
>
> Then we can restart vhost-user-blk at any time during VM running.
>
> V3 to V4:
> - Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
> - Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
>   and VHOST_USER_SET_INFLIGHT_FD
> - Allocate inflight buffer in backend rather than in qemu
> - Document a recommended format for inflight buffer
>
> V2 to V3:
> - Using exisiting wait/nowait options to control connection on
>   client sockets instead of introducing "disconnected" option.
> - Support the case that vhost-user backend restart during initialzation
>   of vhost-user-blk device.
>
> V1 to V2:
> - Introduce "disconnected" option for chardev instead of reuse "wait"
>   option
> - Support the case that QEMU starts before vhost-user backend
> - Drop message VHOST_USER_SET_VRING_INFLIGHT
> - Introduce two new messages VHOST_USER_GET_SHM_SIZE
>   and VHOST_USER_SET_SHM_FD
>
> Xie Yongji (7):
>   char-socket: Enable "nowait" option on client sockets

This patch breaks make check.

It would be nice to add a test for the new nowait behaviour.

>   vhost-user: Support transferring inflight buffer between qemu and
>     backend
>   libvhost-user: Introduce vu_queue_map_desc()
>   libvhost-user: Support tracking inflight I/O in shared memory
>   vhost-user-blk: Add support to get/set inflight buffer
>   vhost-user-blk: Add support to reconnect backend
>   contrib/vhost-user-blk: enable inflight I/O tracking
>
>  Makefile                                |   2 +-
>  chardev/char-socket.c                   |  56 ++--
>  contrib/libvhost-user/libvhost-user.c   | 346 ++++++++++++++++++++----
>  contrib/libvhost-user/libvhost-user.h   |  29 ++
>  contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
>  docs/interop/vhost-user.txt             |  60 ++++
>  hw/block/vhost-user-blk.c               | 223 ++++++++++++---
>  hw/virtio/vhost-user.c                  | 108 ++++++++
>  hw/virtio/vhost.c                       | 108 ++++++++
>  include/hw/virtio/vhost-backend.h       |   9 +
>  include/hw/virtio/vhost-user-blk.h      |   5 +
>  include/hw/virtio/vhost.h               |  19 ++
>  qapi/char.json                          |   3 +-
>  qemu-options.hx                         |   9 +-
>  14 files changed, 851 insertions(+), 129 deletions(-)
>
> --
> 2.17.1
>
>


-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-10 10:25 ` [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting Stefan Hajnoczi
@ 2019-01-10 10:59   ` Yongji Xie
  2019-01-11 15:53     ` Stefan Hajnoczi
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-10 10:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > do a connect on client sockets during initialization of the chardev.
> >
> > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > memory to backend.
>
> Can you describe the problem that the inflight I/O shared memory region
> solves?
>

The backend need to get inflight I/O and do I/O replay after restart.
Now we can only get used_idx in used_ring. It's not enough. Because we
might not process descriptors in the same order which they have been
made available sometimes. A simple example:
https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
shared memory to track inflight I/O.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-10 10:39 ` Marc-André Lureau
@ 2019-01-10 11:09   ` Yongji Xie
  0 siblings, 0 replies; 54+ messages in thread
From: Yongji Xie @ 2019-01-10 11:09 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Michael S. Tsirkin, Daniel P. Berrange, Jason Wang,
	Maxime Coquelin, Yury Kotov,
	Евгений
	Яковлев,
	nixun, QEMU, lilin24, zhangyu31, chaiwen, Xie Yongji

On Thu, 10 Jan 2019 at 18:39, Marc-André Lureau
<marcandre.lureau@gmail.com> wrote:
>
> Hi
>
> On Wed, Jan 9, 2019 at 3:28 PM <elohimes@gmail.com> wrote:
> >
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > do a connect on client sockets during initialization of the chardev.
> >
> > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > memory to backend.
> >
> > The patch 3,4 are the corresponding libvhost-user patches of
> > patch 2. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD.
> >
> > The patch 5 allows vhost-user-blk to use the two new messages
> > to get/set inflight buffer from/to backend.
> >
> > The patch 6 supports vhost-user-blk to reconnect backend when
> > connection closed.
> >
> > The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> > to vhost-user-blk backend which is used to tell qemu that
> > we support reconnecting now.
> >
> > To use it, we could start qemu with:
> >
> > qemu-system-x86_64 \
> >         -chardev socket,id=char0,path=/path/vhost.socket,nowait,reconnect=1, \
> >         -device vhost-user-blk-pci,chardev=char0 \
> >
> > and start vhost-user-blk backend with:
> >
> > vhost-user-blk -b /path/file -s /path/vhost.socket
> >
> > Then we can restart vhost-user-blk at any time during VM running.
> >
> > V3 to V4:
> > - Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
> > - Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
> >   and VHOST_USER_SET_INFLIGHT_FD
> > - Allocate inflight buffer in backend rather than in qemu
> > - Document a recommended format for inflight buffer
> >
> > V2 to V3:
> > - Using exisiting wait/nowait options to control connection on
> >   client sockets instead of introducing "disconnected" option.
> > - Support the case that vhost-user backend restart during initialzation
> >   of vhost-user-blk device.
> >
> > V1 to V2:
> > - Introduce "disconnected" option for chardev instead of reuse "wait"
> >   option
> > - Support the case that QEMU starts before vhost-user backend
> > - Drop message VHOST_USER_SET_VRING_INFLIGHT
> > - Introduce two new messages VHOST_USER_GET_SHM_SIZE
> >   and VHOST_USER_SET_SHM_FD
> >
> > Xie Yongji (7):
> >   char-socket: Enable "nowait" option on client sockets
>
> This patch breaks make check.
>
> It would be nice to add a test for the new nowait behaviour.
>

Oh, sorry. I'll fix this issue and add one test for "nowait" option. Thank you!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets elohimes
@ 2019-01-10 12:49   ` Daniel P. Berrangé
  2019-01-10 13:19     ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-10 12:49 UTC (permalink / raw)
  To: elohimes
  Cc: mst, marcandre.lureau, jasowang, maxime.coquelin, yury-kotov,
	wrfsh, qemu-devel, zhangyu31, chaiwen, nixun, lilin24,
	Xie Yongji

On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> Enable "nowait" option to make QEMU not do a connect
> on client sockets during initialization of the chardev.
> Then we can use qemu_chr_fe_wait_connected() to connect
> when necessary. Now it would be used for unix domain
> socket of vhost-user-blk device to support reconnect.
> 
> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> ---
>  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
>  qapi/char.json        |  3 +--
>  qemu-options.hx       |  9 ++++---
>  3 files changed, 35 insertions(+), 33 deletions(-)
> 
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index eaa8e8b68f..f803f4f7d3 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
>          s->reconnect_time = reconnect;
>      }
>  
> -    if (s->reconnect_time) {
> -        tcp_chr_connect_async(chr);
> -    } else {
> -        if (s->is_listen) {
> -            char *name;
> -            s->listener = qio_net_listener_new();
> +    if (s->is_listen) {
> +        char *name;
> +        s->listener = qio_net_listener_new();
>  
> -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> -            qio_net_listener_set_name(s->listener, name);
> -            g_free(name);
> +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> +        qio_net_listener_set_name(s->listener, name);
> +        g_free(name);
>  
> -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> -                object_unref(OBJECT(s->listener));
> -                s->listener = NULL;
> -                goto error;
> -            }
> +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> +            object_unref(OBJECT(s->listener));
> +            s->listener = NULL;
> +            goto error;
> +        }
>  
> -            qapi_free_SocketAddress(s->addr);
> -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> -            update_disconnected_filename(s);
> +        qapi_free_SocketAddress(s->addr);
> +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> +        update_disconnected_filename(s);
>  
> -            if (is_waitconnect &&
> -                qemu_chr_wait_connected(chr, errp) < 0) {
> -                return;
> -            }
> -            if (!s->ioc) {
> -                qio_net_listener_set_client_func_full(s->listener,
> -                                                      tcp_chr_accept,
> -                                                      chr, NULL,
> -                                                      chr->gcontext);
> -            }
> +        if (is_waitconnect &&
> +            qemu_chr_wait_connected(chr, errp) < 0) {
> +            return;
> +        }
> +        if (!s->ioc) {
> +            qio_net_listener_set_client_func_full(s->listener,
> +                                                  tcp_chr_accept,
> +                                                  chr, NULL,
> +                                                  chr->gcontext);
> +        }
> +    } else if (is_waitconnect) {
> +        if (s->reconnect_time) {
> +            tcp_chr_connect_async(chr);
>          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
>              goto error;
>          }

This skips everything when 'is_waitconnect' is false.

This combines with a bug in tests/libqtest.c which adds the 'nowait'
flag to the -chardevs it cteates. This mistake was previously ignored
because the chardevs were socket clients, but now we honour it.

We shoul remove 'nowait' from the qtest chardevs, but separately
from that this code should also still attempt a non-blocking
connect when is_waitconnect is false.

ie

    } else if (is_waitconnect) {
        if (s->reconnect_time || !is_waitconnect) {
            tcp_chr_connect_async(chr);
        } else if (qemu_chr_wait_connected(chr, errp) < 0) {
            goto error;
        }
    }


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 12:49   ` Daniel P. Berrangé
@ 2019-01-10 13:19     ` Yongji Xie
  2019-01-10 13:24       ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-10 13:19 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > Enable "nowait" option to make QEMU not do a connect
> > on client sockets during initialization of the chardev.
> > Then we can use qemu_chr_fe_wait_connected() to connect
> > when necessary. Now it would be used for unix domain
> > socket of vhost-user-blk device to support reconnect.
> >
> > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > ---
> >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> >  qapi/char.json        |  3 +--
> >  qemu-options.hx       |  9 ++++---
> >  3 files changed, 35 insertions(+), 33 deletions(-)
> >
> > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > index eaa8e8b68f..f803f4f7d3 100644
> > --- a/chardev/char-socket.c
> > +++ b/chardev/char-socket.c
> > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> >          s->reconnect_time = reconnect;
> >      }
> >
> > -    if (s->reconnect_time) {
> > -        tcp_chr_connect_async(chr);
> > -    } else {
> > -        if (s->is_listen) {
> > -            char *name;
> > -            s->listener = qio_net_listener_new();
> > +    if (s->is_listen) {
> > +        char *name;
> > +        s->listener = qio_net_listener_new();
> >
> > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > -            qio_net_listener_set_name(s->listener, name);
> > -            g_free(name);
> > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > +        qio_net_listener_set_name(s->listener, name);
> > +        g_free(name);
> >
> > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > -                object_unref(OBJECT(s->listener));
> > -                s->listener = NULL;
> > -                goto error;
> > -            }
> > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > +            object_unref(OBJECT(s->listener));
> > +            s->listener = NULL;
> > +            goto error;
> > +        }
> >
> > -            qapi_free_SocketAddress(s->addr);
> > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > -            update_disconnected_filename(s);
> > +        qapi_free_SocketAddress(s->addr);
> > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > +        update_disconnected_filename(s);
> >
> > -            if (is_waitconnect &&
> > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > -                return;
> > -            }
> > -            if (!s->ioc) {
> > -                qio_net_listener_set_client_func_full(s->listener,
> > -                                                      tcp_chr_accept,
> > -                                                      chr, NULL,
> > -                                                      chr->gcontext);
> > -            }
> > +        if (is_waitconnect &&
> > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > +            return;
> > +        }
> > +        if (!s->ioc) {
> > +            qio_net_listener_set_client_func_full(s->listener,
> > +                                                  tcp_chr_accept,
> > +                                                  chr, NULL,
> > +                                                  chr->gcontext);
> > +        }
> > +    } else if (is_waitconnect) {
> > +        if (s->reconnect_time) {
> > +            tcp_chr_connect_async(chr);
> >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> >              goto error;
> >          }
>
> This skips everything when 'is_waitconnect' is false.
>
> This combines with a bug in tests/libqtest.c which adds the 'nowait'
> flag to the -chardevs it cteates. This mistake was previously ignored
> because the chardevs were socket clients, but now we honour it.
>
> We shoul remove 'nowait' from the qtest chardevs, but separately
> from that this code should also still attempt a non-blocking
> connect when is_waitconnect is false.
>

Do you mean we still need to connect server in background with
"nowait" option? But my purpose is not to connect server until we
manually call qemu_chr_fe_wait_connected() in other place.

> ie
>
>     } else if (is_waitconnect) {
>         if (s->reconnect_time || !is_waitconnect) {

"!is_waitconnect" should always be false here?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 13:19     ` Yongji Xie
@ 2019-01-10 13:24       ` Daniel P. Berrangé
  2019-01-10 14:08         ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-10 13:24 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > From: Xie Yongji <xieyongji@baidu.com>
> > >
> > > Enable "nowait" option to make QEMU not do a connect
> > > on client sockets during initialization of the chardev.
> > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > when necessary. Now it would be used for unix domain
> > > socket of vhost-user-blk device to support reconnect.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > ---
> > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > >  qapi/char.json        |  3 +--
> > >  qemu-options.hx       |  9 ++++---
> > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > index eaa8e8b68f..f803f4f7d3 100644
> > > --- a/chardev/char-socket.c
> > > +++ b/chardev/char-socket.c
> > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > >          s->reconnect_time = reconnect;
> > >      }
> > >
> > > -    if (s->reconnect_time) {
> > > -        tcp_chr_connect_async(chr);
> > > -    } else {
> > > -        if (s->is_listen) {
> > > -            char *name;
> > > -            s->listener = qio_net_listener_new();
> > > +    if (s->is_listen) {
> > > +        char *name;
> > > +        s->listener = qio_net_listener_new();
> > >
> > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > -            qio_net_listener_set_name(s->listener, name);
> > > -            g_free(name);
> > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > +        qio_net_listener_set_name(s->listener, name);
> > > +        g_free(name);
> > >
> > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > -                object_unref(OBJECT(s->listener));
> > > -                s->listener = NULL;
> > > -                goto error;
> > > -            }
> > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > +            object_unref(OBJECT(s->listener));
> > > +            s->listener = NULL;
> > > +            goto error;
> > > +        }
> > >
> > > -            qapi_free_SocketAddress(s->addr);
> > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > -            update_disconnected_filename(s);
> > > +        qapi_free_SocketAddress(s->addr);
> > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > +        update_disconnected_filename(s);
> > >
> > > -            if (is_waitconnect &&
> > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > -                return;
> > > -            }
> > > -            if (!s->ioc) {
> > > -                qio_net_listener_set_client_func_full(s->listener,
> > > -                                                      tcp_chr_accept,
> > > -                                                      chr, NULL,
> > > -                                                      chr->gcontext);
> > > -            }
> > > +        if (is_waitconnect &&
> > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > +            return;
> > > +        }
> > > +        if (!s->ioc) {
> > > +            qio_net_listener_set_client_func_full(s->listener,
> > > +                                                  tcp_chr_accept,
> > > +                                                  chr, NULL,
> > > +                                                  chr->gcontext);
> > > +        }
> > > +    } else if (is_waitconnect) {
> > > +        if (s->reconnect_time) {
> > > +            tcp_chr_connect_async(chr);
> > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > >              goto error;
> > >          }
> >
> > This skips everything when 'is_waitconnect' is false.
> >
> > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > flag to the -chardevs it cteates. This mistake was previously ignored
> > because the chardevs were socket clients, but now we honour it.
> >
> > We shoul remove 'nowait' from the qtest chardevs, but separately
> > from that this code should also still attempt a non-blocking
> > connect when is_waitconnect is false.
> >
> 
> Do you mean we still need to connect server in background with
> "nowait" option? But my purpose is not to connect server until we
> manually call qemu_chr_fe_wait_connected() in other place.

I don't see a need to delay the connect. We can start a
background connect right away. The later code you have
merely needs to wait for that background connect  to
finish, which qemu_chr_fe_wait_connected still accomplishes.
This keeps the chardev code clearer only having 2 distinct
code paths to worry about - blocking or non-blocking connect.

> 
> > ie
> >
> >     } else if (is_waitconnect) {
> >         if (s->reconnect_time || !is_waitconnect) {
> 
> "!is_waitconnect" should always be false here?

Opps, I meant

     } else  {
         if (s->reconnect_time || !is_waitconnect) {

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 13:24       ` Daniel P. Berrangé
@ 2019-01-10 14:08         ` Yongji Xie
  2019-01-10 14:11           ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-10 14:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > From: Xie Yongji <xieyongji@baidu.com>
> > > >
> > > > Enable "nowait" option to make QEMU not do a connect
> > > > on client sockets during initialization of the chardev.
> > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > when necessary. Now it would be used for unix domain
> > > > socket of vhost-user-blk device to support reconnect.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > ---
> > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > >  qapi/char.json        |  3 +--
> > > >  qemu-options.hx       |  9 ++++---
> > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > >
> > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > --- a/chardev/char-socket.c
> > > > +++ b/chardev/char-socket.c
> > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > >          s->reconnect_time = reconnect;
> > > >      }
> > > >
> > > > -    if (s->reconnect_time) {
> > > > -        tcp_chr_connect_async(chr);
> > > > -    } else {
> > > > -        if (s->is_listen) {
> > > > -            char *name;
> > > > -            s->listener = qio_net_listener_new();
> > > > +    if (s->is_listen) {
> > > > +        char *name;
> > > > +        s->listener = qio_net_listener_new();
> > > >
> > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > -            qio_net_listener_set_name(s->listener, name);
> > > > -            g_free(name);
> > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > +        qio_net_listener_set_name(s->listener, name);
> > > > +        g_free(name);
> > > >
> > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > -                object_unref(OBJECT(s->listener));
> > > > -                s->listener = NULL;
> > > > -                goto error;
> > > > -            }
> > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > +            object_unref(OBJECT(s->listener));
> > > > +            s->listener = NULL;
> > > > +            goto error;
> > > > +        }
> > > >
> > > > -            qapi_free_SocketAddress(s->addr);
> > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > -            update_disconnected_filename(s);
> > > > +        qapi_free_SocketAddress(s->addr);
> > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > +        update_disconnected_filename(s);
> > > >
> > > > -            if (is_waitconnect &&
> > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > -                return;
> > > > -            }
> > > > -            if (!s->ioc) {
> > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > -                                                      tcp_chr_accept,
> > > > -                                                      chr, NULL,
> > > > -                                                      chr->gcontext);
> > > > -            }
> > > > +        if (is_waitconnect &&
> > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > +            return;
> > > > +        }
> > > > +        if (!s->ioc) {
> > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > +                                                  tcp_chr_accept,
> > > > +                                                  chr, NULL,
> > > > +                                                  chr->gcontext);
> > > > +        }
> > > > +    } else if (is_waitconnect) {
> > > > +        if (s->reconnect_time) {
> > > > +            tcp_chr_connect_async(chr);
> > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > >              goto error;
> > > >          }
> > >
> > > This skips everything when 'is_waitconnect' is false.
> > >
> > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > because the chardevs were socket clients, but now we honour it.
> > >
> > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > from that this code should also still attempt a non-blocking
> > > connect when is_waitconnect is false.
> > >
> >
> > Do you mean we still need to connect server in background with
> > "nowait" option? But my purpose is not to connect server until we
> > manually call qemu_chr_fe_wait_connected() in other place.
>
> I don't see a need to delay the connect. We can start a
> background connect right away. The later code you have
> merely needs to wait for that background connect  to
> finish, which qemu_chr_fe_wait_connected still accomplishes.
> This keeps the chardev code clearer only having 2 distinct
> code paths to worry about - blocking or non-blocking connect.
>

Now the problem is that we have a server that only accept one
connection. And we want to read something from it during device
initializtion.

If background connect it before we call qemu_chr_fe_wait_connected()
during device initializtion, qemu_chr_fe_wait_connected() will
accomplish but we can't read anything. And we have no way to release
the background connection. So what I want to do in this patch is to
disable background connect.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 14:08         ` Yongji Xie
@ 2019-01-10 14:11           ` Daniel P. Berrangé
  2019-01-10 14:29             ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-10 14:11 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > >
> > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > on client sockets during initialization of the chardev.
> > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > when necessary. Now it would be used for unix domain
> > > > > socket of vhost-user-blk device to support reconnect.
> > > > >
> > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > ---
> > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > >  qapi/char.json        |  3 +--
> > > > >  qemu-options.hx       |  9 ++++---
> > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > >
> > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > --- a/chardev/char-socket.c
> > > > > +++ b/chardev/char-socket.c
> > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > >          s->reconnect_time = reconnect;
> > > > >      }
> > > > >
> > > > > -    if (s->reconnect_time) {
> > > > > -        tcp_chr_connect_async(chr);
> > > > > -    } else {
> > > > > -        if (s->is_listen) {
> > > > > -            char *name;
> > > > > -            s->listener = qio_net_listener_new();
> > > > > +    if (s->is_listen) {
> > > > > +        char *name;
> > > > > +        s->listener = qio_net_listener_new();
> > > > >
> > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > -            g_free(name);
> > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > +        g_free(name);
> > > > >
> > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > -                object_unref(OBJECT(s->listener));
> > > > > -                s->listener = NULL;
> > > > > -                goto error;
> > > > > -            }
> > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > +            object_unref(OBJECT(s->listener));
> > > > > +            s->listener = NULL;
> > > > > +            goto error;
> > > > > +        }
> > > > >
> > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > -            update_disconnected_filename(s);
> > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > +        update_disconnected_filename(s);
> > > > >
> > > > > -            if (is_waitconnect &&
> > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > -                return;
> > > > > -            }
> > > > > -            if (!s->ioc) {
> > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > -                                                      tcp_chr_accept,
> > > > > -                                                      chr, NULL,
> > > > > -                                                      chr->gcontext);
> > > > > -            }
> > > > > +        if (is_waitconnect &&
> > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > +            return;
> > > > > +        }
> > > > > +        if (!s->ioc) {
> > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > +                                                  tcp_chr_accept,
> > > > > +                                                  chr, NULL,
> > > > > +                                                  chr->gcontext);
> > > > > +        }
> > > > > +    } else if (is_waitconnect) {
> > > > > +        if (s->reconnect_time) {
> > > > > +            tcp_chr_connect_async(chr);
> > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > >              goto error;
> > > > >          }
> > > >
> > > > This skips everything when 'is_waitconnect' is false.
> > > >
> > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > because the chardevs were socket clients, but now we honour it.
> > > >
> > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > from that this code should also still attempt a non-blocking
> > > > connect when is_waitconnect is false.
> > > >
> > >
> > > Do you mean we still need to connect server in background with
> > > "nowait" option? But my purpose is not to connect server until we
> > > manually call qemu_chr_fe_wait_connected() in other place.
> >
> > I don't see a need to delay the connect. We can start a
> > background connect right away. The later code you have
> > merely needs to wait for that background connect  to
> > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > This keeps the chardev code clearer only having 2 distinct
> > code paths to worry about - blocking or non-blocking connect.
> >
> 
> Now the problem is that we have a server that only accept one
> connection. And we want to read something from it during device
> initializtion.
> 
> If background connect it before we call qemu_chr_fe_wait_connected()
> during device initializtion, qemu_chr_fe_wait_connected() will
> accomplish but we can't read anything. And we have no way to release
> the background connection. So what I want to do in this patch is to
> disable background connect.

I'm not seeing the problem here. What I proposed results in 

  1. chardev starts connect()
  2. vhost backend waits for connect() to complete
  3. vhost backend reads from chardev

vs the flow

  1. vhost backend starts connect()
  2. vhost backend waits for connect() to complete
  3. vhost backend reads from chardev

in both cases there's only a single connection established to the
server

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 14:11           ` Daniel P. Berrangé
@ 2019-01-10 14:29             ` Yongji Xie
  2019-01-10 16:41               ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-10 14:29 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, 10 Jan 2019 at 22:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> > On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > >
> > > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > > on client sockets during initialization of the chardev.
> > > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > > when necessary. Now it would be used for unix domain
> > > > > > socket of vhost-user-blk device to support reconnect.
> > > > > >
> > > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > > ---
> > > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > > >  qapi/char.json        |  3 +--
> > > > > >  qemu-options.hx       |  9 ++++---
> > > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > > >
> > > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > > --- a/chardev/char-socket.c
> > > > > > +++ b/chardev/char-socket.c
> > > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > > >          s->reconnect_time = reconnect;
> > > > > >      }
> > > > > >
> > > > > > -    if (s->reconnect_time) {
> > > > > > -        tcp_chr_connect_async(chr);
> > > > > > -    } else {
> > > > > > -        if (s->is_listen) {
> > > > > > -            char *name;
> > > > > > -            s->listener = qio_net_listener_new();
> > > > > > +    if (s->is_listen) {
> > > > > > +        char *name;
> > > > > > +        s->listener = qio_net_listener_new();
> > > > > >
> > > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > > -            g_free(name);
> > > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > > +        g_free(name);
> > > > > >
> > > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > -                object_unref(OBJECT(s->listener));
> > > > > > -                s->listener = NULL;
> > > > > > -                goto error;
> > > > > > -            }
> > > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > +            object_unref(OBJECT(s->listener));
> > > > > > +            s->listener = NULL;
> > > > > > +            goto error;
> > > > > > +        }
> > > > > >
> > > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > -            update_disconnected_filename(s);
> > > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > +        update_disconnected_filename(s);
> > > > > >
> > > > > > -            if (is_waitconnect &&
> > > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > -                return;
> > > > > > -            }
> > > > > > -            if (!s->ioc) {
> > > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > > -                                                      tcp_chr_accept,
> > > > > > -                                                      chr, NULL,
> > > > > > -                                                      chr->gcontext);
> > > > > > -            }
> > > > > > +        if (is_waitconnect &&
> > > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > +            return;
> > > > > > +        }
> > > > > > +        if (!s->ioc) {
> > > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > > +                                                  tcp_chr_accept,
> > > > > > +                                                  chr, NULL,
> > > > > > +                                                  chr->gcontext);
> > > > > > +        }
> > > > > > +    } else if (is_waitconnect) {
> > > > > > +        if (s->reconnect_time) {
> > > > > > +            tcp_chr_connect_async(chr);
> > > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > >              goto error;
> > > > > >          }
> > > > >
> > > > > This skips everything when 'is_waitconnect' is false.
> > > > >
> > > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > > because the chardevs were socket clients, but now we honour it.
> > > > >
> > > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > > from that this code should also still attempt a non-blocking
> > > > > connect when is_waitconnect is false.
> > > > >
> > > >
> > > > Do you mean we still need to connect server in background with
> > > > "nowait" option? But my purpose is not to connect server until we
> > > > manually call qemu_chr_fe_wait_connected() in other place.
> > >
> > > I don't see a need to delay the connect. We can start a
> > > background connect right away. The later code you have
> > > merely needs to wait for that background connect  to
> > > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > > This keeps the chardev code clearer only having 2 distinct
> > > code paths to worry about - blocking or non-blocking connect.
> > >
> >
> > Now the problem is that we have a server that only accept one
> > connection. And we want to read something from it during device
> > initializtion.
> >
> > If background connect it before we call qemu_chr_fe_wait_connected()
> > during device initializtion, qemu_chr_fe_wait_connected() will
> > accomplish but we can't read anything. And we have no way to release
> > the background connection. So what I want to do in this patch is to
> > disable background connect.
>
> I'm not seeing the problem here. What I proposed results in
>
>   1. chardev starts connect()

This should be asynchronous with "reconnect" option. Another thread
may connect before vhost backend?

>   2. vhost backend waits for connect() to complete

Sorry, I'm not sure I get your point here. Do you mean vhost backend
call qemu_chr_fe_wait_connected()? Seems like
qemu_chr_fe_wait_connected() will connect directly rather than wait
for connect() to complete?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 14:29             ` Yongji Xie
@ 2019-01-10 16:41               ` Daniel P. Berrangé
  2019-01-11  7:50                 ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-10 16:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, Jan 10, 2019 at 10:29:20PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 22:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> > > On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > > >
> > > > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > > > on client sockets during initialization of the chardev.
> > > > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > > > when necessary. Now it would be used for unix domain
> > > > > > > socket of vhost-user-blk device to support reconnect.
> > > > > > >
> > > > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > > > ---
> > > > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > > > >  qapi/char.json        |  3 +--
> > > > > > >  qemu-options.hx       |  9 ++++---
> > > > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > > > >
> > > > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > > > --- a/chardev/char-socket.c
> > > > > > > +++ b/chardev/char-socket.c
> > > > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > > > >          s->reconnect_time = reconnect;
> > > > > > >      }
> > > > > > >
> > > > > > > -    if (s->reconnect_time) {
> > > > > > > -        tcp_chr_connect_async(chr);
> > > > > > > -    } else {
> > > > > > > -        if (s->is_listen) {
> > > > > > > -            char *name;
> > > > > > > -            s->listener = qio_net_listener_new();
> > > > > > > +    if (s->is_listen) {
> > > > > > > +        char *name;
> > > > > > > +        s->listener = qio_net_listener_new();
> > > > > > >
> > > > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > > > -            g_free(name);
> > > > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > > > +        g_free(name);
> > > > > > >
> > > > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > -                object_unref(OBJECT(s->listener));
> > > > > > > -                s->listener = NULL;
> > > > > > > -                goto error;
> > > > > > > -            }
> > > > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > +            object_unref(OBJECT(s->listener));
> > > > > > > +            s->listener = NULL;
> > > > > > > +            goto error;
> > > > > > > +        }
> > > > > > >
> > > > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > -            update_disconnected_filename(s);
> > > > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > +        update_disconnected_filename(s);
> > > > > > >
> > > > > > > -            if (is_waitconnect &&
> > > > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > -                return;
> > > > > > > -            }
> > > > > > > -            if (!s->ioc) {
> > > > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > > > -                                                      tcp_chr_accept,
> > > > > > > -                                                      chr, NULL,
> > > > > > > -                                                      chr->gcontext);
> > > > > > > -            }
> > > > > > > +        if (is_waitconnect &&
> > > > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > +            return;
> > > > > > > +        }
> > > > > > > +        if (!s->ioc) {
> > > > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > > > +                                                  tcp_chr_accept,
> > > > > > > +                                                  chr, NULL,
> > > > > > > +                                                  chr->gcontext);
> > > > > > > +        }
> > > > > > > +    } else if (is_waitconnect) {
> > > > > > > +        if (s->reconnect_time) {
> > > > > > > +            tcp_chr_connect_async(chr);
> > > > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > >              goto error;
> > > > > > >          }
> > > > > >
> > > > > > This skips everything when 'is_waitconnect' is false.
> > > > > >
> > > > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > > > because the chardevs were socket clients, but now we honour it.
> > > > > >
> > > > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > > > from that this code should also still attempt a non-blocking
> > > > > > connect when is_waitconnect is false.
> > > > > >
> > > > >
> > > > > Do you mean we still need to connect server in background with
> > > > > "nowait" option? But my purpose is not to connect server until we
> > > > > manually call qemu_chr_fe_wait_connected() in other place.
> > > >
> > > > I don't see a need to delay the connect. We can start a
> > > > background connect right away. The later code you have
> > > > merely needs to wait for that background connect  to
> > > > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > > > This keeps the chardev code clearer only having 2 distinct
> > > > code paths to worry about - blocking or non-blocking connect.
> > > >
> > >
> > > Now the problem is that we have a server that only accept one
> > > connection. And we want to read something from it during device
> > > initializtion.
> > >
> > > If background connect it before we call qemu_chr_fe_wait_connected()
> > > during device initializtion, qemu_chr_fe_wait_connected() will
> > > accomplish but we can't read anything. And we have no way to release
> > > the background connection. So what I want to do in this patch is to
> > > disable background connect.
> >
> > I'm not seeing the problem here. What I proposed results in
> >
> >   1. chardev starts connect()
> 
> This should be asynchronous with "reconnect" option. Another thread
> may connect before vhost backend?
> 
> >   2. vhost backend waits for connect() to complete
> 
> Sorry, I'm not sure I get your point here. Do you mean vhost backend
> call qemu_chr_fe_wait_connected()? Seems like
> qemu_chr_fe_wait_connected() will connect directly rather than wait
> for connect() to complete?

Ahhhh, yes, you are right.

qemu_chr_fe_wait_connected will potentially cause a second connection to
be established

Looking at it the qemu_chr_fe_wait_connected() I believe it is seriously
broken even before this patch series.

The intended usage is that a device can all qemu_chr_fe_wait_connected
to wait for a new connection to be established, and then do I/O on the
chardev.  This does not in fact work if TLS, websock or telnet modes
are enabled for the socket, due to a mistake introduced when we previously
tried to fix this:

  commit 1dc8a6695c731abb7461c637b2512c3670d82be4
  Author: Marc-André Lureau <marcandre.lureau@redhat.com>
  Date:   Tue Aug 16 12:33:32 2016 +0400

    char: fix waiting for TLS and telnet connection

That commit fixed the problem where we continued to accept() new sockets
when TLS/telnet was enabled, because the 's->connected' flag isn't set
immediately.

Unfortunately what this means is that when qemu_chr_fe_wait_connected
returns, the chardev is *not* ready to read/write data. The TLS/telnet
handshake has not been run, and is still pending in the background.

So we'll end up with device backend trying todo I/O on the chardev
at the same time as it is trying todo the TLS/telnet handshake.

We need to fix qemu_chr_fe_wait_connected so that it does explicit
synchronization wrt to any ongoing background connection process.
It must only return once all TLS/telnet/websock handshakes have
completed.  If we fix that correctly, then I believe it will  also
solve the problem you're trying to address.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
@ 2019-01-11  3:56   ` Jason Wang
  2019-01-11  6:10     ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Wang @ 2019-01-11  3:56 UTC (permalink / raw)
  To: elohimes, mst, marcandre.lureau, berrange, maxime.coquelin,
	yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji


On 2019/1/9 下午7:27, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
>
> This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
> VHOST_USER_SET_INFLIGHT_FD message to set/get shared memory
> to/from qemu. Then we maintain a "bitmap" of all descriptors in
> the shared memory for each queue to track inflight I/O.
>
> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> ---
>   Makefile                              |   2 +-
>   contrib/libvhost-user/libvhost-user.c | 258 ++++++++++++++++++++++++--
>   contrib/libvhost-user/libvhost-user.h |  29 +++
>   3 files changed, 268 insertions(+), 21 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index dd53965f77..b5c9092605 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -473,7 +473,7 @@ Makefile: $(version-obj-y)
>   # Build libraries
>   
>   libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
> -libvhost-user.a: $(libvhost-user-obj-y)
> +libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
>   
>   ######################################################################
>   
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 23bd52264c..e73ce04619 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -41,6 +41,8 @@
>   #endif
>   
>   #include "qemu/atomic.h"
> +#include "qemu/osdep.h"
> +#include "qemu/memfd.h"
>   
>   #include "libvhost-user.h"
>   
> @@ -53,6 +55,18 @@
>               _min1 < _min2 ? _min1 : _min2; })
>   #endif
>   
> +/* Round number down to multiple */
> +#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> +
> +/* Round number up to multiple */
> +#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> +
> +/* Align each region to cache line size in inflight buffer */
> +#define INFLIGHT_ALIGNMENT 64
> +
> +/* The version of inflight buffer */
> +#define INFLIGHT_VERSION 1
> +
>   #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
>   
>   /* The version of the protocol we support */
> @@ -66,6 +80,20 @@
>           }                                       \
>       } while (0)
>   
> +static inline
> +bool has_feature(uint64_t features, unsigned int fbit)
> +{
> +    assert(fbit < 64);
> +    return !!(features & (1ULL << fbit));
> +}
> +
> +static inline
> +bool vu_has_feature(VuDev *dev,
> +                    unsigned int fbit)
> +{
> +    return has_feature(dev->features, fbit);
> +}
> +
>   static const char *
>   vu_request_to_string(unsigned int req)
>   {
> @@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
>           REQ(VHOST_USER_POSTCOPY_ADVISE),
>           REQ(VHOST_USER_POSTCOPY_LISTEN),
>           REQ(VHOST_USER_POSTCOPY_END),
> +        REQ(VHOST_USER_GET_INFLIGHT_FD),
> +        REQ(VHOST_USER_SET_INFLIGHT_FD),
>           REQ(VHOST_USER_MAX),
>       };
>   #undef REQ
> @@ -890,6 +920,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
>       return true;
>   }
>   
> +static int
> +vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
> +{
> +    int i = 0;
> +
> +    if (!has_feature(dev->protocol_features,
> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +        return 0;
> +    }
> +
> +    if (unlikely(!vq->inflight)) {
> +        return -1;
> +    }
> +
> +    vq->used_idx = vq->vring.used->idx;
> +    vq->inflight_num = 0;
> +    for (i = 0; i < vq->vring.num; i++) {
> +        if (vq->inflight->desc[i] == 0) {
> +            continue;
> +        }
> +
> +        vq->inflight_desc[vq->inflight_num++] = i;
> +        vq->inuse++;
> +    }
> +    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
> +
> +    /* in case of I/O hang after reconnecting */
> +    if (eventfd_write(vq->kick_fd, 1) ||
> +        eventfd_write(vq->call_fd, 1)) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>   static bool
>   vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>   {
> @@ -925,6 +990,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>                  dev->vq[index].kick_fd, index);
>       }
>   
> +    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
> +        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
> +    }
> +
>       return false;
>   }
>   
> @@ -1215,6 +1284,117 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
>       return true;
>   }
>   
> +static bool
> +vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    int fd;
> +    void *addr;
> +    uint64_t mmap_size;
> +
> +    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
> +        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
> +        vmsg->payload.inflight.mmap_size = 0;
> +        return true;
> +    }
> +
> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n",
> +           vmsg->payload.inflight.num_queues);
> +
> +    mmap_size = vmsg->payload.inflight.num_queues *
> +                ALIGN_UP(sizeof(VuVirtqInflight), INFLIGHT_ALIGNMENT);
> +
> +    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
> +                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
> +                            &fd, NULL);
> +
> +    if (!addr) {
> +        vu_panic(dev, "Failed to alloc vhost inflight area");
> +        vmsg->payload.inflight.mmap_size = 0;
> +        return true;
> +    }
> +
> +    dev->inflight_info.addr = addr;
> +    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
> +    vmsg->payload.inflight.mmap_offset = 0;
> +    vmsg->payload.inflight.align = INFLIGHT_ALIGNMENT;
> +    vmsg->payload.inflight.version = INFLIGHT_VERSION;
> +    vmsg->fd_num = 1;
> +    dev->inflight_info.fd = vmsg->fds[0] = fd;
> +
> +    DPRINT("send inflight mmap_size: %"PRId64"\n",
> +           vmsg->payload.inflight.mmap_size);
> +    DPRINT("send inflight mmap offset: %"PRId64"\n",
> +           vmsg->payload.inflight.mmap_offset);
> +    DPRINT("send inflight align: %"PRId32"\n",
> +           vmsg->payload.inflight.align);
> +    DPRINT("send inflight version: %"PRId16"\n",
> +           vmsg->payload.inflight.version);
> +
> +    return true;
> +}
> +
> +static bool
> +vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> +{
> +    int fd, i;
> +    uint64_t mmap_size, mmap_offset;
> +    uint32_t align;
> +    uint16_t num_queues, version;
> +    void *rc;
> +
> +    if (vmsg->fd_num != 1 ||
> +        vmsg->size != sizeof(vmsg->payload.inflight)) {
> +        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
> +                 vmsg->size, vmsg->fd_num);
> +        return false;
> +    }
> +
> +    fd = vmsg->fds[0];
> +    mmap_size = vmsg->payload.inflight.mmap_size;
> +    mmap_offset = vmsg->payload.inflight.mmap_offset;
> +    align = vmsg->payload.inflight.align;
> +    num_queues = vmsg->payload.inflight.num_queues;
> +    version = vmsg->payload.inflight.version;
> +
> +    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
> +    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
> +    DPRINT("set_inflight_fd align: %"PRId32"\n", align);
> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
> +    DPRINT("set_inflight_fd version: %"PRId16"\n", version);
> +
> +    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> +              fd, mmap_offset);
> +
> +    if (rc == MAP_FAILED) {
> +        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
> +        return false;
> +    }
> +
> +    if (version != INFLIGHT_VERSION) {
> +        vu_panic(dev, "Invalid set_inflight_fd version: %d", version);
> +        return false;
> +    }
> +
> +    if (dev->inflight_info.fd) {
> +        close(dev->inflight_info.fd);
> +    }
> +
> +    if (dev->inflight_info.addr) {
> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> +    }
> +
> +    dev->inflight_info.fd = fd;
> +    dev->inflight_info.addr = rc;
> +    dev->inflight_info.size = mmap_size;
> +
> +    for (i = 0; i < num_queues; i++) {
> +        dev->vq[i].inflight = (VuVirtqInflight *)rc;
> +        rc = (void *)((char *)rc + ALIGN_UP(sizeof(VuVirtqInflight), align));
> +    }
> +
> +    return false;
> +}
> +
>   static bool
>   vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>   {
> @@ -1292,6 +1472,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>           return vu_set_postcopy_listen(dev, vmsg);
>       case VHOST_USER_POSTCOPY_END:
>           return vu_set_postcopy_end(dev, vmsg);
> +    case VHOST_USER_GET_INFLIGHT_FD:
> +        return vu_get_inflight_fd(dev, vmsg);
> +    case VHOST_USER_SET_INFLIGHT_FD:
> +        return vu_set_inflight_fd(dev, vmsg);
>       default:
>           vmsg_close_fds(vmsg);
>           vu_panic(dev, "Unhandled request: %d", vmsg->request);
> @@ -1359,8 +1543,18 @@ vu_deinit(VuDev *dev)
>               close(vq->err_fd);
>               vq->err_fd = -1;
>           }
> +        vq->inflight = NULL;
>       }
>   
> +    if (dev->inflight_info.addr) {
> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> +        dev->inflight_info.addr = NULL;
> +    }
> +
> +    if (dev->inflight_info.fd > 0) {
> +        close(dev->inflight_info.fd);
> +        dev->inflight_info.fd = -1;
> +    }
>   
>       vu_close_log(dev);
>       if (dev->slave_fd != -1) {
> @@ -1687,20 +1881,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
>       return vring_avail_idx(vq) == vq->last_avail_idx;
>   }
>   
> -static inline
> -bool has_feature(uint64_t features, unsigned int fbit)
> -{
> -    assert(fbit < 64);
> -    return !!(features & (1ULL << fbit));
> -}
> -
> -static inline
> -bool vu_has_feature(VuDev *dev,
> -                    unsigned int fbit)
> -{
> -    return has_feature(dev->features, fbit);
> -}
> -
>   static bool
>   vring_notify(VuDev *dev, VuVirtq *vq)
>   {
> @@ -1829,12 +2009,6 @@ virtqueue_map_desc(VuDev *dev,
>       *p_num_sg = num_sg;
>   }
>   
> -/* Round number down to multiple */
> -#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> -
> -/* Round number up to multiple */
> -#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> -
>   static void *
>   virtqueue_alloc_element(size_t sz,
>                                        unsigned out_num, unsigned in_num)
> @@ -1935,9 +2109,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
>       return elem;
>   }
>   
> +static int
> +vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
> +{
> +    if (!has_feature(dev->protocol_features,
> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +        return 0;
> +    }
> +
> +    if (unlikely(!vq->inflight)) {
> +        return -1;
> +    }
> +


Just wonder what happens if backend get killed at this point?

You want to survive from the backend crash but you still depend on 
backend to get and put inflight descriptors which seems somehow conflict.

Thanks


> +    vq->inflight->desc[desc_idx] = 1;
> +
> +    return 0;
> +}
> +
> +static int
> +vu_queue_inflight_put(VuDev *dev, VuVirtq *vq, int desc_idx)
> +{
> +    if (!has_feature(dev->protocol_features,
> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +        return 0;
> +    }
> +
> +    if (unlikely(!vq->inflight)) {
> +        return -1;
> +    }
> +
> +    vq->inflight->desc[desc_idx] = 0;
> +
> +    return 0;
> +}
> +
>   void *
>   vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
>   {
> +    int i;
>       unsigned int head;
>       VuVirtqElement *elem;
>   
> @@ -1946,6 +2155,12 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
>           return NULL;
>       }
>   
> +    if (unlikely(vq->inflight_num > 0)) {
> +        i = (--vq->inflight_num);
> +        elem = vu_queue_map_desc(dev, vq, vq->inflight_desc[i], sz);
> +        return elem;
> +    }
> +
>       if (vu_queue_empty(dev, vq)) {
>           return NULL;
>       }
> @@ -1976,6 +2191,8 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
>   
>       vq->inuse++;
>   
> +    vu_queue_inflight_get(dev, vq, head);
> +
>       return elem;
>   }
>   
> @@ -2121,4 +2338,5 @@ vu_queue_push(VuDev *dev, VuVirtq *vq,
>   {
>       vu_queue_fill(dev, vq, elem, len, 0);
>       vu_queue_flush(dev, vq, 1);
> +    vu_queue_inflight_put(dev, vq, elem->index);
>   }
> diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
> index 4aa55b4d2d..5afb80ea5c 100644
> --- a/contrib/libvhost-user/libvhost-user.h
> +++ b/contrib/libvhost-user/libvhost-user.h
> @@ -53,6 +53,7 @@ enum VhostUserProtocolFeature {
>       VHOST_USER_PROTOCOL_F_CONFIG = 9,
>       VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
>       VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
>   
>       VHOST_USER_PROTOCOL_F_MAX
>   };
> @@ -91,6 +92,8 @@ typedef enum VhostUserRequest {
>       VHOST_USER_POSTCOPY_ADVISE  = 28,
>       VHOST_USER_POSTCOPY_LISTEN  = 29,
>       VHOST_USER_POSTCOPY_END     = 30,
> +    VHOST_USER_GET_INFLIGHT_FD = 31,
> +    VHOST_USER_SET_INFLIGHT_FD = 32,
>       VHOST_USER_MAX
>   } VhostUserRequest;
>   
> @@ -138,6 +141,14 @@ typedef struct VhostUserVringArea {
>       uint64_t offset;
>   } VhostUserVringArea;
>   
> +typedef struct VhostUserInflight {
> +    uint64_t mmap_size;
> +    uint64_t mmap_offset;
> +    uint32_t align;
> +    uint16_t num_queues;
> +    uint16_t version;
> +} VhostUserInflight;
> +
>   #if defined(_WIN32)
>   # define VU_PACKED __attribute__((gcc_struct, packed))
>   #else
> @@ -163,6 +174,7 @@ typedef struct VhostUserMsg {
>           VhostUserLog log;
>           VhostUserConfig config;
>           VhostUserVringArea area;
> +        VhostUserInflight inflight;
>       } payload;
>   
>       int fds[VHOST_MEMORY_MAX_NREGIONS];
> @@ -234,9 +246,19 @@ typedef struct VuRing {
>       uint32_t flags;
>   } VuRing;
>   
> +typedef struct VuVirtqInflight {
> +    char desc[VIRTQUEUE_MAX_SIZE];
> +} VuVirtqInflight;
> +
>   typedef struct VuVirtq {
>       VuRing vring;
>   
> +    VuVirtqInflight *inflight;
> +
> +    uint16_t inflight_desc[VIRTQUEUE_MAX_SIZE];
> +
> +    uint16_t inflight_num;
> +
>       /* Next head to pop */
>       uint16_t last_avail_idx;
>   
> @@ -279,11 +301,18 @@ typedef void (*vu_set_watch_cb) (VuDev *dev, int fd, int condition,
>                                    vu_watch_cb cb, void *data);
>   typedef void (*vu_remove_watch_cb) (VuDev *dev, int fd);
>   
> +typedef struct VuDevInflightInfo {
> +    int fd;
> +    void *addr;
> +    uint64_t size;
> +} VuDevInflightInfo;
> +
>   struct VuDev {
>       int sock;
>       uint32_t nregions;
>       VuDevRegion regions[VHOST_MEMORY_MAX_NREGIONS];
>       VuVirtq vq[VHOST_MAX_NR_VIRTQUEUE];
> +    VuDevInflightInfo inflight_info;
>       int log_call_fd;
>       int slave_fd;
>       uint64_t log_size;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-11  3:56   ` Jason Wang
@ 2019-01-11  6:10     ` Yongji Xie
  2019-01-15  7:52       ` Jason Wang
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-11  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 11 Jan 2019 at 11:56, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2019/1/9 下午7:27, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
> > VHOST_USER_SET_INFLIGHT_FD message to set/get shared memory
> > to/from qemu. Then we maintain a "bitmap" of all descriptors in
> > the shared memory for each queue to track inflight I/O.
> >
> > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > ---
> >   Makefile                              |   2 +-
> >   contrib/libvhost-user/libvhost-user.c | 258 ++++++++++++++++++++++++--
> >   contrib/libvhost-user/libvhost-user.h |  29 +++
> >   3 files changed, 268 insertions(+), 21 deletions(-)
> >
> > diff --git a/Makefile b/Makefile
> > index dd53965f77..b5c9092605 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -473,7 +473,7 @@ Makefile: $(version-obj-y)
> >   # Build libraries
> >
> >   libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
> > -libvhost-user.a: $(libvhost-user-obj-y)
> > +libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
> >
> >   ######################################################################
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 23bd52264c..e73ce04619 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -41,6 +41,8 @@
> >   #endif
> >
> >   #include "qemu/atomic.h"
> > +#include "qemu/osdep.h"
> > +#include "qemu/memfd.h"
> >
> >   #include "libvhost-user.h"
> >
> > @@ -53,6 +55,18 @@
> >               _min1 < _min2 ? _min1 : _min2; })
> >   #endif
> >
> > +/* Round number down to multiple */
> > +#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> > +
> > +/* Round number up to multiple */
> > +#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> > +
> > +/* Align each region to cache line size in inflight buffer */
> > +#define INFLIGHT_ALIGNMENT 64
> > +
> > +/* The version of inflight buffer */
> > +#define INFLIGHT_VERSION 1
> > +
> >   #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
> >
> >   /* The version of the protocol we support */
> > @@ -66,6 +80,20 @@
> >           }                                       \
> >       } while (0)
> >
> > +static inline
> > +bool has_feature(uint64_t features, unsigned int fbit)
> > +{
> > +    assert(fbit < 64);
> > +    return !!(features & (1ULL << fbit));
> > +}
> > +
> > +static inline
> > +bool vu_has_feature(VuDev *dev,
> > +                    unsigned int fbit)
> > +{
> > +    return has_feature(dev->features, fbit);
> > +}
> > +
> >   static const char *
> >   vu_request_to_string(unsigned int req)
> >   {
> > @@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
> >           REQ(VHOST_USER_POSTCOPY_ADVISE),
> >           REQ(VHOST_USER_POSTCOPY_LISTEN),
> >           REQ(VHOST_USER_POSTCOPY_END),
> > +        REQ(VHOST_USER_GET_INFLIGHT_FD),
> > +        REQ(VHOST_USER_SET_INFLIGHT_FD),
> >           REQ(VHOST_USER_MAX),
> >       };
> >   #undef REQ
> > @@ -890,6 +920,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
> >       return true;
> >   }
> >
> > +static int
> > +vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
> > +{
> > +    int i = 0;
> > +
> > +    if (!has_feature(dev->protocol_features,
> > +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> > +        return 0;
> > +    }
> > +
> > +    if (unlikely(!vq->inflight)) {
> > +        return -1;
> > +    }
> > +
> > +    vq->used_idx = vq->vring.used->idx;
> > +    vq->inflight_num = 0;
> > +    for (i = 0; i < vq->vring.num; i++) {
> > +        if (vq->inflight->desc[i] == 0) {
> > +            continue;
> > +        }
> > +
> > +        vq->inflight_desc[vq->inflight_num++] = i;
> > +        vq->inuse++;
> > +    }
> > +    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
> > +
> > +    /* in case of I/O hang after reconnecting */
> > +    if (eventfd_write(vq->kick_fd, 1) ||
> > +        eventfd_write(vq->call_fd, 1)) {
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >   static bool
> >   vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
> >   {
> > @@ -925,6 +990,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
> >                  dev->vq[index].kick_fd, index);
> >       }
> >
> > +    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
> > +        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
> > +    }
> > +
> >       return false;
> >   }
> >
> > @@ -1215,6 +1284,117 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> >       return true;
> >   }
> >
> > +static bool
> > +vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    int fd;
> > +    void *addr;
> > +    uint64_t mmap_size;
> > +
> > +    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
> > +        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
> > +        vmsg->payload.inflight.mmap_size = 0;
> > +        return true;
> > +    }
> > +
> > +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n",
> > +           vmsg->payload.inflight.num_queues);
> > +
> > +    mmap_size = vmsg->payload.inflight.num_queues *
> > +                ALIGN_UP(sizeof(VuVirtqInflight), INFLIGHT_ALIGNMENT);
> > +
> > +    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
> > +                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
> > +                            &fd, NULL);
> > +
> > +    if (!addr) {
> > +        vu_panic(dev, "Failed to alloc vhost inflight area");
> > +        vmsg->payload.inflight.mmap_size = 0;
> > +        return true;
> > +    }
> > +
> > +    dev->inflight_info.addr = addr;
> > +    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
> > +    vmsg->payload.inflight.mmap_offset = 0;
> > +    vmsg->payload.inflight.align = INFLIGHT_ALIGNMENT;
> > +    vmsg->payload.inflight.version = INFLIGHT_VERSION;
> > +    vmsg->fd_num = 1;
> > +    dev->inflight_info.fd = vmsg->fds[0] = fd;
> > +
> > +    DPRINT("send inflight mmap_size: %"PRId64"\n",
> > +           vmsg->payload.inflight.mmap_size);
> > +    DPRINT("send inflight mmap offset: %"PRId64"\n",
> > +           vmsg->payload.inflight.mmap_offset);
> > +    DPRINT("send inflight align: %"PRId32"\n",
> > +           vmsg->payload.inflight.align);
> > +    DPRINT("send inflight version: %"PRId16"\n",
> > +           vmsg->payload.inflight.version);
> > +
> > +    return true;
> > +}
> > +
> > +static bool
> > +vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> > +{
> > +    int fd, i;
> > +    uint64_t mmap_size, mmap_offset;
> > +    uint32_t align;
> > +    uint16_t num_queues, version;
> > +    void *rc;
> > +
> > +    if (vmsg->fd_num != 1 ||
> > +        vmsg->size != sizeof(vmsg->payload.inflight)) {
> > +        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
> > +                 vmsg->size, vmsg->fd_num);
> > +        return false;
> > +    }
> > +
> > +    fd = vmsg->fds[0];
> > +    mmap_size = vmsg->payload.inflight.mmap_size;
> > +    mmap_offset = vmsg->payload.inflight.mmap_offset;
> > +    align = vmsg->payload.inflight.align;
> > +    num_queues = vmsg->payload.inflight.num_queues;
> > +    version = vmsg->payload.inflight.version;
> > +
> > +    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
> > +    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
> > +    DPRINT("set_inflight_fd align: %"PRId32"\n", align);
> > +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
> > +    DPRINT("set_inflight_fd version: %"PRId16"\n", version);
> > +
> > +    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> > +              fd, mmap_offset);
> > +
> > +    if (rc == MAP_FAILED) {
> > +        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
> > +        return false;
> > +    }
> > +
> > +    if (version != INFLIGHT_VERSION) {
> > +        vu_panic(dev, "Invalid set_inflight_fd version: %d", version);
> > +        return false;
> > +    }
> > +
> > +    if (dev->inflight_info.fd) {
> > +        close(dev->inflight_info.fd);
> > +    }
> > +
> > +    if (dev->inflight_info.addr) {
> > +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> > +    }
> > +
> > +    dev->inflight_info.fd = fd;
> > +    dev->inflight_info.addr = rc;
> > +    dev->inflight_info.size = mmap_size;
> > +
> > +    for (i = 0; i < num_queues; i++) {
> > +        dev->vq[i].inflight = (VuVirtqInflight *)rc;
> > +        rc = (void *)((char *)rc + ALIGN_UP(sizeof(VuVirtqInflight), align));
> > +    }
> > +
> > +    return false;
> > +}
> > +
> >   static bool
> >   vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >   {
> > @@ -1292,6 +1472,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >           return vu_set_postcopy_listen(dev, vmsg);
> >       case VHOST_USER_POSTCOPY_END:
> >           return vu_set_postcopy_end(dev, vmsg);
> > +    case VHOST_USER_GET_INFLIGHT_FD:
> > +        return vu_get_inflight_fd(dev, vmsg);
> > +    case VHOST_USER_SET_INFLIGHT_FD:
> > +        return vu_set_inflight_fd(dev, vmsg);
> >       default:
> >           vmsg_close_fds(vmsg);
> >           vu_panic(dev, "Unhandled request: %d", vmsg->request);
> > @@ -1359,8 +1543,18 @@ vu_deinit(VuDev *dev)
> >               close(vq->err_fd);
> >               vq->err_fd = -1;
> >           }
> > +        vq->inflight = NULL;
> >       }
> >
> > +    if (dev->inflight_info.addr) {
> > +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> > +        dev->inflight_info.addr = NULL;
> > +    }
> > +
> > +    if (dev->inflight_info.fd > 0) {
> > +        close(dev->inflight_info.fd);
> > +        dev->inflight_info.fd = -1;
> > +    }
> >
> >       vu_close_log(dev);
> >       if (dev->slave_fd != -1) {
> > @@ -1687,20 +1881,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
> >       return vring_avail_idx(vq) == vq->last_avail_idx;
> >   }
> >
> > -static inline
> > -bool has_feature(uint64_t features, unsigned int fbit)
> > -{
> > -    assert(fbit < 64);
> > -    return !!(features & (1ULL << fbit));
> > -}
> > -
> > -static inline
> > -bool vu_has_feature(VuDev *dev,
> > -                    unsigned int fbit)
> > -{
> > -    return has_feature(dev->features, fbit);
> > -}
> > -
> >   static bool
> >   vring_notify(VuDev *dev, VuVirtq *vq)
> >   {
> > @@ -1829,12 +2009,6 @@ virtqueue_map_desc(VuDev *dev,
> >       *p_num_sg = num_sg;
> >   }
> >
> > -/* Round number down to multiple */
> > -#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> > -
> > -/* Round number up to multiple */
> > -#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> > -
> >   static void *
> >   virtqueue_alloc_element(size_t sz,
> >                                        unsigned out_num, unsigned in_num)
> > @@ -1935,9 +2109,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
> >       return elem;
> >   }
> >
> > +static int
> > +vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
> > +{
> > +    if (!has_feature(dev->protocol_features,
> > +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> > +        return 0;
> > +    }
> > +
> > +    if (unlikely(!vq->inflight)) {
> > +        return -1;
> > +    }
> > +
>
>
> Just wonder what happens if backend get killed at this point?
>

We will re-caculate last_avail_idx like: last_avail_idx = inuse + used_idx

At this point, backend could consume this entry correctly after reconnect.

> You want to survive from the backend crash but you still depend on
> backend to get and put inflight descriptors which seems somehow conflict.
>

But if backend get killed in vu_queue_inflight_put(), I think you are
right, there is something conflict. One descriptor is consumed by
guest but still marked as inused in inflight buffer. Then we will
re-send this old descriptor after restart.

Maybe we can add something like that to fix this issue:

void vu_queue_push()
{
    vq->inflight->elem_idx = elem->idx;
    vu_queue_fill();
    vu_queue_flush();
    vq->inflight->desc[elem->idx] = 0;
    vq->inflight->used_idx = vq->vring.used->idx;
}

static int vu_check_queue_inflights()
{
    ....
    if (vq->inflight->used_idx != vq->vring.used->idx) {
        /* Crash in vu_queue_push() */
        vq->inflight->desc[vq->inflight->elem_idx] = 0;
    }
    ....
}

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-10 16:41               ` Daniel P. Berrangé
@ 2019-01-11  7:50                 ` Yongji Xie
  2019-01-11  8:32                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-11  7:50 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Jan 10, 2019 at 10:29:20PM +0800, Yongji Xie wrote:
> > On Thu, 10 Jan 2019 at 22:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> > > > On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > > > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > > > >
> > > > > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > > > > on client sockets during initialization of the chardev.
> > > > > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > > > > when necessary. Now it would be used for unix domain
> > > > > > > > socket of vhost-user-blk device to support reconnect.
> > > > > > > >
> > > > > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > > > > ---
> > > > > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > > > > >  qapi/char.json        |  3 +--
> > > > > > > >  qemu-options.hx       |  9 ++++---
> > > > > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > > > > --- a/chardev/char-socket.c
> > > > > > > > +++ b/chardev/char-socket.c
> > > > > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > > > > >          s->reconnect_time = reconnect;
> > > > > > > >      }
> > > > > > > >
> > > > > > > > -    if (s->reconnect_time) {
> > > > > > > > -        tcp_chr_connect_async(chr);
> > > > > > > > -    } else {
> > > > > > > > -        if (s->is_listen) {
> > > > > > > > -            char *name;
> > > > > > > > -            s->listener = qio_net_listener_new();
> > > > > > > > +    if (s->is_listen) {
> > > > > > > > +        char *name;
> > > > > > > > +        s->listener = qio_net_listener_new();
> > > > > > > >
> > > > > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > > > > -            g_free(name);
> > > > > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > > > > +        g_free(name);
> > > > > > > >
> > > > > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > -                object_unref(OBJECT(s->listener));
> > > > > > > > -                s->listener = NULL;
> > > > > > > > -                goto error;
> > > > > > > > -            }
> > > > > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > +            object_unref(OBJECT(s->listener));
> > > > > > > > +            s->listener = NULL;
> > > > > > > > +            goto error;
> > > > > > > > +        }
> > > > > > > >
> > > > > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > -            update_disconnected_filename(s);
> > > > > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > +        update_disconnected_filename(s);
> > > > > > > >
> > > > > > > > -            if (is_waitconnect &&
> > > > > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > -                return;
> > > > > > > > -            }
> > > > > > > > -            if (!s->ioc) {
> > > > > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > -                                                      tcp_chr_accept,
> > > > > > > > -                                                      chr, NULL,
> > > > > > > > -                                                      chr->gcontext);
> > > > > > > > -            }
> > > > > > > > +        if (is_waitconnect &&
> > > > > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > +            return;
> > > > > > > > +        }
> > > > > > > > +        if (!s->ioc) {
> > > > > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > +                                                  tcp_chr_accept,
> > > > > > > > +                                                  chr, NULL,
> > > > > > > > +                                                  chr->gcontext);
> > > > > > > > +        }
> > > > > > > > +    } else if (is_waitconnect) {
> > > > > > > > +        if (s->reconnect_time) {
> > > > > > > > +            tcp_chr_connect_async(chr);
> > > > > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > >              goto error;
> > > > > > > >          }
> > > > > > >
> > > > > > > This skips everything when 'is_waitconnect' is false.
> > > > > > >
> > > > > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > > > > because the chardevs were socket clients, but now we honour it.
> > > > > > >
> > > > > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > > > > from that this code should also still attempt a non-blocking
> > > > > > > connect when is_waitconnect is false.
> > > > > > >
> > > > > >
> > > > > > Do you mean we still need to connect server in background with
> > > > > > "nowait" option? But my purpose is not to connect server until we
> > > > > > manually call qemu_chr_fe_wait_connected() in other place.
> > > > >
> > > > > I don't see a need to delay the connect. We can start a
> > > > > background connect right away. The later code you have
> > > > > merely needs to wait for that background connect  to
> > > > > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > > > > This keeps the chardev code clearer only having 2 distinct
> > > > > code paths to worry about - blocking or non-blocking connect.
> > > > >
> > > >
> > > > Now the problem is that we have a server that only accept one
> > > > connection. And we want to read something from it during device
> > > > initializtion.
> > > >
> > > > If background connect it before we call qemu_chr_fe_wait_connected()
> > > > during device initializtion, qemu_chr_fe_wait_connected() will
> > > > accomplish but we can't read anything. And we have no way to release
> > > > the background connection. So what I want to do in this patch is to
> > > > disable background connect.
> > >
> > > I'm not seeing the problem here. What I proposed results in
> > >
> > >   1. chardev starts connect()
> >
> > This should be asynchronous with "reconnect" option. Another thread
> > may connect before vhost backend?
> >
> > >   2. vhost backend waits for connect() to complete
> >
> > Sorry, I'm not sure I get your point here. Do you mean vhost backend
> > call qemu_chr_fe_wait_connected()? Seems like
> > qemu_chr_fe_wait_connected() will connect directly rather than wait
> > for connect() to complete?
>
> Ahhhh, yes, you are right.
>
> qemu_chr_fe_wait_connected will potentially cause a second connection to
> be established
>
> Looking at it the qemu_chr_fe_wait_connected() I believe it is seriously
> broken even before this patch series.
>
> The intended usage is that a device can all qemu_chr_fe_wait_connected
> to wait for a new connection to be established, and then do I/O on the
> chardev.  This does not in fact work if TLS, websock or telnet modes
> are enabled for the socket, due to a mistake introduced when we previously
> tried to fix this:
>
>   commit 1dc8a6695c731abb7461c637b2512c3670d82be4
>   Author: Marc-André Lureau <marcandre.lureau@redhat.com>
>   Date:   Tue Aug 16 12:33:32 2016 +0400
>
>     char: fix waiting for TLS and telnet connection
>
> That commit fixed the problem where we continued to accept() new sockets
> when TLS/telnet was enabled, because the 's->connected' flag isn't set
> immediately.
>
> Unfortunately what this means is that when qemu_chr_fe_wait_connected
> returns, the chardev is *not* ready to read/write data. The TLS/telnet
> handshake has not been run, and is still pending in the background.
>
> So we'll end up with device backend trying todo I/O on the chardev
> at the same time as it is trying todo the TLS/telnet handshake.
>
> We need to fix qemu_chr_fe_wait_connected so that it does explicit
> synchronization wrt to any ongoing background connection process.
> It must only return once all TLS/telnet/websock handshakes have
> completed.  If we fix that correctly, then I believe it will  also
> solve the problem you're trying to address.
>

Yes, I think this should be the right way to go. To fix it, my thought
is to track the async QIOChannelSocket in SocketChardev. Then we can
easily get the connection progress in qemu_chr_fe_wait_connected(). Do
you have any suggestion?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-11  7:50                 ` Yongji Xie
@ 2019-01-11  8:32                   ` Daniel P. Berrangé
  2019-01-11  8:36                     ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-11  8:32 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
> On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 10:29:20PM +0800, Yongji Xie wrote:
> > > On Thu, 10 Jan 2019 at 22:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> > > > > On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > > > > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > > > > >
> > > > > > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > > > > > on client sockets during initialization of the chardev.
> > > > > > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > > > > > when necessary. Now it would be used for unix domain
> > > > > > > > > socket of vhost-user-blk device to support reconnect.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > > > > > ---
> > > > > > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > > > > > >  qapi/char.json        |  3 +--
> > > > > > > > >  qemu-options.hx       |  9 ++++---
> > > > > > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > > > > > --- a/chardev/char-socket.c
> > > > > > > > > +++ b/chardev/char-socket.c
> > > > > > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > > > > > >          s->reconnect_time = reconnect;
> > > > > > > > >      }
> > > > > > > > >
> > > > > > > > > -    if (s->reconnect_time) {
> > > > > > > > > -        tcp_chr_connect_async(chr);
> > > > > > > > > -    } else {
> > > > > > > > > -        if (s->is_listen) {
> > > > > > > > > -            char *name;
> > > > > > > > > -            s->listener = qio_net_listener_new();
> > > > > > > > > +    if (s->is_listen) {
> > > > > > > > > +        char *name;
> > > > > > > > > +        s->listener = qio_net_listener_new();
> > > > > > > > >
> > > > > > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > > > > > -            g_free(name);
> > > > > > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > > > > > +        g_free(name);
> > > > > > > > >
> > > > > > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > > -                object_unref(OBJECT(s->listener));
> > > > > > > > > -                s->listener = NULL;
> > > > > > > > > -                goto error;
> > > > > > > > > -            }
> > > > > > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > > +            object_unref(OBJECT(s->listener));
> > > > > > > > > +            s->listener = NULL;
> > > > > > > > > +            goto error;
> > > > > > > > > +        }
> > > > > > > > >
> > > > > > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > > -            update_disconnected_filename(s);
> > > > > > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > > +        update_disconnected_filename(s);
> > > > > > > > >
> > > > > > > > > -            if (is_waitconnect &&
> > > > > > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > > -                return;
> > > > > > > > > -            }
> > > > > > > > > -            if (!s->ioc) {
> > > > > > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > > -                                                      tcp_chr_accept,
> > > > > > > > > -                                                      chr, NULL,
> > > > > > > > > -                                                      chr->gcontext);
> > > > > > > > > -            }
> > > > > > > > > +        if (is_waitconnect &&
> > > > > > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > > +            return;
> > > > > > > > > +        }
> > > > > > > > > +        if (!s->ioc) {
> > > > > > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > > +                                                  tcp_chr_accept,
> > > > > > > > > +                                                  chr, NULL,
> > > > > > > > > +                                                  chr->gcontext);
> > > > > > > > > +        }
> > > > > > > > > +    } else if (is_waitconnect) {
> > > > > > > > > +        if (s->reconnect_time) {
> > > > > > > > > +            tcp_chr_connect_async(chr);
> > > > > > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > >              goto error;
> > > > > > > > >          }
> > > > > > > >
> > > > > > > > This skips everything when 'is_waitconnect' is false.
> > > > > > > >
> > > > > > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > > > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > > > > > because the chardevs were socket clients, but now we honour it.
> > > > > > > >
> > > > > > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > > > > > from that this code should also still attempt a non-blocking
> > > > > > > > connect when is_waitconnect is false.
> > > > > > > >
> > > > > > >
> > > > > > > Do you mean we still need to connect server in background with
> > > > > > > "nowait" option? But my purpose is not to connect server until we
> > > > > > > manually call qemu_chr_fe_wait_connected() in other place.
> > > > > >
> > > > > > I don't see a need to delay the connect. We can start a
> > > > > > background connect right away. The later code you have
> > > > > > merely needs to wait for that background connect  to
> > > > > > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > > > > > This keeps the chardev code clearer only having 2 distinct
> > > > > > code paths to worry about - blocking or non-blocking connect.
> > > > > >
> > > > >
> > > > > Now the problem is that we have a server that only accept one
> > > > > connection. And we want to read something from it during device
> > > > > initializtion.
> > > > >
> > > > > If background connect it before we call qemu_chr_fe_wait_connected()
> > > > > during device initializtion, qemu_chr_fe_wait_connected() will
> > > > > accomplish but we can't read anything. And we have no way to release
> > > > > the background connection. So what I want to do in this patch is to
> > > > > disable background connect.
> > > >
> > > > I'm not seeing the problem here. What I proposed results in
> > > >
> > > >   1. chardev starts connect()
> > >
> > > This should be asynchronous with "reconnect" option. Another thread
> > > may connect before vhost backend?
> > >
> > > >   2. vhost backend waits for connect() to complete
> > >
> > > Sorry, I'm not sure I get your point here. Do you mean vhost backend
> > > call qemu_chr_fe_wait_connected()? Seems like
> > > qemu_chr_fe_wait_connected() will connect directly rather than wait
> > > for connect() to complete?
> >
> > Ahhhh, yes, you are right.
> >
> > qemu_chr_fe_wait_connected will potentially cause a second connection to
> > be established
> >
> > Looking at it the qemu_chr_fe_wait_connected() I believe it is seriously
> > broken even before this patch series.
> >
> > The intended usage is that a device can all qemu_chr_fe_wait_connected
> > to wait for a new connection to be established, and then do I/O on the
> > chardev.  This does not in fact work if TLS, websock or telnet modes
> > are enabled for the socket, due to a mistake introduced when we previously
> > tried to fix this:
> >
> >   commit 1dc8a6695c731abb7461c637b2512c3670d82be4
> >   Author: Marc-André Lureau <marcandre.lureau@redhat.com>
> >   Date:   Tue Aug 16 12:33:32 2016 +0400
> >
> >     char: fix waiting for TLS and telnet connection
> >
> > That commit fixed the problem where we continued to accept() new sockets
> > when TLS/telnet was enabled, because the 's->connected' flag isn't set
> > immediately.
> >
> > Unfortunately what this means is that when qemu_chr_fe_wait_connected
> > returns, the chardev is *not* ready to read/write data. The TLS/telnet
> > handshake has not been run, and is still pending in the background.
> >
> > So we'll end up with device backend trying todo I/O on the chardev
> > at the same time as it is trying todo the TLS/telnet handshake.
> >
> > We need to fix qemu_chr_fe_wait_connected so that it does explicit
> > synchronization wrt to any ongoing background connection process.
> > It must only return once all TLS/telnet/websock handshakes have
> > completed.  If we fix that correctly, then I believe it will  also
> > solve the problem you're trying to address.
> >
> 
> Yes, I think this should be the right way to go. To fix it, my thought
> is to track the async QIOChannelSocket in SocketChardev. Then we can
> easily get the connection progress in qemu_chr_fe_wait_connected(). Do
> you have any suggestion?

I've got a few patches that refactor the code to fix this. I'll send them
today and CC you on them.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-11  8:32                   ` Daniel P. Berrangé
@ 2019-01-11  8:36                     ` Yongji Xie
  2019-01-15 15:39                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-11  8:36 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 11 Jan 2019 at 16:32, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
> > On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Jan 10, 2019 at 10:29:20PM +0800, Yongji Xie wrote:
> > > > On Thu, 10 Jan 2019 at 22:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Jan 10, 2019 at 10:08:54PM +0800, Yongji Xie wrote:
> > > > > > On Thu, 10 Jan 2019 at 21:24, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jan 10, 2019 at 09:19:41PM +0800, Yongji Xie wrote:
> > > > > > > > On Thu, 10 Jan 2019 at 20:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jan 09, 2019 at 07:27:22PM +0800, elohimes@gmail.com wrote:
> > > > > > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > > > > > >
> > > > > > > > > > Enable "nowait" option to make QEMU not do a connect
> > > > > > > > > > on client sockets during initialization of the chardev.
> > > > > > > > > > Then we can use qemu_chr_fe_wait_connected() to connect
> > > > > > > > > > when necessary. Now it would be used for unix domain
> > > > > > > > > > socket of vhost-user-blk device to support reconnect.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > > > > > > > > > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > > > > > > > > > ---
> > > > > > > > > >  chardev/char-socket.c | 56 +++++++++++++++++++++----------------------
> > > > > > > > > >  qapi/char.json        |  3 +--
> > > > > > > > > >  qemu-options.hx       |  9 ++++---
> > > > > > > > > >  3 files changed, 35 insertions(+), 33 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> > > > > > > > > > index eaa8e8b68f..f803f4f7d3 100644
> > > > > > > > > > --- a/chardev/char-socket.c
> > > > > > > > > > +++ b/chardev/char-socket.c
> > > > > > > > > > @@ -1072,37 +1072,37 @@ static void qmp_chardev_open_socket(Chardev *chr,
> > > > > > > > > >          s->reconnect_time = reconnect;
> > > > > > > > > >      }
> > > > > > > > > >
> > > > > > > > > > -    if (s->reconnect_time) {
> > > > > > > > > > -        tcp_chr_connect_async(chr);
> > > > > > > > > > -    } else {
> > > > > > > > > > -        if (s->is_listen) {
> > > > > > > > > > -            char *name;
> > > > > > > > > > -            s->listener = qio_net_listener_new();
> > > > > > > > > > +    if (s->is_listen) {
> > > > > > > > > > +        char *name;
> > > > > > > > > > +        s->listener = qio_net_listener_new();
> > > > > > > > > >
> > > > > > > > > > -            name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > > > -            qio_net_listener_set_name(s->listener, name);
> > > > > > > > > > -            g_free(name);
> > > > > > > > > > +        name = g_strdup_printf("chardev-tcp-listener-%s", chr->label);
> > > > > > > > > > +        qio_net_listener_set_name(s->listener, name);
> > > > > > > > > > +        g_free(name);
> > > > > > > > > >
> > > > > > > > > > -            if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > > > -                object_unref(OBJECT(s->listener));
> > > > > > > > > > -                s->listener = NULL;
> > > > > > > > > > -                goto error;
> > > > > > > > > > -            }
> > > > > > > > > > +        if (qio_net_listener_open_sync(s->listener, s->addr, errp) < 0) {
> > > > > > > > > > +            object_unref(OBJECT(s->listener));
> > > > > > > > > > +            s->listener = NULL;
> > > > > > > > > > +            goto error;
> > > > > > > > > > +        }
> > > > > > > > > >
> > > > > > > > > > -            qapi_free_SocketAddress(s->addr);
> > > > > > > > > > -            s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > > > -            update_disconnected_filename(s);
> > > > > > > > > > +        qapi_free_SocketAddress(s->addr);
> > > > > > > > > > +        s->addr = socket_local_address(s->listener->sioc[0]->fd, errp);
> > > > > > > > > > +        update_disconnected_filename(s);
> > > > > > > > > >
> > > > > > > > > > -            if (is_waitconnect &&
> > > > > > > > > > -                qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > > > -                return;
> > > > > > > > > > -            }
> > > > > > > > > > -            if (!s->ioc) {
> > > > > > > > > > -                qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > > > -                                                      tcp_chr_accept,
> > > > > > > > > > -                                                      chr, NULL,
> > > > > > > > > > -                                                      chr->gcontext);
> > > > > > > > > > -            }
> > > > > > > > > > +        if (is_waitconnect &&
> > > > > > > > > > +            qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > > > +            return;
> > > > > > > > > > +        }
> > > > > > > > > > +        if (!s->ioc) {
> > > > > > > > > > +            qio_net_listener_set_client_func_full(s->listener,
> > > > > > > > > > +                                                  tcp_chr_accept,
> > > > > > > > > > +                                                  chr, NULL,
> > > > > > > > > > +                                                  chr->gcontext);
> > > > > > > > > > +        }
> > > > > > > > > > +    } else if (is_waitconnect) {
> > > > > > > > > > +        if (s->reconnect_time) {
> > > > > > > > > > +            tcp_chr_connect_async(chr);
> > > > > > > > > >          } else if (qemu_chr_wait_connected(chr, errp) < 0) {
> > > > > > > > > >              goto error;
> > > > > > > > > >          }
> > > > > > > > >
> > > > > > > > > This skips everything when 'is_waitconnect' is false.
> > > > > > > > >
> > > > > > > > > This combines with a bug in tests/libqtest.c which adds the 'nowait'
> > > > > > > > > flag to the -chardevs it cteates. This mistake was previously ignored
> > > > > > > > > because the chardevs were socket clients, but now we honour it.
> > > > > > > > >
> > > > > > > > > We shoul remove 'nowait' from the qtest chardevs, but separately
> > > > > > > > > from that this code should also still attempt a non-blocking
> > > > > > > > > connect when is_waitconnect is false.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Do you mean we still need to connect server in background with
> > > > > > > > "nowait" option? But my purpose is not to connect server until we
> > > > > > > > manually call qemu_chr_fe_wait_connected() in other place.
> > > > > > >
> > > > > > > I don't see a need to delay the connect. We can start a
> > > > > > > background connect right away. The later code you have
> > > > > > > merely needs to wait for that background connect  to
> > > > > > > finish, which qemu_chr_fe_wait_connected still accomplishes.
> > > > > > > This keeps the chardev code clearer only having 2 distinct
> > > > > > > code paths to worry about - blocking or non-blocking connect.
> > > > > > >
> > > > > >
> > > > > > Now the problem is that we have a server that only accept one
> > > > > > connection. And we want to read something from it during device
> > > > > > initializtion.
> > > > > >
> > > > > > If background connect it before we call qemu_chr_fe_wait_connected()
> > > > > > during device initializtion, qemu_chr_fe_wait_connected() will
> > > > > > accomplish but we can't read anything. And we have no way to release
> > > > > > the background connection. So what I want to do in this patch is to
> > > > > > disable background connect.
> > > > >
> > > > > I'm not seeing the problem here. What I proposed results in
> > > > >
> > > > >   1. chardev starts connect()
> > > >
> > > > This should be asynchronous with "reconnect" option. Another thread
> > > > may connect before vhost backend?
> > > >
> > > > >   2. vhost backend waits for connect() to complete
> > > >
> > > > Sorry, I'm not sure I get your point here. Do you mean vhost backend
> > > > call qemu_chr_fe_wait_connected()? Seems like
> > > > qemu_chr_fe_wait_connected() will connect directly rather than wait
> > > > for connect() to complete?
> > >
> > > Ahhhh, yes, you are right.
> > >
> > > qemu_chr_fe_wait_connected will potentially cause a second connection to
> > > be established
> > >
> > > Looking at it the qemu_chr_fe_wait_connected() I believe it is seriously
> > > broken even before this patch series.
> > >
> > > The intended usage is that a device can all qemu_chr_fe_wait_connected
> > > to wait for a new connection to be established, and then do I/O on the
> > > chardev.  This does not in fact work if TLS, websock or telnet modes
> > > are enabled for the socket, due to a mistake introduced when we previously
> > > tried to fix this:
> > >
> > >   commit 1dc8a6695c731abb7461c637b2512c3670d82be4
> > >   Author: Marc-André Lureau <marcandre.lureau@redhat.com>
> > >   Date:   Tue Aug 16 12:33:32 2016 +0400
> > >
> > >     char: fix waiting for TLS and telnet connection
> > >
> > > That commit fixed the problem where we continued to accept() new sockets
> > > when TLS/telnet was enabled, because the 's->connected' flag isn't set
> > > immediately.
> > >
> > > Unfortunately what this means is that when qemu_chr_fe_wait_connected
> > > returns, the chardev is *not* ready to read/write data. The TLS/telnet
> > > handshake has not been run, and is still pending in the background.
> > >
> > > So we'll end up with device backend trying todo I/O on the chardev
> > > at the same time as it is trying todo the TLS/telnet handshake.
> > >
> > > We need to fix qemu_chr_fe_wait_connected so that it does explicit
> > > synchronization wrt to any ongoing background connection process.
> > > It must only return once all TLS/telnet/websock handshakes have
> > > completed.  If we fix that correctly, then I believe it will  also
> > > solve the problem you're trying to address.
> > >
> >
> > Yes, I think this should be the right way to go. To fix it, my thought
> > is to track the async QIOChannelSocket in SocketChardev. Then we can
> > easily get the connection progress in qemu_chr_fe_wait_connected(). Do
> > you have any suggestion?
>
> I've got a few patches that refactor the code to fix this. I'll send them
> today and CC you on them.
>

That would be great! Thank you.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-10 10:59   ` Yongji Xie
@ 2019-01-11 15:53     ` Stefan Hajnoczi
  2019-01-11 17:24       ` Michael S. Tsirkin
  2019-01-12  4:50       ` Yongji Xie
  0 siblings, 2 replies; 54+ messages in thread
From: Stefan Hajnoczi @ 2019-01-11 15:53 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

[-- Attachment #1: Type: text/plain, Size: 2752 bytes --]

On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > From: Xie Yongji <xieyongji@baidu.com>
> > >
> > > This patchset is aimed at supporting qemu to reconnect
> > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > restart.
> > >
> > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > do a connect on client sockets during initialization of the chardev.
> > >
> > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > memory to backend.
> >
> > Can you describe the problem that the inflight I/O shared memory region
> > solves?
> >
> 
> The backend need to get inflight I/O and do I/O replay after restart.
> Now we can only get used_idx in used_ring. It's not enough. Because we
> might not process descriptors in the same order which they have been
> made available sometimes. A simple example:
> https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> shared memory to track inflight I/O.

The inflight shared memory region is something that needs to be
discussed in detail to make sure it is correct not just for vhost-blk
but also for other vhost-user devices in the future.

Please expand the protocol documentation so that someone can implement
this feature without looking at the code.  Explain the reason for the
inflight shared memory region and how exactly it used.

After a quick look at the shared memory region documentation I have a
few questions:

1. The avail ring only contains "head" indices, each of which may chain
   non-head descriptors.  Does the inflight memory region track only the
   heads?

2. Does the inflight shared memory region preserve avail ring order?
   For example, if the guest submitted 5 2 4, will the new vhost-user
   backend keep that order or does it see 2 4 5?

3. What are the exact memory region size requirements?  Imagine a device
   with a large virtqueue.  Or a device with multiple virtqueues of
   different sizes.  Can your structure hand those cases?

I'm concerned that this approach to device recovery is invasive and hard
to test.  Instead I would use VIRTIO's Device Status Field
DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
necessary.  This is more disruptive - drivers either have to resubmit or
fail I/O with EIO - but it's also simple and more likely to work
correctly (it only needs to be implemented correctly in the guest
driver, not in the many available vhost-user backend implementations).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-11 15:53     ` Stefan Hajnoczi
@ 2019-01-11 17:24       ` Michael S. Tsirkin
  2019-01-12  4:50       ` Yongji Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-11 17:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

On Fri, Jan 11, 2019 at 03:53:42PM +0000, Stefan Hajnoczi wrote:
> I'm concerned that this approach to device recovery is invasive and hard
> to test.  Instead I would use VIRTIO's Device Status Field
> DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> necessary.  This is more disruptive - drivers either have to resubmit or
> fail I/O with EIO - but it's also simple and more likely to work
> correctly (it only needs to be implemented correctly in the guest
> driver, not in the many available vhost-user backend implementations).
> 
> Stefan

Unfortunately drivers don't support DEVICE_NEEDS_RESET yet.
I'll be happy to accept patches, but this means we
can't depend on DEVICE_NEEDS_RESET for basic functionality
like reconnect. And given virtio 1.0 has been there
for a while now, I suspect the only way to start using
it is by adding a new feature flag.

Unfortunate but true.

-- 
MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-11 15:53     ` Stefan Hajnoczi
  2019-01-11 17:24       ` Michael S. Tsirkin
@ 2019-01-12  4:50       ` Yongji Xie
  2019-01-14 10:22         ` Stefan Hajnoczi
  1 sibling, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-12  4:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > From: Xie Yongji <xieyongji@baidu.com>
> > > >
> > > > This patchset is aimed at supporting qemu to reconnect
> > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > restart.
> > > >
> > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > do a connect on client sockets during initialization of the chardev.
> > > >
> > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > memory to backend.
> > >
> > > Can you describe the problem that the inflight I/O shared memory region
> > > solves?
> > >
> >
> > The backend need to get inflight I/O and do I/O replay after restart.
> > Now we can only get used_idx in used_ring. It's not enough. Because we
> > might not process descriptors in the same order which they have been
> > made available sometimes. A simple example:
> > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > shared memory to track inflight I/O.
>
> The inflight shared memory region is something that needs to be
> discussed in detail to make sure it is correct not just for vhost-blk
> but also for other vhost-user devices in the future.
>
> Please expand the protocol documentation so that someone can implement
> this feature without looking at the code.  Explain the reason for the
> inflight shared memory region and how exactly it used.
>

OK, will do it in v5.

> After a quick look at the shared memory region documentation I have a
> few questions:
>
> 1. The avail ring only contains "head" indices, each of which may chain
>    non-head descriptors.  Does the inflight memory region track only the
>    heads?
>

Yes, we only track the head in inflight region.

> 2. Does the inflight shared memory region preserve avail ring order?
>    For example, if the guest submitted 5 2 4, will the new vhost-user
>    backend keep that order or does it see 2 4 5?
>

Now we don't support to resubmit I/O in order. But I think we can add
a timestamp
for each inflight descriptor in shared memory to achieve that.

> 3. What are the exact memory region size requirements?  Imagine a device
>    with a large virtqueue.  Or a device with multiple virtqueues of
>    different sizes.  Can your structure hand those cases?
>

Each available virtqueue should have a region in inflight memory. The
size of region should be fixed and support handling max size elements
in one virtqueue. There may be a little waste, but I think it make the
structure more clear.

> I'm concerned that this approach to device recovery is invasive and hard
> to test.  Instead I would use VIRTIO's Device Status Field
> DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> necessary.  This is more disruptive - drivers either have to resubmit or
> fail I/O with EIO - but it's also simple and more likely to work
> correctly (it only needs to be implemented correctly in the guest
> driver, not in the many available vhost-user backend implementations).
>

So you mean adding one way to notify guest to resubmit inflight I/O. I
think it's a good idea. But would it be more flexible to implement
this in backend. We can support old guest. And it will be easy to fix
bug or add feature.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-12  4:50       ` Yongji Xie
@ 2019-01-14 10:22         ` Stefan Hajnoczi
  2019-01-14 10:55           ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Stefan Hajnoczi @ 2019-01-14 10:22 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

[-- Attachment #1: Type: text/plain, Size: 4035 bytes --]

On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > >
> > > > > This patchset is aimed at supporting qemu to reconnect
> > > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > > restart.
> > > > >
> > > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > > do a connect on client sockets during initialization of the chardev.
> > > > >
> > > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > > memory to backend.
> > > >
> > > > Can you describe the problem that the inflight I/O shared memory region
> > > > solves?
> > > >
> > >
> > > The backend need to get inflight I/O and do I/O replay after restart.
> > > Now we can only get used_idx in used_ring. It's not enough. Because we
> > > might not process descriptors in the same order which they have been
> > > made available sometimes. A simple example:
> > > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > > shared memory to track inflight I/O.
> >
> > The inflight shared memory region is something that needs to be
> > discussed in detail to make sure it is correct not just for vhost-blk
> > but also for other vhost-user devices in the future.
> >
> > Please expand the protocol documentation so that someone can implement
> > this feature without looking at the code.  Explain the reason for the
> > inflight shared memory region and how exactly it used.
> >
> 
> OK, will do it in v5.
> 
> > After a quick look at the shared memory region documentation I have a
> > few questions:
> >
> > 1. The avail ring only contains "head" indices, each of which may chain
> >    non-head descriptors.  Does the inflight memory region track only the
> >    heads?
> >
> 
> Yes, we only track the head in inflight region.

Okay, thanks.  That is useful information to include in the
specification.

> > 2. Does the inflight shared memory region preserve avail ring order?
> >    For example, if the guest submitted 5 2 4, will the new vhost-user
> >    backend keep that order or does it see 2 4 5?
> >
> 
> Now we don't support to resubmit I/O in order. But I think we can add
> a timestamp
> for each inflight descriptor in shared memory to achieve that.

Great, the reason I think that feature is interesting is that other
device types may need to preserve order.  It depends on the exact
meaning of the device's requests...

> > I'm concerned that this approach to device recovery is invasive and hard
> > to test.  Instead I would use VIRTIO's Device Status Field
> > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > necessary.  This is more disruptive - drivers either have to resubmit or
> > fail I/O with EIO - but it's also simple and more likely to work
> > correctly (it only needs to be implemented correctly in the guest
> > driver, not in the many available vhost-user backend implementations).
> >
> 
> So you mean adding one way to notify guest to resubmit inflight I/O. I
> think it's a good idea. But would it be more flexible to implement
> this in backend. We can support old guest. And it will be easy to fix
> bug or add feature.

There are trade-offs with either approach.  In the long run I think it's
beneficial minimize non-trivial logic in vhost-user backends.  There
will be more vhost-user backend implementations and therefore more bugs
if we put the logic into the backend.  This is why I think a simple
mechanism for marking the device as needing a reset will be more
reliable and less trouble.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-14 10:22         ` Stefan Hajnoczi
@ 2019-01-14 10:55           ` Yongji Xie
  2019-01-16 14:28             ` Stefan Hajnoczi
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-14 10:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

On Mon, 14 Jan 2019 at 18:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> > On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > > > From: Xie Yongji <xieyongji@baidu.com>
> > > > > >
> > > > > > This patchset is aimed at supporting qemu to reconnect
> > > > > > vhost-user-blk backend after vhost-user-blk backend crash or
> > > > > > restart.
> > > > > >
> > > > > > The patch 1 uses exisiting wait/nowait options to make QEMU not
> > > > > > do a connect on client sockets during initialization of the chardev.
> > > > > >
> > > > > > The patch 2 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > > > > > and VHOST_USER_SET_INFLIGHT_FD to support providing shared
> > > > > > memory to backend.
> > > > >
> > > > > Can you describe the problem that the inflight I/O shared memory region
> > > > > solves?
> > > > >
> > > >
> > > > The backend need to get inflight I/O and do I/O replay after restart.
> > > > Now we can only get used_idx in used_ring. It's not enough. Because we
> > > > might not process descriptors in the same order which they have been
> > > > made available sometimes. A simple example:
> > > > https://patchwork.kernel.org/cover/10715305/#22375607. So we need a
> > > > shared memory to track inflight I/O.
> > >
> > > The inflight shared memory region is something that needs to be
> > > discussed in detail to make sure it is correct not just for vhost-blk
> > > but also for other vhost-user devices in the future.
> > >
> > > Please expand the protocol documentation so that someone can implement
> > > this feature without looking at the code.  Explain the reason for the
> > > inflight shared memory region and how exactly it used.
> > >
> >
> > OK, will do it in v5.
> >
> > > After a quick look at the shared memory region documentation I have a
> > > few questions:
> > >
> > > 1. The avail ring only contains "head" indices, each of which may chain
> > >    non-head descriptors.  Does the inflight memory region track only the
> > >    heads?
> > >
> >
> > Yes, we only track the head in inflight region.
>
> Okay, thanks.  That is useful information to include in the
> specification.
>

OK, will do it.

> > > 2. Does the inflight shared memory region preserve avail ring order?
> > >    For example, if the guest submitted 5 2 4, will the new vhost-user
> > >    backend keep that order or does it see 2 4 5?
> > >
> >
> > Now we don't support to resubmit I/O in order. But I think we can add
> > a timestamp
> > for each inflight descriptor in shared memory to achieve that.
>
> Great, the reason I think that feature is interesting is that other
> device types may need to preserve order.  It depends on the exact
> meaning of the device's requests...
>

Yes, this would be a useful feature.

> > > I'm concerned that this approach to device recovery is invasive and hard
> > > to test.  Instead I would use VIRTIO's Device Status Field
> > > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > > necessary.  This is more disruptive - drivers either have to resubmit or
> > > fail I/O with EIO - but it's also simple and more likely to work
> > > correctly (it only needs to be implemented correctly in the guest
> > > driver, not in the many available vhost-user backend implementations).
> > >
> >
> > So you mean adding one way to notify guest to resubmit inflight I/O. I
> > think it's a good idea. But would it be more flexible to implement
> > this in backend. We can support old guest. And it will be easy to fix
> > bug or add feature.
>
> There are trade-offs with either approach.  In the long run I think it's
> beneficial minimize non-trivial logic in vhost-user backends.  There
> will be more vhost-user backend implementations and therefore more bugs
> if we put the logic into the backend.  This is why I think a simple
> mechanism for marking the device as needing a reset will be more
> reliable and less trouble.
>

I agree. So is it possible to implement both? In the long run,
updating guest driver to support this is better. And at that time, the
logic in backend can be only used to support legacy guest driver.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
@ 2019-01-14 22:25   ` Michael S. Tsirkin
  2019-01-15  6:46     ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-14 22:25 UTC (permalink / raw)
  To: elohimes
  Cc: marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh, qemu-devel, zhangyu31, chaiwen, nixun,
	lilin24, Xie Yongji

On Wed, Jan 09, 2019 at 07:27:23PM +0800, elohimes@gmail.com wrote:
> @@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
>  slave can send file descriptors (at most 8 descriptors in each message)
>  to master via ancillary data using this fd communication channel.
>  
> +Inflight I/O tracking
> +---------------------
> +
> +To support slave reconnecting, slave need to track inflight I/O in a
> +shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
> +are used to transfer the memory between master and slave. And to encourage
> +consistency, we provide a recommended format for this memory:

I think we should make a stronger statement and actually
just say what the format is. Not recommend it weakly.

> +
> +offset	 width	  description
> +0x0      0x400    region for queue0
> +0x400    0x400    region for queue1
> +0x800    0x400    region for queue2
> +...      ...      ...
> +
> +For each virtqueue, we have a 1024 bytes region.


Why is the size hardcoded? Why not a function of VQ size?


> The region's format is like:
> +
> +offset   width    description
> +0x0      0x1      descriptor 0 is in use or not
> +0x1      0x1      descriptor 1 is in use or not
> +0x2      0x1      descriptor 2 is in use or not
> +...      ...      ...
> +
> +For each descriptor, we use one byte to specify whether it's in use or not.
> +
>  Protocol features
>  -----------------
> 

I think that it's a good idea to have a version in this region.
Otherwise how are you going to handle compatibility when
this needs to be extended?



 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-14 22:25   ` Michael S. Tsirkin
@ 2019-01-15  6:46     ` Yongji Xie
  2019-01-15 12:54       ` Michael S. Tsirkin
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-15  6:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, 15 Jan 2019 at 06:25, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jan 09, 2019 at 07:27:23PM +0800, elohimes@gmail.com wrote:
> > @@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
> >  slave can send file descriptors (at most 8 descriptors in each message)
> >  to master via ancillary data using this fd communication channel.
> >
> > +Inflight I/O tracking
> > +---------------------
> > +
> > +To support slave reconnecting, slave need to track inflight I/O in a
> > +shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
> > +are used to transfer the memory between master and slave. And to encourage
> > +consistency, we provide a recommended format for this memory:
>
> I think we should make a stronger statement and actually
> just say what the format is. Not recommend it weakly.
>

Okey, will do it.

> > +
> > +offset        width    description
> > +0x0      0x400    region for queue0
> > +0x400    0x400    region for queue1
> > +0x800    0x400    region for queue2
> > +...      ...      ...
> > +
> > +For each virtqueue, we have a 1024 bytes region.
>
>
> Why is the size hardcoded? Why not a function of VQ size?
>

Sorry, I didn't get your point. Should the region's size be fixed? Do
you mean we need to document a function for the region's size?

>
> > The region's format is like:
> > +
> > +offset   width    description
> > +0x0      0x1      descriptor 0 is in use or not
> > +0x1      0x1      descriptor 1 is in use or not
> > +0x2      0x1      descriptor 2 is in use or not
> > +...      ...      ...
> > +
> > +For each descriptor, we use one byte to specify whether it's in use or not.
> > +
> >  Protocol features
> >  -----------------
> >
>
> I think that it's a good idea to have a version in this region.
> Otherwise how are you going to handle compatibility when
> this needs to be extended?
>

I have put the version into the message's payload: VhostUserInflight. Is it OK?

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-11  6:10     ` Yongji Xie
@ 2019-01-15  7:52       ` Jason Wang
  2019-01-15 14:51         ` Yongji Xie
  2019-01-15 15:58         ` Michael S. Tsirkin
  0 siblings, 2 replies; 54+ messages in thread
From: Jason Wang @ 2019-01-15  7:52 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji


On 2019/1/11 下午2:10, Yongji Xie wrote:
> On Fri, 11 Jan 2019 at 11:56, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2019/1/9 下午7:27, elohimes@gmail.com wrote:
>>> From: Xie Yongji <xieyongji@baidu.com>
>>>
>>> This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
>>> VHOST_USER_SET_INFLIGHT_FD message to set/get shared memory
>>> to/from qemu. Then we maintain a "bitmap" of all descriptors in
>>> the shared memory for each queue to track inflight I/O.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
>>> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
>>> ---
>>>    Makefile                              |   2 +-
>>>    contrib/libvhost-user/libvhost-user.c | 258 ++++++++++++++++++++++++--
>>>    contrib/libvhost-user/libvhost-user.h |  29 +++
>>>    3 files changed, 268 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/Makefile b/Makefile
>>> index dd53965f77..b5c9092605 100644
>>> --- a/Makefile
>>> +++ b/Makefile
>>> @@ -473,7 +473,7 @@ Makefile: $(version-obj-y)
>>>    # Build libraries
>>>
>>>    libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
>>> -libvhost-user.a: $(libvhost-user-obj-y)
>>> +libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
>>>
>>>    ######################################################################
>>>
>>> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
>>> index 23bd52264c..e73ce04619 100644
>>> --- a/contrib/libvhost-user/libvhost-user.c
>>> +++ b/contrib/libvhost-user/libvhost-user.c
>>> @@ -41,6 +41,8 @@
>>>    #endif
>>>
>>>    #include "qemu/atomic.h"
>>> +#include "qemu/osdep.h"
>>> +#include "qemu/memfd.h"
>>>
>>>    #include "libvhost-user.h"
>>>
>>> @@ -53,6 +55,18 @@
>>>                _min1 < _min2 ? _min1 : _min2; })
>>>    #endif
>>>
>>> +/* Round number down to multiple */
>>> +#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>> +
>>> +/* Round number up to multiple */
>>> +#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
>>> +
>>> +/* Align each region to cache line size in inflight buffer */
>>> +#define INFLIGHT_ALIGNMENT 64
>>> +
>>> +/* The version of inflight buffer */
>>> +#define INFLIGHT_VERSION 1
>>> +
>>>    #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
>>>
>>>    /* The version of the protocol we support */
>>> @@ -66,6 +80,20 @@
>>>            }                                       \
>>>        } while (0)
>>>
>>> +static inline
>>> +bool has_feature(uint64_t features, unsigned int fbit)
>>> +{
>>> +    assert(fbit < 64);
>>> +    return !!(features & (1ULL << fbit));
>>> +}
>>> +
>>> +static inline
>>> +bool vu_has_feature(VuDev *dev,
>>> +                    unsigned int fbit)
>>> +{
>>> +    return has_feature(dev->features, fbit);
>>> +}
>>> +
>>>    static const char *
>>>    vu_request_to_string(unsigned int req)
>>>    {
>>> @@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
>>>            REQ(VHOST_USER_POSTCOPY_ADVISE),
>>>            REQ(VHOST_USER_POSTCOPY_LISTEN),
>>>            REQ(VHOST_USER_POSTCOPY_END),
>>> +        REQ(VHOST_USER_GET_INFLIGHT_FD),
>>> +        REQ(VHOST_USER_SET_INFLIGHT_FD),
>>>            REQ(VHOST_USER_MAX),
>>>        };
>>>    #undef REQ
>>> @@ -890,6 +920,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
>>>        return true;
>>>    }
>>>
>>> +static int
>>> +vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
>>> +{
>>> +    int i = 0;
>>> +
>>> +    if (!has_feature(dev->protocol_features,
>>> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (unlikely(!vq->inflight)) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    vq->used_idx = vq->vring.used->idx;
>>> +    vq->inflight_num = 0;
>>> +    for (i = 0; i < vq->vring.num; i++) {
>>> +        if (vq->inflight->desc[i] == 0) {
>>> +            continue;
>>> +        }
>>> +
>>> +        vq->inflight_desc[vq->inflight_num++] = i;
>>> +        vq->inuse++;
>>> +    }
>>> +    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
>>> +
>>> +    /* in case of I/O hang after reconnecting */
>>> +    if (eventfd_write(vq->kick_fd, 1) ||
>>> +        eventfd_write(vq->call_fd, 1)) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    static bool
>>>    vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>>>    {
>>> @@ -925,6 +990,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
>>>                   dev->vq[index].kick_fd, index);
>>>        }
>>>
>>> +    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
>>> +        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
>>> +    }
>>> +
>>>        return false;
>>>    }
>>>
>>> @@ -1215,6 +1284,117 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
>>>        return true;
>>>    }
>>>
>>> +static bool
>>> +vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
>>> +{
>>> +    int fd;
>>> +    void *addr;
>>> +    uint64_t mmap_size;
>>> +
>>> +    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
>>> +        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
>>> +        vmsg->payload.inflight.mmap_size = 0;
>>> +        return true;
>>> +    }
>>> +
>>> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n",
>>> +           vmsg->payload.inflight.num_queues);
>>> +
>>> +    mmap_size = vmsg->payload.inflight.num_queues *
>>> +                ALIGN_UP(sizeof(VuVirtqInflight), INFLIGHT_ALIGNMENT);
>>> +
>>> +    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
>>> +                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
>>> +                            &fd, NULL);
>>> +
>>> +    if (!addr) {
>>> +        vu_panic(dev, "Failed to alloc vhost inflight area");
>>> +        vmsg->payload.inflight.mmap_size = 0;
>>> +        return true;
>>> +    }
>>> +
>>> +    dev->inflight_info.addr = addr;
>>> +    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
>>> +    vmsg->payload.inflight.mmap_offset = 0;
>>> +    vmsg->payload.inflight.align = INFLIGHT_ALIGNMENT;
>>> +    vmsg->payload.inflight.version = INFLIGHT_VERSION;
>>> +    vmsg->fd_num = 1;
>>> +    dev->inflight_info.fd = vmsg->fds[0] = fd;
>>> +
>>> +    DPRINT("send inflight mmap_size: %"PRId64"\n",
>>> +           vmsg->payload.inflight.mmap_size);
>>> +    DPRINT("send inflight mmap offset: %"PRId64"\n",
>>> +           vmsg->payload.inflight.mmap_offset);
>>> +    DPRINT("send inflight align: %"PRId32"\n",
>>> +           vmsg->payload.inflight.align);
>>> +    DPRINT("send inflight version: %"PRId16"\n",
>>> +           vmsg->payload.inflight.version);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static bool
>>> +vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
>>> +{
>>> +    int fd, i;
>>> +    uint64_t mmap_size, mmap_offset;
>>> +    uint32_t align;
>>> +    uint16_t num_queues, version;
>>> +    void *rc;
>>> +
>>> +    if (vmsg->fd_num != 1 ||
>>> +        vmsg->size != sizeof(vmsg->payload.inflight)) {
>>> +        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
>>> +                 vmsg->size, vmsg->fd_num);
>>> +        return false;
>>> +    }
>>> +
>>> +    fd = vmsg->fds[0];
>>> +    mmap_size = vmsg->payload.inflight.mmap_size;
>>> +    mmap_offset = vmsg->payload.inflight.mmap_offset;
>>> +    align = vmsg->payload.inflight.align;
>>> +    num_queues = vmsg->payload.inflight.num_queues;
>>> +    version = vmsg->payload.inflight.version;
>>> +
>>> +    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
>>> +    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
>>> +    DPRINT("set_inflight_fd align: %"PRId32"\n", align);
>>> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
>>> +    DPRINT("set_inflight_fd version: %"PRId16"\n", version);
>>> +
>>> +    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
>>> +              fd, mmap_offset);
>>> +
>>> +    if (rc == MAP_FAILED) {
>>> +        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
>>> +        return false;
>>> +    }
>>> +
>>> +    if (version != INFLIGHT_VERSION) {
>>> +        vu_panic(dev, "Invalid set_inflight_fd version: %d", version);
>>> +        return false;
>>> +    }
>>> +
>>> +    if (dev->inflight_info.fd) {
>>> +        close(dev->inflight_info.fd);
>>> +    }
>>> +
>>> +    if (dev->inflight_info.addr) {
>>> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
>>> +    }
>>> +
>>> +    dev->inflight_info.fd = fd;
>>> +    dev->inflight_info.addr = rc;
>>> +    dev->inflight_info.size = mmap_size;
>>> +
>>> +    for (i = 0; i < num_queues; i++) {
>>> +        dev->vq[i].inflight = (VuVirtqInflight *)rc;
>>> +        rc = (void *)((char *)rc + ALIGN_UP(sizeof(VuVirtqInflight), align));
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>> +
>>>    static bool
>>>    vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>>>    {
>>> @@ -1292,6 +1472,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
>>>            return vu_set_postcopy_listen(dev, vmsg);
>>>        case VHOST_USER_POSTCOPY_END:
>>>            return vu_set_postcopy_end(dev, vmsg);
>>> +    case VHOST_USER_GET_INFLIGHT_FD:
>>> +        return vu_get_inflight_fd(dev, vmsg);
>>> +    case VHOST_USER_SET_INFLIGHT_FD:
>>> +        return vu_set_inflight_fd(dev, vmsg);
>>>        default:
>>>            vmsg_close_fds(vmsg);
>>>            vu_panic(dev, "Unhandled request: %d", vmsg->request);
>>> @@ -1359,8 +1543,18 @@ vu_deinit(VuDev *dev)
>>>                close(vq->err_fd);
>>>                vq->err_fd = -1;
>>>            }
>>> +        vq->inflight = NULL;
>>>        }
>>>
>>> +    if (dev->inflight_info.addr) {
>>> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
>>> +        dev->inflight_info.addr = NULL;
>>> +    }
>>> +
>>> +    if (dev->inflight_info.fd > 0) {
>>> +        close(dev->inflight_info.fd);
>>> +        dev->inflight_info.fd = -1;
>>> +    }
>>>
>>>        vu_close_log(dev);
>>>        if (dev->slave_fd != -1) {
>>> @@ -1687,20 +1881,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
>>>        return vring_avail_idx(vq) == vq->last_avail_idx;
>>>    }
>>>
>>> -static inline
>>> -bool has_feature(uint64_t features, unsigned int fbit)
>>> -{
>>> -    assert(fbit < 64);
>>> -    return !!(features & (1ULL << fbit));
>>> -}
>>> -
>>> -static inline
>>> -bool vu_has_feature(VuDev *dev,
>>> -                    unsigned int fbit)
>>> -{
>>> -    return has_feature(dev->features, fbit);
>>> -}
>>> -
>>>    static bool
>>>    vring_notify(VuDev *dev, VuVirtq *vq)
>>>    {
>>> @@ -1829,12 +2009,6 @@ virtqueue_map_desc(VuDev *dev,
>>>        *p_num_sg = num_sg;
>>>    }
>>>
>>> -/* Round number down to multiple */
>>> -#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>> -
>>> -/* Round number up to multiple */
>>> -#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
>>> -
>>>    static void *
>>>    virtqueue_alloc_element(size_t sz,
>>>                                         unsigned out_num, unsigned in_num)
>>> @@ -1935,9 +2109,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
>>>        return elem;
>>>    }
>>>
>>> +static int
>>> +vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
>>> +{
>>> +    if (!has_feature(dev->protocol_features,
>>> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (unlikely(!vq->inflight)) {
>>> +        return -1;
>>> +    }
>>> +
>>
>> Just wonder what happens if backend get killed at this point?
>>
> We will re-caculate last_avail_idx like: last_avail_idx = inuse + used_idx


I'm not sure I get you here, but it looks to me at least one pending 
descriptor is missed since you don't set vq->inflight->desc[desc_idx] to 1?


>
> At this point, backend could consume this entry correctly after reconnect.
>
>> You want to survive from the backend crash but you still depend on
>> backend to get and put inflight descriptors which seems somehow conflict.
>>
> But if backend get killed in vu_queue_inflight_put(), I think you are
> right, there is something conflict. One descriptor is consumed by
> guest but still marked as inused in inflight buffer. Then we will
> re-send this old descriptor after restart.
>
> Maybe we can add something like that to fix this issue:
>
> void vu_queue_push()
> {
>      vq->inflight->elem_idx = elem->idx;
>      vu_queue_fill();
>      vu_queue_flush();
>      vq->inflight->desc[elem->idx] = 0;


Does this safe to be killed here?


>      vq->inflight->used_idx = vq->vring.used->idx;
> }
>
> static int vu_check_queue_inflights()
> {
>      ....
>      if (vq->inflight->used_idx != vq->vring.used->idx) {
>          /* Crash in vu_queue_push() */
>          vq->inflight->desc[vq->inflight->elem_idx] = 0;
>      }
>      ....
> }
>
> Thanks,
> Yongji


Well, this may work but here're my points:

1) The code want to recover from backed crash by introducing extra space 
to store inflight data, but it still depends on the backend to set/get 
the inflight state

2) Since the backend could be killed at any time, the backend must have 
the ability to recover from the partial inflight state

So it looks to me 1) tends to be self-contradictory and 2) tends to be 
recursive. The above lines show how tricky could the code looks like.

Solving this at vhost-user level through at backend is probably wrong. 
It's time to consider the support from virtio itself.

Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-15  6:46     ` Yongji Xie
@ 2019-01-15 12:54       ` Michael S. Tsirkin
  2019-01-15 14:18         ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-15 12:54 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, Jan 15, 2019 at 02:46:42PM +0800, Yongji Xie wrote:
> On Tue, 15 Jan 2019 at 06:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 07:27:23PM +0800, elohimes@gmail.com wrote:
> > > @@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
> > >  slave can send file descriptors (at most 8 descriptors in each message)
> > >  to master via ancillary data using this fd communication channel.
> > >
> > > +Inflight I/O tracking
> > > +---------------------
> > > +
> > > +To support slave reconnecting, slave need to track inflight I/O in a
> > > +shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
> > > +are used to transfer the memory between master and slave. And to encourage
> > > +consistency, we provide a recommended format for this memory:
> >
> > I think we should make a stronger statement and actually
> > just say what the format is. Not recommend it weakly.
> >
> 
> Okey, will do it.
> 
> > > +
> > > +offset        width    description
> > > +0x0      0x400    region for queue0
> > > +0x400    0x400    region for queue1
> > > +0x800    0x400    region for queue2
> > > +...      ...      ...
> > > +
> > > +For each virtqueue, we have a 1024 bytes region.
> >
> >
> > Why is the size hardcoded? Why not a function of VQ size?
> >
> 
> Sorry, I didn't get your point. Should the region's size be fixed? Do
> you mean we need to document a function for the region's size?


Well you are saying 0x0 to 0x400 is for queue0.
How do you know that's enough? And why are 0x400
bytes necessary? After all max queue size can be very small.



> >
> > > The region's format is like:
> > > +
> > > +offset   width    description
> > > +0x0      0x1      descriptor 0 is in use or not
> > > +0x1      0x1      descriptor 1 is in use or not
> > > +0x2      0x1      descriptor 2 is in use or not
> > > +...      ...      ...
> > > +
> > > +For each descriptor, we use one byte to specify whether it's in use or not.
> > > +
> > >  Protocol features
> > >  -----------------
> > >
> >
> > I think that it's a good idea to have a version in this region.
> > Otherwise how are you going to handle compatibility when
> > this needs to be extended?
> >
> 
> I have put the version into the message's payload: VhostUserInflight. Is it OK?
> 
> Thanks,
> Yongji

I'm not sure I like it.  So is qemu expected to maintain it? Reset it?
Also don't you want to be able to detect that qemu has reset the buffer?
If we have version 1 at a known offset that can serve both purposes.
Given it only has value within the buffer why not store it there?

-- 
MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-15 12:54       ` Michael S. Tsirkin
@ 2019-01-15 14:18         ` Yongji Xie
  2019-01-18  2:45           ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-15 14:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, 15 Jan 2019 at 20:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jan 15, 2019 at 02:46:42PM +0800, Yongji Xie wrote:
> > On Tue, 15 Jan 2019 at 06:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jan 09, 2019 at 07:27:23PM +0800, elohimes@gmail.com wrote:
> > > > @@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
> > > >  slave can send file descriptors (at most 8 descriptors in each message)
> > > >  to master via ancillary data using this fd communication channel.
> > > >
> > > > +Inflight I/O tracking
> > > > +---------------------
> > > > +
> > > > +To support slave reconnecting, slave need to track inflight I/O in a
> > > > +shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
> > > > +are used to transfer the memory between master and slave. And to encourage
> > > > +consistency, we provide a recommended format for this memory:
> > >
> > > I think we should make a stronger statement and actually
> > > just say what the format is. Not recommend it weakly.
> > >
> >
> > Okey, will do it.
> >
> > > > +
> > > > +offset        width    description
> > > > +0x0      0x400    region for queue0
> > > > +0x400    0x400    region for queue1
> > > > +0x800    0x400    region for queue2
> > > > +...      ...      ...
> > > > +
> > > > +For each virtqueue, we have a 1024 bytes region.
> > >
> > >
> > > Why is the size hardcoded? Why not a function of VQ size?
> > >
> >
> > Sorry, I didn't get your point. Should the region's size be fixed? Do
> > you mean we need to document a function for the region's size?
>
>
> Well you are saying 0x0 to 0x400 is for queue0.
> How do you know that's enough? And why are 0x400
> bytes necessary? After all max queue size can be very small.
>
>

OK, I think I get your point. So we need something like:

region's size = max_queue_size * 32 byte + xxx byte (if any)

Right?

>
> > >
> > > > The region's format is like:
> > > > +
> > > > +offset   width    description
> > > > +0x0      0x1      descriptor 0 is in use or not
> > > > +0x1      0x1      descriptor 1 is in use or not
> > > > +0x2      0x1      descriptor 2 is in use or not
> > > > +...      ...      ...
> > > > +
> > > > +For each descriptor, we use one byte to specify whether it's in use or not.
> > > > +
> > > >  Protocol features
> > > >  -----------------
> > > >
> > >
> > > I think that it's a good idea to have a version in this region.
> > > Otherwise how are you going to handle compatibility when
> > > this needs to be extended?
> > >
> >
> > I have put the version into the message's payload: VhostUserInflight. Is it OK?
> >
> > Thanks,
> > Yongji
>
> I'm not sure I like it.  So is qemu expected to maintain it? Reset it?
> Also don't you want to be able to detect that qemu has reset the buffer?
> If we have version 1 at a known offset that can serve both purposes.
> Given it only has value within the buffer why not store it there?
>

Yes, that looks better. Will update it in v5.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-15  7:52       ` Jason Wang
@ 2019-01-15 14:51         ` Yongji Xie
  2019-01-17  9:57           ` Jason Wang
  2019-01-15 15:58         ` Michael S. Tsirkin
  1 sibling, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-15 14:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi

On Tue, 15 Jan 2019 at 15:52, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2019/1/11 下午2:10, Yongji Xie wrote:
> > On Fri, 11 Jan 2019 at 11:56, Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2019/1/9 下午7:27, elohimes@gmail.com wrote:
> >>> From: Xie Yongji <xieyongji@baidu.com>
> >>>
> >>> This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
> >>> VHOST_USER_SET_INFLIGHT_FD message to set/get shared memory
> >>> to/from qemu. Then we maintain a "bitmap" of all descriptors in
> >>> the shared memory for each queue to track inflight I/O.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> >>> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> >>> ---
> >>>    Makefile                              |   2 +-
> >>>    contrib/libvhost-user/libvhost-user.c | 258 ++++++++++++++++++++++++--
> >>>    contrib/libvhost-user/libvhost-user.h |  29 +++
> >>>    3 files changed, 268 insertions(+), 21 deletions(-)
> >>>
> >>> diff --git a/Makefile b/Makefile
> >>> index dd53965f77..b5c9092605 100644
> >>> --- a/Makefile
> >>> +++ b/Makefile
> >>> @@ -473,7 +473,7 @@ Makefile: $(version-obj-y)
> >>>    # Build libraries
> >>>
> >>>    libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
> >>> -libvhost-user.a: $(libvhost-user-obj-y)
> >>> +libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
> >>>
> >>>    ######################################################################
> >>>
> >>> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> >>> index 23bd52264c..e73ce04619 100644
> >>> --- a/contrib/libvhost-user/libvhost-user.c
> >>> +++ b/contrib/libvhost-user/libvhost-user.c
> >>> @@ -41,6 +41,8 @@
> >>>    #endif
> >>>
> >>>    #include "qemu/atomic.h"
> >>> +#include "qemu/osdep.h"
> >>> +#include "qemu/memfd.h"
> >>>
> >>>    #include "libvhost-user.h"
> >>>
> >>> @@ -53,6 +55,18 @@
> >>>                _min1 < _min2 ? _min1 : _min2; })
> >>>    #endif
> >>>
> >>> +/* Round number down to multiple */
> >>> +#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> >>> +
> >>> +/* Round number up to multiple */
> >>> +#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> >>> +
> >>> +/* Align each region to cache line size in inflight buffer */
> >>> +#define INFLIGHT_ALIGNMENT 64
> >>> +
> >>> +/* The version of inflight buffer */
> >>> +#define INFLIGHT_VERSION 1
> >>> +
> >>>    #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
> >>>
> >>>    /* The version of the protocol we support */
> >>> @@ -66,6 +80,20 @@
> >>>            }                                       \
> >>>        } while (0)
> >>>
> >>> +static inline
> >>> +bool has_feature(uint64_t features, unsigned int fbit)
> >>> +{
> >>> +    assert(fbit < 64);
> >>> +    return !!(features & (1ULL << fbit));
> >>> +}
> >>> +
> >>> +static inline
> >>> +bool vu_has_feature(VuDev *dev,
> >>> +                    unsigned int fbit)
> >>> +{
> >>> +    return has_feature(dev->features, fbit);
> >>> +}
> >>> +
> >>>    static const char *
> >>>    vu_request_to_string(unsigned int req)
> >>>    {
> >>> @@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
> >>>            REQ(VHOST_USER_POSTCOPY_ADVISE),
> >>>            REQ(VHOST_USER_POSTCOPY_LISTEN),
> >>>            REQ(VHOST_USER_POSTCOPY_END),
> >>> +        REQ(VHOST_USER_GET_INFLIGHT_FD),
> >>> +        REQ(VHOST_USER_SET_INFLIGHT_FD),
> >>>            REQ(VHOST_USER_MAX),
> >>>        };
> >>>    #undef REQ
> >>> @@ -890,6 +920,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
> >>>        return true;
> >>>    }
> >>>
> >>> +static int
> >>> +vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
> >>> +{
> >>> +    int i = 0;
> >>> +
> >>> +    if (!has_feature(dev->protocol_features,
> >>> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> >>> +        return 0;
> >>> +    }
> >>> +
> >>> +    if (unlikely(!vq->inflight)) {
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    vq->used_idx = vq->vring.used->idx;
> >>> +    vq->inflight_num = 0;
> >>> +    for (i = 0; i < vq->vring.num; i++) {
> >>> +        if (vq->inflight->desc[i] == 0) {
> >>> +            continue;
> >>> +        }
> >>> +
> >>> +        vq->inflight_desc[vq->inflight_num++] = i;
> >>> +        vq->inuse++;
> >>> +    }
> >>> +    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
> >>> +
> >>> +    /* in case of I/O hang after reconnecting */
> >>> +    if (eventfd_write(vq->kick_fd, 1) ||
> >>> +        eventfd_write(vq->call_fd, 1)) {
> >>> +        return -1;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>>    static bool
> >>>    vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
> >>>    {
> >>> @@ -925,6 +990,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
> >>>                   dev->vq[index].kick_fd, index);
> >>>        }
> >>>
> >>> +    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
> >>> +        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
> >>> +    }
> >>> +
> >>>        return false;
> >>>    }
> >>>
> >>> @@ -1215,6 +1284,117 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
> >>>        return true;
> >>>    }
> >>>
> >>> +static bool
> >>> +vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> >>> +{
> >>> +    int fd;
> >>> +    void *addr;
> >>> +    uint64_t mmap_size;
> >>> +
> >>> +    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
> >>> +        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
> >>> +        vmsg->payload.inflight.mmap_size = 0;
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n",
> >>> +           vmsg->payload.inflight.num_queues);
> >>> +
> >>> +    mmap_size = vmsg->payload.inflight.num_queues *
> >>> +                ALIGN_UP(sizeof(VuVirtqInflight), INFLIGHT_ALIGNMENT);
> >>> +
> >>> +    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
> >>> +                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
> >>> +                            &fd, NULL);
> >>> +
> >>> +    if (!addr) {
> >>> +        vu_panic(dev, "Failed to alloc vhost inflight area");
> >>> +        vmsg->payload.inflight.mmap_size = 0;
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    dev->inflight_info.addr = addr;
> >>> +    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
> >>> +    vmsg->payload.inflight.mmap_offset = 0;
> >>> +    vmsg->payload.inflight.align = INFLIGHT_ALIGNMENT;
> >>> +    vmsg->payload.inflight.version = INFLIGHT_VERSION;
> >>> +    vmsg->fd_num = 1;
> >>> +    dev->inflight_info.fd = vmsg->fds[0] = fd;
> >>> +
> >>> +    DPRINT("send inflight mmap_size: %"PRId64"\n",
> >>> +           vmsg->payload.inflight.mmap_size);
> >>> +    DPRINT("send inflight mmap offset: %"PRId64"\n",
> >>> +           vmsg->payload.inflight.mmap_offset);
> >>> +    DPRINT("send inflight align: %"PRId32"\n",
> >>> +           vmsg->payload.inflight.align);
> >>> +    DPRINT("send inflight version: %"PRId16"\n",
> >>> +           vmsg->payload.inflight.version);
> >>> +
> >>> +    return true;
> >>> +}
> >>> +
> >>> +static bool
> >>> +vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
> >>> +{
> >>> +    int fd, i;
> >>> +    uint64_t mmap_size, mmap_offset;
> >>> +    uint32_t align;
> >>> +    uint16_t num_queues, version;
> >>> +    void *rc;
> >>> +
> >>> +    if (vmsg->fd_num != 1 ||
> >>> +        vmsg->size != sizeof(vmsg->payload.inflight)) {
> >>> +        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
> >>> +                 vmsg->size, vmsg->fd_num);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    fd = vmsg->fds[0];
> >>> +    mmap_size = vmsg->payload.inflight.mmap_size;
> >>> +    mmap_offset = vmsg->payload.inflight.mmap_offset;
> >>> +    align = vmsg->payload.inflight.align;
> >>> +    num_queues = vmsg->payload.inflight.num_queues;
> >>> +    version = vmsg->payload.inflight.version;
> >>> +
> >>> +    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
> >>> +    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
> >>> +    DPRINT("set_inflight_fd align: %"PRId32"\n", align);
> >>> +    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
> >>> +    DPRINT("set_inflight_fd version: %"PRId16"\n", version);
> >>> +
> >>> +    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> >>> +              fd, mmap_offset);
> >>> +
> >>> +    if (rc == MAP_FAILED) {
> >>> +        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    if (version != INFLIGHT_VERSION) {
> >>> +        vu_panic(dev, "Invalid set_inflight_fd version: %d", version);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    if (dev->inflight_info.fd) {
> >>> +        close(dev->inflight_info.fd);
> >>> +    }
> >>> +
> >>> +    if (dev->inflight_info.addr) {
> >>> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> >>> +    }
> >>> +
> >>> +    dev->inflight_info.fd = fd;
> >>> +    dev->inflight_info.addr = rc;
> >>> +    dev->inflight_info.size = mmap_size;
> >>> +
> >>> +    for (i = 0; i < num_queues; i++) {
> >>> +        dev->vq[i].inflight = (VuVirtqInflight *)rc;
> >>> +        rc = (void *)((char *)rc + ALIGN_UP(sizeof(VuVirtqInflight), align));
> >>> +    }
> >>> +
> >>> +    return false;
> >>> +}
> >>> +
> >>>    static bool
> >>>    vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >>>    {
> >>> @@ -1292,6 +1472,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
> >>>            return vu_set_postcopy_listen(dev, vmsg);
> >>>        case VHOST_USER_POSTCOPY_END:
> >>>            return vu_set_postcopy_end(dev, vmsg);
> >>> +    case VHOST_USER_GET_INFLIGHT_FD:
> >>> +        return vu_get_inflight_fd(dev, vmsg);
> >>> +    case VHOST_USER_SET_INFLIGHT_FD:
> >>> +        return vu_set_inflight_fd(dev, vmsg);
> >>>        default:
> >>>            vmsg_close_fds(vmsg);
> >>>            vu_panic(dev, "Unhandled request: %d", vmsg->request);
> >>> @@ -1359,8 +1543,18 @@ vu_deinit(VuDev *dev)
> >>>                close(vq->err_fd);
> >>>                vq->err_fd = -1;
> >>>            }
> >>> +        vq->inflight = NULL;
> >>>        }
> >>>
> >>> +    if (dev->inflight_info.addr) {
> >>> +        munmap(dev->inflight_info.addr, dev->inflight_info.size);
> >>> +        dev->inflight_info.addr = NULL;
> >>> +    }
> >>> +
> >>> +    if (dev->inflight_info.fd > 0) {
> >>> +        close(dev->inflight_info.fd);
> >>> +        dev->inflight_info.fd = -1;
> >>> +    }
> >>>
> >>>        vu_close_log(dev);
> >>>        if (dev->slave_fd != -1) {
> >>> @@ -1687,20 +1881,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
> >>>        return vring_avail_idx(vq) == vq->last_avail_idx;
> >>>    }
> >>>
> >>> -static inline
> >>> -bool has_feature(uint64_t features, unsigned int fbit)
> >>> -{
> >>> -    assert(fbit < 64);
> >>> -    return !!(features & (1ULL << fbit));
> >>> -}
> >>> -
> >>> -static inline
> >>> -bool vu_has_feature(VuDev *dev,
> >>> -                    unsigned int fbit)
> >>> -{
> >>> -    return has_feature(dev->features, fbit);
> >>> -}
> >>> -
> >>>    static bool
> >>>    vring_notify(VuDev *dev, VuVirtq *vq)
> >>>    {
> >>> @@ -1829,12 +2009,6 @@ virtqueue_map_desc(VuDev *dev,
> >>>        *p_num_sg = num_sg;
> >>>    }
> >>>
> >>> -/* Round number down to multiple */
> >>> -#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
> >>> -
> >>> -/* Round number up to multiple */
> >>> -#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
> >>> -
> >>>    static void *
> >>>    virtqueue_alloc_element(size_t sz,
> >>>                                         unsigned out_num, unsigned in_num)
> >>> @@ -1935,9 +2109,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
> >>>        return elem;
> >>>    }
> >>>
> >>> +static int
> >>> +vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
> >>> +{
> >>> +    if (!has_feature(dev->protocol_features,
> >>> +        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> >>> +        return 0;
> >>> +    }
> >>> +
> >>> +    if (unlikely(!vq->inflight)) {
> >>> +        return -1;
> >>> +    }
> >>> +
> >>
> >> Just wonder what happens if backend get killed at this point?
> >>
> > We will re-caculate last_avail_idx like: last_avail_idx = inuse + used_idx
>
>
> I'm not sure I get you here, but it looks to me at least one pending
> descriptor is missed since you don't set vq->inflight->desc[desc_idx] to 1?
>
>
> >
> > At this point, backend could consume this entry correctly after reconnect.
> >
> >> You want to survive from the backend crash but you still depend on
> >> backend to get and put inflight descriptors which seems somehow conflict.
> >>
> > But if backend get killed in vu_queue_inflight_put(), I think you are
> > right, there is something conflict. One descriptor is consumed by
> > guest but still marked as inused in inflight buffer. Then we will
> > re-send this old descriptor after restart.
> >
> > Maybe we can add something like that to fix this issue:
> >
> > void vu_queue_push()
> > {
> >      vq->inflight->elem_idx = elem->idx;
> >      vu_queue_fill();
> >      vu_queue_flush();
> >      vq->inflight->desc[elem->idx] = 0;
>
>
> Does this safe to be killed here?
>
>
> >      vq->inflight->used_idx = vq->vring.used->idx;
> > }
> >
> > static int vu_check_queue_inflights()
> > {
> >      ....
> >      if (vq->inflight->used_idx != vq->vring.used->idx) {
> >          /* Crash in vu_queue_push() */
> >          vq->inflight->desc[vq->inflight->elem_idx] = 0;
> >      }
> >      ....
> > }
> >
> > Thanks,
> > Yongji
>
>
> Well, this may work but here're my points:
>
> 1) The code want to recover from backed crash by introducing extra space
> to store inflight data, but it still depends on the backend to set/get
> the inflight state
>
> 2) Since the backend could be killed at any time, the backend must have
> the ability to recover from the partial inflight state
>
> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> recursive. The above lines show how tricky could the code looks like.
>
> Solving this at vhost-user level through at backend is probably wrong.
> It's time to consider the support from virtio itself.
>

I agree that supporting this in virtio level may be better. For
example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
Stefan's proposal. But I still think QEMU should be able to provide
this ability too. Supposed that one vhost-user backend need to support
multiple VMs. We can't enable reconnect ability until all VMs' guest
driver support the new feature. It's limited. But if QEMU have the
ability to store inflight buffer, the backend could at least have a
chance to support this case. Maybe backend could have other way to
avoid the tricky code.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-11  8:36                     ` Yongji Xie
@ 2019-01-15 15:39                       ` Daniel P. Berrangé
  2019-01-15 16:53                         ` Yury Kotov
  2019-01-16  5:39                         ` Yongji Xie
  0 siblings, 2 replies; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-15 15:39 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, Jan 11, 2019 at 04:36:11PM +0800, Yongji Xie wrote:
> On Fri, 11 Jan 2019 at 16:32, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
> > > On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > We need to fix qemu_chr_fe_wait_connected so that it does explicit
> > > > synchronization wrt to any ongoing background connection process.
> > > > It must only return once all TLS/telnet/websock handshakes have
> > > > completed.  If we fix that correctly, then I believe it will  also
> > > > solve the problem you're trying to address.
> > > >
> > >
> > > Yes, I think this should be the right way to go. To fix it, my thought
> > > is to track the async QIOChannelSocket in SocketChardev. Then we can
> > > easily get the connection progress in qemu_chr_fe_wait_connected(). Do
> > > you have any suggestion?
> >
> > I've got a few patches that refactor the code to fix this. I'll send them
> > today and CC you on them.
> >
> 
> That would be great! Thank you.

It took me rather longer than expected to fully debug all scenarios, but
I've finally sent patches:

https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-15  7:52       ` Jason Wang
  2019-01-15 14:51         ` Yongji Xie
@ 2019-01-15 15:58         ` Michael S. Tsirkin
  2019-01-17 10:01           ` Jason Wang
  1 sibling, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-15 15:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, Jan 15, 2019 at 03:52:21PM +0800, Jason Wang wrote:
> Well, this may work but here're my points:
> 
> 1) The code want to recover from backed crash by introducing extra space to
> store inflight data, but it still depends on the backend to set/get the
> inflight state
> 
> 2) Since the backend could be killed at any time, the backend must have the
> ability to recover from the partial inflight state
> 
> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> recursive. The above lines show how tricky could the code looks like.

This is a well studied field. Basically you make sure you commit with an
atomic write.  Restartable sequences allow accelerating this even
further.

> Solving this at vhost-user level through at backend is probably wrong. It's
> time to consider the support from virtio itself.
> 
> Thanks

I think both approaches have their space.

-- 
MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-15 15:39                       ` Daniel P. Berrangé
@ 2019-01-15 16:53                         ` Yury Kotov
  2019-01-15 17:15                           ` Daniel P. Berrangé
  2019-01-16  5:39                         ` Yongji Xie
  1 sibling, 1 reply; 54+ messages in thread
From: Yury Kotov @ 2019-01-15 16:53 UTC (permalink / raw)
  To: Daniel P. Berrangé, Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

15.01.2019, 18:39, "Daniel P. Berrangé" <berrange@redhat.com>:
> On Fri, Jan 11, 2019 at 04:36:11PM +0800, Yongji Xie wrote:
>>  On Fri, 11 Jan 2019 at 16:32, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>  >
>>  > On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
>>  > > On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>  > > >
>>  > > > We need to fix qemu_chr_fe_wait_connected so that it does explicit
>>  > > > synchronization wrt to any ongoing background connection process.
>>  > > > It must only return once all TLS/telnet/websock handshakes have
>>  > > > completed. If we fix that correctly, then I believe it will also
>>  > > > solve the problem you're trying to address.
>>  > > >
>>  > >
>>  > > Yes, I think this should be the right way to go. To fix it, my thought
>>  > > is to track the async QIOChannelSocket in SocketChardev. Then we can
>>  > > easily get the connection progress in qemu_chr_fe_wait_connected(). Do
>>  > > you have any suggestion?
>>  >
>>  > I've got a few patches that refactor the code to fix this. I'll send them
>>  > today and CC you on them.
>>  >
>>
>>  That would be great! Thank you.
>
> It took me rather longer than expected to fully debug all scenarios, but
> I've finally sent patches:
>
> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
>
> Regards,
> Daniel
> --
> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o- https://fstop138.berrange.com :|
> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

I didn't do a deep review and may be wrong, but I think the race is still
possible in the hotplug case.

Example:
1. User adds a chardev (qmp: chardev-add) with 'reconnect' option;
2. Some main-loop iterations...
3. Reconnect timer triggers and starts connection thread;
4. User adds a vhost-user-blk (qmp: device-add): device realize -> wait_connected.

Here, there is a chance that we are in the wait_connected and the connection
thread is still running.

Regards,
Yury

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-15 16:53                         ` Yury Kotov
@ 2019-01-15 17:15                           ` Daniel P. Berrangé
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel P. Berrangé @ 2019-01-15 17:15 UTC (permalink / raw)
  To: Yury Kotov
  Cc: Yongji Xie, Michael S. Tsirkin, Marc-André Lureau,
	Jason Wang, Coquelin, Maxime,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, Jan 15, 2019 at 07:53:51PM +0300, Yury Kotov wrote:
> 15.01.2019, 18:39, "Daniel P. Berrangé" <berrange@redhat.com>:
> > On Fri, Jan 11, 2019 at 04:36:11PM +0800, Yongji Xie wrote:
> >>  On Fri, 11 Jan 2019 at 16:32, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>  >
> >>  > On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
> >>  > > On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>  > > >
> >>  > > > We need to fix qemu_chr_fe_wait_connected so that it does explicit
> >>  > > > synchronization wrt to any ongoing background connection process.
> >>  > > > It must only return once all TLS/telnet/websock handshakes have
> >>  > > > completed. If we fix that correctly, then I believe it will also
> >>  > > > solve the problem you're trying to address.
> >>  > > >
> >>  > >
> >>  > > Yes, I think this should be the right way to go. To fix it, my thought
> >>  > > is to track the async QIOChannelSocket in SocketChardev. Then we can
> >>  > > easily get the connection progress in qemu_chr_fe_wait_connected(). Do
> >>  > > you have any suggestion?
> >>  >
> >>  > I've got a few patches that refactor the code to fix this. I'll send them
> >>  > today and CC you on them.
> >>  >
> >>
> >>  That would be great! Thank you.
> >
> > It took me rather longer than expected to fully debug all scenarios, but
> > I've finally sent patches:
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
> 
> I didn't do a deep review and may be wrong, but I think the race is still
> possible in the hotplug case.
> 
> Example:
> 1. User adds a chardev (qmp: chardev-add) with 'reconnect' option;
> 2. Some main-loop iterations...
> 3. Reconnect timer triggers and starts connection thread;
> 4. User adds a vhost-user-blk (qmp: device-add): device realize -> wait_connected.
> 
> Here, there is a chance that we are in the wait_connected and the connection
> thread is still running.

Hmm, dealing with device-add is tricky. tcp_chr_wait_connect rather
assumes no main loop is running. I'll have to think about how we could
possibly handle hotplug safely.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets
  2019-01-15 15:39                       ` Daniel P. Berrangé
  2019-01-15 16:53                         ` Yury Kotov
@ 2019-01-16  5:39                         ` Yongji Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Yongji Xie @ 2019-01-16  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Marc-André Lureau, Jason Wang, Coquelin,
	Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, 15 Jan 2019 at 23:39, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Jan 11, 2019 at 04:36:11PM +0800, Yongji Xie wrote:
> > On Fri, 11 Jan 2019 at 16:32, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Fri, Jan 11, 2019 at 03:50:40PM +0800, Yongji Xie wrote:
> > > > On Fri, 11 Jan 2019 at 00:41, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > We need to fix qemu_chr_fe_wait_connected so that it does explicit
> > > > > synchronization wrt to any ongoing background connection process.
> > > > > It must only return once all TLS/telnet/websock handshakes have
> > > > > completed.  If we fix that correctly, then I believe it will  also
> > > > > solve the problem you're trying to address.
> > > > >
> > > >
> > > > Yes, I think this should be the right way to go. To fix it, my thought
> > > > is to track the async QIOChannelSocket in SocketChardev. Then we can
> > > > easily get the connection progress in qemu_chr_fe_wait_connected(). Do
> > > > you have any suggestion?
> > >
> > > I've got a few patches that refactor the code to fix this. I'll send them
> > > today and CC you on them.
> > >
> >
> > That would be great! Thank you.
>
> It took me rather longer than expected to fully debug all scenarios, but
> I've finally sent patches:
>
> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
>

I will test my series based on this. Thank you.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-01-14 10:55           ` Yongji Xie
@ 2019-01-16 14:28             ` Stefan Hajnoczi
  0 siblings, 0 replies; 54+ messages in thread
From: Stefan Hajnoczi @ 2019-01-16 14:28 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	nixun, qemu-devel, lilin24, zhangyu31, chaiwen, Xie Yongji

[-- Attachment #1: Type: text/plain, Size: 2043 bytes --]

On Mon, Jan 14, 2019 at 06:55:40PM +0800, Yongji Xie wrote:
> On Mon, 14 Jan 2019 at 18:22, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Sat, Jan 12, 2019 at 12:50:12PM +0800, Yongji Xie wrote:
> > > On Fri, 11 Jan 2019 at 23:53, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 10, 2019 at 06:59:27PM +0800, Yongji Xie wrote:
> > > > > On Thu, 10 Jan 2019 at 18:25, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 09, 2019 at 07:27:21PM +0800, elohimes@gmail.com wrote:
> > > > I'm concerned that this approach to device recovery is invasive and hard
> > > > to test.  Instead I would use VIRTIO's Device Status Field
> > > > DEVICE_NEEDS_RESET bit to tell the guest driver that a reset is
> > > > necessary.  This is more disruptive - drivers either have to resubmit or
> > > > fail I/O with EIO - but it's also simple and more likely to work
> > > > correctly (it only needs to be implemented correctly in the guest
> > > > driver, not in the many available vhost-user backend implementations).
> > > >
> > >
> > > So you mean adding one way to notify guest to resubmit inflight I/O. I
> > > think it's a good idea. But would it be more flexible to implement
> > > this in backend. We can support old guest. And it will be easy to fix
> > > bug or add feature.
> >
> > There are trade-offs with either approach.  In the long run I think it's
> > beneficial minimize non-trivial logic in vhost-user backends.  There
> > will be more vhost-user backend implementations and therefore more bugs
> > if we put the logic into the backend.  This is why I think a simple
> > mechanism for marking the device as needing a reset will be more
> > reliable and less trouble.
> >
> 
> I agree. So is it possible to implement both? In the long run,
> updating guest driver to support this is better. And at that time, the
> logic in backend can be only used to support legacy guest driver.

Yes, in theory both approaches could be available.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-15 14:51         ` Yongji Xie
@ 2019-01-17  9:57           ` Jason Wang
  2019-01-17 14:59             ` Michael S. Tsirkin
  2019-01-18  3:32             ` Yongji Xie
  0 siblings, 2 replies; 54+ messages in thread
From: Jason Wang @ 2019-01-17  9:57 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi


On 2019/1/15 下午10:51, Yongji Xie wrote:
>> Well, this may work but here're my points:
>>
>> 1) The code want to recover from backed crash by introducing extra space
>> to store inflight data, but it still depends on the backend to set/get
>> the inflight state
>>
>> 2) Since the backend could be killed at any time, the backend must have
>> the ability to recover from the partial inflight state
>>
>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
>> recursive. The above lines show how tricky could the code looks like.
>>
>> Solving this at vhost-user level through at backend is probably wrong.
>> It's time to consider the support from virtio itself.
>>
> I agree that supporting this in virtio level may be better. For
> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> Stefan's proposal. But I still think QEMU should be able to provide
> this ability too. Supposed that one vhost-user backend need to support
> multiple VMs. We can't enable reconnect ability until all VMs' guest
> driver support the new feature. It's limited.


That's the way virtio evolves.


>   But if QEMU have the
> ability to store inflight buffer, the backend could at least have a
> chance to support this case.


The problem is, you need a careful designed protocol described somewhere 
(is vhost-user.txt a good place for this?). And this work will be 
(partial) duplicated for the future support from virtio spec itself.


> Maybe backend could have other way to
> avoid the tricky code.


I'm not sure, but it was probably not easy.

Thanks


>
> Thanks,

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-15 15:58         ` Michael S. Tsirkin
@ 2019-01-17 10:01           ` Jason Wang
  2019-01-17 15:04             ` Michael S. Tsirkin
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Wang @ 2019-01-17 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji


On 2019/1/15 下午11:58, Michael S. Tsirkin wrote:
> On Tue, Jan 15, 2019 at 03:52:21PM +0800, Jason Wang wrote:
>> Well, this may work but here're my points:
>>
>> 1) The code want to recover from backed crash by introducing extra space to
>> store inflight data, but it still depends on the backend to set/get the
>> inflight state
>>
>> 2) Since the backend could be killed at any time, the backend must have the
>> ability to recover from the partial inflight state
>>
>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
>> recursive. The above lines show how tricky could the code looks like.
> This is a well studied field. Basically you make sure you commit with an
> atomic write.  Restartable sequences allow accelerating this even
> further.


I'm not sure I get this. But the issue is to exactly deduce all the 
inflight descriptors even if backend could be killed when doing the 
logging. If we could not be 100% accurate, it's have much less value.


>
>> Solving this at vhost-user level through at backend is probably wrong. It's
>> time to consider the support from virtio itself.
>>
>> Thanks
> I think both approaches have their space.


But there will be a lot of duplicated work if we decide to support it 
from virtio.

Thanks


>
> -- MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-17  9:57           ` Jason Wang
@ 2019-01-17 14:59             ` Michael S. Tsirkin
  2019-01-18  3:57               ` Jason Wang
  2019-01-18  3:32             ` Yongji Xie
  1 sibling, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-17 14:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi

On Thu, Jan 17, 2019 at 05:57:29PM +0800, Jason Wang wrote:
> 
> On 2019/1/15 下午10:51, Yongji Xie wrote:
> > > Well, this may work but here're my points:
> > > 
> > > 1) The code want to recover from backed crash by introducing extra space
> > > to store inflight data, but it still depends on the backend to set/get
> > > the inflight state
> > > 
> > > 2) Since the backend could be killed at any time, the backend must have
> > > the ability to recover from the partial inflight state
> > > 
> > > So it looks to me 1) tends to be self-contradictory and 2) tends to be
> > > recursive. The above lines show how tricky could the code looks like.
> > > 
> > > Solving this at vhost-user level through at backend is probably wrong.
> > > It's time to consider the support from virtio itself.
> > > 
> > I agree that supporting this in virtio level may be better. For
> > example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> > Stefan's proposal. But I still think QEMU should be able to provide
> > this ability too. Supposed that one vhost-user backend need to support
> > multiple VMs. We can't enable reconnect ability until all VMs' guest
> > driver support the new feature. It's limited.
> 
> 
> That's the way virtio evolves.
> 
> 
> >   But if QEMU have the
> > ability to store inflight buffer, the backend could at least have a
> > chance to support this case.
> 
> 
> The problem is, you need a careful designed protocol described somewhere (is
> vhost-user.txt a good place for this?). And this work will be (partial)
> duplicated for the future support from virtio spec itself.
> 
> 
> > Maybe backend could have other way to
> > avoid the tricky code.
> 
> 
> I'm not sure, but it was probably not easy.

I see an implementation in libvhost-user.

Do you see an issue there?


> Thanks
> 
> 
> > 
> > Thanks,

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-17 10:01           ` Jason Wang
@ 2019-01-17 15:04             ` Michael S. Tsirkin
  0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-17 15:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, Jan 17, 2019 at 06:01:16PM +0800, Jason Wang wrote:
> 
> On 2019/1/15 下午11:58, Michael S. Tsirkin wrote:
> > On Tue, Jan 15, 2019 at 03:52:21PM +0800, Jason Wang wrote:
> > > Well, this may work but here're my points:
> > > 
> > > 1) The code want to recover from backed crash by introducing extra space to
> > > store inflight data, but it still depends on the backend to set/get the
> > > inflight state
> > > 
> > > 2) Since the backend could be killed at any time, the backend must have the
> > > ability to recover from the partial inflight state
> > > 
> > > So it looks to me 1) tends to be self-contradictory and 2) tends to be
> > > recursive. The above lines show how tricky could the code looks like.
> > This is a well studied field. Basically you make sure you commit with an
> > atomic write.  Restartable sequences allow accelerating this even
> > further.
> 
> 
> I'm not sure I get this. But the issue is to exactly deduce all the inflight
> descriptors even if backend could be killed when doing the logging. If we
> could not be 100% accurate, it's have much less value.


I agree. But why discuss theoretical issues?
Can you point out a problem in the contrib/ code
included here?

If yes it must be fixed I think.

I personally think it's not too hard.
Consider packed ring for example - just maintain a list of the inflight
descriptors, as the last step write out the flags atomically.




> 
> > 
> > > Solving this at vhost-user level through at backend is probably wrong. It's
> > > time to consider the support from virtio itself.
> > > 
> > > Thanks
> > I think both approaches have their space.
> 
> 
> But there will be a lot of duplicated work if we decide to support it from
> virtio.
> 
> Thanks
> 
> 
> > 
> > -- MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-01-15 14:18         ` Yongji Xie
@ 2019-01-18  2:45           ` Yongji Xie
  0 siblings, 0 replies; 54+ messages in thread
From: Yongji Xie @ 2019-01-18  2:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Tue, 15 Jan 2019 at 22:18, Yongji Xie <elohimes@gmail.com> wrote:
>
> On Tue, 15 Jan 2019 at 20:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Jan 15, 2019 at 02:46:42PM +0800, Yongji Xie wrote:
> > > On Tue, 15 Jan 2019 at 06:25, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Wed, Jan 09, 2019 at 07:27:23PM +0800, elohimes@gmail.com wrote:
> > > > > @@ -382,6 +397,30 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
> > > > >  slave can send file descriptors (at most 8 descriptors in each message)
> > > > >  to master via ancillary data using this fd communication channel.
> > > > >
> > > > > +Inflight I/O tracking
> > > > > +---------------------
> > > > > +
> > > > > +To support slave reconnecting, slave need to track inflight I/O in a
> > > > > +shared memory. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD
> > > > > +are used to transfer the memory between master and slave. And to encourage
> > > > > +consistency, we provide a recommended format for this memory:
> > > >
> > > > I think we should make a stronger statement and actually
> > > > just say what the format is. Not recommend it weakly.
> > > >
> > >
> > > Okey, will do it.
> > >
> > > > > +
> > > > > +offset        width    description
> > > > > +0x0      0x400    region for queue0
> > > > > +0x400    0x400    region for queue1
> > > > > +0x800    0x400    region for queue2
> > > > > +...      ...      ...
> > > > > +
> > > > > +For each virtqueue, we have a 1024 bytes region.
> > > >
> > > >
> > > > Why is the size hardcoded? Why not a function of VQ size?
> > > >
> > >
> > > Sorry, I didn't get your point. Should the region's size be fixed? Do
> > > you mean we need to document a function for the region's size?
> >
> >
> > Well you are saying 0x0 to 0x400 is for queue0.
> > How do you know that's enough? And why are 0x400
> > bytes necessary? After all max queue size can be very small.
> >
> >
>
> OK, I think I get your point. So we need something like:
>
> region's size = max_queue_size * 32 byte + xxx byte (if any)
>
> Right?
>
> >
> > > >
> > > > > The region's format is like:
> > > > > +
> > > > > +offset   width    description
> > > > > +0x0      0x1      descriptor 0 is in use or not
> > > > > +0x1      0x1      descriptor 1 is in use or not
> > > > > +0x2      0x1      descriptor 2 is in use or not
> > > > > +...      ...      ...
> > > > > +
> > > > > +For each descriptor, we use one byte to specify whether it's in use or not.
> > > > > +
> > > > >  Protocol features
> > > > >  -----------------
> > > > >
> > > >
> > > > I think that it's a good idea to have a version in this region.
> > > > Otherwise how are you going to handle compatibility when
> > > > this needs to be extended?
> > > >
> > >
> > > I have put the version into the message's payload: VhostUserInflight. Is it OK?
> > >
> > > Thanks,
> > > Yongji
> >
> > I'm not sure I like it.  So is qemu expected to maintain it? Reset it?
> > Also don't you want to be able to detect that qemu has reset the buffer?
> > If we have version 1 at a known offset that can serve both purposes.
> > Given it only has value within the buffer why not store it there?
> >
>
> Yes, that looks better. Will update it in v5.
>

Hi Michael,

I found a problem during implentmenting this. If we put version into
the shared buffer, QEMU will reset it when vm reset. Then if backend
restart at the same time, the version of this buffer will be lost. So
maybe qemu still need to maintain it.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-17  9:57           ` Jason Wang
  2019-01-17 14:59             ` Michael S. Tsirkin
@ 2019-01-18  3:32             ` Yongji Xie
  2019-01-18  3:56               ` Michael S. Tsirkin
  2019-01-18  3:59               ` Jason Wang
  1 sibling, 2 replies; 54+ messages in thread
From: Yongji Xie @ 2019-01-18  3:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Marc-André Lureau,
	Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi

On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2019/1/15 下午10:51, Yongji Xie wrote:
> >> Well, this may work but here're my points:
> >>
> >> 1) The code want to recover from backed crash by introducing extra space
> >> to store inflight data, but it still depends on the backend to set/get
> >> the inflight state
> >>
> >> 2) Since the backend could be killed at any time, the backend must have
> >> the ability to recover from the partial inflight state
> >>
> >> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> >> recursive. The above lines show how tricky could the code looks like.
> >>
> >> Solving this at vhost-user level through at backend is probably wrong.
> >> It's time to consider the support from virtio itself.
> >>
> > I agree that supporting this in virtio level may be better. For
> > example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> > Stefan's proposal. But I still think QEMU should be able to provide
> > this ability too. Supposed that one vhost-user backend need to support
> > multiple VMs. We can't enable reconnect ability until all VMs' guest
> > driver support the new feature. It's limited.
>
>
> That's the way virtio evolves.
>
>
> >   But if QEMU have the
> > ability to store inflight buffer, the backend could at least have a
> > chance to support this case.
>
>
> The problem is, you need a careful designed protocol described somewhere

That's what we should discuss in detail in this series.

> (is vhost-user.txt a good place for this?). And this work will be
> (partial) duplicated for the future support from virtio spec itself.
>

I think the duplicated code is to maintain the inflight descriptor
list which should be in backend. That's not main work in this series.
And backend could choose to include it or not.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  3:32             ` Yongji Xie
@ 2019-01-18  3:56               ` Michael S. Tsirkin
  2019-01-18  3:59               ` Jason Wang
  1 sibling, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-18  3:56 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi

On Fri, Jan 18, 2019 at 11:32:03AM +0800, Yongji Xie wrote:
> On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2019/1/15 下午10:51, Yongji Xie wrote:
> > >> Well, this may work but here're my points:
> > >>
> > >> 1) The code want to recover from backed crash by introducing extra space
> > >> to store inflight data, but it still depends on the backend to set/get
> > >> the inflight state
> > >>
> > >> 2) Since the backend could be killed at any time, the backend must have
> > >> the ability to recover from the partial inflight state
> > >>
> > >> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> > >> recursive. The above lines show how tricky could the code looks like.
> > >>
> > >> Solving this at vhost-user level through at backend is probably wrong.
> > >> It's time to consider the support from virtio itself.
> > >>
> > > I agree that supporting this in virtio level may be better. For
> > > example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> > > Stefan's proposal. But I still think QEMU should be able to provide
> > > this ability too. Supposed that one vhost-user backend need to support
> > > multiple VMs. We can't enable reconnect ability until all VMs' guest
> > > driver support the new feature. It's limited.
> >
> >
> > That's the way virtio evolves.
> >
> >
> > >   But if QEMU have the
> > > ability to store inflight buffer, the backend could at least have a
> > > chance to support this case.
> >
> >
> > The problem is, you need a careful designed protocol described somewhere
> 
> That's what we should discuss in detail in this series.
> 
> > (is vhost-user.txt a good place for this?). And this work will be
> > (partial) duplicated for the future support from virtio spec itself.
> >
> 
> I think the duplicated code is to maintain the inflight descriptor
> list which should be in backend. That's not main work in this series.
> And backend could choose to include it or not.
> 
> Thanks,
> Yongji

It would be if someone volunteered to rewrite the vhost user informal
description that we have in qemu and make it a full spec.
So far a text + implementation in contrib seems plenty to me.

-- 
MST

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-17 14:59             ` Michael S. Tsirkin
@ 2019-01-18  3:57               ` Jason Wang
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Wang @ 2019-01-18  3:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yongji Xie, Marc-André Lureau, Daniel P. Berrangé,
	Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji,
	Stefan Hajnoczi


On 2019/1/17 下午10:59, Michael S. Tsirkin wrote:
> On Thu, Jan 17, 2019 at 05:57:29PM +0800, Jason Wang wrote:
>> On 2019/1/15 下午10:51, Yongji Xie wrote:
>>>> Well, this may work but here're my points:
>>>>
>>>> 1) The code want to recover from backed crash by introducing extra space
>>>> to store inflight data, but it still depends on the backend to set/get
>>>> the inflight state
>>>>
>>>> 2) Since the backend could be killed at any time, the backend must have
>>>> the ability to recover from the partial inflight state
>>>>
>>>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
>>>> recursive. The above lines show how tricky could the code looks like.
>>>>
>>>> Solving this at vhost-user level through at backend is probably wrong.
>>>> It's time to consider the support from virtio itself.
>>>>
>>> I agree that supporting this in virtio level may be better. For
>>> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
>>> Stefan's proposal. But I still think QEMU should be able to provide
>>> this ability too. Supposed that one vhost-user backend need to support
>>> multiple VMs. We can't enable reconnect ability until all VMs' guest
>>> driver support the new feature. It's limited.
>>
>> That's the way virtio evolves.
>>
>>
>>>    But if QEMU have the
>>> ability to store inflight buffer, the backend could at least have a
>>> chance to support this case.
>>
>> The problem is, you need a careful designed protocol described somewhere (is
>> vhost-user.txt a good place for this?). And this work will be (partial)
>> duplicated for the future support from virtio spec itself.
>>
>>
>>> Maybe backend could have other way to
>>> avoid the tricky code.
>>
>> I'm not sure, but it was probably not easy.
> I see an implementation in libvhost-user.
>
> Do you see an issue there?


I've asked some questions in this thread, it looks like we can still 
miss some inflight descriptors.

Thanks


>
>
>> Thanks
>>
>>
>>> Thanks,

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  3:32             ` Yongji Xie
  2019-01-18  3:56               ` Michael S. Tsirkin
@ 2019-01-18  3:59               ` Jason Wang
  2019-01-18  4:03                 ` Michael S. Tsirkin
  2019-01-18  7:01                 ` Yongji Xie
  1 sibling, 2 replies; 54+ messages in thread
From: Jason Wang @ 2019-01-18  3:59 UTC (permalink / raw)
  To: Yongji Xie
  Cc: zhangyu31, Michael S. Tsirkin, Xie Yongji, qemu-devel, lilin24,
	Yury Kotov, Coquelin, Maxime, chaiwen, Stefan Hajnoczi,
	Marc-André Lureau, nixun,
	Евгений
	Яковлев


On 2019/1/18 上午11:32, Yongji Xie wrote:
> On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2019/1/15 下午10:51, Yongji Xie wrote:
>>>> Well, this may work but here're my points:
>>>>
>>>> 1) The code want to recover from backed crash by introducing extra space
>>>> to store inflight data, but it still depends on the backend to set/get
>>>> the inflight state
>>>>
>>>> 2) Since the backend could be killed at any time, the backend must have
>>>> the ability to recover from the partial inflight state
>>>>
>>>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
>>>> recursive. The above lines show how tricky could the code looks like.
>>>>
>>>> Solving this at vhost-user level through at backend is probably wrong.
>>>> It's time to consider the support from virtio itself.
>>>>
>>> I agree that supporting this in virtio level may be better. For
>>> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
>>> Stefan's proposal. But I still think QEMU should be able to provide
>>> this ability too. Supposed that one vhost-user backend need to support
>>> multiple VMs. We can't enable reconnect ability until all VMs' guest
>>> driver support the new feature. It's limited.
>>
>> That's the way virtio evolves.
>>
>>
>>>    But if QEMU have the
>>> ability to store inflight buffer, the backend could at least have a
>>> chance to support this case.
>>
>> The problem is, you need a careful designed protocol described somewhere
> That's what we should discuss in detail in this series.


Well, I ask some questions for this patch, but it looks like they were 
still not answered. No?


>
>> (is vhost-user.txt a good place for this?). And this work will be
>> (partial) duplicated for the future support from virtio spec itself.
>>
> I think the duplicated code is to maintain the inflight descriptor
> list which should be in backend. That's not main work in this series.
> And backend could choose to include it or not.


You need to have a documentation to describe the protocol. Otherwise, it 
would be very hard for other backend to implement.

Thanks


>
> Thanks,
> Yongji
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  3:59               ` Jason Wang
@ 2019-01-18  4:03                 ` Michael S. Tsirkin
  2019-01-18  7:01                 ` Yongji Xie
  1 sibling, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2019-01-18  4:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yongji Xie, zhangyu31, Xie Yongji, qemu-devel, lilin24,
	Yury Kotov, Coquelin, Maxime, chaiwen, Stefan Hajnoczi,
	Marc-André Lureau, nixun,
	Евгений
	Яковлев

On Fri, Jan 18, 2019 at 11:59:50AM +0800, Jason Wang wrote:
> 
> On 2019/1/18 上午11:32, Yongji Xie wrote:
> > On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
> > > 
> > > On 2019/1/15 下午10:51, Yongji Xie wrote:
> > > > > Well, this may work but here're my points:
> > > > > 
> > > > > 1) The code want to recover from backed crash by introducing extra space
> > > > > to store inflight data, but it still depends on the backend to set/get
> > > > > the inflight state
> > > > > 
> > > > > 2) Since the backend could be killed at any time, the backend must have
> > > > > the ability to recover from the partial inflight state
> > > > > 
> > > > > So it looks to me 1) tends to be self-contradictory and 2) tends to be
> > > > > recursive. The above lines show how tricky could the code looks like.
> > > > > 
> > > > > Solving this at vhost-user level through at backend is probably wrong.
> > > > > It's time to consider the support from virtio itself.
> > > > > 
> > > > I agree that supporting this in virtio level may be better. For
> > > > example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> > > > Stefan's proposal. But I still think QEMU should be able to provide
> > > > this ability too. Supposed that one vhost-user backend need to support
> > > > multiple VMs. We can't enable reconnect ability until all VMs' guest
> > > > driver support the new feature. It's limited.
> > > 
> > > That's the way virtio evolves.
> > > 
> > > 
> > > >    But if QEMU have the
> > > > ability to store inflight buffer, the backend could at least have a
> > > > chance to support this case.
> > > 
> > > The problem is, you need a careful designed protocol described somewhere
> > That's what we should discuss in detail in this series.
> 
> 
> Well, I ask some questions for this patch, but it looks like they were still
> not answered. No?
> 

Oh absolutely. I can't say I like the implementation,
I think it's both not the most robust and suboptimal.

> > 
> > > (is vhost-user.txt a good place for this?). And this work will be
> > > (partial) duplicated for the future support from virtio spec itself.
> > > 
> > I think the duplicated code is to maintain the inflight descriptor
> > list which should be in backend. That's not main work in this series.
> > And backend could choose to include it or not.
> 
> 
> You need to have a documentation to describe the protocol. Otherwise, it
> would be very hard for other backend to implement.
> 
> Thanks


Meaning how the inflight descriptors are saved in the buffer.
Yes I agree.

> 
> > 
> > Thanks,
> > Yongji
> > 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  3:59               ` Jason Wang
  2019-01-18  4:03                 ` Michael S. Tsirkin
@ 2019-01-18  7:01                 ` Yongji Xie
  2019-01-18  9:26                   ` Jason Wang
  1 sibling, 1 reply; 54+ messages in thread
From: Yongji Xie @ 2019-01-18  7:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: zhangyu31, Michael S. Tsirkin, Xie Yongji, qemu-devel, lilin24,
	Yury Kotov, Coquelin, Maxime, chaiwen, Stefan Hajnoczi,
	Marc-André Lureau, nixun,
	Евгений
	Яковлев

On Fri, 18 Jan 2019 at 12:00, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2019/1/18 上午11:32, Yongji Xie wrote:
> > On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2019/1/15 下午10:51, Yongji Xie wrote:
> >>>> Well, this may work but here're my points:
> >>>>
> >>>> 1) The code want to recover from backed crash by introducing extra space
> >>>> to store inflight data, but it still depends on the backend to set/get
> >>>> the inflight state
> >>>>
> >>>> 2) Since the backend could be killed at any time, the backend must have
> >>>> the ability to recover from the partial inflight state
> >>>>
> >>>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> >>>> recursive. The above lines show how tricky could the code looks like.
> >>>>
> >>>> Solving this at vhost-user level through at backend is probably wrong.
> >>>> It's time to consider the support from virtio itself.
> >>>>
> >>> I agree that supporting this in virtio level may be better. For
> >>> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> >>> Stefan's proposal. But I still think QEMU should be able to provide
> >>> this ability too. Supposed that one vhost-user backend need to support
> >>> multiple VMs. We can't enable reconnect ability until all VMs' guest
> >>> driver support the new feature. It's limited.
> >>
> >> That's the way virtio evolves.
> >>
> >>
> >>>    But if QEMU have the
> >>> ability to store inflight buffer, the backend could at least have a
> >>> chance to support this case.
> >>
> >> The problem is, you need a careful designed protocol described somewhere
> > That's what we should discuss in detail in this series.
>
>
> Well, I ask some questions for this patch, but it looks like they were
> still not answered. No?
>
>

Oh, sorry, I missed those questions. Let me try to answer them here.

Q1: If backend get killed in vu_queue_inflight_get() without setting
vq->inflight->desc[desc_idx] to 1, is there any problem?

The entry which stores the head of this inflight descriptor is not
lost in avail ring. So we can still get this inflight descriptor from
avail ring although we didn't set vq->inflight->desc[desc_idx] to 1.

Q2:
void vu_queue_push()
{
    vq->inflight->elem_idx = elem->idx;
    vu_queue_fill();
    vu_queue_flush();
    vq->inflight->desc[elem->idx] = 0;
                                                    <-------- Does
this safe to be killed here?
    vq->inflight->used_idx = vq->vring.used->idx;
}

Because there are no concurrency between vu_queue_push() and
vu_queue_pop(), I don't see any problem here.

Basically we just need to make sure this two operations
(vq->vring.used->idx++ and vq->inflight->desc[elem->idx] = 0) are
atomic. I think there are some approach to achieve that. I'm not sure
my approach here is good enough, but it should work.

> >
> >> (is vhost-user.txt a good place for this?). And this work will be
> >> (partial) duplicated for the future support from virtio spec itself.
> >>
> > I think the duplicated code is to maintain the inflight descriptor
> > list which should be in backend. That's not main work in this series.
> > And backend could choose to include it or not.
>
>
> You need to have a documentation to describe the protocol. Otherwise, it
> would be very hard for other backend to implement.
>

Yes, actually now I'm working on adding more detail to vhost-user.txt.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  7:01                 ` Yongji Xie
@ 2019-01-18  9:26                   ` Jason Wang
  2019-01-19 12:19                     ` Yongji Xie
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Wang @ 2019-01-18  9:26 UTC (permalink / raw)
  To: Yongji Xie
  Cc: zhangyu31, Michael S. Tsirkin, Xie Yongji, qemu-devel, lilin24,
	Yury Kotov, Coquelin, Maxime, chaiwen, Stefan Hajnoczi,
	Marc-André Lureau, nixun,
	Евгений
	Яковлев


On 2019/1/18 下午3:01, Yongji Xie wrote:
> On Fri, 18 Jan 2019 at 12:00, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2019/1/18 上午11:32, Yongji Xie wrote:
>>> On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2019/1/15 下午10:51, Yongji Xie wrote:
>>>>>> Well, this may work but here're my points:
>>>>>>
>>>>>> 1) The code want to recover from backed crash by introducing extra space
>>>>>> to store inflight data, but it still depends on the backend to set/get
>>>>>> the inflight state
>>>>>>
>>>>>> 2) Since the backend could be killed at any time, the backend must have
>>>>>> the ability to recover from the partial inflight state
>>>>>>
>>>>>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
>>>>>> recursive. The above lines show how tricky could the code looks like.
>>>>>>
>>>>>> Solving this at vhost-user level through at backend is probably wrong.
>>>>>> It's time to consider the support from virtio itself.
>>>>>>
>>>>> I agree that supporting this in virtio level may be better. For
>>>>> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
>>>>> Stefan's proposal. But I still think QEMU should be able to provide
>>>>> this ability too. Supposed that one vhost-user backend need to support
>>>>> multiple VMs. We can't enable reconnect ability until all VMs' guest
>>>>> driver support the new feature. It's limited.
>>>> That's the way virtio evolves.
>>>>
>>>>
>>>>>     But if QEMU have the
>>>>> ability to store inflight buffer, the backend could at least have a
>>>>> chance to support this case.
>>>> The problem is, you need a careful designed protocol described somewhere
>>> That's what we should discuss in detail in this series.
>>
>> Well, I ask some questions for this patch, but it looks like they were
>> still not answered. No?
>>
>>
> Oh, sorry, I missed those questions. Let me try to answer them here.
>
> Q1: If backend get killed in vu_queue_inflight_get() without setting
> vq->inflight->desc[desc_idx] to 1, is there any problem?
>
> The entry which stores the head of this inflight descriptor is not
> lost in avail ring. So we can still get this inflight descriptor from
> avail ring although we didn't set vq->inflight->desc[desc_idx] to 1.


Ok I get this.


> Q2:
> void vu_queue_push()
> {
>      vq->inflight->elem_idx = elem->idx;
>      vu_queue_fill();
>      vu_queue_flush();
>      vq->inflight->desc[elem->idx] = 0;
>                                                      <-------- Does
> this safe to be killed here?
>      vq->inflight->used_idx = vq->vring.used->idx;
> }
>
> Because there are no concurrency between vu_queue_push() and
> vu_queue_pop(), I don't see any problem here.
>
> Basically we just need to make sure this two operations
> (vq->vring.used->idx++ and vq->inflight->desc[elem->idx] = 0) are
> atomic. I think there are some approach to achieve that. I'm not sure
> my approach here is good enough, but it should work.


Rethink about this, some findings:

- What you suggest requires strict ordering in some part of 
vu_queue_push(). E.g it looks to me you need a compiler barrier to make 
sure used_idx is set before clearing desc[elem->idx]? This a an side 
effect of introduce new metadata (inflight->elem_idx and 
inflight->used_idx) to recover from crashing when logging 
inflgith->desc[] array.

- Modern backends like dpdk do batching aggressively, which means you 
probably need an array of elem_idx[]. This tends to be more complex 
since we probably need to negotiate the size of this array and the 
overhead is probably noticeable.

I don't audit all other places of the codes, but I suspect it would be 
hard to be 100% correct. And what happens for packed virtqueue? 
Basically, we don't want to produce tricky codes that is hard to debug 
in the future. Think in another direction, in order will be supported by 
virtio 1.1. With that, the inflight descriptors could be much more 
easier to be deduced like [used_idx, avail_idx)? So it looks to me 
supporting this from virtio layer is much more easier.

Thoughts?

Thanks


>
>>>> (is vhost-user.txt a good place for this?). And this work will be
>>>> (partial) duplicated for the future support from virtio spec itself.
>>>>
>>> I think the duplicated code is to maintain the inflight descriptor
>>> list which should be in backend. That's not main work in this series.
>>> And backend could choose to include it or not.
>>
>> You need to have a documentation to describe the protocol. Otherwise, it
>> would be very hard for other backend to implement.
>>
> Yes, actually now I'm working on adding more detail to vhost-user.txt.
>
> Thanks,
> Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-01-18  9:26                   ` Jason Wang
@ 2019-01-19 12:19                     ` Yongji Xie
  0 siblings, 0 replies; 54+ messages in thread
From: Yongji Xie @ 2019-01-19 12:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: zhangyu31, Michael S. Tsirkin, Xie Yongji, qemu-devel, lilin24,
	Yury Kotov, Coquelin, Maxime, chaiwen, Stefan Hajnoczi,
	Marc-André Lureau, nixun,
	Евгений
	Яковлев

On Fri, 18 Jan 2019 at 17:27, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2019/1/18 下午3:01, Yongji Xie wrote:
> > On Fri, 18 Jan 2019 at 12:00, Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On 2019/1/18 上午11:32, Yongji Xie wrote:
> >>> On Thu, 17 Jan 2019 at 17:57, Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2019/1/15 下午10:51, Yongji Xie wrote:
> >>>>>> Well, this may work but here're my points:
> >>>>>>
> >>>>>> 1) The code want to recover from backed crash by introducing extra space
> >>>>>> to store inflight data, but it still depends on the backend to set/get
> >>>>>> the inflight state
> >>>>>>
> >>>>>> 2) Since the backend could be killed at any time, the backend must have
> >>>>>> the ability to recover from the partial inflight state
> >>>>>>
> >>>>>> So it looks to me 1) tends to be self-contradictory and 2) tends to be
> >>>>>> recursive. The above lines show how tricky could the code looks like.
> >>>>>>
> >>>>>> Solving this at vhost-user level through at backend is probably wrong.
> >>>>>> It's time to consider the support from virtio itself.
> >>>>>>
> >>>>> I agree that supporting this in virtio level may be better. For
> >>>>> example, resubmitting inflight I/O once DEVICE_NEEDS_RESET is set in
> >>>>> Stefan's proposal. But I still think QEMU should be able to provide
> >>>>> this ability too. Supposed that one vhost-user backend need to support
> >>>>> multiple VMs. We can't enable reconnect ability until all VMs' guest
> >>>>> driver support the new feature. It's limited.
> >>>> That's the way virtio evolves.
> >>>>
> >>>>
> >>>>>     But if QEMU have the
> >>>>> ability to store inflight buffer, the backend could at least have a
> >>>>> chance to support this case.
> >>>> The problem is, you need a careful designed protocol described somewhere
> >>> That's what we should discuss in detail in this series.
> >>
> >> Well, I ask some questions for this patch, but it looks like they were
> >> still not answered. No?
> >>
> >>
> > Oh, sorry, I missed those questions. Let me try to answer them here.
> >
> > Q1: If backend get killed in vu_queue_inflight_get() without setting
> > vq->inflight->desc[desc_idx] to 1, is there any problem?
> >
> > The entry which stores the head of this inflight descriptor is not
> > lost in avail ring. So we can still get this inflight descriptor from
> > avail ring although we didn't set vq->inflight->desc[desc_idx] to 1.
>
>
> Ok I get this.
>
>
> > Q2:
> > void vu_queue_push()
> > {
> >      vq->inflight->elem_idx = elem->idx;
> >      vu_queue_fill();
> >      vu_queue_flush();
> >      vq->inflight->desc[elem->idx] = 0;
> >                                                      <-------- Does
> > this safe to be killed here?
> >      vq->inflight->used_idx = vq->vring.used->idx;
> > }
> >
> > Because there are no concurrency between vu_queue_push() and
> > vu_queue_pop(), I don't see any problem here.
> >
> > Basically we just need to make sure this two operations
> > (vq->vring.used->idx++ and vq->inflight->desc[elem->idx] = 0) are
> > atomic. I think there are some approach to achieve that. I'm not sure
> > my approach here is good enough, but it should work.
>
>
> Rethink about this, some findings:
>
> - What you suggest requires strict ordering in some part of
> vu_queue_push(). E.g it looks to me you need a compiler barrier to make
> sure used_idx is set before clearing desc[elem->idx]? This a an side
> effect of introduce new metadata (inflight->elem_idx and
> inflight->used_idx) to recover from crashing when logging
> inflgith->desc[] array.
>

Yes, compiler barries are needed here.

> - Modern backends like dpdk do batching aggressively, which means you
> probably need an array of elem_idx[]. This tends to be more complex
> since we probably need to negotiate the size of this array and the
> overhead is probably noticeable.
>

Maybe this new approach could make things easier:

void vu_queue_flush(const VuVirtqElement *elem, unsigned int count)
{
    uint16_t old = vring.used->idx;
    uint16_t new = old + count;

    inflight->used_idx = new;
    barrier();
    for (i = 0; i < count; i++) {
        inflgith->desc[elem[i]->idx] = FLAG_PROCESS;
    }
    barrier();
    vring.used->idx = new;
    barrier();
    for (i = 0; i < count; i++) {
        inflgith->desc[elem[i]->idx] = FLAG_AVAIL;
    }
}

static int vu_check_queue_inflights()
{
    ....
    for (i = 0; i < desc_count; i++) {
        if (inflight->desc[i] == FLAG_PROCESS) {
            if (inflight->used_idx != vring.used->idx) {
                /* rollback */
                inflight->desc[i] = FLAG_INUSE;
            } else {
                /* commit */
                inflight->desc[i] = FLAG_AVAIL;
            }
        }
    }
    ....
}

> I don't audit all other places of the codes, but I suspect it would be
> hard to be 100% correct. And what happens for packed virtqueue?
> Basically, we don't want to produce tricky codes that is hard to debug
> in the future. Think in another direction, in order will be supported by
> virtio 1.1. With that, the inflight descriptors could be much more
> easier to be deduced like [used_idx, avail_idx)? So it looks to me
> supporting this from virtio layer is much more easier.
>

Yes, I agree that we should support this in virtio layer for packed
virtqueue or newer virtio device. And I'll be happy to do that. But
this series also make sense to me. Because we can use this to support
legacy virtio 1.0 or 0.9 device.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2019-01-19 12:20 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 11:27 [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 1/7] char-socket: Enable "nowait" option on client sockets elohimes
2019-01-10 12:49   ` Daniel P. Berrangé
2019-01-10 13:19     ` Yongji Xie
2019-01-10 13:24       ` Daniel P. Berrangé
2019-01-10 14:08         ` Yongji Xie
2019-01-10 14:11           ` Daniel P. Berrangé
2019-01-10 14:29             ` Yongji Xie
2019-01-10 16:41               ` Daniel P. Berrangé
2019-01-11  7:50                 ` Yongji Xie
2019-01-11  8:32                   ` Daniel P. Berrangé
2019-01-11  8:36                     ` Yongji Xie
2019-01-15 15:39                       ` Daniel P. Berrangé
2019-01-15 16:53                         ` Yury Kotov
2019-01-15 17:15                           ` Daniel P. Berrangé
2019-01-16  5:39                         ` Yongji Xie
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 2/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
2019-01-14 22:25   ` Michael S. Tsirkin
2019-01-15  6:46     ` Yongji Xie
2019-01-15 12:54       ` Michael S. Tsirkin
2019-01-15 14:18         ` Yongji Xie
2019-01-18  2:45           ` Yongji Xie
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
2019-01-11  3:56   ` Jason Wang
2019-01-11  6:10     ` Yongji Xie
2019-01-15  7:52       ` Jason Wang
2019-01-15 14:51         ` Yongji Xie
2019-01-17  9:57           ` Jason Wang
2019-01-17 14:59             ` Michael S. Tsirkin
2019-01-18  3:57               ` Jason Wang
2019-01-18  3:32             ` Yongji Xie
2019-01-18  3:56               ` Michael S. Tsirkin
2019-01-18  3:59               ` Jason Wang
2019-01-18  4:03                 ` Michael S. Tsirkin
2019-01-18  7:01                 ` Yongji Xie
2019-01-18  9:26                   ` Jason Wang
2019-01-19 12:19                     ` Yongji Xie
2019-01-15 15:58         ` Michael S. Tsirkin
2019-01-17 10:01           ` Jason Wang
2019-01-17 15:04             ` Michael S. Tsirkin
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 6/7] vhost-user-blk: Add support to reconnect backend elohimes
2019-01-09 11:27 ` [Qemu-devel] [PATCH v4 for-4.0 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
2019-01-10 10:25 ` [Qemu-devel] [PATCH v4 for-4.0 0/7] vhost-user-blk: Add support for backend reconnecting Stefan Hajnoczi
2019-01-10 10:59   ` Yongji Xie
2019-01-11 15:53     ` Stefan Hajnoczi
2019-01-11 17:24       ` Michael S. Tsirkin
2019-01-12  4:50       ` Yongji Xie
2019-01-14 10:22         ` Stefan Hajnoczi
2019-01-14 10:55           ` Yongji Xie
2019-01-16 14:28             ` Stefan Hajnoczi
2019-01-10 10:39 ` Marc-André Lureau
2019-01-10 11:09   ` Yongji Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.