qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] virtiofsd: Avoid potential deadlocks
@ 2021-03-08 12:31 Greg Kurz
  2021-03-08 12:31 ` [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read() Greg Kurz
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-08 12:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Michael S. Tsirkin, Vivek Goyal, Stefan Hajnoczi,
	Dr. David Alan Gilbert

clone of "master"

Greg Kurz (4):
  vhost-user: Introduce nested event loop in vhost_user_read()
  vhost-user: Convert slave channel to QIOChannelSocket
  vhost-user: Monitor slave channel in vhost_user_read()
  virtiofsd: Release vu_dispatch_lock when stopping queue

 hw/virtio/vhost-user.c        | 185 ++++++++++++++++++++++------------
 tools/virtiofsd/fuse_virtio.c |   6 ++
 2 files changed, 129 insertions(+), 62 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read()
  2021-03-08 12:31 [PATCH 0/4] virtiofsd: Avoid potential deadlocks Greg Kurz
@ 2021-03-08 12:31 ` Greg Kurz
  2021-03-09 15:02   ` Stefan Hajnoczi
  2021-03-08 12:31 ` [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket Greg Kurz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Greg Kurz @ 2021-03-08 12:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Michael S. Tsirkin, Vivek Goyal, Stefan Hajnoczi,
	Dr. David Alan Gilbert

A deadlock condition potentially exists if a vhost-user process needs
to request something to QEMU on the slave channel while processing a
vhost-user message.

This doesn't seem to affect any vhost-user implementation so far, but
this is currently biting the upcoming enablement of DAX with virtio-fs.
The issue is being observed when the guest does an emergency reboot while
a mapping still exits in the DAX window, which is very easy to get with
a busy enough workload (e.g. as simulated by blogbench [1]) :

- QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.

- In order to complete the request, virtiofsd then asks QEMU to remove
  the mapping on the slave channel.

All these dialogs are synchronous, hence the deadlock.

As pointed out by Stefan Hajnoczi:

When QEMU's vhost-user master implementation sends a vhost-user protocol
message, vhost_user_read() does a "blocking" read during which slave_fd
is not monitored by QEMU.

As a preliminary step to address this, split vhost_user_read() into a
nested even loop and a one-shot callback that does the actual reading.
A subsequent patch will teach the loop to monitor and process messages
from the slave channel as well.

[1] https://github.com/jedisct1/Blogbench

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 hw/virtio/vhost-user.c | 59 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2fdd5daf74bb..8a0574d5f959 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -294,15 +294,27 @@ static int vhost_user_read_header(struct vhost_dev *dev, VhostUserMsg *msg)
     return 0;
 }
 
-static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
+struct vhost_user_read_cb_data {
+    struct vhost_dev *dev;
+    VhostUserMsg *msg;
+    GMainLoop *loop;
+    int ret;
+};
+
+static gboolean vhost_user_read_cb(GIOChannel *source, GIOCondition condition,
+                                   gpointer opaque)
 {
+    struct vhost_user_read_cb_data *data = opaque;
+    struct vhost_dev *dev = data->dev;
+    VhostUserMsg *msg = data->msg;
     struct vhost_user *u = dev->opaque;
     CharBackend *chr = u->user->chr;
     uint8_t *p = (uint8_t *) msg;
     int r, size;
 
     if (vhost_user_read_header(dev, msg) < 0) {
-        return -1;
+        data->ret = -1;
+        goto end;
     }
 
     /* validate message size is sane */
@@ -310,7 +322,8 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
         error_report("Failed to read msg header."
                 " Size %d exceeds the maximum %zu.", msg->hdr.size,
                 VHOST_USER_PAYLOAD_SIZE);
-        return -1;
+        data->ret = -1;
+        goto end;
     }
 
     if (msg->hdr.size) {
@@ -320,11 +333,47 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
         if (r != size) {
             error_report("Failed to read msg payload."
                          " Read %d instead of %d.", r, msg->hdr.size);
-            return -1;
+            data->ret = -1;
+            goto end;
         }
     }
 
-    return 0;
+end:
+    g_main_loop_quit(data->loop);
+    return G_SOURCE_REMOVE;
+}
+
+static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
+{
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->user->chr;
+    GMainContext *prev_ctxt = chr->chr->gcontext;
+    GMainContext *ctxt = g_main_context_new();
+    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
+    struct vhost_user_read_cb_data data = {
+        .dev = dev,
+        .loop = loop,
+        .msg = msg,
+        .ret = 0
+    };
+
+    /* Switch context and add a new watch to monitor chardev activity */
+    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
+    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
+
+    g_main_loop_run(loop);
+
+    /*
+     * Restore the previous context. This also destroys/recreates event
+     * sources : this guarantees that all pending events in the original
+     * context that have been processed by the nested loop are purged.
+     */
+    qemu_chr_be_update_read_handlers(chr->chr, prev_ctxt);
+
+    g_main_loop_unref(loop);
+    g_main_context_unref(ctxt);
+
+    return data.ret;
 }
 
 static int process_message_reply(struct vhost_dev *dev,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-08 12:31 [PATCH 0/4] virtiofsd: Avoid potential deadlocks Greg Kurz
  2021-03-08 12:31 ` [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read() Greg Kurz
@ 2021-03-08 12:31 ` Greg Kurz
  2021-03-09 15:17   ` Stefan Hajnoczi
  2021-03-08 12:31 ` [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read() Greg Kurz
  2021-03-08 12:31 ` [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue Greg Kurz
  3 siblings, 1 reply; 18+ messages in thread
From: Greg Kurz @ 2021-03-08 12:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Michael S. Tsirkin, Vivek Goyal, Stefan Hajnoczi,
	Dr. David Alan Gilbert

The slave channel is implemented with socketpair() : QEMU creates
the pair, passes one of the socket to virtiofsd and monitors the
other one with the main event loop using qemu_set_fd_handler().

In order to fix a potential deadlock between QEMU and a vhost-user
external process (e.g. virtiofsd with DAX), we want the nested
event loop in vhost_user_read() to monitor the slave channel as
well.

Prepare ground for this by converting the slave channel to be a
QIOChannelSocket. This will make monitoring of the slave channel
as simple as calling qio_channel_add_watch_source() from
vhost_user_read() instead of open-coding it.

This also allows to get rid of the ancillary data parsing since
QIOChannelSocket can do this for us. Note that the MSG_CTRUNC
check is dropped on the way because QIOChannelSocket ignores this
case. This isn't a problem since slave_read() provisions space for
8 file descriptors, but affected vhost-user slave protocol messages
generally only convey one. If for some reason a buggy implementation
passes more file descriptors, no need to break the connection, just
like we don't break it if some other type of ancillary data is
received : this isn't explicitely violating the protocol per-se so
it seems better to ignore it.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 hw/virtio/vhost-user.c | 99 ++++++++++++++++++------------------------
 1 file changed, 42 insertions(+), 57 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8a0574d5f959..e395d0d1fd81 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -16,6 +16,7 @@
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/virtio-net.h"
 #include "chardev/char-fe.h"
+#include "io/channel-socket.h"
 #include "sysemu/kvm.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -237,7 +238,8 @@ struct vhost_user {
     struct vhost_dev *dev;
     /* Shared between vhost devs of the same virtio device */
     VhostUserState *user;
-    int slave_fd;
+    QIOChannel *slave_ioc;
+    GSource *slave_src;
     NotifierWithReturn postcopy_notifier;
     struct PostCopyFD  postcopy_fd;
     uint64_t           postcopy_client_bases[VHOST_USER_MAX_RAM_SLOTS];
@@ -1441,56 +1443,33 @@ static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
     return 0;
 }
 
-static void slave_read(void *opaque)
+static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
+                           gpointer opaque)
 {
     struct vhost_dev *dev = opaque;
     struct vhost_user *u = dev->opaque;
     VhostUserHeader hdr = { 0, };
     VhostUserPayload payload = { 0, };
-    int size, ret = 0;
+    ssize_t size;
+    int ret = 0;
     struct iovec iov;
-    struct msghdr msgh;
-    int fd[VHOST_USER_SLAVE_MAX_FDS];
-    char control[CMSG_SPACE(sizeof(fd))];
-    struct cmsghdr *cmsg;
-    int i, fdsize = 0;
-
-    memset(&msgh, 0, sizeof(msgh));
-    msgh.msg_iov = &iov;
-    msgh.msg_iovlen = 1;
-    msgh.msg_control = control;
-    msgh.msg_controllen = sizeof(control);
-
-    memset(fd, -1, sizeof(fd));
+    g_autofree int *fd = NULL;
+    size_t fdsize = 0;
+    int i;
 
     /* Read header */
     iov.iov_base = &hdr;
     iov.iov_len = VHOST_USER_HDR_SIZE;
 
     do {
-        size = recvmsg(u->slave_fd, &msgh, 0);
-    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
+        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
+    } while (size == QIO_CHANNEL_ERR_BLOCK);
 
     if (size != VHOST_USER_HDR_SIZE) {
         error_report("Failed to read from slave.");
         goto err;
     }
 
-    if (msgh.msg_flags & MSG_CTRUNC) {
-        error_report("Truncated message.");
-        goto err;
-    }
-
-    for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg != NULL;
-         cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
-            if (cmsg->cmsg_level == SOL_SOCKET &&
-                cmsg->cmsg_type == SCM_RIGHTS) {
-                    fdsize = cmsg->cmsg_len - CMSG_LEN(0);
-                    memcpy(fd, CMSG_DATA(cmsg), fdsize);
-                    break;
-            }
-    }
-
     if (hdr.size > VHOST_USER_PAYLOAD_SIZE) {
         error_report("Failed to read msg header."
                 " Size %d exceeds the maximum %zu.", hdr.size,
@@ -1500,8 +1479,8 @@ static void slave_read(void *opaque)
 
     /* Read payload */
     do {
-        size = read(u->slave_fd, &payload, hdr.size);
-    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
+        size = qio_channel_read(ioc, (char *) &payload, hdr.size, NULL);
+    } while (size == QIO_CHANNEL_ERR_BLOCK);
 
     if (size != hdr.size) {
         error_report("Failed to read payload from slave.");
@@ -1517,7 +1496,7 @@ static void slave_read(void *opaque)
         break;
     case VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG:
         ret = vhost_user_slave_handle_vring_host_notifier(dev, &payload.area,
-                                                          fd[0]);
+                                                          fd ? fd[0] : -1);
         break;
     default:
         error_report("Received unexpected msg type: %d.", hdr.request);
@@ -1525,8 +1504,8 @@ static void slave_read(void *opaque)
     }
 
     /* Close the remaining file descriptors. */
-    for (i = 0; i < fdsize; i++) {
-        if (fd[i] != -1) {
+    if (fd) {
+        for (i = 0; i < fdsize; i++) {
             close(fd[i]);
         }
     }
@@ -1551,8 +1530,8 @@ static void slave_read(void *opaque)
         iovec[1].iov_len = hdr.size;
 
         do {
-            size = writev(u->slave_fd, iovec, ARRAY_SIZE(iovec));
-        } while (size < 0 && (errno == EINTR || errno == EAGAIN));
+            size = qio_channel_writev(ioc, iovec, ARRAY_SIZE(iovec), NULL);
+        } while (size == QIO_CHANNEL_ERR_BLOCK);
 
         if (size != VHOST_USER_HDR_SIZE + hdr.size) {
             error_report("Failed to send msg reply to slave.");
@@ -1560,18 +1539,20 @@ static void slave_read(void *opaque)
         }
     }
 
-    return;
+    return G_SOURCE_CONTINUE;
 
 err:
-    qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
-    close(u->slave_fd);
-    u->slave_fd = -1;
-    for (i = 0; i < fdsize; i++) {
-        if (fd[i] != -1) {
+    g_source_destroy(u->slave_src);
+    g_source_unref(u->slave_src);
+    u->slave_src = NULL;
+    object_unref(OBJECT(ioc));
+    u->slave_ioc = NULL;
+    if (fd) {
+        for (i = 0; i < fdsize; i++) {
             close(fd[i]);
         }
     }
-    return;
+    return G_SOURCE_REMOVE;
 }
 
 static int vhost_setup_slave_channel(struct vhost_dev *dev)
@@ -1595,8 +1576,9 @@ static int vhost_setup_slave_channel(struct vhost_dev *dev)
         return -1;
     }
 
-    u->slave_fd = sv[0];
-    qemu_set_fd_handler(u->slave_fd, slave_read, NULL, dev);
+    u->slave_ioc = QIO_CHANNEL(qio_channel_socket_new_fd(sv[0], &error_abort));
+    u->slave_src = qio_channel_add_watch_source(u->slave_ioc, G_IO_IN,
+                                                slave_read, dev, NULL, NULL);
 
     if (reply_supported) {
         msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
@@ -1614,9 +1596,11 @@ static int vhost_setup_slave_channel(struct vhost_dev *dev)
 out:
     close(sv[1]);
     if (ret) {
-        qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
-        close(u->slave_fd);
-        u->slave_fd = -1;
+        g_source_destroy(u->slave_src);
+        g_source_unref(u->slave_src);
+        u->slave_src = NULL;
+        object_unref(OBJECT(u->slave_ioc));
+        u->slave_ioc = NULL;
     }
 
     return ret;
@@ -1853,7 +1837,6 @@ static int vhost_user_backend_init(struct vhost_dev *dev, void *opaque)
 
     u = g_new0(struct vhost_user, 1);
     u->user = opaque;
-    u->slave_fd = -1;
     u->dev = dev;
     dev->opaque = u;
 
@@ -1968,10 +1951,12 @@ static int vhost_user_backend_cleanup(struct vhost_dev *dev)
         close(u->postcopy_fd.fd);
         u->postcopy_fd.handler = NULL;
     }
-    if (u->slave_fd >= 0) {
-        qemu_set_fd_handler(u->slave_fd, NULL, NULL, NULL);
-        close(u->slave_fd);
-        u->slave_fd = -1;
+    if (u->slave_ioc) {
+        g_source_destroy(u->slave_src);
+        g_source_unref(u->slave_src);
+        u->slave_src = NULL;
+        object_unref(OBJECT(u->slave_ioc));
+        u->slave_ioc = NULL;
     }
     g_free(u->region_rb);
     u->region_rb = NULL;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read()
  2021-03-08 12:31 [PATCH 0/4] virtiofsd: Avoid potential deadlocks Greg Kurz
  2021-03-08 12:31 ` [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read() Greg Kurz
  2021-03-08 12:31 ` [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket Greg Kurz
@ 2021-03-08 12:31 ` Greg Kurz
  2021-03-09 15:18   ` Stefan Hajnoczi
  2021-03-08 12:31 ` [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue Greg Kurz
  3 siblings, 1 reply; 18+ messages in thread
From: Greg Kurz @ 2021-03-08 12:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Michael S. Tsirkin, Vivek Goyal, Stefan Hajnoczi,
	Dr. David Alan Gilbert

Now that everything is in place, have the nested event loop to monitor
the slave channel. The source in the main event loop is destroyed and
recreated to ensure any pending even for the slave channel that was
previously detected is purged. This guarantees that the main loop
wont invoke slave_read() based on an event that was already handled
by the nested loop.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 hw/virtio/vhost-user.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e395d0d1fd81..7669b3f2a99f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -345,6 +345,9 @@ end:
     return G_SOURCE_REMOVE;
 }
 
+static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
+                           gpointer opaque);
+
 static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
 {
     struct vhost_user *u = dev->opaque;
@@ -352,6 +355,7 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
     GMainContext *prev_ctxt = chr->chr->gcontext;
     GMainContext *ctxt = g_main_context_new();
     GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
+    GSource *slave_src = NULL;
     struct vhost_user_read_cb_data data = {
         .dev = dev,
         .loop = loop,
@@ -363,8 +367,30 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
     qemu_chr_be_update_read_handlers(chr->chr, ctxt);
     qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
 
+    if (u->slave_ioc) {
+        /*
+         * This guarantees that all pending events in the main context
+         * for the slave channel are purged. They will be re-detected
+         * and processed now by the nested loop.
+         */
+        g_source_destroy(u->slave_src);
+        g_source_unref(u->slave_src);
+        u->slave_src = NULL;
+        slave_src = qio_channel_add_watch_source(u->slave_ioc, G_IO_IN,
+                                                 slave_read, dev, NULL,
+                                                 ctxt);
+    }
+
     g_main_loop_run(loop);
 
+    if (u->slave_ioc) {
+        g_source_destroy(slave_src);
+        g_source_unref(slave_src);
+        u->slave_src = qio_channel_add_watch_source(u->slave_ioc, G_IO_IN,
+                                                    slave_read, dev, NULL,
+                                                    NULL);
+    }
+
     /*
      * Restore the previous context. This also destroys/recreates event
      * sources : this guarantees that all pending events in the original
@@ -372,6 +398,7 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
      */
     qemu_chr_be_update_read_handlers(chr->chr, prev_ctxt);
 
+
     g_main_loop_unref(loop);
     g_main_context_unref(ctxt);
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue
  2021-03-08 12:31 [PATCH 0/4] virtiofsd: Avoid potential deadlocks Greg Kurz
                   ` (2 preceding siblings ...)
  2021-03-08 12:31 ` [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read() Greg Kurz
@ 2021-03-08 12:31 ` Greg Kurz
  2021-03-09 14:00   ` Vivek Goyal
  2021-03-09 15:20   ` Stefan Hajnoczi
  3 siblings, 2 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-08 12:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Michael S. Tsirkin, Vivek Goyal, Stefan Hajnoczi,
	Dr. David Alan Gilbert

QEMU can stop a virtqueue by sending a VHOST_USER_GET_VRING_BASE request
to virtiofsd. As with all other vhost-user protocol messages, the thread
that runs the main event loop in virtiofsd takes the vu_dispatch lock in
write mode. This ensures that no other thread can access virtqueues or
memory tables at the same time.

In the case of VHOST_USER_GET_VRING_BASE, the main thread basically
notifies the queue thread that it should terminate and waits for its
termination:

main()
 virtio_loop()
  vu_dispatch_wrlock()
  vu_dispatch()
   vu_process_message()
    vu_get_vring_base_exec()
     fv_queue_cleanup_thread()
      pthread_join()

Unfortunately, the queue thread ends up calling virtio_send_msg()
at some point, which itself needs to grab the lock:

fv_queue_thread()
 g_list_foreach()
  fv_queue_worker()
   fuse_session_process_buf_int()
    do_release()
     lo_release()
      fuse_reply_err()
       send_reply()
        send_reply_iov()
         fuse_send_reply_iov_nofree()
          fuse_send_msg()
           virtio_send_msg()
            vu_dispatch_rdlock() <-- Deadlock !

Simply have the main thread to release the lock before going to
sleep and take it back afterwards. A very similar patch was already
sent by Vivek Goyal sometime back:

https://listman.redhat.com/archives/virtio-fs/2021-January/msg00073.html

The only difference here is that this done in fv_queue_set_started()
because fv_queue_cleanup_thread() can also be called from virtio_loop()
without the lock being held.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 tools/virtiofsd/fuse_virtio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 523ee64fb7ae..3e13997406bf 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -792,7 +792,13 @@ static void fv_queue_set_started(VuDev *dev, int qidx, bool started)
             assert(0);
         }
     } else {
+        /*
+         * Temporarily drop write-lock taken in virtio_loop() so that
+         * the queue thread doesn't block in virtio_send_msg().
+         */
+        vu_dispatch_unlock(vud);
         fv_queue_cleanup_thread(vud, qidx);
+        vu_dispatch_wrlock(vud);
     }
 }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue
  2021-03-08 12:31 ` [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue Greg Kurz
@ 2021-03-09 14:00   ` Vivek Goyal
  2021-03-09 15:20   ` Stefan Hajnoczi
  1 sibling, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2021-03-09 14:00 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Stefan Hajnoczi, Dr. David Alan Gilbert

On Mon, Mar 08, 2021 at 01:31:41PM +0100, Greg Kurz wrote:
> QEMU can stop a virtqueue by sending a VHOST_USER_GET_VRING_BASE request
> to virtiofsd. As with all other vhost-user protocol messages, the thread
> that runs the main event loop in virtiofsd takes the vu_dispatch lock in
> write mode. This ensures that no other thread can access virtqueues or
> memory tables at the same time.
> 
> In the case of VHOST_USER_GET_VRING_BASE, the main thread basically
> notifies the queue thread that it should terminate and waits for its
> termination:
> 
> main()
>  virtio_loop()
>   vu_dispatch_wrlock()
>   vu_dispatch()
>    vu_process_message()
>     vu_get_vring_base_exec()
>      fv_queue_cleanup_thread()
>       pthread_join()
> 
> Unfortunately, the queue thread ends up calling virtio_send_msg()
> at some point, which itself needs to grab the lock:
> 
> fv_queue_thread()
>  g_list_foreach()
>   fv_queue_worker()
>    fuse_session_process_buf_int()
>     do_release()
>      lo_release()
>       fuse_reply_err()
>        send_reply()
>         send_reply_iov()
>          fuse_send_reply_iov_nofree()
>           fuse_send_msg()
>            virtio_send_msg()
>             vu_dispatch_rdlock() <-- Deadlock !
> 
> Simply have the main thread to release the lock before going to
> sleep and take it back afterwards. A very similar patch was already
> sent by Vivek Goyal sometime back:
> 
> https://listman.redhat.com/archives/virtio-fs/2021-January/msg00073.html
> 
> The only difference here is that this done in fv_queue_set_started()
> because fv_queue_cleanup_thread() can also be called from virtio_loop()
> without the lock being held.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Looks good to me.

Reviewed-by: Vivek Goyal <vgoyal@redhat.com>

Vivek
> ---
>  tools/virtiofsd/fuse_virtio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 523ee64fb7ae..3e13997406bf 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -792,7 +792,13 @@ static void fv_queue_set_started(VuDev *dev, int qidx, bool started)
>              assert(0);
>          }
>      } else {
> +        /*
> +         * Temporarily drop write-lock taken in virtio_loop() so that
> +         * the queue thread doesn't block in virtio_send_msg().
> +         */
> +        vu_dispatch_unlock(vud);
>          fv_queue_cleanup_thread(vud, qidx);
> +        vu_dispatch_wrlock(vud);
>      }
>  }
>  
> -- 
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read()
  2021-03-08 12:31 ` [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read() Greg Kurz
@ 2021-03-09 15:02   ` Stefan Hajnoczi
  2021-03-09 18:35     ` Greg Kurz
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2021-03-09 15:02 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

On Mon, Mar 08, 2021 at 01:31:38PM +0100, Greg Kurz wrote:
> A deadlock condition potentially exists if a vhost-user process needs
> to request something to QEMU on the slave channel while processing a
> vhost-user message.
> 
> This doesn't seem to affect any vhost-user implementation so far, but
> this is currently biting the upcoming enablement of DAX with virtio-fs.
> The issue is being observed when the guest does an emergency reboot while
> a mapping still exits in the DAX window, which is very easy to get with
> a busy enough workload (e.g. as simulated by blogbench [1]) :
> 
> - QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.
> 
> - In order to complete the request, virtiofsd then asks QEMU to remove
>   the mapping on the slave channel.
> 
> All these dialogs are synchronous, hence the deadlock.
> 
> As pointed out by Stefan Hajnoczi:
> 
> When QEMU's vhost-user master implementation sends a vhost-user protocol
> message, vhost_user_read() does a "blocking" read during which slave_fd
> is not monitored by QEMU.
> 
> As a preliminary step to address this, split vhost_user_read() into a
> nested even loop and a one-shot callback that does the actual reading.

In case you respin:
s/even/event/

> +static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
> +{
> +    struct vhost_user *u = dev->opaque;
> +    CharBackend *chr = u->user->chr;
> +    GMainContext *prev_ctxt = chr->chr->gcontext;
> +    GMainContext *ctxt = g_main_context_new();
> +    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
> +    struct vhost_user_read_cb_data data = {
> +        .dev = dev,
> +        .loop = loop,
> +        .msg = msg,
> +        .ret = 0
> +    };
> +
> +    /* Switch context and add a new watch to monitor chardev activity */
> +    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
> +    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);

This comment could be expanded to explain why the nested event loop is
necessary. The goal is to monitor the slave_fd while waiting for chr
I/O so we'll need an event loop. prev_ctxt cannot be run nested since
its fd handlers may not be prepared (e.g. re-entrancy).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-08 12:31 ` [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket Greg Kurz
@ 2021-03-09 15:17   ` Stefan Hajnoczi
  2021-03-09 20:23     ` Greg Kurz
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2021-03-09 15:17 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1182 bytes --]

On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> +    g_autofree int *fd = NULL;
> +    size_t fdsize = 0;
> +    int i;
>  
>      /* Read header */
>      iov.iov_base = &hdr;
>      iov.iov_len = VHOST_USER_HDR_SIZE;
>  
>      do {
> -        size = recvmsg(u->slave_fd, &msgh, 0);
> -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> +    } while (size == QIO_CHANNEL_ERR_BLOCK);

Is it possible to leak file descriptors and fd[] memory if we receive a
short message and then loop around? qio_channel_readv_full() will
overwrite &fd and that's how the leak occurs.

On the other hand, it looks like ioc is in blocking mode. I'm not sure
QIO_CHANNEL_ERR_BLOCK can occur?

> @@ -1500,8 +1479,8 @@ static void slave_read(void *opaque)
>  
>      /* Read payload */
>      do {
> -        size = read(u->slave_fd, &payload, hdr.size);
> -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> +        size = qio_channel_read(ioc, (char *) &payload, hdr.size, NULL);
> +    } while (size == QIO_CHANNEL_ERR_BLOCK);

Same question here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read()
  2021-03-08 12:31 ` [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read() Greg Kurz
@ 2021-03-09 15:18   ` Stefan Hajnoczi
  2021-03-09 22:56     ` Greg Kurz
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2021-03-09 15:18 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On Mon, Mar 08, 2021 at 01:31:40PM +0100, Greg Kurz wrote:
> @@ -363,8 +367,30 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
>      qemu_chr_be_update_read_handlers(chr->chr, ctxt);
>      qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
>  
> +    if (u->slave_ioc) {
> +        /*
> +         * This guarantees that all pending events in the main context
> +         * for the slave channel are purged. They will be re-detected
> +         * and processed now by the nested loop.
> +         */
> +        g_source_destroy(u->slave_src);
> +        g_source_unref(u->slave_src);
> +        u->slave_src = NULL;
> +        slave_src = qio_channel_add_watch_source(u->slave_ioc, G_IO_IN,

Why does slave_ioc use G_IO_IN while chr uses G_IO_IN | G_IO_HUP?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue
  2021-03-08 12:31 ` [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue Greg Kurz
  2021-03-09 14:00   ` Vivek Goyal
@ 2021-03-09 15:20   ` Stefan Hajnoczi
  1 sibling, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2021-03-09 15:20 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

On Mon, Mar 08, 2021 at 01:31:41PM +0100, Greg Kurz wrote:
> QEMU can stop a virtqueue by sending a VHOST_USER_GET_VRING_BASE request
> to virtiofsd. As with all other vhost-user protocol messages, the thread
> that runs the main event loop in virtiofsd takes the vu_dispatch lock in
> write mode. This ensures that no other thread can access virtqueues or
> memory tables at the same time.
> 
> In the case of VHOST_USER_GET_VRING_BASE, the main thread basically
> notifies the queue thread that it should terminate and waits for its
> termination:
> 
> main()
>  virtio_loop()
>   vu_dispatch_wrlock()
>   vu_dispatch()
>    vu_process_message()
>     vu_get_vring_base_exec()
>      fv_queue_cleanup_thread()
>       pthread_join()
> 
> Unfortunately, the queue thread ends up calling virtio_send_msg()
> at some point, which itself needs to grab the lock:
> 
> fv_queue_thread()
>  g_list_foreach()
>   fv_queue_worker()
>    fuse_session_process_buf_int()
>     do_release()
>      lo_release()
>       fuse_reply_err()
>        send_reply()
>         send_reply_iov()
>          fuse_send_reply_iov_nofree()
>           fuse_send_msg()
>            virtio_send_msg()
>             vu_dispatch_rdlock() <-- Deadlock !
> 
> Simply have the main thread to release the lock before going to
> sleep and take it back afterwards. A very similar patch was already
> sent by Vivek Goyal sometime back:
> 
> https://listman.redhat.com/archives/virtio-fs/2021-January/msg00073.html
> 
> The only difference here is that this done in fv_queue_set_started()
> because fv_queue_cleanup_thread() can also be called from virtio_loop()
> without the lock being held.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  tools/virtiofsd/fuse_virtio.c | 6 ++++++
>  1 file changed, 6 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read()
  2021-03-09 15:02   ` Stefan Hajnoczi
@ 2021-03-09 18:35     ` Greg Kurz
  0 siblings, 0 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-09 18:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

On Tue, 9 Mar 2021 15:02:48 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Mar 08, 2021 at 01:31:38PM +0100, Greg Kurz wrote:
> > A deadlock condition potentially exists if a vhost-user process needs
> > to request something to QEMU on the slave channel while processing a
> > vhost-user message.
> > 
> > This doesn't seem to affect any vhost-user implementation so far, but
> > this is currently biting the upcoming enablement of DAX with virtio-fs.
> > The issue is being observed when the guest does an emergency reboot while
> > a mapping still exits in the DAX window, which is very easy to get with
> > a busy enough workload (e.g. as simulated by blogbench [1]) :
> > 
> > - QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.
> > 
> > - In order to complete the request, virtiofsd then asks QEMU to remove
> >   the mapping on the slave channel.
> > 
> > All these dialogs are synchronous, hence the deadlock.
> > 
> > As pointed out by Stefan Hajnoczi:
> > 
> > When QEMU's vhost-user master implementation sends a vhost-user protocol
> > message, vhost_user_read() does a "blocking" read during which slave_fd
> > is not monitored by QEMU.
> > 
> > As a preliminary step to address this, split vhost_user_read() into a
> > nested even loop and a one-shot callback that does the actual reading.
> 
> In case you respin:
> s/even/event/
> 

Fixed.

> > +static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
> > +{
> > +    struct vhost_user *u = dev->opaque;
> > +    CharBackend *chr = u->user->chr;
> > +    GMainContext *prev_ctxt = chr->chr->gcontext;
> > +    GMainContext *ctxt = g_main_context_new();
> > +    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
> > +    struct vhost_user_read_cb_data data = {
> > +        .dev = dev,
> > +        .loop = loop,
> > +        .msg = msg,
> > +        .ret = 0
> > +    };
> > +
> > +    /* Switch context and add a new watch to monitor chardev activity */
> > +    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
> > +    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
> 
> This comment could be expanded to explain why the nested event loop is
> necessary. The goal is to monitor the slave_fd while waiting for chr
> I/O so we'll need an event loop. prev_ctxt cannot be run nested since
> its fd handlers may not be prepared (e.g. re-entrancy).

Ok, will do.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-09 15:17   ` Stefan Hajnoczi
@ 2021-03-09 20:23     ` Greg Kurz
  2021-03-10 11:27       ` Stefan Hajnoczi
  2021-03-10 11:43       ` Daniel P. Berrangé
  0 siblings, 2 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-09 20:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2328 bytes --]

On Tue, 9 Mar 2021 15:17:21 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > +    g_autofree int *fd = NULL;
> > +    size_t fdsize = 0;
> > +    int i;
> >  
> >      /* Read header */
> >      iov.iov_base = &hdr;
> >      iov.iov_len = VHOST_USER_HDR_SIZE;
> >  
> >      do {
> > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> 
> Is it possible to leak file descriptors and fd[] memory if we receive a
> short message and then loop around? qio_channel_readv_full() will
> overwrite &fd and that's how the leak occurs.
> 

qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
channel is non-blocking mode and no data is available. Under the hood,
this translates to this code in qio_channel_socket_readv():

 retry:
    ret = recvmsg(sioc->fd, &msg, sflags);
    if (ret < 0) {
        if (errno == EAGAIN) {
            return QIO_CHANNEL_ERR_BLOCK;
        }
        if (errno == EINTR) {
            goto retry;
        }

        error_setg_errno(errp, errno,
                         "Unable to read from socket");
        return -1;
    }

This is strictly equivalent to what we currently have. This cannot
leak file descriptors because we would only loop until the first
byte of real data is available and ancillary data cannot fly without
real data, i.e. no file descriptors when recvmsg() returns EAGAIN.

> On the other hand, it looks like ioc is in blocking mode. I'm not sure
> QIO_CHANNEL_ERR_BLOCK can occur?
> 

You're right but the existing code is ready to handle the non-blocking
case, so I just kept this behavior.

> > @@ -1500,8 +1479,8 @@ static void slave_read(void *opaque)
> >  
> >      /* Read payload */
> >      do {
> > -        size = read(u->slave_fd, &payload, hdr.size);
> > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > +        size = qio_channel_read(ioc, (char *) &payload, hdr.size, NULL);
> > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> 
> Same question here.

This also ends up in qio_channel_socket_readv().

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read()
  2021-03-09 15:18   ` Stefan Hajnoczi
@ 2021-03-09 22:56     ` Greg Kurz
  0 siblings, 0 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-09 22:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Tue, 9 Mar 2021 15:18:56 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Mar 08, 2021 at 01:31:40PM +0100, Greg Kurz wrote:
> > @@ -363,8 +367,30 @@ static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
> >      qemu_chr_be_update_read_handlers(chr->chr, ctxt);
> >      qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
> >  
> > +    if (u->slave_ioc) {
> > +        /*
> > +         * This guarantees that all pending events in the main context
> > +         * for the slave channel are purged. They will be re-detected
> > +         * and processed now by the nested loop.
> > +         */
> > +        g_source_destroy(u->slave_src);
> > +        g_source_unref(u->slave_src);
> > +        u->slave_src = NULL;
> > +        slave_src = qio_channel_add_watch_source(u->slave_ioc, G_IO_IN,
> 
> Why does slave_ioc use G_IO_IN while chr uses G_IO_IN | G_IO_HUP?

Oops my bad... this is copy&paste of the change introduced in
vhost_setup_slave_channel() by patch 2, which is lacking G_IO_HUP.

It should even actually be G_IO_IN | G_IO_HUP | G_IO_ERR to match
what was done before when calling qemu_set_fd_handler() and which
is recommended by the glib documentation:

https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html#GPollFD

So I'm now wondering why callers of qemu_chr_fe_add_watch() never pass
G_IO_ERR... I'll sort this out for v2.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-09 20:23     ` Greg Kurz
@ 2021-03-10 11:27       ` Stefan Hajnoczi
  2021-03-10 13:08         ` Greg Kurz
  2021-03-10 11:43       ` Daniel P. Berrangé
  1 sibling, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2021-03-10 11:27 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2680 bytes --]

On Tue, Mar 09, 2021 at 09:23:22PM +0100, Greg Kurz wrote:
> On Tue, 9 Mar 2021 15:17:21 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > > +    g_autofree int *fd = NULL;
> > > +    size_t fdsize = 0;
> > > +    int i;
> > >  
> > >      /* Read header */
> > >      iov.iov_base = &hdr;
> > >      iov.iov_len = VHOST_USER_HDR_SIZE;
> > >  
> > >      do {
> > > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > 
> > Is it possible to leak file descriptors and fd[] memory if we receive a
> > short message and then loop around? qio_channel_readv_full() will
> > overwrite &fd and that's how the leak occurs.
> > 
> 
> qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
> channel is non-blocking mode and no data is available. Under the hood,
> this translates to this code in qio_channel_socket_readv():
> 
>  retry:
>     ret = recvmsg(sioc->fd, &msg, sflags);
>     if (ret < 0) {
>         if (errno == EAGAIN) {
>             return QIO_CHANNEL_ERR_BLOCK;
>         }
>         if (errno == EINTR) {
>             goto retry;
>         }
> 
>         error_setg_errno(errp, errno,
>                          "Unable to read from socket");
>         return -1;
>     }
> 
> This is strictly equivalent to what we currently have. This cannot
> leak file descriptors because we would only loop until the first
> byte of real data is available and ancillary data cannot fly without
> real data, i.e. no file descriptors when recvmsg() returns EAGAIN.
> 
> > On the other hand, it looks like ioc is in blocking mode. I'm not sure
> > QIO_CHANNEL_ERR_BLOCK can occur?
> > 
> 
> You're right but the existing code is ready to handle the non-blocking
> case, so I just kept this behavior.

I'm confused by this tentative non-blocking support because if we set
the fd to non-blocking, then qio_channel_readv_full() can return less
than size bytes (a short read) and that will cause a failure:

  if (size != VHOST_USER_HDR_SIZE) {
      error_report("Failed to read from slave.");
      goto err;
  }

So although the while QIO_CHANNEL_ERR_BLOCK loop suggests the code
supports non-blocking, it doesn't really support it :).

I think it would be clearer to remove the while QIO_CHANNEL_ERR_BLOCK
loop. However, this is not directly related to the bug that this series
fixes, so if you prefer to keep it, feel free.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-09 20:23     ` Greg Kurz
  2021-03-10 11:27       ` Stefan Hajnoczi
@ 2021-03-10 11:43       ` Daniel P. Berrangé
  2021-03-10 13:45         ` Greg Kurz
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel P. Berrangé @ 2021-03-10 11:43 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Vivek Goyal, Dr. David Alan Gilbert, qemu-devel, Stefan Hajnoczi,
	Michael S. Tsirkin

On Tue, Mar 09, 2021 at 09:23:22PM +0100, Greg Kurz wrote:
> On Tue, 9 Mar 2021 15:17:21 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > > +    g_autofree int *fd = NULL;
> > > +    size_t fdsize = 0;
> > > +    int i;
> > >  
> > >      /* Read header */
> > >      iov.iov_base = &hdr;
> > >      iov.iov_len = VHOST_USER_HDR_SIZE;
> > >  
> > >      do {
> > > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > 
> > Is it possible to leak file descriptors and fd[] memory if we receive a
> > short message and then loop around? qio_channel_readv_full() will
> > overwrite &fd and that's how the leak occurs.
> > 
> 
> qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
> channel is non-blocking mode and no data is available. Under the hood,
> this translates to this code in qio_channel_socket_readv():
> 
>  retry:
>     ret = recvmsg(sioc->fd, &msg, sflags);
>     if (ret < 0) {
>         if (errno == EAGAIN) {
>             return QIO_CHANNEL_ERR_BLOCK;
>         }
>         if (errno == EINTR) {
>             goto retry;
>         }
> 
>         error_setg_errno(errp, errno,
>                          "Unable to read from socket");
>         return -1;
>     }
> 
> This is strictly equivalent to what we currently have. This cannot
> leak file descriptors because we would only loop until the first
> byte of real data is available and ancillary data cannot fly without
> real data, i.e. no file descriptors when recvmsg() returns EAGAIN.

Yep, EAGAIN will only happen if you have no data AND no FDs.

> 
> > On the other hand, it looks like ioc is in blocking mode. I'm not sure
> > QIO_CHANNEL_ERR_BLOCK can occur?
> > 
> 
> You're right but the existing code is ready to handle the non-blocking
> case, so I just kept this behavior.

The existing code is *not* handling the non-blocking case in any
useful way IMHO. It will block execution of this QEMU thread, and
sit in a tight loop burning CPU in the EAGAIN case.

Handling non-blocking means using an I/O watch with the event loop
to be notified when something becomes ready.

I notice the code that follows is also doing


    if (size != VHOST_USER_HDR_SIZE) {
        error_report("Failed to read from slave.");
        goto err;
    }

which is also dubious because it assumes the previous recvmsg is
guaranteed to return VHOST_USER_HDR_SIZE bytes of data in a single
call.

It does at least look like the code is intentionally blocking
execution fo the QEMU thread until the expected amount of I/O
is receieved.

Given this, you should remove the while loop entirely and
just call

   qio_channel_readv_full_all()

this will block execution until *all* the requestd data bytes
are read. It will re-try the recvmsg in the event of a partial
data read, and will intelligently wait in poll() on EAGAIN.

> 
> > > @@ -1500,8 +1479,8 @@ static void slave_read(void *opaque)
> > >  
> > >      /* Read payload */
> > >      do {
> > > -        size = read(u->slave_fd, &payload, hdr.size);
> > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > +        size = qio_channel_read(ioc, (char *) &payload, hdr.size, NULL);
> > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > 
> > Same question here.
> 
> This also ends up in qio_channel_socket_readv().



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-10 11:27       ` Stefan Hajnoczi
@ 2021-03-10 13:08         ` Greg Kurz
  0 siblings, 0 replies; 18+ messages in thread
From: Greg Kurz @ 2021-03-10 13:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, qemu-devel, Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3086 bytes --]

On Wed, 10 Mar 2021 11:27:16 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Tue, Mar 09, 2021 at 09:23:22PM +0100, Greg Kurz wrote:
> > On Tue, 9 Mar 2021 15:17:21 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > > On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > > > +    g_autofree int *fd = NULL;
> > > > +    size_t fdsize = 0;
> > > > +    int i;
> > > >  
> > > >      /* Read header */
> > > >      iov.iov_base = &hdr;
> > > >      iov.iov_len = VHOST_USER_HDR_SIZE;
> > > >  
> > > >      do {
> > > > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > > 
> > > Is it possible to leak file descriptors and fd[] memory if we receive a
> > > short message and then loop around? qio_channel_readv_full() will
> > > overwrite &fd and that's how the leak occurs.
> > > 
> > 
> > qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
> > channel is non-blocking mode and no data is available. Under the hood,
> > this translates to this code in qio_channel_socket_readv():
> > 
> >  retry:
> >     ret = recvmsg(sioc->fd, &msg, sflags);
> >     if (ret < 0) {
> >         if (errno == EAGAIN) {
> >             return QIO_CHANNEL_ERR_BLOCK;
> >         }
> >         if (errno == EINTR) {
> >             goto retry;
> >         }
> > 
> >         error_setg_errno(errp, errno,
> >                          "Unable to read from socket");
> >         return -1;
> >     }
> > 
> > This is strictly equivalent to what we currently have. This cannot
> > leak file descriptors because we would only loop until the first
> > byte of real data is available and ancillary data cannot fly without
> > real data, i.e. no file descriptors when recvmsg() returns EAGAIN.
> > 
> > > On the other hand, it looks like ioc is in blocking mode. I'm not sure
> > > QIO_CHANNEL_ERR_BLOCK can occur?
> > > 
> > 
> > You're right but the existing code is ready to handle the non-blocking
> > case, so I just kept this behavior.
> 
> I'm confused by this tentative non-blocking support because if we set
> the fd to non-blocking, then qio_channel_readv_full() can return less
> than size bytes (a short read) and that will cause a failure:
> 
>   if (size != VHOST_USER_HDR_SIZE) {
>       error_report("Failed to read from slave.");
>       goto err;
>   }
> 
> So although the while QIO_CHANNEL_ERR_BLOCK loop suggests the code
> supports non-blocking, it doesn't really support it :).
> 

Yeah I agree. These EAGAIN checks that we have today are really
misleading. I'll get rid of them in a preliminary patch to
reduce the noise on the slave channel conversion.

> I think it would be clearer to remove the while QIO_CHANNEL_ERR_BLOCK
> loop. However, this is not directly related to the bug that this series
> fixes, so if you prefer to keep it, feel free.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-10 11:43       ` Daniel P. Berrangé
@ 2021-03-10 13:45         ` Greg Kurz
  2021-03-10 13:48           ` Daniel P. Berrangé
  0 siblings, 1 reply; 18+ messages in thread
From: Greg Kurz @ 2021-03-10 13:45 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Dr. David Alan Gilbert,
	Vivek Goyal, qemu-devel

On Wed, 10 Mar 2021 11:43:58 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Mar 09, 2021 at 09:23:22PM +0100, Greg Kurz wrote:
> > On Tue, 9 Mar 2021 15:17:21 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > > On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > > > +    g_autofree int *fd = NULL;
> > > > +    size_t fdsize = 0;
> > > > +    int i;
> > > >  
> > > >      /* Read header */
> > > >      iov.iov_base = &hdr;
> > > >      iov.iov_len = VHOST_USER_HDR_SIZE;
> > > >  
> > > >      do {
> > > > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > > 
> > > Is it possible to leak file descriptors and fd[] memory if we receive a
> > > short message and then loop around? qio_channel_readv_full() will
> > > overwrite &fd and that's how the leak occurs.
> > > 
> > 
> > qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
> > channel is non-blocking mode and no data is available. Under the hood,
> > this translates to this code in qio_channel_socket_readv():
> > 
> >  retry:
> >     ret = recvmsg(sioc->fd, &msg, sflags);
> >     if (ret < 0) {
> >         if (errno == EAGAIN) {
> >             return QIO_CHANNEL_ERR_BLOCK;
> >         }
> >         if (errno == EINTR) {
> >             goto retry;
> >         }
> > 
> >         error_setg_errno(errp, errno,
> >                          "Unable to read from socket");
> >         return -1;
> >     }
> > 
> > This is strictly equivalent to what we currently have. This cannot
> > leak file descriptors because we would only loop until the first
> > byte of real data is available and ancillary data cannot fly without
> > real data, i.e. no file descriptors when recvmsg() returns EAGAIN.
> 
> Yep, EAGAIN will only happen if you have no data AND no FDs.
> 
> > 
> > > On the other hand, it looks like ioc is in blocking mode. I'm not sure
> > > QIO_CHANNEL_ERR_BLOCK can occur?
> > > 
> > 
> > You're right but the existing code is ready to handle the non-blocking
> > case, so I just kept this behavior.
> 
> The existing code is *not* handling the non-blocking case in any
> useful way IMHO. It will block execution of this QEMU thread, and
> sit in a tight loop burning CPU in the EAGAIN case.
> 
> Handling non-blocking means using an I/O watch with the event loop
> to be notified when something becomes ready.
> 

Hmm... slave_read() is a handler registered with qemu_set_fd_handler().
Isn't it already the case then ? Can the first call to recvmsg() even
return EAGAIN actually ?

> I notice the code that follows is also doing
> 
> 
>     if (size != VHOST_USER_HDR_SIZE) {
>         error_report("Failed to read from slave.");
>         goto err;
>     }
> 
> which is also dubious because it assumes the previous recvmsg is
> guaranteed to return VHOST_USER_HDR_SIZE bytes of data in a single
> call.
> 

Yeah this is broken since recvmsg() doesn't guarantee full reads.

> It does at least look like the code is intentionally blocking
> execution fo the QEMU thread until the expected amount of I/O
> is receieved.
> 
> Given this, you should remove the while loop entirely and
> just call
> 
>    qio_channel_readv_full_all()
> 
> this will block execution until *all* the requestd data bytes
> are read. It will re-try the recvmsg in the event of a partial
> data read, and will intelligently wait in poll() on EAGAIN.
> 

Thanks for the suggestion !

> > 
> > > > @@ -1500,8 +1479,8 @@ static void slave_read(void *opaque)
> > > >  
> > > >      /* Read payload */
> > > >      do {
> > > > -        size = read(u->slave_fd, &payload, hdr.size);
> > > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > > +        size = qio_channel_read(ioc, (char *) &payload, hdr.size, NULL);
> > > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > > 
> > > Same question here.
> > 
> > This also ends up in qio_channel_socket_readv().
> 
> 
> 
> Regards,
> Daniel



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket
  2021-03-10 13:45         ` Greg Kurz
@ 2021-03-10 13:48           ` Daniel P. Berrangé
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2021-03-10 13:48 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael S. Tsirkin, Stefan Hajnoczi, Dr. David Alan Gilbert,
	Vivek Goyal, qemu-devel

On Wed, Mar 10, 2021 at 02:45:25PM +0100, Greg Kurz wrote:
> On Wed, 10 Mar 2021 11:43:58 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Tue, Mar 09, 2021 at 09:23:22PM +0100, Greg Kurz wrote:
> > > On Tue, 9 Mar 2021 15:17:21 +0000
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > 
> > > > On Mon, Mar 08, 2021 at 01:31:39PM +0100, Greg Kurz wrote:
> > > > > +    g_autofree int *fd = NULL;
> > > > > +    size_t fdsize = 0;
> > > > > +    int i;
> > > > >  
> > > > >      /* Read header */
> > > > >      iov.iov_base = &hdr;
> > > > >      iov.iov_len = VHOST_USER_HDR_SIZE;
> > > > >  
> > > > >      do {
> > > > > -        size = recvmsg(u->slave_fd, &msgh, 0);
> > > > > -    } while (size < 0 && (errno == EINTR || errno == EAGAIN));
> > > > > +        size = qio_channel_readv_full(ioc, &iov, 1, &fd, &fdsize, NULL);
> > > > > +    } while (size == QIO_CHANNEL_ERR_BLOCK);
> > > > 
> > > > Is it possible to leak file descriptors and fd[] memory if we receive a
> > > > short message and then loop around? qio_channel_readv_full() will
> > > > overwrite &fd and that's how the leak occurs.
> > > > 
> > > 
> > > qio_channel_readv_full() only returns QIO_CHANNEL_ERR_BLOCK when the
> > > channel is non-blocking mode and no data is available. Under the hood,
> > > this translates to this code in qio_channel_socket_readv():
> > > 
> > >  retry:
> > >     ret = recvmsg(sioc->fd, &msg, sflags);
> > >     if (ret < 0) {
> > >         if (errno == EAGAIN) {
> > >             return QIO_CHANNEL_ERR_BLOCK;
> > >         }
> > >         if (errno == EINTR) {
> > >             goto retry;
> > >         }
> > > 
> > >         error_setg_errno(errp, errno,
> > >                          "Unable to read from socket");
> > >         return -1;
> > >     }
> > > 
> > > This is strictly equivalent to what we currently have. This cannot
> > > leak file descriptors because we would only loop until the first
> > > byte of real data is available and ancillary data cannot fly without
> > > real data, i.e. no file descriptors when recvmsg() returns EAGAIN.
> > 
> > Yep, EAGAIN will only happen if you have no data AND no FDs.
> > 
> > > 
> > > > On the other hand, it looks like ioc is in blocking mode. I'm not sure
> > > > QIO_CHANNEL_ERR_BLOCK can occur?
> > > > 
> > > 
> > > You're right but the existing code is ready to handle the non-blocking
> > > case, so I just kept this behavior.
> > 
> > The existing code is *not* handling the non-blocking case in any
> > useful way IMHO. It will block execution of this QEMU thread, and
> > sit in a tight loop burning CPU in the EAGAIN case.
> > 
> > Handling non-blocking means using an I/O watch with the event loop
> > to be notified when something becomes ready.
> > 
> 
> Hmm... slave_read() is a handler registered with qemu_set_fd_handler().
> Isn't it already the case then ? Can the first call to recvmsg() even
> return EAGAIN actually ?

IIUC it shouldn't be able to return EAGAIN in that scenario for stream
sockets (TCP, UNIX), only a datagram socket (UDP) would be at risk of
EAGAIN



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-03-10 13:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 12:31 [PATCH 0/4] virtiofsd: Avoid potential deadlocks Greg Kurz
2021-03-08 12:31 ` [PATCH 1/4] vhost-user: Introduce nested event loop in vhost_user_read() Greg Kurz
2021-03-09 15:02   ` Stefan Hajnoczi
2021-03-09 18:35     ` Greg Kurz
2021-03-08 12:31 ` [PATCH 2/4] vhost-user: Convert slave channel to QIOChannelSocket Greg Kurz
2021-03-09 15:17   ` Stefan Hajnoczi
2021-03-09 20:23     ` Greg Kurz
2021-03-10 11:27       ` Stefan Hajnoczi
2021-03-10 13:08         ` Greg Kurz
2021-03-10 11:43       ` Daniel P. Berrangé
2021-03-10 13:45         ` Greg Kurz
2021-03-10 13:48           ` Daniel P. Berrangé
2021-03-08 12:31 ` [PATCH 3/4] vhost-user: Monitor slave channel in vhost_user_read() Greg Kurz
2021-03-09 15:18   ` Stefan Hajnoczi
2021-03-09 22:56     ` Greg Kurz
2021-03-08 12:31 ` [PATCH 4/4] virtiofsd: Release vu_dispatch_lock when stopping queue Greg Kurz
2021-03-09 14:00   ` Vivek Goyal
2021-03-09 15:20   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).