[PATCH] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB

* [PATCH] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB
@ 2021-03-18 13:35 Stefan Reiter
  2021-03-22 11:08 ` Wolfgang Bumiller
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Reiter @ 2021-03-18 13:35 UTC (permalink / raw)
  To: Markus Armbruster, Kevin Wolf, Paolo Bonzini, Wolfgang Bumiller,
	Thomas Lamprecht
  Cc: qemu-devel

If OOB is disabled, events received in monitor_qmp_event will be handled
in the main context. Thus, we must not acquire a qmp_queue_lock there,
as the dispatcher coroutine holds one over a yield point, where it
expects to be rescheduled from the main context. If a CHR_EVENT_CLOSED
event is received just then, it can race and block the main thread by
waiting on the queue lock.

Run monitor_qmp_cleanup_queue_and_resume in a BH on the iohandler
thread, so the main thread can always make progress during the
reschedule.

The delaying of the cleanup is safe, since the dispatcher always moves
back to the iothread afterward, and thus the cleanup will happen before
it gets to its next iteration.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---


We've had some spurious reports of people reporting (usually multiple) total VM
hangs on our forum, and finally got our hands on some stack traces:

Thread 1 (Thread 0x7fa59d476340 (LWP 1306954)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1 0x00007fa5a8335714 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55722bf3a848) at ../nptl/pthread_mutex_lock.c:80
#2 0x000055722a475e39 in qemu_mutex_lock_impl (mutex=0x55722bf3a848, file=0x55722a5d7149 "../monitor/qmp.c", line=80) at ../util/qemu-thread-posix.c:79
#3 0x000055722a3fb686 in monitor_qmp_cleanup_queue_and_resume (mon=0x55722bf3a730) at ../monitor/qmp.c:80
#4 monitor_qmp_event (opaque=0x55722bf3a730, event=<optimized out>) at ../monitor/qmp.c:421
#5 0x000055722a3f9505 in tcp_chr_disconnect_locked (chr=0x55722bc68fa0) at ../chardev/char-socket.c:507
#6 0x000055722a3f9550 in tcp_chr_disconnect (chr=0x55722bc68fa0) at ../chardev/char-socket.c:517
#7 0x000055722a3f959e in tcp_chr_hup (channel=<optimized out>, cond=<optimized out>, opaque=<optimized out>) at ../chardev/char-socket.c:557
#8 0x00007fa5a9c2edd8 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#9 0x000055722a47a848 in glib_pollfds_poll () at ../util/main-loop.c:221
#10 os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#11 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#12 0x000055722a2570a1 in qemu_main_loop () at ../softmmu/vl.c:1678
#13 0x0000557229fc178e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50

I managed to reproduce it reliably by adding a 'sleep(1)' to the beginning of
handle_qmp_command, then running this:

  ./build/qemu-system-x86_64 \
    -chardev 'socket,id=qmp,path=./qmp.socket,server=on,wait=off' \
    -mon 'chardev=qmp,mode=control'
  yes "query-version" | timeout 2s ./scripts/qmp/qmp-shell ./qmp.socket

When the timeout hits, the main loop always ends up in the trace above, with the
dispatch coroutine having just yielded to reschedule itself.

Holding a QemuMutex over a yield point seems pretty dangerous in general, but it
seems to be working for now...


 monitor/qmp.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/monitor/qmp.c b/monitor/qmp.c
index 2b0308f933..83eb440b29 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -74,8 +74,9 @@ static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
     }
 }
 
-static void monitor_qmp_cleanup_queue_and_resume(MonitorQMP *mon)
+static void monitor_qmp_cleanup_queue_and_resume(void *opaque)
 {
+    MonitorQMP *mon = opaque;
     QEMU_LOCK_GUARD(&mon->qmp_queue_lock);
 
     /*
@@ -453,8 +454,18 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
          * backend is still open.  For example, when the backend is
          * stdio, it's possible that stdout is still open when stdin
          * is closed.
+         *
+         * monitor_qmp_cleanup_queue_and_resume must not run in main
+         * context, as it acquires a qmp_queue_lock, which is held by
+         * the dispatcher coroutine during a reschedule/yield to the
+         * main context, and could thus lead to a deadlock.
+         *
+         * This is safe to delay, since the dispatcher also moves
+         * back to the iohandler context before reaquiring any
+         * queue locks.
          */
-        monitor_qmp_cleanup_queue_and_resume(mon);
+        aio_bh_schedule_oneshot(iohandler_get_aio_context(),
+                                monitor_qmp_cleanup_queue_and_resume, mon);
         json_message_parser_destroy(&mon->parser);
         json_message_parser_init(&mon->parser, handle_qmp_command,
                                  mon, NULL);
-- 
2.20.1




^ permalink raw reply related	[flat|nested] 5+ messages in thread