Kevin Wolf writes: > Am 11.09.2019 um 23:33 hat Eric Blake geschrieben: >> On 9/11/19 12:21 PM, Eric Blake wrote: >> > On 9/11/19 11:15 AM, Sergio Lopez wrote: >> >> On creation, the export's AioContext is set to the same one as the >> >> BlockBackend, while the AioContext in the client QIOChannel is left >> >> untouched. >> >> >> >> As a result, when using data-plane, nbd_client_receive_next_request() >> >> schedules coroutines in the IOThread AioContext, while the client's >> >> QIOChannel is serviced from the main_loop, potentially triggering the >> >> assertion at qio_channel_restart_[read|write]. >> >> >> >> To fix this, as soon we have the export corresponding to the client, >> >> we call qio_channel_attach_aio_context() to attach the QIOChannel >> >> context to the export's AioContext. This matches with the logic in >> >> blk_aio_attached(). >> >> >> >> RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1748253 >> >> Signed-off-by: Sergio Lopez >> >> --- >> >> nbd/server.c | 2 ++ >> >> 1 file changed, 2 insertions(+) >> > >> > I'd like a second opinion from Kevin, but the description makes sense to >> > me. I'm happy to queue this through my NBD tree. >> > >> > Reviewed-by: Eric Blake >> >> I tried to test this patch, but even with it applied, I still got an >> aio-context crasher by attempting an nbd-server-start, nbd-server-add, >> nbd-server-stop (intentionally skipping the nbd-server-remove step) on a >> domain using iothreads, with a backtrace of: >> >> #0 0x00007ff09d070e35 in raise () from target:/lib64/libc.so.6 >> #1 0x00007ff09d05b895 in abort () from target:/lib64/libc.so.6 >> #2 0x000055dd03b9ab86 in error_exit (err=1, msg=0x55dd03d59fb0 >> <__func__.15769> "qemu_mutex_unlock_impl") >> at util/qemu-thread-posix.c:36 >> #3 0x000055dd03b9adcf in qemu_mutex_unlock_impl (mutex=0x55dd062d5090, >> file=0x55dd03d59041 "util/async.c", >> line=523) at util/qemu-thread-posix.c:96 >> #4 0x000055dd03b93433 in aio_context_release (ctx=0x55dd062d5030) at >> util/async.c:523 >> #5 0x000055dd03ac421b in bdrv_do_drained_begin (bs=0x55dd0673a2d0, >> recursive=false, parent=0x0, >> ignore_bds_parents=false, poll=true) at block/io.c:428 >> #6 0x000055dd03ac4299 in bdrv_drained_begin (bs=0x55dd0673a2d0) at >> block/io.c:434 >> #7 0x000055dd03aafb54 in blk_drain (blk=0x55dd06a3ec40) at >> block/block-backend.c:1605 >> #8 0x000055dd03aae054 in blk_remove_bs (blk=0x55dd06a3ec40) at >> block/block-backend.c:800 >> #9 0x000055dd03aad54a in blk_delete (blk=0x55dd06a3ec40) at >> block/block-backend.c:420 >> #10 0x000055dd03aad7d6 in blk_unref (blk=0x55dd06a3ec40) at >> block/block-backend.c:475 >> #11 0x000055dd03aecb68 in nbd_export_put (exp=0x55dd0726f920) at >> nbd/server.c:1666 >> #12 0x000055dd03aec8fe in nbd_export_close (exp=0x55dd0726f920) at >> nbd/server.c:1616 >> #13 0x000055dd03aecbf1 in nbd_export_close_all () at nbd/server.c:1689 >> #14 0x000055dd03748845 in qmp_nbd_server_stop (errp=0x7ffcdf3cb4e8) at >> blockdev-nbd.c:233 >> ... >> >> Does that sound familiar to what you were seeing? Does it mean we >> missed another spot where the context is set incorrectly? > > I think nbd_export_close_all() or one of the NBD functions called by it > needs to take the AioContext lock of the associated BlockBackend. > > The crash is because AIO_POLL_WHILE() wants to temporarily drop the lock > that we're not even holding. Yes, I think locking the context during the "if (exp->blk) {" block at nbd/server.c:1646 should do the trick. On the other hand, I wonder if there is any situation in which calling to blk_unref() without locking the context could be safe. If there isn't any, perhaps we should assert that the lock is held if blk->ctx != NULL to catch this kind of bugs earlier? Sergio.