On 9/12/19 6:31 AM, Kevin Wolf wrote:

>>
>> Yes, I think locking the context during the "if (exp->blk) {" block at
>> nbd/server.c:1646 should do the trick.

That line number has moved over time; which function are you referring to?

> 
> We need to be careful to avoid locking things twice, so maybe
> nbd_export_put() is already too deep inside the NBD server.
> 
> Its callers are:
> 
> * qmp_nbd_server_add(). Like all other QMP handlers in blockdev-nbd.c it
>   neglects to lock the AioContext, but it should do so. The lock is not
>   only needed for the nbd_export_put() call, but even before.
> 
> * nbd_export_close(), which in turn is called from:
>     * nbd_eject_notifier(): run in the main thread, not locked
>     * nbd_export_remove():
>         * qmp_nbd_server_remove(): see above
>     * nbd_export_close_all():
>         * bdrv_close_all()
>         * qmp_nbd_server_stop()

Even weirder: nbd_export_put() calls nbd_export_close(), and
nbd_export_close() calls nbd_export_put().  The mutual recursion is
mind-numbing, and the fact that we use get/put instead of ref/unref like
most other qemu code is not making it any easier to reason about.

> 
> There are also calls from qemu-nbd, but these can be ignored because we
> don't have iothreads there.
> 
> I think the cleanest would be to take the lock in the outermost callers,
> i.e. all QMP handlers that deal with a specific export, in the eject
> notifier and in nbd_export_close_all().

Okay, I'm trying that (I already tried grabbing the aio_context in
nbd_export_close(), but as you predicted, that deadlocked when a nested
call already encountered the lock taken from an outer call).

> 
>> On the other hand, I wonder if there is any situation in which calling
>> to blk_unref() without locking the context could be safe. If there isn't
>> any, perhaps we should assert that the lock is held if blk->ctx != NULL
>> to catch this kind of bugs earlier?
> 
> blk_unref() must be called from the main thread, and if the BlockBackend
> to be unreferenced is not in the main AioContext, the lock must be held.
> 
> I'm not sure how to assert that locks are held, though. I once looked
> for a way to do this, but it involved either looking at the internal
> state of pthreads mutexes or hacking up QemuMutex with debug state.

Even if we can only test that in a debug build but not during normal
builds, could any of our CI builds set up that configuration?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org