Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

From: Roman Kagan <rvkagan@yandex-team.ru>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: kwolf@redhat.com, qemu-block@nongnu.org, qemu-devel@nongnu.org,
	mreitz@redhat.com, den@openvz.org
Subject: Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay
Date: Wed, 3 Feb 2021 18:00:20 +0300	[thread overview]
Message-ID: <20210203150020.GE2786@rvkaganb.lan> (raw)
In-Reply-To: <e4387e05-4963-e335-3002-ddbbcf5aaf82@virtuozzo.com>

On Wed, Feb 03, 2021 at 05:44:34PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.02.2021 17:21, Roman Kagan wrote:
> > On Wed, Feb 03, 2021 at 04:10:41PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > 03.02.2021 13:53, Roman Kagan wrote:
> > > > On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > > > We pause reconnect process during drained section. So, if we have some
> > > > > requests, waiting for reconnect we should cancel them, otherwise they
> > > > > deadlock the drained section.
> > > > > 
> > > > > How to reproduce:
> > > > > 
> > > > > 1. Create an image:
> > > > >      qemu-img create -f qcow2 xx 100M
> > > > > 
> > > > > 2. Start NBD server:
> > > > >      qemu-nbd xx
> > > > > 
> > > > > 3. Start vm with second nbd disk on node2, like this:
> > > > > 
> > > > >     ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
> > > > >        file=/work/images/cent7.qcow2 -drive \
> > > > >        driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60 \
> > > > >        -vnc :0 -m 2G -enable-kvm -vga std
> > > > > 
> > > > > 4. Access the vm through vnc (or some other way?), and check that NBD
> > > > >      drive works:
> > > > > 
> > > > >      dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > >      - the command should succeed.
> > > > > 
> > > > > 5. Now, kill the nbd server, and run dd in the guest again:
> > > > > 
> > > > >      dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > > Now Qemu is trying to reconnect, and dd-generated requests are waiting
> > > > > for the connection (they will wait up to 60 seconds (see
> > > > > reconnect-delay option above) and than fail). But suddenly, vm may
> > > > > totally hang in the deadlock. You may need to increase reconnect-delay
> > > > > period to catch the dead-lock.
> > > > > 
> > > > > VM doesn't respond because drain dead-lock happens in cpu thread with
> > > > > global mutex taken. That's not good thing by itself and is not fixed
> > > > > by this commit (true way is using iothreads). Still this commit fixes
> > > > > drain dead-lock itself.
> > > > > 
> > > > > Note: probably, we can instead continue to reconnect during drained
> > > > > section. To achieve this, we may move negotiation to the connect thread
> > > > > to make it independent of bs aio context. But expanding drained section
> > > > > doesn't seem good anyway. So, let's now fix the bug the simplest way.
> > > > > 
> > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > > > ---
> > > > >    block/nbd.c | 5 +++++
> > > > >    1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/block/nbd.c b/block/nbd.c
> > > > > index 9daf003bea..912ea27be7 100644
> > > > > --- a/block/nbd.c
> > > > > +++ b/block/nbd.c
> > > > > @@ -242,6 +242,11 @@ static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
> > > > >        }
> > > > >        nbd_co_establish_connection_cancel(bs, false);
> > > > > +
> > > > > +    if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> > > > > +        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> > > > > +        qemu_co_queue_restart_all(&s->free_sema);
> > > > > +    }
> > > > >    }
> > > > >    static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
> > > > 
> > > > This basically defeats the whole purpose of reconnect: if the nbd client
> > > > is trying to reconnect, drain effectively cancels that and makes all
> > > > in-flight requests to complete with an error.
> > > > 
> > > > I'm not suggesting to revert this patch (it's now in the tree as commit
> > > > 8c517de24a), because the deadlock is no better, but I'm afraid the only
> > > > real fix is to implement reconnect during the drain section.  I'm still
> > > > trying to get my head around it so no patch yet, but I just wanted to
> > > > bring this up in case anybody beats me to it.
> > > > 
> > > 
> > > 
> > > What do you mean by "reconnect during drained section"? Trying to
> > > establish the connection? Or keeping in-flight requests instead of
> > > cancelling them? We can't keep in-flight requests during drained
> > > section, as it's the whole sense of drained-section: no in-flight
> > > requests. So we'll have to wait for them at drain_begin (waiting up to
> > > reconnect-delay, which doesn't seem good and triggers dead-lock for
> > > non-iothread environment) or just cancel them..
> > > 
> > > Do you argue that waiting on drained-begin is somehow better than
> > > cancelling?
> > 
> > Sorry I should have used more accurate wording to be clear.
> > 
> > Yes, my point is that canceling the requests on entry to a drained
> > section is incorrect.  And no, it doesn't matter if it can be long:
> > that's the price you pay for doing the drain.  Think of reconnect as a
> > special case of a slow connection: if an nbd reply from the server is
> > delayed for whatever reason without dropping the connection, you have to
> > wait here, too.  (In fact, reconnect is even slightly better here since
> > it has a configurable finite timeout while the message delay does not
> > AFAIK.)
> > 
> > Does it make sense?
> 
> Hmm, yes..
> 
> But then we should fix the original deadlock some other way.

Exactly.

> Probably it will be possible only by using iothread for nbd node (I
> failed to find original thread where someone said that iothreads is a
> solution). And than we can do cancel in nbd_client_co_drain_begin()
> only if bs doesn't have a separate iothread.

Without this commit, I see qemu hang with nbd in an iothread, too.  I'll
double-check if it's that very same deadlock or something else.

Thanks,
Roman.