From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35044) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cxONI-0008Ka-Cn for qemu-devel@nongnu.org; Sun, 09 Apr 2017 21:44:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cxONG-00061G-6s for qemu-devel@nongnu.org; Sun, 09 Apr 2017 21:44:11 -0400 Date: Mon, 10 Apr 2017 09:43:56 +0800 From: Fam Zheng Message-ID: <20170410014356.GA15038@lemon> References: <20170407065414.9143-1-famz@redhat.com> <20170407065414.9143-6-famz@redhat.com> <20170407151445.GF4716@noname.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [PATCH v2 5/6] coroutine: Explicitly specify AioContext when entering coroutine List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel@nongnu.org, qemu-block@nongnu.org, Max Reitz On Mon, 04/10 08:56, Paolo Bonzini wrote: > > > On 07/04/2017 23:14, Kevin Wolf wrote: > > One part of the reason is that > > BDRV_POLL_WHILE() drops the lock since commit c9d1a561, so just calling > > aio_context_acquire() doesn't protect you any more if there is any > > chance that a nested function calls BDRV_POLL_WHILE(). > > Hi guys, sorry for the late reply. > > There wasn't actually much that changed in commit c9d1a561. Everything > that could break was also broken before. > > With commit c9d1a561 the logic is > > aio_context_acquire > prepare I/O > aio_context_release > poll main AioContext > aio_context_acquire > aio_poll secondary AioContext > complete I/O > <-- bdrv_wakeup > aio_context_acquire > aio_context_release > Sync I/O is complete > > while before it was > > aio_context_acquire > prepare I/O > aio_poll secondary AioContext > complete I/O > Sync I/O is complete > > The code that can run in "aio_poll secondary AioContext" is the same in > both cases. This is not completely true. Before this fact, a blocked aio_context_acquire() in virtio_scsi_data_plane_handle_cmd() wouldn't be woken up during the sync I/O. Of course the aio context pushdown (9d4566544) happened after c9d1a561, so this is a compound consequence of the two. Also, not polling for at least once in bdrv_drain_poll if !bdrv_requests_pending(bs), since 997235485, made the situation a bit worse - virtio_scsi_data_plane_handle_cmd could have returned before bdrv_drained_begin() returns if we do the BDRV_POLL_WHILE body at least once even if cond is false. > The difference is that, after c9d1a561, it always runs in > one thread which eliminates the need for RFifoLock's contention > callbacks (and a bunch of bdrv_drain_all deadlocks that arose from the > combination of contention callbacks and recursive locking). > > The patch that introduced the bug is the one that introduced the "home > AioContext" co->ctx for coroutines, because now you have > > aio_context_acquire > prepare I/O > aio_context_release > poll main AioContext > aio_context_acquire > aio_poll secondary AioContext > aio_co_wake > complete I/O > bdrv_wakeup > aio_context_acquire > aio_context_release > Sync I/O is complete > > and if "complete I/O" causes anything to happen in the iothread, bad > things happen. > > I think Fam's analysis is right. This patch will hopefully be reverted > in 2.10, but right now it's the right thing to do. However, I don't > like very much adding an argument to qemu_coroutine_enter because: > > 1) coroutines can be used without AioContexts (CoMutex has a dependency > on aio_co_wake, but it is hidden behind the API and if you have a single > AioContext you could just define aio_co_wake to be qemu_coroutine_enter). > > 2) the value that is assigned to co->ctx is not the AioContext in which > the coroutine is running. > > It would be better to split aio_co_wake so that aio_co_wake is > > /* Read coroutine before co->ctx. Matches smp_wmb in > * qemu_coroutine_enter. > */ > smp_read_barrier_depends(); > ctx = atomic_read(&co->ctx); > aio_co_enter(ctx, co); > > and aio_co_enter is the rest of the logic. Is aio_co_enter what solves 1) for you too? If so I can add it in v3. Fam > > Then the coroutine code always runs in the right thread, which obviously > is thread-safe. > > Paolo >