On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote: > 25.03.2020 18:50, Stefan Reiter wrote: > > backup_clean is only ever called as a handler via job_exit, which > > Hmm.. I'm afraid it's not quite correct. > > job_clean > > job_finalize_single > > job_completed_txn_abort (lock aio context) > > job_do_finalize > > > Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio context.. > And on the same time, it directaly calls job_txn_apply(job->txn, job_finalize_single) > without locking. Is it a bug? Indeed, looks like a bug to me. In fact, that's the one causing the issue that Dietmar initially reported. In think the proper fix is drop the context acquisition/release that in backup_clean that I added in 0abf2581717a19, as Stefan proposed, and also acquire the context of "foreign" jobs at job_txn_apply, just as job_completed_txn_abort does. Thanks, Sergio. > And, even if job_do_finalize called always with locked context, where is guarantee that all > context of all jobs in txn are locked? > > Still, let's look through its callers. > > job_finalize > > qmp_block_job_finalize (lock aio context) > qmp_job_finalize (lock aio context) > test_cancel_concluded (doesn't lock, but it's a test) > > job_completed_txn_success > > job_completed > > job_exit (lock aio context) > > job_cancel > > blockdev_mark_auto_del (lock aio context) > > job_user_cancel > > qmp_block_job_cancel (locks context) > qmp_job_cancel (locks context) > > job_cancel_err > > job_cancel_sync (return job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls callback) > > replication_close (it's .bdrv_close.. Hmm, I don't see context locking, where is it ?) > > replication_stop (locks context) > > drive_backup_abort (locks context) > > blockdev_backup_abort (locks context) > > job_cancel_sync_all (locks context) > > cancel_common (locks context) > > test_* (I don't care) > > > already acquires the job's context. The job's context is guaranteed to > > be the same as the one used by backup_top via backup_job_create. > > > > Since the previous logic effectively acquired the lock twice, this > > broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE > > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock > > once, thus deadlocking with the IO thread. > > > > Signed-off-by: Stefan Reiter > > Just note, that this thing were recently touched by 0abf2581717a19 , so add Sergio (its author) to CC. > > > --- > > > > This is a fix for the issue discussed in this part of the thread: > > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html > > ...not the original problem (core dump) posted by Dietmar. > > > > I've still seen it occasionally hang during a backup abort. I'm trying to figure > > out why that happens, stack trace indicates a similar problem with the main > > thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet. > > > > block/backup.c | 4 ---- > > 1 file changed, 4 deletions(-) > > > > diff --git a/block/backup.c b/block/backup.c > > index 7430ca5883..a7a7dcaf4c 100644 > > --- a/block/backup.c > > +++ b/block/backup.c > > @@ -126,11 +126,7 @@ static void backup_abort(Job *job) > > static void backup_clean(Job *job) > > { > > BackupBlockJob *s = container_of(job, BackupBlockJob, common.job); > > - AioContext *aio_context = bdrv_get_aio_context(s->backup_top); > > - > > - aio_context_acquire(aio_context); > > bdrv_backup_top_drop(s->backup_top); > > - aio_context_release(aio_context); > > } > > void backup_do_checkpoint(BlockJob *job, Error **errp) > > > > > -- > Best regards, > Vladimir >