On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
> > backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>   job_finalize_single
> 
>      job_completed_txn_abort (lock aio context)
> 
>      job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio context..
> And on the same time, it directaly calls job_txn_apply(job->txn, job_finalize_single)
> without locking. Is it a bug?

Indeed, looks like a bug to me. In fact, that's the one causing the
issue that Dietmar initially reported.

In think the proper fix is drop the context acquisition/release that
in backup_clean that I added in 0abf2581717a19, as Stefan proposed,
and also acquire the context of "foreign" jobs at job_txn_apply, just
as job_completed_txn_abort does.

Thanks,
Sergio.

> And, even if job_do_finalize called always with locked context, where is guarantee that all
> context of all jobs in txn are locked?
> 
> Still, let's look through its callers.
> 
> 	  job_finalize
> 
>                qmp_block_job_finalize (lock aio context)
>                qmp_job_finalize (lock aio context)
>                test_cancel_concluded (doesn't lock, but it's a test)
> 
>           job_completed_txn_success
> 
>                job_completed
> 
>                     job_exit (lock aio context)
> 
>                     job_cancel
> 	 	
>                          blockdev_mark_auto_del (lock aio context)
> 
>                          job_user_cancel
> 
>                              qmp_block_job_cancel (locks context)
>                              qmp_job_cancel  (locks context)
> 
>                          job_cancel_err
> 
>                               job_cancel_sync (return job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls callback)
> 
>                                    replication_close (it's .bdrv_close.. Hmm, I don't see context locking, where is it ?)
> 
>                                    replication_stop (locks context)
> 
>                                    drive_backup_abort (locks context)
> 
>                                    blockdev_backup_abort (locks context)
> 
>                                    job_cancel_sync_all (locks context)
> 
>                                    cancel_common (locks context)
> 
>                          test_* (I don't care)
> 
> > already acquires the job's context. The job's context is guaranteed to
> > be the same as the one used by backup_top via backup_job_create.
> > 
> > Since the previous logic effectively acquired the lock twice, this
> > broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
> > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
> > once, thus deadlocking with the IO thread.
> > 
> > Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so add Sergio (its author) to CC.
> 
> > ---
> > 
> > This is a fix for the issue discussed in this part of the thread:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
> > ...not the original problem (core dump) posted by Dietmar.
> > 
> > I've still seen it occasionally hang during a backup abort. I'm trying to figure
> > out why that happens, stack trace indicates a similar problem with the main
> > thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.
> > 
> >   block/backup.c | 4 ----
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/block/backup.c b/block/backup.c
> > index 7430ca5883..a7a7dcaf4c 100644
> > --- a/block/backup.c
> > +++ b/block/backup.c
> > @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
> >   static void backup_clean(Job *job)
> >   {
> >       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
> > -    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
> > -
> > -    aio_context_acquire(aio_context);
> >       bdrv_backup_top_drop(s->backup_top);
> > -    aio_context_release(aio_context);
> >   }
> >   void backup_do_checkpoint(BlockJob *job, Error **errp)
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
>