Re: [PATCH] backup: don't acquire aio_context in backup_clean

From: Stefan Reiter <s.reiter@proxmox.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: kwolf@redhat.com, slp@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, jsnow@redhat.com, dietmar@proxmox.com
Subject: Re: [PATCH] backup: don't acquire aio_context in backup_clean
Date: Thu, 26 Mar 2020 10:43:47 +0100	[thread overview]
Message-ID: <1d1984b3-14f5-5a17-b477-d70561f75e8f@proxmox.com> (raw)
In-Reply-To: <2b288000-7c09-ba31-82a7-02c5ed55f4e7@virtuozzo.com>

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
>> backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>    job_finalize_single
> 
>       job_completed_txn_abort (lock aio context)
> 
>       job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock 
> aio context..
> And on the same time, it directaly calls job_txn_apply(job->txn, 
> job_finalize_single)
> without locking. Is it a bug?
> 

I think, as you say, the idea is that job_do_finalize is always called 
with the lock acquired. That's why job_completed_txn_abort takes care to 
release the lock (at least of the "outer_ctx" as it calls it) before 
reacquiring it.

> And, even if job_do_finalize called always with locked context, where is 
> guarantee that all
> context of all jobs in txn are locked?
> 

I also don't see anything that guarantees that... I guess it could be 
adapted to handle locks like job_completed_txn_abort does?

Haven't looked into transactions too much, but does it even make sense 
to have jobs in different contexts in one transaction?

> Still, let's look through its callers.
> 
>        job_finalize
> 
>                 qmp_block_job_finalize (lock aio context)
>                 qmp_job_finalize (lock aio context)
>                 test_cancel_concluded (doesn't lock, but it's a test)
> 
>            job_completed_txn_success
> 
>                 job_completed
> 
>                      job_exit (lock aio context)
> 
>                      job_cancel
> 
>                           blockdev_mark_auto_del (lock aio context)
> 
>                           job_user_cancel
> 
>                               qmp_block_job_cancel (locks context)
>                               qmp_job_cancel  (locks context)
> 
>                           job_cancel_err
> 
>                                job_cancel_sync (return 
> job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls 
> callback)
> 
>                                     replication_close (it's 
> .bdrv_close.. Hmm, I don't see context locking, where is it ?)
Hm, don't see it either. This might indeed be a way to get to job_clean 
without a lock held.

I don't have any testing set up for replication atm, but if you believe 
this would be correct I can send a patch for that as well (just acquire 
the lock in replication_close before job_cancel_async?).

> 
>                                     replication_stop (locks context)
> 
>                                     drive_backup_abort (locks context)
> 
>                                     blockdev_backup_abort (locks context)
> 
>                                     job_cancel_sync_all (locks context)
> 
>                                     cancel_common (locks context)
> 
>                           test_* (I don't care)
> 

To clarify, aside from the commit message the patch itself does not 
appear to be wrong? All paths (aside from replication_close mentioned 
above) guarantee the job lock to be held.

>> already acquires the job's context. The job's context is guaranteed to
>> be the same as the one used by backup_top via backup_job_create.
>>
>> Since the previous logic effectively acquired the lock twice, this
>> broke cleanup of backups for disks using IO threads, since the 
>> BDRV_POLL_WHILE
>> in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release 
>> the lock
>> once, thus deadlocking with the IO thread.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so 
> add Sergio (its author) to CC.
> 
>> ---
>>
>> This is a fix for the issue discussed in this part of the thread:
>> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
>> ...not the original problem (core dump) posted by Dietmar.
>>
>> I've still seen it occasionally hang during a backup abort. I'm trying 
>> to figure
>> out why that happens, stack trace indicates a similar problem with the 
>> main
>> thread hanging at bdrv_do_drained_begin, though I have no clue why as 
>> of yet.
>>
>>   block/backup.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 7430ca5883..a7a7dcaf4c 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
>>   static void backup_clean(Job *job)
>>   {
>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>> -    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
>> -
>> -    aio_context_acquire(aio_context);
>>       bdrv_backup_top_drop(s->backup_top);
>> -    aio_context_release(aio_context);
>>   }
>>   void backup_do_checkpoint(BlockJob *job, Error **errp)
>>
> 
>