qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Reiter <s.reiter@proxmox.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: kwolf@redhat.com, slp@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, jsnow@redhat.com, dietmar@proxmox.com
Subject: Re: [PATCH] backup: don't acquire aio_context in backup_clean
Date: Thu, 26 Mar 2020 10:43:47 +0100	[thread overview]
Message-ID: <1d1984b3-14f5-5a17-b477-d70561f75e8f@proxmox.com> (raw)
In-Reply-To: <2b288000-7c09-ba31-82a7-02c5ed55f4e7@virtuozzo.com>

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
>> backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>    job_finalize_single
> 
>       job_completed_txn_abort (lock aio context)
> 
>       job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock 
> aio context..
> And on the same time, it directaly calls job_txn_apply(job->txn, 
> job_finalize_single)
> without locking. Is it a bug?
> 

I think, as you say, the idea is that job_do_finalize is always called 
with the lock acquired. That's why job_completed_txn_abort takes care to 
release the lock (at least of the "outer_ctx" as it calls it) before 
reacquiring it.

> And, even if job_do_finalize called always with locked context, where is 
> guarantee that all
> context of all jobs in txn are locked?
> 

I also don't see anything that guarantees that... I guess it could be 
adapted to handle locks like job_completed_txn_abort does?

Haven't looked into transactions too much, but does it even make sense 
to have jobs in different contexts in one transaction?

> Still, let's look through its callers.
> 
>        job_finalize
> 
>                 qmp_block_job_finalize (lock aio context)
>                 qmp_job_finalize (lock aio context)
>                 test_cancel_concluded (doesn't lock, but it's a test)
> 
>            job_completed_txn_success
> 
>                 job_completed
> 
>                      job_exit (lock aio context)
> 
>                      job_cancel
> 
>                           blockdev_mark_auto_del (lock aio context)
> 
>                           job_user_cancel
> 
>                               qmp_block_job_cancel (locks context)
>                               qmp_job_cancel  (locks context)
> 
>                           job_cancel_err
> 
>                                job_cancel_sync (return 
> job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls 
> callback)
> 
>                                     replication_close (it's 
> .bdrv_close.. Hmm, I don't see context locking, where is it ?)
Hm, don't see it either. This might indeed be a way to get to job_clean 
without a lock held.

I don't have any testing set up for replication atm, but if you believe 
this would be correct I can send a patch for that as well (just acquire 
the lock in replication_close before job_cancel_async?).

> 
>                                     replication_stop (locks context)
> 
>                                     drive_backup_abort (locks context)
> 
>                                     blockdev_backup_abort (locks context)
> 
>                                     job_cancel_sync_all (locks context)
> 
>                                     cancel_common (locks context)
> 
>                           test_* (I don't care)
> 

To clarify, aside from the commit message the patch itself does not 
appear to be wrong? All paths (aside from replication_close mentioned 
above) guarantee the job lock to be held.

>> already acquires the job's context. The job's context is guaranteed to
>> be the same as the one used by backup_top via backup_job_create.
>>
>> Since the previous logic effectively acquired the lock twice, this
>> broke cleanup of backups for disks using IO threads, since the 
>> BDRV_POLL_WHILE
>> in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release 
>> the lock
>> once, thus deadlocking with the IO thread.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so 
> add Sergio (its author) to CC.
> 
>> ---
>>
>> This is a fix for the issue discussed in this part of the thread:
>> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
>> ...not the original problem (core dump) posted by Dietmar.
>>
>> I've still seen it occasionally hang during a backup abort. I'm trying 
>> to figure
>> out why that happens, stack trace indicates a similar problem with the 
>> main
>> thread hanging at bdrv_do_drained_begin, though I have no clue why as 
>> of yet.
>>
>>   block/backup.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 7430ca5883..a7a7dcaf4c 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>> @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
>>   static void backup_clean(Job *job)
>>   {
>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>> -    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
>> -
>> -    aio_context_acquire(aio_context);
>>       bdrv_backup_top_drop(s->backup_top);
>> -    aio_context_release(aio_context);
>>   }
>>   void backup_do_checkpoint(BlockJob *job, Error **errp)
>>
> 
> 



  reply	other threads:[~2020-03-26  9:44 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-24 11:13 backup transaction with io-thread core dumps Dietmar Maurer
2020-03-24 13:30 ` Dietmar Maurer
2020-03-24 13:33   ` Dietmar Maurer
2020-03-24 13:44     ` John Snow
2020-03-24 14:00       ` Dietmar Maurer
2020-03-24 13:47     ` Max Reitz
2020-03-24 14:02       ` Dietmar Maurer
2020-03-24 16:49       ` Dietmar Maurer
2020-03-25 11:40         ` Stefan Reiter
2020-03-25 12:23           ` Vladimir Sementsov-Ogievskiy
2020-03-25 15:50             ` [PATCH] backup: don't acquire aio_context in backup_clean Stefan Reiter
2020-03-26  5:54               ` Vladimir Sementsov-Ogievskiy
2020-03-26  9:43                 ` Stefan Reiter [this message]
2020-03-26 12:46                   ` Vladimir Sementsov-Ogievskiy
2020-03-26 11:53                 ` Sergio Lopez
2020-03-25  8:13       ` backup transaction with io-thread core dumps Sergio Lopez
2020-03-25 11:46         ` Sergio Lopez
2020-03-25 12:29           ` Dietmar Maurer
2020-03-25 12:39             ` Sergio Lopez
2020-03-25 15:40               ` Dietmar Maurer
2020-03-26  7:50                 ` Sergio Lopez
2020-03-26  8:14                   ` Dietmar Maurer
2020-03-26  9:23                     ` Dietmar Maurer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d1984b3-14f5-5a17-b477-d70561f75e8f@proxmox.com \
    --to=s.reiter@proxmox.com \
    --cc=dietmar@proxmox.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=slp@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).