All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Stefan Reiter <s.reiter@proxmox.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: kwolf@redhat.com, slp@redhat.com, mreitz@redhat.com,
	stefanha@redhat.com, jsnow@redhat.com, dietmar@proxmox.com
Subject: Re: [PATCH] backup: don't acquire aio_context in backup_clean
Date: Thu, 26 Mar 2020 15:46:18 +0300	[thread overview]
Message-ID: <143bd416-c18d-669a-e569-0de3338c740a@virtuozzo.com> (raw)
In-Reply-To: <1d1984b3-14f5-5a17-b477-d70561f75e8f@proxmox.com>

26.03.2020 12:43, Stefan Reiter wrote:
> On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:
>> 25.03.2020 18:50, Stefan Reiter wrote:
>>> backup_clean is only ever called as a handler via job_exit, which
>>
>> Hmm.. I'm afraid it's not quite correct.
>>
>> job_clean
>>
>>    job_finalize_single
>>
>>       job_completed_txn_abort (lock aio context)
>>
>>       job_do_finalize
>>
>>
>> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio context..
>> And on the same time, it directaly calls job_txn_apply(job->txn, job_finalize_single)
>> without locking. Is it a bug?
>>
> 
> I think, as you say, the idea is that job_do_finalize is always called with the lock acquired. That's why job_completed_txn_abort takes care to release the lock (at least of the "outer_ctx" as it calls it) before reacquiring it.
> 
>> And, even if job_do_finalize called always with locked context, where is guarantee that all
>> context of all jobs in txn are locked?
>>
> 
> I also don't see anything that guarantees that... I guess it could be adapted to handle locks like job_completed_txn_abort does?
> 
> Haven't looked into transactions too much, but does it even make sense to have jobs in different contexts in one transaction?

Why not? Assume backing two disks in one transaction, each in separate io thread.. (honestly, I don't know does it work)

> 
>> Still, let's look through its callers.
>>
>>        job_finalize
>>
>>                 qmp_block_job_finalize (lock aio context)
>>                 qmp_job_finalize (lock aio context)
>>                 test_cancel_concluded (doesn't lock, but it's a test)
>>
>>            job_completed_txn_success
>>
>>                 job_completed
>>
>>                      job_exit (lock aio context)
>>
>>                      job_cancel
>>
>>                           blockdev_mark_auto_del (lock aio context)
>>
>>                           job_user_cancel
>>
>>                               qmp_block_job_cancel (locks context)
>>                               qmp_job_cancel  (locks context)
>>
>>                           job_cancel_err
>>
>>                                job_cancel_sync (return job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls callback)
>>
>>                                     replication_close (it's .bdrv_close.. Hmm, I don't see context locking, where is it ?)
> Hm, don't see it either. This might indeed be a way to get to job_clean without a lock held.
> 
> I don't have any testing set up for replication atm, but if you believe this would be correct I can send a patch for that as well (just acquire the lock in replication_close before job_cancel_async?).

I don't know.. But sending a patch is good way to start a discussion)

> 
>>
>>                                     replication_stop (locks context)
>>
>>                                     drive_backup_abort (locks context)
>>
>>                                     blockdev_backup_abort (locks context)
>>
>>                                     job_cancel_sync_all (locks context)
>>
>>                                     cancel_common (locks context)
>>
>>                           test_* (I don't care)
>>
> 
> To clarify, aside from the commit message the patch itself does not appear to be wrong? All paths (aside from replication_close mentioned above) guarantee the job lock to be held.

I mostly worry about the case with transaction with jobs from different aio contexts than about replication..

Anyway, I hope that someone who has better understanding of these things will look at this.

It usually not good idea to send [PATCH] inside discussion thread, it'd better be a separate thread, to be more visible.

May be you send separate series, which will include this patch, some fix for replication, and try to fix job_do_finalize in some way, and we continue discussion from this new series?

> 
>>> already acquires the job's context. The job's context is guaranteed to
>>> be the same as the one used by backup_top via backup_job_create.
>>>
>>> Since the previous logic effectively acquired the lock twice, this
>>> broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
>>> in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
>>> once, thus deadlocking with the IO thread.
>>>
>>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>>
>> Just note, that this thing were recently touched by 0abf2581717a19 , so add Sergio (its author) to CC.
>>
>>> ---
>>>
>>> This is a fix for the issue discussed in this part of the thread:
>>> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
>>> ...not the original problem (core dump) posted by Dietmar.
>>>
>>> I've still seen it occasionally hang during a backup abort. I'm trying to figure
>>> out why that happens, stack trace indicates a similar problem with the main
>>> thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.
>>>
>>>   block/backup.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>>
>>> diff --git a/block/backup.c b/block/backup.c
>>> index 7430ca5883..a7a7dcaf4c 100644
>>> --- a/block/backup.c
>>> +++ b/block/backup.c
>>> @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
>>>   static void backup_clean(Job *job)
>>>   {
>>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>>> -    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
>>> -
>>> -    aio_context_acquire(aio_context);
>>>       bdrv_backup_top_drop(s->backup_top);
>>> -    aio_context_release(aio_context);
>>>   }
>>>   void backup_do_checkpoint(BlockJob *job, Error **errp)
>>>
>>
>>
> 


-- 
Best regards,
Vladimir


  reply	other threads:[~2020-03-26 12:47 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-24 11:13 backup transaction with io-thread core dumps Dietmar Maurer
2020-03-24 13:30 ` Dietmar Maurer
2020-03-24 13:33   ` Dietmar Maurer
2020-03-24 13:44     ` John Snow
2020-03-24 14:00       ` Dietmar Maurer
2020-03-24 13:47     ` Max Reitz
2020-03-24 14:02       ` Dietmar Maurer
2020-03-24 16:49       ` Dietmar Maurer
2020-03-25 11:40         ` Stefan Reiter
2020-03-25 12:23           ` Vladimir Sementsov-Ogievskiy
2020-03-25 15:50             ` [PATCH] backup: don't acquire aio_context in backup_clean Stefan Reiter
2020-03-26  5:54               ` Vladimir Sementsov-Ogievskiy
2020-03-26  9:43                 ` Stefan Reiter
2020-03-26 12:46                   ` Vladimir Sementsov-Ogievskiy [this message]
2020-03-26 11:53                 ` Sergio Lopez
2020-03-25  8:13       ` backup transaction with io-thread core dumps Sergio Lopez
2020-03-25 11:46         ` Sergio Lopez
2020-03-25 12:29           ` Dietmar Maurer
2020-03-25 12:39             ` Sergio Lopez
2020-03-25 15:40               ` Dietmar Maurer
2020-03-26  7:50                 ` Sergio Lopez
2020-03-26  8:14                   ` Dietmar Maurer
2020-03-26  9:23                     ` Dietmar Maurer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=143bd416-c18d-669a-e569-0de3338c740a@virtuozzo.com \
    --to=vsementsov@virtuozzo.com \
    --cc=dietmar@proxmox.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=s.reiter@proxmox.com \
    --cc=slp@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.