Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers

From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "fam@euphon.net" <fam@euphon.net>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"jcody@redhat.com" <jcody@redhat.com>,
	"kwolf@redhat.com" <kwolf@redhat.com>,
	Denis Lunev <den@virtuozzo.com>,
	"eblake@redhat.com" <eblake@redhat.com>,
	"jsnow@redhat.com" <jsnow@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date: Mon, 28 Jan 2019 16:59:40 +0100	[thread overview]
Message-ID: <ea0a19fd-2db5-bfa5-180b-68c3ef80e046@redhat.com> (raw)
In-Reply-To: <643f318f-392d-1370-f952-b2a01e7d9bb7@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 16470 bytes --]

On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
> 18.01.2019 17:56, Max Reitz wrote:
>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>>> Drop write notifiers and use filter node instead. Changes:
>>>
>>> 1. copy-before-writes now handled by filter node, so, drop all
>>>     is_write_notifier arguments.
>>>
>>> 2. we don't have intersecting requests, so their handling is dropped.
>>> Instead, synchronization works as follows:
>>> when backup or backup-top starts copying of some area it firstly
>>> clears copy-bitmap bits, and nobody touches areas, not marked with
>>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>>> job copy operations are surrounded by bdrv region lock, which is
>>> actually serializing request, to not interfer with guest writes and
>>> not read changed data from source (before reading we clear
>>> corresponding bit in copy-bitmap, so, this area is not more handled by
>>> backup-top).
>>>
>>> 3. To sync with in-flight requests we now just drain hook node, we
>>> don't need rw-lock.
>>>
>>> 4. After the whole backup loop (top, full, incremental modes), we need
>>> to check for not copied clusters, which were under backup-top operation
>>> and we skipped them, but backup-top operation failed, error returned to
>>> the guest and dirty bits set back.
>>>
>>> 5. Don't create additional blk, use backup-top children for copy
>>> operations.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>>>   1 file changed, 149 insertions(+), 136 deletions(-)
>>>
>>> diff --git a/block/backup.c b/block/backup.c
>>> index 88c0242b4e..e332909fb7 100644
>>> --- a/block/backup.c
>>> +++ b/block/backup.c
>>
>> [...]
>>
>>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>>>   static void backup_clean(Job *job)
>>>   {
>>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>>> -    assert(s->target);
>>> -    blk_unref(s->target);
>>> +
>>> +    /* We must clean it to not crash in backup_drain. */
>>>       s->target = NULL;
>>
>> Why not set s->source to NULL along with it?  It makes sense if you're
>> going to drop the backup-top node because both of these are its children.
> 
> agree.
> 
>>
>>>   
>>>       if (s->copy_bitmap) {
>>>           hbitmap_free(s->copy_bitmap);
>>>           s->copy_bitmap = NULL;
>>>       }
>>> +
>>> +    bdrv_backup_top_drop(s->backup_top);
>>>   }
>>
>> [...]
>>
>>> @@ -386,21 +319,45 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>>>       bool error_is_read;
>>>       int64_t offset;
>>>       HBitmapIter hbi;
>>> +    void *lock = NULL;
>>>   
>>>       hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> -    while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>>> +    while (hbitmap_count(job->copy_bitmap)) {
>>> +        offset = hbitmap_iter_next(&hbi);
>>> +        if (offset == -1) {
>>> +            /*
>>> +             * we may have skipped some clusters, which were handled by
>>> +             * backup-top, but failed and finished by returning error to
>>> +             * the guest and set dirty bit back.
>>> +             */
>>> +            hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> +            offset = hbitmap_iter_next(&hbi);
>>> +            assert(offset);
>>
>> I think you want to assert "offset >= 0".
>>
>>> +        }
>>> +
>>> +        lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>>> +        /*
>>> +         * Dirty bit is set, which means that there are no in-flight
>>> +         * write requests on this area. We must succeed.
>>> +         */
>>> +        assert(lock);
>>
>> I'm not sure that is true right now, but more on that below in backup_run().
>>
>>> +
>>>           do {
>>>               if (yield_and_check(job)) {
>>> +                bdrv_co_unlock(lock);
>>>                   return 0;
>>>               }
>>> -            ret = backup_do_cow(job, offset,
>>> -                                job->cluster_size, &error_is_read, false);
>>> +            ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read);
>>>               if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>>>                              BLOCK_ERROR_ACTION_REPORT)
>>>               {
>>> +                bdrv_co_unlock(lock);
>>>                   return ret;
>>>               }
>>>           } while (ret < 0);
>>> +
>>> +        bdrv_co_unlock(lock);
>>> +        lock = NULL;
>>
>> This statement seems unnecessary here.
>>
>>>       }
>>>   
>>>       return 0;
>>
>> [...]
>>
>>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>           hbitmap_set(s->copy_bitmap, 0, s->len);
>>>       }
>>>   
>>> -    s->before_write.notify = backup_before_write_notify;
>>> -    bdrv_add_before_write_notifier(bs, &s->before_write);
>>> -
>>>       if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>>           /* All bits are set in copy_bitmap to allow any cluster to be copied.
>>>            * This does not actually require them to be copied. */
>>>           while (!job_is_cancelled(job)) {
>>> -            /* Yield until the job is cancelled.  We just let our before_write
>>> -             * notify callback service CoW requests. */
>>> +            /*
>>> +             * Yield until the job is cancelled.  We just let our backup-top
>>> +             * fileter driver service CbW requests.
>>
>> *filter
>>
>>> +             */
>>>               job_yield(job);
>>>           }
>>>       } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>>           ret = backup_run_incremental(s);
>>>       } else {
>>
>> [...]
>>
>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>               if (alloced < 0) {
>>>                   ret = alloced;
>>>               } else {
>>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
>>> +                    trace_backup_do_cow_skip(job, offset);
>>> +                    continue; /* already copied */
>>> +                }
>>> +                if (!lock) {
>>> +                    lock = bdrv_co_try_lock(s->source, offset, s->cluster_size);
>>> +                    /*
>>> +                     * Dirty bit is set, which means that there are no in-flight
>>> +                     * write requests on this area. We must succeed.
>>> +                     */
>>> +                    assert(lock);
>>
>> What if I have a different parent node for the source that issues
>> concurrent writes?  This used to work fine because the before_write
>> notifier would still work.  After this patch, that would be broken
>> because those writes would not cause a CbW.
> 
> But haw could you have this different parent node? After appending filter,
> there should not be such nodes.

Unless you append them afterwards:

> And I think, during backup it should be
> forbidden to append new parents to source, ignoring filter, as it definitely
> breaks what filter does.

Agreed, but then this needs to be implemented.

> And it applies to other block-job with their filters.
> If we appended a filter, we don't want someone other to write omit our filter.
> 
>>
>> That's not so bad because we just have to make sure that all writes go
>> through the backup-top node.  That would make this assertion valid
>> again, too.  But that means we cannot share PERM_WRITE; see [1].
> 
> But we don't share PERM_WRITE on source in backup_top, only on target.

Are you sure?  The job itself shares it, and the filter shares it, too,
as far as I can see.  It uses bdrv_filter_default_perms(), and that does
seem to share PERM_WRITE.

>>
>>> +                }
>>>                   ret = backup_do_cow(s, offset, s->cluster_size,
>>> -                                    &error_is_read, false);
>>> +                                    &error_is_read);
>>>               }
>>>               if (ret < 0) {
>>>                   /* Depending on error action, fail now or retry cluster */
>>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>                       break;
>>>                   } else {
>>>                       offset -= s->cluster_size;
>>> +                    retry = true;
>>>                       continue;
>>>                   }
>>>               }
>>>           }
>>> +        if (lock) {
>>> +            bdrv_co_unlock(lock);
>>> +            lock = NULL;
>>> +        }
>>> +        if (ret == 0 && !job_is_cancelled(job) &&
>>> +            hbitmap_count(s->copy_bitmap))
>>> +        {
>>> +            /*
>>> +             * we may have skipped some clusters, which were handled by
>>> +             * backup-top, but failed and finished by returning error to
>>> +             * the guest and set dirty bit back.
>>
>> So it's a matter of a race?
>>
>>> +             */
>>> +            goto iteration;
>>> +        }
>>
>> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
>> instead?  Because it would mean reindenting everything?
> 
> Don't remember, but assume that yes. And this code is anyway "To be refactored",
> I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.

Hm, well, if you want to refactor it later anyway...  But I don't like
gotos that go backwards very much, unless there is a good reason to have
them (and there isn't here).

>>>       }
>>>   
>>> -    notifier_with_return_remove(&s->before_write);
>>> +    /* wait pending CBW operations in backup-top */
>>> +    bdrv_drain(s->backup_top);
>>>   
>>> -    /* wait until pending backup_do_cow() calls have completed */
>>> -    qemu_co_rwlock_wrlock(&s->flush_rwlock);
>>> -    qemu_co_rwlock_unlock(&s->flush_rwlock);
>>> +    backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>>> +    job_progress_update(job, ret + backup_top_progress -
>>
>> Why the "ret"?
> 
> oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())
> 
>>
>>> +                        s->backup_top_progress);
>>> +    s->backup_top_progress = backup_top_progress;
>>
>> So the backup-top progress is ignored during basically all of the block
>> job until it suddenly jumps to 100 % completion?  That doesn't sound ideal.
>>
>> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
>>   (And the while() loop of MODE_NONE)
> 
> 
> It is done in backup_do_cow(), so FULL and TOP are covered.
> 
> But you are right that MODE_NONE seems to have a problem about it.. And just updating it
> in a while loop would not work, as I doubt that job_yield will return until job finish
> or user interaction like pause/continue/cancel..
> 
> So now, it looks better to call job_progress_update() from backup_top directly, and drop
> this hack.

Hmmm...  I don't think job_*() calls belong in backup_top.  How about
adding a callback to bdrv_backup_top_append()?

>>>   
>>>       return ret;
>>>   }
>>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>>       int ret;
>>>       int64_t cluster_size;
>>>       HBitmap *copy_bitmap = NULL;
>>> +    BlockDriverState *backup_top = NULL;
>>> +    uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> +                                 BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD;
>>
>> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
>>
>> [1] But we probably do not want to share either PERM_WRITE or
>> PERM_WRITE_UNCHANGED because during the duration of the backup,
>> everything should go through the backup-top filter (not sure about
>> PERM_WRITE_UNCHANGED right now).  Or is that something that the filter
>> node should enforce in backup_top_child_perm()?
> 
> It's not shared perm of backup_top, it's a shared perm of block-job common.blk, which is
> used only to "job->len is fixed, so we can't allow resize", so this part is not changed.
> 
> So yes, the problem you mean by [1] is about backup_top_child_perm() where we share PERM_WRITE.
> And it is described by comment, we must share this write perm, otherwise we break guest writes.

For the target, yes, but the problem is sharing it on the source.

> We share PERM_WRITE in backup_top to force its target child share PERM_WRITE on its backing,
> as backing of target is source.
> 
> But again, we share PERM_WRITE only on target, and it is shared in current code too.

I'm not so sure whether PERM_WRITE is shared only on the target.

>>
>>>   
>>>       assert(bs);
>>>       assert(target);
>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>>   
>>>       copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>>   
>>> -    /* job->len is fixed, so we can't allow resize */
>>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
>>> -                           BLK_PERM_CONSISTENT_READ,
>>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> -                           BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>>> -                           speed, creation_flags, cb, opaque, errp);
>>> -    if (!job) {
>>> +    /*
>>> +     * bdrv_get_device_name will not help to find device name starting from
>>> +     * @bs after backup-top append,
>>
>> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
>> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
>> same result)
> 
> bdrv_get_device_name goes finally through role->get_name, and only root role has
> this handler. After append we'll have backing role instead of root.

Ah, I see, I asked the wrong question.

Why is block_job_create() called on bs and not on backup_top?  mirror
calls it on mirror_top_bs.

>>>                                       so let's calculate job_id before. Do
>>> +     * it in the same way like block_job_create
>>> +     */
>>> +    if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>>> +        job_id = bdrv_get_device_name(bs);
>>> +    }
>>> +
>>> +    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>>> +    if (!backup_top) {
>>>           goto error;
>>>       }
>>>   
>>> -    /* The target must match the source in size, so no resize here either */
>>> -    job->target = blk_new(BLK_PERM_WRITE,
>>> -                          BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> -                          BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>>> -    ret = blk_insert_bs(job->target, target, errp);
>>> -    if (ret < 0) {
>>> +    /* job->len is fixed, so we can't allow resize */
>>> +    job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
>>
>> Is there a reason you dropped PERM_CONSISTENT_READ here?
> 
> Because, we don't use this blk for read now, we read through backup_top child.

Makes sense.

>>> +                           all_except_resize, speed, creation_flags,
>>> +                           cb, opaque, errp);
>>> +    if (!job) {
>>>           goto error;
>>>       }
>>>   
>>> +    job->source = backup_top->backing;
>>> +    job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
>>
>> This looks really ugly.  I think as long as the block job performs
>> writes itself, it should use its own BlockBackend.
> 
> They are not BlockBackends, they are BdrvChildren.

Exactly, which is what I don't like.  They are children of a node that
is implemented in a different file, it looks weird to use them here.

> It was Kevin's idea to reuse filter's
> children in backup job:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html

It's still ugly if backup_top is in a different file.  Well, maybe just
to me.

> Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target.
> 
>>
>> Alternatively, I think it would make sense for the backup-top filter to
>> offer functions that this job can use to issue writes to the target.
>> Then the backup job would no longer need direct access to the target as
>> a BdrvChild.

So what would be the problem with this?

Max

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]