All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"qemu-block@nongnu.org" <qemu-block@nongnu.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "fam@euphon.net" <fam@euphon.net>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"jcody@redhat.com" <jcody@redhat.com>,
	"kwolf@redhat.com" <kwolf@redhat.com>,
	Denis Lunev <den@virtuozzo.com>,
	"eblake@redhat.com" <eblake@redhat.com>,
	"jsnow@redhat.com" <jsnow@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date: Mon, 28 Jan 2019 16:59:40 +0100	[thread overview]
Message-ID: <ea0a19fd-2db5-bfa5-180b-68c3ef80e046@redhat.com> (raw)
In-Reply-To: <643f318f-392d-1370-f952-b2a01e7d9bb7@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 16470 bytes --]

On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
> 18.01.2019 17:56, Max Reitz wrote:
>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>>> Drop write notifiers and use filter node instead. Changes:
>>>
>>> 1. copy-before-writes now handled by filter node, so, drop all
>>>     is_write_notifier arguments.
>>>
>>> 2. we don't have intersecting requests, so their handling is dropped.
>>> Instead, synchronization works as follows:
>>> when backup or backup-top starts copying of some area it firstly
>>> clears copy-bitmap bits, and nobody touches areas, not marked with
>>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>>> job copy operations are surrounded by bdrv region lock, which is
>>> actually serializing request, to not interfer with guest writes and
>>> not read changed data from source (before reading we clear
>>> corresponding bit in copy-bitmap, so, this area is not more handled by
>>> backup-top).
>>>
>>> 3. To sync with in-flight requests we now just drain hook node, we
>>> don't need rw-lock.
>>>
>>> 4. After the whole backup loop (top, full, incremental modes), we need
>>> to check for not copied clusters, which were under backup-top operation
>>> and we skipped them, but backup-top operation failed, error returned to
>>> the guest and dirty bits set back.
>>>
>>> 5. Don't create additional blk, use backup-top children for copy
>>> operations.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>>>   1 file changed, 149 insertions(+), 136 deletions(-)
>>>
>>> diff --git a/block/backup.c b/block/backup.c
>>> index 88c0242b4e..e332909fb7 100644
>>> --- a/block/backup.c
>>> +++ b/block/backup.c
>>
>> [...]
>>
>>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>>>   static void backup_clean(Job *job)
>>>   {
>>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>>> -    assert(s->target);
>>> -    blk_unref(s->target);
>>> +
>>> +    /* We must clean it to not crash in backup_drain. */
>>>       s->target = NULL;
>>
>> Why not set s->source to NULL along with it?  It makes sense if you're
>> going to drop the backup-top node because both of these are its children.
> 
> agree.
> 
>>
>>>   
>>>       if (s->copy_bitmap) {
>>>           hbitmap_free(s->copy_bitmap);
>>>           s->copy_bitmap = NULL;
>>>       }
>>> +
>>> +    bdrv_backup_top_drop(s->backup_top);
>>>   }
>>
>> [...]
>>
>>> @@ -386,21 +319,45 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
>>>       bool error_is_read;
>>>       int64_t offset;
>>>       HBitmapIter hbi;
>>> +    void *lock = NULL;
>>>   
>>>       hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> -    while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>>> +    while (hbitmap_count(job->copy_bitmap)) {
>>> +        offset = hbitmap_iter_next(&hbi);
>>> +        if (offset == -1) {
>>> +            /*
>>> +             * we may have skipped some clusters, which were handled by
>>> +             * backup-top, but failed and finished by returning error to
>>> +             * the guest and set dirty bit back.
>>> +             */
>>> +            hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> +            offset = hbitmap_iter_next(&hbi);
>>> +            assert(offset);
>>
>> I think you want to assert "offset >= 0".
>>
>>> +        }
>>> +
>>> +        lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>>> +        /*
>>> +         * Dirty bit is set, which means that there are no in-flight
>>> +         * write requests on this area. We must succeed.
>>> +         */
>>> +        assert(lock);
>>
>> I'm not sure that is true right now, but more on that below in backup_run().
>>
>>> +
>>>           do {
>>>               if (yield_and_check(job)) {
>>> +                bdrv_co_unlock(lock);
>>>                   return 0;
>>>               }
>>> -            ret = backup_do_cow(job, offset,
>>> -                                job->cluster_size, &error_is_read, false);
>>> +            ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read);
>>>               if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>>>                              BLOCK_ERROR_ACTION_REPORT)
>>>               {
>>> +                bdrv_co_unlock(lock);
>>>                   return ret;
>>>               }
>>>           } while (ret < 0);
>>> +
>>> +        bdrv_co_unlock(lock);
>>> +        lock = NULL;
>>
>> This statement seems unnecessary here.
>>
>>>       }
>>>   
>>>       return 0;
>>
>> [...]
>>
>>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>           hbitmap_set(s->copy_bitmap, 0, s->len);
>>>       }
>>>   
>>> -    s->before_write.notify = backup_before_write_notify;
>>> -    bdrv_add_before_write_notifier(bs, &s->before_write);
>>> -
>>>       if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>>           /* All bits are set in copy_bitmap to allow any cluster to be copied.
>>>            * This does not actually require them to be copied. */
>>>           while (!job_is_cancelled(job)) {
>>> -            /* Yield until the job is cancelled.  We just let our before_write
>>> -             * notify callback service CoW requests. */
>>> +            /*
>>> +             * Yield until the job is cancelled.  We just let our backup-top
>>> +             * fileter driver service CbW requests.
>>
>> *filter
>>
>>> +             */
>>>               job_yield(job);
>>>           }
>>>       } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>>           ret = backup_run_incremental(s);
>>>       } else {
>>
>> [...]
>>
>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>               if (alloced < 0) {
>>>                   ret = alloced;
>>>               } else {
>>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
>>> +                    trace_backup_do_cow_skip(job, offset);
>>> +                    continue; /* already copied */
>>> +                }
>>> +                if (!lock) {
>>> +                    lock = bdrv_co_try_lock(s->source, offset, s->cluster_size);
>>> +                    /*
>>> +                     * Dirty bit is set, which means that there are no in-flight
>>> +                     * write requests on this area. We must succeed.
>>> +                     */
>>> +                    assert(lock);
>>
>> What if I have a different parent node for the source that issues
>> concurrent writes?  This used to work fine because the before_write
>> notifier would still work.  After this patch, that would be broken
>> because those writes would not cause a CbW.
> 
> But haw could you have this different parent node? After appending filter,
> there should not be such nodes.

Unless you append them afterwards:

> And I think, during backup it should be
> forbidden to append new parents to source, ignoring filter, as it definitely
> breaks what filter does.

Agreed, but then this needs to be implemented.

> And it applies to other block-job with their filters.
> If we appended a filter, we don't want someone other to write omit our filter.
> 
>>
>> That's not so bad because we just have to make sure that all writes go
>> through the backup-top node.  That would make this assertion valid
>> again, too.  But that means we cannot share PERM_WRITE; see [1].
> 
> But we don't share PERM_WRITE on source in backup_top, only on target.

Are you sure?  The job itself shares it, and the filter shares it, too,
as far as I can see.  It uses bdrv_filter_default_perms(), and that does
seem to share PERM_WRITE.

>>
>>> +                }
>>>                   ret = backup_do_cow(s, offset, s->cluster_size,
>>> -                                    &error_is_read, false);
>>> +                                    &error_is_read);
>>>               }
>>>               if (ret < 0) {
>>>                   /* Depending on error action, fail now or retry cluster */
>>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
>>>                       break;
>>>                   } else {
>>>                       offset -= s->cluster_size;
>>> +                    retry = true;
>>>                       continue;
>>>                   }
>>>               }
>>>           }
>>> +        if (lock) {
>>> +            bdrv_co_unlock(lock);
>>> +            lock = NULL;
>>> +        }
>>> +        if (ret == 0 && !job_is_cancelled(job) &&
>>> +            hbitmap_count(s->copy_bitmap))
>>> +        {
>>> +            /*
>>> +             * we may have skipped some clusters, which were handled by
>>> +             * backup-top, but failed and finished by returning error to
>>> +             * the guest and set dirty bit back.
>>
>> So it's a matter of a race?
>>
>>> +             */
>>> +            goto iteration;
>>> +        }
>>
>> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
>> instead?  Because it would mean reindenting everything?
> 
> Don't remember, but assume that yes. And this code is anyway "To be refactored",
> I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.

Hm, well, if you want to refactor it later anyway...  But I don't like
gotos that go backwards very much, unless there is a good reason to have
them (and there isn't here).

>>>       }
>>>   
>>> -    notifier_with_return_remove(&s->before_write);
>>> +    /* wait pending CBW operations in backup-top */
>>> +    bdrv_drain(s->backup_top);
>>>   
>>> -    /* wait until pending backup_do_cow() calls have completed */
>>> -    qemu_co_rwlock_wrlock(&s->flush_rwlock);
>>> -    qemu_co_rwlock_unlock(&s->flush_rwlock);
>>> +    backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>>> +    job_progress_update(job, ret + backup_top_progress -
>>
>> Why the "ret"?
> 
> oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())
> 
>>
>>> +                        s->backup_top_progress);
>>> +    s->backup_top_progress = backup_top_progress;
>>
>> So the backup-top progress is ignored during basically all of the block
>> job until it suddenly jumps to 100 % completion?  That doesn't sound ideal.
>>
>> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
>>   (And the while() loop of MODE_NONE)
> 
> 
> It is done in backup_do_cow(), so FULL and TOP are covered.
> 
> But you are right that MODE_NONE seems to have a problem about it.. And just updating it
> in a while loop would not work, as I doubt that job_yield will return until job finish
> or user interaction like pause/continue/cancel..
> 
> So now, it looks better to call job_progress_update() from backup_top directly, and drop
> this hack.

Hmmm...  I don't think job_*() calls belong in backup_top.  How about
adding a callback to bdrv_backup_top_append()?

>>>   
>>>       return ret;
>>>   }
>>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>>       int ret;
>>>       int64_t cluster_size;
>>>       HBitmap *copy_bitmap = NULL;
>>> +    BlockDriverState *backup_top = NULL;
>>> +    uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> +                                 BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD;
>>
>> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
>>
>> [1] But we probably do not want to share either PERM_WRITE or
>> PERM_WRITE_UNCHANGED because during the duration of the backup,
>> everything should go through the backup-top filter (not sure about
>> PERM_WRITE_UNCHANGED right now).  Or is that something that the filter
>> node should enforce in backup_top_child_perm()?
> 
> It's not shared perm of backup_top, it's a shared perm of block-job common.blk, which is
> used only to "job->len is fixed, so we can't allow resize", so this part is not changed.
> 
> So yes, the problem you mean by [1] is about backup_top_child_perm() where we share PERM_WRITE.
> And it is described by comment, we must share this write perm, otherwise we break guest writes.

For the target, yes, but the problem is sharing it on the source.

> We share PERM_WRITE in backup_top to force its target child share PERM_WRITE on its backing,
> as backing of target is source.
> 
> But again, we share PERM_WRITE only on target, and it is shared in current code too.

I'm not so sure whether PERM_WRITE is shared only on the target.

>>
>>>   
>>>       assert(bs);
>>>       assert(target);
>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
>>>   
>>>       copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>>   
>>> -    /* job->len is fixed, so we can't allow resize */
>>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
>>> -                           BLK_PERM_CONSISTENT_READ,
>>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> -                           BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>>> -                           speed, creation_flags, cb, opaque, errp);
>>> -    if (!job) {
>>> +    /*
>>> +     * bdrv_get_device_name will not help to find device name starting from
>>> +     * @bs after backup-top append,
>>
>> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
>> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
>> same result)
> 
> bdrv_get_device_name goes finally through role->get_name, and only root role has
> this handler. After append we'll have backing role instead of root.

Ah, I see, I asked the wrong question.

Why is block_job_create() called on bs and not on backup_top?  mirror
calls it on mirror_top_bs.

>>>                                       so let's calculate job_id before. Do
>>> +     * it in the same way like block_job_create
>>> +     */
>>> +    if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>>> +        job_id = bdrv_get_device_name(bs);
>>> +    }
>>> +
>>> +    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>>> +    if (!backup_top) {
>>>           goto error;
>>>       }
>>>   
>>> -    /* The target must match the source in size, so no resize here either */
>>> -    job->target = blk_new(BLK_PERM_WRITE,
>>> -                          BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> -                          BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>>> -    ret = blk_insert_bs(job->target, target, errp);
>>> -    if (ret < 0) {
>>> +    /* job->len is fixed, so we can't allow resize */
>>> +    job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
>>
>> Is there a reason you dropped PERM_CONSISTENT_READ here?
> 
> Because, we don't use this blk for read now, we read through backup_top child.

Makes sense.

>>> +                           all_except_resize, speed, creation_flags,
>>> +                           cb, opaque, errp);
>>> +    if (!job) {
>>>           goto error;
>>>       }
>>>   
>>> +    job->source = backup_top->backing;
>>> +    job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
>>
>> This looks really ugly.  I think as long as the block job performs
>> writes itself, it should use its own BlockBackend.
> 
> They are not BlockBackends, they are BdrvChildren.

Exactly, which is what I don't like.  They are children of a node that
is implemented in a different file, it looks weird to use them here.

> It was Kevin's idea to reuse filter's
> children in backup job:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html

It's still ugly if backup_top is in a different file.  Well, maybe just
to me.

> Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target.
> 
>>
>> Alternatively, I think it would make sense for the backup-top filter to
>> offer functions that this job can use to issue writes to the target.
>> Then the backup job would no longer need direct access to the target as
>> a BdrvChild.

So what would be the problem with this?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-01-28 16:09 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-29 12:20 [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 01/11] block/backup: simplify backup_incremental_init_copy_bitmap Vladimir Sementsov-Ogievskiy
2019-01-14 13:10   ` Max Reitz
2019-01-14 13:13     ` Max Reitz
2019-01-14 14:01     ` Vladimir Sementsov-Ogievskiy
2019-01-14 14:13       ` Max Reitz
2019-01-14 14:48         ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:05           ` Max Reitz
2019-01-23  8:20             ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:19               ` Max Reitz
2019-01-23 14:36               ` Eric Blake
2019-01-24 14:20                 ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 02/11] block/backup: move to copy_bitmap with granularity Vladimir Sementsov-Ogievskiy
2019-01-14 14:10   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 03/11] block: improve should_update_child Vladimir Sementsov-Ogievskiy
2019-01-14 14:32   ` Max Reitz
2019-01-14 16:13     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:17       ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 04/11] iotests: handle -f argument correctly for qemu_io_silent Vladimir Sementsov-Ogievskiy
2019-01-14 14:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 05/11] iotests: allow resume_drive by node name Vladimir Sementsov-Ogievskiy
2019-01-14 14:46   ` Max Reitz
2019-01-14 16:06     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:11       ` Max Reitz
2019-01-23 13:22         ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:31           ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:33             ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 06/11] iotests: prepare 055 to graph changes during backup job Vladimir Sementsov-Ogievskiy
2019-01-16 13:48   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 07/11] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
2019-01-16 16:02   ` Max Reitz
2019-01-17 12:13     ` Vladimir Sementsov-Ogievskiy
2019-01-18 12:05       ` Max Reitz
2019-01-23 13:47         ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08     ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03         ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 08/11] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
2019-01-16 16:18   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 09/11] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
2019-01-16 16:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 10/11] block/backup: tiny refactor backup_job_create Vladimir Sementsov-Ogievskiy
2019-01-18 13:00   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
2019-01-18 14:56   ` Max Reitz
2019-01-28 11:29     ` Vladimir Sementsov-Ogievskiy
2019-01-28 15:59       ` Max Reitz [this message]
2019-01-28 16:44         ` Vladimir Sementsov-Ogievskiy
2019-01-28 16:53           ` Max Reitz
2019-01-28 17:14             ` Vladimir Sementsov-Ogievskiy
2019-01-28 17:40           ` Kevin Wolf
2019-01-28 19:00             ` Vladimir Sementsov-Ogievskiy
2019-01-23 15:26 ` [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea0a19fd-2db5-bfa5-180b-68c3ef80e046@redhat.com \
    --to=mreitz@redhat.com \
    --cc=den@virtuozzo.com \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=jcody@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.