All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org
Cc: fam@euphon.net, stefanha@redhat.com, jcody@redhat.com,
	kwolf@redhat.com, den@openvz.org, eblake@redhat.com,
	jsnow@redhat.com
Subject: Re: [Qemu-devel] [PATCH v5 07/11] block: introduce backup-top filter driver
Date: Wed, 16 Jan 2019 17:02:07 +0100	[thread overview]
Message-ID: <5a822e18-6967-5059-bf21-6891aa701af4@redhat.com> (raw)
In-Reply-To: <20181229122027.42245-8-vsementsov@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 15568 bytes --]

On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
> Backup-top filter does copy-before-write operation. It should be
> inserted above active disk and has a target node for CBW, like the
> following:
> 
>     +-------+
>     | Guest |
>     +---+---+
>         |r,w
>         v
>     +---+-----------+  target   +---------------+
>     | backup_top    |---------->| target(qcow2) |
>     +---+-----------+   CBW     +---+-----------+
>         |
> backing |r,w
>         v
>     +---+---------+
>     | Active disk |
>     +-------------+
> 
> The driver will be used in backup instead of write-notifiers.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/backup-top.h  |  43 +++++++
>  block/backup-top.c  | 306 ++++++++++++++++++++++++++++++++++++++++++++
>  block/Makefile.objs |   2 +
>  3 files changed, 351 insertions(+)
>  create mode 100644 block/backup-top.h
>  create mode 100644 block/backup-top.c

Looks good to me overall, comments below:

> diff --git a/block/backup-top.h b/block/backup-top.h
> new file mode 100644
> index 0000000000..abe4dda574
> --- /dev/null
> +++ b/block/backup-top.h
> @@ -0,0 +1,43 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +

Include guards are missing here.

> +#include "qemu/osdep.h"
> +
> +#include "block/block_int.h"
> +
> +typedef struct BDRVBackupTopState {
> +    HBitmap *copy_bitmap; /* what should be copied to @target on guest write. */
> +    BdrvChild *target;
> +
> +    uint64_t bytes_copied;
> +} BDRVBackupTopState;
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs);
> +uint64_t bdrv_backup_top_progress(BlockDriverState *bs);
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp);

A comment describing at least parts of the interface would be nice; for
instance, that @target will be forced into the same AioContext as
@source, or that the backup-top node will be an implicit node.

> diff --git a/block/backup-top.c b/block/backup-top.c
> new file mode 100644
> index 0000000000..0e7b3e3142
> --- /dev/null
> +++ b/block/backup-top.c
> @@ -0,0 +1,306 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/cutils.h"
> +#include "qapi/error.h"
> +#include "block/block_int.h"
> +#include "block/qdict.h"
> +
> +#include "block/backup-top.h"
> +
> +static coroutine_fn int backup_top_co_preadv(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, int flags)
> +{
> +    /*
> +     * Features to be implemented:
> +     * F1. COR. save read data to fleecing target for fast access
> +     *     (to reduce reads). This possibly may be done with use of copy-on-read
> +     *     filter, but we need an ability to make COR requests optional: for
> +     *     example, if target is a ram-cache, and if it is full now, we should
> +     *     skip doing COR request, as it is actually not necessary.
> +     *
> +     * F2. Feature for guest: read from fleecing target if data is in ram-cache
> +     *     and is unchanged
> +     */
> +
> +    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
> +                                          uint64_t bytes)

I don't know if there is a real guideline, but usually parameters are
aligned at the opening parenthesis.

> +{
> +    int ret = 0;
> +    BDRVBackupTopState *s = bs->opaque;
> +    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
> +    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
> +    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
> +    size_t align = MAX(bdrv_opt_mem_align(bs->backing->bs),
> +                       bdrv_opt_mem_align(s->target->bs));
> +    struct iovec iov = {
> +        .iov_base = qemu_memalign(align, end - off),

I think it would be more correct to use qemu_blockalign() here and set
bs->bl.opt_mem_alignment somewhere.

> +        .iov_len = end - off

How about just leaving this out here... [1]

> +    };
> +    QEMUIOVector qiov;
> +
> +    qemu_iovec_init_external(&qiov, &iov, 1);

[1] ...and move this... [1]

> +
> +    /*
> +     * Features to be implemented:
> +     * F3. parallelize copying loop
> +     * F4. detect zeros

Isn't there detect-zeroes for that?

> +     * F5. use block_status ?
> +     * F6. don't copy clusters which are already cached by COR [see F1]
> +     * F7. if target is ram-cache and it is full, there should be a possibility
> +     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
> +     *     fast.

Another idea: Read all dirty data from bs->backing (in parallel, more or
less), and then perform all writes (to bs->backing and s->target) in
parallel as well.

(So the writes do not have to wait on one another)

> +     */
> +
> +    len = end - off;
> +    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
> +        iov.iov_len = qiov.size = len;

[1] ...here, and not setting qiov.size?

> +
> +        hbitmap_reset(s->copy_bitmap, off, len);
> +
> +        ret = bdrv_co_preadv(bs->backing, off, len, &qiov,
> +                             BDRV_REQ_NO_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto finish;
> +        }
> +
> +        ret = bdrv_co_pwritev(s->target, off, len, &qiov, BDRV_REQ_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto finish;
> +        }
> +
> +        s->bytes_copied += len;
> +        off += len;
> +        if (off >= end) {
> +            break;
> +        }
> +        len = end - off;
> +    }
> +
> +finish:
> +    qemu_vfree(iov.iov_base);
> +
> +    /*
> +     * F8. we fail guest request in case of error. We can alter it by
> +     * possibility to fail copying process instead, or retry several times, or
> +     * may be guest pause, etc.
> +     */
> +    return ret;
> +}
> +
> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
> +                                                  int64_t offset, int bytes)

(Indentation)

> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /*
> +     * Features to be implemented:
> +     * F9. possibility of lazy discard: just defer the discard after fleecing
> +     *     completion. If write (or new discard) occurs to the same area, just
> +     *     drop deferred discard.
> +     */
> +
> +    return bdrv_co_pdiscard(bs->backing, offset, bytes);
> +}
> +
> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
> +                                                 uint64_t offset,
> +                                                 uint64_t bytes,
> +                                                 QEMUIOVector *qiov, int flags)

(Indentation)

> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
> +{
> +    if (!bs->backing) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_flush(bs->backing->bs);

Why not the target as well?

> +}
> +
> +static void backup_top_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> +    if (bs->backing == NULL) {
> +        /*
> +         * we can be here after failed bdrv_attach_child in
> +         * bdrv_set_backing_hd
> +         */
> +        return;
> +    }
> +    bdrv_refresh_filename(bs->backing->bs);
> +    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
> +            bs->backing->bs->filename);
> +}
> +
> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                       const BdrvChildRole *role,
> +                                       BlockReopenQueue *reopen_queue,
> +                                       uint64_t perm, uint64_t shared,
> +                                       uint64_t *nperm, uint64_t *nshared)

(Indentation)

> +{
> +    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared, nperm,
> +                              nshared);
> +
> +    if (role == &child_file) {
> +        /*
> +         * share write to target, to not interfere guest writes to it's disk

*interfere with guest writes to its disk

> +         * which will be in target backing chain
> +         */
> +        *nshared = *nshared | BLK_PERM_WRITE;
> +        *nperm = *nperm | BLK_PERM_WRITE;

Why do you need to take the write permission?  This filter does not
perform any writes unless it receives writes, right?  (So
bdrv_filter_default_perms() should take care of this.)

> +    } else {
> +        *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> +    }
> +}
> +
> +BlockDriver bdrv_backup_top_filter = {
> +    .format_name = "backup-top",
> +    .instance_size = sizeof(BDRVBackupTopState),
> +
> +    .bdrv_co_preadv             = backup_top_co_preadv,
> +    .bdrv_co_pwritev            = backup_top_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
> +    .bdrv_co_flush              = backup_top_co_flush,
> +
> +    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
> +
> +    .bdrv_refresh_filename      = backup_top_refresh_filename,
> +
> +    .bdrv_child_perm            = backup_top_child_perm,
> +
> +    .is_filter = true,
> +};
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp)
> +{
> +    Error *local_err = NULL;
> +    BDRVBackupTopState *state;
> +    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
> +                                                 NULL, BDRV_O_RDWR, errp);
> +
> +    if (!top) {
> +        return NULL;
> +    }
> +
> +    top->implicit = true;
> +    top->total_sectors = source->total_sectors;
> +    top->opaque = state = g_new0(BDRVBackupTopState, 1);
> +    state->copy_bitmap = copy_bitmap;
> +
> +    bdrv_ref(target);
> +    state->target = bdrv_attach_child(top, target, "target", &child_file, errp);
> +    if (!state->target) {
> +        bdrv_unref(target);
> +        bdrv_unref(top);
> +        return NULL;
> +    }
> +
> +    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
> +    bdrv_set_aio_context(target, bdrv_get_aio_context(source));
> +
> +    bdrv_drained_begin(source);
> +
> +    bdrv_ref(top);
> +    bdrv_append(top, source, &local_err);
> +
> +    if (local_err) {
> +        bdrv_unref(top);

This is done automatically by bdrv_append().

> +    }
> +
> +    bdrv_drained_end(source);
> +
> +    if (local_err) {
> +        bdrv_unref_child(top, state->target);
> +        bdrv_unref(top);
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    return top;
> +}
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    AioContext *aio_context = bdrv_get_aio_context(bs);
> +
> +    aio_context_acquire(aio_context);
> +
> +    bdrv_drained_begin(bs);
> +
> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
> +    bdrv_set_backing_hd(bs, NULL, &error_abort);

This is done automatically in bdrv_close(), and after bs has been
replaced by backing_bs(bs), I don't think new requests should come in,
so I don't think this needs to be done here.

> +
> +    bdrv_drained_end(bs);
> +
> +    if (s->target) {
> +        bdrv_unref_child(bs, s->target);
> +    }

And this should be done in a .bdrv_close() implementation, I think.

Max

> +    bdrv_unref(bs);
> +
> +    aio_context_release(aio_context);
> +}
> +
> +uint64_t bdrv_backup_top_progress(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    return s->bytes_copied;
> +}
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 7a81892a52..9072115c09 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>  
>  block-obj-y += crypto.o
>  
> +block-obj-y += backup-top.o
> +
>  common-obj-y += stream.o
>  
>  nfs.o-libs         := $(LIBNFS_LIBS)
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-01-16 16:02 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-29 12:20 [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 01/11] block/backup: simplify backup_incremental_init_copy_bitmap Vladimir Sementsov-Ogievskiy
2019-01-14 13:10   ` Max Reitz
2019-01-14 13:13     ` Max Reitz
2019-01-14 14:01     ` Vladimir Sementsov-Ogievskiy
2019-01-14 14:13       ` Max Reitz
2019-01-14 14:48         ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:05           ` Max Reitz
2019-01-23  8:20             ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:19               ` Max Reitz
2019-01-23 14:36               ` Eric Blake
2019-01-24 14:20                 ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 02/11] block/backup: move to copy_bitmap with granularity Vladimir Sementsov-Ogievskiy
2019-01-14 14:10   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 03/11] block: improve should_update_child Vladimir Sementsov-Ogievskiy
2019-01-14 14:32   ` Max Reitz
2019-01-14 16:13     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:17       ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 04/11] iotests: handle -f argument correctly for qemu_io_silent Vladimir Sementsov-Ogievskiy
2019-01-14 14:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 05/11] iotests: allow resume_drive by node name Vladimir Sementsov-Ogievskiy
2019-01-14 14:46   ` Max Reitz
2019-01-14 16:06     ` Vladimir Sementsov-Ogievskiy
2019-01-16 13:11       ` Max Reitz
2019-01-23 13:22         ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:31           ` Vladimir Sementsov-Ogievskiy
2019-01-23 13:33             ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 06/11] iotests: prepare 055 to graph changes during backup job Vladimir Sementsov-Ogievskiy
2019-01-16 13:48   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 07/11] block: introduce backup-top filter driver Vladimir Sementsov-Ogievskiy
2019-01-16 16:02   ` Max Reitz [this message]
2019-01-17 12:13     ` Vladimir Sementsov-Ogievskiy
2019-01-18 12:05       ` Max Reitz
2019-01-23 13:47         ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08     ` Vladimir Sementsov-Ogievskiy
2019-04-13 16:08       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03       ` Vladimir Sementsov-Ogievskiy
2019-04-13 17:03         ` Vladimir Sementsov-Ogievskiy
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 08/11] block/io: refactor wait_serialising_requests Vladimir Sementsov-Ogievskiy
2019-01-16 16:18   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 09/11] block: add lock/unlock range functions Vladimir Sementsov-Ogievskiy
2019-01-16 16:36   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 10/11] block/backup: tiny refactor backup_job_create Vladimir Sementsov-Ogievskiy
2019-01-18 13:00   ` Max Reitz
2018-12-29 12:20 ` [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers Vladimir Sementsov-Ogievskiy
2019-01-18 14:56   ` Max Reitz
2019-01-28 11:29     ` Vladimir Sementsov-Ogievskiy
2019-01-28 15:59       ` Max Reitz
2019-01-28 16:44         ` Vladimir Sementsov-Ogievskiy
2019-01-28 16:53           ` Max Reitz
2019-01-28 17:14             ` Vladimir Sementsov-Ogievskiy
2019-01-28 17:40           ` Kevin Wolf
2019-01-28 19:00             ` Vladimir Sementsov-Ogievskiy
2019-01-23 15:26 ` [Qemu-devel] [PATCH v5 00/11] backup-top filter driver for backup no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a822e18-6967-5059-bf21-6891aa701af4@redhat.com \
    --to=mreitz@redhat.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=jcody@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.