From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:40632)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1T9ipC-00024N-1z
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 16:37:19 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1T9ip9-0002fu-VY
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 16:37:17 -0400
Received: from mx1.redhat.com ([209.132.183.28]:47114)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1T9ip9-0002fh-MP
	for qemu-devel@nongnu.org; Thu, 06 Sep 2012 16:37:15 -0400
Message-ID: <50490975.8040702@redhat.com>
Date: Thu, 06 Sep 2012 16:37:09 -0400
From: Jeff Cody <jcody@redhat.com>
MIME-Version: 1.0
References: <cover.1346351079.git.jcody@redhat.com>
	<812f2aca86713b00473144bb272f014cc5e8a7f4.1346351079.git.jcody@redhat.com>
	<5048AC83.8050508@redhat.com>
In-Reply-To: <5048AC83.8050508@redhat.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC v2 PATCH 2/6] block: add live block commit
 functionality
Reply-To: jcody@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: stefanha@gmail.com, supriyak@linux.vnet.ibm.com, eblake@redhat.com, qemu-devel@nongnu.org, pbonzini@redhat.com

On 09/06/2012 10:00 AM, Kevin Wolf wrote:
> Am 30.08.2012 20:47, schrieb Jeff Cody:
>> This adds the live commit coroutine.  This iteration focuses on the
>> commit only below the active layer, and not the active layer itself.
>>
>> The behaviour is similar to block streaming; the sectors are walked
>> through, and anything that exists above 'base' is committed back down
>> into base.  At the end, intermediate images are deleted, and the
>> chain stitched together.  Images are restored to their original open
>> flags upon completion.
>>
>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>> ---
>>  block/Makefile.objs |   1 +
>>  block/commit.c      | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  block_int.h         |  19 +++++
>>  trace-events        |   2 +
>>  4 files changed, 224 insertions(+)
>>  create mode 100644 block/commit.c
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index b5754d3..4a136b8 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -4,6 +4,7 @@ block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>>  block-obj-y += qed-check.o
>>  block-obj-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
>>  block-obj-y += stream.o
>> +block-obj-y += commit.o
>>  block-obj-$(CONFIG_WIN32) += raw-win32.o
>>  block-obj-$(CONFIG_POSIX) += raw-posix.o
>>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
>> diff --git a/block/commit.c b/block/commit.c
>> new file mode 100644
>> index 0000000..bd3d882
>> --- /dev/null
>> +++ b/block/commit.c
>> @@ -0,0 +1,202 @@
>> +/*
>> + * Live block commit
>> + *
>> + * Copyright Red Hat, Inc. 2012
>> + *
>> + * Authors:
>> + *  Jeff Cody   <jcody@redhat.com>
>> + *  Based on stream.c by Stefan Hajnoczi
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "trace.h"
>> +#include "block_int.h"
>> +#include "qemu/ratelimit.h"
>> +
>> +enum {
>> +    /*
>> +     * Size of data buffer for populating the image file.  This should be large
>> +     * enough to process multiple clusters in a single call, so that populating
>> +     * contiguous regions of the image is efficient.
>> +     */
>> +    COMMIT_BUFFER_SIZE = 512 * 1024, /* in bytes */
>> +};
>> +
>> +#define SLICE_TIME 100000000ULL /* ns */
>> +
>> +typedef struct CommitBlockJob {
>> +    BlockJob common;
>> +    RateLimit limit;
>> +    BlockDriverState *active;
>> +    BlockDriverState *top;
>> +    BlockDriverState *base;
>> +    BlockErrorAction on_error;
>> +    int base_flags;
>> +    int top_flags;
>> +} CommitBlockJob;
>> +
>> +static int coroutine_fn commit_populate(BlockDriverState *bs,
>> +                                        BlockDriverState *base,
>> +                                        int64_t sector_num, int nb_sectors,
>> +                                        void *buf)
>> +{
>> +    if (bdrv_read(bs, sector_num, buf, nb_sectors)) {
>> +        return -EIO;
>> +    }
>> +    if (bdrv_write(base, sector_num, buf, nb_sectors)) {
>> +        return -EIO;
>> +    }
> 
> Don't throw error codes away.
> 

OK, I'll pass the return from bdrv_read/write to the caller.

> What should we do with backing files that are smaller than the image to
> commit? In this version, data is copied up to the size of the backing
> file, and then we get -EIO from bdrv_co_do_writev().
> 

We could leave it like that, and let it receive -EIO in that case.
Alternatively, we could try and determine before the commit if the data
will fit in the base, and return -ENOSPC if not.

>> +    return 0;
>> +}
>> +
>> +static void coroutine_fn commit_run(void *opaque)
>> +{
>> +    CommitBlockJob *s = opaque;
>> +    BlockDriverState *active = s->active;
>> +    BlockDriverState *top = s->top;
>> +    BlockDriverState *base = s->base;
>> +    BlockDriverState *top_child = NULL;
>> +    int64_t sector_num, end;
>> +    int error = 0;
>> +    int ret = 0;
>> +    int n = 0;
>> +    void *buf;
>> +    int bytes_written = 0;
>> +
>> +    s->common.len = bdrv_getlength(top);
>> +    if (s->common.len < 0) {
>> +        block_job_complete(&s->common, s->common.len);
>> +        return;
>> +    }
>> +
>> +    top_child = bdrv_find_child(active, top);
>> +
>> +    end = s->common.len >> BDRV_SECTOR_BITS;
>> +    buf = qemu_blockalign(top, COMMIT_BUFFER_SIZE);
>> +
>> +    for (sector_num = 0; sector_num < end; sector_num += n) {
>> +        uint64_t delay_ns = 0;
>> +        bool copy;
>> +
>> +wait:
>> +        /* Note that even when no rate limit is applied we need to yield
>> +         * with no pending I/O here so that qemu_aio_flush() returns.
>> +         */
>> +        block_job_sleep_ns(&s->common, rt_clock, delay_ns);
>> +        if (block_job_is_cancelled(&s->common)) {
>> +            break;
>> +        }
>> +        /* Copy if allocated above the base */
>> +        ret = bdrv_co_is_allocated_above(top, base, sector_num,
>> +                                         COMMIT_BUFFER_SIZE / BDRV_SECTOR_SIZE,
>> +                                         &n);
>> +        copy = (ret == 1);
>> +        trace_commit_one_iteration(s, sector_num, n, ret);
>> +        if (ret >= 0 && copy) {
> 
> If ret == 1, it's automatically >= 0...
> 
> By the way, I'm not sure if the interface is 0/1/-errno or
> 0/positive/-errno.
> 

The header for bdrv_co_is_allocated() states that it returns true if
it is allocated, and false otherwise.  However, the function actually
returns true if allocated, false if not, and something < 0 on failure.

But either way, you are right, I can just use 'copy' here.

>> +            if (s->common.speed) {
>> +                delay_ns = ratelimit_calculate_delay(&s->limit, n);
>> +                if (delay_ns > 0) {
>> +                    goto wait;
>> +                }
>> +            }
>> +            ret = commit_populate(top, base, sector_num, n, buf);
>> +            bytes_written += n * BDRV_SECTOR_SIZE;
>> +        }
>> +        if (ret < 0) {
>> +            if (s->on_error == BLOCK_ERR_STOP_ANY ||
>> +                s->on_error == BLOCK_ERR_STOP_ENOSPC) {
> 
> Shouldn't there be a check for ret == -ENOSPC then...? And does this
> error handling do anything useful if you can't pause the job? Wouldn't
> it retry all the time?
> 

We should check for ret == -ENOSPC, yes.  On the error handling, we
should probably just stop completely here, and return error.

>> +                n = 0;
>> +                continue;
>> +            }
>> +            if (error == 0) {
>> +                error = ret;
>> +            }
>> +            if (s->on_error == BLOCK_ERR_REPORT) {
>> +                break;
>> +            }
>> +        }
>> +        ret = 0;
>> +
>> +        /* Publish progress */
>> +        s->common.offset += n * BDRV_SECTOR_SIZE;
>> +    }
>> +
>> +    if (!block_job_is_cancelled(&s->common) && sector_num == end && ret == 0) {
>> +        /* success */
>> +        if (bdrv_delete_intermediate(active, top, base)) {
>> +            /* something went wrong! */
>> +            /* TODO:add error reporting here */
> 
> Indeed :-)
> 
>> +        }
>> +    }
>> +
>> +    /* restore base open flags here if appropriate (e.g., change the base back
>> +     * to r/o). These reopens do not need to be atomic, since we won't abort
>> +     * even on failure here */
>> +
>> +    if (s->base_flags != bdrv_get_flags(base)) {
>> +        bdrv_reopen(base, s->base_flags, NULL);
>> +    }
>> +    if (s->top_flags != bdrv_get_flags(top_child)) {
>> +        bdrv_reopen(top_child, s->top_flags, NULL);
>> +    }
>> +
>> +    qemu_vfree(buf);
>> +    block_job_complete(&s->common, ret);
>> +}
>> +
>> +static void commit_set_speed(BlockJob *job, int64_t speed, Error **errp)
>> +{
>> +    CommitBlockJob *s = container_of(job, CommitBlockJob, common);
>> +
>> +    if (speed < 0) {
>> +        error_set(errp, QERR_INVALID_PARAMETER, "speed");
>> +        return;
>> +    }
>> +    ratelimit_set_speed(&s->limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME);
>> +}
>> +
>> +static BlockJobType commit_job_type = {
>> +    .instance_size = sizeof(CommitBlockJob),
>> +    .job_type      = "commit",
>> +    .set_speed     = commit_set_speed,
>> +};
>> +
>> +void commit_start(BlockDriverState *bs, BlockDriverState *base,
>> +                 BlockDriverState *top, int64_t speed,
>> +                 BlockErrorAction on_error, BlockDriverCompletionFunc *cb,
>> +                 void *opaque, int orig_base_flags, int orig_top_flags,
>> +                 Error **errp)
>> +{
>> +    CommitBlockJob *s;
>> +
>> +    if ((on_error == BLOCK_ERR_STOP_ANY ||
>> +         on_error == BLOCK_ERR_STOP_ENOSPC) &&
>> +        !bdrv_iostatus_is_enabled(bs)) {
>> +        error_set(errp, QERR_INVALID_PARAMETER_COMBINATION);
>> +        return;
>> +    }
>> +
>> +    s = block_job_create(&commit_job_type, bs, speed, cb, opaque, errp);
>> +    if (!s) {
>> +        return;
>> +    }
>> +
>> +    s->base   = base;
>> +    s->top    = top;
>> +    s->active = bs;
>> +
>> +    s->base_flags = orig_base_flags;
>> +    s->top_flags  = orig_top_flags;
> 
> So it's the caller who is expected to reopen r/w and then pass the
> original flags in? Can't we do both of it here?
> 

Yes, that would be cleaner - I'll move the reopen() to commit_start().


>> +
>> +    s->on_error = on_error;
>> +    s->common.co = qemu_coroutine_create(commit_run);
>> +
>> +    trace_commit_start(bs, base, top, s, s->common.co, opaque,
>> +                       orig_base_flags, orig_top_flags);
>> +    qemu_coroutine_enter(s->common.co, s);
>> +
>> +    return;
> 
> Unnecessary return.
> 
> Kevin
>