From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:47337)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1S23xq-0005GQ-Sx
	for qemu-devel@nongnu.org; Mon, 27 Feb 2012 12:02:23 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1S23xm-0007an-0K
	for qemu-devel@nongnu.org; Mon, 27 Feb 2012 12:02:18 -0500
Received: from mx1.redhat.com ([209.132.183.28]:8196)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jcody@redhat.com>) id 1S23xl-0007af-OK
	for qemu-devel@nongnu.org; Mon, 27 Feb 2012 12:02:13 -0500
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
	(int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q1RH2Cn7017337
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Mon, 27 Feb 2012 12:02:12 -0500
Message-ID: <4F4BB712.8090005@redhat.com>
Date: Mon, 27 Feb 2012 12:02:10 -0500
From: Jeff Cody <jcody@redhat.com>
MIME-Version: 1.0
References: <cover.1330282361.git.jcody@redhat.com>
	<d2f1bf32bf70f2a2651eb40a5f73c8d0f7f9a337.1330282361.git.jcody@redhat.com>
	<4F4BA56B.7040608@redhat.com>
In-Reply-To: <4F4BA56B.7040608@redhat.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v2 1/2] qapi: Introduce
	blockdev-group-snapshot-sync command
Reply-To: jcody@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: pbonzini@redhat.com, eblake@redhat.com, qemu-devel@nongnu.org, armbru@redhat.com, lcapitulino@redhat.com

On 02/27/2012 10:46 AM, Kevin Wolf wrote:
> Am 26.02.2012 20:31, schrieb Jeff Cody:
>> This is a QAPI/QMP only command to take a snapshot of a group of
>> devices. This is similar to the blockdev-snapshot-sync command, except
>> blockdev-group-snapshot-sync accepts a list devices, filenames, and
>> formats.
>>
>> It is attempted to keep the snapshot of the group atomic; if the
>> creation or open of any of the new snapshots fails, then all of
>> the new snapshots are abandoned, and the name of the snapshot image
>> that failed is returned.  The failure case should not interrupt
>> any operations.
>>
>> Rather than use bdrv_close() along with a subsequent bdrv_open() to
>> perform the pivot, the original image is never closed and the new
>> image is placed 'in front' of the original image via manipulation
>> of the BlockDriverState fields.  Thus, once the new snapshot image
>> has been successfully created, there are no more failure points
>> before pivoting to the new snapshot.
>>
>> This allows the group of disks to remain consistent with each other,
>> even across snapshot failures.
>>
>> Signed-off-by: Jeff Cody <jcody@redhat.com>
>> ---
>>  block.c          |   47 ++++++++++++++++++++
>>  block.h          |    1 +
>>  blockdev.c       |  128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  qapi-schema.json |   38 ++++++++++++++++
>>  4 files changed, 214 insertions(+), 0 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 3621d11..0045ab1 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -880,6 +880,53 @@ void bdrv_make_anon(BlockDriverState *bs)
>>      bs->device_name[0] = '\0';
>>  }
>>  
>> +/*
>> + * Add new bs contents at the top of an image chain while the chain is live,
>> + * while keeping required fields on the top layer.
>> + *
>> + * It is assumed that bs_new already points to an existing image,
>> + * with the correct backing filename of top->backing_file
>> + */
>> +void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top)
>> +{
>> +    BlockDriverState tmp;
>> +
>> +    /* the new bs must not be in bdrv_states */
>> +    bdrv_make_anon(bs_new);
>> +
>> +    tmp = *bs_new;
>> +    tmp.backing_hd = bs_new;
>> +
>> +    /* there are some fields that need to stay on the top layer: */
>> +
>> +    /* dev info */
>> +    tmp.dev_ops          = bs_top->dev_ops;
>> +    tmp.dev_opaque       = bs_top->dev_opaque;
>> +    tmp.dev              = bs_top->dev;
>> +    tmp.buffer_alignment = bs_top->buffer_alignment;
>> +    tmp.copy_on_read     = bs_top->copy_on_read;
>> +
>> +    /* i/o timing parameters */
>> +    tmp.slice_time        = bs_top->slice_time;
>> +    tmp.slice_start       = bs_top->slice_start;
>> +    tmp.slice_end         = bs_top->slice_end;
>> +    tmp.io_limits         = bs_top->io_limits;
>> +    tmp.io_base           = bs_top->io_base;
>> +    tmp.throttled_reqs    = bs_top->throttled_reqs;
>> +    tmp.block_timer       = bs_top->block_timer;
>> +    tmp.io_limits_enabled = bs_top->io_limits_enabled;
> 
> What about iostatus_enabled/iostatus? Possibly on_read/write_error and
> geometry, too. (Maybe the correct answer is that you're doing the right
> thing, but I wanted to bring it up)

I think you are right, those should probably be preserved as well; I
don't know if it hurts to not preserve them, but since these are all set
from drive_init, I think it makes sense I think to keep them on the top
layer.

> 
>> +
>> +    /* keep the same entry in bdrv_states */
>> +    pstrcpy(tmp.device_name, sizeof(tmp.device_name), bs_top->device_name);
>> +    tmp.list = bs_top->list;
>> +
>> +    /* swap contents of the fixed new bs and the current top */
>> +    *bs_new = *bs_top;
>> +    *bs_top = tmp;
>> +
>> +    bdrv_detach_dev(bs_new, bs_new->dev);
>> +}
> 
> The last line would actually deserve a comment /* clear the copied
> fields in the new backing file */, which makes clear that there are
> probably some more fields missing in this section.

OK, added.

> 
>> +
>>  void bdrv_delete(BlockDriverState *bs)
>>  {
>>      assert(!bs->dev);
>> diff --git a/block.h b/block.h
>> index cae289b..190a780 100644
>> --- a/block.h
>> +++ b/block.h
>> @@ -114,6 +114,7 @@ int bdrv_create(BlockDriver *drv, const char* filename,
>>  int bdrv_create_file(const char* filename, QEMUOptionParameter *options);
>>  BlockDriverState *bdrv_new(const char *device_name);
>>  void bdrv_make_anon(BlockDriverState *bs);
>> +void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top);
>>  void bdrv_delete(BlockDriverState *bs);
>>  int bdrv_parse_cache_flags(const char *mode, int *flags);
>>  int bdrv_file_open(BlockDriverState **pbs, const char *filename, int flags);
>> diff --git a/blockdev.c b/blockdev.c
>> index 05e7c5e..560f7e8 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -714,6 +714,134 @@ void qmp_blockdev_snapshot_sync(const char *device, const char *snapshot_file,
>>      }
>>  }
>>  
>> +
>> +/* New and old BlockDriverState structs for group snapshots */
>> +typedef struct BlkGroupSnapshotStates {
>> +    BlockDriverState *old_bs;
>> +    BlockDriverState *new_bs;
>> +    bool is_open;
>> +    QSIMPLEQ_ENTRY(BlkGroupSnapshotStates) entry;
>> +} BlkGroupSnapshotStates;
>> +
>> +/*
>> + * 'Atomic' group snapshots.  The snapshots are taken as a set, and if any fail
>> + *  then we do not pivot any of the devices in the group, and abandon the
>> + *  snapshots
>> + */
>> +void qmp_blockdev_group_snapshot_sync(SnapshotDevList *dev_list,
>> +                                                   Error **errp)
> 
> Too much indentation?

Sure :)  Fixed.

> 
>> +{
>> +    int ret = 0;
>> +    SnapshotDevList *dev_entry = dev_list;
>> +    SnapshotDev *dev_info = NULL;
>> +    BlkGroupSnapshotStates *states;
>> +    BlockDriver *proto_drv;
>> +    BlockDriver *drv;
>> +    int flags;
>> +    const char *format;
>> +    const char *snapshot_file;
>> +
>> +    QSIMPLEQ_HEAD(snap_bdrv_states, BlkGroupSnapshotStates) snap_bdrv_states;
>> +    QSIMPLEQ_INIT(&snap_bdrv_states);
>> +
>> +    /* We don't do anything in this loop that commits us to the snapshot */
>> +    while (NULL != dev_entry) {
>> +        dev_info = dev_entry->value;
>> +        dev_entry = dev_entry->next;
>> +
>> +        states = g_malloc0(sizeof(BlkGroupSnapshotStates));
>> +        states->is_open = false;
>> +
>> +        states->old_bs = bdrv_find(dev_info->device);
>> +
>> +        if (!states->old_bs) {
>> +            error_set(errp, QERR_DEVICE_NOT_FOUND, dev_info->device);
>> +            goto fail;
>> +        }
>> +
>> +        if (bdrv_in_use(states->old_bs)) {
>> +            error_set(errp, QERR_DEVICE_IN_USE, dev_info->device);
>> +            goto fail;
>> +        }
>> +
>> +        snapshot_file = dev_info->snapshot_file;
>> +
>> +        flags = states->old_bs->open_flags;
>> +
>> +        if (!dev_info->has_format) {
>> +            format = "qcow2";
>> +        } else {
>> +            format = dev_info->format;
>> +        }
>> +
>> +        drv = bdrv_find_format(format);
>> +        if (!drv) {
>> +            error_set(errp, QERR_INVALID_BLOCK_FORMAT, format);
>> +            goto fail;
>> +        }
>> +
>> +        proto_drv = bdrv_find_protocol(snapshot_file);
>> +        if (!proto_drv) {
>> +            error_set(errp, QERR_INVALID_BLOCK_FORMAT, format);
>> +            goto fail;
>> +        }
>> +
>> +        /* create new image w/backing file */
>> +        ret = bdrv_img_create(snapshot_file, format,
>> +                              states->old_bs->filename,
>> +                              drv->format_name, NULL, -1, flags);
>> +        if (ret) {
>> +            error_set(errp, QERR_OPEN_FILE_FAILED, snapshot_file);
>> +            goto fail;
>> +        }
>> +
>> +        /* We will manually add the backing_hd field to the bs later */
>> +        states->new_bs = bdrv_new("");
>> +        ret = bdrv_open(states->new_bs, snapshot_file,
>> +                        flags | BDRV_O_NO_BACKING, drv);
>> +        states->is_open = true;
>> +
>> +        if (ret != 0) {
>> +            error_set(errp, QERR_OPEN_FILE_FAILED, snapshot_file);
>> +            goto close_and_fail;
>> +        }
>> +        QSIMPLEQ_INSERT_TAIL(&snap_bdrv_states, states, entry);
> 
> Inserting it only here means that all the error cases above leak states.

Yes. Will move it up to immediately after the bdrv_open().

> 
>> +    }
>> +
>> +    /*
>> +     * Now we will drain, flush, & pivot everything - we are committed at this
>> +     * point.
>> +     */
>> +    bdrv_drain_all();
> 
> I would feel more comfortable if we could do the bdrv_drain_all() at the
> very beginning of the function. Not that I know of a specific scenario
> that would go wrong, but you have a nested main loop here that could do
> more or less anything.

I can move it up to the beginning if desired...  My thought was that it
was best to drain closer to the point of commit.


> 
>> +    QSIMPLEQ_FOREACH(states, &snap_bdrv_states, entry) {
>> +        bdrv_flush(states->old_bs);
> 
> This can return an error which must be checked. And of course, we must
> do it before committing to the snapshot (but after bdrv_drain_all).

Hmm. If the flush returns error, do we abandon at this point? Perhaps it
would be best to loop through and flush each device first, and if no
flushes fail, _then_ loop through and perform the bdrv_append(). Once we
are calling bdrv_append(), we want no possible failure points.

> 
>> +        /* This removes our old bs from the bdrv_states, and adds the new bs */
>> +        bdrv_append(states->new_bs, states->old_bs);
>> +    }
>> +
>> +    /* success */
>> +    goto exit;
>> +
>> +close_and_fail:
>> +    /*
>> +    * failure, and it is all-or-none; abandon each new bs, and keep using
>> +    * the original bs for all images
>> +    */
>> +    QSIMPLEQ_FOREACH(states, &snap_bdrv_states, entry) {
>> +        if (states->is_open) {
>> +             bdrv_close(states->new_bs);
> 
> bdrv_delete, as Stefan already mentioned.
> 
>> +        }
>> +    }
>> +fail:
>> +exit:
>> +    QSIMPLEQ_FOREACH(states, &snap_bdrv_states, entry) {
>> +        g_free(states);
>> +    }
>> +    return;
>> +}
> 
> Kevin