From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49473) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UpFy6-0003p9-C6 for qemu-devel@nongnu.org; Wed, 19 Jun 2013 06:50:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UpFy4-0001Oi-5h for qemu-devel@nongnu.org; Wed, 19 Jun 2013 06:50:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30699) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UpFy3-0001OQ-RJ for qemu-devel@nongnu.org; Wed, 19 Jun 2013 06:50:24 -0400 Date: Wed, 19 Jun 2013 12:50:16 +0200 From: Kevin Wolf Message-ID: <20130619105015.GA2934@dhcp-200-207.str.redhat.com> References: <1369917299-5725-1-git-send-email-stefanha@redhat.com> <1369917299-5725-4-git-send-email-stefanha@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1369917299-5725-4-git-send-email-stefanha@redhat.com> Subject: Re: [Qemu-devel] [PATCH v5 03/11] block: add basic backup support to block driver List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Fam Zheng , qemu-devel@nongnu.org, dietmar@proxmox.com, imain@redhat.com, Paolo Bonzini , xiawenc@linux.vnet.ibm.com Am 30.05.2013 um 14:34 hat Stefan Hajnoczi geschrieben: > From: Dietmar Maurer > > backup_start() creates a block job that copies a point-in-time snapshot > of a block device to a target block device. > > We call backup_do_cow() for each write during backup. That function > reads the original data from the block device before it gets > overwritten. The data is then written to the target device. > > Currently backup cluster size is hardcoded to 65536 bytes. > > [I made a number of changes to Dietmar's original patch and folded them > in to make code review easy. Here is the full list: > > * Drop BackupDumpFunc interface in favor of a target block device > * Detect zero clusters with buffer_is_zero() and use bdrv_co_write_zeroes() > * Use 0 delay instead of 1us, like other block jobs > * Unify creation/start functions into backup_start() > * Simplify cleanup, free bitmap in backup_run() instead of cb > * function > * Use HBitmap to avoid duplicating bitmap code > * Use bdrv_getlength() instead of accessing ->total_sectors > * directly > * Delete the backup.h header file, it is no longer necessary > * Move ./backup.c to block/backup.c > * Remove #ifdefed out code > * Coding style and whitespace cleanups > * Use bdrv_add_before_write_notifier() instead of blockjob-specific hooks > * Keep our own in-flight CowRequest list instead of using block.c > tracked requests. This means a little code duplication but is much > simpler than trying to share the tracked requests list and use the > backup block size. > * Add on_source_error and on_target_error error handling. > > -- stefanha] > > Signed-off-by: Dietmar Maurer > Signed-off-by: Stefan Hajnoczi > > backup size fixes > > Signed-off-by: Stefan Hajnoczi > --- > block/Makefile.objs | 1 + > block/backup.c | 355 ++++++++++++++++++++++++++++++++++++++++++++++ > include/block/block_int.h | 19 +++ > 3 files changed, 375 insertions(+) > create mode 100644 block/backup.c > > diff --git a/block/Makefile.objs b/block/Makefile.objs > index 5f0358a..88bd101 100644 > --- a/block/Makefile.objs > +++ b/block/Makefile.objs > @@ -20,5 +20,6 @@ endif > common-obj-y += stream.o > common-obj-y += commit.o > common-obj-y += mirror.o > +common-obj-y += backup.o > > $(obj)/curl.o: QEMU_CFLAGS+=$(CURL_CFLAGS) > diff --git a/block/backup.c b/block/backup.c > new file mode 100644 > index 0000000..466744a > --- /dev/null > +++ b/block/backup.c > @@ -0,0 +1,355 @@ > +/* > + * QEMU backup > + * > + * Copyright (C) 2013 Proxmox Server Solutions > + * > + * Authors: > + * Dietmar Maurer (dietmar@proxmox.com) > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + * > + */ > + > +#include > +#include > +#include > + > +#include "block/block.h" > +#include "block/block_int.h" > +#include "block/blockjob.h" > +#include "qemu/ratelimit.h" > + > +#define DEBUG_BACKUP 0 > + > +#define DPRINTF(fmt, ...) \ > + do { \ > + if (DEBUG_BACKUP) { \ > + fprintf(stderr, "backup: " fmt, ## __VA_ARGS__); \ > + } \ > + } while (0) > + > +#define BACKUP_CLUSTER_BITS 16 > +#define BACKUP_CLUSTER_SIZE (1 << BACKUP_CLUSTER_BITS) > +#define BACKUP_SECTORS_PER_CLUSTER (BACKUP_CLUSTER_SIZE / BDRV_SECTOR_SIZE) > + > +#define SLICE_TIME 100000000ULL /* ns */ > + > +typedef struct CowRequest { > + int64_t start; > + int64_t end; > + QLIST_ENTRY(CowRequest) list; > + CoQueue wait_queue; /* coroutines blocked on this request */ > +} CowRequest; > + > +typedef struct BackupBlockJob { > + BlockJob common; > + BlockDriverState *target; > + RateLimit limit; > + BlockdevOnError on_source_error; > + BlockdevOnError on_target_error; > + CoRwlock flush_rwlock; > + uint64_t sectors_read; > + HBitmap *bitmap; > + QLIST_HEAD(, CowRequest) inflight_reqs; > +} BackupBlockJob; > + > +/* See if in-flight requests overlap and wait for them to complete */ > +static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job, > + int64_t start, > + int64_t end) > +{ > + CowRequest *req; > + bool retry; > + > + do { > + retry = false; > + QLIST_FOREACH(req, &job->inflight_reqs, list) { > + if (end > req->start && start < req->end) { > + qemu_co_queue_wait(&req->wait_queue); > + retry = true; > + break; > + } > + } > + } while (retry); > +} > + > +/* Keep track of an in-flight request */ > +static void cow_request_begin(CowRequest *req, BackupBlockJob *job, > + int64_t start, int64_t end) > +{ > + req->start = start; > + req->end = end; > + qemu_co_queue_init(&req->wait_queue); > + QLIST_INSERT_HEAD(&job->inflight_reqs, req, list); > +} > + > +/* Forget about a completed request */ > +static void cow_request_end(CowRequest *req) > +{ > + QLIST_REMOVE(req, list); > + qemu_co_queue_restart_all(&req->wait_queue); > +} > + > +static int coroutine_fn backup_do_cow(BlockDriverState *bs, > + int64_t sector_num, int nb_sectors, > + bool *error_is_read) > +{ > + BackupBlockJob *job = (BackupBlockJob *)bs->job; > + CowRequest cow_request; > + struct iovec iov; > + QEMUIOVector bounce_qiov; > + void *bounce_buffer = NULL; > + int ret = 0; > + int64_t start, end, total_sectors; > + int n; > + > + qemu_co_rwlock_rdlock(&job->flush_rwlock); > + > + start = sector_num / BACKUP_SECTORS_PER_CLUSTER; > + end = DIV_ROUND_UP(sector_num + nb_sectors, BACKUP_SECTORS_PER_CLUSTER); > + > + DPRINTF("%s enter %s C%" PRId64 " %" PRId64 " %d\n", > + __func__, bdrv_get_device_name(bs), start, sector_num, nb_sectors); Maybe put the first "%s" and __func__ directly into the DPRINTF macro? > + wait_for_overlapping_requests(job, start, end); > + cow_request_begin(&cow_request, job, start, end); > + > + total_sectors = bdrv_getlength(bs); > + if (total_sectors < 0) { > + if (error_is_read) { > + *error_is_read = true; > + } > + ret = total_sectors; > + goto out; > + } > + total_sectors /= BDRV_SECTOR_SIZE; Why is this needed? There are two callers of the function: One is the write notifier, which has already a sector number that is checked against the image size. The other one is background job that gets the size once when it starts. I don't think there's a reason to call the block driver each time and potentially do an expensive request. At first I thought it has something to do with resizing images, but it's forbidden while a block job is running (otherwise the job's main loop exit condition would be wrong, too), so that doesn't make it necessary. > + > + for (; start < end; start++) { > + if (hbitmap_get(job->bitmap, start)) { > + DPRINTF("%s skip C%" PRId64 "\n", __func__, start); > + continue; /* already copied */ > + } > + > + DPRINTF("%s C%" PRId64 "\n", __func__, start); > + > + n = MIN(BACKUP_SECTORS_PER_CLUSTER, > + total_sectors - start * BACKUP_SECTORS_PER_CLUSTER); > + > + if (!bounce_buffer) { > + bounce_buffer = qemu_blockalign(bs, BACKUP_CLUSTER_SIZE); > + } > + iov.iov_base = bounce_buffer; > + iov.iov_len = n * BDRV_SECTOR_SIZE; > + qemu_iovec_init_external(&bounce_qiov, &iov, 1); > + > + ret = bdrv_co_readv(bs, start * BACKUP_SECTORS_PER_CLUSTER, n, > + &bounce_qiov); > + if (ret < 0) { > + DPRINTF("%s bdrv_co_readv C%" PRId64 " failed\n", __func__, start); > + if (error_is_read) { > + *error_is_read = true; > + } > + goto out; > + } > + > + if (buffer_is_zero(iov.iov_base, iov.iov_len)) { > + ret = bdrv_co_write_zeroes(job->target, > + start * BACKUP_SECTORS_PER_CLUSTER, n); > + } else { > + ret = bdrv_co_writev(job->target, > + start * BACKUP_SECTORS_PER_CLUSTER, n, > + &bounce_qiov); > + } > + if (ret < 0) { > + DPRINTF("%s write C%" PRId64 " failed\n", __func__, start); > + if (error_is_read) { > + *error_is_read = false; > + } > + goto out; > + } > + > + hbitmap_set(job->bitmap, start, 1); > + > + /* Publish progress */ > + job->sectors_read += n; > + job->common.offset += n * BDRV_SECTOR_SIZE; This is interesting, because the function is not only called by the background job, but also by write notifiers. So 'offset' in a literal sense doesn't make too much sense because we're not operating purely sequential. The QAPI documentation describes 'offset' like this: # @offset: the current progress value If we take it as just that, I think we could actually consider this code correct, because it's a useful measure for the progress (each sector is copied only once, either by the job or by a notifier), even though it really has nothing to do with an offset into the image. Maybe a comment would be appropriate. > + > + DPRINTF("%s done C%" PRId64 "\n", __func__, start); > + } > + > +out: > + if (bounce_buffer) { > + qemu_vfree(bounce_buffer); > + } > + > + cow_request_end(&cow_request); > + > + qemu_co_rwlock_unlock(&job->flush_rwlock); > + > + return ret; > +} > + > +static int coroutine_fn backup_before_write_notify( > + NotifierWithReturn *notifier, > + void *opaque) > +{ > + BdrvTrackedRequest *req = opaque; > + > + return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL); > +} > + > +static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp) > +{ > + BackupBlockJob *s = container_of(job, BackupBlockJob, common); > + > + if (speed < 0) { > + error_set(errp, QERR_INVALID_PARAMETER, "speed"); > + return; > + } > + ratelimit_set_speed(&s->limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME); > +} > + > +static void backup_iostatus_reset(BlockJob *job) > +{ > + BackupBlockJob *s = container_of(job, BackupBlockJob, common); > + > + bdrv_iostatus_reset(s->target); > +} > + > +static BlockJobType backup_job_type = { > + .instance_size = sizeof(BackupBlockJob), > + .job_type = "backup", > + .set_speed = backup_set_speed, > + .iostatus_reset = backup_iostatus_reset, > +}; Align = on the same column? Should probably be const, too. > + > +static BlockErrorAction backup_error_action(BackupBlockJob *job, > + bool read, int error) > +{ > + if (read) { > + return block_job_error_action(&job->common, job->common.bs, > + job->on_source_error, true, error); > + } else { > + return block_job_error_action(&job->common, job->target, > + job->on_target_error, false, error); > + } > +} > + > +static void coroutine_fn backup_run(void *opaque) > +{ > + BackupBlockJob *job = opaque; > + BlockDriverState *bs = job->common.bs; > + BlockDriverState *target = job->target; > + BlockdevOnError on_target_error = job->on_target_error; > + NotifierWithReturn before_write = { > + .notify = backup_before_write_notify, > + }; > + int64_t start, end; > + int ret = 0; > + > + QLIST_INIT(&job->inflight_reqs); > + qemu_co_rwlock_init(&job->flush_rwlock); > + > + start = 0; > + end = DIV_ROUND_UP(bdrv_getlength(bs) / BDRV_SECTOR_SIZE, bdrv_getlength() can fail. > + BACKUP_SECTORS_PER_CLUSTER); > + > + job->bitmap = hbitmap_alloc(end, 0); > + > + bdrv_set_on_error(target, on_target_error, on_target_error); > + bdrv_iostatus_enable(target); Mirroring also has: bdrv_set_enable_write_cache(s->target, true); Wouldn't it make sense for backup as well? > + > + bdrv_add_before_write_notifier(bs, &before_write); > + > + DPRINTF("backup_run start %s %" PRId64 " %" PRId64 "\n", > + bdrv_get_device_name(bs), start, end); > + > + for (; start < end; start++) { > + bool error_is_read; > + > + if (block_job_is_cancelled(&job->common)) { > + break; > + } > + > + /* we need to yield so that qemu_aio_flush() returns. > + * (without, VM does not reboot) > + */ > + if (job->common.speed) { > + uint64_t delay_ns = ratelimit_calculate_delay( > + &job->limit, job->sectors_read); > + job->sectors_read = 0; > + block_job_sleep_ns(&job->common, rt_clock, delay_ns); Here's the other implication of counting the progress of copies initiated by write notifiers: The more copies they trigger, the less additional copies are made by the background job. I'm willing to count this as a feature rather than a bug. > + } else { > + block_job_sleep_ns(&job->common, rt_clock, 0); > + } > + > + if (block_job_is_cancelled(&job->common)) { > + break; > + } > + > + DPRINTF("backup_run loop C%" PRId64 "\n", start); > + > + ret = backup_do_cow(bs, start * BACKUP_SECTORS_PER_CLUSTER, 1, > + &error_is_read); You're taking advantage of the fact that backup_do_cow() rounds this one sector up to 64k, which is a much more reasonable size. But why not specify BACKUP_SECTORS_PER_CLUSTER as nb_sectors when this is what you really assume? > + if (ret < 0) { > + /* Depending on error action, fail now or retry cluster */ > + BlockErrorAction action = > + backup_error_action(job, error_is_read, -ret); > + if (action == BDRV_ACTION_REPORT) { > + break; > + } else { > + start--; > + continue; > + } > + } > + } > + > + notifier_with_return_remove(&before_write); > + > + /* wait until pending backup_do_cow() calls have completed */ > + qemu_co_rwlock_wrlock(&job->flush_rwlock); > + qemu_co_rwlock_unlock(&job->flush_rwlock); > + > + hbitmap_free(job->bitmap); > + > + bdrv_iostatus_disable(target); > + bdrv_delete(job->target); > + > + DPRINTF("backup_run complete %d\n", ret); > + block_job_completed(&job->common, ret); > +} Kevin