On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote: > 18.01.2019 17:56, Max Reitz wrote: >> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote: >>> Drop write notifiers and use filter node instead. Changes: >>> >>> 1. copy-before-writes now handled by filter node, so, drop all >>> is_write_notifier arguments. >>> >>> 2. we don't have intersecting requests, so their handling is dropped. >>> Instead, synchronization works as follows: >>> when backup or backup-top starts copying of some area it firstly >>> clears copy-bitmap bits, and nobody touches areas, not marked with >>> dirty bits in copy-bitmap, so there is no intersection. Also, backup >>> job copy operations are surrounded by bdrv region lock, which is >>> actually serializing request, to not interfer with guest writes and >>> not read changed data from source (before reading we clear >>> corresponding bit in copy-bitmap, so, this area is not more handled by >>> backup-top). >>> >>> 3. To sync with in-flight requests we now just drain hook node, we >>> don't need rw-lock. >>> >>> 4. After the whole backup loop (top, full, incremental modes), we need >>> to check for not copied clusters, which were under backup-top operation >>> and we skipped them, but backup-top operation failed, error returned to >>> the guest and dirty bits set back. >>> >>> 5. Don't create additional blk, use backup-top children for copy >>> operations. >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>> --- >>> block/backup.c | 285 ++++++++++++++++++++++++++----------------------- >>> 1 file changed, 149 insertions(+), 136 deletions(-) >>> >>> diff --git a/block/backup.c b/block/backup.c >>> index 88c0242b4e..e332909fb7 100644 >>> --- a/block/backup.c >>> +++ b/block/backup.c >> >> [...] >> >>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job) >>> static void backup_clean(Job *job) >>> { >>> BackupBlockJob *s = container_of(job, BackupBlockJob, common.job); >>> - assert(s->target); >>> - blk_unref(s->target); >>> + >>> + /* We must clean it to not crash in backup_drain. */ >>> s->target = NULL; >> >> Why not set s->source to NULL along with it? It makes sense if you're >> going to drop the backup-top node because both of these are its children. > > agree. > >> >>> >>> if (s->copy_bitmap) { >>> hbitmap_free(s->copy_bitmap); >>> s->copy_bitmap = NULL; >>> } >>> + >>> + bdrv_backup_top_drop(s->backup_top); >>> } >> >> [...] >> >>> @@ -386,21 +319,45 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job) >>> bool error_is_read; >>> int64_t offset; >>> HBitmapIter hbi; >>> + void *lock = NULL; >>> >>> hbitmap_iter_init(&hbi, job->copy_bitmap, 0); >>> - while ((offset = hbitmap_iter_next(&hbi)) != -1) { >>> + while (hbitmap_count(job->copy_bitmap)) { >>> + offset = hbitmap_iter_next(&hbi); >>> + if (offset == -1) { >>> + /* >>> + * we may have skipped some clusters, which were handled by >>> + * backup-top, but failed and finished by returning error to >>> + * the guest and set dirty bit back. >>> + */ >>> + hbitmap_iter_init(&hbi, job->copy_bitmap, 0); >>> + offset = hbitmap_iter_next(&hbi); >>> + assert(offset); >> >> I think you want to assert "offset >= 0". >> >>> + } >>> + >>> + lock = bdrv_co_try_lock(job->source, offset, job->cluster_size); >>> + /* >>> + * Dirty bit is set, which means that there are no in-flight >>> + * write requests on this area. We must succeed. >>> + */ >>> + assert(lock); >> >> I'm not sure that is true right now, but more on that below in backup_run(). >> >>> + >>> do { >>> if (yield_and_check(job)) { >>> + bdrv_co_unlock(lock); >>> return 0; >>> } >>> - ret = backup_do_cow(job, offset, >>> - job->cluster_size, &error_is_read, false); >>> + ret = backup_do_cow(job, offset, job->cluster_size, &error_is_read); >>> if (ret < 0 && backup_error_action(job, error_is_read, -ret) == >>> BLOCK_ERROR_ACTION_REPORT) >>> { >>> + bdrv_co_unlock(lock); >>> return ret; >>> } >>> } while (ret < 0); >>> + >>> + bdrv_co_unlock(lock); >>> + lock = NULL; >> >> This statement seems unnecessary here. >> >>> } >>> >>> return 0; >> >> [...] >> >>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error **errp) >>> hbitmap_set(s->copy_bitmap, 0, s->len); >>> } >>> >>> - s->before_write.notify = backup_before_write_notify; >>> - bdrv_add_before_write_notifier(bs, &s->before_write); >>> - >>> if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { >>> /* All bits are set in copy_bitmap to allow any cluster to be copied. >>> * This does not actually require them to be copied. */ >>> while (!job_is_cancelled(job)) { >>> - /* Yield until the job is cancelled. We just let our before_write >>> - * notify callback service CoW requests. */ >>> + /* >>> + * Yield until the job is cancelled. We just let our backup-top >>> + * fileter driver service CbW requests. >> >> *filter >> >>> + */ >>> job_yield(job); >>> } >>> } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) { >>> ret = backup_run_incremental(s); >>> } else { >> >> [...] >> >>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error **errp) >>> if (alloced < 0) { >>> ret = alloced; >>> } else { >>> + if (!hbitmap_get(s->copy_bitmap, offset)) { >>> + trace_backup_do_cow_skip(job, offset); >>> + continue; /* already copied */ >>> + } >>> + if (!lock) { >>> + lock = bdrv_co_try_lock(s->source, offset, s->cluster_size); >>> + /* >>> + * Dirty bit is set, which means that there are no in-flight >>> + * write requests on this area. We must succeed. >>> + */ >>> + assert(lock); >> >> What if I have a different parent node for the source that issues >> concurrent writes? This used to work fine because the before_write >> notifier would still work. After this patch, that would be broken >> because those writes would not cause a CbW. > > But haw could you have this different parent node? After appending filter, > there should not be such nodes. Unless you append them afterwards: > And I think, during backup it should be > forbidden to append new parents to source, ignoring filter, as it definitely > breaks what filter does. Agreed, but then this needs to be implemented. > And it applies to other block-job with their filters. > If we appended a filter, we don't want someone other to write omit our filter. > >> >> That's not so bad because we just have to make sure that all writes go >> through the backup-top node. That would make this assertion valid >> again, too. But that means we cannot share PERM_WRITE; see [1]. > > But we don't share PERM_WRITE on source in backup_top, only on target. Are you sure? The job itself shares it, and the filter shares it, too, as far as I can see. It uses bdrv_filter_default_perms(), and that does seem to share PERM_WRITE. >> >>> + } >>> ret = backup_do_cow(s, offset, s->cluster_size, >>> - &error_is_read, false); >>> + &error_is_read); >>> } >>> if (ret < 0) { >>> /* Depending on error action, fail now or retry cluster */ >>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error **errp) >>> break; >>> } else { >>> offset -= s->cluster_size; >>> + retry = true; >>> continue; >>> } >>> } >>> } >>> + if (lock) { >>> + bdrv_co_unlock(lock); >>> + lock = NULL; >>> + } >>> + if (ret == 0 && !job_is_cancelled(job) && >>> + hbitmap_count(s->copy_bitmap)) >>> + { >>> + /* >>> + * we may have skipped some clusters, which were handled by >>> + * backup-top, but failed and finished by returning error to >>> + * the guest and set dirty bit back. >> >> So it's a matter of a race? >> >>> + */ >>> + goto iteration; >>> + } >> >> Why not wrap everything in a do {} while (ret == 0 && !job_is...) >> instead? Because it would mean reindenting everything? > > Don't remember, but assume that yes. And this code is anyway "To be refactored", > I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path. Hm, well, if you want to refactor it later anyway... But I don't like gotos that go backwards very much, unless there is a good reason to have them (and there isn't here). >>> } >>> >>> - notifier_with_return_remove(&s->before_write); >>> + /* wait pending CBW operations in backup-top */ >>> + bdrv_drain(s->backup_top); >>> >>> - /* wait until pending backup_do_cow() calls have completed */ >>> - qemu_co_rwlock_wrlock(&s->flush_rwlock); >>> - qemu_co_rwlock_unlock(&s->flush_rwlock); >>> + backup_top_progress = bdrv_backup_top_progress(s->backup_top); >>> + job_progress_update(job, ret + backup_top_progress - >> >> Why the "ret"? > > oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow()) > >> >>> + s->backup_top_progress); >>> + s->backup_top_progress = backup_top_progress; >> >> So the backup-top progress is ignored during basically all of the block >> job until it suddenly jumps to 100 % completion? That doesn't sound ideal. >> >> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL? >> (And the while() loop of MODE_NONE) > > > It is done in backup_do_cow(), so FULL and TOP are covered. > > But you are right that MODE_NONE seems to have a problem about it.. And just updating it > in a while loop would not work, as I doubt that job_yield will return until job finish > or user interaction like pause/continue/cancel.. > > So now, it looks better to call job_progress_update() from backup_top directly, and drop > this hack. Hmmm... I don't think job_*() calls belong in backup_top. How about adding a callback to bdrv_backup_top_append()? >>> >>> return ret; >>> } >>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, >>> int ret; >>> int64_t cluster_size; >>> HBitmap *copy_bitmap = NULL; >>> + BlockDriverState *backup_top = NULL; >>> + uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | >>> + BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD; >> >> BLK_PERM_ALL & ~BLK_PERM_RESIZE? >> >> [1] But we probably do not want to share either PERM_WRITE or >> PERM_WRITE_UNCHANGED because during the duration of the backup, >> everything should go through the backup-top filter (not sure about >> PERM_WRITE_UNCHANGED right now). Or is that something that the filter >> node should enforce in backup_top_child_perm()? > > It's not shared perm of backup_top, it's a shared perm of block-job common.blk, which is > used only to "job->len is fixed, so we can't allow resize", so this part is not changed. > > So yes, the problem you mean by [1] is about backup_top_child_perm() where we share PERM_WRITE. > And it is described by comment, we must share this write perm, otherwise we break guest writes. For the target, yes, but the problem is sharing it on the source. > We share PERM_WRITE in backup_top to force its target child share PERM_WRITE on its backing, > as backing of target is source. > > But again, we share PERM_WRITE only on target, and it is shared in current code too. I'm not so sure whether PERM_WRITE is shared only on the target. >> >>> >>> assert(bs); >>> assert(target); >>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, >>> >>> copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size)); >>> >>> - /* job->len is fixed, so we can't allow resize */ >>> - job = block_job_create(job_id, &backup_job_driver, txn, bs, >>> - BLK_PERM_CONSISTENT_READ, >>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | >>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD, >>> - speed, creation_flags, cb, opaque, errp); >>> - if (!job) { >>> + /* >>> + * bdrv_get_device_name will not help to find device name starting from >>> + * @bs after backup-top append, >> >> Why not? Since backup-top is appended, shouldn't all parents of @bs be >> parents of @backup_top then? (Making bdrv_get_parent_name() return the >> same result) > > bdrv_get_device_name goes finally through role->get_name, and only root role has > this handler. After append we'll have backing role instead of root. Ah, I see, I asked the wrong question. Why is block_job_create() called on bs and not on backup_top? mirror calls it on mirror_top_bs. >>> so let's calculate job_id before. Do >>> + * it in the same way like block_job_create >>> + */ >>> + if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) { >>> + job_id = bdrv_get_device_name(bs); >>> + } >>> + >>> + backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp); >>> + if (!backup_top) { >>> goto error; >>> } >>> >>> - /* The target must match the source in size, so no resize here either */ >>> - job->target = blk_new(BLK_PERM_WRITE, >>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | >>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD); >>> - ret = blk_insert_bs(job->target, target, errp); >>> - if (ret < 0) { >>> + /* job->len is fixed, so we can't allow resize */ >>> + job = block_job_create(job_id, &backup_job_driver, txn, bs, 0, >> >> Is there a reason you dropped PERM_CONSISTENT_READ here? > > Because, we don't use this blk for read now, we read through backup_top child. Makes sense. >>> + all_except_resize, speed, creation_flags, >>> + cb, opaque, errp); >>> + if (!job) { >>> goto error; >>> } >>> >>> + job->source = backup_top->backing; >>> + job->target = ((BDRVBackupTopState *)backup_top->opaque)->target; >> >> This looks really ugly. I think as long as the block job performs >> writes itself, it should use its own BlockBackend. > > They are not BlockBackends, they are BdrvChildren. Exactly, which is what I don't like. They are children of a node that is implemented in a different file, it looks weird to use them here. > It was Kevin's idea to reuse filter's > children in backup job: > > https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html It's still ugly if backup_top is in a different file. Well, maybe just to me. > Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target. > >> >> Alternatively, I think it would make sense for the backup-top filter to >> offer functions that this job can use to issue writes to the target. >> Then the backup job would no longer need direct access to the target as >> a BdrvChild. So what would be the problem with this? Max