From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:42524) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gkVZz-0001uL-Gn for qemu-devel@nongnu.org; Fri, 18 Jan 2019 09:57:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gkVZx-0005q6-WD for qemu-devel@nongnu.org; Fri, 18 Jan 2019 09:57:07 -0500 References: <20181229122027.42245-1-vsementsov@virtuozzo.com> <20181229122027.42245-12-vsementsov@virtuozzo.com> From: Max Reitz Message-ID: <1ea09d99-04ff-711e-9137-cdae33ca8681@redhat.com> Date: Fri, 18 Jan 2019 15:56:45 +0100 MIME-Version: 1.0 In-Reply-To: <20181229122027.42245-12-vsementsov@virtuozzo.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Xfe2USpCVcdiQTDILprq6YpAY2jyVk4KB" Subject: Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladimir Sementsov-Ogievskiy , qemu-block@nongnu.org, qemu-devel@nongnu.org Cc: fam@euphon.net, stefanha@redhat.com, jcody@redhat.com, kwolf@redhat.com, den@openvz.org, eblake@redhat.com, jsnow@redhat.com This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --Xfe2USpCVcdiQTDILprq6YpAY2jyVk4KB From: Max Reitz To: Vladimir Sementsov-Ogievskiy , qemu-block@nongnu.org, qemu-devel@nongnu.org Cc: fam@euphon.net, stefanha@redhat.com, jcody@redhat.com, kwolf@redhat.com, den@openvz.org, eblake@redhat.com, jsnow@redhat.com Message-ID: <1ea09d99-04ff-711e-9137-cdae33ca8681@redhat.com> Subject: Re: [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers References: <20181229122027.42245-1-vsementsov@virtuozzo.com> <20181229122027.42245-12-vsementsov@virtuozzo.com> In-Reply-To: <20181229122027.42245-12-vsementsov@virtuozzo.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote: > Drop write notifiers and use filter node instead. Changes: >=20 > 1. copy-before-writes now handled by filter node, so, drop all > is_write_notifier arguments. >=20 > 2. we don't have intersecting requests, so their handling is dropped. > Instead, synchronization works as follows: > when backup or backup-top starts copying of some area it firstly > clears copy-bitmap bits, and nobody touches areas, not marked with > dirty bits in copy-bitmap, so there is no intersection. Also, backup > job copy operations are surrounded by bdrv region lock, which is > actually serializing request, to not interfer with guest writes and > not read changed data from source (before reading we clear > corresponding bit in copy-bitmap, so, this area is not more handled by > backup-top). >=20 > 3. To sync with in-flight requests we now just drain hook node, we > don't need rw-lock. >=20 > 4. After the whole backup loop (top, full, incremental modes), we need > to check for not copied clusters, which were under backup-top operation= > and we skipped them, but backup-top operation failed, error returned to= > the guest and dirty bits set back. >=20 > 5. Don't create additional blk, use backup-top children for copy > operations. >=20 > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > block/backup.c | 285 ++++++++++++++++++++++++++-----------------------= > 1 file changed, 149 insertions(+), 136 deletions(-) >=20 > diff --git a/block/backup.c b/block/backup.c > index 88c0242b4e..e332909fb7 100644 > --- a/block/backup.c > +++ b/block/backup.c [...] > @@ -300,21 +231,23 @@ static void backup_abort(Job *job) > static void backup_clean(Job *job) > { > BackupBlockJob *s =3D container_of(job, BackupBlockJob, common.job= ); > - assert(s->target); > - blk_unref(s->target); > + > + /* We must clean it to not crash in backup_drain. */ > s->target =3D NULL; Why not set s->source to NULL along with it? It makes sense if you're going to drop the backup-top node because both of these are its children.= > =20 > if (s->copy_bitmap) { > hbitmap_free(s->copy_bitmap); > s->copy_bitmap =3D NULL; > } > + > + bdrv_backup_top_drop(s->backup_top); > } [...] > @@ -386,21 +319,45 @@ static int coroutine_fn backup_run_incremental(Ba= ckupBlockJob *job) > bool error_is_read; > int64_t offset; > HBitmapIter hbi; > + void *lock =3D NULL; > =20 > hbitmap_iter_init(&hbi, job->copy_bitmap, 0); > - while ((offset =3D hbitmap_iter_next(&hbi)) !=3D -1) { > + while (hbitmap_count(job->copy_bitmap)) { > + offset =3D hbitmap_iter_next(&hbi); > + if (offset =3D=3D -1) { > + /* > + * we may have skipped some clusters, which were handled b= y > + * backup-top, but failed and finished by returning error = to > + * the guest and set dirty bit back. > + */ > + hbitmap_iter_init(&hbi, job->copy_bitmap, 0); > + offset =3D hbitmap_iter_next(&hbi); > + assert(offset); I think you want to assert "offset >=3D 0". > + } > + > + lock =3D bdrv_co_try_lock(job->source, offset, job->cluster_si= ze); > + /* > + * Dirty bit is set, which means that there are no in-flight > + * write requests on this area. We must succeed. > + */ > + assert(lock); I'm not sure that is true right now, but more on that below in backup_run= (). > + > do { > if (yield_and_check(job)) { > + bdrv_co_unlock(lock); > return 0; > } > - ret =3D backup_do_cow(job, offset, > - job->cluster_size, &error_is_read, fal= se); > + ret =3D backup_do_cow(job, offset, job->cluster_size, &err= or_is_read); > if (ret < 0 && backup_error_action(job, error_is_read, -re= t) =3D=3D > BLOCK_ERROR_ACTION_REPORT) > { > + bdrv_co_unlock(lock); > return ret; > } > } while (ret < 0); > + > + bdrv_co_unlock(lock); > + lock =3D NULL; This statement seems unnecessary here. > } > =20 > return 0; [...] > @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Erro= r **errp) > hbitmap_set(s->copy_bitmap, 0, s->len); > } > =20 > - s->before_write.notify =3D backup_before_write_notify; > - bdrv_add_before_write_notifier(bs, &s->before_write); > - > if (s->sync_mode =3D=3D MIRROR_SYNC_MODE_NONE) { > /* All bits are set in copy_bitmap to allow any cluster to be = copied. > * This does not actually require them to be copied. */ > while (!job_is_cancelled(job)) { > - /* Yield until the job is cancelled. We just let our befo= re_write > - * notify callback service CoW requests. */ > + /* > + * Yield until the job is cancelled. We just let our back= up-top > + * fileter driver service CbW requests. *filter > + */ > job_yield(job); > } > } else if (s->sync_mode =3D=3D MIRROR_SYNC_MODE_INCREMENTAL) { > ret =3D backup_run_incremental(s); > } else { [...] > @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error= **errp) > if (alloced < 0) { > ret =3D alloced; > } else { > + if (!hbitmap_get(s->copy_bitmap, offset)) { > + trace_backup_do_cow_skip(job, offset); > + continue; /* already copied */ > + } > + if (!lock) { > + lock =3D bdrv_co_try_lock(s->source, offset, s->cl= uster_size); > + /* > + * Dirty bit is set, which means that there are no= in-flight > + * write requests on this area. We must succeed. > + */ > + assert(lock); What if I have a different parent node for the source that issues concurrent writes? This used to work fine because the before_write notifier would still work. After this patch, that would be broken because those writes would not cause a CbW. That's not so bad because we just have to make sure that all writes go through the backup-top node. That would make this assertion valid again, too. But that means we cannot share PERM_WRITE; see [1]. > + } > ret =3D backup_do_cow(s, offset, s->cluster_size, > - &error_is_read, false); > + &error_is_read); > } > if (ret < 0) { > /* Depending on error action, fail now or retry cluste= r */ > @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Erro= r **errp) > break; > } else { > offset -=3D s->cluster_size; > + retry =3D true; > continue; > } > } > } > + if (lock) { > + bdrv_co_unlock(lock); > + lock =3D NULL; > + } > + if (ret =3D=3D 0 && !job_is_cancelled(job) && > + hbitmap_count(s->copy_bitmap)) > + { > + /* > + * we may have skipped some clusters, which were handled b= y > + * backup-top, but failed and finished by returning error = to > + * the guest and set dirty bit back. So it's a matter of a race? > + */ > + goto iteration; > + } Why not wrap everything in a do {} while (ret =3D=3D 0 && !job_is...) instead? Because it would mean reindenting everything? > } > =20 > - notifier_with_return_remove(&s->before_write); > + /* wait pending CBW operations in backup-top */ > + bdrv_drain(s->backup_top); > =20 > - /* wait until pending backup_do_cow() calls have completed */ > - qemu_co_rwlock_wrlock(&s->flush_rwlock); > - qemu_co_rwlock_unlock(&s->flush_rwlock); > + backup_top_progress =3D bdrv_backup_top_progress(s->backup_top); > + job_progress_update(job, ret + backup_top_progress - Why the "ret"? > + s->backup_top_progress); > + s->backup_top_progress =3D backup_top_progress; So the backup-top progress is ignored during basically all of the block job until it suddenly jumps to 100 % completion? That doesn't sound idea= l. Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL? (And the while() loop of MODE_NONE) > =20 > return ret; > } > @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, Blo= ckDriverState *bs, > int ret; > int64_t cluster_size; > HBitmap *copy_bitmap =3D NULL; > + BlockDriverState *backup_top =3D NULL; > + uint64_t all_except_resize =3D BLK_PERM_CONSISTENT_READ | BLK_PERM= _WRITE | > + BLK_PERM_WRITE_UNCHANGED | BLK_PERM_G= RAPH_MOD; BLK_PERM_ALL & ~BLK_PERM_RESIZE? [1] But we probably do not want to share either PERM_WRITE or PERM_WRITE_UNCHANGED because during the duration of the backup, everything should go through the backup-top filter (not sure about PERM_WRITE_UNCHANGED right now). Or is that something that the filter node should enforce in backup_top_child_perm()? > =20 > assert(bs); > assert(target); > @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, B= lockDriverState *bs, > =20 > copy_bitmap =3D hbitmap_alloc(len, ctz32(cluster_size)); > =20 > - /* job->len is fixed, so we can't allow resize */ > - job =3D block_job_create(job_id, &backup_job_driver, txn, bs, > - BLK_PERM_CONSISTENT_READ, > - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |= > - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_M= OD, > - speed, creation_flags, cb, opaque, errp); > - if (!job) { > + /* > + * bdrv_get_device_name will not help to find device name starting= from > + * @bs after backup-top append, Why not? Since backup-top is appended, shouldn't all parents of @bs be parents of @backup_top then? (Making bdrv_get_parent_name() return the same result) > so let's calculate job_id before. = Do > + * it in the same way like block_job_create > + */ > + if (job_id =3D=3D NULL && !(creation_flags & JOB_INTERNAL)) { > + job_id =3D bdrv_get_device_name(bs); > + } > + > + backup_top =3D bdrv_backup_top_append(bs, target, copy_bitmap, err= p); > + if (!backup_top) { > goto error; > } > =20 > - /* The target must match the source in size, so no resize here eit= her */ > - job->target =3D blk_new(BLK_PERM_WRITE, > - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | > - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MO= D); > - ret =3D blk_insert_bs(job->target, target, errp); > - if (ret < 0) { > + /* job->len is fixed, so we can't allow resize */ > + job =3D block_job_create(job_id, &backup_job_driver, txn, bs, 0, Is there a reason you dropped PERM_CONSISTENT_READ here? > + all_except_resize, speed, creation_flags, > + cb, opaque, errp); > + if (!job) { > goto error; > } > =20 > + job->source =3D backup_top->backing; > + job->target =3D ((BDRVBackupTopState *)backup_top->opaque)->target= ; This looks really ugly. I think as long as the block job performs writes itself, it should use its own BlockBackend. Alternatively, I think it would make sense for the backup-top filter to offer functions that this job can use to issue writes to the target. Then the backup job would no longer need direct access to the target as a BdrvChild. > + > job->on_source_error =3D on_source_error; > job->on_target_error =3D on_target_error; > job->sync_mode =3D sync_mode; [...] > @@ -712,6 +722,9 @@ BlockJob *backup_job_create(const char *job_id, Blo= ckDriverState *bs, > backup_clean(&job->common.job); > job_early_fail(&job->common.job); > } > + if (backup_top) { > + bdrv_backup_top_drop(backup_top); > + } While it shouldn't be a problem in practice, backup_top has a reference to copy_bitmap, so that bitmap should be freed after backup_top is droppe= d. Max > =20 > return NULL; > } >=20 --Xfe2USpCVcdiQTDILprq6YpAY2jyVk4KB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEkb62CjDbPohX0Rgp9AfbAGHVz0AFAlxB6S0ACgkQ9AfbAGHV z0Beqwf9EeX7d+pQUvjQFHRxgqmxMnrdqog0VcmtvKI01+dtBOhU9A3RoaH3ycQg Dl+quZTcZRorszbXNQQyoBfsQ1te3NblbqQ2mUfKm9PE3Hvz86AN8VkQTIGxnllR QqTYrNJ99nIvvs/ck8rr1fDdZLKf7zacVeNUnN5vjnKiHslNx17EeLDn2lFXxijf fxfGFdITDqyEhGFIdu8fPyZj/3LquHWCRnAqHuCprufIlP2zZHun1sAUqhv/stBS jWLNXi1gx2TQnVzHtzo1s9ZZIBipXskUGm5UlQGxvKBXsp8m6R7yziFwMXcf6DRU HM82ArcCOuY3flE5ZE0vfc3kA+pohg== =hiGV -----END PGP SIGNATURE----- --Xfe2USpCVcdiQTDILprq6YpAY2jyVk4KB--