From: Ilya Dryomov <idryomov@gmail.com>
To: Peter Lieven <pl@kamp.de>
Cc: "Kevin Wolf" <kwolf@redhat.com>,
"Ilya Dryomov" <idryomov@redhat.com>,
"Daniel P. Berrangé" <berrange@redhat.com>,
qemu-block@nongnu.org, ct@flyingcircus.io, qemu-devel@nongnu.org,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Max Reitz" <mreitz@redhat.com>,
"Jason Dillaman" <dillaman@redhat.com>
Subject: Re: [PATCH V2] block/rbd: implement bdrv_co_block_status
Date: Sun, 5 Sep 2021 21:04:00 +0200 [thread overview]
Message-ID: <CAOi1vP-=A4oU9Tm=3KtFuN0DTPhSryPY_wym5byR-0R74Tp3+g@mail.gmail.com> (raw)
In-Reply-To: <92d7d28c-5f55-7364-8699-9adbb4166f91@kamp.de>
On Thu, Sep 2, 2021 at 4:12 PM Peter Lieven <pl@kamp.de> wrote:
>
> Am 24.08.21 um 22:39 schrieb Ilya Dryomov:
> > On Mon, Aug 23, 2021 at 11:38 AM Peter Lieven <pl@kamp.de> wrote:
> >> Am 22.08.21 um 23:02 schrieb Ilya Dryomov:
> >>> On Tue, Aug 10, 2021 at 3:41 PM Peter Lieven <pl@kamp.de> wrote:
> >>>> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> >>>> This results mainly in incorrect progress during block operations (e.g.
> >>>> qemu-img convert with an rbd image as source).
> >>>>
> >>>> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> >>>> allocated and unallocated (all zero areas).
> >>>>
> >>>> To avoid querying the ceph OSDs for the answer this is only done if
> >>>> the image has the fast-diff features which depends on the object-map
> >>> Hi Peter,
> >>>
> >>> Nit: "has the fast-diff feature which depends on the object-map and
> >>> exclusive-lock features"
> >>
> >> will reword in V3.
> >>
> >>
> >>>> and exclusive-lock. In this case it is guaranteed that the information
> >>>> is present in memory in the librbd client and thus very fast.
> >>>>
> >>>> If fast-diff is not available all areas are reported to be allocated
> >>>> which is the current behaviour if bdrv_co_block_status is not implemented.
> >>>>
> >>>> Signed-off-by: Peter Lieven <pl@kamp.de>
> >>>> ---
> >>>> V1->V2:
> >>>> - add commit comment [Stefano]
> >>>> - use failed_post_open [Stefano]
> >>>> - remove redundant assert [Stefano]
> >>>> - add macro+comment for the magic -9000 value [Stefano]
> >>>> - always set *file if its non NULL [Stefano]
> >>>>
> >>>> block/rbd.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 1 file changed, 125 insertions(+)
> >>>>
> >>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>> index dcf82b15b8..8692e76f40 100644
> >>>> --- a/block/rbd.c
> >>>> +++ b/block/rbd.c
> >>>> @@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
> >>>> char *namespace;
> >>>> uint64_t image_size;
> >>>> uint64_t object_size;
> >>>> + uint64_t features;
> >>>> } BDRVRBDState;
> >>>>
> >>>> typedef struct RBDTask {
> >>>> @@ -983,6 +984,13 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
> >>>> s->image_size = info.size;
> >>>> s->object_size = info.obj_size;
> >>>>
> >>>> + r = rbd_get_features(s->image, &s->features);
> >>>> + if (r < 0) {
> >>>> + error_setg_errno(errp, -r, "error getting image features from %s",
> >>>> + s->image_name);
> >>>> + goto failed_post_open;
> >>>> + }
> >>> The object-map and fast-diff features can be enabled/disabled while the
> >>> image is open so this should probably go to qemu_rbd_co_block_status().
> >>>
> >>>> +
> >>>> /* If we are using an rbd snapshot, we must be r/o, otherwise
> >>>> * leave as-is */
> >>>> if (s->snap != NULL) {
> >>>> @@ -1259,6 +1267,122 @@ static ImageInfoSpecific *qemu_rbd_get_specific_info(BlockDriverState *bs,
> >>>> return spec_info;
> >>>> }
> >>>>
> >>>> +typedef struct rbd_diff_req {
> >>>> + uint64_t offs;
> >>>> + uint64_t bytes;
> >>>> + int exists;
> >>>> +} rbd_diff_req;
> >>>> +
> >>>> +/*
> >>>> + * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
> >>>> + * value in the callback routine. Choose a value that does not conflict with
> >>>> + * an existing exitcode and return it if we want to prematurely stop the
> >>>> + * execution because we detected a change in the allocation status.
> >>>> + */
> >>>> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> >>>> +
> >>>> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> >>>> + int exists, void *opaque)
> >>>> +{
> >>>> + struct rbd_diff_req *req = opaque;
> >>>> +
> >>>> + assert(req->offs + req->bytes <= offs);
> >>>> +
> >>>> + if (req->exists && offs > req->offs + req->bytes) {
> >>>> + /*
> >>>> + * we started in an allocated area and jumped over an unallocated area,
> >>>> + * req->bytes contains the length of the allocated area before the
> >>>> + * unallocated area. stop further processing.
> >>>> + */
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>>> + }
> >>>> + if (req->exists && !exists) {
> >>>> + /*
> >>>> + * we started in an allocated area and reached a hole. req->bytes
> >>>> + * contains the length of the allocated area before the hole.
> >>>> + * stop further processing.
> >>>> + */
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>>> + }
> >>>> + if (!req->exists && exists && offs > req->offs) {
> >>>> + /*
> >>>> + * we started in an unallocated area and hit the first allocated
> >>>> + * block. req->bytes must be set to the length of the unallocated area
> >>>> + * before the allocated area. stop further processing.
> >>>> + */
> >>>> + req->bytes = offs - req->offs;
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * assert that we catched all cases above and allocation state has not
> >>> catched -> caught
> >>>
> >>>> + * changed during callbacks.
> >>>> + */
> >>>> + assert(exists == req->exists || !req->bytes);
> >>>> + req->exists = exists;
> >>>> +
> >>>> + /*
> >>>> + * assert that we either return an unallocated block or have got callbacks
> >>>> + * for all allocated blocks present.
> >>>> + */
> >>>> + assert(!req->exists || offs == req->offs + req->bytes);
> >>>> + req->bytes = offs + len - req->offs;
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
> >>>> + bool want_zero, int64_t offset,
> >>>> + int64_t bytes, int64_t *pnum,
> >>>> + int64_t *map,
> >>>> + BlockDriverState **file)
> >>>> +{
> >>>> + BDRVRBDState *s = bs->opaque;
> >>>> + int ret, r;
> >>>> + struct rbd_diff_req req = { .offs = offset };
> >>>> +
> >>>> + assert(offset + bytes <= s->image_size);
> >>>> +
> >>>> + /* default to all sectors allocated */
> >>>> + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> >>> I'm a little confused by the meaning of these flags (but I haven't
> >>> looked at the other drivers yet). Looks like this patch always sets
> >>> BDRV_BLOCK_OFFSET_VALID (makes sense since the "host" offset is always
> >>> known for rbd) and returns either BDRV_BLOCK_DATA or BDRV_BLOCK_ZERO.
> >>>
> >>> DATA ZERO OFFSET_VALID
> >>> t t t sectors read as zero, returned file is zero at offset
> >>> t f t sectors read as valid from file at offset
> >>> f t t sectors preallocated, read as zero, returned file not
> >>> necessarily zero at offset
> >>> f f t sectors preallocated but read from backing_hd,
> >>> returned file contains garbage at offset
> >>>
> >>> What about the first case (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)?
> >>> What is the practical difference to just BDRV_BLOCK_ZERO?
> >>
> >> Actually I don't know, I adapted the flags from other drivers. Looking at your
> >> table it seems that DATA + ZERO would be more appropriate for rbd, right?
> > It would seem that way but I haven't had a chance to look at how qemu
> > block layer interprets these yet.
> >
> >> We do not preallocate in qem/rbd yet.
> > I guess it depends on what preallocated means here, if it talks about
> > the logical address space or the physical sectors. If the former, rbd
> > can be thought of as always preallocated.
> >
> > Separately, rbd_read() at an offset that corresponds to an unallocated
> > area (including in parent image(s)) would always return zeroes. I wonder
> > if that is the same as "returned file is zero at offset".
>
>
> It seems that all other drivers either set DATA or ZERO. I would leave it
>
> the same for RBD.
Agreed. It looks like it has to do with layering and backing files and
since rbd layering isn't exposed to qemu let's not over complicate this.
>
>
> >
> >>
> >>>> + if (map) {
> >>>> + *map = offset;
> >>>> + }
> >>>> + if (file) {
> >>>> + *file = bs;
> >>>> + }
> >>> A comment in block_int.h says that map and file are guaranteed to be
> >>> non-NULL so these tests seem redundant?
> >>
> >> This is also Copy&Paste from other drivers, will change it (and maybe
> >>
> >> in other drivers as well).
> >>
> >>
> >>
> >>>> + *pnum = bytes;
> >>>> +
> >>>> + /* RBD image does not support fast-diff */
> >>>> + if (!(s->features & RBD_FEATURE_FAST_DIFF)) {
> >>>> + goto out;
> >>>> + }
> >>> Need to make sure that fast-diff is valid here: call rbd_get_flags()
> >>> on the image and test for !RBD_FLAG_FAST_DIFF_INVALID.
> >>
> >> Do I have to check for feature FAST_DIFF + flag !FAST_DIFF_INVALID or
> >>
> >> is the second enough? Is this call fast?
> > For both. The call should be as fast as rbd_get_features().
>
>
> The question was if this is handled solely in the client?
Yes, most of the time. Both image features and flags are stored in
the header object and the in-memory copy of the header may need to be
refreshed if the client gets a header update notification in the
interim.
Thanks,
Ilya
prev parent reply other threads:[~2021-09-05 19:33 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-10 13:41 [PATCH V2] block/rbd: implement bdrv_co_block_status Peter Lieven
2021-08-22 21:02 ` Ilya Dryomov
2021-08-23 9:38 ` Peter Lieven
2021-08-24 20:39 ` Ilya Dryomov
2021-09-02 14:12 ` Peter Lieven
2021-09-05 19:04 ` Ilya Dryomov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAOi1vP-=A4oU9Tm=3KtFuN0DTPhSryPY_wym5byR-0R74Tp3+g@mail.gmail.com' \
--to=idryomov@gmail.com \
--cc=berrange@redhat.com \
--cc=ct@flyingcircus.io \
--cc=dillaman@redhat.com \
--cc=idryomov@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=pl@kamp.de \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).