Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension - Vladimir Sementsov-Ogievskiy

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Alex Bligh <alex@alex.org.uk>
Cc: "nbd-general@lists.sourceforge.net"
	<nbd-general@lists.sourceforge.net>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Kevin Wolf <kwolf@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Pavel Borzenkov <pborzenkov@virtuozzo.com>,
	"Stefan stefanha@redhat. com" <stefanha@redhat.com>,
	"Denis V. Lunev" <den@openvz.org>, Wouter Verhelst <w@uter.be>,
	Eric Blake <eblake@redhat.com>,
	mpa@pengutronix.de
Subject: Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Date: Tue, 29 Nov 2016 17:36:13 +0300	[thread overview]
Message-ID: <b55c59cc-626c-ba45-1d89-d7c5583616af@virtuozzo.com> (raw)
In-Reply-To: <F075D231-9CBE-437C-B9F0-0DE2AB98007B@alex.org.uk>

29.11.2016 15:57, Alex Bligh wrote:
> Vladimir,
>
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
>
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
>
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
>
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
>
> How can data be 'dirty' if it is static and unchangeable? (I thought)
>
> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.

Yes, something like this.

Note: in Qemu it may not be a snapshot (actually, I didn't see a way in 
Qemu to export snapshots not switching to them (except opening external 
snapshot as a separate block dev)), but just any read-only drive, or 
temporary drive, used in the image fleecing scheme:

driveA is online normal drive
driveB is empty nbd export

- start backup driveA->driveB with sync=none, which means that the only 
copy which is done is copying old data from A to B before every write to A
- and set driveA as backing for driveB (on read from B, if data is not 
allocated it will be read from A)

after that, driveB is something like a snapshot for backup through NBD.

It's all just to say: calling backup-point-in-time-state just a 
'snapshot' confuses me and I hope not to see this word in the spec (in 
this context).

>
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
>
> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
>
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
>
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
>
> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
>
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.

Good point. For Qcow2 we finally come to just bitmaps, not "dirty 
bitmaps", to make it more general. There is a problem with allocation if 
we want to make it a subcase of bitmap: allocation natively have two 
bits per block: zero and allocated. We can of course separate this into 
two bitmaps, but this will not be similar with classic get_block_status.

> Some more nits:
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
>
>
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
>
>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to select
> the index). I suppose one could ask for a list of bitmaps.
>
>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
> I'm suggesting one generic 'read bitmap' command like you.

To support get_block_status in this general read_bitmap, we will need to 
define something like 'multibitmap', which allows several bits per 
chunk, as allocation data has two: zero and allocated.

>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.

the issue is: too many descriptors possible. So, (1) solves it. (2) is 
optional, just to simplify/optimize client side.

>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
> Yeah.
>
>> +
>> +* `NBD_CMD_BLOCK_STATUS`
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> MUST NOT is what we say elsewhere I believe.
>
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.

I agree.

>
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.

if the bitmap is 010101010101 we will have too many descriptors. For 
example, 16tb disk, 64k granularity -> 2G of descriptors payload.

>
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.
>

Agree. Anyway it is just a strong recommendation, not requirement.

-- 
Best regards,
Vladimir