From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59326)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1cBV9G-0001t7-UW
	for qemu-devel@nongnu.org; Mon, 28 Nov 2016 18:15:49 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1cBV9D-0005PQ-KN
	for qemu-devel@nongnu.org; Mon, 28 Nov 2016 18:15:46 -0500
Received: from mx1.redhat.com ([209.132.183.28]:55126)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <jsnow@redhat.com>) id 1cBV9D-0005PH-7b
	for qemu-devel@nongnu.org; Mon, 28 Nov 2016 18:15:43 -0500
References: <1480073296-6931-1-git-send-email-vsementsov@virtuozzo.com>
	<20161127191714.7wrvzbx7zpotad22@grep.be>
From: John Snow <jsnow@redhat.com>
Message-ID: <8095acbc-93c7-a310-56f6-61efa8812a78@redhat.com>
Date: Mon, 28 Nov 2016 18:15:39 -0500
MIME-Version: 1.0
In-Reply-To: <20161127191714.7wrvzbx7zpotad22@grep.be>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS
 extension
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Wouter Verhelst <w@uter.be>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: nbd-general@lists.sourceforge.net, kwolf@redhat.com, qemu-devel@nongnu.org, pborzenkov@virtuozzo.com, stefanha@redhat.com, pbonzini@redhat.com, mpa@pengutronix.de, den@openvz.org

Hi Wouter,

Some of this mess may be partially my fault, but I have not been 
following the NBD extension proposals up until this point.

Are you familiar with the genesis behind this idea and what we are 
trying to accomplish in general?

We had the thought to propose an extension roughly similar to SCSI's 
'get lba status', but the existing status bits there did not correlate 
semantically to what we are hoping to convey. There was some debate over 
if it would be an abuse of protocol to attempt to use them as such.

For a quick recap, get lba status appears to offer three canonical statuses:

0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"

My interpretation of mapped was simply that it was physically allocated, 
and 'deallocated' was simply unallocated.

(I uh, am not actually clear on what anchored means exactly.)

Either way, we felt at the time that it would be wrong to propose an 
analogue command for NBD and then immediately abuse the existing 
semantics, hence a new command like -- but not identical to -- the SCSI one.


On 11/27/2016 02:17 PM, Wouter Verhelst wrote:
> Hi Vladimir,
>
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
>
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.
>
> On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> With the availability of sparse storage formats, it is often needed
>> to query status of a particular range and read only those blocks of
>> data that are actually present on the block device.
>>
>> To provide such information, the patch adds the BLOCK_STATUS
>> extension with one new NBD_CMD_BLOCK_STATUS command, a new
>> structured reply chunk format, and a new transmission flag.
>>
>> There exists a concept of data dirtiness, which is required
>> during, for example, incremental block device backup. To express
>> this concept via NBD protocol, this patch also adds a flag to
>> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
>> provisioning information; however, with the current proposal, data
>> dirtiness is only useful with additional coordination outside of
>> the NBD protocol (such as a way to start and stop the server from
>> tracking dirty sectors).  Future NBD extensions may add commands
>> to control dirtiness through NBD.
>>
>> Since NBD protocol has no notion of block size, and to mimic SCSI
>> "GET LBA STATUS" command more closely, it has been chosen to return
>> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
>> instead of a bitmap.
>>
>> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>> CC: Denis V. Lunev <den@openvz.org>
>> CC: Wouter Verhelst <w@uter.be>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> v3:
>>
>> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
>> removing the restriction, that sum of status descriptor lengths must be equal
>> to requested length. I.e., let's permit the server to replay with less data
>> than required if it wants.
>
> Reasonable, yes. The length that the client requests should be a maximum (i.e.
> "I'm interested in this range"), not an exact request.
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>>  NBD_FLAG_CAN_MULTI_CONN in master branch.
>
> Right.
>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
>
> Probably a good idea, given the above.
>
>> By this resend I just want to continue the diqussion, started about half
>> a year ago. Here is a summary of some questions and ideas from v2
>> diqussion:
>>
>> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
>>    A: This all is for read-only disks, so the data is static and unchangeable.
>
> I think we should declare that it's up to the client to ensure no other
> writes happen without its knowledge. This may be because the client and
> server communicate out of band about state changes, or because the
> client somehow knows that it's the only writer, or whatever.
>
> We can easily do that by declaring that the result of that command only
> talks about *current* state, and that concurrent writes by different
> clients may invalidate the state. This is true for NBD in general (i.e.,
> concurrent read or write commands from other clients may confuse file
> systems on top of NBD), so it doesn't change expectations in any way.
>

Agree with you here. "This was correct when I sent it, but it's up to 
you to ensure that nothing would have invalidated that in the meantime" 
is fine semantically to me.

Of course in our implementation, we intend to only export essentially 
read-only snapshots of data, so we don't personally expect to run into 
any of these kind of semantic problems.

>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
>
> Not entirely sure anymore what this is about?
>

We have a concept of granularity for dirty tracking; consider it a block 
size.

I don't think it's necessarily relevant to NBD, except for the case of 
false positives. Consider the case of a 512 byte block that is assumed 
dirty simply because it's adjacent to actually dirty data.

That may have some meaning for how we write the NBD spec; i.e. the 
meaning of the status bit changes from "This is definitely modified" to 
"This was possibly modified," due to limitations in the management of 
this information by the server.

If we expose the granularity information via NBD, it at least makes it 
clear how fuzzy the results presented may be. Otherwise, it's not really 
required.

[Again, as seen below, a user-defined bit would be plenty sufficient...!]

>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).
>
> To avoid this, we should have a clear definition of what the reply means
> *by default*, but then we can add a note that clients and servers can
> possibly define other meanings out of band if they want to.
>

I had personally only ever considered #3; where a command would be 
issued to QEMU to begin offering NBD data for some point-in-time as 
associated with a particular reference/snapshot/backup/etc. This leaves 
it up to the out-of-band client to order up the right data.

That does appear to choose a meaningful name for the status bits a bit 
more difficult...

[...but reading ahead, a 'user defined' bit would fit the bill just fine.]

When Vladimir authored a "persistence" feature for bitmaps to be stored 
alongside QCOW2 files, we had difficulty describing exactly what a dirty 
bit meant for the data -- ultimately it is reliant on external 
information. The meaning is simply but ambiguously, "The data associated 
with this bit has been changed since the last time the bit was reset."

We don't record or stipulate the conditions for a reset.

(for our purposes, a reset would occur during a full or incremental backup.)

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
>
> Yeah, having them separate commands might be a bad idea indeed.
>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
>
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.
>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
>
> Right, probably best to do that, yes.
>
>> ======
>>
>> Also, an idea on 2-4:
>>
>>     As we say, that dirtiness is unknown for NBD, and external tool
>>     should specify, manage and understand, which data is actually
>>     transmitted, why not just call it user_data and leave status field
>>     of reply chunk unspecified in this case?
>>
>>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>     leave status field for some external... protocol? Who knows, what is
>>     this user data.
>
> Yes, this sounds like a reasonable approach.
>

I'd be pretty happy (personally) with some user defined bits. Leaves a 
lot of the ambiguousness of exactly what we're trying to convey away 
from the NBD spec, which is nice.

>>     Note: I'm not sure, that I like this (my) proposal. It's just an
>>     idea, may be someone like it.  And, I think, it represents what we
>>     are trying to do more honestly.
>
> Indeed.
>
>>     Note2: the next step of generalization will be NBD_CMD_USER, with
>>     variable request size, structured reply and no definition :)
>
> Well, er, no please, if we can avoid it :-)
>
>> Another idea, about backups themselves:
>>
>>     Why do we need allocated/zero status for backup? IMHO we don't.
>
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)
>
> I can understand a "has this changed since time X" request, which the
> "dirty" thing seems to want to be. Whether something is allocated is
> just a special case of that.
>
> Actually, come to think of that. What is the exact use case for this
> thing? I understand you're trying to create incremental backups of
> things, which would imply you don't write from the client that is
> getting the block status thingies, right? If so, how about:
>

Essentially you can create a bitmap object in-memory in QEMU and then 
associate it with a drive. It records changes to the drive, and QEMU can 
be instructed to write out any changes since the last backup to disk, 
while clearing the bits of the bitmap.

It doesn't record a specific point in time, it's just implicit to the 
last time you cleared the bitmap -- usually the last incremental or full 
backup you've made.

So it does describe "blocks changed since time X," it's just that we 
don't really know exactly when time X was.

This is all well and dandy, except there is some desire from third 
parties to be able to ask QEMU about this dirty block information -- to 
be able to see this bitmap, more or less. We already use NBD for 
exporting data, and instead of inventing a new side-band, we decided 
that we wanted to use NBD to let external users get this information.

What exactly they do with that info is beyond QEMU's scope.

> - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
>   snapshots. Not required, optional, includes a machine-readable form,
>   not defined by NBD, which explains what the snapshot is about (e.g., a
>   qemu json file). The "base" version of that is just "allocation
>   status", and is implied (i.e., you don't need to run
>   NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
>   allocation status).

We don't necessarily have /snapshots/ per se, but we do have some block 
device and one or more bitmaps describing deltas to one or more backups 
that we do not necessarily have access to.

e.g. a given bitmap may describe a delta to an off-site backup. We know 
the delta, but do not maintain any meaningful handle to the given 
off-site backup, including name, URI, or date.

QEMU leaves this association up to upper management, and provides only 
IDs to facilitate any additional correlation.

We could offer up descriptions of these bitmaps in response to such a 
command, but they're not ... quite snapshots. It is more the case that 
we use them to create snapshots.

> - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
>   which tell you what the status of a block of data is for each of the
>   relevant snapshots that we know about.
>
> Perhaps this is somewhat overengineered, but it does bring most of the
> definition of what a snapshot is back into the NBD protocol, without
> having to say "this could be anything", and without requiring
> connectivity over two ports for this to be useful (e.g., you could store
> the machine-readable form of the snapshot description into your backup
> program and match what they mean with what you're interested in at
> restore time, etc).
>
> This wouldn't work if you're interested in new snapshots that get
> created once we've already moved into transmission, but hey.
>
> Thoughts?
>
>>     Full backup: just do structured read - it will show us, which chunks
>>     may be treaded as zeroes.
>
> Right.
>
>>     Incremental backup: get dirty bitmap (somehow, for example through
>>     user-defined part of proposed command), than, for dirty blocks, read
>>     them through structured read, so information about zero/unallocated
>>     areas are here.
>>
>> For me all the variants above are OK. Let's finally choose something.
>>
>> v2:
>> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
>>
>> Since then, we've added the STRUCTURED_REPLY extension, which
>> necessitates a rather larger rebase; I've also changed things
>> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
>> modes to be determined by boolean flags (rather than by fixed
>> values of the 16-bit flags field), changed the reply status fields
>> to be bitwise-or values (with a default of 0 always being sane),
>> and changed the descriptor layout to drop an offset but to include
>> a 32-bit status so that the descriptor is nicely 8-byte aligned
>> without padding.
>>
>>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 154 insertions(+), 1 deletion(-)
>
> [...]
>
> I'll commit this in a minute into a separate branch called
> "extension-blockstatus", under the understanding that changes are still
> required, as per above (i.e., don't assume that just because there's a
> branch I'm happy with the current result ;-)
>
> Regards
>

Err, I hope I haven't confused everything to heck and back now....

Thanks,
--js